0% found this document useful (0 votes)
1K views

Data Factory, Data Integration

This document provides an overview of the Azure Data Factory documentation, including quickstarts, tutorials, samples, concepts, how-to guides, reference materials, and resources. Key sections include quickstarts for creating data factories using different tools/languages, tutorials on copying and transforming data both in the cloud and between on-premises and cloud, and concepts covering pipelines/activities, linked services, integration runtime, roles/permissions, and pricing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Data Factory, Data Integration

This document provides an overview of the Azure Data Factory documentation, including quickstarts, tutorials, samples, concepts, how-to guides, reference materials, and resources. Key sections include quickstarts for creating data factories using different tools/languages, tutorials on copying and transforming data both in the cloud and between on-premises and cloud, and concepts covering pipelines/activities, linked services, integration runtime, roles/permissions, and pricing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2034

Contents

Data Factory Documentation


Switch to version 1 documentation
Overview
Introduction to Data Factory
Compare current version to version 1
Quickstarts
Create data factory - User interface (UI)
Create data factory - Copy Data tool
Create data factory - Azure PowerShell
Create data factory - .NET
Create data factory - Python
Create data factory - REST
Create data factory - Resource Manager template
Create data flow
Tutorials
Copy data in cloud
Copy Data tool
User interface (UI)
.NET
Copy on-premises data to cloud
Copy Data tool
User interface (UI)
Azure PowerShell
Copy data in bulk
User interface (UI)
Azure PowerShell
Copy data incrementally
1 - Copy from one table
User interface (UI)
Azure PowerShell
2 - Copy from multiple tables
User interface (UI)
Azure PowerShell
3 - Use change tracking feature
User interface (UI)
Azure PowerShell
4 - Copy new files by lastmodifieddate
Copy Data tool
5 - Copy new files by time partitioned file name
Copy Data tool
Transform data in cloud
HDInsight Spark
User interface (UI)
Azure PowerShell
Databricks Notebook
User interface (UI)
Transform data in virtual network
User interface (UI)
Azure PowerShell
Add branching and chaining
User interface (UI)
.NET
Run SSIS packages in Azure
User interface (UI)
Azure PowerShell
Samples
Code samples
Azure PowerShell
Concepts
Pipelines and activities
Linked services
Datasets
Pipeline execution and triggers
Integration runtime
Mapping Data Flows
Mapping data flow concepts
Debug mode
Schema drift
Inspect pane
Column patterns
Data flow monitoring
Data flow performance
Move nodes
Optimize tab
Expression builder
Reference nodes
Expression language
Roles and permissions
Understanding pricing
Naming rules
How-to guides
Author
Visually author data factories
Continuous integration and delivery
Iterative development and debugging
Connectors
Amazon Marketplace Web Service
Amazon Redshift
Amazon S3
Azure Blob Storage
Azure Cosmos DB SQL API
Azure Cosmos DB's API for MongoDB
Azure Data Explorer
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2
Azure Database for MariaDB
Azure Database for MySQL
Azure Database for PostgreSQL
Azure File Storage
Azure Search
Azure SQL Database
Azure SQL Database Managed Instance
Azure SQL Data Warehouse
Azure Table Storage
Cassandra
Common Data Service for Apps
Concur
Couchbase
DB2
Delimited text format
Drill
Dynamics 365
Dynamics AX
Dynamics CRM
File System
FTP
Google AdWords
Google BigQuery
Google Cloud Storage
Greenplum
HBase
HDFS
Hive
HTTP
HubSpot
Impala
Informix
Jira
Magento
MariaDB
Marketo
Microsoft Access
MongoDB
MongoDB (legacy)
MySQL
Netezza
OData
ODBC
Office 365
Oracle
Oracle Eloqua
Oracle Responsys
Oracle Service Cloud
Parquet format
Paypal
Phoenix
PostgreSQL
Presto
QuickBooks Online
REST
Salesforce
Salesforce Service Cloud
Salesforce Marketing Cloud
SAP Business Warehouse Open Hub
Load SAP BW data
SAP Business Warehouse MDX
SAP Cloud for Customer
SAP ECC
SAP HANA
SAP Table
ServiceNow
SFTP
Shopify
Spark
SQL Server
Square
Sybase
Teradata
Vertica
Web Table
Xero
Zoho
Move data
Copy data using Copy Activity
Delete files using Delete Activity
Copy Data tool
Load Data Lake Storage Gen2
Copy from Data Lake Storage Gen1
Load SQL Data Warehouse
Load Data Lake Storage Gen1
Load SAP BW data
Load Office 365 data
Read or write partitioned data
Format and compression support
Schema and type mapping
Fault tolerance
Performance and tuning
Transform data
HDInsight Hive Activity
HDInsight Pig Activity
HDInsight MapReduce Activity
HDInsight Streaming Activity
HDInsight Spark Activity
ML Batch Execution Activity
ML Update Resource Activity
Stored Procedure Activity
Data Lake U-SQL Activity
Databricks Notebook Activity
Databricks Jar Activity
Databricks Python Activity
Custom activity
Compute linked services
Control flow
Append Variable Activity
Azure Function Activity
Execute Data Flow Activity
Execute Pipeline Activity
Filter Activity
For Each Activity
Get Metadata Activity
If Condition Activity
Lookup Activity
Set Variable Activity
Until Activity
Validation Activity
Wait Activity
Web Activity
Webhook Activity
Data flow transformations
Aggregate
Alter row
Conditional split
Derived column
Exists
Filter
Join
Lookup
New branch
Pivot
Select
Sink
Sort
Source
Surrogate key
Union
Unpivot
Window
Parameterize
Parameterize linked services
Expression Language
System variables
Security
Data movement security considerations
Store credentials in Azure Key Vault
Encrypt credentials for self-hosted integration runtime
Managed identity for Data Factory
Monitor and manage
Monitor visually
Monitor with Azure Monitor
Monitor with SDKs
Monitor integration runtime
Monitor Azure-SSIS integration runtime
Reconfigure Azure-SSIS integration runtime
Copy or clone a data factory
Create integration runtime
Azure integration runtime
Self hosted integration runtime
Azure-SSIS integration runtime
Shared self-hosted integration runtime
Run SSIS packages in Azure
Run SSIS packages with Execute SSIS Package activity
Run SSIS packages with Stored Procedure activity
Schedule Azure-SSIS integration runtime
Join Azure-SSIS IR to a virtual network
Enable Azure AD authentication for Azure-SSIS IR
Provision Enterprise Edition for Azure-SSIS IR
Customize setup for Azure-SSIS IR
Install licensed components for Azure-SSIS IR
Configure high performance for Azure-SSIS IR
Configure disaster recovery for Azure-SSIS IR
Clean up SSISDB logs with Elastic Database Jobs
Create triggers
Create an event-based trigger
Create a schedule trigger
Create a tumbling window trigger
Templates
Overview of templates
Copy files from multiple containers
Copy new files by LastModifiedDate
Bulk copy with control table
Delta copy with control table
Transform data with Databricks
Reference
.NET
PowerShell
REST API
Resource Manager template
Python
Resources
Ask a question - MSDN forum
Ask a question - Stack Overflow
Request a feature
FAQ
Whitepapers
Roadmap
Pricing
Availability by region
Support options
Introduction to Azure Data Factory
1/3/2019 • 10 minutes to read • Edit Online

NOTE
This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see
Introduction to Data Factory V2.

What is Azure Data Factory?


In the world of big data, how is existing data leveraged in business? Is it possible to enrich data that's generated in
the cloud by using reference data from on-premises data sources or other disparate data sources?
For example, a gaming company collects logs that are produced by games in the cloud. It wants to analyze these
logs to gain insights into customer preferences, demographics, usage behavior, and so on. The company also wants
to identify up-sell and cross-sell opportunities, develop compelling new features to drive business growth, and
provide a better experience to customers.
To analyze these logs, the company needs to use the reference data such as customer information, game
information, and marketing campaign information that is in an on-premises data store. Therefore, the company
wants to ingest log data from the cloud data store and reference data from the on-premises data store.
Next they want to process the data by using Hadoop in the cloud (Azure HDInsight). They want to publish the
result data into a cloud data warehouse such as Azure SQL Data Warehouse or an on-premises data store such as
SQL Server. The company wants this workflow to run once a week.
The company needs a platform where they can create a workflow that can ingest data from both on-premises and
cloud data stores. The company also needs to be able to transform or process data by using existing compute
services such as Hadoop, and publish the results to an on-premises or cloud data store for BI applications to
consume.

Azure Data Factory is the platform for these kinds of scenarios. It is a cloud -based data integration service that
allows you to create data -driven workflows in the cloud that orchestrate and automate data movement and data
transformation. Using Azure Data Factory, you can do the following tasks:
Create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data
stores.
Process or transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure
Data Lake Analytics, and Azure Machine Learning.
Publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
It's more of an Extract-and-Load (EL ) and Transform-and-Load (TL ) platform rather than a traditional Extract-
Transform-and-Load (ETL ) platform. The transformations process data by using compute services rather than by
adding derived columns, counting the number of rows, sorting data, and so on.
Currently, in Azure Data Factory, the data that workflows consume and produce is time-sliced data (hourly, daily,
weekly, and so on). For example, a pipeline might read input data, process data, and produce output data once a
day. You can also run a workflow just one time.

How does it work?


The pipelines (data-driven workflows) in Azure Data Factory typically perform the following three steps:

Connect and collect


Enterprises have data of various types that are located in disparate sources. The first step in building an
information production system is to connect to all the required sources of data and processing. These sources
include SaaS services, file shares, FTP, and web services. Then move the data as-needed to a centralized location
for subsequent processing.
Without Data Factory, enterprises must build custom data movement components or write custom services to
integrate these data sources and processing. It is expensive and hard to integrate and maintain such systems.
These systems also often lack the enterprise grade monitoring, alerting, and controls that a fully managed service
can offer.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud
source data stores to a centralization data store in the cloud for further analysis.
For example, you can collect data in Azure Data Lake Store and transform the data later by using an Azure Data
Lake Analytics compute service. Or, collect data in Azure blob storage and transform it later by using an Azure
HDInsight Hadoop cluster.
Transform and enrich
After data is present in a centralized data store in the cloud, process or transfer it by using compute services such
as HDInsight Hadoop, Spark, Data Lake Analytics, or Machine Learning. You want to reliably produce transformed
data on a maintainable and controlled schedule to feed production environments with trusted data.
Publish
Deliver transformed data from the cloud to on-premises sources such as SQL Server. Alternatively, keep it in your
cloud storage sources for consumption by BI and analytics tools and other applications.

Key components
An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory
is composed of four key components. These components work together to provide the platform on which you can
compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory can have one or more pipelines. A pipeline is a group of activities. Together, the activities in a
pipeline perform a task.
For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a Hive
query on an HDInsight cluster to partition the data. The benefit of this is that the pipeline allows you to manage
the activities as a set instead of each one individually. For example, you can deploy and schedule the pipeline,
instead of scheduling independent activities.
Activity
A pipeline can have one or more activities. Activities define the actions to perform on your data. For example, you
can use a copy activity to copy data from one data store to another data store. Similarly, you can use a Hive activity.
A Hive activity runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data Factory
supports two types of activities: data movement activities and data transformation activities.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data from any source can
be written to any sink. Select a data store to learn how to copy data to and from that store. Data Factory supports
the following data stores:

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage ✓ ✓

Azure Cosmos DB (SQL API) ✓ ✓

Azure Data Lake Storage ✓ ✓


Gen1

Azure SQL Database ✓ ✓

Azure SQL Data Warehouse ✓ ✓

Azure Search Index ✓

Azure Table storage ✓ ✓

Databases Amazon Redshift ✓

DB2* ✓

MySQL* ✓

Oracle* ✓ ✓

PostgreSQL* ✓

SAP Business Warehouse* ✓

SAP HANA* ✓

SQL Server* ✓ ✓

Sybase* ✓

Teradata* ✓

NoSQL Cassandra* ✓

MongoDB* ✓
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

File Amazon S3 ✓

File System* ✓ ✓

FTP ✓

HDFS* ✓

SFTP ✓

Others Generic HTTP ✓

Generic OData ✓

Generic ODBC* ✓

Salesforce ✓

Web Table (table from ✓


HTML)

For more information, see Move data by using Copy Activity.


Data transformation activities
Azure Data Factory supports the following transformation activities that can be added to pipelines either
individually or chained with another activity.

DATA TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

Machine Learning activities: Batch Execution and Update Azure VM


Resource

Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server

Data Lake Analytics U-SQL Azure Data Lake Analytics

DotNet HDInsight [Hadoop] or Azure Batch


NOTE
You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See Invoke Spark programs from
Azure Data Factory for details. You can create a custom activity to run R scripts on your HDInsight cluster with R installed.
See Run R Script using Azure Data Factory.

For more information, see Move data by using Copy Activity.


Custom .NET activities
Create a custom .NET activity if you need to move data to or from a data store that Copy Activity doesn't support
or if you need to transform data by using your own logic. For details about how to create and use a custom activity,
see Use custom activities in an Azure Data Factory pipeline.
Datasets
An activity takes zero or more datasets as inputs and one or more datasets as outputs. Datasets represent data
structures within the data stores. These structures point to or reference the data you want to use in your activities
(such as inputs or outputs).
For example, an Azure blob dataset specifies the blob container and folder in the Azure blob storage from which
the pipeline should read the data. Or an Azure SQL table dataset specifies the table to which the output data is
written by the activity.
Linked services
Linked services are much like connection strings, which define the connection information that's needed for Data
Factory to connect to external resources. Think of it this way: a linked service defines the connection to the data
source and a dataset represents the structure of the data.
For example, an Azure Storage-linked service specifies a connection string with which to connect to the Azure
Storage account. An Azure blob dataset specifies the blob container and the folder that contains the data.
Linked services are used for two reasons in Data Factory:
To represent a data store that includes, but isn't limited to, an on-premises SQL Server database, Oracle
database, file share, or Azure blob storage account. See the Data movement activities section for a list of
supported data stores.
To represent a compute resource that can host the execution of an activity. For example, the HDInsightHive
activity runs on an HDInsight Hadoop cluster. See the Data transformation activities section for a list of
supported compute environments.
Relationship between Data Factory entities

Supported regions
Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data
factory can access data stores and compute services in other Azure regions to move data between data stores or
process data by using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the
movement of data between supported data stores. It also lets you process data by using compute services in other
regions or in an on-premises environment. It also allows you to monitor and manage workflows by using both
programmatic and UI mechanisms.
Data Factory is available in only West US, East US, and North Europe regions. However, the service that powers
the data movement in Data Factory is available globally in several regions. If a data store is behind a firewall, then
a Data Management Gateway that's installed in your on-premises environment moves the data instead.
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are located in the West Europe region. You can create and use an Azure Data Factory instance in
North Europe. Then you can use it to schedule jobs on your compute environments in West Europe. It takes a few
milliseconds for Data Factory to trigger the job on your compute environment, but the time for running the job on
your computing environment does not change.

Get started with creating a pipeline


You can use one of these tools or APIs to create data pipelines in Azure Data Factory:
Azure portal
Visual Studio
PowerShell
.NET API
REST API
Azure Resource Manager template
To learn how to build data factories with data pipelines, follow the step-by-step instructions in the following
tutorials:

TUTORIAL DESCRIPTION

Move data between two cloud data stores Create a data factory with a pipeline that moves data from
blob storage to a SQL database.

Transform data by using Hadoop cluster Build your first Azure data factory with a data pipeline that
processes data by running a Hive script on an Azure
HDInsight (Hadoop) cluster.

Move data between an on-premises data store and a cloud Build a data factory with a pipeline that moves data from an
data store by using Data Management Gateway on-premises SQL Server database to an Azure blob. As part of
the walkthrough, you install and configure the Data
Management Gateway on your machine.
Introduction to Azure Data Factory
2/27/2019 • 8 minutes to read • Edit Online

In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage
systems. However, on its own, raw data doesn't have the proper context or meaning to provide meaningful
insights to analysts, data scientists, or business decision makers.
Big data requires service that can orchestrate and operationalize processes to refine these enormous stores of
raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for
these complex hybrid extract-transform-load (ETL ), extract-load-transform (ELT), and data integration
projects.
For example, imagine a gaming company that collects petabytes of game logs that are produced by games in
the cloud. The company wants to analyze these logs to gain insights into customer preferences,
demographics, and usage behavior. It also wants to identify up-sell and cross-sell opportunities, develop
compelling new features, drive business growth, and provide a better experience to its customers.
To analyze these logs, the company needs to use reference data such as customer information, game
information, and marketing campaign information that is in an on-premises data store. The company wants
to utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud
data store.
To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight),
and publish the transformed data into a cloud data warehouse such as Azure SQL Data Warehouse to easily
build a report on top of it. They want to automate this workflow, and monitor and manage it on a daily
schedule. They also want to execute it when files land in a blob store container.
Azure Data Factory is the platform that solves such data scenarios. It is a cloud -based data integration service
that allows you to create data -driven workflows in the cloud for orchestrating and automating data
movement and data transformation. Using Azure Data Factory, you can create and schedule data-driven
workflows (called pipelines) that can ingest data from disparate data stores. It can process and transform the
data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and
Azure Machine Learning.
Additionally, you can publish output data to data stores such as Azure SQL Data Warehouse for business
intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized
into meaningful data stores and data lakes for better business decisions.
How does it work?
The pipelines (data-driven workflows) in Azure Data Factory typically perform the following four steps:

Connect and collect


Enterprises have data of various types that are located in disparate sources on-premises, in the cloud,
structured, unstructured, and semi-structured, all arriving at different intervals and speeds.
The first step in building an information production system is to connect to all the required sources of data
and processing, such as software-as-a-service (SaaS ) services, databases, file shares, and FTP web services.
The next step is to move the data as needed to a centralized location for subsequent processing.
Without Data Factory, enterprises must build custom data movement components or write custom services
to integrate these data sources and processing. It's expensive and hard to integrate and maintain such
systems. In addition, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully
managed service can offer.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and
cloud source data stores to a centralization data store in the cloud for further analysis. For example, you can
collect data in Azure Data Lake Store and transform the data later by using an Azure Data Lake Analytics
compute service. You can also collect data in Azure Blob storage and transform it later by using an Azure
HDInsight Hadoop cluster.
Transform and enrich
After data is present in a centralized data store in the cloud, process or transform the collected data by using
compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning. You want to
reliably produce transformed data on a maintainable and controlled schedule to feed production
environments with trusted data.
Publish
After the raw data has been refined into a business-ready consumable form, load the data into Azure Data
Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can
point to from their business intelligence tools.
Monitor
After you have successfully built and deployed your data integration pipeline, providing business value from
refined data, monitor the scheduled activities and pipelines for success and failure rates. Azure Data Factory
has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and
health panels on the Azure portal.

Top-level concepts
An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data
Factory is composed of four key components. These components work together to provide the platform on
which you can compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a
unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group
of activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to
partition the data.
The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each
one individually. The activities in a pipeline can be chained together to operate sequentially, or they can
operate independently in parallel.
Activity
Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data
from one data store to another data store. Similarly, you might use a Hive activity, which runs a Hive query on
an Azure HDInsight cluster, to transform or analyze your data. Data Factory supports three types of activities:
data movement activities, data transformation activities, and control activities.
Datasets
Datasets represent data structures within the data stores, which simply point to or reference the data you
want to use in your activities as inputs or outputs.
Linked services
Linked services are much like connection strings, which define the connection information that's needed for
Data Factory to connect to external resources. Think of it this way: a linked service defines the connection to
the data source, and a dataset represents the structure of the data. For example, an Azure Storage-linked
service specifies a connection string to connect to the Azure Storage account. Additionally, an Azure blob
dataset specifies the blob container and the folder that contains the data.
Linked services are used for two purposes in Data Factory:
To represent a data store that includes, but isn't limited to, an on-premises SQL Server database,
Oracle database, file share, or Azure blob storage account. For a list of supported data stores, see the
copy activity article.
To represent a compute resource that can host the execution of an activity. For example, the
HDInsightHive activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and
supported compute environments, see the transform data article.
Triggers
Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off.
There are different types of triggers for different types of events.
Pipeline runs
A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the
arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within
the trigger definition.
Parameters
Parameters are key-value pairs of read-only configuration. Parameters are defined in the pipeline. The
arguments for the defined parameters are passed during execution from the run context that was created by a
trigger or a pipeline that was executed manually. Activities within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and a reusable/referenceable entity. An activity can reference datasets
and can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains the connection information to either a data
store or a compute environment. It is also a reusable/referenceable entity.
Control flow
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching,
defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or
from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.
For more information about Data Factory concepts, see the following articles:
Dataset and linked services
Pipelines and activities
Integration runtime

Supported regions
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region.
However, a data factory can access data stores and compute services in other Azure regions to move data
between data stores or process data using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the
movement of data between supported data stores and the processing of data using compute services in other
regions or in an on-premises environment. It also allows you to monitor and manage workflows by using
both programmatic and UI mechanisms.
Although Data Factory is available only in certain regions, the service that powers the data movement in Data
Factory is available globally in several regions. If a data store is behind a firewall, then a Self-hosted
Integration Runtime that's installed in your on-premises environment moves the data instead.
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are running out of the West Europe region. You can create and use an Azure Data Factory
instance in East US or East US 2 and use it to schedule jobs on your compute environments in West Europe.
It takes a few milliseconds for Data Factory to trigger the job on your compute environment, but the time for
running the job on your computing environment does not change.

Accessibility
The Data Factory user experience in the Azure portal is accessible.
Compare with version 1
For a list of differences between version 1 and the current version of the Data Factory service, see Compare
with version 1.

Next steps
Get started with creating a Data Factory pipeline by using one of the following tools/SDKs:
Data Factory UI in the Azure portal
Copy Data tool in the Azure portal
PowerShell
.NET
Python
REST
Azure Resource Manager template
Compare Azure Data Factory with Data Factory
version 1
3/5/2019 • 10 minutes to read • Edit Online

This article compares Data Factory with Data Factory version 1. For an introduction to Data Factory, see
Introduction to Data Factory.For an introduction to Data Factory version 1, see Introduction to Azure Data Factory.

Feature comparison
The following table compares the features of Data Factory with the features of Data Factory version 1.

FEATURE VERSION 1 CURRENT VERSION

Datasets A named view of data that references Datasets are the same in the current
the data that you want to use in your version. However, you do not need to
activities as inputs and outputs. define availability schedules for
Datasets identify data within different datasets. You can define a trigger
data stores, such as tables, files, folders, resource that can schedule pipelines
and documents. For example, an Azure from a clock scheduler paradigm. For
Blob dataset specifies the blob more information, see Triggers and
container and folder in Azure Blob Datasets.
storage from which the activity should
read the data.

Availability defines the processing


window slicing model for the dataset
(for example, hourly, daily, and so on).

Linked services Linked services are much like Linked services are the same as in Data
connection strings, which define the Factory V1, but with a new connectVia
connection information that's necessary property to utilize the Integration
for Data Factory to connect to external Runtime compute environment of the
resources. current version of Data Factory. For
more information, see Integration
runtime in Azure Data Factory and
Linked service properties for Azure Blob
storage.

Pipelines A data factory can have one or more Pipelines are groups of activities that
pipelines. A pipeline is a logical are performed on data. However, the
grouping of activities that together scheduling of activities in the pipeline
perform a task. You use startTime, has been separated into new trigger
endTime, and isPaused to schedule and resources. You can think of pipelines in
run pipelines. the current version of Data Factory
more as “workflow units” that you
schedule separately via triggers.

Pipelines do not have “windows” of time


execution in the current version of Data
Factory. The Data Factory V1 concepts
of startTime, endTime, and isPaused are
no longer present in the current version
of Data Factory. For more information,
see Pipeline execution and triggers and
Pipelines and activities.
FEATURE VERSION 1 CURRENT VERSION

Activities Activities define actions to perform on In the current version of Data Factory,
your data within a pipeline. Data activities still are defined actions within
movement (copy activity) and data a pipelineThe current version of Data
transformation activities (such as Hive, Factory introduces new control flow
Pig, and MapReduce) are supported. activities. You use these activities in a
control flow (looping and branching).
Data movement and data
transformation activities that were
supported in V1 are supported in the
current version. You can define
transformation activities without using
datasets in the current version.

Hybrid data movement and activity Now called Integration Runtime, Data Data Management Gateway is now
dispatch Management Gateway supported called Self-Hosted Integration Runtime.
moving data between on-premises and It provides the same capability as it did
cloud. in V1.

The Azure-SSIS Integration Runtime in


the current version of Data Factory also
supports deploying and running SQL
Server Integration Services (SSIS)
packages in the cloud. For more
information, see Integration runtime in
Azure Data Factory.

Parameters NA Parameters are key-value pairs of read-


only configuration settings that are
defined in pipelines. You can pass
arguments for the parameters when
you are manually running the pipeline.
If you are using a scheduler trigger, the
trigger can pass values for the
parameters too. Activities within the
pipeline consume the parameter values.

Expressions Data Factory V1 allows you to use In the current version of Data Factory,
functions and system variables in data you can use expressions anywhere in a
selection queries and activity/dataset JSON string value. For more
properties. information, see Expressions and
functions in the current version of Data
Factory.

Pipeline runs NA A single instance of a pipeline execution.


For example, say you have a pipeline
that executes at 8 AM, 9 AM, and 10
AM. There would be three separate runs
of the pipeline (pipeline runs) in this
case. Each pipeline run has a unique
pipeline run ID. The pipeline run ID is a
GUID that uniquely defines that
particular pipeline run. Pipeline runs are
typically instantiated by passing
arguments to parameters that are
defined in the pipelines.

Activity runs NA An instance of an activity execution


within a pipeline.
FEATURE VERSION 1 CURRENT VERSION

Trigger runs NA An instance of a trigger execution. For


more information, see Triggers.

Scheduling Scheduling is based on pipeline Scheduler trigger or execution via


start/end times and dataset availability. external scheduler. For more
information, see Pipeline execution and
triggers.

The following sections provide more information about the capabilities of the current version.

Control flow
To support diverse integration flows and patterns in the modern data warehouse, the current version of Data
Factory has enabled a new flexible data pipeline model that is no longer tied to time-series data. A few common
flows that were previously not possible are now enabled. They are described in the following sections.
Chaining activities
In V1, you had to configure the output of an activity as an input of another activity to chain them. in the current
version, you can chain activities in a sequence within a pipeline. You can use the dependsOn property in an
activity definition to chain it with an upstream activity. For more information and an example, see Pipelines and
activities and Branching and chaining activities.
Branching activities
in the current version, you can branch activities within a pipeline. The If-condition activity provides the same
functionality that an if statement provides in programming languages. It evaluates a set of activities when the
condition evaluates to true and another set of activities when the condition evaluates to false . For examples of
branching activities, see the Branching and chaining activities tutorial.
Parameters
You can define parameters at the pipeline level and pass arguments while you're invoking the pipeline on-demand
or from a trigger. Activities can consume the arguments that are passed to the pipeline. For more information, see
Pipelines and triggers.
Custom state passing
Activity outputs including state can be consumed by a subsequent activity in the pipeline. For example, in the
JSON definition of an activity, you can access the output of the previous activity by using the following syntax:
@activity('NameofPreviousActivity').output.value . By using this feature, you can build workflows where values
can pass through activities.
Looping containers
The ForEach activity defines a repeating control flow in your pipeline. This activity iterates over a collection and
runs specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping
structure in programming languages.
The Until activity provides the same functionality that a do-until looping structure provides in programming
languages. It runs a set of activities in a loop until the condition that's associated with the activity evaluates to
true . You can specify a timeout value for the until activity in Data Factory.

Trigger-based flows
Pipelines can be triggered by on-demand (event-based, i.e. blob post) or wall-clock time. The pipelines and triggers
article has detailed information about triggers.
Invoking a pipeline from another pipeline
The Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline.
Delta flows
A key use case in ETL patterns is “delta loads,” in which only data that has changed since the last iteration of a
pipeline is loaded. New capabilities in the current version, such as lookup activity, flexible scheduling, and control
flow, enable this use case in a natural way. For a tutorial with step-by-step instructions, see Tutorial: Incremental
copy.
Other control flow activities
Following are a few more control flow activities that are supported by the current version of Data Factory.

CONTROL ACTIVITY DESCRIPTION

ForEach activity Defines a repeating control flow in your pipeline. This activity
is used to iterate over a collection and runs specified activities
in a loop. The loop implementation of this activity is similar to
Foreach looping structure in programming languages.

Web activity Calls a custom REST endpoint from a Data Factory pipeline.
You can pass datasets and linked services to be consumed
and accessed by the activity.

Lookup activity Reads or looks up a record or table name value from any
external source. This output can further be referenced by
succeeding activities.

Get metadata activity Retrieves the metadata of any data in Azure Data Factory.

Wait activity Pauses the pipeline for a specified period of time.

Deploy SSIS packages to Azure


You use Azure-SSIS if you want to move your SSIS workloads to the cloud, create a data factory by using the
current version, and provision an Azure-SSIS Integration Runtime.
The Azure-SSIS Integration Runtime is a fully managed cluster of Azure VMs (nodes) that are dedicated to
running your SSIS packages in the cloud. After you provision Azure-SSIS Integration Runtime, you can use the
same tools that you have been using to deploy SSIS packages to an on-premises SSIS environment.
For example, you can use SQL Server Data Tools or SQL Server Management Studio to deploy SSIS packages to
this runtime on Azure. For step-by-step instructions, see the tutorial Deploy SQL Server integration services
packages to Azure.

Flexible scheduling
In the current version of Data Factory, you do not need to define dataset availability schedules. You can define a
trigger resource that can schedule pipelines from a clock scheduler paradigm. You can also pass parameters to
pipelines from a trigger for a flexible scheduling and execution model.
Pipelines do not have “windows” of time execution in the current version of Data Factory. The Data Factory V1
concepts of startTime, endTime, and isPaused don't exist in the current version of Data Factory. For more
information about how to build and then schedule a pipeline in the current version of Data Factory, see Pipeline
execution and triggers.

Support for more data stores


The current version supports the copying of data to and from more data stores than V1. For a list of supported
data stores, see the following articles:
Version 1 - supported data stores
Current version - supported data stores

Support for on-demand Spark cluster


The current version supports the creation of an on-demand Azure HDInsight Spark cluster. To create an on-
demand Spark cluster, specify the cluster type as Spark in your on-demand, HDInsight linked service definition.
Then you can configure the Spark activity in your pipeline to use this linked service.
At runtime, when the activity is executed, the Data Factory service automatically creates the Spark cluster for you.
For more information, see the following articles:
Spark Activity in the current version of Data Factory
Azure HDInsight on-demand linked service

Custom activities
In V1, you implement (custom) DotNet activity code by creating a .NET class library project with a class that
implements the Execute method of the IDotNetActivity interface. Therefore, you need to write your custom code in
.NET Framework 4.5.2 and run it on Windows-based Azure Batch Pool nodes.
In a custom activity in the current version, you don't have to implement a .NET interface. You can directly run
commands, scripts, and your own custom code compiled as an executable.
For more information, see Difference between custom activity in Data Factory and version 1.

SDKs
the current version of Data Factory provides a richer set of SDKs that can be used to author, manage, and monitor
pipelines.
.NET SDK: The .NET SDK is updated in the current version.
PowerShell: The PowerShell cmdlets are updated in the current version. The cmdlets for the current
version have DataFactoryV2 in the name, for example: Get-AzDataFactoryV2.
Python SDK: This SDK is new in the current version.
REST API: The REST API is updated in the current version.
The SDKs that are updated in the current version are not backward-compatible with V1 clients.

Authoring experience
V2 V1

Azure portal Yes Yes

Azure PowerShell Yes Yes

.NET SDK Yes Yes

REST API Yes Yes


V2 V1

Python SDK Yes No

Resource Manager template Yes Yes

Roles and permissions


The Data Factory version 1 Contributor role can be used to create and manage the current version of Data Factory
resources. For more info, see Data Factory Contributor.

Monitoring experience
in the current version, you can also monitor data factories by using Azure Monitor. The new PowerShell cmdlets
support monitoring of integration runtimes. Both V1 and V2 support visual monitoring via a monitoring
application that can be launched from the Azure portal.

Next steps
Learn how to create a data factory by following step-by-step instructions in the following quickstarts: PowerShell,
.NET, Python, REST API.
Quickstart: Create a data factory by using the
Azure Data Factory UI
2/11/2019 • 10 minutes to read • Edit Online

This quickstart describes how to use the Azure Data Factory UI to create and monitor a data factory.
The pipeline that you create in this data factory copies data from one folder to another folder in Azure
Blob storage. For a tutorial on how to transform data by using Azure Data Factory, see Tutorial:
Transform data by using Spark.

NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.

3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.

5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.

2. On the Blob service page, select + Container on the toolbar.


3. In the New container dialog box, enter adftutorial for the name, and then select OK.

4. Select adftutorial in the list of containers.

5. On the Container page, select Upload on the toolbar.

6. On the Upload blob page, select Advanced.

7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not
already exist.
John, Doe
Jane, Doe

8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.

10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.

12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Video
Watching this video helps you understand the Data Factory UI:

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is
supported only in Microsoft Edge and Google Chrome web browsers.
2. Go to the Azure portal.
3. Select Create a resource on the left menu, select Analytics, and then select Data Factory.

4. On the New data factory page, enter ADFTutorialDataFactory for Name.


The name of the Azure data factory must be globally unique. If you see the following error,
change the name of the data factory (for example, <yourname>ADFTutorialDataFactory)
and try creating again. For naming rules for Data Factory artifacts, see the Data Factory -
naming rules article.

5. For Subscription, select your Azure subscription in which you want to create the data factory.
6. For Resource Group, use one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. For Version, select V2.
8. For Location, select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory
meta data will be stored. Please note that the associated data stores (like Azure Storage and
Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can run in
other regions.
9. Select Create.
10. After the creation is complete, you see the Data Factory page. Select the Author & Monitor
tile to start the Azure Data Factory user interface (UI) application on a separate tab.

11. On the Let's get started page, switch to the Author tab in the left panel.
Create a linked service
In this procedure, you create a linked service to link your Azure storage account to the data factory. The
linked service has the connection information that the Data Factory service uses at runtime to connect
to it.
1. Select Connections, and then select the New button on the toolbar.

2. On the New Linked Service page, select Azure Blob Storage, and then select Continue.
3. Complete the following steps:
a. For Name, enter AzureStorageLinkedService.
b. For Storage account name, select the name of your Azure storage account.
c. Select Test connection to confirm that the Data Factory service can connect to the storage
account.
d. Select Finish to save the linked service.
Create datasets
In this procedure, you create two datasets: InputDataset and OutputDataset. These datasets are of
type AzureBlob. They refer to the Azure Storage linked service that you created in the previous
section.
The input dataset represents the source data in the input folder. In the input dataset definition, you
specify the blob container (adftutorial), the folder (input), and the file (emp.txt) that contain the
source data.
The output dataset represents the data that's copied to the destination. In the output dataset definition,
you specify the blob container (adftutorial), the folder (output), and the file to which the data is
copied. Each run of a pipeline has a unique ID associated with it. You can access this ID by using the
system variable RunId. The name of the output file is dynamically evaluated based on the run ID of the
pipeline.
In the linked service settings, you specified the Azure storage account that contains the source data. In
the source dataset settings, you specify where exactly the source data resides (blob container, folder,
and file). In the sink dataset settings, you specify where the data is copied to (blob container, folder, and
file).
1. Select the + (plus) button, and then select Dataset.
2. On the New Dataset page, select Azure Blob Storage, and then select Finish.

3. In the General tab for the dataset, enter InputDataset for Name.
4. Switch to the Connection tab and complete the following steps:
a. For Linked service, select AzureStorageLinkedService.
b. For File path, select the Browse button.
c. In the Choose a file or folder window, browse to the input folder in the adftutorial
container, select the emp.txt file, and then select Finish.

d. (optional) Select Preview data to preview the data in the emp.txt file.
5. Repeat the steps to create the output dataset:
a. Select the + (plus) button, and then select Dataset.
b. On the New Dataset page, select Azure Blob Storage, and then select Finish.
c. In General table, specify OutputDataset for the name.
d. In Connection tab, select AzureStorageLinkedService as linked service, and enter
adftutorial/output for the folder, in the directory field. If the output folder does not exist, the
copy activity creates it at runtime.

Create a pipeline
In this procedure, you create and validate a pipeline with a copy activity that uses the input and output
datasets. The copy activity copies data from the file you specified in the input dataset settings to the file
you specified in the output dataset settings. If the input dataset specifies only a folder (not the file
name), the copy activity copies all the files in the source folder to the destination.
1. Select the + (plus) button, and then select Pipeline.

2. In the General tab, specify CopyPipeline for Name.


3. In the Activities toolbox, expand Move & Transform. Drag the Copy activity from the
Activities toolbox to the pipeline designer surface. You can also search for activities in the
Activities toolbox. Specify CopyFromBlobToBlob for Name.
4. Switch to the Source tab in the copy activity settings, and select InputDataset for Source
Dataset.
5. Switch to the Sink tab in the copy activity settings, and select OutputDataset for Sink Dataset.
6. Click Validate on the pipeline toolbar above the canvas to validate the pipeline settings.
Confirm that the pipeline has been successfully validated. To close the validation output, select
the >> (right arrow ) button.

Debug the pipeline


In this step, you debug the pipeline before deploying it to Data Factory.
1. On the pipeline toolbar above the canvas, click Debug to trigger a test run.
2. Confirm that you see the status of the pipeline run on the Output tab of the pipeline settings at
the bottom.
3. Confirm that you see an output file in the output folder of the adftutorial container. If the
output folder does not exist, the Data Factory service automatically creates it.

Trigger the pipeline manually


In this procedure, you deploy entities (linked services, datasets, pipelines) to Azure Data Factory. Then,
you manually trigger a pipeline run.
1. Before you trigger a pipeline, you must publish entities to Data Factory. To publish, select
Publish All on the top.

2. To trigger the pipeline manually, select Trigger on the pipeline toolbar, and then select Trigger
Now.

Monitor the pipeline


1. Switch to the Monitor tab on the left. Use the Refresh button to refresh the list.
2. Select the View Activity Runs link under Actions. You see the status of the copy activity run
on this page.

3. To view details about the copy operation, select the Details (eyeglasses image) link in the
Actions column. For details about the properties, see Copy Activity overview.

4. Confirm that you see a new file in the output folder.


5. You can switch back to the Pipeline Runs view from the Activity Runs view by selecting the
Pipelines link.

Trigger the pipeline on a schedule


This procedure is optional in this tutorial. You can create a scheduler trigger to schedule the pipeline to
run periodically (hourly, daily, and so on). In this procedure, you create a trigger to run every minute
until the end date and time that you specify.
1. Switch to the Author tab.
2. Go to your pipeline, select Trigger on the pipeline toolbar, and then select New/Edit.
3. On the Add Triggers page, select Choose trigger, and then select New.
4. On the New Trigger page, under End, select On Date, specify an end time a few minutes after
the current time, and then select Apply.
A cost is associated with each pipeline run, so specify the end time only minutes apart from the
start time. Ensure that it's the same day. However, ensure that there is enough time for the
pipeline to run between the publish time and the end time. The trigger comes into effect only
after you publish the solution to Data Factory, not when you save the trigger in the UI.

5. On the New Trigger page, select the Activated check box, and then select Next.
6. Review the warning message, and select Finish.

7. Select Publish All to publish changes to Data Factory.


8. Switch to the Monitor tab on the left. Select Refresh to refresh the list. You see that the pipeline
runs once every minute from the publish time to the end time.
Notice the values in the Triggered By column. The manual trigger run was from the step
(Trigger Now) that you did earlier.

9. Switch to the Trigger Runs view.

10. Confirm that an output file is created for every pipeline run until the specified end date and time
in the output folder.

Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To
learn about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Use the Copy Data tool to copy
data
4/8/2019 • 6 minutes to read • Edit Online

In this quickstart, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to
create a pipeline that copies data from a folder in Azure Blob storage to another folder.

NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of
the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account, see
Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.

5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a folder
named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.

2. On the Blob service page, select + Container on the toolbar.


3. In the New container dialog box, enter adftutorial for the name, and then select OK.

4. Select adftutorial in the list of containers.

5. On the Container page, select Upload on the toolbar.

6. On the Upload blob page, select Advanced.

7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not already
exist.
John, Doe
Jane, Doe

8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the Files
box.
9. Enter input as a value for the Upload to folder box.

10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.

12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Create a data factory
1. Select New on the left menu, select Data + Analytics, and then select Data Factory.

2. On the New data factory page, enter ADFTutorialDataFactory for Name.


The name of the Azure data factory must be globally unique. If you see the following error,
change the name of the data factory (for example, <yourname>ADFTutorialDataFactory) and
try creating again. For naming rules for Data Factory artifacts, see the Data Factory - naming
rules article.

3. For Subscription, select your Azure subscription in which you want to create the data factory.
4. For Resource Group, use one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
5. For Version, select V2.
6. For Location, select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory
meta data will be stored. Please note that the associated data stores (like Azure Storage and
Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can run in
other regions.
7. Select Create.
8. After the creation is complete, you see the Data Factory page. Select the Author & Monitor tile
to start the Azure Data Factory user interface (UI) application on a separate tab.

Start the Copy Data tool


1. On the Let's get started page, select the Copy Data tile to start the Copy Data tool.
2. On the Properties page of the Copy Data tool, you can specify a name for the pipeline and its
description, then select Next.

3. On the Source data store page, complete the following steps:


a. Click + Create new connection to add a connection.
b. Select Azure Blob Storage from the gallery, and then select Next.

c. On the Specify the Azure Blob storage account page, select your storage account from the
Storage account name list, and then select Finish.
d. Select the newly created linked service as source, then click Next.

4. On the Choose the input file or folder page, complete the following steps:
a. Click Browse to navigate to the adftutorial/input folder, select the emp.txt file, then click
Choose.
d. Check the Binary copy option to copy file as-is, then select Next.

5. On the Destination data store page, select the Azure Blob Storage linked service you just
created, and then select Next.
6. On the Choose the output file or folder page, enter adftutorial/output for the folder path,
then select Next.

7. On the Settings page, select Next to use the default configurations.


8. On the Summary page, review all settings, and select Next.
9. On the Deployment complete page, select Monitor to monitor the pipeline that you created.

10. The application switches to the Monitor tab. You see the status of the pipeline on this tab. Select
Refresh to refresh the list.

11. Select the View Activity Runs link in the Actions column. The pipeline has only one activity of
type Copy.
12. To view details about the copy operation, select the Details (eyeglasses image) link in the
Actions column. For details about the properties, see Copy Activity overview.

13. Verify that the emp.txt file is created in the output folder of the adftutorial container. If the
output folder does not exist, the Data Factory service automatically creates it.
14. Switch to the Author tab above the Monitor tab on the left panel so that you can edit linked
services, datasets, and pipelines. To learn about editing them in the Data Factory UI, see Create a
data factory by using the Azure portal.

Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To
learn about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Create an Azure data factory using
PowerShell
3/5/2019 • 12 minutes to read • Edit Online

This quickstart describes how to use PowerShell to create an Azure data factory. The pipeline you
create in this data factory copies data from one folder to another folder in an Azure blob storage. For
a tutorial on how to transform data using Azure Data Factory, see Tutorial: Transform data using
Spark.

NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the
Azure Data Factory service, see Introduction to Azure Data Factory.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the
permissions that you have in the subscription, in the Azure portal, select your username in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.

3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.

5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.

3. In the New container dialog box, enter adftutorial for the name, and then select OK.

4. Select adftutorial in the list of containers.

5. On the Container page, select Upload on the toolbar.

6. On the Upload blob page, select Advanced.

7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not
already exist.

John, Doe
Jane, Doe

8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.

10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Azure PowerShell

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.

Install the latest Azure PowerShell modules by following instructions in How to install and configure
Azure PowerShell.
Log in to PowerShell
1. Launch PowerShell on your machine. Keep PowerShell open until the end of this quickstart. If
you close and reopen, you need to run these commands again.
2. Run the following command, and enter the same Azure user name and password that you use
to sign in to the Azure portal:

Connect-AzAccount

3. Run the following command to view all the subscriptions for this account:

Get-AzSubscription

4. If you see multiple subscriptions associated with your account, run the following command to
select the subscription that you want to work with. Replace SubscriptionId with the ID of your
Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later.
Copy the following command text to PowerShell, specify a name for the Azure resource group
in double quotes, and then run the command. For example: "ADFQuickStartRG" .

$resourceGroupName = "ADFQuickStartRG";

If the resource group already exists, you may not want to overwrite it. Assign a different value
to the $ResourceGroupName variable and run the command again
2. To create the Azure resource group, run the following command:

$ResGrp = New-AzResourceGroup $resourceGroupName -location 'East US'

If the resource group already exists, you may not want to overwrite it. Assign a different value
to the $ResourceGroupName variable and run the command again.
3. Define a variable for the data factory name.
IMPORTANT
Update the data factory name to be globally unique. For example, ADFTutorialFactorySP1127.

$dataFactoryName = "ADFQuickStartFactory";

4. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet, using the Location
and ResourceGroupName property from the $ResGrp variable:

$DataFactory = Set-AzDataFactoryV2 -ResourceGroupName $ResGrp.ResourceGroupName `


-Location $ResGrp.Location -Name $dataFactoryName

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error,
change the name and try again.

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data


Factory names must be globally unique.

To create Data Factory instances, the user account you use to log in to Azure must be a
member of contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory:
Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.

Create a linked service


Create linked services in a data factory to link your data stores and compute services to the data
factory. In this quickstart, you create an Azure Storage linked service that is used as both the source
and sink stores. The linked service has the connection information that the Data Factory service uses
at runtime to connect to it.
1. Create a JSON file named AzureStorageLinkedService.json in C:\ADFv2QuickStartPSH
folder with the following content: (Create the folder ADFv2QuickStartPSH if it does not
already exist.).

IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before
saving the file.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=
<accountName>;AccountKey=<accountKey>;EndpointSuffix=core.windows.net",
"type": "SecureString"
}
}
}
}

If you are using Notepad, select All files for the Save as type filed in the Save as dialog box.
Otherwise, it may add .txt extension to the file. For example,
AzureStorageLinkedService.json.txt . If you create the file in File Explorer before opening it in
Notepad, you may not see the .txt extension since the Hide extensions for known files
types option is set by default. Remove the .txt extension before proceeding to the next step.
2. In PowerShell, switch to the ADFv2QuickStartPSH folder.

Set-Location 'C:\ADFv2QuickStartPSH'

3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:


AzureStorageLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $DataFactory.DataFactoryName `


-ResourceGroupName $ResGrp.ResourceGroupName -Name "AzureStorageLinkedService" `
-DefinitionFile ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create a dataset
In this step, you define a dataset that represents the data to copy from a source to a sink. The dataset
is of type AzureBlob. It refers to the Azure Storage linked service you created in the previous step.
It takes a parameter to construct the folderPath property. For an input dataset, the copy activity in
the pipeline passes the input path as a value for this parameter. Similarly, for an output dataset, the
copy activity passes the output path as a value for this parameter.
1. Create a JSON file named BlobDataset.json in the C:\ADFv2QuickStartPSH folder, with
the following content:
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@{dataset().path}"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}

2. To create the dataset: BlobDataset, run the Set-AzDataFactoryV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $DataFactory.DataFactoryName `


-ResourceGroupName $ResGrp.ResourceGroupName -Name "BlobDataset" `
-DefinitionFile ".\BlobDataset.json"

Here is the sample output:

DatasetName : BlobDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset

Create a pipeline
In this quickstart, you create a pipeline with one activity that takes two parameters - input blob path
and output blob path. The values for these parameters are set when the pipeline is triggered/run. The
copy activity uses the same blob dataset created in the previous step as input and output. When the
dataset is used as an input dataset, input path is specified. And, when the dataset is used as an output
dataset, the output path is specified.
1. Create a JSON file named Adfv2QuickStartPipeline.json in the C:\ADFv2QuickStartPSH
folder with the following content:
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}

2. To create the pipeline: Adfv2QuickStartPipeline, Run the Set-AzDataFactoryV2Pipeline


cmdlet.

$DFPipeLine = Set-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-Name "Adfv2QuickStartPipeline" `
-DefinitionFile ".\Adfv2QuickStartPipeline.json"

Create a pipeline run


In this step, you set values for the pipeline parameters: inputPath and outputPath with actual values
of source and sink blob paths. Then, you create a pipeline run by using these arguments.
1. Create a JSON file named PipelineParameters.json in the C:\ADFv2QuickStartPSH folder
with the following content:
{
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/output"
}

2. Run the Invoke-AzDataFactoryV2Pipeline cmdlet to create a pipeline run and pass in the
parameter values. The cmdlet returns the pipeline run ID for future monitoring.

$RunId = Invoke-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-PipelineName $DFPipeLine.Name `
-ParameterFile .\PipelineParameters.json

Monitor the pipeline run


1. Run the following PowerShell script to continuously check the pipeline run status until it
finishes copying the data. Copy/paste the following script in the PowerShell window, and press
ENTER.

while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun `
-ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId

if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}

Start-Sleep -Seconds 10
}

Here is the sample output of pipeline run:

Pipeline is running...status: InProgress


Pipeline run finished. The status is: Succeeded

ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : SPTestFactory0928
RunId : 0000000000-0000-0000-0000-0000000000000
PipelineName : Adfv2QuickStartPipeline
LastUpdated : 9/28/2017 8:28:38 PM
Parameters : {[inputPath, adftutorial/input], [outputPath, adftutorial/output]}
RunStart : 9/28/2017 8:28:14 PM
RunEnd : 9/28/2017 8:28:38 PM
DurationInMs : 24151
Status : Succeeded
Message :

You might see the following error:


Activity CopyFromBlobToBlob failed: Failed to detect region of linked service
'AzureStorage' : 'AzureStorageLinkedService' with error '[Region Resolver] Azure Storage
failed to get address for DNS. Warning: System.Net.Sockets.SocketException (0x80004005): No
such host is known

If you see the error, perform the following steps:


a. In the AzureStorageLinkedService.json, confirm that the name and key of your Azure
Storage Account are correct.
b. Verify that the format of the connection string is correct. The properties, for example,
AccountName and AccountKey are separated by semi-colon ( ; ) character.
c. If you have angled brackets surrounding the account name and account key, remove
them.
d. Here is an example connection string:

"connectionString": {
"value":
"DefaultEndpointsProtocol=https;AccountName=mystorageaccountname;AccountKey=mystorag
eaccountkey;EndpointSuffix=core.windows.net",
"type": "SecureString"
}

e. Recreate the linked service by following steps in the Create a linked service section.
f. Rerun the pipeline by following steps in the Create a pipeline run section.
g. Run the current monitoring command again to monitor the new pipeline run.
2. Run the following script to retrieve copy activity run details, for example, size of the data
read/written.

Write-Output "Activity run details:"


$Result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $DataFactory.DataFactoryName -
ResourceGroupName $ResGrp.ResourceGroupName -PipelineRunId $RunId -RunStartedAfter (Get-
Date).AddMinutes(-30) -RunStartedBefore (Get-Date).AddMinutes(30)
$Result

Write-Output "Activity 'Output' section:"


$Result.Output -join "`r`n"

Write-Output "Activity 'Error' section:"


$Result.Error -join "`r`n"

3. Confirm that you see the output similar to the following sample output of activity run result:
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : SPTestFactory0928
ActivityName : CopyFromBlobToBlob
PipelineRunId : 00000000000-0000-0000-0000-000000000000
PipelineName : Adfv2QuickStartPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, copyDuration, throughput...}
LinkedServiceName :
ActivityRunStart : 9/28/2017 8:28:18 PM
ActivityRunEnd : 9/28/2017 8:28:36 PM
DurationInMs : 18095
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity 'Output' section:


"dataRead": 38
"dataWritten": 38
"copyDuration": 7
"throughput": 0.01
"errors": []
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (West US)"
"usedDataIntegrationUnits": 2
"billedDuration": 14

Verify the output


The pipeline automatically creates the output folder in the adftutorial blob container. Then, it copies
the emp.txt file from the input folder to the output folder.
1. In the Azure portal, on the adftutorial container page, click Refresh to see the output folder.

2. Click output in the folder list.


3. Confirm that the emp.txt is copied to the output folder.
Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the
Azure resource group, which includes all the resources in the resource group. If you want to keep the
other resources intact, delete only the data factory you created in this tutorial.
Deleting a resource group deletes all resources including data factories in it. Run the following
command to delete the entire resource group:

Remove-AzResourceGroup -ResourceGroupName $resourcegroupname

Note: dropping a resource group may take some time. Please be patient with the process
If you want to delete just the data factory, not the entire resource group, run the following command:

Remove-AzDataFactoryV2 -Name $dataFactoryName -ResourceGroupName $resourceGroupName

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob
storage. Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline
using .NET SDK
4/28/2019 • 11 minutes to read • Edit Online

This quickstart describes how to use .NET SDK to create an Azure data factory. The pipeline you
create in this data factory copies data from one folder to another folder in an Azure blob storage. For
a tutorial on how to transform data using Azure Data Factory, see Tutorial: Transform data using
Spark.

NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the
Azure Data Factory service, see Introduction to Azure Data Factory.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the
permissions that you have in the subscription, in the Azure portal, select your username in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and
password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.

3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.

5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.

3. In the New container dialog box, enter adftutorial for the name, and then select OK.

4. Select adftutorial in the list of containers.

5. On the Container page, select Upload on the toolbar.

6. On the Upload blob page, select Advanced.


7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not
already exist.

John, Doe
Jane, Doe

8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.

10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Visual Studio
The walkthrough in this article uses Visual Studio 2017. You can also use Visual Studio 2013 or 2015.
Azure .NET SDK
Download and install Azure .NET SDK on your machine.

Create an application in Azure Active Directory


Following instructions from the sections in this article to do the following tasks:
1. Create an Azure Active Directory application. Create an application in Azure Active Directory
that represents the .NET application you are creating in this tutorial. For the sign-on URL, you can
provide a dummy URL as shown in the article ( https://contoso.org/exampleapp ).
2. Get the application ID and authentication key, and note down these values that you use later
in this tutorial.
3. Get the tenant ID and note down this value that you use later in this tutorial.
4. Assign the application to the Contributor role at the subscription level so that the application can
create data factories in the subscription.

Create a Visual Studio project


Using Visual Studio 2013/2015/2017, create a C# .NET console application.
1. Launch Visual Studio.
2. Click File, point to New, and click Project.
3. Select Visual C# -> Console App (.NET Framework) from the list of project types on the right.
.NET version 4.5.2 or above is required.
4. Enter ADFv2QuickStart for the Name.
5. Click OK to create the project.
Install NuGet packages
1. Click Tools -> NuGet Package Manager -> Package Manager Console.
2. In the Package Manager Console, run the following commands to install packages. Refer to
Microsoft.Azure.Management.DataFactory nuget package with details.

Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory

Create a data factory client


1. Open Program.cs, include the following statements to add references to namespaces.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;

2. Add the following code to the Main method that sets the variables. Replace the place-holders
with your own values. For a list of Azure regions in which Data Factory is currently available,
select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL
Database, etc.) and computes (HDInsight, etc.) used by data factory can be in other regions.

// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID where the data factory resides>";
string resourceGroup = "<your resource group where the data factory resides>";
string region = "East US 2";
string dataFactoryName = "<specify the name of data factory to create. It must be globally
unique.>";
string storageAccount = "<your storage account name to copy data>";
string storageKey = "<your storage account key>";
// specify the container and input folder from which all files need to be copied to the
output folder.
string inputBlobPath = "<the path to existing blob(s) to copy data from, e.g.
containername/foldername>";
//specify the contains and output folder where the files are copied
string outputBlobPath = "<the blob path to copy data to, e.g. containername/foldername>";

string storageLinkedServiceName = "AzureStorageLinkedService"; // name of the Azure


Storage linked service
string blobDatasetName = "BlobDataset"; // name of the blob dataset
string pipelineName = "Adfv2QuickStartPipeline"; // name of the pipeline

3. Add the following code to the Main method that creates an instance of
DataFactoryManagementClient class. You use this object to create a data factory, a linked
service, datasets, and a pipeline. You also use this object to monitor the pipeline run details.
// Authenticate and create a data factory management client
var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/",
cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Create a data factory


Add the following code to the Main method that creates a data factory.

// Create a data factory


Console.WriteLine("Creating data factory " + dataFactoryName + "...");
Factory dataFactory = new Factory
{
Location = region,
Identity = new FactoryIdentity()
};
client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, dataFactory);
Console.WriteLine(SafeJsonConvert.SerializeObject(dataFactory, client.SerializationSettings));

while (client.Factories.Get(resourceGroup, dataFactoryName).ProvisioningState ==


"PendingCreation")
{
System.Threading.Thread.Sleep(1000);
}

Create a linked service


Add the following code to the Main method that creates an Azure Storage linked service.
You create linked services in a data factory to link your data stores and compute services to the data
factory. In this Quickstart, you only need to create one Azure Storage linked service for both the copy
source and sink store, named "AzureStorageLinkedService" in the sample.

// Create an Azure Storage linked service


Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");

LinkedServiceResource storageLinkedService = new LinkedServiceResource(


new AzureStorageLinkedService
{
ConnectionString = new SecureString("DefaultEndpointsProtocol=https;AccountName=" +
storageAccount + ";AccountKey=" + storageKey)
}
);
client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, storageLinkedServiceName,
storageLinkedService);
Console.WriteLine(SafeJsonConvert.SerializeObject(storageLinkedService,
client.SerializationSettings));

Create a dataset
Add the following code to the Main method that creates an Azure blob dataset.
You define a dataset that represents the data to copy from a source to a sink. In this example, this
Blob dataset references to the Azure Storage linked service you created in the previous step. The
dataset takes a parameter whose value is set in an activity that consumes the dataset. The parameter
is used to construct the "folderPath" pointing to where the data resides/stored.

// Create an Azure Blob dataset


Console.WriteLine("Creating dataset " + blobDatasetName + "...");
DatasetResource blobDataset = new DatasetResource(
new AzureBlobDataset
{
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
},
FolderPath = new Expression { Value = "@{dataset().path}" },
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "path", new ParameterSpecification { Type = ParameterType.String } }

}
}
);
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobDatasetName, blobDataset);
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity.
In this example, this pipeline contains one activity and takes two parameters - input blob path and
output blob path. The values for these parameters are set when the pipeline is triggered/run. The
copy activity refers to the same blob dataset created in the previous step as input and output. When
the dataset is used as an input dataset, input path is specified. And, when the dataset is used as an
output dataset, the output path is specified.
// Create a pipeline with a copy activity
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource pipeline = new PipelineResource
{
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "inputPath", new ParameterSpecification { Type = ParameterType.String } },
{ "outputPath", new ParameterSpecification { Type = ParameterType.String } }
},
Activities = new List<Activity>
{
new CopyActivity
{
Name = "CopyFromBlobToBlob",
Inputs = new List<DatasetReference>
{
new DatasetReference()
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.inputPath" }
}
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.outputPath" }
}
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
}
}
};
client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, pipeline);
Console.WriteLine(SafeJsonConvert.SerializeObject(pipeline, client.SerializationSettings));

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run.
This code also sets values of inputPath and outputPath parameters specified in pipeline with the
actual values of source and sink blob paths.

// Create a pipeline run


Console.WriteLine("Creating pipeline run...");
Dictionary<string, object> parameters = new Dictionary<string, object>
{
{ "inputPath", inputBlobPath },
{ "outputPath", outputBlobPath }
};
CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup,
dataFactoryName, pipelineName, parameters: parameters).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);
Monitor a pipeline run
1. Add the following code to the Main method to continuously check the status until it finishes
copying the data.

// Monitor the pipeline run


Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(resourceGroup, dataFactoryName,
runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress")
System.Threading.Thread.Sleep(15000);
else
break;
}

2. Add the following code to the Main method that retrieves copy activity run details, for
example, size of the data read/written.

// Check the copy activity run details


Console.WriteLine("Checking copy activity run details...");

RunFilterParameters filterParams = new RunFilterParameters(DateTime.UtcNow.AddMinutes(-


10), DateTime.UtcNow.AddMinutes(10));
ActivityRunsQueryResponse queryResponse =
client.ActivityRuns.QueryByPipelineRun(resourceGroup, dataFactoryName, runResponse.RunId,
filterParams);
if (pipelineRun.Status == "Succeeded")
Console.WriteLine(queryResponse.Value.First().Output);
else
Console.WriteLine(queryResponse.Value.First().Error);
Console.WriteLine("\nPress any key to exit...");
Console.ReadKey();

Run the code


Build and start the application, then verify the pipeline execution.
The console prints the progress of creating data factory, linked service, datasets, pipeline, and
pipeline run. It then checks the pipeline run status. Wait until you see the copy activity run details with
data read/written size. Then, use tools such as Azure Storage explorer to check the blob(s) is copied
to "outputBlobPath" from "inputBlobPath" as you specified in variables.
Sample output

Creating data factory SPv2Factory0907...


{
"identity": {
"type": "SystemAssigned"
},
"location": "East US"
}
Creating linked service AzureStorageLinkedService...
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>",
"type": "SecureString"
}
}
}
}
Creating dataset BlobDataset...
{
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
Creating pipeline Adfv2QuickStartPipeline...
{
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"name": "CopyFromBlobToBlob"
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}
Creating pipeline run...
Pipeline run ID: 308d222d-3858-48b1-9e66-acd921feaa09
Checking pipeline run status...
Status: InProgress
Status: InProgress
Checking copy activity run details...
{
"dataRead": 331452208,
"dataWritten": 331452208,
"copyDuration": 23,
"throughput": 14073.209,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (West US)",
"usedDataIntegrationUnits": 2,
"billedDuration": 23
}

Press any key to exit...

Verify the output


The pipeline automatically creates the output folder in the adftutorial blob container. Then, it copies
the emp.txt file from the input folder to the output folder.
1. In the Azure portal, on the adftutorial container page, click Refresh to see the output folder.

2. Click output in the folder list.


3. Confirm that the emp.txt is copied to the output folder.
Clean up resources
To programmatically, delete the data factory, add the following lines of code to the program:

Console.WriteLine("Deleting the data factory");


client.Factories.Delete(resourceGroup, dataFactoryName);

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob
storage. Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline
using Python
3/6/2019 • 9 minutes to read • Edit Online

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven
workflows in the cloud for orchestrating and automating data movement and data transformation.
Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that
can ingest data from disparate data stores, process/transform the data by using compute services such
as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and
publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
This quickstart describes how to use Python to create an Azure data factory. The pipeline in this data
factory copies data from one folder to another folder in an Azure blob storage.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure Storage account. You use the blob storage as source and sink data store. If you don't have
an Azure storage account, see the Create a storage account article for steps to create one.
Create an application in Azure Active Directory following this instruction. Make note of the
following values that you use in later steps: application ID, authentication key, and tenant ID.
Assign application to "Contributor" role by following instructions in the same article.
Create and upload an input file
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.

John|Doe
Jane|Doe

2. Use tools such as Azure Storage Explorer to create the adfv2tutorial container, and input folder
in the container. Then, upload the input.txt file to the input folder.

Install the Python package


1. Open a terminal or command prompt with administrator privileges.
2. First, install the Python package for Azure management resources:

pip install azure-mgmt-resource

3. To install the Python package for Data Factory, run the following command:

pip install azure-mgmt-datafactory

The Python SDK for Data Factory supports Python 2.7, 3.3, 3.4, 3.5, 3.6 and 3.7.
Create a data factory client
1. Create a file named datafactory.py. Add the following statements to add references to
namespaces.

from azure.common.credentials import ServicePrincipalCredentials


from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *
from datetime import datetime, timedelta
import time

2. Add the following functions that print information.

def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)

def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n\n")

def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))

3. Add the following code to the Main method that creates an instance of
DataFactoryManagementClient class. You use this object to create the data factory, linked
service, datasets, and pipeline. You also use this object to monitor the pipeline run details. Set
subscription_id variable to the ID of your Azure subscription. For a list of Azure regions in
which Data Factory is currently available, select the regions that interest you on the following
page, and then expand Analytics to locate Data Factory: Products available by region. The data
stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
def main():

# Azure subscription ID
subscription_id = '<Specify your Azure Subscription ID>'

# This program creates this resource group. If it's an existing resource group, comment
out the code that creates the resource group
rg_name = 'ADFTutorialResourceGroup'

# The data factory name. It must be globally unique.


df_name = '<Specify a name for the data factory. It must be globally unique>'

# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ServicePrincipalCredentials(client_id='<Active Directory application/client
ID>', secret='<client secret>', tenant='<Active Directory tenant ID>')
resource_client = ResourceManagementClient(credentials, subscription_id)
adf_client = DataFactoryManagementClient(credentials, subscription_id)

rg_params = {'location':'eastus'}
df_params = {'location':'eastus'}

Create a data factory


Add the following code to the Main method that creates a data factory. If your resource group
already exists, comment out the first create_or_update statement.

# create the resource group


# comment out if the resource group already exits
resource_client.resource_groups.create_or_update(rg_name, rg_params)

#Create a data factory


df_resource = Factory(location='eastus')
df = adf_client.factories.create_or_update(rg_name, df_name, df_resource)
print_item(df)
while df.provisioning_state != 'Succeeded':
df = adf_client.factories.get(rg_name, df_name)
time.sleep(1)

Create a linked service


Add the following code to the Main method that creates an Azure Storage linked service.
You create linked services in a data factory to link your data stores and compute services to the data
factory. In this quickstart, you only need create one Azure Storage linked service as both copy source
and sink store, named "AzureStorageLinkedService" in the sample. Replace <storageaccountname> and
<storageaccountkey> with name and key of your Azure Storage account.

# Create an Azure Storage linked service


ls_name = 'storageLinkedService'

# IMPORTANT: specify the name and key of your Azure Storage account.
storage_string = SecureString('DefaultEndpointsProtocol=https;AccountName=
<storageaccountname>;AccountKey=<storageaccountkey>')

ls_azure_storage = AzureStorageLinkedService(connection_string=storage_string)
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)
Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. For information about
properties of Azure Blob dataset, see Azure blob connector article.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure
Storage linked service you create in the previous step.

# Create an Azure blob dataset (input)


ds_name = 'ds_in'
ds_ls = LinkedServiceReference(ls_name)
blob_path= 'adfv2tutorial/input'
blob_filename = 'input.txt'
ds_azure_blob= AzureBlobDataset(ds_ls, folder_path=blob_path, file_name = blob_filename)
ds = adf_client.datasets.create_or_update(rg_name, df_name, ds_name, ds_azure_blob)
print_item(ds)

Create a dataset for sink Azure Blob


Add the following code to the Main method that creates an Azure blob dataset. For information about
properties of Azure Blob dataset, see Azure blob connector article.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure
Storage linked service you create in the previous step.

# Create an Azure blob dataset (output)


dsOut_name = 'ds_out'
output_blobpath = 'adfv2tutorial/output'
dsOut_azure_blob = AzureBlobDataset(ds_ls, folder_path=output_blobpath)
dsOut = adf_client.datasets.create_or_update(rg_name, df_name, dsOut_name, dsOut_azure_blob)
print_item(dsOut)

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity.

# Create a copy activity


act_name = 'copyBlobtoBlob'
blob_source = BlobSource()
blob_sink = BlobSink()
dsin_ref = DatasetReference(ds_name)
dsOut_ref = DatasetReference(dsOut_name)
copy_activity = CopyActivity(act_name,inputs=[dsin_ref], outputs=[dsOut_ref],
source=blob_source, sink=blob_sink)

#Create a pipeline with the copy activity


p_name = 'copyPipeline'
params_for_pipeline = {}
p_obj = PipelineResource(activities=[copy_activity], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, p_name, p_obj)
print_item(p)

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run.
#Create a pipeline run.
run_response = adf_client.pipelines.create_run(rg_name, df_name, p_name,
{
}
)

Monitor a pipeline run


To monitor the pipeline run, add the following code the Main method:

#Monitor the pipeline run


time.sleep(30)
pipeline_run = adf_client.pipeline_runs.get(rg_name, df_name, run_response.run_id)
print("\n\tPipeline run status: {}".format(pipeline_run.status))
activity_runs_paged = list(adf_client.activity_runs.list_by_pipeline_run(rg_name, df_name,
pipeline_run.run_id, datetime.now() - timedelta(1), datetime.now() + timedelta(1)))
print_activity_run_details(activity_runs_paged[0])

Now, add the following statement to invoke the main method when the program is run:

# Start the main method


main()

Full script
Here is the full Python code:

from azure.common.credentials import ServicePrincipalCredentials


from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *
from datetime import datetime, timedelta
import time

def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)
print("\n")

def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n")

def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))

def main():

# Azure subscription ID
subscription_id = '<your Azure subscription ID>'

# This program creates this resource group. If it's an existing resource group, comment out the
code that creates the resource group
rg_name = '<Azure resource group name>'

# The data factory name. It must be globally unique.


df_name = '<Your data factory name>'

# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ServicePrincipalCredentials(client_id='<Active Directory client ID>',
secret='<client secret>', tenant='<tenant ID>')
resource_client = ResourceManagementClient(credentials, subscription_id)
adf_client = DataFactoryManagementClient(credentials, subscription_id)

rg_params = {'location':'eastus'}
df_params = {'location':'eastus'}

# create the resource group


# comment out if the resource group already exits
resource_client.resource_groups.create_or_update(rg_name, rg_params)

# Create a data factory


df_resource = Factory(location='eastus')
df = adf_client.factories.create_or_update(rg_name, df_name, df_resource)
print_item(df)
while df.provisioning_state != 'Succeeded':
df = adf_client.factories.get(rg_name, df_name)
time.sleep(1)

# Create an Azure Storage linked service


ls_name = 'storageLinkedService'

# Specify the name and key of your Azure Storage account


storage_string = SecureString('DefaultEndpointsProtocol=https;AccountName=<storage account
name>;AccountKey=<storage account key>')

ls_azure_storage = AzureStorageLinkedService(connection_string=storage_string)
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)

# Create an Azure blob dataset (input)


ds_name = 'ds_in'
ds_ls = LinkedServiceReference(ls_name)
blob_path= 'adfv2tutorial/input'
blob_filename = 'input.txt'
ds_azure_blob= AzureBlobDataset(ds_ls, folder_path=blob_path, file_name = blob_filename)
ds = adf_client.datasets.create_or_update(rg_name, df_name, ds_name, ds_azure_blob)
print_item(ds)

# Create an Azure blob dataset (output)


dsOut_name = 'ds_out'
output_blobpath = 'adfv2tutorial/output'
dsOut_azure_blob = AzureBlobDataset(ds_ls, folder_path=output_blobpath)
dsOut = adf_client.datasets.create_or_update(rg_name, df_name, dsOut_name, dsOut_azure_blob)
print_item(dsOut)

# Create a copy activity


act_name = 'copyBlobtoBlob'
blob_source = BlobSource()
blob_sink = BlobSink()
dsin_ref = DatasetReference(ds_name)
dsOut_ref = DatasetReference(dsOut_name)
dsOut_ref = DatasetReference(dsOut_name)
copy_activity = CopyActivity(act_name,inputs=[dsin_ref], outputs=[dsOut_ref],
source=blob_source, sink=blob_sink)

# Create a pipeline with the copy activity


p_name = 'copyPipeline'
params_for_pipeline = {}
p_obj = PipelineResource(activities=[copy_activity], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, p_name, p_obj)
print_item(p)

# Create a pipeline run


run_response = adf_client.pipelines.create_run(rg_name, df_name, p_name,
{
}
)

# Monitor the pipeilne run


time.sleep(30)
pipeline_run = adf_client.pipeline_runs.get(rg_name, df_name, run_response.run_id)
print("\n\tPipeline run status: {}".format(pipeline_run.status))
activity_runs_paged = list(adf_client.activity_runs.list_by_pipeline_run(rg_name, df_name,
pipeline_run.run_id, datetime.now() - timedelta(1), datetime.now() + timedelta(1)))
print_activity_run_details(activity_runs_paged[0])

# Start the main method


main()

Run the code


Build and start the application, then verify the pipeline execution.
The console prints the progress of creating data factory, linked service, datasets, pipeline, and pipeline
run. Wait until you see the copy activity run details with data read/written size. Then, use tools such as
Azure Storage explorer to check the blob(s) is copied to "outputBlobPath" from "inputBlobPath" as you
specified in variables.
Here is the sample output:
Name: <data factory name>
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>
Location: eastus
Tags: {}

Name: storageLinkedService
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory
name>/linkedservices/storageLinkedService

Name: ds_in
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_in

Name: ds_out
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_out

Name: copyPipeline
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/pipelines/copyPipeline

Pipeline run status: Succeeded


Datetime with no tzinfo will be considered UTC.
Datetime with no tzinfo will be considered UTC.

Activity run details

Activity run status: Succeeded


Number of bytes read: 18
Number of bytes written: 18
Copy duration: 4

Clean up resources
To delete the data factory, add the following code to the program:

adf_client.factories.delete(rg_name,df_name)

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create an Azure data factory and
pipeline by using the REST API
3/26/2019 • 8 minutes to read • Edit Online

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven
workflows in the cloud for orchestrating and automating data movement and data transformation.
Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that
can ingest data from disparate data stores, process/transform the data by using compute services such
as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and
publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
This quickstart describes how to use REST API to create an Azure data factory. The pipeline in this data
factory copies data from one location to another location in an Azure blob storage.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.

Azure subscription. If you don't have a subscription, you can create a free trial account.
Azure Storage account. You use the blob storage as source and sink data store. If you don't have
an Azure storage account, see the Create a storage account article for steps to create one.
Create a blob container in Blob Storage, create an input folder in the container, and upload some
files to the folder. You can use tools such as Azure Storage explorer to connect to Azure Blob
storage, create a blob container, upload input file, and verify the output file.
Install Azure PowerShell. Follow the instructions in How to install and configure Azure
PowerShell. This quickstart uses PowerShell to invoke REST API calls.
Create an application in Azure Active Directory following this instruction. Make note of the
following values that you use in later steps: application ID, authentication key, and tenant ID.
Assign application to "Contributor" role.

Set global variables


1. Launch PowerShell. Keep Azure PowerShell open until the end of this quickstart. If you close
and reopen, you need to run the commands again.
Run the following command, and enter the user name and password that you use to sign in to
the Azure portal:

Connect-AzAccount
Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

2. Run the following commands after replacing the places-holders with your own values, to set
global variables to be used in later steps.

$tenantID = "<your tenant ID>"


$appId = "<your application ID>"
$authKey = "<your authentication key for the application>"
$subsId = "<your subscription ID to create the factory>"
$resourceGroup = "<your resource group to create the factory>"
$dataFactoryName = "<specify the name of data factory to create. It must be globally
unique.>"
$apiVersion = "2018-06-01"

Authenticate with Azure AD


Run the following commands to authenticate with Azure Active Directory (AAD ):

$AuthContext =
[Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext]"https://login.microsoftonli
ne.com/${tenantId}"
$cred = New-Object -TypeName Microsoft.IdentityModel.Clients.ActiveDirectory.ClientCredential -
ArgumentList ($appId, $authKey)
$result = $AuthContext.AcquireToken("https://management.core.windows.net/", $cred)
$authHeader = @{
'Content-Type'='application/json'
'Accept'='application/json'
'Authorization'=$result.CreateAuthorizationHeader()
}

Create a data factory


Run the following commands to create a data factory:

$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}?api-version=${apiVersion}"
$body = @"
{
"name": "$dataFactoryName",
"location": "East US",
"properties": {},
"identity": {
"type": "SystemAssigned"
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
Note the following points:
The name of the Azure data factory must be globally unique. If you receive the following error,
change the name and try again.

Data factory name "ADFv2QuickStartDataFactory" is not available.

For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory:
Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Here is the sample response:

{
"name": "<dataFactoryName>",
"tags": {
},
"properties": {
"provisioningState": "Succeeded",
"loggingStorageAccountKey": "**********",
"createTime": "2017-09-14T06:22:59.9106216Z",
"version": "2018-06-01"
},
"identity": {
"type": "SystemAssigned",
"principalId": "<service principal ID>",
"tenantId": "<tenant ID>"
},
"id": "dataFactoryName",
"type": "Microsoft.DataFactory/factories",
"location": "East US"
}

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data
factory. In this quickstart, you only need create one Azure Storage linked service as both copy source
and sink store, named "AzureStorageLinkedService" in the sample.
Run the following commands to create a linked service named AzureStorageLinkedService:
Replace <accountName> and <accountKey> with name and key of your Azure storage account before
executing the commands.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/linkedservices/AzureStorageLinkedService?api-
version=${apiVersion}"
$body = @"
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>",
"type": "SecureString"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Here is the sample output:

{
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory
/factories/<dataFactoryName>/linkedservices/AzureStorageLinkedService",
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "@{value=**********; type=SecureString}"
}
},
"etag": "0000c552-0000-0000-0000-59b1459c0000"
}

Create datasets
You define a dataset that represents the data to copy from a source to a sink. In this example, this Blob
dataset refers to the Azure Storage linked service you create in the previous step. The dataset takes a
parameter whose value is set in an activity that consumes the dataset. The parameter is used to
construct the "folderPath" pointing to where the data resides/stored.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/datasets/BlobDataset?api-version=${apiVersion}"
$body = @"
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Here is the sample output:

{
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory
/factories/<dataFactoryName>/datasets/BlobDataset",
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@{value=@{dataset().path}; type=Expression}"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": "@{type=String}"
}
},
"etag": "0000c752-0000-0000-0000-59b1459d0000"
}

Create pipeline
In this example, this pipeline contains one activity and takes two parameters - input blob path and
output blob path. The values for these parameters are set when the pipeline is triggered/run. The copy
activity refers to the same blob dataset created in the previous step as input and output. When the
dataset is used as an input dataset, input path is specified. And, when the dataset is used as an output
dataset, the output path is specified.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/pipelines/Adfv2QuickStartPipeline?api-
version=${apiVersion}"
$body = @"
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Here is the sample output:


{
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory
/factories/<dataFactoryName>/pipelines/Adfv2QuickStartPipeline",
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
"@{name=CopyFromBlobToBlob; type=Copy; inputs=System.Object[]; outputs=System.Object[];
typeProperties=}"
],
"parameters": {
"inputPath": "@{type=String}",
"outputPath": "@{type=String}"
}
},
"etag": "0000c852-0000-0000-0000-59b1459e0000"
}

Create pipeline run


In this step, you set values of inputPath and outputPath parameters specified in pipeline with the
actual values of source and sink blob paths, and trigger a pipeline run. The pipeline run ID returned in
the response body is used in later monitoring API.
Replace value of inputPath and outputPath with your source and sink blob path to copy data from
and to before saving the file.

$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/pipelines/Adfv2QuickStartPipeline/createRun?api-
version=${apiVersion}"
$body = @"
{
"inputPath": "<the path to existing blob(s) to copy data from, e.g. containername/path>",
"outputPath": "<the blob path to copy data to, e.g. containername/path>"
}
"@
$response = Invoke-RestMethod -Method POST -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
$runId = $response.runId

Here is the sample output:

{
"runId": "2f26be35-c112-43fa-9eaa-8ba93ea57881"
}

Monitor pipeline
1. Run the following script to continuously check the pipeline run status until it finishes copying
the data.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/provid
ers/Microsoft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}?api-
version=${apiVersion}"
while ($True) {
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
Write-Host "Pipeline run status: " $response.Status -foregroundcolor "Yellow"

if ($response.Status -eq "InProgress") {


Start-Sleep -Seconds 15
}
else {
$response | ConvertTo-Json
break
}
}

Here is the sample output:

{
"key": "000000000-0000-0000-0000-00000000000",
"timestamp": "2017-09-07T13:12:39.5561795Z",
"runId": "000000000-0000-0000-0000-000000000000",
"dataFactoryName": "<dataFactoryName>",
"pipelineName": "Adfv2QuickStartPipeline",
"parameters": [
"inputPath: <inputBlobPath>",
"outputPath: <outputBlobPath>"
],
"parametersCount": 2,
"parameterNames": [
"inputPath",
"outputPath"
],
"parameterNamesCount": 2,
"parameterValues": [
"<inputBlobPath>",
"<outputBlobPath>"
],
"parameterValuesCount": 2,
"runStart": "2017-09-07T13:12:00.3710792Z",
"runEnd": "2017-09-07T13:12:39.5561795Z",
"durationInMs": 39185,
"status": "Succeeded",
"message": ""
}

2. Run the following script to retrieve copy activity run details, for example, size of the data
read/written.

$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/provid
ers/Microsoft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}/activityruns?
api-version=${apiVersion}&startTime="+(Get-Date).ToString('yyyy-MM-dd')+"&endTime="+(Get-
Date).AddDays(1).ToString('yyyy-MM-dd')+"&pipelineName=Adfv2QuickStartPipeline"
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
$response | ConvertTo-Json

Here is the sample output:


{
"value": [
{
"id": "000000000-0000-0000-0000-00000000000",
"timestamp": "2017-09-07T13:12:38.4780542Z",
"pipelineRunId": "000000000-0000-00000-0000-0000000000000",
"pipelineName": "Adfv2QuickStartPipeline",
"status": "Succeeded",
"failureType": "",
"linkedServiceName": "",
"activityName": "CopyFromBlobToBlob",
"activityType": "Copy",
"activityStart": "2017-09-07T13:12:02.3299261Z",
"activityEnd": "2017-09-07T13:12:38.4780542Z",
"duration": 36148,
"input": "@{source=; sink=}",
"output": "@{dataRead=331452208; dataWritten=331452208; copyDuration=22;
throughput=14712.9; errors=System.Object[];
effectiveIntegrationRuntime=DefaultIntegrationRuntime (West US); usedDataIntegrationUnits=2;
billedDuration=22}",
"error": "@{errorCode=; message=; failureType=; target=CopyFromBlobToBlob}"
}
]
}

Verify the output


Use Azure Storage explorer to check the blob(s) is copied to "outputBlobPath" from "inputBlobPath" as
you specified when creating a pipeline run.

Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure
resource group, which includes all the resources in the resource group. If you want to keep the other
resources intact, delete only the data factory you created in this tutorial.
Run the following command to delete the entire resource group:

Remove-AzResourceGroup -ResourceGroupName $resourcegroupname

Run the following command to delete only the data factory:

Remove-AzDataFactoryV2 -Name "<NameOfYourDataFactory>" -ResourceGroupName "<NameOfResourceGroup>"

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Tutorial: Create an Azure data factory using
Azure Resource Manager template
3/26/2019 • 15 minutes to read • Edit Online

This quickstart describes how to use an Azure Resource Manager template to create an Azure data
factory. The pipeline you create in this data factory copies data from one folder to another folder in an
Azure blob storage. For a tutorial on how to transform data using Azure Data Factory, see Tutorial:
Transform data using Spark.

NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the Azure
Data Factory service, see Introduction to Azure Data Factory.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of
the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account, see
Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.

3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.

5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a folder
named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.

3. In the New container dialog box, enter adftutorial for the name, and then select OK.

4. Select adftutorial in the list of containers.

5. On the Container page, select Upload on the toolbar.

6. On the Upload blob page, select Advanced.

7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not already
exist.

John, Doe
Jane, Doe

8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the Files
box.
9. Enter input as a value for the Upload to folder box.

10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Azure PowerShell

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.

Install the latest Azure PowerShell modules by following instructions in How to install and configure
Azure PowerShell.

Resource Manager templates


To learn about Azure Resource Manager templates in general, see Authoring Azure Resource Manager
Templates.
The following section provides the complete Resource Manager template for defining Data Factory
entities so that you can quickly run through the tutorial and test the template. To understand how each
Data Factory entity is defined, see Data Factory entities in the template section.
To learn about the JSON syntax and properties for Data Factory resources in a template, see
Microsoft.DataFactory resource types.

Data Factory JSON


Create a JSON file named ADFTutorialARM.json in C:\ADFTutorial folder with the following
content:

{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {
"dataFactoryName": {
"type": "string",
"metadata": {
"description": "Name of the data factory. Must be globally unique."
}
},
"dataFactoryLocation": {
"type": "string",
"allowedValues": [
"East US",
"East US 2",
"West Europe"
],
"defaultValue": "East US",
"metadata": {
"description": "Location of the data factory. Currently, only East US, East US 2, and West
Europe are supported. "
}
},
"storageAccountName": {
"type": "string",
"metadata": {
"description": "Name of the Azure storage account that contains the input/output data."
}
},
"storageAccountKey": {
"storageAccountKey": {
"type": "securestring",
"metadata": {
"description": "Key for the Azure storage account."
}
},
"blobContainer": {
"type": "string",
"metadata": {
"description": "Name of the blob container in the Azure Storage account."
}
},
"inputBlobFolder": {
"type": "string",
"metadata": {
"description": "The folder in the blob container that has the input file."
}
},
"inputBlobName": {
"type": "string",
"metadata": {
"description": "Name of the input file/blob."
}
},
"outputBlobFolder": {
"type": "string",
"metadata": {
"description": "The folder in the blob container that will hold the transformed data."
}
},
"outputBlobName": {
"type": "string",
"metadata": {
"description": "Name of the output file/blob."
}
},
"triggerStartTime": {
"type": "string",
"metadata": {
"description": "Start time for the trigger."
}
},
"triggerEndTime": {
"type": "string",
"metadata": {
"description": "End time for the trigger."
}
}
},
"variables": {
"azureStorageLinkedServiceName": "ArmtemplateStorageLinkedService",
"inputDatasetName": "ArmtemplateTestDatasetIn",
"outputDatasetName": "ArmtemplateTestDatasetOut",
"pipelineName": "ArmtemplateSampleCopyPipeline",
"triggerName": "ArmTemplateTestTrigger"
},
"resources": [{
"name": "[parameters('dataFactoryName')]",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "[parameters('dataFactoryLocation')]",
"identity": {
"type": "SystemAssigned"
},
"resources": [{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[parameters('dataFactoryName')]"
],
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": {
"value": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=
',parameters('storageAccountKey'))]",
"type": "SecureString"
}
}
}
},
{
"type": "datasets",
"name": "[variables('inputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'),
'/')]",
"fileName": "[parameters('inputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},
{
"type": "datasets",
"name": "[variables('outputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'),
'/')]",
"fileName": "[parameters('outputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},
{
"type": "pipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"activities": [{
"type": "Copy",
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "MyCopyActivity",
"inputs": [{
"referenceName": "[variables('inputDatasetName')]",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "[variables('outputDatasetName')]",
"type": "DatasetReference"
}]
}]
}
},
{
"type": "triggers",
"name": "[variables('triggerName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]",
"[variables('pipelineName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "[parameters('triggerStartTime')]",
"endTime": "[parameters('triggerEndTime')]",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "ArmtemplateSampleCopyPipeline"
},
"parameters": {}
}]
}
}
]
}]
}

Parameters JSON
Create a JSON file named ADFTutorialARM -Parameters.json that contains parameters for the Azure
Resource Manager template.
IMPORTANT
Specify the name and key of your Azure Storage account for the storageAccountName and
storageAccountKey parameters in this parameter file. You created the adftutorial container and uploaded
the sample file (emp.txt) to the input folder in this Azure blob storage.
Specify a globally unique name for the data factory for the dataFactoryName parameter. For example:
ARMTutorialFactoryJohnDoe11282017.
For the triggerStartTime, specify the current day in the format: 2017-11-28T00:00:00 .
For the triggerEndTime, specify the next day in the format: 2017-11-29T00:00:00 . You can also check the
current UTC time and specify the next hour or two as the end time. For example, if the UTC time now is 1:32
AM, specify 2017-11-29:03:00:00 as the end time. In this case, the trigger runs the pipeline twice (at 2 AM
and 3 AM).

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"dataFactoryName": {
"value": "<datafactoryname>"
},
"dataFactoryLocation": {
"value": "East US"
},
"storageAccountName": {
"value": "<yourstorageaccountname>"
},
"storageAccountKey": {
"value": "<yourstorageaccountkey>"
},
"blobContainer": {
"value": "adftutorial"
},
"inputBlobFolder": {
"value": "input"
},
"inputBlobName": {
"value": "emp.txt"
},
"outputBlobFolder": {
"value": "output"
},
"outputBlobName": {
"value": "emp.txt"
},
"triggerStartTime": {
"value": "2017-11-28T00:00:00. Set to today"
},
"triggerEndTime": {
"value": "2017-11-29T00:00:00. Set to tomorrow"
}
}
}

IMPORTANT
You may have separate parameter JSON files for development, testing, and production environments that you
can use with the same Data Factory JSON template. By using a Power Shell script, you can automate deploying
Data Factory entities in these environments.
Deploy Data Factory entities
In PowerShell, run the following command to deploy Data Factory entities using the Resource Manager
template you created earlier in this quickstart.

New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile C:\ADFTutorial\ADFTutorialARM.json -TemplateParameterFile
C:\ADFTutorial\ADFTutorialARM-Parameters.json

You see output similar to the following sample:

DeploymentName : MyARMDeployment
ResourceGroupName : ADFTutorialResourceGroup
ProvisioningState : Succeeded
Timestamp : 11/29/2017 3:11:13 AM
Mode : Incremental
TemplateLink :
Parameters :
Name Type Value
=============== ============ ==========
dataFactoryName String <data factory name>
dataFactoryLocation String East US
storageAccountName String <storage account name>
storageAccountKey SecureString
blobContainer String adftutorial
inputBlobFolder String input
inputBlobName String emp.txt
outputBlobFolder String output
outputBlobName String emp.txt
triggerStartTime String 11/29/2017 12:00:00 AM
triggerEndTime String 11/29/2017 4:00:00 AM

Outputs :
DeploymentDebugLogLevel :

Start the trigger


The template deploys the following Data Factory entities:
Azure Storage linked service
Azure Blob datasets (input and output)
Pipeline with a copy activity
Trigger to trigger the pipeline
The deployed trigger is in stopped state. One of the ways to start the trigger is to use the Start-
AzDataFactoryV2Trigger PowerShell cmdlet. The following procedure provides detailed steps:
1. In the PowerShell window, create a variable to hold the name of the resource group. Copy the
following command into the PowerShell window, and press ENTER. If you have specified a
different resource group name for the New -AzResourceGroupDeployment command, update the
value here.

$resourceGroupName = "ADFTutorialResourceGroup"

2. Create a variable to hold the name of the data factory. Specify the same name that you specified
in the ADFTutorialARM -Parameters.json file.
$dataFactoryName = "<yourdatafactoryname>"

3. Set a variable for the name of the trigger. The name of the trigger is hardcoded in the Resource
Manager template file (ADFTutorialARM.json).

$triggerName = "ArmTemplateTestTrigger"

4. Get the status of the trigger by running the following PowerShell command after specifying the
name of your data factory and trigger:

Get-AzDataFactoryV2Trigger -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $triggerName

Here is the sample output:

TriggerName : ArmTemplateTestTrigger
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ARMFactory1128
Properties : Microsoft.Azure.Management.DataFactory.Models.ScheduleTrigger
RuntimeState : Stopped

Notice that the runtime state of the trigger is Stopped.


5. Start the trigger. The trigger runs the pipeline defined in the template at the hour. That's, if you
executed this command at 2:25 PM, the trigger runs the pipeline at 3 PM for the first time. Then,
it runs the pipeline hourly until the end time you specified for the trigger.

Start-AzDataFactoryV2Trigger -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -TriggerName $triggerName

Here is the sample output:

Confirm
Are you sure you want to start trigger 'ArmTemplateTestTrigger' in data factory
'ARMFactory1128'?
[Y] Yes [N] No [S] Suspend [?] Help (default is "Y"): y
True

6. Confirm that the trigger has been started by running the Get-AzDataFactoryV2Trigger command
again.

Get-AzDataFactoryV2Trigger -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -TriggerName $triggerName

Here is the sample output:

TriggerName : ArmTemplateTestTrigger
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ARMFactory1128
Properties : Microsoft.Azure.Management.DataFactory.Models.ScheduleTrigger
RuntimeState : Started
Monitor the pipeline
1. After logging in to the Azure portal, Click All services, search with the keyword such as data fa,
and select Data factories.

2. In the Data Factories page, click the data factory you created. If needed, filter the list with the
name of your data factory.

3. In the Data factory page, click Monitor & Manage tile.


4. The Data Integration Application should open in a separate tab in the web browser. If the
monitor tab is not active, switch to the monitor tab. Notice that the pipeline run was triggered
by a scheduler trigger.

IMPORTANT
You see pipeline runs only at the hour clock (for example: 4 AM, 5 AM, 6 AM, etc.). Click Refresh on the
toolbar to refresh the list when the time reaches the next hour.

5. Click the link in the Actions columns.

6. You see the activity runs associated with the pipeline run. In this quickstart, the pipeline has only
one activity of type: Copy. Therefore, you see a run for that activity.

7. Click the link under Output column. You see the output from the copy operation in an Output
window. Click the maximize button to see the full output. You can close the maximized output
window or close it.
8. Stop the trigger once you see a successful/failure run. The trigger runs the pipeline once an hour.
The pipeline copies the same file from the input folder to the output folder for each run. To stop
the trigger, run the following command in the PowerShell window.

Stop-AzDataFactoryV2Trigger -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $triggerName

Verify the output


The pipeline automatically creates the output folder in the adftutorial blob container. Then, it copies the
emp.txt file from the input folder to the output folder.
1. In the Azure portal, on the adftutorial container page, click Refresh to see the output folder.

2. Click output in the folder list.


3. Confirm that the emp.txt is copied to the output folder.
Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure
resource group, which includes all the resources in the resource group. If you want to keep the other
resources intact, delete only the data factory you created in this tutorial.
Deleting a resource group deletes all resources including data factories in it. Run the following
command to delete the entire resource group:

Remove-AzResourceGroup -ResourceGroupName $resourcegroupname

Note: dropping a resource group may take some time. Please be patient with the process
If you want to delete just the data factory, not the entire resource group, run the following command:

Remove-AzDataFactoryV2 -Name $dataFactoryName -ResourceGroupName $resourceGroupName

JSON definitions for entities


The following Data Factory entities are defined in the JSON template:
Azure Storage linked service
Azure blob input dataset
Azure Blob output dataset
Data pipeline with a copy activity
Trigger
Azure Storage linked service
The AzureStorageLinkedService links your Azure storage account to the data factory. You created a
container and uploaded data to this storage account as part of prerequisites. You specify the name and
key of your Azure storage account in this section. See Azure Storage linked service for details about
JSON properties used to define an Azure Storage linked service.
{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[parameters('dataFactoryName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": {
"value": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=
',parameters('storageAccountKey'))]",
"type": "SecureString"
}
}
}
}

The connectionString uses the storageAccountName and storageAccountKey parameters. The values
for these parameters passed by using a configuration file. The definition also uses variables:
azureStorageLinkedService and dataFactoryName defined in the template.
Azure blob input dataset
The Azure storage linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure storage account. In Azure blob dataset definition, you specify names of
blob container, folder, and file that contains the input data. See Azure Blob dataset properties for details
about JSON properties used to define an Azure Blob dataset.

{
"type": "datasets",
"name": "[variables('inputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'),
'/')]",
"fileName": "[parameters('inputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},

Azure blob output dataset


You specify the name of the folder in the Azure Blob Storage that holds the copied data from the input
folder. See Azure Blob dataset properties for details about JSON properties used to define an Azure
Blob dataset.
{
"type": "datasets",
"name": "[variables('outputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'),
'/')]",
"fileName": "[parameters('outputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
}

Data pipeline
You define a pipeline that copies data from one Azure blob dataset to another Azure blob dataset. See
Pipeline JSON for descriptions of JSON elements used to define a pipeline in this example.

{
"type": "pipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "MyCopyActivity",
"inputs": [{
"referenceName": "[variables('inputDatasetName')]",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "[variables('outputDatasetName')]",
"type": "DatasetReference"
}]
}]
}
}

Trigger
You define a trigger that runs the pipeline once an hour. The deployed trigger is in stopped state. Start
the trigger by using the Start-AzDataFactoryV2Trigger cmdlet. For more information about triggers,
see Pipeline execution and triggers article.

{
"type": "triggers",
"name": "[variables('triggerName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]",
"[variables('pipelineName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-11-28T00:00:00",
"endTime": "2017-11-29T00:00:00",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "ArmtemplateSampleCopyPipeline"
},
"parameters": {}
}]
}
}

Reuse the template


In the tutorial, you created a template for defining Data Factory entities and a template for passing
values for parameters. To use the same template to deploy Data Factory entities to different
environments, you create a parameter file for each environment and use it when deploying to that
environment.
Example:

New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Dev.json

New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Test.json

New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Production.json

Notice that the first command uses parameter file for the development environment, second one for the
test environment, and the third one for the production environment.
You can also reuse the template to perform repeated tasks. For example, create many data factories with
one or more pipelines that implement the same logic but each data factory uses different Azure storage
accounts. In this scenario, you use the same template in the same environment (dev, test, or production)
with different parameter files to create data factories.
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Create Azure Data Factory Data Flow
5/6/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Mapping Data Flows in ADF provide a way to transform data at scale without any coding required. You can design
a data transformation job in the data flow designer by constructing a series of transformations. Start with any
number of source transformations followed by data transformation steps. Then, complete your data flow with sink
to land your results in a destination.
Get started by first creating a new V2 Data Factory from the Azure portal. After creating your new factory, click on
the "Author & Monitor" tile to launch the Data Factory UI.

Once you are in the Data Factory UI, you can use sample Data Flows. The samples are available from the ADF
Template Gallery. In ADF, create "Pipeline from Template" and select the Data Flow category from the template
gallery.
You will be prompted to enter your Azure Blob Storage account information.

The data used for these samples can be found here. Download the sample data and store the files in your Azure
Blob storage accounts so that you can execute the samples.

Create new data flow


Use the Create Resource "plus sign" button in the ADF UI to create Data Flows.

Next steps
Begin building your data transformation with a source transformation.
Copy data from Azure Blob storage to a SQL
database by using the Copy Data tool
3/6/2019 • 5 minutes to read • Edit Online

In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that copies data from Azure Blob storage to a SQL database.

NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source data store. If you don't have an Azure storage account,
see the instructions in Create a storage account.
Azure SQL Database: Use a SQL database as the sink data store. If you don't have a SQL database, see the
instructions in Create a SQL database.
Create a blob and a SQL table
Prepare your Blob storage and your SQL database for the tutorial by performing these steps.
Create a source blob
1. Launch Notepad. Copy the following text and save it in a file named inputEmp.txt on your disk:

John|Doe
Jane|Doe

2. Create a container named adfv2tutorial and upload the inputEmp.txt file to the container. You can use
various tools to perform these tasks, such as Azure Storage Explorer.
Create a sink SQL table
1. Use the following SQL script to create a table named dbo.emp in your SQL database:

CREATE TABLE dbo.emp


(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);


2. Allow Azure services to access SQL Server. Verify that the setting Allow access to Azure services is
enabled for your server that's running SQL Database. This setting lets Data Factory write data to your
database instance. To verify and turn on this setting, go to your Azure SQL server > Security > Firewalls
and virtual networks > set the Allow access to Azure services option to ON.

Create a data factory


1. On the left menu, select + New > Data + Analytics > Data Factory:

2. On the New data factory page, under Name, enter ADFTutorialDataFactory.


The name for your data factory must be globally unique. You might receive the following error message:

If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which to create the new data factory.
4. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2 for the version.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that are used by your data factory can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, the Deploying Data Factory tile shows the process status.
10. After creation is finished, the Data Factory home page is displayed.

11. To launch the Azure Data Factory user interface (UI) in a separate tab, select the Author & Monitor tile.

Use the Copy Data tool to create a pipeline


1. On the Let's get started page, select the Copy Data tile to launch the Copy Data tool.
2. On the Properties page, under Task name, enter CopyFromBlobToSqlPipeline. Then select Next. The
Data Factory UI creates a pipeline with the specified task name.

3. On the Source data store page, complete the following steps:


a. Click + Create new connection to add a connection
b. Select Azure Blob Storage from the gallery, and then select Next.

c. On the New Linked Service page, select your storage account from the Storage account name list,
and then select Finish.
d. Select the newly created linked service as source, then click Next.

4. On the Choose the input file or folder page, complete the following steps:
a. Click Browse to navigate to the adfv2tutorial/input folder, select the inputEmp.txt file, then click
Choose.
b. Click Next to move to next step.
5. On the File format settings page, notice that the tool automatically detects the column and row delimiters.
Select Next. You also can preview data and view the schema of the input data on this page.

6. On the Destination data store page, completes the following steps:


a. Click + Create new connection to add a connection

b. Select Azure SQL Database from the gallery, and then select Next.

c. On the New Linked Service page, select your server name and DB name from the dropdown list, and
specify the username and password, then select Finish.
d. Select the newly created linked service as sink, then click Next.

7. On the Table mapping page, select the [dbo].[emp] table, and then select Next.
8. On the Schema mapping page, notice that the first and second columns in the input file are mapped to the
FirstName and LastName columns of the emp table. Select Next.

9. On the Settings page, select Next.


10. On the Summary page, review the settings, and then select Next.
11. On the Deployment page, select Monitor to monitor the pipeline (task).

12. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline. Select Refresh to refresh the list.
13. To view the activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. For details about the copy operation, select the Details link (eyeglasses icon) in the
Actions column. To go back to the Pipeline Runs view, select the Pipelines link at the top. To refresh the
view, select Refresh.

14. Verify that the data is inserted into the emp table in your SQL database.

15. Select the Author tab on the left to switch to the editor mode. You can update the linked services, datasets,
and pipelines that were created via the tool by using the editor. For details on editing these entities in the
Data Factory UI, see the Azure portal version of this tutorial.
Next steps
The pipeline in this sample copies data from Blob storage to a SQL database. You learned how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob storage to a SQL
database by using Azure Data Factory
3/26/2019 • 10 minutes to read • Edit Online

In this tutorial, you create a data factory by using the Azure Data Factory user interface (UI). The pipeline in this
data factory copies data from Azure Blob storage to a SQL database. The configuration pattern in this tutorial
applies to copying from a file-based data store to a relational data store. For a list of data stores supported as
sources and sinks, see the supported data stores table.

NOTE
If you're new to Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Create a pipeline with a copy activity.
Test run the pipeline.
Trigger the pipeline manually.
Trigger the pipeline on a schedule.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription. If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account. You use Blob storage as a source data store. If you don't have a storage account, see
Create an Azure storage account for steps to create one.
Azure SQL Database. You use the database as a sink data store. If you don't have a SQL database, see Create
a SQL database for steps to create one.
Create a blob and a SQL table
Now, prepare your Blob storage and SQL database for the tutorial by performing the following steps.
Create a source blob
1. Launch Notepad. Copy the following text, and save it as an emp.txt file on your disk:

John,Doe
Jane,Doe

2. Create a container named adftutorial in your Blob storage. Create a folder named input in this container.
Then, upload the emp.txt file to the input folder. Use the Azure portal or tools such as Azure Storage
Explorer to do these tasks.
Create a sink SQL table
1. Use the following SQL script to create the dbo.emp table in your SQL database:
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);

2. Allow Azure services to access SQL Server. Ensure that Allow access to Azure services is turned ON for
your SQL Server so that Data Factory can write data to your SQL Server. To verify and turn on this setting,
take the following steps:
a. On the left, select More services > SQL servers.
b. Select your server, and under SETTINGS select Firewall.
c. On the Firewall settings page, select ON for Allow access to Azure services.

Create a data factory


In this step, you create a data factory and start the Data Factory UI to create a pipeline in the data factory.
1. Open Microsoft Edge or Google Chrome. Currently, Data Factory UI is supported only in Microsoft
Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. On the New data factory page, under Name, enter ADFTutorialDataFactory.

The name of the Azure data factory must be globally unique. If you see the following error message for the
name field, change the name of the data factory (for example, yournameADFTutorialDataFactory). For
naming rules for Data Factory artifacts, see Data Factory naming rules.

4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
6. Under Version, select V2.
7. Under Location, select a location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by the data factory can be in other regions.
8. Select Pin to dashboard.
9. Select Create.
10. On the dashboard, you see the following tile with the status Deploying Data Factory:
11. After the creation is finished, you see the Data factory page as shown in the image.

12. Select Author & Monitor to launch the Data Factory UI in a separate tab.

Create a pipeline
In this step, you create a pipeline with a copy activity in the data factory. The copy activity copies data from Blob
storage to SQL Database. In the Quickstart tutorial, you created a pipeline by following these steps:
1. Create the linked service.
2. Create input and output datasets.
3. Create a pipeline.
In this tutorial, you start with creating the pipeline. Then you create linked services and datasets when you need
them to configure the pipeline.
1. On the Let's get started page, select Create pipeline.
2. In the General tab for the pipeline, enter CopyPipeline for Name of the pipeline.
3. In the Activities tool box, expand the Move andTransform category, and drag and drop the Copy Data
activity from the tool box to the pipeline designer surface. Specify CopyFromBlobToSql for Name.

Configure source
1. Go to the Source tab. Select + New to create a source dataset.
2. In the New Dataset window, select Azure Blob Storage, and then select Finish. The source data is in
Blob storage, so you select Azure Blob Storage for the source dataset.
3. You see a new tab opened for blob dataset. On the General tab at the bottom of the Properties window,
enter SourceBlobDataset for Name.
4. Go to the Connection tab of the Properties window. Next to the Linked service text box, select + New.

5. In the New Linked Service window, enter AzureStorageLinkedService as name, select your storage
account from the Storage account name list, then select Save to deploy the linked service.
6. After the linked service is created, you are back in the dataset settings. Next to File path, select Browse.

7. Navigate to the adftutorial/input folder, select the emp.txt file, and then select Finish.
8. Confirm that File format is set to Text format and that Column delimiter is set to Comma ( , ). If the
source file uses different row and column delimiters, you can select Detect Text Format for File format.
The Copy Data tool detects the file format and delimiters automatically for you. You can still override these
values. To preview data on this page, select Preview data.

9. Go to the Schema tab of the Properties window, and select Import Schema. Notice that the application
detected two columns in the source file. You import the schema here so that you can map columns from the
source data store to the sink data store. If you don't need to map columns, you can skip this step. For this
tutorial, import the schema.
10. Now, go back to the pipeline -> Source tab, confirm that SourceBlobDataset is selected. To preview data
on this page, select Preview data.

Configure sink
1. Go to the Sink tab, and select + New to create a sink dataset.

2. In the New Dataset window, input "SQL" in the search box to filter the connectors, then select Azure SQL
Database, and then select Finish. In this tutorial, you copy data to a SQL database.
3. On the General tab of the Properties window, in Name, enter OutputSqlDataset.

4. Go to the Connection tab, and next to Linked service, select + New. A dataset must be associated with a
linked service. The linked service has the connection string that Data Factory uses to connect to the SQL
database at runtime. The dataset specifies the container, folder, and the file (optional) to which the data is
copied.

5. In the New Linked Service window, take the following steps:


a. Under Name, enter AzureSqlDatabaseLinkedService.
b. Under Server name, select your SQL Server instance.
c. Under Database name, select your SQL database.
d. Under User name, enter the name of the user.
e. Under Password, enter the password for the user.
f. Select Test connection to test the connection.
g. Select Save to save the linked service.

6. In Table, select [dbo].[emp].


7. Go to the Schema tab, and select Import Schema.

8. Select the ID column, and then select Delete. The ID column is an identity column in the SQL database, so
the copy activity doesn't need to insert data into this column.

9. Go to the tab with the pipeline, and in Sink Dataset, confirm that OutputSqlDataset is selected.
Configure mapping
Go to the Mapping tab at the bottom of the Properties window, and select Import Schemas. Notice that the
first and second columns in the source file are mapped to FirstName and LastName in the SQL database.

Validate the pipeline


To validate the pipeline, select Validate from the tool bar.
You can see the JSON code associated with the pipeline by clicking Code on the upper-right.

Debug and publish the pipeline


You can debug a pipeline before you publish artifacts (linked services, datasets, and pipeline) to Data Factory or
your own Azure Repos Git repository.
1. To debug the pipeline, select Debug on the toolbar. You see the status of the pipeline run in the Output tab
at the bottom of the window.
2. Once the pipeline can run successfully, in the top toolbar, select Publish All. This action publishes entities
(datasets, and pipelines) you created to Data Factory.

3. Wait until you see the Successfully published message. To see notification messages, click the Show
Notifications on the top-right (bell button).

Trigger the pipeline manually


In this step, you manually trigger the pipeline you published in the previous step.
1. Select Trigger on the toolbar, and then select Trigger Now. On the Pipeline Run page, select Finish.
2. Go to the Monitor tab on the left. You see a pipeline run that is triggered by a manual trigger. You can use
links in the Actions column to view activity details and to rerun the pipeline.

3. To see activity runs associated with the pipeline run, select the View Activity Runs link in the Actions
column. In this example, there is only one activity, so you see only one entry in the list. For details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column. Select Pipelines at the top
to go back to the Pipeline Runs view. To refresh the view, select Refresh.

4. Verify that two more rows are added to the emp table in the SQL database.

Trigger the pipeline on a schedule


In this schedule, you create a schedule trigger for the pipeline. The trigger runs the pipeline on the specified
schedule, such as hourly or daily. In this example, you set the trigger to run every minute until the specified end
datetime.
1. Go to the Author tab on the left above the monitor tab.
2. Go to your pipeline, click Trigger on the tool bar, and select New/Edit.
3. In the Add Triggers window, select Choose trigger, and then select + New.

4. In the New Trigger window, take the following steps:


a. Under Name, enter RunEveryMinute.
b. Under End, select On Date.
c. Under End On, select the drop-down list.
d. Select the current day option. By default, the end day is set to the next day.
e. Update the minutes part to be a few minutes past the current datetime. The trigger is activated only after
you publish the changes. If you set it to only a couple of minutes apart and you don't publish it by then, you
don't see a trigger run.
f. Select Apply.
g. Select the Activated option. You can deactivate it and activate it later by using this check box.
h. Select Next.
IMPORTANT
A cost is associated with each pipeline run, so set the end date appropriately.

5. On the Trigger Run Parameters page, review the warning, and then select Finish. The pipeline in this
example doesn't take any parameters.
6. Click Publish All to publish the change.
7. Go to the Monitor tab on the left to see the triggered pipeline runs.

8. To switch from the Pipeline Runs view to the Trigger Runs view, select Pipeline Runs and then select
Trigger Runs.

9. You see the trigger runs in a list.

10. Verify that two rows per minute (for each pipeline run) are inserted into the emp table until the specified
end time.

Next steps
The pipeline in this sample copies data from one location to another location in Blob storage. You learned how to:
Create a data factory.
Create a pipeline with a copy activity.
Test run the pipeline.
Trigger the pipeline manually.
Trigger the pipeline on a schedule.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob to Azure SQL Database
using Azure Data Factory
5/15/2019 • 10 minutes to read • Edit Online

In this tutorial, you create a Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL
Database. The configuration pattern in this tutorial applies to copying from a file-based data store to a relational
data store. For a list of data stores supported as sources and sinks, see supported data stores table.
You perform the following steps in this tutorial:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline contains a Copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses .NET SDK. You can use other mechanisms to interact with Azure Data Factory, refer to samples
under "Quickstarts".
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Visual Studio 2015, or 2017. The walkthrough in this article uses Visual Studio 2017.
Download and install Azure .NET SDK.
Create an application in Azure Active Directory following this instruction. Make note of the following
values that you use in later steps: application ID, authentication key, and tenant ID. Assign application to
"Contributor" role by following instructions in the same article.
Create a blob and a SQL table
Now, prepare your Azure Blob and Azure SQL Database for the tutorial by performing the following steps:
Create a source blob
1. Launch Notepad. Copy the following text and save it as inputEmp.txt file on your disk.

John|Doe
Jane|Doe

2. Use tools such as Azure Storage Explorer to create the adfv2tutorial container, and to upload the
inputEmp.txt file to the container.
Create a sink SQL table
1. Use the following SQL script to create the dbo.emp table in your Azure SQL Database.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);

2. Allow Azure services to access SQL server. Ensure that Allow access to Azure services setting is turned
ON for your Azure SQL server so that the Data Factory service can write data to your Azure SQL server.
To verify and turn on this setting, do the following steps:
a. Click More services hub on the left and click SQL servers.
b. Select your server, and click Firewall under SETTINGS.
c. In the Firewall settings page, click ON for Allow access to Azure services.

Create a Visual Studio project


Using Visual Studio 2015/2017, create a C# .NET console application.
1. Launch Visual Studio.
2. Click File, point to New, and click Project.
3. Select Visual C# -> Console App (.NET Framework) from the list of project types on the right. .NET
version 4.5.2 or above is required.
4. Enter ADFv2Tutorial for the Name.
5. Click OK to create the project.

Install NuGet packages


1. Click Tools -> NuGet Package Manager -> Package Manager Console.
2. In the Package Manager Console, run the following commands to install packages. Refer to
Microsoft.Azure.Management.DataFactory nuget package with details.

Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory

Create a data factory client


1. Open Program.cs, include the following statements to add references to namespaces.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;

2. Add the following code to the Main method that sets variables. Replace place-holders with your own
values. For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory: Products available
by region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used
by data factory can be in other regions.

// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID to create the factory>";
string resourceGroup = "<your resource group to create the factory>";

string region = "East US";


string dataFactoryName = "<specify the name of a data factory to create. It must be globally
unique.>";

// Specify the source Azure Blob information


string storageAccount = "<your storage account name to copy data>";
string storageKey = "<your storage account key>";
string inputBlobPath = "adfv2tutorial/";
string inputBlobName = "inputEmp.txt";

// Specify the sink Azure SQL Database information


string azureSqlConnString = "Server=tcp:<your server name>.database.windows.net,1433;Database=<your
database name>;User ID=<your username>@<your server name>;Password=<your
password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30";
string azureSqlTableName = "dbo.emp";

string storageLinkedServiceName = "AzureStorageLinkedService";


string sqlDbLinkedServiceName = "AzureSqlDbLinkedService";
string blobDatasetName = "BlobDataset";
string sqlDatasetName = "SqlDataset";
string pipelineName = "Adfv2TutorialBlobToSqlCopy";

3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create a data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details.

// Authenticate and create a data factory management client


var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Create a data factory


Add the following code to the Main method that creates a data factory.
// Create a data factory
Console.WriteLine("Creating a data factory " + dataFactoryName + "...");
Factory dataFactory = new Factory
{
Location = region,
Identity = new FactoryIdentity()

};
client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, dataFactory);
Console.WriteLine(SafeJsonConvert.SerializeObject(dataFactory, client.SerializationSettings));

while (client.Factories.Get(resourceGroup, dataFactoryName).ProvisioningState == "PendingCreation")


{
System.Threading.Thread.Sleep(1000);
}

Create linked services


In this tutorial, you create two linked services for source and sink respectively:
Create an Azure Storage linked service
Add the following code to the Main method that creates an Azure Storage linked service. Learn more from
Azure Blob linked service properties on supported properties and details.

// Create an Azure Storage linked service


Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");

LinkedServiceResource storageLinkedService = new LinkedServiceResource(


new AzureStorageLinkedService
{
ConnectionString = new SecureString("DefaultEndpointsProtocol=https;AccountName=" + storageAccount +
";AccountKey=" + storageKey)
}
);
client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, storageLinkedServiceName,
storageLinkedService);
Console.WriteLine(SafeJsonConvert.SerializeObject(storageLinkedService, client.SerializationSettings));

Create an Azure SQL Database linked service


Add the following code to the Main method that creates an Azure SQL Database linked service. Learn more
from Azure SQL Database linked service properties on supported properties and details.

// Create an Azure SQL Database linked service


Console.WriteLine("Creating linked service " + sqlDbLinkedServiceName + "...");

LinkedServiceResource sqlDbLinkedService = new LinkedServiceResource(


new AzureSqlDatabaseLinkedService
{
ConnectionString = new SecureString(azureSqlConnString)
}
);
client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, sqlDbLinkedServiceName,
sqlDbLinkedService);
Console.WriteLine(SafeJsonConvert.SerializeObject(sqlDbLinkedService, client.SerializationSettings));

Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. Learn more from Azure Blob
dataset properties on supported properties and details.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step, and describes:
The location of the blob to copy from: FolderPath and FileName;
The blob format indicating how to parse the content: TextFormat and its settings (for example, column
delimiter).
The data structure, including column names and data types which in this case map to the sink SQL table.

// Create an Azure Blob dataset


Console.WriteLine("Creating dataset " + blobDatasetName + "...");
DatasetResource blobDataset = new DatasetResource(
new AzureBlobDataset
{
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
},
FolderPath = inputBlobPath,
FileName = inputBlobName,
Format = new TextFormat { ColumnDelimiter = "|" },
Structure = new List<DatasetDataElement>
{
new DatasetDataElement
{
Name = "FirstName",
Type = "String"
},
new DatasetDataElement
{
Name = "LastName",
Type = "String"
}
}
}
);
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobDatasetName, blobDataset);
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));

Create a dataset for sink Azure SQL Database


Add the following code to the Main method that creates an Azure SQL Database dataset. Learn more from
Azure SQL Database dataset properties on supported properties and details.
You define a dataset that represents the sink data in Azure SQL Database. This dataset refers to the Azure SQL
Database linked service you create in the previous step. It also specifies the SQL table that holds the copied data.
// Create an Azure SQL Database dataset
Console.WriteLine("Creating dataset " + sqlDatasetName + "...");
DatasetResource sqlDataset = new DatasetResource(
new AzureSqlTableDataset
{
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = sqlDbLinkedServiceName
},
TableName = azureSqlTableName
}
);
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, sqlDatasetName, sqlDataset);
Console.WriteLine(SafeJsonConvert.SerializeObject(sqlDataset, client.SerializationSettings));

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity. In this tutorial, this
pipeline contains one activity: copy activity, which takes in the Blob dataset as source and the SQL dataset as sink.
Learn more from Copy Activity Overview on copy activity details.

// Create a pipeline with copy activity


Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource pipeline = new PipelineResource
{
Activities = new List<Activity>
{
new CopyActivity
{
Name = "CopyFromBlobToSQL",
Inputs = new List<DatasetReference>
{
new DatasetReference()
{
ReferenceName = blobDatasetName
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = sqlDatasetName
}
},
Source = new BlobSource { },
Sink = new SqlSink { }
}
}
};
client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, pipeline);
Console.WriteLine(SafeJsonConvert.SerializeObject(pipeline, client.SerializationSettings));

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run.
// Create a pipeline run
Console.WriteLine("Creating pipeline run...");
CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup,
dataFactoryName, pipelineName).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);

Monitor a pipeline run


1. Add the following code to the Main method to continuously check the status of the pipeline run until it
finishes copying the data.

// Monitor the pipeline run


Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(resourceGroup, dataFactoryName, runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress")
System.Threading.Thread.Sleep(15000);
else
break;
}

2. Add the following code to the Main method that retrieves copy activity run details, for example, size of the
data read/written.

// Check the copy activity run details


Console.WriteLine("Checking copy activity run details...");

List<ActivityRun> activityRuns = client.ActivityRuns.ListByPipelineRun(


resourceGroup, dataFactoryName, runResponse.RunId, DateTime.UtcNow.AddMinutes(-10),
DateTime.UtcNow.AddMinutes(10)).ToList();

if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(activityRuns.First().Output);
}
else
Console.WriteLine(activityRuns.First().Error);

Console.WriteLine("\nPress any key to exit...");


Console.ReadKey();

Run the code


Build and start the application, then verify the pipeline execution.
The console prints the progress of creating a data factory, linked service, datasets, pipeline, and pipeline run. It
then checks the pipeline run status. Wait until you see the copy activity run details with data read/written size.
Then, use tools such as SSMS (SQL Server Management Studio) or Visual Studio to connect to your destination
Azure SQL Database and check if the data is copied into the table you specified.
Sample output

Creating a data factory AdfV2Tutorial...


{
"identity": {
"type": "SystemAssigned"
},
"location": "East US"
}
Creating linked service AzureStorageLinkedService...
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=<accountKey>"
}
}
}
}
Creating linked service AzureSqlDbLinkedService...
{
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
}
Creating dataset BlobDataset...
{
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adfv2tutorial/",
"fileName": "inputEmp.txt",
"format": {
"type": "TextFormat",
"columnDelimiter": "|"
}
},
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureStorageLinkedService"
}
}
}
Creating dataset SqlDataset...
{
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "dbo.emp"
},
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureSqlDbLinkedService"
}
}
}
}
Creating pipeline Adfv2TutorialBlobToSqlCopy...
{
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
}
},
"inputs": [
{
"type": "DatasetReference",
"referenceName": "BlobDataset"
}
],
"outputs": [
{
"type": "DatasetReference",
"referenceName": "SqlDataset"
}
],
"name": "CopyFromBlobToSQL"
}
]
}
}
Creating pipeline run...
Pipeline run ID: 1cd03653-88a0-4c90-aabc-ae12d843e252
Checking pipeline run status...
Status: InProgress
Status: InProgress
Status: Succeeded
Checking copy activity run details...
{
"dataRead": 18,
"dataWritten": 28,
"rowsCopied": 2,
"copyDuration": 2,
"throughput": 0.01,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"usedDataIntegrationUnits": 2,
"billedDuration": 2
}

Press any key to exit...

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline contains a Copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copying data from on-premises to cloud:
Copy data from on-premises to cloud
Copy data from an on-premises SQL Server
database to Azure Blob storage by using the Copy
Data tool
4/8/2019 • 8 minutes to read • Edit Online

In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that copies data from an on-premises SQL Server database to Azure Blob storage.

NOTE
If you're new to Azure Data Factory, see Introduction to Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to log in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. Select your user name in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on how to add a user to a role, see Manage access using RBAC and the
Azure portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Blob storage
(sink). You then create a table named emp in your SQL Server database and insert a couple of sample entries into
the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database. In the tree view, right-click the database that you created, and then select New Query.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

INSERT INTO emp (FirstName, LastName) VALUES ('John', 'Doe')


INSERT INTO emp (FirstName, LastName) VALUES ('Jane', 'Doe')
GO

Azure storage account


In this tutorial, you use a general-purpose Azure storage account (specifically, Blob storage) as a destination/sink
data store. If you don't have a general-purpose storage account, see Create a storage account for instructions to
create one. The pipeline in the data factory you that create in this tutorial copies data from the on-premises SQL
Server database (source) to this Blob storage (sink).
Get the storage account name and account key
You use the name and key of your storage account in this tutorial. To get the name and key of your storage
account, take the following steps:
1. Sign in to the Azure portal with your Azure user name and password.
2. In the left pane, select More services. Filter by using the Storage keyword, and then select Storage
accounts.

3. In the list of storage accounts, filter for your storage account, if needed. Then select your storage account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Blob storage.
1. In the Storage account window, switch to Overview, and then select Blobs.

2. In the Blob service window, select Container.

3. In the New container window, in the Name box, enter adftutorial, and then select OK.

4. In the list of containers, select adftutorial.


5. Keep the Container window for adftutorial open. You use it verify the output at the end of the tutorial.
Data Factory automatically creates the output folder in this container, so you don't need to create one.

Create a data factory


1. On the menu on the left, select New > Data + Analytics > Data Factory.
2. On the New data factory page, under Name, enter ADFTutorialDataFactory.
The name of the data factory must be globally unique. If you see the following error message for the name
field, change the name of the data factory (for example, yournameADFTutorialDataFactory). For naming
rules for Data Factory artifacts, see Data Factory naming rules.

3. Select the Azure subscription in which you want to create the data factory.
4. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under Version, select **V2 **.
6. Under Location, select the location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by Data Factory can be in other locations/regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, you see the following tile with the status Deploying Data Factory:
10. After the creation is finished, you see the Data Factory page as shown in the image.

11. Select Author & Monitor to launch the Data Factory user interface in a separate tab.

Use the Copy Data tool to create a pipeline


1. On the Let's get started page, select Copy Data to launch the Copy Data tool.
2. On the Properties page of the Copy Data tool, under Task name, enter
CopyFromOnPremSqlToAzureBlobPipeline. Then select Next. The Copy Data tool creates a pipeline
with the name you specify for this field.

3. On the Source data store page, click on Create new connection.


4. Under New Linked Service, search for SQL Server, and then select Next.

5. Under New Linked Service (SQL Server) Name, enter SqlServerLinkedService. Select +New under
Connect via integration runtime. You must create a self-hosted integration runtime, download it to your
machine, and register it with Data Factory. The self-hosted integration runtime copies data between your
on-premises environment and the cloud.
6. In the Integration Runtime Setup dialog box, Select Private Network. Then select Next.
7. In the Integration Runtime Setup dialog box under Name, enter TutorialIntegrationRuntime. Then
select Next.

8. Select Click here to launch the express setup for this computer. This action installs the integration
runtime on your machine and registers it with Data Factory. Alternatively, you can use the manual setup
option to download the installation file, run it, and use the key to register the integration runtime.
9. Run the downloaded application. You see the status of the express setup in the window.
10. Confirm that TutorialIntegrationRuntime is selected for the Integration Runtime field.
11. In Specify the on-premises SQL Server database, take the following steps:
a. Under Name, enter SqlServerLinkedService.
b. Under Server name, enter the name of your on-premises SQL Server instance.
c. Under Database name, enter the name of your on-premises database.
d. Under Authentication type, select appropriate authentication.
e. Under User name, enter the name of user with access to on-premises SQL Server.
f. Enter the password for the user. Select Finish.
12. Select Next.
13. On the Select tables from which to copy the data or use a custom query page, select the [dbo].[emp]
table in the list, and select Next. You can select any other table based on your database.

14. On the Destination data store page, select Create new connection
15. In New Linked Service, Search and Select Azure Blob, then Continue.
16. On the New Linked Service (Azure Blob Storage) dialog, take the following steps:

a. Under **Name****, enter **AzureStorageLinkedService**.

b. Under **Connect via integration runtime**, select **TutorialIntegrationRuntime**

c. Under **Storage account name**, select your storage account from the drop-down list.

d. Select **Next**.

17. In Destination data store dialog, select Next. In Connection properties, select Azure storage service
as Azure Blob Storage. Select Next.
18. In the Choose the output file or folder dialog, under Folder path, enter adftutorial/fromonprem. You
created the adftutorial container as part of the prerequisites. If the output folder doesn't exist (in this case
fromonprem ), Data Factory automatically creates it. You also can use the Browse button to browse the
blob storage and its containers/folders. If you do not specify any value under File name, by default the
name from the source would be used (in this case dbo.emp).

19. On the File format settings dialog, select Next.


20. On the Settings dialog, select Next.
21. On the Summary dialog, review values for all the settings, and select Next.

22. On the Deployment page, select Monitor to monitor the pipeline or task you created.

23. On the Monitor tab, you can view the status of the pipeline you created. You can use the links in the Action
column to view activity runs associated with the pipeline run and to rerun the pipeline.
24. Select the View Activity Runs link in the Actions column to see activity runs associated with the pipeline
run. To see details about the copy operation, select the Details link (eyeglasses icon) in the Actions column.
To switch back to the Pipeline Runs view, select Pipelines at the top.

25. Confirm that you see the output file in the fromonprem folder of the adftutorial container.

26. Select the Edit tab on the left to switch to the editor mode. You can update the linked services, datasets, and
pipelines created by the tool by using the editor. Select Code to view the JSON code associated with the
entity opened in the editor. For details on how to edit these entities in the Data Factory UI, see the Azure
portal version of this tutorial.
Next steps
The pipeline in this sample copies data from an on-premises SQL Server database to Blob storage. You learned
how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
For a list of data stores that are supported by Data Factory, see Supported data stores.
To learn about how to copy data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Copy data from an on-premises SQL Server database
to Azure Blob storage
4/8/2019 • 9 minutes to read • Edit Online

In this tutorial, you use the Azure Data Factory user interface (UI) to create a data factory pipeline that copies data
from an on-premises SQL Server database to Azure Blob storage. You create and use a self-hosted integration
runtime, which moves data between on-premises and cloud data stores.

NOTE
This article doesn't provide a detailed introduction to Data Factory. For more information, see Introduction to Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.

Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to sign in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. In the upper-right corner, select your
user name, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on how to add a user to a role, see Manage access using RBAC and the Azure
portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Blob storage
(sink). You then create a table named emp in your SQL Server database and insert a couple of sample entries into
the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database:

CREATE TABLE dbo.emp


(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

INSERT INTO emp (FirstName, LastName) VALUES ('John', 'Doe')


INSERT INTO emp (FirstName, LastName) VALUES ('Jane', 'Doe')
GO

6. In the tree view, right-click the database that you created, and then select New Query.
Azure storage account
In this tutorial, you use a general-purpose Azure storage account (specifically, Blob storage) as a destination/sink
data store. If you don't have a general-purpose Azure storage account, see Create a storage account. The pipeline
in the data factory that you create in this tutorial copies data from the on-premises SQL Server database (source)
to Blob storage (sink).
Get the storage account name and account key
You use the name and key of your storage account in this tutorial. To get the name and key of your storage
account, take the following steps:
1. Sign in to the Azure portal with your Azure user name and password.
2. In the left pane, select More services. Filter by using the Storage keyword, and then select Storage
accounts.

3. In the list of storage accounts, filter for your storage account, if needed. Then select your storage account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Blob storage.
1. In the Storage account window, go to Overview, and then select Blobs.

2. In the Blob service window, select Container.

3. In the New container window, under Name, enter adftutorial. Then select OK.

4. In the list of containers, select adftutorial.


5. Keep the container window for adftutorial open. You use it verify the output at the end of the tutorial.
Data Factory automatically creates the output folder in this container, so you don't need to create one.

Create a data factory


In this step, you create a data factory and start the Data Factory UI to create a pipeline in the data factory.
1. Open the Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. On the New data factory page, under Name, enter ADFTutorialDataFactory.

The name of the data factory must be globally unique. If you see the following error message for the name field,
change the name of the data factory (for example, yournameADFTutorialDataFactory). For naming rules for Data
Factory artifacts, see Data Factory naming rules.

1. Select the Azure subscription in which you want to create the data factory.
2. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
3. Under Version, select V2.
4. Under Location, select the location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Storage and SQL Database) and computes (for example,
Azure HDInsight) used by Data Factory can be in other regions.
5. Select Pin to dashboard.
6. Select Create.
7. On the dashboard, you see the following tile with the status Deploying Data Factory:
8. After the creation is finished, you see the Data Factory page as shown in the image:

9. Select the Author & Monitor tile to launch the Data Factory UI in a separate tab.

Create a pipeline
1. On the Let's get started page, select Create pipeline. A pipeline is automatically created for you. You see
the pipeline in the tree view, and its editor opens.
2. On the General tab at the bottom of the Properties window, in Name, enter SQLServerToBlobPipeline.

3. In the Activities tool box, expand DataFlow. Drag and drop the Copy activity to the pipeline design
surface. Set the name of the activity to CopySqlServerToAzureBlobActivity.
4. In the Properties window, go to the Source tab, and select + New.

5. In the New Dataset window, search for SQL Server. Select SQL Server, and then select Finish. You see a
new tab titled SqlServerTable1. You also see the SqlServerTable1 dataset in the tree view on the left.
6. On the General tab at the bottom of the Properties window, in Name, enter SqlServerDataset.
7. Go to the Connection tab, and select + New. You create a connection to the source data store (SQL Server
database) in this step.

8. In the New Linked Service window, add Name as SqlServerLinkedService. Select New under Connect
via integration runtime. In this section, you create a self-hosted integration runtime and associate it with
an on-premises machine with the SQL Server database. The self-hosted integration runtime is the
component that copies data from the SQL Server database on your machine to Blob storage.
9. In the Integration Runtime Setup window, select Private Network, and then select Next.
10. Enter a name for the integration runtime, and select Next.
11. Under Option 1: Express setup, select Click here to launch the express setup for this computer.
12. In the Integration Runtime (Self-hosted) Express Setup window, select Close.
13. In the New Linked Service window, ensure the Integration Runtime created above is selected under
Connect via integration runtime.
14. In the New Linked Service window, take the following steps:
a. Under Name, enter SqlServerLinkedService.
b. Under Connect via integration runtime, confirm that the self-hosted integration runtime you created
earlier shows up.
c. Under Server name, enter the name of your SQL Server instance.
d. Under Database name, enter the name of the database with the emp table.
e. Under Authentication type, select the appropriate authentication type that Data Factory should use to
connect to your SQL Server database.
f. Under User name and Password, enter the user name and password. If you need to use a backslash (\) in
your user account or server name, precede it with the escape character (\). For example, use
mydomain\\myuser.
g. Select Test connection. Do this step to confirm that Data Factory can connect to your SQL Server
database by using the self-hosted integration runtime you created.
h. To save the linked service, select Finish.
15. You should be back in the window with the source dataset opened. On the Connection tab of the
Properties window, take the following steps:
a. In Linked service, confirm that you see SqlServerLinkedService.
b. In Table, select [dbo].[emp].

16. Go to the tab with SQLServerToBlobPipeline, or select SQLServerToBlobPipeline in the tree view.

17. Go to the Sink tab at the bottom of the Properties window, and select + New.
18. In the New Dataset window, select Azure Blob Storage. Then select Finish. You see a new tab opened for
the dataset. You also see the dataset in the tree view.
19. In Name, enter AzureBlobDataset.
20. Go to the Connection tab at the bottom of the Properties window. Next to Linked service, select + New.

21. In the New Linked Service window, take the following steps:
a. Under Name, enter AzureStorageLinkedService.
b. Under Storage account name, select your storage account.
c. To test the connection to your storage account, select Test connection.
d. Select Save.

22. You should be back in the window with the sink dataset open. On the Connection tab, take the following
steps:
a. In Linked service, confirm that AzureStorageLinkedService is selected.
b. For the folder/ Directory part of File path, enter adftutorial/fromonprem. If the output folder doesn't
exist in the adftutorial container, Data Factory automatically creates the output folder.
c. For the file name part of File path, select Add dynamic content.
d. Add @CONCAT(pipeline().RunId, '.txt') , select Finish. This will rename the file with PipelineRunID.txt.
23. Go to the tab with the pipeline opened, or select the pipeline in the tree view. In Sink Dataset, confirm that
AzureBlobDataset is selected.

24. To validate the pipeline settings, select Validate on the toolbar for the pipeline. To close the Pipe
Validation Report, select Close.
25. To publish entities you created to Data Factory, select Publish All.

26. Wait until you see the Publishing succeeded pop-up. To check the status of publishing, select the Show
Notifications link on the left. To close the notification window, select Close.
Trigger a pipeline run
Select Trigger on the toolbar for the pipeline, and then select Trigger Now.
Monitor the pipeline run
1. Go to the Monitor tab. You see the pipeline that you manually triggered in the previous step.

2. To view activity runs associated with the pipeline run, select the View Activity Runs link in the Actions
column. You see only activity runs because there is only one activity in the pipeline. To see details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column. To go back to the Pipeline
Runs view, select Pipelines at the top.

Verify the output


The pipeline automatically creates the output folder named fromonprem in the adftutorial blob container.
Confirm that you see the [pipeline().RunId ].txt file in the output folder.
Next steps
The pipeline in this sample copies data from one location to another in Blob storage. You learned how to:
Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Storage linked services.
Create SQL Server and Blob storage datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.
For a list of data stores that are supported by Data Factory, see Supported data stores.
To learn how to copy data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Tutorial: Copy data from an on-premises SQL Server
database to Azure Blob storage
4/8/2019 • 15 minutes to read • Edit Online

In this tutorial, you use Azure PowerShell to create a data-factory pipeline that copies data from an on-premises
SQL Server database to Azure Blob storage. You create and use a self-hosted integration runtime, which moves
data between on-premises and cloud data stores.

NOTE
This article does not provide a detailed introduction to the Data Factory service. For more information, see Introduction to
Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.

Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to log in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal, select your username at the top-right
corner, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on adding a user to a role, see the Manage access using RBAC and the Azure
portal article.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Azure Blob
storage (sink). You then create a table named emp in your SQL Server database, and insert a couple of sample
entries into the table.
1. Start SQL Server Management Studio. If it is not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database:

INSERT INTO emp VALUES ('John', 'Doe')


INSERT INTO emp VALUES ('Jane', 'Doe')
GO

6. In the tree view, right-click the database that you created, and then select New Query.
Azure Storage account
In this tutorial, you use a general-purpose Azure storage account (specifically, Azure Blob storage) as a
destination/sink data store. If you don't have a general-purpose Azure storage account, see Create a storage
account. The pipeline in the data factory you that create in this tutorial copies data from the on-premises SQL
Server database (source) to this Azure Blob storage (sink).
Get storage account name and account key
You use the name and key of your Azure storage account in this tutorial. Get the name and key of your storage
account by doing the following:
1. Sign in to the Azure portal with your Azure username and password.
2. In the left pane, select More services, filter by using the Storage keyword, and then select Storage
accounts.

3. In the list of storage accounts, filter for your storage account (if needed), and then select your storage
account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Azure Blob storage.
1. In the Storage account window, switch to Overview, and then select Blobs.

2. In the Blob service window, select Container.

3. In the New container window, in the Name box, enter adftutorial, and then select OK.

4. In the list of containers, select adftutorial.


5. Keep the container window for adftutorial open. You use it verify the output at the end of the tutorial.
Data Factory automatically creates the output folder in this container, so you don't need to create one.

Windows PowerShell
Install Azure PowerShell

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Install the latest version of Azure PowerShell if you don't already have it on your machine. For detailed
instructions, see How to install and configure Azure PowerShell.
Log in to PowerShell
1. Start PowerShell on your machine, and keep it open through completion of this quickstart tutorial. If you
close and reopen it, you'll need to run these commands again.
2. Run the following command, and then enter the Azure username and password that you use to sign in to
the Azure portal:

Connect-AzAccount

3. If you have multiple Azure subscriptions, run the following command to select the subscription that you
want to work with. Replace SubscriptionId with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

Create a data factory


1. Define a variable for the resource group name that you'll use later in PowerShell commands. Copy the
following command to PowerShell, specify a name for the Azure resource group (enclosed in double
quotation marks; for example, "adfrg" ), and then run the command.

$resourceGroupName = "ADFTutorialResourceGroup"

2. To create the Azure resource group, run the following command:

New-AzResourceGroup $resourceGroupName $location

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.

3. Define a variable for the data factory name that you can use in PowerShell commands later. The name must
start with a letter or a number, and it can contain only letters, numbers, and the dash (-) character.

IMPORTANT
Update the data factory name with a globally unique name. An example is ADFTutorialFactorySP1127.

$dataFactoryName = "ADFTutorialFactory"

4. Define a variable for the location of the data factory:

$location = "East US"

5. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet:

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name $dataFactoryName

NOTE
The name of the data factory must be globally unique. If you receive the following error, change the name and try again.

The specified data factory name 'ADFv2TutorialDataFactory' is already in use. Data factory names
must be globally unique.

To create data-factory instances, the user account that you use to sign in to Azure must be assigned a contributor or
owner role or must be an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on the
following page, and then expand Analytics to locate Data Factory: Products available by region. The data stores (Azure
Storage, Azure SQL Database, and so on) and computes (Azure HDInsight and so on) used by the data factory can be in
other regions.

Create a self-hosted integration runtime


In this section, you create a self-hosted integration runtime and associate it with an on-premises machine with the
SQL Server database. The self-hosted integration runtime is the component that copies data from the SQL Server
database on your machine to Azure Blob storage.
1. Create a variable for the name of integration runtime. Use a unique name, and note the name. You use it
later in this tutorial.

$integrationRuntimeName = "ADFTutorialIR"

2. Create a self-hosted integration runtime.

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $integrationRuntimeName -Type SelfHosted -Description "selfhosted IR
description"

Here is the sample output:


Id : /subscriptions/<subscription
ID>/resourceGroups/ADFTutorialResourceGroup/providers/Microsoft.DataFactory/factories/onpremdf0914/inte
grationruntimes/myonpremirsp0914
Type : SelfHosted
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Name : myonpremirsp0914
Description : selfhosted IR description

3. To retrieve the status of the created integration runtime, run the following command:

Get-AzDataFactoryV2IntegrationRuntime -name $integrationRuntimeName -ResourceGroupName


$resourceGroupName -DataFactoryName $dataFactoryName -Status

Here is the sample output:

Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net, *.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration

4. To retrieve the authentication keys for registering the self-hosted integration runtime with the Data Factory
service in the cloud, run the following command. Copy one of the keys (excluding the quotation marks) for
registering the self-hosted integration runtime that you install on your machine in the next step.

Get-AzDataFactoryV2IntegrationRuntimeKey -Name $integrationRuntimeName -DataFactoryName


$dataFactoryName -ResourceGroupName $resourceGroupName | ConvertTo-Json

Here is the sample output:

{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}

Install the integration runtime


1. Download Azure Data Factory Integration Runtime on a local Windows machine, and then run the
installation.
2. In the Welcome to Microsoft Integration Runtime Setup wizard, select Next.
3. In the End-User License Agreement window, accept the terms and license agreement, and select Next.
4. In the Destination Folder window, select Next.
5. In the Ready to install Microsoft Integration Runtime window, select Install.
6. If you see a warning message about the computer being configured to enter sleep or hibernate mode when
not in use, select OK.
7. If a Power Options window is displayed, close it, and switch to the setup window.
8. In the Completed the Microsoft Integration Runtime Setup wizard, select Finish.
9. In the Register Integration Runtime (Self-hosted) window, paste the key you saved in the previous
section, and then select Register.

When the self-hosted integration runtime is registered successfully, the following message is displayed:
10. In the New Integration Runtime (Self-hosted) Node window, select Next.

11. In the Intranet Communication Channel window, select Skip.


You can select a TLS/SSL certification for securing intra-node communication in a multi-node integration
runtime environment.
12. In the Register Integration Runtime (Self-hosted) window, select Launch Configuration Manager.
13. When the node is connected to the cloud service, the following message is displayed:

14. Test the connectivity to your SQL Server database by doing the following:
a. In the Configuration Manager window, switch to the Diagnostics tab.
b. In the Data source type box, select SqlServer.
c. Enter the server name.
d. Enter the database name.
e. Select the authentication mode.
f. Enter the username.
g. Enter the password that's associated with the username.
h. To confirm that integration runtime can connect to the SQL Server, select Test.
If the connection is successful, a green checkmark icon is displayed. Otherwise, you'll receive an error
message associated with the failure. Fix any issues, and ensure that the integration runtime can connect to
your SQL Server instance.
Note all the preceding values for later use in this tutorial.

Create linked services


To link your data stores and compute services to the data factory, create linked services in the data factory. In this
tutorial, you link your Azure storage account and on-premises SQL Server instance to the data store. The linked
services have the connection information that the Data Factory service uses at runtime to connect to them.
Create an Azure Storage linked service (destination/sink)
In this step, you link your Azure storage account to the data factory.
1. Create a JSON file named AzureStorageLinkedService.json in the C:\ADFv2Tutorial folder with the
following code. If the ADFv2Tutorial folder does not already exist, create it.
IMPORTANT
Before you save the file, replace <accountName> and <accountKey> with the name and key of your Azure storage
account. You noted them in the Prerequisites section.

{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>;EndpointSuffix=core.windows.net"
}
}
},
"name": "AzureStorageLinkedService"
}

2. In PowerShell, switch to the C:\ADFv2Tutorial folder.


3. To create the linked service, AzureStorageLinkedService, run the following
Set-AzDataFactoryV2LinkedService cmdlet:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$ResourceGroupName -Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is a sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

If you receive a "file not found" error, confirm that the file exists by running the dir command. If the file
name has a .txt extension (for example, AzureStorageLinkedService.json.txt), remove it, and then run the
PowerShell command again.
Create and encrypt a SQL Server linked service (source )
In this step, you link your on-premises SQL Server instance to the data factory.
1. Create a JSON file named SqlServerLinkedService.json in the C:\ADFv2Tutorial folder by using the
following code:

IMPORTANT
Select the section that's based on the authentication that you use to connect to SQL Server.

Using SQL authentication (sa):


{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<servername>;Database=<databasename>;User ID=<username>;Password=
<password>;Timeout=60"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}

Using Windows authentication:

{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Integrated Security=True"
},
"userName": "<user> or <domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}

IMPORTANT
Select the section that's based on the authentication you use to connect to your SQL Server instance.
Replace <integration runtime name> with the name of your integration runtime.
Before you save the file, replace <servername>, <databasename>, <username>, and <password> with the
values of your SQL Server instance.
If you need to use a backslash (\) in your user account or server name, precede it with the escape character (\).
For example, use mydomain\\myuser.

2. To encrypt the sensitive data (username, password, and so on), run the
New-AzDataFactoryV2LinkedServiceEncryptedCredential cmdlet.
This encryption ensures that the credentials are encrypted using Data Protection Application Programming
Interface (DPAPI). The encrypted credentials are stored locally on the self-hosted integration runtime node
(local machine). The output payload can be redirected to another JSON file (in this case,
encryptedLinkedService.json) that contains encrypted credentials.
New-AzDataFactoryV2LinkedServiceEncryptedCredential -DataFactoryName $dataFactoryName -
ResourceGroupName $ResourceGroupName -IntegrationRuntimeName $integrationRuntimeName -File
".\SQLServerLinkedService.json" > encryptedSQLServerLinkedService.json

3. Run the following command, which creates EncryptedSqlServerLinkedService:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$ResourceGroupName -Name "EncryptedSqlServerLinkedService" -File
".\encryptedSqlServerLinkedService.json"

Create datasets
In this step, you create input and output datasets. They represent input and output data for the copy operation,
which copies data from the on-premises SQL Server database to Azure Blob storage.
Create a dataset for the source SQL Server database
In this step, you define a dataset that represents data in the SQL Server database instance. The dataset is of type
SqlServerTable. It refers to the SQL Server linked service that you created in the preceding step. The linked
service has the connection information that the Data Factory service uses to connect to your SQL Server instance
at runtime. This dataset specifies the SQL table in the database that contains the data. In this tutorial, the emp
table contains the source data.
1. Create a JSON file named SqlServerDataset.json in the C:\ADFv2Tutorial folder, with the following code:

{
"properties": {
"type": "SqlServerTable",
"typeProperties": {
"tableName": "dbo.emp"
},
"structure": [
{
"name": "ID",
"type": "String"
},
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"linkedServiceName": {
"referenceName": "EncryptedSqlServerLinkedService",
"type": "LinkedServiceReference"
}
},
"name": "SqlServerDataset"
}

2. To create the dataset SqlServerDataset, run the Set-AzDataFactoryV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SqlServerDataset" -File ".\SqlServerDataset.json"

Here is the sample output:


DatasetName : SqlServerDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Structure : {"name": "ID" "type": "String", "name": "FirstName" "type": "String", "name":
"LastName" "type": "String"}
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerTableDataset

Create a dataset for Azure Blob storage (sink)


In this step, you define a dataset that represents data that will be copied to Azure Blob storage. The dataset is of
the type AzureBlob. It refers to the Azure Storage linked service that you created earlier in this tutorial.
The linked service has the connection information that the data factory uses at runtime to connect to your Azure
storage account. This dataset specifies the folder in the Azure storage to which the data is copied from the SQL
Server database. In this tutorial, the folder is adftutorial/fromonprem, where adftutorial is the blob container
and fromonprem is the folder.
1. Create a JSON file named AzureBlobDataset.json in the C:\ADFv2Tutorial folder, with the following code:

{
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/fromonprem",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"name": "AzureBlobDataset"
}

2. To create the dataset AzureBlobDataset, run the Set-AzDataFactoryV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "AzureBlobDataset" -File ".\AzureBlobDataset.json"

Here is the sample output:

DatasetName : AzureBlobDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset

Create a pipeline
In this tutorial, you create a pipeline with a copy activity. The copy activity uses SqlServerDataset as the input
dataset and AzureBlobDataset as the output dataset. The source type is set to SqlSource and the sink type is set to
BlobSink.
1. Create a JSON file named SqlServerToBlobPipeline.json in the C:\ADFv2Tutorial folder, with the following
code:
{
"name": "SQLServerToBlobPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type":"BlobSink"
}
},
"name": "CopySqlServerToAzureBlobActivity",
"inputs": [
{
"referenceName": "SqlServerDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlobDataset",
"type": "DatasetReference"
}
]
}
]
}
}

2. To create the pipeline SQLServerToBlobPipeline, run the Set-AzDataFactoryV2Pipeline cmdlet.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SQLServerToBlobPipeline" -File ".\SQLServerToBlobPipeline.json"

Here is the sample output:

PipelineName : SQLServerToBlobPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Activities : {CopySqlServerToAzureBlobActivity}
Parameters :

Create a pipeline run


Start a pipeline run for the SQLServerToBlobPipeline pipeline, and capture the pipeline run ID for future
monitoring.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName 'SQLServerToBlobPipeline'

Monitor the pipeline run


1. To continuously check the run status of pipeline SQLServerToBlobPipeline, run the following script in
PowerShell, and print the final result:
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)

if (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {


Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
Start-Sleep -Seconds 30
}
else {
Write-Host "Pipeline 'SQLServerToBlobPipeline' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
}

Here is the output of the sample run:

ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : copy
PipelineRunId : 4ec8980c-62f6-466f-92fa-e69b10f33640
PipelineName : SQLServerToBlobPipeline
Input :
Output :
LinkedServiceName :
ActivityRunStart : 9/13/2017 1:35:22 PM
ActivityRunEnd : 9/13/2017 1:35:42 PM
DurationInMs : 20824
Status : Succeeded
Error : {errorCode, message, failureType, target}

2. You can get the run ID of pipeline SQLServerToBlobPipeline and check the detailed activity run result by
running the following command:

Write-Host "Pipeline 'SQLServerToBlobPipeline' run result:" -foregroundcolor "Yellow"


($result | Where-Object {$_.ActivityName -eq "CopySqlServerToAzureBlobActivity"}).Output.ToString()

Here is the output of the sample run:

{
"dataRead": 36,
"dataWritten": 24,
"rowsCopied": 2,
"copyDuration": 3,
"throughput": 0.01171875,
"errors": [],
"effectiveIntegrationRuntime": "MyIntegrationRuntime",
"billedDuration": 3
}

Verify the output


The pipeline automatically creates the output folder named fromonprem in the adftutorial blob container.
Confirm that you see the dbo.emp.txt file in the output folder.
1. In the Azure portal, in the adftutorial container window, select Refresh to see the output folder.
2. Select fromonprem in the list of folders.
3. Confirm that you see a file named dbo.emp.txt .

Next steps
The pipeline in this sample copies data from one location to another in Azure Blob storage. You learned how to:
Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.
For a list of data stores that are supported by Data Factory, see supported data stores.
To learn about copying data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Copy multiple tables in bulk by using Azure Data
Factory
4/8/2019 • 14 minutes to read • Edit Online

This tutorial demonstrates copying a number of tables from Azure SQL Database to Azure SQL Data
Warehouse. You can apply the same pattern in other copy scenarios as well. For example, copying tables from
SQL Server/Oracle to Azure SQL Database/Data Warehouse/Azure Blob, copying different paths from Blob to
Azure SQL Database tables.

NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory.

At a high level, this tutorial involves following steps:


Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses Azure portal. To learn about using other tools/SDKs to create a data factory, see Quickstarts.

End-to-end workflow
In this scenario, you have a number of tables in Azure SQL Database that you want to copy to SQL Data
Warehouse. Here is the logical sequence of steps in the workflow that happens in pipelines:

The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the pipeline
triggers another pipeline, which iterates over each table in the database and performs the data copy operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the list,
copy the specific table in Azure SQL Database to the corresponding table in SQL Data Warehouse using staged
copy via Blob storage and PolyBase for best performance. In this example, the first pipeline passes the list of
tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure Storage account. The Azure Storage account is used as staging blob storage in the bulk copy
operation.
Azure SQL Database. This database contains the source data.
Azure SQL Data Warehouse. This data warehouse holds the data copied over from the SQL Database.
Prepare SQL Database and SQL Data Warehouse
Prepare the source Azure SQL Database:
Create an Azure SQL Database with Adventure Works LT sample data following Create an Azure SQL database
article. This tutorial copies all the tables from this sample database to a SQL data warehouse.
Prepare the sink Azure SQL Data Warehouse:
1. If you don't have an Azure SQL Data Warehouse, see the Create a SQL Data Warehouse article for steps to
create one.
2. Create corresponding table schemas in SQL Data Warehouse. You can use Migration Utility to migrate
schema from Azure SQL Database to Azure SQL Data Warehouse. You use Azure Data Factory to
migrate/copy data in a later step.

Azure services to access SQL server


For both SQL Database and SQL Data Warehouse, allow Azure services to access SQL server. Ensure that Allow
access to Azure services setting is turned ON for your Azure SQL server. This setting allows the Data Factory
service to read data from your Azure SQL Database and write data to your Azure SQL Data Warehouse. To verify
and turn on this setting, do the following steps:
1. Click More services hub on the left and click SQL servers.
2. Select your server, and click Firewall under SETTINGS.
3. In the Firewall settings page, click ON for Allow access to Azure services.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. In the New data factory page, enter ADFTutorialBulkCopyDF for the name.

The name of the Azure data factory must be globally unique. If you see the following error for the name
field, change the name of the data factory (for example, yournameADFTutorialBulkCopyDF ). See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.

`Data factory name “ADFTutorialBulkCopyDF” is not available`

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.

11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Data Factory UI application in a separate tab.
13. In the get started page, switch to the Edit tab in the left panel as shown in the following image:

Create linked services


You create linked services to link your data stores and computes to a data factory. A linked service has the
connection information that the Data Factory service uses to connect to the data store at runtime.
In this tutorial, you link your Azure SQL Database, Azure SQL Data Warehouse, and Azure Blob Storage data
stores to your data factory. The Azure SQL Database is the source data store. The Azure SQL Data Warehouse is
the sink/destination data store. The Azure Blob Storage is to stage the data before the data is loaded into SQL
Data Warehouse by using PolyBase.
Create the source Azure SQL Database linked service
In this step, you create a linked service to link your Azure SQL database to the data factory.
1. Click Connections at the bottom of the window, and click + New on the toolbar.

2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for Name.
b. Select your Azure SQL server for Server name
c. Select your Azure SQL database for Database name.
d. Enter name of the user to connect to Azure SQL database.
e. Enter password for the user.
f. To test the connection to Azure SQL database using the specified information, click Test connection.
g. Click Save.

Create the sink Azure SQL Data Warehouse linked service


1. In the Connections tab, click + New on the toolbar again.
2. In the New Linked Service window, select Azure SQL Data Warehouse, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDWLinkedService for Name.
b. Select your Azure SQL server for Server name
c. Select your Azure SQL database for Database name.
d. Enter name of the user to connect to Azure SQL database.
e. Enter password for the user.
f. To test the connection to Azure SQL database using the specified information, click Test connection.
g. Click Save.
Create the staging Azure Storage linked service
In this tutorial, you use Azure Blob storage as an interim staging area to enable PolyBase for a better copy
performance.
1. In the Connections tab, click + New on the toolbar again.
2. In the New Linked Service window, select Azure Blob Storage, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureStorageLinkedService for Name.
b. Select your Azure Storage account for Storage account name.
c. Click Save.

Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored.
The input dataset AzureSqlDatabaseDataset refers to the AzureSqlDatabaseLinkedService. The linked
service specifies the connection string to connect to the database. The dataset specifies the name of the database
and the table that contains the source data.
The output dataset AzureSqlDWDataset refers to the AzureSqlDWLinkedService. The linked service specifies
the connection string to connect to the data warehouse. The dataset specifies the database and the table to which
the data is copied.
In this tutorial, the source and destination SQL tables are not hard-coded in the dataset definitions. Instead, the
ForEach activity passes the name of the table at runtime to the Copy activity.
Create a dataset for source SQL Database
1. Click + (plus) in the left pane, and click Dataset.

2. In the New Dataset window, select Azure SQL Database, and click Finish. You should see a new tab titled
AzureSqlTable1.
3. In the properties window at the bottom, enter AzureSqlDatabaseDataset for Name.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select any table for Table. This table is a dummy table. You specify a query on the source dataset
when creating a pipeline. The query is used to extract data from the Azure SQL database.
Alternatively, you can click Edit check box, and enter dummyName as the table name.

Create a dataset for sink SQL Data Warehouse


1. Click + (plus) in the left pane, and click Dataset.
2. In the New Dataset window, select Azure SQL Data Warehouse, and click Finish. You should see a new
tab titled AzureSqlDWTable1.
3. In the properties window at the bottom, enter AzureSqlDWDataset for Name.
4. Switch to the Parameters tab, click + New, and enter DWTableName for the parameter name. If you
copy/paste this name from the page, ensure that there is no trailing space character at the end of
DWTableName.
5. Switch to the Connection tab,
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. For Table, check the Edit option, click into the table name input box, then click the Add dynamic
content link below.

c. In the Add Dynamic Content page, click the DWTAbleName under Parameters which will
automatically populate the top expression text box @dataset().DWTableName , then click Finish. The
tableName property of the dataset is set to the value that's passed as an argument for the
DWTableName parameter. The ForEach activity iterates through a list of tables, and passes one by one to
the Copy activity.

Create pipelines
In this tutorial, you create two pipelines: IterateAndCopySQLTables and GetTableListAndTriggerCopyData.
The GetTableListAndTriggerCopyData pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline IterateAndCopySQLTables to do the actual data copy.
The IterateAndCopySQLTables takes a list of tables as a parameter. For each table in the list, it copies data from
the table in Azure SQL Database to Azure SQL Data Warehouse using staged copy and PolyBase.
Create the pipeline IterateAndCopySQLTables
1. In the left pane, click + (plus), and click Pipeline.

2. In the General tab, specify IterateAndCopySQLTables for name.


3. Switch to the Parameters tab, and do the following actions:
a. Click + New.
b. Enter tableList for the parameter name.
c. Select Array for Type.

4. In the Activities toolbox, expand Iteration & Conditions, and drag-drop the ForEach activity to the
pipeline design surface. You can also search for activities in the Activities toolbox.
a. In the General tab at the bottom, enter IterateSQLTables for Name.
b. Switch to the Settings tab, click the inputbox for Items, then click the Add dynamic content link below.
c. In the Add Dynamic Content page, collapse the System Vairables and Functions section, click the
tableList under Parameters which will automatically populate the top expression text box as
@pipeline().parameter.tableList , then click Finish.

d. Switch to Activities tab, click Add activity to add a child activity to the ForEach activity.
5. In the Activities toolbox, expand DataFlow, and drag-drop Copy activity into the pipeline designer
surface. Notice the breadcrumb menu at the top. The IterateAndCopySQLTable is the pipeline name and
IterateSQLTables is the ForEach activity name. The designer is in the activity scope. To switch back to the
pipeline editor from the ForEach editor, click the link in the breadcrumb menu.
6. Switch to the Source tab, and do the following steps:
a. Select AzureSqlDatabaseDataset for Source Dataset.
b. Select Query option for User Query.
c. Click the Query input box -> select the Add dynamic content below -> enter the following
expression for Query -> select Finish.

SELECT * FROM [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]


7. Switch to the Sink tab, and do the following steps:
a. Select AzureSqlDWDataset for Sink Dataset.
b. Click input box for the VALUE of DWTableName parameter -> select the Add dynamic content
below, enter [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}] expression as script, -> select Finish.
c. Expand Polybase Settings, and select Allow polybase.
d. Clear the Use Type default option.
e. Click the Pre-copy Script input box -> select the Add dynamic content below -> enter the
following expression as script -> select Finish.

TRUNCATE TABLE [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]

8. Switch to the Settings tab, and do the following steps:


a. Select True for Enable Staging.
b. Select AzureStorageLinkedService for Store Account Linked Service.

9. To validate the pipeline settings, click Validate on the top pipeline tool bar. Confirm that there is no
validation error. To close the Pipeline Validation Report, click >>.
Create the pipeline GetTableListAndTriggerCopyData
This pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline "IterateAndCopySQLTables" to do the actual data copy.
1. In the left pane, click + (plus), and click Pipeline.

2. In the Properties window, change the name of the pipeline to GetTableListAndTriggerCopyData.


3. In the Activities toolbox, expand General, and drag-and-drop Lookup activity to the pipeline designer
surface, and do the following steps:
a. Enter LookupTableList for Name.
b. Enter Retrieve the table list from Azure SQL database for Description.
4. Switch to the Settings page, and do the following steps:
a. Select AzureSqlDatabaseDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.

SELECT TABLE_SCHEMA, TABLE_NAME FROM information_schema.TABLES WHERE TABLE_TYPE = 'BASE TABLE'


and TABLE_SCHEMA = 'SalesLT' and TABLE_NAME <> 'ProductModel'

d. Clear the checkbox for the First row only field.


5. Drag-and-drop Execute Pipeline activity from the Activities toolbox to the pipeline designer surface, and
set the name to TriggerCopy.

6. Switch to the Settings page, and do the following steps:


a. Select IterateAndCopySQLTables for Invoked pipeline.
b. Expand the Advanced section.
c. Click + New in the Parameters section.
d. Enter tableList for parameter name.
e. Click VALUE input box -> select the Add dynamic content below -> enter
@activity('LookupTableList').output.value as table name value -> select Finish. You are setting the
result list from the Lookup activity as an input to the second pipeline. The result list contains the list
of tables whose data needs to be copied to the destination.
7. Connect the Lookup activity to the Execute Pipeline activity by dragging the green box attached to the
Lookup activity to the left of Execute Pipeline activity.

8. To validate the pipeline, click Validate on the toolbar. Confirm that there are no validation errors. To close
the Pipeline Validation Report, click >>.
9. To publish entities (datasets, pipelines, etc.) to the Data Factory service, click Publish All on top of the
window. Wait until the publishing succeeds.

Trigger a pipeline run


Go to pipeline GetTableListAndTriggerCopyData, click Trigger, and click Trigger Now.

Monitor the pipeline run


1. Switch to the Monitor tab. Click Refresh until you see runs for both the pipelines in your solution.
Continue refreshing the list until you see the Succeeded status.

2. To view activity runs associated with the GetTableListAndTriggerCopyData pipeline, click the first link in the
Actions link for that pipeline. You should see two activity runs for this pipeline run.

3. To view the output of the Lookup activity, click link in the Output column for that activity. You can
maximize and restore the Output window. After reviewing, click X to close the Output window.
{
"count": 9,
"value": [
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Customer"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Product"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductModelProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductCategory"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Address"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "CustomerAddress"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderDetail"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderHeader"
}
],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"effectiveIntegrationRuntimes": [
{
"name": "DefaultIntegrationRuntime",
"type": "Managed",
"location": "East US",
"billedDuration": 0,
"nodes": null
}
]
}

4. To switch back to the Pipeline Runs view, click Pipelines link at the top. Click View Activity Runs link
(first link in the Actions column) for the IterateAndCopySQLTables pipeline. You should see output as
shown in the following image: Notice that there is one Copy activity run for each table in the Lookup
activity output.
5. Confirm that the data was copied to the target SQL Data Warehouse you used in this tutorial.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Copy multiple tables in bulk by using Azure Data
Factory
3/5/2019 • 11 minutes to read • Edit Online

This tutorial demonstrates copying a number of tables from Azure SQL Database to Azure SQL Data
Warehouse. You can apply the same pattern in other copy scenarios as well. For example, copying tables from
SQL Server/Oracle to Azure SQL Database/Data Warehouse/Azure Blob, copying different paths from Blob to
Azure SQL Database tables.
At a high level, this tutorial involves following steps:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses Azure PowerShell. To learn about using other tools/SDKs to create a data factory, see
Quickstarts.

End-to-end workflow
In this scenario, we have a number of tables in Azure SQL Database that we want to copy to SQL Data
Warehouse. Here is the logical sequence of steps in the workflow that happens in pipelines:

The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the pipeline
triggers another pipeline, which iterates over each table in the database and performs the data copy operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the list,
copy the specific table in Azure SQL Database to the corresponding table in SQL Data Warehouse using staged
copy via Blob storage and PolyBase for best performance. In this example, the first pipeline passes the list of
tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Azure Storage account. The Azure Storage account is used as staging blob storage in the bulk copy operation.
Azure SQL Database. This database contains the source data.
Azure SQL Data Warehouse. This data warehouse holds the data copied over from the SQL Database.
Prepare SQL Database and SQL Data Warehouse
Prepare the source Azure SQL Database:
Create an Azure SQL Database with Adventure Works LT sample data following Create an Azure SQL database
article. This tutorial copies all the tables from this sample database to a SQL data warehouse.
Prepare the sink Azure SQL Data Warehouse:
1. If you don't have an Azure SQL Data Warehouse, see the Create a SQL Data Warehouse article for steps to
create one.
2. Create corresponding table schemas in SQL Data Warehouse. You can use Migration Utility to migrate
schema from Azure SQL Database to Azure SQL Data Warehouse. You use Azure Data Factory to
migrate/copy data in a later step.

Azure services to access SQL server


For both SQL Database and SQL Data Warehouse, allow Azure services to access SQL server. Ensure that Allow
access to Azure services setting is turned ON for your Azure SQL server. This setting allows the Data Factory
service to read data from your Azure SQL Database and write data to your Azure SQL Data Warehouse. To verify
and turn on this setting, do the following steps:
1. Click All services on the left and click SQL servers.
2. Select your server, and click Firewall under SETTINGS.
3. In the Firewall settings page, click ON for Allow access to Azure services.

Create a data factory


1. Launch PowerShell. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you
need to run the commands again.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:

Connect-AzAccount

Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

2. Run the Set-AzDataFactoryV2 cmdlet to create a data factory. Replace place-holders with your own
values before executing the command.

$resourceGroupName = "<your resource group to create the factory>"


$dataFactoryName = "<specify the name of data factory to create. It must be globally unique.>"
Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error,
change the name and try again.

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory
names must be globally unique.

To create Data Factory instances, you must be a Contributor or Administrator of the Azure
subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest
you on the following page, and then expand Analytics to locate Data Factory: Products available by
region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.)
used by data factory can be in other regions.

Create linked services


In this tutorial, you create three linked services for source, sink, and staging blob respectively, which includes
connections to your data stores:
Create the source Azure SQL Database linked service
1. Create a JSON file named AzureSqlDatabaseLinkedService.json in C:\ADFv2TutorialBulkCopy folder
with the following content: (Create the folder ADFv2TutorialBulkCopy if it does not already exist.)

IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.
{
"name": "AzureSqlDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
}

2. In Azure PowerShell, switch to the ADFv2TutorialBulkCopy folder.


3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:
AzureSqlDatabaseLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "AzureSqlDatabaseLinkedService" -File ".\AzureSqlDatabaseLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSqlDatabaseLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService

Create the sink Azure SQL Data Warehouse linked service


1. Create a JSON file named AzureSqlDWLinkedService.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:

IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.

{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
}

2. To create the linked service: AzureSqlDWLinkedService, run the Set-AzDataFactoryV2LinkedService


cmdlet.
Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName
-Name "AzureSqlDWLinkedService" -File ".\AzureSqlDWLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSqlDWLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDWLinkedService

Create the staging Azure Storage linked service


In this tutorial, you use Azure Blob storage as an interim staging area to enable PolyBase for a better copy
performance.
1. Create a JSON file named AzureStorageLinkedService.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:

IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before saving the
file.

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>"
}
}
}
}

2. To create the linked service: AzureStorageLinkedService, run the Set-AzDataFactoryV2LinkedService


cmdlet.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored:
Create a dataset for source SQL Database
1. Create a JSON file named AzureSqlDatabaseDataset.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content. The "tableName" is a dummy one as later you use the SQL query in copy
activity to retrieve data.

{
"name": "AzureSqlDatabaseDataset",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "AzureSqlDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "dummy"
}
}
}

2. To create the dataset: AzureSqlDatabaseDataset, run the Set-AzDataFactoryV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name


"AzureSqlDatabaseDataset" -File ".\AzureSqlDatabaseDataset.json"

Here is the sample output:

DatasetName : AzureSqlDatabaseDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a dataset for sink SQL Data Warehouse


1. Create a JSON file named AzureSqlDWDataset.json in the C:\ADFv2TutorialBulkCopy folder, with the
following content: The "tableName" is set as a parameter, later the copy activity that references this dataset
passes the actual value into the dataset.

{
"name": "AzureSqlDWDataset",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": {
"referenceName": "AzureSqlDWLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": {
"value": "@{dataset().DWTableName}",
"type": "Expression"
}
},
"parameters":{
"DWTableName":{
"type":"String"
}
}
}
}

2. To create the dataset: AzureSqlDWDataset, run the Set-AzDataFactoryV2Dataset cmdlet.


Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"AzureSqlDWDataset" -File ".\AzureSqlDWDataset.json"

Here is the sample output:

DatasetName : AzureSqlDWDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDwTableDataset

Create pipelines
In this tutorial, you create two pipelines:
Create the pipeline "IterateAndCopySQLTables"
This pipeline takes a list of tables as a parameter. For each table in the list, it copies data from the table in Azure
SQL Database to Azure SQL Data Warehouse using staged copy and PolyBase.
1. Create a JSON file named IterateAndCopySQLTables.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:
{
"name": "IterateAndCopySQLTables",
"properties": {
"activities": [
{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"items": {
"value": "@pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "CopyData",
"description": "Copy data from SQL database to SQL DW",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlDatabaseDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlDWDataset",
"type": "DatasetReference",
"parameters": {
"DWTableName": "[@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]"
}
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]"
},
"sink": {
"type": "SqlDWSink",
"preCopyScript": "TRUNCATE TABLE [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}
]
}
}
],
"parameters": {
"tableList": {
"type": "Object"
}
}
}
}
2. To create the pipeline: IterateAndCopySQLTables, Run the Set-AzDataFactoryV2Pipeline cmdlet.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "IterateAndCopySQLTables" -File ".\IterateAndCopySQLTables.json"

Here is the sample output:

PipelineName : IterateAndCopySQLTables
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {IterateSQLTables}
Parameters : {[tableList, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}

Create the pipeline "GetTableListAndTriggerCopyData"


This pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline "IterateAndCopySQLTables" to do the actual data copy.
1. Create a JSON file named GetTableListAndTriggerCopyData.json in the C:\ADFv2TutorialBulkCopy
folder, with the following content:
{
"name":"GetTableListAndTriggerCopyData",
"properties":{
"activities":[
{
"name": "LookupTableList",
"description": "Retrieve the table list from Azure SQL dataabse",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT TABLE_SCHEMA, TABLE_NAME FROM
information_schema.TABLES WHERE TABLE_TYPE = 'BASE TABLE' and TABLE_SCHEMA = 'SalesLT' and TABLE_NAME <>
'ProductModel'"
},
"dataset": {
"referenceName": "AzureSqlDatabaseDataset",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "TriggerCopy",
"type": "ExecutePipeline",
"typeProperties": {
"parameters": {
"tableList": {
"value": "@activity('LookupTableList').output.value",
"type": "Expression"
}
},
"pipeline": {
"referenceName": "IterateAndCopySQLTables",
"type": "PipelineReference"
},
"waitOnCompletion": true
},
"dependsOn": [
{
"activity": "LookupTableList",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]
}
}

2. To create the pipeline: GetTableListAndTriggerCopyData, Run the Set-AzDataFactoryV2Pipeline


cmdlet.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "GetTableListAndTriggerCopyData" -File ".\GetTableListAndTriggerCopyData.json"

Here is the sample output:


PipelineName : GetTableListAndTriggerCopyData
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {LookupTableList, TriggerCopy}
Parameters :

Start and monitor pipeline run


1. Start a pipeline run for the main "GetTableListAndTriggerCopyData" pipeline and capture the pipeline run
ID for future monitoring. Underneath, it triggers the run for pipeline "IterateAndCopySQLTables" as
specified in ExecutePipeline activity.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName 'GetTableListAndTriggerCopyData'

2. Run the following script to continuously check the run status of pipeline
GetTableListAndTriggerCopyData, and print out the final pipeline run and activity run result.

while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId

if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
Write-Host "Pipeline run details:" -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}

Start-Sleep -Seconds 15
}

$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result

Here is the output of the sample run:


Pipeline run details:
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
RunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
LastUpdated : 9/18/2017 4:08:15 PM
Parameters : {}
RunStart : 9/18/2017 4:06:44 PM
RunEnd : 9/18/2017 4:08:15 PM
DurationInMs : 90637
Status : Succeeded
Message :

Activity run details:


ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : LookupTableList
PipelineRunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
Input : {source, dataset, firstRowOnly}
Output : {count, value, effectiveIntegrationRuntime}
LinkedServiceName :
ActivityRunStart : 9/18/2017 4:06:46 PM
ActivityRunEnd : 9/18/2017 4:07:09 PM
DurationInMs : 22995
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : TriggerCopy
PipelineRunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
Input : {pipeline, parameters, waitOnCompletion}
Output : {pipelineRunId}
LinkedServiceName :
ActivityRunStart : 9/18/2017 4:07:11 PM
ActivityRunEnd : 9/18/2017 4:08:14 PM
DurationInMs : 62581
Status : Succeeded
Error : {errorCode, message, failureType, target}

3. You can get the run ID of pipeline "IterateAndCopySQLTables", and check the detailed activity run result
as the following.

Write-Host "Pipeline 'IterateAndCopySQLTables' run result:" -foregroundcolor "Yellow"


($result | Where-Object {$_.ActivityName -eq "TriggerCopy"}).Output.ToString()

Here is the output of the sample run:

{
"pipelineRunId": "7514d165-14bf-41fb-b5fb-789bea6c9e58"
}

$result2 = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineRunId <copy above run ID> -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)
$result2

4. Connect to your sink Azure SQL Data Warehouse and confirm that data has been copied from Azure SQL
Database properly.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Incrementally load data from a source data store to a
destination data store
5/10/2019 • 2 minutes to read • Edit Online

In a data integration solution, incrementally (or delta) loading data after an initial full data load is a widely used
scenario. The tutorials in this section show you different ways of loading data incrementally by using Azure Data
Factory.

Delta data loading from database by using a watermark


In this case, you define a watermark in your source database. A watermark is a column that has the last updated
time stamp or an incrementing key. The delta loading solution loads the changed data between an old watermark
and a new watermark. The workflow for this approach is depicted in the following diagram:

For step-by-step instructions, see the following tutorials:


Incrementally copy data from one table in Azure SQL Database to Azure Blob storage
Incrementally copy data from multiple tables in on-premises SQL Server to Azure SQL Database

Delta data loading from SQL DB by using the Change Tracking


technology
Change Tracking technology is a lightweight solution in SQL Server and Azure SQL Database that provides an
efficient change tracking mechanism for applications. It enables an application to easily identify data that was
inserted, updated, or deleted.
The workflow for this approach is depicted in the following diagram:

For step-by-step instructions, see the following tutorial:


Incrementally copy data from Azure SQL Database to Azure Blob storage by using Change Tracking technology

Loading new and changed files only by using LastModifiedDate


You can copy the new and changed files only by using LastModifiedDate to the destination store. ADF will scan all
the files from the source store, apply the file filter by their LastModifiedDate, and only copy the new and updated
file since last time to the destination store. Please be aware if you let ADF scan huge amounts of files but only copy
a few files to destination, you would still expect the long duration due to file scanning is time consuming as well.
For step-by-step instructions, see the following tutorial:
Incrementally copy new and changed files based on LastModifiedDate from Azure Blob storage to Azure Blob
storage

Loading new files only by using time partitioned folder or file name.
You can copy new files only, where files or folders has already been time partitioned with timeslice information as
part of the file or folder name (for example, /yyyy/mm/dd/file.csv). It is the most performance approach for
incremental loading new files.
For step-by-step instructions, see the following tutorial:
Incrementally copy new files based on time partitioned folder or file name from Azure Blob storage to Azure Blob
storage

Next steps
Advance to the following tutorial:
Incrementally copy data from one table in Azure SQL Database to Azure Blob storage
Incrementally load data from an Azure SQL database
to Azure Blob storage
3/26/2019 • 13 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data from a table in an Azure SQL
database to Azure Blob storage.
You perform the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run

Overview
Here is the high-level solution diagram:

Here are the important steps to create this solution:


1. Select the watermark column. Select one column in the source data store, which can be used to slice the
new or updated records for every run. Normally, the data in this selected column (for example,
last_modify_time or ID ) keeps increasing when rows are created or updated. The maximum value in this
column is used as a watermark.
2. Prepare a data store to store the watermark value. In this tutorial, you store the watermark value in a
SQL database.
3. Create a pipeline with the following workflow:
The pipeline in this solution has the following activities:
Create two Lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Blob storage as a new file.
Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure SQL Database. You use the database as the source data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Azure Storage. You use the blob storage as the sink data store. If you don't have a storage account, see Create
a storage account for steps to create one. Create a container named adftutorial.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Server Explorer, right-click the database, and choose New
Query.
2. Run the following SQL command against your SQL database to create a table named data_source_table as
the data source store:

create table data_source_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

INSERT INTO data_source_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'aaaa','9/1/2017 12:56:00 AM'),
(2, 'bbbb','9/2/2017 5:23:00 AM'),
(3, 'cccc','9/3/2017 2:36:00 AM'),
(4, 'dddd','9/4/2017 3:21:00 AM'),
(5, 'eeee','9/5/2017 8:06:00 AM');

In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000

Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
create table watermarktable
(

TableName varchar(255),
WatermarkValue datetime,
);

2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.

INSERT INTO watermarktable


VALUES ('data_source_table','1/1/2010 12:00:00 AM')

3. Review the data in the table watermarktable .

Select * from watermarktable

Output:

TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000

Create a stored procedure in your SQL database


Run the following command to create a stored procedure in your SQL database:

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. In the New data factory page, enter ADFIncCopyTutorialDF for the name.

The name of the Azure data factory must be globally unique. If you see a red exclamation mark with the
following error, change the name of the data factory (for example, yournameADFIncCopyTutorialDF ) and
try creating again. See Data Factory - Naming Rules article for naming rules for Data Factory artifacts.

`Data factory name "ADFIncCopyTutorialDF" is not available`

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.

11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.

Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. In the get started page of Data Factory UI, click the Create pipeline tile.

2. In the General page of the Properties window for the pipeline, enter IncrementalCopyPipeline name.
3. Let's add the first lookup activity to get the old watermark value. In the Activities toolbox, expand General,
and drag-drop the Lookup activity to the pipeline designer surface. Change the name of the activity to
LookupOldWaterMarkActivity.
4. Switch to the Settings tab, and click + New for Source Dataset. In this step, you create a dataset to
represent data in the watermarktable. This table contains the old watermark that was used in the previous
copy operation.
5. In the New Dataset window, select Azure SQL Database, and click Finish. You see a new tab opened for
the dataset.
6. In the properties window for the dataset, enter WatermarkDataset for Name.
7. Switch to the Connection tab, and click + New to make a connection (create a linked service) to your Azure
SQL database.

8. In the New Linked Service window, do the following steps:


a. Enter AzureSqlDatabaseLinkedService for Name.
b. Select your Azure SQL server for Server name.
c. Enter the name of the user to access for the Azure SQL server.
d. Enter the password for the user.
e. To test connection to the Azure SQL database, click Test connection.
f. Click Save.
g. In the Connection tab, confirm that AzureSqlDatabaseLinkedService is selected for Linked
service.
9. Select [dbo].[watermarktable] for Table. If you want to preview data in the table, click Preview data.

10. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline in
the tree view on the left. In the properties window for the Lookup activity, confirm that
WatermarkDataset is selected for the Source Dataset field.
11. In the Activities toolbox, expand General, and drag-drop another Lookup activity to the pipeline designer
surface, and set the name to LookupNewWaterMarkActivity in the General tab of the properties
window. This Lookup activity gets the new watermark value from the table with the source data to be copied
to the destination.
12. In the properties window for the second Lookup activity, switch to the Settings tab, and click New. You
create a dataset to point to the source table that contains the new watermark value (maximum value of
LastModifyTime).
13. In the New Dataset window, select Azure SQL Database, and click Finish. You see a new tab opened for
this dataset. You also see the dataset in the tree view.
14. In the General tab of the properties window, enter SourceDataset for Name.

15. Switch to the Connection tab, and do the following steps:


a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[data_source_table] for Table. You specify a query on this dataset later in the tutorial.
The query takes the precedence over the table you specify in this step.

16. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline in
the tree view on the left. In the properties window for the Lookup activity, confirm that SourceDataset is
selected for the Source Dataset field.
17. Select Query for the Use Query field, and enter the following query: you are only selecting the maximum
value of LastModifytime from the data_source_table. If you don't have this query, the dataset gets all the
rows from the table as you specified the table name (data_source_table) in the dataset definition.

select MAX(LastModifytime) as NewWatermarkvalue from data_source_table


18. In the Activities toolbox, expand DataFlow, and drag-drop the Copy activity from the Activities toolbox,
and set the name to IncrementalCopyActivity.
19. Connect both Lookup activities to the Copy activity by dragging the green button attached to the
Lookup activities to the Copy activity. Release the mouse button when you see the border color of the Copy
activity changes to blue.

20. Select the Copy activity and confirm that you see the properties for the activity in the Properties window.
21. Switch to the Source tab in the Properties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Query for the Use Query field.
c. Enter the following SQL query for the Query field.

select * from data_source_table where LastModifytime >


'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'
22. Switch to the Sink tab, and click + New for the Sink Dataset field.

23. In this tutorial sink data store is of type Azure Blob Storage. Therefore, select Azure Blob Storage, and click
Finish in the New Dataset window.
24. In the General tab of the Properties window for the dataset, enter SinkDataset for Name.
25. Switch to the Connection tab, and click + New. In this step, you create a connection (linked service) to your
Azure Blob storage.

26. In the New Linked Service window, do the following steps:


a. Enter AzureStorageLinkedService for Name.
b. Select your Azure Storage account for Storage account name.
c. Click Save.
27. In the Connection tab, do the following steps:
a. Confirm that AzureStorageLinkedService is selected for Linked service.
b. For the folder part of the File path field, enter adftutorial/incrementalcopy. adftutorial is the
blob container name and incrementalcopy is the folder name. This snippet assumes that you have a
blob container named adftutorial in your blob storage. Create the container if it doesn't exist, or set it
to the name of an existing one. Azure Data Factory automatically creates the output folder
incrementalcopy if it does not exist. You can also use the Browse button for the File path to
navigate to a folder in a blob container. .RunId, '.txt')`.
c. Fir the filename part of the File path field, enter @CONCAT('Incremental-', pipeline().RunId, '.txt')
. The file name is dynamically generated by using the expression. Each pipeline run has a unique ID.
The Copy activity uses the run ID to generate the file name.
28. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline in
the tree view on the left.
29. In the Activities toolbox, expand General, and drag-drop the Stored Procedure activity from the
Activities toolbox to the pipeline designer surface. Connect the green (Success) output of the Copy
activity to the Stored Procedure activity.

30. Select Stored Procedure Activity in the pipeline designer, change its name to
StoredProceduretoWriteWatermarkActivity.
31. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedService* for Linked service.

32. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select usp_write_watermark.
b. To specify values for the stored procedure parameters, click Import parameter, and enter following
values for the parameters:

NAME TYPE VALUE

LastModifiedtime DateTime @{activity('LookupNewWaterMark


Activity').output.firstRow.NewWate
rmarkvalue}

TableName String @{activity('LookupOldWaterMarkA


ctivity').output.firstRow.TableName}
33. To validate the pipeline settings, click Validate on the toolbar. Confirm that there are no validation errors. To
close the Pipeline Validation Report window, click >>.

34. Publish entities (linked services, datasets, and pipelines) to the Azure Data Factory service by selecting the
Publish All button. Wait until you see a message that the publishing succeeded.
Trigger a pipeline run
1. Click Trigger on the toolbar, and click Trigger Now.

2. In the Pipeline Run window, select Finish.

Monitor the pipeline run


1. Switch to the Monitor tab on the left. You can see the status of the pipeline run triggered by the manual
trigger. Click Refresh button to refresh the list.

2. To view activity runs associated with this pipeline run, click the first link (View Activity Runs) in the
Actions column. You can go back to the previous view by clicking Pipelines at the top. Click Refresh
button to refresh the list.

Review the results


1. Connect to your Azure Storage Account by using tools such as Azure Storage Explorer. Verify that an output
file is created in the incrementalcopy folder of the adftutorial container.

2. Open the output file and notice that all the data is copied from the data_source_table to the blob file.

1,aaaa,2017-09-01 00:56:00.0000000
2,bbbb,2017-09-02 05:23:00.0000000
3,cccc,2017-09-03 02:36:00.0000000
4,dddd,2017-09-04 03:21:00.0000000
5,eeee,2017-09-05 08:06:00.0000000

3. Check the latest value from watermarktable . You see that the watermark value was updated.

Select * from watermarktable

Here is the output:

TABLENAME WATERMARKVALUE

data_source_table 2017-09-05 8:06:00.000

Add more data to source


Insert new data into the SQL database (data source store).

INSERT INTO data_source_table


VALUES (6, 'newdata','9/6/2017 2:23:00 AM')

INSERT INTO data_source_table


VALUES (7, 'newdata','9/7/2017 9:01:00 AM')

The updated data in the SQL database is:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000
6 | newdata | 2017-09-06 02:23:00.000
7 | newdata | 2017-09-07 09:01:00.000

Trigger another pipeline run


1. Switch to the Edit tab. Click the pipeline in the tree view if it's not opened in the designer.
2. Click Trigger on the toolbar, and click Trigger Now.

Monitor the second pipeline run


1. Switch to the Monitor tab on the left. You can see the status of the pipeline run triggered by the manual
trigger. Click Refresh button to refresh the list.

2. To view activity runs associated with this pipeline run, click the first link (View Activity Runs) in the
Actions column. You can go back to the previous view by clicking Pipelines at the top. Click Refresh
button to refresh the list.

Verify the second output


1. In the blob storage, you see that another file was created. In this tutorial, the new file name is
Incremental-<GUID>.txt . Open that file, and you see two rows of records in it.

6,newdata,2017-09-06 02:23:00.0000000
7,newdata,2017-09-07 09:01:00.0000000

2. Check the latest value from watermarktable . You see that the watermark value was updated again.

Select * from watermarktable

sample output:

TABLENAME WATERMARKVALUE

data_source_table 2017-09-07 09:01:00.000

Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run
In this tutorial, the pipeline copied data from a single table in a SQL database to Blob storage. Advance to the
following tutorial to learn how to copy data from multiple tables in an on-premises SQL Server database to a SQL
database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from an Azure SQL database
to Azure Blob storage
3/14/2019 • 13 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data from a table in an Azure SQL
database to Azure Blob storage.
You perform the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.

Overview
Here is the high-level solution diagram:

Here are the important steps to create this solution:


1. Select the watermark column. Select one column in the source data store, which can be used to slice the
new or updated records for every run. Normally, the data in this selected column (for example,
last_modify_time or ID ) keeps increasing when rows are created or updated. The maximum value in this
column is used as a watermark.
2. Prepare a data store to store the watermark value.
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following workflow:
The pipeline in this solution has the following activities:
Create two Lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Blob storage as a new file.
Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure SQL Database. You use the database as the source data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Azure Storage. You use the blob storage as the sink data store. If you don't have a storage account, see Create
a storage account for steps to create one. Create a container named adftutorial.
Azure PowerShell. Follow the instructions in Install and configure Azure PowerShell.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Server Explorer, right-click the database, and choose New
Query.
2. Run the following SQL command against your SQL database to create a table named data_source_table
as the data source store:

create table data_source_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

INSERT INTO data_source_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'aaaa','9/1/2017 12:56:00 AM'),
(2, 'bbbb','9/2/2017 5:23:00 AM'),
(3, 'cccc','9/3/2017 2:36:00 AM'),
(4, 'dddd','9/4/2017 3:21:00 AM'),
(5, 'eeee','9/5/2017 8:06:00 AM');

In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000

Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
create table watermarktable
(

TableName varchar(255),
WatermarkValue datetime,
);

2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.

INSERT INTO watermarktable


VALUES ('data_source_table','1/1/2010 12:00:00 AM')

3. Review the data in the table watermarktable .

Select * from watermarktable

Output:

TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000

Create a stored procedure in your SQL database


Run the following command to create a stored procedure in your SQL database:

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotation
marks, and then run the command. An example is "adfrg" .

$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

2. Define a variable for the location of the data factory.

$location = "East US"


3. To create the Azure resource group, run the following command:

New-AzResourceGroup $resourceGroupName $location

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

4. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to make it globally unique. An example is ADFTutorialFactorySP1127.

$dataFactoryName = "ADFIncCopyTutorialFactory";

5. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet:

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName

Note the following points:


The name of the data factory must be globally unique. If you receive the following error, change the name
and try again:

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names must
be globally unique.

To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factory: Products available by region.
The data stores (Storage, SQL Database, etc.) and computes (Azure HDInsight, etc.) used by the data
factory can be in other regions.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this section, you create linked services to your storage account and SQL database.
Create a Storage linked service
1. Create a JSON file named AzureStorageLinkedService.json in the C:\ADF folder with the following content.
(Create the folder ADF if it doesn't already exist.) Replace <accountName> and <accountKey> with the name
and key of your storage account before you save the file.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>",
"type": "SecureString"
}
}
}
}

2. In PowerShell, switch to the ADF folder.


3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service
AzureStorageLinkedService. In the following example, you pass values for the ResourceGroupName and
DataFactoryName parameters:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create a SQL Database linked service


1. Create a JSON file named AzureSQLDatabaseLinkedService.json in the C:\ADF folder with the following
content. (Create the folder ADF if it doesn't already exist.) Replace <server>, <database>, <user id>, and
<password> with the name of your server, database, user ID, and password before you save the file.

{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database>; Persist
Security Info=False; User ID=<user> ; Password=<password>; MultipleActiveResultSets = False; Encrypt =
True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}

2. In PowerShell, switch to the ADF folder.


3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service
AzureSQLDatabaseLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSQLDatabaseLinkedService" -File ".\AzureSQLDatabaseLinkedService.json"
Here is the sample output:

LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
ProvisioningState :

Create datasets
In this step, you create datasets to represent source and sink data.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:

{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

In this tutorial, you use the table name data_source_table. Replace it if you use a table with a different name.
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset SourceDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SourceDataset" -File ".\SourceDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SourceDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a sink dataset


1. Create a JSON file named SinkDataset.json in the same folder with the following content:
{
"name": "SinkDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/incrementalcopy",
"fileName": "@CONCAT('Incremental-', pipeline().RunId, '.txt')",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

IMPORTANT
This snippet assumes that you have a blob container named adftutorial in your blob storage. Create the container if
it doesn't exist, or set it to the name of an existing one. The output folder incrementalcopy is automatically
created if it doesn't exist in the container. In this tutorial, the file name is dynamically generated by using the
expression @CONCAT('Incremental-', pipeline().RunId, '.txt') .

2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset SinkDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SinkDataset" -File ".\SinkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SinkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset

Create a dataset for a watermark


In this step, you create a dataset for storing a high watermark value.
1. Create a JSON file named WatermarkDataset.json in the same folder with the following content:

{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset WatermarkDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "WatermarkDataset" -File ".\WatermarkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : WatermarkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. Create a JSON file IncrementalCopyPipeline.json in the same folder with the following content:

{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from watermarktable"
},

"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select MAX(LastModifytime) as NewWatermarkvalue from data_source_table"
},

"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},

{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from data_source_table where LastModifytime >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],

"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},

{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {

"storedProcedureName": "usp_write_watermark",
"storedProcedureParameters": {
"LastModifiedtime": {"value":
"@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}", "type": "datetime" },
"TableName": { "value":"@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type":"String"}
}
},

"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},

"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]

}
}
2. Run the Set-AzDataFactoryV2Pipeline cmdlet to create the pipeline IncrementalCopyPipeline.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "IncrementalCopyPipeline" -File ".\IncrementalCopyPipeline.json"

Here is the sample output:

PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Activities : {LookupOldWaterMarkActivity, LookupNewWaterMarkActivity, IncrementalCopyActivity,
StoredProceduretoWriteWatermarkActivity}
Parameters :

Run the pipeline


1. Run the pipeline IncrementalCopyPipeline by using the Invoke-AzDataFactoryV2Pipeline cmdlet.
Replace placeholders with your own resource group and data factory name.

$RunId = Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroupName


$resourceGroupName -dataFactoryName $dataFactoryName

2. Check the status of the pipeline by running the Get-AzDataFactoryV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".

Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-PipelineRunId $RunId -RunStartedAfter "<start time>" -RunStartedBefore "<end time>"

Here is the sample output:


ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupNewWaterMarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {NewWatermarkvalue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:42:42 AM
ActivityRunEnd : 9/14/2017 7:42:50 AM
DurationInMs : 7777
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:42:42 AM
ActivityRunEnd : 9/14/2017 7:43:07 AM
DurationInMs : 25437
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:10 AM
ActivityRunEnd : 9/14/2017 7:43:29 AM
DurationInMs : 19769
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:32 AM
ActivityRunEnd : 9/14/2017 7:43:47 AM
DurationInMs : 14467
Status : Succeeded
Error : {errorCode, message, failureType, target}

Review the results


1. In the blob storage (sink store), you see that the data were copied to the file defined in SinkDataset. In the
current tutorial, the file name is Incremental- d4bf3ce2-5d60-43f3-9318-923155f61037.txt . Open the file, and
you can see records in the file that are the same as the data in the SQL database.
1,aaaa,2017-09-01 00:56:00.0000000
2,bbbb,2017-09-02 05:23:00.0000000
3,cccc,2017-09-03 02:36:00.0000000
4,dddd,2017-09-04 03:21:00.0000000
5,eeee,2017-09-05 08:06:00.0000000

2. Check the latest value from watermarktable . You see that the watermark value was updated.

Select * from watermarktable

Here is the sample output:

TABLENAME WATERMARKVALUE

data_source_table 2017-09-05 8:06:00.000

Insert data into the data source store to verify delta data loading
1. Insert new data into the SQL database (data source store).

INSERT INTO data_source_table


VALUES (6, 'newdata','9/6/2017 2:23:00 AM')

INSERT INTO data_source_table


VALUES (7, 'newdata','9/7/2017 9:01:00 AM')

The updated data in the SQL database is:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000
6 | newdata | 2017-09-06 02:23:00.000
7 | newdata | 2017-09-07 09:01:00.000

2. Run the pipeline IncrementalCopyPipeline again by using the Invoke-AzDataFactoryV2Pipeline cmdlet.


Replace placeholders with your own resource group and data factory name.

$RunId = Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroupName


$resourceGroupName -dataFactoryName $dataFactoryName

3. Check the status of the pipeline by running the Get-AzDataFactoryV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".

Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-PipelineRunId $RunId -RunStartedAfter "<start time>" -RunStartedBefore "<end time>"

Here is the sample output:


ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupNewWaterMarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {NewWatermarkvalue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:52:26 AM
ActivityRunEnd : 9/14/2017 8:52:58 AM
DurationInMs : 31758
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:52:26 AM
ActivityRunEnd : 9/14/2017 8:52:52 AM
DurationInMs : 25497
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:00 AM
ActivityRunEnd : 9/14/2017 8:53:20 AM
DurationInMs : 20194
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:23 AM
ActivityRunEnd : 9/14/2017 8:53:41 AM
DurationInMs : 18502
Status : Succeeded
Error : {errorCode, message, failureType, target}

4. In the blob storage, you see that another file was created. In this tutorial, the new file name is
Incremental-2fc90ab8-d42c-4583-aa64-755dba9925d7.txt . Open that file, and you see two rows of records in it.

5. Check the latest value from watermarktable . You see that the watermark value was updated again.

Select * from watermarktable


sample output:

TABLENAME WATERMARKVALUE

data_source_table 2017-09-07 09:01:00.000

Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
In this tutorial, the pipeline copied data from a single table in a SQL database to Blob storage. Advance to the
following tutorial to learn how to copy data from multiple tables in an on-premises SQL Server database to a SQL
database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from multiple tables in SQL Server to an
Azure SQL database
4/14/2019 • 17 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data from multiple tables in on-premises SQL Server to an
Azure SQL database.
You perform the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime.
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.

Overview
Here are the important steps to create this solution:
1. Select the watermark column.
Select one column for each table in the source data store, which can be used to identify the new or updated records for every run.
Normally, the data in this selected column (for example, last_modify_time or ID ) keeps increasing when rows are created or updated. The
maximum value in this column is used as a watermark.
2. Prepare a data store to store the watermark value.
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities:
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter to the pipeline. For each source
table, it invokes the following activities to perform delta loading for that table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the second Lookup activity to
retrieve the new watermark value. These watermark values are passed to the Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark column greater than the old
watermark value and less than the new watermark value. Then, it copies the delta data from the source data store to Azure Blob storage
as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:

If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
SQL Server. You use an on-premises SQL Server database as the source data store in this tutorial.
Azure SQL Database. You use a SQL database as the sink data store. If you don't have a SQL database, see Create an Azure SQL database
for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio, and connect to your on-premises SQL Server database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your database to create tables named customer_table and project_table :

create table customer_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

INSERT INTO customer_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'John','9/1/2017 12:56:00 AM'),
(2, 'Mike','9/2/2017 5:23:00 AM'),
(3, 'Alice','9/3/2017 2:36:00 AM'),
(4, 'Andy','9/4/2017 3:21:00 AM'),
(5, 'Anny','9/5/2017 8:06:00 AM');

INSERT INTO project_table


(Project, Creationtime)
VALUES
('project1','1/1/2015 0:00:00 AM'),
('project2','2/2/2016 1:23:00 AM'),
('project3','3/4/2017 5:16:00 AM');

Create destination tables in your Azure SQL database


1. Open SQL Server Management Studio, and connect to your Azure SQL database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your SQL database to create tables named customer_table and project_table :

create table customer_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

Create another table in the Azure SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to store the watermark value:

create table watermarktable


(

TableName varchar(255),
WatermarkValue datetime,
);

2. Insert initial watermark values for both source tables into the watermark table.
INSERT INTO watermarktable
VALUES
('customer_table','1/1/2010 12:00:00 AM'),
('project_table','1/1/2010 12:00:00 AM');

Create a stored procedure in the Azure SQL database


Run the following command to create a stored procedure in your SQL database. This stored procedure updates the watermark value after every
pipeline run.

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create data types and additional stored procedures in Azure SQL database
Run the following query to create two stored procedures and two data types in your SQL database. They're used to merge the data from source
tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in via a table variable and then
merge the them into destination store. Be cautious that it is not expecting a "large" number of delta rows (more than 100) to be stored in the
table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy activity to copy all the delta data
into a temporary "staging" table in the destination store first, and then built your own stored procedure without using table variable to merge
them from the “staging” table to the “final” table.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

GO

CREATE PROCEDURE usp_upsert_customer_table @customer_table DataTypeforCustomerTable READONLY


AS

BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END

GO

CREATE TYPE DataTypeforProjectTable AS TABLE(


Project varchar(255),
Creationtime datetime
);

GO

CREATE PROCEDURE usp_upsert_project_table @project_table DataTypeforProjectTable READONLY


AS

BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in Microsoft Edge and Google
Chrome web browsers.
2. Click New on the left menu, click Data + Analytics, and click Data Factory.
3. In the New data factory page, enter ADFMultiIncCopyTutorialDF for the name.

The name of the Azure data factory must be globally unique. If you receive the following error, change the name of the data factory (for
example, yournameADFMultiIncCopyTutorialDF ) and try creating again. See Data Factory - Naming Rules article for naming rules for
Data Factory artifacts.

`Data factory name ADFMultiIncCopyTutorialDF is not available`

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 (Preview) for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down list. The data stores (Azure
Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.

11. After the creation is complete, you see the Data Factory page as shown in the image.

12. Click Author & Monitor tile to launch Azure Data Factory user interface (UI) in a separate tab.
13. In the get started page of Azure Data Factory UI, click Create pipeline (or) switch to the Edit tab.
Create self-hosted integration runtime
As you are moving data from a data store in a private network (on-premises) to an Azure data store, install a self-hosted integration runtime (IR )
in your on-premises environment. The self-hosted IR moves data between your private network and Azure.
1. Click Connections at the bottom of the left pane, and switch to the Integration Runtimes in the Connections window.

2. In the Integration Runtimes tab, click + New.

3. In the Integration Runtime Setup window, select Perform data movement and dispatch activities to external computes, and
click Next.
4. Select ** Private Network**, and click Next.

5. Enter MySelfHostedIR for Name, and click Next.


6. Click Click here to launch the express setup for this computer in the Option 1: Express setup section.
7. In the Integration Runtime (Self-hosted) Express Setup window, click Close.

8. In the Web browser, in the Integration Runtime Setup window, click Finish.
9. Confirm that you see MySelfHostedIR in the list of integration runtimes.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In this section, you create linked
services to your on-premises SQL Server database and SQL database.
Create the SQL Server linked service
In this step, you link your on-premises SQL Server database to the data factory.
1. In the Connections window, switch from Integration Runtimes tab to the Linked Services tab, and click + New.
2. In the New Linked Service window, select SQL Server, and click Continue.

3. In the New Linked Service window, do the following steps:


a. Enter SqlServerLinkedService for Name.
b. Select MySelfHostedIR for Connect via integration runtime. This is an important step. The default integration runtime
cannot connect to an on-premises data store. Use the self-hosted integration runtime you created earlier.
c. For Server name, enter the name of your computer that has the SQL Server database.
d. For Database name, enter the name of the database in your SQL Server that has the source data. You created a table and
inserted data into this database as part of the prerequisites.
e. For Authentication type, select the type of the authentication you want to use to connect to the database.
f. For User name, enter the name of user that has access to the SQL Server database. If you need to use a slash character ( \ ) in
your user account or server name, use the escape character ( \ ). An example is mydomain\\myuser .
g. For Password, enter the password for the user.
h. To test whether Data Factory can connect to your SQL Server database, click Test connection. Fix any errors until the connection
succeeds.
i. To save the linked service, click Save.

Create the Azure SQL Database linked service


In the last step, you create a linked service to link your source SQL Server database to the data factory. In this step, you link your
destination/sink Azure SQL database to the data factory.
1. In the Connections window, switch from Integration Runtimes tab to the Linked Services tab, and click + New.

2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for Name.
b. For Server name, select the name of your Azure SQL server from the drop-down list.
c. For Database name, select the Azure SQL database in which you created customer_table and project_table as part of the
prerequisites.
d. For User name, enter the name of user that has access to the Azure SQL database.
e. For Password, enter the password for the user.
f. To test whether Data Factory can connect to your SQL Server database, click Test connection. Fix any errors until the connection
succeeds.
g. To save the linked service, click Save.

4. Confirm that you see two linked services in the list.


Create datasets
In this step, you create datasets to represent the data source, the data destination, and the place to store the watermark.
Create a source dataset
1. In the left pane, click + (plus), and click Dataset.

2. In the New Dataset window, select SQL Server, click Finish.


3. You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the treeview. In the General tab of the
Properties window at the bottom, enter SourceDataset for Name.

4. Switch to the Connection tab in the Properties window, and select SqlServerLinkedService for Linked service. You do not select a
table here. The Copy activity in the pipeline uses a SQL query to load the data rather than load the entire table.
Create a sink dataset
1. In the left pane, click + (plus), and click Dataset.

2. In the New Dataset window, select Azure SQL Database, and click Finish.
3. You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the treeview. In the General tab of the
Properties window at the bottom, enter SinkDataset for Name.

4. Switch to the Parameters tab in the Properties window, and do the following steps:
a. Click New in the Create/update parameters section.
b. Enter SinkTableName for the name, and String for the type. This dataset takes SinkTableName as a parameter. The
SinkTableName parameter is set by the pipeline dynamically at runtime. The ForEach activity in the pipeline iterates through a list
of table names and passes the table name to this dataset in each iteration.

5. Switch to the Connection tab in the Properties window, and select AzureSqlLinkedService for Linked service. For Table property,
click Add dynamic content.

6. Select SinkTableName in the Parameters section

7. After clicking Finish, you see **@dataset().SinkTableName** as the table name.


Create a dataset for a watermark
In this step, you create a dataset for storing a high watermark value.
1. In the left pane, click + (plus), and click Dataset.

2. In the New Dataset window, select Azure SQL Database, and click Finish.
3. In the General tab of the Properties window at the bottom, enter WatermarkDataset for Name.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[watermarktable] for Table.

Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table names and performs the following
operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in the last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark column in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to the destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the next iteration.
Create the pipeline
1. In the left pane, click + (plus), and click Pipeline.

2. In the General tab of the Properties window, enter IncrementalCopyPipeline for Name.

3. In the Properties window, do the following steps:


a. Click + New.
b. Enter tableList for the parameter name.
c. Select Object for the parameter type.

4. In the Activities toolbox, expand Iteration & Conditionals, and drag-drop the ForEach activity to the pipeline designer surface. In the
General tab of the Properties window, enter IterateSQLTables.

5. Switch to the Settings tab in the Properties window, and enter @pipeline().parameters.tableList for Items. The ForEach activity
iterates through a list of tables and performs the incremental copy operation.

6. Select the ForEach activity in the pipeline if it isn't already selected. Click the Edit (Pencil icon) button.
7. In the Activities toolbox, expand General, drag-drop the Lookup activity to the pipeline designer surface, and enter
LookupOldWaterMarkActivity for Name.

8. Switch to the Settings tab of the Properties window, and do the following steps:
a. Select WatermarkDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.

select * from watermarktable where TableName = '@{item().TABLE_NAME}'


9. Drag-drop the Lookup activity from the Activities toolbox, and enter LookupNewWaterMarkActivity for Name.
10. Switch to the Settings tab.
a. Select SourceDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.

select MAX(@{item().WaterMark_Column}) as NewWatermarkvalue from @{item().TABLE_NAME}

11. Drag-drop the Copy activity from the Activities toolbox, and enter IncrementalCopyActivity for Name.
12. Connect Lookup activities to the Copy activity one by one. To connect, start dragging at the green box attached to the Lookup activity
and drop it on the Copy activity. Release the mouse button when the border color of the Copy activity changes to blue.

13. Select the Copy activity in the pipeline. Switch to the Source tab in the Properties window.
a. Select SourceDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.

select * from @{item().TABLE_NAME} where @{item().WaterMark_Column} >


'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and @{item().WaterMark_Column} <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'
14. Switch to the Sink tab, and select SinkDataset for Sink Dataset.

15. Do the following steps:


a. In the Dataset property, for SinkTableName parameter, enter @{item().TABLE_NAME} .
b. For Stored Procedure Name property, enter @{item().StoredProcedureNameForMergeOperation} .
c. For Table Type property, enter @{item().TableType} .
16. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer surface. Connect the Copy activity to
the Stored Procedure activity.

17. Select the Stored Procedure activity in the pipeline, and enter StoredProceduretoWriteWatermarkActivity for Name in the General
tab of the Properties window.

18. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedService for Linked Service.
19. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select usp_write_watermark .
b. Select Import parameter.
c. Specify the following values for the parameters:

NAME TYPE VALUE

LastModifiedtime DateTime @{activity('LookupNewWaterMarkActivity').output.firstRow.NewWate

TableName String @{activity('LookupOldWaterMarkActivity').output.firstRow.TableNa

20. In the left pane, click Publish. This action publishes the entities you created to the Data Factory service.

21. Wait until you see the Successfully published message. To see the notifications, click the Show Notifications link. Close the
notifications window by clicking X.
Run the pipeline
1. On the toolbar for the pipeline, click Trigger, and click Trigger Now.

2. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish.
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]

Monitor the pipeline


1. Switch to the Monitor tab on the left. You see the pipeline run triggered by the manual trigger. Click Refresh button to refresh the list.
Links in the Actions column allow you to view activity runs associated with the pipeline run, and to rerun the pipeline.

2. Click View Activity Runs link in the Actions column. You see all the activity runs associated with the selected pipeline run.
Review the results
In SQL Server Management Studio, run the following queries against the target SQL database to verify that the data was copied from source
tables to destination tables:
Query

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Query

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000

Query

select * from watermarktable

Output

======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000

Notice that the watermark values for both tables were updated.
Add more data to the source tables
Run the following query against the source SQL Server database to update an existing row in customer_table. Insert a new row into
project_table.

UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3

INSERT INTO project_table


(Project, Creationtime)
VALUES
('NewProject','10/1/2017 0:00:00 AM');

Rerun the pipeline


1. In the web browser window, switch to the Edit tab on the left.
2. On the toolbar for the pipeline, click Trigger, and click Trigger Now.

3. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish.

[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]

Monitor the pipeline again


1. Switch to the Monitor tab on the left. You see the pipeline run triggered by the manual trigger. Click Refresh button to refresh the list.
Links in the Actions column allow you to view activity runs associated with the pipeline run, and to rerun the pipeline.
2. Click View Activity Runs link in the Actions column. You see all the activity runs associated with the selected pipeline run.

Review the final results


In SQL Server Management Studio, run the following queries against the target database to verify that the updated/new data was copied from
source tables to destination tables.
Query

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Notice the new values of Name and LastModifytime for the PersonID for number 3.
Query

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000

Notice that the NewProject entry was added to project_table.


Query

select * from watermarktable

Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000

Notice that the watermark values for both tables were updated.

Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR ).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from multiple tables in SQL
Server to an Azure SQL database
4/8/2019 • 18 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data from multiple tables in on-
premises SQL Server to an Azure SQL database.
You perform the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime.
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.

Overview
Here are the important steps to create this solution:
1. Select the watermark column. Select one column for each table in the source data store, which can be
used to identify the new or updated records for every run. Normally, the data in this selected column (for
example, last_modify_time or ID ) keeps increasing when rows are created or updated. The maximum value
in this column is used as a watermark.
2. Prepare a data store to store the watermark value.
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities:
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter
to the pipeline. For each source table, it invokes the following activities to perform delta loading for that
table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Azure Blob storage as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
SQL Server. You use an on-premises SQL Server database as the source data store in this tutorial.
Azure SQL Database. You use a SQL database as the sink data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio, and connect to your on-premises SQL Server database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your database to create tables named customer_table and
project_table :

create table customer_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

INSERT INTO customer_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'John','9/1/2017 12:56:00 AM'),
(2, 'Mike','9/2/2017 5:23:00 AM'),
(3, 'Alice','9/3/2017 2:36:00 AM'),
(4, 'Andy','9/4/2017 3:21:00 AM'),
(5, 'Anny','9/5/2017 8:06:00 AM');

INSERT INTO project_table


(Project, Creationtime)
VALUES
('project1','1/1/2015 0:00:00 AM'),
('project2','2/2/2016 1:23:00 AM'),
('project3','3/4/2017 5:16:00 AM');

Create destination tables in your Azure SQL database


1. Open SQL Server Management Studio, and connect to your SQL Server database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your SQL database to create tables named customer_table and
project_table :

create table customer_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

Create another table in the Azure SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:

create table watermarktable


(

TableName varchar(255),
WatermarkValue datetime,
);

2. Insert initial watermark values for both source tables into the watermark table.

INSERT INTO watermarktable


VALUES
('customer_table','1/1/2010 12:00:00 AM'),
('project_table','1/1/2010 12:00:00 AM');

Create a stored procedure in the Azure SQL database


Run the following command to create a stored procedure in your SQL database. This stored procedure updates
the watermark value after every pipeline run.

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create data types and additional stored procedures in the Azure SQL database
Run the following query to create two stored procedures and two data types in your SQL database. They're used
to merge the data from source tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in
via a table variable and then merge the them into destination store. Be cautious that it is not expecting a "large"
number of delta rows (more than 100) to be stored in the table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy
activity to copy all the delta data into a temporary "staging" table in the destination store first, and then built your
own stored procedure without using table variable to merge them from the “staging” table to the “final” table.

CREATE TYPE DataTypeforCustomerTable AS TABLE(


PersonID int,
Name varchar(255),
LastModifytime datetime
);

GO

CREATE PROCEDURE usp_upsert_customer_table @customer_table DataTypeforCustomerTable READONLY


AS

BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END

GO

CREATE TYPE DataTypeforProjectTable AS TABLE(


Project varchar(255),
Creationtime datetime
);

GO

CREATE PROCEDURE usp_upsert_project_table @project_table DataTypeforProjectTable READONLY


AS

BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END

Azure PowerShell
Install the latest Azure PowerShell modules by following the instructions in Install and configure Azure
PowerShell.

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotation
marks, and then run the command. An example is "adfrg" .

$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

2. Define a variable for the location of the data factory.

$location = "East US"

3. To create the Azure resource group, run the following command:

New-AzureRmResourceGroup $resourceGroupName $location

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

4. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to make it globally unique. An example is ADFIncMultiCopyTutorialFactorySP1127.

$dataFactoryName = "ADFIncMultiCopyTutorialFactory";

5. To create the data factory, run the following Set-AzureRmDataFactoryV2 cmdlet:

Set-AzureRmDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name


$dataFactoryName

Note the following points:


The name of the data factory must be globally unique. If you receive the following error, change the name
and try again:

The specified Data Factory name 'ADFIncMultiCopyTutorialFactory' is already in use. Data Factory names
must be globally unique.

To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (Azure Storage, SQL Database, etc.) and computes (Azure HDInsight, etc.) used by the data
factory can be in other regions.

Create a self-hosted integration runtime


In this section, you create a self-hosted integration runtime and associate it with an on-premises machine with the
SQL Server database. The self-hosted integration runtime is the component that copies data from SQL Server on
your machine to Azure Blob storage.
1. Create a variable for the name of the integration runtime. Use a unique name, and make a note of it. You
use it later in this tutorial.

$integrationRuntimeName = "ADFTutorialIR"

2. Create a self-hosted integration runtime.

Set-AzDataFactoryV2IntegrationRuntime -Name $integrationRuntimeName -Type SelfHosted -DataFactoryName


$dataFactoryName -ResourceGroupName $resourceGroupName

Here is the sample output:

Id : /subscriptions/<subscription
ID>/resourceGroups/ADFTutorialResourceGroup/providers/Microsoft.DataFactory/factories/onpremdf0914/inte
grationruntimes/myonpremirsp0914
Type : SelfHosted
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Name : myonpremirsp0914
Description :

3. To retrieve the status of the created integration runtime, run the following command. Confirm that the value
of the State property is set to NeedRegistration.

Get-AzDataFactoryV2IntegrationRuntime -name $integrationRuntimeName -ResourceGroupName


$resourceGroupName -DataFactoryName $dataFactoryName -Status

Here is the sample output:

Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net, *.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration

4. To retrieve the authentication keys used to register the self-hosted integration runtime with Azure Data
Factory service in the cloud, run the following command:

Get-AzDataFactoryV2IntegrationRuntimeKey -Name $integrationRuntimeName -DataFactoryName


$dataFactoryName -ResourceGroupName $resourceGroupName | ConvertTo-Json

Here is the sample output:


{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}

5. Copy one of the keys (exclude the double quotation marks) used to register the self-hosted integration
runtime that you install on your machine in the following steps.

Install the integration runtime


1. If you already have the integration runtime on your machine, uninstall it by using Add or Remove
Programs.
2. Download the self-hosted integration runtime on a local Windows machine. Run the installation.
3. On the Welcome to Microsoft Integration Runtime Setup page, select Next.
4. On the End-User License Agreement page, accept the terms and license agreement, and select Next.
5. On the Destination Folder page, select Next.
6. On the Ready to install Microsoft Integration Runtime page, select Install.
7. If you see a warning message about configuring the computer to enter sleep or hibernate mode when not
in use, select OK.
8. If you see the Power Options page, close it, and go to the setup page.
9. On the Completed the Microsoft Integration Runtime Setup page, select Finish.
10. On the Register Integration Runtime (Self-hosted) page, paste the key you saved in the previous
section, and select Register.

11. When the self-hosted integration runtime is registered successfully, you see the following message:
12. On the New Integration Runtime (Self-hosted) Node page, select Next.

13. On the Intranet Communication Channel page, select Skip. Select a TLS/SSL certification to secure
intranode communication in a multinode integration runtime environment.
14. On the Register Integration Runtime (Self-hosted) page, select Launch Configuration Manager.
15. When the node is connected to the cloud service, you see the following page:

16. Now, test the connectivity to your SQL Server database.


a. On the Configuration Manager page, go to the Diagnostics tab.
b. Select SqlServer for the data source type.
c. Enter the server name.
d. Enter the database name.
e. Select the authentication mode.
f. Enter the user name.
g. Enter the password for the user name.
h. Select Test to confirm that the integration runtime can connect to SQL Server. If the connection is
successful, you see a green check mark. If the connection is not successful, you see an error message. Fix
any issues, and ensure that the integration runtime can connect to SQL Server.

NOTE
Make a note of the values for authentication type, server, database, user, and password. You use them later in this
tutorial.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In this
section, you create linked services to your on-premises SQL Server database and SQL database.
Create the SQL Server linked service
In this step, you link your on-premises SQL Server database to the data factory.
1. Create a JSON file named SqlServerLinkedService.json in the C:\ADFTutorials\IncCopyMultiTableTutorial
folder with the following content. Select the right section based on the authentication you use to connect to
SQL Server. Create the local folders if they don't already exist.
IMPORTANT
Select the right section based on the authentication you use to connect to SQL Server.

If you use SQL authentication, copy the following JSON definition:

{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<servername>;Database=<databasename>;User ID=<username>;Password=
<password>;Timeout=60"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}

If you use Windows authentication, copy the following JSON definition:

{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Integrated Security=True"
},
"userName": "<user> or <domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}

IMPORTANT
Select the right section based on the authentication you use to connect to SQL Server.
Replace <integration runtime name> with the name of your integration runtime.
Replace <servername>, <databasename>, <username>, and <password> with values of your SQL Server
database before you save the file.
If you need to use a slash character ( \ ) in your user account or server name, use the escape character ( \ ). An
example is mydomain\\myuser .

2. In PowerShell, switch to the C:\ADFTutorials\IncCopyMultiTableTutorial folder.


3. Run the Set-AzureRmDataFactoryV2LinkedService cmdlet to create the linked service
AzureStorageLinkedService. In the following example, you pass values for the ResourceGroupName and
DataFactoryName parameters:

Set-AzureRmDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "SqlServerLinkedService" -File ".\SqlServerLinkedService.json"

Here is the sample output:

LinkedServiceName : SqlServerLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerLinkedService

Create the SQL database linked service


1. Create a JSON file named AzureSQLDatabaseLinkedService.json in
C:\ADFTutorials\IncCopyMultiTableTutorial folder with the following content. (Create the folder ADF if it
doesn't already exist.) Replace <server>, <database name>, <user id>, and <password> with the name of
your SQL Server database, name of your database, user ID, and password before you save the file.

{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database name>; Persist
Security Info=False; User ID=<user name>; Password=<password>; MultipleActiveResultSets = False;
Encrypt = True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}

2. In PowerShell, run the Set-AzureRmDataFactoryV2LinkedService cmdlet to create the linked service


AzureSQLDatabaseLinkedService.

Set-AzureRmDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSQLDatabaseLinkedService" -File ".\AzureSQLDatabaseLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService

Create datasets
In this step, you create datasets to represent the data source, the data destination, and the place to store the
watermark.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "SqlServerTable",
"typeProperties": {
"tableName": "dummyName"
},
"linkedServiceName": {
"referenceName": "SqlServerLinkedService",
"type": "LinkedServiceReference"
}
}
}

The table name is a dummy name. The Copy activity in the pipeline uses a SQL query to load the data
rather than load the entire table.
2. Run the Set-AzureRmDataFactoryV2Dataset cmdlet to create the dataset SourceDataset.

Set-AzureRmDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "SourceDataset" -File ".\SourceDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SourceDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerTableDataset

Create a sink dataset


1. Create a JSON file named SinkDataset.json in the same folder with the following content. The tableName
element is set by the pipeline dynamically at runtime. The ForEach activity in the pipeline iterates through a
list of table names and passes the table name to this dataset in each iteration.

{
"name": "SinkDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": {
"value": "@{dataset().SinkTableName}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"SinkTableName": {
"type": "String"
}
}
}
}

2. Run the Set-AzureRmDataFactoryV2Dataset cmdlet to create the dataset SinkDataset.


Set-AzureRmDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName
-Name "SinkDataset" -File ".\SinkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SinkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a dataset for a watermark


In this step, you create a dataset for storing a high watermark value.
1. Create a JSON file named WatermarkDataset.json in the same folder with the following content:

{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

2. Run the Set-AzureRmDataFactoryV2Dataset cmdlet to create the dataset WatermarkDataset.

Set-AzureRmDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "WatermarkDataset" -File ".\WatermarkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : WatermarkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : <data factory name>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table names
and performs the following operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in the
last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark column
in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to the
destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the next
iteration.
Create the pipeline
1. Create a JSON file named IncrementalCopyPipeline.json in the same folder with the following content:

{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [{

"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"items": {
"value": "@pipeline().parameters.tableList",
"type": "Expression"
},

"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from watermarktable where TableName = '@{item().TABLE_NAME}'"
},

"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select MAX(@{item().WaterMark_Column}) as NewWatermarkvalue from
@{item().TABLE_NAME}"
},

"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},

{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from @{item().TABLE_NAME} where @{item().WaterMark_Column} >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and
@{item().WaterMark_Column} <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'"
},
"sink": {
"type": "SqlSink",
"SqlWriterTableType": "@{item().TableType}",
"SqlWriterTableType": "@{item().TableType}",
"SqlWriterStoredProcedureName": "@{item().StoredProcedureNameForMergeOperation}"
}
},
"dependsOn": [{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],

"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "SinkDataset",
"type": "DatasetReference",
"parameters": {
"SinkTableName": "@{item().TableType}"
}
}]
},

{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {

"storedProcedureName": "usp_write_watermark",
"storedProcedureParameters": {
"LastModifiedtime": {
"value": "@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}",
"type": "datetime"
},
"TableName": {
"value": "@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type": "String"
}
}
},

"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},

"dependsOn": [{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}]
}

}
}],

"parameters": {
"tableList": {
"type": "Object"
"type": "Object"
}
}
}
}

2. Run the Set-AzureRmDataFactoryV2Pipeline cmdlet to create the pipeline IncrementalCopyPipeline.

Set-AzureRmDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "IncrementalCopyPipeline" -File ".\IncrementalCopyPipeline.json"

Here is the sample output:

PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Activities : {IterateSQLTables}
Parameters : {[tableList,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}

Run the pipeline


1. Create a parameter file named Parameters.json in the same folder with the following content:

{
"tableList":
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
}

2. Run the pipeline IncrementalCopyPipeline by using the Invoke-AzureRmDataFactoryV2Pipeline


cmdlet. Replace placeholders with your own resource group and data factory name.

$RunId = Invoke-AzureRmDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroup


$resourceGroupName -dataFactoryName $dataFactoryName -ParameterFile ".\Parameters.json"

Monitor the pipeline


1. Sign in to the Azure portal.
2. Select All services, search with the keyword Data factories, and select Data factories.
3. Search for your data factory in the list of data factories, and select it to open the Data factory page.

4. On the Data factory page, select Monitor & Manage.

5. The Data Integration Application opens in a separate tab. You can see all the pipeline runs and their
status. Notice that in the following example, the status of the pipeline run is Succeeded. To check
parameters passed to the pipeline, select the link in the Parameters column. If an error occurred, you see a
link in the Error column. Select the link in the Actions column.

6. When you select the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline:

7. To go back to the Pipeline Runs view, select Pipelines as shown in the image.

Review the results


In SQL Server Management Studio, run the following queries against the target SQL database to verify that the
data was copied from source tables to destination tables:
Query

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Query

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000

Query

select * from watermarktable


Output

======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000

Notice that the watermark values for both tables were updated.

Add more data to the source tables


Run the following query against the source SQL Server database to update an existing row in customer_table.
Insert a new row into project_table.

UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3

INSERT INTO project_table


(Project, Creationtime)
VALUES
('NewProject','10/1/2017 0:00:00 AM');

Rerun the pipeline


1. Now, rerun the pipeline by executing the following PowerShell command:

$RunId = Invoke-AzureRmDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroup


$resourceGroupname -dataFactoryName $dataFactoryName -ParameterFile ".\Parameters.json"

2. Monitor the pipeline runs by following the instructions in the Monitor the pipeline section. Because the
pipeline status is In Progress, you see another action link under Actions to cancel the pipeline run.

3. Select Refresh to refresh the list until the pipeline run succeeds.

4. Optionally, select the View Activity Runs link under Actions to see all the activity runs associated with
this pipeline run.

Review the final results


In SQL Server Management Studio, run the following queries against the target database to verify that the
updated/new data was copied from source tables to destination tables.
Query

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Notice the new values of Name and LastModifytime for the PersonID for number 3.
Query

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000

Notice that the NewProject entry was added to project_table.


Query

select * from watermarktable

Output

======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000

Notice that the watermark values for both tables were updated.

Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR ).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from Azure SQL Database to
Azure Blob Storage using change tracking
information
3/26/2019 • 15 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change tracking
information in the source Azure SQL database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source, sink, and change tracking datasets.
Create, run, and monitor the full copy pipeline
Add or update data in the source table
Create, run, and monitor the incremental copy pipeline

Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In some
cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time you
processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database and
SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with SQL
Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob Storage.
For more concrete information about SQL Change Tracking technology, see Change tracking in SQL Server.

End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking technology.

NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL Database as
the source data store. You can also use an on-premises SQL Server.

1. Initial loading of historical data (run once):


a. Enable Change Tracking technology in the source Azure SQL database.
b. Get the initial value of SYS_CHANGE_VERSION in the Azure SQL database as the baseline to capture
changed data.
c. Load full data from the Azure SQL database into an Azure blob storage.
2. Incremental loading of delta data on a schedule (run periodically after the initial loading of data):
a. Get the old and new SYS_CHANGE_VERSION values.
b. Load the delta data by joining the primary keys of changed rows (between two
SYS_CHANGE_VERSION values) from sys.change_tracking_tables with data in the source table, and
then move the delta data to destination.
c. Update the SYS_CHANGE_VERSION for the delta loading next time.

High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data store
(Azure SQL Database) to the destination data store (Azure Blob Storage).

2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure SQL Database. You use the database as the source data store. If you don't have an Azure SQL
Database, see the Create an Azure SQL database article for steps to create one.
Azure Storage account. You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named adftutorial.
Create a data source table in your Azure SQL database
1. Launch SQL Server Management Studio, and connect to your Azure SQL server.
2. In Server Explorer, right-click your database and choose the New Query.
3. Run the following SQL command against your Azure SQL database to create a table named
data_source_table as data source store.
create table data_source_table
(
PersonID int NOT NULL,
Name varchar(255),
Age int
PRIMARY KEY (PersonID)
);

INSERT INTO data_source_table


(PersonID, Name, Age)
VALUES
(1, 'aaaa', 21),
(2, 'bbbb', 24),
(3, 'cccc', 20),
(4, 'dddd', 26),
(5, 'eeee', 22);

4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:

NOTE
Replace <your database name> with the name of your Azure SQL database that has the data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three days
or more, some changed data is not included. You need to either change the value of CHANGE_RETENTION to a
bigger number. Alternatively, ensure that your period to load the changed data is within two days. For more
information, see Enable change tracking for a database

ALTER DATABASE <your database name>


SET CHANGE_TRACKING = ON
(CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON)

ALTER TABLE data_source_table


ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON)

5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:

create table table_store_ChangeTracking_version


(
TableName varchar(255),
SYS_CHANGE_VERSION BIGINT,
);

DECLARE @ChangeTracking_version BIGINT


SET @ChangeTracking_version = CHANGE_TRACKING_CURRENT_VERSION();

INSERT INTO table_store_ChangeTracking_version


VALUES ('data_source_table', @ChangeTracking_version)

NOTE
If the data is not changed after you enabled the change tracking for SQL Database, the value of the change tracking
version is 0.
6. Run the following query to create a stored procedure in your Azure SQL database. The pipeline invokes this
stored procedure to update the change tracking version in the table you created in the previous step.

CREATE PROCEDURE Update_ChangeTracking_Version @CurrentTrackingVersion BIGINT, @TableName varchar(50)


AS

BEGIN

UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName

END

Azure PowerShell

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. In the New data factory page, enter ADFTutorialDataFactory for the name.

The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.
`Data factory name “ADFTutorialDataFactory” is not available`

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 (Preview) for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.

11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.
13. In the get started page, switch to the Edit tab in the left panel as shown in the following image:

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In this
section, you create linked services to your Azure Storage account and Azure SQL database.
Create Azure Storage linked service.
In this step, you link your Azure Storage Account to the data factory.
1. Click Connections, and click + New.

2. In the New Linked Service window, select Azure Blob Storage, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureStorageLinkedService for Name.
b. Select your Azure Storage account for Storage account name.
c. Click Save.
Create Azure SQL Database linked service.
In this step, you link your Azure SQL database to the data factory.
1. Click Connections, and click + New.
2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for the Name field.
b. Select your Azure SQL server for the Server name field.
c. Select your Azure SQL database for the Database name field.
d. Enter name of the user for the User name field.
e. Enter password for the user for the Password field.
f. Click Test connection to test the connection.
g. Click Save to save the linked service.
Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a dataset to represent source data
In this step, you create a dataset to represent the source data.
1. In the treeview, click + (plus), and click Dataset.
2. Select Azure SQL Database, and click Finish.
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Properties
window, change the name of the dataset to SourceDataset.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[data_source_table] for Table.

Create a dataset to represent data copied to sink data store.


In this step, you create a dataset to represent the data that is copied from the source data store. You created the
adftutorial container in your Azure Blob Storage as part of the prerequisites. Create the container if it does not
exist (or) set it to the name of an existing one. In this tutorial, the output file name is dynamically generated by
using the expression: @CONCAT('Incremental-', pipeline().RunId, '.txt') .
1. In the treeview, click + (plus), and click Dataset.
2. Select Azure Blob Storage, and click Finish.
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Properties
window, change the name of the dataset to SinkDataset.
4. Switch to the Connection tab in the Properties window, and do the following steps:
a. Select AzureStorageLinkedService for Linked service.
b. Enter adftutorial/incchgtracking for folder part of the filePath.
c. Enter @CONCAT('Incremental-', pipeline().RunId, '.txt') for file part of the filePath.

Create a dataset to represent change tracking data


In this step, you create a dataset for storing the change tracking version. You created the table
table_store_ChangeTracking_version as part of the prerequisites.
1. In the treeview, click + (plus), and click Dataset.
2. Select Azure SQL Database, and click Finish.
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Properties
window, change the name of the dataset to ChangeTrackingDataset.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[table_store_ChangeTracking_version] for Table.

Create a pipeline for the full copy


In this step, you create a pipeline with a copy activity that copies the entire data from the source data store (Azure
SQL Database) to the destination data store (Azure Blob Storage).
1. Click + (plus) in the left pane, and click Pipeline.

2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the Properties
window, change the name of the pipeline to FullCopyPipeline.
3. In the Activities toolbox, expand Data Flow, and drag-drop the Copy activity to the pipeline designer
surface, and set the name FullCopyActivity.
4. Switch to the Source tab, and select SourceDataset for the Source Dataset field.

5. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.

6. To validate the pipeline definition, click Validate on the toolbar. Confirm that there is no validation error.
Close the Pipeline Validation Report by clicking >>.
7. To publish entities (linked services, datasets, and pipelines), click Publish. Wait until the publishing
succeeds.

8. Wait until you see the Successfully published message.

9. You can also see notifications by clicking the Show Notifications button on the left. To close the
notifications window, click X.
Run the full copy pipeline
Click Trigger on the toolbar for the pipeline, and click Trigger Now.

Monitor the full copy pipeline


1. Click the Monitor tab on the left. You see the pipeline run in the list and its status. To refresh the list, click
Refresh. The links in the Actions column let you view activity runs associated with the pipeline run and to
rerun the pipeline.
2. To view activity runs associated with the pipeline run, click the View Activity Runs link in the Actions
column. There is only one activity in the pipeline, so you see only one entry in the list. To switch back to the
pipeline runs view, click Pipelines link at the top.

Review the results


You see a file named incremental-<GUID>.txt in the incchgtracking folder of the adftutorial container.

The file should have the data from the Azure SQL database:

1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22

Add more data to the source table


Run the following query against the Azure SQL database to add a row and update a row.
INSERT INTO data_source_table
(PersonID, Name, Age)
VALUES
(6, 'new','50');

UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1

Create a pipeline for the delta copy


In this step, you create a pipeline with the following activities, and run it periodically. The lookup activities get the
old and new SYS_CHANGE_VERSION from Azure SQL Database and pass it to copy activity. The copy activity
copies the inserted/updated/deleted data between the two SYS_CHANGE_VERSION values from Azure SQL
Database to Azure Blob Storage. The stored procedure activity updates the value of SYS_CHANGE_VERSION
for the next pipeline run.
1. In the Data Factory UI, switch to the Edit tab. Click + (plus) in the left pane, and click Pipeline.

2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the Properties
window, change the name of the pipeline to IncrementalCopyPipeline.
3. Expand General in the Activities toolbox, and drag-drop the Lookup activity to the pipeline designer
surface. Set the name of the activity to LookupLastChangeTrackingVersionActivity. This activity gets the
change tracking version used in the last copy operation that is stored in the table
table_store_ChangeTracking_version.
4. Switch to the Settings in the Properties window, and select ChangeTrackingDataset for the Source
Dataset field.

5. Drag-and-drop the Lookup activity from the Activities toolbox to the pipeline designer surface. Set the
name of the activity to LookupCurrentChangeTrackingVersionActivity. This activity gets the current
change tracking version.

6. Switch to the Settings in the Properties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.

SELECT CHANGE_TRACKING_CURRENT_VERSION() as CurrentChangeTrackingVersion


7. In the Activities toolbox, expand Data Flow, and drag-drop the Copy activity to the pipeline designer
surface. Set the name of the activity to IncrementalCopyActivity. This activity copies the data between
last change tracking version and the current change tracking version to the destination data store.

8. Switch to the Source tab in the Properties window, and do the following steps:
a. Select SourceDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.
select data_source_table.PersonID,data_source_table.Name,data_source_table.Age,
CT.SYS_CHANGE_VERSION, SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN
CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as CT
on data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVer
sion}

9. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.

10. Connect both Lookup activities to the Copy activity one by one. Drag the green button attached to the
Lookup activity to the Copy activity.

11. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer surface.
Set the name of the activity to StoredProceduretoUpdateChangeTrackingActivity. This activity updates
the change tracking version in the table_store_ChangeTracking_version table.
12. Switch to the SQL Account* tab, and select AzureSqlDatabaseLinkedService for Linked service.

13. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select Update_ChangeTracking_Version.
b. Select Import parameter.
c. In the Stored procedure parameters section, specify following values for the parameters:

NAME TYPE VALUE

CurrentTrackingVersion Int64 @{activity('LookupCurrentChangeT


rackingVersionActivity').output.first
Row.CurrentChangeTrackingVersio
n}

TableName String @{activity('LookupLastChangeTrac


kingVersionActivity').output.firstRo
w.TableName}
14. Connect the Copy activity to the Stored Procedure Activity. Drag-and-drop the green button attached
to the Copy activity to the Stored Procedure activity.

15. Click Validate on the toolbar. Confirm that there are no validation errors. Close the Pipeline Validation
Report window by clicking >>.

16. Publish entities (linked services, datasets, and pipelines) to the Data Factory service by clicking the Publish
All button. Wait until you see the Publishing succeeded message.
Run the incremental copy pipeline
1. Click Trigger on the toolbar for the pipeline, and click Trigger Now.

2. In the Pipeline Run window, select Finish.


Monitor the incremental copy pipeline
1. Click the Monitor tab on the left. You see the pipeline run in the list and its status. To refresh the list, click
Refresh. The links in the Actions column let you view activity runs associated with the pipeline run and to
rerun the pipeline.
2. To view activity runs associated with the pipeline run, click the View Activity Runs link in the Actions
column. There is only one activity in the pipeline, so you see only one entry in the list. To switch back to the
pipeline runs view, click Pipelines link at the top.

Review the results


You see the second file in the incchgtracking folder of the adftutorial container.

The file should have only the delta data from the Azure SQL database. The record with U is the updated row in
the database and I is the one added row.

1,update,10,2,U
6,new,50,1,I

The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.
==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I

Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally load data from Azure SQL Database to
Azure Blob Storage using change tracking
information
3/5/2019 • 14 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change tracking
information in the source Azure SQL database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source, sink, and change tracking datasets.
Create, run, and monitor the full copy pipeline
Add or update data in the source table
Create, run, and monitor the incremental copy pipeline

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In some
cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time you
processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database and
SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with SQL
Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob Storage.
For more concrete information about SQL Change Tracking technology, see Change tracking in SQL Server.

End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking technology.

NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL Database
as the source data store. You can also use an on-premises SQL Server.

1. Initial loading of historical data (run once):


a. Enable Change Tracking technology in the source Azure SQL database.
b. Get the initial value of SYS_CHANGE_VERSION in the Azure SQL database as the baseline to capture
changed data.
c. Load full data from the Azure SQL database into an Azure blob storage.
2. Incremental loading of delta data on a schedule (run periodically after the initial loading of data):
a. Get the old and new SYS_CHANGE_VERSION values.
b. Load the delta data by joining the primary keys of changed rows (between two
SYS_CHANGE_VERSION values) from sys.change_tracking_tables with data in the source table,
and then move the delta data to destination.
c. Update the SYS_CHANGE_VERSION for the delta loading next time.

High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data store
(Azure SQL Database) to the destination data store (Azure Blob Storage).

2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure PowerShell. Install the latest Azure PowerShell modules by following instructions in How to install and
configure Azure PowerShell.
Azure SQL Database. You use the database as the source data store. If you don't have an Azure SQL
Database, see the Create an Azure SQL database article for steps to create one.
Azure Storage account. You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named adftutorial.
Create a data source table in your Azure SQL database
1. Launch SQL Server Management Studio, and connect to your Azure SQL server.
2. In Server Explorer, right-click your database and choose the New Query.
3. Run the following SQL command against your Azure SQL database to create a table named
data_source_table as data source store.

create table data_source_table


(
PersonID int NOT NULL,
Name varchar(255),
Age int
PRIMARY KEY (PersonID)
);

INSERT INTO data_source_table


(PersonID, Name, Age)
VALUES
(1, 'aaaa', 21),
(2, 'bbbb', 24),
(3, 'cccc', 20),
(4, 'dddd', 26),
(5, 'eeee', 22);

4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:

NOTE
Replace <your database name> with the name of your Azure SQL database that has the data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three days
or more, some changed data is not included. You need to either change the value of CHANGE_RETENTION to a
bigger number. Alternatively, ensure that your period to load the changed data is within two days. For more
information, see Enable change tracking for a database

ALTER DATABASE <your database name>


SET CHANGE_TRACKING = ON
(CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON)

ALTER TABLE data_source_table


ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON)

5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:

create table table_store_ChangeTracking_version


(
TableName varchar(255),
SYS_CHANGE_VERSION BIGINT,
);

DECLARE @ChangeTracking_version BIGINT


SET @ChangeTracking_version = CHANGE_TRACKING_CURRENT_VERSION();

INSERT INTO table_store_ChangeTracking_version


VALUES ('data_source_table', @ChangeTracking_version)
NOTE
If the data is not changed after you enabled the change tracking for SQL Database, the value of the change tracking
version is 0.

6. Run the following query to create a stored procedure in your Azure SQL database. The pipeline invokes this
stored procedure to update the change tracking version in the table you created in the previous step.

CREATE PROCEDURE Update_ChangeTracking_Version @CurrentTrackingVersion BIGINT, @TableName varchar(50)


AS

BEGIN

UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName

END

Azure PowerShell
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotes, and
then run the command. For example: "adfrg" .

$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again

2. Define a variable for the location of the data factory:

$location = "East US"

3. To create the Azure resource group, run the following command:

New-AzResourceGroup $resourceGroupName $location

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.

4. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to be globally unique.
$dataFactoryName = "IncCopyChgTrackingDF";

5. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet:

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name $dataFactoryName

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error, change the
name and try again.

The specified Data Factory name 'ADFIncCopyChangeTrackingTestFactory' is already in use. Data Factory
names must be globally unique.

To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory
can be in other regions.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In this
section, you create linked services to your Azure Storage account and Azure SQL database.
Create Azure Storage linked service.
In this step, you link your Azure Storage Account to the data factory.
1. Create a JSON file named AzureStorageLinkedService.json in
C:\ADFTutorials\IncCopyChangeTrackingTutorial folder with the following content: (Create the folder
if it does not already exist.). Replace <accountName> , <accountKey> with name and key of your Azure storage
account before saving the file.

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>",
"type": "SecureString"
}
}
}
}

2. In Azure PowerShell, switch to the C:\ADFTutorials\IncCopyChgTrackingTutorial folder.


3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:
AzureStorageLinkedService. In the following example, you pass values for the ResourceGroupName
and DataFactoryName parameters.
Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create Azure SQL Database linked service.


In this step, you link your Azure SQL database to the data factory.
1. Create a JSON file named AzureSQLDatabaseLinkedService.json in
C:\ADFTutorials\IncCopyChangeTrackingTutorial folder with the following content: Replace <server>
<database name>, <user id>, and <password> with name of your Azure SQL server, name of your
database, user ID, and password before saving the file.

{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database name>; Persist
Security Info=False; User ID=<user name>; Password=<password>; MultipleActiveResultSets = False;
Encrypt = True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}

2. In Azure PowerShell, run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:
AzureSQLDatabaseLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSQLDatabaseLinkedService" -File ".\AzureSQLDatabaseLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService

Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a source dataset
In this step, you create a dataset to represent the source data.
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: SourceDataset

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SourceDataset" -File ".\SourceDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SourceDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a sink dataset


In this step, you create a dataset to represent the data that is copied from the source data store.
1. Create a JSON file named SinkDataset.json in the same folder with the following content:

{
"name": "SinkDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/incchgtracking",
"fileName": "@CONCAT('Incremental-', pipeline().RunId, '.txt')",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

You create the adftutorial container in your Azure Blob Storage as part of the prerequisites. Create the
container if it does not exist (or) set it to the name of an existing one. In this tutorial, the output file name is
dynamically generated by using the expression: @CONCAT('Incremental-', pipeline().RunId, '.txt').
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: SinkDataset

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SinkDataset" -File ".\SinkDataset.json"
Here is the sample output of the cmdlet:

DatasetName : SinkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset

Create a change tracking dataset


In this step, you create a dataset for storing the change tracking version.
1. Create a JSON file named ChangeTrackingDataset.json in the same folder with the following content:

{
"name": " ChangeTrackingDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "table_store_ChangeTracking_version"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

You create the table table_store_ChangeTracking_version as part of the prerequisites.


2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: WatermarkDataset

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "ChangeTrackingDataset" -File ".\ChangeTrackingDataset.json"

Here is the sample output of the cmdlet:

DatasetName : ChangeTrackingDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a pipeline for the full copy


In this step, you create a pipeline with a copy activity that copies the entire data from the source data store (Azure
SQL Database) to the destination data store (Azure Blob Storage).
1. Create a JSON file: FullCopyPipeline.json in same folder with the following content:
{
"name": "FullCopyPipeline",
"properties": {
"activities": [{
"name": "FullCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "BlobSink"
}
},

"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}]
}]
}
}

2. Run the Set-AzDataFactoryV2Pipeline cmdlet to create the pipeline: FullCopyPipeline.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "FullCopyPipeline" -File ".\FullCopyPipeline.json"

Here is the sample output:

PipelineName : FullCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Activities : {FullCopyActivity}
Parameters :

Run the full copy pipeline


Run the pipeline: FullCopyPipeline by using Invoke-AzDataFactoryV2Pipeline cmdlet.

Invoke-AzDataFactoryV2Pipeline -PipelineName "FullCopyPipeline" -ResourceGroup $resourceGroupName -


dataFactoryName $dataFactoryName

Monitor the full copy pipeline


1. Log in to Azure portal.
2. Click All services, search with the keyword data factories , and select Data factories.
3. Search for your data factory in the list of data factories, and select it to launch the Data factory page.

4. In the Data factory page, click Monitor & Manage tile.

5. The Data Integration Application launches in a separate tab. You can see all the pipeline runs and their
statuses. Notice that in the following example, the status of the pipeline run is Succeeded. You can check
parameters passed to the pipeline by clicking link in the Parameters column. If there was an error, you see
a link in the Error column. Click the link in the Actions column.
6. When you click the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline.

7. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see a file named incremental-<GUID>.txt in the incchgtracking folder of the adftutorial container.

The file should have the data from the Azure SQL database:

1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22

Add more data to the source table


Run the following query against the Azure SQL database to add a row and update a row.

INSERT INTO data_source_table


(PersonID, Name, Age)
VALUES
(6, 'new','50');

UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1

Create a pipeline for the delta copy


In this step, you create a pipeline with the following activities, and run it periodically. The lookup activities get the
old and new SYS_CHANGE_VERSION from Azure SQL Database and pass it to copy activity. The copy activity
copies the inserted/updated/deleted data between the two SYS_CHANGE_VERSION values from Azure SQL
Database to Azure Blob Storage. The stored procedure activity updates the value of SYS_CHANGE_VERSION
for the next pipeline run.
1. Create a JSON file: IncrementalCopyPipeline.json in same folder with the following content:

{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupLastChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from table_store_ChangeTracking_version"
},

"dataset": {
"referenceName": "ChangeTrackingDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupCurrentChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT CHANGE_TRACKING_CURRENT_VERSION() as
CurrentChangeTrackingVersion"
},

"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},

{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select
data_source_table.PersonID,data_source_table.Name,data_source_table.Age, CT.SYS_CHANGE_VERSION,
SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as CT on
data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupLastChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
]
},
{
"activity": "LookupCurrentChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
}
],

"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},

{
"name": "StoredProceduretoUpdateChangeTrackingActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {

"storedProcedureName": "Update_ChangeTracking_Version",
"storedProcedureParameters": {
"CurrentTrackingVersion": {"value":
"@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion}
", "type": "INT64" },
"TableName": {
"value":"@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.TableName}",
"type":"String"}
}
},

"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},

"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]

}
}

2. Run the Set-AzDataFactoryV2Pipeline cmdlet to create the pipeline: FullCopyPipeline.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "IncrementalCopyPipeline" -File ".\IncrementalCopyPipeline.json"

Here is the sample output:


PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Activities : {LookupLastChangeTrackingVersionActivity,
LookupCurrentChangeTrackingVersionActivity, IncrementalCopyActivity,
StoredProceduretoUpdateChangeTrackingActivity}
Parameters :

Run the incremental copy pipeline


Run the pipeline: IncrementalCopyPipeline by using Invoke-AzDataFactoryV2Pipeline cmdlet.

Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroup $resourceGroupName -


dataFactoryName $dataFactoryName

Monitor the incremental copy pipeline


1. In the Data Integration Application, refresh the pipeline runs view. Confirm that you see the
IncrementalCopyPipeline in the list. Click the link in the Actions column.

2. When you click the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline.

3. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see the second file in the incchgtracking folder of the adftutorial container.
The file should have only the delta data from the Azure SQL database. The record with U is the updated row in
the database and I is the one added row.

1,update,10,2,U
6,new,50,1,I

The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.

==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I

Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally copy new and changed files based on
LastModifiedDate by using the Copy Data tool
5/10/2019 • 5 minutes to read • Edit Online

In this tutorial, you'll use the Azure portal to create a data factory. Then, you'll use the Copy Data tool to create a
pipeline that incrementally copies new and changed files only, based on their LastModifiedDate from Azure Blob
storage to Azure Blob storage.
By doing so, ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and
copy the new and updated file only since last time to the destination store. Please note that if you let ADF scan
huge amounts of files but only copy a few files to destination, you would still expect the long duration due to file
scanning is time consuming as well.

NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you will perform the following tasks:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source and sink data store. If you don't have an Azure
storage account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
1. Create a container named source. You can use various tools to perform this task, such as Azure Storage
Explorer.
2. Create a container named destination.

Create a data factory


1. On the left menu, select Create a resource > Data + Analytics > Data Factory:
2. On the New data factory page, under Name, enter ADFTutorialDataFactory.

The name for your data factory must be globally unique. You might receive the following error message:
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which you'll create the new data factory.
4. For Resource Group, take one of the following steps:
Select Use existing and select an existing resource group from the drop-down list.
Select Create new and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that your data factory uses can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, refer to the Deploying Data Factory tile to see the process status.

10. After creation is finished, the Data Factory home page is displayed.
11. To open the Azure Data Factory user interface (UI) on a separate tab, select the Author & Monitor tile.

Use the Copy Data tool to create a pipeline


1. On the Let's get started page, select the Copy Data title to open the Copy Data tool.

2. On the Properties page, take the following steps:


a. Under Task name, enter DeltaCopyFromBlobPipeline.
b. Under Task cadence or Task schedule, select Run regularly on schedule.
c. Under Trigger Type, select Tumbling Window.
d. Under Recurrence, enter 15 Minute(s).
e. Select Next.
The Data Factory UI creates a pipeline with the specified task name.

3. On the Source data store page, complete the following steps:


a. Select + Create new connection, to add a connection.
b. Select Azure Blob Storage from the gallery, and then select Continue.

c. On the New Linked Service page, select your storage account from the Storage account name list
and then select Finish.
d. Select the newly created linked service and then select Next.

4. On the Choose the input file or folder page, complete the following steps:
a. Browse and select the source folder, and then select Choose.
b. Under File loading behavior, select Incremental load: LastModifiedDate.

c. Check Binary copy and select Next.


5. On the Destination data store page, select AzureBlobStorage. This is the same storage account as the
source data store. Then select Next.

6. On the Choose the output file or folder page, complete the following steps:
a. Browse and select the destination folder, and then select Choose.

b. Select Next.

7. On the Settings page, select Next.


8. On the Summary page, review the settings and then select Next.

9. On the Deployment page, select Monitor to monitor the pipeline (task).


10. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline. Select Refresh to refresh the list, and select the View Activity
Runs link in the Actions column.

11. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column.

Because there is no file in the source container in your Blob storage account, you will not see any file
copied to the destination container in your Blob storage account.
12. Create an empty text file and name it file1.txt. Upload this text file to the source container in your storage
account. You can use various tools to perform these tasks, such as Azure Storage Explorer.

13. To go back to the Pipeline Runs view, select All Pipeline Runs, and wait for the same pipeline to be
triggered again automatically.

14. Select View Activity Run for the second pipeline run when you see it. Then review the details in the same
way you did for the first pipeline run.
You will that see one file (file1.txt) has been copied from the source container to the destination container
of your Blob storage account.

15. Create another empty text file and name it file2.txt. Upload this text file to the source container in your
Blob storage account.
16. Repeat steps 13 and 14 for this second text file. You will see that only the new file (file2.txt) has been copied
from the source container to the destination container of your storage account in the next pipeline run.

You can also verify this by using Azure Storage Explorer to scan the files.
Next steps
Advance to the following tutorial to learn about transforming data by using an Apache Spark cluster on Azure:
Transform data in the cloud by using an Apache Spark cluster
Incrementally copy new files based on time
partitioned file name by using the Copy Data tool
3/26/2019 • 5 minutes to read • Edit Online

In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that incrementally copies new files based on time partitioned file name from Azure Blob storage to Azure
Blob storage.

NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source and sink data store. If you don't have an Azure storage
account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
1. Create a container named source. Create a folder path as 2019/02/26/14 in your container. Create an
empty text file, and name it as file1.txt. Upload the file1.txt to the folder path source/2019/02/26/14 in
your storage account. You can use various tools to perform these tasks, such as Azure Storage Explorer.
NOTE
Please adjust the folder name with your UTC time. For example, if the current UTC time is 2:03 PM on Feb 26th,
2019, you can create the folder path as source/2019/02/26/14/ by the rule of
source/{Year}/{Month}/{Day}/{Hour}/.

2. Create a container named destination. You can use various tools to perform these tasks, such as Azure
Storage Explorer.

Create a data factory


1. On the left menu, select Create a resource > Data + Analytics > Data Factory:

2. On the New data factory page, under Name, enter ADFTutorialDataFactory.


The name for your data factory must be globally unique. You might receive the following error message:

If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which to create the new data factory.
4. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2 for the version.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that are used by your data factory can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, the Deploying Data Factory tile shows the process status.
10. After creation is finished, the Data Factory home page is displayed.

11. To launch the Azure Data Factory user interface (UI) in a separate tab, select the Author & Monitor tile.

Use the Copy Data tool to create a pipeline


1. On the Let's get started page, select the Copy Data title to launch the Copy Data tool.
2. On the Properties page, take the following steps:
a. Under Task name, enter DeltaCopyFromBlobPipeline.
b. Under Task cadence or Task schedule, select Run regularly on schedule.
c. Under Trigger type, select Tumbling Window.
d. Under Recurrence, enter 1 Hour(s).
e. Select Next.
The Data Factory UI creates a pipeline with the specified task name.
3. On the Source data store page, complete the following steps:
a. Click + Create new connection, to add a connection.
b. Select Azure Blob Storage from the gallery, and then click Continue.

c. On the New Linked Service page, select your storage account from the Storage account name list,
and then click Finish.

d. Select the newly created linked service, then click Next.


4. On the Choose the input file or folder page, do the following steps:
a. Browse and select the source container, then select Choose.

b. Under File loading behavior, select Incremental load: time-partitioned folder/file names.
c. Write the dynamic folder path as source/{year}/{month}/{day}/{hour}/, and change the format as
followings:
d. Check Binary copy and click Next.

5. On the Destination data store page, select the AzureBlobStorage, which is the same storage account as
data source store, and then click Next.
6. On the Choose the output file or folder page, do the following steps:
a. Browse and select the destination folder, then click Choose.

b. Write the dynamic folder path as source/{year}/{month}/{day}/{hour}/, and change the format as
followings:
c. Click Next.

7. On the Settings page, select Next.


8. On the Summary page, review the settings, and then select Next.

9. On the Deployment page, select Monitor to monitor the pipeline (task).


10. Notice that the Monitor tab on the left is automatically selected. You need wait for the pipeline run when it
is triggered automatically (about after one hour). When it runs, the Actions column includes links to view
activity run details and to rerun the pipeline. Select Refresh to refresh the list, and select the View Activity
Runs link in the Actions column.

11. There's only one activity (copy activity) in the pipeline, so you see only one entry. You can see the source file
(file1.txt) has been copied from source/2019/02/26/14/ to destination/2019/02/26/14/ with the same
file name.

You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files.
12. Create another empty text file with the new name as file2.txt. Upload the file2.txt file to the folder path
source/2019/02/26/15 in your storage account. You can use various tools to perform these tasks, such as
Azure Storage Explorer.

NOTE
You might be aware that a new folder path is required to be created. Please adjust the folder name with your UTC
time. For example, if the current UTC time is 3:20 PM on Feb 26th, 2019, you can create the folder path as
source/2019/02/26/15/ by the rule of {Year}/{Month}/{Day}/{Hour}/.

13. To go back to the Pipeline Runs view, select All Pipelines Runs, and wait for the same pipeline being
triggered again automatically after another one hour.
14. Select View Activity Run for the second pipeline run when it comes, and do the same to review details.

You can see the source file (file2.txt) has been copied from source/2019/02/26/15/ to
destination/2019/02/26/15/ with the same file name.
You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files
in destination container

Next steps
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Transform data using Spark cluster in cloud
Transform data in the cloud by using a Spark activity
in Azure Data Factory
3/7/2019 • 7 minutes to read • Edit Online

In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline. This pipeline transforms data by
using a Spark activity and an on-demand Azure HDInsight linked service.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure storage account. You create a Python script and an input file, and you upload them to Azure Storage.
The output from the Spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.

NOTE
HdInsight supports only general-purpose storage accounts with standard tier. Make sure that the account is not a premium
or blob only storage account.

Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload the Python script to your Blob storage account
1. Create a Python file named WordCount_Spark.py with the following content:
import sys
from operator import add

from pyspark.sql import SparkSession

def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()

lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/minecr
aftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)

counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfiles
/wordcount")

spark.stop()

if __name__ == "__main__":
main()

2. Replace <storageAccountName> with the name of your Azure storage account. Then, save the file.
3. In Azure Blob storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark.
5. Create a subfolder named script under the spark folder.
6. Upload the WordCount_Spark.py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstory.txt with some text. The Spark program counts the number of words in
this text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstory.txt file to the inputfiles subfolder.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. Select New on the left menu, select Data + Analytics, and then select Data Factory.
3. In the New data factory pane, enter ADFTutorialDataFactory under Name.
The name of the Azure data factory must be globally unique. If you see the following error, change the
name of the data factory. (For example, use <yourname>ADFTutorialDataFactory). For naming rules for
Data Factory artifacts, see the Data Factory - naming rules article.

4. For Subscription, select your Azure subscription in which you want to create the data factory.
5. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the
resource group. To learn about resource groups, see Using resource groups to manage your Azure
resources.
6. For Version, select V2.
7. For Location, select the location for the data factory.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factory: Products available by region.
The data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data
Factory uses can be in other regions.
8. Select Create.
9. After the creation is complete, you see the Data factory page. Select the Author & Monitor tile to start
the Data Factory UI application on a separate tab.

Create linked services


You author two linked services in this section:
An Azure Storage linked service that links an Azure storage account to the data factory. This storage is used
by the on-demand HDInsight cluster. It also contains the Spark script to be run.
An on-demand HDInsight linked service. Azure Data Factory automatically creates an HDInsight cluster
and runs the Spark program. It then deletes the HDInsight cluster after the cluster is idle for a preconfigured
time.
Create an Azure Storage linked service
1. On the Let's get started page, switch to the Edit tab in the left panel.
2. Select Connections at the bottom of the window, and then select + New.

3. In the New Linked Service window, select Data Store > Azure Blob Storage, and then select Continue.
4. For Storage account name, select the name from the list, and then select Save.
Create an on-demand HDInsight linked service
1. Select the + New button again to create another linked service.
2. In the New Linked Service window, select Compute > Azure HDInsight, and then select Continue.
3. In the New Linked Service window, complete the following steps:
a. For Name, enter AzureHDInsightLinkedService.
b. For Type, confirm that On-demand HDInsight is selected.
c. For Azure Storage Linked Service, select AzureStorage1. You created this linked service earlier. If you
used a different name, specify the right name here.
d. For Cluster type, select spark.
e. For Service principal id, enter the ID of the service principal that has permission to create an HDInsight
cluster.
This service principal needs to be a member of the Contributor role of the subscription or the resource
group in which the cluster is created. For more information, see Create an Azure Active Directory
application and service principal.
f. For Service principal key, enter the key.
g. For Resource group, select the same resource group that you used when you created the data factory.
The Spark cluster is created in this resource group.
h. Expand OS type.
i. Enter a name for Cluster user name.
j. Enter the Cluster password for the user.
k. Select Finish.
NOTE
Azure HDInsight limits the total number of cores that you can use in each Azure region that it supports. For the on-demand
HDInsight linked service, the HDInsight cluster is created in the same Azure Storage location that's used as its primary
storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more information, see Set
up clusters in HDInsight with Hadoop, Spark, Kafka, and more.

Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.

2. In the Activities toolbox, expand HDInsight. Drag the Spark activity from the Activities toolbox to the
pipeline designer surface.
3. In the properties for the Spark activity window at the bottom, complete the following steps:
a. Switch to the HDI Cluster tab.
b. Select AzureHDInsightLinkedService (which you created in the previous procedure).
4. Switch to the Script/Jar tab, and complete the following steps:
a. For Job Linked Service, select AzureStorage1.
b. Select Browse Storage.

c. Browse to the adftutorial/spark/script folder, select WordCount_Spark.py, and then select Finish.
5. To validate the pipeline, select the Validate button on the toolbar. Select the >> (right arrow ) button to
close the validation window.
6. Select Publish All. The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.

Trigger a pipeline run


Select Trigger on the toolbar, and then select Trigger Now.
Monitor the pipeline run
1. Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 20 minutes to create
a Spark cluster.
2. Select Refresh periodically to check the status of the pipeline run.

3. To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.

You can switch back to the pipeline runs view by selecting the Pipelines link at the top.

Verify the output


Verify that the output file is created in the spark/otuputfiles/wordcount folder of the adftutorial container.
The file should have each word from the input text file and the number of times the word appeared in the file. For
example:

(u'This', 1)
(u'a', 1)
(u'is', 1)
(u'test', 1)
(u'file', 1)

Next steps
The pipeline in this sample transforms data by using a Spark activity and an on-demand HDInsight linked service.
You learned how to:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
To learn how to transform data by running a Hive script on an Azure HDInsight cluster that's in a virtual network,
advance to the next tutorial:
Tutorial: Transform data using Hive in Azure Virtual Network.
Transform data in the cloud by using Spark activity in
Azure Data Factory
3/7/2019 • 7 minutes to read • Edit Online

In this tutorial, you use Azure PowerShell to create a Data Factory pipeline that transforms data using Spark
Activity and an on-demand HDInsight linked service. You perform the following steps in this tutorial:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure Storage account. You create a python script and an input file, and upload them to the Azure storage.
The output from the spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload python script to your Blob Storage account
1. Create a python file named WordCount_Spark.py with the following content:
import sys
from operator import add

from pyspark.sql import SparkSession

def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()

lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/minecr
aftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)

counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfiles
/wordcount")

spark.stop()

if __name__ == "__main__":
main()

2. Replace <storageAccountName> with the name of your Azure Storage account. Then, save the file.
3. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark.
5. Create a subfolder named script under spark folder.
6. Upload the WordCount_Spark.py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstory.txt with some text. The spark program counts the number of words in this
text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstory.txt to the inputfiles subfolder.

Author linked services


You author two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is used
by the on-demand HDInsight cluster. It also contains the Spark script to be executed.
An On-Demand HDInsight Linked Service. Azure Data Factory automatically creates a HDInsight cluster, run
the Spark program, and then deletes the HDInsight cluster after it's idle for a pre-configured time.
Azure Storage linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure Storage linked
service, and then save the file as MyStorageLinkedService.json.
{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>",
"type": "SecureString"
}
}
}
}

Update the <storageAccountName> and <storageAccountKey> with the name and key of your Azure Storage
account.
On-demand HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyOnDemandSparkLinkedService.json.

{
"name": "MyOnDemandSparkLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterSize": 2,
"clusterType": "spark",
"timeToLive": "00:15:00",
"hostSubscriptionId": "<subscriptionID> ",
"servicePrincipalId": "<servicePrincipalID>",
"servicePrincipalKey": {
"value": "<servicePrincipalKey>",
"type": "SecureString"
},
"tenant": "<tenant ID>",
"clusterResourceGroup": "<resourceGroupofHDICluster>",
"version": "3.6",
"osType": "Linux",
"clusterNamePrefix":"ADFSparkSample",
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}

Update values for the following properties in the linked service definition:
hostSubscriptionId. Replace <subscriptionID> with the ID of your Azure subscription. The on-demand
HDInsight cluster is created in this subscription.
tenant. Replace <tenantID> with ID of your Azure tenant.
servicePrincipalId, servicePrincipalKey. Replace <servicePrincipalID> and <servicePrincipalKey> with ID
and key of your service principal in the Azure Active Directory. This service principal needs to be a member of
the Contributor role of the subscription or the resource Group in which the cluster is created. See create Azure
Active Directory application and service principal for details.
clusterResourceGroup. Replace <resourceGroupOfHDICluster> with the name of the resource group in
which the HDInsight cluster needs to be created.
NOTE
Azure HDInsight has limitation on the total number of cores you can use in each Azure region it supports. For On-Demand
HDInsight Linked Service, the HDInsight cluster will be created in the same location of the Azure Storage used as its primary
storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more information, see Set
up clusters in HDInsight with Hadoop, Spark, Kafka, and more.

Author a pipeline
In this step, you create a new pipeline with a Spark activity. The activity uses the word count sample. Download
the contents from this location if you haven't already done so.
Create a JSON file in your preferred editor, copy the following JSON definition of a pipeline definition, and save it
as MySparkOnDemandPipeline.json.

{
"name": "MySparkOnDemandPipeline",
"properties": {
"activities": [
{
"name": "MySparkActivity",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyOnDemandSparkLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"rootPath": "adftutorial/spark",
"entryFilePath": "script/WordCount_Spark.py",
"getDebugInfo": "Failure",
"sparkJobLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}

Note the following points:


rootPath points to the spark folder of the adftutorial container.
entryFilePath points to the WordCount_Spark.py file in the script sub folder of the spark folder.

Create a data factory


You have authored linked service and pipeline definitions in JSON files. Now, let’s create a data factory, and
deploy the linked Service and pipeline JSON files by using PowerShell cmdlets. Run the following PowerShell
commands one by one:
1. Set variables one by one.
Resource Group Name

$resourceGroupName = "ADFTutorialResourceGroup"

Data Factory Name. Must be globally unique


$dataFactoryName = "MyDataFactory09102017"

Pipeline name

$pipelineName = "MySparkOnDemandPipeline" # Name of the pipeline

2. Launch PowerShell. Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:

Connect-AzAccount

Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

3. Create the resource group: ADFTutorialResourceGroup.

New-AzResourceGroup -Name $resourceGroupName -Location "East Us"

4. Create the data factory.

$df = Set-AzDataFactoryV2 -Location EastUS -Name $dataFactoryName -ResourceGroupName


$resourceGroupName

Execute the following command to see the output:

$df

5. Switch to the folder where you created JSON files, and run the following command to deploy an Azure
Storage linked service:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "MyStorageLinkedService" -File "MyStorageLinkedService.json"

6. Run the following command to deploy an on-demand Spark linked service:


Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -Name "MyOnDemandSparkLinkedService" -File "MyOnDemandSparkLinkedService.json"

7. Run the following command to deploy a pipeline:

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name $pipelineName -File "MySparkOnDemandPipeline.json"

Start and monitor a pipeline run


1. Start a pipeline run. It also captures the pipeline run ID for future monitoring.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName $pipelineName

2. Run the following script to continuously check the pipeline run status until it finishes.

while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)

if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}

Write-Host "Activity `Output` section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "Activity `Error` section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

3. Here is the output of the sample run:


Pipeline run status: In Progress
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName :
ActivityName : MySparkActivity
PipelineRunId : 94e71d08-a6fa-4191-b7d1-cf8c71cb4794
PipelineName : MySparkOnDemandPipeline
Input : {rootPath, entryFilePath, getDebugInfo, sparkJobLinkedService}
Output :
LinkedServiceName :
ActivityRunStart : 9/20/2017 6:33:47 AM
ActivityRunEnd :
DurationInMs :
Status : InProgress
Error :

Pipeline ' MySparkOnDemandPipeline' run finished. Result:


ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : MyDataFactory09102017
ActivityName : MySparkActivity
PipelineRunId : 94e71d08-a6fa-4191-b7d1-cf8c71cb4794
PipelineName : MySparkOnDemandPipeline
Input : {rootPath, entryFilePath, getDebugInfo, sparkJobLinkedService}
Output : {clusterInUse, jobId, ExecutionProgress, effectiveIntegrationRuntime}
LinkedServiceName :
ActivityRunStart : 9/20/2017 6:33:47 AM
ActivityRunEnd : 9/20/2017 6:46:30 AM
DurationInMs : 763466
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity Output section:


"clusterInUse": "https://ADFSparkSamplexxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.azurehdinsight.net/"
"jobId": "0"
"ExecutionProgress": "Succeeded"
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)"
Activity Error section:
"errorCode": ""
"message": ""
"failureType": ""
"target": "MySparkActivity"

4. Confirm that a folder named outputfiles is created in the spark folder of adftutorial container with the
output from the spark program.

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
Advance to the next tutorial to learn how to transform data by running Hive script on an Azure HDInsight cluster
that is in a virtual network.
Tutorial: transform data using Hive in Azure Virtual Network.
Run a Databricks notebook with the Databricks
Notebook Activity in Azure Data Factory
5/22/2019 • 5 minutes to read • Edit Online

In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline that executes a Databricks
notebook against the Databricks jobs cluster. It also passes Azure Data Factory parameters to the Databricks
notebook during execution.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses Databricks Notebook Activity.
Trigger a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
For an eleven-minute introduction and demonstration of this feature, watch the following video:

Prerequisites
Azure Databricks workspace. Create a Databricks workspace or use an existing one. You create a Python
notebook in your Azure Databricks workspace. Then you execute the notebook and pass parameters to it using
Azure Data Factory.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. Select Create a resource on the left menu, select Analytics, and then select Data Factory.
3. In the New data factory pane, enter ADFTutorialDataFactory under Name.
The name of the Azure data factory must be globally unique. If you see the following error, change the name
of the data factory. (For example, use <yourname>ADFTutorialDataFactory). For naming rules for Data
Factory artifacts, see the Data Factory - naming rules article.
4. For Subscription, select your Azure subscription in which you want to create the data factory.
5. For Resource Group, take one of the following steps:
Select Use existing and select an existing resource group from the drop-down list.
Select Create new and enter the name of a resource group.
Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the
resource group. To learn about resource groups, see Using resource groups to manage your Azure
resources.
6. For Version, select V2.
7. For Location, select the location for the data factory.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data Factory
uses can be in other regions.
8. Select Create.
9. After the creation is complete, you see the Data factory page. Select the Author & Monitor tile to start
the Data Factory UI application on a separate tab.

Create linked services


In this section, you author a Databricks linked service. This linked service contains the connection information to
the Databricks cluster:
Create an Azure Databricks linked service
1. On the Let's get started page, switch to the Edit tab in the left panel.

2. Select Connections at the bottom of the window, and then select + New.
3. In the New Linked Service window, select Compute > Azure Databricks, and then select Continue.
4. In the New Linked Service window, complete the following steps:
a. For Name, enter AzureDatabricks_LinkedService
b. Select the appropriate Databricks workspace that you will run your notebook in
c. For Select cluster, select New job cluster
d. For Domain/ Region, info should auto-populate
e. For Access Token, generate it from Azure Databricks workplace. You can find the steps here.
f. For Cluster version, select 4.2 (with Apache Spark 2.3.1, Scala 2.11)
g. For Cluster node type, select Standard_D3_v2 under General Purpose (HDD ) category for this
tutorial.
h. For Workers, enter 2.
i. Select Finish
Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.
2. Create a parameter to be used in the Pipeline. Later you pass this parameter to the Databricks Notebook
Activity. In the empty pipeline, click on the Parameters tab, then New and name it as 'name'.
3. In the Activities toolbox, expand Databricks. Drag the Notebook activity from the Activities toolbox to
the pipeline designer surface.

4. In the properties for the Databricks Notebook activity window at the bottom, complete the following
steps:
a. Switch to the Azure Databricks tab.
b. Select AzureDatabricks_LinkedService (which you created in the previous procedure).
c. Switch to the Settings tab
c. Browse to select a Databricks Notebook path. Let’s create a notebook and specify the path here. You get
the Notebook Path by following the next few steps.
a. Launch your Azure Databricks Workspace
b. Create a New Folder in Workplace and call it as adftutorial.

c. Create a new notebook (Python), let’s call it mynotebook under adftutorial Folder, click Create.
d. In the newly created notebook "mynotebook'" add the following code:

# Creating widgets for leveraging parameters, and printing the parameters

dbutils.widgets.text("input", "","")
dbutils.widgets.get("input")
y = getArgument("input")
print ("Param -\'input':")
print (y)

e. The Notebook Path in this case is /adftutorial/mynotebook


5. Switch back to the Data Factory UI authoring tool. Navigate to Settings Tab under the Notebook1
Activity.
a. Add Parameter to the Notebook activity. You use the same parameter that you added earlier to the
Pipeline.
b. Name the parameter as input and provide the value as expression @pipeline().parameters.name.
6. To validate the pipeline, select the Validate button on the toolbar. To close the validation window, select the
>> (right arrow ) button.

7. Select Publish All. The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.
Trigger a pipeline run
Select Trigger on the toolbar, and then select Trigger Now.

The Pipeline Run dialog box asks for the name parameter. Use /path/filename as the parameter here. Click
Finish.
Monitor the pipeline run
1. Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 5-8 minutes to create
a Databricks job cluster, where the notebook is executed.

2. Select Refresh periodically to check the status of the pipeline run.


3. To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.

You can switch back to the pipeline runs view by selecting the Pipelines link at the top.
Verify the output
You can log on to the Azure Databricks workspace, go to Clusters and you can see the Job status as pending
execution, running, or terminated.

You can click on the Job name and navigate to see further details. On successful run, you can validate the
parameters passed and the output of the Python notebook.

Next steps
The pipeline in this sample triggers a Databricks Notebook activity and passes a parameter to it. You learned how
to:
Create a data factory.
Create a pipeline that uses a Databricks Notebook activity.
Trigger a pipeline run.
Monitor the pipeline run.
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory
3/15/2019 • 9 minutes to read • Edit Online

In this tutorial, you use Azure portal to create a Data Factory pipeline that transforms data using Hive Activity on a
HDInsight cluster that is in an Azure Virtual Network (VNet). You perform the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure Storage account. You create a hive script, and upload it to the Azure storage. The output from the
Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Virtual Network. If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration of
Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the previous
step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a sample
configuration of HDInsight in a virtual network.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
A virtual machine. Create an Azure virtual machine VM and join it into the same virtual network that
contains your HDInsight cluster. For details, see How to create virtual machines.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:
DROP TABLE IF EXISTS HiveSampleOut;
CREATE EXTERNAL TABLE HiveSampleOut (clientid string, market string, devicemodel string, state string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${hiveconf:Output}';

INSERT OVERWRITE TABLE HiveSampleOut


Select
clientid,
market,
devicemodel,
state
FROM hivesampletable

2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts.
4. Upload the hivescript.hql file to the hivescripts subfolder.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. Log in to the Azure portal.
3. Click New on the left menu, click Data + Analytics, and click Data Factory.
4. In the New data factory page, enter ADFTutorialHiveFactory for the name.
The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, yournameMyAzureSsisDataFactory) and try creating again. See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.

`Data factory name “MyAzureSsisDataFactory” is not available`

5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version.
8. Select the location for the data factory. Only locations that are supported for creation of data factories are
shown in the list.
9. Select Pin to dashboard.
10. Click Create.
11. On the dashboard, you see the following tile with status: Deploying data factory.
12. After the creation is complete, you see the Data Factory page as shown in the image.

13. Click Author & Monitor to launch the Data Factory User Interface (UI) in a separate tab.
14. In the get started page, switch to the Edit tab in the left panel as shown in the following image:
Create a self-hosted integration runtime
As the Hadoop cluster is inside a virtual network, you need to install a self-hosted integration runtime (IR ) in the
same virtual network. In this section, you create a new VM, join it to the same virtual network, and install self-
hosted IR on it. The self-hosted IR allows Data Factory service to dispatch processing requests to a compute
service such as HDInsight inside a virtual network. It also allows you to move data to/from data stores inside a
virtual network to Azure. You use a self-hosted IR when the data store or compute is in an on-premises
environment as well.
1. In the Azure Data Factory UI, click Connections at the bottom of the window, switch to the Integration
Runtimes tab, and click + New button on the toolbar.

2. In the Integration Runtime Setup window, Select Perform data movement and dispatch activities to
external computes option, and click Next.
3. Select Private Network, and click Next.
4. Enter MySelfHostedIR for Name, and click Next.

5. Copy the authentication key for the integration runtime by clicking the copy button, and save it. Keep the
window open. You use this key to register the IR installed in a virtual machine.
Install IR on a virtual machine
1. On the Azure VM, download self-hosted integration runtime. Use the authentication key obtained in the
previous step to manually register the self-hosted integration runtime.
2. You see the following message when the self-hosted integration runtime is registered successfully.

3. Click Launch Configuration Manager. You see the following page when the node is connected to the
cloud service:
Self-hosted IR in the Azure Data Factory UI
1. In the Azure Data Factory UI, you should see the name of the self-hosted VM name and its status.
2. Click Finish to close the Integration Runtime Setup window. You see the self-hosted IR in the list of
integration runtimes.

Create linked services


You author and deploy two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is the
primary storage used by your HDInsight cluster. In this case, you use this Azure Storage account to store the
Hive script and output of the script.
An HDInsight Linked Service. Azure Data Factory submits the Hive script to this HDInsight cluster for
execution.
Create Azure Storage linked service
1. Switch to the Linked Services tab, and click New.

2. In the New Linked Service window, select Azure Blob Storage, and click Continue.

3. In the New Linked Service window, do the following steps:


a. Enter AzureStorageLinkedService for Name.
b. Select MySelfHostedIR for Connect via integration runtime.
c. Select your Azure storage account for Storage account name.
d. To test the connection to storage account, click Test connection.
e. Click Save.

Create HDInsight linked service


1. Click New again to create another linked service.
2. Switch to the Compute tab, select Azure HDInsight, and click Continue.

3. In the New Linked Service window, do the following steps:


a. Enter AzureHDInsightLinkedService for Name.
b. Select Bring your own HDInsight.
c. Select your HDInsight cluster for Hdi cluster.
d. Enter the user name for the HDInsight cluster.
e. Enter the password for the user.
This article assumes that you have access to the cluster over the internet. For example, that you can connect to the
cluster at https://clustername.azurehdinsight.net . This address uses the public gateway, which is not available if
you have used network security groups (NSGs) or user-defined routes (UDRs) to restrict access from the internet.
For Data Factory to be able to submit jobs to HDInsight cluster in Azure Virtual Network, you need to configure
your Azure Virtual Network such a way that the URL can be resolved to the private IP address of gateway used by
HDInsight.
1. From Azure portal, open the Virtual Network the HDInsight is in. Open the network interface with name
starting with nic-gateway-0 . Note down its private IP address. For example, 10.6.0.15.
2. If your Azure Virtual Network has DNS server, update the DNS record so the HDInsight cluster URL
https://<clustername>.azurehdinsight.net can be resolved to 10.6.0.15 . If you don’t have a DNS server in
your Azure Virtual Network, you can temporarily work around by editing the hosts file
(C:\Windows\System32\drivers\etc) of all VMs that registered as self-hosted integration runtime nodes by
adding an entry similar to the following one:
10.6.0.15 myHDIClusterName.azurehdinsight.net

Create a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined.
Note the following points:
scriptPath points to path to Hive script on the Azure Storage Account you used for MyStorageLinkedService.
The path is case-sensitive.
Output is an argument used in the Hive script. Use the format of
wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ to point it to an existing folder on
your Azure Storage. The path is case-sensitive.
1. In the Data Factory UI, click + (plus) in the left pane, and click Pipeline.

2. In the Activities toolbox, expand HDInsight, and drag-drop Hive activity to the pipeline designer surface.
3. In the properties window, switch to the HDI Cluster tab, and select AzureHDInsightLinkedService for
HDInsight Linked Service.

4. Switch to the Scripts tab, and do the following steps:


a. Select AzureStorageLinkedService for Script Linked Service.
b. For File Path, click Browse Storage.
c. In the Choose a file or folder window, navigate to hivescripts folder of the adftutorial container,
select hivescript.hql, and click Finish.

d. Confirm that you see adftutorial/hivescripts/hivescript.hql for File Path.

e. In the Script tab, expand Advanced section.


f. Click Auto-fill from script for Parameters.
g. Enter the value for the Output parameter in the following format:
wasb://<Blob Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ . For example:
wasb://adftutorial@mystorageaccount.blob.core.windows.net/outputfolder/ .
5. To publish artifacts to Data Factory, click Publish.

Trigger a pipeline run


1. First, validate the pipeline by clicking the Validate button on the toolbar. Close the Pipeline Validation
Output window by clicking right-arrow (>>).
2. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now.

Monitor the pipeline run


1. Switch to the Monitor tab on the left. You see a pipeline run in the Pipeline Runs list.

2. To refresh the list, click Refresh.


3. To view activity runs associated with the pipeline runs, click View activity runs in the Action column.
Other action links are for stopping/rerunning the pipeline.

4. You see only one activity run since there is only one activity in the pipeline of type HDInsightHive. To
switch back to the previous view, click Pipelines link at the top.
5. Confirm that you see an output file in the outputfolder of the adftutorial container.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Branching and chaining Data Factory control flow
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory
4/11/2019 • 9 minutes to read • Edit Online

In this tutorial, you use Azure PowerShell to create a Data Factory pipeline that transforms data using Hive Activity
on a HDInsight cluster that is in an Azure Virtual Network (VNet). You perform the following steps in this tutorial:
Create a data factory.
Author and setup self-hosted integration runtime
Author and deploy linked services.
Author and deploy a pipeline that contains a Hive activity.
Start a pipeline run.
Monitor the pipeline run
verify the output.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure Storage account. You create a hive script, and upload it to the Azure storage. The output from the
Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Virtual Network. If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration of
Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the previous
step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a sample
configuration of HDInsight in a virtual network.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:

DROP TABLE IF EXISTS HiveSampleOut;


CREATE EXTERNAL TABLE HiveSampleOut (clientid string, market string, devicemodel string, state string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${hiveconf:Output}';

INSERT OVERWRITE TABLE HiveSampleOut


Select
clientid,
market,
devicemodel,
state
FROM hivesampletable

2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts.
4. Upload the hivescript.hql file to the hivescripts subfolder.
Create a data factory
1. Set the resource group name. You create a resource group as part of this tutorial. However, you can use an
existing resource group if you like.

$resourceGroupName = "ADFTutorialResourceGroup"

2. Specify the data factory name. Must be globally unique.

$dataFactoryName = "MyDataFactory09142017"

3. Specify a name for the pipeline.

$pipelineName = "MyHivePipeline" #

4. Specify a name for the self-hosted integration runtime. You need a self-hosted integration runtime when the
Data Factory needs to access resources (such as Azure SQL Database) inside a VNet.

$selfHostedIntegrationRuntimeName = "MySelfHostedIR09142017"

5. Launch PowerShell. Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently available,
select the regions that interest you on the following page, and then expand Analytics to locate Data
Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:

Connect-AzAccount

Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

6. Create the resource group: ADFTutorialResourceGroup if it does not exist already in your subscription.

New-AzResourceGroup -Name $resourceGroupName -Location "East Us"

7. Create the data factory.

$df = Set-AzDataFactoryV2 -Location EastUS -Name $dataFactoryName -ResourceGroupName $resourceGroupName

Execute the following command to see the output:


$df

Create self-hosted IR
In this section, you create a self-hosted integration runtime and associate it with an Azure VM in the same Azure
Virtual Network where your HDInsight cluster is in.
1. Create Self-hosted integration runtime. Use a unique name in case if another integration runtime with the
same name exists.

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntimeName -Type SelfHosted

This command creates a logical registration of the self-hosted integration runtime.


2. Use PowerShell to retrieve authentication keys to register the self-hosted integration runtime. Copy one of
the keys for registering the self-hosted integration runtime.

Get-AzDataFactoryV2IntegrationRuntimeKey -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntimeName | ConvertTo-Json

Here is the sample output:

{
"AuthKey1": "IR@0000000000000000000000000000000000000=",
"AuthKey2": "IR@0000000000000000000000000000000000000="
}

Note down the value of AuthKey1 without quotation mark.


3. Create an Azure VM and join it into the same virtual network that contains your HDInsight cluster. For
details, see How to create virtual machines. Join them into an Azure Virtual Network.
4. On the Azure VM, download self-hosted integration runtime. Use the Authentication Key obtained in the
previous step to manually register the self-hosted integration runtime.
You see the following message when the self-hosted integration runtime is registered successfully:

You see the following page when the node is connected to the cloud service:
Author linked services
You author and deploy two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is the
primary storage used by your HDInsight cluster. In this case, we also use this Azure Storage account to keep the
Hive script and output of the script.
An HDInsight Linked Service. Azure Data Factory submits the Hive script to this HDInsight cluster for
execution.
Azure Storage linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure Storage linked
service, and then save the file as MyStorageLinkedService.json.

{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>",
"type": "SecureString"
}
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}

Replace <accountname> and <accountkey> with the name and key of your Azure Storage account.
HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyHDInsightLinkedService.json.

{
"name": "MyHDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": "https://<clustername>.azurehdinsight.net",
"userName": "<username>",
"password": {
"value": "<password>",
"type": "SecureString"
},
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}

Update values for the following properties in the linked service definition:
userName. Name of the cluster login user that you specified when creating the cluster.
password. The password for the user.
clusterUri. Specify the URL of your HDInsight cluster in the following format:
https://<clustername>.azurehdinsight.net . This article assumes that you have access to the cluster over the
internet. For example, you can connect to the cluster at https://clustername.azurehdinsight.net . This
address uses the public gateway, which is not available if you have used network security groups (NSGs) or
user-defined routes (UDRs) to restrict access from the internet. For Data Factory to submit jobs to
HDInsight clusters in Azure Virtual Network, your Azure Virtual Network needs to be configured in such a
way that the URL can be resolved to the private IP address of the gateway used by HDInsight.
1. From Azure portal, open the Virtual Network the HDInsight is in. Open the network interface with
name starting with nic-gateway-0 . Note down its private IP address. For example, 10.6.0.15.
2. If your Azure Virtual Network has DNS server, update the DNS record so the HDInsight cluster URL
https://<clustername>.azurehdinsight.net can be resolved to 10.6.0.15 . This is the recommended
approach. If you don’t have a DNS server in your Azure Virtual Network, you can temporarily work
around this by editing the hosts file (C:\Windows\System32\drivers\etc) of all VMs that registered as
self-hosted integration runtime nodes by adding an entry like this:
10.6.0.15 myHDIClusterName.azurehdinsight.net

Create linked services


In the PowerShell, switch to the folder where you created JSON files, and run the following command to deploy
the linked services:
1. In the PowerShell, switch to the folder where you created JSON files.
2. Run the following command to create an Azure Storage linked service.
Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName
-Name "MyStorageLinkedService" -File "MyStorageLinkedService.json"

3. Run the following command to create an Azure HDInsight linked service.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "MyHDInsightLinkedService" -File "MyHDInsightLinkedService.json"

Author a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined. Create a JSON file in your preferred editor, copy the following
JSON definition of a pipeline definition, and save it as MyHivePipeline.json.

{
"name": "MyHivePipeline",
"properties": {
"activities": [
{
"name": "MyHiveActivity",
"type": "HDInsightHive",
"linkedServiceName": {
"referenceName": "MyHDILinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptPath": "adftutorial\\hivescripts\\hivescript.hql",
"getDebugInfo": "Failure",
"defines": {
"Output": "wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/"
},
"scriptLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}

Note the following points:


scriptPath points to path to Hive script on the Azure Storage Account you used for MyStorageLinkedService.
The path is case-sensitive.
Output is an argument used in the Hive script. Use the format of
wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ to point it to an existing folder on
your Azure Storage. The path is case-sensitive.
Switch to the folder where you created JSON files, and run the following command to deploy the pipeline:

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name


$pipelineName -File "MyHivePipeline.json"

Start the pipeline


1. Start a pipeline run. It also captures the pipeline run ID for future monitoring.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName $pipelineName

2. Run the following script to continuously check the pipeline run status until it finishes.

while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)

if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}

Write-Host "Activity `Output` section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "Activity `Error` section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

Here is the output of the sample run:


Pipeline run status: In Progress

ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 000000000-0000-0000-000000000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output :
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd :
DurationInMs :
Status : InProgress
Error :

Pipeline ' MyHivePipeline' run finished. Result:

ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 0000000-0000-0000-0000-000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output : {logLocation, clusterInUse, jobId, ExecutionProgress...}
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd : 9/18/2017 6:59:16 AM
DurationInMs : 63636
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity Output section:


"logLocation": "wasbs://adfjobs@adfv2samplestor.blob.core.windows.net/HiveQueryJobs/000000000-0000-47c3-
9b28-1cdc7f3f2ba2/18_09_2017_06_58_18_023/Status"
"clusterInUse": "https://adfv2HivePrivate.azurehdinsight.net"
"jobId": "job_1505387997356_0024"
"ExecutionProgress": "Succeeded"
"effectiveIntegrationRuntime": "MySelfhostedIR"
Activity Error section:
"errorCode": ""
"message": ""
"failureType": ""
"target": "MyHiveActivity"

3. Check the outputfolder folder for new file created as the Hive query result, it should look like the following
sample output:

8 en-US SCH-i500 California


23 en-US Incredible Pennsylvania
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
246 en-US SCH-i500 District Of Columbia
246 en-US SCH-i500 District Of Columbia
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Author and setup self-hosted integration runtime
Author and deploy linked services.
Author and deploy a pipeline that contains a Hive activity.
Start a pipeline run.
Monitor the pipeline run
verify the output.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Branching and chaining Data Factory control flow
Branching and chaining activities in a Data Factory
pipeline
3/26/2019 • 10 minutes to read • Edit Online

In this tutorial, you create a Data Factory pipeline that showcases some of the control flow features. This pipeline
does a simple copy from a container in Azure Blob Storage to another container in the same storage account. If the
copy activity succeeds, the pipeline sends details of the successful copy operation (such as the amount of data
written) in a success email. If the copy activity fails, the pipeline sends details of copy failure (such as the error
message) in a failure email. Throughout the tutorial, you see how to pass parameters.
A high-level overview of the scenario:

You perform the following steps in this tutorial:


Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a Copy activity and a Web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
This tutorial uses Azure portal. You can use other mechanisms to interact with Azure Data Factory, refer to
"Quickstarts" in the table of contents.

Prerequisites
Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Create blob table
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.

John,Doe
Jane,Doe

2. Use tools such as Azure Storage Explorer do the following steps:


a. Create the adfv2branch container.
b. Create input folder in the adfv2branch container.
c. Upload input.txt file to the container.

Create email workflow endpoints


To trigger sending an email from the pipeline, you use Logic Apps to define the workflow. For details on creating a
Logic App workflow, see How to create a logic app.
Success email workflow
Create a Logic App workflow named CopySuccessEmail . Define the workflow trigger as
When an HTTP request is received , and add an action of Office 365 Outlook – Send an email .

For your request trigger, fill in the Request Body JSON Schema with the following JSON:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}

The Request in the Logic App Designer should look like the following image:

For the Send Email action, customize how you wish to format the email, utilizing the properties passed in the
request Body JSON schema. Here is an example:
Save the workflow. Make a note of your HTTP Post request URL for your success email workflow:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

Fail email workflow


Follow the same steps to create another Logic Apps workflow of CopyFailEmail. In the request trigger, the
Request Body JSON schema is the same. Change the format of your email like the Subject to tailor toward a failure
email. Here is an example:
Save the workflow. Make a note of your HTTP Post request URL for your failure email workflow:

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

You should now have two workflow URLs:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. In the New data factory page, enter ADFTutorialDataFactory for the name.

The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.
`Data factory name “ADFTutorialDataFactory” is not available`

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.

11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.

Create a pipeline
In this step, you create a pipeline with one Copy activity and two Web activities. You use the following features to
create the pipeline:
Parameters for the pipeline that are access by datasets.
Web activity to invoke logic apps workflows to send success/failure emails.
Connecting one activity with another activity (on success and failure)
Using output from an activity as an input to the subsequent activity
1. In the get started page of Data Factory UI, click the Create pipeline tile.
2. In the properties window for the pipeline, switch to the Parameters tab, and use the New button to add the
following three parameters of type String: sourceBlobContainer, sinkBlobContainer, and receiver.
sourceBlobContainer - parameter in the pipeline consumed by the source blob dataset.
sinkBlobContainer – parameter in the pipeline consumed by the sink blob dataset
receiver – this parameter is used by the two Web activities in the pipeline that send success or failure
emails to the receiver whose email address is specified by this parameter.

3. In the Activities toolbox, expand Data Flow, and drag-drop Copy activity to the pipeline designer surface.
4. In the Properties window for the Copy activity at the bottom, switch to the Source tab, and click + New.
You create a source dataset for the copy activity in this step.

5. In the New Dataset window, select Azure Blob Storage, and click Finish.
6. You see a new tab titled AzureBlob1. Change the name of the dataset to SourceBlobDataset.
7. Switch to the Connection tab in the Properties window, and click New for the Linked service. You create
a linked service to link your Azure Storage account to the data factory in this step.
8. In the New Linked Service window, do the following steps:
a. Enter AzureStorageLinkedService for Name.
b. Select your Azure storage account for the Storage account name.
c. Click Save.
9. Enter @pipeline().parameters.sourceBlobContainer for the folder and emp.txt for the file name. You use the
sourceBlobContainer pipeline parameter to set the folder path for the dataset.
13. Switch to the pipeline tab (or) click the pipeline in the treeview. Confirm that SourceBlobDataset is selected
for Source Dataset.

![Source dataset](./media/tutorial-control-flow-portal/pipeline-source-dataset-selected.png)

13. In the properties window, switch to the Sink tab, and click + New for Sink Dataset. You create a sink
dataset for the copy activity in this step similar to the way you created the source dataset.

14. In the New Dataset window, select Azure Blob Storage, and click Finish.
15. In the General settings page for the dataset, enter SinkBlobDataset for Name.
16. Switch to the Connection tab, and do the following steps:
a. Select AzureStorageLinkedService for LinkedService.
b. Enter @pipeline().parameters.sinkBlobContainer for the folder.
c. Enter @CONCAT(pipeline().RunId, '.txt') for the file name. The expression uses the ID of the current
pipeline run for the file name. For the supported list of system variables and expressions, see System
variables and Expression language.
17. Switch to the pipeline tab at the top. Expand General in the Activities toolbox, and drag-drop a Web
activity to the pipeline designer surface. Set the name of the activity to SendSuccessEmailActivity. The
Web Activity allows a call to any REST endpoint. For more information about the activity, see Web Activity.
This pipeline uses a Web Activity to call the Logic Apps email workflow.

18. Switch to the Settings tab from the General tab, and do the following steps:
a. For URL, specify URL for the logic apps workflow that sends the success email.
b. Select POST for Method.
c. Click + Add header link in the Headers section.
d. Add a header Content-Type and set it to application/json.
e. Specify the following JSON for Body.

{
"message": "@{activity('Copy1').output.dataWritten}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}

The message body contains the following properties:


Message – Passing value of @{activity('Copy1').output.dataWritten . Accesses a property of
the previous copy activity and passes the value of dataWritten. For the failure case, pass the
error output instead of @{activity('CopyBlobtoBlob').error.message .
Data Factory Name – Passing value of @{pipeline().DataFactory} This is a system variable,
allowing you to access the corresponding data factory name. For a list of system variables, see
System Variables article.
Pipeline Name – Passing value of @{pipeline().Pipeline} . This is also a system variable,
allowing you to access the corresponding pipeline name.
Receiver – Passing value of "@pipeline().parameters.receiver"). Accessing the pipeline
parameters.

19. Connect the Copy activity to the Web activity by dragging the green button next to the Copy activity and
dropping on the Web activity.
20. Drag-drop another Web activity from the Activities toolbox to the pipeline designer surface, and set the
name to SendFailureEmailActivity.

21. Switch to the Settings tab, and do the following steps:


a. For URL, specify URL for the logic apps workflow that sends the failure email.
b. Select POST for Method.
c. Click + Add header link in the Headers section.
d. Add a header Content-Type and set it to application/json.
e. Specify the following JSON for Body.
{
"message": "@{activity('Copy1').error.message}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}

22. Select Copy activity in the pipeline designer, and click +-> button, and select Error.

23. Drag the red button next to the Copy activity to the second Web activity SendFailureEmailActivity. You
can move the activities around so that the pipeline looks like in the following image:
24. To validate the pipeline, click Validate button on the toolbar. Close the Pipeline Validation Output
window by clicking the >> button.

25. To publish the entities (datasets, pipelines, etc.) to Data Factory service, select Publish All. Wait until you
see the Successfully published message.
Trigger a pipeline run that succeeds
1. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now.

2. In the Pipeline Run window, do the following steps:


a. Enter adftutorial/adfv2branch/input for the sourceBlobContainer parameter.
b. Enter adftutorial/adfv2branch/output for the sinkBlobContainer parameter.
c. Enter an email address of the receiver.
d. Click Finish
Monitor the successful pipeline run
1. To monitor the pipeline run, switch to the Monitor tab on the left. You see the pipeline run that was
triggered manually by you. Use the Refresh button to refresh the list.

2. To view activity runs associated with this pipeline run, click the first link in the Actions column. You can
switch back to the previous view by clicking Pipelines at the top. Use the Refresh button to refresh the list.

Trigger a pipeline run that fails


1. Switch to the Edit tab on the left.
2. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now.
3. In the Pipeline Run window, do the following steps:
a. Enter adftutorial/dummy/input for the sourceBlobContainer parameter. Ensure that the dummy
folder does not exist in the adftutorial container.
b. Enter adftutorial/dummy/output for the sinkBlobContainer parameter.
c. Enter an email address of the receiver.
d. Click Finish.

Monitor the failed pipeline run


1. To monitor the pipeline run, switch to the Monitor tab on the left. You see the pipeline run that was
triggered manually by you. Use the Refresh button to refresh the list.
2. Click Error link for the pipeline run to see details about the error.

3. To view activity runs associated with this pipeline run, click the first link in the Actions column. Use the
Refresh button to refresh the list. Notice that the Copy activity in the pipeline failed. The Web activity
succeeded to send the failure email to the specified receiver.

4. Click Error link in the Actions column to see details about the error.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now proceed to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Branching and chaining activities in a Data Factory
pipeline
3/29/2019 • 14 minutes to read • Edit Online

In this tutorial, you create a Data Factory pipeline that showcases some of the control flow features. This pipeline
does a simple copy from a container in Azure Blob Storage to another container in the same storage account. If
the copy activity succeeds, you want to send details of the successful copy operation (such as the amount of data
written) in a success email. If the copy activity fails, you want to send details of copy failure (such as the error
message) in a failure email. Throughout the tutorial, you see how to pass parameters.
A high-level overview of the scenario:

You perform the following steps in this tutorial:


Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
This tutorial uses .NET SDK. You can use other mechanisms to interact with Azure Data Factory, refer to
"Quickstarts" in the table of contents.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Visual Studio 2013, 2015, or 2017. The walkthrough in this article uses Visual Studio 2017.
Download and install Azure .NET SDK.
Create an application in Azure Active Directory following these instructions. Make note of the following
values that you use in later steps: application ID, authentication key, and tenant ID. Assign application to
"Contributor" role by following instructions in the same article.
Create blob table
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.

John|Doe
Jane|Doe

2. Use tools such as Azure Storage Explorer to create the adfv2branch container, and to upload the input.txt
file to the container.

Create Visual Studio project


Using Visual Studio 2015/2017, create a C# .NET console application.
1. Launch Visual Studio.
2. Click File, point to New, and click Project. .NET version 4.5.2 or above is required.
3. Select Visual C# -> Console App (.NET Framework) from the list of project types on the right.
4. Enter ADFv2BranchTutorial for the Name.
5. Click OK to create the project.

Install NuGet packages


1. Click Tools -> NuGet Package Manager -> Package Manager Console.
2. In the Package Manager Console, run the following commands to install packages. Refer to
Microsoft.Azure.Management.DataFactory nuget package with details.

Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory

Create a data factory client


1. Open Program.cs, include the following statements to add references to namespaces.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;

2. Add these static variables to the Program class. Replace place-holders with your own values. For a list of
Azure regions in which Data Factory is currently available, select the regions that interest you on the
following page, and then expand Analytics to locate Data Factory: Products available by region. The data
stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory can
be in other regions.

// Set variables
static string tenantID = "<tenant ID>";
static string applicationId = "<application ID>";
static string authenticationKey = "<Authentication key for your application>";
static string subscriptionId = "<Azure subscription ID>";
static string resourceGroup = "<Azure resource group name>";

static string region = "East US";


static string dataFactoryName = "<Data factory name>";

// Specify the source Azure Blob information


static string storageAccount = "<Azure Storage account name>";
static string storageKey = "<Azure Storage account key>";
// confirm that you have the input.txt file placed in th input folder of the adfv2branch container.
static string inputBlobPath = "adfv2branch/input";
static string inputBlobName = "input.txt";
static string outputBlobPath = "adfv2branch/output";
static string emailReceiver = "<specify email address of the receiver>";

static string storageLinkedServiceName = "AzureStorageLinkedService";


static string blobSourceDatasetName = "SourceStorageDataset";
static string blobSinkDatasetName = "SinkStorageDataset";
static string pipelineName = "Adfv2TutorialBranchCopy";

static string copyBlobActivity = "CopyBlobtoBlob";


static string sendFailEmailActivity = "SendFailEmailActivity";
static string sendSuccessEmailActivity = "SendSuccessEmailActivity";

3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details.

// Authenticate and create a data factory management client


var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Create a data factory


Create a “CreateOrUpdateDataFactory” function in your Program.cs file:
static Factory CreateOrUpdateDataFactory(DataFactoryManagementClient client)
{
Console.WriteLine("Creating data factory " + dataFactoryName + "...");
Factory resource = new Factory
{
Location = region
};
Console.WriteLine(SafeJsonConvert.SerializeObject(resource, client.SerializationSettings));

Factory response;
{
response = client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, resource);
}

while (client.Factories.Get(resourceGroup, dataFactoryName).ProvisioningState == "PendingCreation")


{
System.Threading.Thread.Sleep(1000);
}
return response;
}

Add the following code to Main method that creates a data factory.

Factory df = CreateOrUpdateDataFactory(client);

Create an Azure Storage linked service


Create a “StorageLinkedServiceDefinition” function in your Program.cs file:

static LinkedServiceResource StorageLinkedServiceDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");
AzureStorageLinkedService storageLinkedService = new AzureStorageLinkedService
{
ConnectionString = new SecureString("DefaultEndpointsProtocol=https;AccountName=" + storageAccount +
";AccountKey=" + storageKey)
};
Console.WriteLine(SafeJsonConvert.SerializeObject(storageLinkedService, client.SerializationSettings));
LinkedServiceResource linkedService = new LinkedServiceResource(storageLinkedService,
name:storageLinkedServiceName);
return linkedService;
}

Add the following code to the Main method that creates an Azure Storage linked service. Learn more from
Azure Blob linked service properties on supported properties and details.

client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, storageLinkedServiceName,


StorageLinkedServiceDefinition(client));

Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. Learn more from Azure Blob
dataset properties on supported properties and details.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step, and describes:
The location of the blob to copy from: FolderPath and FileName;
Notice the use of parameters for the FolderPath. “sourceBlobContainer” is the name of the parameter and the
expression is replaced with the values passed in the pipeline run. The syntax to define parameters is
@pipeline().parameters.<parameterName>

Create a “SourceBlobDatasetDefinition” function in your Program.cs file

static DatasetResource SourceBlobDatasetDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating dataset " + blobSourceDatasetName + "...");
AzureBlobDataset blobDataset = new AzureBlobDataset
{
FolderPath = new Expression { Value = "@pipeline().parameters.sourceBlobContainer" },
FileName = inputBlobName,
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));
DatasetResource dataset = new DatasetResource(blobDataset, name:blobSourceDatasetName);
return dataset;
}

Create a dataset for sink Azure Blob


Create a “SourceBlobDatasetDefinition” function in your Program.cs file

static DatasetResource SinkBlobDatasetDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating dataset " + blobSinkDatasetName + "...");
AzureBlobDataset blobDataset = new AzureBlobDataset
{
FolderPath = new Expression { Value = "@pipeline().parameters.sinkBlobContainer" },
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));
DatasetResource dataset = new DatasetResource(blobDataset, name: blobSinkDatasetName);
return dataset;
}

Add the following code to the Main method that creates both Azure Blob source and sink datasets.

client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSourceDatasetName,


SourceBlobDatasetDefinition(client));

client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSinkDatasetName,


SinkBlobDatasetDefinition(client));

Create a C# class: EmailRequest


In your C# project, create a class named EmailRequest. This defines what properties the pipeline sends in the
body request when sending an email. In this tutorial, the pipeline sends four properties from the pipeline to the
email:
Message: body of the email. In the case of a successful copy, this property contains details of the run (number
of data written). In the case of a failed copy, this property contains details of the error.
Data factory name: name of the data factory
Pipeline name: name of the pipeline
Receiver: Parameter that is passed through. This property specifies the receiver of the email.

class EmailRequest
{
[Newtonsoft.Json.JsonProperty(PropertyName = "message")]
public string message;

[Newtonsoft.Json.JsonProperty(PropertyName = "dataFactoryName")]
public string dataFactoryName;

[Newtonsoft.Json.JsonProperty(PropertyName = "pipelineName")]
public string pipelineName;

[Newtonsoft.Json.JsonProperty(PropertyName = "receiver")]
public string receiver;

public EmailRequest(string input, string df, string pipeline, string receiverName)


{
message = input;
dataFactoryName = df;
pipelineName = pipeline;
receiver = receiverName;
}
}

Create email workflow endpoints


To trigger sending an email, you use Logic Apps to define the workflow. For details on creating a Logic App
workflow, see How to create a logic app.
Success email workflow
Create a Logic App workflow named CopySuccessEmail . Define the workflow trigger as
When an HTTP request is received , and add an action of Office 365 Outlook – Send an email .

For your request trigger, fill in the Request Body JSON Schema with the following JSON:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}

This aligns with the EmailRequest class you created in the previous section.
Your Request should look like this in the Logic App Designer:

For the Send Email action, customize how you wish to format the email, utilizing the properties passed in the
request Body JSON schema. Here is an example:
Make a note of your HTTP Post request URL for your success email workflow:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000
Fail email workflow
Clone your CopySuccessEmail and create another Logic Apps workflow of CopyFailEmail. In the request
trigger, the Request Body JSON schema is the same. Simply change the format of your email like the Subject to
tailor toward a failure email. Here is an example:

Make a note of your HTTP Post request URL for your failure email workflow:

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

You should now have two workflow URL’s:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity and dependsOn property.
In this tutorial, the pipeline contains one activity: copy activity, which takes in the Blob dataset as a source and
another Blob dataset as a sink. Upon the copy activity succeeding and failing, it calls different email tasks.
In this pipeline, you use the following features:
Parameters
Web Activity
Activity dependency
Using output from an activity as an input to the subsequent activity
Let’s break down the following pipeline section by section:

static PipelineResource PipelineDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource resource = new PipelineResource
{
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "sourceBlobContainer", new ParameterSpecification { Type = ParameterType.String } },
{ "sinkBlobContainer", new ParameterSpecification { Type = ParameterType.String } },
{ "receiver", new ParameterSpecification { Type = ParameterType.String } }

},
Activities = new List<Activity>
{
new CopyActivity
{
Name = copyBlobActivity,
Inputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSourceDatasetName
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSinkDatasetName
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
},
new WebActivity
{
Name = sendSuccessEmailActivity,
Method = WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/00000000000000000000000000000000000/triggers/manual/path
s/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
},
new WebActivity
{
Name = sendFailEmailActivity,
Method =WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/000000000000000000000000000000000/triggers/manual/paths/
invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').error.message}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Failed" }
}
}
}
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(resource, client.SerializationSettings));
return resource;
}

Add the following code to the Main method that creates the pipeline:

client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, PipelineDefinition(client));

Parameters
The first section of our pipeline defines parameters.
sourceBlobContainer - parameter in the pipeline consumed by the source blob dataset.
sinkBlobContainer – parameter in the pipeline consumed by the sink blob dataset
receiver – this parameter is used by the two Web activities in the pipeline that send success or failure emails to
the receiver whose email address is specified by this parameter.

Parameters = new Dictionary<string, ParameterSpecification>


{
{ "sourceBlobContainer", new ParameterSpecification { Type = ParameterType.String } },
{ "sinkBlobContainer", new ParameterSpecification { Type = ParameterType.String } },
{ "receiver", new ParameterSpecification { Type = ParameterType.String } }
},

Web Activity
The Web Activity allows a call to any REST endpoint. For more information about the activity, see Web Activity.
This pipeline uses a Web Activity to call the Logic Apps email workflow. You create two web activities: one that
calls to the CopySuccessEmail workflow and one that calls the CopyFailWorkFlow.
new WebActivity
{
Name = sendCopyEmailActivity,
Method = WebActivityMethod.POST,
Url = "https://prodxxx.eastus.logic.azure.com:443/workflows/12345",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
}

In the “Url” property, paste the Request URL endpoints from your Logic Apps workflow accordingly. In the “Body”
property, pass an instance of the “EmailRequest” class. The email request contains the following properties:
Message – Passing value of @{activity('CopyBlobtoBlob').output.dataWritten . Accesses a property of the
previous copy activity and passes the value of dataWritten. For the failure case, pass the error output instead of
@{activity('CopyBlobtoBlob').error.message .
Data Factory Name – Passing value of @{pipeline().DataFactory} This is a system variable, allowing you to
access the corresponding data factory name. For a list of system variables, see System Variables article.
Pipeline Name – Passing value of @{pipeline().Pipeline} . This is also a system variable, allowing you to
access the corresponding pipeline name.
Receiver – Passing value of "@pipeline().parameters.receiver"). Accessing the pipeline parameters.
This code creates a new Activity Dependency, depending on the previous copy activity that it succeeds.

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run.

// Create a pipeline run


Console.WriteLine("Creating pipeline run...");
Dictionary<string, object> arguments = new Dictionary<string, object>
{
{ "sourceBlobContainer", inputBlobPath },
{ "sinkBlobContainer", outputBlobPath },
{ "receiver", emailReceiver }
};

CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup,


dataFactoryName, pipelineName, arguments).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);

Main class
Your final Main method should look like this. Build and run your program to trigger a pipeline run!
// Authenticate and create a data factory management client
var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Factory df = CreateOrUpdateDataFactory(client);

client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, storageLinkedServiceName,


StorageLinkedServiceDefinition(client));
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSourceDatasetName,
SourceBlobDatasetDefinition(client));
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSinkDatasetName,
SinkBlobDatasetDefinition(client));

client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, PipelineDefinition(client));

Console.WriteLine("Creating pipeline run...");


Dictionary<string, object> arguments = new Dictionary<string, object>
{
{ "sourceBlobContainer", inputBlobPath },
{ "sinkBlobContainer", outputBlobPath },
{ "receiver", emailReceiver }
};

CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup,


dataFactoryName, pipelineName, arguments).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);

Monitor a pipeline run


1. Add the following code to the Main method to continuously check the status of the pipeline run until it
finishes copying the data.

// Monitor the pipeline run


Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(resourceGroup, dataFactoryName, runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress")
System.Threading.Thread.Sleep(15000);
else
break;
}

2. Add the following code to the Main method that retrieves copy activity run details, for example, size of the
data read/written.
// Check the copy activity run details
Console.WriteLine("Checking copy activity run details...");

List<ActivityRun> activityRuns = client.ActivityRuns.ListByPipelineRun(


resourceGroup, dataFactoryName, runResponse.RunId, DateTime.UtcNow.AddMinutes(-10),
DateTime.UtcNow.AddMinutes(10)).ToList();

if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(activityRuns.First().Output);
//SaveToJson(SafeJsonConvert.SerializeObject(activityRuns.First().Output,
client.SerializationSettings), "ActivityRunResult.json", folderForJsons);
}
else
Console.WriteLine(activityRuns.First().Error);

Console.WriteLine("\nPress any key to exit...");


Console.ReadKey();

Run the code


Build and start the application, then verify the pipeline execution. The console prints the progress of creating data
factory, linked service, datasets, pipeline, and pipeline run. It then checks the pipeline run status. Wait until you see
the copy activity run details with data read/written size. Then, use tools such as Azure Storage explorer to check
the blob(s) is copied to "outputBlobPath" from "inputBlobPath" as you specified in variables.
Sample output:

Creating data factory DFTutorialTest...


{
"location": "East US"
}
Creating linked service AzureStorageLinkedService...
{
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=***;AccountKey=***"
}
}
}
Creating dataset SourceStorageDataset...
{
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"type": "Expression",
"value": "@pipeline().parameters.sourceBlobContainer"
},
"fileName": "input.txt"
},
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureStorageLinkedService"
}
}
Creating dataset SinkStorageDataset...
{
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"type": "Expression",
"value": "@pipeline().parameters.sinkBlobContainer"
}
},
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureStorageLinkedService"
}
}
Creating pipeline Adfv2TutorialBranchCopy...
{
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [
{
"type": "DatasetReference",
"referenceName": "SourceStorageDataset"
}
],
"outputs": [
{
"type": "DatasetReference",
"referenceName": "SinkStorageDataset"
}
],
"name": "CopyBlobtoBlob"
},
{
"type": "WebActivity",
"typeProperties": {
"method": "POST",
"url": "https://xxxx.eastus.logic.azure.com:443/workflows/... ",
"body": {
"message": "@{activity('CopyBlobtoBlob').output.dataWritten}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
},
"name": "SendSuccessEmailActivity",
"dependsOn": [
{
"activity": "CopyBlobtoBlob",
"dependencyConditions": [
"Succeeded"
]
}
]
},
{
"type": "WebActivity",
"typeProperties": {
"method": "POST",
"url": "https://xxx.eastus.logic.azure.com:443/workflows/... ",
"body": {
"message": "@{activity('CopyBlobtoBlob').error.message}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
},
},
"name": "SendFailEmailActivity",
"dependsOn": [
{
"activity": "CopyBlobtoBlob",
"dependencyConditions": [
"Failed"
]
}
]
}
],
"parameters": {
"sourceBlobContainer": {
"type": "String"
},
"sinkBlobContainer": {
"type": "String"
},
"receiver": {
"type": "String"
}
}
}
}
Creating pipeline run...
Pipeline run ID: 00000000-0000-0000-0000-0000000000000
Checking pipeline run status...
Status: InProgress
Status: InProgress
Status: Succeeded
Checking copy activity run details...
{
"dataRead": 20,
"dataWritten": 20,
"copyDuration": 4,
"throughput": 0.01,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)"
}
{}

Press any key to exit...

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now proceed to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Provision the Azure-SSIS Integration Runtime in
Azure Data Factory
3/5/2019 • 9 minutes to read • Edit Online

This tutorial provides steps for using the Azure portal to provision an Azure-SSIS integration runtime (IR ) in Azure
Data Factory. Then, you can use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to
deploy and run SQL Server Integration Services (SSIS ) packages in this runtime in Azure. For conceptual
information on Azure-SSIS IRs, see Azure-SSIS integration runtime overview.
In this tutorial, you complete the following steps:
Create a data factory.
Provision an Azure-SSIS integration runtime.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Database server. If you don't already have a database server, create one in the Azure portal before
you get started. Azure Data Factory creates the SSIS Catalog (SSISDB database) on this database server. We
recommend that you create the database server in the same Azure region as the integration runtime. This
configuration lets the integration runtime write execution logs to the SSISDB database without crossing Azure
regions.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part of an
elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual network. If you
use Azure SQL Database with virtual network service endpoints/Managed Instance to host SSISDB or require
access to on-premises data, you need to join your Azure-SSIS IR to a virtual network, see Create Azure-SSIS IR
in a virtual network.
Confirm that the Allow access to Azure services setting is enabled for the database server. This is not
applicable when you use Azure SQL Database with virtual network service endpoints/Managed Instance to
host SSISDB. For more information, see Secure your Azure SQL database. To enable this setting by using
PowerShell, see New -AzSqlServerFirewallRule.
Add the IP address of the client machine, or a range of IP addresses that includes the IP address of client
machine, to the client IP address list in the firewall settings for the database server. For more information, see
Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server using SQL authentication with your server admin credentials or Azure
Active Directory (AAD ) authentication with the managed identity for your Azure Data Factory (ADF ). For the
latter, you need to add the managed identity for your ADF into an AAD group with access permissions to the
database server, see Create Azure-SSIS IR with AAD authentication.
Confirm that your Azure SQL Database server does not have an SSIS Catalog (SSISDB database). The
provisioning of an Azure-SSIS IR does not support using an existing SSIS Catalog.
NOTE
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see ADF +
SSIS IR availability by region.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. Sign in to the Azure portal.
3. Select New on the left menu, select Data + Analytics, and then select Data Factory.

4. On the New data factory page, enter MyAzureSsisDataFactory under Name.


The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, <yourname>MyAzureSsisDataFactory) and try creating again.
For naming rules for Data Factory artifacts, see the Data Factory - naming rules article.
Data factory name “MyAzureSsisDataFactory” is not available

5. For Subscription, select your Azure subscription in which you want to create the data factory.
6. For Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. For Version, select V2 (Preview).
8. For Location, select the location for the data factory. The list shows only locations that are supported for the
creation of data factories.
9. Select Pin to dashboard.
10. Select Create.
11. On the dashboard, you see the following tile with the status Deploying data factory:

12. After the creation is complete, you see the Data factory page.
13. Select Author & Monitor to open the Data Factory user interface (UI) on a separate tab.

Create an Azure-SSIS integration runtime


From the Data Factory overview
1. On the Let's get started page, select the Configure SSIS Integration Runtime tile.

2. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.
From the Authoring UI
1. In the Azure Data Factory UI, switch to the Edit tab, select Connections, and then switch to the
Integration Runtimes tab to view existing integration runtimes in your data factory.
2. Select New to create an Azure-SSIS IR.

3. In the Integration Runtime Setup window, select Lift-and-shift existing SSIS packages to execute in
Azure, and then select Next.
4. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.

Provision an Azure-SSIS integration runtime


1. On the General Settings page of Integration Runtime Setup, complete the following steps:
a. For Name, enter the name of your integration runtime.
b. For Description, enter the description of your integration runtime.
c. For Location, select the location of your integration runtime. Only supported locations are displayed. We
recommend that you select the same location of your database server to host SSISDB.
d. For Node Size, select the size of node in your integration runtime cluster. Only supported node sizes are
displayed. Select a large node size (scale up), if you want to run many compute/memory –intensive
packages.
e. For Node Number, select the number of nodes in your integration runtime cluster. Only supported node
numbers are displayed. Select a large cluster with many nodes (scale out), if you want to run many packages
in parallel.
f. For Edition/License, select SQL Server edition/license for your integration runtime: Standard or
Enterprise. Select Enterprise, if you want to use advanced/premium features on your integration runtime.
g. For Save Money, select Azure Hybrid Benefit (AHB ) option for your integration runtime: Yes or No.
Select Yes, if you want to bring your own SQL Server license with Software Assurance to benefit from cost
savings with hybrid use.
h. Click Next.
2. On the SQL Settings page, complete the following steps:

a. For Subscription, select the Azure subscription that has your database server to host SSISDB.
b. For Location, select the location of your database server to host SSISDB. We recommend that you select
the same location of your integration runtime.
c. For Catalog Database Server Endpoint, select the endpoint of your database server to host SSISDB.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part of
an elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual network.
For guidance in choosing the type of database server to host SSISDB, see Compare Azure SQL Database
single databases/elastic pools and Managed Instance. If you select Azure SQL Database with virtual
network service endpoints/Managed Instance to host SSISDB or require access to on-premises data, you
need to join your Azure-SSIS IR to a virtual network. See Create Azure-SSIS IR in a virtual network.
d. On Use AAD authentication... checkbox, select the authentication method for your database server to
host SSISDB: SQL or Azure Active Directory (AAD ) with the managed identity for your Azure Data Factory
(ADF ). If you check it, you need to add the managed identity for your ADF into an AAD group with access
permissions to the database server, see Create Azure-SSIS IR with AAD authentication.
e. For Admin Username, enter SQL authentication username for your database server to host SSISDB.
f. For Admin Password, enter SQL authentication password for your database server to host SSISDB.
g. For Catalog Database Service Tier, select the service tier for your database server to host SSISDB:
Basic/Standard/Premium tier or elastic pool name.
h. Click Test Connection and if successful, click Next.
3. On the Advanced Settings page, complete the following steps:
a. For Maximum Parallel Executions Per Node, select the maximum number of packages to execute
concurrently per node in your integration runtime cluster. Only supported package numbers are displayed.
Select a low number, if you want to use more than one cores to run a single large/heavy-weight package
that is compute/memory -intensive. Select a high number, if you want to run one or more small/light-weight
packages in a single core.
b. For Custom Setup Container SAS URI, optionally enter Shared Access Signature (SAS ) Uniform
Resource Identifier (URI) of your Azure Storage Blob container where your setup script and its associated
files are stored, see Custom setup for Azure-SSIS IR.
c. On Select a VNet... checkbox, select whether you want to join your integration runtime to a virtual
network. You should check it if you use Azure SQL Database with virtual network service
endpoints/Managed Instance to host SSISDB or require access to on-premises data, see Create Azure-SSIS
IR in a virtual network.
4. Click Finish to start the creation of your integration runtime.

IMPORTANT
This process takes approximately 20 to 30 minutes to complete.
The Data Factory service connects to your Azure SQL Database server to prepare the SSIS Catalog (SSISDB database).
When you provision an instance of an Azure-SSIS IR, the Azure Feature Pack for SSIS and the Access Redistributable
are also installed. These components provide connectivity to Excel and Access files and to various Azure data sources,
in addition to the data sources supported by the built-in components. You can also install additional components. For
more info, see Custom setup for the Azure-SSIS integration runtime.

5. On the Connections tab, switch to Integration Runtimes if needed. Select Refresh to refresh the status.

6. Use the links in the Actions column to stop/start, edit, or delete the integration runtime. Use the last link to
view JSON code for the integration runtime. The edit and delete buttons are enabled only when the IR is
stopped.

Deploy SSIS packages


Now, use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy your SSIS
packages to Azure. Connect to your Azure SQL Database server that hosts the SSIS Catalog (SSISDB database).
The name of Azure SQL Database server is in the format <servername>.database.windows.net .
See the following articles from the SSIS documentation:
Deploy, run, and monitor an SSIS package on Azure
Connect to the SSIS Catalog on Azure
Schedule package execution on Azure
Connect to on-premises data sources with Windows authentication

Next steps
In this tutorial, you learned how to:
Create a data factory.
Provision an Azure-SSIS integration runtime.
To learn about customizing your Azure-SSIS integration runtime, advance to the following article:
Customize Azure-SSIS IR
Provision the Azure-SSIS Integration Runtime in
Azure Data Factory with PowerShell
3/15/2019 • 11 minutes to read • Edit Online

This tutorial provides steps for provisioning an Azure-SSIS integration runtime (IR ) in Azure Data Factory. Then,
you can use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy and run SQL
Server Integration Services (SSIS ) packages in this runtime in Azure. In this tutorial, you do the following steps:

NOTE
This article uses Azure PowerShell to provision an Azure SSIS IR. To use the Data Factory user interface (UI) to provision an
Azure SSIS IR, see Tutorial: Create an Azure SSIS integration runtime.

Create a data factory.


Create an Azure-SSIS integration runtime
Start the Azure-SSIS integration runtime
Deploy SSIS packages
Review the complete script

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure subscription. If you don't have an Azure subscription, create a free account before you begin. For
conceptual information on Azure-SSIS IR, see Azure-SSIS integration runtime overview.
Azure SQL Database server. If you don't already have a database server, create one in the Azure portal before
you get started. This server hosts the SSIS Catalog database (SSISDB ). We recommend that you create the
database server in the same Azure region as the integration runtime. This configuration lets the integration
runtime write execution logs to SSISDB without crossing Azure regions.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part
of an elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual
network. For guidance in choosing the type of database server to host SSISDB, see Compare Azure SQL
Database single databases/elastic pools and Managed Instance. If you use Azure SQL Database with
virtual network service endpoints/Managed Instance to host SSISDB or require access to on-premises
data, you need to join your Azure-SSIS IR to a virtual network, see Create Azure-SSIS IR in a virtual
network.
Confirm that the "Allow access to Azure services" setting is ON for the database server. This setting is
not applicable when you use Azure SQL Database with virtual network service endpoints/Managed
Instance to host SSISDB. For more information, see Secure your Azure SQL database. To enable this
setting by using PowerShell, see New -AzSqlServerFirewallRule.
Add the IP address of the client machine or a range of IP addresses that includes the IP address of client
machine to the client IP address list in the firewall settings for the database server. For more information,
see Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server using SQL authentication with your server admin credentials or
Azure Active Directory (AAD ) authentication with the managed identity for your Azure Data Factory. For
the latter, you need to add the managed identity for your ADF into an AAD group with access
permissions to the database server, see Create Azure-SSIS IR with AAD authentication.
Confirm that your Azure SQL Database server does not have an SSIS Catalog (SSISDB database). The
provisioning of Azure-SSIS IR does not support using an existing SSIS Catalog.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell. You use
PowerShell to run a script to provision an Azure-SSIS integration runtime that runs SSIS packages in the cloud.

NOTE
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see ADF +
SSIS IR availability by region.

Launch Windows PowerShell ISE


Start Windows PowerShell ISE with administrative privileges.

Create variables
Copy and paste the following script: Specify values for the variables. For a list of supported pricing tiers for Azure
SQL Database, see SQL Database resource limits.
# Azure Data Factory information
# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like "`$"
$SubscriptionName = "[Azure subscription name]"
$ResourceGroupName = "[Azure resource group name]"
# Data factory name. Must be globally unique
$DataFactoryName = "[Data factory name]"
$DataFactoryLocation = "EastUS"

# Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[Specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[Specify a description for your Azure-SSIS IR]"
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium features
on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your own
on-premises SQL Server license to earn cost savings from Azure Hybrid Benefit (AHB) option
# For a Standard_D1_v2 node, 1-4 parallel executions per node are supported, but for other nodes, 1-8 are
currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup script
and its associated files are stored

# SSISDB info
$SSISDBServerEndpoint = "[your Azure SQL Database server name].database.windows.net" # WARNING: Please ensure
that there is no existing SSISDB, so we can prepare and manage one on your behalf
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication]"
# For the basic pricing tier, specify "Basic", not "B" - For standard/premium/elastic pool tiers, specify "S0",
"S1", "S2", "S3", etc.
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>)]"

Validate the connection to database


Add the following script to validate your Azure SQL Database server, <servername>.database.windows.net .

$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" + $SSISDBServerAdminUserName +


";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect to your Azure SQL Database server, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you want to proceed?
[Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}

To create an Azure SQL database as part of the script, see the following example:
Set values for the variables that haven't been defined already. For example: SSISDBServerName,
FirewallIPAddress.

New-AzSqlServer -ResourceGroupName $ResourceGroupName `


-ServerName $SSISDBServerName `
-Location $DataFactoryLocation `
-SqlAdministratorCredentials $(New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList
$SSISDBServerAdminUserName, $(ConvertTo-SecureString -String $SSISDBServerAdminPassword -AsPlainText -Force))

New-AzSqlServerFirewallRule -ResourceGroupName $ResourceGroupName `


-ServerName $SSISDBServerName `
-FirewallRuleName "ClientIPAddress_$today" -StartIpAddress $FirewallIPAddress -EndIpAddress
$FirewallIPAddress

New-AzSqlServerFirewallRule -ResourceGroupName $ResourceGroupName -ServerName $SSISDBServerName -


AllowAllAzureIPs

Log in and select subscription


Add the following code to the script to log in and select your Azure subscription:

Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

Create a resource group


Create an Azure resource group using the New -AzResourceGroup command. A resource group is a logical
container into which Azure resources are deployed and managed as a group. The following example creates a
resource group named myResourceGroup in the westeurope location.
If your resource group already exists, don't copy this code to your script.

New-AzResourceGroup -Location $DataFactoryLocation -Name $ResourceGroupName

Create a data factory


Run the following command to create a data factory:

Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `


-Location $DataFactoryLocation `
-Name $DataFactoryName

Create an integration runtime


Run the following command to create an Azure-SSIS integration runtime that runs SSIS packages in Azure:
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName, $secpasswd)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode $AzureSSISMaxParallelExecutionsPerNode
`
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogAdminCredential $serverCreds `
-CatalogPricingTier $SSISDBPricingTier

if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}

Start integration runtime


Run the following command to start the Azure-SSIS integration runtime:

write-host("##### Starting your Azure-SSIS integration runtime. This command takes 20 to 30 minutes to
complete. #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

write-host("##### Completed #####")


write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")

This command takes from 20 to 30 minutes to complete.

Deploy SSIS packages


Now, use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy your SSIS
packages to Azure. Connect to your Azure SQL server that hosts the SSIS catalog (SSISDB ). The name of the
Azure SQL Database server is in the format: <servername>.database.windows.net .
See the following articles from SSIS documentation:
Deploy, run, and monitor an SSIS package on Azure
Connect to SSIS catalog on Azure
Schedule package execution on Azure
Connect to on-premises data sources with Windows authentication

Full script
The PowerShell script in this section configures an instance of Azure-SSIS integration runtime in the cloud that
runs SSIS packages. After you run this script successfully, you can deploy and run SSIS packages in the Microsoft
Azure cloud with SSISDB hosted in Azure SQL Database.
1. Launch the Windows PowerShell Integrated Scripting Environment (ISE ).
2. In the ISE, run the following command from the command prompt.

Set-ExecutionPolicy Unrestricted -Scope CurrentUser

3. Copy the PowerShell script in this section and paste it into the ISE.
4. Provide appropriate values for all parameters at the beginning of the script.
5. Run the script. The Start-AzDataFactoryV2IntegrationRuntime command near the end of the script runs for 20 to
30 minutes.

NOTE
The script connects to your Azure SQL Database server to prepare the SSIS Catalog database (SSISDB).
When you provision an instance of Azure-SSIS IR, the Azure Feature Pack for SSIS and the Access Redistributable are
also installed. These components provide connectivity to Excel and Access files and to various Azure data sources, in
addition to the data sources supported by the built-in components. You can also install additional components. For
more info, see Custom setup for the Azure-SSIS integration runtime.

For a list of supported pricing tiers for Azure SQL Database, see SQL Database resource limits.
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see
ADF + SSIS IR availability by region.

# Azure Data Factory information


# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like "`$"
$SubscriptionName = "[Azure subscription name]"
$ResourceGroupName = "[Azure resource group name]"
# Data factory name. Must be globally unique
$DataFactoryName = "[Data factory name]"
$DataFactoryLocation = "EastUS"

# Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[Specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[Specify a description for your Azure-SSIS IR]"
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium features
on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your own
on-premises SQL Server license to earn cost savings from Azure Hybrid Benefit (AHB) option
# For a Standard_D1_v2 node, 1-4 parallel executions per node are supported, but for other nodes, 1-8 are
currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup script
and its associated files are stored

# SSISDB info
$SSISDBServerEndpoint = "[your Azure SQL Database server name].database.windows.net" # WARNING: Please ensure
that there is no existing SSISDB, so we can prepare and manage one on your behalf
that there is no existing SSISDB, so we can prepare and manage one on your behalf
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication]"
# For the basic pricing tier, specify "Basic", not "B" - For standard/premium/elastic pool tiers, specify "S0",
"S1", "S2", "S3", etc.
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>)]"

$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" + $SSISDBServerAdminUserName +


";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect to your Azure SQL Database server, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you want to proceed?
[Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}

Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `


-Location $DataFactoryLocation `
-Name $DataFactoryName

$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force


$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName, $secpasswd)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode $AzureSSISMaxParallelExecutionsPerNode
`
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogAdminCredential $serverCreds `
-CatalogPricingTier $SSISDBPricingTier

if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}

write-host("##### Starting your Azure-SSIS integration runtime. This command takes 20 to 30 minutes to
complete. #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

write-host("##### Completed #####")


write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")
write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")

Join Azure-SSIS IR to a virtual network


If you use Azure SQL Database with virtual network service endpoints/Managed Instance that joins a virtual
network to host SSISDB, you must also join your Azure-SSIS integration runtime to the same virtual network.
Azure Data Factory lets you join your Azure-SSIS integration runtime to a virtual network. For more information,
see Join Azure-SSIS integration runtime to a virtual network.
For a full script to create an Azure-SSIS integration runtime that joins a virtual network, see Create an Azure-SSIS
integration runtime.

Monitor and manage Azure-SSIS IR


See the following articles for details about monitoring and managing an Azure-SSIS IR.
Monitor an Azure-SSIS integration runtime
Manage an Azure-SSIS integration runtime

Next steps
In this tutorial, you learned how to:
Create a data factory.
Create an Azure-SSIS integration runtime
Start the Azure-SSIS integration runtime
Deploy SSIS packages
Review the complete script
To learn about customizing your Azure-SSIS integration runtime, advance to the following article:
Customize Azure-SSIS IR
Azure PowerShell samples for Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online

The following table includes links to sample Azure PowerShell scripts for Azure Data Factory.

Copy data

Copy blobs from a folder to another folder in an Azure Blob This PowerShell script copies blobs from a folder in Azure Blob
Storage Storage to another folder in the same Blob Storage.

Copy data from on-premises SQL Server to Azure Blob This PowerShell script copies data from an on-premises SQL
Storage Server database to an Azure blob storage.

Bulk copy This sample PowerShell script copies data from multiple tables
in an Azure SQL database to an Azure SQL data warehouse.

Incremental copy This sample PowerShell script loads only new or updated
records from a source data store to a sink data store after the
initial full copy of data from the source to the sink.

Transform data

Transform data using a Spark cluster This PowerShell script transforms data by running a program
on a Spark cluster.

Lift and shift SSIS packages to Azure

Create Azure-SSIS integration runtime This PowerShell script provisions an Azure-SSIS integration
runtime that runs SQL Server Integration Services (SSIS)
packages in Azure.
Pipelines and activities in Azure Data Factory
5/15/2019 • 14 minutes to read • Edit Online

This article helps you understand pipelines and activities in Azure Data Factory and use them to
construct end-to-end data-driven workflows for your data movement and data processing scenarios.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. For example, a pipeline could contain a set of activities that ingest and clean
log data, and then kick off a Spark job on an HDInsight cluster to analyze the log data. The beauty of
this is that the pipeline allows you to manage the activities as a set instead of each one individually.
For example, you can deploy and schedule the pipeline, instead of the activities independently.
The activities in a pipeline define actions to perform on your data. For example, you may use a copy
activity to copy data from an on-premises SQL Server to an Azure Blob Storage. Then, use a Hive
activity that runs a Hive script on an Azure HDInsight cluster to process/transform data from the
blob storage to produce output data. Finally, use a second copy activity to copy the output data to an
Azure SQL Data Warehouse on top of which business intelligence (BI) reporting solutions are built.
Data Factory supports three types of activities: data movement activities, data transformation
activities, and control activities. An activity can take zero or more input datasets and produce one or
more output datasets. The following diagram shows the relationship between pipeline, activity, and
dataset in Data Factory:

An input dataset represents the input for an activity in the pipeline and an output dataset represents
the output for the activity. Datasets identify data within different data stores, such as tables, files,
folders, and documents. After you create a dataset, you can use it with activities in a pipeline. For
example, a dataset can be an input/output dataset of a Copy Activity or an HDInsightHive Activity.
For more information about datasets, see Datasets in Azure Data Factory article.

Data movement activities


Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory
supports the data stores listed in the table in this section. Data from any source can be written to any
sink. Click a data store to learn how to copy data to and from that store.

SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY


CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Azure Azure Blob ✓ ✓ ✓ ✓


Storage

Azure Cosmos ✓ ✓ ✓ ✓
DB (SQL API)
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Azure Cosmos ✓ ✓ ✓ ✓
DB's API for
MongoDB

Azure Data ✓ ✓ ✓ ✓
Explorer

Azure Data ✓ ✓ ✓ ✓
Lake Storage
Gen1

Azure Data ✓ ✓ ✓ ✓
Lake Storage
Gen2

Azure ✓ ✓ ✓
Database for
MariaDB

Azure ✓ ✓ ✓
Database for
MySQL

Azure ✓ ✓ ✓
Database for
PostgreSQL

Azure File ✓ ✓ ✓ ✓
Storage

Azure SQL ✓ ✓ ✓ ✓
Database

Azure SQL ✓ ✓ ✓
Database
Managed
Instance

Azure SQL ✓ ✓ ✓ ✓
Data
Warehouse

Azure Search ✓ ✓ ✓
Index

Azure Table ✓ ✓ ✓ ✓
Storage

Database Amazon ✓ ✓ ✓
Redshift

DB2 ✓ ✓ ✓

Drill (Preview) ✓ ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Google ✓ ✓ ✓
BigQuery

Greenplum ✓ ✓ ✓

HBase ✓ ✓ ✓

Hive ✓ ✓ ✓

Apache Impala ✓ ✓ ✓
(Preview)

Informix ✓ ✓

MariaDB ✓ ✓ ✓

Microsoft ✓ ✓
Access

MySQL ✓ ✓ ✓

Netezza ✓ ✓ ✓

Oracle ✓ ✓ ✓ ✓

Phoenix ✓ ✓ ✓

PostgreSQL ✓ ✓ ✓

Presto ✓ ✓ ✓
(Preview)

SAP Business ✓ ✓
Warehouse
Open Hub

SAP Business ✓ ✓
Warehouse via
MDX

SAP HANA ✓ ✓ ✓

SAP Table ✓ ✓ ✓

Spark ✓ ✓ ✓

SQL Server ✓ ✓ ✓ ✓

Sybase ✓ ✓

Teradata ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Vertica ✓ ✓ ✓

NoSQL Cassandra ✓ ✓ ✓

Couchbase ✓ ✓ ✓
(Preview)

MongoDB ✓ ✓ ✓

File Amazon S3 ✓ ✓ ✓

File System ✓ ✓ ✓ ✓

FTP ✓ ✓ ✓

Google Cloud ✓ ✓ ✓
Storage

HDFS ✓ ✓ ✓

SFTP ✓ ✓ ✓

Generic Generic HTTP ✓ ✓ ✓


protocol

Generic OData ✓ ✓ ✓

Generic ODBC ✓ ✓ ✓

Generic REST ✓ ✓ ✓

Services and Amazon ✓ ✓ ✓


apps Marketplace
Web Service
(Preview)

Common Data ✓ ✓ ✓ ✓
Service for
Apps

Concur ✓ ✓ ✓
(Preview)

Dynamics 365 ✓ ✓ ✓ ✓

Dynamics AX ✓ ✓ ✓
(Preview)

Dynamics ✓ ✓ ✓ ✓
CRM
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Google ✓ ✓ ✓
AdWords
(Preview)

HubSpot ✓ ✓ ✓
(Preview)

Jira (Preview) ✓ ✓ ✓

Magento ✓ ✓ ✓
(Preview)

Marketo ✓ ✓ ✓
(Preview)

Office 365 ✓ ✓ ✓

Oracle Eloqua ✓ ✓ ✓
(Preview)

Oracle ✓ ✓ ✓
Responsys
(Preview)

Oracle Service ✓ ✓ ✓
Cloud
(Preview)

Paypal ✓ ✓ ✓
(Preview)

QuickBooks ✓ ✓ ✓
(Preview)

Salesforce ✓ ✓ ✓ ✓

Salesforce ✓ ✓ ✓ ✓
Service Cloud

Salesforce ✓ ✓ ✓
Marketing
Cloud
(Preview)

SAP Cloud for ✓ ✓ ✓ ✓


Customer
(C4C)

SAP ECC ✓ ✓ ✓

ServiceNow ✓ ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Shopify ✓ ✓ ✓
(Preview)

Square ✓ ✓ ✓
(Preview)

Web Table ✓ ✓
(HTML table)

Xero (Preview) ✓ ✓ ✓

Zoho (Preview) ✓ ✓ ✓

NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a
dependency on preview connectors in your solution, please contact Azure support.

For more information, see Copy Activity - Overview article.

Data transformation activities


Azure Data Factory supports the following transformation activities that can be added to pipelines
either individually or chained with another activity.

DATA TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

Machine Learning activities: Batch Execution and Azure VM


Update Resource

Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server

U-SQL Azure Data Lake Analytics

Custom Code Azure Batch

Databricks Notebook Azure Databricks

For more information, see the data transformation activities article.


Control activities
The following control flow activities are supported:

CONTROL ACTIVITY DESCRIPTION

Execute Pipeline Activity Execute Pipeline activity allows a Data Factory pipeline
to invoke another pipeline.

ForEachActivity ForEach Activity defines a repeating control flow in


your pipeline. This activity is used to iterate over a
collection and executes specified activities in a loop.
The loop implementation of this activity is similar to
Foreach looping structure in programming languages.

WebActivity Web Activity can be used to call a custom REST


endpoint from a Data Factory pipeline. You can pass
datasets and linked services to be consumed and
accessed by the activity.

Lookup Activity Lookup Activity can be used to read or look up a


record/ table name/ value from any external source.
This output can further be referenced by succeeding
activities.

Get Metadata Activity GetMetadata activity can be used to retrieve metadata


of any data in Azure Data Factory.

Until Activity Implements Do-Until loop that is similar to Do-Until


looping structure in programming languages. It
executes a set of activities in a loop until the condition
associated with the activity evaluates to true. You can
specify a timeout value for the until activity in Data
Factory.

If Condition Activity The If Condition can be used to branch based on


condition that evaluates to true or false. The If
Condition activity provides the same functionality that
an if statement provides in programming languages. It
evaluates a set of activities when the condition
evaluates to true and another set of activities when
the condition evaluates to false .

Wait Activity When you use a Wait activity in a pipeline, the pipeline
waits for the specified period of time before continuing
with execution of subsequent activities.

Pipeline JSON
Here is how a pipeline is defined in JSON format:
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities":
[
],
"parameters": {
}
}
}

TAG DESCRIPTION TYPE REQUIRED

name Name of the pipeline. String Yes


Specify a name that
represents the action
that the pipeline
performs.
Maximum
number of
characters: 140
Must start with a
letter, number, or
an underscore (_)
Following
characters are
not allowed: “.”,
“+”, “?”, “/”,
“<”,”>”,”*”,”%”,”&”,”:
”,”\”

description Specify the text String No


describing what the
pipeline is used for.

activities The activities section Array Yes


can have one or more
activities defined within
it. See the Activity JSON
section for details about
the activities JSON
element.

parameters The parameters section List No


can have one or more
parameters defined
within the pipeline,
making your pipeline
flexible for reuse.

Activity JSON
The activities section can have one or more activities defined within it. There are two main types of
activities: Execution and Control Activities.
Execution activities
Execution activities include data movement and data transformation activities. They have the
following top-level structure:

{
"name": "Execution Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"linkedServiceName": "MyLinkedService",
"policy":
{
},
"dependsOn":
{
}
}

Following table describes properties in the activity JSON definition:

TAG DESCRIPTION REQUIRED

name Name of the activity. Specify a Yes


name that represents the action
that the activity performs.
Maximum number of
characters: 55
Must start with a letter
number, or an underscore
(_)
Following characters are
not allowed: “.”, “+”, “?”, “/”,
“<”,”>”,”*”,”%”,”&”,”:”,”\”

description Text describing what the activity or Yes


is used for

type Type of the activity. See the Data Yes


Movement Activities, Data
Transformation Activities, and
Control Activities sections for
different types of activities.

linkedServiceName Name of the linked service used Yes for HDInsight Activity, Azure
by the activity. Machine Learning Batch Scoring
Activity, Stored Procedure Activity.
An activity may require that you
specify the linked service that links No for all others
to the required compute
environment.

typeProperties Properties in the typeProperties No


section depend on each type of
activity. To see type properties for
an activity, click links to the activity
in the previous section.
TAG DESCRIPTION REQUIRED

policy Policies that affect the run-time No


behavior of the activity. This
property includes timeout and
retry behavior. If it is not specified,
default values are used. For more
information, see Activity policy
section.

dependsOn This property is used to define No


activity dependencies, and how
subsequent activities depend on
previous activities. For more
information, see Activity
dependency

Activity policy
Policies affect the run-time behavior of an activity, giving configurability options. Activity Policies are
only available for execution activities.
Activity policy JSON definition

{
"name": "MyPipelineName",
"properties": {
"activities": [
{
"name": "MyCopyBlobtoSqlActivity"
"type": "Copy",
"typeProperties": {
...
},
"policy": {
"timeout": "00:10:00",
"retry": 1,
"retryIntervalInSeconds": 60,
"secureOutput": true
}
}
],
"parameters": {
...
}
}
}

JSON NAME DESCRIPTION ALLOWED VALUES REQUIRED

timeout Specifies the timeout for Timespan No. Default timeout is 7


the activity to run. days.

retry Maximum retry Integer No. Default is 0


attempts

retryIntervalInSeconds The delay between retry Integer No. Default is 30


attempts in seconds seconds
JSON NAME DESCRIPTION ALLOWED VALUES REQUIRED

secureOutput When set to true, output Boolean No. Default is false.


from activity is
considered as secure and
will not be logged to
monitoring.

Control activity
Control activities have the following top-level structure:

{
"name": "Control Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"dependsOn":
{
}
}

TAG DESCRIPTION REQUIRED

name Name of the activity. Specify a Yes


name that represents the action
that the activity performs.
Maximum number of
characters: 55
Must start with a letter
number, or an underscore
(_)
Following characters are
not allowed: “.”, “+”, “?”, “/”,
“<”,”>”,”*”,”%”,”&”,”:”,”\”

description Text describing what the activity or Yes


is used for

type Type of the activity. See the data Yes


movement activities, data
transformation activities, and
control activities sections for
different types of activities.

typeProperties Properties in the typeProperties No


section depend on each type of
activity. To see type properties for
an activity, click links to the activity
in the previous section.

dependsOn This property is used to define No


Activity Dependency, and how
subsequent activities depend on
previous activities. For more
information, see activity
dependency.
Activity dependency
Activity Dependency defines how subsequent activities depend on previous activities, determining
the condition whether to continue executing the next task. An activity can depend on one or multiple
previous activities with different dependency conditions.
The different dependency conditions are: Succeeded, Failed, Skipped, Completed.
For example, if a pipeline has Activity A -> Activity B, the different scenarios that can happen are:
Activity B has dependency condition on Activity A with succeeded: Activity B only runs if Activity
A has a final status of succeeded
Activity B has dependency condition on Activity A with failed: Activity B only runs if Activity A
has a final status of failed
Activity B has dependency condition on Activity A with completed: Activity B runs if Activity A
has a final status of succeeded or failed
Activity B has dependency condition on Activity A with skipped: Activity B runs if Activity A has a
final status of skipped. Skipped occurs in the scenario of Activity X -> Activity Y -> Activity Z,
where each activity runs only if the previous activity succeeds. If Activity X fails, then Activity Y
has a status of “Skipped” because it never executes. Similarly, Activity Z has a status of “Skipped”
as well.
Example: Activity 2 depends on the Activity 1 succeeding

{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities": [
{
"name": "MyFirstActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
}
},
{
"name": "MySecondActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
},
"dependsOn": [
{
"activity": "MyFirstActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
],
"parameters": {
}
}
}

Sample copy pipeline


In the following sample pipeline, there is one activity of type Copy in the activities section. In this
sample, the copy activity copies data from an Azure Blob storage to an Azure SQL database.

{
"name": "CopyPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"policy": {
"retry": 2,
"timeout": "01:00:00"
}
}
]
}
}

Note the following points:


In the activities section, there is only one activity whose type is set to Copy.
Input for the activity is set to InputDataset and output for the activity is set to OutputDataset.
See Datasets article for defining datasets in JSON.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is
specified as the sink type. In the data movement activities section, click the data store that you
want to use as a source or a sink to learn more about moving data to/from that data store.
For a complete walkthrough of creating this pipeline, see Quickstart: create a data factory.

Sample transformation pipeline


In the following sample pipeline, there is one activity of type HDInsightHive in the activities
section. In this sample, the HDInsight Hive activity transforms data from an Azure Blob storage by
running a Hive script file on an Azure HDInsight Hadoop cluster.
{
"name": "TransformPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"retry": 3
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
]
}
}

Note the following points:


In the activities section, there is only one activity whose type is set to HDInsightHive.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by
the scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container
adfgetstarted .
The defines section is used to specify the runtime settings that are passed to the hive script as
Hive configuration values (for example, $ {hiveconf:inputtable} , ${hiveconf:partitionedtable} ).

The typeProperties section is different for each transformation activity. To learn about type
properties supported for a transformation activity, click the transformation activity in the Data
transformation activities.
For a complete walkthrough of creating this pipeline, see Tutorial: transform data using Spark.

Multiple activities in a pipeline


The previous two sample pipelines have only one activity in them. You can have more than one
activity in a pipeline. If you have multiple activities in a pipeline and subsequent activities are not
dependent on previous activities, the activities may run in parallel.
You can chain two activities by using activity dependency, which defines how subsequent activities
depend on previous activities, determining the condition whether to continue executing the next task.
An activity can depend on one or more previous activities with different dependency conditions.
Scheduling pipelines
Pipelines are scheduled by triggers. There are different types of triggers (scheduler trigger, which
allows pipelines to be triggered on a wall-clock schedule, as well as manual trigger, which triggers
pipelines on-demand). For more information about triggers, see pipeline execution and triggers
article.
To have your trigger kick off a pipeline run, you must include a pipeline reference of the particular
pipeline in the trigger definition. Pipelines & triggers have an n-m relationship. Multiple triggers can
kick off a single pipeline and the same trigger can kick off multiple pipelines. Once the trigger is
defined, you must start the trigger to have it start triggering the pipeline. For more information
about triggers, see pipeline execution and triggers article.
For example, say you have a scheduler trigger, “Trigger A” that I wish to kick off my pipeline,
“MyCopyPipeline.” You define the trigger as shown in the following example:
Trigger A definition

{
"name": "TriggerA",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
...
}
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyCopyPipeline"
},
"parameters": {
"copySourceName": "FileSource"
}
}
}
}

Next steps
See the following tutorials for step-by-step instructions for creating pipelines with activities:
Build a pipeline with a copy activity
Build a pipeline with a data transformation activity
Linked services in Azure Data Factory
4/28/2019 • 4 minutes to read • Edit Online

This article describes what linked services are, how they are defined in JSON format, and how they are used in
Azure Data Factory pipelines.
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform
a task. The activities in a pipeline define actions to perform on your data. For example, you might use a copy
activity to copy data from an on-premises SQL Server to Azure Blob storage. Then, you might use a Hive activity
that runs a Hive script on an Azure HDInsight cluster to process data from Blob storage to produce output data.
Finally, you might use a second copy activity to copy the output data to Azure SQL Data Warehouse, on top of
which business intelligence (BI) reporting solutions are built. For more information about pipelines and activities,
see Pipelines and activities in Azure Data Factory.
Now, a dataset is a named view of data that simply points or references the data you want to use in your
activities as inputs and outputs.
Before you create a dataset, you must create a linked service to link your data store to the data factory. Linked
services are much like connection strings, which define the connection information needed for Data Factory to
connect to external resources. Think of it this way; the dataset represents the structure of the data within the linked
data stores, and the linked service defines the connection to the data source. For example, an Azure Storage linked
service links a storage account to the data factory. An Azure Blob dataset represents the blob container and the
folder within that Azure storage account that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked services: Azure
Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which refers to the Azure Storage
linked service) and Azure SQL Table dataset (which refers to the Azure SQL Database linked service). The Azure
Storage and Azure SQL Database linked services contain connection strings that Data Factory uses at runtime to
connect to your Azure Storage and Azure SQL Database, respectively. The Azure Blob dataset specifies the blob
container and blob folder that contains the input blobs in your Blob storage. The Azure SQL Table dataset specifies
the SQL table in your SQL database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data Factory:

Linked service JSON


A linked service in Data Factory is defined in JSON format as follows:

{
"name": "<Name of the linked service>",
"properties": {
"type": "<Type of the linked service>",
"typeProperties": {
"<data store or compute-specific type properties>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

The following table describes properties in the above JSON:

PROPERTY DESCRIPTION REQUIRED

name Name of the linked service. See Azure Yes


Data Factory - Naming rules.

type Type of the linked service. For example: Yes


AzureStorage (data store) or AzureBatch
(compute). See the description for
typeProperties.

typeProperties The type properties are different for Yes


each data store or compute.

For the supported data store types and


their type properties, see the dataset
type table in this article. Navigate to the
data store connector article to learn
about type properties specific to a data
store.

For the supported compute types and


their type properties, see Compute
linked services.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in a private
network). If not specified, it uses the
default Azure Integration Runtime.

Linked service example


The following linked service is an Azure Storage linked service. Notice that the type is set to AzureStorage. The
type properties for the Azure Storage linked service include a connection string. The Data Factory service uses this
connection string to connect to the data store at runtime.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Create linked services


You can create linked services by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure
Resource Manager Template, and Azure portal

Data store linked services


Connecting to data stores can be found in our supported data stores and formats. Reference the list for specific
connection properties needed for different stores.

Compute linked services


Reference compute environments supported for details about different compute environments you can connect to
from your data factory as well as the different configurations.

Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these
tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Datasets in Azure Data Factory
4/28/2019 • 11 minutes to read • Edit Online

This article describes what datasets are, how they are defined in JSON format, and how they are used in
Azure Data Factory pipelines.
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. The activities in a pipeline define actions to perform on your data. Now, a
dataset is a named view of data that simply points or references the data you want to use in your
activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files,
folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in
Blob storage from which the activity should read the data.
Before you create a dataset, you must create a linked service to link your data store to the data factory.
Linked services are much like connection strings, which define the connection information needed for
Data Factory to connect to external resources. Think of it this way; the dataset represents the structure of
the data within the linked data stores, and the linked service defines the connection to the data source.
For example, an Azure Storage linked service links a storage account to the data factory. An Azure Blob
dataset represents the blob container and the folder within that Azure storage account that contains the
input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked
services: Azure Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which
refers to the Azure Storage linked service) and Azure SQL Table dataset (which refers to the Azure SQL
Database linked service). The Azure Storage and Azure SQL Database linked services contain
connection strings that Data Factory uses at runtime to connect to your Azure Storage and Azure SQL
Database, respectively. The Azure Blob dataset specifies the blob container and blob folder that contains
the input blobs in your Blob storage. The Azure SQL Table dataset specifies the SQL table in your SQL
database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in
Data Factory:

Dataset JSON
A dataset in Data Factory is defined in the following JSON format:
{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"structure": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}

The following table describes properties in the above JSON:

PROPERTY DESCRIPTION REQUIRED

name Name of the dataset. See Azure Yes


Data Factory - Naming rules.

type Type of the dataset. Specify one of Yes


the types supported by Data
Factory (for example: AzureBlob,
AzureSqlTable).

For details, see Dataset types.

structure Schema of the dataset. For details, No


see Dataset schema.

typeProperties The type properties are different for Yes


each type (for example: Azure Blob,
Azure SQL table). For details on the
supported types and their
properties, see Dataset type.

Data flow compatible dataset

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer
SLA provisions.

See supported dataset types for a list of dataset types that are Data Flow compatible. Datasets that are
compatible for Data Flow require fine-grained dataset definitions for transformations. Thus, the JSON
definition is slightly different. Instead of a structure property, datasets that are Data Flow compatible
have a schema property.
In Data Flow, datasets are used in source and sink transformations. The datasets define the basic data
schemas. If your data has no schema, you can use schema drift for your source and sink. The schema in
the dataset represents the physical data type and shape.
By defining the schema from the dataset, you'll get the related data types, data formats, file location, and
connection information from the associated Linked service. Metadata from the datasets appears in your
source transformation as the source projection. The projection in the source transformation represents
the Data Flow data with defined names and types.
When you import the schema of a Data Flow dataset, select the Import Schema button and choose to
import from the source or from a local file. In most cases, you'll import the schema directly from the
source. But if you already have a local schema file (a Parquet file or CSV with headers), you can direct
Data Factory to base the schema on that file.

{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"schema": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}

The following table describes properties in the above JSON:

PROPERTY DESCRIPTION REQUIRED

name Name of the dataset. See Azure Yes


Data Factory - Naming rules.

type Type of the dataset. Specify one of Yes


the types supported by Data
Factory (for example: AzureBlob,
AzureSqlTable).

For details, see Dataset types.

schema Schema of the dataset. For details, No


see Data Flow compatible datasets.

typeProperties The type properties are different for Yes


each type (for example: Azure Blob,
Azure SQL table). For details on the
supported types and their
properties, see Dataset type.

Dataset example
In the following example, the dataset represents a table named MyTable in a SQL database.
{
"name": "DatasetSample",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "MyAzureSqlLinkedService",
"type": "LinkedServiceReference",
},
"typeProperties":
{
"tableName": "MyTable"
},
}
}

Note the following points:


type is set to AzureSqlTable.
tableName type property (specific to AzureSqlTable type) is set to MyTable.
linkedServiceName refers to a linked service of type AzureSqlDatabase, which is defined in the next
JSON snippet.

Dataset type
There are many different types of datasets, depending on the data store you use. See the following table
for a list of data stores supported by Data Factory. Click a data store to learn how to create a linked
service and a dataset for that data store.

SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

Azure Azure Blob ✓ ✓ ✓ ✓ ✓


Storage Supported
Formats:
Delimited Text,
Parquet

Azure ✓ ✓ ✓ ✓
Cosmos DB
(SQL API)

Azure ✓ ✓ ✓ ✓
Cosmos
DB's API for
MongoDB

Azure Data ✓ ✓ ✓ ✓
Explorer

Azure Data ✓ ✓ ✓ ✓ ✓
Lake Supported
Storage Formats:
Gen1 Delimited Text,
Parquet
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

Azure Data ✓ ✓ ✓ ✓ ✓
Lake Supported
Storage Formats:
Gen2 Delimited Text,
Parquet

Azure ✓ ✓ ✓
Database for
MariaDB

Azure ✓ ✓ ✓
Database for
MySQL

Azure ✓ ✓ ✓
Database for
PostgreSQL

Azure File ✓ ✓ ✓ ✓
Storage

Azure SQL ✓ ✓ ✓ ✓ ✓
Database

Azure SQL ✓ ✓ ✓
Database
Managed
Instance

Azure SQL ✓ ✓ ✓ ✓ ✓
Data
Warehouse

Azure ✓ ✓ ✓
Search Index

Azure Table ✓ ✓ ✓ ✓
Storage

Database Amazon ✓ ✓ ✓
Redshift

DB2 ✓ ✓ ✓

Drill ✓ ✓ ✓
(Preview)

Google ✓ ✓ ✓
BigQuery

Greenplum ✓ ✓ ✓
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

HBase ✓ ✓ ✓

Hive ✓ ✓ ✓

Apache ✓ ✓ ✓
Impala
(Preview)

Informix ✓ ✓

MariaDB ✓ ✓ ✓

Microsoft ✓ ✓
Access

MySQL ✓ ✓ ✓

Netezza ✓ ✓ ✓

Oracle ✓ ✓ ✓ ✓

Phoenix ✓ ✓ ✓

PostgreSQL ✓ ✓ ✓

Presto ✓ ✓ ✓
(Preview)

SAP ✓ ✓
Business
Warehouse
Open Hub

SAP ✓ ✓
Business
Warehouse
via MDX

SAP HANA ✓ ✓ ✓

SAP Table ✓ ✓ ✓

Spark ✓ ✓ ✓

SQL Server ✓ ✓ ✓ ✓

Sybase ✓ ✓

Teradata ✓ ✓
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

Vertica ✓ ✓ ✓

NoSQL Cassandra ✓ ✓ ✓

Couchbase ✓ ✓ ✓
(Preview)

MongoDB ✓ ✓ ✓

File Amazon S3 ✓ ✓ ✓

File System ✓ ✓ ✓ ✓

FTP ✓ ✓ ✓

Google ✓ ✓ ✓
Cloud
Storage

HDFS ✓ ✓ ✓

SFTP ✓ ✓ ✓

Generic Generic ✓ ✓ ✓
protocol HTTP

Generic ✓ ✓ ✓
OData

Generic ✓ ✓ ✓
ODBC

Generic ✓ ✓ ✓
REST

Services Amazon ✓ ✓ ✓
and apps Marketplace
Web Service
(Preview)

Common ✓ ✓ ✓ ✓
Data Service
for Apps

Concur ✓ ✓ ✓
(Preview)

Dynamics ✓ ✓ ✓ ✓
365
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

Dynamics ✓ ✓ ✓
AX (Preview)

Dynamics ✓ ✓ ✓ ✓
CRM

Google ✓ ✓ ✓
AdWords
(Preview)

HubSpot ✓ ✓ ✓
(Preview)

Jira ✓ ✓ ✓
(Preview)

Magento ✓ ✓ ✓
(Preview)

Marketo ✓ ✓ ✓
(Preview)

Office 365 ✓ ✓ ✓

Oracle ✓ ✓ ✓
Eloqua
(Preview)

Oracle ✓ ✓ ✓
Responsys
(Preview)

Oracle ✓ ✓ ✓
Service
Cloud
(Preview)

Paypal ✓ ✓ ✓
(Preview)

QuickBooks ✓ ✓ ✓
(Preview)

Salesforce ✓ ✓ ✓ ✓

Salesforce ✓ ✓ ✓ ✓
Service
Cloud
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

Salesforce ✓ ✓ ✓
Marketing
Cloud
(Preview)

SAP Cloud ✓ ✓ ✓ ✓
for
Customer
(C4C)

SAP ECC ✓ ✓ ✓

ServiceNow ✓ ✓ ✓

Shopify ✓ ✓ ✓
(Preview)

Square ✓ ✓ ✓
(Preview)

Web Table ✓ ✓
(HTML
table)

Xero ✓ ✓ ✓
(Preview)

Zoho ✓ ✓ ✓
(Preview)

NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a
dependency on preview connectors in your solution, please contact Azure support.

In the example in the previous section, the type of the dataset is set to AzureSqlTable. Similarly, for an
Azure Blob dataset, the type of the dataset is set to AzureBlob, as shown in the following JSON:
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference",
},

"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
}
}
}

Dataset structure or schema


The structure section or schema (Data Flow compatible) section datasets is optional. It defines the
schema of the dataset by containing a collection of names and data types of columns. You use the
structure section to provide type information that is used to convert types and map columns from the
source to the destination.
Each column in the structure contains the following properties:

PROPERTY DESCRIPTION REQUIRED

name Name of the column. Yes

type Data type of the column. Data No


Factory supports the following
interim data types as allowed
values: Int16, Int32, Int64, Single,
Double, Decimal, Byte[],
Boolean, String, Guid, Datetime,
Datetimeoffset, and Timespan

culture .NET-based culture to be used when No


the type is a .NET type: Datetime
or Datetimeoffset . The default is
en-us .

format Format string to be used when the No


type is a .NET type: Datetime or
Datetimeoffset . Refer to Custom
Date and Time Format Strings on
how to format datetime.

Example
In the following example, suppose the source Blob data is in CSV format and contains three columns:
userid, name, and lastlogindate. They are of type Int64, String, and Datetime with a custom datetime
format using abbreviated French names for day of the week.
Define the Blob dataset structure as follows along with type definitions for the columns:
"structure":
[
{ "name": "userid", "type": "Int64"},
{ "name": "name", "type": "String"},
{ "name": "lastlogindate", "type": "Datetime", "culture": "fr-fr", "format": "ddd-MM-YYYY"}
]

Guidance
The following guidelines help you understand when to include structure information, and what to
include in the structure section. Learn more on how data factory maps source data to sink and when to
specify structure information from Schema and type mapping.
For strong schema data sources, specify the structure section only if you want map source
columns to sink columns, and their names are not the same. This kind of structured data source
stores data schema and type information along with the data itself. Examples of structured data
sources include SQL Server, Oracle, and Azure SQL Database.

As type information is already available for structured data sources, you should not include type
information when you do include the structure section.
For no/weak schema data sources e.g. text file in blob storage, include structure when the
dataset is an input for a copy activity, and data types of source dataset should be converted to native
types for the sink. And include structure when you want to map source columns to sink columns..

Create datasets
You can create datasets by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure
Resource Manager Template, and Azure portal

Current version vs. version 1 datasets


Here are some differences between Data Factory and Data Factory version 1 datasets:
The external property is not supported in the current version. It's replaced by a trigger.
The policy and availability properties are not supported in the current version. The start time for a
pipeline depends on triggers.
Scoped datasets (datasets defined in a pipeline) are not supported in the current version.

Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one
of these tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Pipeline execution and triggers in Azure Data
Factory
3/5/2019 • 15 minutes to read • Edit Online

A pipeline run in Azure Data Factory defines an instance of a pipeline execution. For example, say you have a
pipeline that executes at 8:00 AM, 9:00 AM, and 10:00 AM. In this case, there are three separate runs of the
pipeline, or pipeline runs. Each pipeline run has a unique pipeline run ID. A run ID is a GUID that uniquely
defines that particular pipeline run.
Pipeline runs are typically instantiated by passing arguments to parameters that you define in the pipeline. You
can execute a pipeline either manually or by using a trigger. This article provides details about both ways of
executing a pipeline.

Manual execution (on-demand)


The manual execution of a pipeline is also referred to as on-demand execution.
For example, say you have a basic pipeline named copyPipeline that you want to execute. The pipeline has a
single activity that copies from an Azure Blob storage source folder to a destination folder in the same storage.
The following JSON definition shows this sample pipeline:
{
"name": "copyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "CopyBlobtoBlob",
"inputs": [
{
"referenceName": "sourceBlobDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "sinkBlobDataset",
"type": "DatasetReference"
}
]
}
],
"parameters": {
"sourceBlobContainer": {
"type": "String"
},
"sinkBlobContainer": {
"type": "String"
}
}
}
}

In the JSON definition, the pipeline takes two parameters: sourceBlobContainer and sinkBlobContainer.
You pass values to these parameters at runtime.
You can manually run your pipeline by using one of the following methods:
.NET SDK
Azure PowerShell module
REST API
Python SDK
REST API
The following sample command shows you how to manually run your pipeline by using the REST API:

POST
https://management.azure.com/subscriptions/mySubId/resourceGroups/myResourceGroup/providers/Microsoft.DataFa
ctory/factories/myDataFactory/pipelines/copyPipeline/createRun?api-version=2017-03-01-preview

For a complete sample, see Quickstart: Create a data factory by using the REST API.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.

The following sample command shows you how to manually run your pipeline by using Azure PowerShell:

Invoke-AzDataFactoryV2Pipeline -DataFactory $df -PipelineName "Adfv2QuickStartPipeline" -ParameterFile


.\PipelineParameters.json

You pass parameters in the body of the request payload. In the .NET SDK, Azure PowerShell, and the Python
SDK, you pass values in a dictionary that's passed as an argument to the call:

{
"sourceBlobContainer": "MySourceFolder",
"sinkBlobContainer": "MySinkFolder"
}

The response payload is a unique ID of the pipeline run:

{
"runId": "0448d45a-a0bd-23f3-90a5-bfeea9264aed"
}

For a complete sample, see Quickstart: Create a data factory by using Azure PowerShell.
.NET SDK
The following sample call shows you how to manually run your pipeline by using the .NET SDK:

client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup, dataFactoryName, pipelineName, parameters)

For a complete sample, see Quickstart: Create a data factory by using the .NET SDK.

NOTE
You can use the .NET SDK to invoke Data Factory pipelines from Azure Functions, from your own web services, and so on.

Trigger execution
Triggers are another way that you can execute a pipeline run. Triggers represent a unit of processing that
determines when a pipeline execution needs to be kicked off. Currently, Data Factory supports three types of
triggers:
Schedule trigger: A trigger that invokes a pipeline on a wall-clock schedule.
Tumbling window trigger: A trigger that operates on a periodic interval, while also retaining state.
Event-based trigger: A trigger that responds to an event.
Pipelines and triggers have a many-to-many relationship. Multiple triggers can kick off a single pipeline, or a
single trigger can kick off multiple pipelines. In the following trigger definition, the pipelines property refers to
a list of pipelines that are triggered by the particular trigger. The property definition includes values for the
pipeline parameters.
Basic trigger definition

{
"properties": {
"name": "MyTrigger",
"type": "<type of trigger>",
"typeProperties": {...},
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "<Name of your pipeline>"
},
"parameters": {
"<parameter 1 Name>": {
"type": "Expression",
"value": "<parameter 1 Value>"
},
"<parameter 2 Name>": "<parameter 2 Value>"
}
}
]
}
}

Schedule trigger
A schedule trigger runs pipelines on a wall-clock schedule. This trigger supports periodic and advanced
calendar options. For example, the trigger supports intervals like "weekly" or "Monday at 5:00 PM and
Thursday at 9:00 PM." The schedule trigger is flexible because the dataset pattern is agnostic, and the trigger
doesn't discern between time-series and non-time-series data.
For more information about schedule triggers and for examples, see Create a schedule trigger.

Schedule trigger definition


When you create a schedule trigger, you specify scheduling and recurrence by using a JSON definition.
To have your schedule trigger kick off a pipeline run, include a pipeline reference of the particular pipeline in the
trigger definition. Pipelines and triggers have a many-to-many relationship. Multiple triggers can kick off a
single pipeline. A single trigger can kick off multiple pipelines.
{
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": <<Minute, Hour, Day, Week, Year>>,
"interval": <<int>>, // How often to fire
"startTime": <<datetime>>,
"endTime": <<datetime>>,
"timeZone": "UTC",
"schedule": { // Optional (advanced scheduling specifics)
"hours": [<<0-24>>],
"weekDays": [<<Monday-Sunday>>],
"minutes": [<<0-60>>],
"monthDays": [<<1-31>>],
"monthlyOccurrences": [
{
"day": <<Monday-Sunday>>,
"occurrence": <<1-5>>
}
]
}
}
},
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "<Name of your pipeline>"
},
"parameters": {
"<parameter 1 Name>": {
"type": "Expression",
"value": "<parameter 1 Value>"
},
"<parameter 2 Name>": "<parameter 2 Value>"
}
}
]}
}

IMPORTANT
The parameters property is a mandatory property of the pipelines element. If your pipeline doesn't take any
parameters, you must include an empty JSON definition for the parameters property.

Schema overview
The following table provides a high-level overview of the major schema elements that are related to recurrence
and scheduling a trigger:

JSON PROPERTY DESCRIPTION

startTime A date-time value. For basic schedules, the value of the


startTime property applies to the first occurrence. For
complex schedules, the trigger starts no sooner than the
specified startTime value.

endTime The end date and time for the trigger. The trigger doesn't
execute after the specified end date and time. The value for
the property can't be in the past.
JSON PROPERTY DESCRIPTION

timeZone The time zone. Currently, only the UTC time zone is
supported.

recurrence A recurrence object that specifies the recurrence rules for the
trigger. The recurrence object supports the frequency,
interval, endTime, count, and schedule elements. When a
recurrence object is defined, the frequency element is
required. The other elements of the recurrence object are
optional.

frequency The unit of frequency at which the trigger recurs. The


supported values include "minute", "hour", "day", "week", and
"month".

interval A positive integer that denotes the interval for the


frequency value. The frequency value determines how
often the trigger runs. For example, if the interval is 3 and
the frequency is "week", the trigger recurs every three
weeks.

schedule The recurrence schedule for the trigger. A trigger with a


specified frequency value alters its recurrence based on a
recurrence schedule. The schedule property contains
modifications for the recurrence that are based on minutes,
hours, week days, month days, and week number.

Schedule trigger example

{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-11-01T09:00:00-08:00",
"endTime": "2017-11-02T22:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "SQLServerToBlobPipeline"
},
"parameters": {}
},
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "SQLServerToAzureSQLPipeline"
},
"parameters": {}
}
]
}
}

Schema defaults, limits, and examples


JSON PROPERTY TYPE REQUIRED DEFAULT VALUE VALID VALUES EXAMPLE

startTime string Yes None ISO 8601 date- "startTime" :


times "2013-01-
09T09:30:00-
08:00"

recurrence object Yes None A recurrence "recurrence"


object : {
"frequency" :
"monthly",
"interval" :
1 }

interval number No 1 1 to 1000 "interval":10

endTime string Yes None A date-time "endTime" :


value that "2013-02-
09T09:30:00-
represents a time 08:00"
in the future

schedule object No None A schedule "schedule" :


object { "minute" :
[30], "hour"
: [8,17] }

startTime property
The following table shows you how the startTime property controls a trigger run:

STARTTIME VALUE RECURRENCE WITHOUT SCHEDULE RECURRENCE WITH SCHEDULE

Start time is in the past Calculates the first future execution The trigger starts no sooner than the
time after the start time, and runs at specified start time. The first
that time. occurrence is based on the schedule,
calculated from the start time.
Runs subsequent executions calculated
from the last execution time. Runs subsequent executions based on
the recurrence schedule.
See the example that follows this table.

Start time is in the future or the Runs once at the specified start time. The trigger starts no sooner than the
current time specified start time. The first
Runs subsequent executions calculated occurrence is based on the schedule,
from the last execution time. calculated from the start time.

Runs subsequent executions based on


the recurrence schedule.

Let's look at an example of what happens when the start time is in the past, with a recurrence, but no schedule.
Assume that the current time is 2017-04-08 13:00, the start time is 2017-04-07 14:00, and the recurrence is
every two days. (The recurrence value is defined by setting the frequency property to "day" and the interval
property to 2.) Notice that the startTime value is in the past and occurs before the current time.
Under these conditions, the first execution is 2017-04-09 at 14:00. The Scheduler engine calculates execution
occurrences from the start time. Any instances in the past are discarded. The engine uses the next instance that
occurs in the future. In this scenario, the start time is 2017-04-07 at 2:00 PM. The next instance is two days from
that time, which is on 2017-04-09 at 2:00 PM.
The first execution time is the same even whether startTime is 2017-04-05 14:00 or 2017-04-01 14:00. After
the first execution, subsequent executions are calculated by using the schedule. Therefore, the subsequent
executions are on 2017-04-11 at 2:00 PM, then on 2017-04-13 at 2:00 PM, then on 2017-04-15 at 2:00 PM,
and so on.
Finally, when hours or minutes aren’t set in the schedule for a trigger, the hours or minutes of the first execution
are used as defaults.
schedule property
You can use schedule to limit the number of trigger executions. For example, if a trigger with a monthly
frequency is scheduled to run only on day 31, the trigger runs only in those months that have a thirty-first day.
You can also use schedule to expand the number of trigger executions. For example, a trigger with a monthly
frequency that's scheduled to run on month days 1 and 2, runs on the first and second days of the month, rather
than once a month.
If multiple schedule elements are specified, the order of evaluation is from the largest to the smallest schedule
setting: week number, month day, week day, hour, minute.
The following table describes the schedule elements in detail:

JSON ELEMENT DESCRIPTION VALID VALUES

minutes Minutes of the hour at which the - Integer


trigger runs. - Array of integers

hours Hours of the day at which the trigger - Integer


runs. - Array of integers

weekDays Days of the week the trigger runs. The


value can be specified only with a - Monday
weekly frequency. - Tuesday
- Wednesday
- Thursday
- Friday
- Saturday
- Sunday
- Array of day values (maximum array
size is 7)

Day values are not case-sensitive

monthlyOccurrences Days of the month on which the - Array of monthlyOccurrence


trigger runs. The value can be specified objects:
with a monthly frequency only. { "day": day, "occurrence":
occurrence }
- The day attribute is the day of the
week on which the trigger runs. For
example, a monthlyOccurrences
property with a day value of
{Sunday} means every Sunday of the
month. The day attribute is required.
- The occurrence attribute is the
occurrence of the specified day during
the month. For example, a
monthlyOccurrences property with
day and occurrence values of
{Sunday, -1} means the last Sunday
of the month. The occurrence
attribute is optional.
JSON ELEMENT DESCRIPTION VALID VALUES

monthDays Day of the month on which the trigger - Any value <= -1 and >= -31
runs. The value can be specified with a - Any value >= 1 and <= 31
monthly frequency only. - Array of values

Tumbling window trigger


Tumbling window triggers are a type of trigger that fires at a periodic time interval from a specified start time,
while retaining state. Tumbling windows are a series of fixed-sized, non-overlapping, and contiguous time
intervals.
For more information about tumbling window triggers and for examples, see Create a tumbling window trigger.

Event-based trigger
An event-based trigger runs pipelines in response to an event, such as the arrival of a file, or the deletion of a
file, in Azure Blob Storage.
For more information about event-based triggers, see Create a trigger that runs a pipeline in response to an
event.

Examples of trigger recurrence schedules


This section provides examples of recurrence schedules. It focuses on the schedule object and its elements.
The examples assume that the interval value is 1, and that the frequency value is correct according to the
schedule definition. For example, you can't have a frequency value of "day" and also have a monthDays
modification in the schedule object. These kinds of restrictions are described in the table in the preceding
section.

EXAMPLE DESCRIPTION

{"hours":[5]} Run at 5:00 AM every day.

{"minutes":[15], "hours":[5]} Run at 5:15 AM every day.

{"minutes":[15], "hours":[5,17]} Run at 5:15 AM and 5:15 PM every day.

{"minutes":[15,45], "hours":[5,17]} Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM every day.

{"minutes":[0,15,30,45]} Run every 15 minutes.


EXAMPLE DESCRIPTION

{hours":[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, Run every hour.


13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]}
This trigger runs every hour. The minutes are controlled by
the startTime value, when a value is specified. If a value isn't
specified, the minutes are controlled by the creation time.
For example, if the start time or creation time (whichever
applies) is 12:25 PM, the trigger runs at 00:25, 01:25, 02:25,
..., and 23:25.

This schedule is equivalent to having a trigger with a


frequency value of "hour", an interval value of 1, and no
schedule. This schedule can be used with different
frequency and interval values to create other triggers. For
example, when the frequency value is "month", the
schedule runs only once a month, rather than every day
when the frequency value is "day".

{"minutes":[0]} Run every hour on the hour.

This trigger runs every hour on the hour starting at 12:00


AM, 1:00 AM, 2:00 AM, and so on.

This schedule is equivalent to a trigger with a frequency


value of "hour" and a startTime value of zero minutes, and
no schedule but a frequency value of "day". If the
frequency value is "week" or "month", the schedule
executes one day a week or one day a month only,
respectively.

{"minutes":[15]} Run at 15 minutes past every hour.

This trigger runs every hour at 15 minutes past the hour


starting at 00:15 AM, 1:15 AM, 2:15 AM, and so on, and
ending at 11:15 PM.

{"hours":[17], "weekDays":["saturday"]} Run at 5:00 PM on Saturdays every week.

{"hours":[17], "weekDays":["monday", "wednesday", Run at 5:00 PM on Monday, Wednesday, and Friday every
"friday"]} week.

{"minutes":[15,45], "hours":[17], "weekDays": Run at 5:15 PM and 5:45 PM on Monday, Wednesday, and
["monday", "wednesday", "friday"]} Friday every week.

{"minutes":[0,15,30,45], "weekDays":["monday", Run every 15 minutes on weekdays.


"tuesday", "wednesday", "thursday", "friday"]}

{"minutes":[0,15,30,45], "hours": [9, 10, 11, 12, Run every 15 minutes on weekdays between 9:00 AM and
13, 14, 15, 16] "weekDays":["monday", "tuesday", 4:45 PM.
"wednesday", "thursday", "friday"]}

{"weekDays":["tuesday", "thursday"]} Run on Tuesdays and Thursdays at the specified start time.

{"minutes":[0], "hours":[6], "monthDays":[28]} Run at 6:00 AM on the twenty-eighth day of every month
(assuming a frequency value of "month").
EXAMPLE DESCRIPTION

{"minutes":[0], "hours":[6], "monthDays":[-1]} Run at 6:00 AM on the last day of the month.

To run a trigger on the last day of a month, use -1 instead of


day 28, 29, 30, or 31.

{"minutes":[0], "hours":[6], "monthDays":[1,-1]} Run at 6:00 AM on the first and last day of every month.

{monthDays":[1,14]} Run on the first and fourteenth day of every month at the
specified start time.

{"minutes":[0], "hours":[5], "monthlyOccurrences": Run on the first Friday of every month at 5:00 AM.
[{"day":"friday", "occurrence":1}]}

{"monthlyOccurrences":[{"day":"friday", Run on the first Friday of every month at the specified start
"occurrence":1}]} time.

{"monthlyOccurrences":[{"day":"friday", Run on the third Friday from the end of the month, every
"occurrence":-3}]} month, at the specified start time.

{"minutes":[15], "hours":[5], "monthlyOccurrences": Run on the first and last Friday of every month at 5:15 AM.
[{"day":"friday", "occurrence":1},{"day":"friday",
"occurrence":-1}]}

{"monthlyOccurrences":[{"day":"friday", Run on the first and last Friday of every month at the
"occurrence":1},{"day":"friday", "occurrence":-1}]} specified start time.

{"monthlyOccurrences":[{"day":"friday", Run on the fifth Friday of every month at the specified start
"occurrence":5}]} time.

When there's no fifth Friday in a month, the pipeline doesn't


run. To run the trigger on the last occurring Friday of the
month, consider using -1 instead of 5 for the occurrence
value.

{"minutes":[0,15,30,45], "monthlyOccurrences": Run every 15 minutes on the last Friday of the month.
[{"day":"friday", "occurrence":-1}]}

{"minutes":[15,45], "hours":[5,17], Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM on the
"monthlyOccurrences":[{"day":"wednesday", third Wednesday of every month.
"occurrence":3}]}

Trigger type comparison


The tumbling window trigger and the schedule trigger both operate on time heartbeats. How are they different?
The following table provides a comparison of the tumbling window trigger and schedule trigger:

TUMBLING WINDOW TRIGGER SCHEDULE TRIGGER

Backfill scenarios Supported. Pipeline runs can be Not supported. Pipeline runs can be
scheduled for windows in the past. executed only on time periods from
the current time and the future.
TUMBLING WINDOW TRIGGER SCHEDULE TRIGGER

Reliability 100% reliability. Pipeline runs can be Less reliable.


scheduled for all windows from a
specified start date without gaps.

Retry capability Supported. Failed pipeline runs have a Not supported.


default retry policy of 0, or a policy
that's specified by the user in the
trigger definition. Automatically retries
when pipeline runs fail due to
concurrency/server/throttling limits
(that is, status codes 400: User Error,
429: Too many requests, and 500:
Internal Server error).

Concurrency Supported. Users can explicitly set Not supported.


concurrency limits for the trigger.
Allows between 1 and 50 concurrent
triggered pipeline runs.

System variables Supports the use of the WindowStart Not supported.


and WindowEnd system variables.
Users can access
triggerOutputs().windowStartTime
and
triggerOutputs().windowEndTime as
trigger system variables in the trigger
definition. The values are used as the
window start time and window end
time, respectively. For example, for a
tumbling window trigger that runs
every hour, for the window 1:00 AM to
2:00 AM, the definition is
triggerOutputs().WindowStartTime
= 2017-09-01T01:00:00Z
and
triggerOutputs().WindowEndTime =
2017-09-01T02:00:00Z
.

Pipeline-to-trigger relationship Supports a one-to-one relationship. Supports many-to-many relationships.


Only one pipeline can be triggered. Multiple triggers can kick off a single
pipeline. A single trigger can kick off
multiple pipelines.

Next steps
See the following tutorials:
Quickstart: Create a data factory by using the .NET SDK
Create a schedule trigger
Create a tumbling window trigger
Integration runtime in Azure Data Factory
5/31/2019 • 11 minutes to read • Edit Online

The Integration Runtime (IR ) is the compute infrastructure used by Azure Data Factory to provide the
following data integration capabilities across different network environments:
Data Flow: Execute a Data Flow in managed Azure compute environment.
Data movement: Copy data across data stores in public network and data stores in private
network (on-premises or virtual private network). It provides support for built-in connectors, format
conversion, column mapping, and performant and scalable data transfer.
Activity dispatch: Dispatch and monitor transformation activities running on a variety of compute
services such as Azure Databricks, Azure HDInsight, Azure Machine Learning, Azure SQL Database,
SQL Server, and more.
SSIS package execution: Natively execute SQL Server Integration Services (SSIS ) packages in a
managed Azure compute environment.
In Data Factory, an activity defines the action to be performed. A linked service defines a target data
store or a compute service. An integration runtime provides the bridge between the activity and linked
Services. It is referenced by the linked service or activity, and provides the compute environment where
the activity either runs on or gets dispatched from. This way, the activity can be performed in the region
closest possible to the target data store or compute service in the most performant way while meeting
security and compliance needs.

Integration runtime types


Data Factory offers three types of Integration Runtime, and you should choose the type that best serve
the data integration capabilities and network environment needs you are looking for. These three types
are:
Azure
Self-hosted
Azure-SSIS
The following table describes the capabilities and network support for each of the integration runtime
types:

IR TYPE PUBLIC NETWORK PRIVATE NETWORK

Azure Data Flow


Data movement
Activity dispatch

Self-hosted Data movement Data movement


Activity dispatch Activity dispatch

Azure-SSIS SSIS package execution SSIS package execution

The following diagram shows how the different integration runtimes can be used in combination to
offer rich data integration capabilities and network support:
Azure integration runtime
An Azure integration runtime is capable of:
Running Data Flows in Azure
Running copy activity between cloud data stores
Dispatching the following transform activities in public network: Databricks Notebook/ Jar/ Python
activity, HDInsight Hive activity, HDInsight Pig activity, HDInsight MapReduce activity, HDInsight
Spark activity, HDInsight Streaming activity, Machine Learning Batch Execution activity, Machine
Learning Update Resource activities, Stored Procedure activity, Data Lake Analytics U -SQL activity,
.NET custom activity, Web activity, Lookup activity, and Get Metadata activity.
Azure IR network environment
Azure Integration Runtime supports connecting to data stores and compute services with public
accessible endpoints. Use a self-hosted integration runtime for Azure Virtual Network environment.
Azure IR compute resource and scaling
Azure integration runtime provides a fully managed, serverless compute in Azure. You don’t have to
worry about infrastructure provision, software installation, patching, or capacity scaling. In addition,
you only pay for the duration of the actual utilization.
Azure integration runtime provides the native compute to move data between cloud data stores in a
secure, reliable, and high-performance manner. You can set how many data integration units to use on
the copy activity, and the compute size of the Azure IR is elastically scaled up accordingly without you
having to explicitly adjusting size of the Azure Integration Runtime.
Activity dispatch is a lightweight operation to route the activity to the target compute service, so there
isn’t need to scale up the compute size for this scenario.
For information about creating and configuring an Azure IR, see How to create and configure Azure IR
under how to guides.

NOTE
Azure Integration runtime has properties related to Data Flow runtime, which defines the underlying compute
infrastructure that would be used to run the data flows on.
Self-hosted integration runtime
A self-hosted IR is capable of:
Running copy activity between a cloud data stores and a data store in private network.
Dispatching the following transform activities against compute resources in On-Premise or Azure
Virtual Network: HDInsight Hive activity (BYOC -Bring Your Own Cluster), HDInsight Pig activity
(BYOC ), HDInsight MapReduce activity (BYOC ), HDInsight Spark activity (BYOC ), HDInsight
Streaming activity (BYOC ), Machine Learning Batch Execution activity, Machine Learning Update
Resource activities, Stored Procedure activity, Data Lake Analytics U -SQL activity, .NET custom
activity, Lookup activity, and Get Metadata activity.

NOTE
Use self-hosted integration runtime to support data stores that requires bring-your-own driver such as SAP
Hana, MySQL, etc. For more information, see supported data stores.

Self-hosted IR network environment


If you want to perform data integration securely in a private network environment, which does not
have a direct line-of-sight from the public cloud environment, you can install a self-hosted IR on
premises environment behind your corporate firewall, or inside a virtual private network. The self-
hosted integration runtime only makes outbound HTTP -based connections to open internet.
Self-hosted IR compute resource and scaling
Self-hosted IR needs to be installed on an on-premises machine or a virtual machine inside a private
network. Currently, we only support running the self-hosted IR on a Windows operating system.
For high availability and scalability, you can scale out the self-hosted IR by associating the logical
instance with multiple on-premises machines in active-active mode. For more information, see how to
create and configure self-hosted IR article under how to guides for details.

Azure-SSIS Integration Runtime


To lift and shift existing SSIS workload, you can create an Azure-SSIS IR to natively execute SSIS
packages.
Azure -SSIS IR network environment
Azure-SSIS IR can be provisioned in either public network or private network. On-premises data
access is supported by joining Azure-SSIS IR to a Virtual Network that is connected to your on-
premises network.
Azure -SSIS IR compute resource and scaling
Azure-SSIS IR is a fully managed cluster of Azure VMs dedicated to run your SSIS packages. You can
bring your own Azure SQL Database or Managed Instance server to host the catalog of SSIS
projects/packages (SSISDB ) that is going to be attached to it. You can scale up the power of the
compute by specifying node size and scale it out by specifying the number of nodes in the cluster. You
can manage the cost of running your Azure-SSIS Integration Runtime by stopping and starting it as
you see fit.
For more information, see how to create and configure Azure-SSIS IR article under how to guides.
Once created, you can deploy and manage your existing SSIS packages with little to no change using
familiar tools such as SQL Server Data Tools (SSDT) and SQL Server Management Studio (SSMS ),
just like using SSIS on premises.
For more information about Azure-SSIS runtime, see the following articles:
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an
Azure-SSIS IR and uses an Azure SQL database to host the SSIS catalog.
How to: Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using Azure SQL Database Managed Instance and joining the IR to a virtual
network.
Monitor an Azure-SSIS IR. This article shows you how to retrieve information about an Azure-SSIS
IR and descriptions of statuses in the returned information.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or remove an Azure-SSIS IR.
It also shows you how to scale out your Azure-SSIS IR by adding more nodes to the IR.
Join an Azure-SSIS IR to a virtual network. This article provides conceptual information about
joining an Azure-SSIS IR to an Azure virtual network. It also provides steps to use Azure portal to
configure virtual network so that Azure-SSIS IR can join the virtual network.

Integration runtime location


The Data Factory location is where the metadata of the data factory is stored and where the triggering
of the pipeline is initiated from. Meanwhile, a data factory can access data stores and compute services
in other Azure regions to move data between data stores or process data using compute services. This
behavior is realized through the globally available IR to ensure data compliance, efficiency, and reduced
network egress costs.
The IR Location defines the location of its back-end compute, and essentially the location where the
data movement, activity dispatching, and SSIS package execution are performed. The IR location can
be different from the location of the data factory it belongs to.
Azure IR location
You can set a certain location of an Azure IR, in which case the data movement or activity dispatch will
happen in that specific region.
If you choose to use the auto-resolve Azure IR which is the default,
For copy activity, ADF will make a best effort to automatically detect your sink and source data
store to choose the best location either in the same region if available or the closest one in the
same geography, or if not detectable to use the data factory region as alternative.
For Lookup/GetMetadata/Delete activity execution (also known as Pipeline activities),
transformation activity dispatching (also known as External activities), and authoring operations
(test connection, browse folder list and table list, preview data), ADF will use the IR in the data
factory region.
For Data Flow, ADF will use the IR in the data factory region.

TIP
A good practice would be to ensure Data flow runs in the same region as your corresponding data
stores (if possible). You can either achieve this by auto-resolve Azure IR (if data store location is same as
Data Factory location), or by creating a new Azure IR instance in the same region as your data stores
and then execute the data flow on it.

You can monitor which IR location takes effect during activity execution in pipeline activity monitoring
view on UI or activity monitoring payload.
TIP
If you have strict data compliance requirements and need ensure that data do not leave a certain geography,
you can explicitly create an Azure IR in a certain region and point the Linked Service to this IR using ConnectVia
property. For example, if you want to copy data from Blob in UK South to SQL DW in UK South and want to
ensure data do not leave UK, create an Azure IR in UK South and link both Linked Services to this IR.

Self-hosted IR location
The self-hosted IR is logically registered to the Data Factory and the compute used to support its
functionalities is provided by you. Therefore there is no explicit location property for self-hosted IR.
When used to perform data movement, the self-hosted IR extracts data from the source and writes into
the destination.
Azure -SSIS IR location
Selecting the right location for your Azure-SSIS IR is essential to achieve high performance in your
extract-transform-load (ETL ) workflows.
The location of your Azure-SSIS IR does not need be the same as the location of your data factory,
but it should be the same as the location of your own Azure SQL Database/Managed Instance
server where SSISDB is to be hosted. This way, your Azure-SSIS Integration Runtime can easily
access SSISDB without incurring excessive traffics between different locations.
If you do not have an existing Azure SQL Database/Managed Instance server to host SSISDB, but
you have on-premises data sources/destinations, you should create a new Azure SQL
Database/Managed Instance server in the same location of a virtual network connected to your on-
premises network. This way, you can create your Azure-SSIS IR using the new Azure SQL
Database/Managed Instance server and joining that virtual network, all in the same location,
effectively minimizing data movements across different locations.
If the location of your existing Azure SQL Database/Managed Instance server where SSISDB is
hosted is not the same as the location of a virtual network connected to your on-premises network,
first create your Azure-SSIS IR using an existing Azure SQL Database/Managed Instance server
and joining another virtual network in the same location, and then configure a virtual network to
virtual network connection between different locations.
The following diagram shows location settings of Data Factory and its integration run times:

Determining which IR to use


Copy activity
For Copy activity, it requires source and sink linked services to define the direction of data flow. The
following logic is used to determine which integration runtime instance is used to perform the copy:
Copying between two cloud data sources: when both source and sink linked services are using
Azure IR, ADF will use the regional Azure IR if you specified, or auto determine a location of Azure
IR if you choose the auto-resolve IR (default) as described in Integration runtime location section.
Copying between a cloud data source and a data source in private network: if either source
or sink linked service points to a self-hosted IR, the copy activity is executed on that self-hosted
Integration Runtime.
Copying between two data sources in private network: both the source and sink Linked
Service must point to the same instance of integration runtime, and that integration runtime is used
to execute the copy Activity.
Lookup and GetMetadata activity
The Lookup and GetMetadata activity is executed on the integration runtime associated to the data
store linked service.
Transformation activity
Each transformation activity has a target compute Linked Service, which points to an integration
runtime. This integration runtime instance is where the transformation activity is dispatched from.
Data Flow activity
Data Flow activity is executed on the integration runtime associated to it.

Next steps
See the following articles:
Create Azure integration runtime
Create self-hosted integration runtime
Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using Azure SQL Database Managed Instance and joining the IR to a virtual
network.
What are Mapping Data Flows?
5/6/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Mapping Data Flows are visually-designed data transformation in Azure Data Factory. Data Flows allow data
engineers to develop graphical data transformation logic without writing code. The resulting data flows are
executed as activities within Azure Data Factory Pipelines using scaled-out Azure Databricks clusters.
The intent of Azure Data Factory Data Flow is to provide a fully visual experience with no coding required. Your
Data Flows will execute on your own execution cluster for scaled-out data processing. Azure Data Factory
handles all of the code translation, path optimization, and execution of your data flow jobs.
Start by creating data flows in Debug mode so that you can validate your transformation logic interactively. Next,
add a Data Flow activity to your pipeline to execute and test your data flow in pipeline debug, or use "Trigger
Now" in the pipeline to test your Data Flow from a pipeline Activity.
You will then schedule and monitor your data flow activities by using Azure Data Factory pipelines that execute
the Data Flow activity.
The Debug Mode toggle switch on the Data Flow design surface allows interactive building of data
transformations. Debug Mode provides a data prep environment for data flow construction.
Mapping Data Flow Debug Mode
5/23/2019 • 3 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Azure Data Factory Mapping Data Flow has a debug mode, which can be switched on with the Data Flow Debug
button at the top of the design surface. When designing data flows, setting debug mode on will allow you to
interactively watch the data shape transform while you build and debug your data flows. The Debug session can be
used both in Data Flow design sessions as well as during pipeline debug execution of data flows.

Overview
When Debug mode is on, you will interactively build your data flow with an active Spark cluster. The session will
close once you turn debug off in Azure Data Factory. You should be aware of the hourly charges incurred by Azure
Databricks during the time that you have the debug session turned on.
In most cases, it is a good practice to build your Data Flows in debug mode so that you can validate your business
logic and view your data transformations before publishing your work in Azure Data Factory. You should also use
the "Debug" button on the pipeline panel to test your data flow inside of a pipeline.

NOTE
While the debug mode light is green on the Data Factory toolbar, you will be charged at the Data Flow debug rate of 8
cores/hr of general compute with a 60 minute time-to-live

Debug mode on
When you switch on debug mode, you will be prompted with a side-panel form that will request you to point to
your interactive Azure Databricks cluster and select options for the source sampling. You must use an interactive
cluster from Azure Databricks and select either a sampling size from each your Source transforms, or pick a text
file to use for your test data.
NOTE
When running in Debug Mode in Data Flow, your data will not be written to the Sink transform. A Debug session is intended
to serve as a test >harness for your transformations. Sinks are not required during debug and are ignored in your data flow.
If you wish to test writing the data >in your Sink, execute the Data Flow from an Azure Data Factory Pipeline and use the
Debug execution from a pipeline.

Debug settings
Debug settings can be Each Source from your Data Flow will appear in the side panel and can also be edited by
selecting "source settings" on the Data Flow designer toolbar. You can select the limits and/or file source to use for
each your Source transformation here. The row limits in this setting are only for the current debug session. You can
also use the Sampling setting in the source for limiting rows into the Source transformation.

Cluster status
There is a cluster status indicator at the top of the design surface that will turn green when the cluster is ready for
debug. If your cluster is already warm, then the green indicator will appear almost instantly. If your cluster was not
already running when you entered debug mode, then you will have to wait 5-7 minutes for the cluster to spin up.
The indicator light will be yellow until it is ready. Once your cluster is ready for Data Flow debug, the indicator light
will turn green.
When you are finished with your debugging, turn the Debug switch off so that your Azure Databricks cluster can
terminate and you will no longer be billed for debug activity.

Data preview
With debug on, the Data Preview tab will light-up on the bottom panel. Without debug mode on, Data Flow will
show you only the current metadata in and out of each of your transformations in the Inspect tab. The data preview
will only query the number of rows that you have set as your limit in your debug settings. You may need to click
"Fetch data" to refresh the data preview.

Data profiles
Selecting individual columns in your data preview tab will pop-up a chart on the far-right of your data grid with
detailed statistics about each field. Azure Data Factory will make a determination based upon the data sampling of
which type of chart to display. High-cardinality fields will default to NULL / NOT NULL charts while categorical
and numeric data that has low cardinality will display bar charts showing data value frequency. You will also see
max / len length of string fields, min / max values in numeric fields, standard dev, percentiles, counts and average.

Next steps
Once you are finished building and debugging your data flow, execute it from a pipeline.
When testing your pipeline with a data flow, use the pipeline Debug run execution option.
Mapping Data Flow Schema Drift
4/12/2019 • 3 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

The concept of Schema Drift is the case where your sources often change metadata. Fields, columns, types, etc. can
be added, removed or changed on the fly. Without handling for Schema Drift, your Data Flow becomes vulnerable
to changes in upstream data source changes. When incoming columns and fields change, typical ETL patterns fail
because they tend to be tied to those source names.
In order to protect against Schema Drift, it is important to have the facilities in a Data Flow tool to allow you, as a
Data Engineer, to:
Define sources that have mutable field names, data types, values and sizes
Define transformation parameters that can work with data patterns instead of hard-coded fields and values
Define expressions that understand patterns to match incoming fields, instead of using named fields

How to implement schema drift


Choose "Allow Schema Drift" in your Source Transformation

When you've selected this option, all incoming fields will be read from your source on every Data Flow
execution and will be passed through the entire flow to the Sink.
Make sure to use "Auto-Map" to map all new fields in the Sink Transformation so that all new fields get
picked-up and landed in your destination:

Everything will work when new fields are introduced in that scenario with a simple Source -> Sink (aka
Copy) mapping.
To add transformations in that workflow that handles schema drift, you can use pattern matching to match
columns by name, type, and value.
Click on "Add Column Pattern" in the Derived Column or Aggregate transformation if you wish to create a
transformation that understands "Schema Drift".

NOTE
You need to make an architectural decision in your data flow to accept schema drift throughout your flow. When you do this,
you can protect against schema changes from the sources. However, you will lose early-binding of your columns and types
throughout your data flow. Azure Data Factory treats schema drift flows as late-binding flows, so when you build your
transformations, the column names will not be available to you in the schema views throughout the flow.

In the Taxi Demo sample Data Flow, there is a sample Schema Drift in the bottom data flow with the TripFare
source. In the Aggregate transformation, notice that we are using the "column pattern" design for the aggregation
fields. Instead of naming specific columns, or looking for columns by position, we assume that the data can change
and may not appear in the same order between runs.
In this example of Azure Data Factory Data Flow schema drift handling, we've built and aggregation that scans for
columns of type 'double', knowing that the data domain contains prices for each trip. We can then perform an
aggregate math calculation across all double fields in the source, regardless of where the column lands and
regardless of the column's naming.
The Azure Data Factory Data Flow syntax uses $$ to represent each matched column from your matching pattern.
You can also match on column names using complex string search and regular expression functions. In this case,
we are going to create a new aggregated field name based on each match of a 'double' type of column and append
the text _total to each of those matched names:
concat($$, '_total')

Then, we will round and sum the values for each of those matched columns:
round(sum ($$))

You can test this out with the Azure Data Factory Data Flow sample "Taxi Demo". Switch on the Debug session
using the Debug toggle at the top of the Data Flow design surface so that you can see your results interactively:
Access new columns downstream
When you generate new columns with column patterns, you can access those new columns later in your data flow
transformations using the "byName" expression function.

Next steps
In the Data Flow Expression Language you will find additional facilities for column patterns and schema drift
including "byName" and "byPosition".
Azure Data Factory Mapping Data Flow
Transformation Inspect Tab
2/22/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

The Inspect Pane provides a view into the metadata of the data stream that you're transforming. You will be able to
see the column counts, columns changed, columns added, data types, column ordering, and column references.
"Inspect" is a read-only view of your metadata. You do not need to have Debug mode enabled in order to see
metadate in the Inspect Pane.
As you change the shape of your data through transformations, you will see the metadata changes flow through
the Inspect Pane. If there is not a defined schema in your Source transformation, then metadata will not be visible
in the Inspect Pane. Lack of metadata is common in Schema Drift scenarios.
Data Preview is a pane that provides a read-only view of your data as it is being transformed. You can view the
output of your transformation and expressions to validate your data flow. You must have the Debug mode
switched-on to see data previews. When you click on columns in the data preview grid, you will see a subsequent
panel to the right. The pop-out panel will show the profile statistics about each of the columns that you select.
Azure data factory mapping data flows column
patterns
5/31/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Several Azure Data Factory Data Flow transformations support the idea of "Columns Patterns" so that you can
create template columns based on patterns instead of hard-coded column names. You can use this feature within
the Expression Builder to define patterns to match columns for transformation instead of requiring exact, specific
field names. Patterns are useful if incoming source fields change often, particularly in the case of changing columns
in text files or NoSQL databases. This condition is sometimes referred to as "Schema Drift".

Column patterns are useful for handling both Schema Drift scenarios as well as general scenarios. It is good for
conditions where you are not able to fully know each column name. You can pattern match on column name and
column data type and build an expression for transformation that will perform that operation against any field in
the data stream that matches your name & type patterns.
When adding an expression to a transform that accepts patterns, choose "Add Column Pattern". Column Patterns
allows schema drift column matching patterns.
When building template column patterns, use $$ in the expression to represent a reference to each matched field
from the input data stream.
If you choose to use one of the Expression Builder regex functions, you can then subsequently use $1, $2, $3 ... to
reference the sub-patterns matched from your regex expression.
An example of Column Pattern scenario is using SUM with a series of incoming fields. The aggregate SUM
calculations are in the Aggregate transformation. You can then use SUM on every match of field types that match
"integer" and then use $$ to reference each match in your expression.
Monitor Data Flows
2/22/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

After you have completed building and debugging your data flow, you will want to schedule your data flow to
execute on a schedule within the context of a pipeline. You can schedule the pipeline from Azure Data Factory using
Triggers. Or you can use the Trigger Now option from the Azure Data Factory Pipeline Builder to execute a single-
run execution to test your data flow within the pipeline context.
When you execute your pipeline, you will be able to monitor the pipeline and all of the activities contained in the
pipeline including the Data Flow activity. Click on the monitor icon in the left-hand Azure Data Factory UI panel.
You will see a screen similar to the one below. The highlighted icons will allow you to drill into the activities in the
pipeline, including the Data Flow activity.

You will see stats at this level as well inculding the run times and status. The Run ID at the activity level is different
that the Run ID at the pipeline level. The Run ID at the previous level is for the pipeline. Clicking the eyeglasses will
give you deep details on your data flow execution.

When you are in the graphical node monitoring view, you will see a simplified view -only version of your data flow
graph.
View Data Flow Execution Plans
When your Data Flow is executed in Databricks, Azure Data Factory determines optimal code paths based on the
entirity of your data flow. Additionally, the execution paths may occur on different scale-out nodes and data
partitions. Therefore, the monitoring graph represents the design of your flow, taking into account the execution
path of your transformations. When you click on individual nodes, you will see "groupings" that represent code
that was executed together on the cluster. The timings and counts that you see represent those groups as opposed
to the individual steps in your design.
When you click on the open space in the monitoring window, the stats in the bottom pane will display
timing and row counts for each Sink and the transformations that led to the sink data for transformation
lineage.
When you select individual transformations, you will receive additional feedback on the right-hand panel
that shows partition stats, column counts, skewness (how evenly is the data distributed across partitions),
and kurtosis (how spikey is the data).
When you click on the Sink in the node view, you will see column lineage. There are three different methods
that columns are accumulated throughout your data flow to land in the Sink. They are:
Computed: You use the column for conditional processing or within an expression in your data flow, but
do not land it in the Sink
Derived: The column is a new column that you generated in your flow, i.e. it was not present in the
Source
Mapped: The column originated from the source and your are mapping it to a sink field

Monitor Icons
This icon means that the transformation data was already cached on the cluster, so the timings and execution path
have taken that into account:

You will also see green circle icons in the transformation. They represent a count of the number of sinks that data is
flowing into.
Mapping data flows performance and tuning guide
6/3/2019 • 6 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Azure Data Factory Mapping Data Flows provide a code-free browser interface to design, deploy, and orchestrate
data transformations at scale.

NOTE
If you are not familiar with ADF Mapping Data Flows in general, see Data Flows Overview before reading this article.

NOTE
When you are designing and testing Data Flows from the ADF UI, make sure to turn on the Debug switch so that you can
execute your data flows in real-time without waiting for a cluster to warm up.

Monitor data flow performance


While designing your mapping data flows in the browser, you can unit test each individual transformation by
clicking on the data preview tab in the bottom settings pane for each transformation. The next step you should take
is to test your data flow end-to-end in the pipeline designer. Add an Execute Data Flow activity and use the Debug
button to test the performance of your data flow. In the bottom pane of the pipeline window, you will see an
eyeglass icon under "actions":

Clicking that icon will display the execution plan and subsequent performance profile of your data flow. You can use
this information to estimate the performance of your data flow against different sized data sources. Note that you
can assume 1 minute of cluster job execution set-up time in your overall performance calculations and if you are
using the default Azure Integration Runtime, you may need to add 5 minutes of cluster spin-up time as well.
Optimizing for Azure SQL Database and Azure SQL Data Warehouse

Partition your source data


Go to "Optimize" and select "Source". Set either a specific table column or a type in a query.
If you chose "column", then pick the partition column.
Also, set the maximum number of connections to your Azure SQL DB. You can try a higher setting to gain
parallel connections to your database. However, some cases may result in faster performance with a limited
number of connections.
Your source database tables do not need to be partitioned.
Setting a query in your Source transformation that matches the partitioning scheme of your database table will
allow the source database engine to leverage partition elimination.
If your source is not already partitioned, ADF will still use data partitioning in the Spark transformation
environment based on the key that you select in the Source transformation.
Set batch size and query on source

Setting batch size will instruct ADF to store data in sets in memory instead of row -by-row. It is an optional
setting and you may run out of resources on the compute nodes if they are not sized properly.
Setting a query can allow you to filter rows right at the source before they even arrive for Data Flow for
processing, which can make the initial data acquisition faster.
If you use a query, you can add optional query hints for your Azure SQL DB, i.e. READ UNCOMMITTED
Set sink batch size

In order to avoid row -by-row processing of your data flows, set the "Batch size" in the sink settings for Azure
SQL DB. This will tell ADF to process database writes in batches based on the size provided.
Set partitioning options on your sink
Even if you don't have your data partitioned in your destination Azure SQL DB tables, go to the Optimize tab
and set partitioning.
Very often, simply telling ADF to use Round Robin partitioning on the Spark execution clusters results in much
faster data loading instead of forcing all connections from a single node/partition.
Increase size of your compute engine in Azure Integration Runtime
Increase the number of cores, which will increase the number of nodes, and provide you with more processing
power to query and write to your Azure SQL DB.
Try "Compute Optimized" and "Memory Optimized" options to apply more resources to your compute nodes.
Unit test and performance test with debug
When unit testing data flows, set the "Data Flow Debug" button to ON.
Inside of the Data Flow designer, use the Data Preview tab on transformations to view the results of your
transformation logic.
Unit test your data flows from the pipeline designer by placing a Data Flow activity on the pipeline design
canvas and use the "Debug" button to test.
Testing in debug mode will work against a live warmed cluster environment without the need to wait for a just-
in-time cluster spin-up.
Disable indexes on write
Use an ADF pipeline stored procedure activity prior to your Data Flow activity that disables indexes on your
target tables that are being written to from your Sink.
After your Data Flow activity, add another stored proc activity that enabled those indexes.
Increase the size of your Azure SQL DB
Schedule a resizing of your source and sink Azure SQL DB before your run your pipeline to increase the
throughput and minimize Azure throttling once you reach DTU limits.
After your pipeline execution is complete, you can resize your databases back to their normal run rate.

Optimizing for Azure SQL Data Warehouse


Use staging to load data in bulk via Polybase
In order to avoid row -by-row processing of your data flows, set the "Staging" option in the Sink settings so that
ADF can leverage Polybase to avoid row -by-row inserts into DW. This will instruct ADF to use Polybase so that
data can be loaded in bulk.
When you execute your data flow activity from a pipeline, with Staging turned on, you will need to select the
Blob store location of your staging data for bulk loading.
Increase the size of your Azure SQL DW
Schedule a resizing of your source and sink Azure SQL DW before you run your pipeline to increase the
throughput and minimize Azure throttling once you reach DWU limits.
After your pipeline execution is complete, you can resize your databases back to their normal run rate.

Optimize for files


You can control how many partitions that ADF will use. On each Source & Sink transformation, as well as each
individual transformation, you can set a partitioning scheme. For smaller files, you may find selecting "Single
Partition" can sometimes work better and faster than asking Spark to partition your small files.
If you do not have enough information about your source data, you can choose "Round Robin" partitioning and
set the number of partitions.
If you explore your data and find that you have columns that can be good hash keys, use the Hash partitioning
option.
File naming options
The default nature of writing transformed data in ADF Mapping Data Flows is to write to a dataset that has a
Blob or ADLS Linked Service. You should set that dataset to point to a folder or container, not a named file.
Data Flows use Azure Databricks Spark for execution, which means that your output will be split over multiple
files based on either default Spark partitioning or the partitioning scheme that you've explicitly chosen.
A very common operation in ADF Data Flows is to choose "Output to single file" so that all of your output PART
files are merged together into a single output file.
However, this operation requires that the output reduces to a single partition on a single cluster node.
Keep this in mind when choosing this popular option. You can run out of cluster node resources if you are
combining many large source files into a single output file partition.
To avoid exhausting compute node resources, you can keep the default or explicit partitioning scheme in ADF,
which optimizes for performance, and then add a subsequent Copy Activity in the pipeline that merges all of the
PART files from the output folder to a new single file. Essentially, this technique separates the action of
transformation from file merging and achieves the same result as setting "output to single file".

Next steps
See the other Data Flow articles:
Data Flow overview
Data Flow activity
Monitor Data Flow performance
Mapping Data Flow Move Nodes
5/10/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

The Azure Data Factory Data Flow design surface is a "construction" surface where you build data flows top-down,
left-to-right. There is a toolbox attached to each transform with a plus (+) symbol. Concentrate on your business
logic instead of connecting nodes via edges in a free-form DAG environment.
So, without a drag-and-drop paradigm, the way to "move" a transformation node, is to change the incoming
stream. Instead, you will move transforms around by changing the "incoming stream".

Streams of data inside of data flow


In Azure Data Factory Data Flow, streams represent the flow of data. On the transformation settings pane, you will
see an "Incoming Steam" field. This tells you which incoming data stream is feeding that transformation. You can
change the physical location of your transform node on the graph by clicking the Incoming Stream name and
selecting another data stream. The current transformation along with all subsequent transforms on that stream will
then move to the new location.
If you are moving a transformation with one or more transformations after it, then the new location in the data
flow will be joined via a new branch.
If you have no subsequent transformations after the node you've selected, then only that transform will move to
the new location.

Next steps
After completing your Data Flow design, turn the debug button on and test it out in debug mode either directly in
the data flow designer or pipeline debug.
Mapping Data Flow Transformation Optimize Tab
2/22/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Each Data Flow transformation has an "Optimize" tab. The optimize tab contains optional settings to configure
partitioning schemes for data flows.

The default setting is "use current partitioning". Current Partitioning instructs Azure Data Factory to use the
partitioning scheme native to Data Flows running on Spark in Azure Databricks. Generally, this is the
recommended approach.
However, there are instances where you may wish to adjust the partitioning. For instance, if you want to output
your transformations to a single file in the lake, then chose "single partition" on the Optimize tab for partitioning in
the Sink Transformation.
Another case where you may wish to exercise control over the partitioning schemes being used for your data
transformations is in terms of performance. Adjusting the partitioning of data provides a level of control over the
distribution of your data across compute nodes and data locality optimizations that can have both positive as well
as negative effects on your overall data flow performance.
If you wish to change partitioning on any transformation, simply click the Optimize tab and select the "Set
Partitioning" radio button. You will then be presented with a series of options for partitioning. The best method of
partitioning to implement will differ based on your data volumes, candidate keys, null values and cardinality. Best
practice is to start with default partitioning and then try the different partitioning options. You can test using the
Debug run in Pipeline and then view the time spent in each transformation grouping as well as partition usage
from the Monitoring view.
Round Robin
Round Robin is simple partition that automatically distributes data equally across partitions. Use Round Robin
when you do not have good key candidates to implement a solid, smart partitioning strategy. You can set the
number of physical partitions.
Hash
Azure Data Factory will produce a hash of columns to produce uniform partitions such that rows with similar
values will fall in the same partition. When using the Hash option, test for possible partition skew. You can set the
number of physical partitions.
Dynamic Range
Dynamic Range will use Spark dynamic ranges based on the columns or expressions that you provide. You can set
the number of physical partitions.
Fixed Range
You must build an expression that provides a fixed range for values within your partitioned data columns. You
should have a good understanding of your data before using this option in order to avoid partition skew. The value
that enter for the expression will be used as part of a partition function. You can set the number of physical
partitions.
Key
If you have a good understanding of the cardinality of your data, key partitioning may be a good partition strategy.
Key partitioning will create partitions for each unique value in your column. You cannot set the number of
partitions because the number will be based on unique values in the data.
Mapping Data Flow Expression Builder
4/9/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

In Azure Data Factory Mapping Data Flow, you'll find expression boxes where you can enter expressions for data
transformation. Use columns, fields, variables, parameters, functions from your data flow in these boxes. To build
the expression, use the Expression Builder, which is launched by clicking in the expression text box inside the
transformation. You'll also sometimes see "Computed Column" options when selecting columns for
transformation. When you click that, you'll also see the Expression Builder launched.

The Expression Builder tool defaults to the text editor option. the auto-complete feature reads from the entire
Azure Data Factory Data Flow object model with syntax checking and highlighting.
Currently Working on Field

At the top left of the Expression Builder UI, you will see a field called "Currently Working On" with the name of the
field that you are currently working on. The expression that you build in the UI will be applied just to that currently
working field. If you wish to transform another field, save your current work and use this drop-down to select
another field and build an expression for the other fields.

Data Preview in Debug mode


When you are working on your expressions, you can optionally switch on Debug mode from the Azure Data
Factory Data Flow design surface, enabling live in-progress preview of your data results from the expression that
you are building. Real-time live debugging is enabled for your expressions.

Comments
Add comments to your expressions using single line and multi-line comment syntax:

Regular Expressions
The Azure Data Factory Data Flow expression language, full reference documentation here, enables functions that
include regular expression syntax. When using regular expression functions, the Expression Builder will try to
interpret backslash (\) as an escape character sequence. When using backslashes in your regular expression, either
enclose the entire regex in ticks (`) or use a double backslash.
Example using ticks

regex_replace('100 and 200', `(\d+)`, 'digits')

or using double slash

regex_replace('100 and 200', '(\\d+)', 'digits')

Addressing array indexes


With expression functions that return arrays, use square brackets [] to address specific indexes inside that return
array object. The array is ones-based.

Handling names with special characters


When you have column names that include special characters or spaces, surround the name with curly braces.
{[dbo].this_is my complex name$$$}

Next steps
Begin building data transformation expressions
Mapping Data Flow Reference Node
2/22/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

A reference node is automatically added to the canvas to signify that the node it is attached to references another
existing node on the canvas. Think of a reference node as a pointer or a reference to another data flow
transformation.
For example: When you Join or Union more than one stream of data, the Data Flow canvas may add a reference
node that reflects the name and settings of the non-primary incoming stream.
The reference node cannot be moved or deleted. However, you can click into the node to modify the originating
transformation settings.
The UI rules that govern when Data Flow adds the reference node are based upon available space and vertical
spacing between rows.
Data transformation expressions in Mapping Data
Flow
5/6/2019 • 28 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Expression functions
In Data Factory, use the expression language of the Mapping Data Flow feature to configure data transformations.

abs

abs(<value1> : number) => number

Positive Modulus of pair of numbers.


abs(-20) -> 20
abs(10) -> 10

acos

acos(<value1> : number) => double

Calculates a consine inverse value


acos(1) -> 0.0

add

add(<value1> : any, <value2> : any) => any

Adds a pair of strings or numbers. Adds a date to a number of days. Appends one array of similar type to another.
Same as the + operator
add(10, 20) -> 30
10 + 20 -> 30
add('ice', 'cream') -> 'icecream'
'ice' + 'cream' + ' cone' -> 'icecream cone'
add(toDate('2012-12-12'), 3) -> 2012-12-15 (date value)
toDate('2012-12-12') + 3 -> 2012-12-15 (date value)
[10, 20] + [30, 40] => [10, 20, 30, 40]

addDays

addDays(<date/timestamp> : datetime, <days to add> : integral) => datetime


Add days to a date or timestamp. Same as the + operator for date
addDays(toDate('2016-08-08'), 1) -> 2016-08-09

addMonths

addMonths(<date/timestamp> : datetime, <months to add> : integral) => datetime

Add months to a date or timestamp


addMonths(toDate('2016-08-31'), 1) -> 2016-09-30
addMonths(toTimestamp('2016-09-30 10:10:10'), -1) -> 2016-08-31 10:10:10

and

and(<value1> : boolean, <value2> : boolean) => boolean

Logical AND operator. Same as &&


and(true, false) -> false
true && false -> false

asin

asin(<value1> : number) => double

Calculates an inverse sine value


asin(0) -> 0.0

atan

atan(<value1> : number) => double

Calculates a inverse tangent value


atan(0) -> 0.0

atan2

atan2(<value1> : number, <value2> : number) => double

Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates
atan2(0, 0) -> 0.0

avg

avg(<value1> : number) => number

Gets the average of values of a column


avg(sales) -> 7523420.234

avgIf

avgIf(<value1> : boolean, <value2> : number) => number


Based on a criteria gets the average of values of a column
avgIf(region == 'West', sales) -> 7523420.234

byName

byName(<column name> : string) => any

Selects a column value by name in the stream. If there are multiple matches, the first match is returned. If no
match it returns a NULL value. The returned value has to be type converted by one of the type conversion
functions(TO_DATE, TO_STRING ...). Column names known at design time should be addressed just by their name.
Computed inputs are not supported but you can use parameter substitutions
toString(byName('parent')) -> appa
toLong(byName('income')) -> 9000000000009
toBoolean(byName('foster')) -> false
toLong(byName($debtCol)) -> 123456890
birthDate -> 12/31/2050
toString(byName('Bogus Column')) -> NULL

byPosition

byPosition(<position> : integer) => any

Selects a column value by its relative position(1 based) in the stream. If the position is out of bounds it returns a
NULL value. The returned value has to be type converted by one of the type conversion functions(TO_DATE,
TO_STRING ...)Computed inputs are not supported but you can use parameter substitutions
toString(byPosition(1)) -> amma
toDecimal(byPosition(2), 10, 2) -> 199990.99
toBoolean(byName(4)) -> false
toString(byName($colName)) -> family
toString(byPosition(1234)) -> NULL

case

case(<condition> : boolean, <true_expression> : any, <false_expression> : any, ...) => any

Based on alternating conditions applies one value or the other. If the number of inputs are even, the other is NULL
for last condition
case(custType == 'Premium', 10, 4.5)
case(custType == 'Premium', price*0.95, custType == 'Elite', price*0.9, price*2)
case(dayOfWeek(saleDate) == 1, 'Sunday', dayOfWeek(saleDate) == 6, 'Saturday')

cbrt

cbrt(<value1> : number) => double

Calculate the cube root of a number


cbrt(8) -> 2.0
ceil

ceil(<value1> : number) => number

Returns the smallest integer not smaller than the number


ceil(-0.1) -> 0

concat

concat(<this> : string, <that> : string, ...) => string

Concatenates a variable number of strings together. Same as the + operator with strings
concat('Awesome', 'Cool', 'Product') -> 'AwesomeCoolProduct'
'Awesome' + 'Cool' + 'Product' -> 'AwesomeCoolProduct'
concat(addrLine1, ' ', addrLine2, ' ', city, ' ', state, ' ', zip)
addrLine1 + ' ' + addrLine2 + ' ' + city + ' ' + state + ' ' + zip

concatWS

concatWS(<separator> : string, <this> : string, <that> : string, ...) => string

Concatenates a variable number of strings together with a separator. The first parameter is the separator
concatWS(' ', 'Awesome', 'Cool', 'Product') -> 'Awesome Cool Product'
concatWS(' ' , addrLine1, addrLine2, city, state, zip) ->
concatWS(',' , toString(order_total), toString(order_discount))

cos

cos(<value1> : number) => double

Calculates a cosine value


cos(10) -> -0.83907152907

cosh

cosh(<value1> : number) => double

Calculates a hyperbolic cosine of a value


cosh(0) -> 1.0

count

count([<value1> : any]) => long

Gets the aggregate count of values. If the optional column(s) is specified, it ignores NULL values in the count
count(custId) -> 100
count(custId, custName) -> 50
count() -> 125
count(iif(isNull(custId), 1, NULL)) -> 5
countDistinct

countDistinct(<value1> : any, [<value2> : any], ...) => long

Gets the aggregate count of distinct values of a set of columns


countDistinct(custId, custName) -> 60

countIf

countIf(<value1> : boolean, [<value2> : any]) => long

Based on a criteria gets the aggregate count of values. If the optional column is specified, it ignores NULL values
in the count
countIf(state == 'CA' && commission < 10000, name) -> 100

covariancePopulation

covariancePopulation(<value1> : number, <value2> : number) => double

Gets the population covariance between two columns


covariancePopulation(sales, profit) -> 122.12

covariancePopulationIf

covariancePopulationIf(<value1> : boolean, <value2> : number, <value3> : number) => double

Based on a criteria, gets the population covariance of two columns


covariancePopulationIf(region == 'West', sales) -> 122.12

covarianceSample

covarianceSample(<value1> : number, <value2> : number) => double

Gets the sample covariance of two columns


covarianceSample(sales, profit) -> 122.12

covarianceSampleIf

covarianceSampleIf(<value1> : boolean, <value2> : number, <value3> : number) => double

Based on a criteria, gets the sample covariance of two columns


covarianceSampleIf(region == 'West', sales, profit) -> 122.12

crc32

crc32(<value1> : any, ...) => long

Calculates the CRC32 hash of set of column of varying primitive datatypes given a bit length which can only be of
values 0(256), 224, 256, 384, 512. It can be used to calculate a fingerprint for a row
crc32(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> 3630253689
cumeDist

cumeDist() => integer

The CumeDist function computes the position of a value relative to all values in the partition. The result is the
number of rows preceding or equal to the current row in the ordering of the partition divided by the total number
of rows in the window partition. Any tie values in the ordering will evaluate to the same position.
cumeDist() -> 1

currentDate

currentDate([<value1> : string]) => date

Gets the current date when this job starts to run. You can pass an optional timezone in the form of 'GMT', 'PST',
'UTC', 'America/Cayman'. The local timezone is used as the default.
currentDate() -> 12-12-2030
currentDate('PST') -> 12-31-2050

currentTimestamp

currentTimestamp() => timestamp

Gets the current timestamp when the job starts to run with local time zone
currentTimestamp() -> 12-12-2030T12:12:12

currentUTC

currentUTC([<value1> : string]) => timestamp

Gets the current the timestamp as UTC. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. It is defaulted to the current timezone
currentUTC() -> 12-12-2030T19:18:12
currentUTC('Asia/Seoul') -> 12-13-2030T11:18:12

dayOfMonth

dayOfMonth(<value1> : datetime) => integer

Gets the day of the month given a date


dayOfMonth(toDate('2018-06-08')) -> 08

dayOfWeek

dayOfWeek(<value1> : datetime) => integer

Gets the day of the week given a date. 1 - Sunday, 2 - Monday ..., 7 - Saturday
dayOfWeek(toDate('2018-06-08')) -> 7

dayOfYear

dayOfYear(<value1> : datetime) => integer


Gets the day of the year given a date
dayOfYear(toDate('2016-04-09')) -> 100

degrees

degrees(<value1> : number) => double

Converts radians to degrees


degrees(3.141592653589793) -> 180

denseRank

denseRank(<value1> : any, ...) => integer

Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to
the current row in the ordering of the partition. The values will not produce gaps in the sequence. Dense Rank
works even when data is not sorted and looks for change in values
denseRank(salesQtr, salesAmt) -> 1

divide

divide(<value1> : any, <value2> : any) => any

Divides pair of numbers. Same as the / operator


divide(20, 10) -> 2
20 / 10 -> 2

endsWith

endsWith(<string> : string, <substring to check> : string) => boolean

Checks if the string ends with the supplied string


endsWith('great', 'eat') -> true

equals

equals(<value1> : any, <value2> : any) => boolean

Comparison equals operator. Same as == operator


equals(12, 24) -> false
12==24 -> false
'bad'=='bad' -> true
'good'== NULL -> false
NULL===NULL -> false

equalsIgnoreCase

equalsIgnoreCase(<value1> : string, <value2> : string) => boolean


Comparison equals operator ignoring case. Same as <=> operator
'abc'<==>'abc' -> true
equalsIgnoreCase('abc', 'Abc') -> true

factorial

factorial(<value1> : number) => long

Calculate the factorial of a number


factorial(5) -> 120

false

false() => boolean

Always returns a false value. Use the function syntax(false()) if there is a column named 'false'
isDiscounted == false()
isDiscounted() == false

first

first(<value1> : any, [<value2> : boolean]) => any

Gets the first value of a column group. If the second parameter ignoreNulls is omitted, it is assumed false
first(sales) -> 12233.23
first(sales, false) -> NULL

floor

floor(<value1> : number) => number

Returns the largest integer not greater than the number


floor(-0.1) -> -1

fromUTC

fromUTC(<value1> : timestamp, [<value2> : string]) => timestamp

Converts to the timestamp from UTC. You can optionally pass the timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. It is defaulted to the current timezone
fromUTC(currentTimeStamp()) -> 12-12-2030T19:18:12
fromUTC(currentTimeStamp(), 'Asia/Seoul') -> 12-13-2030T11:18:12

greater

greater(<value1> : any, <value2> : any) => boolean

Comparison greater operator. Same as > operator


greater(12, 24) -> false
'abcd' > 'abc' -> true

greaterOrEqual

greaterOrEqual(<value1> : any, <value2> : any) => boolean

Comparison greater than or equal operator. Same as >= operator


greaterOrEqual(12, 12) -> false
'abcd' >= 'abc' -> true

greatest

greatest(<value1> : any, ...) => any

Returns the greatest value among the list of values as input. Returns null if all inputs are null
greatest(10, 30, 15, 20) -> 30
greatest(toDate('12/12/2010'), toDate('12/12/2011'), toDate('12/12/2000')) -> '12/12/2011'

hour

hour(<value1> : timestamp, [<value2> : string]) => integer

Gets the hour value of a timestamp. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default.
hour(toTimestamp('2009-07-30T12:58:59')) -> 12
hour(toTimestamp('2009-07-30T12:58:59'), 'PST') -> 12

iif

iif(<condition> : boolean, <true_expression> : any, [<false_expression> : any]) => any

Based on a condition applies one value or the other. If other is unspecified it is considered NULL. Both the values
must be compatible(numeric, string...)
iif(custType == 'Premium', 10, 4.5)
iif(amount > 100, 'High')
iif(dayOfWeek(saleDate) == 6, 'Weekend', 'Weekday')

in

in(<array of items> : array, <item to find> : any) => boolean

Checks if an item is in the array


in([10, 20, 30], 10) -> true
in(['good', 'kid'], 'bad') -> false

initCap

initCap(<value1> : string) => string

Converts the first letter of every word to uppercase. Words are identified as separated by whitespace
initCap('cool iceCREAM') -> 'Cool IceCREAM'

instr

instr(<string> : string, <substring to find> : string) => integer

Finds the position(1 based) of the substring within a string. 0 is returned if not found
instr('great', 'eat') -> 3
instr('microsoft', 'o') -> 7
instr('good', 'bad') -> 0

isDelete

isDelete([<value1> : integer]) => boolean

Checks if the row is marked for delete. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. Default value for the stream index is 1
isDelete() -> true
isDelete(1) -> false

isError

isError([<value1> : integer]) => boolean

Checks if the row is marked as error. For transformations taking more than one input stream you can pass the (1-
based) index of the stream. Default value for the stream index is 1
isError() -> true
isError(1) -> false

isIgnore

isIgnore([<value1> : integer]) => boolean

Checks if the row is marked to be ignored. For transformations taking more than one input stream you can pass
the (1-based) index of the stream. Default value for the stream index is 1
isIgnore() -> true
isIgnore(1) -> false

isInsert

isInsert([<value1> : integer]) => boolean

Checks if the row is marked for insert. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. Default value for the stream index is 1
isInsert() -> true
isInsert(1) -> false

isMatch

isMatch([<value1> : integer]) => boolean


Checks if the row is matched at lookup. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. Default value for the stream index is 1
isMatch() -> true
isMatch(1) -> false

isNull

isNull(<value1> : any) => boolean

Checks if the value is NULL


isNull(NULL()) -> true
isNull('') -> false'

isUpdate

isUpdate([<value1> : integer]) => boolean

Checks if the row is marked for update. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. Default value for the stream index is 1
isUpdate() -> true
isUpdate(1) -> false

kurtosis

kurtosis(<value1> : number) => double

Gets the kurtosis of a column


kurtosis(sales) -> 122.12

kurtosisIf

kurtosisIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the kurtosis of a column


kurtosisIf(region == 'West', sales) -> 122.12

lag

lag(<value> : any, [<number of rows to look before> : number], [<default value> : any]) => any

Gets the value of the first parameter evaluated n rows before the current row. The second parameter is the number
of rows to look back and the default value is 1. If there are not as many rows a value of null is returned unless a
default value is specified
lag(amount, 2) -> 60
lag(amount, 2000, 100) -> 100

last

last(<value1> : any, [<value2> : boolean]) => any


Gets the last value of a column group. If the second parameter ignoreNulls is omitted, it is assumed false
last(sales) -> 523.12
last(sales, false) -> NULL

lastDayOfMonth

lastDayOfMonth(<value1> : datetime) => date

Gets the last date of the month given a date


lastDayOfMonth(toDate('2009-01-12')) -> 2009-01-31

lead

lead(<value> : any, [<number of rows to look after> : number], [<default value> : any]) => any

Gets the value of the first parameter evaluated n rows after the current row. The second parameter is the number
of rows to look forward and the default value is 1. If there are not as many rows a value of null is returned unless a
default value is specified
lead(amount, 2) -> 60
lead(amount, 2000, 100) -> 100

least

least(<value1> : any, ...) => any

Comparison lesser than or equal operator. Same as <= operator


least(10, 30, 15, 20) -> 10
least(toDate('12/12/2010'), toDate('12/12/2011'), toDate('12/12/2000')) -> '12/12/2000'

left

left(<string to subset> : string, <number of characters> : integral) => string

Extracts a substring start at index 1 with number of characters. Same as SUBSTRING (str, 1, n)
left('bojjus', 2) -> 'bo'
left('bojjus', 20) -> 'bojjus'

length

length(<value1> : string) => integer

Returns the length of the string


length('kiddo') -> 5

lesser

lesser(<value1> : any, <value2> : any) => boolean

Comparison less operator. Same as < operator


lesser(12 < 24) -> true
'abcd' < 'abc' -> false

lesserOrEqual

lesserOrEqual(<value1> : any, <value2> : any) => boolean

Comparison lesser than or equal operator. Same as <= operator


lesserOrEqual(12, 12) -> true
'abcd' <= 'abc' -> false

levenshtein

levenshtein(<from string> : string, <to string> : string) => integer

Gets the levenshtein distance between two strings


levenshtein('boys', 'girls') -> 4

like

like(<string> : string, <pattern match> : string) => boolean

The pattern is a string that is matched literally. The exceptions are the following special symbols: _ matches any
one character in the input (similar to . in posix regular expressions) % matches zero or more characters in the input
(similar to .* in posix regular expressions). The escape character is ''. If an escape character precedes a special
symbol or another escape character, the following character is matched literally. It is invalid to escape any other
character.
like('icecream', 'ice%') -> true

locate

locate(<substring to find> : string, <string> : string, [<from index - 1-based> : integral]) => integer

Finds the position(1 based) of the substring within a string starting a certain position. If the position is omitted it is
considered from the beginning of the string. 0 is returned if not found
locate('eat', 'great') -> 3
locate('o', 'microsoft', 6) -> 7
locate('bad', 'good') -> 0

log

log(<value1> : number, [<value2> : number]) => double

Calculates log value. An optional base can be supplied else a euler number if used
log(100, 10) -> 2

log10

log10(<value1> : number) => double


Calculates log value based on 10 base
log10(100) -> 2

lower

lower(<value1> : string) => string

Lowercases a string
lower('GunChus') -> 'gunchus'

lpad

lpad(<string to pad> : string, <final padded length> : integral, <padding> : string) => string

Left pads the string by the supplied padding until it is of a certain length. If the string is equal to or greater than
the length, then it is considered a no-op
lpad('great', 10, '-') -> '-----great'
lpad('great', 4, '-') -> 'great'
lpad('great', 8, '<>') -> '<><great'

ltrim

ltrim(<string to trim> : string, <trim characters> : string) => string

Left trims a string of leading characters. If second parameter is unspecified, it trims whitespace. Else it trims any
character specified in the second parameter
ltrim('!--!wor!ld!', '-!') -> 'wor!ld!'

max

max(<value1> : any) => any

Gets the maximum value of a column


MAX(sales) -> 12312131.12

maxIf

maxIf(<value1> : boolean, <value2> : any) => any

Based on a criteria, gets the maximum value of a column


maxIf(region == 'West', sales) -> 99999.56

md5

md5(<value1> : any, ...) => string

Calculates the MD5 digest of set of column of varying primitive datatypes and returns a 32 character hex string. It
can be used to calculate a fingerprint for a row
md5(5, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> 'c1527622a922c83665e49835e46350fe'
mean

mean(<value1> : number) => number

Gets the mean of values of a column. Same as AVG


mean(sales) -> 7523420.234

meanIf

meanIf(<value1> : boolean, <value2> : number) => number

Based on a criteria gets the mean of values of a column. Same as avgIf


meanIf(region == 'West', sales) -> 7523420.234

min

min(<value1> : any) => any

Gets the minimum value of a column


min(sales) -> 00.01
min(orderDate) -> 12/12/2000

minIf

minIf(<value1> : boolean, <value2> : any) => any

Based on a criteria, gets the minimum value of a column


minIf(region == 'West', sales) -> 00.01

minus

minus(<value1> : any, <value2> : any) => any

Subtracts numbers. Subtract from a date number of days. Same as the - operator
minus(20, 10) -> 10
20 - 10 -> 10
minus(toDate('2012-12-15'), 3) -> 2012-12-12 (date value)
toDate('2012-12-15') - 3 -> 2012-12-13 (date value)

minute

minute(<value1> : timestamp, [<value2> : string]) => integer

Gets the minute value of a timestamp. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default.
minute(toTimestamp('2009-07-30T12:58:59')) -> 58
minute(toTimestamp('2009-07-30T12:58:59', 'PST')) -> 58

mod
mod(<value1> : any, <value2> : any) => any

Modulus of pair of numbers. Same as the % operator


mod(20, 8) -> 4
20 % 8 -> 4

month

month(<value1> : datetime) => integer

Gets the month value of a date or timestamp


month(toDate('2012-8-8')) -> 8

monthsBetween

monthsBetween(<from date/timestamp> : datetime, <to date/timestamp> : datetime, [<time zone> : boolean], [


<value4> : string]) => double

Gets the number of months between two datesYou can pass an optional timezone in the form of 'GMT', 'PST',
'UTC', 'America/Cayman'. The local timezone is used as the default.
monthsBetween(toDate('1997-02-28 10:30:00'), toDate('1996-10-30')) -> 3.94959677

multiply

multiply(<value1> : any, <value2> : any) => any

Multiplies pair of numbers. Same as the * operator


multiply(20, 10) -> 200
20 * 10 -> 200

nTile

nTile([<value1> : integer]) => integer

The NTile function divides the rows for each window partition into n buckets ranging from 1 to at most n .
Bucket values will differ by at most 1. If the number of rows in the partition does not divide evenly into the
number of buckets, then the remainder values are distributed one per bucket, starting with the first bucket. The
NTile function is useful for the calculation of tertiles, quartiles, deciles, and other common summary statistics. The
function calculates two variables during initialization: The size of a regular bucket will have one extra row added to
it. Both variables are based on the size of the current partition. During the calculation process the function keeps
track of the current row number, the current bucket number, and the row number at which the bucket will change
(bucketThreshold). When the current row number reaches bucket threshold, the bucket value is increased by one
and the threshold is increased by the bucket size (plus one extra if the current bucket is padded).
nTile() -> 1
nTile(numOfBuckets) -> 1

negate

negate(<value1> : number) => number

Negates a number. Turns positive numbers to negative and vice versa


negate(13) -> -13

nextSequence

nextSequence() => long

Returns the next unique sequence. The number is consecutive only within a partition and is prefixed by the
partitionId
nextSequence() -> 12313112

normalize

normalize(<String to normalize> : string) => string

Normalize the string value to separate accented unicode characters


normalize('boys') -> 'boys'

not

not(<value1> : boolean) => boolean

Logical negation operator


not(true) -> false
not(premium)

notEquals

notEquals(<value1> : any, <value2> : any) => boolean

Comparison not equals operator. Same as != operator


12!=24 -> true
'abc'!='abc' -> false

null

null() => null

Returns a NULL value. Use the function syntax(null()) if there is a column named 'null'. Any operation that uses
will result in a NULL
custId = NULL (for derived field)
custId == NULL -> NULL
'nothing' + NULL -> NULL
10 * NULL -> NULL'
NULL == '' -> NULL'

or

or(<value1> : boolean, <value2> : boolean) => boolean

Logical OR operator. Same as ||


or(true, false) -> true
true || false -> true

pMod

pMod(<value1> : any, <value2> : any) => any

Positive Modulus of pair of numbers.


pmod(-20, 8) -> 4

power

power(<value1> : number, <value2> : number) => double

Raises one number to the power of another


power(10, 2) -> 100

rank

rank(<value1> : any, ...) => integer

Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to
the current row in the ordering of the partition. The values will produce gaps in the sequence. Rank works even
when data is not sorted and looks for change in values
rank(salesQtr, salesAmt) -> 1

regexExtract

regexExtract(<string> : string, <regex to find> : string, [<match group 1-based index> : integral]) => string

Extract a matching substring for a given regex pattern. The last parameter identifies the match group and is
defaulted to 1 if omitted. Use <regex> (back quote) to match a string without escaping
regexExtract('Cost is between 600 and 800 dollars', '(\\d+) and (\\d+)', 2) -> '800'
regexExtract('Cost is between 600 and 800 dollars', `(\d+) and (\d+)`, 2) -> '800'

regexMatch

regexMatch(<string> : string, <regex to match> : string) => boolean

Checks if the string matches the given regex pattern. Use <regex> (back quote) to match a string without escaping
regexMatch('200.50', '(\\d+).(\\d+)') -> true
regexMatch('200.50', `(\d+).(\d+)`) -> true

regexReplace

regexReplace(<string> : string, <regex to find> : string, <substring to replace> : string) => string

Replace all occurrences of a regex pattern with another substring in the given string Use <regex> (back quote) to
match a string without escaping
regexReplace('100 and 200', '(\\d+)', 'bojjus') -> 'bojjus and bojjus'
regexReplace('100 and 200', `(\d+)`, 'gunchus') -> 'gunchus and gunchus'

regexSplit

regexSplit(<string to split> : string, <regex expression> : string) => array

Splits a string based on a delimiter based on regex and returns an array of strings
regexSplit('oneAtwoBthreeC', '[CAB]') -> ['one', 'two', 'three']
regexSplit('oneAtwoBthreeC', '[CAB]')[1] -> 'one'
regexSplit('oneAtwoBthreeC', '[CAB]')[0] -> NULL
regexSplit('oneAtwoBthreeC', '[CAB]')[20] -> NULL

replace

replace(<string> : string, <substring to find> : string, <substring to replace> : string) => string

Replace all occurrences of a substring with another substring in the given string
replace('doggie dog', 'dog', 'cat') -> 'catgie cat'
replace('doggie dog', 'dog', '') -> 'gie'

reverse

reverse(<value1> : string) => string

Reverses a string
reverse('gunchus') -> 'suhcnug'

right

right(<string to subset> : string, <number of characters> : integral) => string

Extracts a substring with number of characters from the right. Same as SUBSTRING (str, LENGTH(str) - n, n)
right('bojjus', 2) -> 'us'
right('bojjus', 20) -> 'bojjus'

rlike

rlike(<string> : string, <pattern match> : string) => boolean

Checks if the string matches the given regex pattern


rlike('200.50', '(\d+).(\d+)') -> true

round

round(<number> : number, [<scale to round> : number], [<rounding option> : integral]) => double

Rounds a number given an optional scale and an optional rounding mode. If the scale is omitted, it is defaulted to
0. If the mode is omitted, it is defaulted to ROUND_HALF_UP (5). The values for rounding include 1 - ROUND_UP
2 - ROUND_DOWN 3 - ROUND_CEILING 4 - ROUND_FLOOR 5 - ROUND_HALF_UP 6 -
ROUND_HALF_DOWN 7 - ROUND_HALF_EVEN 8 - ROUND_UNNECESSARY
round(100.123) -> 100.0
round(2.5, 0) -> 3.0
round(5.3999999999999995, 2, 7) -> 5.40

rowNumber

rowNumber() => integer

Assigns a sequential row numbering for rows in a window starting with 1


rowNumber() -> 1

rpad

rpad(<string to pad> : string, <final padded length> : integral, <padding> : string) => string

Right pads the string by the supplied padding until it is of a certain length. If the string is equal to or greater than
the length, then it is considered a no-op
rpad('great', 10, '-') -> 'great-----'
rpad('great', 4, '-') -> 'great'
rpad('great', 8, '<>') -> 'great<><'

rtrim

rtrim(<string to trim> : string, <trim characters> : string) => string

Right trims a string of leading characters. If second parameter is unspecified, it trims whitespace. Else it trims any
character specified in the second parameter
rtrim('!--!wor!ld!', '-!') -> '!--!wor!ld'

second

second(<value1> : timestamp, [<value2> : string]) => integer

Gets the second value of a date. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default.
second(toTimestamp('2009-07-30T12:58:59')) -> 59

sha1

sha1(<value1> : any, ...) => string

Calculates the SHA-1 digest of set of column of varying primitive datatypes and returns a 40 character hex string.
It can be used to calculate a fingerprint for a row
sha1(5, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> '63849fd2abb65fbc626c60b1f827bd05573f0cea'

sha2

sha2(<value1> : integer, <value2> : any, ...) => string

Calculates the SHA-2 digest of set of column of varying primitive datatypes given a bit length which can only be of
values 0(256), 224, 256, 384, 512. It can be used to calculate a fingerprint for a row
sha2(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) ->
'd3b2bff62c3a00e9370b1ac85e428e661a7df73959fa1a96ae136599e9ee20fd'

sin

sin(<value1> : number) => double

Calculates a sine value


sin(2) -> 0.90929742682

sinh

sinh(<value1> : number) => double

Calculates a hyperbolic sine value


sinh(0) -> 0.0

skewness

skewness(<value1> : number) => double

Gets the skewness of a column


skewness(sales) -> 122.12

skewnessIf

skewnessIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the skewness of a column


skewnessIf(region == 'West', sales) -> 122.12

slice

slice(<array to slice> : array, <from 1-based index> : integral, [<number of items> : integral]) => array

Extracts a subset of an array from a position. Position is 1 based. If the length is omitted, it is defaulted to end of
the string
slice([10, 20, 30, 40], 1, 2) -> [10, 20]
slice([10, 20, 30, 40], 2) -> [20, 30, 40]
slice([10, 20, 30, 40], 2)[1] -> 20
slice([10, 20, 30, 40], 2)[0] -> NULL
slice([10, 20, 30, 40], 2)[20] -> NULL
slice([10, 20, 30, 40], 8) -> []

soundex

soundex(<value1> : string) => string

Gets the soundex code for the string


soundex('genius') -> 'G520'

split

split(<string to split> : string, <split characters> : string) => array

Splits a string based on a delimiter and returns an array of strings


split('100,200,300', ',') -> ['100', '200', '300']
split('100,200,300', '|') -> ['100,200,300']
split('100, 200, 300', ', ') -> ['100', '200', '300']
split('100, 200, 300', ', ')[1] -> '100'
split('100, 200, 300', ', ')[0] -> NULL
split('100, 200, 300', ', ')[20] -> NULL
split('100200300', ',') -> ['100200300']

sqrt

sqrt(<value1> : number) => double

Calculates the square root of a number


sqrt(9) -> 3

startsWith

startsWith(<string> : string, <substring to check> : string) => boolean

Checks if the string starts with the supplied string


startsWith('great', 'gr') -> true

stddev

stddev(<value1> : number) => double

Gets the standard deviation of a column


stdDev(sales) -> 122.12

stddevIf

stddevIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the standard deviation of a column


stddevIf(region == 'West', sales) -> 122.12

stddevPopulation

stddevPopulation(<value1> : number) => double

Gets the population standard deviation of a column


stddevPopulation(sales) -> 122.12
stddevPopulationIf

stddevPopulationIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the population standard deviation of a column


stddevPopulationIf(region == 'West', sales) -> 122.12

stddevSample

stddevSample(<value1> : number) => double

Gets the sample standard deviation of a column


stddevSample(sales) -> 122.12

stddevSampleIf

stddevSampleIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the sample standard deviation of a column


stddevSampleIf(region == 'West', sales) -> 122.12

subDays

subDays(<date/timestamp> : datetime, <days to subtract> : integral) => datetime

Subtract months from a date. Same as the - operator for date


subDays(toDate('2016-08-08'), 1) -> 2016-08-09

subMonths

subMonths(<date/timestamp> : datetime, <months to subtract> : integral) => datetime

Subtract months from a date or timestamp


subMonths(toDate('2016-09-30'), 1) -> 2016-08-31

substring

substring(<string to subset> : string, <from 1-based index> : integral, [<number of characters> : integral]) =>
string

Extracts a substring of a certain length from a position. Position is 1 based. If the length is omitted, it is defaulted to
end of the string
substring('Cat in the hat', 5, 2) -> 'in'
substring('Cat in the hat', 5, 100) -> 'in the hat'
substring('Cat in the hat', 5) -> 'in the hat'
substring('Cat in the hat', 100, 100) -> ''

sum

sum(<value1> : number) => number


Gets the aggregate sum of a numeric column
sum(col) -> value

sumDistinct

sumDistinct(<value1> : number) => number

Gets the aggregate sum of distinct values of a numeric column


sumDistinct(col) -> value

sumDistinctIf

sumDistinctIf(<value1> : boolean, <value2> : number) => number

Based on criteria gets the aggregate sum of a numeric column. The condition can be based on any column
sumDistinctIf(state == 'CA' && commission < 10000, sales) -> value
sumDistinctIf(true, sales) -> SUM(sales)

sumIf

sumIf(<value1> : boolean, <value2> : number) => number

Based on criteria gets the aggregate sum of a numeric column. The condition can be based on any column
sumIf(state == 'CA' && commission < 10000, sales) -> value
sumIf(true, sales) -> SUM(sales)

tan

tan(<value1> : number) => double

Calculates a tangent value


tan(0) -> 0.0

tanh

tanh(<value1> : number) => double

Calculates a hyperbolic tangent value


tanh(0) -> 0.0

toBoolean

toBoolean(<value1> : string) => boolean

Converts a value of ('t', 'true', 'y', 'yes', '1') to true and ('f', 'false', 'n', 'no', '0') to false and NULL for any other value
toBoolean('true') -> true
toBoolean('n') -> false
toBoolean('truthy') -> NULL
toDate

toDate(<string> : any, [<date format> : string]) => date

Converts a string to a date given an optional date format. Refer to Java SimpleDateFormat for all possible formats.
If the date format is omitted, combinations of the following are accepted. [ yyyy, yyyy-[M ]M, yyyy-[M ]M -[d]d,
yyyy-[M ]M -[d]d, yyyy-[M ]M -[d]d, yyyy-[M ]M -[d]dT* ]
toDate('2012-8-8') -> 2012-8-8
toDate('12/12/2012', 'MM/dd/yyyy') -> 2012-12-12

toDecimal

toDecimal(<value> : any, [<precision> : integral], [<scale> : integral], [<format> : string], [<locale> :


string]) => decimal(10,0)

Converts any numeric or string to a decimal value. If precision and scale are not specified, it is defaulted to
(10,2).An optional Java decimal format can be used for the conversion. An optional locale format in the form of
BCP47 language like en-US, de, zh-CN
toDecimal(123.45) -> 123.45
toDecimal('123.45', 8, 4) -> 123.4500
toDecimal('$123.45', 8, 4,'$###.00') -> 123.4500
toDecimal('Ç123,45', 10, 2, 'Ç###,##', 'de') -> 123.45

toDouble

toDouble(<value> : any, [<format> : string], [<locale> : string]) => double

Converts any numeric or string to a double value. An optional Java decimal format can be used for the conversion.
An optional locale format in the form of BCP47 language like en-US, de, zh-CN
toDouble(123.45) -> 123.45
toDouble('123.45') -> 123.45
toDouble('$123.45', '$###.00') -> 123.45
toDouble('Ç123,45', 'Ç###,##', 'de') -> 123.45

toFloat

toFloat(<value> : any, [<format> : string], [<locale> : string]) => float

Converts any numeric or string to a float value. An optional Java decimal format can be used for the conversion.
Truncates any double
toFloat(123.45) -> 123.45
toFloat('123.45') -> 123.45
toFloat('$123.45', '$###.00') -> 123.45

toInteger

toInteger(<value> : any, [<format> : string], [<locale> : string]) => integer

Converts any numeric or string to an integer value. An optional Java decimal format can be used for the
conversion. Truncates any long, float, double
toInteger(123) -> 123
toInteger('123') -> 123
toInteger('$123', '$###') -> 123

toLong

toLong(<value> : any, [<format> : string], [<locale> : string]) => long

Converts any numeric or string to a long value. An optional Java decimal format can be used for the conversion.
Truncates any float, double
toLong(123) -> 123
toLong('123') -> 123
toLong('$123', '$###') -> 123

toShort

toShort(<value> : any, [<format> : string], [<locale> : string]) => short

Converts any numeric or string to a short value. An optional Java decimal format can be used for the conversion.
Truncates any integer, long, float, double
toShort(123) -> 123
toShort('123') -> 123
toShort('$123', '$###') -> 123

toString

toString(<value> : any, [<number format/date format> : string]) => string

Converts a primitive datatype to a string. For numbers and date a format can be specified. If unspecified the
system default is picked.Java decimal format is used for numbers. Refer to Java SimpleDateFormat for all possible
date formats; the default format is yyyy-MM -dd
toString(10) -> '10'
toString('engineer') -> 'engineer'
toString(123456.789, '##,###.##') -> '123,456.79'
toString(123.78, '000000.000') -> '000123.780'
toString(12345, '##0.#####E0') -> '12.345E3'
toString(toDate('2018-12-31')) -> '2018-12-31'
toString(toDate('2018-12-31'), 'MM/dd/yy') -> '12/31/18'
toString(4 == 20) -> 'false'

toTimestamp

toTimestamp(<string> : any, [<timestamp format> : string], [<time zone> : string]) => timestamp

Converts a string to a date given an optional timestamp format. Refer to Java SimpleDateFormat for all possible
formats. If the timestamp is omitted the default pattern yyyy-[M ]M -[d]d hh:mm:ss[.f...] is used
toTimestamp('2016-12-31 00:12:00') -> 2012-8-8T00:12:00
toTimestamp('2016/12/31T00:12:00', 'MM/dd/yyyyThh:mm:ss') -> 2012-12-12T00:12:00
toUTC

toUTC(<value1> : timestamp, [<value2> : string]) => timestamp

Converts the timestamp to UTC. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. It is defaulted to the current timezone
toUTC(currentTimeStamp()) -> 12-12-2030T19:18:12
toUTC(currentTimeStamp(), 'Asia/Seoul') -> 12-13-2030T11:18:12

translate

translate(<string to translate> : string, <lookup characters> : string, <replace characters> : string) =>
string

Replace one set of characters by another set of characters in the string. Characters have 1 to 1 replacement
translate('(Hello)', '()', '[]') -> '[Hello]'
translate('(Hello)', '()', '[') -> '[Hello'

trim

trim(<string to trim> : string, [<trim characters> : string]) => string

Trims a string of leading and trailing characters. If second parameter is unspecified, it trims whitespace. Else it
trims any character specified in the second parameter
trim('!--!wor!ld!', '-!') -> 'wor!ld'

true

true() => boolean

Always returns a true value. Use the function syntax(true()) if there is a column named 'true'
isDiscounted == true()
isDiscounted() == true

typeMatch

typeMatch(<type> : string, <base type> : string) => boolean

Matches the type of the column. Can only be used in pattern expressions.number matches short, integer, long,
double, float or decimal, integral matches short, integer, long, fractional matches double, float, decimal and
datetime matches date or timestamp type
typeMatch(type, 'number') -> true
typeMatch('date', 'number') -> false

upper

upper(<value1> : string) => string

Uppercases a string
upper('bojjus') -> 'BOJJUS'
variance

variance(<value1> : number) => double

Gets the variance of a column


variance(sales) -> 122.12

varianceIf

varianceIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the variance of a column


varianceIf(region == 'West', sales) -> 122.12

variancePopulation

variancePopulation(<value1> : number) => double

Gets the population variance of a column


variancePopulation(sales) -> 122.12

variancePopulationIf

variancePopulationIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the population variance of a column


variancePopulationIf(region == 'West', sales) -> 122.12

varianceSample

varianceSample(<value1> : number) => double

Gets the unbiased variance of a column


varianceSample(sales) -> 122.12

varianceSampleIf

varianceSampleIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the unbiased variance of a column


varianceSampleIf(region == 'West', sales) -> 122.12

weekOfYear

weekOfYear(<value1> : datetime) => integer

Gets the week of the year given a date


weekOfYear(toDate('2008-02-20')) -> 8

xor
xor(<value1> : boolean, <value2> : boolean) => boolean

Logical XOR operator. Same as ^ operator


xor(true, false) -> true
xor(true, true) -> false
true ^ false -> true

year

year(<value1> : datetime) => integer

Gets the year value of a date


year(toDate('2012-8-8')) -> 2012

Next steps
Learn how to use Expression Builder.
Roles and permissions for Azure Data Factory
3/7/2019 • 3 minutes to read • Edit Online

This article describes the roles required to create and manage Azure Data Factory resources, and the permissions
granted by those roles.

Roles and requirements


To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the
contributor or owner role, or an administrator of the Azure subscription. To view the permissions that you have in
the subscription, in the Azure portal, select your username in the upper-right corner, and then select Permissions.
If you have access to multiple subscriptions, select the appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers, and
integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory Contributor
role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level
or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.

Set up permissions
After you create a Data Factory, you may want to let other users work with the data factory. To give this access to
other users, you have to add them to the built-in Data Factory Contributor role on the resource group that
contains the data factory.
Scope of the Data Factory Contributor role
Membership in the Data Factory Contributor role lets users do the following things:
Create, edit, and delete data factories and child resources including datasets, linked services, pipelines, triggers,
and integration runtimes.
Deploy Resource Manager templates. Resource Manager deployment is the deployment method used by Data
Factory in the Azure portal.
Manage App Insights alerts for a data factory.
Create support tickets.
For more info about this role, see Data Factory Contributor role.
Resource Manager template deployment
The Data Factory Contributor role, at the resource group level or above, lets users deploy Resource Manager
templates. As a result, members of the role can use Resource Manager templates to deploy both data factories
and their child resources, including datasets, linked services, pipelines, triggers, and integration runtimes.
Membership in this role does not let the user create other resources, however.
Permissions on Azure Repos and GitHub are independent of Data Factory permissions. As a result, a user with
repo permissions who is only a member of the Reader role can edit Data Factory child resources and commit
changes to the repo, but can't publish these changes.
IMPORTANT
Resource Manager template deployment with the Data Factory Contributor role does not elevate your permissions. For
example, if you deploy a template that creates an Azure virtual machine, and you don't have permission to create virtual
machines, the deployment fails with an authorization error.

Custom scenarios and custom roles


Sometimes you may need to grant different access levels for different data factory users. For example:
You may need a group where users only have permissions on a specific data factory.
Or you may need a group where users can only monitor a data factory (or factories) but can't modify it.
You can achieve these custom scenarios by creating custom roles and assigning users to those roles. For more
info about custom roles, see Custom roles in Azure.
Here are a few examples that demonstrate what you can achieve with custom roles:
Let a user create, edit, or delete any data factory in a resource group from the Azure portal.
Assign the built-in Data Factory contributor role at the resource group level for the user. If you want to
allow access to any data factory in a subscription, assign the role at the subscription level.
Let a user view (read) and monitor a data factory, but not edit or change it.
Assign the built-in reader role on the data factory resource for the user.
Let a user edit a single data factory in the Azure portal.
This scenario requires two role assignments.
1. Assign the built-in contributor role at the data factory level.
2. Create a custom role with the permission Microsoft.Resources/deployments/. Assign this custom
role to the user at resource group level.
Let a user update a data factory from PowerShell or the SDK, but not in the Azure portal.
Assign the built-in contributor role on the data factory resource for the user. This role lets the user see the
resources in the Azure portal, but the user can't access the Publish and Publish All buttons.

Next steps
Learn more about roles in Azure - Understand role definitions
Learn more about the Data Factory contributor role - Data Factory Contributor role.
Understanding Data Factory pricing through
examples
5/6/2019 • 6 minutes to read • Edit Online

This article explains and demonstrates the Azure Data Factory pricing model with detailed examples.

NOTE
The prices used in these examples below are hypothetical and are not intended to imply actual pricing.

Copy data from AWS S3 to Azure Blob storage hourly


In this scenario, you want to copy data from AWS S3 to Azure Blob storage on an hourly schedule.
To accomplish the scenario, you need to create a pipeline with the following items:
1. A copy activity with an input dataset for the data to be copied from AWS S3.
2. An output dataset for the data on Azure Storage.
3. A schedule trigger to execute the pipeline every hour.

OPERATIONS TYPES AND UNITS

Create Linked Service 2 Read/Write entity

Create Datasets 4 Read/Write entities (2 for dataset creation, 2 for linked


service references)

Create Pipeline 3 Read/Write entities (1 for pipeline creation, 2 for dataset


references)

Get Pipeline 1 Read/Write entity

Run Pipeline 2 Activity runs (1 for trigger run, 1 for activity runs)

Copy Data Assumption: execution time = 10 min 10 * 4 Azure Integration Runtime (default DIU setting = 4) For
more information on data integration units and optimizing
copy performance, see this article
OPERATIONS TYPES AND UNITS

Monitor Pipeline Assumption: Only 1 run occurred 2 Monitoring run records retried (1 for pipeline run, 1 for
activity run)

Total Scenario pricing: $0.16811


Data Factory Operations = $0.0001
Read/Write = 10*00001 = $0.0001 [1 R/W = $0.50/50000 = 0.00001]
Monitoring = 2*000005 = $0.00001 [1 Monitoring = $0.25/50000 = 0.000005]
Pipeline Orchestration & Execution = $0.168
Activity Runs = 001*2 = 0.002 [1 run = $1/1000 = 0.001]
Data Movement Activities = $0.166 (Prorated for 10 minutes of execution time. $0.25/hour on Azure
Integration Runtime)

Copy data and transform with Azure Databricks hourly


In this scenario, you want to copy data from AWS S3 to Azure Blob storage and transform the data with Azure
Databricks on an hourly schedule.
To accomplish the scenario, you need to create a pipeline with the following items:
1. One copy activity with an input dataset for the data to be copied from AWS S3, and an output dataset for the
data on Azure storage.
2. One Azure Databricks activity for the data transformation.
3. One schedule trigger to execute the pipeline every hour.

OPERATIONS TYPES AND UNITS

Create Linked Service 3 Read/Write entity


OPERATIONS TYPES AND UNITS

Create Datasets 4 Read/Write entities (2 for dataset creation, 2 for linked


service references)

Create Pipeline 3 Read/Write entities (1 for pipeline creation, 2 for dataset


references)

Get Pipeline 1 Read/Write entity

Run Pipeline 3 Activity runs (1 for trigger run, 2 for activity runs)

Copy Data Assumption: execution time = 10 min 10 * 4 Azure Integration Runtime (default DIU setting = 4) For
more information on data integration units and optimizing
copy performance, see this article

Monitor Pipeline Assumption: Only 1 run occurred 3 Monitoring run records retried (1 for pipeline run, 2 for
activity run)

Execute Databricks activity Assumption: execution time = 10 10 min External Pipeline Activity Execution
min

Total Scenario pricing: $0.16916


Data Factory Operations = $0.00012
Read/Write = 11*00001 = $0.00011 [1 R/W = $0.50/50000 = 0.00001]
Monitoring = 3*000005 = $0.00001 [1 Monitoring = $0.25/50000 = 0.000005]
Pipeline Orchestration & Execution = $0.16904
Activity Runs = 001*3 = 0.003 [1 run = $1/1000 = 0.001]
Data Movement Activities = $0.166 (Prorated for 10 minutes of execution time. $0.25/hour on Azure
Integration Runtime)
External Pipeline Activity = $0.000041 (Prorated for 10 minutes of execution time. $0.00025/hour on
Azure Integration Runtime)

Copy data and transform with dynamic parameters hourly


In this scenario, you want to copy data from AWS S3 to Azure Blob storage and transform with Azure Databricks
(with dynamic parameters in the script) on an hourly schedule.
To accomplish the scenario, you need to create a pipeline with the following items:
1. One copy activity with an input dataset for the data to be copied from AWS S3, an output dataset for the data
on Azure storage.
2. One Lookup activity for passing parameters dynamically to the transformation script.
3. One Azure Databricks activity for the data transformation.
4. One schedule trigger to execute the pipeline every hour.
OPERATIONS TYPES AND UNITS

Create Linked Service 3 Read/Write entity

Create Datasets 4 Read/Write entities (2 for dataset creation, 2 for linked


service references)

Create Pipeline 3 Read/Write entities (1 for pipeline creation, 2 for dataset


references)

Get Pipeline 1 Read/Write entity

Run Pipeline 4 Activity runs (1 for trigger run, 3 for activity runs)

Copy Data Assumption: execution time = 10 min 10 * 4 Azure Integration Runtime (default DIU setting = 4) For
more information on data integration units and optimizing
copy performance, see this article

Monitor Pipeline Assumption: Only 1 run occurred 4 Monitoring run records retried (1 for pipeline run, 3 for
activity run)

Execute Lookup activity Assumption: execution time = 1 min 1 min Pipeline Activity execution

Execute Databricks activity Assumption: execution time = 10 10 min External Pipeline Activity execution
min

Total Scenario pricing: $0.17020


Data Factory Operations = $0.00013
Read/Write = 11*00001 = $0.00011 [1 R/W = $0.50/50000 = 0.00001]
Monitoring = 4*000005 = $0.00002 [1 Monitoring = $0.25/50000 = 0.000005]
Pipeline Orchestration & Execution = $0.17007
Activity Runs = 001*4 = 0.004 [1 run = $1/1000 = 0.001]
Data Movement Activities = $0.166 (Prorated for 10 minutes of execution time. $0.25/hour on Azure
Integration Runtime)
Pipeline Activity = $0.00003 (Prorated for 1 minute of execution time. $0.002/hour on Azure Integration
Runtime)
External Pipeline Activity = $0.000041 (Prorated for 10 minutes of execution time. $0.00025/hour on
Azure Integration Runtime)

Using mapping data flow debug for a normal workday (Preview Pricing)
As a Data Engineer, you are responsible for designing, building, and testing Mapping Data Flows every day. You
log into the ADF UI in the morning and enable the Debug mode for Data Flows. The default TTL for Debug
sessions is 60 minutes. You work throughout the day for 10 hours, so your Debug session never expires. Therefore,
your charge for the day will be:
10 (hours) x 8 (cores) x $0.112 = $8.96

Transform data in blob store with mapping data flows (Preview Pricing)
In this scenario, you want to transform data in Blob Store visually in ADF Mapping Data Flows on an hourly
schedule.
To accomplish the scenario, you need to create a pipeline with the following items:
1. A Data Flow activity with the transformation logic.
2. An input dataset for the data on Azure Storage.
3. An output dataset for the data on Azure Storage.
4. A schedule trigger to execute the pipeline every hour.

OPERATIONS TYPES AND UNITS

Create Linked Service 2 Read/Write entity

Create Datasets 4 Read/Write entities (2 for dataset creation, 2 for linked


service references)

Create Pipeline 3 Read/Write entities (1 for pipeline creation, 2 for dataset


references)

Get Pipeline 1 Read/Write entity

Run Pipeline 2 Activity runs (1 for trigger run, 1 for activity runs)

Data Flow Assumptions: execution time = 10 min + 10 min 10 * 8 cores of General Compute with TTL of 10
TTL

Monitor Pipeline Assumption: Only 1 run occurred 2 Monitoring run records retried (1 for pipeline run, 1 for
activity run)

Total Scenario pricing: $0.3011


Data Factory Operations = $0.0001
Read/Write = 10*00001 = $0.0001 [1 R/W = $0.50/50000 = 0.00001]
Monitoring = 2*000005 = $0.00001 [1 Monitoring = $0.25/50000 = 0.000005]
Pipeline Orchestration & Execution = $0.301
Activity Runs = 001*2 = 0.002 [1 run = $1/1000 = 0.001]
Data Flow Activities = $0.299 Prorated for 20 minutes (10 mins execution time + 10 mins TTL ).
$0.112/hour on Azure Integration Runtime with 8 cores general compute

Next steps
Now that you understand the pricing for Azure Data Factory, you can get started!
Create a data factory by using the Azure Data Factory UI
Introduction to Azure Data Factory
Visual authoring in Azure Data Factory
Azure Data Factory - naming rules
1/3/2019 • 2 minutes to read • Edit Online

The following table provides naming rules for Data Factory artifacts.

NAME NAME UNIQUENESS VALIDATION CHECKS

Data Factory Unique across Microsoft Azure. Each data factory is tied to
Names are case-insensitive, that is, exactly one Azure
MyDF and mydf refer to the same subscription.
data factory. Object names must start with
a letter or a number, and can
contain only letters, numbers,
and the dash (-) character.
Every dash (-) character must
be immediately preceded and
followed by a letter or a
number. Consecutive dashes
are not permitted in container
names.
Name can be 3-63 characters
long.

Linked Services/Datasets/Pipelines Unique with in a data factory. Names Object names must start with
are case-insensitive. a letter, number, or an
underscore (_).
Following characters are not
allowed: “.”, “+”, “?”, “/”, “<”,
”>”,”*”,”%”,”&”,”:”,”\”
Dashes ("-") are not allowed in
the names of linked services
and of datasets only.

Resource Group Unique across Microsoft Azure. For more info, see Azure naming rules
Names are case-insensitive. and restrictions.

Next steps
Learn how to create data factories by following step-by-step instructions in Quickstart: create a data factory
article.
Visual authoring in Azure Data Factory
5/9/2019 • 14 minutes to read • Edit Online

The Azure Data Factory user interface experience (UX) lets you visually author and deploy resources for your data
factory without having to write any code. You can drag activities to a pipeline canvas, perform test runs, debug
iteratively, and deploy and monitor your pipeline runs. There are two approaches for using the UX to perform
visual authoring:
Author directly with the Data Factory service.
Author with Azure Repos Git integration for collaboration, source control, and versioning.

Author directly with the Data Factory service


Visual authoring with the Data Factory service differs from visual authoring with Git integration in two ways:
The Data Factory service doesn't include a repository for storing the JSON entities for your changes.
The Data Factory service isn't optimized for collaboration or version control.

When you use the UX Authoring canvas to author directly with the Data Factory service, only the Publish All
mode is available. Any changes that you make are published directly to the Data Factory service.
Author with Azure Repos Git integration
Visual authoring with Azure Repos Git integration supports source control and collaboration for work on your data
factory pipelines. You can associate a data factory with an Azure Repos Git organization repository for source
control, collaboration, versioning, and so on. A single Azure Repos Git organization can have multiple repositories,
but an Azure Repos Git repository can be associated with only one data factory. If you don't have an Azure Repos
organization or repository, follow these instructions to create your resources.

NOTE
You can store script and data files in an Azure Repos Git repository. However, you have to upload the files manually to Azure
Storage. A Data Factory pipeline does not automatically upload script or data files stored in an Azure Repos Git repository to
Azure Storage.

Configure an Azure Repos Git repository with Azure Data Factory


You can configure an Azure Repos Git repository with a data factory through two methods.
Configuration method 1 (Azure Repos Git repo): Let's get started page
In Azure Data Factory, go to the Let's get started page. Select Configure Code Repository:
The Repository Settings configuration pane appears:
The pane shows the following Azure Repos code repository settings:
SETTING DESCRIPTION VALUE

Repository Type The type of the Azure Repos code Azure Repos Git
repository.

Azure Active Directory Your Azure AD tenant name. <your tenant name>

Azure Repos Organization Your Azure Repos organization name. <your organization name>
You can locate your Azure Repos
organization name at
https://{organization
name}.visualstudio.com
. You can sign in to your Azure Repos
organization to access your Visual
Studio profile and see your repositories
and projects.

ProjectName Your Azure Repos project name. You <your Azure Repos project name>
can locate your Azure Repos project
name at
https://{organization
name}.visualstudio.com/{project
name}
.

RepositoryName Your Azure Repos code repository <your Azure Repos code repository
name. Azure Repos projects contain Git name>
repositories to manage your source
code as your project grows. You can
create a new repository or use an
existing repository that's already in your
project.

Collaboration branch Your Azure Repos collaboration branch <your collaboration branch name>
that is used for publishing. By default, it
is master . Change this setting in case
you want to publish resources from
another branch.

Root folder Your root folder in your Azure Repos <your root folder name>
collaboration branch.

Import existing Data Factory Specifies whether to import existing Selected (default)
resources to repository data factory resources from the UX
Authoring canvas into an Azure Repos
Git repository. Select the box to import
your data factory resources into the
associated Git repository in JSON
format. This action exports each
resource individually (that is, the linked
services and datasets are exported into
separate JSONs). When this box isn't
selected, the existing resources aren't
imported.

Configuration method 2 (Azure Repos Git repo): UX authoring canvas


In the Azure Data Factory UX Authoring canvas, locate your data factory. Select the Data Factory drop-down
menu, and then select Configure Code Repository.
A configuration pane appears. For details about the configuration settings, see the descriptions in Configuration
method 1.

Use a different Azure Active Directory tenant


You can create an Azure Repos Git repo in a different Azure Active Directory tenant. To specify a different Azure
AD tenant, you have to have administrator permissions for the Azure subscription that you're using.
Use your personal Microsoft account
To use a personal Microsoft account for Git integration, you can link your personal Azure Repo to your
organization's Active Directory.
1. Add your personal Microsoft account to your organization's Active Directory as a guest. For more info, see
Add Azure Active Directory B2B collaboration users in the Azure portal.
2. Log in to the Azure portal with your personal Microsoft account. Then switch to your organization's Active
Directory.
3. Go to the Azure DevOps section, where you now see your personal repo. Select the repo and connect with
Active Directory.
After these configuration steps, your personal repo is available when you set up Git integration in the Data Factory
UI.
For more info about connecting Azure Repos to your organization's Active Directory, see Connect your Azure
DevOps organization to Azure Active Directory.
Switch to a different Git repo
To switch to a different Git repo, locate the icon in the upper right corner of the Data Factory overview page, as
shown in the following screenshot. If you can’t see the icon, clear your local browser cache. Select the icon to
remove the association with the current repo.
After you remove the association with the current repo, you can configure your Git settings to use a different repo.
Then you can import existing Data Factory resources to the new repo.
Use version control
Version control systems (also known as source control) let developers collaborate on code and track changes that
are made to the code base. Source control is an essential tool for multi-developer projects.
Each Azure Repos Git repository that's associated with a data factory has a collaboration branch. ( master is the
default collaboration branch). Users can also create feature branches by clicking + New Branch and do
development in the feature branches.

When you are ready with the feature development in your feature branch, you can click Create pull request. This
action takes you to Azure Repos Git where you can raise pull requests, do code reviews, and merge changes to
your collaboration branch. ( master is the default). You are only allowed to publish to the Data Factory service from
your collaboration branch.
Configure publishing settings
To configure the publish branch - that is, the branch where Resource Manager templates are saved - add a
publish_config.json file to the root folder in the collaboration branch. Data Factory reads this file, looks for the
field publishBranch , and creates a new branch (if it doesn't already exist) with the value provided. Then it saves all
Resource Manager templates to the specified location. For example:

{
"publishBranch": "factory/adf_publish"
}

When you publish from Git mode, you can confirm that Data Factory is using the publish branch that you expect,
as shown in the following screenshot:

When you specify a new publish branch, Data Factory doesn't delete the previous publish branch. If you want to
remote the previous publish branch, delete it manually.
Data Factory only reads the publish_config.json file when it loads the factory. If you already have the factory
loaded in the portal, refresh the browser to make your changes take effect.
Publish code changes
After you have merged changes to the collaboration branch ( master is the default), select Publish to manually
publish your code changes in the master branch to the Data Factory service.

IMPORTANT
The master branch is not representative of what's deployed in the Data Factory service. The master branch must be
published manually to the Data Factory service.

Advantages of Git integration


Source Control. As your data factory workloads become crucial, you would want to integrate your factory with
Git to leverage several source control benefits like the following:
Ability to track/audit changes.
Ability to revert changes that introduced bugs.
Partial Saves. As you make a lot of changes in your factory, you will realize that in the regular LIVE mode, you
can't save your changes as draft, because you are not ready, or you don’t want to lose your changes in case your
computer crashes. With Git integration, you can continue saving your changes incrementally, and publish to the
factory only when you are ready. Git acts as a staging place for your work, until you have tested your changes
to your satisfaction.
Collaboration and Control. If you have multiple team members participating to the same factory, you may
want to let your teammates collaborate with each other via a code review process. You can also set up your
factory such that not every contributor to the factory has permission to deploy to the factory. Team members
may just be allowed to make changes via Git, but only certain people in the team are allowed to "Publish" the
changes to the factory.
Showing diffs. In Git mode, you get to see a nice diff of the payload that’s about to get published to the
factory. This diff shows you all resources/entities that got modified/added/deleted since the last time you
published to your factory. Based on this diff, you can either continue further with publishing, or go back and
check your changes, and then come back later.
Better CI/CD. If you are using Git mode, you can configure your release pipeline to trigger automatically as
soon as there are any changes made in the dev factory. You also get to customize the properties in your factory
that are available as parameters in the Resource Manager template. It can be useful to keep only the required
set of properties as parameters, and have everything else hard coded.
Better Performance. An average factory loads 10x times faster in Git mode than in regular LIVE mode,
because the resources are downloaded via Git.
Best practices for Git integration
Permissions. Typically you don’t want all the team members to be having permissions to update the factory.
All team members should have read permissions to the data factory.
Only a select set of people should be allowed to publish to the factory, and for that they need to be part
of the "Data Factory contributor" role on the factory.
One of the good practices of the source control is also to not allow direct check-ins into the collaboration
branch. This requirement prevents bugs as every check-in goes through a Pull Request process.
Switching modes.
Once you are in Git mode, we don’t recommend you to switch back and forth into LIVE mode, primarily
because any changes that are made in LIVE mode, will not be seen when you switch back to Git. Try to
make the changes in Git mode itself and then publish them via the UI.
Similarly, don’t use any Data factory powershell cmdlets either, as they achieve the same effect by
directly applying the provided changes to the live factory.
Use passwords from Azure Key Vault.
We strongly recommend using AzureKeyVault to store any connection strings or passwords to
DataFactory Linked Services.
We don’t store any such secret information in Git (for security reasons), so any changes to Linked
Services are right away published to the live factory. This immediate publishing is sometimes not
desired, as the changes may not have gotten tested, which defeats the purpose of Git.
As a result, all such secrets must be fetched from Linked Services that use Azure Key Vault based.
Some of the other benefits of using Key Vault, is that it makes CICD easier, by not making you provide
these secrets during Resource Manager template deployment.

Author with GitHub integration


Visual authoring with GitHub integration supports source control and collaboration for work on your data factory
pipelines. You can associate a data factory with a GitHub account repository for source control, collaboration,
versioning. A single GitHub account can have multiple repositories, but a GitHub repository can be associated with
only one data factory. If you don't have a GitHub account or repository, follow these instructions to create your
resources.
The GitHub integration with Data Factory supports both public GitHub (that is, https://github.com) and GitHub
Enterprise. You can use both public and private GitHub repositories with Data Factory as long you have read and
write permission to the repository in GitHub.
To configure a GitHub repo, you have to have administrator permissions for the Azure subscription that you're
using.
For a nine-minute introduction and demonstration of this feature, watch the following video:

Limitations
You can store script and data files in a GitHub repository. However, you have to upload the files manually to
Azure Storage. A Data Factory pipeline does not automatically upload script or data files stored in a GitHub
repository to Azure Storage.
GitHub Enterprise with a version older than 2.14.0 doesn't work in the Microsoft Edge browser.
GitHub integration with the Data Factor visual authoring tools only works in the generally available version
of Data Factory.
Configure a public GitHub repository with Azure Data Factory
You can configure a GitHub repository with a data factory through two methods.
Configuration method 1 (public repo): Let's get started page
In Azure Data Factory, go to the Let's get started page. Select Configure Code Repository:

The Repository Settings configuration pane appears:

The pane shows the following Azure Repos code repository settings:
SETTING DESCRIPTION VALUE

Repository Type The type of the Azure Repos code GitHub


repository.

GitHub account Your GitHub account name. This name


can be found from
https://github.com/{account
name}/{repository name}. Navigating to
this page prompts you to enter GitHub
OAuth credentials to your GitHub
account.

RepositoryName Your GitHub code repository name.


GitHub accounts contain Git
repositories to manage your source
code. You can create a new repository
or use an existing repository that's
already in your account.

Collaboration branch Your GitHub collaboration branch that


is used for publishing. By default, it
is master. Change this setting in case
you want to publish resources from
another branch.

Root folder Your root folder in your GitHub


collaboration branch.

Import existing Data Factory Specifies whether to import existing Selected (default)
resources to repository data factory resources from the
UX Authoring canvas into a GitHub
repository. Select the box to import
your data factory resources into the
associated Git repository in JSON
format. This action exports each
resource individually (that is, the linked
services and datasets are exported into
separate JSONs). When this box isn't
selected, the existing resources aren't
imported.

Branch to import resource into Specifies into which branch the data
factory resources (pipelines, datasets,
linked services etc.) are imported. You
can import resources into one of the
following branches: a. Collaboration b.
Create new c. Use Existing

Configuration method 2 (public repo): UX authoring canvas


In the Azure Data Factory UX Authoring canvas, locate your data factory. Select the Data Factory drop-down
menu, and then select Configure Code Repository.
A configuration pane appears. For details about the configuration settings, see the descriptions in Configuration
method 1 above.
Configure a GitHub Enterprise Repository with Azure Data Factory
You can configure a GitHub Enterprise repository with a data factory through two methods.
Configuration method 1 (Enterprise repo): Let's get started page
In Azure Data Factory, go to the Let's get started page. Select Configure Code Repository:

The Repository Settings configuration pane appears:

The pane shows the following Azure Repos code repository settings:

SETTING DESCRIPTION VALUE

Repository Type The type of the Azure Repos code GitHub


repository.

Use GitHub Enterprise Checkbox to select GitHub Enterprise


SETTING DESCRIPTION VALUE

GitHub Enterprise URL The GitHub Enterprise root URL. For


example: https://github.mydomain.com

GitHub account Your GitHub account name. This name


can be found from
https://github.com/{account
name}/{repository name}. Navigating to
this page prompts you to enter GitHub
OAuth credentials to your GitHub
account.

RepositoryName Your GitHub code repository name.


GitHub accounts contain Git
repositories to manage your source
code. You can create a new repository
or use an existing repository that's
already in your account.

Collaboration branch Your GitHub collaboration branch that


is used for publishing. By default, it
is master. Change this setting in case
you want to publish resources from
another branch.

Root folder Your root folder in your GitHub


collaboration branch.

Import existing Data Factory Specifies whether to import existing Selected (default)
resources to repository data factory resources from the
UX Authoring canvas into a GitHub
repository. Select the box to import
your data factory resources into the
associated Git repository in JSON
format. This action exports each
resource individually (that is, the linked
services and datasets are exported into
separate JSONs). When this box isn't
selected, the existing resources aren't
imported.

Branch to import resource into Specifies into which branch the data
factory resources (pipelines, datasets,
linked services etc.) are imported. You
can import resources into one of the
following branches: a. Collaboration b.
Create new c. Use Existing

Configuration method 2 (Enterprise repo): UX authoring canvas


In the Azure Data Factory UX Authoring canvas, locate your data factory. Select the Data Factory drop-down
menu, and then select Configure Code Repository.
A configuration pane appears. For details about the configuration settings, see the descriptions in Configuration
method 1 above.

Use the expression language


You can specify expressions for property values by using the expression language that's supported by Azure Data
Factory.
Specify expressions for property values by selecting Add Dynamic Content:

Use functions and parameters


You can use functions or specify parameters for pipelines and datasets in the Data Factory expression builder:
For information about the supported expressions, see Expressions and functions in Azure Data Factory.
Provide feedback
Select Feedback to comment about features or to notify Microsoft about issues with the tool:

Next steps
To learn more about monitoring and managing pipelines, see Monitor and manage pipelines programmatically.
Continuous integration and delivery (CI/CD) in Azure
Data Factory
5/22/2019 • 26 minutes to read • Edit Online

Continuous Integration is the practice of testing each change done to your codebase automatically and as early as
possible. Continuous Delivery follows the testing that happens during Continuous Integration and pushes changes
to a staging or production system.
For Azure Data Factory, continuous integration & delivery means moving Data Factory pipelines from one
environment (development, test, production) to another. To do continuous integration & delivery, you can use Data
Factory UI integration with Azure Resource Manager templates. The Data Factory UI can generate a Resource
Manager template when you select the ARM template options. When you select Export ARM template, the
portal generates the Resource Manager template for the data factory and a configuration file that includes all your
connections strings and other parameters. Then you have to create one configuration file for each environment
(development, test, production). The main Resource Manager template file remains the same for all the
environments.
For a nine-minute introduction and demonstration of this feature, watch the following video:

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Create a Resource Manager template for each environment


Select Export ARM template to export the Resource Manager template for your data factory in the development
environment.

Then go to your test data factory and production data factory and select Import ARM template.
This action takes you to the Azure portal, where you can import the exported template. Select Build your own
template in the editor and then Load file and select the generated Resource Manager template. Provide the
settings, and the data factory and the entire pipeline is imported in your production environment.
Select Load file to select the exported Resource Manager template and provide all the configuration values (for
example, linked services).
Connection strings. You can find the info required to create connection strings in the articles about the individual
connectors. For example, for Azure SQL Database, see Copy data to or from Azure SQL Database by using Azure
Data Factory. To verify the correct connection string - for a linked service, for example - you can also open code
view for the resource in the Data Factory UI. In code view, however, the password or account key portion of the
connection string is removed. To open code view, select the icon highlighted in the following screenshot.

Continuous integration lifecycle


Here is the entire lifecycle for continuous integration & delivery that you can use after you enable Azure Repos Git
integration in the Data Factory UI:
1. Set up a development data factory with Azure Repos in which all developers can author Data Factory
resources like pipelines, datasets, and so forth.
2. Then developers can modify resources such as pipelines. As they make their modifications, they can select
Debug to see how the pipeline runs with the most recent changes.
3. After developers are satisfied with their changes, they can create a pull request from their branch to the
master branch (or the collaboration branch) to get their changes reviewed by peers.
4. After changes are in the master branch, they can publish to the development factory by selecting Publish.
5. When the team is ready to promote changes to the test factory and the production factory, they can export
the Resource Manager template from the master branch, or from any other branch in case their master
branch backs the live development Data Factory.
6. The exported Resource Manager template can be deployed with different parameter files to the test factory
and the production factory.

Automate continuous integration with Azure Pipelines releases


Here are the steps to set up an Azure Pipelines release so you can automate the deployment of a data factory to
multiple environments.

Requirements
An Azure subscription linked to Team Foundation Server or Azure Repos using the Azure Resource
Manager service endpoint.
A Data Factory with Azure Repos Git integration configured.
An Azure Key Vault containing the secrets.
Set up an Azure Pipelines release
1. Go to your Azure Repos page in the same project as the one configured with the Data Factory.
2. Click on the top menu Azure Pipelines > Releases > Create release definition.

3. Select the Empty process template.


4. Enter the name of your environment.
5. Add a Git artifact and select the same repo configured with the Data Factory. Choose adf_publish as the
default branch with latest default version.

6. Add an Azure Resource Manager Deployment task:


a. Create new task, search for Azure Resource Group Deployment, and add it.
b. In the Deployment task, choose the subscription, resource group, and location for the target Data Factory,
and provide credentials if necessary.
c. Select the Create or update resource group action.
d. Select … in the Template field. Browse for the Resource Manager template
(ARMTemplateForFactory.json) that was created by the publish action in the portal. Look for this file in the
folder <FactoryName> of the adf_publish branch.
e. Do the same thing for the parameters file. Choose the correct file depending on whether you created a
copy or you’re using the default file ARMTemplateParametersForFactory.json.
f. Select … next to the Override template parameters field and fill in the information for the target Data
Factory. For the credentials that come from key vault, use the same name for the secret in the following
format: assuming the secret’s name is cred1 , enter "$(cred1)" (between quotes).

g. Select the Incremental Deployment Mode.

WARNING
If you select Complete deployment mode, existing resources may be deleted, including all the resources in the target
resource group that are not defined in the Resource Manager template.

7. Save the release pipeline.


8. Create a new release from this release pipeline.

Optional - Get the secrets from Azure Key Vault


If you have secrets to pass in an Azure Resource Manager template, we recommend using Azure Key Vault with the
Azure Pipelines release.
There are two ways to handle the secrets:
1. Add the secrets to parameters file. For more info, see Use Azure Key Vault to pass secure parameter value
during deployment.
Create a copy of the parameters file that is uploaded to the publish branch and set the values of the
parameters you want to get from key vault with the following format:

{
"parameters": {
"azureSqlReportingDbPassword": {
"reference": {
"keyVault": {
"id": "/subscriptions/<subId>/resourceGroups/<resourcegroupId>
/providers/Microsoft.KeyVault/vaults/<vault-name> "
},
"secretName": " < secret - name > "
}
}
}
}

When you use this method, the secret is pulled from the key vault automatically.
The parameters file needs to be in the publish branch as well.
2. Add an Azure Key Vault task before the Azure Resource Manager Deployment described in the previous
section:
Select the Tasks tab, create a new task, search for Azure Key Vault and add it.
In the Key Vault task, choose the subscription in which you created the key vault, provide credentials
if necessary, and then choose the key vault.

Grant permissions to the Azure Pipelines agent


The Azure Key Vault task may fail the fIntegration Runtime time with an Access Denied error. Download the logs
for the release, and locate the .ps1 file with the command to give permissions to the Azure Pipelines agent. You
can run the command directly, or you can copy the principal ID from the file and add the access policy manually in
the Azure portal. (Get and List are the minimum permissions required).
Update active triggers
Deployment can fail if you try to update active triggers. To update active triggers, you need to manually stop them
and start them after the deployment. You can add an Azure Powershell task for this purpose, as shown in the
following example:
1. In the Tasks tab of the release, search for Azure Powershell and add it.
2. Choose Azure Resource Manager as the connection type and select your subscription.
3. Choose Inline Script as the script type and then provide your code. The following example stops the
triggers:

$triggersADF = Get-AzDataFactoryV2Trigger -DataFactoryName $DataFactoryName -ResourceGroupName


$ResourceGroupName

$triggersADF | ForEach-Object { Stop-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -


DataFactoryName $DataFactoryName -Name $_.name -Force }

You can follow similar steps and use similar code (with the Start-AzDataFactoryV2Trigger function) to restart the
triggers after deployment.

IMPORTANT
In continuous integration and deployment scenarios, the Integration Runtime type across different environments must be
the same. For example, if you have a Self-Hosted Integration Runtime (IR) in the development environment, the same IR
must be of type Self-Hosted in other environments such as test and production also. Similarly, if you're sharing integration
runtimes across multiple stages, you have to configure the Integration Runtimes as Linked Self-Hosted in all environments,
such as development, test, and production.

Sample deployment template


Here is a sample deployment template that you can import in Azure Pipelines.

{
"source": 2,
"id": 1,
"revision": 51,
"name": "Data Factory Prod Deployment",
"description": null,
"createdBy": {
"displayName": "Sample User",
"url": "https://pde14b1dc-d2c9-49e5-88cb-45ccd58d0335.codex.ms/vssps/_apis/Identities/c9f828d1-2dbb-4e39-
b096-f1c53d82bc2c",
"id": "c9f828d1-2dbb-4e39-b096-f1c53d82bc2c",
"uniqueName": "sampleuser@microsoft.com",
"imageUrl": "https://sampleuser.visualstudio.com/_api/_common/identityImage?id=c9f828d1-2dbb-4e39-b096-
f1c53d82bc2c",
"descriptor": "aad.M2Y2N2JlZGUtMDViZC03ZWI3LTgxYWMtMDcwM2UyODMxNTBk"
},
"createdOn": "2018-03-01T22:57:25.660Z",
"modifiedBy": {
"displayName": "Sample User",
"url": "https://pde14b1dc-d2c9-49e5-88cb-45ccd58d0335.codex.ms/vssps/_apis/Identities/c9f828d1-2dbb-4e39-
b096-f1c53d82bc2c",
"id": "c9f828d1-2dbb-4e39-b096-f1c53d82bc2c",
"uniqueName": "sampleuser@microsoft.com",
"imageUrl": "https://sampleuser.visualstudio.com/_api/_common/identityImage?id=c9f828d1-2dbb-4e39-b096-
f1c53d82bc2c",
"descriptor": "aad.M2Y2N2JlZGUtMDViZC03ZWI3LTgxYWMtMDcwM2UyODMxNTBk"
},
"modifiedOn": "2018-03-14T17:58:11.643Z",
"isDeleted": false,
"path": "\\",
"variables": {},
"variableGroups": [],
"environments": [{
"id": 1,
"name": "Prod",
"rank": 1,
"owner": {
"displayName": "Sample User",
"url": "https://pde14b1dc-d2c9-49e5-88cb-45ccd58d0335.codex.ms/vssps/_apis/Identities/c9f828d1-2dbb-4e39-
b096-f1c53d82bc2c",
"id": "c9f828d1-2dbb-4e39-b096-f1c53d82bc2c",
"uniqueName": "sampleuser@microsoft.com",
"imageUrl": "https://sampleuser.visualstudio.com/_api/_common/identityImage?id=c9f828d1-2dbb-4e39-b096-
f1c53d82bc2c",
"descriptor": "aad.M2Y2N2JlZGUtMDViZC03ZWI3LTgxYWMtMDcwM2UyODMxNTBk"
},
"variables": {
"factoryName": {
"value": "sampleuserprod"
}
},
"variableGroups": [],
"preDeployApprovals": {
"approvals": [{
"rank": 1,
"isAutomated": true,
"isNotificationOn": false,
"id": 1
}],
"approvalOptions": {
"requiredApproverCount": null,
"releaseCreatorCanBeApprover": false,
"autoTriggeredAndPreviousEnvironmentApprovedCanBeSkipped": false,
"enforceIdentityRevalidation": false,
"timeoutInMinutes": 0,
"executionOrder": 1
}
},
"deployStep": {
"id": 2
},
"postDeployApprovals": {
"approvals": [{
"rank": 1,
"isAutomated": true,
"isNotificationOn": false,
"id": 3
}],
"approvalOptions": {
"requiredApproverCount": null,
"releaseCreatorCanBeApprover": false,
"releaseCreatorCanBeApprover": false,
"autoTriggeredAndPreviousEnvironmentApprovedCanBeSkipped": false,
"enforceIdentityRevalidation": false,
"timeoutInMinutes": 0,
"executionOrder": 2
}
},
"deployPhases": [{
"deploymentInput": {
"parallelExecution": {
"parallelExecutionType": "none"
},
"skipArtifactsDownload": false,
"artifactsDownloadInput": {
"downloadInputs": []
},
"queueId": 19,
"demands": [],
"enableAccessToken": false,
"timeoutInMinutes": 0,
"jobCancelTimeoutInMinutes": 1,
"condition": "succeeded()",
"overrideInputs": {}
},
"rank": 1,
"phaseType": 1,
"name": "Run on agent",
"workflowTasks": [{
"taskId": "72a1931b-effb-4d2e-8fd8-f8472a07cb62",
"version": "2.*",
"name": "Azure PowerShell script: FilePath",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceNameSelector": "ConnectedServiceNameARM",
"ConnectedServiceName": "",
"ConnectedServiceNameARM": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"ScriptType": "FilePath",
"ScriptPath": "$(System.DefaultWorkingDirectory)/Dev/deployment.ps1",
"Inline": "param\n(\n [parameter(Mandatory = $false)] [String]
$rootFolder=\"C:\\Users\\sampleuser\\Downloads\\arm_template\",\n [parameter(Mandatory = $false)] [String]
$armTemplate=\"$rootFolder\\arm_template.json\",\n [parameter(Mandatory = $false)] [String]
$armTemplateParameters=\"$rootFolder\\arm_template_parameters.json\",\n [parameter(Mandatory = $false)]
[String] $domain=\"microsoft.onmicrosoft.com\",\n [parameter(Mandatory = $false)] [String]
$TenantId=\"72f988bf-86f1-41af-91ab-2d7cd011db47\",\n [parame",
"ScriptArguments": "-rootFolder \"$(System.DefaultWorkingDirectory)/Dev/\" -DataFactoryName $(factoryname)
-predeployment $true",
"TargetAzurePs": "LatestVersion",
"CustomTargetAzurePs": "5.*"
}
}, {
"taskId": "1e244d32-2dd4-4165-96fb-b7441ca9331e",
"version": "1.*",
"name": "Azure Key Vault: sampleuservault",
"refName": "secret1",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceName": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"ConnectedServiceName": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"KeyVaultName": "sampleuservault",
"SecretsFilter": "*"
}
}, {
"taskId": "94a74903-f93f-4075-884f-dc11f34058b4",
"version": "2.*",
"name": "Azure Deployment:Create Or Update Resource Group action on sampleuser-datafactory",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceName": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"action": "Create Or Update Resource Group",
"resourceGroupName": "sampleuser-datafactory",
"location": "East US",
"templateLocation": "Linked artifact",
"csmFileLink": "",
"csmParametersFileLink": "",
"csmFile": "$(System.DefaultWorkingDirectory)/Dev/ARMTemplateForFactory.json",
"csmParametersFile": "$(System.DefaultWorkingDirectory)/Dev/ARMTemplateParametersForFactory.json",
"overrideParameters": "-factoryName \"$(factoryName)\" -linkedService1_connectionString
\"$(linkedService1-connectionString)\" -linkedService2_connectionString \"$(linkedService2-
connectionString)\"",
"deploymentMode": "Incremental",
"enableDeploymentPrerequisites": "None",
"deploymentGroupEndpoint": "",
"project": "",
"deploymentGroupName": "",
"copyAzureVMTags": "true",
"outputVariable": "",
"deploymentOutputs": ""
}
}, {
"taskId": "72a1931b-effb-4d2e-8fd8-f8472a07cb62",
"version": "2.*",
"name": "Azure PowerShell script: FilePath",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceNameSelector": "ConnectedServiceNameARM",
"ConnectedServiceName": "",
"ConnectedServiceNameARM": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"ScriptType": "FilePath",
"ScriptPath": "$(System.DefaultWorkingDirectory)/Dev/deployment.ps1",
"Inline": "# You can write your azure powershell scripts inline here. \n# You can also pass predefined and
custom variables to this script using arguments",
"ScriptArguments": "-rootFolder \"$(System.DefaultWorkingDirectory)/Dev/\" -DataFactoryName $(factoryname)
-predeployment $false",
"TargetAzurePs": "LatestVersion",
"CustomTargetAzurePs": ""
}
}]
}],
"environmentOptions": {
"emailNotificationType": "OnlyOnFailure",
"emailRecipients": "release.environment.owner;release.creator",
"skipArtifactsDownload": false,
"timeoutInMinutes": 0,
"timeoutInMinutes": 0,
"enableAccessToken": false,
"publishDeploymentStatus": true,
"badgeEnabled": false,
"autoLinkWorkItems": false
},
"demands": [],
"conditions": [{
"name": "ReleaseStarted",
"conditionType": 1,
"value": ""
}],
"executionPolicy": {
"concurrencyCount": 1,
"queueDepthCount": 0
},
"schedules": [],
"retentionPolicy": {
"daysToKeep": 30,
"releasesToKeep": 3,
"retainBuild": true
},
"processParameters": {
"dataSourceBindings": [{
"dataSourceName": "AzureRMWebAppNamesByType",
"parameters": {
"WebAppKind": "$(WebAppKind)"
},
"endpointId": "$(ConnectedServiceName)",
"target": "WebAppName"
}]
},
"properties": {},
"preDeploymentGates": {
"id": 0,
"gatesOptions": null,
"gates": []
},
"postDeploymentGates": {
"id": 0,
"gatesOptions": null,
"gates": []
},
"badgeUrl": "https://sampleuser.vsrm.visualstudio.com/_apis/public/Release/badge/19749ef3-2f42-49b5-9696-
f28b49faebcb/1/1"
}, {
"id": 2,
"name": "Staging",
"rank": 2,
"owner": {
"displayName": "Sample User",
"url": "https://pde14b1dc-d2c9-49e5-88cb-45ccd58d0335.codex.ms/vssps/_apis/Identities/c9f828d1-2dbb-4e39-
b096-f1c53d82bc2c",
"id": "c9f828d1-2dbb-4e39-b096-f1c53d82bc2c",
"uniqueName": "sampleuser@microsoft.com",
"imageUrl": "https://sampleuser.visualstudio.com/_api/_common/identityImage?id=c9f828d1-2dbb-4e39-b096-
f1c53d82bc2c",
"descriptor": "aad.M2Y2N2JlZGUtMDViZC03ZWI3LTgxYWMtMDcwM2UyODMxNTBk"
},
"variables": {
"factoryName": {
"value": "sampleuserstaging"
}
},
"variableGroups": [],
"preDeployApprovals": {
"approvals": [{
"rank": 1,
"isAutomated": true,
"isNotificationOn": false,
"isNotificationOn": false,
"id": 4
}],
"approvalOptions": {
"requiredApproverCount": null,
"releaseCreatorCanBeApprover": false,
"autoTriggeredAndPreviousEnvironmentApprovedCanBeSkipped": false,
"enforceIdentityRevalidation": false,
"timeoutInMinutes": 0,
"executionOrder": 1
}
},
"deployStep": {
"id": 5
},
"postDeployApprovals": {
"approvals": [{
"rank": 1,
"isAutomated": true,
"isNotificationOn": false,
"id": 6
}],
"approvalOptions": {
"requiredApproverCount": null,
"releaseCreatorCanBeApprover": false,
"autoTriggeredAndPreviousEnvironmentApprovedCanBeSkipped": false,
"enforceIdentityRevalidation": false,
"timeoutInMinutes": 0,
"executionOrder": 2
}
},
"deployPhases": [{
"deploymentInput": {
"parallelExecution": {
"parallelExecutionType": "none"
},
"skipArtifactsDownload": false,
"artifactsDownloadInput": {
"downloadInputs": []
},
"queueId": 19,
"demands": [],
"enableAccessToken": false,
"timeoutInMinutes": 0,
"jobCancelTimeoutInMinutes": 1,
"condition": "succeeded()",
"overrideInputs": {}
},
"rank": 1,
"phaseType": 1,
"name": "Run on agent",
"workflowTasks": [{
"taskId": "72a1931b-effb-4d2e-8fd8-f8472a07cb62",
"version": "2.*",
"name": "Azure PowerShell script: FilePath",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceNameSelector": "ConnectedServiceNameARM",
"ConnectedServiceName": "",
"ConnectedServiceNameARM": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"ScriptType": "FilePath",
"ScriptPath": "$(System.DefaultWorkingDirectory)/Dev/deployment.ps1",
"Inline": "# You can write your azure powershell scripts inline here. \n# You can also pass predefined and
custom variables to this script using arguments",
"ScriptArguments": "-rootFolder \"$(System.DefaultWorkingDirectory)/Dev/\" -DataFactoryName $(factoryname)
-predeployment $true",
"TargetAzurePs": "LatestVersion",
"CustomTargetAzurePs": ""
}
}, {
"taskId": "1e244d32-2dd4-4165-96fb-b7441ca9331e",
"version": "1.*",
"name": "Azure Key Vault: sampleuservault",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceName": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"KeyVaultName": "sampleuservault",
"SecretsFilter": "*"
}
}, {
"taskId": "94a74903-f93f-4075-884f-dc11f34058b4",
"version": "2.*",
"name": "Azure Deployment:Create Or Update Resource Group action on sampleuser-datafactory",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceName": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"action": "Create Or Update Resource Group",
"resourceGroupName": "sampleuser-datafactory",
"location": "East US",
"templateLocation": "Linked artifact",
"csmFileLink": "",
"csmParametersFileLink": "",
"csmFile": "$(System.DefaultWorkingDirectory)/Dev/ARMTemplateForFactory.json",
"csmParametersFile": "$(System.DefaultWorkingDirectory)/Dev/ARMTemplateParametersForFactory.json",
"overrideParameters": "-factoryName \"$(factoryName)\" -linkedService1_connectionString
\"$(linkedService1-connectionString)\" -linkedService2_connectionString \"$(linkedService2-
connectionString)\"",
"deploymentMode": "Incremental",
"enableDeploymentPrerequisites": "None",
"deploymentGroupEndpoint": "",
"project": "",
"deploymentGroupName": "",
"copyAzureVMTags": "true",
"outputVariable": "",
"deploymentOutputs": ""
}
}, {
"taskId": "72a1931b-effb-4d2e-8fd8-f8472a07cb62",
"version": "2.*",
"name": "Azure PowerShell script: FilePath",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceNameSelector": "ConnectedServiceNameARM",
"ConnectedServiceName": "",
"ConnectedServiceNameARM": "16a37943-8b58-4c2f-a3d6-052d6f032a07",
"ScriptType": "FilePath",
"ScriptPath": "$(System.DefaultWorkingDirectory)/Dev/deployment.ps1",
"Inline": "param(\n$x,\n$y,\n$z)\nwrite-host \"----------\"\nwrite-host $x\nwrite-host $y\nwrite-host $z |
ConvertTo-SecureString\nwrite-host \"----------\"",
"ScriptArguments": "-rootFolder \"$(System.DefaultWorkingDirectory)/Dev/\" -DataFactoryName $(factoryname)
-predeployment $false",
"TargetAzurePs": "LatestVersion",
"CustomTargetAzurePs": ""
}
}]
}],
"environmentOptions": {
"emailNotificationType": "OnlyOnFailure",
"emailRecipients": "release.environment.owner;release.creator",
"skipArtifactsDownload": false,
"timeoutInMinutes": 0,
"enableAccessToken": false,
"publishDeploymentStatus": true,
"badgeEnabled": false,
"autoLinkWorkItems": false
},
"demands": [],
"conditions": [{
"name": "ReleaseStarted",
"conditionType": 1,
"value": ""
}],
"executionPolicy": {
"concurrencyCount": 1,
"queueDepthCount": 0
},
"schedules": [],
"retentionPolicy": {
"daysToKeep": 30,
"releasesToKeep": 3,
"retainBuild": true
},
"processParameters": {
"dataSourceBindings": [{
"dataSourceName": "AzureRMWebAppNamesByType",
"parameters": {
"WebAppKind": "$(WebAppKind)"
},
"endpointId": "$(ConnectedServiceName)",
"target": "WebAppName"
}]
},
"properties": {},
"preDeploymentGates": {
"id": 0,
"gatesOptions": null,
"gates": []
},
"postDeploymentGates": {
"id": 0,
"gatesOptions": null,
"gates": []
},
"badgeUrl": "https://sampleuser.vsrm.visualstudio.com/_apis/public/Release/badge/19749ef3-2f42-49b5-9696-
f28b49faebcb/1/2"
}],
"artifacts": [{
"sourceId": "19749ef3-2f42-49b5-9696-f28b49faebcb:a6c88f30-5e1f-4de8-b24d-279bb209d85f",
"type": "Git",
"type": "Git",
"alias": "Dev",
"definitionReference": {
"branches": {
"id": "adf_publish",
"name": "adf_publish"
},
"checkoutSubmodules": {
"id": "",
"name": ""
},
"defaultVersionSpecific": {
"id": "",
"name": ""
},
"defaultVersionType": {
"id": "latestFromBranchType",
"name": "Latest from default branch"
},
"definition": {
"id": "a6c88f30-5e1f-4de8-b24d-279bb209d85f",
"name": "Dev"
},
"fetchDepth": {
"id": "",
"name": ""
},
"gitLfsSupport": {
"id": "",
"name": ""
},
"project": {
"id": "19749ef3-2f42-49b5-9696-f28b49faebcb",
"name": "Prod"
}
},
"isPrimary": true
}],
"triggers": [{
"schedule": {
"jobId": "b5ef09b6-8dfd-4b91-8b48-0709e3e67b2d",
"timeZoneId": "UTC",
"startHours": 3,
"startMinutes": 0,
"daysToRelease": 31
},
"triggerType": 2
}],
"releaseNameFormat": "Release-$(rev:r)",
"url": "https://sampleuser.vsrm.visualstudio.com/19749ef3-2f42-49b5-9696-
f28b49faebcb/_apis/Release/definitions/1",
"_links": {
"self": {
"href": "https://sampleuser.vsrm.visualstudio.com/19749ef3-2f42-49b5-9696-
f28b49faebcb/_apis/Release/definitions/1"
},
"web": {
"href": "https://sampleuser.visualstudio.com/19749ef3-2f42-49b5-9696-f28b49faebcb/_release?definitionId=1"
}
},
"tags": [],
"properties": {
"DefinitionCreationSource": {
"$type": "System.String",
"$value": "ReleaseNew"
}
}
}
Sample script to stop and restart triggers and clean up
Here is a sample script to stop triggers before deployment and to restart triggers afterwards. The script also
includes code to delete resources that have been removed. To install the latest version of Azure PowerShell, see
Install Azure PowerShell on Windows with PowerShellGet.

param
(
[parameter(Mandatory = $false)] [String] $rootFolder,
[parameter(Mandatory = $false)] [String] $armTemplate,
[parameter(Mandatory = $false)] [String] $ResourceGroupName,
[parameter(Mandatory = $false)] [String] $DataFactoryName,
[parameter(Mandatory = $false)] [Bool] $predeployment=$true,
[parameter(Mandatory = $false)] [Bool] $deleteDeployment=$false
)

$templateJson = Get-Content $armTemplate | ConvertFrom-Json


$resources = $templateJson.resources

#Triggers
Write-Host "Getting triggers"
$triggersADF = Get-AzDataFactoryV2Trigger -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$triggersTemplate = $resources | Where-Object { $_.type -eq "Microsoft.DataFactory/factories/triggers" }
$triggerNames = $triggersTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-40)}
$activeTriggerNames = $triggersTemplate | Where-Object { $_.properties.runtimeState -eq "Started" -and
($_.properties.pipelines.Count -gt 0 -or $_.properties.pipeline.pipelineReference -ne $null)} | ForEach-Object
{$_.name.Substring(37, $_.name.Length-40)}
$deletedtriggers = $triggersADF | Where-Object { $triggerNames -notcontains $_.Name }
$triggerstostop = $triggerNames | where { ($triggersADF | Select-Object name).name -contains $_ }

if ($predeployment -eq $true) {


#Stop all triggers
Write-Host "Stopping deployed triggers"
$triggerstostop | ForEach-Object {
Write-host "Disabling trigger " $_
Stop-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -
Name $_ -Force
}
}
else {
#Deleted resources
#pipelines
Write-Host "Getting pipelines"
$pipelinesADF = Get-AzDataFactoryV2Pipeline -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$pipelinesTemplate = $resources | Where-Object { $_.type -eq "Microsoft.DataFactory/factories/pipelines" }
$pipelinesNames = $pipelinesTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-40)}
$deletedpipelines = $pipelinesADF | Where-Object { $pipelinesNames -notcontains $_.Name }
#datasets
Write-Host "Getting datasets"
$datasetsADF = Get-AzDataFactoryV2Dataset -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$datasetsTemplate = $resources | Where-Object { $_.type -eq "Microsoft.DataFactory/factories/datasets" }
$datasetsNames = $datasetsTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-40) }
$deleteddataset = $datasetsADF | Where-Object { $datasetsNames -notcontains $_.Name }
#linkedservices
Write-Host "Getting linked services"
$linkedservicesADF = Get-AzDataFactoryV2LinkedService -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$linkedservicesTemplate = $resources | Where-Object { $_.type -eq
"Microsoft.DataFactory/factories/linkedservices" }
$linkedservicesNames = $linkedservicesTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-40)}
$deletedlinkedservices = $linkedservicesADF | Where-Object { $linkedservicesNames -notcontains $_.Name }
#Integrationruntimes
Write-Host "Getting integration runtimes"
$integrationruntimesADF = Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -
$integrationruntimesADF = Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -
ResourceGroupName $ResourceGroupName
$integrationruntimesTemplate = $resources | Where-Object { $_.type -eq
"Microsoft.DataFactory/factories/integrationruntimes" }
$integrationruntimesNames = $integrationruntimesTemplate | ForEach-Object {$_.name.Substring(37,
$_.name.Length-40)}
$deletedintegrationruntimes = $integrationruntimesADF | Where-Object { $integrationruntimesNames -
notcontains $_.Name }

#Delete resources
Write-Host "Deleting triggers"
$deletedtriggers | ForEach-Object {
Write-Host "Deleting trigger " $_.Name
$trig = Get-AzDataFactoryV2Trigger -name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName
if ($trig.RuntimeState -eq "Started") {
Stop-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName
-Name $_.Name -Force
}
Remove-AzDataFactoryV2Trigger -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting pipelines"
$deletedpipelines | ForEach-Object {
Write-Host "Deleting pipeline " $_.Name
Remove-AzDataFactoryV2Pipeline -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting datasets"
$deleteddataset | ForEach-Object {
Write-Host "Deleting dataset " $_.Name
Remove-AzDataFactoryV2Dataset -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting linked services"
$deletedlinkedservices | ForEach-Object {
Write-Host "Deleting Linked Service " $_.Name
Remove-AzDataFactoryV2LinkedService -Name $_.Name -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Force
}
Write-Host "Deleting integration runtimes"
$deletedintegrationruntimes | ForEach-Object {
Write-Host "Deleting integration runtime " $_.Name
Remove-AzDataFactoryV2IntegrationRuntime -Name $_.Name -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Force
}

if ($deleteDeployment -eq $true) {


Write-Host "Deleting ARM deployment ... under resource group: " $ResourceGroupName
$deployments = Get-AzResourceGroupDeployment -ResourceGroupName $ResourceGroupName
$deploymentsToConsider = $deployments | Where { $_.DeploymentName -like "ArmTemplate_master*" -or
$_.DeploymentName -like "ArmTemplateForFactory*" } | Sort-Object -Property Timestamp -Descending
$deploymentName = $deploymentsToConsider[0].DeploymentName

Write-Host "Deployment to be deleted: " $deploymentName


$deploymentOperations = Get-AzResourceGroupDeploymentOperation -DeploymentName $deploymentName -
ResourceGroupName $ResourceGroupName
$deploymentsToDelete = $deploymentOperations | Where { $_.properties.targetResource.id -like
"*Microsoft.Resources/deployments*" }

$deploymentsToDelete | ForEach-Object {
Write-host "Deleting inner deployment: " $_.properties.targetResource.id
Remove-AzResourceGroupDeployment -Id $_.properties.targetResource.id
}
Write-Host "Deleting deployment: " $deploymentName
Remove-AzResourceGroupDeployment -ResourceGroupName $ResourceGroupName -Name $deploymentName
}

#Start Active triggers - After cleanup efforts


Write-Host "Starting active triggers"
Write-Host "Starting active triggers"
$activeTriggerNames | ForEach-Object {
Write-host "Enabling trigger " $_
Start-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -
Name $_ -Force
}
}

Use custom parameters with the Resource Manager template


If you are in GIT mode, you can override the default properties in your Resource Manager template to set
properties that are parameterized in the template and properties that are hard-coded. You might want to override
the default parameterization template in these scenarios:
You use automated CI/CD and you want to change some properties during Resource Manager deployment, but
the properties aren't parameterized by default.
Your factory is so large that the default Resource Manager template is invalid because it has more than the
maximum allowed parameters (256).
Under these conditions, to override the default parameterization template, create a file named arm -template-
parameters-definition.json in the root folder of the repository. The file name must exactly match. Data Factory tries
to read this file from whichever branch you are currently on in the Azure Data Factory portal, not just from the
collaboration branch. You can create or edit the file from a private branch, where you can test your changes by
using the Export ARM template in the UI. Then, you can merge the file into the collaboration branch. If no file is
found, the default template is used.
Syntax of a custom parameters file
Here are some guidelines to use when you author the custom parameters file. The file consists of a section for each
entity type: trigger, pipeline, linkedservice, dataset, integrationruntime, and so on.
Enter the property path under the relevant entity type.
When you set a property name to '*'', you indicate that you want to parameterize all properties under it (only
down to the first level, not recursively). You can also provide any exceptions to this.
When you set the value of a property as a string, you indicate that you want to parameterize the property. Use
the format <action>:<name>:<stype> .
<action> can be one of the following characters:
= means keep the current value as the default value for the parameter.
- means do not keep the default value for the parameter.
| is a special case for secrets from Azure Key Vault for connection strings or keys.
<name> is the name of the parameter. If it is blank, it takes the name of the property. If the value starts
with a - character, the name is shortened. For example,
AzureStorage1_properties_typeProperties_connectionString would be shortened to
AzureStorage1_connectionString .
<stype> is the type of parameter. If <stype> is blank, the default type is string . Supported values:
string , bool , number , object , and securestring .
When you specify an array in the definition file, you indicate that the matching property in the template is an
array. Data Factory iterates through all the objects in the array by using the definition that's specified in the
Integration Runtime object of the array. The second object, a string, becomes the name of the property, which is
used as the name for the parameter for each iteration.
It's not possible to have a definition that's specific for a resource instance. Any definition applies to all resources
of that type.
By default, all secure strings, such as Key Vault secrets, and secure strings, such as connection strings, keys, and
tokens, are parameterized.
Sample parameterization template
{
"Microsoft.DataFactory/factories/pipelines": {
"properties": {
"activities": [{
"typeProperties": {
"waitTimeInSeconds": "-::number",
"headers": "=::object"
}
}]
}
},
"Microsoft.DataFactory/factories/integrationRuntimes": {
"properties": {
"typeProperties": {
"*": "="
}
}
},
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"typeProperties": {
"recurrence": {
"*": "=",
"interval": "=:triggerSuffix:number",
"frequency": "=:-freq"
},
"maxConcurrency": "="
}
}
},
"Microsoft.DataFactory/factories/linkedServices": {
"*": {
"properties": {
"typeProperties": {
"accountName": "=",
"username": "=",
"connectionString": "|:-connectionString:secureString",
"secretAccessKey": "|"
}
}
},
"AzureDataLakeStore": {
"properties": {
"typeProperties": {
"dataLakeStoreUri": "="
}
}
}
},
"Microsoft.DataFactory/factories/datasets": {
"properties": {
"typeProperties": {
"*": "="
}
}
}
}

Explanation:
Pipelines
Any property in the path activities/typeProperties/waitTimeInSeconds is parameterized. This means that any
activity in a pipeline that has a code-level property named waitTimeInSeconds (for example, the Wait activity) is
parameterized as a number, with a default name. But, it won't have a default value in the Resource Manager
template. It will be a mandatory input during the Resource Manager deployment.
Similarly, a property called headers (for example, in a Web activity) is parameterized with type object
(JObject). It has a default value, which is the same value as in the source factory.
IntegrationRuntimes
Only properties, and all properties, under the path typeProperties are parameterized, with their respective
default values. For example, as of today's schema, there are two properties under IntegrationRuntimes type
properties: computeProperties and ssisProperties . Both property types are created with their respective
default values and types (Object).
Triggers
Under typeProperties , two properties are parameterized. The first one is maxConcurrency , which is specified to
have a default value, and the type would be string . It has the default parameter name of
<entityName>_properties_typeProperties_maxConcurrency .
The recurrence property also is parameterized. Under it, all properties at that level are specified to be
parameterized as strings, with default values and parameter names. An exception is the interval property,
which is parameterized as number type, and with the parameter name suffixed with
<entityName>_properties_typeProperties_recurrence_triggerSuffix . Similarly, the freq property is a string and
is parameterized as a string. However, the freq property is parameterized without a default value. The name is
shortened and suffixed. For example, <entityName>_freq .
LinkedServices
Linked services is unique. Because linked services and datasets can potentially be of several types, you can
provide type-specific customization. For example, you might say that for all linked services of type
AzureDataLakeStore , a specific template will be applied, and for all others (via * ) a different template will be
applied.
In the preceding example, the connectionString property will be parameterized as a securestring value, it
won't have a default value, and it will have a shortened parameter name that's suffixed with connectionString .
The property secretAccessKey , however, happens to be an AzureKeyVaultSecret (for instance, an AmazonS3
linked service). Thus, it is automatically parameterized as an Azure Key Vault secret, and it's fetched from the key
vault that it's configured with in the source factory. You can also parameterize the key vault, itself.
Datasets
Even though type-specific customization is available for datasets, configuration can be provided without
explicitly having a *-level configuration. In the preceding example, all dataset properties under typeProperties
are parameterized.
The default parameterization template can change, but this is the current template. This will be useful if you just
need to add one additional property as a parameter, but also if you don’t want to lose the existing
parameterizations and need to re-create them.

{
"Microsoft.DataFactory/factories/pipelines": {
},
"Microsoft.DataFactory/factories/integrationRuntimes":{
"properties": {
"typeProperties": {
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "=",
"catalogAdminUserName": "=",
"catalogAdminPassword": {
"value": "-::secureString"
}
},
"customSetupScriptProperties": {
"customSetupScriptProperties": {
"sasToken": {
"value": "-::secureString"
}
}
},
"linkedInfo": {
"key": {
"value": "-::secureString"
},
"resourceId": "="
}
}
}
},
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"pipelines": [{
"parameters": {
"*": "="
}
},
"pipelineReference.referenceName"
],
"pipeline": {
"parameters": {
"*": "="
}
},
"typeProperties": {
"scope": "="
}

}
},
"Microsoft.DataFactory/factories/linkedServices": {
"*": {
"properties": {
"typeProperties": {
"accountName": "=",
"username": "=",
"userName": "=",
"accessKeyId": "=",
"servicePrincipalId": "=",
"userId": "=",
"clientId": "=",
"clusterUserName": "=",
"clusterSshUserName": "=",
"hostSubscriptionId": "=",
"clusterResourceGroup": "=",
"subscriptionId": "=",
"resourceGroupName": "=",
"tenant": "=",
"dataLakeStoreUri": "=",
"baseUrl": "=",
"database": "=",
"serviceEndpoint": "=",
"batchUri": "=",
"databaseName": "=",
"systemNumber": "=",
"server": "=",
"url":"=",
"aadResourceId": "=",
"connectionString": "|:-connectionString:secureString"
}
}
},
"Odbc": {
"properties": {
"typeProperties": {
"typeProperties": {
"userName": "=",
"connectionString": {
"secretName": "="
}
}
}
}
},
"Microsoft.DataFactory/factories/datasets": {
"*": {
"properties": {
"typeProperties": {
"folderPath": "=",
"fileName": "="
}
}
}}
}

Example: Add a Databricks Interactive cluster ID (from a Databricks Linked Service) to the parameters file:

{
"Microsoft.DataFactory/factories/pipelines": {
},
"Microsoft.DataFactory/factories/integrationRuntimes":{
"properties": {
"typeProperties": {
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "=",
"catalogAdminUserName": "=",
"catalogAdminPassword": {
"value": "-::secureString"
}
},
"customSetupScriptProperties": {
"sasToken": {
"value": "-::secureString"
}
}
},
"linkedInfo": {
"key": {
"value": "-::secureString"
},
"resourceId": "="
}
}
}
},
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"pipelines": [{
"parameters": {
"*": "="
}
},
"pipelineReference.referenceName"
],
"pipeline": {
"parameters": {
"*": "="
}
},
"typeProperties": {
"scope": "="
}
}

}
},
"Microsoft.DataFactory/factories/linkedServices": {
"*": {
"properties": {
"typeProperties": {
"accountName": "=",
"username": "=",
"userName": "=",
"accessKeyId": "=",
"servicePrincipalId": "=",
"userId": "=",
"clientId": "=",
"clusterUserName": "=",
"clusterSshUserName": "=",
"hostSubscriptionId": "=",
"clusterResourceGroup": "=",
"subscriptionId": "=",
"resourceGroupName": "=",
"tenant": "=",
"dataLakeStoreUri": "=",
"baseUrl": "=",
"database": "=",
"serviceEndpoint": "=",
"batchUri": "=",
"databaseName": "=",
"systemNumber": "=",
"server": "=",
"url":"=",
"aadResourceId": "=",
"connectionString": "|:-connectionString:secureString",
"existingClusterId": "-"
}
}
},
"Odbc": {
"properties": {
"typeProperties": {
"userName": "=",
"connectionString": {
"secretName": "="
}
}
}
}
},
"Microsoft.DataFactory/factories/datasets": {
"*": {
"properties": {
"typeProperties": {
"folderPath": "=",
"fileName": "="
}
}
}}
}

Linked Resource Manager templates


If you've set up continuous integration and deployment (CI/CD ) for your Data Factories, you may observe that, as
your factory grows bigger, you run into the Resource Manager template limits, like the maximum number of
resources or the maximum payload in a Resource Manager template. For scenarios like these, along with
generating the full Resource Manager template for a factory, Data Factory also now generates Linked Resource
Manager templates. As a result, you have the entire factory payload broken down into several files, so that you
don’t run into the mentioned limits.
If you have Git configured, the linked templates are generated and saved alongside the full Resource Manager
templates, in the adf_publish branch, under a new folder called linkedTemplates .

The Linked Resource Manager templates usually have a master template and a set of child templates linked to the
master. The parent template is called ArmTemplate_master.json , and child templates are named with the pattern
ArmTemplate_0.json , ArmTemplate_1.json , and so on. To move over from using the full Resource Manager template
to using the linked templates, update your CI/CD task to point to ArmTemplate_master.json instead of pointing to
ArmTemplateForFactory.json (that is, the full Resource Manager template). Resource Manager also requires you to
upload the linked templates into a storage account so that they can be accessed by Azure during deployment. For
more info, see Deploying Linked ARM Templates with VSTS.
Remember to add the Data Factory scripts in your CI/CD pipeline before and after the deployment task.
If you don’t have Git configured, the linked templates are accessible via the Export ARM template gesture.

Best practices for CI/CD


If you're using Git integration with your data factory, and you have a CI/CD pipeline that moves your changes from
Development into Test and then to Production, we recommend the following best practices:
Git Integration. You are only required to configure your Development data factory with Git integration.
Changes to Test and Production are deployed via CI/CD, and they don't need to have Git integration.
Data Factory CI/CD script. Before the Resource Manager deployment step in CI/CD, you must take care
of things like stopping the triggers, and different kind of factory cleanup. We recommend using this script as
it takes care of all these things. Run the script once before the deployment, and once after, using appropriate
flags.
Integration Runtimes and sharing. Integration Runtimes are one of the infrastructural components in
your data factory, which undergo changes less often, and are similar across all stages in your CI/CD. As a
result, Data Factory expects you to have the same name and same type of Integration Runtimes across all
stages of CI/CD. If you are looking to share Integration Runtimes across all stages - for instance, the Self-
hosted Integration Runtimes - one way to share is by hosting the Self-hosted IR in a ternary factory, just for
containing the shared Integration Runtimes. Then you can use them in Dev/Test/Prod as a Linked IR type.
Key Vault. When you use the recommended Azure Key Vault based linked services, you can take its
advantages one level further by potentially keeping separate key vaults for Dev/Test/Prod. You can also
configure separate permission levels for each of them. You may not want your team members to have
permissions to the Production secrets. We also recommend you to keep the same secret names across all
stages. If you keep the same names, you don't have to change your Resource Manager templates across
CI/CD, since the only thing that needs to be changed is the key vault name, which is one of the Resource
Manager template parameters.

Unsupported features
You can't publish individual resources, because data factory entities depend on each other. For example,
triggers depend on pipelines, pipelines depend on datasets and other pipelines, etc. Tracking changing
dependencies is hard. If it was possible to select the resources to publish manually, it would be possible to
pick only a subset of the entire set of changes, which would lead to things unexpected behavior after
publishing.
You can't publish from private branches.
You can't host projects on Bitbucket.
Iterative development and debugging with Azure
Data Factory
3/7/2019 • 2 minutes to read • Edit Online

Azure Data Factory lets you iteratively develop and debug Data Factory pipelines.
For an eight-minute introduction and demonstration of this feature, watch the following video:

Iterative debugging features


Create pipelines and do test runs using the Debug capability in the pipeline canvas without writing a single line of
code.

View the results of your test runs in the Output window of the pipeline canvas.
After a test run succeeds, add more activities to your pipeline and continue debugging in an iterative manner. You
can also Cancel a test run while it is in progress.

When you do test runs, you don't have to publish your changes to the data factory before you select Debug. This
feature is helpful in scenarios where you want to make sure that the changes work as expected before you update
the data factory workflow.

IMPORTANT
Selecting Debug actually runs the pipeline. So, for example, if the pipeline contains copy activity, the test run copies data
from source to destination. As a result, we recommend that you use test folders in your copy activities and other activities
when debugging. After you've debugged the pipeline, switch to the actual folders that you want to use in normal operations.

Visualizing debug runs


You can visualize all the debug runs that are in progress for your data factory in one place. Select View debug
runs in the upper right corner of the page. This feature is useful in scenarios where you have master pipelines
kicking off debug runs for child pipelines, and you want a single view to see all the active debug runs.

Monitoring debug runs


The test runs initiated with the Debug capability are not available in the list on the Monitor tab. You can only see
runs triggered with Trigger Now, Schedule, or Tumbling Window triggers in the Monitor tab. You can see the
last test run initiated with the Debug capability in the Output window of the pipeline canvas.

Setting breakpoints for debugging


Data Factory also lets you debug until you reach a particular activity on the pipeline canvas. Just put a breakpoint
on the activity until which you want to test, and select Debug. Data Factory ensures that the test runs only until the
breakpoint activity on the pipeline canvas. This Debug Until feature is useful when you don't want to test the entire
pipeline, but only a subset of activities inside the pipeline.

To set a breakpoint, select an element on the pipeline canvas. A Debug Until option appears as an empty red circle
at the upper right corner of the element.

After you select the Debug Until option, it changes to a filled red circle to indicate the breakpoint is enabled.

Next steps
Continuous integration and deployment in Azure Data Factory
Copy data from Amazon Marketplace Web Service
using Azure Data Factory (Preview)
1/3/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Amazon Marketplace
Web Service. It builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Amazon Marketplace Web Service to any supported sink data store. For a list of data
stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon Marketplace Web Service connector.

Linked service properties


The following properties are supported for Amazon Marketplace Web Service linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


AmazonMWS

endpoint The endpoint of the Amazon MWS Yes


server, (that is,
mws.amazonservices.com)
PROPERTY DESCRIPTION REQUIRED

marketplaceID The Amazon Marketplace ID you want Yes


to retrieve data from. To retrieve data
from multiple Marketplace IDs, separate
them with a comma ( , ). (that is,
A2EUQ1WTGCTBG2)

sellerID The Amazon seller ID. Yes

mwsAuthToken The Amazon MWS authentication Yes


token. Mark this field as a SecureString
to store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

accessKeyId The access key ID used to access data. Yes

secretKey The secret key used to access data. Yes


Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:
{
"name": "AmazonMWSLinkedService",
"properties": {
"type": "AmazonMWS",
"typeProperties": {
"endpoint" : "mws.amazonservices.com",
"marketplaceID" : "A2EUQ1WTGCTBG2",
"sellerID" : "<sellerID>",
"mwsAuthToken": {
"type": "SecureString",
"value": "<mwsAuthToken>"
},
"accessKeyId" : "<accessKeyId>",
"secretKey": {
"type": "SecureString",
"value": "<secretKey>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Amazon Marketplace Web Service dataset.
To copy data from Amazon Marketplace Web Service, set the type property of the dataset to
AmazonMWSObject. The following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: AmazonMWSObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "AmazonMWSDataset",
"properties": {
"type": "AmazonMWSObject",
"linkedServiceName": {
"referenceName": "<AmazonMWS linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Amazon Marketplace Web Service source.
Amazon MWS as source
To copy data from Amazon Marketplace Web Service, set the source type in the copy activity to
AmazonMWSSource. The following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
AmazonMWSSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Orders where
Amazon_Order_Id = 'xx'"
.

Example:

"activities":[
{
"name": "CopyFromAmazonMWS",
"type": "Copy",
"inputs": [
{
"referenceName": "<AmazonMWS input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AmazonMWSSource",
"query": "SELECT * FROM Orders where Amazon_Order_Id = 'xx'"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Amazon Redshift using Azure Data
Factory
3/14/2019 • 5 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an Amazon Redshift. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Amazon Redshift to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Amazon Redshift connector supports retrieving data from Redshift using query or built-in
Redshift UNLOAD support.

TIP
To achieve the best performance when copying large amounts of data from Redshift, consider using the built-in Redshift
UNLOAD through Amazon S3. See Use UNLOAD to copy data from Amazon Redshift section for details.

Prerequisites
If you are copying data to an on-premises data store using Self-hosted Integration Runtime, grant Integration
Runtime (use IP address of the machine) the access to Amazon Redshift cluster. See Authorize access to the
cluster for instructions.
If you are copying data to an Azure data store, see Azure Data Center IP Ranges for the Compute IP address
and SQL ranges used by the Azure data centers.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon Redshift connector.

Linked service properties


The following properties are supported for Amazon Redshift linked service:
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


AmazonRedshift

server IP address or host name of the Amazon Yes


Redshift server.

port The number of the TCP port that the No, default is 5439
Amazon Redshift server uses to listen
for client connections.

database Name of the Amazon Redshift Yes


database.

username Name of user who has access to the Yes


database.

password Password for the user account. Mark Yes


this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

Example:

{
"name": "AmazonRedshiftLinkedService",
"properties":
{
"type": "AmazonRedshift",
"typeProperties":
{
"server": "<server name>",
"database": "<database name>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Amazon Redshift dataset.
To copy data from Amazon Redshift, set the type property of the dataset to RelationalTable. The following
properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: RelationalTable

tableName Name of the table in the Amazon No (if "query" in activity source is
Redshift. specified)

Example

{
"name": "AmazonRedshiftDataset",
"properties":
{
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<Amazon Redshift linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Amazon Redshift source.
Amazon Redshift as source
To copy data from Amazon Redshift, set the source type in the copy activity to AmazonRedshiftSource. The
following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
AmazonRedshiftSource

query Use the custom query to read data. For No (if "tableName" in dataset is
example: select * from MyTable. specified)

redshiftUnloadSettings Property group when using Amazon No


Redshift UNLOAD.

s3LinkedServiceName Refers to an Amazon S3 to-be-used as Yes if using UNLOAD


an interim store by specifying a linked
service name of "AmazonS3" type.

bucketName Indicate the S3 bucket to store the Yes if using UNLOAD


interim data. If not provided, Data
Factory service generates it
automatically.

Example: Amazon Redshift source in copy activity using UNLOAD


"source": {
"type": "AmazonRedshiftSource",
"query": "<SQL query>",
"redshiftUnloadSettings": {
"s3LinkedServiceName": {
"referenceName": "<Amazon S3 linked service>",
"type": "LinkedServiceReference"
},
"bucketName": "bucketForUnload"
}
}

Learn more on how to use UNLOAD to copy data from Amazon Redshift efficiently from next section.

Use UNLOAD to copy data from Amazon Redshift


UNLOAD is a mechanism provided by Amazon Redshift, which can unload the results of a query to one or more
files on Amazon Simple Storage Service (Amazon S3). It is the way recommended by Amazon for copying large
data set from Redshift.
Example: copy data from Amazon Redshift to Azure SQL Data Warehouse using UNLOAD, staged copy
and PolyBase
For this sample use case, copy activity unloads data from Amazon Redshift to Amazon S3 as configured in
"redshiftUnloadSettings", and then copy data from Amazon S3 to Azure Blob as specified in "stagingSettings",
lastly use PolyBase to load data into SQL Data Warehouse. All the interim format is handled by copy activity
properly.
"activities":[
{
"name": "CopyFromAmazonRedshiftToSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "AmazonRedshiftDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AmazonRedshiftSource",
"query": "select * from MyTable",
"redshiftUnloadSettings": {
"s3LinkedServiceName": {
"referenceName": "AmazonS3LinkedService",
"type": "LinkedServiceReference"
},
"bucketName": "bucketForUnload"
}
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "AzureStorageLinkedService",
"path": "adfstagingcopydata"
},
"dataIntegrationUnits": 32
}
}
]

Data type mapping for Amazon Redshift


When copying data from Amazon Redshift, the following mappings are used from Amazon Redshift data types to
Azure Data Factory interim data types. See Schema and data type mappings to learn about how copy activity
maps the source schema and data type to the sink.

AMAZON REDSHIFT DATA TYPE DATA FACTORY INTERIM DATA TYPE

BIGINT Int64

BOOLEAN String

CHAR String

DATE DateTime

DECIMAL Decimal
AMAZON REDSHIFT DATA TYPE DATA FACTORY INTERIM DATA TYPE

DOUBLE PRECISION Double

INTEGER Int32

REAL Single

SMALLINT Int16

TEXT String

TIMESTAMP DateTime

VARCHAR String

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Amazon Simple Storage Service
using Azure Data Factory
5/10/2019 • 11 minutes to read • Edit Online

This article outlines how to copy data from Amazon Simple Storage Service (Amazon S3). To learn about Azure
Data Factory, read the introductory article.

Supported capabilities
This Amazon S3 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this Amazon S3 connector supports copying files as-is or parsing files with the supported file
formats and compression codecs. It uses AWS Signature Version 4 to authenticate requests to S3.

TIP
You can use this Amazon S3 connector to copy data from any S3-compatible storage providers e.g. Google Cloud
Storage. Specify the corresponding service URL in the linked service configuration.

Required permissions
To copy data from Amazon S3, make sure you have been granted the following permissions:
For copy activity execution:: s3:GetObject and s3:GetObjectVersion for Amazon S3 Object Operations.
For Data Factory GUI authoring: s3:ListAllMyBuckets and s3:ListBucket / s3:GetBucketLocation for
Amazon S3 Bucket Operations permissions are additionally required for operations like test connection and
browse/navigate file paths. If you don't want to grant these permission, skip test connection in linked service
creation page and specify the path directly in dataset settings.
For details about the full list of Amazon S3 permissions, see Specifying Permissions in a Policy.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon S3.

Linked service properties


The following properties are supported for Amazon S3 linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AmazonS3.

accessKeyId ID of the secret access key. Yes

secretAccessKey The secret access key itself. Mark this Yes


field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

serviceUrl Specify the custom S3 endpoint if you No


are copying data from a S3-compatible
storage provider other than the official
Amazon S3 service. For example, to
copy data from Google Cloud Storage,
specify
https://storage.googleapis.com .

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

TIP
Specify the custom S3 service URL if you are copying data from a S3-compatible storage other than the official Amazon S3
service.

NOTE
This connector requires access keys for IAM account to copy data from Amazon S3. Temporary Security Credential is not
supported.

Here is an example:
{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AmazonS3",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from Amazon S3 in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based dataset and supported settings. The following properties are supported for
Amazon S3 under location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property under location in Yes


dataset must be set to
AmazonS3Location.

bucketName The S3 bucket name. Yes

folderPath The path to folder under the given No


bucket. If you want to use wildcard to
filter folder, skip this setting and specify
in activity source settings.

fileName The file name under the given bucket + No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.

NOTE
AmazonS3Object type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility, but it doesn't work with Mapping Data Flow. You are
suggested to use this new model going forward, and the ADF authoring UI has switched to generating these new types.

Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AmazonS3Location",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data from Amazon S3 in ORC/Avro/JSON/Binary format, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: AmazonS3Object

bucketName The S3 bucket name. Wildcard filter is Yes for Copy/Lookup activity, No for
not supported. GetMetadata activity

key The name or wildcard filter of S3 No


object key under the specified bucket.
Applies only when "prefix" property is
not specified.

The wildcard filter is supported for both


folder part and file name part. Allowed
wildcards are: * (matches zero or
more characters) and ? (matches zero
or single character).
- Example 1:
"key":
"rootfolder/subfolder/*.csv"
- Example 2:
"key": "rootfolder/subfolder/???
20180427.txt"
See more example in Folder and file
filter examples. Use ^ to escape if
your actual folder/file name has
wildcard or this escape char inside.

prefix Prefix for the S3 object key. Objects No


whose keys start with this prefix are
selected. Applies only when "key"
property is not specified.
PROPERTY DESCRIPTION REQUIRED

version The version of the S3 object, if S3 No


versioning is enabled. If not specified,
the latest version will be fetched.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL which


mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL which


mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat, JsonFormat,
AvroFormat, OrcFormat,
ParquetFormat. Set the type
property under format to one of these
values. For more information, see Text
Format, Json Format, Avro Format, Orc
Format, and Parquet Format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are: Optimal and
Fastest.
TIP
To copy all files under a folder, specify bucketName for bucket and prefix for folder part.
To copy a single file with a given name, specify bucketName for bucket and key for folder part plus file name.
To copy a subset of files under a folder, specify bucketName for bucket and key for folder part plus wildcard filter.

Example: using prefix

{
"name": "AmazonS3Dataset",
"properties": {
"type": "AmazonS3Object",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"bucketName": "testbucket",
"prefix": "testFolder/test",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Example: using key and version (optional)

{
"name": "AmazonS3Dataset",
"properties": {
"type": "AmazonS3",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"bucketName": "testbucket",
"key": "testFolder/testfile.csv.gz",
"version": "XXXXXXXXXczm0CJajYkHf0_k6LhBmkcL",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Amazon S3 source.
Amazon S3 as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from Amazon S3 in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity source and supported settings. The following properties are
supported for Amazon S3 under storeSettings settings in format-based copy source:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
AmazonS3ReadSetting.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are
true (default) and false.

prefix Prefix for the S3 object key under the No


given bucket configured in dataset to
filter source objects. Objects whose
keys start with this prefix are selected.
Applies only when
wildcardFolderPath and
wildcardFileName properties are not
specified.

wildcardFolderPath The folder path with wildcard No


characters under the given bucket
configured in dataset to filter source
folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
PROPERTY DESCRIPTION REQUIRED

wildcardFileName The file name with wildcard characters Yes if fileName in dataset and
under the given bucket + prefix are not specified
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".
The properties can be NULL which
mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

modifiedDatetimeEnd Same as above. No

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:
"activities":[
{
"name": "CopyFromAmazonS3",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "AmazonS3ReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from Amazon S3 in ORC/Avro/JSON/Binary format, the following properties are supported in
the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
FileSystemSource

recursive Indicates whether the data is read No


recursively from the sub folders or only
from the specified folder. Note when
recursive is set to true and sink is file-
based store, empty folder/sub-folder
will not be copied/created at sink.
Allowed values are: true (default), false

maxConcurrentConnections The number of the connections to No


connect to the data store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:
"activities":[
{
"name": "CopyFromAmazonS3",
"type": "Copy",
"inputs": [
{
"referenceName": "<Amazon S3 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SOURCE FOLDER STRUCTURE


AND FILTER RESULT (FILES IN
BUCKET KEY RECURSIVE BOLD ARE RETRIEVED)

bucket Folder*/* false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/* true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SOURCE FOLDER STRUCTURE
AND FILTER RESULT (FILES IN
BUCKET KEY RECURSIVE BOLD ARE RETRIEVED)

bucket Folder*/*.csv false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Next steps
For a list of data stores that are supported as sources and sinks by the copy activity in Azure Data Factory, see
supported data stores.
Copy data to or from Azure Blob storage by using
Azure Data Factory
5/6/2019 • 22 minutes to read • Edit Online

This article outlines how to copy data to and from Azure Blob storage. To learn about Azure Data Factory, read
the introductory article.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.

Supported capabilities
This Azure Blob connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Specifically, this Blob storage connector supports:
Copying blobs to and from general-purpose Azure storage accounts and hot/cool blob storage.
Copying blobs by using account key, service shared access signature, service principal or managed identities
for Azure resources authentications.
Copying blobs from block, append, or page blobs and copying data to only block blobs.
Copying blobs as is or parsing or generating blobs with supported file formats and compression codecs.

NOTE
If you enable the "Allow trusted Microsoft services to access this storage account" option on Azure Storage firewall
settings, using Azure Integration Runtime to connect to Blob storage will fail with a forbidden error, as ADF is not treated
as a trusted Microsoft service. Please connect via a Self-hosted Integration Runtime instead.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Blob storage.

Linked service properties


Azure Blob connector support the following authentication types, refer to the corresponding section on details:
Account key authentication
Shared access signature authentication
Service principal authentication
Managed identities for Azure resources authentication

NOTE
HDInsights, Azure Machine Learning and Azure SQL Data Warehouse PolyBase load only support Azure Blob storage
account key authentication.

Account key authentication


To use storage account key authentication, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureBlobStorage (suggested) or
AzureStorage (see notes below).

connectionString Specify the information needed to Yes


connect to Storage for the
connectionString property.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put account key in Azure Key
Vault and pull the accountKey
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is in a private network). If
not specified, it uses the default Azure
Integration Runtime.

NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureBlobStorage" linked service type going forward.

Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store account key in Azure Key Vault

{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;"
},
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Shared access signature authentication


A shared access signature provides delegated access to resources in your storage account. You can use a shared
access signature to grant a client limited permissions to objects in your storage account for a specified time. You
don't have to share your account access keys. The shared access signature is a URI that encompasses in its query
parameters all the information necessary for authenticated access to a storage resource. To access storage
resources with the shared access signature, the client only needs to pass in the shared access signature to the
appropriate constructor or method. For more information about shared access signatures, see Shared access
signatures: Understand the shared access signature model.

NOTE
Data Factory now supports both service shared access signatures and account shared access signatures. For more
information about these two types and how to construct them, see Types of shared access signatures.
In later dataset configuration, the folder path is the absolute path starting from container level. You need to configure
one aligned with the path in your SAS URI.
TIP
To generate a service shared access signature for your storage account, you can execute the following PowerShell
commands. Replace the placeholders and grant the needed permission.
$context = New-AzStorageContext -StorageAccountName <accountName> -StorageAccountKey <accountKey>
New-AzStorageContainerSASToken -Name <containerName> -Context $context -Permission rwdl -StartTime
<startTime> -ExpiryTime <endTime> -FullUri

To use shared access signature authentication, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureBlobStorage (suggested) or
AzureStorage (see notes below).

sasUri Specify the shared access signature URI Yes


to the Storage resources such as
blob/container.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put SAS token in Azure Key
Vault to leverage auto rotation and
remove the token portion. Refer to the
following samples and Store credentials
in Azure Key Vault article with more
details.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure Integration Runtime or the
Self-hosted Integration Runtime (if
your data store is located in a private
network). If not specified, it uses the
default Azure Integration Runtime.

NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureBlobStorage" linked service type going forward.

Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource e.g.
https://<container>.blob.core.windows.net/?sv=<storage version>&amp;st=<start time>&amp;se=<expire
time>&amp;sr=<resource>&amp;sp=<permissions>&amp;sip=<ip range>&amp;spr=<protocol>&amp;sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store account key in Azure Key Vault

{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource without token e.g.
https://<container>.blob.core.windows.net/>"
},
"sasToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

When you create a shared access signature URI, consider the following points:
Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is
used in your data factory.
Set Expiry time appropriately. Make sure that the access to Storage objects doesn't expire within the active
period of the pipeline.
The URI should be created at the right container/blob based on the need. A shared access signature URI to a
blob allows Data Factory to access that particular blob. A shared access signature URI to a Blob storage
container allows Data Factory to iterate through blobs in that container. To provide access to more or fewer
objects later, or to update the shared access signature URI, remember to update the linked service with the
new URI.
Service principal authentication
For Azure Storage service principal authentication in general, refer to Authenticate access to Azure Storage using
Azure Active Directory.
To use service principal authentication, follow these steps:
1. Register an application entity in Azure Active Directory (Azure AD ) by following Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Grant the service principal proper permission in Azure Blob storage. Refer to Manage access rights to
Azure Storage data with RBAC with more details on the roles.
As source, in Access control (IAM ), grant at least Storage Blob Data Reader role.
As sink, in Access control (IAM ), grant at least Storage Blob Data Contributor role.
These properties are supported for an Azure Blob storage linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureBlobStorage.

serviceEndpoint Specify the Azure Blob storage service Yes


endpoint with the pattern of
https://<accountName>.blob.core.windows.net/
.

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Mark this Yes


field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. Retrieve it by
hovering the mouse in the top-right
corner of the Azure portal.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is in a private network). If
not specified, it uses the default Azure
Integration Runtime.

NOTE
Service principal authentication is only supported by "AzureBlobStorage" type linked service but not previous
"AzureStorage" type linked service.

Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<accountName>.blob.core.windows.net/",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Managed identities for Azure resources authentication


A data factory can be associated with a managed identity for Azure resources, which represents this specific data
factory. You can directly use this managed identity for Blob storage authentication similar to using your own
service principal. It allows this designated factory to access and copy data from/to your Blob storage.
Refer to Authenticate access to Azure Storage using Azure Active Directory for Azure Storage authentication in
general. To use managed identities for Azure resources authentication, follow these steps:
1. Retrieve data factory managed identity information by copying the value of "SERVICE IDENTITY
APPLICATION ID" generated along with your factory.
2. Grant the managed identity proper permission in Azure Blob storage. Refer to Manage access rights to
Azure Storage data with RBAC with more details on the roles.
As source, in Access control (IAM ), grant at least Storage Blob Data Reader role.
As sink, in Access control (IAM ), grant at least Storage Blob Data Contributor role.
These properties are supported for an Azure Blob storage linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureBlobStorage.

serviceEndpoint Specify the Azure Blob storage service Yes


endpoint with the pattern of
https://<accountName>.blob.core.windows.net/
.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is in a private network). If
not specified, it uses the default Azure
Integration Runtime.
NOTE
Managed identities for Azure resources authentication is only supported by "AzureBlobStorage" type linked service but not
previous "AzureStorage" type linked service.

Example:

{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<accountName>.blob.core.windows.net/"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data to and from Blob storage in Parquet or delimited text format, refer to Parquet format and Delimited
text format article on format-based dataset and supported settings. The following properties are supported for
Azure Blob under location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property of the location in Yes


dataset must be set to
AzureBlobStorageLocation.

container The blob container. Yes

folderPath The path to folder under the given No


container. If you want to use wildcard
to filter folder, skip this setting and
specify in activity source settings.

fileName The file name under the given No


container + folderPath. If you want to
use wildcard to filter files, skip this
setting and specify in activity source
settings.
NOTE
AzureBlob type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility, but it doesn't work with Mapping Data Flow. You are
suggested to use this new model going forward, and the ADF authoring UI has switched to generating these new types.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data to and from Blob storage in ORC/Avro/JSON/Binary format, set the type property of the dataset to
AzureBlob. The following properties are supported.

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to AzureBlob.

folderPath Path to the container and folder in the Yes for Copy/Lookup activity, No for
blob storage. GetMetadata activity

Wildcard filter is supported for the


path excluding container name.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.

Examples:
myblobcontainer/myblobfolder/, see
more examples in Folder and file filter
examples.
PROPERTY DESCRIPTION REQUIRED

fileName Name or wildcard filter for the No


blob(s) under the specified "folderPath".
If you don't specify a value for this
property, the dataset points to all
blobs in the folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has wildcard or this escape char
inside.

When fileName isn't specified for an


output dataset and
preserveHierarchy isn't specified in
the activity sink, the copy activity
automatically generates the blob name
with the following pattern: "Data.
[activity run ID GUID].[GUID if
FlattenHierarchy].[format if
configured].[compression if
configured]", e.g. "Data.0a405f8a-93ff-
4c6f-b3be-f69616f1df7a.txt.gz"; if you
copy from tabular source using table
name instead of query, the name
pattern is "[table name].[format ].
[compression if configured]", e.g.
"MyTable.csv".
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.
PROPERTY DESCRIPTION REQUIRED

format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both the input and
output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat, JsonFormat,
AvroFormat, OrcFormat, and
ParquetFormat. Set the type
property under format to one of these
values. For more information, see the
Text format, JSON format, Avro format,
Orc format, and Parquet format
sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are Optimal and
Fastest.

TIP
To copy all blobs under a folder, specify folderPath only.
To copy a single blob with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of blobs under a folder, specify folderPath with folder part and fileName with wildcard filter.

Example:
{
"name": "AzureBlobDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "<Azure Blob storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Blob storage source and sink.
Blob storage as a source type
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from Blob storage in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity source and supported settings. The following properties are
supported for Azure Blob under storeSettings settings in format-based copy source:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
AzureBlobStorageReadSetting.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are
true (default) and false.
PROPERTY DESCRIPTION REQUIRED

wildcardFolderPath The folder path with wildcard No


characters under the given container
configured in dataset to filter source
folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given container + dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".
The properties can be NULL which
mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.

modifiedDatetimeEnd Same as above. No

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.
NOTE
For Parquet/delimited text format, BlobSource type copy activity source mentioned in next section is still supported as-is
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:

"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureBlobStorageReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from Blob storage in ORC/Avro/JSON/Binary format, set the source type in the copy activity to
BlobSource. The following properties are supported in the copy activity source section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to BlobSource.
PROPERTY DESCRIPTION REQUIRED

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false.

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Blob input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Blob storage as a sink type


For copy to Parquet and delimited text format, refer to Parquet and delimited text format sink section.
For copy to other formats like ORC/Avro/JSON/Binary format, refer to Other format sink section.
Parquet and delimited text format sink
To copy data to Blob storage in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity sink and supported settings. The following properties are supported
for Azure Blob under storeSettings settings in format-based copy sink:

PROPERTY DESCRIPTION REQUIRED


PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
AzureBlobStorageWriteSetting.

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- PreserveHierarchy (default):
Preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy: All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles: Merges all files from the
source folder to one file. If the file or
blob name is specified, the merged file
name is the specified name. Otherwise,
it's an autogenerated file name.

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, BlobSink type copy activity sink mentioned in next section is still supported as-is for
backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has switched
to generating these new types.

Example:
"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureBlobStorageWriteSetting",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]

Other format sink


To copy data to Blob storage, set the sink type in the copy activity to BlobSink. The following properties are
supported in the sink section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to BlobSink.

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- PreserveHierarchy (default):
Preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy: All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles: Merges all files from the
source folder to one file. If the file or
blob name is specified, the merged file
name is the specified name. Otherwise,
it's an autogenerated file name.
PROPERTY DESCRIPTION REQUIRED

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Blob output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SOURCE FOLDER STRUCTURE


AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

container/Folder* (empty, use default) false container


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SOURCE FOLDER STRUCTURE
AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

container/Folder* (empty, use default) true container


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

container/Folder* *.csv false container


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

container/Folder* *.csv true container


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Some recursive and copyBehavior examples


This section describes the resulting behavior of the Copy operation for different combinations of recursive and
copyBehavior values.

RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

true preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the same
File2 structure as the source:
Subfolder1
File3 Folder1
File4 File1
File5 File2
Subfolder1
File3
File4
File5
RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

true flattenHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2
autogenerated name for
File3
autogenerated name for
File4
autogenerated name for
File5

true mergeFiles Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 + File3 +
File5 File4 + File5 contents are
merged into one file with an
autogenerated file name.

false preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1
File5 File2

Subfolder1 with File3, File4,


and File5 is not picked up.

false flattenHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2

Subfolder1 with File3, File4,


and File5 is not picked up.
RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

false mergeFiles Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 File1 + File2 contents are
File5 merged into one file with an
autogenerated file name.
autogenerated name for
File1

Subfolder1 with File3, File4,


and File5 is not picked up.

Mapping Data Flow properties


Learn details from source transformation and sink transformation in Mapping Data Flow.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data to or from Azure Cosmos DB (SQL API)
by using Azure Data Factory
2/6/2019 • 8 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Cosmos DB
(SQL API). The article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy
Activity.

NOTE
This connector only support copy data to/from Cosmos DB SQL API. For MongoDB API, refer to connector for Azure
Cosmos DB's API for MongoDB. Other API types are not supported now.

Supported capabilities
You can copy data from Azure Cosmos DB (SQL API) to any supported sink data store, or copy data from any
supported source data store to Azure Cosmos DB (SQL API). For a list of data stores that Copy Activity supports
as sources and sinks, see Supported data stores and formats.
You can use the Azure Cosmos DB (SQL API) connector to:
Copy data from and to the Azure Cosmos DB SQL API.
Write to Azure Cosmos DB as insert or upsert.
Import and export JSON documents as-is, or copy data from or to a tabular dataset. Examples include a SQL
database and a CSV file. To copy documents as-is to or from JSON files or to or from another Azure Cosmos
DB collection, see Import or export JSON documents.
Data Factory integrates with the Azure Cosmos DB bulk executor library to provide the best performance when
you write to Azure Cosmos DB.

TIP
The Data Migration video walks you through the steps of copying data from Azure Blob storage to Azure Cosmos DB. The
video also describes performance-tuning considerations for ingesting data to Azure Cosmos DB in general.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to Azure Cosmos DB (SQL API).

Linked service properties


The following properties are supported for the Azure Cosmos DB (SQL API) linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


CosmosDb.

connectionString Specify information that's required to Yes


connect to the Azure Cosmos DB
database.
Note: You must specify database
information in the connection string as
shown in the examples that follow.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put account key in Azure Key
Vault and pull the accountKey
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key Vault
article with more details.

connectVia The Integration Runtime to use to No


connect to the data store. You can use
the Azure Integration Runtime or a
self-hosted integration runtime (if your
data store is located in a private
network). If this property isn't specified,
the default Azure Integration Runtime
is used.

Example

{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store account key in Azure Key Vault


{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "AccountEndpoint=<EndpointUrl>;Database=<Database>"
},
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
This section provides a list of properties that the Azure Cosmos DB (SQL API) dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from or to Azure Cosmos DB (SQL API), set the type property of the dataset to
DocumentDbCollection. The following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to DocumentDbCollection.

collectionName The name of the Azure Cosmos DB Yes


document collection.

Example

{
"name": "CosmosDbSQLAPIDataset",
"properties": {
"type": "DocumentDbCollection",
"linkedServiceName":{
"referenceName": "<Azure Cosmos DB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"collectionName": "<collection name>"
}
}
}

Schema by Data Factory


For schema-free data stores like Azure Cosmos DB, Copy Activity infers the schema in one of the ways described
in the following list. Unless you want to import or export JSON documents as-is , the best practice is to specify the
structure of data in the structure section.
If you specify the structure of data by using the structure property in the dataset definition, Data Factory
honors this structure as the schema.
If a row doesn't contain a value for a column, a null value is provided for the column value.
If you don't specify the structure of data by using the structure property in the dataset definition, the Data
Factory service infers the schema by using the first row in the data.
If the first row doesn't contain the full schema, some columns will be missing in the result of the copy
operation.

Copy Activity properties


This section provides a list of properties that the Azure Cosmos DB (SQL API) source and sink support.
For a full list of sections and properties that are available for defining activities, see Pipelines.
Azure Cosmos DB (SQL API ) as source
To copy data from Azure Cosmos DB (SQL API), set the source type in Copy Activity to
DocumentDbCollectionSource.
The following properties are supported in the Copy Activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
DocumentDbCollectionSource.

query Specify the Azure Cosmos DB query to No


read data.
If not specified, this SQL statement is
Example: executed:
SELECT c.BusinessEntityID, select <columns defined in
c.Name.First AS FirstName, structure> from mycollection
c.Name.Middle AS MiddleName,
c.Name.Last AS LastName,
c.Suffix, c.EmailPromotion FROM c
WHERE c.ModifiedDate > \"2009-01-
01T00:00:00\"

nestingSeparator A special character that indicates that No


the document is nested and how to (the default is . (dot))
flatten the result set.

For example, if an Azure Cosmos DB


query returns the nested result
"Name": {"First": "John"} , Copy
Activity identifies the column name as
Name.First , with the value "John",
when the nestedSeparator value is .
(dot).

Example
"activities":[
{
"name": "CopyFromCosmosDBSQLAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Cosmos DB SQL API input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "SELECT c.BusinessEntityID, c.Name.First AS FirstName, c.Name.Middle AS MiddleName,
c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-01-01T00:00:00\""
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure Cosmos DB (SQL API ) as sink


To copy data to Azure Cosmos DB (SQL API), set the sink type in Copy Activity to
DocumentDbCollectionSink.
The following properties are supported in the Copy Activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Activity Yes


sink must be set to
DocumentDbCollectionSink.

writeBehavior Describes how to write data to Azure No


Cosmos DB. Allowed values: insert and (the default is insert)
upsert.

The behavior of upsert is to replace the


document if a document with the same
ID already exists; otherwise, insert the
document.

Note: Data Factory automatically


generates an ID for a document if an ID
isn't specified either in the original
document or by column mapping. This
means that you must ensure that, for
upsert to work as expected, your
document has an ID.
PROPERTY DESCRIPTION REQUIRED

writeBatchSize Data Factory uses the Azure Cosmos No


DB bulk executor library to write data (the default is 10,000)
to Azure Cosmos DB. The
writeBatchSize property controls the
size of documents that ADF provide to
the library. You can try increasing the
value for writeBatchSize to improve
performance and decreasing the value
if your document size being large - see
below tips.

nestingSeparator A special character in the source No


column name that indicates that a (the default is . (dot))
nested document is needed.

For example, Name.First in the


output dataset structure generates the
following JSON structure in the Azure
Cosmos DB document when the
nestedSeparator is . (dot):
"Name": {"First": "[value maps to
this column from source]"}

TIP
Cosmos DB limits single request's size to 2MB. The formula is Request Size = Single Document Size * Write Batch Size. If you
hit error saying "Request size is too large.", reduce the writeBatchSize value in copy sink configuration.

Example

"activities":[
{
"name": "CopyToCosmosDBSQLAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DocumentDbCollectionSink",
"writeBehavior": "upsert"
}
}
}
]
Import or export JSON documents
You can use this Azure Cosmos DB (SQL API) connector to easily:
Import JSON documents from various sources to Azure Cosmos DB, including from Azure Blob storage,
Azure Data Lake Store, and other file-based stores that Azure Data Factory supports.
Export JSON documents from an Azure Cosmos DB collection to various file-based stores.
Copy documents between two Azure Cosmos DB collections as-is.
To achieve schema-agnostic copy:
When you use the Copy Data tool, select the Export as-is to JSON files or Cosmos DB collection option.
When you use activity authoring, don't specify the structure (also called schema) section in the Azure Cosmos
DB dataset. Also, don't specify the nestingSeparator property in the Azure Cosmos DB source or sink in
Copy Activity. When you import from or export to JSON files, in the corresponding file store dataset, specify
the format type as JsonFormat and configure the filePattern as described in the JSON format section. Then,
don't specify the structure section and skip the rest of the format settings.

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see supported
data stores.
Copy data to or from Azure Cosmos DB's API for
MongoDB by using Azure Data Factory
2/6/2019 • 6 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Cosmos DB's
API for MongoDB. The article builds on Copy Activity in Azure Data Factory, which presents a general overview of
Copy Activity.

NOTE
This connector only support copy data to/from Azure Cosmos DB's API for MongoDB. For SQL API, refer to Cosmos DB SQL
API connector. Other API types are not supported now.

Supported capabilities
You can copy data from Azure Cosmos DB's API for MongoDB to any supported sink data store, or copy data
from any supported source data store to Azure Cosmos DB's API for MongoDB. For a list of data stores that Copy
Activity supports as sources and sinks, see Supported data stores and formats.
You can use the Azure Cosmos DB's API for MongoDB connector to:
Copy data from and to the Azure Cosmos DB's API for MongoDB.
Write to Azure Cosmos DB as insert or upsert.
Import and export JSON documents as-is, or copy data from or to a tabular dataset. Examples include a SQL
database and a CSV file. To copy documents as-is to or from JSON files or to or from another Azure Cosmos
DB collection, see Import or export JSON documents.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are specific
to Azure Cosmos DB's API for MongoDB.

Linked service properties


The following properties are supported for the Azure Cosmos DB's API for MongoDB linked service:
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


CosmosDbMongoDbApi.

connectionString Specify the connection string for your Yes


Azure Cosmos DB's API for MongoDB.
You can find it in the Azure portal ->
your Cosmos DB blade -> primary or
secondary connection string, with the
pattern of
mongodb://<cosmosdb-name>:
<password>@<cosmosdb-
name>.documents.azure.com:10255/?
ssl=true&replicaSet=globaldb
.

Mark this field as a SecureString type


to store it securely in Data Factory. You
can also reference a secret stored in
Azure Key Vault.

database Name of the database that you want to Yes


access.

connectVia The Integration Runtime to use to No


connect to the data store. You can use
the Azure Integration Runtime or a self-
hosted integration runtime (if your data
store is located in a private network). If
this property isn't specified, the default
Azure Integration Runtime is used.

Example

{
"name": "CosmosDbMongoDBAPILinkedService",
"properties": {
"type": "CosmosDbMongoDbApi",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "mongodb://<cosmosdb-name>:<password>@<cosmosdb-name>.documents.azure.com:10255/?
ssl=true&replicaSet=globaldb"
},
"database": "myDatabase"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for Azure Cosmos DB's API for MongoDB dataset:
PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to
CosmosDbMongoDbApiCollection.

collectionName The name of the Azure Cosmos DB Yes


collection.

Example

{
"name": "CosmosDbMongoDBAPIDataset",
"properties": {
"type": "CosmosDbMongoDbApiCollection",
"linkedServiceName":{
"referenceName": "<Azure Cosmos DB's API for MongoDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"collectionName": "<collection name>"
}
}
}

Copy Activity properties


This section provides a list of properties that the Azure Cosmos DB's API for MongoDB source and sink support.
For a full list of sections and properties that are available for defining activities, see Pipelines.
Azure Cosmos DB's API for MongoDB as source
The following properties are supported in the Copy Activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
CosmosDbMongoDbApiSource.

filter Specifies selection filter using query No


operators. To return all documents in a
collection, omit this parameter or pass
an empty document ({}).

cursorMethods.project Specifies the fields to return in the No


documents for projection. To return all
fields in the matching documents, omit
this parameter.

cursorMethods.sort Specifies the order in which the query No


returns matching documents. Refer to
cursor.sort().

cursorMethods.limit Specifies the maximum number of No


documents the server returns. Refer to
cursor.limit().
PROPERTY DESCRIPTION REQUIRED

cursorMethods.skip Specifies the number of documents to No


skip and from where MongoDB begins
to return results. Refer to cursor.skip().

batchSize Specifies the number of documents to No


return in each batch of the response (the default is 100)
from MongoDB instance. In most cases,
modifying the batch size will not affect
the user or the application. Cosmos DB
limits each batch cannot exceed 40MB
in size, which is the sum of the
batchSize number of documents' size,
so decrease this value if your document
size being large.

TIP
ADF support consuming BSON document in Strict mode. Make sure your filter query is in Strict mode instead of Shell
mode. More description can be found at MongoDB manual.

Example

"activities":[
{
"name": "CopyFromCosmosDBMongoDBAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Cosmos DB's API for MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CosmosDbMongoDbApiSource",
"filter": "{datetimeData: {$gte: ISODate(\"2018-12-11T00:00:00.000Z\"),$lt: ISODate(\"2018-12-
12T00:00:00.000Z\")}, _id: ObjectId(\"5acd7c3d0000000000000000\") }",
"cursorMethods": {
"project": "{ _id : 1, name : 1, age: 1, datetimeData: 1 }",
"sort": "{ age : 1 }",
"skip": 3,
"limit": 3
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure Cosmos DB's API for MongoDB as sink


The following properties are supported in the Copy Activity sink section:
PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Activity Yes


sink must be set to
CosmosDbMongoDbApiSink.

writeBehavior Describes how to write data to Azure No


Cosmos DB. Allowed values: insert and (the default is insert)
upsert.

The behavior of upsert is to replace the


document if a document with the same
ID already exists; otherwise, insert the
document.

Note: Data Factory automatically


generates an ID for a document if an ID
isn't specified either in the original
document or by column mapping. This
means that you must ensure that, for
upsert to work as expected, your
document has an ID.

writeBatchSize The writeBatchSize property controls No


the size of documents to write in each (the default is 10,000)
batch. You can try increasing the value
for writeBatchSize to improve
performance and decreasing the value if
your document size being large.

writeBatchTimeout The wait time for the batch insert No


operation to finish before it times out. (the default is 00:30:00 - 30 minutes)
The allowed value is timespan.

Example
"activities":[
{
"name": "CopyToCosmosDBMongoDBAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "CosmosDbMongoDbApiSink",
"writeBehavior": "upsert"
}
}
}
]

TIP
To import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data,
refer to Schema mapping.

Import or export JSON documents


You can use this Azure Cosmos DB connector to easily:
Import JSON documents from various sources to Azure Cosmos DB, including from Azure Blob storage,
Azure Data Lake Store, and other file-based stores that Azure Data Factory supports.
Export JSON documents from an Azure Cosmos DB collection to various file-based stores.
Copy documents between two Azure Cosmos DB collections as-is.
To achieve such schema-agnostic copy, skip the "structure" (also called schema) section in dataset and schema
mapping in copy activity.

Schema mapping
To copy data from Azure Cosmos DB's API for MongoDB to tabular sink or reversed, refer to schema mapping.
Specifically for writing into Cosmos DB, to make sure you populate Cosmos DB with the right object ID from your
source data, for example, you have an "id" column in SQL database table and want to use the value of that as the
document ID in MongoDB for insert/upsert, you need to set the proper schema mapping according to MongoDB
strict mode definition ( _id.$oid ) as the following:
After copy activity execution, below BSON ObjectId is generated in sink:

{
"_id": ObjectId("592e07800000000000000000")
}

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see supported data
stores.
Copy data to or from Azure Data Explorer using
Azure Data Factory
4/18/2019 • 5 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data to or from Azure Data
Explorer. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from any supported source data store to Azure Data Explorer. You can also copy data from
Azure Data Explorer to any supported sink data store. For a list of data stores that are supported as sources or
sinks by the copy activity, see the Supported data stores table.

NOTE
Copying data to/from Azure Data Explorer from/to on premises data store using Self-hosted Integration Runtime is
supported since version 3.14.

The Azure Data Explorer connector allows you to do the following:


Copy data by using Azure Active Directory (Azure AD ) application token authentication with a service
principal.
As a source, retrieve data by using a KQL (Kusto) query.
As a sink, append data to a destination table.

Getting started
TIP
For a walkthrough of using Azure Data Explorer connector, see Copy data to/from Azure Data Explorer using Azure Data
Factory.

You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Data Explorer connector.

Linked service properties


The Azure Data Explorer connector uses service principal authentication. Follow these steps to get a service
principal and grant permissions:
1. Register an application entity in Azure Active Directory (Azure AD ) by following Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Grant the service principal proper permission in Azure Data Explorer. Refer to Manage Azure Data Explorer
database permissions with detailed information on roles and permissions as well as walkthrough on
managing permissions. In general, you need to
As source, grant at least Database viewer role to your database.
As sink, grant at least Database ingestor role to your database.

NOTE
When using ADF UI to author, the operations of listing databases on linked service or listing tables on dataset may require
higher privileged permission granted for the service principal. Alternatively, you can choose to manually input database
name and table name. Copy activity execution works as long as the service principal is granted with proper permission to
read/write data.

The following properties are supported for Azure Data Explorer linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureDataExplorer

endpoint Endpoint URL of the Azure Data Yes


Explorer cluster, with the format as
https://<clusterName>.
<regionName>.kusto.windows.net
.

database Name of database. Yes

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. This is what you
normally know as "Authority ID" in
Kusto connection string. Retrieve it by
hovering with the mouse in the top-
right corner of the Azure portal.

servicePrincipalId Specify the application's client ID. This is Yes


what you normally know as "AAD
application client ID" in Kusto
connection string.
PROPERTY DESCRIPTION REQUIRED

servicePrincipalKey Specify the application's key. This is Yes


what you normally know as "AAD
application key" in Kusto connection
string. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

Linked Service Properties Example:

{
"name": "AzureDataExplorerLinkedService",
"properties": {
"type": "AzureDataExplorer",
"typeProperties": {
"endpoint": "https://<clusterName>.<regionName>.kusto.windows.net ",
"database": "<database name>",
"tenant": "<tenant name/id e.g. microsoft.onmicrosoft.com>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties that are supported by the Azure Data Explorer dataset.
To copy data to Azure Data Explorer, set the type property of the dataset to AzureDataExplorerTable.
The following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureDataExplorerTable

table The name of the table that the linked Yes for sink; No for source
service refers to.

Dataset Properties Example


{
"name": "AzureDataExplorerDataset",
"properties": {
"type": "AzureDataExplorerTable",
"linkedServiceName": {
"referenceName": "<Azure Data Explorer linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"table": "<table name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Azure Data Explorer source and sink.
Azure Data Explorer as source
To copy data from Azure Data Explorer, set the type property in the Copy activity source to
AzureDataExplorerSource. The following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
AzureDataExplorerSource

query A read-only request given in a KQL Yes


format. Use the custom KQL query as a
reference.

queryTimeout The wait time before the query request No


times out. Default value is 10 min
(00:10:00); allowed max value is 1 hour
(01:00:00).

NOTE
Azure Data Explorer source by default has a size limit of 500,000 records or 64 MB. To retrieve all the records without
truncation, you can specify set notruncation; at the beginning of your query. Refer to Query limits on more details.

Example:
"activities":[
{
"name": "CopyFromAzureDataExplorer",
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataExplorerSource",
"query": "TestTable1 | take 10",
"queryTimeout": "00:10:00"
},
"sink": {
"type": "<sink type>"
}
},
"inputs": [
{
"referenceName": "<Azure Data Explorer input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
]
}
]

Azure Data Explorer as sink


To copy data to Azure Data Explorer, set the type property in the copy activity sink to AzureDataExplorerSink.
The following properties are supported in the copy activity sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to:
AzureDataExplorerSink

ingestionMappingName Name of a pre-created mapping on a No


Kusto table. To map the columns from
source to Azure Data Explorer - which
applies to all supported source
stores/formats including
CSV/JSON/Avro formats etc., you can
use the Copy activity column mapping
(implicitly by name or explicitly as
configured) and/or Azure Data Explorer
mappings.

Example:
"activities":[
{
"name": "CopyToAzureDataExplorer",
"type": "Copy",
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDataExplorerSink",
"ingestionMappingName": "<optional Azure Data Explorer mapping name>"
}
},
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Data Explorer output dataset name>",
"type": "DatasetReference"
}
]
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see
supported data stores.
Learn more about Copy data from Azure Data Factory to Azure Data Explorer.
Copy data to or from Azure Data Lake Storage
Gen1 by using Azure Data Factory
5/13/2019 • 19 minutes to read • Edit Online

This article outlines how to copy data to and from Azure Data Lake Storage Gen1 (ADLS Gen1). To learn about
Azure Data Factory, read the introductory article.

Supported capabilities
This Azure Data Lake Storage Gen1 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Specifically, this connector supports:
Copying files by using one of the following methods of authentication: service principal or managed
identities for Azure resources.
Copying files as-is, or parsing or generating files with the supported file formats and compression codecs.

IMPORTANT
If you copy data using the self-hosted integration runtime, configure the corporate firewall to allow outbound traffic to
<ADLS account name>.azuredatalakestore.net and login.microsoftonline.com/<tenant>/oauth2/token on port
443. The latter is the Azure Security Token Service that the integration runtime needs to communicate with to get the
access token.

Get started
TIP
For a walkthrough of using the Azure Data Lake Store connector, see Load data into Azure Data Lake Store.

You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Data Lake Store.
Linked service properties
The following properties are supported for the Azure Data Lake Store linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureDataLakeStore.

dataLakeStoreUri Information about the Azure Data Lake Yes


Store account. This information takes
one of the following formats:
https://[accountname].azuredatalakestore.net/webhdfs/v1
or
adl://[accountname].azuredatalakestore.net/
.

subscriptionId The Azure subscription ID to which the Required for sink


Data Lake Store account belongs.

resourceGroupName The Azure resource group name to Required for sink


which the Data Lake Store account
belongs.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure integration runtime or self-
hosted integration runtime (if your
data store is located in a private
network). If this property is not
specified, it uses the default Azure
integration runtime.

Use service principal authentication


To use service principal authentication, register an application entity in Azure Active Directory and grant it access
to Data Lake Store. For detailed steps, see Service-to-service authentication. Make note of the following values,
which you use to define the linked service:
Application ID
Application key
Tenant ID
IMPORTANT
Make sure you grant the service principal proper permission in Data Lake Store:
As source: In Data explorer > Access, grant at least Read + Execute permission to list and copy the files in folders
and subfolders. Or, you can grant Read permission to copy a single file. You can choose to add to This folder and all
children for recursive, and add as an access permission and a default permission entry. There's no requirement on
account level access control (IAM).
As sink: In Data explorer > Access, grant at least Write + Execute permission to create child items in the folder. You
can choose to add to This folder and all children for recursive, and add as an access permission and a default
permission entry. If you use Azure integration runtime to copy (both source and sink are in the cloud), in IAM, grant
at least the Reader role in order to let Data Factory detect the region for Data Lake Store. If you want to avoid this
IAM role, explicitly create an Azure integration runtime with the location of Data Lake Store. For example, if your Data
Lake Store is in West Europe, create an Azure integration runtime with location set to "West Europe". Associate them in
the Data Lake Store linked service as in the following example.

NOTE
To list folders starting from the root, you must set the permission of the service principal being granted to at root level
with "Execute" permission. This is true when you use the:
Copy Data Tool to author copy pipeline.
Data Factory UI to test connection and navigating folders during authoring. If you have concern on granting
permission at root level, you can skip test connection and input path manually during authoring. Copy activity will still
work as long as the service principal is granted with proper permission at the files to be copied.

The following properties are supported:

PROPERTY DESCRIPTION REQUIRED

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Mark this Yes


field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. You can retrieve it
by hovering the mouse in the upper-
right corner of the Azure portal.

Example:
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Use managed identities for Azure resources authentication


A data factory can be associated with a managed identity for Azure resources, which represents this specific data
factory. You can directly use this managed identity for Data Lake Store authentication, similar to using your own
service principal. It allows this designated factory to access and copy data to or from Data Lake Store.
To use managed identities for Azure resources authentication:
1. Retrieve the data factory managed identity information by copying the value of the "Service Identity
Application ID" generated along with your factory.
2. Grant the managed identity access to Data Lake Store, the same way you do for service principal, following
these notes.

IMPORTANT
Make sure you grant the data factory managed identity proper permission in Data Lake Store:
As source: In Data explorer > Access, grant at least Read + Execute permission to list and copy the files in folders
and subfolders. Or, you can grant Read permission to copy a single file. You can choose to add to This folder and all
children for recursive, and add as an access permission and a default permission entry. There's no requirement on
account level access control (IAM).
As sink: In Data explorer > Access, grant at least Write + Execute permission to create child items in the folder. You
can choose to add to This folder and all children for recursive, and add as an access permission and a default
permission entry. If you use Azure integration runtime to copy (both source and sink are in the cloud), in IAM, grant
at least the Reader role in order to let Data Factory detect the region for Data Lake Store. If you want to avoid this
IAM role, explicitly create an Azure integration runtime with the location of Data Lake Store. Associate them in the Data
Lake Store linked service as the following example.
NOTE
To list folders starting from the root, you must set the permission of the managed identity being granted to at root level
with "Execute" permission. This is true when you use the:
Copy Data Tool to author copy pipeline.
Data Factory UI to test connection and navigating folders during authoring. If you have concern on granting
permission at root level, you can skip test connection and input path manually during authoring. Copy activity will still
work as long as the managed identity is granted with proper permission at the files to be copied.

In Azure Data Factory, you don't need to specify any properties besides the general Data Lake Store information
in the linked service.
Example:

{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data to and from ADLS Gen1 in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based dataset and supported settings. The following properties are
supported for ADLS Gen1 under location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property under location in Yes


dataset must be set to
AzureDataLakeStoreLocation.

folderPath The path to folder. If you want to use No


wildcard to filter folder, skip this setting
and specify in activity source settings.

fileName The file name under the given No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.
NOTE
AzureDataLakeStoreFile type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility, but it doesn't work with Mapping Data Flow. You are
suggested to use this new model going forward, and the ADF authoring UI has switched to generating these new types.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<ADLS Gen1 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureDataLakeStoreLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data to and from ADLS Gen1 in ORC/Avro/JSON/Binary format, the following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: AzureDataLakeStoreFile

folderPath Path to the folder in Data Lake Store. If No


not specified, it points to the root.

Wildcard filter is supported, allowed


wildcards are: * (matches zero or
more characters) and ? (matches
zero or single character); use ^ to
escape if your actual folder name has
wildcard or this escape char inside.

Examples: rootfolder/subfolder/, see


more examples in Folder and file filter
examples.
PROPERTY DESCRIPTION REQUIRED

fileName Name or wildcard filter for the file(s) No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has wildcard or this escape char
inside.

When fileName isn't specified for an


output dataset and
preserveHierarchy isn't specified in
the activity sink, the copy activity
automatically generates the file name
with the following pattern: "Data.
[activity run ID GUID].[GUID if
FlattenHierarchy].[format if
configured].[compression if
configured]". For example,
"Data.0a405f8a-93ff-4c6f-b3be-
f69616f1df7a.txt.gz". If you copy from
tabular source using table name
instead of query, the name pattern is
"[table name].[format ].[compression if
configured]". For example,
"MyTable.csv".
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.
PROPERTY DESCRIPTION REQUIRED

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary
copy), skip the format section in both
input and output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat, JsonFormat,
AvroFormat, OrcFormat,
ParquetFormat. Set the type
property under format to one of these
values. For more information, see Text
Format, Json Format, Avro Format, Orc
Format, and Parquet Format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are: Optimal and
Fastest.

TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a particular name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

Example:

{
"name": "ADLSDataset",
"properties": {
"type": "AzureDataLakeStoreFile",
"linkedServiceName":{
"referenceName": "<ADLS linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "datalake/myfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Copy Activity properties
For a full list of sections and properties available for defining activities, see Pipelines. This section provides a list
of properties supported by Azure Data Lake Store source and sink.
Azure Data Lake Store as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from ADLS Gen1 in Parquet or delimited text format, refer to Parquet format and Delimited
text format article on format-based copy activity source and supported settings. The following properties are
supported for ADLS Gen1 under storeSettings settings in format-based copy source:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
AzureDataLakeStoreReadSetting.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are
true (default) and false.

wildcardFolderPath The folder path with wildcard No


characters to filter source folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder
name has wildcard or this escape char
inside.
See more examples in Folder and file
filter examples.

wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder
name has wildcard or this escape char
inside. See more examples in Folder
and file filter examples.
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".
The properties can be NULL which
mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.

modifiedDatetimeEnd Same as above. No

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit
the concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, AzureDataLakeStoreSource type copy activity source mentioned in next section is still
supported as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF
authoring UI has switched to generating these new types.

Example:
"activities":[
{
"name": "CopyFromADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureDataLakeStoreReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from ADLS Gen1 in ORC/Avro/JSON/Binary format, the following properties are supported in
the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Yes


Activity source must be set to:
AzureDataLakeStoreSource.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and
the sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are:
true (default) and false.

maxConcurrentConnections The number of the connections to No


connect to the data store concurrently.
Specify only when you want to limit
the concurrent connection to the data
store.
Example:

"activities":[
{
"name": "CopyFromADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<ADLS Gen1 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure Data Lake Store as sink


For copy to Parquet and delimited text format, refer to Parquet and delimited text format sink section.
For copy to other formats like ORC/Avro/JSON/Binary format, refer to Other format sink section.
Parquet and delimited text format sink
To copy data to ADLS Gen1 in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity sink and supported settings. The following properties are
supported for ADLS Gen1 under storeSettings settings in format-based copy sink:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
AzureDataLakeStoreWriteSetting.
PROPERTY DESCRIPTION REQUIRED

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- PreserveHierarchy (default):
Preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy: All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles: Merges all files from the
source folder to one file. If the file
name is specified, the merged file name
is the specified name. Otherwise, it's an
autogenerated file name.

maxConcurrentConnections The number of the connections to No


connect to the data store concurrently.
Specify only when you want to limit
the concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, AzureDataLakeStoreSink type copy activity sink mentioned in next section is still
supported as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF
authoring UI has switched to generating these new types.

Example:
"activities":[
{
"name": "CopyToADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureDataLakeStoreWriteSetting",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]

Other format sink


To copy data to ADLS Gen1 in ORC/Avro/JSON/Binary format, the following properties are supported in the
sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Yes


Activity sink must be set to:
AzureDataLakeStoreSink.

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- PreserveHierarchy (default):
preserves the file hierarchy in the
target folder. The relative path of the
source file to the source folder is
identical to the relative path of the
target file to the target folder.
- FlattenHierarchy: all files from the
source folder are in the first level of the
target folder. The target files have
auto-generated names.
- MergeFiles: merges all files from the
source folder to one file. If the file
name is specified, the merged file name
is the specified name. Otherwise, the
file name is auto-generated.
PROPERTY DESCRIPTION REQUIRED

maxConcurrentConnections The number of the connections to No


connect to the data store concurrently.
Specify only when you want to limit
the concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyToADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ADLS Gen1 output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SOURCE FOLDER STRUCTURE


AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SOURCE FOLDER STRUCTURE
AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Examples of behavior of the copy operation


This section describes the resulting behavior of the copy operation for different combinations of recursive and
copyBehavior values.

RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

true preserveHierarchy Folder1 The target Folder1 is


File1 created with the same
File2 structure as the source:
Subfolder1
File3 Folder1
File4 File1
File5 File2
Subfolder1
File3
File4
File5.
RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

true flattenHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 auto-generated name for
File5 File1
auto-generated name for
File2
auto-generated name for
File3
auto-generated name for
File4
auto-generated name for
File5

true mergeFiles Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 + File3 +
File5 File4 + File 5 contents are
merged into one file, with
an auto-generated file
name.

false preserveHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1
File5 File2

Subfolder1 with File3, File4,


and File5 are not picked up.

false flattenHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 auto-generated name for
File5 File1
auto-generated name for
File2

Subfolder1 with File3, File4,


and File5 are not picked up.
RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

false mergeFiles Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 contents are
File5 merged into one file with
auto-generated file name.
auto-generated name for
File1

Subfolder1 with File3, File4,


and File5 are not picked up.

Preserve ACLs to Data Lake Storage Gen2


If you want to replicate the ACLs along with data files when upgrading from Data Lake Storage Gen1 to Gen2,
refer to Preserve ACLs from Data Lake Storage Gen1.

Mapping Data Flow properties


Learn details from source transformation and sink transformation in Mapping Data Flow.

Next steps
For a list of data stores supported as sources and sinks by Copy Activity in Azure Data Factory, see supported
data stores.
Copy data to or from Azure Data Lake Storage
Gen2 using Azure Data Factory
5/24/2019 • 22 minutes to read • Edit Online

Azure Data Lake Storage Gen2 (ADLS Gen2) is a set of capabilities dedicated to big data analytics, built into
Azure Blob storage. It allows you to interface with your data using both file system and object storage
paradigms.
This article outlines how to copy data to and from Azure Data Lake Storage Gen2. To learn about Azure Data
Factory, read the introductory article.

Supported capabilities
This Azure Data Lake Storage Gen2 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Specifically, this connector supports:
Copying data by using account key, service principal or managed identities for Azure resources
authentications.
Copying files as-is or parsing or generating files with supported file formats and compression codecs.

TIP
If you enable the hierarchical namespace, currently there is no interoperability of operations between Blob and ADLS Gen2
APIs. In case you hit the error of "ErrorCode=FilesystemNotFound" with detailed message as "The specified filesystem does
not exist.", it's caused by the specified sink file system was created via Blob API instead of ADLS Gen2 API elsewhere. To fix
the issue, please specify a new file system with a name that does not exist as the name of a Blob container, and ADF will
automatically create that file system during data copy.

NOTE
If you enables "Allow trusted Microsoft services to access this storage account" option on Azure Storage firewall settings,
using Azure Integration Runtime to connect to Data Lake Storage Gen2 will fail with forbidden error, as ADF are not
treated as trusted Microsoft service. Please use Self-hosted Integration Runtime as connect via instead.

Get started
TIP
For a walkthrough of using Data Lake Storage Gen2 connector, see Load data into Azure Data Lake Storage Gen2.

You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Data Lake Storage Gen2.

Linked service properties


Azure Data Lake Storage Gen2 connector support the following authentication types, refer to the corresponding
section on details:
Account key authentication
Service principal authentication
Managed identities for Azure resources authentication
Account key authentication
To use storage account key authentication, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureBlobFS.

url Endpoint for the Data Lake Storage Yes


Gen2 with the pattern of
https://<accountname>.dfs.core.windows.net
.

accountKey Account key for the Data Lake Storage Yes


Gen2 service. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is in a private network). If
not specified, it uses the default Azure
Integration Runtime.

Example:
{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
"accountkey": {
"type": "SecureString",
"value": "<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Service principal authentication


To use service principal authentication, follow these steps:
1. Register an application entity in Azure Active Directory (Azure AD ) by following Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Grant the service principal proper permission.
As source, in Storage Explorer, grant at least Read + Execute permission to list and copy the files in
folders and subfolders or grant Read permission to copy a single file. Alternatively, in Access control
(IAM ), grant at least Storage Blob Data Reader role.
As sink, in Storage Explorer, grant at least Write + Execute permission to create child items in the
folder. Alternatively, in Access control (IAM ), grant at least Storage Blob Data Contributor role.

NOTE
To list folders starting from the account level or to test connection, you need to set the permission of the service principal
being granted to storage account with "Execute" permission in IAM. This is true when you use the:
Copy Data Tool to author copy pipeline.
Data Factory UI to test connection and navigating folders during authoring. If you have concern on granting
permission at account level, you can skip test connection and input path manually during authoring. Copy activity will
still work as long as the service principal is granted with proper permission at the files to be copied.

These properties are supported in linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureBlobFS.

url Endpoint for the Data Lake Storage Yes


Gen2 with the pattern of
https://<accountname>.dfs.core.windows.net
.
PROPERTY DESCRIPTION REQUIRED

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Mark this Yes


field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. Retrieve it by
hovering the mouse in the top-right
corner of the Azure portal.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is in a private network). If
not specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Managed identities for Azure resources authentication


A data factory can be associated with a managed identity for Azure resources, which represents this specific data
factory. You can directly use this managed identity for ADLS Gen2 authentication similar to using your own
service principal. It allows this designated factory to access and copy data from/to your ADLS Gen2.
To use managed identities for Azure resources authentication, follow these steps:
1. Retrieve data factory managed identity information by copying the value of "SERVICE IDENTITY
APPLICATION ID" generated along with your factory.
2. Grant the managed identity proper permission.
As source, in Storage Explorer, grant at least Read + Execute permission to list and copy the files in
folders and subfolders or grant Read permission to copy a single file. Alternatively, in Access control
(IAM ), grant at least Storage Blob Data Reader role.
As sink, in Storage Explorer, grant at least Write + Execute permission to create child items in the
folder. Alternatively, in Access control (IAM ), grant at least Storage Blob Data Contributor role.

NOTE
To list folders starting from the account level or to test connection, you need to set the permission of the managed identity
being granted to storage account with "Execute" permission in IAM. This is true when you use the:
Copy Data Tool to author copy pipeline.
Data Factory UI to test connection and navigating folders during authoring. If you have concern on granting
permission at account level, you can skip test connection and input path manually during authoring. Copy activity will
still work as long as the managed identity is granted with proper permission at the files to be copied.

IMPORTANT
If you use PolyBase to load data from ADLS Gen2 into SQL DW, when using ADLS Gen2 managed identity authentication,
make sure you also follow the steps #1 and #2 in this guidance to register your SQL Database server with Azure Active
Directory (AAD) and assign Storage Blob Data Contributor RBAC role to your SQL Database server; the rest will be handled
by ADF. If your ADLS Gen2 is configured with VNet service endpoint, to use PolyBase to load data from it, you must use
managed identity authentication.

These properties are supported in linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureBlobFS.

url Endpoint for the Data Lake Storage Yes


Gen2 with the pattern of
https://<accountname>.dfs.core.windows.net
.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is in a private network). If
not specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data to and from ADLS Gen2 in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based dataset and supported settings. The following properties are
supported for ADLS Gen2 under location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property under location in Yes


dataset must be set to
AzureBlobFSLocation.

fileSystem The ADLS Gen2 file system name. No

folderPath The path to folder under the given file No


system. If you want to use wildcard to
filter folder, skip this setting and specify
in activity source settings.

fileName The file name under the given No


fileSystem + folderPath. If you want to
use wildcard to filter files, skip this
setting and specify in activity source
settings.

NOTE
AzureBlobFSFile type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility, but it doesn't work with Mapping Data Flow. You are
suggested to use this new model going forward, and the ADF authoring UI has switched to generating these new types.

Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<ADLS Gen2 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobFSLocation",
"fileSystem": "filesystemname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data to and from ADLS Gen2 in ORC/Avro/JSON/Binary format, the following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to AzureBlobFSFile.

folderPath Path to the folder in the Data Lake No


Storage Gen2. If not specified, it points
to the root.

Wildcard filter is supported, allowed


wildcards are: * (matches zero or
more characters) and ? (matches zero
or single character); use ^ to escape if
your actual folder name has wildcard or
this escape char inside.

Examples: filesystem/folder/, see more


examples in Folder and file filter
examples.
PROPERTY DESCRIPTION REQUIRED

fileName Name or wildcard filter for the file(s) No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has wildcard or this escape char
inside.

When fileName isn't specified for an


output dataset and
preserveHierarchy isn't specified in
the activity sink, the copy activity
automatically generates the file name
with the following pattern: "Data.
[activity run ID GUID].[GUID if
FlattenHierarchy].[format if
configured].[compression if
configured]", e.g. "Data.0a405f8a-93ff-
4c6f-b3be-f69616f1df7a.txt.gz"; if you
copy from tabular source using table
name instead of query, the name
pattern is "[table name].[format ].
[compression if configured]", e.g.
"MyTable.csv".
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.
PROPERTY DESCRIPTION REQUIRED

format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both the input and
output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat, JsonFormat,
AvroFormat, OrcFormat, and
ParquetFormat. Set the type
property under format to one of these
values. For more information, see the
Text format, JSON format, Avro format,
Orc format, and Parquet format
sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are Optimal and
Fastest.

TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

Example:
{
"name": "ADLSGen2Dataset",
"properties": {
"type": "AzureBlobFSFile",
"linkedServiceName": {
"referenceName": "<Azure Data Lake Storage Gen2 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "myfilesystem/myfolder",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the copy activity configurations and
pipelines and activities article. This section provides a list of properties supported by the Data Lake Storage
Gen2 source and sink.
Azure Data Lake storage Gen2 as a source type
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from ADLS Gen2 in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity source and supported settings. The following properties are
supported for ADLS Gen2 under storeSettings settings in format-based copy source:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
AzureBlobFSReadSetting.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are
true (default) and false.
PROPERTY DESCRIPTION REQUIRED

wildcardFolderPath The folder path with wildcard No


characters under the given file system
configured in dataset to filter source
folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given file system + dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".
The properties can be NULL which
mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.

modifiedDatetimeEnd Same as above. No

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.
NOTE
For Parquet/delimited text format, AzureBlobFSSource type copy activity source mentioned in next section is still
supported as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF
authoring UI has switched to generating these new types.

Example:

"activities":[
{
"name": "CopyFromADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureBlobFSReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from ADLS Gen2 in ORC/Avro/JSON/Binary format, the following properties are supported in
the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
AzureBlobFSSource.
PROPERTY DESCRIPTION REQUIRED

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false.

maxConcurrentConnections The number of the connections to No


connect to the data store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyFromADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<ADLS Gen2 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureBlobFSSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure Data Lake Storage Gen2 as a sink type


For copy to Parquet and delimited text format, refer to Parquet and delimited text format sink section.
For copy to other formats like ORC/Avro/JSON/Binary format, refer to Other format sink section.
Parquet and delimited text format sink
To copy data to ADLS Gen2 in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity sink and supported settings. The following properties are supported
for ADLS Gen2 under storeSettings settings in format-based copy sink:

PROPERTY DESCRIPTION REQUIRED


PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
AzureBlobFSWriteSetting.

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- PreserveHierarchy (default):
Preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy: All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles: Merges all files from the
source folder to one file. If the file
name is specified, the merged file name
is the specified name. Otherwise, it's an
autogenerated file name.

maxConcurrentConnections The number of the connections to No


connect to the data store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, AzureBlobFSSink type copy activity sink mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:
"activities":[
{
"name": "CopyToADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureBlobFSWriteSetting",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]

Other format sink


To copy data to ADLS Gen2 in ORC/Avro/JSON/Binary format, the following properties are supported in the
sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to AzureBlobFSSink.

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- PreserveHierarchy (default):
Preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy: All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles: Merges all files from the
source folder to one file. If the file
name is specified, the merged file name
is the specified name. Otherwise, it's an
autogenerated file name.
PROPERTY DESCRIPTION REQUIRED

maxConcurrentConnections The number of the connections to No


connect to the data store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyToADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ADLS Gen2 output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureBlobFSSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SOURCE FOLDER STRUCTURE


AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SOURCE FOLDER STRUCTURE
AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Some recursive and copyBehavior examples


This section describes the resulting behavior of the Copy operation for different combinations of recursive and
copyBehavior values.

RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

true preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the same
File2 structure as the source:
Subfolder1
File3 Folder1
File4 File1
File5 File2
Subfolder1
File3
File4
File5
RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

true flattenHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2
autogenerated name for
File3
autogenerated name for
File4
autogenerated name for
File5

true mergeFiles Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 + File3 +
File5 File4 + File5 contents are
merged into one file with an
autogenerated file name.

false preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1
File5 File2

Subfolder1 with File3, File4,


and File5 is not picked up.

false flattenHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2

Subfolder1 with File3, File4,


and File5 is not picked up.
RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

false mergeFiles Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 File1 + File2 contents are
File5 merged into one file with an
autogenerated file name.
autogenerated name for
File1

Subfolder1 with File3, File4,


and File5 is not picked up.

Preserve ACLs from Data Lake Storage Gen1


TIP
For copy data from Azure Data Lake Storage Gen1 into Gen2 in general, see Copy data from Azure Data Lake Storage
Gen1 to Gen2 with Azure Data Factory with walkthrough and best practices.

When copy files from Azure Data Lake Storage (ADLS ) Gen1 to Gen2, you can choose to preserve the POSIX
access control lists (ACLs) along with data. For access control in details, refer to Access control in Azure Data
Lake Storage Gen1 and Access control in Azure Data Lake Storage Gen2.
The following types of ACLs can be preserved using Azure Data Factory Copy activity, you can select one or
more types:
ACL: Copy and preserve POSIX access control lists on files and directories. It will copy the full existing
ACLs from source to sink.
Owner: Copy and preserve the owning user of files and directories. Super-user access to sink ADLS Gen2 is
required.
Group: Copy and preserve the owning group of files and directories. Super-user access to sink ADLS Gen2,
or the owning user (if the owning user is also a member of the target group) is required.
If you specify to copy from a folder, Data Factory replicates the ACLs for that given folder as well as the files and
directories under it (if recursive is set to true). If you specify to copy from a single file, the ACLs on that file is
copied.

IMPORTANT
When you choose to preserve ACLs, make sure you grant high enough permissions for ADF to operate against your sink
ADLS Gen2 account. For example, use account key authentication, or assign Storage Blob Data Owner role to the service
principal/managed identity.

When you configure source as ADLS Gen1 with binary copy option/binary format, and sink as ADLS Gen2 with
binary copy option/binary format, you can find Preserve option in Copy Data Tool Settings page or in Copy
Activity -> Settings tab for activity authoring.
Here is an example of JSON configuration (see preserve ):

"activities":[
{
"name": "CopyFromGen1ToGen2",
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": true
},
"sink": {
"type": "AzureBlobFSSink",
"copyBehavior": "PreserveHierarchy"
},
"preserve": [
"ACL",
"Owner",
"Group"
]
},
"inputs": [
{
"referenceName": "<Azure Data Lake Storage Gen1 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Data Lake Storage Gen2 output dataset name>",
"type": "DatasetReference"
}
]
}
]

Mapping Data Flow properties


Learn details from source transformation and sink transformation in Mapping Data Flow.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Azure Database for MariaDB using
Azure Data Factory
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Azure Database for
MariaDB. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Azure Database for MariaDB to any supported sink data store. For a list of data stores
that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Database for MariaDB connector.

Linked service properties


The following properties are supported for Azure Database for MariaDB linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


MariaDB
PROPERTY DESCRIPTION REQUIRED

connectionString A connection string to connect to Azure Yes


Database for MariaDB. You can find it
from the Azure portal -> your Azure
Database for MariaDB -> Connection
strings -> ADO.NET one.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the pwd configuration
out of the connection string. Refer to
the following samples and Store
credentials in Azure Key Vault article
with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "AzureDatabaseForMariaDBLinkedService",
"properties": {
"type": "MariaDB",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server={your_server}.mariadb.database.azure.com; Port=3306; Database=
{your_database}; Uid={your_user}@{your_server}; Pwd={your_password}; SslMode=Preferred;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault


{
"name": "AzureDatabaseForMariaDBLinkedService",
"properties": {
"type": "MariaDB",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server={your_server}.mariadb.database.azure.com; Port=3306; Database=
{your_database}; Uid={your_user}@{your_server}; SslMode=Preferred;"
},
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Database for MariaDB dataset.
To copy data from Azure Database for MariaDB, set the type property of the dataset to MariaDBTable. The
following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: MariaDBTable

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "AzureDatabaseForMariaDBDataset",
"properties": {
"type": "MariaDBTable",
"linkedServiceName": {
"referenceName": "<Azure Database for MariaDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Azure Database for MariaDB source.
Azure Database for MariaDB as source
To copy data from Azure Database for MariaDB, set the source type in the copy activity to MariaDBSource. The
following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: MariaDBSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromAzureDatabaseForMariaDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Database for MariaDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MariaDBSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Azure Database for MySQL using
Azure Data Factory
4/19/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Azure Database for
MySQL. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Azure Database for MySQL to any supported sink data store. For a list of data stores that
are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Database for MySQL connector.

Linked service properties


The following properties are supported for Azure Database for MySQL linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


AzureMySql

connectionString Specify information needed to connect Yes


to the Azure Database for MySQL
instance.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the password
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key Vault
article with more details.
PROPERTY DESCRIPTION REQUIRED

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

A typical connection string is


Server=<server>.mysql.database.azure.com;Port=<port>;Database=<database>;UID=<username>;PWD=<password> . More
properties you can set per your case:

PROPERTY DESCRIPTION OPTIONS REQUIRED

SSLMode This option specifies whether DISABLED (0) / PREFERRED No


the driver uses SSL (1) (Default) / REQUIRED (2)
encryption and verification / VERIFY_CA (3) /
when connecting to MySQL. VERIFY_IDENTITY (4)
E.g. SSLMode=<0/1/2/3/4>

UseSystemTrustStore This option specifies whether Enabled (1) / Disabled (0) No


to use a CA certificate from (Default)
the system trust store, or
from a specified PEM file.
E.g.
UseSystemTrustStore=
<0/1>;

Example:
{
"name": "AzureDatabaseForMySQLLinkedService",
"properties": {
"type": "AzureMySql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>.mysql.database.azure.com;Port=<port>;Database=<database>;UID=
<username>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "AzureDatabaseForMySQLLinkedService",
"properties": {
"type": "AzureMySql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>.mysql.database.azure.com;Port=<port>;Database=<database>;UID=
<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Database for MySQL dataset.
To copy data from Azure Database for MySQL, set the type property of the dataset to AzureMySqlTable. The
following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: AzureMySqlTable
PROPERTY DESCRIPTION REQUIRED

tableName Name of the table in the MySQL No (if "query" in activity source is
database. specified)

Example

{
"name": "AzureMySQLDataset",
"properties": {
"type": "AzureMySqlTable",
"linkedServiceName": {
"referenceName": "<Azure MySQL linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Azure Database for MySQL source.
Azure Database for MySQL as source
To copy data from Azure Database for MySQL, set the source type in the copy activity to AzureMySqlSource.
The following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
AzureMySqlSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

queryCommandTimeout The wait time before the query request No


times out. Default is 120 minutes
(02:00:00)

Example:
"activities":[
{
"name": "CopyFromAzureDatabaseForMySQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure MySQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureMySqlSource",
"query": "<custom query e.g. SELECT * FROM MyTable>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for Azure Database for MySQL


When copying data from Azure Database for MySQL, the following mappings are used from MySQL data types
to Azure Data Factory interim data types. See Schema and data type mappings to learn about how copy activity
maps the source schema and data type to the sink.

AZURE DATABASE FOR MYSQL DATA TYPE DATA FACTORY INTERIM DATA TYPE

bigint Int64

bigint unsigned Decimal

bit Boolean

bit(M), M>1 Byte[]

blob Byte[]

bool Int16

char String

date Datetime

datetime Datetime

decimal Decimal, String


AZURE DATABASE FOR MYSQL DATA TYPE DATA FACTORY INTERIM DATA TYPE

double Double

double precision Double

enum String

float Single

int Int32

int unsigned Int64

integer Int32

integer unsigned Int64

long varbinary Byte[]

long varchar String

longblob Byte[]

longtext String

mediumblob Byte[]

mediumint Int32

mediumint unsigned Int64

mediumtext String

numeric Decimal

real Double

set String

smallint Int16

smallint unsigned Int32

text String

time TimeSpan

timestamp Datetime
AZURE DATABASE FOR MYSQL DATA TYPE DATA FACTORY INTERIM DATA TYPE

tinyblob Byte[]

tinyint Int16

tinyint unsigned Int16

tinytext String

varchar String

year Int32

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Azure Database for PostgreSQL
using Azure Data Factory
3/15/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Azure Database for
PostgreSQL. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Azure Database for PostgreSQL to any supported sink data store. For a list of data stores
that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Database for PostgreSQL connector.

Linked service properties


The following properties are supported for Azure Database for PostgreSQL linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


AzurePostgreSql

connectionString An ODBC connection string to connect Yes


to Azure Database for PostgreSQL.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the password
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key Vault
article with more details.
PROPERTY DESCRIPTION REQUIRED

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

A typical connection string is


Server=<server>.postgres.database.azure.com;Database=<database>;Port=<port>;UID=<username>;Password=<Password> .
More properties you can set per your case:

PROPERTY DESCRIPTION OPTIONS REQUIRED

EncryptionMethod (EM) The method the driver uses 0 (No Encryption) (Default) No
to encrypt data sent / 1 (SSL) / 6 (RequestSSL)
between the driver and the
database server. E.g.
ValidateServerCertificate=
<0/1/6>;

ValidateServerCertificate Determines whether the 0 (Disabled) (Default) / 1 No


(VSC) driver validates the (Enabled)
certificate that is sent by the
database server when SSL
encryption is enabled
(Encryption Method=1). E.g.
ValidateServerCertificate=
<0/1>;

Example:

{
"name": "AzurePostgreSqlLinkedService",
"properties": {
"type": "AzurePostgreSql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>.postgres.database.azure.com;Database=<database>;Port=<port>;UID=
<username>;Password=<Password>"
}
}
}
}

Example: store password in Azure Key Vault


{
"name": "AzurePostgreSqlLinkedService",
"properties": {
"type": "AzurePostgreSql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>.postgres.database.azure.com;Database=<database>;Port=<port>;UID=
<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Database for PostgreSQL dataset.
To copy data from Azure Database for PostgreSQL, set the type property of the dataset to
AzurePostgreSqlTable. The following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: AzurePostgreSqlTable

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "AzurePostgreSqlDataset",
"properties": {
"type": "AzurePostgreSqlTable",
"linkedServiceName": {
"referenceName": "<AzurePostgreSql linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Azure Database for PostgreSQL source.
Azure Database for PostgreSql as source
To copy data from Azure Database for PostgreSQL, set the source type in the copy activity to
AzurePostgreSqlSource. The following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
AzurePostgreSqlSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromAzurePostgreSql",
"type": "Copy",
"inputs": [
{
"referenceName": "<AzurePostgreSql input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzurePostgreSqlSource",
"query": "<custom query e.g. SELECT * FROM MyTable>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from or to Azure File Storage by using
Azure Data Factory
5/6/2019 • 14 minutes to read • Edit Online

This article outlines how to copy data to and from Azure File Storage. To learn about Azure Data Factory, read the
introductory article.

Supported capabilities
This Azure File Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this Azure File Storage connector supports copying files as-is or parsing/generating files with the
supported file formats and compression codecs.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure File Storage.

Linked service properties


The following properties are supported for Azure File Storage linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


FileServer.
PROPERTY DESCRIPTION REQUIRED

host Specifies the Azure File Storage Yes


endpoint as:
-Using UI: specify
\\<storage
name>.file.core.windows.net\<file
service name>
- Using JSON:
"host": "\\\\<storage
name>.file.core.windows.net\\
<file service name>"
.

userid Specify the user to access the Azure File Yes


Storage as:
-Using UI: specify
AZURE\<storage name>
-Using JSON:
"userid": "AZURE\\<storage name>"
.

password Specify the storage access key. Mark Yes


this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The Integration Runtime to be used to No for source, Yes for sink
connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

IMPORTANT
To copy data into Azure File Storage using Azure Integration Runtime, explicitly create an Azure IR with the location of
your File Storage, and associate in the linked service as the following example.
To copy data from/to Azure File Storage using Self-hosted Integration Runtime outside of Azure, remember to open
outbound TCP port 445 in your local network.

TIP
When using ADF UI for authoring, you can find the specific entry of "Azure File Storage" for linked service creation, which
underneath generates type FileServer object.

Example:
{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "FileServer",
"typeProperties": {
"host": "\\\\<storage name>.file.core.windows.net\\<file service name>",
"userid": "AZURE\\<storage name>",
"password": {
"type": "SecureString",
"value": "<storage access key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data to and from Azure File Storage in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based dataset and supported settings. The following properties are
supported for Azure File Storage under location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property under location in Yes


dataset must be set to
FileServerLocation.

folderPath The path to folder. If you want to use No


wildcard to filter folder, skip this setting
and specify in activity source settings.

fileName The file name under the given No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.

NOTE
FileShare type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility. You are suggested to use this new model going forward,
and the ADF authoring UI has switched to generating these new types.

Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure File Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "FileServerLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data to and from Azure File Storage in ORC/Avro/JSON/Binary format, the following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: FileShare

folderPath Path to the folder. Yes

Wildcard filter is supported, allowed


wildcards are: * (matches zero or
more characters) and ? (matches zero
or single character); use ^ to escape if
your actual folder name has wildcard or
this escape char inside.

Examples: rootfolder/subfolder/, see


more examples in Folder and file filter
examples.
PROPERTY DESCRIPTION REQUIRED

fileName Name or wildcard filter for the file(s) No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has wildcard or this escape char
inside.

When fileName isn't specified for an


output dataset and
preserveHierarchy isn't specified in
the activity sink, the copy activity
automatically generates the file name
with the following pattern: "Data.
[activity run ID GUID].[GUID if
FlattenHierarchy].[format if
configured].[compression if
configured]", e.g. "Data.0a405f8a-93ff-
4c6f-b3be-f69616f1df7a.txt.gz"; if you
copy from tabular source using table
name instead of query, the name
pattern is "[table name].[format ].
[compression if configured]", e.g.
"MyTable.csv".
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.
PROPERTY DESCRIPTION REQUIRED

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat, JsonFormat,
AvroFormat, OrcFormat,
ParquetFormat. Set the type property
under format to one of these values.
For more information, see Text Format,
Json Format, Avro Format, Orc Format,
and Parquet Format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are: Optimal and
Fastest.

TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.

Example:
{
"name": "AzureFileStorageDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<Azure File Storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Azure File Storage source and sink.
Azure File Storage as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from Azure File Storage in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based copy activity source and supported settings. The following
properties are supported for Azure File Storage under storeSettings settings in format-based copy source:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
FileServerReadSetting.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are
true (default) and false.
PROPERTY DESCRIPTION REQUIRED

wildcardFolderPath The folder path with wildcard characters No


to filter source folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".
The properties can be NULL which
mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

modifiedDatetimeEnd Same as above. No

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.
NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:

"activities":[
{
"name": "CopyFromAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "FileServerReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from Azure File Storage in ORC/Avro/JSON/Binary format, the following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
FileSystemSource
PROPERTY DESCRIPTION REQUIRED

recursive Indicates whether the data is read No


recursively from the sub folders or only
from the specified folder. Note when
recursive is set to true and sink is file-
based store, empty folder/sub-folder
will not be copied/created at sink.
Allowed values are: true (default), false

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyFromAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure File Storage input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure File Storage as sink


For copy to Parquet and delimited text format, refer to Parquet and delimited text format sink section.
For copy to other formats like ORC/Avro/JSON/Binary format, refer to Other format sink section.
Parquet and delimited text format sink
To copy data to Azure File Storage in Parquet or delimited text format, refer to Parquet format and Delimited
text format article on format-based copy activity sink and supported settings. The following properties are
supported for Azure File Storage under storeSettings settings in format-based copy sink:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
FileServerWriteSetting.
PROPERTY DESCRIPTION REQUIRED

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- PreserveHierarchy (default):
Preserves the file hierarchy in the target
folder. The relative path of source file to
source folder is identical to the relative
path of target file to target folder.
- FlattenHierarchy: All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles: Merges all files from the
source folder to one file. If the file name
is specified, the merged file name is the
specified name. Otherwise, it's an
autogenerated file name.

maxConcurrentConnections The number of the connections to No


connect to the data store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, FileSystemSink type copy activity sink mentioned in next section is still supported as-is
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has switched
to generating these new types.

Example:
"activities":[
{
"name": "CopyToAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "FileServerWriteSetting",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]

Other format sink


To copy data to Azure File Storage in ORC/Avro/JSON/Binary format, the following properties are supported
in the sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to: FileSystemSink

copyBehavior Defines the copy behavior when the No


source is files from file-based data
store.

Allowed values are:


- PreserveHierarchy (default):
preserves the file hierarchy in the
target folder. The relative path of source
file to source folder is identical to the
relative path of target file to target
folder.
- FlattenHierarchy: all files from the
source folder are in the first level of
target folder. The target files have auto
generated name.
- MergeFiles: merges all files from the
source folder to one file. If the File/Blob
Name is specified, the merged file name
would be the specified name;
otherwise, would be auto-generated
file name.
PROPERTY DESCRIPTION REQUIRED

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyToAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure File Storage output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "FileSystemSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SOURCE FOLDER STRUCTURE


AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SOURCE FOLDER STRUCTURE
AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

recursive and copyBehavior examples


This section describes the resulting behavior of the Copy operation for different combinations of recursive and
copyBehavior values.

RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

true preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the same
File2 structure as the source:
Subfolder1
File3 Folder1
File4 File1
File5 File2
Subfolder1
File3
File4
File5.
RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

true flattenHierarchy Folder1 The target Folder1 is created


File1 with the following structure:
File2
Subfolder1 Folder1
File3 auto-generated name for
File4 File1
File5 auto-generated name for
File2
auto-generated name for
File3
auto-generated name for
File4
auto-generated name for
File5

true mergeFiles Folder1 The target Folder1 is created


File1 with the following structure:
File2
Subfolder1 Folder1
File3 File1 + File2 + File3 +
File4 File4 + File 5 contents are
File5 merged into one file with
auto-generated file name

false preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 File1
File5 File2

Subfolder1 with File3, File4,


and File5 are not picked up.

false flattenHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 auto-generated name for
File5 File1
auto-generated name for
File2

Subfolder1 with File3, File4,


and File5 are not picked up.
RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

false mergeFiles Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 File1 + File2 contents are
File5 merged into one file with
auto-generated file name.
auto-generated name for
File1

Subfolder1 with File3, File4,


and File5 are not picked up.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data to an Azure Search index using Azure
Data Factory
5/24/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data into Azure Search index. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from any supported source data store into Azure Search index. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Search connector.

Linked service properties


The following properties are supported for Azure Search linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


AzureSearch

url URL for the Azure Search service. Yes

key Admin key for the Azure Search service. Yes


Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.
PROPERTY DESCRIPTION REQUIRED

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

IMPORTANT
When copying data from a cloud data store into Azure Search index, in Azure Search linked service, you need to refer an
Azure Integration Runtime with explicit region in connactVia. Set the region as the one your Azure Search resides. Learn
more from Azure Integration Runtime.

Example:

{
"name": "AzureSearchLinkedService",
"properties": {
"type": "AzureSearch",
"typeProperties": {
"url": "https://<service>.search.windows.net",
"key": {
"type": "SecureString",
"value": "<AdminKey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Search dataset.
To copy data into Azure Search, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: AzureSearchIndex

indexName Name of the Azure Search index. Data Yes


Factory does not create the index. The
index must exist in Azure Search.

Example:
{
"name": "AzureSearchIndexDataset",
"properties": {
"type": "AzureSearchIndex",
"linkedServiceName": {
"referenceName": "<Azure Search linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties" : {
"indexName": "products"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Azure Search source.
Azure Search as sink
To copy data into Azure Search, set the source type in the copy activity to AzureSearchIndexSink. The following
properties are supported in the copy activity sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
AzureSearchIndexSink

writeBehavior Specifies whether to merge or replace No


when a document already exists in the
index. See the WriteBehavior property.

Allowed values are: Merge (default),


and Upload.

writeBatchSize Uploads data into the Azure Search No


index when the buffer size reaches
writeBatchSize. See the WriteBatchSize
property for details.

Allowed values are: integer 1 to 1,000;


default is 1000.

WriteBehavior property
AzureSearchSink upserts when writing data. In other words, when writing a document, if the document key
already exists in the Azure Search index, Azure Search updates the existing document rather than throwing a
conflict exception.
The AzureSearchSink provides the following two upsert behaviors (by using AzureSearch SDK):
Merge: combine all the columns in the new document with the existing one. For columns with null value in the
new document, the value in the existing one is preserved.
Upload: The new document replaces the existing one. For columns not specified in the new document, the
value is set to null whether there is a non-null value in the existing document or not.
The default behavior is Merge.
WriteBatchSize Property
Azure Search service supports writing documents as a batch. A batch can contain 1 to 1,000 Actions. An action
handles one document to perform the upload/merge operation.
Example:

"activities":[
{
"name": "CopyToAzureSearch",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Search output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureSearchIndexSink",
"writeBehavior": "Merge"
}
}
}
]

Data type support


The following table specifies whether an Azure Search data type is supported or not.

AZURE SEARCH DATA TYPE SUPPORTED IN AZURE SEARCH SINK

String Y

Int32 Y

Int64 Y

Double Y

Boolean Y

DataTimeOffset Y

String Array N

GeographyPoint N

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data to or from Azure SQL Database by using
Azure Data Factory
5/6/2019 • 14 minutes to read • Edit Online

This article outlines how to copy data to and from Azure SQL Database. To learn about Azure Data Factory, read
the introductory article.

Supported capabilities
This Azure SQL Database connector is supported for the following activities:
Copy activity with supported source/sink matrix table
Mapping data flow
Lookup activity
GetMetadata activity
Specifically, this Azure SQL Database connector supports these functions:
Copy data by using SQL authentication and Azure Active Directory (Azure AD ) Application token
authentication with a service principal or managed identities for Azure resources.
As a source, retrieve data by using a SQL query or stored procedure.
As a sink, append data to a destination table or invoke a stored procedure with custom logic during the copy.
Azure SQL Database Always Encrypted is not supported now.

IMPORTANT
If you copy data by using Azure Data Factory Integration Runtime, configure an Azure SQL server firewall so that Azure
Services can access the server. If you copy data by using a self-hosted integration runtime, configure the Azure SQL server
firewall to allow the appropriate IP range. This range includes the machine's IP that is used to connect to Azure SQL
Database.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to an
Azure SQL Database connector.
Linked service properties
These properties are supported for an Azure SQL Database linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureSqlDatabase.

connectionString Specify information needed to connect Yes


to the Azure SQL Database instance for
the connectionString property.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password/service principal
key in Azure Key Vault,and if it's SQL
authentication pull the password
configuration out of the connection
string. See the JSON example below the
table and Store credentials in Azure Key
Vault article with more details.

servicePrincipalId Specify the application's client ID. Yes, when you use Azure AD
authentication with a service principal.

servicePrincipalKey Specify the application's key. Mark this Yes, when you use Azure AD
field as a SecureString to store it authentication with a service principal.
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes, when you use Azure AD
name or tenant ID) under which your authentication with a service principal.
application resides. Retrieve it by
hovering the mouse in the top-right
corner of the Azure portal.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or a self-
hosted integration runtime if your data
store is located in a private network. If
not specified, it uses the default Azure
Integration Runtime.

For different authentication types, refer to the following sections on prerequisites and JSON samples,
respectively:
SQL authentication
Azure AD application token authentication: Service principal
Azure AD application token authentication: Managed identities for Azure resources

TIP
If you hit error with error code as "UserErrorFailedToConnectToSqlServer" and message like "The session limit for the
database is XXX and has been reached.", add Pooling=false to your connection string and try again.

SQL authentication
Linked service example that uses SQL authentication
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Password in Azure Key Vault:

{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Service principal authentication


To use a service principal-based Azure AD application token authentication, follow these steps:
1. Create an Azure Active Directory application from the Azure portal. Make note of the application
name and the following values that define the linked service:
Application ID
Application key
Tenant ID
2. Provision an Azure Active Directory administrator for your Azure SQL server on the Azure portal if
you haven't already done so. The Azure AD administrator must be an Azure AD user or Azure AD group,
but it can't be a service principal. This step is done so that, in the next step, you can use an Azure AD
identity to create a contained database user for the service principal.
3. Create contained database users for the service principal. Connect to the database from or to which you
want to copy data by using tools like SSMS, with an Azure AD identity that has at least ALTER ANY USER
permission. Run the following T-SQL:

CREATE USER [your application name] FROM EXTERNAL PROVIDER;

4. Grant the service principal needed permissions as you normally do for SQL users or others. Run the
following code, or refer to more options here.

EXEC sp_addrolemember [role name], [your application name];

5. Configure an Azure SQL Database linked service in Azure Data Factory.


Linked service example that uses service principal authentication

{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30"
},
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Managed identities for Azure resources authentication


A data factory can be associated with a managed identity for Azure resources that represents the specific data
factory. You can use this managed identity for Azure SQL Database authentication. The designated factory can
access and copy data from or to your database by using this identity.
To use managed identity authentication, follow these steps:
1. Provision an Azure Active Directory administrator for your Azure SQL server on the Azure portal if
you haven't already done so. The Azure AD administrator can be an Azure AD user or Azure AD group. If
you grant the group with managed identity an admin role, skip steps 3 and 4. The administrator will have
full access to the database.
2. Create contained database users for the Data Factory Managed Identity. Connect to the database from
or to which you want to copy data by using tools like SSMS, with an Azure AD identity that has at least
ALTER ANY USER permission. Run the following T-SQL:

CREATE USER [your Data Factory name] FROM EXTERNAL PROVIDER;

3. Grant the Data Factory Managed Identity needed permissions as you normally do for SQL users
and others. Run the following code, or refer to more options here.

EXEC sp_addrolemember [role name], [your Data Factory name];

4. Configure an Azure SQL Database linked service in Azure Data Factory.


Example:

{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Azure SQL Database dataset.
To copy data from or to Azure SQL Database, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to AzureSqlTable.

tableName The name of the table or view in the No for source, Yes for sink
Azure SQL Database instance that the
linked service refers to.

Dataset properties example

{
"name": "AzureSQLDbDataset",
"properties":
{
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "<Azure SQL Database linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"tableName": "MyTable"
}
}
}
Copy Activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Azure SQL Database source and sink.
Azure SQL Database as the source
To copy data from Azure SQL Database, set the type property in the Copy Activity source to SqlSource. The
following properties are supported in the Copy Activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Activity Yes


source must be set to SqlSource.

sqlReaderQuery Use the custom SQL query to read No


data. Example:
select * from MyTable .

sqlReaderStoredProcedureName The name of the stored procedure that No


reads data from the source table. The
last SQL statement must be a SELECT
statement in the stored procedure.

storedProcedureParameters Parameters for the stored procedure. No


Allowed values are name or value pairs.
Names and casing of parameters must
match the names and casing of the
stored procedure parameters.

Points to note
If the sqlReaderQuery is specified for the SqlSource, Copy Activity runs this query against the Azure SQL
Database source to get the data. Or you can specify a stored procedure. Specify
sqlReaderStoredProcedureName and storedProcedureParameters if the stored procedure takes
parameters.
If you don't specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in
the structure section of the dataset JSON are used to construct a query.
select column1, column2 from mytable runs against Azure SQL Database. If the dataset definition doesn't have
the structure, all columns are selected from the table.
SQL query example
"activities":[
{
"name": "CopyFromAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL Database input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Stored procedure example

"activities":[
{
"name": "CopyFromAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL Database input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Stored procedure definition


CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO

Azure SQL Database as the sink


To copy data to Azure SQL Database, set the type property in the Copy Activity sink to SqlSink. The following
properties are supported in the Copy Activity sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Activity Yes


sink must be set to SqlSink.

writeBatchSize Number of rows to inserts into the SQL No


table per batch.
The allowed value is integer (number
of rows). By default, Data Factory
dynamically determine the appropriate
batch size based on the row size.

writeBatchTimeout The wait time for the batch insert No


operation to finish before it times out.
The allowed value is timespan.
Example: “00:30:00” (30 minutes).

preCopyScript Specify a SQL query for Copy Activity No


to run before writing data into Azure
SQL Database. It's only invoked once
per copy run. Use this property to
clean up the preloaded data.

sqlWriterStoredProcedureName The name of the stored procedure that No


defines how to apply source data into a
target table. An example is to do
upserts or transform by using your
own business logic.

This stored procedure is invoked per


batch. For operations that only run
once and have nothing to do with
source data, use the preCopyScript
property. Example operations are
delete and truncate.
PROPERTY DESCRIPTION REQUIRED

storedProcedureParameters Parameters for the stored procedure. No


Allowed values are name and value
pairs. Names and casing of parameters
must match the names and casing of
the stored procedure parameters.

sqlWriterTableType Specify a table type name to be used in No


the stored procedure. Copy Activity
makes the data being moved available
in a temporary table with this table
type. Stored procedure code can then
merge the data being copied with
existing data.

TIP
When you copy data to Azure SQL Database, Copy Activity appends data to the sink table by default. To do an upsert or
additional business logic, use the stored procedure in SqlSink. Learn more details from Invoking stored procedure from
SQL Sink.

Append data example

"activities":[
{
"name": "CopyToAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure SQL Database output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 100000
}
}
}
]

Invoke a stored procedure during copy for upsert example


Learn more details from Invoking stored procedure from SQL Sink.
"activities":[
{
"name": "CopyToAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure SQL Database output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "CopyTestStoredProcedureWithParameters",
"sqlWriterTableType": "CopyTestTableType",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" }
}
}
}
}
]

Identity columns in the target database


This section shows you how to copy data from a source table without an identity column to a destination table
with an identity column.
Source table

create table dbo.SourceTbl


(
name varchar(100),
age int
)

Destination table

create table dbo.TargetTbl


(
identifier int identity(1,1),
name varchar(100),
age int
)

NOTE
The target table has an identity column.

Source dataset JSON definition


{
"name": "SampleSource",
"properties": {
"type": " AzureSqlTable",
"linkedServiceName": {
"referenceName": "TestIdentitySQL",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "SourceTbl"
}
}
}

Destination dataset JSON definition

{
"name": "SampleTarget",
"properties": {
"structure": [
{ "name": "name" },
{ "name": "age" }
],
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "TestIdentitySQL",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "TargetTbl"
}
}
}

NOTE
Your source and target table have different schema.

The target has an additional column with an identity. In this scenario, you must specify the structure property in
the target dataset definition, which doesn’t include the identity column.

Invoke stored procedure from SQL sink


When you copy data into Azure SQL Database, you can also configure and invoke a user-specified stored
procedure with additional parameters.
You can use a stored procedure when built-in copy mechanisms don't serve the purpose. They're typically used
when an upsert, insert plus update, or extra processing must be done before the final insertion of source data into
the destination table. Some extra processing examples are merge columns, look up additional values, and
insertion into more than one table.
The following sample shows how to use a stored procedure to do an upsert into a table in Azure SQL Database.
Assume that input data and the sink Marketing table each have three columns: ProfileID, State, and Category.
Do the upsert based on the ProfileID column, and only apply it for a specific category.
Output dataset: the "tableName" should be the same table type parameter name in your stored procedure (see
below stored procedure script).
{
"name": "AzureSQLDbDataset",
"properties":
{
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "<Azure SQL Database linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "Marketing"
}
}
}

Define the SQL sink section in copy activity as follows.

"sink": {
"type": "SqlSink",
"SqlWriterTableType": "MarketingType",
"SqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}

In your database, define the stored procedure with the same name as the SqlWriterStoredProcedureName. It
handles input data from your specified source and merges into the output table. The parameter name of the table
type in the stored procedure should be the same as the tableName defined in the dataset.

CREATE PROCEDURE spOverwriteMarketing @Marketing [dbo].[MarketingType] READONLY, @category varchar(256)


AS
BEGIN
MERGE [dbo].[Marketing] AS target
USING @Marketing AS source
ON (target.ProfileID = source.ProfileID and target.Category = @category)
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT MATCHED THEN
INSERT (ProfileID, State, Category)
VALUES (source.ProfileID, source.State, source.Category);
END

In your database, define the table type with the same name as the sqlWriterTableType. The schema of the table
type should be same as the schema returned by your input data.

CREATE TYPE [dbo].[MarketingType] AS TABLE(


[ProfileID] [varchar](256) NOT NULL,
[State] [varchar](256) NOT NULL,
[Category] [varchar](256) NOT NULL
)

The stored procedure feature takes advantage of Table-Valued Parameters.

Mapping Data Flow properties


Learn details from source transformation and sink transformation in Mapping Data Flow.
Data type mapping for Azure SQL Database
When you copy data from or to Azure SQL Database, the following mappings are used from Azure SQL
Database data types to Azure Data Factory interim data types. See Schema and data type mappings to learn how
Copy Activity maps the source schema and data type to the sink.

AZURE SQL DATABASE DATA TYPE DATA FACTORY INTERIM DATA TYPE

bigint Int64

binary Byte[]

bit Boolean

char String, Char[]

date DateTime

Datetime DateTime

datetime2 DateTime

Datetimeoffset DateTimeOffset

Decimal Decimal

FILESTREAM attribute (varbinary(max)) Byte[]

Float Double

image Byte[]

int Int32

money Decimal

nchar String, Char[]

ntext String, Char[]

numeric Decimal

nvarchar String, Char[]

real Single

rowversion Byte[]

smalldatetime DateTime

smallint Int16

smallmoney Decimal
AZURE SQL DATABASE DATA TYPE DATA FACTORY INTERIM DATA TYPE

sql_variant Object

text String, Char[]

time TimeSpan

timestamp Byte[]

tinyint Byte

uniqueidentifier Guid

varbinary Byte[]

varchar String, Char[]

xml Xml

NOTE
For data types maps to Decimal interim type, currently ADF support precision up to 28. If you have data with precision
larger than 28, consider to convert to string in SQL query.

Next steps
For a list of data stores supported as sources and sinks by Copy Activity in Azure Data Factory, see Supported
data stores and formats.
Copy data to and from Azure SQL Database
Managed Instance by using Azure Data Factory
5/6/2019 • 12 minutes to read • Edit Online

This article outlines how to use the copy activity in Azure Data Factory to copy data to and from Azure SQL
Database Managed Instance. It builds on the Copy activity overview article that presents a general overview of the
copy activity.

Supported capabilities
You can copy data from Azure SQL Database Managed Instance to any supported sink data store. You also can
copy data from any supported source data store to the managed instance. For a list of data stores that are
supported as sources and sinks by the copy activity, see the Supported data stores table.
Specifically, this Azure SQL Database Managed Instance connector supports:
Copying data by using SQL or Windows authentication.
As a source, retrieving data by using a SQL query or stored procedure.
As a sink, appending data to a destination table or invoking a stored procedure with custom logic during copy.
SQL Server Always Encrypted is not supported now.

Prerequisites
To use copy data from an Azure SQL Database Managed Instance that's located in a virtual network, set up a self-
hosted integration runtime that can access the database. For more information, see Self-hosted integration
runtime.
If you provision your self-hosted integration runtime in the same virtual network as your managed instance, make
sure that your integration runtime machine is in a different subnet than your managed instance. If you provision
your self-hosted integration runtime in a different virtual network than your managed instance, you can use either
a virtual network peering or virtual network to virtual network connection. For more information, see Connect
your application to Azure SQL Database Managed Instance.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to the
Azure SQL Database Managed Instance connector.
Linked service properties
The following properties are supported for the Azure SQL Database Managed Instance linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes.


SqlServer.

connectionString This property specifies the Yes.


connectionString information that's
needed to connect to the managed
instance by using either SQL
authentication or Windows
authentication. For more information,
see the following examples.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault,and if it's SQL authentication pull
the password configuration out of the
connection string. See the JSON
example below the table and Store
credentials in Azure Key Vault article
with more details.

userName This property specifies a user name if No.


you use Windows authentication. An
example is domainname\username.

password This property specifies a password for No.


the user account you specified for the
user name. Select SecureString to
store the connectionString information
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia This integration runtime is used to Yes.


connect to the data store. Provision the
self-hosted integration runtime in the
same virtual network as your managed
instance.

TIP
You might see the error code "UserErrorFailedToConnectToSqlServer" with a message like "The session limit for the database
is XXX and has been reached." If this error occurs, add Pooling=false to your connection string and try again.

Example 1: Use SQL authentication


{
"name": "AzureSqlMILinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Data Source=<servername>\\<instance name if using named instance>;Initial Catalog=
<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: Use SQL authentication with password in Azure Key Vault

{
"name": "AzureSqlMILinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Data Source=<servername>\\<instance name if using named instance>;Initial Catalog=
<databasename>;Integrated Security=False;User ID=<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 3: Use Windows authentication


{
"name": "AzureSqlMILinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Data Source=<servername>\\<instance name if using named instance>;Initial Catalog=
<databasename>;Integrated Security=True;"
},
"userName": "<domain\\username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for use to define datasets, see the datasets article. This section
provides a list of properties supported by the Azure SQL Database Managed Instance dataset.
To copy data to and from Azure SQL Database Managed Instance, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes.


be set to SqlServerTable.

tableName This property is the name of the table No for source. Yes for sink.
or view in the database instance that
the linked service refers to.

Example

{
"name": "AzureSqlMIDataset",
"properties":
{
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "<Managed Instance linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"tableName": "MyTable"
}
}
}

Copy activity properties


For a full list of sections and properties available for use to define activities, see the Pipelines article. This section
provides a list of properties supported by the Azure SQL Database Managed Instance source and sink.
Azure SQL Database Managed Instance as a source
To copy data from Azure SQL Database Managed Instance, set the source type in the copy activity to SqlSource.
The following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes.


source must be set to SqlSource.

sqlReaderQuery This property uses the custom SQL No.


query to read data. An example is
select * from MyTable .

sqlReaderStoredProcedureName This property is the name of the stored No.


procedure that reads data from the
source table. The last SQL statement
must be a SELECT statement in the
stored procedure.

storedProcedureParameters These parameters are for the stored No.


procedure.
Allowed values are name or value pairs.
The names and casing of the
parameters must match the names and
casing of the stored procedure
parameters.

Note the following points:


If sqlReaderQuery is specified for SqlSource, the copy activity runs this query against the managed instance
source to get the data. You also can specify a stored procedure by specifying
sqlReaderStoredProcedureName and storedProcedureParameters if the stored procedure takes
parameters.
If you don't specify either the sqlReaderQuery or sqlReaderStoredProcedureName property, the columns
defined in the "structure" section of the dataset JSON are used to construct a query. The query
select column1, column2 from mytable runs against the managed instance. If the dataset definition doesn't have
"structure," all columns are selected from the table.
Example: Use a SQL query
"activities":[
{
"name": "CopyFromAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Managed Instance input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Example: Use a stored procedure

"activities":[
{
"name": "CopyFromAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Managed Instance input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type": "Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

The stored procedure definition


CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO

Azure SQL Database Managed Instance as a sink


To copy data to Azure SQL Database Managed Instance, set the sink type in the copy activity to SqlSink. The
following properties are supported in the copy activity sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes.


sink must be set to SqlSink.

writeBatchSize Number of rows to inserts into the SQL No


table per batch.
Allowed values are integers for the
number of rows. By default, Data
Factory dynamically determine the
appropriate batch size based on the
row size.

writeBatchTimeout This property specifies the wait time for No.


the batch insert operation to complete
before it times out.
Allowed values are for the time span.
An example is “00:30:00,” which is 30
minutes.

preCopyScript This property specifies a SQL query for No.


the copy activity to execute before
writing data into the managed instance.
It's invoked only once per copy run.
You can use this property to clean up
preloaded data.

sqlWriterStoredProcedureName This name is for the stored procedure No.


that defines how to apply source data
into the target table. Examples of
procedures are to do upserts or
transforms by using your own business
logic.

This stored procedure is invoked per


batch. To do an operation that runs
only once and has nothing to do with
source data, for example, delete or
truncate, use the preCopyScript
property.
PROPERTY DESCRIPTION REQUIRED

storedProcedureParameters These parameters are used for the No.


stored procedure.
Allowed values are name or value pairs.
The names and casing of the
parameters must match the names and
casing of the stored procedure
parameters.

sqlWriterTableType This property specifies a table type No.


name to be used in the stored
procedure. The copy activity makes the
data being moved available in a temp
table with this table type. Stored
procedure code can then merge the
data that's being copied with existing
data.

TIP
When data is copied to Azure SQL Database Managed Instance, the copy activity appends data to the sink table by default.
To perform an upsert or additional business logic, use the stored procedure in SqlSink. For more information, see Invoke a
stored procedure from a SQL sink.

Example 1: Append data

"activities":[
{
"name": "CopyToAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Managed Instance output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 100000
}
}
}
]

Example 2: Invoke a stored procedure during copy for upsert


Learn more details from Invoke a stored procedure from a SQL sink.
"activities":[
{
"name": "CopyToAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Managed Instance output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "CopyTestStoredProcedureWithParameters",
"sqlWriterTableType": "CopyTestTableType",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" }
}
}
}
}
]

Identity columns in the target database


The following example copies data from a source table with no identity column to a destination table with an
identity column.
Source table

create table dbo.SourceTbl


(
name varchar(100),
age int
)

Destination table

create table dbo.TargetTbl


(
identifier int identity(1,1),
name varchar(100),
age int
)

Notice that the target table has an identity column.


Source dataset JSON definition
{
"name": "SampleSource",
"properties": {
"type": " SqlServerTable",
"linkedServiceName": {
"referenceName": "TestIdentitySQL",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "SourceTbl"
}
}
}

Destination dataset JSON definition

{
"name": "SampleTarget",
"properties": {
"structure": [
{ "name": "name" },
{ "name": "age" }
],
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "TestIdentitySQL",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "TargetTbl"
}
}
}

Notice that your source and target table have different schema. The target table has an identity column. In this
scenario, specify the "structure" property in the target dataset definition, which doesn’t include the identity column.

Invoke a stored procedure from a SQL sink


When data is copied into Azure SQL Database Managed Instance, a stored procedure can be configured and
invoked with additional parameters that you specify.
You can use a stored procedure when built-in copy mechanisms don't serve the purpose. It's typically used when
an upsert (update + insert) or extra processing must be done before the final insertion of source data in the
destination table. Extra processing can include tasks such as merging columns, looking up additional values, and
insertion into multiple tables.
The following sample shows how to use a stored procedure to do an upsert into a table in the SQL Server
database. Assume that input data and the sink Marketing table each have three columns: ProfileID, State, and
Category. Do the upsert based on the ProfileID column, and only apply it for a specific category.
Output dataset: the "tableName" should be the same table type parameter name in your stored procedure (see
below stored procedure script).
{
"name": "AzureSqlMIDataset",
"properties":
{
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "<Managed Instance linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "Marketing"
}
}
}

Define the SQL sink section in copy activity as follows.

"sink": {
"type": "SqlSink",
"SqlWriterTableType": "MarketingType",
"SqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}

In your database, define the stored procedure with the same name as the SqlWriterStoredProcedureName. It
handles input data from your specified source and merges into the output table. The parameter name of the table
type in the stored procedure should be the same as the tableName defined in the dataset.

CREATE PROCEDURE spOverwriteMarketing @Marketing [dbo].[MarketingType] READONLY, @category varchar(256)


AS
BEGIN
MERGE [dbo].[Marketing] AS target
USING @Marketing AS source
ON (target.ProfileID = source.ProfileID and target.Category = @category)
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT MATCHED THEN
INSERT (ProfileID, State, Category)
VALUES (source.ProfileID, source.State, source.Category);
END

In your database, define the table type with the same name as sqlWriterTableType. The schema of the table type is
the same as the schema returned by your input data.

CREATE TYPE [dbo].[MarketingType] AS TABLE(


[ProfileID] [varchar](256) NOT NULL,
[State] [varchar](256) NOT NULL,
[Category] [varchar](256) NOT NULL
)

The stored procedure feature takes advantage of table-valued parameters.

Data type mapping for Azure SQL Database Managed Instance


When data is copied to and from Azure SQL Database Managed Instance, the following mappings are used from
Azure SQL Database Managed Instance data types to Azure Data Factory interim data types. To learn how the
copy activity maps from the source schema and data type to the sink, see Schema and data type mappings.

AZURE SQL DATABASE MANAGED INSTANCE DATA TYPE AZURE DATA FACTORY INTERIM DATA TYPE

bigint Int64

binary Byte[]

bit Boolean

char String, Char[]

date DateTime

Datetime DateTime

datetime2 DateTime

Datetimeoffset DateTimeOffset

Decimal Decimal

FILESTREAM attribute (varbinary(max)) Byte[]

Float Double

image Byte[]

int Int32

money Decimal

nchar String, Char[]

ntext String, Char[]

numeric Decimal

nvarchar String, Char[]

real Single

rowversion Byte[]

smalldatetime DateTime

smallint Int16

smallmoney Decimal

sql_variant Object
AZURE SQL DATABASE MANAGED INSTANCE DATA TYPE AZURE DATA FACTORY INTERIM DATA TYPE

text String, Char[]

time TimeSpan

timestamp Byte[]

tinyint Int16

uniqueidentifier Guid

varbinary Byte[]

varchar String, Char[]

xml Xml

NOTE
For data types that map to the Decimal interim type, currently Azure Data Factory supports precision up to 28. If you have
data that requires precision larger than 28, consider converting to a string in a SQL query.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported
data stores.
Copy data to or from Azure SQL Data Warehouse
by using Azure Data Factory
5/31/2019 • 18 minutes to read • Edit Online

This article outlines how to copy data to and from Azure SQL Data Warehouse. To learn about Azure Data
Factory, read the introductory article.

Supported capabilities
This Azure Blob connector is supported for the following activities:
Copy activity with supported source/sink matrix table
Mapping data flow
Lookup activity
GetMetadata activity
Specifically, this Azure SQL Data Warehouse connector supports these functions:
Copy data by using SQL authentication and Azure Active Directory (Azure AD ) Application token
authentication with a service principal or managed identities for Azure resources.
As a source, retrieve data by using a SQL query or stored procedure.
As a sink, load data by using PolyBase or a bulk insert. We recommend PolyBase for better copy performance.

IMPORTANT
If you copy data by using Azure Data Factory Integration Runtime, configure an Azure SQL server firewall so that Azure
services can access the server. If you copy data by using a self-hosted integration runtime, configure the Azure SQL server
firewall to allow the appropriate IP range. This range includes the machine's IP that is used to connect to Azure SQL
Database.

Get started
TIP
To achieve best performance, use PolyBase to load data into Azure SQL Data Warehouse. The Use PolyBase to load data
into Azure SQL Data Warehouse section has details. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data
Warehouse under 15 minutes with Azure Data Factory.

You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that define Data Factory entities specific to an Azure SQL
Data Warehouse connector.

Linked service properties


The following properties are supported for an Azure SQL Data Warehouse linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureSqlDW.

connectionString Specify the information needed to Yes


connect to the Azure SQL Data
Warehouse instance for the
connectionString property.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password/service principal
key in Azure Key Vault,and if it's SQL
authentication pull the password
configuration out of the connection
string. See the JSON example below the
table and Store credentials in Azure Key
Vault article with more details.

servicePrincipalId Specify the application's client ID. Yes, when you use Azure AD
authentication with a service principal.

servicePrincipalKey Specify the application's key. Mark this Yes, when you use Azure AD
field as a SecureString to store it authentication with a service principal.
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes, when you use Azure AD
name or tenant ID) under which your authentication with a service principal.
application resides. You can retrieve it
by hovering the mouse in the top-right
corner of the Azure portal.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or a self-
hosted integration runtime (if your data
store is located in a private network). If
not specified, it uses the default Azure
Integration Runtime.

For different authentication types, refer to the following sections on prerequisites and JSON samples,
respectively:
SQL authentication
Azure AD application token authentication: Service principal
Azure AD application token authentication: Managed identities for Azure resources
TIP
If you hit error with error code as "UserErrorFailedToConnectToSqlServer" and message like "The session limit for the
database is XXX and has been reached.", add Pooling=false to your connection string and try again.

SQL authentication
Linked service example that uses SQL authentication

{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Password in Azure Key Vault:

{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Service principal authentication


To use service principal-based Azure AD application token authentication, follow these steps:
1. Create an Azure Active Directory application from the Azure portal. Make note of the application
name and the following values that define the linked service:
Application ID
Application key
Tenant ID
2. Provision an Azure Active Directory administrator for your Azure SQL server on the Azure portal if
you haven't already done so. The Azure AD administrator can be an Azure AD user or Azure AD group. If
you grant the group with managed identity an admin role, skip steps 3 and 4. The administrator will have
full access to the database.
3. Create contained database users for the service principal. Connect to the data warehouse from or to
which you want to copy data by using tools like SSMS, with an Azure AD identity that has at least ALTER
ANY USER permission. Run the following T-SQL:

CREATE USER [your application name] FROM EXTERNAL PROVIDER;

4. Grant the service principal needed permissions as you normally do for SQL users or others. Run the
following code, or refer to more options here. If you want to use PolyBase to load the data, learn the
required database permission.

EXEC sp_addrolemember db_owner, [your application name];

5. Configure an Azure SQL Data Warehouse linked service in Azure Data Factory.
Linked service example that uses service principal authentication

{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30"
},
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Managed identities for Azure resources authentication


A data factory can be associated with a managed identity for Azure resources that represents the specific factory.
You can use this managed identity for Azure SQL Data Warehouse authentication. The designated factory can
access and copy data from or to your data warehouse by using this identity.
To use managed identity authentication, follow these steps:
1. Provision an Azure Active Directory administrator for your Azure SQL server on the Azure portal if
you haven't already done so. The Azure AD administrator can be an Azure AD user or Azure AD group. If
you grant the group with managed identity an admin role, skip steps 3 and 4. The administrator will have
full access to the database.
2. Create contained database users for the Data Factory Managed Identity. Connect to the data warehouse
from or to which you want to copy data by using tools like SSMS, with an Azure AD identity that has at
least ALTER ANY USER permission. Run the following T-SQL.

CREATE USER [your Data Factory name] FROM EXTERNAL PROVIDER;

3. Grant the Data Factory Managed Identity needed permissions as you normally do for SQL users and
others. Run the following code, or refer to more options here. If you want to use PolyBase to load the data,
learn the required database permission.

EXEC sp_addrolemember db_owner, [your Data Factory name];

4. Configure an Azure SQL Data Warehouse linked service in Azure Data Factory.
Example:

{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Azure SQL Data Warehouse dataset.
To copy data from or to Azure SQL Data Warehouse, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to AzureSqlDWTable.

tableName The name of the table or view in the No for source, Yes for sink
Azure SQL Data Warehouse instance
that the linked service refers to.

Dataset properties example


{
"name": "AzureSQLDWDataset",
"properties":
{
"type": "AzureSqlDWTable",
"linkedServiceName": {
"referenceName": "<Azure SQL Data Warehouse linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"tableName": "MyTable"
}
}
}

Copy Activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Azure SQL Data Warehouse source and sink.
Azure SQL Data Warehouse as the source
To copy data from Azure SQL Data Warehouse, set the type property in the Copy Activity source to
SqlDWSource. The following properties are supported in the Copy Activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Activity Yes


source must be set to SqlDWSource.

sqlReaderQuery Use the custom SQL query to read No


data. Example:
select * from MyTable .

sqlReaderStoredProcedureName The name of the stored procedure that No


reads data from the source table. The
last SQL statement must be a SELECT
statement in the stored procedure.

storedProcedureParameters Parameters for the stored procedure. No


Allowed values are name or value pairs.
Names and casing of parameters must
match the names and casing of the
stored procedure parameters.

Points to note
If the sqlReaderQuery is specified for the SqlSource, the Copy Activity runs this query against the Azure
SQL Data Warehouse source to get the data. Or you can specify a stored procedure. Specify the
sqlReaderStoredProcedureName and storedProcedureParameters if the stored procedure takes
parameters.
If you don't specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in
the structure section of the dataset JSON are used to construct a query.
select column1, column2 from mytable runs against Azure SQL Data Warehouse. If the dataset definition
doesn't have the structure, all columns are selected from the table.
SQL query example
"activities":[
{
"name": "CopyFromAzureSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL DW input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Stored procedure example

"activities":[
{
"name": "CopyFromAzureSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL DW input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Stored procedure definition


CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO

Azure SQL Data Warehouse as sink


To copy data to Azure SQL Data Warehouse, set the sink type in Copy Activity to SqlDWSink. The following
properties are supported in the Copy Activity sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Activity Yes


sink must be set to SqlDWSink.

allowPolyBase Indicates whether to use PolyBase, No


when applicable, instead of the
BULKINSERT mechanism.

We recommend that you load data into


SQL Data Warehouse by using
PolyBase. See the Use PolyBase to load
data into Azure SQL Data Warehouse
section for constraints and details.

Allowed values are True and False


(default).

polyBaseSettings A group of properties that can be No


specified when the allowPolybase
property is set to true.

rejectValue Specifies the number or percentage of No


rows that can be rejected before the
query fails.

Learn more about PolyBase’s reject


options in the Arguments section of
CREATE EXTERNAL TABLE (Transact-
SQL).

Allowed values are 0 (default), 1, 2, etc.

rejectType Specifies whether the rejectValue No


option is a literal value or a percentage.

Allowed values are Value (default) and


Percentage.
PROPERTY DESCRIPTION REQUIRED

rejectSampleValue Determines the number of rows to Yes, if the rejectType is percentage.


retrieve before PolyBase recalculates
the percentage of rejected rows.

Allowed values are 1, 2, etc.

useTypeDefault Specifies how to handle missing values No


in delimited text files when PolyBase
retrieves data from the text file.

Learn more about this property from


the Arguments section in CREATE
EXTERNAL FILE FORMAT (Transact-
SQL).

Allowed values are True and False


(default).

See troubleshooting tips related to


this setting.

writeBatchSize Number of rows to inserts into the SQL No


table per batch. Applies only when
PolyBase isn't used.

The allowed value is integer (number


of rows). By default, Data Factory
dynamically determine the appropriate
batch size based on the row size.

writeBatchTimeout Wait time for the batch insert No


operation to finish before it times out.
Applies only when PolyBase isn't used.

The allowed value is timespan.


Example: “00:30:00” (30 minutes).

preCopyScript Specify a SQL query for Copy Activity No


to run before writing data into Azure
SQL Data Warehouse in each run. Use
this property to clean up the preloaded
data.

SQL Data Warehouse sink example

"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings":
{
"rejectType": "percentage",
"rejectValue": 10.0,
"rejectSampleValue": 100,
"useTypeDefault": true
}
}

Learn more about how to use PolyBase to efficiently load SQL Data Warehouse in the next section.
Use PolyBase to load data into Azure SQL Data Warehouse
Using PolyBase is an efficient way to load a large amount of data into Azure SQL Data Warehouse with high
throughput. You'll see a large gain in the throughput by using PolyBase instead of the default BULKINSERT
mechanism. See Performance reference for a detailed comparison. For a walkthrough with a use case, see Load 1
TB into Azure SQL Data Warehouse.
If your source data is in Azure Blob, Azure Data Lake Storage Gen1 or Azure Data Lake Storage Gen2,
and the format is PolyBase compatible, you can use copy activity to directly invoke PolyBase to let Azure
SQL Data Warehouse pull the data from source. For details, see Direct copy by using PolyBase.
If your source data store and format isn't originally supported by PolyBase, use the Staged copy by using
PolyBase feature instead. The staged copy feature also provides you better throughput. It automatically
converts the data into PolyBase-compatible format. And it stores the data in Azure Blob storage. It then loads
the data into SQL Data Warehouse.

TIP
Learn more on Best practices for using PolyBase.

Direct copy by using PolyBase


SQL Data Warehouse PolyBase directly supports Azure Blob, Azure Data Lake Storage Gen1 and Azure Data
Lake Storage Gen2. If your source data meets the criteria described in this section, use PolyBase to copy directly
from the source data store to Azure SQL Data Warehouse. Otherwise, use Staged copy by using PolyBase.

TIP
To copy data efficiently to SQL Data Warehouse, learn more from Azure Data Factory makes it even easier and convenient
to uncover insights from data when using Data Lake Store with SQL Data Warehouse.

If the requirements aren't met, Azure Data Factory checks the settings and automatically falls back to the
BULKINSERT mechanism for the data movement.
1. The source linked service is with the following types and authentication methods:

SUPPORTED SOURCE DATA STORE TYPE SUPPORTED SOURCE AUTHENTICATION TYPE

Azure Blob Account key authentication, managed identity


authentication

Azure Data Lake Storage Gen1 Service principal authentication

Azure Data Lake Storage Gen2 Account key authentication, managed identity
authentication

IMPORTANT
If your Azure Storage is configured with VNet service endpoint, you must use managed identity authentication.
Refer to Impact of using VNet Service Endpoints with Azure storage

2. The source data format is of Parquet, ORC, or Delimited text, with the following configurations:
a. Folder path don't contain wildcard filter.
b. File name points to a single file or is * or *.* .
c. rowDelimiter must be \n.
d. nullValue is either set to empty string ("") or left as default, and treatEmptyAsNull is left as default or
set to true.
e. encodingName is set to utf-8, which is the default value.
f. quoteChar , escapeChar , and skipLineCount aren't specified. PolyBase support skip header row which
can be configured as firstRowAsHeader in ADF.
g. compression can be no compression, GZip, or Deflate.

"activities":[
{
"name": "CopyFromAzureBlobToSQLDataWarehouseViaPolyBase",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
}
}
]

Staged copy by using PolyBase


When your source data doesn’t meet the criteria in the previous section, enable data copying via an interim
staging Azure Blob storage instance. It can't be Azure Premium Storage. In this case, Azure Data Factory
automatically runs transformations on the data to meet the data format requirements of PolyBase. Then it uses
PolyBase to load data into SQL Data Warehouse. Finally, it cleans up your temporary data from the blob storage.
See Staged copy for details about copying data via a staging Azure Blob storage instance.
To use this feature, create an Azure Storage linked service that refers to the Azure storage account with the
interim blob storage. Then specify the enableStaging and stagingSettings properties for the Copy Activity as
shown in the following code:
"activities":[
{
"name": "CopyFromSQLServerToSQLDataWarehouseViaPolyBase",
"type": "Copy",
"inputs": [
{
"referenceName": "SQLServerDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingBlob",
"type": "LinkedServiceReference"
}
}
}
}
]

Best practices for using PolyBase


The following sections provide best practices in addition to those mentioned in Best practices for Azure SQL Data
Warehouse.
Required database permission
To use PolyBase, the user that loads data into SQL Data Warehouse must have "CONTROL" permission on the
target database. One way to achieve that is to add the user as a member of the db_owner role. Learn how to do
that in the SQL Data Warehouse overview.
Row size and data type limits
PolyBase loads are limited to rows smaller than 1 MB. It cannot be used to load to VARCHR (MAX),
NVARCHAR (MAX), or VARBINARY (MAX). For more information, see SQL Data Warehouse service capacity
limits.
When your source data has rows greater than 1 MB, you might want to vertically split the source tables into
several small ones. Make sure that the largest size of each row doesn't exceed the limit. The smaller tables can
then be loaded by using PolyBase and merged together in Azure SQL Data Warehouse.
Alternatively, for data with such wide columns, you can use non-PolyBase to load the data using ADF, by turning
off "allow PolyBase" setting.
PolyBase troubleshooting
Loading to Decimal column
If your source data is in text format or other non-PolyBase compatible stores (using staged copy and PolyBase),
and it contains empty value to be loaded into SQL Data Warehouse Decimal column, you may hit the following
error:

ErrorCode=FailedDbOperation, ......HadoopSqlException: Error converting data type VARCHAR to


DECIMAL.....Detailed Message=Empty string can't be converted to DECIMAL.....

The solution is to unselect "Use type default" option (as false) in copy activity sink -> PolyBase setings.
"USE_TYPE_DEFAULT" is a PolyBase native configuration which specifies how to handle missing values in
delimited text files when PolyBase retrieves data from the text file.
Others
For more knonw PolyBase issues, refer to Troubleshooting Azure SQL Data Warehouse PolyBase load.
SQL Data Warehouse resource class
To achieve the best possible throughput, assign a larger resource class to the user that loads data into SQL Data
Warehouse via PolyBase.
tableName in Azure SQL Data Warehouse
The following table gives examples of how to specify the tableName property in the JSON dataset. It shows
several combinations of schema and table names.

DB SCHEMA TABLE NAME TABLENAME JSON PROPERTY

dbo MyTable MyTable or dbo.MyTable or [dbo].


[MyTable]

dbo1 MyTable dbo1.MyTable or [dbo1].[MyTable]

dbo My.Table [My.Table] or [dbo].[My.Table]

dbo1 My.Table [dbo1].[My.Table]

If you see the following error, the problem might be the value you specified for the tableName property. See the
preceding table for the correct way to specify values for the tableName JSON property.

Type=System.Data.SqlClient.SqlException,Message=Invalid object name 'stg.Account_test'.,Source=.Net SqlClient


Data Provider

Columns with default values


Currently, the PolyBase feature in Data Factory accepts only the same number of columns as in the target table.
An example is a table with four columns where one of them is defined with a default value. The input data still
needs to have four columns. A three-column input dataset yields an error similar to the following message:

All columns of the table must be specified in the INSERT BULK statement.

The NULL value is a special form of the default value. If the column is nullable, the input data in the blob for that
column might be empty. But it can't be missing from the input dataset. PolyBase inserts NULL for missing values
in Azure SQL Data Warehouse.

Mapping Data Flow properties


Learn details from source transformation and sink transformation in Mapping Data Flow.
Data type mapping for Azure SQL Data Warehouse
When you copy data from or to Azure SQL Data Warehouse, the following mappings are used from Azure SQL
Data Warehouse data types to Azure Data Factory interim data types. See schema and data type mappings to
learn how Copy Activity maps the source schema and data type to the sink.

TIP
Refer to Table data types in Azure SQL Data Warehouse article on SQL DW supported data types and the workarounds for
unsupported ones.

AZURE SQL DATA WAREHOUSE DATA TYPE DATA FACTORY INTERIM DATA TYPE

bigint Int64

binary Byte[]

bit Boolean

char String, Char[]

date DateTime

Datetime DateTime

datetime2 DateTime

Datetimeoffset DateTimeOffset

Decimal Decimal

FILESTREAM attribute (varbinary(max)) Byte[]

Float Double

image Byte[]

int Int32

money Decimal

nchar String, Char[]

numeric Decimal

nvarchar String, Char[]

real Single

rowversion Byte[]

smalldatetime DateTime
AZURE SQL DATA WAREHOUSE DATA TYPE DATA FACTORY INTERIM DATA TYPE

smallint Int16

smallmoney Decimal

time TimeSpan

tinyint Byte

uniqueidentifier Guid

varbinary Byte[]

varchar String, Char[]

Next steps
For a list of data stores supported as sources and sinks by Copy Activity in Azure Data Factory, see supported
data stores and formats.
Copy data to and from Azure Table storage by using
Azure Data Factory
3/5/2019 • 10 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data to and from Azure Table storage.
It builds on the Copy Activity overview article that presents a general overview of Copy Activity.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Supported capabilities
You can copy data from any supported source data store to Table storage. You also can copy data from Table
storage to any supported sink data store. For a list of data stores that are supported as sources or sinks by the
copy activity, see the Supported data stores table.
Specifically, this Azure Table connector supports copying data by using account key and service shared access
signature authentications.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Table storage.

Linked service properties


Use an account key
You can create an Azure Storage linked service by using the account key. It provides the data factory with global
access to Storage. The following properties are supported.

PROPERTY DESCRIPTION REQUIRED


PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureTableStorage.

connectionString Specify the information needed to Yes


connect to Storage for the
connectionString property.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put account key in Azure Key
Vault and pull the accountKey
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key Vault
article with more details.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in a private
network). If not specified, it uses the
default Azure Integration Runtime.

NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureTableStorage" linked service type going forward.

Example:

{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store account key in Azure Key Vault


{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;"
},
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Use shared access signature authentication


You also can create a Storage linked service by using a shared access signature. It provides the data factory with
restricted/time-bound access to all/specific resources in the storage.
A shared access signature provides delegated access to resources in your storage account. You can use it to grant
a client limited permissions to objects in your storage account for a specified time and with a specified set of
permissions. You don't have to share your account access keys. The shared access signature is a URI that
encompasses in its query parameters all the information necessary for authenticated access to a storage resource.
To access storage resources with the shared access signature, the client only needs to pass in the shared access
signature to the appropriate constructor or method. For more information about shared access signatures, see
Shared access signatures: Understand the shared access signature model.

NOTE
Data Factory now supports both service shared access signatures and account shared access signatures. For more
information about these two types and how to construct them, see Types of shared access signatures.

TIP
To generate a service shared access signature for your storage account, you can execute the following PowerShell
commands. Replace the placeholders and grant the needed permission.
$context = New-AzStorageContext -StorageAccountName <accountName> -StorageAccountKey <accountKey>
New-AzStorageContainerSASToken -Name <containerName> -Context $context -Permission rwdl -StartTime
<startTime> -ExpiryTime <endTime> -FullUri

To use shared access signature authentication, the following properties are supported.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AzureTableStorage.
PROPERTY DESCRIPTION REQUIRED

sasUri Specify SAS URI of the shared access Yes


signature URI to the table.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put SAS token in Azure Key
Vault to leverage auto rotation and
remove the token portion. Refer to the
following samples and Store credentials
in Azure Key Vault article with more
details.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure Integration Runtime or the
Self-hosted Integration Runtime (if your
data store is located in a private
network). If not specified, it uses the
default Azure Integration Runtime.

NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureTableStorage" linked service type going forward.

Example:

{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource e.g.
https://<account>.table.core.windows.net/<table>?sv=<storage version>&amp;st=<start time>&amp;se=<expire
time>&amp;sr=<resource>&amp;sp=<permissions>&amp;sip=<ip range>&amp;spr=<protocol>&amp;sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store account key in Azure Key Vault


{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource without token e.g.
https://<account>.table.core.windows.net/<table>>"
},
"sasToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

When you create a shared access signature URI, consider the following points:
Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is
used in your data factory.
Set Expiry time appropriately. Make sure that the access to Storage objects doesn't expire within the active
period of the pipeline.
The URI should be created at the right table level based on the need.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Azure Table dataset.
To copy data to and from Azure Table, set the type property of the dataset to AzureTable. The following
properties are supported.

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to AzureTable.

tableName The name of the table in the Table Yes


storage database instance that the
linked service refers to.

Example:
{
"name": "AzureTableDataset",
"properties":
{
"type": "AzureTable",
"linkedServiceName": {
"referenceName": "<Azure Table storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "MyTable"
}
}
}

Schema by Data Factory


For schema-free data stores such as Azure Table, Data Factory infers the schema in one of the following ways:
If you specify the structure of data by using the structure property in the dataset definition, Data Factory
honors this structure as the schema. In this case, if a row doesn't contain a value for a column, a null value is
provided for it.
If you don't specify the structure of data by using the structure property in the dataset definition, Data Factory
infers the schema by using the first row in the data. In this case, if the first row doesn't contain the full schema,
some columns are missed in the result of the copy operation.
For schema-free data sources, the best practice is to specify the structure of data by using the structure property.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Azure Table source and sink.
Azure Table as a source type
To copy data from Azure Table, set the source type in the copy activity to AzureTableSource. The following
properties are supported in the copy activity source section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
AzureTableSource.

azureTableSourceQuery Use the custom Table storage query to No


read data. See examples in the following
section.

azureTableSourceIgnoreTableNotFound Indicates whether to allow the No


exception of the table to not exist.
Allowed values are True and False
(default).

azureTableSourceQuery examples
If the Azure Table column is of the datetime type:

"azureTableSourceQuery": "LastModifiedTime gt datetime'2017-10-01T00:00:00' and LastModifiedTime le


datetime'2017-10-02T00:00:00'"
If the Azure Table column is of the string type:

"azureTableSourceQuery": "LastModifiedTime ge '201710010000_0000' and LastModifiedTime le '201710010000_9999'"

If you use the pipeline parameter, cast the datetime value to proper format according to the previous samples.
Azure Table as a sink type
To copy data to Azure Table, set the sink type in the copy activity to AzureTableSink. The following properties are
supported in the copy activity sink section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to AzureTableSink.

azureTableDefaultPartitionKeyValue The default partition key value that can No


be used by the sink.

azureTablePartitionKeyName Specify the name of the column whose No


values are used as partition keys. If not
specified,
"AzureTableDefaultPartitionKeyValue" is
used as the partition key.

azureTableRowKeyName Specify the name of the column whose No


column values are used as the row key.
If not specified, use a GUID for each
row.

azureTableInsertType The mode to insert data into Azure No


Table. This property controls whether
existing rows in the output table with
matching partition and row keys have
their values replaced or merged.

Allowed values are merge (default) and


replace.

This setting applies at the row level not


the table level. Neither option deletes
rows in the output table that do not
exist in the input. To learn about how
the merge and replace settings work,
see Insert or merge entity and Insert or
replace entity.

writeBatchSize Inserts data into Azure Table when No (default is 10,000)


writeBatchSize or writeBatchTimeout is
hit.
Allowed values are integer (number of
rows).

writeBatchTimeout Inserts data into Azure Table when No (default is 90 seconds, storage
writeBatchSize or writeBatchTimeout is client's default timeout)
hit.
Allowed values are timespan. An
example is "00:20:00" (20 minutes).

Example:
"activities":[
{
"name": "CopyToAzureTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Table output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureTableSink",
"azureTablePartitionKeyName": "<column name>",
"azureTableRowKeyName": "<column name>"
}
}
}
]

azureTablePartitionKeyName
Map a source column to a destination column by using the "translator" property before you can use the
destination column as azureTablePartitionKeyName.
In the following example, source column DivisionID is mapped to the destination column DivisionID:

"translator": {
"type": "TabularTranslator",
"columnMappings": "DivisionID: DivisionID, FirstName: FirstName, LastName: LastName"
}

"DivisionID" is specified as the partition key.

"sink": {
"type": "AzureTableSink",
"azureTablePartitionKeyName": "DivisionID"
}

Data type mapping for Azure Table


When you copy data from and to Azure Table, the following mappings are used from Azure Table data types to
Data Factory interim data types. To learn about how the copy activity maps the source schema and data type to the
sink, see Schema and data type mappings.
When you move data to and from Azure Table, the following mappings defined by Azure Table are used from
Azure Table OData types to .NET type and vice versa.
AZURE TABLE DATA TYPE DATA FACTORY INTERIM DATA TYPE DETAILS

Edm.Binary byte[] An array of bytes up to 64 KB.

Edm.Boolean bool A Boolean value.

Edm.DateTime DateTime A 64-bit value expressed as


Coordinated Universal Time (UTC). The
supported DateTime range begins
midnight, January 1, 1601 A.D. (C.E.),
UTC. The range ends December 31,
9999.

Edm.Double double A 64-bit floating point value.

Edm.Guid Guid A 128-bit globally unique identifier.

Edm.Int32 Int32 A 32-bit integer.

Edm.Int64 Int64 A 64-bit integer.

Edm.String String A UTF-16-encoded value. String values


can be up to 64 KB.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Cassandra using Azure Data Factory
3/14/2019 • 7 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Cassandra database. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Cassandra database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Cassandra connector supports:
Cassandra versions 2.x and 3.x.
Copying data using Basic or Anonymous authentication.

NOTE
For activity running on Self-hosted Integration Runtime, Cassandra 3.x is supported since IR version 3.7 and above.

Prerequisites
To copy data from a Cassandra database that is not publicly accessible, you need to set up a Self-hosted
Integration Runtime. See Self-hosted Integration Runtime article to learn details. The Integration Runtime
provides a built-in Cassandra driver, therefore you don't need to manually install any driver when copying data
from/to Cassandra.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Cassandra connector.

Linked service properties


The following properties are supported for Cassandra linked service:
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Cassandra

host One or more IP addresses or host Yes


names of Cassandra servers.
Specify a comma-separated list of IP
addresses or host names to connect to
all servers concurrently.

port The TCP port that the Cassandra server No (default is 9042)
uses to listen for client connections.

authenticationType Type of authentication used to connect Yes


to the Cassandra database.
Allowed values are: Basic, and
Anonymous.

username Specify user name for the user account. Yes, if authenticationType is set to Basic.

password Specify password for the user account. Yes, if authenticationType is set to Basic.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

NOTE
Currently connection to Cassandra using SSL is not supported.

Example:
{
"name": "CassandraLinkedService",
"properties": {
"type": "Cassandra",
"typeProperties": {
"host": "<host>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Cassandra dataset.
To copy data from Cassandra, set the type property of the dataset to CassandraTable. The following properties
are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: CassandraTable

keyspace Name of the keyspace or schema in No (if "query" for "CassandraSource" is


Cassandra database. specified)

tableName Name of the table in Cassandra No (if "query" for "CassandraSource" is


database. specified)

Example:

{
"name": "CassandraDataset",
"properties": {
"type": "CassandraTable",
"linkedServiceName": {
"referenceName": "<Cassandra linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"keySpace": "<keyspace name>",
"tableName": "<table name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Cassandra source.
Cassandra as source
To copy data from Cassandra, set the source type in the copy activity to CassandraSource. The following
properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
CassandraSource

query Use the custom query to read data. No (if "tableName" and "keyspace" in
SQL-92 query or CQL query. See CQL dataset are specified).
reference.

When using SQL query, specify


keyspace name.table name to
represent the table you want to query.

consistencyLevel The consistency level specifies how No (default is ONE )


many replicas must respond to a read
request before returning data to the
client application. Cassandra checks the
specified number of replicas for data to
satisfy the read request. See
Configuring data consistency for details.

Allowed values are: ONE, TWO, THREE,


QUORUM, ALL, LOCAL_QUORUM,
EACH_QUORUM, and LOCAL_ONE.

Example:

"activities":[
{
"name": "CopyFromCassandra",
"type": "Copy",
"inputs": [
{
"referenceName": "<Cassandra input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CassandraSource",
"query": "select id, firstname, lastname from mykeyspace.mytable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Data type mapping for Cassandra
When copying data from Cassandra, the following mappings are used from Cassandra data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the source
schema and data type to the sink.

CASSANDRA DATA TYPE DATA FACTORY INTERIM DATA TYPE

ASCII String

BIGINT Int64

BLOB Byte[]

BOOLEAN Boolean

DECIMAL Decimal

DOUBLE Double

FLOAT Single

INET String

INT Int32

TEXT String

TIMESTAMP DateTime

TIMEUUID Guid

UUID Guid

VARCHAR String

VARINT Decimal

NOTE
For collection types (map, set, list, etc.), refer to Work with Cassandra collection types using virtual table section.
User-defined types are not supported.
The length of Binary Column and String Column lengths cannot be greater than 4000.

Work with collections using virtual table


Azure Data Factory uses a built-in ODBC driver to connect to and copy data from your Cassandra database. For
collection types including map, set and list, the driver renormalizes the data into corresponding virtual tables.
Specifically, if a table contains any collection columns, the driver generates the following virtual tables:
A base table, which contains the same data as the real table except for the collection columns. The base table
uses the same name as the real table that it represents.
A virtual table for each collection column, which expands the nested data. The virtual tables that represent
collections are named using the name of the real table, a separator "vt" and the name of the column.
Virtual tables refer to the data in the real table, enabling the driver to access the denormalized data. See Example
section for details. You can access the content of Cassandra collections by querying and joining the virtual tables.
Example
For example, the following "ExampleTable" is a Cassandra database table that contains an integer primary key
column named "pk_int", a text column named value, a list column, a map column, and a set column (named
"StringSet").

PK_INT VALUE LIST MAP STRINGSET

1 "sample value 1" ["1", "2", "3"] {"S1": "a", "S2": "b"} {"A", "B", "C"}

3 "sample value 3" ["100", "101", "102", {"S1": "t"} {"A", "E"}
"105"]

The driver would generate multiple virtual tables to represent this single table. The foreign key columns in the
virtual tables reference the primary key columns in the real table, and indicate which real table row the virtual
table row corresponds to.
The first virtual table is the base table named "ExampleTable" is shown in the following table:

PK_INT VALUE

1 "sample value 1"

3 "sample value 3"

The base table contains the same data as the original database table except for the collections, which are omitted
from this table and expanded in other virtual tables.
The following tables show the virtual tables that renormalize the data from the List, Map, and StringSet columns.
The columns with names that end with "_index" or "_key" indicate the position of the data within the original list or
map. The columns with names that end with "_value" contain the expanded data from the collection.
Table "ExampleTable_vt_List":

PK_INT LIST_INDEX LIST_VALUE

1 0 1

1 1 2

1 2 3

3 0 100

3 1 101

3 2 102

3 3 103
Table "ExampleTable_vt_Map":

PK_INT MAP_KEY MAP_VALUE

1 S1 A

1 S2 b

3 S1 t

Table "ExampleTable_vt_StringSet":

PK_INT STRINGSET_VALUE

1 A

1 B

1 C

3 A

3 E

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Dynamics 365 (Common Data
Service) or Dynamics CRM by using Azure Data
Factory
5/29/2019 • 9 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Microsoft Dynamics
365 or Microsoft Dynamics CRM. It builds on the Copy Activity overview article that presents a general overview
of Copy Activity.

Supported capabilities
You can copy data from Dynamics 365 (Common Data Service) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Common Data Service) or
Dynamics CRM. For a list of data stores supported as sources or sinks by the copy activity, see the Supported data
stores table.
This Dynamics connector supports the following Dynamics versions and authentication types. (IFD is short for
internet-facing deployment.)

DYNAMICS VERSIONS AUTHENTICATION TYPES LINKED SERVICE SAMPLES

Dynamics 365 online Office365 Dynamics online + Office365 auth


Dynamics CRM Online

Dynamics 365 on-premises with IFD IFD Dynamics on-premises with IFD + IFD
Dynamics CRM 2016 on-premises with auth
IFD
Dynamics CRM 2015 on-premises with
IFD

For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing
Other application types e.g. Finance and Operations, Talent, etc. are not supported by this connector.

TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-step
instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.

Linked service properties


The following properties are supported for the Dynamics linked service.
Dynamics 365 and Dynamics CRM Online
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Dynamics.

deploymentType The deployment type of the Dynamics Yes


instance. It must be "Online" for
Dynamics online.

serviceUri The service URL of your Dynamics Yes


instance, e.g.
https://adfdynamics.crm.dynamics.com
.

authenticationType The authentication type to connect to a Yes


Dynamics server. Specify "Office365"
for Dynamics online.

username Specify the user name to connect to Yes


Dynamics.

password Specify the password for the user Yes


account you specified for username.
Mark this field as a SecureString to store
it securely in Data Factory, or reference
a secret stored in Azure Key Vault.

connectVia The integration runtime to be used to No for source, Yes for sink if the source
connect to the data store. If not linked service doesn't have an
specified, it uses the default Azure integration runtime
Integration Runtime.

NOTE
The Dynamics connector used to use optional "organizationName" property to identify your Dynamics CRM/365 Online
instance. While it keeps working, you are suggested to specify the new "serviceUri" property instead to gain better
performance for instance discovery.

Example: Dynamics online using Office365 authentication


{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics online linked service using Office365 authentication",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://adfdynamics.crm.dynamics.com",
"authenticationType": "Office365",
"username": "test@contoso.onmicrosoft.com",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dynamics 365 and Dynamics CRM on-premises with IFD


Additional properties that compare to Dynamics online are "hostName" and "port".

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Dynamics.

deploymentType The deployment type of the Dynamics Yes


instance. It must be
"OnPremisesWithIfd" for Dynamics
on-premises with IFD.

hostName The host name of the on-premises Yes


Dynamics server.

port The port of the on-premises Dynamics No, default is 443


server.

organizationName The organization name of the Dynamics Yes


instance.

authenticationType The authentication type to connect to Yes


the Dynamics server. Specify "Ifd" for
Dynamics on-premises with IFD.

username Specify the user name to connect to Yes


Dynamics.
PROPERTY DESCRIPTION REQUIRED

password Specify the password for the user Yes


account you specified for username. You
can choose to mark this field as a
SecureString to store it securely in ADF,
or store password in Azure Key Vault
and let the copy activity pull from there
when performing data copy - learn
more from Store credentials in Key
Vault.

connectVia The integration runtime to be used to No for source, Yes for sink
connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.

Example: Dynamics on-premises with IFD using IFD authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "test@contoso.onmicrosoft.com",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, set the type property of the dataset to DynamicsEntity. The following
properties are supported.

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to DynamicsEntity.

entityName The logical name of the entity to No for source (if "query" in the activity
retrieve. source is specified), Yes for sink
IMPORTANT
When you copy data from Dynamics, the "structure" section is optional but highly recommanded in the Dynamics dataset
to ensure a deterministic copy result. It defines the column name and data type for Dynamics data that you want to copy
over. To learn more, see Dataset structure and Data type mapping for Dynamics.
When importing schema in authoring UI, ADF infer the schema by sampling the top rows from the Dynamics query result
to initialize the structure construction, in which case columns with no values will be omitted. The same behavior applies to
copy executions if there is no explicit structure definition. You can review and add more columns into the Dynamics dataset
schema/structure as needed, which will be honored during copy runtime.
When you copy data to Dynamics, the "structure" section is optional in the Dynamics dataset. Which columns to copy into
is determined by the source data schema. If your source is a CSV file without a header, in the input dataset, specify the
"structure" with the column name and data type. They map to fields in the CSV file one by one in order.

Example:

{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"structure": [
{
"name": "accountid",
"type": "Guid"
},
{
"name": "name",
"type": "String"
},
{
"name": "marketingonly",
"type": "Boolean"
},
{
"name": "modifiedon",
"type": "Datetime"
}
],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Dynamics source and sink types.
Dynamics as a source type
To copy data from Dynamics, set the source type in the copy activity to DynamicsSource. The following properties
are supported in the copy activity source section.
PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
DynamicsSource.

query FetchXML is a proprietary query No (if "entityName" in the dataset is


language that is used in Dynamics specified)
(online and on-premises). See the
following example. To learn more, see
Build queries with FetchXML.

NOTE
The PK column will always be copied out even if the column projection you configure in the FetchXML query doesn't contain
it.

Example:

"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Sample FetchXML query


<fetch>
<entity name="account">
<attribute name="accountid" />
<attribute name="name" />
<attribute name="marketingonly" />
<attribute name="modifiedon" />
<order attribute="modifiedon" descending="false" />
<filter type="and">
<condition attribute ="modifiedon" operator="between">
<value>2017-03-10 18:40:00z</value>
<value>2017-03-12 20:40:00z</value>
</condition>
</filter>
</entity>
</fetch>

Dynamics as a sink type


To copy data to Dynamics, set the sink type in the copy activity to DynamicsSink. The following properties are
supported in the copy activity sink section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to DynamicsSink.

writeBehavior The write behavior of the operation. Yes


Allowed value is "Upsert".

writeBatchSize The row count of data written to No (default is 10)


Dynamics in each batch.

ignoreNullValues Indicates whether to ignore null values No (default is false)


from input data (except key fields)
during a write operation.
Allowed values are true and false.
- True: Leave the data in the destination
object unchanged when you do an
upsert/update operation. Insert a
defined default value when you do an
insert operation.
- False: Update the data in the
destination object to NULL when you
do an upsert/update operation. Insert a
NULL value when you do an insert
operation.

NOTE
The default value of the sink "writeBatchSize" and the copy activity "parallelCopies" for the Dynamics sink are both 10.
Therefore, 100 records are submitted to Dynamics concurrently.

For Dynamics 365 online, there is a limit of 2 concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" fault is thrown before the first request is ever executed. Keeping "writeBatchSize" less or equal to 10
would avoid such throttling of concurrent calls.
The optimal combination of "writeBatchSize" and "parallelCopies" depends on the schema of your entity e.g.
number of columns, row size, number of plugins/workflows/workflow activities hooked up to those calls, etc. The
default setting of 10 writeBatchSize * 10 parallelCopies is the recommendation according to Dynamics service,
which would work for most Dynamics entities though may not be best performance. You can tune the performance
by adjusting the combination in your copy activity settings.
Example:

"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]

Data type mapping for Dynamics


When you copy data from Dynamics, the following mappings are used from Dynamics data types to Data Factory
interim data types. To learn how the copy activity maps the source schema and data type to the sink, see Schema
and data type mappings.
Configure the corresponding Data Factory data type in a dataset structure based on your source Dynamics data
type by using the following mapping table.

DATA FACTORY INTERIM DATA


DYNAMICS DATA TYPE TYPE SUPPORTED AS SOURCE SUPPORTED AS SINK

AttributeTypeCode.BigInt Long ✓ ✓

AttributeTypeCode.Boolean Boolean ✓ ✓

AttributeType.Customer Guid ✓

AttributeType.DateTime Datetime ✓ ✓

AttributeType.Decimal Decimal ✓ ✓

AttributeType.Double Double ✓ ✓
DATA FACTORY INTERIM DATA
DYNAMICS DATA TYPE TYPE SUPPORTED AS SOURCE SUPPORTED AS SINK

AttributeType.EntityName String ✓ ✓

AttributeType.Integer Int32 ✓ ✓

AttributeType.Lookup Guid ✓ ✓ (with single target


associated)

AttributeType.ManagedProp Boolean ✓
erty

AttributeType.Memo String ✓ ✓

AttributeType.Money Decimal ✓ ✓

AttributeType.Owner Guid ✓

AttributeType.Picklist Int32 ✓ ✓

AttributeType.Uniqueidentifi Guid ✓ ✓
er

AttributeType.String String ✓ ✓

AttributeType.State Int32 ✓ ✓

AttributeType.Status Int32 ✓ ✓

NOTE
The Dynamics data types AttributeType.CalendarRules and AttributeType.PartyList aren't supported.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Concur using Azure Data Factory
(Preview)
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Concur. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Concur to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

NOTE
Partner account is currently not supported.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Concur connector.

Linked service properties


The following properties are supported for Concur linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Concur
PROPERTY DESCRIPTION REQUIRED

clientId Application client_id supplied by Concur Yes


App Management.

username The user name that you use to access Yes


Concur Service.

password The password corresponding to the Yes


user name that you provided in the
username field. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "ConcurLinkedService",
"properties": {
"type": "Concur",
"typeProperties": {
"clientId" : "<clientId>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Concur dataset.
To copy data from Concur, set the type property of the dataset to ConcurObject. There is no additional type-
specific property in this type of dataset. The following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: ConcurObject
PROPERTY DESCRIPTION REQUIRED

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "ConcurDataset",
"properties": {
"type": "ConcurObject",
"linkedServiceName": {
"referenceName": "<Concur linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Concur source.
ConcurSource as source
To copy data from Concur, set the source type in the copy activity to ConcurSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: ConcurSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Opportunities
where Id = xxx "
.

Example:
"activities":[
{
"name": "CopyFromConcur",
"type": "Copy",
"inputs": [
{
"referenceName": "<Concur input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ConcurSource",
"query": "SELECT * FROM Opportunities where Id = xxx"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Couchbase using Azure Data
Factory (Preview)
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Couchbase. It builds on
the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Couchbase to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Couchbase connector.

Linked service properties


The following properties are supported for Couchbase linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Couchbase
PROPERTY DESCRIPTION REQUIRED

connectionString An ODBC connection string to connect Yes


to Couchbase.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put credential string in Azure
Key Vault and pull the credString
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key Vault
article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "CouchbaseLinkedService",
"properties": {
"type": "Couchbase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>; Port=<port>;AuthMech=1;CredString=[{\"user\": \"JSmith\",
\"pass\":\"access123\"}, {\"user\": \"Admin\", \"pass\":\"simba123\"}];"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store credential string in Azure Key Vault


{
"name": "CouchbaseLinkedService",
"properties": {
"type": "Couchbase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>; Port=<port>;AuthMech=1;"
},
"credString": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Couchbase dataset.
To copy data from Couchbase, set the type property of the dataset to CouchbaseTable. The following properties
are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: CouchbaseTable

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "CouchbaseDataset",
"properties": {
"type": "CouchbaseTable",
"linkedServiceName": {
"referenceName": "<Couchbase linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Couchbase source.
CouchbaseSource as source
To copy data from Couchbase, set the source type in the copy activity to CouchbaseSource. The following
properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
CouchbaseSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromCouchbase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Couchbase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CouchbaseSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from DB2 by using Azure Data Factory
1/3/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a DB2 database. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from DB2 database to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this DB2 connector supports the following IBM DB2 platforms and versions with Distributed
Relational Database Architecture (DRDA) SQL Access Manager (SQLAM ) version 9, 10 and 11:
IBM DB2 for z/OS 11.1
IBM DB2 for z/OS 10.1
IBM DB2 for i 7.2
IBM DB2 for i 7.1
IBM DB2 for LUW 11
IBM DB2 for LUW 10.5
IBM DB2 for LUW 10.1

TIP
If you receive an error message that states "The package corresponding to an SQL statement execution request was not
found. SQLSTATE=51002 SQLCODE=-805", the reason is a needed package is not created for normal user on such OS.
Follow these instructions according to your DB2 server type:
DB2 for i (AS400): let power user create collection for the login user before using copy activity. Command:
create collection <username>
DB2 for z/OS or LUW: use a high privilege account - power user or admin with package authorities and BIND, BINDADD,
GRANT EXECUTE TO PUBLIC permissions - to run the copy activity once, then the needed package is automatically
created during copy. Afterwards, you can switch back to normal user for your subsequent copy runs.

Prerequisites
To use copy data from a DB2 database that is not publicly accessible, you need to set up a Self-hosted Integration
Runtime. To learn about Self-hosted integration runtimes, see Self-hosted Integration Runtime article. The
Integration Runtime provides a built-in DB2 driver, therefore you don't need to manually install any driver when
copying data from DB2.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
DB2 connector.

Linked service properties


The following properties are supported for DB2 linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Db2 Yes

server Name of the DB2 server. You can Yes


specify the port number following the
server name delimited by colon e.g.
server:port .

database Name of the DB2 database. Yes

authenticationType Type of authentication used to connect Yes


to the DB2 database.
Allowed value is: Basic.

username Specify user name to connect to the Yes


DB2 database.

password Specify password for the user account Yes


you specified for the username. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:
{
"name": "Db2LinkedService",
"properties": {
"type": "Db2",
"typeProperties": {
"server": "<servername:port>",
"database": "<dbname>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by DB2 dataset.
To copy data from DB2, set the type property of the dataset to RelationalTable. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: RelationalTable

tableName Name of the table in the DB2 database. No (if "query" in activity source is
specified)

Example

{
"name": "DB2Dataset",
"properties":
{
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<DB2 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by DB2 source.
DB2 as source
To copy data from DB2, set the source type in the copy activity to RelationalSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
RelationalSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"query": "SELECT * FROM
\"DB2ADMIN\".\"Customers\""
.

Example:

"activities":[
{
"name": "CopyFromDB2",
"type": "Copy",
"inputs": [
{
"referenceName": "<DB2 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM \"DB2ADMIN\".\"Customers\""
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for DB2


When copying data from DB2, the following mappings are used from DB2 data types to Azure Data Factory
interim data types. See Schema and data type mappings to learn about how copy activity maps the source schema
and data type to the sink.

DB2 DATABASE TYPE DATA FACTORY INTERIM DATA TYPE

BigInt Int64

Binary Byte[]

Blob Byte[]

Char String
DB2 DATABASE TYPE DATA FACTORY INTERIM DATA TYPE

Clob String

Date Datetime

DB2DynArray String

DbClob String

Decimal Decimal

DecimalFloat Decimal

Double Double

Float Double

Graphic String

Integer Int32

LongVarBinary Byte[]

LongVarChar String

LongVarGraphic String

Numeric Decimal

Real Single

SmallInt Int16

Time TimeSpan

Timestamp DateTime

VarBinary Byte[]

VarChar String

VarGraphic String

Xml Byte[]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Delimited text format in Azure Data Factory
5/6/2019 • 5 minutes to read • Edit Online

Follow this article when you want to parse the delimited text files or write the data into delimited text
format.
Delimited text format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake
Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage,
HDFS, HTTP, and SFTP.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the delimited text dataset.

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset Yes


must be set to DelimitedText.

location Location settings of the file(s). Each Yes


file-based connector has its own
location type and supported
properties under location . See
details in connector article ->
Dataset properties section.

columnDelimiter The character(s) used to separate No


columns in a file. Currently, multi-char
delimiter is only supported for
Mapping Data Flow but not Copy
activity.
The default value is comma , ,
When the column delimiter is defined
as empty string which means no
delimiter, the whole line is taken as a
single column.

rowDelimiter The single character or "\r\n" used to No


separate rows in a file.
The default value is any of the
following values on read: ["\r\n",
"\r", "\n"], and "\n" or “\r\n” on
write by Mapping Data Flow and
Copy activity respectively.
When rowDelimiter is set to no
delimiter (empty string), the
columnDelimiter must be set as no
delimiter (empty string) as well, which
means to treat the entire content as a
single value.
PROPERTY DESCRIPTION REQUIRED

quoteChar The single character to quote column No


values if it contains column delimiter.
The default value is double quotes
" .
For Mapping Data Flow, quoteChar
cannot be an empty string.
For Copy activity, when quoteChar
is defined as empty string, it means
there is no quote char and column
value is not quoted, and
escapeChar is used to escape the
column delimiter and itself.

escapeChar The single character to escape quotes No


inside a quoted value.
The default value is backslash \ .
For Mapping Data Flow, escapeChar
cannot be an empty string.
For Copy activity, when escapeChar
is defined as empty string, the
quoteChar must be set as empty
string as well, in which case make sure
all column values don’t contain
delimiters.

firstRowAsHeader Specifies whether to treat/make the No


first row as a header line with names
of columns.
Allowed values are true and false
(default).

nullValue Specifies the string representation of No


null value.
The default value is empty string.
PROPERTY DESCRIPTION REQUIRED

encodingName The encoding type used to read/write No


test files.
Allowed values are as follows: "UTF-
8", "UTF-16", "UTF-16BE", "UTF-32",
"UTF-32BE", "US-ASCII", “UTF-7”,
"BIG5", "EUC-JP", "EUC-KR",
"GB2312", "GB18030", "JOHAB",
"SHIFT-JIS", "CP875", "CP866",
"IBM00858", "IBM037", "IBM273",
"IBM437", "IBM500", "IBM737",
"IBM775", "IBM850", "IBM852",
"IBM855", "IBM857", "IBM860",
"IBM861", "IBM863", "IBM864",
"IBM865", "IBM869", "IBM870",
"IBM01140", "IBM01141",
"IBM01142", "IBM01143",
"IBM01144", "IBM01145",
"IBM01146", "IBM01147",
"IBM01148", "IBM01149", "ISO-
2022-JP", "ISO-2022-KR", "ISO-8859-
1", "ISO-8859-2", "ISO-8859-3",
"ISO-8859-4", "ISO-8859-5", "ISO-
8859-6", "ISO-8859-7", "ISO-8859-
8", "ISO-8859-9", "ISO-8859-13",
"ISO-8859-15", "WINDOWS-874",
"WINDOWS-1250", "WINDOWS-
1251", "WINDOWS-1252",
"WINDOWS-1253", "WINDOWS-
1254", "WINDOWS-1255",
"WINDOWS-1256", "WINDOWS-
1257", "WINDOWS-1258”.
Note Mapping Data Flow doesn’t
support UTF-7 encoding.

compressionCodec The compression codec used to No


read/write text files.
Allowed values are bzip2, gzip,
deflate, ZipDeflate, snappy, or lz4.
to use when saving the file.
Note currently Copy activity doesn’t
support “snappy” & “lz4”, and
Mapping Data Flow doesn’t support
“ZipDeflate”.

compressionLevel The compression ratio. No


Allowed values are Optimal or
Fastest.
- Fastest: The compression operation
should complete as quickly as
possible, even if the resulting file is
not optimally compressed.
- Optimal: The compression
operation should be optimally
compressed, even if the operation
takes a longer time to complete. For
more information, see Compression
Level topic.

Below is an example of delimited text dataset on Azure Blob Storage:


{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the delimited text source and sink.
Delimited text as source
The following properties are supported in the copy activity *source* section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
DelimitedTextSource.

formatSettings A group of properties. Refer to No


Delimited text read settings table
below.

storeSettings A group of properties on how to read No


data from a data store. Each file-
based connector has its own
supported read settings under
storeSettings . See details in
connector article -> Copy activity
properties section.

Supported delimited text read settings under formatSettings :

PROPERTY DESCRIPTION REQUIRED

type The type of formatSettings must be Yes


set to DelimitedTextReadSetting.
PROPERTY DESCRIPTION REQUIRED

skipLineCount Indicates the number of non-empty No


rows to skip when reading data from
input files.
If both skipLineCount and
firstRowAsHeader are specified, the
lines are skipped first and then the
header information is read from the
input file.

Delimited text as sink


The following properties are supported in the copy activity *sink* section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
DelimitedTextSink.

formatSettings A group of properties. Refer to


Delimited text write settings table
below.

storeSettings A group of properties on how to No


write data to a data store. Each file-
based connector has its own
supported write settings under
storeSettings . See details in
connector article -> Copy activity
properties section.

Supported delimited text write settings under formatSettings :

PROPERTY DESCRIPTION REQUIRED

type The type of formatSettings must be Yes


set to DelimitedTextWriteSetting.

fileExtension The file extension used to name the Yes when file name is not specified in
output files, e.g. .csv , .txt . It output dataset
must be specified when the
fileName is not specified in the
output DelimitedText dataset.

Mapping Data Flow properties


Learn details from source transformation and sink transformation in Mapping Data Flow.

Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Copy data from Drill using Azure Data Factory
(Preview)
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Drill. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Drill to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Drill connector.

Linked service properties


The following properties are supported for Drill linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Drill Yes


PROPERTY DESCRIPTION REQUIRED

connectionString An ODBC connection string to connect Yes


to Drill.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the pwd configuration
out of the connection string. Refer to
the following samples and Store
credentials in Azure Key Vault article
with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "DrillLinkedService",
"properties": {
"type": "Drill",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "ConnectionType=Direct;Host=<host>;Port=<port>;AuthenticationType=Plain;UID=<user
name>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault


{
"name": "DrillLinkedService",
"properties": {
"type": "Drill",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "ConnectionType=Direct;Host=<host>;Port=<port>;AuthenticationType=Plain;UID=<user
name>;"
},
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Drill dataset.
To copy data from Drill, set the type property of the dataset to DrillTable. The following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: DrillTable

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "DrillDataset",
"properties": {
"type": "DrillTable",
"linkedServiceName": {
"referenceName": "<Drill linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Drill source.
DrillSource as source
To copy data from Drill, set the source type in the copy activity to DrillSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: DrillSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromDrill",
"type": "Copy",
"inputs": [
{
"referenceName": "<Drill input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DrillSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Dynamics 365 (Common Data
Service) or Dynamics CRM by using Azure Data
Factory
5/29/2019 • 9 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Microsoft Dynamics
365 or Microsoft Dynamics CRM. It builds on the Copy Activity overview article that presents a general overview
of Copy Activity.

Supported capabilities
You can copy data from Dynamics 365 (Common Data Service) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Common Data Service) or
Dynamics CRM. For a list of data stores supported as sources or sinks by the copy activity, see the Supported data
stores table.
This Dynamics connector supports the following Dynamics versions and authentication types. (IFD is short for
internet-facing deployment.)

DYNAMICS VERSIONS AUTHENTICATION TYPES LINKED SERVICE SAMPLES

Dynamics 365 online Office365 Dynamics online + Office365 auth


Dynamics CRM Online

Dynamics 365 on-premises with IFD IFD Dynamics on-premises with IFD + IFD
Dynamics CRM 2016 on-premises with auth
IFD
Dynamics CRM 2015 on-premises with
IFD

For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing
Other application types e.g. Finance and Operations, Talent, etc. are not supported by this connector.

TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-step
instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.

Linked service properties


The following properties are supported for the Dynamics linked service.
Dynamics 365 and Dynamics CRM Online
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Dynamics.

deploymentType The deployment type of the Dynamics Yes


instance. It must be "Online" for
Dynamics online.

serviceUri The service URL of your Dynamics Yes


instance, e.g.
https://adfdynamics.crm.dynamics.com
.

authenticationType The authentication type to connect to a Yes


Dynamics server. Specify "Office365"
for Dynamics online.

username Specify the user name to connect to Yes


Dynamics.

password Specify the password for the user Yes


account you specified for username.
Mark this field as a SecureString to store
it securely in Data Factory, or reference
a secret stored in Azure Key Vault.

connectVia The integration runtime to be used to No for source, Yes for sink if the source
connect to the data store. If not linked service doesn't have an
specified, it uses the default Azure integration runtime
Integration Runtime.

NOTE
The Dynamics connector used to use optional "organizationName" property to identify your Dynamics CRM/365 Online
instance. While it keeps working, you are suggested to specify the new "serviceUri" property instead to gain better
performance for instance discovery.

Example: Dynamics online using Office365 authentication


{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics online linked service using Office365 authentication",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://adfdynamics.crm.dynamics.com",
"authenticationType": "Office365",
"username": "test@contoso.onmicrosoft.com",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dynamics 365 and Dynamics CRM on-premises with IFD


Additional properties that compare to Dynamics online are "hostName" and "port".

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Dynamics.

deploymentType The deployment type of the Dynamics Yes


instance. It must be
"OnPremisesWithIfd" for Dynamics
on-premises with IFD.

hostName The host name of the on-premises Yes


Dynamics server.

port The port of the on-premises Dynamics No, default is 443


server.

organizationName The organization name of the Dynamics Yes


instance.

authenticationType The authentication type to connect to Yes


the Dynamics server. Specify "Ifd" for
Dynamics on-premises with IFD.

username Specify the user name to connect to Yes


Dynamics.
PROPERTY DESCRIPTION REQUIRED

password Specify the password for the user Yes


account you specified for username. You
can choose to mark this field as a
SecureString to store it securely in ADF,
or store password in Azure Key Vault
and let the copy activity pull from there
when performing data copy - learn
more from Store credentials in Key
Vault.

connectVia The integration runtime to be used to No for source, Yes for sink
connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.

Example: Dynamics on-premises with IFD using IFD authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "test@contoso.onmicrosoft.com",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, set the type property of the dataset to DynamicsEntity. The following
properties are supported.

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to DynamicsEntity.

entityName The logical name of the entity to No for source (if "query" in the activity
retrieve. source is specified), Yes for sink
IMPORTANT
When you copy data from Dynamics, the "structure" section is optional but highly recommanded in the Dynamics dataset
to ensure a deterministic copy result. It defines the column name and data type for Dynamics data that you want to copy
over. To learn more, see Dataset structure and Data type mapping for Dynamics.
When importing schema in authoring UI, ADF infer the schema by sampling the top rows from the Dynamics query result
to initialize the structure construction, in which case columns with no values will be omitted. The same behavior applies to
copy executions if there is no explicit structure definition. You can review and add more columns into the Dynamics dataset
schema/structure as needed, which will be honored during copy runtime.
When you copy data to Dynamics, the "structure" section is optional in the Dynamics dataset. Which columns to copy into
is determined by the source data schema. If your source is a CSV file without a header, in the input dataset, specify the
"structure" with the column name and data type. They map to fields in the CSV file one by one in order.

Example:

{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"structure": [
{
"name": "accountid",
"type": "Guid"
},
{
"name": "name",
"type": "String"
},
{
"name": "marketingonly",
"type": "Boolean"
},
{
"name": "modifiedon",
"type": "Datetime"
}
],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Dynamics source and sink types.
Dynamics as a source type
To copy data from Dynamics, set the source type in the copy activity to DynamicsSource. The following properties
are supported in the copy activity source section.
PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
DynamicsSource.

query FetchXML is a proprietary query No (if "entityName" in the dataset is


language that is used in Dynamics specified)
(online and on-premises). See the
following example. To learn more, see
Build queries with FetchXML.

NOTE
The PK column will always be copied out even if the column projection you configure in the FetchXML query doesn't contain
it.

Example:

"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Sample FetchXML query


<fetch>
<entity name="account">
<attribute name="accountid" />
<attribute name="name" />
<attribute name="marketingonly" />
<attribute name="modifiedon" />
<order attribute="modifiedon" descending="false" />
<filter type="and">
<condition attribute ="modifiedon" operator="between">
<value>2017-03-10 18:40:00z</value>
<value>2017-03-12 20:40:00z</value>
</condition>
</filter>
</entity>
</fetch>

Dynamics as a sink type


To copy data to Dynamics, set the sink type in the copy activity to DynamicsSink. The following properties are
supported in the copy activity sink section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to DynamicsSink.

writeBehavior The write behavior of the operation. Yes


Allowed value is "Upsert".

writeBatchSize The row count of data written to No (default is 10)


Dynamics in each batch.

ignoreNullValues Indicates whether to ignore null values No (default is false)


from input data (except key fields)
during a write operation.
Allowed values are true and false.
- True: Leave the data in the destination
object unchanged when you do an
upsert/update operation. Insert a
defined default value when you do an
insert operation.
- False: Update the data in the
destination object to NULL when you
do an upsert/update operation. Insert a
NULL value when you do an insert
operation.

NOTE
The default value of the sink "writeBatchSize" and the copy activity "parallelCopies" for the Dynamics sink are both 10.
Therefore, 100 records are submitted to Dynamics concurrently.

For Dynamics 365 online, there is a limit of 2 concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" fault is thrown before the first request is ever executed. Keeping "writeBatchSize" less or equal to 10
would avoid such throttling of concurrent calls.
The optimal combination of "writeBatchSize" and "parallelCopies" depends on the schema of your entity e.g.
number of columns, row size, number of plugins/workflows/workflow activities hooked up to those calls, etc. The
default setting of 10 writeBatchSize * 10 parallelCopies is the recommendation according to Dynamics service,
which would work for most Dynamics entities though may not be best performance. You can tune the performance
by adjusting the combination in your copy activity settings.
Example:

"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]

Data type mapping for Dynamics


When you copy data from Dynamics, the following mappings are used from Dynamics data types to Data Factory
interim data types. To learn how the copy activity maps the source schema and data type to the sink, see Schema
and data type mappings.
Configure the corresponding Data Factory data type in a dataset structure based on your source Dynamics data
type by using the following mapping table.

DATA FACTORY INTERIM DATA


DYNAMICS DATA TYPE TYPE SUPPORTED AS SOURCE SUPPORTED AS SINK

AttributeTypeCode.BigInt Long ✓ ✓

AttributeTypeCode.Boolean Boolean ✓ ✓

AttributeType.Customer Guid ✓

AttributeType.DateTime Datetime ✓ ✓

AttributeType.Decimal Decimal ✓ ✓

AttributeType.Double Double ✓ ✓
DATA FACTORY INTERIM DATA
DYNAMICS DATA TYPE TYPE SUPPORTED AS SOURCE SUPPORTED AS SINK

AttributeType.EntityName String ✓ ✓

AttributeType.Integer Int32 ✓ ✓

AttributeType.Lookup Guid ✓ ✓ (with single target


associated)

AttributeType.ManagedProp Boolean ✓
erty

AttributeType.Memo String ✓ ✓

AttributeType.Money Decimal ✓ ✓

AttributeType.Owner Guid ✓

AttributeType.Picklist Int32 ✓ ✓

AttributeType.Uniqueidentifi Guid ✓ ✓
er

AttributeType.String String ✓ ✓

AttributeType.State Int32 ✓ ✓

AttributeType.Status Int32 ✓ ✓

NOTE
The Dynamics data types AttributeType.CalendarRules and AttributeType.PartyList aren't supported.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Dynamics AX by using Azure Data
Factory (Preview)
3/6/2019 • 3 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from Dynamics AX source. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.

Supported capabilities
You can copy data from Dynamics AX to any supported sink data store. For a list of data stores that Copy Activity
supports as sources and sinks, see Supported data stores and formats.
Specifically, this Dynamics AX connector supports copying data from Dynamics AX using OData protocol with
Service Principal authentication.

TIP
You can also use this connector to copy data from Dynamics 365 Finance and Operations. Refer to Dynamics 365's
OData support and Authentication method.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to Dynamics AX connector.

Prerequisites
To use service principal authentication, follow these steps:
1. Register an application entity in Azure Active Directory (Azure AD ) by following Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Go to Dynamics AX, and grant this service principal proper permission to access your Dynamics AX.
Linked service properties
The following properties are supported for Dynamics AX linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


DynamicsAX.

url The Dynamics AX (or Dynamics 365 Yes


Finance and Operations) instance
OData endpoint.

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Mark this Yes


field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. Retrieve it by
hovering the mouse in the top-right
corner of the Azure portal.

aadResourceId Specify the AAD resource you are Yes


requesting for authorization. For
example, if your Dynamics URL is
https://sampledynamics.sandbox.operations.dynamics.com/data/
, the corresponding AAD resource is
usually
https://sampledynamics.sandbox.operations.dynamics.com
.

connectVia The Integration Runtime to use to No


connect to the data store. You can
choose Azure Integration Runtime or a
self-hosted Integration Runtime (if your
data store is located in a private
network). If not specified, the default
Azure Integration Runtime is used.

Example
{
"name": "DynamicsAXLinkedService",
"properties": {
"type": "DynamicsAX",
"typeProperties": {
"url": "<Dynamics AX instance OData endpoint>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource, e.g. https://sampledynamics.sandbox.operations.dynamics.com>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}

Dataset properties
This section provides a list of properties that the Dynamics AX dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from Dynamics AX, set the type property of the dataset to DynamicsAXResource. The following
properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to DynamicsAXResource.

path The path to the Dynamics AX OData Yes


entity.

Example

{
"name": "DynamicsAXResourceDataset",
"properties": {
"type": "DynamicsAXResource",
"typeProperties": {
"path": "<entity path e.g. dd04tentitySet>"
},
"linkedServiceName": {
"referenceName": "<Dynamics AX linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy Activity properties


This section provides a list of properties that the Dynamics AX source supports.
For a full list of sections and properties that are available for defining activities, see Pipelines.
Dynamics AX as source
To copy data from Dynamics AX, set the source type in Copy Activity to DynamicsAXSource. The following
properties are supported in the Copy Activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Activity Yes


source must be set to
DynamicsAXSource.

query OData query options for filtering data. No


Example:
"?
$select=Name,Description&$top=5"
.

Note: The connector copies data from


the combined URL:
[URL specified in linked
service]/[path specified in
dataset][query specified in copy
activity source]
. For more information, see OData URL
components.

Example

"activities":[
{
"name": "CopyFromDynamicsAX",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics AX input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsAXSource",
"query": "$top=10"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to Dynamics 365 (Common
Data Service) or Dynamics CRM by using Azure
Data Factory
5/29/2019 • 9 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Microsoft
Dynamics 365 or Microsoft Dynamics CRM. It builds on the Copy Activity overview article that presents a
general overview of Copy Activity.

Supported capabilities
You can copy data from Dynamics 365 (Common Data Service) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Common Data Service)
or Dynamics CRM. For a list of data stores supported as sources or sinks by the copy activity, see the Supported
data stores table.
This Dynamics connector supports the following Dynamics versions and authentication types. (IFD is short for
internet-facing deployment.)

DYNAMICS VERSIONS AUTHENTICATION TYPES LINKED SERVICE SAMPLES

Dynamics 365 online Office365 Dynamics online + Office365 auth


Dynamics CRM Online

Dynamics 365 on-premises with IFD IFD Dynamics on-premises with IFD + IFD
Dynamics CRM 2016 on-premises with auth
IFD
Dynamics CRM 2015 on-premises with
IFD

For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing
Other application types e.g. Finance and Operations, Talent, etc. are not supported by this connector.

TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.

Linked service properties


The following properties are supported for the Dynamics linked service.
Dynamics 365 and Dynamics CRM Online
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Dynamics.

deploymentType The deployment type of the Dynamics Yes


instance. It must be "Online" for
Dynamics online.

serviceUri The service URL of your Dynamics Yes


instance, e.g.
https://adfdynamics.crm.dynamics.com
.

authenticationType The authentication type to connect to Yes


a Dynamics server. Specify
"Office365" for Dynamics online.

username Specify the user name to connect to Yes


Dynamics.

password Specify the password for the user Yes


account you specified for username.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

connectVia The integration runtime to be used to No for source, Yes for sink if the source
connect to the data store. If not linked service doesn't have an
specified, it uses the default Azure integration runtime
Integration Runtime.

NOTE
The Dynamics connector used to use optional "organizationName" property to identify your Dynamics CRM/365 Online
instance. While it keeps working, you are suggested to specify the new "serviceUri" property instead to gain better
performance for instance discovery.
Example: Dynamics online using Office365 authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics online linked service using Office365 authentication",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://adfdynamics.crm.dynamics.com",
"authenticationType": "Office365",
"username": "test@contoso.onmicrosoft.com",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dynamics 365 and Dynamics CRM on-premises with IFD


Additional properties that compare to Dynamics online are "hostName" and "port".

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Dynamics.

deploymentType The deployment type of the Dynamics Yes


instance. It must be
"OnPremisesWithIfd" for Dynamics
on-premises with IFD.

hostName The host name of the on-premises Yes


Dynamics server.

port The port of the on-premises Dynamics No, default is 443


server.

organizationName The organization name of the Yes


Dynamics instance.

authenticationType The authentication type to connect to Yes


the Dynamics server. Specify "Ifd" for
Dynamics on-premises with IFD.

username Specify the user name to connect to Yes


Dynamics.
PROPERTY DESCRIPTION REQUIRED

password Specify the password for the user Yes


account you specified for username.
You can choose to mark this field as a
SecureString to store it securely in
ADF, or store password in Azure Key
Vault and let the copy activity pull
from there when performing data copy
- learn more from Store credentials in
Key Vault.

connectVia The integration runtime to be used to No for source, Yes for sink
connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.

Example: Dynamics on-premises with IFD using IFD authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "test@contoso.onmicrosoft.com",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, set the type property of the dataset to DynamicsEntity. The following
properties are supported.

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to DynamicsEntity.

entityName The logical name of the entity to No for source (if "query" in the activity
retrieve. source is specified), Yes for sink
IMPORTANT
When you copy data from Dynamics, the "structure" section is optional but highly recommanded in the Dynamics
dataset to ensure a deterministic copy result. It defines the column name and data type for Dynamics data that you
want to copy over. To learn more, see Dataset structure and Data type mapping for Dynamics.
When importing schema in authoring UI, ADF infer the schema by sampling the top rows from the Dynamics query
result to initialize the structure construction, in which case columns with no values will be omitted. The same behavior
applies to copy executions if there is no explicit structure definition. You can review and add more columns into the
Dynamics dataset schema/structure as needed, which will be honored during copy runtime.
When you copy data to Dynamics, the "structure" section is optional in the Dynamics dataset. Which columns to copy
into is determined by the source data schema. If your source is a CSV file without a header, in the input dataset, specify
the "structure" with the column name and data type. They map to fields in the CSV file one by one in order.

Example:

{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"structure": [
{
"name": "accountid",
"type": "Guid"
},
{
"name": "name",
"type": "String"
},
{
"name": "marketingonly",
"type": "Boolean"
},
{
"name": "modifiedon",
"type": "Datetime"
}
],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Dynamics source and sink types.
Dynamics as a source type
To copy data from Dynamics, set the source type in the copy activity to DynamicsSource. The following
properties are supported in the copy activity source section.
PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
DynamicsSource.

query FetchXML is a proprietary query No (if "entityName" in the dataset is


language that is used in Dynamics specified)
(online and on-premises). See the
following example. To learn more, see
Build queries with FetchXML.

NOTE
The PK column will always be copied out even if the column projection you configure in the FetchXML query doesn't
contain it.

Example:

"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Sample FetchXML query


<fetch>
<entity name="account">
<attribute name="accountid" />
<attribute name="name" />
<attribute name="marketingonly" />
<attribute name="modifiedon" />
<order attribute="modifiedon" descending="false" />
<filter type="and">
<condition attribute ="modifiedon" operator="between">
<value>2017-03-10 18:40:00z</value>
<value>2017-03-12 20:40:00z</value>
</condition>
</filter>
</entity>
</fetch>

Dynamics as a sink type


To copy data to Dynamics, set the sink type in the copy activity to DynamicsSink. The following properties are
supported in the copy activity sink section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to DynamicsSink.

writeBehavior The write behavior of the operation. Yes


Allowed value is "Upsert".

writeBatchSize The row count of data written to No (default is 10)


Dynamics in each batch.

ignoreNullValues Indicates whether to ignore null values No (default is false)


from input data (except key fields)
during a write operation.
Allowed values are true and false.
- True: Leave the data in the
destination object unchanged when
you do an upsert/update operation.
Insert a defined default value when
you do an insert operation.
- False: Update the data in the
destination object to NULL when you
do an upsert/update operation. Insert
a NULL value when you do an insert
operation.

NOTE
The default value of the sink "writeBatchSize" and the copy activity "parallelCopies" for the Dynamics sink are both 10.
Therefore, 100 records are submitted to Dynamics concurrently.

For Dynamics 365 online, there is a limit of 2 concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" fault is thrown before the first request is ever executed. Keeping "writeBatchSize" less or equal to
10 would avoid such throttling of concurrent calls.
The optimal combination of "writeBatchSize" and "parallelCopies" depends on the schema of your entity e.g.
number of columns, row size, number of plugins/workflows/workflow activities hooked up to those calls, etc.
The default setting of 10 writeBatchSize * 10 parallelCopies is the recommendation according to Dynamics
service, which would work for most Dynamics entities though may not be best performance. You can tune the
performance by adjusting the combination in your copy activity settings.
Example:

"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]

Data type mapping for Dynamics


When you copy data from Dynamics, the following mappings are used from Dynamics data types to Data
Factory interim data types. To learn how the copy activity maps the source schema and data type to the sink, see
Schema and data type mappings.
Configure the corresponding Data Factory data type in a dataset structure based on your source Dynamics data
type by using the following mapping table.

DATA FACTORY INTERIM


DYNAMICS DATA TYPE DATA TYPE SUPPORTED AS SOURCE SUPPORTED AS SINK

AttributeTypeCode.BigInt Long ✓ ✓

AttributeTypeCode.Boolean Boolean ✓ ✓

AttributeType.Customer Guid ✓

AttributeType.DateTime Datetime ✓ ✓

AttributeType.Decimal Decimal ✓ ✓

AttributeType.Double Double ✓ ✓
DATA FACTORY INTERIM
DYNAMICS DATA TYPE DATA TYPE SUPPORTED AS SOURCE SUPPORTED AS SINK

AttributeType.EntityName String ✓ ✓

AttributeType.Integer Int32 ✓ ✓

AttributeType.Lookup Guid ✓ ✓ (with single target


associated)

AttributeType.ManagedPro Boolean ✓
perty

AttributeType.Memo String ✓ ✓

AttributeType.Money Decimal ✓ ✓

AttributeType.Owner Guid ✓

AttributeType.Picklist Int32 ✓ ✓

AttributeType.Uniqueidentifi Guid ✓ ✓
er

AttributeType.String String ✓ ✓

AttributeType.State Int32 ✓ ✓

AttributeType.Status Int32 ✓ ✓

NOTE
The Dynamics data types AttributeType.CalendarRules and AttributeType.PartyList aren't supported.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data to or from a file system by using Azure
Data Factory
5/6/2019 • 15 minutes to read • Edit Online

This article outlines how to copy data to and from file system. To learn about Azure Data Factory, read the
introductory article.

Supported capabilities
This file system connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this file system connector supports:
Copying files from/to local machine or network file share. To use a Linux file share, install Samba on your
Linux server.
Copying files using Windows authentication.
Copying files as-is or parsing/generating files with the supported file formats and compression codecs.

Prerequisites
To copy data from/to a file system that is not publicly accessible, you need to set up a Self-hosted Integration
Runtime. See Self-hosted Integration Runtime article for details.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to file
system.

Linked service properties


The following properties are supported for file system linked service:
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


FileServer.

host Specifies the root path of the folder Yes


that you want to copy. Use the escape
character "" for special characters in the
string. See Sample linked service and
dataset definitions for examples.

userid Specify the ID of the user who has Yes


access to the server.

password Specify the password for the user Yes


(userid). Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Sample linked service and dataset definitions


SCENARIO "HOST" IN LINKED SERVICE DEFINITION "FOLDERPATH" IN DATASET DEFINITION

Local folder on Integration Runtime In JSON: D:\\ In JSON: .\\ or folder\\subfolder


machine: On UI: D:\ On UI: .\ or folder\subfolder

Examples: D:\* or D:\folder\subfolder\*

Remote shared folder: In JSON: \\\\myserver\\share In JSON: .\\ or folder\\subfolder


On UI: \\myserver\share On UI: .\ or folder\subfolder
Examples: \\myserver\share\* or
\\myserver\share\folder\subfolder\*

NOTE
When authoring via UI, you don't need to input double backslash ( \\ ) to escape like you do via JSON, specify single
backslash.

Example:
{
"name": "FileLinkedService",
"properties": {
"type": "FileServer",
"typeProperties": {
"host": "<host>",
"userid": "<domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data to and from file system in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based dataset and supported settings. The following properties are
supported for file system under location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property under location in Yes


dataset must be set to
FileServerLocation.

folderPath The path to folder. If you want to use No


wildcard to filter folder, skip this setting
and specify in activity source settings.

fileName The file name under the given No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.

NOTE
FileShare type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility, but it doesn't work with Mapping Data Flow. You are
suggested to use this new model going forward, and the ADF authoring UI has switched to generating these new types.

Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<File system linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "FileServerLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data to and from file system in ORC/Avro/JSON/Binary format, the following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: FileShare

folderPath Path to the folder. Wildcard filter is No


supported, allowed wildcards are: *
(matches zero or more characters) and
? (matches zero or single character);
use ^ to escape if your actual folder
name has wildcard or this escape char
inside.

Examples: rootfolder/subfolder/, see


more examples in Sample linked service
and dataset definitions and Folder and
file filter examples.
PROPERTY DESCRIPTION REQUIRED

fileName Name or wildcard filter for the file(s) No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has wildcard or this escape char
inside.

When fileName isn't specified for an


output dataset and
preserveHierarchy isn't specified in
the activity sink, the copy activity
automatically generates the file name
with the following pattern: "Data.
[activity run ID GUID].[GUID if
FlattenHierarchy].[format if
configured].[compression if
configured]", e.g. "Data.0a405f8a-93ff-
4c6f-b3be-f69616f1df7a.txt.gz"; if you
copy from tabular source using table
name instead of query, the name
pattern is "[table name].[format ].
[compression if configured]", e.g.
"MyTable.csv".
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.
PROPERTY DESCRIPTION REQUIRED

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat, JsonFormat,
AvroFormat, OrcFormat,
ParquetFormat. Set the type
property under format to one of these
values. For more information, see Text
Format, Json Format, Avro Format, Orc
Format, and Parquet Format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are: Optimal and
Fastest.

TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.

Example:
{
"name": "FileSystemDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<file system linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by file system source and sink.
File system as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from file system in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity source and supported settings. The following properties are
supported for file system under storeSettings settings in format-based copy source:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
FileServerReadSetting.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are
true (default) and false.
PROPERTY DESCRIPTION REQUIRED

wildcardFolderPath The folder path with wildcard No


characters to filter source folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".
The properties can be NULL which
mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

modifiedDatetimeEnd Same as above. No

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.
NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:

"activities":[
{
"name": "CopyFromFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "FileServerReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from file system in ORC/Avro/JSON/Binary format, the following properties are supported in
the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
FileSystemSource
PROPERTY DESCRIPTION REQUIRED

recursive Indicates whether the data is read No


recursively from the sub-folders or only
from the specified folder. Note when
recursive is set to true and sink is file-
based store, empty folder/sub-folder
will not be copied/created at sink.
Allowed values are: true (default), false

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyFromFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<file system input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

File system as sink


For copy to Parquet and delimited text format, refer to Parquet and delimited text format sink section.
For copy to other formats like ORC/Avro/JSON/Binary format, refer to Other format sink section.
Parquet and delimited text format sink
To copy data to file system in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity sink and supported settings. The following properties are supported
for file system under storeSettings settings in format-based copy sink:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
FileServerWriteSetting.
PROPERTY DESCRIPTION REQUIRED

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- PreserveHierarchy (default):
Preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy: All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles: Merges all files from the
source folder to one file. If the file name
is specified, the merged file name is the
specified name. Otherwise, it's an
autogenerated file name.

maxConcurrentConnections The number of the connections to No


connect to the data store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, FileSystemSink type copy activity sink mentioned in next section is still supported as-is
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:
"activities":[
{
"name": "CopyToFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "FileServerWriteSetting",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]

Other format sink


To copy data to file system in ORC/Avro/JSON/Binary format, the following properties are supported in the
sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to: FileSystemSink

copyBehavior Defines the copy behavior when the No


source is files from file-based data
store.

Allowed values are:


- PreserveHierarchy (default):
preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy: all files from the
source folder are in the first level of
target folder. The target files have auto
generated name.
- MergeFiles: merges all files from the
source folder to one file. If the File/Blob
Name is specified, the merged file name
would be the specified name;
otherwise, would be auto-generated
file name.
PROPERTY DESCRIPTION REQUIRED

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyToFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<file system output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "FileSystemSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SOURCE FOLDER STRUCTURE


AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SOURCE FOLDER STRUCTURE
AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

recursive and copyBehavior examples


This section describes the resulting behavior of the Copy operation for different combinations of recursive and
copyBehavior values.

RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

true preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the same
File2 structure as the source:
Subfolder1
File3 Folder1
File4 File1
File5 File2
Subfolder1
File3
File4
File5.
RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

true flattenHierarchy Folder1 The target Folder1 is created


File1 with the following structure:
File2
Subfolder1 Folder1
File3 auto-generated name for
File4 File1
File5 auto-generated name for
File2
auto-generated name for
File3
auto-generated name for
File4
auto-generated name for
File5

true mergeFiles Folder1 The target Folder1 is created


File1 with the following structure:
File2
Subfolder1 Folder1
File3 File1 + File2 + File3 +
File4 File4 + File 5 contents are
File5 merged into one file with
auto-generated file name

false preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 File1
File5 File2

Subfolder1 with File3, File4,


and File5 are not picked up.

false flattenHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 auto-generated name for
File5 File1
auto-generated name for
File2

Subfolder1 with File3, File4,


and File5 are not picked up.
RECURSIVE COPYBEHAVIOR SOURCE FOLDER STRUCTURE RESULTING TARGET

false mergeFiles Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 File1 + File2 contents are
File5 merged into one file with
auto-generated file name.
auto-generated name for
File1

Subfolder1 with File3, File4,


and File5 are not picked up.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from FTP server by using Azure Data
Factory
5/6/2019 • 9 minutes to read • Edit Online

This article outlines how to copy data from FTP server. To learn about Azure Data Factory, read the introductory
article.

Supported capabilities
This FTP connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this FTP connector supports:
Copying files using Basic or Anonymous authentication.
Copying files as-is or parsing files with the supported file formats and compression codecs.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
FTP.

Linked service properties


The following properties are supported for FTP linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


FtpServer.

host Specify the name or IP address of the Yes


FTP server.
PROPERTY DESCRIPTION REQUIRED

port Specify the port on which the FTP No


server is listening.
Allowed values are: integer, default
value is 21.

enableSsl Specify whether to use FTP over an No


SSL/TLS channel.
Allowed values are: true (default), false.

enableServerCertificateValidation Specify whether to enable server SSL No


certificate validation when you are
using FTP over SSL/TLS channel.
Allowed values are: true (default), false.

authenticationType Specify the authentication type. Yes


Allowed values are: Basic, Anonymous

userName Specify the user who has access to the No


FTP server.

password Specify the password for the user No


(userName). Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

NOTE
The FTP connector supports accessing FTP server with either no encryption or explicit SSL/TLS encryption; it doesn’t
support implicit SSL/TLS encryption.

Example 1: using Anonymous authentication


{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "<ftp server>",
"port": 21,
"enableSsl": true,
"enableServerCertificateValidation": true,
"authenticationType": "Anonymous"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: using Basic authentication

{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "<ftp server>",
"port": 21,
"enableSsl": true,
"enableServerCertificateValidation": true,
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from FTP in Parquet or delimited text format, refer to Parquet format and Delimited text format
article on format-based dataset and supported settings. The following properties are supported for FTP under
location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property under location in Yes


dataset must be set to
FtpServerLocation.
PROPERTY DESCRIPTION REQUIRED

folderPath The path to folder. If you want to use No


wildcard to filter folder, skip this setting
and specify in activity source settings.

fileName The file name under the given No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.

NOTE
FileShare type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility. You are suggested to use this new model going forward,
and the ADF authoring UI has switched to generating these new types.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<FTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "FtpServerLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data from FTP in ORC/Avro/JSON/Binary format, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: FileShare
PROPERTY DESCRIPTION REQUIRED

folderPath Path to the folder. Wildcard filter is Yes


supported, allowed wildcards are: *
(matches zero or more characters) and
? (matches zero or single character);
use ^ to escape if your actual folder
name has wildcard or this escape char
inside.

Examples: rootfolder/subfolder/, see


more examples in Folder and file filter
examples.

fileName Name or wildcard filter for the file(s) No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has wildcard or this escape char
inside.

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.

If you want to parse files with a specific


format, the following file format types
are supported: TextFormat,
JsonFormat, AvroFormat,
OrcFormat, ParquetFormat. Set the
type property under format to one of
these values. For more information, see
Text Format, Json Format, Avro Format,
Orc Format, and Parquet Format
sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are: Optimal and
Fastest.

useBinaryTransfer Specify whether to use the binary No


transfer mode. The values are true for
binary mode (default), and false for
ASCII.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.

Example:

{
"name": "FTPDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<FTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "myfile.csv.gz",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by FTP source.
FTP as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from FTP in Parquet or delimited text format, refer to Parquet format and Delimited text format
article on format-based copy activity source and supported settings. The following properties are supported for
FTP under storeSettings settings in format-based copy source:

PROPERTY DESCRIPTION REQUIRED


PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
FtpReadSetting.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are
true (default) and false.

wildcardFolderPath The folder path with wildcard No


characters to filter source folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".
The properties can be NULL which
mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeEnd Same as above. No

useBinaryTransfer Specify whether to use the binary No


transfer mode for FTP stores. The
values are true for binary mode
(default), and false for ASCII.

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:

"activities":[
{
"name": "CopyFromFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "FtpReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from FTP in ORC/Avro/JSON/Binary format, the following properties are supported in the copy
activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
FileSystemSource

recursive Indicates whether the data is read No


recursively from the sub folders or only
from the specified folder. Note when
recursive is set to true and sink is file-
based store, empty folder/sub-folder
will not be copied/created at sink.
Allowed values are: true (default), false

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyFromFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<FTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SOURCE FOLDER STRUCTURE


AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)
SOURCE FOLDER STRUCTURE
AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Google AdWords using Azure Data
Factory (Preview)
2/1/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Google AdWords. It
builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Google AdWords to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Google AdWords connector.

Linked service properties


The following properties are supported for Google AdWords linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


GoogleAdWords

clientCustomerID The Client customer ID of the AdWords Yes


account that you want to fetch report
data for.
PROPERTY DESCRIPTION REQUIRED

developerToken The developer token associated with Yes


the manager account that you use to
grant access to the AdWords API. You
can choose to mark this field as a
SecureString to store it securely in ADF,
or store password in Azure Key Vault
and let ADF copy activity pull from
there when performing data copy -
learn more from Store credentials in
Key Vault.

authenticationType The OAuth 2.0 authentication Yes


mechanism used for authentication.
ServiceAuthentication can only be used
on self-hosted IR.
Allowed values are:
ServiceAuthentication,
UserAuthentication

refreshToken The refresh token obtained from No


Google for authorizing access to
AdWords for UserAuthentication. You
can choose to mark this field as a
SecureString to store it securely in ADF,
or store password in Azure Key Vault
and let ADF copy activity pull from
there when performing data copy -
learn more from Store credentials in
Key Vault.

clientId The client id of the google application No


used to acquire the refresh token. You
can choose to mark this field as a
SecureString to store it securely in ADF,
or store password in Azure Key Vault
and let ADF copy activity pull from
there when performing data copy -
learn more from Store credentials in
Key Vault.

clientSecret The client secret of the google No


application used to acquire the refresh
token. You can choose to mark this field
as a SecureString to store it securely in
ADF, or store password in Azure Key
Vault and let ADF copy activity pull
from there when performing data copy
- learn more from Store credentials in
Key Vault.

email The service account email ID that is No


used for ServiceAuthentication and can
only be used on self-hosted IR.

keyFilePath The full path to the .p12 key file that is No


used to authenticate the service
account email address and can only be
used on self-hosted IR.
PROPERTY DESCRIPTION REQUIRED

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over SSL. This
property can only be set when using
SSL on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

useSystemTrustStore Specifies whether to use a CA certificate No


from the system trust store or from a
specified PEM file. The default value is
false.

Example:

{
"name": "GoogleAdWordsLinkedService",
"properties": {
"type": "GoogleAdWords",
"typeProperties": {
"clientCustomerID" : "<clientCustomerID>",
"developerToken": {
"type": "SecureString",
"value": "<developerToken>"
},
"authenticationType" : "ServiceAuthentication",
"refreshToken": {
"type": "SecureString",
"value": "<refreshToken>"
},
"clientId": {
"type": "SecureString",
"value": "<clientId>"
},
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"email" : "<email>",
"keyFilePath" : "<keyFilePath>",
"trustedCertPath" : "<trustedCertPath>",
"useSystemTrustStore" : true,
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Google AdWords dataset.
To copy data from Google AdWords, set the type property of the dataset to GoogleAdWordsObject. The
following properties are supported:

PROPERTY DESCRIPTION REQUIRED


PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: GoogleAdWordsObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "GoogleAdWordsDataset",
"properties": {
"type": "GoogleAdWordsObject",
"linkedServiceName": {
"referenceName": "<GoogleAdWords linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Google AdWords source.
Google AdWords as source
To copy data from Google AdWords, set the source type in the copy activity to GoogleAdWordsSource. The
following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
GoogleAdWordsSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromGoogleAdWords",
"type": "Copy",
"inputs": [
{
"referenceName": "<GoogleAdWords input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GoogleAdWordsSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Google BigQuery by using Azure
Data Factory
1/30/2019 • 5 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from Google BigQuery. It builds
on the Copy Activity overview article that presents a general overview of the copy activity.

Supported capabilities
You can copy data from Google BigQuery to any supported sink data store. For a list of data stores that are
supported as sources or sinks by the copy activity, see the Supported data stores table.
Data Factory provides a built-in driver to enable connectivity. Therefore, you don't need to manually install a driver
to use this connector.

NOTE
This Google BigQuery connector is built on top of the BigQuery APIs. Be aware that BigQuery limits the maximum rate of
incoming requests and enforces appropriate quotas on a per-project basis, refer to Quotas & Limits - API requests. Make
sure you do not trigger too many concurrent requests to the account.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to the
Google BigQuery connector.

Linked service properties


The following properties are supported for the Google BigQuery linked service.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


GoogleBigQuery.

project The project ID of the default BigQuery Yes


project to query against.
PROPERTY DESCRIPTION REQUIRED

additionalProjects A comma-separated list of project IDs No


of public BigQuery projects to access.

requestGoogleDriveScope Whether to request access to Google No


Drive. Allowing Google Drive access
enables support for federated tables
that combine BigQuery data with data
from Google Drive. The default value is
false.

authenticationType The OAuth 2.0 authentication Yes


mechanism used for authentication.
ServiceAuthentication can be used only
on Self-hosted Integration Runtime.
Allowed values are UserAuthentication
and ServiceAuthentication. Refer to
sections below this table on more
properties and JSON samples for those
authentication types respectively.

Using user authentication


Set "authenticationType" property to UserAuthentication, and specify the following properties along with
generic properties described in the previous section:

PROPERTY DESCRIPTION REQUIRED

clientId ID of the application used to generate No


the refresh token.

clientSecret Secret of the application used to No


generate the refresh token. Mark this
field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

refreshToken The refresh token obtained from No


Google used to authorize access to
BigQuery. Learn how to get one from
Obtaining OAuth 2.0 access tokens and
this community blog. Mark this field as
a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

Example:
{
"name": "GoogleBigQueryLinkedService",
"properties": {
"type": "GoogleBigQuery",
"typeProperties": {
"project" : "<project ID>",
"additionalProjects" : "<additional project IDs>",
"requestGoogleDriveScope" : true,
"authenticationType" : "UserAuthentication",
"clientId": "<id of the application used to generate the refresh token>",
"clientSecret": {
"type": "SecureString",
"value":"<secret of the application used to generate the refresh token>"
},
"refreshToken": {
"type": "SecureString",
"value": "<refresh token>"
}
}
}
}

Using service authentication


Set "authenticationType" property to ServiceAuthentication, and specify the following properties along with
generic properties described in the previous section. This authentication type can be used only on Self-hosted
Integration Runtime.

PROPERTY DESCRIPTION REQUIRED

email The service account email ID that is No


used for ServiceAuthentication. It can
be used only on Self-hosted Integration
Runtime.

keyFilePath The full path to the .p12 key file that is No


used to authenticate the service
account email address.

trustedCertPath The full path of the .pem file that No


contains trusted CA certificates used to
verify the server when you connect
over SSL. This property can be set only
when you use SSL on Self-hosted
Integration Runtime. The default value
is the cacerts.pem file installed with the
integration runtime.

useSystemTrustStore Specifies whether to use a CA certificate No


from the system trust store or from a
specified .pem file. The default value is
false.

Example:
{
"name": "GoogleBigQueryLinkedService",
"properties": {
"type": "GoogleBigQuery",
"typeProperties": {
"project" : "<project id>",
"requestGoogleDriveScope" : true,
"authenticationType" : "ServiceAuthentication",
"email": "<email>",
"keyFilePath": "<.p12 key path on the IR machine>"
},
"connectVia": {
"referenceName": "<name of Self-hosted Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Google BigQuery dataset.
To copy data from Google BigQuery, set the type property of the dataset to GoogleBigQueryObject. The
following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: GoogleBigQueryObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "GoogleBigQueryDataset",
"properties": {
"type": "GoogleBigQueryObject",
"linkedServiceName": {
"referenceName": "<GoogleBigQuery linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Google BigQuery source type.
GoogleBigQuerySource as a source type
To copy data from Google BigQuery, set the source type in the copy activity to GoogleBigQuerySource. The
following properties are supported in the copy activity source section.
PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
GoogleBigQuerySource.

query Use the custom SQL query to read No (if "tableName" in dataset is
data. An example is specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromGoogleBigQuery",
"type": "Copy",
"inputs": [
{
"referenceName": "<GoogleBigQuery input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GoogleBigQuerySource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Google Cloud Storage using Azure
Data Factory
5/6/2019 • 10 minutes to read • Edit Online

This article outlines how to copy data from Google Cloud Storage. To learn about Azure Data Factory, read the
introductory article.

Supported capabilities
This Google Cloud Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this Google Cloud Storage connector supports copying files as-is or parsing files with the supported
file formats and compression codecs.

NOTE
Copying data from Google Cloud Storage leverages the Amazon S3 connector with corresponding custom S3 endpoint, as
Google Cloud Storage provides S3-compatible interoperability.

Required permissions
To copy data from Google Cloud Storage, make sure you have been granted the following permissions:
For copy activity execution:: s3:GetObject and s3:GetObjectVersion for Object Operations.
For Data Factory GUI authoring: s3:ListAllMyBuckets and s3:ListBucket / s3:GetBucketLocation for Bucket
Operations permissions are additionally required for operations like test connection and browse/navigate file
paths. If you don't want to grant these permission, skip test connection in linked service creation page and
specify the path directly in dataset settings.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Google Cloud Storage.
Linked service properties
The following properties are supported for Google Cloud Storage linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


AmazonS3.

accessKeyId ID of the secret access key. To find the Yes


access key and secret, go to Google
Cloud Storage > Settings >
Interoperability.

secretAccessKey The secret access key itself. Mark this Yes


field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

serviceUrl Specify the custom S3 endpoint as Yes


https://storage.googleapis.com .

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

Here is an example:

{
"name": "GoogleCloudStorageLinkedService",
"properties": {
"type": "AmazonS3",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
},
"serviceUrl": "https://storage.googleapis.com"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from Google Cloud Storage in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based dataset and supported settings. The following properties are
supported for Google Cloud Storage under location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property under location in Yes


dataset must be set to
AmazonS3Location.

bucketName The S3 bucket name. Yes

folderPath The path to folder under the given No


bucket. If you want to use wildcard to
filter folder, skip this setting and specify
in activity source settings.

fileName The file name under the given bucket + No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.

NOTE
AmazonS3Object type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility. You are suggested to use this new model going forward,
and the ADF authoring UI has switched to generating these new types.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Google Cloud Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AmazonS3Location",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data from Google Cloud Storage in ORC/Avro/JSON/Binary format, the following properties are
supported:
PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: AmazonS3Object

bucketName The S3 bucket name. Wildcard filter is Yes for Copy/Lookup activity, No for
not supported. GetMetadata activity

key The name or wildcard filter of S3 No


object key under the specified bucket.
Applies only when "prefix" property is
not specified.

The wildcard filter is supported for both


folder part and file name part. Allowed
wildcards are: * (matches zero or
more characters) and ? (matches zero
or single character).
- Example 1:
"key":
"rootfolder/subfolder/*.csv"
- Example 2:
"key": "rootfolder/subfolder/???
20180427.txt"
See more examples in Folder and file
filter examples. Use ^ to escape if
your actual folder/file name has
wildcard or this escape char inside.

prefix Prefix for the S3 object key. Objects No


whose keys start with this prefix are
selected. Applies only when "key"
property is not specified.

version The version of the S3 object, if S3 No


versioning is enabled.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

The properties can be NULL which


mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

The properties can be NULL which


mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat, JsonFormat,
AvroFormat, OrcFormat,
ParquetFormat. Set the type
property under format to one of these
values. For more information, see Text
Format, Json Format, Avro Format, Orc
Format, and Parquet Format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are: Optimal and
Fastest.

TIP
To copy all files under a folder, specify bucketName for bucket and prefix for folder part.
To copy a single file with a given name, specify bucketName for bucket and key for folder part plus file name.
To copy a subset of files under a folder, specify bucketName for bucket and key for folder part plus wildcard filter.

Example: using prefix


{
"name": "GoogleCloudStorageDataset",
"properties": {
"type": "AmazonS3Object",
"linkedServiceName": {
"referenceName": "<linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"bucketName": "testbucket",
"prefix": "testFolder/test",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Google Cloud Storage source.
Google Cloud Storage as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from Google Cloud Storage in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based copy activity source and supported settings. The following
properties are supported for Google Cloud Storage under storeSettings settings in format-based copy source:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
AmazonS3ReadSetting.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are
true (default) and false.
PROPERTY DESCRIPTION REQUIRED

prefix Prefix for the S3 object key under the


given bucket configured in dataset to
filter source objects. Objects whose
keys start with this prefix are selected.
Applies only when
wildcardFolderPath and
wildcardFileName properties are not
specified.

wildcardFolderPath The folder path with wildcard No


characters under the given bucket
configured in dataset to filter source
folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

wildcardFileName The file name with wildcard characters Yes if fileName in dataset and
under the given bucket + prefix are not specified
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".
The properties can be NULL which
mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

modifiedDatetimeEnd Same as above. No


PROPERTY DESCRIPTION REQUIRED

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:

"activities":[
{
"name": "CopyFromGoogleCloudStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "AmazonS3ReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from Google Cloud Storage in ORC/Avro/JSON/Binary format, the following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED


PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
FileSystemSource

recursive Indicates whether the data is read No


recursively from the sub folders or only
from the specified folder. Note when
recursive is set to true and sink is file-
based store, empty folder/sub-folder
will not be copied/created at sink.
Allowed values are: true (default), false

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyFromGoogleCloudStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SOURCE FOLDER STRUCTURE


AND FILTER RESULT (FILES IN
BUCKET KEY RECURSIVE BOLD ARE RETRIEVED)
SOURCE FOLDER STRUCTURE
AND FILTER RESULT (FILES IN
BUCKET KEY RECURSIVE BOLD ARE RETRIEVED)

bucket Folder*/* false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/* true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Next steps
For a list of data stores that are supported as sources and sinks by the copy activity in Azure Data Factory, see
supported data stores.
Copy data from Greenplum using Azure Data
Factory
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Greenplum. It builds on
the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Greenplum to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Greenplum connector.

Linked service properties


The following properties are supported for Greenplum linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Greenplum

connectionString An ODBC connection string to connect Yes


to Greenplum.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the pwd configuration
out of the connection string. Refer to
the following samples and Store
credentials in Azure Key Vault article
with more details.
PROPERTY DESCRIPTION REQUIRED

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "GreenplumLinkedService",
"properties": {
"type": "Greenplum",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "HOST=<server>;PORT=<port>;DB=<database>;UID=<user name>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "GreenplumLinkedService",
"properties": {
"type": "Greenplum",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "HOST=<server>;PORT=<port>;DB=<database>;UID=<user name>;"
},
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Greenplum dataset.
To copy data from Greenplum, set the type property of the dataset to GreenplumTable. The following properties
are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: GreenplumTable

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "GreenplumDataset",
"properties": {
"type": "GreenplumTable",
"linkedServiceName": {
"referenceName": "<Greenplum linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Greenplum source.
GreenplumSource as source
To copy data from Greenplum, set the source type in the copy activity to GreenplumSource. The following
properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
GreenplumSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromGreenplum",
"type": "Copy",
"inputs": [
{
"referenceName": "<Greenplum input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GreenplumSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from HBase using Azure Data Factory
3/14/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from HBase. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from HBase to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HBase connector.

Linked service properties


The following properties are supported for HBase linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


HBase

host The IP address or host name of the Yes


HBase server. (i.e.
[clustername].azurehdinsight.net
, 192.168.222.160 )

port The TCP port that the HBase instance No


uses to listen for client connections. The
default value is 9090. If you connect to
Azure HDInsights, specify port as 443.
PROPERTY DESCRIPTION REQUIRED

httpPath The partial URL corresponding to the No


HBase server, e.g. /hbaserest0 when
using HDInsights cluster.

authenticationType The authentication mechanism to use Yes


to connect to the HBase server.
Allowed values are: Anonymous, Basic

username The user name used to connect to the No


HBase instance.

password The password corresponding to the No


user name. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

enableSsl Specifies whether the connections to No


the server are encrypted using SSL. The
default value is false.

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over SSL. This
property can only be set when using
SSL on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

allowHostNameCNMismatch Specifies whether to require a CA- No


issued SSL certificate name to match
the host name of the server when
connecting over SSL. The default value
is false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

NOTE
If your cluster doesn't support sticky session e.g. HDInsight, explicitly add node index at the end of the http path setting, e.g.
specify /hbaserest0 instead of /hbaserest .

Example for HDInsights HBase:


{
"name": "HBaseLinkedService",
"properties": {
"type": "HBase",
"typeProperties": {
"host" : "<cluster name>.azurehdinsight.net",
"port" : "443",
"httpPath" : "/hbaserest0",
"authenticationType" : "Basic",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"enableSsl" : true
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example for generic HBase:

{
"name": "HBaseLinkedService",
"properties": {
"type": "HBase",
"typeProperties": {
"host" : "<host e.g. 192.168.222.160>",
"port" : "<port>",
"httpPath" : "<e.g. /gateway/sandbox/hbase/version>",
"authenticationType" : "Basic",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"enableSsl" : true,
"trustedCertPath" : "<trustedCertPath>",
"allowHostNameCNMismatch" : true,
"allowSelfSignedServerCert" : true
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by HBase dataset.
To copy data from HBase, set the type property of the dataset to HBaseObject. The following properties are
supported:
PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: HBaseObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "HBaseDataset",
"properties": {
"type": "HBaseObject",
"linkedServiceName": {
"referenceName": "<HBase linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by HBase source.
HBaseSource as source
To copy data from HBase, set the source type in the copy activity to HBaseSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: HBaseSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromHBase",
"type": "Copy",
"inputs": [
{
"referenceName": "<HBase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HBaseSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from HDFS using Azure Data Factory
5/6/2019 • 15 minutes to read • Edit Online

This article outlines how to copy data from HDFS server. To learn about Azure Data Factory, read the
introductory article.

Supported capabilities
This HDFS connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
Specifically, this HDFS connector supports:
Copying files using Windows (Kerberos) or Anonymous authentication.
Copying files using webhdfs protocol or built-in DistCp support.
Copying files as-is or parsing/generating files with the supported file formats and compression codecs.

Prerequisites
To copy data from an HDFS that is not publicly accessible, you need to set up a Self-hosted Integration Runtime.
See Self-hosted Integration Runtime article to learn details.

NOTE
Make sure the Integration Runtime can access to ALL the [name node server]:[name node port] and [data node servers]:
[data node port] of the Hadoop cluster. Default [name node port] is 50070, and default [data node port] is 50075.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HDFS.

Linked service properties


The following properties are supported for HDFS linked service:
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Hdfs.

url URL to the HDFS Yes

authenticationType Allowed values are: Anonymous, or Yes


Windows.

To use Kerberos authentication for


HDFS connector, refer to this section to
set up your on-premises environment
accordingly.

userName Username for Windows authentication. Yes (for Windows Authentication)


For Kerberos authentication, specify
<username>@<domain>.com .

password Password for Windows authentication. Yes (for Windows Authentication)


Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example: using Anonymous authentication

{
"name": "HDFSLinkedService",
"properties": {
"type": "Hdfs",
"typeProperties": {
"url" : "http://<machine>:50070/webhdfs/v1/",
"authenticationType": "Anonymous",
"userName": "hadoop"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: using Windows authentication


{
"name": "HDFSLinkedService",
"properties": {
"type": "Hdfs",
"typeProperties": {
"url" : "http://<machine>:50070/webhdfs/v1/",
"authenticationType": "Windows",
"userName": "<username>@<domain>.com (for Kerberos auth)",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from HDFS in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based dataset and supported settings. The following properties are supported for HDFS
under location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property under location in Yes


dataset must be set to HdfsLocation.

folderPath The path to folder. If you want to use No


wildcard to filter folder, skip this setting
and specify in activity source settings.

fileName The file name under the given No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.

NOTE
FileShare type dataset with Parquet/Text format mentioned in next section is still supported as-is for Copy/Lookup activity
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<HDFS linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "HdfsLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data from HDFS in ORC/Avro/JSON/Binary format, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: FileShare

folderPath Path to the folder. Wildcard filter is Yes


supported, allowed wildcards are: *
(matches zero or more characters) and
? (matches zero or single character);
use ^ to escape if your actual file
name has wildcard or this escape char
inside.

Examples: rootfolder/subfolder/, see


more examples in Folder and file filter
examples.

fileName Name or wildcard filter for the file(s) No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual folder
name has wildcard or this escape char
inside.
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.
PROPERTY DESCRIPTION REQUIRED

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.

If you want to parse files with a specific


format, the following file format types
are supported: TextFormat,
JsonFormat, AvroFormat,
OrcFormat, ParquetFormat. Set the
type property under format to one of
these values. For more information, see
Text Format, Json Format, Avro Format,
Orc Format, and Parquet Format
sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are: Optimal and
Fastest.

TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

Example:

{
"name": "HDFSDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<HDFS linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by HDFS source.
HDFS as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from HDFS in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity source and supported settings. The following properties are
supported for HDFS under storeSettings settings in format-based copy source:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
HdfsReadSetting.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are
true (default) and false.

wildcardFolderPath The folder path with wildcard No


characters to filter source folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".
The properties can be NULL which
mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

modifiedDatetimeEnd Same as above. No

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:
"activities":[
{
"name": "CopyFromHDFS",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "HdfsReadSetting",
"recursive": true
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from HDFS in ORC/Avro/JSON/Binary format, the following properties are supported in the
copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: HdfsSource

recursive Indicates whether the data is read No


recursively from the sub folders or only
from the specified folder. Note when
recursive is set to true and sink is file-
based store, empty folder/sub-folder
will not be copied/created at sink.
Allowed values are: true (default), false

distcpSettings Property group when using HDFS No


DistCp.

resourceManagerEndpoint The Yarn Resource Manager endpoint Yes if using DistCp

tempScriptPath A folder path used to store temp Yes if using DistCp


DistCp command script. The script file
is generated by Data Factory and will
be removed after Copy job finished.
PROPERTY DESCRIPTION REQUIRED

distcpOptions Additional options provided to DistCp No


command.

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example: HDFS source in copy activity using DistCp

"source": {
"type": "HdfsSource",
"distcpSettings": {
"resourceManagerEndpoint": "resourcemanagerendpoint:8088",
"tempScriptPath": "/usr/hadoop/tempscript",
"distcpOptions": "-m 100"
}
}

Learn more on how to use DistCp to copy data from HDFS efficiently from the next section.
Folder and file filter examples
This section describes the resulting behavior of the folder path and file name with wildcard filters.

SOURCE FOLDER STRUCTURE


AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SOURCE FOLDER STRUCTURE
AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Use DistCp to copy data from HDFS


DistCp is a Hadoop native command-line tool to do distributed copy in a Hadoop cluster. When run a Distcp
command, it will first list all the files to be copied, create several Map jobs into the Hadoop cluster, and each Map
job will do binary copy from source to sink.
Copy Activity support using DistCp to copy files as-is into Azure Blob (including staged copy or Azure Data Lake
Store, in which case it can fully leverage your cluster's power instead of running on the Self-hosted Integration
Runtime. It will provide better copy throughput especially if your cluster is very powerful. Based on your
configuration in Azure Data Factory, Copy activity automatically construct a distcp command, submit to your
Hadoop cluster, and monitor the copy status.
Prerequisites
To use DistCp to copy files as-is from HDFS to Azure Blob (including staged copy) or Azure Data Lake Store,
make sure your Hadoop cluster meets below requirements:
1. MapReduce and Yarn services are enabled.
2. Yarn version is 2.5 or above.
3. HDFS server is integrated with your target data store - Azure Blob or Azure Data Lake Store:
Azure Blob FileSystem is natively supported since Hadoop 2.7. You only need to specify jar path in
Hadoop env config.
Azure Data Lake Store FileSystem is packaged starting from Hadoop 3.0.0-alpha1. If your Hadoop
cluster is lower than that version, you need to manually import ADLS related jar packages (azure-
datalake-store.jar) into cluster from here, and specify jar path in Hadoop env config.
4. Prepare a temp folder in HDFS. This temp folder is used to store DistCp shell script, so it will occupy KB -
level space.
5. Make sure the user account provided in HDFS Linked Service have permission to a) submit application in
Yarn; b) have the permission to create subfolder, read/write files under above temp folder.
Configurations
See DistCp related configurations and examples in HDFS as source section.

Use Kerberos authentication for HDFS connector


There are two options to set up the on-premises environment so as to use Kerberos Authentication in HDFS
connector. You can choose the one better fits your case.
Option 1: Join Self-hosted Integration Runtime machine in Kerberos realm
Option 2: Enable mutual trust between Windows domain and Kerberos realm
Option 1: Join Self-hosted Integration Runtime machine in Kerberos realm
Requirements
The Self-hosted Integration Runtime machine needs to join the Kerberos realm and can’t join any Windows
domain.
How to configure
On Self-hosted Integration Runtime machine:
1. Run the Ksetup utility to configure the Kerberos KDC server and realm.
The machine must be configured as a member of a workgroup since a Kerberos realm is different from a
Windows domain. This can be achieved by setting the Kerberos realm and adding a KDC server as follows.
Replace REALM.COM with your own respective realm as needed.

C:> Ksetup /setdomain REALM.COM


C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>

Restart the machine after executing these 2 commands.


2. Verify the configuration with Ksetup command. The output should be like:

C:> Ksetup
default realm = REALM.COM (external)
REALM.com:
kdc = <your_kdc_server_address>

In Azure Data Factory:


Configure the HDFS connector using Windows authentication together with your Kerberos principal name
and password to connect to the HDFS data source. Check HDFS Linked Service properties section on
configuration details.
Option 2: Enable mutual trust between Windows domain and Kerberos realm
Requirements
The Self-hosted Integration Runtime machine must join a Windows domain.
You need permission to update the domain controller's settings.
How to configure

NOTE
Replace REALM.COM and AD.COM in the following tutorial with your own respective realm and domain controller as
needed.

On KDC server:
1. Edit the KDC configuration in krb5.conf file to let KDC trust Windows Domain referring to the following
configuration template. By default, the configuration is located at /etc/krb5.conf.

[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log

[libdefaults]
default_realm = REALM.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true

[realms]
REALM.COM = {
kdc = node.REALM.COM
admin_server = node.REALM.COM
}
AD.COM = {
kdc = windc.ad.com
admin_server = windc.ad.com
}

[domain_realm]
.REALM.COM = REALM.COM
REALM.COM = REALM.COM
.ad.com = AD.COM
ad.com = AD.COM

[capaths]
AD.COM = {
REALM.COM = .
}

Restart the KDC service after configuration.


2. Prepare a principal named krbtgt/REALM.COM@AD.COM in KDC server with the following command:

Kadmin> addprinc krbtgt/REALM.COM@AD.COM

3. In hadoop.security.auth_to_local HDFS service configuration file, add


RULE:[1:$1@$0](.*\@AD.COM)s/\@.*// .

On domain controller:
1. Run the following Ksetup commands to add a realm entry:

C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>


C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM

2. Establish trust from Windows Domain to Kerberos Realm. [password] is the password for the principal
krbtgt/REALM.COM@AD.COM.

C:> netdom trust REALM.COM /Domain: AD.COM /add /realm /passwordt:[password]

3. Select encryption algorithm used in Kerberos.


a. Go to Server Manager > Group Policy Management > Domain > Group Policy Objects > Default or
Active Domain Policy, and Edit.
b. In the Group Policy Management Editor popup window, go to Computer Configuration >
Policies > Windows Settings > Security Settings > Local Policies > Security Options, and configure
Network security: Configure Encryption types allowed for Kerberos.
c. Select the encryption algorithm you want to use when connect to KDC. Commonly, you can simply
select all the options.

d. Use Ksetup command to specify the encryption algorithm to be used on the specific REALM.

C:> ksetup /SetEncTypeAttr REALM.COM DES-CBC-CRC DES-CBC-MD5 RC4-HMAC-MD5 AES128-CTS-HMAC-


SHA1-96 AES256-CTS-HMAC-SHA1-96

4. Create the mapping between the domain account and Kerberos principal, in order to use Kerberos
principal in Windows Domain.
a. Start the Administrative tools > Active Directory Users and Computers.
b. Configure advanced features by clicking View > Advanced Features.
c. Locate the account to which you want to create mappings, and right-click to view Name Mappings
> click Kerberos Names tab.
d. Add a principal from the realm.

On Self-hosted Integration Runtime machine:


Run the following Ksetup commands to add a realm entry.

C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>


C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM

In Azure Data Factory:


Configure the HDFS connector using Windows authentication together with either your Domain Account
or Kerberos Principal to connect to the HDFS data source. Check HDFS Linked Service properties section on
configuration details.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Hive using Azure Data Factory
2/1/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Hive. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Hive to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Hive connector.

Linked service properties


The following properties are supported for Hive linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Hive Yes

host IP address or host name of the Hive Yes


server, separated by ';' for multiple
hosts (only when serviceDiscoveryMode
is enable).

port The TCP port that the Hive server uses Yes
to listen for client connections. If you
connect to Azure HDInsights, specify
port as 443.

serverType The type of Hive server. No


Allowed values are: HiveServer1,
HiveServer2, HiveThriftServer
PROPERTY DESCRIPTION REQUIRED

thriftTransportProtocol The transport protocol to use in the No


Thrift layer.
Allowed values are: Binary, SASL,
HTTP

authenticationType The authentication method used to Yes


access the Hive server.
Allowed values are: Anonymous,
Username, UsernameAndPassword,
WindowsAzureHDInsightService

serviceDiscoveryMode true to indicate using the ZooKeeper No


service, false not.

zooKeeperNameSpace The namespace on ZooKeeper under No


which Hive Server 2 nodes are added.

useNativeQuery Specifies whether the driver uses native No


HiveQL queries,or converts them into
an equivalent form in HiveQL.

username The user name that you use to access No


Hive Server.

password The password corresponding to the No


user. Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

httpPath The partial URL corresponding to the No


Hive server.

enableSsl Specifies whether the connections to No


the server are encrypted using SSL. The
default value is false.

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over SSL. This
property can only be set when using
SSL on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

useSystemTrustStore Specifies whether to use a CA certificate No


from the system trust store or from a
specified PEM file. The default value is
false.

allowHostNameCNMismatch Specifies whether to require a CA- No


issued SSL certificate name to match
the host name of the server when
connecting over SSL. The default value
is false.
PROPERTY DESCRIPTION REQUIRED

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "HiveLinkedService",
"properties": {
"type": "Hive",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "<port>",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Hive dataset.
To copy data from Hive, set the type property of the dataset to HiveObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: HiveObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example
{
"name": "HiveDataset",
"properties": {
"type": "HiveObject",
"linkedServiceName": {
"referenceName": "<Hive linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Hive source.
HiveSource as source
To copy data from Hive, set the source type in the copy activity to HiveSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: HiveSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromHive",
"type": "Copy",
"inputs": [
{
"referenceName": "<Hive input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HiveSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from an HTTP endpoint by using Azure
Data Factory
5/6/2019 • 9 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from an HTTP endpoint. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.
The difference among this HTTP connector, the REST connector and the Web table connector are:
REST connector specifically support copying data from RESTful APIs;
HTTP connector is generic to retrieve data from any HTTP endpoint, e.g. to download file. Before REST
connector becomes available, you may happen to use the HTTP connector to copy data from RESTful API,
which is supported but less functional comparing to REST connector.
Web table connector extracts table content from an HTML webpage.

Supported capabilities
You can copy data from an HTTP source to any supported sink data store. For a list of data stores that Copy
Activity supports as sources and sinks, see Supported data stores and formats.
You can use this HTTP connector to:
Retrieve data from an HTTP/S endpoint by using the HTTP GET or POST methods.
Retrieve data by using one of the following authentications: Anonymous, Basic, Digest, Windows, or
ClientCertificate.
Copy the HTTP response as-is or parse it by using supported file formats and compression codecs.

TIP
To test an HTTP request for data retrieval before you configure the HTTP connector in Data Factory, learn about the API
specification for header and body requirements. You can use tools like Postman or a web browser to validate.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to the HTTP connector.
Linked service properties
The following properties are supported for the HTTP linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


HttpServer.

url The base URL to the web server. Yes

enableServerCertificateValidation Specify whether to enable server SSL No


certificate validation when you connect (the default is true)
to an HTTP endpoint. If your HTTPS
server uses a self-signed certificate, set
this property to false.

authenticationType Specifies the authentication type. Yes


Allowed values are Anonymous, Basic,
Digest, Windows, and
ClientCertificate.

See the sections that follow this table


for more properties and JSON samples
for these authentication types.

connectVia The Integration Runtime to use to No


connect to the data store. You can use
the Azure Integration Runtime or a
self-hosted Integration Runtime (if your
data store is located in a private
network). If not specified, this property
uses the default Azure Integration
Runtime.

Using Basic, Digest, or Windows authentication


Set the authenticationType property to Basic, Digest, or Windows. In addition to the generic properties that
are described in the preceding section, specify the following properties:

PROPERTY DESCRIPTION REQUIRED

userName The user name to use to access the Yes


HTTP endpoint.

password The password for the user (the Yes


userName value). Mark this field as a
SecureString type to store it securely
in Data Factory. You can also reference
a secret stored in Azure Key Vault.

Example
{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"authenticationType": "Basic",
"url" : "<HTTP endpoint>",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Using ClientCertificate authentication


To use ClientCertificate authentication, set the authenticationType property to ClientCertificate. In addition to
the generic properties that are described in the preceding section, specify the following properties:

PROPERTY DESCRIPTION REQUIRED

embeddedCertData Base64-encoded certificate data. Specify either embeddedCertData or


certThumbprint.

certThumbprint The thumbprint of the certificate that's Specify either embeddedCertData or


installed on your self-hosted certThumbprint.
Integration Runtime machine's cert
store. Applies only when the self-
hosted type of Integration Runtime is
specified in the connectVia property.

password The password that's associated with the No


certificate. Mark this field as a
SecureString type to store it securely
in Data Factory. You can also reference
a secret stored in Azure Key Vault.

If you use certThumbprint for authentication and the certificate is installed in the personal store of the local
computer, grant read permissions to the self-hosted Integration Runtime:
1. Open the Microsoft Management Console (MMC ). Add the Certificates snap-in that targets Local
Computer.
2. Expand Certificates > Personal, and then select Certificates.
3. Right-click the certificate from the personal store, and then select All Tasks > Manage Private Keys.
4. On the Security tab, add the user account under which the Integration Runtime Host Service
(DIAHostService) is running, with read access to the certificate.

Example 1: Using certThumbprint


{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"authenticationType": "ClientCertificate",
"url": "<HTTP endpoint>",
"certThumbprint": "<thumbprint of certificate>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: Using embeddedCertData

{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"authenticationType": "ClientCertificate",
"url": "<HTTP endpoint>",
"embeddedCertData": "<Base64-encoded cert data>",
"password": {
"type": "SecureString",
"value": "password of cert"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from HTTP in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based dataset and supported settings. The following properties are supported for HTTP
under location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property under location in Yes


dataset must be set to
HttpServerLocation.

relativeUrl A relative URL to the resource that No


contains the data.
NOTE
The supported HTTP request payload size is around 500 KB. If the payload size you want to pass to your web endpoint is
larger than 500 KB, consider batching the payload in smaller chunks.

NOTE
HttpFile type dataset with Parquet/Text format mentioned in next section is still supported as-is for Copy/Lookup activity
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "HttpServerLocation",
"relativeUrl": "<relative url>"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data from HTTP in ORC/Avro/JSON/Binary format, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to HttpFile.

relativeUrl A relative URL to the resource that No


contains the data. When this property
isn't specified, only the URL that's
specified in the linked service definition
is used.

requestMethod The HTTP method. Allowed values are No


Get (default) and Post.

additionalHeaders Additional HTTP request headers. No

requestBody The body for the HTTP request. No


PROPERTY DESCRIPTION REQUIRED

format If you want to retrieve data from the No


HTTP endpoint as-is without parsing it,
and then copy the data to a file-based
store, skip the format section in both
the input and output dataset
definitions.

If you want to parse the HTTP


response content during copy, the
following file format types are
supported: TextFormat, JsonFormat,
AvroFormat, OrcFormat, and
ParquetFormat. Under format, set
the type property to one of these
values. For more information, see JSON
format, Text format, Avro format, Orc
format, and Parquet format.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.

Supported types: GZip, Deflate,


BZip2, and ZipDeflate.
Supported levels: Optimal and Fastest.

NOTE
The supported HTTP request payload size is around 500 KB. If the payload size you want to pass to your web endpoint is
larger than 500 KB, consider batching the payload in smaller chunks.

Example 1: Using the Get method (default)

{
"name": "HttpSourceDataInput",
"properties": {
"type": "HttpFile",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"relativeUrl": "<relative url>",
"additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
}
}
}

Example 2: Using the Post method


{
"name": "HttpSourceDataInput",
"properties": {
"type": "HttpFile",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"relativeUrl": "<relative url>",
"requestMethod": "Post",
"requestBody": "<body for POST HTTP request>"
}
}
}

Copy Activity properties


This section provides a list of properties that the HTTP source supports.
For a full list of sections and properties that are available for defining activities, see Pipelines.
HTTP as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from HTTP in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity source and supported settings. The following properties are
supported for HTTP under storeSettings settings in format-based copy source:

PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
HttpReadSetting.

requestMethod The HTTP method. No


Allowed values are Get (default) and
Post.

addtionalHeaders Additional HTTP request headers. No

requestBody The body for the HTTP request. No

requestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to read
response data. The default value is
00:01:40.
PROPERTY DESCRIPTION REQUIRED

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, HttpSource type copy activity source mentioned in next section is still supported as-is
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:

"activities":[
{
"name": "CopyFromHTTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "HttpReadSetting",
"requestMethod": "Post",
"additionalHeaders": "<header key: header value>\n<header key: header value>\n",
"requestBody": "<body for POST HTTP request>"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from HTTP in ORC/Avro/JSON/Binary format, the following properties are supported in the
copy activity source section:

PROPERTY DESCRIPTION REQUIRED


PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to HttpSource.

httpRequestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to read
response data. The default value is
00:01:40.

Example

"activities":[
{
"name": "CopyFromHTTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<HTTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HttpSource",
"httpRequestTimeout": "00:01:00"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from HubSpot using Azure Data Factory
(Preview)
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from HubSpot. It builds on
the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from HubSpot to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HubSpot connector.

Linked service properties


The following properties are supported for HubSpot linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Hubspot

clientId The client ID associated with your Yes


Hubspot application.
PROPERTY DESCRIPTION REQUIRED

clientSecret The client secret associated with your Yes


Hubspot application. Mark this field as
a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

accessToken The access token obtained when Yes


initially authenticating your OAuth
integration. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

refreshToken The refresh token obtained when Yes


initially authenticating your OAuth
integration. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "HubspotLinkedService",
"properties": {
"type": "Hubspot",
"typeProperties": {
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
},
"refreshToken": {
"type": "SecureString",
"value": "<refreshToken>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by HubSpot dataset.
To copy data from HubSpot, set the type property of the dataset to HubspotObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: HubspotObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "HubspotDataset",
"properties": {
"type": "HubspotObject",
"linkedServiceName": {
"referenceName": "<Hubspot linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by HubSpot source.
HubspotSource as source
To copy data from HubSpot, set the source type in the copy activity to HubspotSource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: HubspotSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Companies where
Company_Id = xxx"
.

Example:
"activities":[
{
"name": "CopyFromHubspot",
"type": "Copy",
"inputs": [
{
"referenceName": "<Hubspot input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HubspotSource",
"query": "SELECT * FROM Companies where Company_Id = xxx"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Impala by using Azure Data Factory
(Preview)
2/1/2019 • 4 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from Impala. It builds on the
Copy Activity overview article that presents a general overview of the copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Impala to any supported sink data store. For a list of data stores that are supported as
sources or sinks by the copy activity, see the Supported data stores table.
Data Factory provides a built-in driver to enable connectivity. Therefore, you don't need to manually install a driver
to use this connector.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to the
Impala connector.

Linked service properties


The following properties are supported for Impala linked service.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Impala.

host The IP address or host name of the Yes


Impala server (that is,
192.168.222.160).
PROPERTY DESCRIPTION REQUIRED

port The TCP port that the Impala server No


uses to listen for client connections. The
default value is 21050.

authenticationType The authentication type to use. Yes


Allowed values are Anonymous,
SASLUsername, and
UsernameAndPassword.

username The user name used to access the No


Impala server. The default value is
anonymous when you use
SASLUsername.

password The password that corresponds to the No


user name when you use
UsernameAndPassword. Mark this field
as a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

enableSsl Specifies whether the connections to No


the server are encrypted by using SSL.
The default value is false.

trustedCertPath The full path of the .pem file that No


contains trusted CA certificates used to
verify the server when you connect
over SSL. This property can be set only
when you use SSL on Self-hosted
Integration Runtime. The default value
is the cacerts.pem file installed with the
integration runtime.

useSystemTrustStore Specifies whether to use a CA certificate No


from the system trust store or from a
specified PEM file. The default value is
false.

allowHostNameCNMismatch Specifies whether to require a CA- No


issued SSL certificate name to match
the host name of the server when you
connect over SSL. The default value is
false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.
Example:

{
"name": "ImpalaLinkedService",
"properties": {
"type": "Impala",
"typeProperties": {
"host" : "<host>",
"port" : "<port>",
"authenticationType" : "UsernameAndPassword",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Impala dataset.
To copy data from Impala, set the type property of the dataset to ImpalaObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: ImpalaObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "ImpalaDataset",
"properties": {
"type": "ImpalaObject",
"linkedServiceName": {
"referenceName": "<Impala linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Impala source type.
Impala as a source type
To copy data from Impala, set the source type in the copy activity to ImpalaSource. The following properties are
supported in the copy activity source section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to ImpalaSource.

query Use the custom SQL query to read No (if "tableName" in dataset is
data. An example is specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromImpala",
"type": "Copy",
"inputs": [
{
"referenceName": "<Impala input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ImpalaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from and to ODBC data stores using
Azure Data Factory
3/18/2019 • 8 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to an ODBC data
store. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from ODBC source to any supported sink data store, or copy from any supported source data
store to ODBC sink. For a list of data stores that are supported as sources/sinks by the copy activity, see the
Supported data stores table.
Specifically, this ODBC connector supports copying data from/to any ODBC -compatible data stores using
Basic or Anonymous authentication.

Prerequisites
To use this ODBC connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the ODBC driver for the data store on the Integration Runtime machine.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-step
instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ODBC connector.

Linked service properties


The following properties are supported for ODBC linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Odbc Yes


PROPERTY DESCRIPTION REQUIRED

connectionString The connection string excluding the Yes


credential portion. You can specify the
connection string with pattern like
"Driver={SQL
Server};Server=Server.database.windows.net;
Database=TestDatabase;"
, or use the system DSN (Data Source
Name) you set up on the Integration
Runtime machine with
"DSN=<name of the DSN on IR
machine>;"
(you need still specify the credential
portion in linked service accordingly).
Mark this field as a SecureString to store
it securely in Data Factory, or reference
a secret stored in Azure Key Vault.

authenticationType Type of authentication used to connect Yes


to the ODBC data store.
Allowed values are: Basic and
Anonymous.

userName Specify user name if you are using Basic No


authentication.

password Specify password for the user account No


you specified for the userName. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

credential The access credential portion of the No


connection string specified in driver-
specific property-value format. Example:
"RefreshToken=<secret refresh
token>;"
. Mark this field as a SecureString.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-hosted
Integration Runtime is required as
mentioned in Prerequisites.

Example 1: using Basic authentication


{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<connection string>"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: using Anonymous authentication

{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<connection string>"
},
"authenticationType": "Anonymous",
"credential": {
"type": "SecureString",
"value": "RefreshToken=<secret refresh token>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section provides
a list of properties supported by ODBC dataset.
To copy data from/to ODBC -compatible data store, set the type property of the dataset to RelationalTable. The
following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: RelationalTable
PROPERTY DESCRIPTION REQUIRED

tableName Name of the table in the ODBC data No for source (if "query" in activity
store. source is specified);
Yes for sink

Example

{
"name": "ODBCDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<ODBC linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by ODBC source.
ODBC as source
To copy data from ODBC -compatible data store, set the source type in the copy activity to RelationalSource. The
following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
RelationalSource

query Use the custom SQL query to read data. No (if "tableName" in dataset is
For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<ODBC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

ODBC as sink
To copy data to ODBC -compatible data store, set the sink type in the copy activity to OdbcSink. The following
properties are supported in the copy activity sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to: OdbcSink

writeBatchTimeout Wait time for the batch insert operation No


to complete before it times out.
Allowed values are: timespan. Example:
“00:30:00” (30 minutes).

writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).

preCopyScript Specify a SQL query for Copy Activity to No


execute before writing data into data
store in each run. You can use this
property to clean up the pre-loaded
data.

NOTE
For "writeBatchSize", if it's not set (auto-detected), copy activity first detects whether the driver supports batch operations,
and set it to 10000 if it does, or set it to 1 if it doesn’t. If you explicitly set the value other than 0, copy activity honors the
value and fails at runtime if the driver doesn’t support batch operations.
Example:

"activities":[
{
"name": "CopyToODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ODBC output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OdbcSink",
"writeBatchSize": 100000
}
}
}
]

IBM Informix source


You can copy data from IBM Informix database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for Informix to connect to the data store. Therefore, install the driver if it is not already
installed on the same machine. For example, you can use driver "IBM INFORMIX ODBC DRIVER (64-bit)". See
Prerequisites section for details.
Before you use the Informix source in a Data Factory solution, verify whether the Integration Runtime can connect
to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link an IBM Informix data store to an Azure data factory as shown in the
following example:
{
"name": "InformixLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<Informix connection string or DSN>"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.

Microsoft Access source


You can copy data from Microsoft Access database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for Microsoft Access to connect to the data store. Therefore, install the driver if it is not
already installed on the same machine. See Prerequisites section for details.
Before you use the Microsoft Access source in a Data Factory solution, verify whether the Integration Runtime can
connect to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link a Microsoft Access database to an Azure data factory as shown in the
following example:
{
"name": "MicrosoftAccessLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Driver={Microsoft Access Driver (*.mdb, *.accdb)};Dbq=<path to your DB file e.g.
C:\\mydatabase.accdb>;"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.

SAP HANA sink


NOTE
To copy data from SAP HANA data store, refer to native SAP HANA connector. To copy data to SAP HANA, please follow this
instruction to use ODBC connector. Note the linked services for SAP HANA connector and ODBC connector are with different
type thus cannot be reused.

You can copy data to SAP HANA database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for SAP HANA to connect to the data store. Therefore, install the driver if it is not already
installed on the same machine. See Prerequisites section for details.
Before you use the SAP HANA sink in a Data Factory solution, verify whether the Integration Runtime can connect
to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link a SAP HANA data store to an Azure data factory as shown in the following
example:
{
"name": "SAPHANAViaODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Driver={HDBODBC};servernode=<HANA server>.clouddatahub-int.net:30015"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.

Troubleshoot connectivity issues


To troubleshoot connection issues, use the Diagnostics tab of Integration Runtime Configuration Manager.
1. Launch Integration Runtime Configuration Manager.
2. Switch to the Diagnostics tab.
3. Under the "Test Connection" section, select the type of data store (linked service).
4. Specify the connection string that is used to connect to the data store, choose the authentication and enter
user name, password, and/or credentials.
5. Click Test connection to test the connection to the data store.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Jira using Azure Data Factory
(Preview)
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Jira. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Jira to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to Jira
connector.

Linked service properties


The following properties are supported for Jira linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Jira Yes

host The IP address or host name of the Jira Yes


service. (for example, jira.example.com)
PROPERTY DESCRIPTION REQUIRED

port The TCP port that the Jira server uses No


to listen for client connections. The
default value is 443 if connecting
through HTTPS, or 8080 if connecting
through HTTP.

username The user name that you use to access Yes


Jira Service.

password The password corresponding to the Yes


user name that you provided in the
username field. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "JiraLinkedService",
"properties": {
"type": "Jira",
"typeProperties": {
"host" : "<host>",
"port" : "<port>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Jira dataset.
To copy data from Jira, set the type property of the dataset to JiraObject. The following properties are supported:
PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: JiraObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "JiraDataset",
"properties": {
"type": "JiraObject",
"linkedServiceName": {
"referenceName": "<Jira linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Jira source.
JiraSource as source
To copy data from Jira, set the source type in the copy activity to JiraSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: JiraSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromJira",
"type": "Copy",
"inputs": [
{
"referenceName": "<Jira input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "JiraSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Magento using Azure Data Factory
(Preview)
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Magento. It builds on
the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Magento to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Magento connector.

Linked service properties


The following properties are supported for Magento linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Magento

host The URL of the Magento instance. (that Yes


is, 192.168.222.110/magento3)
PROPERTY DESCRIPTION REQUIRED

accessToken The access token from Magento. Mark Yes


this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "MagentoLinkedService",
"properties": {
"type": "Magento",
"typeProperties": {
"host" : "192.168.222.110/magento3",
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Magento dataset.
To copy data from Magento, set the type property of the dataset to MagentoObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: MagentoObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example
{
"name": "MagentoDataset",
"properties": {
"type": "MagentoObject",
"linkedServiceName": {
"referenceName": "<Magento linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Magento source.
Magento as source
To copy data from Magento, set the source type in the copy activity to MagentoSource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
MagentoSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Customers" .

Example:

"activities":[
{
"name": "CopyFromMagento",
"type": "Copy",
"inputs": [
{
"referenceName": "<Magento input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MagentoSource",
"query": "SELECT * FROM Customers where Id > XXX"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MariaDB using Azure Data Factory
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from MariaDB. It builds on
the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from MariaDB to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
This connector currently supports MariaDB of version 10.0 to 10.2.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MariaDB connector.

Linked service properties


The following properties are supported for MariaDB linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


MariaDB

connectionString An ODBC connection string to connect Yes


to MariaDB.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the pwd configuration
out of the connection string. Refer to
the following samples and Store
credentials in Azure Key Vault article
with more details.
PROPERTY DESCRIPTION REQUIRED

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "MariaDBLinkedService",
"properties": {
"type": "MariaDB",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<host>;Port=<port>;Database=<database>;UID=<user name>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "MariaDBLinkedService",
"properties": {
"type": "MariaDB",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<host>;Port=<port>;Database=<database>;UID=<user name>;"
},
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by MariaDB dataset.
To copy data from MariaDB, set the type property of the dataset to MariaDBTable. There is no additional type-
specific property in this type of dataset.
Example

{
"name": "MariaDBDataset",
"properties": {
"type": "MariaDBTable",
"linkedServiceName": {
"referenceName": "<MariaDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by MariaDB source.
MariaDB as source
To copy data from MariaDB, set the source type in the copy activity to MariaDBSource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: MariaDBSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromMariaDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MariaDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MariaDBSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Marketo using Azure Data Factory
(Preview)
4/22/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Marketo. It builds on
the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Marketo to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

NOTE
This Marketo connector is built on top of the Marketo REST API. Be aware that the Marketo has concurrent request limit on
service side. If you hit errors saying "Error while attempting to use REST API: Max rate limit '100' exceeded with in '20' secs
(606)" or "Error while attempting to use REST API: Concurrent access limit '10' reached (615)", consider to reduce the
concurrent copy activity runs to reduce the number of requests to the service.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Marketo connector.

Linked service properties


The following properties are supported for Marketo linked service:
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Marketo

endpoint The endpoint of the Marketo server. (i.e. Yes


123-ABC-321.mktorest.com)

clientId The client Id of your Marketo service. Yes

clientSecret The client secret of your Marketo Yes


service. Mark this field as a SecureString
to store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "MarketoLinkedService",
"properties": {
"type": "Marketo",
"typeProperties": {
"endpoint" : "123-ABC-321.mktorest.com",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Marketo dataset.
To copy data from Marketo, set the type property of the dataset to MarketoObject. The following properties are
supported:
PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: MarketoObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "MarketoDataset",
"properties": {
"type": "MarketoObject",
"linkedServiceName": {
"referenceName": "<Marketo linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Marketo source.
Marketo as source
To copy data from Marketo, set the source type in the copy activity to MarketoSource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: MarketoSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Activitiy_Types" .

Example:
"activities":[
{
"name": "CopyFromMarketo",
"type": "Copy",
"inputs": [
{
"referenceName": "<Marketo input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MarketoSource",
"query": "SELECT top 1000 * FROM Activitiy_Types"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to ODBC data stores using
Azure Data Factory
3/18/2019 • 8 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to an ODBC data
store. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from ODBC source to any supported sink data store, or copy from any supported source data
store to ODBC sink. For a list of data stores that are supported as sources/sinks by the copy activity, see the
Supported data stores table.
Specifically, this ODBC connector supports copying data from/to any ODBC -compatible data stores using
Basic or Anonymous authentication.

Prerequisites
To use this ODBC connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the ODBC driver for the data store on the Integration Runtime machine.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-step
instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ODBC connector.

Linked service properties


The following properties are supported for ODBC linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Odbc Yes


PROPERTY DESCRIPTION REQUIRED

connectionString The connection string excluding the Yes


credential portion. You can specify the
connection string with pattern like
"Driver={SQL
Server};Server=Server.database.windows.net;
Database=TestDatabase;"
, or use the system DSN (Data Source
Name) you set up on the Integration
Runtime machine with
"DSN=<name of the DSN on IR
machine>;"
(you need still specify the credential
portion in linked service accordingly).
Mark this field as a SecureString to store
it securely in Data Factory, or reference
a secret stored in Azure Key Vault.

authenticationType Type of authentication used to connect Yes


to the ODBC data store.
Allowed values are: Basic and
Anonymous.

userName Specify user name if you are using Basic No


authentication.

password Specify password for the user account No


you specified for the userName. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

credential The access credential portion of the No


connection string specified in driver-
specific property-value format. Example:
"RefreshToken=<secret refresh
token>;"
. Mark this field as a SecureString.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-hosted
Integration Runtime is required as
mentioned in Prerequisites.

Example 1: using Basic authentication


{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<connection string>"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: using Anonymous authentication

{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<connection string>"
},
"authenticationType": "Anonymous",
"credential": {
"type": "SecureString",
"value": "RefreshToken=<secret refresh token>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section provides
a list of properties supported by ODBC dataset.
To copy data from/to ODBC -compatible data store, set the type property of the dataset to RelationalTable. The
following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: RelationalTable
PROPERTY DESCRIPTION REQUIRED

tableName Name of the table in the ODBC data No for source (if "query" in activity
store. source is specified);
Yes for sink

Example

{
"name": "ODBCDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<ODBC linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by ODBC source.
ODBC as source
To copy data from ODBC -compatible data store, set the source type in the copy activity to RelationalSource. The
following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
RelationalSource

query Use the custom SQL query to read data. No (if "tableName" in dataset is
For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<ODBC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

ODBC as sink
To copy data to ODBC -compatible data store, set the sink type in the copy activity to OdbcSink. The following
properties are supported in the copy activity sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to: OdbcSink

writeBatchTimeout Wait time for the batch insert operation No


to complete before it times out.
Allowed values are: timespan. Example:
“00:30:00” (30 minutes).

writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).

preCopyScript Specify a SQL query for Copy Activity to No


execute before writing data into data
store in each run. You can use this
property to clean up the pre-loaded
data.

NOTE
For "writeBatchSize", if it's not set (auto-detected), copy activity first detects whether the driver supports batch operations,
and set it to 10000 if it does, or set it to 1 if it doesn’t. If you explicitly set the value other than 0, copy activity honors the
value and fails at runtime if the driver doesn’t support batch operations.
Example:

"activities":[
{
"name": "CopyToODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ODBC output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OdbcSink",
"writeBatchSize": 100000
}
}
}
]

IBM Informix source


You can copy data from IBM Informix database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for Informix to connect to the data store. Therefore, install the driver if it is not already
installed on the same machine. For example, you can use driver "IBM INFORMIX ODBC DRIVER (64-bit)". See
Prerequisites section for details.
Before you use the Informix source in a Data Factory solution, verify whether the Integration Runtime can connect
to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link an IBM Informix data store to an Azure data factory as shown in the
following example:
{
"name": "InformixLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<Informix connection string or DSN>"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.

Microsoft Access source


You can copy data from Microsoft Access database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for Microsoft Access to connect to the data store. Therefore, install the driver if it is not
already installed on the same machine. See Prerequisites section for details.
Before you use the Microsoft Access source in a Data Factory solution, verify whether the Integration Runtime can
connect to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link a Microsoft Access database to an Azure data factory as shown in the
following example:
{
"name": "MicrosoftAccessLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Driver={Microsoft Access Driver (*.mdb, *.accdb)};Dbq=<path to your DB file e.g.
C:\\mydatabase.accdb>;"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.

SAP HANA sink


NOTE
To copy data from SAP HANA data store, refer to native SAP HANA connector. To copy data to SAP HANA, please follow this
instruction to use ODBC connector. Note the linked services for SAP HANA connector and ODBC connector are with different
type thus cannot be reused.

You can copy data to SAP HANA database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for SAP HANA to connect to the data store. Therefore, install the driver if it is not already
installed on the same machine. See Prerequisites section for details.
Before you use the SAP HANA sink in a Data Factory solution, verify whether the Integration Runtime can connect
to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link a SAP HANA data store to an Azure data factory as shown in the following
example:
{
"name": "SAPHANAViaODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Driver={HDBODBC};servernode=<HANA server>.clouddatahub-int.net:30015"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.

Troubleshoot connectivity issues


To troubleshoot connection issues, use the Diagnostics tab of Integration Runtime Configuration Manager.
1. Launch Integration Runtime Configuration Manager.
2. Switch to the Diagnostics tab.
3. Under the "Test Connection" section, select the type of data store (linked service).
4. Specify the connection string that is used to connect to the data store, choose the authentication and enter
user name, password, and/or credentials.
5. Click Test connection to test the connection to the data store.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MongoDB using Azure Data Factory
2/1/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a MongoDB database. It
builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
ADF release this new version of MongoDB connector which provides better native MongoDB support. If you are using the
previous MongoDB connector in your solution which is supported as-is for backward compatibility, refer to MongoDB
connector (legacy) article.

Supported capabilities
You can copy data from MongoDB database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MongoDB connector supports versions up to 3.4.

Prerequisites
To copy data from a MongoDB database that is not publicly accessible, you need to set up a Self-hosted
Integration Runtime. See Self-hosted Integration Runtime article to learn details.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MongoDB connector.

Linked service properties


The following properties are supported for MongoDB linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


MongoDbV2
PROPERTY DESCRIPTION REQUIRED

connectionString Specify the MongoDB connection string Yes


e.g.
mongodb://[username:password@]host[:port]
[/[database][?options]]
. Refer to MongoDB manual on
connection string for more details.

Mark this field as a SecureString type


to store it securely in Data Factory. You
can also reference a secret stored in
Azure Key Vault.

database Name of the database that you want to Yes


access.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "MongoDBLinkedService",
"properties": {
"type": "MongoDbV2",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "mongodb://[username:password@]host[:port][/[database][?options]]"
},
"database": "myDatabase"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for MongoDB dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: MongoDbV2Collection

collectionName Name of the collection in MongoDB Yes


database.

Example:
{
"name": "MongoDbDataset",
"properties": {
"type": "MongoDbV2Collection",
"linkedServiceName": {
"referenceName": "<MongoDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"collectionName": "<Collection name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by MongoDB source.
MongoDB as source
The following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
MongoDbV2Source

filter Specifies selection filter using query No


operators. To return all documents in a
collection, omit this parameter or pass
an empty document ({}).

cursorMethods.project Specifies the fields to return in the No


documents for projection. To return all
fields in the matching documents, omit
this parameter.

cursorMethods.sort Specifies the order in which the query No


returns matching documents. Refer to
cursor.sort().

cursorMethods.limit Specifies the maximum number of No


documents the server returns. Refer to
cursor.limit().

cursorMethods.skip Specifies the number of documents to No


skip and from where MongoDB begins
to return results. Refer to cursor.skip().
PROPERTY DESCRIPTION REQUIRED

batchSize Specifies the number of documents to No


return in each batch of the response (the default is 100)
from MongoDB instance. In most cases,
modifying the batch size will not affect
the user or the application. Cosmos DB
limits each batch cannot exceed 40MB
in size, which is the sum of the
batchSize number of documents' size,
so decrease this value if your document
size being large.

TIP
ADF support consuming BSON document in Strict mode. Make sure your filter query is in Strict mode instead of Shell
mode. More description can be found at MongoDB manual.

Example:

"activities":[
{
"name": "CopyFromMongoDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MongoDbV2Source",
"filter": "{datetimeData: {$gte: ISODate(\"2018-12-11T00:00:00.000Z\"),$lt: ISODate(\"2018-12-
12T00:00:00.000Z\")}, _id: ObjectId(\"5acd7c3d0000000000000000\") }",
"cursorMethods": {
"project": "{ _id : 1, name : 1, age: 1, datetimeData: 1 }",
"sort": "{ age : 1 }",
"skip": 3,
"limit": 3
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Export JSON documents as-is


You can use this MongoDB connector to export JSON documents as-is from a MongoDB collection to various
file-based stores or to Azure Cosmos DB. To achieve such schema-agnostic copy, skip the "structure" (also called
schema) section in dataset and schema mapping in copy activity.
Schema mapping
To copy data from MongoDB to tabular sink, refer to schema mapping.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MongoDB using Azure Data Factory
1/15/2019 • 6 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a MongoDB database. It
builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
ADF release a new MongoDB connector which provides better native MongoDB support comparing to this ODBC-based
implementation, refer to MongoDB connector article on details. This legacy MongoDB connector is kept supported as-is for
backward compability, while for any new workload, please use the new connector.

Supported capabilities
You can copy data from MongoDB database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MongoDB connector supports:
MongoDB versions 2.4, 2.6, 3.0, 3.2, 3.4 and 3.6.
Copying data using Basic or Anonymous authentication.

Prerequisites
To copy data from a MongoDB database that is not publicly accessible, you need to set up a Self-hosted
Integration Runtime. See Self-hosted Integration Runtime article to learn details. The Integration Runtime
provides a built-in MongoDB driver, therefore you don't need to manually install any driver when copying data
from MongoDB.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MongoDB connector.

Linked service properties


The following properties are supported for MongoDB linked service:
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


MongoDb

server IP address or host name of the Yes


MongoDB server.

port TCP port that the MongoDB server No (default is 27017)


uses to listen for client connections.

databaseName Name of the MongoDB database that Yes


you want to access.

authenticationType Type of authentication used to connect Yes


to the MongoDB database.
Allowed values are: Basic, and
Anonymous.

username User account to access MongoDB. Yes (if basic authentication is used).

password Password for the user. Mark this field as Yes (if basic authentication is used).
a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

authSource Name of the MongoDB database that No. For basic authentication, default is
you want to use to check your to use the admin account and the
credentials for authentication. database specified using databaseName
property.

enableSsl Specifies whether the connections to No


the server are encrypted using SSL. The
default value is false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:
{
"name": "MongoDBLinkedService",
"properties": {
"type": "MongoDb",
"typeProperties": {
"server": "<server name>",
"databaseName": "<database name>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for MongoDB dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: MongoDbCollection

collectionName Name of the collection in MongoDB Yes


database.

Example:

{
"name": "MongoDbDataset",
"properties": {
"type": "MongoDbCollection",
"linkedServiceName": {
"referenceName": "<MongoDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"collectionName": "<Collection name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by MongoDB source.
MongoDB as source
The following properties are supported in the copy activity source section:
PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
MongoDbSource

query Use the custom SQL-92 query to read No (if "collectionName" in dataset is
data. For example: select * from specified)
MyTable.

Example:

"activities":[
{
"name": "CopyFromMongoDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MongoDbSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

TIP
When specify the SQL query, pay attention to the DateTime format. For example:
SELECT * FROM Account WHERE LastModifiedDate >= '2018-06-01' AND LastModifiedDate < '2018-06-02' or to use
parameter
SELECT * FROM Account WHERE LastModifiedDate >= '@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-
dd HH:mm:ss')}' AND LastModifiedDate < '@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-dd
HH:mm:ss')}'

Schema by Data Factory


Azure Data Factory service infers schema from a MongoDB collection by using the latest 100 documents in the
collection. If these 100 documents do not contain full schema, some columns may be ignored during the copy
operation.

Data type mapping for MongoDB


When copying data from MongoDB, the following mappings are used from MongoDB data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the source
schema and data type to the sink.

MONGODB DATA TYPE DATA FACTORY INTERIM DATA TYPE

Binary Byte[]

Boolean Boolean

Date DateTime

NumberDouble Double

NumberInt Int32

NumberLong Int64

ObjectID String

String String

UUID Guid

Object Renormalized into flatten columns with “_" as nested separator

NOTE
To learn about support for arrays using virtual tables, refer to Support for complex types using virtual tables section.
Currently, the following MongoDB data types are not supported: DBPointer, JavaScript, Max/Min key, Regular Expression,
Symbol, Timestamp, Undefined.

Support for complex types using virtual tables


Azure Data Factory uses a built-in ODBC driver to connect to and copy data from your MongoDB database. For
complex types such as arrays or objects with different types across the documents, the driver re-normalizes data
into corresponding virtual tables. Specifically, if a table contains such columns, the driver generates the following
virtual tables:
A base table, which contains the same data as the real table except for the complex type columns. The base
table uses the same name as the real table that it represents.
A virtual table for each complex type column, which expands the nested data. The virtual tables are named
using the name of the real table, a separator “_" and the name of the array or object.
Virtual tables refer to the data in the real table, enabling the driver to access the denormalized data. You can access
the content of MongoDB arrays by querying and joining the virtual tables.
Example
For example, ExampleTable here is a MongoDB table that has one column with an array of Objects in each cell –
Invoices, and one column with an array of Scalar types – Ratings.

_ID CUSTOMER NAME INVOICES SERVICE LEVEL RATINGS


_ID CUSTOMER NAME INVOICES SERVICE LEVEL RATINGS

1111 ABC [{invoice_id:"123", Silver [5,6]


item:"toaster",
price:"456",
discount:"0.2"},
{invoice_id:"124",
item:"oven", price:
"1235", discount:
"0.2"}]

2222 XYZ [{invoice_id:"135", Gold [1,2]


item:"fridge", price:
"12543", discount:
"0.0"}]

The driver would generate multiple virtual tables to represent this single table. The first virtual table is the base
table named “ExampleTable", shown in the example. The base table contains all the data of the original table, but
the data from the arrays has been omitted and is expanded in the virtual tables.

_ID CUSTOMER NAME SERVICE LEVEL

1111 ABC Silver

2222 XYZ Gold

The following tables show the virtual tables that represent the original arrays in the example. These tables contain
the following:
A reference back to the original primary key column corresponding to the row of the original array (via the _id
column)
An indication of the position of the data within the original array
The expanded data for each element within the array
Table “ExampleTable_Invoices":

EXAMPLETABLE_IN
_ID VOICES_DIM1_IDX INVOICE_ID ITEM PRICE DISCOUNT

1111 0 123 toaster 456 0.2

1111 1 124 oven 1235 0.2

2222 0 135 fridge 12543 0.0

Table “ExampleTable_Ratings":

_ID EXAMPLETABLE_RATINGS_DIM1_IDX EXAMPLETABLE_RATINGS

1111 0 5

1111 1 6

2222 0 1
_ID EXAMPLETABLE_RATINGS_DIM1_IDX EXAMPLETABLE_RATINGS

2222 1 2

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MySQL using Azure Data Factory
3/15/2019 • 5 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a MySQL database. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from MySQL database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MySQL connector supports MySQL version 5.6 and 5.7.

Prerequisites
If your MySQL database is not publicly accessible, you need to set up a Self-hosted Integration Runtime. To learn
about Self-hosted integration runtimes, see Self-hosted Integration Runtime article. The Integration Runtime
provides a built-in MySQL driver starting from version 3.7, therefore you don't need to manually install any driver.
For Self-hosted IR version lower than 3.7, you need to install the MySQL Connector/Net for Microsoft Windows
version between 6.6.5 and 6.10.7 on the Integration Runtime machine. This 32 bit driver is compatible with 64 bit
IR.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MySQL connector.

Linked service properties


The following properties are supported for MySQL linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


MySql
PROPERTY DESCRIPTION REQUIRED

connectionString Specify information needed to connect Yes


to the Azure Database for MySQL
instance.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the password
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key Vault
article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

A typical connection string is Server=<server>;Port=<port>;Database=<database>;UID=<username>;PWD=<password> .


More properties you can set per your case:

PROPERTY DESCRIPTION OPTIONS REQUIRED

SSLMode This option specifies whether DISABLED (0) / PREFERRED No


the driver uses SSL (1) (Default) / REQUIRED (2)
encryption and verification / VERIFY_CA (3) /
when connecting to MySQL. VERIFY_IDENTITY (4)
E.g. SSLMode=<0/1/2/3/4>

UseSystemTrustStore This option specifies whether Enabled (1) / Disabled (0) No


to use a CA certificate from (Default)
the system trust store, or
from a specified PEM file.
E.g.
UseSystemTrustStore=
<0/1>;

Example:

{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<username>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example: store password in Azure Key Vault

{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

If you were using MySQL linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Previous payload:

{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by MySQL dataset.
To copy data from MySQL, set the type property of the dataset to RelationalTable. The following properties are
supported:
PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: RelationalTable

tableName Name of the table in the MySQL No (if "query" in activity source is
database. specified)

Example

{
"name": "MySQLDataset",
"properties":
{
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<MySQL linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by MySQL source.
MySQL as source
To copy data from MySQL, set the source type in the copy activity to RelationalSource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
RelationalSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromMySQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<MySQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for MySQL


When copying data from MySQL, the following mappings are used from MySQL data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the source
schema and data type to the sink.

MYSQL DATA TYPE DATA FACTORY INTERIM DATA TYPE

bigint Int64

bigint unsigned Decimal

bit(1) Boolean

bit(M), M>1 Byte[]

blob Byte[]

bool Int16

char String

date Datetime

datetime Datetime

decimal Decimal, String


MYSQL DATA TYPE DATA FACTORY INTERIM DATA TYPE

double Double

double precision Double

enum String

float Single

int Int32

int unsigned Int64

integer Int32

integer unsigned Int64

long varbinary Byte[]

long varchar String

longblob Byte[]

longtext String

mediumblob Byte[]

mediumint Int32

mediumint unsigned Int64

mediumtext String

numeric Decimal

real Double

set String

smallint Int16

smallint unsigned Int32

text String

time TimeSpan

timestamp Datetime
MYSQL DATA TYPE DATA FACTORY INTERIM DATA TYPE

tinyblob Byte[]

tinyint Int16

tinyint unsigned Int16

tinytext String

varchar String

year Int

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Netezza by using Azure Data
Factory
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from Netezza. The article builds
on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.

Supported capabilities
You can copy data from Netezza to any supported sink data store. For a list of data stores that Copy Activity
supports as sources and sinks, see Supported data stores and formats.
Azure Data Factory provides a built-in driver to enable connectivity. You don't need to manually install any driver
to use this connector.

Get started
You can create a pipeline that uses a copy activity by using the .NET SDK, the Python SDK, Azure PowerShell, the
REST API, or an Azure Resource Manager template. See the Copy Activity tutorial for step-by-step instructions on
how to create a pipeline that has a copy activity.
The following sections provide details about properties you can use to define Data Factory entities that are specific
to the Netezza connector.

Linked service properties


The following properties are supported for the Netezza linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Netezza.

connectionString An ODBC connection string to connect Yes


to Netezza.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the pwd configuration
out of the connection string. Refer to
the following samples and Store
credentials in Azure Key Vault article
with more details.

connectVia The Integration Runtime to use to No


connect to the data store. You can
choose a self-hosted Integration
Runtime or the Azure Integration
Runtime (if your data store is publicly
accessible). If not specified, the default
Azure Integration Runtime is used.
A typical connection string is Server=<server>;Port=<port>;Database=<database>;UID=<user name>;PWD=<password> .
The following table describes more properties that you can set:

PROPERTY DESCRIPTION REQUIRED

SecurityLevel The level of security (SSL/TLS) that the No


driver uses for the connection to the
data store. Example:
SecurityLevel=preferredSecured .
Supported values are:
- Only unsecured (onlyUnSecured):
The driver doesn't use SSL.
- Preferred unsecured
(preferredUnSecured) (default): If
the server provides a choice, the driver
doesn't use SSL.
- Preferred secured
(preferredSecured): If the server
provides a choice, the driver uses SSL.
- Only secured (onlySecured): The
driver doesn't connect unless an SSL
connection is available.

CaCertFile The full path to the SSL certificate that's Yes, if SSL is enabled
used by the server. Example:
CaCertFile=<cert path>;

Example

{
"name": "NetezzaLinkedService",
"properties": {
"type": "Netezza",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault


{
"name": "NetezzaLinkedService",
"properties": {
"type": "Netezza",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;"
},
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
This section provides a list of properties that the Netezza dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets.
To copy data from Netezza, set the type property of the dataset to NetezzaTable. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: NetezzaTable

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "NetezzaDataset",
"properties": {
"type": "NetezzaTable",
"linkedServiceName": {
"referenceName": "<Netezza linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy Activity properties


This section provides a list of properties that the Netezza source supports.
For a full list of sections and properties that are available for defining activities, see Pipelines.
Netezza as source
To copy data from Netezza, set the source type in Copy Activity to NetezzaSource. The following properties are
supported in the Copy Activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Activity Yes


source must be set to NetezzaSource.

query Use the custom SQL query to read No (if "tableName" in dataset is
data. Example: specified)
"SELECT * FROM MyTable"

Example:

"activities":[
{
"name": "CopyFromNetezza",
"type": "Copy",
"inputs": [
{
"referenceName": "<Netezza input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "NetezzaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from an OData source by using Azure
Data Factory
3/6/2019 • 5 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from an OData source. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.

Supported capabilities
You can copy data from an OData source to any supported sink data store. For a list of data stores that Copy
Activity supports as sources and sinks, see Supported data stores and formats.
Specifically, this OData connector supports:
OData version 3.0 and 4.0.
Copying data by using one of the following authentications: Anonymous, Basic, Windows, AAD service
principal, and managed identities for Azure resources.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are specific
to an OData connector.

Linked service properties


The following properties are supported for an OData linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


OData.

url The root URL of the OData service. Yes


PROPERTY DESCRIPTION REQUIRED

authenticationType The type of authentication used to Yes


connect to the OData source. Allowed
values are Anonymous, Basic,
Windows, AadServicePrincipal, and
ManagedServiceIdentity. User based
OAuth isn't supported.

userName Specify userName if you use Basic or No


Windows authentication.

password Specify password for the user account No


you specified for userName. Mark this
field as a SecureString type to store it
securely in Data Factory. You also can
reference a secret stored in Azure Key
Vault.

servicePrincipalId Specify the Azure Active Directory No


application's client ID.

aadServicePrincipalCredentialType Specify the credential type to use for No


service principal authentication. Allowed
values are: ServicePrincipalKey or
ServicePrincipalCert .

servicePrincipalKey Specify the Azure Active Directory No


application's key. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

servicePrincipalEmbeddedCert Specify the base64 encoded certificate No


of your application registered in Azure
Active Directory. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

servicePrincipalEmbeddedCertPassword Specify the password of your certificate No


if your certificate is secured with a
password. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

tenant Specify the tenant information (domain No


name or tenant ID) under which your
application resides. Retrieve it by
hovering the mouse in the top-right
corner of the Azure portal.

aadResourceId Specify the AAD resource you are No


requesting for authorization.
PROPERTY DESCRIPTION REQUIRED

connectVia The Integration Runtime to use to No


connect to the data store. You can
choose Azure Integration Runtime or a
self-hosted Integration Runtime (if your
data store is located in a private
network). If not specified, the default
Azure Integration Runtime is used.

Example 1: Using Anonymous authentication

{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "https://services.odata.org/OData/OData.svc",
"authenticationType": "Anonymous"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: Using Basic authentication

{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "Basic",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 3: Using Windows authentication


{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "Windows",
"userName": "<domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 4: Using service principal key authentication

{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "AadServicePrincipal",
"servicePrincipalId": "<service principal id>",
"aadServicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource URL>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}

Example 5: Using service principal cert authentication


{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "AadServicePrincipal",
"servicePrincipalId": "<service principal id>",
"aadServicePrincipalCredentialType": "ServicePrincipalCert",
"servicePrincipalEmbeddedCert": {
"type": "SecureString",
"value": "<base64 encoded string of (.pfx) certificate data>"
},
"servicePrincipalEmbeddedCertPassword": {
"type": "SecureString",
"value": "<password of your certificate>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource e.g. https://tenant.sharepoint.com>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}

Dataset properties
This section provides a list of properties that the OData dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from OData, set the type property of the dataset to ODataResource. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to ODataResource.

path The path to the OData resource. Yes

Example

{
"name": "ODataDataset",
"properties":
{
"type": "ODataResource",
"linkedServiceName": {
"referenceName": "<OData linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties":
{
"path": "Products"
}
}
}
Copy Activity properties
This section provides a list of properties that the OData source supports.
For a full list of sections and properties that are available for defining activities, see Pipelines.
OData as source
To copy data from OData, set the source type in Copy Activity to RelationalSource. The following properties are
supported in the Copy Activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Copy Activity Yes


source must be set to
RelationalSource.

query OData query options for filtering data. No


Example:
"?
$select=Name,Description&$top=5"
.

Note: The OData connector copies data


from the combined URL:
[URL specified in linked
service]/[path specified in
dataset][query specified in copy
activity source]
. For more information, see OData URL
components.

Example

"activities":[
{
"name": "CopyFromOData",
"type": "Copy",
"inputs": [
{
"referenceName": "<OData input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "?$select=Name,Description&$top=5"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for OData


When you copy data from OData, the following mappings are used between OData data types and Azure Data
Factory interim data types. To learn how Copy Activity maps the source schema and data type to the sink, see
Schema and data type mappings.

ODATA DATA TYPE DATA FACTORY INTERIM DATA TYPE

Edm.Binary Byte[]

Edm.Boolean Bool

Edm.Byte Byte[]

Edm.DateTime DateTime

Edm.Decimal Decimal

Edm.Double Double

Edm.Single Single

Edm.Guid Guid

Edm.Int16 Int16

Edm.Int32 Int32

Edm.Int64 Int64

Edm.SByte Int16

Edm.String String

Edm.Time TimeSpan

Edm.DateTimeOffset DateTimeOffset

NOTE
OData complex data types (such as Object) aren't supported.

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to ODBC data stores using
Azure Data Factory
3/18/2019 • 8 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to an ODBC data
store. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from ODBC source to any supported sink data store, or copy from any supported source data
store to ODBC sink. For a list of data stores that are supported as sources/sinks by the copy activity, see the
Supported data stores table.
Specifically, this ODBC connector supports copying data from/to any ODBC -compatible data stores using
Basic or Anonymous authentication.

Prerequisites
To use this ODBC connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the ODBC driver for the data store on the Integration Runtime machine.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ODBC connector.

Linked service properties


The following properties are supported for ODBC linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Odbc
PROPERTY DESCRIPTION REQUIRED

connectionString The connection string excluding the Yes


credential portion. You can specify the
connection string with pattern like
"Driver={SQL
Server};Server=Server.database.windows.net;
Database=TestDatabase;"
, or use the system DSN (Data Source
Name) you set up on the Integration
Runtime machine with
"DSN=<name of the DSN on IR
machine>;"
(you need still specify the credential
portion in linked service accordingly).
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

authenticationType Type of authentication used to connect Yes


to the ODBC data store.
Allowed values are: Basic and
Anonymous.

userName Specify user name if you are using Basic No


authentication.

password Specify password for the user account No


you specified for the userName. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

credential The access credential portion of the No


connection string specified in driver-
specific property-value format. Example:
"RefreshToken=<secret refresh
token>;"
. Mark this field as a SecureString.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example 1: using Basic authentication


{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<connection string>"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: using Anonymous authentication

{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<connection string>"
},
"authenticationType": "Anonymous",
"credential": {
"type": "SecureString",
"value": "RefreshToken=<secret refresh token>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by ODBC dataset.
To copy data from/to ODBC -compatible data store, set the type property of the dataset to RelationalTable. The
following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: RelationalTable
PROPERTY DESCRIPTION REQUIRED

tableName Name of the table in the ODBC data No for source (if "query" in activity
store. source is specified);
Yes for sink

Example

{
"name": "ODBCDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<ODBC linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by ODBC source.
ODBC as source
To copy data from ODBC -compatible data store, set the source type in the copy activity to RelationalSource. The
following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
RelationalSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<ODBC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

ODBC as sink
To copy data to ODBC -compatible data store, set the sink type in the copy activity to OdbcSink. The following
properties are supported in the copy activity sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to: OdbcSink

writeBatchTimeout Wait time for the batch insert No


operation to complete before it times
out.
Allowed values are: timespan. Example:
“00:30:00” (30 minutes).

writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).

preCopyScript Specify a SQL query for Copy Activity No


to execute before writing data into data
store in each run. You can use this
property to clean up the pre-loaded
data.

NOTE
For "writeBatchSize", if it's not set (auto-detected), copy activity first detects whether the driver supports batch operations,
and set it to 10000 if it does, or set it to 1 if it doesn’t. If you explicitly set the value other than 0, copy activity honors the
value and fails at runtime if the driver doesn’t support batch operations.
Example:

"activities":[
{
"name": "CopyToODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ODBC output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OdbcSink",
"writeBatchSize": 100000
}
}
}
]

IBM Informix source


You can copy data from IBM Informix database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for Informix to connect to the data store. Therefore, install the driver if it is not already
installed on the same machine. For example, you can use driver "IBM INFORMIX ODBC DRIVER (64-bit)". See
Prerequisites section for details.
Before you use the Informix source in a Data Factory solution, verify whether the Integration Runtime can
connect to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link an IBM Informix data store to an Azure data factory as shown in the
following example:
{
"name": "InformixLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<Informix connection string or DSN>"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores
in a copy operation.

Microsoft Access source


You can copy data from Microsoft Access database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for Microsoft Access to connect to the data store. Therefore, install the driver if it is not
already installed on the same machine. See Prerequisites section for details.
Before you use the Microsoft Access source in a Data Factory solution, verify whether the Integration Runtime
can connect to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link a Microsoft Access database to an Azure data factory as shown in the
following example:
{
"name": "MicrosoftAccessLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Driver={Microsoft Access Driver (*.mdb, *.accdb)};Dbq=<path to your DB file e.g.
C:\\mydatabase.accdb>;"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores
in a copy operation.

SAP HANA sink


NOTE
To copy data from SAP HANA data store, refer to native SAP HANA connector. To copy data to SAP HANA, please follow
this instruction to use ODBC connector. Note the linked services for SAP HANA connector and ODBC connector are with
different type thus cannot be reused.

You can copy data to SAP HANA database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for SAP HANA to connect to the data store. Therefore, install the driver if it is not already
installed on the same machine. See Prerequisites section for details.
Before you use the SAP HANA sink in a Data Factory solution, verify whether the Integration Runtime can
connect to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link a SAP HANA data store to an Azure data factory as shown in the following
example:
{
"name": "SAPHANAViaODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Driver={HDBODBC};servernode=<HANA server>.clouddatahub-int.net:30015"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores
in a copy operation.

Troubleshoot connectivity issues


To troubleshoot connection issues, use the Diagnostics tab of Integration Runtime Configuration Manager.
1. Launch Integration Runtime Configuration Manager.
2. Switch to the Diagnostics tab.
3. Under the "Test Connection" section, select the type of data store (linked service).
4. Specify the connection string that is used to connect to the data store, choose the authentication and enter
user name, password, and/or credentials.
5. Click Test connection to test the connection to the data store.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Office 365 into Azure using Azure
Data Factory
6/3/2019 • 11 minutes to read • Edit Online

Azure Data Factory integrates with Microsoft Graph data connect, allowing you to bring the rich organizational
data in your Office 365 tenant into Azure in a scalable way and build analytics applications and extract insights
based on these valuable data assets. Integration with Privileged Access Management provides secured access
control for the valuable curated data in Office 365. Please refer to this link for an overview on Microsoft Graph
data connect and refer to this link for licensing information.
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Office 365. It builds on
the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
ADF Office 365 connector and Microsoft Graph data connect enables at scale ingestion of different types of
datasets from Exchange Email enabled mailboxes, including address book contacts, calendar events, email
messages, user information, mailbox settings, and so on. Refer here to see the complete list of datasets available.
For now, within a single copy activity you can only copy data from Office 365 into Azure Blob Storage, Azure
Data Lake Storage Gen1, and Azure Data Lake Storage Gen2 in JSON format (type setOfObjects). If you
want to load Office 365 into other types of data stores or in other formats, you can chain the first copy activity
with a subsequent copy activity to further load data into any of the supported ADF destination stores (refer to
"supported as a sink" column in the "Supported data stores and formats" table).

IMPORTANT
The Azure subscription containing the data factory and the sink data store must be under the same Azure Active
Directory (Azure AD) tenant as Office 365 tenant.
Ensure the Azure Integration Runtime region used for copy activity as well as the destination is in the same region where
the Office 365 tenant users' mailbox is located. Refer here to understand how the Azure IR location is determined. Refer
to table here for the list of supported Office regions and corresponding Azure regions.
Service Principal authentication is the only authentication mechanism supported for Azure Blob Storage, Azure Data Lake
Storage Gen1, and Azure Data Lake Storage Gen2 as destination stores.

Prerequisites
To copy data from Office 365 into Azure, you need to complete the following prerequisite steps:
Your Office 365 tenant admin must complete on-boarding actions as described here.
Create and configure an Azure AD web application in Azure Active Directory. For instructions, see Create an
Azure AD application.
Make note of the following values, which you will use to define the linked service for Office 365:
Tenant ID. For instructions, see Get tenant ID.
Application ID and Application key. For instructions, see Get application ID and authentication key.
Add the user identity who will be making the data access request as the owner of the Azure AD web
application (from the Azure AD web application > Settings > Owners > Add owner).
The user identity must be in the Office 365 organization you are getting data from and must not be a
Guest user.

Approving new data access requests


If this is the first time you are requesting data for this context (a combination of which data table is being access,
which destination account is the data being loaded into, and which user identity is making the data access request),
you will see the copy activity status as "In Progress", and only when you click into "Details" link under Actions will
you see the status as “RequestingConsent”. A member of the data access approver group needs to approve the
request in the Privileged Access Management before the data extraction can proceed.
Refer here on how the approver can approve the data access request, and refer here for an explanation on the
overall integration with Privileged Access Management, including how to set up the data access approver group.

Policy validation
If ADF is created as part of a managed app and Azure policies assignments are made on resources within the
management resource group, then for every copy activity run, ADF will check to make sure the policy assignments
are enforced. Refer here for a list of supported policies.

Getting started
TIP
For a walkthrough of using Office 365 connector, see Load data from Office 365 article.

You can create a pipeline with the copy activity by using one of the following tools or SDKs. Select a link to go to a
tutorial with step-by-step instructions to create a pipeline with a copy activity.
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template.
The following sections provide details about properties that are used to define Data Factory entities specific to
Office 365 connector.

Linked service properties


The following properties are supported for Office 365 linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Office365

office365TenantId Azure tenant ID to which the Office 365 Yes


account belongs.

servicePrincipalTenantId Specify the tenant information under Yes


which your Azure AD web application
resides.
PROPERTY DESCRIPTION REQUIRED

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Mark this Yes


field as a SecureString to store it
securely in Data Factory.

connectVia The Integration Runtime to be used to No


connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.

NOTE
The difference between office365TenantId and servicePrincipalTenantId and the corresponding value to provide:
If you are an enterprise developer developing an application against Office 365 data for your own organization's usage,
then you should supply the same tenant ID for both properties, which is your organization's AAD tenant ID.
If you are an ISV developer developing an application for your customers, then office365TenantId will be your customer’s
(application installer) AAD tenant ID and servicePrincipalTenantId will be your company’s AAD tenant ID.

Example:

{
"name": "Office365LinkedService",
"properties": {
"type": "Office365",
"typeProperties": {
"office365TenantId": "<Office 365 tenant id>",
"servicePrincipalTenantId": "<AAD app service principal tenant id>",
"servicePrincipalId": "<AAD app service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<AAD app service principal key>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Office 365 dataset.
To copy data from Office 365, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: Office365Table

tableName Name of the dataset to extract from Yes


Office 365. Refer here for the list of
Office 365 datasets available for
extraction.
PROPERTY DESCRIPTION REQUIRED

allowedGroups Group selection predicate. Use this No


property to select up to 10 user groups
for whom the data will be retrieved. If
no groups are specified, then data will
be returned for the entire organization.

userScopeFilterUri When allowedGroups property is not No


specified, you can use a predicate
expression that is applied on the entire
tenant to filter the specific rows to
extract from Office 365. The predicate
format should match the query format
of Microsoft Graph APIs, e.g.
https://graph.microsoft.com/v1.0/users?
$filter=Department eq 'Finance'
.

dateFilterColumn Name of the DateTime filter column. Yes if dataset has one or more
Use this property to limit the time DateTime columns. Refer here for list of
range for which Office 365 data is datasets that require this DateTime
extracted. filter.

startTime Start DateTime value to filter on. Yes if dateFilterColumn is specified

endTime End DateTime value to filter on. Yes if dateFilterColumn is specified

Example

{
"name": "DS_May2019_O365_Message",
"properties": {
"type": "Office365Table",
"linkedServiceName": {
"referenceName": "<Office 365 linked service name>",
"type": "LinkedServiceReference"
},
"structure": [
{
"name": "Id",
"type": "String",
"description": "The unique identifier of the event."
},
{
"name": "CreatedDateTime",
"type": "DateTime",
"description": "The date and time that the event was created."
},
{
"name": "LastModifiedDateTime",
"type": "DateTime",
"description": "The date and time that the event was last modified."
},
{
"name": "ChangeKey",
"type": "String",
"description": "Identifies the version of the event object. Every time the event is changed,
ChangeKey changes as well. This allows Exchange to apply changes to the correct version of the object."
},
{
"name": "Categories",
"type": "String",
"description": "The categories associated with the event. Format: ARRAY<STRING>"
},
{
"name": "OriginalStartTimeZone",
"type": "String",
"description": "The start time zone that was set when the event was created. See
DateTimeTimeZone for a list of valid time zones."
},
{
"name": "OriginalEndTimeZone",
"type": "String",
"description": "The end time zone that was set when the event was created. See
DateTimeTimeZone for a list of valid time zones."
},
{
"name": "ResponseStatus",
"type": "String",
"description": "Indicates the type of response sent in response to an event message. Format:
STRUCT<Response: STRING, Time: STRING>"
},
{
"name": "iCalUId",
"type": "String",
"description": "A unique identifier that is shared by all instances of an event across
different calendars."
},
{
"name": "ReminderMinutesBeforeStart",
"type": "Int32",
"description": "The number of minutes before the event start time that the reminder alert
occurs."
},
{
"name": "IsReminderOn",
"type": "Boolean",
"description": "Set to true if an alert is set to remind the user of the event."
},
{
"name": "HasAttachments",
"type": "Boolean",
"description": "Set to true if the event has attachments."
},
{
"name": "Subject",
"type": "String",
"description": "The text of the event's subject line."
},
{
"name": "Body",
"type": "String",
"description": "The body of the message associated with the event.Format: STRUCT<ContentType:
STRING, Content: STRING>"
},
{
"name": "Importance",
"type": "String",
"description": "The importance of the event: Low, Normal, High."
},
{
"name": "Sensitivity",
"type": "String",
"description": "Indicates the level of privacy for the event: Normal, Personal, Private,
Confidential."
},
{
"name": "Start",
"type": "String",
"description": "The start time of the event. Format: STRUCT<DateTime: STRING, TimeZone:
STRING>"
STRING>"
},
{
"name": "End",
"type": "String",
"description": "The date and time that the event ends. Format: STRUCT<DateTime: STRING,
TimeZone: STRING>"
},
{
"name": "Location",
"type": "String",
"description": "Location information of the event. Format: STRUCT<DisplayName: STRING,
Address: STRUCT<Street: STRING, City: STRING, State: STRING, CountryOrRegion: STRING, PostalCode: STRING>,
Coordinates: STRUCT<Altitude: DOUBLE, Latitude: DOUBLE, Longitude: DOUBLE, Accuracy: DOUBLE, AltitudeAccuracy:
DOUBLE>>"
},
{
"name": "IsAllDay",
"type": "Boolean",
"description": "Set to true if the event lasts all day. Adjusting this property requires
adjusting the Start and End properties of the event as well."
},
{
"name": "IsCancelled",
"type": "Boolean",
"description": "Set to true if the event has been canceled."
},
{
"name": "IsOrganizer",
"type": "Boolean",
"description": "Set to true if the message sender is also the organizer."
},
{
"name": "Recurrence",
"type": "String",
"description": "The recurrence pattern for the event. Format: STRUCT<Pattern: STRUCT<Type:
STRING, `Interval`: INT, Month: INT, DayOfMonth: INT, DaysOfWeek: ARRAY<STRING>, FirstDayOfWeek: STRING,
Index: STRING>, `Range`: STRUCT<Type: STRING, StartDate: STRING, EndDate: STRING, RecurrenceTimeZone: STRING,
NumberOfOccurrences: INT>>"
},
{
"name": "ResponseRequested",
"type": "Boolean",
"description": "Set to true if the sender would like a response when the event is accepted or
declined."
},
{
"name": "ShowAs",
"type": "String",
"description": "The status to show: Free, Tentative, Busy, Oof, WorkingElsewhere, Unknown."
},
{
"name": "Type",
"type": "String",
"description": "The event type: SingleInstance, Occurrence, Exception, SeriesMaster."
},
{
"name": "Attendees",
"type": "String",
"description": "The collection of attendees for the event. Format: ARRAY<STRUCT<EmailAddress:
STRUCT<Name: STRING, Address: STRING>, Status: STRUCT<Response: STRING, Time: STRING>, Type: STRING>>"
},
{
"name": "Organizer",
"type": "String",
"description": "The organizer of the event. Format: STRUCT<EmailAddress: STRUCT<Name: STRING,
Address: STRING>>"
},
{
"name": "WebLink",
"name": "WebLink",
"type": "String",
"description": "The URL to open the event in Outlook Web App."
},
{
"name": "Attachments",
"type": "String",
"description": "The FileAttachment and ItemAttachment attachments for the message. Navigation
property. Format: ARRAY<STRUCT<LastModifiedDateTime: STRING, Name: STRING, ContentType: STRING, Size: INT,
IsInline: BOOLEAN, Id: STRING>>"
},
{
"name": "BodyPreview",
"type": "String",
"description": "The preview of the message associated with the event. It is in text format."
},
{
"name": "Locations",
"type": "String",
"description": "The locations where the event is held or attended from. The location and
locations properties always correspond with each other. Format: ARRAY<STRUCT<DisplayName: STRING, Address:
STRUCT<Street: STRING, City: STRING, State: STRING, CountryOrRegion: STRING, PostalCode: STRING>, Coordinates:
STRUCT<Altitude: DOUBLE, Latitude: DOUBLE, Longitude: DOUBLE, Accuracy: DOUBLE, AltitudeAccuracy: DOUBLE>,
LocationEmailAddress: STRING, LocationUri: STRING, LocationType: STRING, UniqueId: STRING, UniqueIdType:
STRING>>"
},
{
"name": "OnlineMeetingUrl",
"type": "String",
"description": "A URL for an online meeting. The property is set only when an organizer
specifies an event as an online meeting such as a Skype meeting"
},
{
"name": "OriginalStart",
"type": "DateTime",
"description": "The start time that was set when the event was created in UTC time."
},
{
"name": "SeriesMasterId",
"type": "String",
"description": "The ID for the recurring series master item, if this event is part of a
recurring series."
}
],
"typeProperties": {
"tableName": "BasicDataSet_v0.Event_v1",
"dateFilterColumn": "CreatedDateTime",
"startTime": "2019-04-28T16:00:00.000Z",
"endTime": "2019-05-05T16:00:00.000Z",
"userScopeFilterUri": "https://graph.microsoft.com/v1.0/users?$filter=Department eq 'Finance'"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Office 365 source.
Office 365 as source
To copy data from Office 365, set the source type in the copy activity to Office365Source. No additional
properties are supported in the copy activity source section.
Example:
"activities": [
{
"name": "CopyFromO365ToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Office 365 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "Office365Source"
},
"sink": {
"type": "BlobSink"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Oracle by using Azure Data
Factory
3/5/2019 • 7 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to an Oracle database.
It builds on the Copy Activity overview article that presents a general overview of the copy activity.

Supported capabilities
You can copy data from an Oracle database to any supported sink data store. You also can copy data from any
supported source data store to an Oracle database. For a list of data stores that are supported as sources or sinks
by the copy activity, see the Supported data stores table.
Specifically, this Oracle connector supports the following versions of an Oracle database. It also supports Basic or
OID authentications:
Oracle 12c R1 (12.1)
Oracle 11g R1, R2 (11.1, 11.2)
Oracle 10g R1, R2 (10.1, 10.2)
Oracle 9i R1, R2 (9.0.1, 9.2)
Oracle 8i R3 (8.1.7)

NOTE
Oracle proxy server is not supported.

Prerequisites
To copy data from and to an Oracle database that isn't publicly accessible, you need to set up a Self-hosted
Integration Runtime. For more information about integration runtime, see Self-hosted Integration Runtime. The
integration runtime provides a built-in Oracle driver. Therefore, you don't need to manually install a driver when
you copy data from and to Oracle.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to the
Oracle connector.
Linked service properties
The following properties are supported for the Oracle linked service.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Oracle.

connectionString Specifies the information needed to Yes


connect to the Oracle Database
instance.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the password
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key Vault
article with more details.

Supported connection type: You can


use Oracle SID or Oracle Service
Name to identify your database:
- If you use SID:
Host=<host>;Port=<port>;Sid=
<sid>;User Id=
<username>;Password=<password>;
- If you use Service Name:
Host=<host>;Port=
<port>;ServiceName=
<servicename>;User Id=
<username>;Password=<password>;

connectVia The integration runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

TIP
If you hit error saying "ORA-01025: UPI parameter out of range" and your Oracle is of version 8i, add WireProtocolMode=1
to your connection string and try again.

To enable encryption on Oracle connection, you have two options:


1. To use Triple-DES Encryption (3DES ) and Advanced Encryption Standard (AES ), on Oracle server
side, go to Oracle Advanced Security (OAS ) and configure the encryption settings, refer to details here.
ADF Oracle connector automatically negotiates the encryption method to use the one you configure in
OAS when establishing connection to Oracle.
2. To use SSL, follow below steps:
a. Get SSL certificate info. Get the DER encoded certificate information of your SSL cert, and save the
output (----- Begin Certificate … End Certificate -----) as a text file.
openssl x509 -inform DER -in [Full Path to the DER Certificate including the name of the DER
Certificate] -text

Example: extract cert info from DERcert.cer; then, save the output to cert.txt

openssl x509 -inform DER -in DERcert.cer -text


Output:
-----BEGIN CERTIFICATE-----
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXX
-----END CERTIFICATE-----

b. Build the keystore or truststore. The following command creates the truststore file with or without a
password in PKCS -12 format.

openssl pkcs12 -in [Path to the file created in the previous step] -out [Path and name of
TrustStore] -passout pass:[Keystore PWD] -nokeys -export

Example: creates a PKCS12 truststore file named MyTrustStoreFile with a password

openssl pkcs12 -in cert.txt -out MyTrustStoreFile -passout pass:ThePWD -nokeys -export

c. Place the truststore file on the Self-hosted IR machine, e.g. at C:\MyTrustStoreFile.


d. In ADF, configure the Oracle connection string with EncryptionMethod=1 and corresponding
TrustStore / TrustStorePassword value, e.g.
Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=
<password>;EncryptionMethod=1;TrustStore=C:\\MyTrustStoreFile;TrustStorePassword=
<trust_store_password>
.
Example:

{
"name": "OracleLinkedService",
"properties": {
"type": "Oracle",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=<password>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault


{
"name": "OracleLinkedService",
"properties": {
"type": "Oracle",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Oracle dataset.
To copy data from and to Oracle, set the type property of the dataset to OracleTable. The following properties are
supported.

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to OracleTable.

tableName The name of the table in the Oracle Yes


database that the linked service refers
to.

Example:

{
"name": "OracleDataset",
"properties":
{
"type": "OracleTable",
"linkedServiceName": {
"referenceName": "<Oracle linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "MyTable"
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Oracle source and sink.
Oracle as a source type
To copy data from Oracle, set the source type in the copy activity to OracleSource. The following properties are
supported in the copy activity source section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to OracleSource.

oracleReaderQuery Use the custom SQL query to read No


data. An example is
"SELECT * FROM MyTable" .

If you don't specify "oracleReaderQuery", the columns defined in the "structure" section of the dataset are used to
construct a query ( select column1, column2 from mytable ) to run against the Oracle database. If the dataset
definition doesn't have "structure", all columns are selected from the table.
Example:

"activities":[
{
"name": "CopyFromOracle",
"type": "Copy",
"inputs": [
{
"referenceName": "<Oracle input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "OracleSource",
"oracleReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Oracle as a sink type


To copy data to Oracle, set the sink type in the copy activity to OracleSink. The following properties are
supported in the copy activity sink section.

PROPERTY DESCRIPTION REQUIRED


PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to OracleSink.

writeBatchSize Inserts data into the SQL table when No (default is 10,000)
the buffer size reaches writeBatchSize.
Allowed values are Integer (number of
rows).

writeBatchTimeout Wait time for the batch insert operation No


to complete before it times out.
Allowed values are Timespan. An
example is 00:30:00 (30 minutes).

preCopyScript Specify a SQL query for the copy No


activity to execute before writing data
into Oracle in each run. You can use
this property to clean up the preloaded
data.

Example:

"activities":[
{
"name": "CopyToOracle",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Oracle output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OracleSink"
}
}
}
]

Data type mapping for Oracle


When you copy data from and to Oracle, the following mappings are used from Oracle data types to Data Factory
interim data types. To learn about how the copy activity maps the source schema and data type to the sink, see
Schema and data type mappings.

ORACLE DATA TYPE DATA FACTORY INTERIM DATA TYPE

BFILE Byte[]
ORACLE DATA TYPE DATA FACTORY INTERIM DATA TYPE

BLOB Byte[]
(only supported on Oracle 10g and higher)

CHAR String

CLOB String

DATE DateTime

FLOAT Decimal, String (if precision > 28)

INTEGER Decimal, String (if precision > 28)

LONG String

LONG RAW Byte[]

NCHAR String

NCLOB String

NUMBER Decimal, String (if precision > 28)

NVARCHAR2 String

RAW Byte[]

ROWID String

TIMESTAMP DateTime

TIMESTAMP WITH LOCAL TIME ZONE String

TIMESTAMP WITH TIME ZONE String

UNSIGNED INTEGER Number

VARCHAR2 String

XML String

NOTE
The data types INTERVAL YEAR TO MONTH and INTERVAL DAY TO SECOND aren't supported.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Oracle Eloqua using Azure Data
Factory (Preview)
1/3/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Oracle Eloqua. It builds
on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Oracle Eloqua to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Eloqua connector.

Linked service properties


The following properties are supported for Oracle Eloqua linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Eloqua
PROPERTY DESCRIPTION REQUIRED

endpoint The endpoint of the Eloqua server. Yes


Eloqua supports multiple data centers,
to determine your endpoint, login to
https://login.eloqua.com with your
credential, then copy the base URL
portion from the redirected URL with
the pattern of xxx.xxx.eloqua.com .

username The site name and user name of your Yes


Eloqua account in the form:
SiteName\Username e.g.
Eloqua\Alice .

password The password corresponding to the Yes


user name. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "EloquaLinkedService",
"properties": {
"type": "Eloqua",
"typeProperties": {
"endpoint" : "<base URL e.g. xxx.xxx.eloqua.com>",
"username" : "<site name>\\<user name e.g. Eloqua\\Alice>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Eloqua dataset.
To copy data from Oracle Eloqua, set the type property of the dataset to EloquaObject. The following properties
are supported:
PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: EloquaObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "EloquaDataset",
"properties": {
"type": "EloquaObject",
"linkedServiceName": {
"referenceName": "<Eloqua linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Oracle Eloqua source.
Eloqua as source
To copy data from Oracle Eloqua, set the source type in the copy activity to EloquaSource. The following
properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: EloquaSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Accounts" .

Example:
"activities":[
{
"name": "CopyFromEloqua",
"type": "Copy",
"inputs": [
{
"referenceName": "<Eloqua input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "EloquaSource",
"query": "SELECT * FROM Accounts"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of supported data stored by Azure Data Factory, see supported data stores.
Copy data from Oracle Responsys using Azure Data
Factory (Preview)
1/16/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Oracle Responsys. It
builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Oracle Responsys to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Responsys connector.

Linked service properties


The following properties are supported for Oracle Responsys linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Responsys

endpoint The endpoint of the Respopnsys server Yes

clientId The client ID associated with the Yes


Responsys application.
PROPERTY DESCRIPTION REQUIRED

clientSecret The client secret associated with the Yes


Responsys application. You can choose
to mark this field as a SecureString to
store it securely in ADF, or store
password in Azure Key Vault and let
ADF copy activity pull from there when
performing data copy - learn more
from Store credentials in Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "OracleResponsysLinkedService",
"properties": {
"type": "Responsys",
"typeProperties": {
"endpoint" : "<endpoint>",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Responsys dataset.
To copy data from Oracle Responsys, set the type property of the dataset to ResponsysObject. The following
properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: ResponsysObject
PROPERTY DESCRIPTION REQUIRED

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "OracleResponsysDataset",
"properties": {
"type": "ResponsysObject",
"linkedServiceName": {
"referenceName": "<Oracle Responsys linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Oracle Responsys source.
Oracle Responsys as source
To copy data from Oracle Responsys, set the source type in the copy activity to ResponsysSource. The following
properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
ResponsysSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromOracleResponsys",
"type": "Copy",
"inputs": [
{
"referenceName": "<Oracle Responsys input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ResponsysSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Oracle Service Cloud using Azure
Data Factory (Preview)
1/16/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Oracle Service Cloud. It
builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Oracle Service Cloud to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Service Cloud connector.

Linked service properties


The following properties are supported for Oracle Service Cloud linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


OracleServiceCloud

host The URL of the Oracle Service Cloud Yes


instance.
PROPERTY DESCRIPTION REQUIRED

username The user name that you use to access Yes


Oracle Service Cloud server.

password The password corresponding to the Yes


user name that you provided in the
username key. You can choose to mark
this field as a SecureString to store it
securely in ADF, or store password in
Azure Key Vault and let ADF copy
activity pull from there when
performing data copy - learn more
from Store credentials in Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "OracleServiceCloudLinkedService",
"properties": {
"type": "OracleServiceCloud",
"typeProperties": {
"host" : "<host>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true,
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Service Cloud dataset.
To copy data from Oracle Service Cloud, set the type property of the dataset to OracleServiceCloudObject. The
following properties are supported:
PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: OracleServiceCloudObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "OracleServiceCloudDataset",
"properties": {
"type": "OracleServiceCloudObject",
"linkedServiceName": {
"referenceName": "<OracleServiceCloud linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Oracle Service Cloud source.
Oracle Service Cloud as source
To copy data from Oracle Service Cloud, set the source type in the copy activity to OracleServiceCloudSource.
The following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
OracleServiceCloudSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromOracleServiceCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<OracleServiceCloud input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "OracleServiceCloudSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Parquet format in Azure Data Factory
5/6/2019 • 3 minutes to read • Edit Online

Follow this article when you want to parse the Parquet files or write the data into Parquet format.
Parquet format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage
Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS,
HTTP, and SFTP.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Parquet dataset.

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset Yes


must be set to Parquet.

location Location settings of the file(s). Each Yes


file-based connector has its own
location type and supported
properties under location . See
details in connector article ->
Dataset properties section.

compressionCodec The compression codec to use when No


writing to Parquet files. When
reading from Parquet files, Data
Factory automatically determine the
compression codec based on the file
metadata.
Supported types are “none”, “gzip”,
“snappy” (default), and "lzo". Note
currently Copy activity doesn't
support LZO.

NOTE
White space in column name is not supported for Parquet files.

Below is an example of Parquet dataset on Azure Blob Storage:


{
"name": "ParquetDataset",
"properties": {
"type": "Parquet",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"compressionCodec": "snappy"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Parquet source and sink.
Parquet as source
The following properties are supported in the copy activity *source* section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy Yes


activity source must be set to
ParquetSource.

storeSettings A group of properties on how to No


read data from a data store. Each file-
based connector has its own
supported read settings under
storeSettings . See details in
connector article -> Copy activity
properties section.

Parquet as sink
The following properties are supported in the copy activity *sink* section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy Yes


activity source must be set to
ParquetSink.

storeSettings A group of properties on how to No


write data to a data store. Each file-
based connector has its own
supported write settings under
storeSettings . See details in
connector article -> Copy activity
properties section.
Mapping Data Flow properties
Learn details from source transformation and sink transformation in Mapping Data Flow.

Data type support


Parquet complex data types are currently not supported (e.g. MAP, LIST, STRUCT).

Using Self-hosted Integration Runtime


IMPORTANT
For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are
not copying Parquet files as-is, you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on
your IR machine. See the following paragraph with more details.

For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java
runtime by firstly checking the registry
(SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if not found, secondly
checking system variable JAVA_HOME for OpenJDK.
To use JRE: The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK: it's supported since IR version 3.13. Package the jvm.dll with all other required
assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME
accordingly.

TIP
If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error
occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an
environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size
for JVM to empower such copy, then rerun the pipeline.

Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g . The flag Xms specifies the initial
memory allocation pool for a Java Virtual Machine (JVM ), while Xmx specifies the maximum memory
allocation pool. This means that JVM will be started with Xms amount of memory and will be able to use a
maximum of Xmx amount of memory. By default, ADF use min 64MB and max 1G.

Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Copy data from PayPal using Azure Data Factory
(Preview)
1/3/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from PayPal. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from PayPal to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
PayPal connector.

Linked service properties


The following properties are supported for PayPal linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


PayPal

host The URL of the PayPal instance. (that is, Yes


api.sandbox.paypal.com)
PROPERTY DESCRIPTION REQUIRED

clientId The client ID associated with your Yes


PayPal application.

clientSecret The client secret associated with your Yes


PayPal application. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "PayPalLinkedService",
"properties": {
"type": "PayPal",
"typeProperties": {
"host" : "api.sandbox.paypal.com",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by PayPal dataset.
To copy data from PayPal, set the type property of the dataset to PayPalObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: PayPalObject

tableName Name of the table. No (if "query" in activity source is


specified)
Example

{
"name": "PayPalDataset",
"properties": {
"type": "PayPalObject",
"linkedServiceName": {
"referenceName": "<PayPal linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by PayPal source.
PayPal as source
To copy data from PayPal, set the source type in the copy activity to PayPalSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: PayPalSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM
Payment_Experience"
.

Example:
"activities":[
{
"name": "CopyFromPayPal",
"type": "Copy",
"inputs": [
{
"referenceName": "<PayPal input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PayPalSource",
"query": "SELECT * FROM Payment_Experience"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Phoenix using Azure Data Factory
1/3/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Phoenix. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Phoenix to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Phoenix connector.

Linked service properties


The following properties are supported for Phoenix linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Phoenix

host The IP address or host name of the Yes


Phoenix server. (that is,
192.168.222.160)

port The TCP port that the Phoenix server No


uses to listen for client connections. The
default value is 8765. If you connect to
Azure HDInsights, specify port as 443.
PROPERTY DESCRIPTION REQUIRED

httpPath The partial URL corresponding to the No


Phoenix server. (that is,
/gateway/sandbox/phoenix/version).
Specify /hbasephoenix0 if using
HDInsights cluster.

authenticationType The authentication mechanism used to Yes


connect to the Phoenix server.
Allowed values are: Anonymous,
UsernameAndPassword,
WindowsAzureHDInsightService

username The user name used to connect to the No


Phoenix server.

password The password corresponding to the No


user name. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

enableSsl Specifies whether the connections to No


the server are encrypted using SSL. The
default value is false.

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over SSL. This
property can only be set when using
SSL on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

useSystemTrustStore Specifies whether to use a CA certificate No


from the system trust store or from a
specified PEM file. The default value is
false.

allowHostNameCNMismatch Specifies whether to require a CA- No


issued SSL certificate name to match
the host name of the server when
connecting over SSL. The default value
is false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.
NOTE
If your cluster doesn't support sticky session e.g. HDInsight, explicitly add node index at the end of the http path setting, e.g.
specify /hbasephoenix0 instead of /hbasephoenix .

Example:

{
"name": "PhoenixLinkedService",
"properties": {
"type": "Phoenix",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "443",
"httpPath" : "/hbasephoenix0",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Phoenix dataset.
To copy data from Phoenix, set the type property of the dataset to PhoenixObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: PhoenixObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "PhoenixDataset",
"properties": {
"type": "PhoenixObject",
"linkedServiceName": {
"referenceName": "<Phoenix linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Phoenix source.
Phoenix as source
To copy data from Phoenix, set the source type in the copy activity to PhoenixSource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: PhoenixSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromPhoenix",
"type": "Copy",
"inputs": [
{
"referenceName": "<Phoenix input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PhoenixSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from PostgreSQL by using Azure Data
Factory
3/15/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a PostgreSQL database.
It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from PostgreSQL database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this PostgreSQL connector supports PostgreSQL version 7.4 and above.

Prerequisites
If your PostgreSQL database is not publicly accessible, you need to set up a Self-hosted Integration Runtime. To
learn about Self-hosted integration runtimes, see Self-hosted Integration Runtime article. The Integration
Runtime provides a built-in PostgreSQL driver starting from version 3.7, therefore you don't need to manually
install any driver.
For Self-hosted IR version lower than 3.7, you need to install the Ngpsql data provider for PostgreSQL with
version between 2.0.12 and 3.1.9 on the Integration Runtime machine.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
PostgreSQL connector.

Linked service properties


The following properties are supported for PostgreSQL linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


PostgreSql
PROPERTY DESCRIPTION REQUIRED

connectionString An ODBC connection string to connect Yes


to Azure Database for PostgreSQL.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the password
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key Vault
article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

A typical connection string is Server=<server>;Database=<database>;Port=<port>;UID=<username>;Password=<Password>


. More properties you can set per your case:

PROPERTY DESCRIPTION OPTIONS REQUIRED

EncryptionMethod (EM) The method the driver uses 0 (No Encryption) (Default) No
to encrypt data sent / 1 (SSL) / 6 (RequestSSL)
between the driver and the
database server. E.g.
ValidateServerCertificate=
<0/1/6>;

ValidateServerCertificate Determines whether the 0 (Disabled) (Default) / 1 No


(VSC) driver validates the (Enabled)
certificate that is sent by the
database server when SSL
encryption is enabled
(Encryption Method=1). E.g.
ValidateServerCertificate=
<0/1>;

Example:

{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Port=<port>;UID=<username>;Password=<Password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example: store password in Azure Key Vault

{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Port=<port>;UID=<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

If you were using PostgreSQL linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Previous payload:

{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by PostgreSQL dataset.
To copy data from PostgreSQL, set the type property of the dataset to RelationalTable. The following properties
are supported:
PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: RelationalTable

tableName Name of the table in the PostgreSQL No (if "query" in activity source is
database. specified)

Example

{
"name": "PostgreSQLDataset",
"properties":
{
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<PostgreSQL linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by PostgreSQL source.
PostgreSQL as source
To copy data from PostgreSQL, set the source type in the copy activity to RelationalSource. The following
properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
RelationalSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"query": "SELECT * FROM
\"MySchema\".\"MyTable\""
.

NOTE
Schema and table names are case-sensitive. Enclose them in "" (double quotes) in the query.

Example:
"activities":[
{
"name": "CopyFromPostgreSQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<PostgreSQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM \"MySchema\".\"MyTable\""
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Presto using Azure Data Factory
(Preview)
1/3/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Presto. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Presto to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Presto connector.

Linked service properties


The following properties are supported for Presto linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Presto

host The IP address or host name of the Yes


Presto server. (i.e. 192.168.222.160)
PROPERTY DESCRIPTION REQUIRED

serverVersion The version of the Presto server. (i.e. Yes


0.148-t)

catalog The catalog context for all request Yes


against the server.

port The TCP port that the Presto server No


uses to listen for client connections. The
default value is 8080.

authenticationType The authentication mechanism used to Yes


connect to the Presto server.
Allowed values are: Anonymous, LDAP

username The user name used to connect to the No


Presto server.

password The password corresponding to the No


user name. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

enableSsl Specifies whether the connections to No


the server are encrypted using SSL. The
default value is false.

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over SSL. This
property can only be set when using
SSL on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

useSystemTrustStore Specifies whether to use a CA certificate No


from the system trust store or from a
specified PEM file. The default value is
false.

allowHostNameCNMismatch Specifies whether to require a CA- No


issued SSL certificate name to match
the host name of the server when
connecting over SSL. The default value
is false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

timeZoneID The local time zone used by the No


connection. Valid values for this option
are specified in the IANA Time Zone
Database. The default value is the
system time zone.

Example:
{
"name": "PrestoLinkedService",
"properties": {
"type": "Presto",
"typeProperties": {
"host" : "<host>",
"serverVersion" : "0.148-t",
"catalog" : "<catalog>",
"port" : "<port>",
"authenticationType" : "LDAP",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"timeZoneID" : "Europe/Berlin"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Presto dataset.
To copy data from Presto, set the type property of the dataset to PrestoObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: PrestoObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "PrestoDataset",
"properties": {
"type": "PrestoObject",
"linkedServiceName": {
"referenceName": "<Presto linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Presto source.
Presto as source
To copy data from Presto, set the source type in the copy activity to PrestoSource. The following properties are
supported in the copy activity source section:
PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: PrestoSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromPresto",
"type": "Copy",
"inputs": [
{
"referenceName": "<Presto input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PrestoSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from QuickBooks Online using Azure Data
Factory (Preview)
3/14/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from QuickBooks Online. It
builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from QuickBooks Online to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Currently this connector only support 1.0a, which means you need to have a developer account with apps created
before July 17, 2017.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
QuickBooks connector.

Linked service properties


The following properties are supported for QuickBooks linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


QuickBooks
PROPERTY DESCRIPTION REQUIRED

endpoint The endpoint of the QuickBooks Online Yes


server. (that is,
quickbooks.api.intuit.com)

companyId The company ID of the QuickBooks Yes


company to authorize. For info about
how to find the company ID, see How
do I find my Company ID?.

consumerKey The consumer key for OAuth 1.0 Yes


authentication.

consumerSecret The consumer secret for OAuth 1.0 Yes


authentication. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

accessToken The access token for OAuth 1.0 Yes


authentication. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

accessTokenSecret The access token secret for OAuth 1.0 Yes


authentication. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

Example:
{
"name": "QuickBooksLinkedService",
"properties": {
"type": "QuickBooks",
"typeProperties": {
"endpoint" : "quickbooks.api.intuit.com",
"companyId" : "<companyId>",
"consumerKey": "<consumerKey>",
"consumerSecret": {
"type": "SecureString",
"value": "<consumerSecret>"
},
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
},
"accessTokenSecret": {
"type": "SecureString",
"value": "<accessTokenSecret>"
},
"useEncryptedEndpoints" : true
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by QuickBooks dataset.
To copy data from QuickBooks Online, set the type property of the dataset to QuickBooksObject. The following
properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: QuickBooksObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "QuickBooksDataset",
"properties": {
"type": "QuickBooksObject",
"linkedServiceName": {
"referenceName": "<QuickBooks linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by QuickBooks source.
QuickBooks as source
To copy data from QuickBooks Online, set the source type in the copy activity to QuickBooksSource. The
following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
QuickBooksSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM "Bill" WHERE Id =
'123'"
.

Example:

"activities":[
{
"name": "CopyFromQuickBooks",
"type": "Copy",
"inputs": [
{
"referenceName": "<QuickBooks input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "QuickBooksSource",
"query": "SELECT * FROM \"Bill\" WHERE Id = '123' "
},
"sink": {
"type": "<sink type>"
}
}
}
]

Copy data from Quickbooks Desktop


The Copy Activity in Azure Data Factory cannot copy data directly from Quickbooks Desktop. To copy data from
Quickbooks Desktop, export your Quickbooks data to a comma-separated-values (CSV ) file and then upload the
file to Azure Blob Storage. From there, you can use Data Factory to copy the data to the sink of your choice.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from a REST endpoint by using Azure
Data Factory
4/1/2019 • 8 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from a REST endpoint. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.
The difference among this REST connector, HTTP connector and the Web table connector are:
REST connector specifically support copying data from RESTful APIs;
HTTP connector is generic to retrieve data from any HTTP endpoint, e.g. to download file. Before this REST
connector becomes available, you may happen to use HTTP connector to copy data from RESTful API, which is
supported but less functional comparing to REST connector.
Web table connector extracts table content from an HTML webpage.

Supported capabilities
You can copy data from a REST source to any supported sink data store. For a list of data stores that Copy Activity
supports as sources and sinks, see Supported data stores and formats.
Specifically, this generic REST connector supports:
Retrieving data from a REST endpoint by using the GET or POST methods.
Retrieving data by using one of the following authentications: Anonymous, Basic, AAD service principal,
and managed identities for Azure resources.
Pagination in the REST APIs.
Copying the REST JSON response as-is or parse it by using schema mapping. Only response payload in
JSON is supported.

TIP
To test a request for data retrieval before you configure the REST connector in Data Factory, learn about the API
specification for header and body requirements. You can use tools like Postman or a web browser to validate.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are specific
to the REST connector.
Linked service properties
The following properties are supported for the REST linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


RestService.

url The base URL of the REST service. Yes

enableServerCertificateValidation Whether to validate server side SSL No


certificate when connecting to the (the default is true)
endpoint.

authenticationType Type of authentication used to connect Yes


to the REST service. Allowed values are
Anonymous, Basic,
AadServicePrincipal and
ManagedServiceIdentity. Refer to
corresponding sections below on more
properties and examples respectively.

connectVia The Integration Runtime to use to No


connect to the data store. You can use
the Azure Integration Runtime or a self-
hosted Integration Runtime (if your
data store is located in a private
network). If not specified, this property
uses the default Azure Integration
Runtime.

Use basic authentication


Set the authenticationType property to Basic. In addition to the generic properties that are described in the
preceding section, specify the following properties:

PROPERTY DESCRIPTION REQUIRED

userName The user name to use to access the Yes


REST endpoint.

password The password for the user (the Yes


userName value). Mark this field as a
SecureString type to store it securely
in Data Factory. You can also reference
a secret stored in Azure Key Vault.

Example
{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"authenticationType": "Basic",
"url" : "<REST endpoint>",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Use AAD service principal authentication


Set the authenticationType property to AadServicePrincipal. In addition to the generic properties that are
described in the preceding section, specify the following properties:

PROPERTY DESCRIPTION REQUIRED

servicePrincipalId Specify the Azure Active Directory Yes


application's client ID.

servicePrincipalKey Specify the Azure Active Directory Yes


application's key. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. Retrieve it by
hovering the mouse in the top-right
corner of the Azure portal.

aadResourceId Specify the AAD resource you are Yes


requesting for authorization, e.g.
https://management.core.windows.net
.

Example
{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"url": "<REST endpoint e.g. https://www.example.com/>",
"authenticationType": "AadServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource URL e.g. https://management.core.windows.net>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Use managed identities for Azure resources authentication


Set the authenticationType property to ManagedServiceIdentity. In addition to the generic properties that
are described in the preceding section, specify the following properties:

PROPERTY DESCRIPTION REQUIRED

aadResourceId Specify the AAD resource you are Yes


requesting for authorization, e.g.
https://management.core.windows.net
.

Example

{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"url": "<REST endpoint e.g. https://www.example.com/>",
"authenticationType": "ManagedServiceIdentity",
"aadResourceId": "<AAD resource URL e.g. https://management.core.windows.net>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
This section provides a list of properties that the REST dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from REST, the following properties are supported:
PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to RestResource.

relativeUrl A relative URL to the resource that No


contains the data. When this property
isn't specified, only the URL that's
specified in the linked service definition
is used.

requestMethod The HTTP method. Allowed values are No


Get (default) and Post.

additionalHeaders Additional HTTP request headers. No

requestBody The body for the HTTP request. No

paginationRules The pagination rules to compose next No


page requests. Refer to pagination
support section on details.

Example 1: Using the Get method with pagination

{
"name": "RESTDataset",
"properties": {
"type": "RestResource",
"linkedServiceName": {
"referenceName": "<REST linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"relativeUrl": "<relative url>",
"additionalHeaders": {
"x-user-defined": "helloworld"
},
"paginationRules": {
"AbsoluteUrl": "$.paging.next"
}
}
}
}

Example 2: Using the Post method


{
"name": "RESTDataset",
"properties": {
"type": "RestResource",
"linkedServiceName": {
"referenceName": "<REST linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"relativeUrl": "<relative url>",
"requestMethod": "Post",
"requestBody": "<body for POST REST request>"
}
}
}

Copy Activity properties


This section provides a list of properties that the REST source supports.
For a full list of sections and properties that are available for defining activities, see Pipelines.
REST as source
The following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to RestSource.

httpRequestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to read
response data. The default value is
00:01:40.

requestInterval The time to wait before sending the No


request for next page. The default value
is 00:00:01

Example
"activities":[
{
"name": "CopyFromREST",
"type": "Copy",
"inputs": [
{
"referenceName": "<REST input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:00"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Pagination support
Normally, REST API limit its response payload size of a single request under a reasonable number; while to return
large amount of data, it splits the result into multiple pages and requires callers to send consecutive requests to
get next page of the result. Usually, the request for one page is dynamic and composed by the information
returned from the response of previous page.
This generic REST connector supports the following pagination patterns:
Next request’s absolute or relative URL = property value in current response body
Next request’s absolute or relative URL = header value in current response headers
Next request’s query parameter = property value in current response body
Next request’s query parameter = header value in current response headers
Next request’s header = property value in current response body
Next request’s header = header value in current response headers
Pagination rules are defined as a dictionary in dataset which contain one or more case-sensitive key-value pairs.
The configuration will be used to generate the request starting from the second page. The connector will stop
iterating when it gets HTTP status code 204 (No Content), or any of the JSONPath expression in
"paginationRules" returns null.
Supported keys in pagination rules:

KEY DESCRIPTION

AbsoluteUrl Indicates the URL to issue the next request. It can be either
absolute URL or relative URL.

QueryParameters.request_query_parameter OR "request_query_parameter" is user-defined which references


QueryParameters['request_query_parameter'] one query parameter name in the next HTTP request URL.
KEY DESCRIPTION

Headers.request_header OR Headers['request_header'] "request_header" is user-defined which references one header


name in the next HTTP request.

Supported values in pagination rules:

VALUE DESCRIPTION

Headers.response_header OR Headers['response_header'] "response_header" is user-defined which references one


header name in the current HTTP response, the value of
which will be used to issue next request.

A JSONPath expression starting with "$" (representing the The response body should contain only one JSON object. The
root of the response body) JSONPath expression should return a single primitive value,
which will be used to issue next request.

Example:
Facebook Graph API returns response in the following structure, in which case next page's URL is represented in
paging.next:

{
"data": [
{
"created_time": "2017-12-12T14:12:20+0000",
"name": "album1",
"id": "1809938745705498_1809939942372045"
},
{
"created_time": "2017-12-12T14:14:03+0000",
"name": "album2",
"id": "1809938745705498_1809941802371859"
},
{
"created_time": "2017-12-12T14:14:11+0000",
"name": "album3",
"id": "1809938745705498_1809941879038518"
}
],
"paging": {
"cursors": {
"after": "MTAxNTExOTQ1MjAwNzI5NDE=",
"before": "NDMyNzQyODI3OTQw"
},
"previous": "https://graph.facebook.com/me/albums?limit=25&before=NDMyNzQyODI3OTQw",
"next": "https://graph.facebook.com/me/albums?limit=25&after=MTAxNTExOTQ1MjAwNzI5NDE="
}
}

The corresponding REST dataset configuration especially the paginationRules is as follows:


{
"name": "MyFacebookAlbums",
"properties": {
"type": "RestResource",
"typeProperties": {
"relativeUrl": "albums",
"paginationRules": {
"AbsoluteUrl": "$.paging.next"
}
},
"linkedServiceName": {
"referenceName": "MyRestService",
"type": "LinkedServiceReference"
}
}
}

Export JSON response as-is


You can use this REST connector to export REST API JSON response as-is to various file-based stores. To achieve
such schema-agnostic copy, skip the "structure" (also called schema) section in dataset and schema mapping in
copy activity.

Schema mapping
To copy data from REST endpoint to tabular sink, refer to schema mapping.

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to Salesforce by using Azure
Data Factory
4/19/2019 • 9 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Salesforce. It builds
on the Copy Activity overview article that presents a general overview of the copy activity.

Supported capabilities
You can copy data from Salesforce to any supported sink data store. You also can copy data from any supported
source data store to Salesforce. For a list of data stores that are supported as sources or sinks by the Copy activity,
see the Supported data stores table.
Specifically, this Salesforce connector supports:
Salesforce Developer, Professional, Enterprise, or Unlimited editions.
Copying data from and to Salesforce production, sandbox, and custom domain.
The Salesforce connector is built on top of the Salesforce REST/Bulk API, with v45 for copy data from and v40 for
copy data to.

Prerequisites
API permission must be enabled in Salesforce. For more information, see Enable API access in Salesforce by
permission set

Salesforce request limits


Salesforce has limits for both total API requests and concurrent API requests. Note the following points:
If the number of concurrent requests exceeds the limit, throttling occurs and you see random failures.
If the total number of requests exceeds the limit, the Salesforce account is blocked for 24 hours.
You might also receive the "REQUEST_LIMIT_EXCEEDED" error message in both scenarios. For more information,
see the "API request limits" section in Salesforce developer limits.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-step
instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to the
Salesforce connector.

Linked service properties


The following properties are supported for the Salesforce linked service.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Salesforce.

environmentUrl Specify the URL of the Salesforce No


instance.
- Default is
"https://login.salesforce.com" .
- To copy data from sandbox, specify
"https://test.salesforce.com" .
- To copy data from custom domain,
specify, for example,
"https://[domain].my.salesforce.com"
.

username Specify a user name for the user Yes


account.

password Specify a password for the user account. Yes

Mark this field as a SecureString to store


it securely in Data Factory, or reference
a secret stored in Azure Key Vault.

securityToken Specify a security token for the user Yes


account. For instructions on how to
reset and get a security token, see Get a
security token. To learn about security
tokens in general, see Security and the
API.

Mark this field as a SecureString to store


it securely in Data Factory, or reference
a secret stored in Azure Key Vault.

connectVia The integration runtime to be used to No for source, Yes for sink if the source
connect to the data store. If not linked service doesn't have integration
specified, it uses the default Azure runtime
Integration Runtime.

IMPORTANT
When you copy data into Salesforce, the default Azure Integration Runtime can't be used to execute copy. In other words, if
your source linked service doesn't have a specified integration runtime, explicitly create an Azure Integration Runtime with a
location near your Salesforce instance. Associate the Salesforce linked service as in the following example.

Example: Store credentials in Data Factory


{
"name": "SalesforceLinkedService",
"properties": {
"type": "Salesforce",
"typeProperties": {
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"securityToken": {
"type": "SecureString",
"value": "<security token>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Store credentials in Key Vault

{
"name": "SalesforceLinkedService",
"properties": {
"type": "Salesforce",
"typeProperties": {
"username": "<username>",
"password": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of password in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
},
"securityToken": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of security token in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Salesforce dataset.
To copy data from and to Salesforce, set the type property of the dataset to SalesforceObject. The following
properties are supported.
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


SalesforceObject.

objectApiName The Salesforce object name to retrieve No for source, Yes for sink
data from.

IMPORTANT
The "__c" part of API Name is needed for any custom object.

Example:

{
"name": "SalesforceDataset",
"properties": {
"type": "SalesforceObject",
"linkedServiceName": {
"referenceName": "<Salesforce linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"objectApiName": "MyTable__c"
}
}
}

NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalTable" type dataset, it
keeps working while you see a suggestion to switch to the new "SalesforceObject" type.

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to RelationalTable.

tableName Name of the table in Salesforce. No (if "query" in the activity source is
specified)

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Salesforce source and sink.
Salesforce as a source type
To copy data from Salesforce, set the source type in the copy activity to SalesforceSource. The following
properties are supported in the copy activity source section.
PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
SalesforceSource.

query Use the custom query to read data. You No (if "objectApiName" in the dataset is
can use Salesforce Object Query specified)
Language (SOQL) query or SQL-92
query. See more tips in query tips
section. If query is not specified, all the
data of the Salesforce object specified in
"objectApiName" in dataset will be
retrieved.

readBehavior Indicates whether to query the existing No


records, or query all records including
the deleted ones. If not specified, the
default behavior is the former.
Allowed values: query (default),
queryAll.

IMPORTANT
The "__c" part of API Name is needed for any custom object.

Example:
"activities":[
{
"name": "CopyFromSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<Salesforce input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceSource",
"query": "SELECT Col_Currency__c, Col_Date__c, Col_Email__c FROM AllDataType__c"
},
"sink": {
"type": "<sink type>"
}
}
}
]

NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalSource" type copy, the
source keeps working while you see a suggestion to switch to the new "SalesforceSource" type.

Salesforce as a sink type


To copy data to Salesforce, set the sink type in the copy activity to SalesforceSink. The following properties are
supported in the copy activity sink section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to SalesforceSink.

writeBehavior The write behavior for the operation. No (default is Insert)


Allowed values are Insert and Upsert.

externalIdFieldName The name of the external ID field for the Yes for "Upsert"
upsert operation. The specified field
must be defined as "External Id Field" in
the Salesforce object. It can't have NULL
values in the corresponding input data.

writeBatchSize The row count of data written to No (default is 5,000)


Salesforce in each batch.
PROPERTY DESCRIPTION REQUIRED

ignoreNullValues Indicates whether to ignore NULL No (default is false)


values from input data during a write
operation.
Allowed values are true and false.
- True: Leave the data in the destination
object unchanged when you do an
upsert or update operation. Insert a
defined default value when you do an
insert operation.
- False: Update the data in the
destination object to NULL when you
do an upsert or update operation.
Insert a NULL value when you do an
insert operation.

Example: Salesforce sink in a copy activity

"activities":[
{
"name": "CopyToSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Salesforce output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SalesforceSink",
"writeBehavior": "Upsert",
"externalIdFieldName": "CustomerId__c",
"writeBatchSize": 10000,
"ignoreNullValues": true
}
}
}
]

Query tips
Retrieve data from a Salesforce report
You can retrieve data from Salesforce reports by specifying a query as {call "<report name>"} . An example is
"query": "{call \"TestReport\"}" .

Retrieve deleted records from the Salesforce Recycle Bin


To query the soft deleted records from the Salesforce Recycle Bin, you can specify readBehavior as queryAll .
Difference between SOQL and SQL query syntax
When copying data from Salesforce, you can use either SOQL query or SQL query. Note that these two has
different syntax and functionality support, do not mix it. You are suggested to use the SOQL query which is natively
supported by Salesforce. The following table lists the main differences:

SYNTAX SOQL MODE SQL MODE

Column selection Need to enumerate the fields to be SELECT * is supported in addition to


copied in the query, e.g. column selection.
SELECT field1, filed2 FROM
objectname

Quotation marks Filed/object names cannot be quoted. Field/object names can be quoted, e.g.
SELECT "id" FROM "Account"

Datetime format Refer to details here and samples in next Refer to details here and samples in next
section. section.

Boolean values Represented as False and True , e.g. Represented as 0 or 1, e.g.


SELECT … WHERE IsDeleted=True . SELECT … WHERE IsDeleted=1 .

Column renaming Not supported. Supported, e.g.:


SELECT a AS b FROM … .

Relationship Supported, e.g. Not supported.


Account_vod__r.nvs_Country__c .

Retrieve data by using a where clause on the DateTime column


When you specify the SOQL or SQL query, pay attention to the DateTime format difference. For example:
SOQL sample:
SELECT Id, Name, BillingCity FROM Account WHERE LastModifiedDate >=
@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-ddTHH:mm:ssZ')} AND LastModifiedDate <
@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-ddTHH:mm:ssZ')}
SQL sample:
SELECT * FROM Account WHERE LastModifiedDate >= {ts'@{formatDateTime(pipeline().parameters.StartTime,'yyyy-
MM-dd HH:mm:ss')}'} AND LastModifiedDate < {ts'@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-dd
HH:mm:ss')}'}

Error of MALFORMED_QUERY:Truncated
If you hit error of "MALFORMED_QUERY: Truncated", normally it's due to you have JunctionIdList type column in
data and Salesforce has limitation on supporting such data with large number of rows. To mitigate, try to exclude
JunctionIdList column or limit the number of rows to copy (you can partition to multiple copy activity runs).

Data type mapping for Salesforce


When you copy data from Salesforce, the following mappings are used from Salesforce data types to Data Factory
interim data types. To learn about how the copy activity maps the source schema and data type to the sink, see
Schema and data type mappings.

SALESFORCE DATA TYPE DATA FACTORY INTERIM DATA TYPE

Auto Number String

Checkbox Boolean

Currency Decimal
SALESFORCE DATA TYPE DATA FACTORY INTERIM DATA TYPE

Date DateTime

Date/Time DateTime

Email String

Id String

Lookup Relationship String

Multi-Select Picklist String

Number Decimal

Percent Decimal

Phone String

Picklist String

Text String

Text Area String

Text Area (Long) String

Text Area (Rich) String

Text (Encrypted) String

URL String

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from and to Salesforce by using Azure
Data Factory
4/19/2019 • 9 minutes to read • Edit Online

This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Salesforce. It builds
on the Copy Activity overview article that presents a general overview of the copy activity.

Supported capabilities
You can copy data from Salesforce to any supported sink data store. You also can copy data from any supported
source data store to Salesforce. For a list of data stores that are supported as sources or sinks by the Copy
activity, see the Supported data stores table.
Specifically, this Salesforce connector supports:
Salesforce Developer, Professional, Enterprise, or Unlimited editions.
Copying data from and to Salesforce production, sandbox, and custom domain.
The Salesforce connector is built on top of the Salesforce REST/Bulk API, with v45 for copy data from and v40
for copy data to.

Prerequisites
API permission must be enabled in Salesforce. For more information, see Enable API access in Salesforce by
permission set

Salesforce request limits


Salesforce has limits for both total API requests and concurrent API requests. Note the following points:
If the number of concurrent requests exceeds the limit, throttling occurs and you see random failures.
If the total number of requests exceeds the limit, the Salesforce account is blocked for 24 hours.
You might also receive the "REQUEST_LIMIT_EXCEEDED" error message in both scenarios. For more
information, see the "API request limits" section in Salesforce developer limits.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Salesforce connector.

Linked service properties


The following properties are supported for the Salesforce linked service.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


Salesforce.

environmentUrl Specify the URL of the Salesforce No


instance.
- Default is
"https://login.salesforce.com" .
- To copy data from sandbox, specify
"https://test.salesforce.com" .
- To copy data from custom domain,
specify, for example,
"https://[domain].my.salesforce.com"
.

username Specify a user name for the user Yes


account.

password Specify a password for the user Yes


account.

Mark this field as a SecureString to


store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

securityToken Specify a security token for the user Yes


account. For instructions on how to
reset and get a security token, see Get
a security token. To learn about
security tokens in general, see Security
and the API.

Mark this field as a SecureString to


store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

connectVia The integration runtime to be used to No for source, Yes for sink if the source
connect to the data store. If not linked service doesn't have integration
specified, it uses the default Azure runtime
Integration Runtime.

IMPORTANT
When you copy data into Salesforce, the default Azure Integration Runtime can't be used to execute copy. In other words,
if your source linked service doesn't have a specified integration runtime, explicitly create an Azure Integration Runtime
with a location near your Salesforce instance. Associate the Salesforce linked service as in the following example.

Example: Store credentials in Data Factory


{
"name": "SalesforceLinkedService",
"properties": {
"type": "Salesforce",
"typeProperties": {
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"securityToken": {
"type": "SecureString",
"value": "<security token>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Store credentials in Key Vault

{
"name": "SalesforceLinkedService",
"properties": {
"type": "Salesforce",
"typeProperties": {
"username": "<username>",
"password": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of password in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
},
"securityToken": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of security token in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Salesforce dataset.
To copy data from and to Salesforce, set the type property of the dataset to SalesforceObject. The following
properties are supported.
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


SalesforceObject.

objectApiName The Salesforce object name to retrieve No for source, Yes for sink
data from.

IMPORTANT
The "__c" part of API Name is needed for any custom object.

Example:

{
"name": "SalesforceDataset",
"properties": {
"type": "SalesforceObject",
"linkedServiceName": {
"referenceName": "<Salesforce linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"objectApiName": "MyTable__c"
}
}
}

NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalTable" type dataset, it
keeps working while you see a suggestion to switch to the new "SalesforceObject" type.

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to RelationalTable.

tableName Name of the table in Salesforce. No (if "query" in the activity source is
specified)

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Salesforce source and sink.
Salesforce as a source type
To copy data from Salesforce, set the source type in the copy activity to SalesforceSource. The following
properties are supported in the copy activity source section.
PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to
SalesforceSource.

query Use the custom query to read data. No (if "objectApiName" in the dataset is
You can use Salesforce Object Query specified)
Language (SOQL) query or SQL-92
query. See more tips in query tips
section. If query is not specified, all the
data of the Salesforce object specified
in "objectApiName" in dataset will be
retrieved.

readBehavior Indicates whether to query the existing No


records, or query all records including
the deleted ones. If not specified, the
default behavior is the former.
Allowed values: query (default),
queryAll.

IMPORTANT
The "__c" part of API Name is needed for any custom object.

Example:
"activities":[
{
"name": "CopyFromSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<Salesforce input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceSource",
"query": "SELECT Col_Currency__c, Col_Date__c, Col_Email__c FROM AllDataType__c"
},
"sink": {
"type": "<sink type>"
}
}
}
]

NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalSource" type copy, the
source keeps working while you see a suggestion to switch to the new "SalesforceSource" type.

Salesforce as a sink type


To copy data to Salesforce, set the sink type in the copy activity to SalesforceSink. The following properties are
supported in the copy activity sink section.

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to SalesforceSink.

writeBehavior The write behavior for the operation. No (default is Insert)


Allowed values are Insert and Upsert.

externalIdFieldName The name of the external ID field for Yes for "Upsert"
the upsert operation. The specified field
must be defined as "External Id Field" in
the Salesforce object. It can't have
NULL values in the corresponding
input data.

writeBatchSize The row count of data written to No (default is 5,000)


Salesforce in each batch.
PROPERTY DESCRIPTION REQUIRED

ignoreNullValues Indicates whether to ignore NULL No (default is false)


values from input data during a write
operation.
Allowed values are true and false.
- True: Leave the data in the
destination object unchanged when
you do an upsert or update operation.
Insert a defined default value when you
do an insert operation.
- False: Update the data in the
destination object to NULL when you
do an upsert or update operation.
Insert a NULL value when you do an
insert operation.

Example: Salesforce sink in a copy activity

"activities":[
{
"name": "CopyToSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Salesforce output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SalesforceSink",
"writeBehavior": "Upsert",
"externalIdFieldName": "CustomerId__c",
"writeBatchSize": 10000,
"ignoreNullValues": true
}
}
}
]

Query tips
Retrieve data from a Salesforce report
You can retrieve data from Salesforce reports by specifying a query as {call "<report name>"} . An example is
"query": "{call \"TestReport\"}" .

Retrieve deleted records from the Salesforce Recycle Bin


To query the soft deleted records from the Salesforce Recycle Bin, you can specify readBehavior as queryAll .
Difference between SOQL and SQL query syntax
When copying data from Salesforce, you can use either SOQL query or SQL query. Note that these two has
different syntax and functionality support, do not mix it. You are suggested to use the SOQL query which is
natively supported by Salesforce. The following table lists the main differences:

SYNTAX SOQL MODE SQL MODE

Column selection Need to enumerate the fields to be SELECT * is supported in addition to


copied in the query, e.g. column selection.
SELECT field1, filed2 FROM
objectname

Quotation marks Filed/object names cannot be quoted. Field/object names can be quoted, e.g.
SELECT "id" FROM "Account"

Datetime format Refer to details here and samples in Refer to details here and samples in
next section. next section.

Boolean values Represented as False and True , Represented as 0 or 1, e.g.


e.g. SELECT … WHERE IsDeleted=True SELECT … WHERE IsDeleted=1 .
.

Column renaming Not supported. Supported, e.g.:


SELECT a AS b FROM … .

Relationship Supported, e.g. Not supported.


Account_vod__r.nvs_Country__c .

Retrieve data by using a where clause on the DateTime column


When you specify the SOQL or SQL query, pay attention to the DateTime format difference. For example:
SOQL sample:
SELECT Id, Name, BillingCity FROM Account WHERE LastModifiedDate >=
@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-ddTHH:mm:ssZ')} AND LastModifiedDate <
@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-ddTHH:mm:ssZ')}
SQL sample:
SELECT * FROM Account WHERE LastModifiedDate >=
{ts'@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-dd HH:mm:ss')}'} AND LastModifiedDate <
{ts'@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-dd HH:mm:ss')}'}

Error of MALFORMED_QUERY:Truncated
If you hit error of "MALFORMED_QUERY: Truncated", normally it's due to you have JunctionIdList type column
in data and Salesforce has limitation on supporting such data with large number of rows. To mitigate, try to
exclude JunctionIdList column or limit the number of rows to copy (you can partition to multiple copy activity
runs).

Data type mapping for Salesforce


When you copy data from Salesforce, the following mappings are used from Salesforce data types to Data
Factory interim data types. To learn about how the copy activity maps the source schema and data type to the
sink, see Schema and data type mappings.

SALESFORCE DATA TYPE DATA FACTORY INTERIM DATA TYPE

Auto Number String

Checkbox Boolean
SALESFORCE DATA TYPE DATA FACTORY INTERIM DATA TYPE

Currency Decimal

Date DateTime

Date/Time DateTime

Email String

Id String

Lookup Relationship String

Multi-Select Picklist String

Number Decimal

Percent Decimal

Phone String

Picklist String

Text String

Text Area String

Text Area (Long) String

Text Area (Rich) String

Text (Encrypted) String

URL String

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Salesforce Marketing Cloud using
Azure Data Factory (Preview)
1/16/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Salesforce Marketing
Cloud. It builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Salesforce Marketing Cloud to any supported sink data store. For a list of data stores that
are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

NOTE
This connector doesn't support retrieving custom objects or custom data extensions.

Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Salesforce Marketing Cloud connector.

Linked service properties


The following properties are supported for Salesforce Marketing Cloud linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


SalesforceMarketingCloud

clientId The client ID associated with the Yes


Salesforce Marketing Cloud application.
PROPERTY DESCRIPTION REQUIRED

clientSecret The client secret associated with the Yes


Salesforce Marketing Cloud application.
You can choose to mark this field as a
SecureString to store it securely in ADF,
or store password in Azure Key Vault
and let ADF copy activity pull from
there when performing data copy -
learn more from Store credentials in
Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "SalesforceMarketingCloudLinkedService",
"properties": {
"type": "SalesforceMarketingCloud",
"typeProperties": {
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Salesforce Marketing Cloud dataset.
To copy data from Salesforce Marketing Cloud, set the type property of the dataset to
SalesforceMarketingCloudObject. The following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to:
SalesforceMarketingCloudObject
PROPERTY DESCRIPTION REQUIRED

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "SalesforceMarketingCloudDataset",
"properties": {
"type": "SalesforceMarketingCloudObject",
"linkedServiceName": {
"referenceName": "<SalesforceMarketingCloud linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Salesforce Marketing Cloud source.
Salesforce Marketing Cloud as source
To copy data from Salesforce Marketing Cloud, set the source type in the copy activity to
SalesforceMarketingCloudSource. The following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
SalesforceMarketingCloudSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromSalesforceMarketingCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<SalesforceMarketingCloud input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceMarketingCloudSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Business Warehouse via Open
Hub using Azure Data Factory
5/28/2019 • 7 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP Business
Warehouse (BW ) via Open Hub. It builds on the copy activity overview article that presents a general overview of
copy activity.

Supported capabilities
You can copy data from SAP Business Warehouse via Open Hub to any supported sink data store. For a list of
data stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP Business Warehouse Open Hub connector supports:
SAP Business Warehouse version 7.01 or higher (in a recent SAP Support Package Stack released after
the year 2015).
Copying data via Open Hub Destination local table which underneath can be DSO, InfoCube, MultiProvider,
DataSource, etc.
Copying data using basic authentication.
Connecting to Application Server.

SAP BW Open Hub Integration


SAP BW Open Hub Service is an efficient way to extract data from SAP BW. The following diagram shows one of
the typical flows customers have in their SAP system, in which case data flows from SAP ECC -> PSA -> DSO ->
Cube.
SAP BW Open Hub Destination (OHD ) defines the target to which the SAP data is relayed. Any objects
supported by SAP Data Transfer Process (DTP ) can be used as open hub data sources, for example, DSO,
InfoCube, DataSource, etc. Open Hub Destination type - where the relayed data is stored - can be database tables
(local or remote) and flat files. This SAP BW Open Hub connector support copying data from OHD local table in
BW. In case you are using other types, you can directly connect to the database or file system using other
connectors.
Delta extraction flow
ADF SAP BW Open Hub Connector offers two optional properties: excludeLastRequest and baseRequestId
which can be used to handle delta load from Open Hub.
excludeLastRequestId: Whether to exclude the records of the last request. Default value is true.
baseRequestId: The ID of request for delta loading. Once it is set, only data with requestId larger than the
value of this property will be retrieved.
Overall, the extraction from SAP InfoProviders to Azure Data Factory (ADF ) consists of 2 steps:
1. SAP BW Data Transfer Process (DTP ) This step copies the data from an SAP BW InfoProvider to an
SAP BW Open Hub table
2. ADF data copy In this step, the Open Hub table is read by the ADF Connector

In the first step, a DTP is executed. Each execution creates a new SAP request ID. The request ID is stored in the
Open Hub table and is then used by the ADF connector to identify the delta. The two steps run asynchronously:
the DTP is triggered by SAP, and the ADF data copy is triggered through ADF.
By default, ADF is not reading the latest delta from the Open Hub table (option "exclude last request" is true).
Hereby, the data in ADF is not 100% up-to-date with the data in the Open Hub table (the last delta is missing). In
return, this procedure ensures that no rows get lost caused by the asynchronous extraction. It works fine even
when ADF is reading the Open Hub table while the DTP is still writing into the same table.
You typically store the max copied request ID in the last run by ADF in a staging data store (such as Azure Blob in
above diagram). Therefore, the same request is not read a second time by ADF in the subsequent run. Meanwhile,
note the data is not automatically deleted from the Open Hub table.
For proper delta handling it is not allowed to have request IDs from different DTPs in the same Open Hub table.
Therefore, you must not create more than one DTP for each Open Hub Destination (OHD ). When needing Full
and Delta extraction from the same InfoProvider, you should create two OHDs for the same InfoProvider.

Prerequisites
To use this SAP Business Warehouse Open Hub connector, you need to:
Set up a Self-hosted Integration Runtime with version 3.13 or above. See Self-hosted Integration Runtime
article for details.
Download the 64-bit SAP .NET Connector 3.0 from SAP's website, and install it on the Self-hosted IR
machine. When installing, in the optional setup steps window, make sure you select the Install
Assemblies to GAC option as shown in the following image.

SAP user being used in the Data Factory BW connector needs to have following permissions:
Authorization for RFC and SAP BW.
Permissions to the “Execute” Activity of Authorization Object “S_SDSAUTH”.
Create SAP Open Hub Destination type as Database Table with "Technical Key" option checked. It is also
recommended to leave the Deleting Data from Table as unchecked although it is not required. Leverage the
DTP (directly execute or integrate into existing process chain) to land data from source object (such as
cube) you have chosen to the open hub destination table.

Getting started
TIP
For a walkthrough of using SAP BW Open Hub connector, see Load data from SAP Business Warehouse (BW) by using
Azure Data Factory.

You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Business Warehouse Open Hub connector.

Linked service properties


The following properties are supported for SAP Business Warehouse Open Hub linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


SapOpenHub

server Name of the server on which the SAP Yes


BW instance resides.

systemNumber System number of the SAP BW system. Yes


Allowed value: two-digit decimal
number represented as a string.

clientId Client ID of the client in the SAP W Yes


system.
Allowed value: three-digit decimal
number represented as a string.

language Language that the SAP system uses. No (default value is EN)

userName Name of the user who has access to Yes


the SAP server.

password Password for the user. Mark this field as Yes


a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example:
{
"name": "SapBwOpenHubLinkedService",
"properties": {
"type": "SapOpenHub",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the SAP BW Open Hub dataset.
To copy data from and to SAP BW Open Hub, set the type property of the dataset to SapOpenHubTable. The
following properties are supported.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


SapOpenHubTable.

openHubDestinationName The name of the Open Hub Destination Yes


to copy data from.

excludeLastRequest Whether to exclude the records of the No (default is true)


last request.

baseRequestId The ID of request for delta loading. No


Once it is set, only data with requestId
larger than the value of this property
will be retrieved.

TIP
If your Open Hub table only contains the data generated by single request ID, for example, you always do full load and
overwrite the existing data in the table, or you only run the DTP once for test, remember to uncheck the
"excludeLastRequest" option in order to copy the data out.

Example:
{
"name": "SAPBWOpenHubDataset",
"properties": {
"type": "SapOpenHubTable",
"linkedServiceName": {
"referenceName": "<SAP BW Open Hub linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"openHubDestinationName": "<open hub destination name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SAP BW Open Hub source.
SAP BW Open Hub as source
To copy data from SAP BW Open Hub, set the source type in the copy activity to SapOpenHubSource. There
are no additional type-specific properties needed in the copy activity source section.
Example:

"activities":[
{
"name": "CopyFromSAPBWOpenHub",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP BW Open Hub input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapOpenHubSource"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for SAP BW Open Hub


When copying data from SAP BW Open Hub, the following mappings are used from SAP BW data types to
Azure Data Factory interim data types. See Schema and data type mappings to learn about how copy activity
maps the source schema and data type to the sink.
SAP ABAP TYPE DATA FACTORY INTERIM DATA TYPE

C (String) String

I (integer) Int32

F (Float) Double

D (Date) String

T (Time) String

P (BCD Packed, Currency, Decimal, Qty) Decimal

N (Numc) String

X (Binary and Raw) String

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Business Warehouse by using
Azure Data Factory
5/22/2019 • 10 minutes to read • Edit Online

This article shows how to use Azure Data Factory to copy data from SAP Business Warehouse (BW ) via Open Hub
to Azure Data Lake Storage Gen2. You can use a similar process to copy data to other supported sink data stores.

TIP
For general information about copying data from SAP BW, including SAP BW Open Hub integration and delta extraction flow,
see Copy data from SAP Business Warehouse via Open Hub by using Azure Data Factory.

Prerequisites
Azure Data Factory: If you don't have one, follow the steps to create a data factory.
SAP BW Open Hub Destination (OHD ) with destination type "Database Table": To create an OHD or
to check that your OHD is configured correctly for Data Factory integration, see the SAP BW Open Hub
Destination configurations section of this article.
The SAP BW user needs the following permissions:
Authorization for Remote Function Calls (RFC ) and SAP BW.
Permissions to the “Execute” activity of the S_SDSAUTH authorization object.
A self-hosted integration runtime (IR) with SAP .NET connector 3.0. Follow these setup steps:
1. Install and register the self-hosted integration runtime, version 3.13 or later. (This process is described
later in this article.)
2. Download the 64-bit SAP Connector for Microsoft .NET 3.0 from SAP's website, and install it on the
same computer as the self-hosted IR. During installation, make sure that you select Install
Assemblies to GAC in the Optional setup steps dialog box, as the following image shows:
Do a full copy from SAP BW Open Hub
In the Azure portal, go to your data factory. Select Author & Monitor to open the Data Factory UI in a separate
tab.
1. On the Let's get started page, select Copy Data to open the Copy Data tool.
2. On the Properties page, specify a Task name, and then select Next.
3. On the Source data store page, select +Create new connection. Select SAP BW Open Hub from the
connector gallery, and then select Continue. To filter the connectors, you can type SAP in the search box.
4. On the Specify SAP BW Open Hub connection page, follow these steps to create a new connection.

a. From the Connect via integration runtime list, select an existing self-hosted IR. Or, choose to
create one if you don't have one yet.
To create a new self-hosted IR, select +New, and then select Self-hosted. Enter a Name, and then
select Next. Select Express setup to install on the current computer, or follow the Manual setup
steps that are provided.
As mentioned in Prerequisites, make sure that you have SAP Connector for Microsoft .NET 3.0
installed on the same computer where the self-hosted IR is running.
b. Fill in the SAP BW Server name, System number, Client ID, Language (if other than EN ), User
name, and Password.
c. Select Test connection to validate the settings, and then select Finish.
d. A new connection is created. Select Next.
5. On the Select Open Hub Destinations page, browse the Open Hub Destinations that are available in your
SAP BW. Select the OHD to copy data from, and then select Next.

6. Specify a filter, if you need one. If your OHD only contains data from a single data-transfer process (DTP )
execution with a single request ID, or you're sure that your DTP is finished and you want to copy the data,
clear the Exclude Last Request check box.
Learn more about these settings in the SAP BW Open Hub Destination configurations section of this article.
Select Validate to double-check what data will be returned. Then select Next.
7. On the Destination data store page, select +Create new connection > Azure Data Lake Storage Gen2
> Continue.
8. On the Specify Azure Data Lake Storage connection page, follow these steps to create a connection.
a. Select your Data Lake Storage Gen2-capable account from the Name drop-down list.
b. Select Finish to create the connection. Then select Next.
9. On the Choose the output file or folder page, enter copyfromopenhub as the output folder name. Then
select Next.
10. On the File format setting page, select Next to use the default settings.

11. On the Settings page, expand Performance settings. Enter a value for Degree of copy parallelism such
as 5 to load from SAP BW in parallel. Then select Next.
12. On the Summary page, review the settings. Then select Next.
13. On the Deployment page, select Monitor to monitor the pipeline.

14. Notice that the Monitor tab on the left side of the page is automatically selected. The Actions column
includes links to view activity-run details and to rerun the pipeline.
15. To view activity runs that are associated with the pipeline run, select View Activity Runs in the Actions
column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch back to
the pipeline-runs view, select the Pipelines link at the top. Select Refresh to refresh the list.

16. To monitor the execution details for each copy activity, select the Details link, which is an eyeglasses icon
below Actions in the activity-monitoring view. Available details include the data volume copied from the
source to the sink, data throughput, execution steps and duration, and configurations used.

17. To view the maximum Request ID, go back to the activity-monitoring view and select Output under
Actions.

Do an incremental copy from SAP BW Open Hub


TIP
See SAP BW Open Hub connector delta extraction flow to learn how the SAP BW Open Hub connector in Data Factory copies
incremental data from SAP BW. This article can also help you understand basic connector configuration.
Now, let's continue to configure incremental copy from SAP BW Open Hub.
Incremental copy uses a "high-watermark" mechanism that's based on the request ID. That ID is automatically
generated in SAP BW Open Hub Destination by the DTP. The following diagram shows this workflow:

On the data factory Let's get started page, select Create pipeline from template to use the built-in template.
1. Search for SAP BW to find and select the Incremental copy from SAP BW to Azure Data Lake Storage
Gen2 template. This template copies data into Azure Data Lake Storage Gen2. You can use a similar
workflow to copy to other sink types.
2. On the template's main page, select or create the following three connections, and then select Use this
template in the lower-right corner of the window.
Azure Blob storage: In this walkthrough, we use Azure Blob storage to store the high watermark, which
is the max copied request ID.
SAP BW Open Hub: This is the source to copy data from. Refer to the previous full-copy walkthrough
for detailed configuration.
Azure Data Lake Storage Gen2: This is the sink to copy data to. Refer to the previous full-copy
walkthrough for detailed configuration.

3. This template generates a pipeline with the following three activities and makes them chained on-success:
Lookup, Copy Data, and Web.
Go to the pipeline Parameters tab. You see all the configurations that you need to provide.
SAPOpenHubDestinationName: Specify the Open Hub table name to copy data from.
ADLSGen2SinkPath: Specify the destination Azure Data Lake Storage Gen2 path to copy data to. If
the path doesn't exist, the Data Factory copy activity creates a path during execution.
HighWatermarkBlobPath: Specify the path to store the high-watermark value, such as
container/path .

HighWatermarkBlobName: Specify the blob name to store the high watermark value, such as
requestIdCache.txt . In Blob storage, go to the corresponding path of
HighWatermarkBlobPath+HighWatermarkBlobName, such as container/path/requestIdCache.txt.
Create a blob with content 0.

LogicAppURL: In this template, we use WebActivity to call Azure Logic Apps to set the high-
watermark value in Blob storage. Or, you can use Azure SQL Database to store it. Use a stored
procedure activity to update the value.
You must first create a logic app, as the following image shows. Then, paste in the HTTP POST URL.
a. Go to the Azure portal. Select a new Logic Apps service. Select +Blank Logic App to go to
Logic Apps Designer.
b. Create a trigger of When an HTTP request is received. Specify the HTTP request body as
follows:

{
"properties": {
"sapOpenHubMaxRequestId": {
"type": "string"
}
},
"type": "object"
}

c. Add a Create blob action. For Folder path and Blob name, use the same values that you
configured previously in HighWatermarkBlobPath and HighWatermarkBlobName.
d. Select Save. Then, copy the value of HTTP POST URL to use in the Data Factory pipeline.
4. After you provide the Data Factory pipeline parameters, select Debug > Finish to invoke a run to validate
the configuration. Or, select Publish All to publish the changes, and then select Trigger to execute a run.

SAP BW Open Hub Destination configurations


This section introduces configuration of the SAP BW side to use the SAP BW Open Hub connector in Data Factory
to copy data.
Configure delta extraction in SAP BW
If you need both historical copy and incremental copy or only incremental copy, configure delta extraction in SAP
BW.
1. Create the Open Hub Destination. You can create the OHD in SAP Transaction RSA1, which automatically
creates the required transformation and data-transfer process. Use the following settings:
ObjectType: You can use any object type. Here, we use InfoCube as an example.
Destination Type: Select Database Table.
Key of the Table: Select Technical Key.
Extraction: Select Keep Data and Insert Records into Table.

You might increase the number of parallel running SAP work processes for the DTP:

2. Schedule the DTP in process chains.


A delta DTP for a cube only works if the necessary rows haven't been compressed. Make sure that BW cube
compression isn't running before the DTP to the Open Hub table. The easiest way to do this is to integrate
the DTP into your existing process chains. In the following example, the DTP (to the OHD ) is inserted into
the process chain between the Adjust (aggregate rollup) and Collapse (cube compression) steps.

Configure full extraction in SAP BW


In addition to delta extraction, you might want a full extraction of the same SAP BW InfoProvider. This usually
applies if you want to do full copy but not incremental, or you want to resync delta extraction.
You can't have more than one DTP for the same OHD. So, you must create an additional OHD before delta
extraction.

For a full load OHD, choose different options than for delta extraction:
In OHD: Set the Extraction option to Delete Data and Insert Records. Otherwise, data will be extracted
many times when you repeat the DTP in a BW process chain.
In the DTP: Set Extraction Mode to Full. You must change the automatically created DTP from Delta to
Full immediately after the OHD is created, as this image shows:
In the BW Open Hub connector of Data Factory: Turn off Exclude last request. Otherwise, nothing will be
extracted.
You typically run the full DTP manually. Or, you can create a process chain for the full DTP. It's typically a separate
chain that's independent of your existing process chains. In either case, make sure that the DTP is finished before
you start the extraction by using Data Factory copy. Otherwise, only partial data will be copied.
Run delta extraction the first time
The first delta extraction is technically a full extraction. By default, the SAP BW Open Hub connector excludes the
last request when it copies data. For the first delta extraction, no data is extracted by the Data Factory copy activity
until a subsequent DTP generates delta data in the table with a separate request ID. There are two ways to avoid
this scenario:
Turn off the Exclude last request option for the first delta extraction. Make sure that the first delta DTP is
finished before you start the delta extraction the first time.
Use the procedure for resyncing the delta extraction, as described in the next section.
Resync delta extraction
The following scenarios change the data in SAP BW cubes but are not considered by the delta DTP:
SAP BW selective deletion (of rows by using any filter condition)
SAP BW request deletion (of faulty requests)
An SAP Open Hub Destination isn't a data-mart-controlled data target (in all SAP BW support packages since
2015). So, you can delete data from a cube without changing the data in the OHD. You must then resync the data of
the cube with Data Factory:
1. Run a full extraction in Data Factory (by using a full DTP in SAP ).
2. Delete all rows in the Open Hub table for the delta DTP.
3. Set the status of the delta DTP to Fetched.
After this, all subsequent delta DTPs and Data Factory delta extractions work as expected.
To set the status of the delta DTP to Fetched, you can use the following option to run the delta DTP manually:

*No Data Transfer; Delta Status in Source: Fetched*

Next steps
Learn about SAP BW Open Hub connector support:
SAP Business Warehouse Open Hub connector
Copy data from SAP Business Warehouse using
Azure Data Factory
1/3/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP Business
Warehouse (BW ). It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from SAP Business Warehouse to any supported sink data store. For a list of data stores that
are supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP Business Warehouse connector supports:
SAP Business Warehouse version 7.x.
Copying data from InfoCubes and QueryCubes (including BEx queries) using MDX queries.
Copying data using basic authentication.

Prerequisites
To use this SAP Business Warehouse connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the SAP NetWeaver library on the Integration Runtime machine. You can get the SAP Netweaver
library from your SAP administrator, or directly from the SAP Software Download Center. Search for the SAP
Note #1025361 to get the download location for the most recent version. Make sure that you pick the 64-bit
SAP NetWeaver library which matches your Integration Runtime installation. Then install all files included in
the SAP NetWeaver RFC SDK according to the SAP Note. The SAP NetWeaver library is also included in the
SAP Client Tools installation.

TIP
To troubleshoot connectivity issue to SAP BW, make sure:
All dependency libraries extracted from the NetWeaver RFC SDK are in place in the %windir%\system32 folder. Usually it
has icudt34.dll, icuin34.dll, icuuc34.dll, libicudecnumber.dll, librfc32.dll, libsapucum.dll, sapcrypto.dll, sapcryto_old.dll,
sapnwrfc.dll.
The needed ports used to connect to SAP Server are enabled on the Self-hosted IR machine, which usually are port 3300
and 3201.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Business Warehouse connector.

Linked service properties


The following properties are supported for SAP Business Warehouse (BW ) linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


SapBw

server Name of the server on which the SAP Yes


BW instance resides.

systemNumber System number of the SAP BW system. Yes


Allowed value: two-digit decimal
number represented as a string.

clientId Client ID of the client in the SAP W Yes


system.
Allowed value: three-digit decimal
number represented as a string.

userName Name of the user who has access to the Yes


SAP server.

password Password for the user. Mark this field as Yes


a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-hosted
Integration Runtime is required as
mentioned in Prerequisites.

Example:
{
"name": "SapBwLinkedService",
"properties": {
"type": "SapBw",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP BW dataset.
To copy data from SAP BW, set the type property of the dataset to RelationalTable. While there are no type-
specific properties supported for the SAP BW dataset of type RelationalTable.
Example:

{
"name": "SAPBWDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<SAP BW linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SAP BW source.
SAP BW as source
To copy data from SAP BW, set the source type in the copy activity to RelationalSource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
RelationalSource
PROPERTY DESCRIPTION REQUIRED

query Specifies the MDX query to read data Yes


from the SAP BW instance.

Example:

"activities":[
{
"name": "CopyFromSAPBW",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP BW input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "<MDX query for SAP BW>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for SAP BW


When copying data from SAP BW, the following mappings are used from SAP BW data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the source
schema and data type to the sink.

SAP BW DATA TYPE DATA FACTORY INTERIM DATA TYPE

ACCP Int

CHAR String

CLNT String

CURR Decimal

CUKY String

DEC Decimal

FLTP Double
SAP BW DATA TYPE DATA FACTORY INTERIM DATA TYPE

INT1 Byte

INT2 Int16

INT4 Int

LANG String

LCHR String

LRAW Byte[]

PREC Int16

QUAN Decimal

RAW Byte[]

RAWSTRING Byte[]

STRING String

UNIT String

DATS String

NUMC String

TIMS String

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Cloud for Customer (C4C)
using Azure Data Factory
1/16/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from/to SAP Cloud for
Customer (C4C ). It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from SAP Cloud for Customer to any supported sink data store, or copy data from any
supported source data store to SAP Cloud for Customer. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this connector enables Azure Data Factory to copy data from/to SAP Cloud for Customer including
the SAP Cloud for Sales, SAP Cloud for Service, and SAP Cloud for Social Engagement solutions.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Cloud for Customer connector.

Linked service properties


The following properties are supported for SAP Cloud for Customer linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


SapCloudForCustomer.

url The URL of the SAP C4C OData service. Yes

username Specify the user name to connect to the Yes


SAP C4C.
PROPERTY DESCRIPTION REQUIRED

password Specify the password for the user Yes


account you specified for the username.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

connectVia The Integration Runtime to be used to No for source, Yes for sink
connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.

IMPORTANT
To copy data into SAP Cloud for Customer, explicitly create an Azure IR with a location near your SAP Cloud for Customer,
and associate in the linked service as the following example:

Example:

{
"name": "SAPC4CLinkedService",
"properties": {
"type": "SapCloudForCustomer",
"typeProperties": {
"url": "https://<tenantname>.crm.ondemand.com/sap/c4c/odata/v1/c4codata/" ,
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP Cloud for Customer dataset.
To copy data from SAP Cloud for Customer, set the type property of the dataset to
SapCloudForCustomerResource. The following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to:
SapCloudForCustomerResource

path Specify path of the SAP C4C OData Yes


entity.

Example:
{
"name": "SAPC4CDataset",
"properties": {
"type": "SapCloudForCustomerResource",
"typeProperties": {
"path": "<path e.g. LeadCollection>"
},
"linkedServiceName": {
"referenceName": "<SAP C4C linked service>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SAP Cloud for Customer source.
SAP C4C as source
To copy data from SAP Cloud for Customer, set the source type in the copy activity to
SapCloudForCustomerSource. The following properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


SapCloudForCustomerSource

query Specify the custom OData query to No


read data.

Sample query to get data for a specific day:


"query": "$filter=CreatedOn ge datetimeoffset'2017-07-31T10:02:06.4202620Z' and CreatedOn le
datetimeoffset'2017-08-01T10:02:06.4202620Z'"

Example:
"activities":[
{
"name": "CopyFromSAPC4C",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP C4C input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapCloudForCustomerSource",
"query": "<custom query e.g. $top=10>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

SAP C4C as sink


To copy data to SAP Cloud for Customer, set the sink type in the copy activity to SapCloudForCustomerSink.
The following properties are supported in the copy activity sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


SapCloudForCustomerSink

writeBehavior The write behavior of the operation. No. Default “Insert”.


Could be “Insert”, “Update”.

writeBatchSize The batch size of write operation. The No. Default 10.
batch size to get best performance may
be different for different table or server.

Example:
"activities":[
{
"name": "CopyToSapC4c",
"type": "Copy",
"inputs": [{
"type": "DatasetReference",
"referenceName": "<dataset type>"
}],
"outputs": [{
"type": "DatasetReference",
"referenceName": "SapC4cDataset"
}],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SapCloudForCustomerSink",
"writeBehavior": "Insert",
"writeBatchSize": 30
},
"parallelCopies": 10,
"dataIntegrationUnits": 4,
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "ErrorLogBlobLinkedService",
"type": "LinkedServiceReference"
},
"path": "incompatiblerows"
}
}
}
]

Data type mapping for SAP Cloud for Customer


When copying data from SAP Cloud for Customer, the following mappings are used from SAP Cloud for
Customer data types to Azure Data Factory interim data types. See Schema and data type mappings to learn
about how copy activity maps the source schema and data type to the sink.

SAP C4C ODATA DATA TYPE DATA FACTORY INTERIM DATA TYPE

Edm.Binary Byte[]

Edm.Boolean Bool

Edm.Byte Byte[]

Edm.DateTime DateTime

Edm.Decimal Decimal

Edm.Double Double

Edm.Single Single

Edm.Guid Guid
SAP C4C ODATA DATA TYPE DATA FACTORY INTERIM DATA TYPE

Edm.Int16 Int16

Edm.Int32 Int32

Edm.Int64 Int64

Edm.SByte Int16

Edm.String String

Edm.Time TimeSpan

Edm.DateTimeOffset DateTimeOffset

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP ECC using Azure Data Factory
5/24/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from SAP ECC (SAP
Enterprise Central Component). It builds on the copy activity overview article that presents a general overview of
copy activity.

Supported capabilities
You can copy data from SAP ECC to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP ECC connector supports:
Copying data from SAP ECC on SAP NetWeaver version 7.0 and above.
Copying data from any objects exposed by SAP ECC OData services (e.g. SAP Table/Views, BAPI, Data
Extractors, etc.), or data/IDOCs sent to SAP PI that can be received as OData via relative Adapters.
Copying data using basic authentication.

TIP
To copy data from SAP ECC via SAP table/view, you can use SAP Table connector which is more performant and scalable.

Prerequisites
Generally, SAP ECC exposes entities via OData services through SAP Gateway. To use this SAP ECC connector,
you need to:
Set up SAP Gateway. For servers with SAP NetWeaver version higher than 7.4, the SAP Gateway is
already installed. Otherwise, you need to install imbedded Gateway or Gateway hub before exposing SAP
ECC data through OData services. Learn how to set up SAP Gateway from installation guide.
Activate and configure SAP OData service. You can activate the OData Services through TCODE SICF
in seconds. You can also configure which objects needs to be exposed. Here is a sample step-by-step
guidance.

Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP ECC connector.

Linked service properties


The following properties are supported for SAP ECC linked service:
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


SapEcc

url The url of the SAP ECC OData service. Yes

username The username used to connect to the No


SAP ECC.

password The plaintext password used to connect No


to the SAP ECC.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "SapECCLinkedService",
"properties": {
"type": "SapEcc",
"typeProperties": {
"url": "<SAP ECC OData url e.g. http://eccsvrname:8000/sap/opu/odata/sap/zgw100_dd02l_so_srv/>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP ECC dataset.
To copy data from SAP ECC, set the type property of the dataset to SapEccResource. The following properties
are supported:

PROPERTY DESCRIPTION REQUIRED

path Path of the SAP ECC OData entity. Yes

Example
{
"name": "SapEccDataset",
"properties": {
"type": "SapEccResource",
"typeProperties": {
"path": "<entity path e.g. dd04tentitySet>"
},
"linkedServiceName": {
"referenceName": "<SAP ECC linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SAP ECC source.
SAP ECC as source
To copy data from SAP ECC, set the source type in the copy activity to SapEccSource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: SapEccSource

query OData query options to filter data. No


Example:
"$select=Name,Description&$top=10".

SAP ECC connector copies data from


the combined URL: (url specified in
linked service)/(path specified in
dataset)?(query specified in copy
activity source). Refer to OData URL
components.

Example:
"activities":[
{
"name": "CopyFromSAPECC",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP ECC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapEccSource",
"query": "$top=10"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for SAP ECC


When copying data from SAP ECC, the following mappings are used from OData data types for SAP ECC data to
Azure Data Factory interim data types. See Schema and data type mappings to learn about how copy activity
maps the source schema and data type to the sink.

ODATA DATA TYPE DATA FACTORY INTERIM DATA TYPE

Edm.Binary String

Edm.Boolean Bool

Edm.Byte String

Edm.DateTime DateTime

Edm.Decimal Decimal

Edm.Double Double

Edm.Single Single

Edm.Guid String

Edm.Int16 Int16

Edm.Int32 Int32

Edm.Int64 Int64
ODATA DATA TYPE DATA FACTORY INTERIM DATA TYPE

Edm.SByte Int16

Edm.String String

Edm.Time TimeSpan

Edm.DateTimeOffset DateTimeOffset

NOTE
Complex data types are not supported now.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP HANA using Azure Data
Factory
1/3/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP HANA
database. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from SAP HANA database to any supported sink data store. For a list of data stores supported
as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP HANA connector supports:
Copying data from any version of SAP HANA database.
Copying data from HANA information models (such as Analytic and Calculation views) and Row/Column
tables using SQL queries.
Copying data using Basic or Windows authentication.

NOTE
To copy data into SAP HANA data store, use generic ODBC connector. See SAP HANA sink with details. Note the linked
services for SAP HANA connector and ODBC connector are with different type thus cannot be reused.

Prerequisites
To use this SAP HANA connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the SAP HANA ODBC driver on the Integration Runtime machine. You can download the SAP HANA
ODBC driver from the SAP Software Download Center. Search with the keyword SAP HANA CLIENT for
Windows.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP HANA connector.
Linked service properties
The following properties are supported for SAP HANA linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


SapHana

server Name of the server on which the SAP Yes


HANA instance resides. If your server is
using a customized port, specify
server:port .

authenticationType Type of authentication used to connect Yes


to the SAP HANA database.
Allowed values are: Basic, and
Windows

userName Name of the user who has access to Yes


the SAP server.

password Password for the user. Mark this field as Yes


a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example:

{
"name": "SapHanaLinkedService",
"properties": {
"type": "SapHana",
"typeProperties": {
"server": "<server>:<port (optional)>",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP HANA dataset.
To copy data from SAP HANA, set the type property of the dataset to RelationalTable. While there are no type-
specific properties supported for the SAP HANA dataset of type RelationalTable.
Example:

{
"name": "SAPHANADataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<SAP HANA linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SAP HANA source.
SAP HANA as source
To copy data from SAP HANA, set the source type in the copy activity to RelationalSource. The following
properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
RelationalSource

query Specifies the SQL query to read data Yes


from the SAP HANA instance.

Example:
"activities":[
{
"name": "CopyFromSAPHANA",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP HANA input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "<SQL query for SAP HANA>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for SAP HANA


When copying data from SAP HANA, the following mappings are used from SAP HANA data types to Azure
Data Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the
source schema and data type to the sink.

SAP HANA DATA TYPE DATA FACTORY INTERIM DATA TYPE

ALPHANUM String

BIGINT Int64

BLOB Byte[]

BOOLEAN Byte

CLOB Byte[]

DATE DateTime

DECIMAL Decimal

DOUBLE Single

INT Int32

NVARCHAR String

REAL Single
SAP HANA DATA TYPE DATA FACTORY INTERIM DATA TYPE

SECONDDATE DateTime

SMALLINT Int16

TIME TimeSpan

TIMESTAMP DateTime

TINYINT Byte

VARCHAR String

Known limitations
There are a few known limitations when copying data from SAP HANA:
NVARCHAR strings are truncated to maximum length of 4000 Unicode characters
SMALLDECIMAL is not supported
VARBINARY is not supported
Valid Dates are between 1899/12/30 and 9999/12/31

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Table using Azure Data Factory
5/24/2019 • 7 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP Table. It builds
on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from SAP Table to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP Table connector supports:
Copying data from SAP Table in SAP Business Suite with version 7.01 or higher (in a recent SAP Support
Package Stack released after the year 2015) or S/4HANA.
Copying data from both SAP Transparent Table and View.
Copying data using basic authentication or SNC (Secure Network Communications) if SNC is configured.
Connecting to Application Server or Message Server.

Prerequisites
To use this SAP Table connector, you need to:
Set up a Self-hosted Integration Runtime with version 3.17 or above. See Self-hosted Integration Runtime
article for details.
Download the 64-bit SAP .NET Connector 3.0 from SAP's website, and install it on the Self-hosted IR
machine. When installing, in the optional setup steps window, make sure you select the Install Assemblies
to GAC option.

SAP user being used in the Data Factory SAP Table connector needs to have following permissions:
Authorization for RFC.
Permissions to the "Execute" Activity of Authorization Object "S_SDSAUTH".

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Table connector.

Linked service properties


The following properties are supported for SAP Business Warehouse Open Hub linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


SapTable

server Name of the server on which the SAP No


instance resides.
Applicable if you want to connect to
SAP Application Server.

systemNumber System number of the SAP system. No


Applicable if you want to connect to
SAP Application Server.
Allowed value: two-digit decimal
number represented as a string.

messageServer The hostname of the SAP Message No


Server.
Applicable if you want to connect to
SAP Message Server.

messageServerService The service name or port number of No


the Message Server.
Applicable if you want to connect to
SAP Message Server.

systemId SystemID of the SAP system where the No


table is located.
Applicable if you want to connect to
SAP Message Server.

logonGroup The Logon Group for the SAP System. No


Applicable if you want to connect to
SAP Message Server.
PROPERTY DESCRIPTION REQUIRED

clientId Client ID of the client in the SAP Yes


system.
Allowed value: three-digit decimal
number represented as a string.

language Language that the SAP system uses. No (default value is EN)

userName Name of the user who has access to the Yes


SAP server.

password Password for the user. Mark this field as Yes


a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

sncMode SNC activation indicator to access the No


SAP server where the table is located.
Applicable if you want to use SNC to
connect to SAP server.
Allowed values are: 0 (off, default) or 1
(on).

sncMyName Initiator's SNC name to access the SAP No


server where the table is located.
Applicable when sncMode is On.

sncPartnerName Communication partner's SNC name to No


access the SAP server where the table is
located.
Applicable when sncMode is On.

sncLibraryPath External security product's library to No


access the SAP server where the table is
located.
Applicable when sncMode is On.

sncQop SNC Quality of Protection. No


Applicable when sncMode is On.
Allowed values are: 1 (Authentication),
2 (Integrity), 3 (Privacy), 8 (Default), 9
(Maximum).

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-hosted
Integration Runtime is required as
mentioned in Prerequisites.

Example 1: connecting to the SAP Application Server


{
"name": "SapTableLinkedService",
"properties": {
"type": "SapTable",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: connecting to the SAP Message Server

{
"name": "SapTableLinkedService",
"properties": {
"type": "SapTable",
"typeProperties": {
"messageServer": "<message server name>",
"messageServerService": "<service name or port>",
"systemId": "<system id>",
"logonGroup": "<logon group>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 3: connecting using SNC


{
"name": "SapTableLinkedService",
"properties": {
"type": "SapTable",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
},
"sncMode": 1,
"sncMyName": "snc myname",
"sncPartnerName": "snc partner name",
"sncLibraryPath": "snc library path",
"sncQop": "8"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the SAP Table dataset.
To copy data from and to SAP BW Open Hub, the following properties are supported.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


SapTableResource.

tableName The name of the SAP Table to copy data Yes


from.

Example:

{
"name": "SAPTableDataset",
"properties": {
"type": "SapTableResource",
"linkedServiceName": {
"referenceName": "<SAP Table linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<SAP table name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SAP Table source.
SAP Table as source
To copy data from SAP Table, the following properties are supported.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to Yes


SapTableSource.

rowCount Number of rows to be retrieved. No

rfcTableFields Fields to copy from the SAP table. For No


example, column0, column1 .

rfcTableOptions Options to filter the rows in SAP Table. No


For example,
COLUMN0 EQ 'SOMEVALUE' .

customRfcReadTableFunctionModule Custom RFC function module that can No


be used to read data from SAP Table.

partitionOption The partition mechanism to read from No


SAP table. The supported options
include:
- None
- PartitionOnInt (normal integer or
integer values with zero padding on the
left, such as 0000012345)
- PartitionOnCalendarYear (4 digits
in format "YYYY")
- PartitionOnCalendarMonth (6 digits
in format "YYYYMM")
- PartitionOnCalendarDate (8 digits
in format "YYYYMMDD")

partitionColumnName The name of the column to partition No


the data.

partitionUpperBound The maximum value of the column No


specified in partitionColumnName
that will be used for proceeding
partitioning.

partitionLowerBound The minimum value of the column No


specified in partitionColumnName
that will be used for proceeding
partitioning.

maxPartitionsNumber The maximum number of partitions to No


split the data into.
TIP
If your SAP table has large volume of data such as several billions of rows, use partitionOption and
partitionSetting to split the data into small partitions, in which case data is read by partitions and each data
partition is retrieved from your SAP server via one single RFC call.
Taking partitionOption as partitionOnInt as an example, the number of rows in each partition is calculated by
(total rows falling between partitionUpperBound and partitionLowerBound)/maxPartitionsNumber.
If you want to further run partitions in parallel to speed up copy, it is strongly recommended to make
maxPartitionsNumber as a multiple of the value of parallelCopies (learn more from Parallel Copy).

Example:

"activities":[
{
"name": "CopyFromSAPTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP Table input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapTableSource",
"partitionOption": "PartitionOnInt",
"partitionSettings": {
"partitionColumnName": "<partition column name>",
"partitionUpperBound": "2000",
"partitionLowerBound": "1",
"maxPartitionsNumber": 500
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for SAP Table


When copying data from SAP Table, the following mappings are used from SAP Table data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the source
schema and data type to the sink.

SAP ABAP TYPE DATA FACTORY INTERIM DATA TYPE

C (String) String

I (Integer) Int32
SAP ABAP TYPE DATA FACTORY INTERIM DATA TYPE

F (Float) Double

D (Date) String

T (Time) String

P (BCD Packed, Currency, Decimal, Qty) Decimal

N (Numeric) String

X (Binary and Raw) String

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from ServiceNow using Azure Data
Factory
1/16/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from ServiceNow. It builds
on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from ServiceNow to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ServiceNow connector.

Linked service properties


The following properties are supported for ServiceNow linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


ServiceNow

endpoint The endpoint of the ServiceNow server Yes


(
http://<instance>.service-now.com
).

authenticationType The authentication type to use. Yes


Allowed values are: Basic, OAuth2
PROPERTY DESCRIPTION REQUIRED

username The user name used to connect to the Yes


ServiceNow server for Basic and
OAuth2 authentication.

password The password corresponding to the Yes


user name for Basic and OAuth2
authentication. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

clientId The client ID for OAuth2 No


authentication.

clientSecret The client secret for OAuth2 No


authentication. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "ServiceNowLinkedService",
"properties": {
"type": "ServiceNow",
"typeProperties": {
"endpoint" : "http://<instance>.service-now.com",
"authenticationType" : "Basic",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by ServiceNow dataset.
To copy data from ServiceNow, set the type property of the dataset to ServiceNowObject. The following
properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: ServiceNowObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "ServiceNowDataset",
"properties": {
"type": "ServiceNowObject",
"linkedServiceName": {
"referenceName": "<ServiceNow linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by ServiceNow source.
ServiceNow as source
To copy data from ServiceNow, set the source type in the copy activity to ServiceNowSource. The following
properties are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
ServiceNowSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Actual.alm_asset" .

Note the following when specifying the schema and column for ServiceNow in query, and refer to Performance
tips on copy performance implication.
Schema: specify the schema as Actual or Display in the ServiceNow query, which you can look at it as the
parameter of sysparm_display_value as true or false when calling ServiceNow restful APIs.
Column: the column name for actual value under Actual schema is [column name]_value , while for display
value under Display schema is [column name]_display_value . Note the column name need map to the schema
being used in the query.
Sample query: SELECT col_value FROM Actual.alm_asset OR SELECT col_display_value FROM Display.alm_asset

Example:
"activities":[
{
"name": "CopyFromServiceNow",
"type": "Copy",
"inputs": [
{
"referenceName": "<ServiceNow input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ServiceNowSource",
"query": "SELECT * FROM Actual.alm_asset"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Performance tips
Schema to use
ServiceNow has 2 different schemas, one is "Actual" which returns actual data, the other is "Display" which
returns the display values of data.
If you have a filter in your query, use "Actual" schema which has better copy performance. When querying against
"Actual" schema, ServiceNow natively support filter when fetching the data to only return the filtered resultset,
whereas when querying "Display" schema, ADF retrieve all the data and apply filter internally.
Index
ServiceNow table index can help improve query performance, refer to Create a table index.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SFTP server using Azure Data
Factory
5/6/2019 • 12 minutes to read • Edit Online

This article outlines how to copy data from SFTP server. To learn about Azure Data Factory, read the introductory
article.

Supported capabilities
This SFTP connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this SFTP connector supports:
Copying files using Basic or SshPublicKey authentication.
Copying files as-is or parsing files with the supported file formats and compression codecs.

Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SFTP.

Linked service properties


The following properties are supported for SFTP linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Sftp. Yes

host Name or IP address of the SFTP server. Yes


PROPERTY DESCRIPTION REQUIRED

port Port on which the SFTP server is No


listening.
Allowed values are: integer, default
value is 22.

skipHostKeyValidation Specify whether to skip host key No


validation.
Allowed values are: true, false (default).

hostKeyFingerprint Specify the finger print of the host key. Yes if the "skipHostKeyValidation" is set
to false.

authenticationType Specify authentication type. Yes


Allowed values are: Basic,
SshPublicKey. Refer to Using basic
authentication and Using SSH public
key authentication sections on more
properties and JSON samples
respectively.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

Using basic authentication


To use basic authentication, set "authenticationType" property to Basic, and specify the following properties
besides the SFTP connector generic ones introduced in the last section:

PROPERTY DESCRIPTION REQUIRED

userName User who has access to the SFTP server. Yes

password Password for the user (userName). Yes


Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

Example:
{
"name": "SftpLinkedService",
"type": "linkedservices",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": false,
"hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Using SSH public key authentication


To use SSH public key authentication, set "authenticationType" property as SshPublicKey, and specify the
following properties besides the SFTP connector generic ones introduced in the last section:

PROPERTY DESCRIPTION REQUIRED

userName User who has access to the SFTP server Yes

privateKeyPath Specify absolute path to the private key Specify either the privateKeyPath or
file that Integration Runtime can privateKeyContent .
access. Applies only when Self-hosted
type of Integration Runtime is specified
in "connectVia".

privateKeyContent Base64 encoded SSH private key Specify either the privateKeyPath or
content. SSH private key should be privateKeyContent .
OpenSSH format. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

passPhrase Specify the pass phrase/password to Yes if the private key file is protected by
decrypt the private key if the key file is a pass phrase.
protected by a pass phrase. Mark this
field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

NOTE
SFTP connector supports RSA/DSA OpenSSH key. Make sure your key file content starts with "-----BEGIN [RSA/DSA]
PRIVATE KEY-----". If the private key file is a ppk-format file, please use Putty tool to convert from .ppk to OpenSSH
format.

Example 1: SshPublicKey authentication using private key filePath


{
"name": "SftpLinkedService",
"type": "Linkedservices",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": true,
"authenticationType": "SshPublicKey",
"userName": "xxx",
"privateKeyPath": "D:\\privatekey_openssh",
"passPhrase": {
"type": "SecureString",
"value": "<pass phrase>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: SshPublicKey authentication using private key content

{
"name": "SftpLinkedService",
"type": "Linkedservices",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": true,
"authenticationType": "SshPublicKey",
"userName": "<username>",
"privateKeyContent": {
"type": "SecureString",
"value": "<base64 string of the private key content>"
},
"passPhrase": {
"type": "SecureString",
"value": "<pass phrase>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from SFTP in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based dataset and supported settings. The following properties are supported for SFTP
under location settings in format-based dataset:

PROPERTY DESCRIPTION REQUIRED

type The type property under location in Yes


dataset must be set to SftpLocation.

folderPath The path to folder. If you want to use No


wildcard to filter folder, skip this setting
and specify in activity source settings.

fileName The file name under the given No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.

NOTE
FileShare type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility. You are suggested to use this new model going forward,
and the ADF authoring UI has switched to generating these new types.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<SFTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "SftpLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Other format dataset


To copy data from SFTP in ORC/Avro/JSON/Binary format, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: FileShare
PROPERTY DESCRIPTION REQUIRED

folderPath Path to the folder. Wildcard filter is Yes


supported, allowed wildcards are: *
(matches zero or more characters) and
? (matches zero or single character);
use ^ to escape if your actual file
name has wildcard or this escape char
inside.

Examples: rootfolder/subfolder/, see


more examples in Folder and file filter
examples.

fileName Name or wildcard filter for the file(s) No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual folder
name has wildcard or this escape char
inside.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of files.

The properties can be NULL that mean


no file attribute filter will be applied to
the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.

If you want to parse files with a specific


format, the following file format types
are supported: TextFormat,
JsonFormat, AvroFormat,
OrcFormat, ParquetFormat. Set the
type property under format to one of
these values. For more information, see
Text Format, Json Format, Avro Format,
Orc Format, and Parquet Format
sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are: Optimal and
Fastest.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.

Example:

{
"name": "SFTPDataset",
"type": "Datasets",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<SFTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SFTP source.
SFTP as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from SFTP in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity source and supported settings. The following properties are
supported for SFTP under storeSettings settings in format-based copy source:
PROPERTY DESCRIPTION REQUIRED

type The type property under Yes


storeSettings must be set to
SftpReadSetting.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink. Allowed values are
true (default) and false.

wildcardFolderPath The folder path with wildcard No


characters to filter source folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time are within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".
The properties can be NULL which
mean no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has datetime
value but modifiedDatetimeEnd is
NULL, it means the files whose last
modified attribute is greater than or
equal with the datetime value will be
selected. When modifiedDatetimeEnd
has datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime value
will be selected.
PROPERTY DESCRIPTION REQUIRED

modifiedDatetimeEnd Same as above. No

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.

Example:

"activities":[
{
"name": "CopyFromSFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "SftpReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Other format source


To copy data from SFTP in ORC/Avro/JSON/Binary format, the following properties are supported in the
copy activity source section:
PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
FileSystemSource

recursive Indicates whether the data is read No


recursively from the sub folders or only
from the specified folder. Note when
recursive is set to true and sink is file-
based store, empty folder/sub-folder
will not be copied/created at sink.
Allowed values are: true (default), false

maxConcurrentConnections The number of the connections to No


connect to storage store concurrently.
Specify only when you want to limit the
concurrent connection to the data
store.

Example:

"activities":[
{
"name": "CopyFromSFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<SFTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SOURCE FOLDER STRUCTURE


AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)
SOURCE FOLDER STRUCTURE
AND FILTER RESULT (FILES IN
FOLDERPATH FILENAME RECURSIVE BOLD ARE RETRIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Shopify using Azure Data Factory
(Preview)
1/3/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Shopify. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Shopify to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Shopify connector.

Linked service properties


The following properties are supported for Shopify linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Shopify

host The endpoint of the Shopify server. Yes


(that is, mystore.myshopify.com)
PROPERTY DESCRIPTION REQUIRED

accessToken The API access token that can be used Yes


to access Shopify’s data. The token
does not expire if it is offline mode.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "ShopifyLinkedService",
"properties": {
"type": "Shopify",
"typeProperties": {
"host" : "mystore.myshopify.com",
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Shopify dataset.
To copy data from Shopify, set the type property of the dataset to ShopifyObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: ShopifyObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example
{
"name": "ShopifyDataset",
"properties": {
"type": "ShopifyObject",
"linkedServiceName": {
"referenceName": "<Shopify linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Shopify source.
Shopify as source
To copy data from Shopify, set the source type in the copy activity to ShopifySource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: ShopifySource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM "Products" WHERE
Product_Id = '123'"
.

Example:
"activities":[
{
"name": "CopyFromShopify",
"type": "Copy",
"inputs": [
{
"referenceName": "<Shopify input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ShopifySource",
"query": "SELECT * FROM \"Products\" WHERE Product_Id = '123'"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Spark using Azure Data Factory
1/3/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Spark. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Spark to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Spark connector.

Linked service properties


The following properties are supported for Spark linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Spark

host IP address or host name of the Spark Yes


server

port The TCP port that the Spark server Yes


uses to listen for client connections. If
you connect to Azure HDInsights,
specify port as 443.

serverType The type of Spark server. No


Allowed values are: SharkServer,
SharkServer2, SparkThriftServer
PROPERTY DESCRIPTION REQUIRED

thriftTransportProtocol The transport protocol to use in the No


Thrift layer.
Allowed values are: Binary, SASL,
HTTP

authenticationType The authentication method used to Yes


access the Spark server.
Allowed values are: Anonymous,
Username, UsernameAndPassword,
WindowsAzureHDInsightService

username The user name that you use to access No


Spark Server.

password The password corresponding to the No


user. Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

httpPath The partial URL corresponding to the No


Spark server.

enableSsl Specifies whether the connections to No


the server are encrypted using SSL. The
default value is false.

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over SSL. This
property can only be set when using
SSL on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

useSystemTrustStore Specifies whether to use a CA certificate No


from the system trust store or from a
specified PEM file. The default value is
false.

allowHostNameCNMismatch Specifies whether to require a CA- No


issued SSL certificate name to match
the host name of the server when
connecting over SSL. The default value
is false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.
Example:

{
"name": "SparkLinkedService",
"properties": {
"type": "Spark",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "<port>",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Spark dataset.
To copy data from Spark, set the type property of the dataset to SparkObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: SparkObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "SparkDataset",
"properties": {
"type": "SparkObject",
"linkedServiceName": {
"referenceName": "<Spark linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Spark source.
Spark as source
To copy data from Spark, set the source type in the copy activity to SparkSource. The following properties are
supported in the copy activity source section:
PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: SparkSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromSpark",
"type": "Copy",
"inputs": [
{
"referenceName": "<Spark input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SparkSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data to and from SQL Server using Azure Data
Factory
5/6/2019 • 12 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to an SQL Server
database. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from/to SQL Server database to any supported sink data store, or copy data from any
supported source data store to SQL Server database. For a list of data stores that are supported as sources/sinks
by the copy activity, see the Supported data stores table.
Specifically, this SQL Server connector supports:
SQL Server version 2016, 2014, 2012, 2008 R2, 2008, and 2005
Copying data using SQL or Windows authentication.
As source, retrieving data using SQL query or stored procedure.
As sink, appending data to destination table or invoking a stored procedure with custom logic during copy.
SQL Server Always Encrypted is not supported now.

Prerequisites
To use copy data from a SQL Server database that is not publicly accessible, you need to set up a Self-hosted
Integration Runtime. See Self-hosted Integration Runtime article for details. The Integration Runtime provides a
built-in SQL Server database driver, therefore you don't need to manually install any driver when copying data
from/to SQL Server database.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SQL Server database connector.

Linked service properties


The following properties are supported for SQL Server linked service:
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


SqlServer

connectionString Specify connectionString information Yes


needed to connect to the SQL Server
database using either SQL
authentication or Windows
authentication. Refer to the following
samples.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault,and if it's SQL authentication pull
the password configuration out of the
connection string. See the JSON
example below the table and Store
credentials in Azure Key Vault article
with more details.

userName Specify user name if you are using No


Windows Authentication. Example:
domainname\username.

password Specify password for the user account No


you specified for the userName. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

TIP
If you hit error with error code as "UserErrorFailedToConnectToSqlServer" and message like "The session limit for the
database is XXX and has been reached.", add Pooling=false to your connection string and try again.

Example 1: using SQL authentication


{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Data Source=<servername>\\<instance name if using named instance>;Initial Catalog=
<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: using SQL authentication with password in Azure Key Vault

{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Data Source=<servername>\\<instance name if using named instance>;Initial Catalog=
<databasename>;Integrated Security=False;User ID=<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 3: using Windows authentication


{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Data Source=<servername>\\<instance name if using named instance>;Initial Catalog=
<databasename>;Integrated Security=True;"
},
"userName": "<domain\\username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SQL Server dataset.
To copy data from/to SQL Server database, the following properties are supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: SqlServerTable

tableName Name of the table or view in the SQL No for source, Yes for sink
Server database instance that linked
service refers to.

Example:

{
"name": "SQLServerDataset",
"properties":
{
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "<SQL Server linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"tableName": "MyTable"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SQL Server source and sink.
SQL Server as source
To copy data from SQL Server, set the source type in the copy activity to SqlSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: SqlSource

sqlReaderQuery Use the custom SQL query to read No


data. Example:
select * from MyTable .

sqlReaderStoredProcedureName Name of the stored procedure that No


reads data from the source table. The
last SQL statement must be a SELECT
statement in the stored procedure.

storedProcedureParameters Parameters for the stored procedure. No


Allowed values are: name/value pairs.
Names and casing of parameters must
match the names and casing of the
stored procedure parameters.

Points to note:
If the sqlReaderQuery is specified for the SqlSource, the Copy Activity runs this query against the SQL
Server source to get the data. Alternatively, you can specify a stored procedure by specifying the
sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes
parameters).
If you do not specify either "sqlReaderQuery" or "sqlReaderStoredProcedureName", the columns defined in
the "structure" section of the dataset JSON are used to construct a query (
select column1, column2 from mytable ) to run against the SQL Server. If the dataset definition does not have
the "structure", all columns are selected from the table.
Example: using SQL query
"activities":[
{
"name": "CopyFromSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<SQL Server input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Example: using stored procedure

"activities":[
{
"name": "CopyFromSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<SQL Server input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type": "Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

The stored procedure definition:


CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO

SQL Server as sink


To copy data to SQL Server, set the sink type in the copy activity to SqlSink. The following properties are
supported in the copy activity sink section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


sink must be set to: SqlSink

writeBatchSize Number of rows to inserts into the SQL No


table per batch.
Allowed values are: integer (number of
rows). By default, Data Factory
dynamically determine the appropriate
batch size based on the row size.

writeBatchTimeout Wait time for the batch insert No


operation to complete before it times
out.
Allowed values are: timespan. Example:
“00:30:00” (30 minutes).

preCopyScript Specify a SQL query for Copy Activity No


to execute before writing data into SQL
Server. It will only be invoked once per
copy run. You can use this property to
clean up the pre-loaded data.

sqlWriterStoredProcedureName Name of the stored procedure that No


defines how to apply source data into
target table, e.g. to do upserts or
transform using your own business
logic.

Note this stored procedure will be


invoked per batch. If you want to do
operation that only runs once and has
nothing to do with source data e.g.
delete/truncate, use preCopyScript
property.
PROPERTY DESCRIPTION REQUIRED

storedProcedureParameters Parameters for the stored procedure. No


Allowed values are: name/value pairs.
Names and casing of parameters must
match the names and casing of the
stored procedure parameters.

sqlWriterTableType Specify a table type name to be used in No


the stored procedure. Copy activity
makes the data being moved available
in a temp table with this table type.
Stored procedure code can then merge
the data being copied with existing
data.

TIP
When copying data to SQL Server, the copy activity appends data to the sink table by default. To perform an UPSERT or
additional business logic, use the stored procedure in SqlSink. Learn more details from Invoking stored procedure for SQL
Sink.

Example 1: appending data

"activities":[
{
"name": "CopyToSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<SQL Server output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 100000
}
}
}
]

Example 2: invoking a stored procedure during copy for upsert


Learn more details from Invoking stored procedure for SQL Sink.
"activities":[
{
"name": "CopyToSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<SQL Server output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "CopyTestStoredProcedureWithParameters",
"sqlWriterTableType": "CopyTestTableType",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" }
}
}
}
}
]

Identity columns in the target database


This section provides an example that copies data from a source table with no identity column to a destination
table with an identity column.
Source table:

create table dbo.SourceTbl


(
name varchar(100),
age int
)

Destination table:

create table dbo.TargetTbl


(
identifier int identity(1,1),
name varchar(100),
age int
)

Notice that the target table has an identity column.


Source dataset JSON definition
{
"name": "SampleSource",
"properties": {
"type": " SqlServerTable",
"linkedServiceName": {
"referenceName": "TestIdentitySQL",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "SourceTbl"
}
}
}

Destination dataset JSON definition

{
"name": "SampleTarget",
"properties": {
"structure": [
{ "name": "name" },
{ "name": "age" }
],
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "TestIdentitySQL",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "TargetTbl"
}
}
}

Notice that as your source and target table have different schema (target has an additional column with identity).
In this scenario, you need to specify structure property in the target dataset definition, which doesn’t include the
identity column.

Invoke stored procedure from SQL sink


When copying data into SQL Server database, a user specified stored procedure could be configured and invoked
with additional parameters.
A stored procedure can be used when built-in copy mechanisms do not serve the purpose. It is typically used
when upsert (insert + update) or extra processing (merging columns, looking up additional values, insertion into
multiple tables, etc.) needs to be done before the final insertion of source data in the destination table.
The following sample shows how to use a stored procedure to do an upsert into a table in the SQL Server
database. Assume that input data and the sink Marketing table each have three columns: ProfileID, State, and
Category. Do the upsert based on the ProfileID column, and only apply it for a specific category.
Output dataset: the "tableName" should be the same table type parameter name in your stored procedure (see
below stored procedure script).
{
"name": "SQLServerDataset",
"properties":
{
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "<SQL Server linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "Marketing"
}
}
}

Define the SQL sink section in copy activity as follows.

"sink": {
"type": "SqlSink",
"SqlWriterTableType": "MarketingType",
"SqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}

In your database, define the stored procedure with the same name as the SqlWriterStoredProcedureName. It
handles input data from your specified source and merges into the output table. The parameter name of the table
type in the stored procedure should be the same as the tableName defined in the dataset.

CREATE PROCEDURE spOverwriteMarketing @Marketing [dbo].[MarketingType] READONLY, @category varchar(256)


AS
BEGIN
MERGE [dbo].[Marketing] AS target
USING @Marketing AS source
ON (target.ProfileID = source.ProfileID and target.Category = @category)
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT MATCHED THEN
INSERT (ProfileID, State, Category)
VALUES (source.ProfileID, source.State, source.Category);
END

In your database, define the table type with the same name as sqlWriterTableType. Notice that the schema of the
table type should be same as the schema returned by your input data.

CREATE TYPE [dbo].[MarketingType] AS TABLE(


[ProfileID] [varchar](256) NOT NULL,
[State] [varchar](256) NOT NULL,
[Category] [varchar](256) NOT NULL
)

The stored procedure feature takes advantage of Table-Valued Parameters.

Data type mapping for SQL server


When copying data from/to SQL Server, the following mappings are used from SQL Server data types to Azure
Data Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the
source schema and data type to the sink.

SQL SERVER DATA TYPE DATA FACTORY INTERIM DATA TYPE

bigint Int64

binary Byte[]

bit Boolean

char String, Char[]

date DateTime

Datetime DateTime

datetime2 DateTime

Datetimeoffset DateTimeOffset

Decimal Decimal

FILESTREAM attribute (varbinary(max)) Byte[]

Float Double

image Byte[]

int Int32

money Decimal

nchar String, Char[]

ntext String, Char[]

numeric Decimal

nvarchar String, Char[]

real Single

rowversion Byte[]

smalldatetime DateTime

smallint Int16

smallmoney Decimal

sql_variant Object
SQL SERVER DATA TYPE DATA FACTORY INTERIM DATA TYPE

text String, Char[]

time TimeSpan

timestamp Byte[]

tinyint Int16

uniqueidentifier Guid

varbinary Byte[]

varchar String, Char[]

xml Xml

NOTE
For data types maps to Decimal interim type, currently ADF support precision up to 28. If you have data with precision
larger than 28, consider to convert to string in SQL query.

Troubleshooting connection issues


1. Configure your SQL Server to accept remote connections. Launch SQL Server Management Studio,
right-click server, and click Properties. Select Connections from the list and check Allow remote
connections to the server.

See Configure the remote access Server Configuration Option for detailed steps.
2. Launch SQL Server Configuration Manager. Expand SQL Server Network Configuration for the
instance you want, and select Protocols for MSSQLSERVER. You should see protocols in the right-pane.
Enable TCP/IP by right-clicking TCP/IP and clicking Enable.

See Enable or Disable a Server Network Protocol for details and alternate ways of enabling TCP/IP
protocol.
3. In the same window, double-click TCP/IP to launch TCP/IP Properties window.
4. Switch to the IP Addresses tab. Scroll down to see IPAll section. Note down the TCP Port (default is
1433).
5. Create a rule for the Windows Firewall on the machine to allow incoming traffic through this port.
6. Verify connection: To connect to the SQL Server using fully qualified name, use SQL Server
Management Studio from a different machine. For example: "<machine>.<domain>.corp.<company>.com,1433" .

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Square using Azure Data Factory
(Preview)
3/21/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Square. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Square to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Square connector.

Linked service properties


The following properties are supported for Square linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Square

host The URL of the Square instance. (i.e. Yes


mystore.mysquare.com)
PROPERTY DESCRIPTION REQUIRED

clientId The client ID associated with your Yes


Square application.

clientSecret The client secret associated with your Yes


Square application. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

redirectUri The redirect URL assigned in the Square Yes


application dashboard. (i.e.
http://localhost:2500)

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "SquareLinkedService",
"properties": {
"type": "Square",
"typeProperties": {
"host" : "mystore.mysquare.com",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"redirectUri" : "http://localhost:2500"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Square dataset.
To copy data from Square, set the type property of the dataset to SquareObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED


PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: SquareObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "SquareDataset",
"properties": {
"type": "SquareObject",
"linkedServiceName": {
"referenceName": "<Square linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Square source.
Square as source
To copy data from Square, set the source type in the copy activity to SquareSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: SquareSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Business" .

Example:
"activities":[
{
"name": "CopyFromSquare",
"type": "Copy",
"inputs": [
{
"referenceName": "<Square input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SquareSource",
"query": "SELECT * FROM Business"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Sybase using Azure Data Factory
1/3/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Sybase database. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Sybase database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Sybase connector supports:
SAP Sybase SQL Anywhere (ASA) version 16 and above; IQ and ASE are not supported.
Copying data using Basic or Windows authentication.

Prerequisites
To use this Sybase connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the data provider for Sybase iAnywhere.Data.SQLAnywhere 16 or above on the Integration Runtime
machine.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Sybase connector.

Linked service properties


The following properties are supported for Sybase linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Sybase

server Name of the Sybase server. Yes


PROPERTY DESCRIPTION REQUIRED

database Name of the Sybase database. Yes

authenticationType Type of authentication used to connect Yes


to the Sybase database.
Allowed values are: Basic, and
Windows.

username Specify user name to connect to the Yes


Sybase database.

password Specify password for the user account Yes


you specified for the username. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-hosted
Integration Runtime is required as
mentioned in Prerequisites.

Example:

{
"name": "SybaseLinkedService",
"properties": {
"type": "Sybase",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Sybase dataset.
To copy data from Sybase, set the type property of the dataset to RelationalTable. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: RelationalTable
PROPERTY DESCRIPTION REQUIRED

tableName Name of the table in the Sybase No (if "query" in activity source is
database. specified)

Example

{
"name": "SybaseDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<Sybase linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Sybase source.
Sybase as source
To copy data from Sybase, set the source type in the copy activity to RelationalSource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
RelationalSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromSybase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Sybase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for Sybase


When copying data from Sybase, the following mappings are used from Sybase data types to Azure Data Factory
interim data types. See Schema and data type mappings to learn about how copy activity maps the source schema
and data type to the sink.
Sybase supports T-SQL types. For a mapping table from SQL types to Azure Data Factory interim data types, see
Azure SQL Database Connector - data type mapping section.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Teradata using Azure Data Factory
1/3/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Teradata database. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Teradata database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Teradata connector supports:
Teradata version 12 and above.
Copying data using Basic or Windows authentication.

Prerequisites
To use this Teradata connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the .NET Data Provider for Teradata version 14 or above on the Integration Runtime machine.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Teradata connector.

Linked service properties


The following properties are supported for Teradata linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Teradata

server Name of the Teradata server. Yes


PROPERTY DESCRIPTION REQUIRED

authenticationType Type of authentication used to connect Yes


to the Teradata database.
Allowed values are: Basic, and
Windows.

username Specify user name to connect to the Yes


Teradata database.

password Specify password for the user account Yes


you specified for the username. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-hosted
Integration Runtime is required as
mentioned in Prerequisites.

Example:

{
"name": "TeradataLinkedService",
"properties": {
"type": "Teradata",
"typeProperties": {
"server": "<server>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Teradata dataset.
To copy data from Teradata, set the type property of the dataset to RelationalTable. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: RelationalTable

tableName Name of the table in the Teradata No (if "query" in activity source is
database. specified)
Example:

{
"name": "TeradataDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<Teradata linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Teradata source.
Teradata as source
To copy data from Teradata, set the source type in the copy activity to RelationalSource. The following properties
are supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to:
RelationalSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromTeradata",
"type": "Copy",
"inputs": [
{
"referenceName": "<Teradata input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for Teradata


When copying data from Teradata, the following mappings are used from Teradata data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the source
schema and data type to the sink.

TERADATA DATA TYPE DATA FACTORY INTERIM DATA TYPE

BigInt Int64

Blob Byte[]

Byte Byte[]

ByteInt Int16

Char String

Clob String

Date DateTime

Decimal Decimal

Double Double

Graphic String

Integer Int32
TERADATA DATA TYPE DATA FACTORY INTERIM DATA TYPE

Interval Day TimeSpan

Interval Day To Hour TimeSpan

Interval Day To Minute TimeSpan

Interval Day To Second TimeSpan

Interval Hour TimeSpan

Interval Hour To Minute TimeSpan

Interval Hour To Second TimeSpan

Interval Minute TimeSpan

Interval Minute To Second TimeSpan

Interval Month String

Interval Second TimeSpan

Interval Year String

Interval Year To Month String

Number Double

Period(Date) String

Period(Time) String

Period(Time With Time Zone) String

Period(Timestamp) String

Period(Timestamp With Time Zone) String

SmallInt Int16

Time TimeSpan

Time With Time Zone String

Timestamp DateTime

Timestamp With Time Zone DateTimeOffset

VarByte Byte[]
TERADATA DATA TYPE DATA FACTORY INTERIM DATA TYPE

VarChar String

VarGraphic String

Xml String

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Vertica using Azure Data Factory
2/1/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Vertica. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from Vertica to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Vertica connector.

Linked service properties


The following properties are supported for Vertica linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Vertica

connectionString An ODBC connection string to connect Yes


to Vertica.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password in Azure Key
Vault and pull the pwd configuration
out of the connection string. Refer to
the following samples and Store
credentials in Azure Key Vault article
with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Self-hosted Integration Runtime or
Azure Integration Runtime (if your data
store is publicly accessible). If not
specified, it uses the default Azure
Integration Runtime.

Example:
{
"name": "VerticaLinkedService",
"properties": {
"type": "Vertica",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "VerticaLinkedService",
"properties": {
"type": "Vertica",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;"
},
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Vertica dataset.
To copy data from Vertica, set the type property of the dataset to VerticaTable. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: VerticaTable

tableName Name of the table. No (if "query" in activity source is


specified)

Example
{
"name": "VerticaDataset",
"properties": {
"type": "VerticaTable",
"linkedServiceName": {
"referenceName": "<Vertica linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Vertica source.
Vertica as source
To copy data from Vertica, set the source type in the copy activity to VerticaSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: VerticaSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromVertica",
"type": "Copy",
"inputs": [
{
"referenceName": "<Vertica input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "VerticaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Web table by using Azure Data
Factory
1/3/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Web table database.
It builds on the copy activity overview article that presents a general overview of copy activity.
The difference among this Web table connector, the REST connector and the HTTP connector are:
Web table connector extracts table content from an HTML webpage.
REST connector specifically support copying data from RESTful APIs.
HTTP connector is generic to retrieve data from any HTTP endpoint, e.g. to download file.

Supported capabilities
You can copy data from Web table database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Web table connector supports extracting table content from an HTML page.

Prerequisites
To use this Web table connector, you need to set up a Self-hosted Integration Runtime. See Self-hosted
Integration Runtime article for details.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Web table connector.

Linked service properties


The following properties are supported for Web table linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Web Yes


PROPERTY DESCRIPTION REQUIRED

url URL to the Web source Yes

authenticationType Allowed value is: Anonymous. Yes

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example:

{
"name": "WebLinkedService",
"properties": {
"type": "Web",
"typeProperties": {
"url" : "https://en.wikipedia.org/wiki/",
"authenticationType": "Anonymous"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Web table dataset.
To copy data from Web table, set the type property of the dataset to WebTable. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: WebTable

path A relative URL to the resource that No. When path is not specified, only
contains the table. the URL specified in the linked service
definition is used.

index The index of the table in the resource. Yes


See Get index of a table in an HTML
page section for steps to getting index
of a table in an HTML page.

Example:
{
"name": "WebTableInput",
"properties": {
"type": "WebTable",
"linkedServiceName": {
"referenceName": "<Web linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"index": 1,
"path": "AFI's_100_Years...100_Movies"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Web table source.
Web table as source
To copy data from Web table, set the source type in the copy activity to WebSource, no additional properties are
supported.
Example:

"activities":[
{
"name": "CopyFromWebTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<Web table input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "WebSource"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Get index of a table in an HTML page


To get the index of a table which you need to configure in dataset properties, you can use e.g. Excel 2016 as the
tool as follows:
1. Launch Excel 2016 and switch to the Data tab.
2. Click New Query on the toolbar, point to From Other Sources and click From Web.
3. In the From Web dialog box, enter URL that you would use in linked service JSON (for example:
https://en.wikipedia.org/wiki/) along with path you would specify for the dataset (for example:
AFI%27s_100_Years...100_Movies), and click OK.

URL used in this example: https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies


4. If you see Access Web content dialog box, select the right URL, authentication, and click Connect.
5. Click a table item in the tree view to see content from the table and then click Edit button at the bottom.

6. In the Query Editor window, click Advanced Editor button on the toolbar.
7. In the Advanced Editor dialog box, the number next to "Source" is the index.

If you are using Excel 2013, use Microsoft Power Query for Excel to get the index. See Connect to a web page
article for details. The steps are similar if you are using Microsoft Power BI for Desktop.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Xero using Azure Data Factory
(Preview)
1/3/2019 • 4 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Xero. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Xero to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Xero connector supports:
Xero private application but not public application.
All Xero tables (API endpoints) except "Reports".
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Xero connector.

Linked service properties


The following properties are supported for Xero linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Xero Yes


PROPERTY DESCRIPTION REQUIRED

host The endpoint of the Xero server ( Yes


api.xero.com ).

consumerKey The consumer key associated with the Yes


Xero application. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

privateKey The private key from the .pem file that Yes
was generated for your Xero private
application, see Create a public/private
key pair. Note to generate the
privatekey.pem with numbits of 512
using
openssl genrsa -out
privatekey.pem 512
; 1024 is not supported. Include all the
text from the .pem file including the
Unix line endings(\n), see sample below.

Mark this field as a SecureString to


store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether the host name is No


required in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:
{
"name": "XeroLinkedService",
"properties": {
"type": "Xero",
"typeProperties": {
"host" : "api.xero.com",
"consumerKey": {
"type": "SecureString",
"value": "<consumerKey>"
},
"privateKey": {
"type": "SecureString",
"value": "<privateKey>"
}
}
}
}

Sample private key value:


Include all the text from the .pem file including the Unix line endings(\n).

"-----BEGIN RSA PRIVATE KEY-----


\nMII***************************************************P\nbu*************************************************
***s\nU/****************************************************B\nA**********************************************
*******W\njH****************************************************e\nsx*****************************************
************l\nq******************************************************X\nh************************************
*****************i\nd*****************************************************s\nA********************************
*********************dsfb\nN*****************************************************M\np*************************
****************************Ly\nK*****************************************************Y=\n-----END RSA PRIVATE
KEY-----"

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Xero dataset.
To copy data from Xero, set the type property of the dataset to XeroObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: XeroObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example
{
"name": "XeroDataset",
"properties": {
"type": "XeroObject",
"linkedServiceName": {
"referenceName": "<Xero linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Xero source.
Xero as source
To copy data from Xero, set the source type in the copy activity to XeroSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: XeroSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Contacts" .

Example:

"activities":[
{
"name": "CopyFromXero",
"type": "Copy",
"inputs": [
{
"referenceName": "<Xero input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "XeroSource",
"query": "SELECT * FROM Contacts"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Note the following when specifying the Xero query:
Tables with complex items will be split to multiple tables. For example, Bank transactions has a complex
data structure "LineItems", so data of bank transaction is mapped to table Bank_Transaction and
Bank_Transaction_Line_Items , with Bank_Transaction_ID as foreign key to link them together.

Xero data is available through two schemas: Minimal (default) and Complete . The Complete schema
contains prerequisite call tables which require additional data (e.g. ID column) before making the desired
query.
The following tables have the same information in the Minimal and Complete schema. To reduce the number of
API calls, use Minimal schema (default).
Bank_Transactions
Contact_Groups
Contacts
Contacts_Sales_Tracking_Categories
Contacts_Phones
Contacts_Addresses
Contacts_Purchases_Tracking_Categories
Credit_Notes
Credit_Notes_Allocations
Expense_Claims
Expense_Claim_Validation_Errors
Invoices
Invoices_Credit_Notes
Invoices_ Prepayments
Invoices_Overpayments
Manual_Journals
Overpayments
Overpayments_Allocations
Prepayments
Prepayments_Allocations
Receipts
Receipt_Validation_Errors
Tracking_Categories
The following tables can only be queried with complete schema:
Complete.Bank_Transaction_Line_Items
Complete.Bank_Transaction_Line_Item_Tracking
Complete.Contact_Group_Contacts
Complete.Contacts_Contact_ Persons
Complete.Credit_Note_Line_Items
Complete.Credit_Notes_Line_Items_Tracking
Complete.Expense_Claim_ Payments
Complete.Expense_Claim_Receipts
Complete.Invoice_Line_Items
Complete.Invoices_Line_Items_Tracking
Complete.Manual_Journal_Lines
Complete.Manual_Journal_Line_Tracking
Complete.Overpayment_Line_Items
Complete.Overpayment_Line_Items_Tracking
Complete.Prepayment_Line_Items
Complete.Prepayment_Line_Item_Tracking
Complete.Receipt_Line_Items
Complete.Receipt_Line_Item_Tracking
Complete.Tracking_Category_Options

Next steps
For a list of supported data stores by the copy activity, see supported data stores.
Copy data from Zoho using Azure Data Factory
(Preview)
1/3/2019 • 3 minutes to read • Edit Online

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Zoho. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.

Supported capabilities
You can copy data from Zoho to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Zoho connector.

Linked service properties


The following properties are supported for Zoho linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


Zoho

endpoint The endpoint of the Zoho server ( Yes


crm.zoho.com/crm/private ).
PROPERTY DESCRIPTION REQUIRED

accessToken The access token for Zoho Yes


authentication. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over SSL. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
SSL. The default value is true.

Example:

{
"name": "ZohoLinkedService",
"properties": {
"type": "Zoho",
"typeProperties": {
"endpoint" : "crm.zoho.com/crm/private",
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Zoho dataset.
To copy data from Zoho, set the type property of the dataset to ZohoObject. The following properties are
supported:

PROPERTY DESCRIPTION REQUIRED

type The type property of the dataset must Yes


be set to: ZohoObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example
{
"name": "ZohoDataset",
"properties": {
"type": "ZohoObject",
"linkedServiceName": {
"referenceName": "<Zoho linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Zoho source.
Zoho as source
To copy data from Zoho, set the source type in the copy activity to ZohoSource. The following properties are
supported in the copy activity source section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy activity Yes


source must be set to: ZohoSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Accounts" .

Example:

"activities":[
{
"name": "CopyFromZoho",
"type": "Copy",
"inputs": [
{
"referenceName": "<Zoho input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ZohoSource",
"query": "SELECT * FROM Accounts"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy Activity in Azure Data Factory
5/6/2019 • 12 minutes to read • Edit Online

Overview
In Azure Data Factory, you can use Copy Activity to copy data among data stores
located on-premises and in the cloud. After the data is copied, it can be further
transformed and analyzed. You can also use Copy Activity to publish transformation
and analysis results for business intelligence (BI) and application consumption.

Copy Activity is executed on an Integration Runtime. For different data copy


scenario, different flavor of Integration Runtime can be leveraged:
When copying data between data stores that both are publicly accessible, copy
activity can be empowered by Azure Integration Runtime, which is secure,
reliable, scalable, and globally available.
When copying data from/to data stores located on-premises or in a network with
access control (for example, Azure Virtual Network), you need to set up a self-
hosted Integrated Runtime to empower data copy.
Integration Runtime needs to be associated with each source and sink data store.
Learn details on how copy activity determines which IR to use.
Copy Activity goes through the following stages to copy data from a source to a sink.
The service that powers Copy Activity:
1. Reads data from a source data store.
2. Performs serialization/deserialization, compression/decompression, column
mapping, etc. It does these operations based on the configurations of the input
dataset, output dataset, and Copy Activity.
3. Writes data to the sink/destination data store.

Supported data stores and formats


SUPPORTED SUPPORTED
AS A SUPPORTED SUPPORTED BY SELF-
CATEGORY DATA STORE SOURCE AS A SINK BY AZURE IR HOSTED IR

Azure Azure Blob ✓ ✓ ✓ ✓


Storage
SUPPORTED SUPPORTED
AS A SUPPORTED SUPPORTED BY SELF-
CATEGORY DATA STORE SOURCE AS A SINK BY AZURE IR HOSTED IR

Azure ✓ ✓ ✓ ✓
Cosmos DB
(SQL API)

Azure ✓ ✓ ✓ ✓
Cosmos
DB's API for
MongoDB

Azure Data ✓ ✓ ✓ ✓
Explorer

Azure Data ✓ ✓ ✓ ✓
Lake
Storage
Gen1

Azure Data ✓ ✓ ✓ ✓
Lake
Storage
Gen2

Azure ✓ ✓ ✓
Database
for MariaDB

Azure ✓ ✓ ✓
Database
for MySQL

Azure ✓ ✓ ✓
Database
for
PostgreSQL

Azure File ✓ ✓ ✓ ✓
Storage

Azure SQL ✓ ✓ ✓ ✓
Database

Azure SQL ✓ ✓ ✓
Database
Managed
Instance

Azure SQL ✓ ✓ ✓ ✓
Data
Warehouse

Azure ✓ ✓ ✓
Search
Index
SUPPORTED SUPPORTED
AS A SUPPORTED SUPPORTED BY SELF-
CATEGORY DATA STORE SOURCE AS A SINK BY AZURE IR HOSTED IR

Azure Table ✓ ✓ ✓ ✓
Storage

Database Amazon ✓ ✓ ✓
Redshift

DB2 ✓ ✓ ✓

Drill ✓ ✓ ✓
(Preview)

Google ✓ ✓ ✓
BigQuery

Greenplum ✓ ✓ ✓

HBase ✓ ✓ ✓

Hive ✓ ✓ ✓

Apache ✓ ✓ ✓
Impala
(Preview)

Informix ✓ ✓

MariaDB ✓ ✓ ✓

Microsoft ✓ ✓
Access

MySQL ✓ ✓ ✓

Netezza ✓ ✓ ✓

Oracle ✓ ✓ ✓ ✓

Phoenix ✓ ✓ ✓

PostgreSQL ✓ ✓ ✓

Presto ✓ ✓ ✓
(Preview)

SAP ✓ ✓
Business
Warehouse
Open Hub
SUPPORTED SUPPORTED
AS A SUPPORTED SUPPORTED BY SELF-
CATEGORY DATA STORE SOURCE AS A SINK BY AZURE IR HOSTED IR

SAP ✓ ✓
Business
Warehouse
via MDX

SAP HANA ✓ ✓ ✓

SAP Table ✓ ✓ ✓

Spark ✓ ✓ ✓

SQL Server ✓ ✓ ✓ ✓

Sybase ✓ ✓

Teradata ✓ ✓

Vertica ✓ ✓ ✓

NoSQL Cassandra ✓ ✓ ✓

Couchbase ✓ ✓ ✓
(Preview)

MongoDB ✓ ✓ ✓

File Amazon S3 ✓ ✓ ✓

File System ✓ ✓ ✓ ✓

FTP ✓ ✓ ✓

Google ✓ ✓ ✓
Cloud
Storage

HDFS ✓ ✓ ✓

SFTP ✓ ✓ ✓

Generic Generic ✓ ✓ ✓
protocol HTTP

Generic ✓ ✓ ✓
OData

Generic ✓ ✓ ✓
ODBC

Generic ✓ ✓ ✓
REST
SUPPORTED SUPPORTED
AS A SUPPORTED SUPPORTED BY SELF-
CATEGORY DATA STORE SOURCE AS A SINK BY AZURE IR HOSTED IR

Services Amazon ✓ ✓ ✓
and apps Marketplac
e Web
Service
(Preview)

Common ✓ ✓ ✓ ✓
Data
Service for
Apps

Concur ✓ ✓ ✓
(Preview)

Dynamics ✓ ✓ ✓ ✓
365

Dynamics ✓ ✓ ✓
AX
(Preview)

Dynamics ✓ ✓ ✓ ✓
CRM

Google ✓ ✓ ✓
AdWords
(Preview)

HubSpot ✓ ✓ ✓
(Preview)

Jira ✓ ✓ ✓
(Preview)

Magento ✓ ✓ ✓
(Preview)

Marketo ✓ ✓ ✓
(Preview)

Office 365 ✓ ✓ ✓

Oracle ✓ ✓ ✓
Eloqua
(Preview)

Oracle ✓ ✓ ✓
Responsys
(Preview)
SUPPORTED SUPPORTED
AS A SUPPORTED SUPPORTED BY SELF-
CATEGORY DATA STORE SOURCE AS A SINK BY AZURE IR HOSTED IR

Oracle ✓ ✓ ✓
Service
Cloud
(Preview)

Paypal ✓ ✓ ✓
(Preview)

QuickBooks ✓ ✓ ✓
(Preview)

Salesforce ✓ ✓ ✓ ✓

Salesforce ✓ ✓ ✓ ✓
Service
Cloud

Salesforce ✓ ✓ ✓
Marketing
Cloud
(Preview)

SAP Cloud ✓ ✓ ✓ ✓
for
Customer
(C4C)

SAP ECC ✓ ✓ ✓

ServiceNow ✓ ✓ ✓

Shopify ✓ ✓ ✓
(Preview)

Square ✓ ✓ ✓
(Preview)

Web Table ✓ ✓
(HTML
table)

Xero ✓ ✓ ✓
(Preview)

Zoho ✓ ✓ ✓
(Preview)

NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If
you want to take a dependency on preview connectors in your solution, please contact
Azure support.
Supported file formats
You can use Copy Activity to copy files as-is between two file-based data stores, in
which case the data is copied efficiently without any serialization/deserialization.
Copy Activity also supports reading from and writing to files in specified formats:
Text, JSON, Avro, ORC, and Parquet, and compressing and decompressing files
with the following codecs: GZip, Deflate, BZip2, and ZipDeflate. See Supported
file and compression formats with details.
For example, you can do the following copy activities:
Copy data in on-premises SQL Server and write to Azure Data Lake Storage
Gen2 in Parquet format.
Copy files in text (CSV ) format from on-premises File System and write to Azure
Blob in Avro format.
Copy zipped files from on-premises File System and decompress then land to
Azure Data Lake Storage Gen2.
Copy data in GZip compressed text (CSV ) format from Azure Blob and write to
Azure SQL Database.
And many more cases with serialization/deserialization or
compression/decompression need.

Supported regions
The service that powers Copy Activity is available globally in the regions and
geographies listed in Azure Integration Runtime locations. The globally available
topology ensures efficient data movement that usually avoids cross-region hops. See
Services by region for availability of Data Factory and Data Movement in a region.

Configuration
To use copy activity in Azure Data Factory, you need to:
1. Create linked services for source data store and sink data store. Refer to the
connector article's "Linked service properties" section on how to configure and
the supported properties. You can find the supported connector list in Supported
data stores and formats section.
2. Create datasets for source and sink. Refer to the source and sink connector
articles' "Dataset properties" section on how to configure and the supported
properties.
3. Create a pipeline with copy activity. The next section provides an example.
Syntax
The following template of a copy activity contains an exhaustive list of supported
properties. Specify the ones that fit your scenario.
"activities":[
{
"name": "CopyActivityTemplate",
"type": "Copy",
"inputs": [
{
"referenceName": "<source dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<sink dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>",
<properties>
},
"sink": {
"type": "<sink type>"
<properties>
},
"translator":
{
"type": "TabularTranslator",
"columnMappings": "<column mapping>"
},
"dataIntegrationUnits": <number>,
"parallelCopies": <number>,
"enableStaging": true/false,
"stagingSettings": {
<properties>
},
"enableSkipIncompatibleRow": true/false,
"redirectIncompatibleRowSettings": {
<properties>
}
}
}
]

Syntax details
PROPERTY DESCRIPTION REQUIRED

type The type property of a copy Yes


activity must be set to:
Copy

inputs Specify the dataset you Yes


created which points to the
source data. Copy activity
supports only a single input.

outputs Specify the dataset you Yes


created which points to the
sink data. Copy activity
supports only a single
output.
PROPERTY DESCRIPTION REQUIRED

typeProperties A group of properties to Yes


configure copy activity.

source Specify the copy source type Yes


and the corresponding
properties on how to
retrieve data.

Learn details from the "Copy


activity properties" section in
connector article listed in
Supported data stores and
formats.

sink Specify the copy sink type Yes


and the corresponding
properties on how to write
data.

Learn details from the "Copy


activity properties" section in
connector article listed in
Supported data stores and
formats.

translator Specify explicit column No


mappings from source to
sink. Applies when the
default copy behavior
cannot fulfill your need.

Learn details from Schema


and data type mapping.

dataIntegrationUnits Specify the powerfulness of No


Azure Integration Runtime
to empower data copy.
Formerly known as cloud
Data Movement Units
(DMU).

Learn details from Data


Integration Units.

parallelCopies Specify the parallelism that No


you want Copy Activity to
use when reading data from
source and writing data to
sink.

Learn details from Parallel


copy.
PROPERTY DESCRIPTION REQUIRED

enableStaging Choose to stage the interim No


stagingSettings data in a blob storage
instead of directly copy data
from source to sink.

Learn the useful scenarios


and configuration details
from Staged copy.

enableSkipIncompatibleRow Choose how to handle No


redirectIncompatibleRowSett incompatible rows when
ings copying data from source to
sink.

Learn details from Fault


tolerance.

Monitoring
You can monitor the copy activity run on Azure Data Factory "Author & Monitor" UI
or programmatically. You can then compare the performance and configuration of
your scenario to Copy Activity's performance reference from in-house testing.
Monitor visually
To visually monitor the copy activity run, go to your data factory -> Author &
Monitor -> Monitor tab, you see a list of pipeline runs with a "View Activity Runs"
link in the Actions column.

Click to see the list of activities in this pipeline run. In the Actions column, you have
links to the copy activity input, output, errors (if copy activity run fails), and details.

Click the "Details" link under Actions to see copy activity's execution details and
performance characteristics. It shows you information including volume/rows/files of
data copied from source to sink, throughput, steps it goes through with
corresponding duration and used configurations for your copy scenario.
TIP
For some scenarios, you will also see "Performance tuning tips" on top of the copy
monitoring page, which tells you the bottleneck identified and guides you on what to
change so as to boost copy throughput, see example with details here.

Example: copy from Amazon S3 to Azure Data Lake Store

Example: copy from Azure SQL Database to Azure SQL Data Warehouse
using staged copy

Monitor programmatically
Copy activity execution details and performance characteristics are also returned in
Copy Activity run result -> Output section. Below is an exhaustive list; only the
applicable ones to your copy scenario will show up. Learn how to monitor activity
run from quickstart monitoring section.

PROPERTY NAME DESCRIPTION UNIT

dataRead Data size read from source Int64 value in bytes


PROPERTY NAME DESCRIPTION UNIT

dataWritten Data size written to sink Int64 value in bytes

filesRead Number of files being copied Int64 value (no unit)


when copying data from file
storage.

filesWritten Number of files being copied Int64 value (no unit)


when copying data to file
storage.

rowsRead Number of rows being read Int64 value (no unit)


from source (not applicable
for binary copy).

rowsCopied Number of rows being Int64 value (no unit)


copied to sink (not
applicable for binary copy).

rowsSkipped Number of incompatible Int64 value (no unit)


rows being skipped. You can
turn on the feature by set
"enableSkipIncompatibleRow
" to true.

throughput Ratio at which data are Floating point number in


transferred. KB/s

copyDuration The duration of the copy. Int32 value in seconds

sourcePeakConnections Peak number of concurrent Int32 value


connections established to
the source data store during
copy.

sinkPeakConnections Peak number of concurrent Int32 value


connections established to
the sink data store during
copy.

sqlDwPolyBase If PolyBase is used when Boolean


copying data into SQL Data
Warehouse.

redshiftUnload If UNLOAD is used when Boolean


copying data from Redshift.

hdfsDistcp If DistCp is used when Boolean


copying data from HDFS.

effectiveIntegrationRuntime Show which Integration Text (string)


Runtime(s) is used to
empower the activity run, in
the format of
<IR name> (<region if
it's Azure IR>)
.
PROPERTY NAME DESCRIPTION UNIT

usedDataIntegrationUnits The effective Data Int32 value


Integration Units during
copy.

usedParallelCopies The effective parallelCopies Int32 value


during copy.

redirectRowPath Path to the log of skipped Text (string)


incompatible rows in the
blob storage you configure
under
"redirectIncompatibleRowSet
tings". See below example.

executionDetails More details on the stages Array


copy activity goes through,
and the corresponding
steps, duration, used
configurations, etc. It's not
recommended to parse this
section as it may change.

"output": {
"dataRead": 107280845500,
"dataWritten": 107280845500,
"filesRead": 10,
"filesWritten": 10,
"copyDuration": 224,
"throughput": 467707.344,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US 2)",
"usedDataIntegrationUnits": 32,
"usedParallelCopies": 8,
"executionDetails": [
{
"source": {
"type": "AmazonS3"
},
"sink": {
"type": "AzureDataLakeStore"
},
"status": "Succeeded",
"start": "2018-01-17T15:13:00.3515165Z",
"duration": 221,
"usedDataIntegrationUnits": 32,
"usedParallelCopies": 8,
"detailedDurations": {
"queuingDuration": 2,
"transferDuration": 219
}
}
]
}

Schema and data type mapping


See the Schema and data type mapping, which describes how copy activity maps
your source data to sink.
Fault tolerance
By default, copy activity stops copying data and returns failure when it encounters
incompatible data between source and sink. You can explicitly configure to skip and
log the incompatible rows and only copy those compatible data to make the copy
succeeded. See the Copy Activity fault tolerance on more details.

Performance and tuning


See the Copy Activity performance and tuning guide, which describes key factors
that affect the performance of data movement (Copy Activity) in Azure Data Factory.
It also lists the observed performance during internal testing and discusses various
ways to optimize the performance of Copy Activity.
In some cases, when you execute a copy activity in ADF, you will directly see
"Performance tuning tips" on top of the copy activity monitoring page as shown in
the following example. It not only tells you the bottleneck identified for the given
copy run, but also guides you on what to change so as to boost copy throughput. The
performance tuning tips currently provide suggestions like to use PolyBase when
copying data into Azure SQL Data Warehouse, to increase Azure Cosmos DB RU or
Azure SQL DB DTU when the resource on data store side is the bottleneck, to
remove the unnecessary staged copy, etc. The performance tuning rules will be
gradually enriched as well.
Example: copy into Azure SQL DB with performance tuning tips
In this sample, during copy run, ADF notice the sink Azure SQL DB reaches high
DTU utilization which slows down the write operations, thus the suggestion is to
increase the Azure SQL DB tier with more DTU.

Incremental copy
Data Factory supports scenarios for incrementally copying delta data from a source
data store to a destination data store. See Tutorial: incrementally copy data.

Read and write partitioned data


In version 1, Azure Data Factory supported reading or writing partitioned data by
using SliceStart/SliceEnd/WindowStart/WindowEnd system variables. In the current
version, you can achieve this behavior by using a pipeline parameter and trigger's
start time/scheduled time as a value of the parameter. For more information, see
How to read or write partitioned data.

Next steps
See the following quickstarts, tutorials, and samples:
Copy data from one location to another location in the same Azure Blob Storage
Copy data from Azure Blob Storage to Azure SQL Database
Copy data from on-premises SQL Server to Azure
Delete Activity in Azure Data Factory
4/2/2019 • 7 minutes to read • Edit Online

You can use the Delete Activity in Azure Data Factory to delete files or folders from on-premises storage stores or
cloud storage stores. Use this activity to clean up or archive files when they are no longer needed.

WARNING
Deleted files or folders cannot be restored. Be cautious when using the Delete activity to delete files or folders.

Best practices
Here are some recommendations for using the Delete activity:
Back up your files before deleting them with the Delete activity in case you need to restore them in the
future.
Make sure that Data Factory has write permissions to delete folders or files from the storage store.
Make sure you are not deleting files that are being written at the same time.
If you want to delete files or folder from an on-premises system, make sure you are using a self-hosted
integration runtime with a version greater than 3.14.

Supported data stores


Azure Blob storage
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2
File system data stores
File System
FTP
SFTP
Amazon S3

Syntax
{
"name": "DeleteActivity",
"type": "Delete",
"typeProperties": {
"dataset": {
"referenceName": "<dataset name>",
"type": "DatasetReference"
},
"recursive": true/false,
"maxConcurrentConnections": <number>,
"enableLogging": true/false,
"logStorageSettings": {
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference"
},
"path": "<path to save log file>"
}
}
}

Type properties
PROPERTY DESCRIPTION REQUIRED

dataset Provides the dataset reference to Yes


determine which files or folder to be
deleted

recursive Indicates whether the files are deleted No. The default is false .
recursively from the subfolders or only
from the specified folder.

maxConcurrentConnections The number of the connections to No. The default is 1 .


connect to storage store concurrently
for deleting folder or files.

enablelogging Indicates whether you need to record No


the folder or file names that have been
deleted. If true, you need to further
provide a storage account to save the
log file, so that you can track the
behaviors of the Delete activity by
reading the log file.

logStorageSettings Only applicable when enablelogging = No


true.

A group of storage properties that can


be specified where you want to save the
log file containing the folder or file
names that have been deleted by the
Delete activity.
PROPERTY DESCRIPTION REQUIRED

linkedServiceName Only applicable when enablelogging = No


true.

The linked service of Azure Storage,


Azure Data Lake Storage Gen1, or
Azure Data Lake Storage Gen2 to store
the log file that contains the folder or
file names that have been deleted by
the Delete activity.

path Only applicable when enablelogging = No


true.

The path to save the log file in your


storage account. If you do not provide a
path, the service creates a container for
you.

Monitoring
There are two places where you can see and monitor the results of the Delete activity:
From the output of the Delete activity.
From the log file.
Sample output of the Delete activity

{
"datasetName": "AmazonS3",
"type": "AmazonS3Object",
"prefix": "test",
"bucketName": "adf",
"recursive": true,
"isWildcardUsed": false,
"maxConcurrentConnections": 2,
"filesDeleted": 4,
"logPath": "https://sample.blob.core.windows.net/mycontainer/5c698705-a6e2-40bf-911e-e0a927de3f07",
"effectiveIntegrationRuntime": "MyAzureIR (West Central US)",
"executionDuration": 650
}

Sample log file of the Delete activity


NAME CATEGORY STATUS ERROR

test1/yyy.json File Deleted

test2/hello789.txt File Deleted

test2/test3/hello000.txt File Deleted

test2/test3/zzz.json File Deleted

Examples of using the Delete activity


Delete specific folders or files
The store has the following folder structure:
Root/
Folder_A_1/
1.txt
2.txt
3.csv
Folder_A_2/
4.txt
5.csv
Folder_B_1/
6.txt
7.csv
Folder_B_2/
8.txt
Now you are using the Delete activity to delete folder or files by the combination of different property value from
the dataset and the Delete activity:

RECURSIVE (FROM THE DELETE


FOLDERPATH (FROM DATASET) FILENAME (FROM DATASET) ACTIVITY) OUTPUT

Root/ Folder_A_2 NULL False Root/


Folder_A_1/
1.txt
2.txt
3.csv
Folder_A_2/
4.txt
5.csv
Folder_B_1/
6.txt
7.csv
Folder_B_2/
8.txt

Root/ Folder_A_2 NULL True Root/


Folder_A_1/
1.txt
2.txt
3.csv
Folder_A_2/
4.txt
5.csv
Folder_B_1/
6.txt
7.csv
Folder_B_2/
8.txt
RECURSIVE (FROM THE DELETE
FOLDERPATH (FROM DATASET) FILENAME (FROM DATASET) ACTIVITY) OUTPUT

Root/ Folder_A_2 *.txt False Root/


Folder_A_1/
1.txt
2.txt
3.csv
Folder_A_2/
4.txt
5.csv
Folder_B_1/
6.txt
7.csv
Folder_B_2/
8.txt

Root/ Folder_A_2 *.txt True Root/


Folder_A_1/
1.txt
2.txt
3.csv
Folder_A_2/
4.txt
5.csv
Folder_B_1/
6.txt
7.csv
Folder_B_2/
8.txt

Periodically clean up the time -partitioned folder or files


You can create a pipeline to periodically clean up the time partitioned folder or files. For example, the folder
structure is similar as: /mycontainer/2018/12/14/*.csv . You can leverage ADF system variable from schedule trigger
to identify which folder or files should be deleted in each pipeline run.
Sample pipeline
{
"name": "cleanup_time_partitioned_folder",
"properties": {
"activities": [
{
"name": "DeleteOneFolder",
"type": "Delete",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"dataset": {
"referenceName": "PartitionedFolder",
"type": "DatasetReference",
"parameters": {
"TriggerTime": {
"value": "@formatDateTime(pipeline().parameters.TriggerTime, 'yyyy/MM/dd')",
"type": "Expression"
}
}
},
"recursive": true,
"logStorageSettings": {
"linkedServiceName": {
"referenceName": "BloblinkedService",
"type": "LinkedServiceReference"
},
"path": "mycontainer/log"
},
"enableLogging": true
}
}
],
"parameters": {
"TriggerTime": {
"type": "String"
}
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

Sample dataset
{
"name": "PartitionedFolder",
"properties": {
"linkedServiceName": {
"referenceName": "BloblinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"TriggerTime": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@concat('mycontainer/',dataset().TriggerTime)",
"type": "Expression"
}
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}

Sample trigger

{
"name": "DailyTrigger",
"properties": {
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "cleanup_time_partitioned_folder",
"type": "PipelineReference"
},
"parameters": {
"TriggerTime": "@trigger().scheduledTime"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2018-12-13T00:00:00.000Z",
"timeZone": "UTC",
"schedule": {
"minutes": [
59
],
"hours": [
23
]
}
}
}
}
}

Clean up the expired files that were last modified before 2018.1.1
You can create a pipeline to clean up the old or expired files by leveraging file attribute filter: “LastModified” in
dataset.
Sample pipeline

{
"name": "CleanupExpiredFiles",
"properties": {
"activities": [
{
"name": "DeleteFilebyLastModified",
"type": "Delete",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"dataset": {
"referenceName": "BlobFilesLastModifiedBefore201811",
"type": "DatasetReference"
},
"recursive": true,
"logStorageSettings": {
"linkedServiceName": {
"referenceName": "BloblinkedService",
"type": "LinkedServiceReference"
},
"path": "mycontainer/log"
},
"enableLogging": true
}
}
]
}
}

Sample dataset

{
"name": "BlobFilesLastModifiedBefore201811",
"properties": {
"linkedServiceName": {
"referenceName": "BloblinkedService",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"fileName": "*",
"folderPath": "mycontainer",
"modifiedDatetimeEnd": "2018-01-01T00:00:00.000Z"
}
}
}

Move files by chaining the Copy activity and the Delete activity
You can move a file by using a copy activity to copy a file and then a delete activity to delete a file in a pipeline.
When you want to move multiple files, you can use the GetMetadata activity + Filter activity + Foreach activity +
Copy activity + Delete activity as in the following sample:
NOTE
If you want to move the entire folder by defining a dataset containing a folder path only, and then using a copy activity and a
the Delete activity to reference to the same dataset representing a folder, you need to be very careful. It is because you have
to make sure that there will NOT be new files arriving into the folder between copying operation and deleting operation. If
there are new files arriving at the folder at the moment when your copy activity just completed the copy job but the Delete
activity has not been stared, it is possible that the Delete activity will delete this new arriving file which has NOT been copied
to the destination yet by deleting the entire folder.

Sample pipeline

{
"name": "MoveFiles",
"properties": {
"activities": [
{
"name": "GetFileList",
"type": "GetMetadata",
"typeProperties": {
"dataset": {
"referenceName": "OneSourceFolder",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
]
}
},
{
"name": "FilterFiles",
"type": "Filter",
"dependsOn": [
{
"activity": "GetFileList",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('GetFileList').output.childItems",
"type": "Expression"
},
"condition": {
"value": "@equals(item().type, 'File')",
"type": "Expression"
}
}
},
{
"name": "ForEachFile",
"type": "ForEach",
"dependsOn": [
{
"activity": "FilterFiles",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('FilterFiles').output.value",
"type": "Expression"
},
},
"batchCount": 20,
"activities": [
{
"name": "CopyAFile",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "OneSourceFile",
"type": "DatasetReference",
"parameters": {
"path": "myFolder",
"filename": {
"value": "@item().name",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "OneDestinationFile",
"type": "DatasetReference",
"parameters": {
"DestinationFileName": {
"value": "@item().name",
"type": "Expression"
}
}
}
]
},
{
"name": "DeleteAFile",
"type": "Delete",
"dependsOn": [
{
"activity": "CopyAFile",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"dataset": {
"referenceName": "OneSourceFile",
"type": "DatasetReference",
"parameters": {
"path": "myFolder",
"filename": {
"value": "@item().name",
"type": "Expression"
}
}
},
"logStorageSettings": {
"linkedServiceName": {
"referenceName": "BloblinkedService",
"type": "LinkedServiceReference"
},
"path": "Container/log"
},
"enableLogging": true
}
}
]
}
}
]
}
}

Sample datasets
Dataset used by GetMetadata activity to enumerate the file list.

{
"name": "OneSourceFolder",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"fileName": "",
"folderPath": "myFolder"
}
}
}

Dataset for data source used by copy activity and the Delete activity.
{
"name": "OneSourceFile",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
},
"filename": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"fileName": {
"value": "@dataset().filename",
"type": "Expression"
},
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
}
}
}

Dataset for data destination used by copy activity.

{
"name": "OneDestinationFile",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"DestinationFileName": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"fileName": {
"value": "@dataset().DestinationFileName",
"type": "Expression"
},
"folderPath": "mycontainer/dest"
}
}
}

Known limitation
Delete activity does not support deleting list of folders described by wildcard.
When using file attribute filter: modifiedDatetimeStart and modifiedDatetimeEnd to select files to be
deleted, make sure to set "fileName": "*" in dataset.

Next steps
Learn more about moving files in Azure Data Factory.
Copy Data tool in Azure Data Factory
Copy Data tool in Azure Data Factory
3/15/2019 • 5 minutes to read • Edit Online

The Azure Data Factory Copy Data tool eases and optimizes the process of ingesting data into a data lake, which is
usually a first step in an end-to-end data integration scenario. It saves time, especially when you use Azure Data
Factory to ingest data from a data source for the first time. Some of the benefits of using this tool are:
When using the Azure Data Factory Copy Data tool, you do not need understand Data Factory definitions for
linked services, datasets, pipelines, activities, and triggers.
The flow of Copy Data tool is intuitive for loading data into a data lake. The tool automatically creates all the
necessary Data Factory resources to copy data from the selected source data store to the selected
destination/sink data store.
The Copy Data tool helps you validate the data that is being ingested at the time of authoring, which helps you
avoid any potential errors at the beginning itself.
If you need to implement complex business logic to load data into a data lake, you can still edit the Data Factory
resources created by the Copy Data tool by using the per-activity authoring in Data Factory UI.
The following table provides guidance on when to use the Copy Data tool vs. per-activity authoring in Data
Factory UI:

COPY DATA TOOL PER ACTIVITY (COPY ACTIVITY) AUTHORING

You want to easily build a data loading task without learning You want to implement complex and flexible logic for loading
about Azure Data Factory entities (linked services, datasets, data into lake.
pipelines, etc.)

You want to quickly load a large number of data artifacts into You want to chain Copy activity with subsequent activities for
a data lake. cleansing or processing data.

To start the Copy Data tool, click the Copy Data tile on the home page of your data factory.

Intuitive flow for loading data into a data lake


This tool allows you to easily move data from a wide variety of sources to destinations in minutes with an intuitive
flow:
1. Configure settings for the source.
2. Configure settings for the destination.
3. Configure advanced settings for the copy operation such as column mapping, performance settings, and
fault tolerance settings.
4. Specify a schedule for the data loading task.
5. Review summary of Data Factory entities to be created.
6. Edit the pipeline to update settings for the copy activity as needed.
The tool is designed with big data in mind from the start, with support for diverse data and object types.
You can use it to move hundreds of folders, files, or tables. The tool supports automatic data preview,
schema capture and automatic mapping, and data filtering as well.

Automatic data preview


You can preview part of the data from the selected source data store, which allows you to validate the data that is
being copied. In addition, if the source data is in a text file, the Copy Data tool parses the text file to automatically
detect the row and column delimiters, and schema.
After the detection:

Schema capture and automatic mapping


The schema of data source may not be same as the schema of data destination in many cases. In this scenario, you
need to map columns from the source schema to columns from the destination schema.
The Copy Data tool monitors and learns your behavior when you are mapping columns between source and
destination stores. After you pick one or a few columns from source data store, and map them to the destination
schema, the Copy Data tool starts to analyze the pattern for column pairs you picked from both sides. Then, it
applies the same pattern to the rest of the columns. Therefore, you see all the columns have been mapped to the
destination in a way you want just after several clicks. If you are not satisfied with the choice of column mapping
provided by Copy Data tool, you can ignore it and continue with manually mapping the columns. Meanwhile, the
Copy Data tool constantly learns and updates the pattern, and ultimately reaches the right pattern for the column
mapping you want to achieve.
NOTE
When copying data from SQL Server or Azure SQL Database into Azure SQL Data Warehouse, if the table does not exist in
the destination store, Copy Data tool supports creation of the table automatically by using the source schema.

Filter data
You can filter source data to select only the data that needs to be copied to the sink data store. Filtering reduces the
volume of the data to be copied to the sink data store and therefore enhances the throughput of the copy
operation. Copy Data tool provides a flexible way to filter data in a relational database by using the SQL query
language, or files in an Azure blob folder.
Filter data in a database
The following screenshot shows a SQL query to filter the data.

Filter data in an Azure blob folder


You can use variables in the folder path to copy data from a folder. The supported variables are: {year}, {month},
{day}, {hour}, and {minute}. For example: inputfolder/{year}/{month}/{day}.
Suppose that you have input folders in the following format:
2016/03/01/01
2016/03/01/02
2016/03/01/03
...

Click the Browse button for File or folder, browse to one of these folders (for example, 2016->03->01->02), and
click Choose. You should see 2016/03/01/02 in the text box.
Then, replace 2016 with {year}, 03 with {month}, 01 with {day}, and 02 with {hour}, and press the Tab key. You
should see drop-down lists to select the format for these four variables:

The Copy Data tool generates parameters with expressions, functions, and system variables that can be used to
represent {year}, {month}, {day}, {hour}, and {minute} when creating pipeline. For more information, see the How to
read or write partitioned data article.

Scheduling options
You can run the copy operation once or on a schedule (hourly, daily, and so on). These options can be used for the
connectors across different environments, including on-premises, cloud, and local desktop.
A one-time copy operation enables data movement from a source to a destination only once. It applies to data of
any size and any supported format. The scheduled copy allows you to copy data on a recurrence that you specify.
You can use rich settings (like retry, timeout, and alerts) to configure the scheduled copy.
Next steps
Try these tutorials that use the Copy Data tool:
Quickstart: create a data factory using the Copy Data tool
Tutorial: copy data in Azure using the Copy Data tool
Tutorial: copy on-premises data to Azure using the Copy Data tool
Load data into Azure Data Lake Storage Gen2 with
Azure Data Factory
5/13/2019 • 4 minutes to read • Edit Online

Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built into Azure Blob storage.
It allows you to interface with your data using both file system and object storage paradigms.
Azure Data Factory (ADF ) is a fully managed cloud-based data integration service. You can use the service to
populate the lake with data from a rich set of on-premises and cloud-based data stores and save time when
building your analytics solutions. For a detailed list of supported connectors, see the table of Supported data
stores.
Azure Data Factory offers a scale-out, managed data movement solution. Due to the scale-out architecture of ADF,
it can ingest data at a high throughput. For details, see Copy activity performance.
This article shows you how to use the Data Factory Copy Data tool to load data from Amazon Web Services S3
service into Azure Data Lake Storage Gen2. You can follow similar steps to copy data from other types of data
stores.

TIP
For copying data from Azure Data Lake Storage Gen1 into Gen2, refer to this specific walkthrough.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account with Data Lake Storage Gen2 enabled: If you don't have a Storage account, create an
account.
AWS account with an S3 bucket that contains data: This article shows how to copy data from Amazon S3. You
can use other data stores by following similar steps.

Create a data factory


1. On the left menu, select Create a resource > Data + Analytics > Data Factory:
2. In the New data factory page, provide values for the fields that are shown in the following image:

Name: Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadADLSDemo" is not available," enter a different name for the data factory. For example, you
could use the name yournameADFTutorialDataFactory. Try creating the data factory again. For the
naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription: Select your Azure subscription in which to create the data factory.
Resource Group: Select an existing resource group from the drop-down list, or select the Create new
option and enter the name of a resource group. To learn about resource groups, see Using resource
groups to manage your Azure resources.
Version: Select V2.
Location: Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions.
3. Select Create.
4. After creation is complete, go to your data factory. You see the Data Factory home page as shown in the
following image:

Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.

Load data into Azure Data Lake Storage Gen2


1. In the Get started page, select the Copy Data tile to launch the Copy Data tool:
2. In the Properties page, specify CopyFromAmazonS3ToADLS for the Task name field, and select Next:

3. In the Source data store page, click + Create new connection:


Select Amazon S3 from the connector gallery, and select Continue

4. In the Specify Amazon S3 connection page, do the following steps:


a. Specify the Access Key ID value.
b. Specify the Secret Access Key value.
c. Click Test connection to validate the settings, then select Finish.
d. You will see a new connection gets created. Select Next.
5. In the Choose the input file or folder page, browse to the folder and file that you want to copy over.
Select the folder/file, select Choose:

6. Specify the copy behavior by checking the Copy files recursively and Binary copy options. Select Next:
7. In the Destination data store page, click + Create new connection, and then select Azure Data Lake
Storage Gen2, and select Continue:

8. In the Specify Azure Data Lake Storage connection page, do the following steps:
a. Select your Data Lake Storage Gen2 capable account from the "Storage account name" drop down list.
b. Select Finish to create the connection. Then select Next.
9. In the Choose the output file or folder page, enter copyfroms3 as the output folder name, and select
Next. ADF will create the corresponding ADLS Gen2 file system and sub-folders during copy if it doesn't
exist.

10. In the Settings page, select Next to use the default settings:
11. In the Summary page, review the settings, and select Next:

12. In the Deployment page, select Monitor to monitor the pipeline:


13. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline:

14. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch
back to the pipeline runs view, select the Pipelines link at the top. Select Refresh to refresh the list.

15. To monitor the execution details for each copy activity, select the Details link (eyeglasses image) under
Actions in the activity monitoring view. You can monitor details like the volume of data copied from the
source to the sink, data throughput, execution steps with corresponding duration, and used configurations:
16. Verify that the data is copied into your Data Lake Storage Gen2 account.

Next steps
Copy activity overview
Azure Data Lake Storage Gen2 connector
Copy data from Azure Data Lake Storage Gen1 to
Gen2 with Azure Data Factory
5/13/2019 • 7 minutes to read • Edit Online

Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built into Azure Blob storage.
It allows you to interface with your data using both file system and object storage paradigms.
If you are currently using Azure Data Lake Storage Gen1, you can evaluate the Gen2 new capability by copying
data from Data Lake Storage Gen1 to Gen2 using Azure Data Factory.
Azure Data Factory is a fully managed cloud-based data integration service. You can use the service to populate
the lake with data from a rich set of on-premises and cloud-based data stores and save time when building your
analytics solutions. For a detailed list of supported connectors, see the table of Supported data stores.
Azure Data Factory offers a scale-out, managed data movement solution. Due to the scale-out architecture of ADF,
it can ingest data at a high throughput. For details, see Copy activity performance.
This article shows you how to use the Data Factory Copy Data tool to copy data from Azure Data Lake Storage
Gen1 into Azure Data Lake Storage Gen2. You can follow similar steps to copy data from other types of data
stores.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure Data Lake Storage Gen1 account with data in it.
Azure Storage account with Data Lake Storage Gen2 enabled: If you don't have a Storage account, create an
account.

Create a data factory


1. On the left menu, select Create a resource > Data + Analytics > Data Factory:
2. In the New data factory page, provide values for the fields that are shown in the following image:

Name: Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadADLSDemo" is not available," enter a different name for the data factory. For example, you
could use the name yournameADFTutorialDataFactory. Try creating the data factory again. For the
naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription: Select your Azure subscription in which to create the data factory.
Resource Group: Select an existing resource group from the drop-down list, or select the Create new
option and enter the name of a resource group. To learn about resource groups, see Using resource
groups to manage your Azure resources.
Version: Select V2.
Location: Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions.
3. Select Create.
4. After creation is complete, go to your data factory. You see the Data Factory home page as shown in the
following image:

Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.

Load data into Azure Data Lake Storage Gen2


1. In the Get started page, select the Copy Data tile to launch the Copy Data tool:
2. In the Properties page, specify CopyFromADLSGen1ToGen2 for the Task name field, and select Next:

3. In the Source data store page, click + Create new connection:


Select Azure Data Lake Storage Gen1 from the connector gallery, and select Continue

4. In the Specify Azure Data Lake Storage Gen1 connection page, do the following steps:
a. Select your Data Lake Storage Gen1 for the account name, and specify or validate the Tenant.
b. Click Test connection to validate the settings, then select Finish.
c. You will see a new connection gets created. Select Next.

IMPORTANT
In this walkthrough, you use a managed identity for Azure resources to authenticate your Data Lake Storage Gen1.
Be sure to grant the MSI the proper permissions in Azure Data Lake Storage Gen1 by following these instructions.
5. In the Choose the input file or folder page, browse to the folder and file that you want to copy over.
Select the folder/file, select Choose:

6. Specify the copy behavior by checking the Copy files recursively and Binary copy options. Select Next:
7. In the Destination data store page, click + Create new connection, and then select Azure Data Lake
Storage Gen2, and select Continue:

8. In the Specify Azure Data Lake Storage Gen2 connection page, do the following steps:
a. Select your Data Lake Storage Gen2 capable account from the "Storage account name" drop down list.
b. Select Finish to create the connection. Then select Next.
9. In the Choose the output file or folder page, enter copyfromadlsgen1 as the output folder name, and
select Next. ADF will create the corresponding ADLS Gen2 file system and sub-folders during copy if it
doesn't exist.

10. In the Settings page, select Next to use the default settings.
11. In the Summary page, review the settings, and select Next:
12. In the Deployment page, select Monitor to monitor the pipeline:

13. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline:
14. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch
back to the pipeline runs view, select the Pipelines link at the top. Select Refresh to refresh the list.

15. To monitor the execution details for each copy activity, select the Details link (eyeglasses image) under
Actions in the activity monitoring view. You can monitor details like the volume of data copied from the
source to the sink, data throughput, execution steps with corresponding duration, and used configurations:

16. Verify that the data is copied into your Data Lake Storage Gen2 account.

Best practices
To assess upgrading from Azure Data Lake Storage (ADLS ) Gen1 to Gen2 in general, refer to Upgrade your big
data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2. The following
sections introduce best practices of using ADF for data upgrade from Gen1 to Gen2.
Data partition for historical data copy
If your total data size in ADLS Gen1 is less than 30TB and the number of files is less than 1 million, you can
copy all data in single Copy activity run.
If you have larger size of data to copy, or you want the flexibility to manage data migration in batches and make
each of them completed within a specific timing windows, you are suggested to partition the data, in which case
it can also reduce the risk of any unexpected issue.
A PoC (Proof of Concept) is highly recommended in order to verify the end to end solution and test the copy
throughput in your environment. Major steps of doing PoC:
1. Create one ADF pipeline with single copy activity to copy several TBs of data from ADLS Gen1 to ADLS
Gen2 to get a copy performance baseline, starting with Data Integration Units (DIUs) as 128 .
2. Based on the copy throughput you get in step #1, calculate the estimated time required for the entire data
migration.
3. (Optional) Create a control table and define the file filter to partition the files to be migrated. The way to
partition the files as followings:
Partitioned by folder name or folder name with wildcard filter (suggested)
Partitioned by file’s last modified time
Network bandwidth and storage I/O
You can control the concurrency of ADF copy jobs which read data from ADLS Gen1 and write data to ADLS
Gen2, so that you can manage the usage on storage I/O in order to not impact the normal business work on ADLS
Gen1 during the migration.
Permissions
In Data Factory, ADLS Gen1 connector supports Service Principal and Managed Identity for Azure resource
authentications; ADLS Gen2 connector supports account key, Service Principal and Managed Identity for Azure
resource authentications. To make Data Factory able to navigate and copy all the files/ACLs as you need, make
sure you grant high enough permissions for the account you provide to access/read/write all files and set ACLs if
you choose to. Suggest to grant it as super-user/owner role during the migration period.
Preserve ACLs from Data Lake Storage Gen1
If you want to replicate the ACLs along with data files when upgrading from Data Lake Storage Gen1 to Gen2,
refer to Preserve ACLs from Data Lake Storage Gen1.
Incremental copy
Several approaches can be used to load only the new or updated files from ADLS Gen1:
Load new or updated files by time partitioned folder or file name, e.g. /2019/05/13/*;
Load new or updated files by LastModifiedDate;
Identify new or updated files by any 3rd party tool/solution, then pass the file or folder name to ADF pipeline
via parameter or a table/file.
The proper frequency to do incremental load depends on the total number of files in ADLS Gen1 and the volume
of new or updated file to be loaded every time.

Next steps
Copy activity overview Azure Data Lake Storage Gen1 connector Azure Data Lake Storage Gen2 connector
Load data into Azure SQL Data Warehouse by using
Azure Data Factory
3/26/2019 • 6 minutes to read • Edit Online

Azure SQL Data Warehouse is a cloud-based, scale-out database that's capable of processing massive volumes of
data, both relational and non-relational. SQL Data Warehouse is built on the massively parallel processing (MPP )
architecture that's optimized for enterprise data warehouse workloads. It offers cloud elasticity with the flexibility to
scale storage and compute independently.
Getting started with Azure SQL Data Warehouse is now easier than ever when you use Azure Data Factory. Azure
Data Factory is a fully managed cloud-based data integration service. You can use the service to populate a SQL
Data Warehouse with data from your existing system and save time when building your analytics solutions.
Azure Data Factory offers the following benefits for loading data into Azure SQL Data Warehouse:
Easy to set up: An intuitive 5-step wizard with no scripting required.
Rich data store support: Built-in support for a rich set of on-premises and cloud-based data stores. For a
detailed list, see the table of Supported data stores.
Secure and compliant: Data is transferred over HTTPS or ExpressRoute. The global service presence ensures
that your data never leaves the geographical boundary.
Unparalleled performance by using PolyBase: Polybase is the most efficient way to move data into Azure
SQL Data Warehouse. Use the staging blob feature to achieve high load speeds from all types of data stores,
including Azure Blob storage and Data Lake Store. (Polybase supports Azure Blob storage and Azure Data Lake
Store by default.) For details, see Copy activity performance.
This article shows you how to use the Data Factory Copy Data tool to load data from Azure SQL Database into
Azure SQL Data Warehouse. You can follow similar steps to copy data from other types of data stores.

NOTE
For more information, see Copy data to or from Azure SQL Data Warehouse by using Azure Data Factory.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Data Warehouse: The data warehouse holds the data that's copied over from the SQL database. If
you don't have an Azure SQL Data Warehouse, see the instructions in Create a SQL Data Warehouse.
Azure SQL Database: This tutorial copies data from an Azure SQL database with Adventure Works LT sample
data. You can create a SQL database by following the instructions in Create an Azure SQL database.
Azure storage account: Azure Storage is used as the staging blob in the bulk copy operation. If you don't have
an Azure storage account, see the instructions in Create a storage account.

Create a data factory


1. On the left menu, select Create a resource > Data + Analytics > Data Factory:
2. In the New data factory page, provide values for the fields that are shown in the following image:

Name: Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadSQLDWDemo" is not available," enter a different name for the data factory. For example,
you could use the name yournameADFTutorialDataFactory. Try creating the data factory again. For
the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription: Select your Azure subscription in which to create the data factory.
Resource Group: Select an existing resource group from the drop-down list, or select the Create new
option and enter the name of a resource group. To learn about resource groups, see Using resource
groups to manage your Azure resources.
Version: Select V2.
Location: Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These data
stores include Azure Data Lake Store, Azure Storage, Azure SQL Database, and so on.
3. Select Create.
4. After creation is complete, go to your data factory. You see the Data Factory home page as shown in the
following image:

Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.

Load data into Azure SQL Data Warehouse


1. In the Get started page, select the Copy Data tile to launch the Copy Data tool:
2. In the Properties page, specify CopyFromSQLToSQLDW for the Task name field, and select Next:

3. In the Source data store page, complete the following steps:


a. click + Create new connection:
b. Select Azure SQL Database from the gallery, and select Continue. You can type "SQL" in the search
box to filter the connectors.

c. In the New Linked Service page, select your server name and DB name from the dropdown list, and
specify the username and password. Click Test connection to validate the settings, then select Finish.
d. Select the newly created linked service as source, then click Next.

4. In the Select tables from which to copy the data or use a custom query page, enter SalesLT to filter
the tables. Choose the (Select all) box to use all of the tables for the copy, and then select Next:
5. In the Destination data store page, complete the following steps:
a. Click + Create new connection to add a connection

b. Select Azure SQL Data Warehouse from the gallery, and select Next.
c. In the New Linked Service page, select your server name and DB name from the dropdown list, and
specify the username and password. Click Test connection to validate the settings, then select Finish.

d. Select the newly created linked service as sink, then click Next.
6. In the Table mapping page, review the content, and select Next. An intelligent table mapping displays. The
source tables are mapped to the destination tables based on the table names. If a source table doesn't exist
in the destination, Azure Data Factory creates a destination table with the same name by default. You can
also map a source table to an existing destination table.

NOTE
Automatic table creation for the SQL Data Warehouse sink applies when SQL Server or Azure SQL Database is the
source. If you copy data from another source data store, you need to pre-create the schema in the sink Azure SQL
Data Warehouse before executing the data copy.

7. In the Schema mapping page, review the content, and select Next. The intelligent table mapping is based
on the column name. If you let Data Factory automatically create the tables, data type conversion can occur
when there are incompatibilities between the source and destination stores. If there's an unsupported data
type conversion between the source and destination column, you see an error message next to the
corresponding table.

8. In the Settings page, complete the following steps:


a. In Staging settings section, click + New to new a staging storage. The storage is used for staging the
data before it loads into SQL Data Warehouse by using PolyBase. After the copy is complete, the interim
data in Azure Storage is automatically cleaned up.
b. In the New Linked Service page, select your storage account, and select Finish.

c. In the Advanced settings section, deselect the Use type default option, then select Next.

9. In the Summary page, review the settings, and select Next:


10. In the Deployment page, select Monitor to monitor the pipeline (task):

11. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline:

12. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. To switch back to the pipeline runs view, select the Pipelines link at the top. Select
Refresh to refresh the list.
13. To monitor the execution details for each copy activity, select the Details link under Actions in the activity
monitoring view. You can monitor details like the volume of data copied from the source to the sink, data
throughput, execution steps with corresponding duration, and used configurations:

Next steps
Advance to the following article to learn about Azure SQL Data Warehouse support:
Azure SQL Data Warehouse connector
Load data into Azure Data Lake Storage Gen1 by
using Azure Data Factory
3/26/2019 • 4 minutes to read • Edit Online

Azure Data Lake Storage Gen1 (previously known as Azure Data Lake Store) is an enterprise-wide hyper-scale
repository for big data analytic workloads. Data Lake Storage Gen1 lets you capture data of any size, type, and
ingestion speed. The data is captured in a single place for operational and exploratory analytics.
Azure Data Factory is a fully managed cloud-based data integration service. You can use the service to populate
the lake with data from your existing system and save time when building your analytics solutions.
Azure Data Factory offers the following benefits for loading data into Data Lake Storage Gen1:
Easy to set up: An intuitive 5-step wizard with no scripting required.
Rich data store support: Built-in support for a rich set of on-premises and cloud-based data stores. For a
detailed list, see the table of Supported data stores.
Secure and compliant: Data is transferred over HTTPS or ExpressRoute. The global service presence ensures
that your data never leaves the geographical boundary.
High performance: Up to 1-GB/s data loading speed into Data Lake Storage Gen1. For details, see Copy
activity performance.
This article shows you how to use the Data Factory Copy Data tool to load data from Amazon S3 into Data Lake
Storage Gen1. You can follow similar steps to copy data from other types of data stores.

NOTE
For more information, see Copy data to or from Data Lake Storage Gen1 by using Azure Data Factory.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Data Lake Storage Gen1 account: If you don't have a Data Lake Storage Gen1 account, see the instructions in
Create a Data Lake Storage Gen1 account.
Amazon S3: This article shows how to copy data from Amazon S3. You can use other data stores by following
similar steps.

Create a data factory


1. On the left menu, select Create a resource > Analytics > Data Factory:
2. In the New data factory page, provide values for the fields that are shown in the following image:

Name: Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadADLSG1Demo" is not available," enter a different name for the data factory. For example,
you could use the name yournameADFTutorialDataFactory. Try creating the data factory again. For
the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription: Select your Azure subscription in which to create the data factory.
Resource Group: Select an existing resource group from the drop-down list, or select the Create new
option and enter the name of a resource group. To learn about resource groups, see Using resource
groups to manage your Azure resources.
Version: Select V2.
Location: Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These data
stores include Azure Data Lake Storage Gen1, Azure Storage, Azure SQL Database, and so on.
3. Select Create.
4. After creation is complete, go to your data factory. You see the Data Factory home page as shown in the
following image:

Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.

Load data into Data Lake Storage Gen1


1. In the Get started page, select the Copy Data tile to launch the Copy Data tool:

2. In the Properties page, specify CopyFromAmazonS3ToADLS for the Task name field, and select Next:
3. In the Source data store page, click + Create new connection:

Select Amazon S3, and select Continue


4. In the Specify Amazon S3 connection page, do the following steps:
a. Specify the Access Key ID value.
b. Specify the Secret Access Key value.
c. Select Finish.

d. You will see a new connection. Select Next.


5. In the Choose the input file or folder page, browse to the folder and file that you want to copy over.
Select the folder/file, select Choose, and then select Next:

6. Choose the copy behavior by selecting the Copy files recursively and Binary copy (copy files as-is)
options. Select Next:
7. In the Destination data store page, click + Create new connection, and then select Azure Data Lake
Storage Gen1, and select Continue:

8. In the New Linked Service (Azure Data Lake Storage Gen1) page, do the following steps:
a. Select your Data Lake Storage Gen1 account for the Data Lake Store account name.
b. Specify the Tenant, and select Finish.
c. Select Next.

IMPORTANT
In this walkthrough, you use a managed identity for Azure resources to authenticate your Data Lake Storage Gen1
account. Be sure to grant the MSI the proper permissions in Data Lake Storage Gen1 by following these instructions.
9. In the Choose the output file or folder page, enter copyfroms3 as the output folder name, and select
Next:

10. In the Settings page, select Next:


11. In the Summary page, review the settings, and select Next:
12. In the Deployment page, select Monitor to monitor the pipeline (task):
13. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline:

14. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch
back to the pipeline runs view, select the Pipelines link at the top. Select Refresh to refresh the list.

15. To monitor the execution details for each copy activity, select the Details link under Actions in the activity
monitoring view. You can monitor details like the volume of data copied from the source to the sink, data
throughput, execution steps with corresponding duration, and used configurations:
16. Verify that the data is copied into your Data Lake Storage Gen1 account:

Next steps
Advance to the following article to learn about Data Lake Storage Gen1 support:
Azure Data Lake Storage Gen1 connector
Copy data from SAP Business Warehouse by using
Azure Data Factory
5/22/2019 • 10 minutes to read • Edit Online

This article shows how to use Azure Data Factory to copy data from SAP Business Warehouse (BW ) via Open Hub
to Azure Data Lake Storage Gen2. You can use a similar process to copy data to other supported sink data stores.

TIP
For general information about copying data from SAP BW, including SAP BW Open Hub integration and delta extraction flow,
see Copy data from SAP Business Warehouse via Open Hub by using Azure Data Factory.

Prerequisites
Azure Data Factory: If you don't have one, follow the steps to create a data factory.
SAP BW Open Hub Destination (OHD ) with destination type "Database Table": To create an OHD
or to check that your OHD is configured correctly for Data Factory integration, see the SAP BW Open Hub
Destination configurations section of this article.
The SAP BW user needs the following permissions:
Authorization for Remote Function Calls (RFC ) and SAP BW.
Permissions to the “Execute” activity of the S_SDSAUTH authorization object.
A self-hosted integration runtime (IR) with SAP .NET connector 3.0. Follow these setup steps:
1. Install and register the self-hosted integration runtime, version 3.13 or later. (This process is
described later in this article.)
2. Download the 64-bit SAP Connector for Microsoft .NET 3.0 from SAP's website, and install it on the
same computer as the self-hosted IR. During installation, make sure that you select Install
Assemblies to GAC in the Optional setup steps dialog box, as the following image shows:
Do a full copy from SAP BW Open Hub
In the Azure portal, go to your data factory. Select Author & Monitor to open the Data Factory UI in a separate
tab.
1. On the Let's get started page, select Copy Data to open the Copy Data tool.
2. On the Properties page, specify a Task name, and then select Next.
3. On the Source data store page, select +Create new connection. Select SAP BW Open Hub from the
connector gallery, and then select Continue. To filter the connectors, you can type SAP in the search box.
4. On the Specify SAP BW Open Hub connection page, follow these steps to create a new connection.

a. From the Connect via integration runtime list, select an existing self-hosted IR. Or, choose to
create one if you don't have one yet.
To create a new self-hosted IR, select +New, and then select Self-hosted. Enter a Name, and then
select Next. Select Express setup to install on the current computer, or follow the Manual setup
steps that are provided.
As mentioned in Prerequisites, make sure that you have SAP Connector for Microsoft .NET 3.0
installed on the same computer where the self-hosted IR is running.
b. Fill in the SAP BW Server name, System number, Client ID, Language (if other than EN ), User
name, and Password.
c. Select Test connection to validate the settings, and then select Finish.
d. A new connection is created. Select Next.
5. On the Select Open Hub Destinations page, browse the Open Hub Destinations that are available in
your SAP BW. Select the OHD to copy data from, and then select Next.

6. Specify a filter, if you need one. If your OHD only contains data from a single data-transfer process (DTP )
execution with a single request ID, or you're sure that your DTP is finished and you want to copy the data,
clear the Exclude Last Request check box.
Learn more about these settings in the SAP BW Open Hub Destination configurations section of this
article. Select Validate to double-check what data will be returned. Then select Next.
7. On the Destination data store page, select +Create new connection > Azure Data Lake Storage
Gen2 > Continue.
8. On the Specify Azure Data Lake Storage connection page, follow these steps to create a connection.
a. Select your Data Lake Storage Gen2-capable account from the Name drop-down list.
b. Select Finish to create the connection. Then select Next.
9. On the Choose the output file or folder page, enter copyfromopenhub as the output folder name. Then
select Next.
10. On the File format setting page, select Next to use the default settings.

11. On the Settings page, expand Performance settings. Enter a value for Degree of copy parallelism such
as 5 to load from SAP BW in parallel. Then select Next.
12. On the Summary page, review the settings. Then select Next.
13. On the Deployment page, select Monitor to monitor the pipeline.

14. Notice that the Monitor tab on the left side of the page is automatically selected. The Actions column
includes links to view activity-run details and to rerun the pipeline.
15. To view activity runs that are associated with the pipeline run, select View Activity Runs in the Actions
column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch back to
the pipeline-runs view, select the Pipelines link at the top. Select Refresh to refresh the list.

16. To monitor the execution details for each copy activity, select the Details link, which is an eyeglasses icon
below Actions in the activity-monitoring view. Available details include the data volume copied from the
source to the sink, data throughput, execution steps and duration, and configurations used.

17. To view the maximum Request ID, go back to the activity-monitoring view and select Output under
Actions.

Do an incremental copy from SAP BW Open Hub


TIP
See SAP BW Open Hub connector delta extraction flow to learn how the SAP BW Open Hub connector in Data Factory
copies incremental data from SAP BW. This article can also help you understand basic connector configuration.
Now, let's continue to configure incremental copy from SAP BW Open Hub.
Incremental copy uses a "high-watermark" mechanism that's based on the request ID. That ID is automatically
generated in SAP BW Open Hub Destination by the DTP. The following diagram shows this workflow:

On the data factory Let's get started page, select Create pipeline from template to use the built-in template.
1. Search for SAP BW to find and select the Incremental copy from SAP BW to Azure Data Lake
Storage Gen2 template. This template copies data into Azure Data Lake Storage Gen2. You can use a
similar workflow to copy to other sink types.
2. On the template's main page, select or create the following three connections, and then select Use this
template in the lower-right corner of the window.
Azure Blob storage: In this walkthrough, we use Azure Blob storage to store the high watermark,
which is the max copied request ID.
SAP BW Open Hub: This is the source to copy data from. Refer to the previous full-copy walkthrough
for detailed configuration.
Azure Data Lake Storage Gen2: This is the sink to copy data to. Refer to the previous full-copy
walkthrough for detailed configuration.

3. This template generates a pipeline with the following three activities and makes them chained on-success:
Lookup, Copy Data, and Web.
Go to the pipeline Parameters tab. You see all the configurations that you need to provide.
SAPOpenHubDestinationName: Specify the Open Hub table name to copy data from.
ADLSGen2SinkPath: Specify the destination Azure Data Lake Storage Gen2 path to copy data to. If
the path doesn't exist, the Data Factory copy activity creates a path during execution.
HighWatermarkBlobPath: Specify the path to store the high-watermark value, such as
container/path .

HighWatermarkBlobName: Specify the blob name to store the high watermark value, such as
requestIdCache.txt . In Blob storage, go to the corresponding path of
HighWatermarkBlobPath+HighWatermarkBlobName, such as container/path/requestIdCache.txt.
Create a blob with content 0.

LogicAppURL: In this template, we use WebActivity to call Azure Logic Apps to set the high-
watermark value in Blob storage. Or, you can use Azure SQL Database to store it. Use a stored
procedure activity to update the value.
You must first create a logic app, as the following image shows. Then, paste in the HTTP POST URL.
a. Go to the Azure portal. Select a new Logic Apps service. Select +Blank Logic App to go to
Logic Apps Designer.
b. Create a trigger of When an HTTP request is received. Specify the HTTP request body as
follows:

{
"properties": {
"sapOpenHubMaxRequestId": {
"type": "string"
}
},
"type": "object"
}

c. Add a Create blob action. For Folder path and Blob name, use the same values that you
configured previously in HighWatermarkBlobPath and HighWatermarkBlobName.
d. Select Save. Then, copy the value of HTTP POST URL to use in the Data Factory pipeline.
4. After you provide the Data Factory pipeline parameters, select Debug > Finish to invoke a run to validate
the configuration. Or, select Publish All to publish the changes, and then select Trigger to execute a run.

SAP BW Open Hub Destination configurations


This section introduces configuration of the SAP BW side to use the SAP BW Open Hub connector in Data Factory
to copy data.
Configure delta extraction in SAP BW
If you need both historical copy and incremental copy or only incremental copy, configure delta extraction in SAP
BW.
1. Create the Open Hub Destination. You can create the OHD in SAP Transaction RSA1, which automatically
creates the required transformation and data-transfer process. Use the following settings:
ObjectType: You can use any object type. Here, we use InfoCube as an example.
Destination Type: Select Database Table.
Key of the Table: Select Technical Key.
Extraction: Select Keep Data and Insert Records into Table.

You might increase the number of parallel running SAP work processes for the DTP:

2. Schedule the DTP in process chains.


A delta DTP for a cube only works if the necessary rows haven't been compressed. Make sure that BW cube
compression isn't running before the DTP to the Open Hub table. The easiest way to do this is to integrate
the DTP into your existing process chains. In the following example, the DTP (to the OHD ) is inserted into
the process chain between the Adjust (aggregate rollup) and Collapse (cube compression) steps.

Configure full extraction in SAP BW


In addition to delta extraction, you might want a full extraction of the same SAP BW InfoProvider. This usually
applies if you want to do full copy but not incremental, or you want to resync delta extraction.
You can't have more than one DTP for the same OHD. So, you must create an additional OHD before delta
extraction.

For a full load OHD, choose different options than for delta extraction:
In OHD: Set the Extraction option to Delete Data and Insert Records. Otherwise, data will be extracted
many times when you repeat the DTP in a BW process chain.
In the DTP: Set Extraction Mode to Full. You must change the automatically created DTP from Delta to
Full immediately after the OHD is created, as this image shows:
In the BW Open Hub connector of Data Factory: Turn off Exclude last request. Otherwise, nothing will be
extracted.
You typically run the full DTP manually. Or, you can create a process chain for the full DTP. It's typically a separate
chain that's independent of your existing process chains. In either case, make sure that the DTP is finished before
you start the extraction by using Data Factory copy. Otherwise, only partial data will be copied.
Run delta extraction the first time
The first delta extraction is technically a full extraction. By default, the SAP BW Open Hub connector excludes the
last request when it copies data. For the first delta extraction, no data is extracted by the Data Factory copy activity
until a subsequent DTP generates delta data in the table with a separate request ID. There are two ways to avoid
this scenario:
Turn off the Exclude last request option for the first delta extraction. Make sure that the first delta DTP is
finished before you start the delta extraction the first time.
Use the procedure for resyncing the delta extraction, as described in the next section.
Resync delta extraction
The following scenarios change the data in SAP BW cubes but are not considered by the delta DTP:
SAP BW selective deletion (of rows by using any filter condition)
SAP BW request deletion (of faulty requests)
An SAP Open Hub Destination isn't a data-mart-controlled data target (in all SAP BW support packages since
2015). So, you can delete data from a cube without changing the data in the OHD. You must then resync the data
of the cube with Data Factory:
1. Run a full extraction in Data Factory (by using a full DTP in SAP ).
2. Delete all rows in the Open Hub table for the delta DTP.
3. Set the status of the delta DTP to Fetched.
After this, all subsequent delta DTPs and Data Factory delta extractions work as expected.
To set the status of the delta DTP to Fetched, you can use the following option to run the delta DTP manually:

*No Data Transfer; Delta Status in Source: Fetched*

Next steps
Learn about SAP BW Open Hub connector support:
SAP Business Warehouse Open Hub connector
Load data from Office 365 by using Azure Data
Factory
3/26/2019 • 5 minutes to read • Edit Online

This article shows you how to use the Data Factory load data from Office 365 into Azure Blob storage. You can
follow similar steps to copy data to Azure Data Lake Gen1 or Gen2. Refer to Office 365 connector article on
copying data from Office 365 in general.

Create a data factory


1. On the left menu, select Create a resource > Data + Analytics > Data Factory:

2. In the New data factory page, provide values for the fields that are shown in the following image:
Name: Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadFromOffice365Demo" is not available," enter a different name for the data factory. For
example, you could use the name yournameLoadFromOffice365Demo. Try creating the data factory
again. For the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription: Select your Azure subscription in which to create the data factory.
Resource Group: Select an existing resource group from the drop-down list, or select the Create new
option and enter the name of a resource group. To learn about resource groups, see Using resource
groups to manage your Azure resources.
Version: Select V2.
Location: Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These data
stores include Azure Data Lake Store, Azure Storage, Azure SQL Database, and so on.
3. Select Create.
4. After creation is complete, go to your data factory. You see the Data Factory home page as shown in the
following image:
5. Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.

Create a pipeline
1. On the "Let's get started" page, select Create pipeline.

2. In the General tab for the pipeline, enter "CopyPipeline" for Name of the pipeline.
3. In the Activities tool box > Move & Transform category > drag and drop the Copy activity from the tool
box to the pipeline designer surface. Specify "CopyFromOffice365ToBlob" as activity name.
Configure source
1. Go to the pipeline > Source tab, click + New to create a source dataset.
2. In the New Dataset window, select Office 365, and then select Finish.

3. You see a new tab opened for Office 365 dataset. On the General tab at the bottom of the Properties
window, enter "SourceOffice365Dataset" for Name.
4. Go to the Connection tab of the Properties window. Next to the Linked service text box, click + New.
5. In the New Linked Service window, enter "Office365LinkedService" as name, enter the service principal ID
and service principal key, then select Save to deploy the linked service.
6. After the linked service is created, you are back in the dataset settings. Next to "Table", choose the down-
arrow to expand the list of available Office 365 datasets, and choose "BasicDataSet_v0.Contact_v0" from the
drop-down list:
7. Go to the Schema tab of the Properties window and select Import Schema. Notice that the schema and
sample values for Contact dataset is displayed.
8. Now, go back to the pipeline > Source tab, confirm that SourceBlobDataset is selected.
Configure sink
1. Go to the pipeline > Sink tab, and select + New to create a sink dataset.
2. In the New Dataset window, notice that only the supported destination are selected when copying from
Office 365. Select Azure Blob Storage, and then select Finish. In this tutorial, you copy Office 365 data
into an Azure Blob Storage.
3. On the General tab of the Properties window, in Name, enter "OutputBlobDataset".
4. Go to the Connection tab of the Properties window. Next to the Linked service text box, select + New.
5. In the New Linked Service window, enter "AzureStorageLinkedService" as name, select "Service Principal"
from the dropdown list of authentication methods, fill in the Service Endpoint, Tenant Service principal ID,
and Service principal key, then select Save to deploy the linked service. Refer here for how to set up service
principal authentication for Azure Blob Storage.
6. After the linked service is created, you are back in the dataset settings. Next to File path, select Browse to
choose the output folder where the Office 365 data will be extracted to. Under "File Format Settings", next
to File Format, choose "JSON format", and next to File Pattern, choose "Set of objects".
7. Go back to the pipeline > Sink tab, confirm that OutputBlobDataset is selected.

Validate the pipeline


To validate the pipeline, select Validate from the tool bar.
You can also see the JSON code associated with the pipeline by clicking Code on the upper-right.

Publish the pipeline


In the top toolbar, select Publish All. This action publishes entities (datasets, and pipelines) you created to Data
Factory.

Trigger the pipeline manually


Select Trigger on the toolbar, and then select Trigger Now. On the Pipeline Run page, select Finish.

Monitor the pipeline


Go to the Monitor tab on the left. You see a pipeline run that is triggered by a manual trigger. You can use links in
the Actions column to view activity details and to rerun the pipeline.
To see activity runs associated with the pipeline run, select the View Activity Runs link in the Actions column. In
this example, there is only one activity, so you see only one entry in the list. For details about the copy operation,
select the Details link (eyeglasses icon) in the Actions column.

If this is the first time you are requesting data for this context (a combination of which data table is being access,
which destination account is the data being loaded into, and which user identity is making the data access request),
you will see the copy activity status as "In Progress", and only when you click into "Details" link under Actions will
you see the status as "RequesetingConsent". A member of the data access approver group needs to approve the
request in the Privileged Access Management before the data extraction can proceed.
Status as requesting consent:

Status as extracting data:


Once the consent is provided, data extraction will continue and, after some time, the pipeline run will show as
completed.

Now go to the destination Azure Blob Storage and verify that Office 365 data has been extracted in JSON format.

Next steps
Advance to the following article to learn about Azure SQL Data Warehouse support:
Office 365 connector
How to read or write partitioned data in Azure Data
Factory
1/3/2019 • 2 minutes to read • Edit Online

In Azure Data Factory version 1, you could read or write partitioned data by using the SliceStart, SliceEnd,
WindowStart, and WindowEnd system variables. In the current version of Data Factory, you can achieve this
behavior by using a pipeline parameter and a trigger's start time or scheduled time as a value of the parameter.

Use a pipeline parameter


In Data Factory version 1, you could use the partitionedBy property and SliceStart system variable as shown in
the following example:

"folderPath": "adfcustomerprofilingsample/logs/marketingcampaigneffectiveness/{Year}/{Month}/{Day}/",
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "%M" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "%d" } }
],

For more information about the partitonedBy property, see Copy data to or from Azure Blob storage by using
Azure Data Factory.
To achieve this behavior in the current version of Data Factory:
1. Define a pipeline parameter of type string. In the following example, the name of the pipeline parameter is
windowStartTime.
2. Set folderPath in the dataset definition to reference the value of the pipeline parameter.
3. Pass the actual value for the parameter when you invoke the pipeline on demand. You can also pass a trigger's
start time or scheduled time dynamically at runtime.

"folderPath": {
"value":
"adfcustomerprofilingsample/logs/marketingcampaigneffectiveness/@{formatDateTime(pipeline().parameters.windowS
tartTime, 'yyyy/MM/dd')}/",
"type": "Expression"
},

Pass in a value from a trigger


In the following tumbling window trigger definition, the window start time of the trigger is passed as a value for
the pipeline parameter windowStartTime:
{
"name": "MyTrigger",
"properties": {
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Hour",
"interval": "1",
"startTime": "2018-05-15T00:00:00Z",
"delay": "00:10:00",
"maxConcurrency": 10
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyPipeline"
},
"parameters": {
"windowStartTime": "@trigger().outputs.windowStartTime"
}
}
}
}

Example
Here is a sample dataset definition:

{
"name": "SampleBlobDataset",
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value":
"adfcustomerprofilingsample/logs/marketingcampaigneffectiveness/@{formatDateTime(pipeline().parameters.windowS
tartTime, 'yyyy/MM/dd')}/",
"type": "Expression"
},
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"structure": [
{ "name": "ProfileID", "type": "String" },
{ "name": "SessionStart", "type": "String" },
{ "name": "Duration", "type": "Int32" },
{ "name": "State", "type": "String" },
{ "name": "SrcIPAddress", "type": "String" },
{ "name": "GameType", "type": "String" },
{ "name": "Multiplayer", "type": "String" },
{ "name": "EndRank", "type": "String" },
{ "name": "WeaponsUsed", "type": "Int32" },
{ "name": "UsersInteractedWith", "type": "String" },
{ "name": "Impressions", "type": "String" }
],
"linkedServiceName": {
"referenceName": "churnStorageLinkedService",
"type": "LinkedServiceReference"
}
}

Pipeline definition:
{
"properties": {
"activities": [{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": {
"value": "@concat(pipeline().parameters.blobContainer, '/scripts/',
pipeline().parameters.partitionHiveScriptFile)",
"type": "Expression"
},
"scriptLinkedService": {
"referenceName": "churnStorageLinkedService",
"type": "LinkedServiceReference"
},
"defines": {
"RAWINPUT": {
"value": "@concat('wasb://', pipeline().parameters.blobContainer, '@',
pipeline().parameters.blobStorageAccount, '.blob.core.windows.net/logs/',
pipeline().parameters.inputRawLogsFolder, '/')",
"type": "Expression"
},
"Year": {
"value": "@formatDateTime(pipeline().parameters.windowStartTime, 'yyyy')",
"type": "Expression"
},
"Month": {
"value": "@formatDateTime(pipeline().parameters.windowStartTime, 'MM')",
"type": "Expression"
},
"Day": {
"value": "@formatDateTime(pipeline().parameters.windowStartTime, 'dd')",
"type": "Expression"
}
}
},
"linkedServiceName": {
"referenceName": "HdiLinkedService",
"type": "LinkedServiceReference"
},
"name": "HivePartitionGameLogs"
}],
"parameters": {
"windowStartTime": {
"type": "String"
},
"blobStorageAccount": {
"type": "String"
},
"blobContainer": {
"type": "String"
},
"inputRawLogsFolder": {
"type": "String"
}
}
}
}

Next steps
For a complete walkthrough of how to create a data factory that has a pipeline, see Quickstart: Create a data
factory.
Supported file formats and compression codecs in
Azure Data Factory
5/22/2019 • 17 minutes to read • Edit Online

This article applies to the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen1, Azure
Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.
If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input
and output dataset definitions. If you want to parse or generate files with a specific format, Azure Data
Factory supports the following file format types:
Text format
JSON format
Parquet format
ORC format
Avro format

TIP
Learn how copy activity maps your source data to sink from Schema mapping in copy activity.

Text format
NOTE
Data Factory introduced new delimited text format datset, see Delimited text format article with details. The following
configurations on file-based data store dataset is still supported as-is for backward compabitility. You are suggested to
use the new model going forward.

If you want to read from a text file or write to a text file, set the type property in the format section of the
dataset to TextFormat. You can also specify the following optional properties in the format section. See
TextFormat example section on how to configure.

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

columnDelimiter The character used to Only one character is No


separate columns in a file. allowed. The default value
You can consider to use a is comma (',').
rare unprintable character
that may not exist in your To use a Unicode character,
data. For example, specify refer to Unicode Characters
"\u0001", which represents to get the corresponding
Start of Heading (SOH). code for it.
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

rowDelimiter The character used to Only one character is No


separate rows in a file. allowed. The default value
is any of the following
values on read: ["\r\n",
"\r", "\n"] and "\r\n" on
write.

escapeChar The special character used Only one character is No


to escape a column allowed. No default value.
delimiter in the content of
input file. Example: if you have
comma (',') as the column
You cannot specify both delimiter but you want to
escapeChar and quoteChar have the comma character
for a table. in the text (example: "Hello,
world"), you can define ‘$’
as the escape character and
use string "Hello$, world" in
the source.

quoteChar The character used to quote Only one character is No


a string value. The column allowed. No default value.
and row delimiters inside
the quote characters would For example, if you have
be treated as part of the comma (',') as the column
string value. This property delimiter but you want to
is applicable to both input have comma character in
and output datasets. the text (example: <Hello,
world>), you can define "
You cannot specify both (double quote) as the quote
escapeChar and quoteChar character and use the string
for a table. "Hello, world" in the source.

nullValue One or more characters One or more characters. No


used to represent a null The default values are
value. "\N" and "NULL" on read
and "\N" on write.

encodingName Specify the encoding name. A valid encoding name. see No


Encoding.EncodingName
Property. Example:
windows-1250 or shift_jis.
The default value is UTF-8.

firstRowAsHeader Specifies whether to True No


consider the first row as a False (default)
header. For an input
dataset, Data Factory reads
first row as a header. For an
output dataset, Data
Factory writes first row as a
header.

See Scenarios for using


firstRowAsHeader and
skipLineCount for sample
scenarios.
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

skipLineCount Indicates the number of Integer No


non-empty rows to skip
when reading data from
input files. If both
skipLineCount and
firstRowAsHeader are
specified, the lines are
skipped first and then the
header information is read
from the input file.

See Scenarios for using


firstRowAsHeader and
skipLineCount for sample
scenarios.

treatEmptyAsNull Specifies whether to treat True (default) No


null or empty string as a False
null value when reading
data from an input file.

TextFormat example
In the following JSON definition for a dataset, some of the optional properties are specified.

"typeProperties":
{
"folderPath": "mycontainer/myfolder",
"fileName": "myblobname",
"format":
{
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": ";",
"quoteChar": "\"",
"NullValue": "NaN",
"firstRowAsHeader": true,
"skipLineCount": 0,
"treatEmptyAsNull": true
}
},

To use an escapeChar instead of quoteChar , replace the line with quoteChar with the following escapeChar:

"escapeChar": "$",

Scenarios for using firstRowAsHeader and skipLineCount


You are copying from a non-file source to a text file and would like to add a header line containing the
schema metadata (for example: SQL schema). Specify firstRowAsHeader as true in the output dataset for this
scenario.
You are copying from a text file containing a header line to a non-file sink and would like to drop that line.
Specify firstRowAsHeader as true in the input dataset.
You are copying from a text file and want to skip a few lines at the beginning that contain no data or header
information. Specify skipLineCount to indicate the number of lines to be skipped. If the rest of the file
contains a header line, you can also specify firstRowAsHeader . If both skipLineCount and firstRowAsHeader
are specified, the lines are skipped first and then the header information is read from the input file
JSON format
To import/export a JSON file as-is into/from Azure Cosmos DB, see Import/export JSON documents
section in Move data to/from Azure Cosmos DB article.
If you want to parse the JSON files or write the data in JSON format, set the type property in the format
section to JsonFormat. You can also specify the following optional properties in the format section. See
JsonFormat example section on how to configure.

PROPERTY DESCRIPTION REQUIRED

filePattern Indicate the pattern of data stored in No


each JSON file. Allowed values are:
setOfObjects and arrayOfObjects.
The default value is setOfObjects.
See JSON file patterns section for
details about these patterns.

jsonNodeReference If you want to iterate and extract data No


from the objects inside an array field
with the same pattern, specify the
JSON path of that array. This property
is supported only when copying data
from JSON files.

jsonPathDefinition Specify the JSON path expression for No


each column mapping with a
customized column name (start with
lowercase). This property is supported
only when copying data from JSON
files, and you can extract data from
object or array.

For fields under root object, start with


root $; for fields inside the array
chosen by jsonNodeReference
property, start from the array element.
See JsonFormat example section on
how to configure.

encodingName Specify the encoding name. For the list No


of valid encoding names, see:
Encoding.EncodingName Property. For
example: windows-1250 or shift_jis.
The default value is: UTF-8.

nestingSeparator Character that is used to separate No


nesting levels. The default value is '.'
(dot).

NOTE
For the case of cross-apply data in array into multiple rows (case 1 -> sample 2 in JsonFormat examples), you can only
choose to expand single array using property jsonNodeReference .

JSON file patterns


Copy activity can parse the following patterns of JSON files:
Type I: setOfObjects
Each file contains single object, or line-delimited/concatenated multiple objects. When this option is
chosen in an output dataset, copy activity produces a single JSON file with each object per line (line-
delimited).
single object JSON example

{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}

line-delimited JSON example

{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":
"567834760","switch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":
"789037573","switch1":"US","switch2":"UK"}
{"time":"2015-04-
29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":
"345626404","switch1":"Germany","switch2":"UK"}

concatenated JSON example

{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
}
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}

Type II: arrayOfObjects


Each file contains an array of objects.
[
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
},
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
},
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}
]

JsonFormat example
Case 1: Copying data from JSON files
Sample 1: extract data from object and array
In this sample, you expect one root JSON object maps to single record in tabular result. If you have a JSON file
with the following content:

{
"id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3",
"context": {
"device": {
"type": "PC"
},
"custom": {
"dimensions": [
{
"TargetResourceType": "Microsoft.Compute/virtualMachines"
},
{
"ResourceManagementProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3"
},
{
"OccurrenceTime": "1/13/2017 11:24:37 AM"
}
]
}
}
}

and you want to copy it into an Azure SQL table in the following format, by extracting data from both objects
and array:
RESOURCEMANAGEME
ID DEVICETYPE TARGETRESOURCETYPE NTPROCESSRUNID OCCURRENCETIME

ed0e4960-d9c5- PC Microsoft.Compute/v 827f8aaa-ab72- 1/13/2017 11:24:37


11e6-85dc- irtualMachines 437c-ba48- AM
d7996816aad3 d8917a7336a3

The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically:
section defines the customized column names and the corresponding data type while converting
structure
to tabular data. This section is optional unless you need to do column mapping. For more information, see
Map source dataset columns to destination dataset columns.
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. To
copy data from array, you can use array[x].property to extract value of the given property from the xth
object, or you can use array[*].property to find the value from any object containing such property.

"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "deviceType",
"type": "String"
},
{
"name": "targetResourceType",
"type": "String"
},
{
"name": "resourceManagementProcessRunId",
"type": "String"
},
{
"name": "occurrenceTime",
"type": "DateTime"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonPathDefinition": {"id": "$.id", "deviceType": "$.context.device.type",
"targetResourceType": "$.context.custom.dimensions[0].TargetResourceType",
"resourceManagementProcessRunId": "$.context.custom.dimensions[1].ResourceManagementProcessRunId",
"occurrenceTime": " $.context.custom.dimensions[2].OccurrenceTime"}
}
}
}

Sample 2: cross apply multiple objects with the same pattern from array
In this sample, you expect to transform one root JSON object into multiple records in tabular result. If you have
a JSON file with the following content:
{
"ordernumber": "01",
"orderdate": "20170122",
"orderlines": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "sanmateo": "No 1" } ]
}

and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array
and cross join with the common root info:

ORDERNUMBER ORDERDATE ORDER_PD ORDER_PRICE CITY

01 20170122 P1 23 [{"sanmateo":"No
1"}]

01 20170122 P2 13 [{"sanmateo":"No
1"}]

01 20170122 P3 231 [{"sanmateo":"No


1"}]

The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically:
structure section defines the customized column names and the corresponding data type while converting
to tabular data. This section is optional unless you need to do column mapping. For more information, see
Map source dataset columns to destination dataset columns.
jsonNodeReference indicates to iterate and extract data from the objects with the same pattern under array
orderlines .
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. In
this example, ordernumber , orderdate , and city are under root object with JSON path starting with $. ,
while order_pd and order_price are defined with path derived from the array element without $. .
"properties": {
"structure": [
{
"name": "ordernumber",
"type": "String"
},
{
"name": "orderdate",
"type": "String"
},
{
"name": "order_pd",
"type": "String"
},
{
"name": "order_price",
"type": "Int64"
},
{
"name": "city",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonNodeReference": "$.orderlines",
"jsonPathDefinition": {"ordernumber": "$.ordernumber", "orderdate": "$.orderdate", "order_pd":
"prod", "order_price": "price", "city": " $.city"}
}
}
}

Note the following points:


If the structure and jsonPathDefinition are not defined in the Data Factory dataset, the Copy Activity
detects the schema from the first object and flatten the whole object.
If the JSON input has an array, by default the Copy Activity converts the entire array value into a string. You
can choose to extract data from it using jsonNodeReference and/or jsonPathDefinition , or skip it by not
specifying it in jsonPathDefinition .
If there are duplicate names at the same level, the Copy Activity picks the last one.
Property names are case-sensitive. Two properties with same name but different casings are treated as two
separate properties.
Case 2: Writing data to JSON file
If you have the following table in SQL Database:

ID ORDER_DATE ORDER_PRICE ORDER_BY

1 20170119 2000 David

2 20170120 3500 Patrick

3 20170121 4000 Jason

and for each record, you expect to write to a JSON object in the following format:
{
"id": "1",
"order": {
"date": "20170119",
"price": 2000,
"customer": "David"
}
}

The output dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically, structure section defines the customized property names in destination file,
nestingSeparator (default is ".") are used to identify the nest layer from the name. This section is optional
unless you want to change the property name comparing with source column name, or nest some of the
properties.

"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "order.date",
"type": "String"
},
{
"name": "order.price",
"type": "Int64"
},
{
"name": "order.customer",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat"
}
}
}

Parquet format
NOTE
Data Factory introduced new Parquet format datset, see Parquet format article with details. The following configurations
on file-based data store dataset is still supported as-is for backward compabitility. You are suggested to use the new
model going forward.

If you want to parse the Parquet files or write the data in Parquet format, set the format type property to
ParquetFormat. You do not need to specify any properties in the Format section within the typeProperties
section. Example:
"format":
{
"type": "ParquetFormat"
}

Note the following points:


Complex data types are not supported (MAP, LIST).
White space in column name is not supported.
Parquet file has the following compression-related options: NONE, SNAPPY, GZIP, and LZO. Data Factory
supports reading data from Parquet file in any of these compressed formats except LZO - it uses the
compression codec in the metadata to read the data. However, when writing to a Parquet file, Data Factory
chooses SNAPPY, which is the default for Parquet format. Currently, there is no option to override this
behavior.

IMPORTANT
For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not
copying Parquet files as-is, you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR
machine. See the following paragraph with more details.

For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java runtime
by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for
JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE: The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK: it's supported since IR version 3.13. Package the jvm.dll with all other required
assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME
accordingly.

TIP
If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred
when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable
_JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such
copy, then rerun the pipeline.

Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g . The flag Xms specifies the initial memory
allocation pool for a Java Virtual Machine (JVM ), while Xmx specifies the maximum memory allocation pool.
This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx
amount of memory. By default, ADF use min 64MB and max 1G.
Data type mapping for Parquet files
DATA FACTORY INTERIM PARQUET ORIGINAL TYPE PARQUET ORIGINAL TYPE
DATA TYPE PARQUET PRIMITIVE TYPE (DESERIALIZE) (SERIALIZE)

Boolean Boolean N/A N/A

SByte Int32 Int8 Int8

Byte Int32 UInt8 Int16

Int16 Int32 Int16 Int16

UInt16 Int32 UInt16 Int32

Int32 Int32 Int32 Int32

UInt32 Int64 UInt32 Int64

Int64 Int64 Int64 Int64

UInt64 Int64/Binary UInt64 Decimal

Single Float N/A N/A

Double Double N/A N/A

Decimal Binary Decimal Decimal

String Binary Utf8 Utf8

DateTime Int96 N/A N/A

TimeSpan Int96 N/A N/A

DateTimeOffset Int96 N/A N/A

ByteArray Binary N/A N/A

Guid Binary Utf8 Utf8

Char Binary Utf8 Utf8

CharArray Not supported N/A N/A

ORC format
If you want to parse the ORC files or write the data in ORC format, set the format type property to
OrcFormat. You do not need to specify any properties in the Format section within the typeProperties section.
Example:
"format":
{
"type": "OrcFormat"
}

Note the following points:


Complex data types are not supported (STRUCT, MAP, LIST, UNION ).
White space in column name is not supported.
ORC file has three compression-related options: NONE, ZLIB, SNAPPY. Data Factory supports reading data
from ORC file in any of these compressed formats. It uses the compression codec is in the metadata to read
the data. However, when writing to an ORC file, Data Factory chooses ZLIB, which is the default for ORC.
Currently, there is no option to override this behavior.

IMPORTANT
For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not
copying ORC files as-is, you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR
machine. See the following paragraph with more details.

For copy running on Self-hosted IR with ORC file serialization/deserialization, ADF locates the Java runtime by
firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if
not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE: The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK: it's supported since IR version 3.13. Package the jvm.dll with all other required
assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME
accordingly.
Data type mapping for ORC files
DATA FACTORY INTERIM DATA TYPE ORC TYPES

Boolean Boolean

SByte Byte

Byte Short

Int16 Short

UInt16 Int

Int32 Int

UInt32 Long

Int64 Long

UInt64 String

Single Float
DATA FACTORY INTERIM DATA TYPE ORC TYPES

Double Double

Decimal Decimal

String String

DateTime Timestamp

DateTimeOffset Timestamp

TimeSpan Timestamp

ByteArray Binary

Guid String

Char Char(1)

AVRO format
If you want to parse the Avro files or write the data in Avro format, set the format type property to
AvroFormat. You do not need to specify any properties in the Format section within the typeProperties section.
Example:

"format":
{
"type": "AvroFormat",
}

To use Avro format in a Hive table, you can refer to Apache Hive’s tutorial.
Note the following points:
Complex data types are not supported (records, enums, arrays, maps, unions, and fixed).

Compression support
Azure Data Factory supports compress/decompress data during copy. When you specify compression property
in an input dataset, the copy activity read the compressed data from the source and decompress it; and when
you specify the property in an output dataset, the copy activity compress then write data to the sink. Here are a
few sample scenarios:
Read GZIP compressed data from an Azure blob, decompress it, and write result data to an Azure SQL
database. You define the input Azure Blob dataset with the compression type property as GZIP.
Read data from a plain-text file from on-premises File System, compress it using GZip format, and write the
compressed data to an Azure blob. You define an output Azure Blob dataset with the compression type
property as GZip.
Read .zip file from FTP server, decompress it to get the files inside, and land those files in Azure Data Lake
Store. You define an input FTP dataset with the compression type property as ZipDeflate.
Read a GZIP -compressed data from an Azure blob, decompress it, compress it using BZIP2, and write result
data to an Azure blob. You define the input Azure Blob dataset with compression type set to GZIP and the
output dataset with compression type set to BZIP2.
To specify compression for a dataset, use the compression property in the dataset JSON as in the following
example:

{
"name": "AzureBlobDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"fileName": "pagecounts.csv.gz",
"folderPath": "compression/file/",
"format": {
"type": "TextFormat"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

The compression section has two properties:


Type: the compression codec, which can be GZIP, Deflate, BZIP2, or ZipDeflate.
Level: the compression ratio, which can be Optimal or Fastest.
Fastest: The compression operation should complete as quickly as possible, even if the resulting
file is not optimally compressed.
Optimal: The compression operation should be optimally compressed, even if the operation takes
a longer time to complete.
For more information, see Compression Level topic.

NOTE
Compression settings are not supported for data in the AvroFormat, OrcFormat, or ParquetFormat. When reading
files in these formats, Data Factory detects and uses the compression codec in the metadata. When writing to files in
these formats, Data Factory chooses the default compression codec for that format. For example, ZLIB for OrcFormat and
SNAPPY for ParquetFormat.

Unsupported file types and compression formats


You can use the extensibility features of Azure Data Factory to transform files that aren't supported. Two options
include Azure Functions and custom tasks by using Azure Batch.
You can see a sample that uses an Azure function to extract the contents of a tar file. For more information, see
Azure Functions activity.
You can also build this functionality using a custom dotnet activity. Further information is available here

Next steps
See the following articles for file-based data stores supported by Azure Data Factory:
Azure Blob Storage connector
Azure Data Lake Store connector
Amazon S3 connector
File System connector
FTP connector
SFTP connector
HDFS connector
HTTP connector
Schema mapping in copy activity
4/29/2019 • 6 minutes to read • Edit Online

This article describes how Azure Data Factory copy activity does schema mapping and data type mapping
from source data to sink data when execute the data copy.

Schema mapping
Column mapping applies when copying data from source to sink. By default, copy activity map source
data to sink by column names. You can specify explicit mapping to customize the column mapping
based on your need. More specifically, copy activity:
1. Read the data from source and determine the source schema
2. Use default column mapping to map columns by name, or apply explicit column mapping if specified.
3. Write the data to sink
Explicit mapping
You can specify the columns to map in copy activity -> translator -> mappings property. The following
example defines a copy activity in a pipeline to copy data from delimited text to Azure SQL Database.
{
"name": "CopyActivity",
"type": "Copy",
"inputs": [{
"referenceName": "DelimitedTextInput",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "AzureSqlOutput",
"type": "DatasetReference"
}],
"typeProperties": {
"source": { "type": "DelimitedTextSource" },
"sink": { "type": "SqlSink" },
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "UserId",
"type": "Guid"
},
"sink": {
"name": "MyUserId"
}
},
{
"source": {
"name": "Name",
"type": "String"
},
"sink": {
"name": "MyName"
}
},
{
"source": {
"name": "Group",
"type": "String"
},
"sink": {
"name": "MyGroup"
}
}
]
}
}
}

The following properties are supported under translator -> mappings -> object with source and sink :

PROPERTY DESCRIPTION REQUIRED

name Name of the source or sink column. Yes

ordinal Column index. Start with 1. No


Apply and required when using
delimited text without header line.
PROPERTY DESCRIPTION REQUIRED

path JSON path expression for each field No


to extract or map. Apply for
hierarchical data e.g.
MongoDB/REST.
For fields under root object, JSON
path starts with root $; for fields
inside the array chosen by
collectionReference property,
JSON path starts from the array
element.

type Data Factory interim data type of No


the source or sink column.

culture Culture of the source or sink No


column.
Apply when type is Datetime or
Datetimeoffset . The default is
en-us .

format Format string to be used when type No


is Datetime or Datetimeoffset .
Refer to Custom Date and Time
Format Strings on how to format
datetime.

The following properties are supported under translator -> mappings in addition to object with source
and sink :

PROPERTY DESCRIPTION REQUIRED

collectionReference Supported only when hierarchical No


data e.g. MongoDB/REST is source.
If you want to iterate and extract
data from the objects inside an
array field with the same pattern
and convert to per row per object,
specify the JSON path of that array
to do cross-apply.

Alternative column mapping


You can specify copy activity -> translator -> columnMappings to map between tabular-shaped data . In
this case, "structure" section is required for both input and output datasets. Column mapping supports
mapping all or subset of columns in the source dataset "structure" to all columns in the sink
dataset "structure". The following are error conditions that result in an exception:
Source data store query result does not have a column name that is specified in the input dataset
"structure" section.
Sink data store (if with pre-defined schema) does not have a column name that is specified in the
output dataset "structure" section.
Either fewer columns or more columns in the "structure" of sink dataset than specified in the mapping.
Duplicate mapping.
In the following example, the input dataset has a structure and it points to a table in an on-premises
Oracle database.
{
"name": "OracleDataset",
"properties": {
"structure":
[
{ "name": "UserId"},
{ "name": "Name"},
{ "name": "Group"}
],
"type": "OracleTable",
"linkedServiceName": {
"referenceName": "OracleLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "SourceTable"
}
}
}

In this sample, the output dataset has a structure and it points to a table in Salesfoce.

{
"name": "SalesforceDataset",
"properties": {
"structure":
[
{ "name": "MyUserId"},
{ "name": "MyName" },
{ "name": "MyGroup"}
],
"type": "SalesforceObject",
"linkedServiceName": {
"referenceName": "SalesforceLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "SinkTable"
}
}
}

The following JSON defines a copy activity in a pipeline. The columns from source mapped to columns in
sink by using the translator -> columnMappings property.
{
"name": "CopyActivity",
"type": "Copy",
"inputs": [
{
"referenceName": "OracleDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SalesforceDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": { "type": "OracleSource" },
"sink": { "type": "SalesforceSink" },
"translator":
{
"type": "TabularTranslator",
"columnMappings":
{
"UserId": "MyUserId",
"Group": "MyGroup",
"Name": "MyName"
}
}
}
}

If you are using the syntax of "columnMappings": "UserId: MyUserId, Group: MyGroup, Name: MyName" to
specify column mapping, it is still supported as-is.
Alternative schema mapping
You can specify copy activity -> translator -> schemaMapping to map between hierarchical-shaped data
and tabular-shaped data, e.g. copy from MongoDB/REST to text file and copy from Oracle to Azure
Cosmos DB's API for MongoDB. The following properties are supported in copy activity translator
section:

PROPERTY DESCRIPTION REQUIRED

type The type property of the copy Yes


activity translator must be set to:
TabularTranslator
PROPERTY DESCRIPTION REQUIRED

schemaMapping A collection of key-value pairs, which Yes


represents the mapping relation
from source side to sink side.
- Key: represents source. For
tabular source, specify the column
name as defined in dataset
structure; for hierarchical source,
specify the JSON path expression for
each field to extract and map.
- Value: represents sink. For
tabular sink, specify the column
name as defined in dataset
structure; for hierarchical sink,
specify the JSON path expression for
each field to extract and map.
In the case of hierarchical data, for
fields under root object, JSON path
starts with root $; for fields inside
the array chosen by
collectionReference property,
JSON path starts from the array
element.

collectionReference If you want to iterate and extract No


data from the objects inside an
array field with the same pattern
and convert to per row per object,
specify the JSON path of that array
to do cross-apply. This property is
supported only when hierarchical
data is source.

Example: copy from MongoDB to Oracle:


For example, if you have MongoDB document with the following content:

{
"id": {
"$oid": "592e07800000000000000000"
},
"number": "01",
"date": "20170122",
"orders": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "name": "Seattle" } ]
}

and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the
array (order_pd and order_price) and cross join with the common root info (number, date, and city ):

ORDERNUMBER ORDERDATE ORDER_PD ORDER_PRICE CITY

01 20170122 P1 23 Seattle

01 20170122 P2 13 Seattle

01 20170122 P3 231 Seattle

Configure the schema-mapping rule as the following copy activity JSON sample:

{
"name": "CopyFromMongoDBToOracle",
"type": "Copy",
"typeProperties": {
"source": {
"type": "MongoDbV2Source"
},
"sink": {
"type": "OracleSink"
},
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"orderNumber": "$.number",
"orderDate": "$.date",
"order_pd": "prod",
"order_price": "price",
"city": " $.city[0].name"
},
"collectionReference": "$.orders"
}
}
}

Data type mapping


Copy activity performs source types to sink types mapping with the following 2-step approach:
1. Convert from native source types to Azure Data Factory interim data types
2. Convert from Azure Data Factory interim data types to native sink type
You can find the mapping between native type to interim type in the "Data type mapping" section in each
connector topic.
Supported data types
Data Factory supports the following interim data types: You can specify below values when configuring
type information in dataset structure configuration:
Byte[]
Boolean
Datetime
Datetimeoffset
Decimal
Double
Guid
Int16
Int32
Int64
Single
String
Timespan

Next steps
See the other Copy Activity articles:
Copy activity overview
Fault tolerance of copy activity in Azure Data Factory
4/8/2019 • 3 minutes to read • Edit Online

The copy activity in Azure Data Factory offers you two ways to handle incompatible rows when copying data
between source and sink data stores:
You can abort and fail the copy activity when incompatible data is encountered (default behavior).
You can continue to copy all of the data by adding fault tolerance and skipping incompatible data rows. In
addition, you can log the incompatible rows in Azure Blob storage or Azure Data Lake Store. You can then
examine the log to learn the cause for the failure, fix the data on the data source, and retry the copy activity.

Supported scenarios
Copy Activity supports three scenarios for detecting, skipping, and logging incompatible data:
Incompatibility between the source data type and the sink native type.
For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that
contains three INT type columns. The CSV file rows that contain numeric data, such as 123,456,789 are
copied successfully to the sink store. However, the rows that contain non-numeric values, such as 123,456,
abc are detected as incompatible and are skipped.
Mismatch in the number of columns between the source and the sink.
For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that
contains six columns. The CSV file rows that contain six columns are copied successfully to the sink store.
The CSV file rows that contain more or fewer than six columns are detected as incompatible and are
skipped.
Primary key violation when writing to SQL Server/Azure SQL Database/Azure Cosmos DB.
For example: Copy data from a SQL server to a SQL database. A primary key is defined in the sink SQL
database, but no such primary key is defined in the source SQL server. The duplicated rows that exist in the
source cannot be copied to the sink. Copy Activity copies only the first row of the source data into the sink.
The subsequent source rows that contain the duplicated primary key value are detected as incompatible
and are skipped.

NOTE
For loading data into SQL Data Warehouse using PolyBase, configure PolyBase's native fault tolerance settings by
specifying reject policies via "polyBaseSettings" in copy activity. You can still enable redirecting PolyBase incompatible
rows to Blob or ADLS as normal as shown below.
This feature doesn't apply when copy activity is configured to invoke Amazon Redshift Unload.

Configuration
The following example provides a JSON definition to configure skipping the incompatible rows in Copy Activity:
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
},
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "<Azure Storage or Data Lake Store linked service>",
"type": "LinkedServiceReference"
},
"path": "redirectcontainer/erroroutput"
}
}

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

enableSkipIncompatibleRow Specifies whether to skip True No


incompatible rows during False (default)
copy or not.

redirectIncompatibleRowSett A group of properties that No


ings can be specified when you
want to log the incompatible
rows.

linkedServiceName The linked service of Azure The name of an No


Storage or Azure Data Lake AzureStorage or
Store to store the log that AzureDataLakeStore type
contains the skipped rows. linked service, which refers
to the instance that you
want to use to store the log
file.

path The path of the log file that Specify the path that you No
contains the skipped rows. want to use to log the
incompatible data. If you do
not provide a path, the
service creates a container
for you.

Monitor skipped rows


After the copy activity run completes, you can see the number of skipped rows in the output of the copy activity:

"output": {
"dataRead": 95,
"dataWritten": 186,
"rowsCopied": 9,
"rowsSkipped": 2,
"copyDuration": 16,
"throughput": 0.01,
"redirectRowPath": "https://myblobstorage.blob.core.windows.net//myfolder/a84bf8d4-233f-4216-8cb5-
45962831cd1b/",
"errors": []
},
If you configure to log the incompatible rows, you can find the log file at this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/[copy-activity-run-id]/[auto-generated-
GUID].csv
.
The log files can only be the csv files. The original data being skipped will be logged with comma as column
delimiter if needed. We add two more columns "ErrorCode" and "ErrorMessage" in additional to the original
source data in log file, where you can see the root cause of the incompatibility. The ErrorCode and ErrorMessage
will be quoted by double quotes.
An example of the log file content is as follows:

data1, data2, data3, "UserErrorInvalidDataValue", "Column 'Prop_2' contains an invalid value 'data3'. Cannot
convert 'data3' to type 'DateTime'."
data4, data5, data6, "2627", "Violation of PRIMARY KEY constraint 'PK_tblintstrdatetimewithpk'. Cannot insert
duplicate key in object 'dbo.tblintstrdatetimewithpk'. The duplicate key value is (data4)."

Next steps
See the other Copy Activity articles:
Copy activity overview
Copy activity performance
Copy Activity performance and tuning guide
5/31/2019 • 24 minutes to read • Edit Online

Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading
solution. It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-
premises data stores. Blazing-fast data loading performance is key to ensure you can focus on the core “big
data” problem: building advanced analytics solutions and getting deep insights from all that data.
Azure provides a set of enterprise-grade data storage and data warehouse solutions, and Copy Activity offers a
highly optimized data loading experience that is easy to configure and set up. With just a single copy activity,
you can achieve:
Loading data into Azure SQL Data Warehouse at 1.2 GBps.
Loading data into Azure Blob storage at 1.0 GBps
Loading data into Azure Data Lake Store at 1.0 GBps
This article describes:
Performance reference numbers for supported source and sink data stores to help you plan your project;
Features that can boost the copy throughput in different scenarios, including data integration units, parallel
copy, and staged Copy;
Performance tuning guidance on how to tune the performance and the key factors that can impact copy
performance.

NOTE
If you are not familiar with Copy Activity in general, see Copy Activity Overview before reading this article.

Performance reference
As a reference, below table shows the copy throughput number in MBps for the given source and sink pairs in
a single copy activity run based on in-house testing. For comparison, it also demonstrates how different
settings of Data Integration Units or Self-hosted Integration Runtime scalability (multiple nodes) can help on
copy performance.
IMPORTANT
When copy activity is executed on an Azure Integration Runtime, the minimal allowed Data Integration Units (formerly
known as Data Movement Units) is two. If not specified, see default Data Integration Units being used in Data Integration
Units.

Points to note:
Throughput is calculated by using the following formula: [size of data read from source]/[Copy Activity run
duration].
The performance reference numbers in the table were measured using TPC -H dataset in a single copy
activity run. Test files for file-based stores are multiple files with 10GB in size.
In Azure data stores, the source and sink are in the same Azure region.
For hybrid copy between on-premises and cloud data stores, each Self-hosted Integration Runtime node
was running on a machine that was separate from the data store with below specification. When a single
activity was running, the copy operation consumed only a small portion of the test machine's CPU, memory,
or network bandwidth.

CPU 32 cores 2.20 GHz Intel Xeon E5-2660 v2

Memory 128 GB

Network Internet interface: 10 Gbps; intranet interface: 40 Gbps


TIP
You can achieve higher throughput by using more Data Integration Units (DIU). For example, with 100 DIUs, you can
achieve copying data from Azure Blob into Azure Data Lake Store at 1.0GBps. See the Data Integration Units section for
details about this feature and the supported scenario.

Data integration units


A Data Integration Unit (DIU ) (formerly known as Cloud Data Movement Unit or DMU ) is a measure that
represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data
Factory. DIU only applies to Azure Integration Runtime, but not Self-hosted Integration Runtime.
The minimal Data Integration Units to empower Copy Activity run is two. If not specified, the following
table lists the default DIUs used in different copy scenarios:

COPY SCENARIO DEFAULT DIUS DETERMINED BY SERVICE

Copy data between file-based stores Between 4 and 32 depending on the number and size of the
files.

All other copy scenarios 4

To override this default, specify a value for the dataIntegrationUnits property as follows. The allowed values
for the dataIntegrationUnits property is up to 256. The actual number of DIUs that the copy operation
uses at run time is equal to or less than the configured value, depending on your data pattern. For information
about the level of performance gain you might get when you configure more units for a specific copy source
and sink, see the performance reference.
You can see the actually used Data Integration Units for each copy run in the copy activity output when
monitoring an activity run. Learn details from Copy activity monitoring.

NOTE
Setting of DIUs larger than 4 currently applies only when you copy multiple files from Azure Storage/Data Lake
Storage/Amazon S3/Google Cloud Storage/cloud FTP/cloud SFTP to any other cloud data stores.

Example:

"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"dataIntegrationUnits": 32
}
}
]
Data Integration Units billing impact
It's important to remember that you are charged based on the total time of the copy operation. The total
duration you are billed for data movement is the sum of duration across DIUs. If a copy job used to take one
hour with two cloud units and now it takes 15 minutes with eight cloud units, the overall bill remains almost the
same.

Parallel Copy
You can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use. You can
think of this property as the maximum number of threads within Copy Activity that can read from your source
or write to your sink data stores in parallel.
For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the
source data store and to the destination data store. The default number of parallel copies that it uses depends
on the type of source and sink that you are using:

COPY SCENARIO DEFAULT PARALLEL COPY COUNT DETERMINED BY SERVICE

Copy data between file-based stores Depends on the size of the files and the number of Data
Integration Units (DIUs) used to copy data between two
cloud data stores, or the physical configuration of the Self-
hosted Integration Runtime machine.

Copy data from any source data store to Azure Table 4


storage

All other copy scenarios 1

TIP
When copying data between file-based stores, the default behavior (auto determined) usually give you the best
throughput.

To control the load on machines that host your data stores, or to tune copy performance, you may choose to
override the default value and specify a value for the parallelCopies property. The value must be an integer
greater than or equal to 1. At run time, for the best performance, Copy Activity uses a value that is less than or
equal to the value that you set.

"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"parallelCopies": 32
}
}
]
Points to note:
When you copy data between file-based stores, the parallelCopies determine the parallelism at the file
level. The chunking within a single file would happen underneath automatically and transparently, and it's
designed to use the best suitable chunk size for a given source data store type to load data in parallel and
orthogonal to parallelCopies. The actual number of parallel copies the data movement service uses for the
copy operation at run time is no more than the number of files you have. If the copy behavior is mergeFile,
Copy Activity cannot take advantage of file-level parallelism.
When you specify a value for the parallelCopies property, consider the load increase on your source and
sink data stores, and to Self-Hosted Integration Runtime if the copy activity is empowered by it for example,
for hybrid copy. This happens especially when you have multiple activities or concurrent runs of the same
activities that run against the same data store. If you notice that either the data store or Self-hosted
Integration Runtime is overwhelmed with the load, decrease the parallelCopies value to relieve the load.
When you copy data from stores that are not file-based to stores that are file-based, the data movement
service ignores the parallelCopies property. Even if parallelism is specified, it's not applied in this case.
parallelCopies is orthogonal to dataIntegrationUnits. The former is counted across all the Data
Integration Units.

Staged copy
When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an
interim staging store. Staging is especially useful in the following cases:
You want to ingest data from various data stores into SQL Data Warehouse via PolyBase. SQL Data
Warehouse uses PolyBase as a high-throughput mechanism to load a large amount of data into SQL Data
Warehouse. However, the source data must be in Blob storage or Azure Data Lake Store, and it must meet
additional criteria. When you load data from a data store other than Blob storage or Azure Data Lake Store,
you can activate data copying via interim staging Blob storage. In that case, Data Factory performs the
required data transformations to ensure that it meets the requirements of PolyBase. Then it uses PolyBase to
load data into SQL Data Warehouse efficiently. For more information, see Use PolyBase to load data into
Azure SQL Data Warehouse.
Sometimes it takes a while to perform a hybrid data movement (that is, to copy from an on-
premises data store to a cloud data store) over a slow network connection. To improve performance,
you can use staged copy to compress the data on-premises so that it takes less time to move data to the
staging data store in the cloud then decompress the data in the staging store before loading into the
destination data store.
You don't want to open ports other than port 80 and port 443 in your firewall, because of corporate
IT policies. For example, when you copy data from an on-premises data store to an Azure SQL Database
sink or an Azure SQL Data Warehouse sink, you need to activate outbound TCP communication on port
1433 for both the Windows firewall and your corporate firewall. In this scenario, staged copy can take
advantage of the Self-hosted Integration Runtime to first copy data to a Blob storage staging instance over
HTTP or HTTPS on port 443, then load the data into SQL Database or SQL Data Warehouse from Blob
storage staging. In this flow, you don't need to enable port 1433.
How staged copy works
When you activate the staging feature, first the data is copied from the source data store to the staging Blob
storage (bring your own). Next, the data is copied from the staging data store to the sink data store. Data
Factory automatically manages the two-stage flow for you. Data Factory also cleans up temporary data from the
staging storage after the data movement is complete.
When you activate data movement by using a staging store, you can specify whether you want the data to be
compressed before moving data from the source data store to an interim or staging data store, and then
decompressed before moving data from an interim or staging data store to the sink data store.
Currently, you can't copy data between two on-premises data stores by using a staging store.
Configuration
Configure the enableStaging setting in Copy Activity to specify whether you want the data to be staged in
Blob storage before you load it into a destination data store. When you set enableStaging to TRUE , specify the
additional properties listed in the next table. If you don’t have one, you also need to create an Azure Storage or
Storage shared access signature-linked service for staging.

PROPERTY DESCRIPTION DEFAULT VALUE REQUIRED

enableStaging Specify whether you want False No


to copy data via an interim
staging store.

linkedServiceName Specify the name of an N/A Yes, when enableStaging


AzureStorage linked service, is set to TRUE
which refers to the instance
of Storage that you use as
an interim staging store.

You cannot use Storage


with a shared access
signature to load data into
SQL Data Warehouse via
PolyBase. You can use it in
all other scenarios.

path Specify the Blob storage N/A No


path that you want to
contain the staged data. If
you do not provide a path,
the service creates a
container to store
temporary data.

Specify a path only if you


use Storage with a shared
access signature, or you
require temporary data to
be in a specific location.

enableCompression Specifies whether data False No


should be compressed
before it is copied to the
destination. This setting
reduces the volume of data
being transferred.
NOTE
If you use staged copy with compression enabled, service principal or MSI authentication for staging blob linked service is
not supported.

Here's a sample definition of Copy Activity with the properties that are described in the preceding table:

"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingBlob",
"type": "LinkedServiceReference"
},
"path": "stagingcontainer/path",
"enableCompression": true
}
}
}
]

Staged copy billing impact


You are charged based on two steps: copy duration and copy type.
When you use staging during a cloud copy (copying data from a cloud data store to another cloud data store,
both stages empowered by Azure Integration Runtime), you are charged the [sum of copy duration for step
1 and step 2] x [cloud copy unit price].
When you use staging during a hybrid copy (copying data from an on-premises data store to a cloud data
store, one stage empowered by Self-hosted Integration Runtime), you are charged for [hybrid copy duration]
x [hybrid copy unit price] + [cloud copy duration] x [cloud copy unit price].

Performance tuning steps


We suggest that you take these steps to tune the performance of your Data Factory service with Copy Activity:
1. Establish a baseline. During the development phase, test your pipeline by using Copy Activity against a
representative data sample. Collect execution details and performance characteristics following Copy
activity monitoring.
2. Diagnose and optimize performance. If the performance you observe doesn't meet your expectations,
you need to identify performance bottlenecks. Then, optimize performance to remove or reduce the
effect of bottlenecks.
In some cases, when you execute a copy activity in ADF, you will directly see "Performance tuning tips"
on top of the copy activity monitoring page as shown in the following example. It not only tells you the
bottleneck identified for the given copy run, but also guides you on what to change so as to boost copy
throughput. The performance tuning tips currently provide suggestions like to use PolyBase when
copying data into Azure SQL Data Warehouse, to increase Azure Cosmos DB RU or Azure SQL DB DTU
when the resource on data store side is the bottleneck, to remove the unnecessary staged copy, etc. The
performance tuning rules will be gradually enriched as well.
Example: copy into Azure SQL DB with performance tuning tips
In this sample, during copy run, ADF notice the sink Azure SQL DB reaches high DTU utilization which
slows down the write operations, thus the suggestion is to increase the Azure SQL DB tier with more
DTU.

In addition, the following are some common considerations. A full description of performance diagnosis
is beyond the scope of this article.
Performance features:
Parallel copy
Data integration units
Staged copy
Self-hosted Integration Runtime scalability
Self-hosted Integration Runtime
Source
Sink
Serialization and deserialization
Compression
Column mapping
Other considerations
3. Expand the configuration to your entire data set. When you're satisfied with the execution results
and performance, you can expand the definition and pipeline to cover your entire data set.

Considerations for Self-hosted Integration Runtime


If your copy activity is executed on a Self-hosted Integration Runtime, note the following:
Setup: We recommend that you use a dedicated machine to host Integration Runtime. See Considerations for
using Self-hosted Integration Runtime.
Scale out: A single logical Self-hosted Integration Runtime with one or more nodes can serve multiple Copy
Activity runs at the same time concurrently. If you have heavy need on hybrid data movement either with large
number of concurrent copy activity runs or with large volume of data to copy, consider to scale out Self-hosted
Integration Runtime so as to provision more resource to empower copy.

Considerations for the source


General
Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it.
For Microsoft data stores, see monitoring and tuning topics that are specific to data stores, and help you
understand data store performance characteristics, minimize response times, and maximize throughput.
If you copy data from Blob storage to SQL Data Warehouse, consider using PolyBase to boost
performance. See Use PolyBase to load data into Azure SQL Data Warehouse for details.
If you copy data from HDFS to Azure Blob/Azure Data Lake Store, consider using DistCp to boost
performance. See Use DistCp to copy data from HDFS for details.
If you copy data from Redshift to Azure SQL Data Warehouse/Azure BLob/Azure Data Lake Store,
consider using UNLOAD to boost performance. See Use UNLOAD to copy data from Amazon Redshift for
details.
File -based data stores
Average file size and file count: Copy Activity transfers data one file at a time. With the same amount of
data to be moved, the overall throughput is lower if the data consists of many small files rather than a few
large files due to the bootstrap phase for each file. Therefore, if possible, combine small files into larger files
to gain higher throughput.
File format and compression: For more ways to improve performance, see the Considerations for
serialization and deserialization and Considerations for compression sections.
Relational data stores
Data pattern: Your table schema affects copy throughput. A large row size gives you a better performance
than small row size, to copy the same amount of data. The reason is that the database can more efficiently
retrieve fewer batches of data that contain fewer rows.
Query or stored procedure: Optimize the logic of the query or stored procedure you specify in the Copy
Activity source to fetch data more efficiently.

Considerations for the sink


General
Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it.
For Microsoft data stores, refer to monitoring and tuning topics that are specific to data stores. These topics can
help you understand data store performance characteristics and how to minimize response times and maximize
throughput.
If you copy data from Blob storage to SQL Data Warehouse, consider using PolyBase to boost
performance. See Use PolyBase to load data into Azure SQL Data Warehouse for details.
If you copy data from HDFS to Azure Blob/Azure Data Lake Store, consider using DistCp to boost
performance. See Use DistCp to copy data from HDFS for details.
If you copy data from Redshift to Azure SQL Data Warehouse/Azure BLob/Azure Data Lake Store,
consider using UNLOAD to boost performance. See Use UNLOAD to copy data from Amazon Redshift for
details.
File -based data stores
Copy behavior: If you copy data from a different file-based data store, Copy Activity has three options via
the copyBehavior property. It preserves hierarchy, flattens hierarchy, or merges files. Either preserving or
flattening hierarchy has little or no performance overhead, but merging files causes performance overhead
to increase.
File format and compression: See the Considerations for serialization and deserialization and
Considerations for compression sections for more ways to improve performance.
Relational data stores
Copy behavior: Depending on the properties you've set for sqlSink, Copy Activity writes data to the
destination database in different ways.
By default, the data movement service uses the Bulk Copy API to insert data in append mode, which
provides the best performance.
If you configure a stored procedure in the sink, the database applies the data one row at a time
instead of as a bulk load. Performance drops significantly. If your data set is large, when applicable,
consider switching to using the preCopyScript property.
If you configure the preCopyScript property for each Copy Activity run, the service triggers the
script, and then you use the Bulk Copy API to insert the data. For example, to overwrite the entire
table with the latest data, you can specify a script to first delete all records before bulk-loading the
new data from the source.
Data pattern and batch size:
Your table schema affects copy throughput. To copy the same amount of data, a large row size gives
you better performance than a small row size because the database can more efficiently commit fewer
batches of data.
Copy Activity inserts data in a series of batches. You can set the number of rows in a batch by using
the writeBatchSize property. If your data has small rows, you can set the writeBatchSize property
with a higher value to benefit from lower batch overhead and higher throughput. If the row size of
your data is large, be careful when you increase writeBatchSize. A high value might lead to a copy
failure caused by overloading the database.
NoSQL stores
For Table storage:
Partition: Writing data to interleaved partitions dramatically degrades performance. Sort your source
data by partition key so that the data is inserted efficiently into one partition after another, or adjust
the logic to write the data to a single partition.

Considerations for serialization and deserialization


Serialization and deserialization can occur when your input data set or output data set is a file. See Supported
file and compression formats with details on supported file formats by Copy Activity.
Copy behavior:
Copying files between file-based data stores:
When input and output data sets both have the same or no file format settings, the data movement
service executes a binary copy without any serialization or deserialization. You see a higher
throughput compared to the scenario, in which the source and sink file format settings are different
from each other.
When input and output data sets both are in text format and only the encoding type is different, the
data movement service only does encoding conversion. It doesn't do any serialization and
deserialization, which causes some performance overhead compared to a binary copy.
When input and output data sets both have different file formats or different configurations, like
delimiters, the data movement service deserializes source data to stream, transform, and then serialize
it into the output format you indicated. This operation results in a much more significant performance
overhead compared to other scenarios.
When you copy files to/from a data store that is not file-based (for example, from a file-based store to a
relational store), the serialization or deserialization step is required. This step results in significant
performance overhead.
File format: The file format you choose might affect copy performance. For example, Avro is a compact binary
format that stores metadata with data. It has broad support in the Hadoop ecosystem for processing and
querying. However, Avro is more expensive for serialization and deserialization, which results in lower copy
throughput compared to text format. Make your choice of file format throughout the processing flow
holistically. Start with what form the data is stored in, source data stores or to be extracted from external
systems; the best format for storage, analytical processing, and querying; and in what format the data should be
exported into data marts for reporting and visualization tools. Sometimes a file format that is suboptimal for
read and write performance might be a good choice when you consider the overall analytical process.

Considerations for compression


When your input or output data set is a file, you can set Copy Activity to perform compression or
decompression as it writes data to the destination. When you choose compression, you make a tradeoff
between input/output (I/O ) and CPU. Compressing the data costs extra in compute resources. But in return, it
reduces network I/O and storage. Depending on your data, you may see a boost in overall copy throughput.
Codec: Each compression codec has advantages. For example, bzip2 has the lowest copy throughput, but you
get the best Hive query performance with bzip2 because you can split it for processing. Gzip is the most
balanced option, and it is used the most often. Choose the codec that best suits your end-to-end scenario.
Level: You can choose from two options for each compression codec: fastest compressed and optimally
compressed. The fastest compressed option compresses the data as quickly as possible, even if the resulting file
is not optimally compressed. The optimally compressed option spends more time on compression and yields a
minimal amount of data. You can test both options to see which provides better overall performance in your
case.
A consideration: To copy a large amount of data between an on-premises store and the cloud, consider using
Staged copy with compression enabled. Using interim storage is helpful when the bandwidth of your corporate
network and your Azure services is the limiting factor, and you want the input data set and output data set both
to be in uncompressed form.

Considerations for column mapping


You can set the columnMappings property in Copy Activity to map all or a subset of the input columns to the
output columns. After the data movement service reads the data from the source, it needs to perform column
mapping on the data before it writes the data to the sink. This extra processing reduces copy throughput.
If your source data store is queryable, for example, if it's a relational store like SQL Database or SQL Server, or
if it's a NoSQL store like Table storage or Azure Cosmos DB, consider pushing the column filtering and
reordering logic to the query property instead of using column mapping. This way, the projection occurs while
the data movement service reads data from the source data store, where it is much more efficient.
Learn more from Copy Activity schema mapping.

Other considerations
If the size of data you want to copy is large, you can adjust your business logic to further partition the data and
schedule Copy Activity to run more frequently to reduce the data size for each Copy Activity run.
Be cautious about the number of data sets and copy activities requiring Data Factory to connect to the same
data store at the same time. Many concurrent copy jobs might throttle a data store and lead to degraded
performance, copy job internal retries, and in some cases, execution failures.
Sample scenario: Copy from an on-premises SQL Server to Blob
storage
Scenario: A pipeline is built to copy data from an on-premises SQL Server to Blob storage in CSV format. To
make the copy job faster, the CSV files should be compressed into bzip2 format.
Test and analysis: The throughput of Copy Activity is less than 2 MBps, which is much slower than the
performance benchmark.
Performance analysis and tuning: To troubleshoot the performance issue, let’s look at how the data is
processed and moved.
1. Read data: Integration runtime opens a connection to SQL Server and sends the query. SQL Server
responds by sending the data stream to integration runtime via the intranet.
2. Serialize and compress data: Integration runtime serializes the data stream to CSV format, and
compresses the data to a bzip2 stream.
3. Write data: Integration runtime uploads the bzip2 stream to Blob storage via the Internet.
As you can see, the data is being processed and moved in a streaming sequential manner: SQL Server > LAN >
Integration runtime > WAN > Blob storage. The overall performance is gated by the minimum
throughput across the pipeline.

One or more of the following factors might cause the performance bottleneck:
Source: SQL Server itself has low throughput because of heavy loads.
Self-hosted Integration Runtime:
LAN: Integration runtime is located far from the SQL Server machine and has a low -bandwidth
connection.
Integration runtime: Integration runtime has reached its load limitations to perform the following
operations:
Serialization: Serializing the data stream to CSV format has slow throughput.
Compression: You chose a slow compression codec (for example, bzip2, which is 2.8 MBps
with Core i7).
WAN: The bandwidth between the corporate network and your Azure services is low (for example, T1
= 1,544 kbps; T2 = 6,312 kbps).
Sink: Blob storage has low throughput. (This scenario is unlikely because its SLA guarantees a minimum of
60 MBps.)
In this case, bzip2 data compression might be slowing down the entire pipeline. Switching to a gzip
compression codec might ease this bottleneck.

Reference
Here is performance monitoring and tuning references for some of the supported data stores:
Azure Storage (including Blob storage and Table storage): Azure Storage scalability targets and Azure
Storage performance and scalability checklist
Azure SQL Database: You can monitor the performance and check the database transaction unit (DTU )
percentage
Azure SQL Data Warehouse: Its capability is measured in data warehouse units (DWUs); see Manage
compute power in Azure SQL Data Warehouse (Overview )
Azure Cosmos DB: Performance levels in Azure Cosmos DB
On-premises SQL Server: Monitor and tune for performance
On-premises file server: Performance tuning for file servers

Next steps
See the other Copy Activity articles:
Copy activity overview
Copy Activity schema mapping
Copy activity fault tolerance
Transform data in Azure Data Factory
3/7/2019 • 4 minutes to read • Edit Online

Overview
This article explains data transformation activities in Azure Data Factory that you can use to transform and
processes your raw data into predictions and insights. A transformation activity executes in a computing
environment such as Azure HDInsight cluster or an Azure Batch. It provides links to articles with detailed
information on each transformation activity.
Data Factory supports the following data transformation activities that can be added to pipelines either
individually or chained with another activity.

HDInsight Hive activity


The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand
Windows/Linux-based HDInsight cluster. See Hive activity article for details about this activity.

HDInsight Pig activity


The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand
Windows/Linux-based HDInsight cluster. See Pig activity article for details about this activity.

HDInsight MapReduce activity


The HDInsight MapReduce activity in a Data Factory pipeline executes MapReduce programs on your own or
on-demand Windows/Linux-based HDInsight cluster. See MapReduce activity article for details about this
activity.

HDInsight Streaming activity


The HDInsight Streaming activity in a Data Factory pipeline executes Hadoop Streaming programs on your own
or on-demand Windows/Linux-based HDInsight cluster. See HDInsight Streaming activity for details about this
activity.

HDInsight Spark activity


The HDInsight Spark activity in a Data Factory pipeline executes Spark programs on your own HDInsight cluster.
For details, see Invoke Spark programs from Azure Data Factory.

Machine Learning activities


Azure Data Factory enables you to easily create pipelines that use a published Azure Machine Learning web
service for predictive analytics. Using the Batch Execution activity in an Azure Data Factory pipeline, you can
invoke a Machine Learning web service to make predictions on the data in batch.
Over time, the predictive models in the Machine Learning scoring experiments need to be retrained using new
input datasets. After you are done with retraining, you want to update the scoring web service with the retrained
Machine Learning model. You can use the Update Resource activity to update the web service with the newly
trained model.
See Use Machine Learning activities for details about these Machine Learning activities.

Stored procedure activity


You can use the SQL Server Stored Procedure activity in a Data Factory pipeline to invoke a stored procedure in
one of the following data stores: Azure SQL Database, Azure SQL Data Warehouse, SQL Server Database in
your enterprise or an Azure VM. See Stored Procedure activity article for details.

Data Lake Analytics U-SQL activity


Data Lake Analytics U -SQL activity runs a U -SQL script on an Azure Data Lake Analytics cluster. See Data
Analytics U -SQL activity article for details.

Databricks Notebook activity


The Azure Databricks Notebook Activity in a Data Factory pipeline runs a Databricks notebook in your Azure
Databricks workspace.Azure Databricks is a managed platform for running Apache Spark. See Transform data
by running a Databricks notebook.

Databricks Jar activity


The Azure Databricks Jar Activity in a Data Factory pipeline runs a Spark Jar in your Azure Databricks cluster.
Azure Databricks is a managed platform for running Apache Spark. See Transform data by running a Jar activity
in Azure Databricks.

Databricks Python activity


The Azure Databricks Python Activity in a Data Factory pipeline runs a Python file in your Azure Databricks
cluster. Azure Databricks is a managed platform for running Apache Spark. See Transform data by running a
Python activity in Azure Databricks.

Custom activity
If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity
with your own data processing logic and use the activity in the pipeline. You can configure the custom .NET
activity to run using either an Azure Batch service or an Azure HDInsight cluster. See Use custom activities
article for details.
You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See Run R Script
using Azure Data Factory.

Compute environments
You create a linked service for the compute environment and then use the linked service when defining a
transformation activity. There are two types of compute environments supported by Data Factory.
On-Demand: In this case, the computing environment is fully managed by Data Factory. It is automatically
created by the Data Factory service before a job is submitted to process data and removed when the job is
completed. You can configure and control granular settings of the on-demand compute environment for job
execution, cluster management, and bootstrapping actions.
Bring Your Own: In this case, you can register your own computing environment (for example HDInsight
cluster) as a linked service in Data Factory. The computing environment is managed by you and the Data
Factory service uses it to execute the activities.
See Compute Linked Services article to learn about compute services supported by Data Factory.

Next steps
See the following tutorial for an example of using a transformation activity: Tutorial: transform data using Spark
Transform data using Hadoop Hive activity in Azure
Data Factory
3/7/2019 • 2 minutes to read • Edit Online

The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand
HDInsight cluster. This article builds on the data transformation activities article, which presents a general
overview of data transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the Tutorial:
transform data before reading this article.

Syntax
{
"name": "Hive Activity",
"description": "description",
"type": "HDInsightHive",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"scriptPath": "MyAzureStorage\\HiveScripts\\MyHiveSript.hql",
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}

Syntax details
PROPERTY DESCRIPTION REQUIRED

name Name of the activity Yes

description Text describing what the activity is No


used for

type For Hive Activity, the activity type is Yes


HDinsightHive
PROPERTY DESCRIPTION REQUIRED

linkedServiceName Reference to the HDInsight cluster Yes


registered as a linked service in Data
Factory. To learn about this linked
service, see Compute linked services
article.

scriptLinkedService Reference to an Azure Storage Linked No


Service used to store the Hive script to
be executed. If you don't specify this
Linked Service, the Azure Storage
Linked Service defined in the HDInsight
Linked Service is used.

scriptPath Provide the path to the script file Yes


stored in the Azure Storage referred by
scriptLinkedService. The file name is
case-sensitive.

getDebugInfo Specifies when the log files are copied No


to the Azure Storage used by
HDInsight cluster (or) specified by
scriptLinkedService. Allowed values:
None, Always, or Failure. Default value:
None.

arguments Specifies an array of arguments for a No


Hadoop job. The arguments are passed
as command-line arguments to each
task.

defines Specify parameters as key/value pairs No


for referencing within the Hive script.

queryTimeout Query timeout value (in minutes). No


Applicable when the HDInsight cluster
is with Enterprise Security Package
enabled.

Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Transform data using Hadoop Pig activity in Azure
Data Factory
3/7/2019 • 2 minutes to read • Edit Online

The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand HDInsight
cluster. This article builds on the data transformation activities article, which presents a general overview of data
transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the Tutorial:
transform data before reading this article.

Syntax
{
"name": "Pig Activity",
"description": "description",
"type": "HDInsightPig",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"scriptPath": "MyAzureStorage\\PigScripts\\MyPigSript.pig",
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}

Syntax details
PROPERTY DESCRIPTION REQUIRED

name Name of the activity Yes

description Text describing what the activity is No


used for

type For Hive Activity, the activity type is Yes


HDinsightPig
PROPERTY DESCRIPTION REQUIRED

linkedServiceName Reference to the HDInsight cluster Yes


registered as a linked service in Data
Factory. To learn about this linked
service, see Compute linked services
article.

scriptLinkedService Reference to an Azure Storage Linked No


Service used to store the Pig script to
be executed. If you don't specify this
Linked Service, the Azure Storage
Linked Service defined in the HDInsight
Linked Service is used.

scriptPath Provide the path to the script file No


stored in the Azure Storage referred by
scriptLinkedService. The file name is
case-sensitive.

getDebugInfo Specifies when the log files are copied No


to the Azure Storage used by
HDInsight cluster (or) specified by
scriptLinkedService. Allowed values:
None, Always, or Failure. Default value:
None.

arguments Specifies an array of arguments for a No


Hadoop job. The arguments are passed
as command-line arguments to each
task.

defines Specify parameters as key/value pairs No


for referencing within the Pig script.

Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Transform data using Hadoop MapReduce activity in
Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online

The HDInsight MapReduce activity in a Data Factory pipeline invokes MapReduce program on your own or on-
demand HDInsight cluster. This article builds on the data transformation activities article, which presents a
general overview of data transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial:
Tutorial: transform data before reading this article.
See Pig and Hive for details about running Pig/Hive scripts on a HDInsight cluster from a pipeline by using
HDInsight Pig and Hive activities.

Syntax
{
"name": "Map Reduce Activity",
"description": "Description",
"type": "HDInsightMapReduce",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"className": "org.myorg.SampleClass",
"jarLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"jarFilePath": "MyAzureStorage/jars/sample.jar",
"getDebugInfo": "Failure",
"arguments": [
"-SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}

Syntax details
PROPERTY DESCRIPTION REQUIRED

name Name of the activity Yes

description Text describing what the activity is No


used for

type For MapReduce Activity, the activity Yes


type is HDinsightMapReduce
PROPERTY DESCRIPTION REQUIRED

linkedServiceName Reference to the HDInsight cluster Yes


registered as a linked service in Data
Factory. To learn about this linked
service, see Compute linked services
article.

className Name of the Class to be executed Yes

jarLinkedService Reference to an Azure Storage Linked No


Service used to store the Jar files. If
you don't specify this Linked Service,
the Azure Storage Linked Service
defined in the HDInsight Linked Service
is used.

jarFilePath Provide the path to the Jar files stored Yes


in the Azure Storage referred by
jarLinkedService. The file name is case-
sensitive.

jarlibs String array of the path to the Jar No


library files referenced by the job
stored in the Azure Storage defined in
jarLinkedService. The file name is case-
sensitive.

getDebugInfo Specifies when the log files are copied No


to the Azure Storage used by
HDInsight cluster (or) specified by
jarLinkedService. Allowed values: None,
Always, or Failure. Default value: None.

arguments Specifies an array of arguments for a No


Hadoop job. The arguments are passed
as command-line arguments to each
task.

defines Specify parameters as key/value pairs No


for referencing within the Hive script.

Example
You can use the HDInsight MapReduce Activity to run any MapReduce jar file on an HDInsight cluster. In the
following sample JSON definition of a pipeline, the HDInsight Activity is configured to run a Mahout JAR file.
{
"name": "MapReduce Activity for Mahout",
"description": "Custom MapReduce to generate Mahout result",
"type": "HDInsightMapReduce",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
"jarLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
},
"jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar",
"arguments": [
"-s",
"SIMILARITY_LOGLIKELIHOOD",
"--input",
"wasb://adfsamples@spestore.blob.core.windows.net/Mahout/input",
"--output",
"wasb://adfsamples@spestore.blob.core.windows.net/Mahout/output/",
"--maxSimilaritiesPerItem",
"500",
"--tempDir",
"wasb://adfsamples@spestore.blob.core.windows.net/Mahout/temp/mahout"
]
}
}

You can specify any arguments for the MapReduce program in the arguments section. At runtime, you see a
few extra arguments (for example: mapreduce.job.tags) from the MapReduce framework. To differentiate your
arguments with the MapReduce arguments, consider using both option and value as arguments as shown in the
following example (-s, --input, --output etc., are options immediately followed by their values).

Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Transform data using Hadoop Streaming activity in
Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online

The HDInsight Streaming Activity in a Data Factory pipeline executes Hadoop Streaming programs on your own
or on-demand HDInsight cluster. This article builds on the data transformation activities article, which presents a
general overview of data transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the Tutorial:
transform data before reading this article.

JSON sample
{
"name": "Streaming Activity",
"description": "Description",
"type": "HDInsightStreaming",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"mapper": "MyMapper.exe",
"reducer": "MyReducer.exe",
"combiner": "MyCombiner.exe",
"fileLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"filePaths": [
"<containername>/example/apps/MyMapper.exe",
"<containername>/example/apps/MyReducer.exe",
"<containername>/example/apps/MyCombiner.exe"
],
"input": "wasb://<containername>@<accountname>.blob.core.windows.net/example/input/MapperInput.txt",
"output":
"wasb://<containername>@<accountname>.blob.core.windows.net/example/output/ReducerOutput.txt",
"commandEnvironment": [
"CmdEnvVarName=CmdEnvVarValue"
],
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}

Syntax details
PROPERTY DESCRIPTION REQUIRED

name Name of the activity Yes


PROPERTY DESCRIPTION REQUIRED

description Text describing what the activity is No


used for

type For Hadoop Streaming Activity, the Yes


activity type is HDInsightStreaming

linkedServiceName Reference to the HDInsight cluster Yes


registered as a linked service in Data
Factory. To learn about this linked
service, see Compute linked services
article.

mapper Specifies the name of the mapper Yes


executable

reducer Specifies the name of the reducer Yes


executable

combiner Specifies the name of the combiner No


executable

fileLinkedService Reference to an Azure Storage Linked No


Service used to store the Mapper,
Combiner, and Reducer programs to
be executed. If you don't specify this
Linked Service, the Azure Storage
Linked Service defined in the HDInsight
Linked Service is used.

filePath Provide an array of path to the Yes


Mapper, Combiner, and Reducer
programs stored in the Azure Storage
referred by fileLinkedService. The path
is case-sensitive.

input Specifies the WASB path to the input Yes


file for the Mapper.

output Specifies the WASB path to the output Yes


file for the Reducer.

getDebugInfo Specifies when the log files are copied No


to the Azure Storage used by
HDInsight cluster (or) specified by
scriptLinkedService. Allowed values:
None, Always, or Failure. Default value:
None.

arguments Specifies an array of arguments for a No


Hadoop job. The arguments are passed
as command-line arguments to each
task.

defines Specify parameters as key/value pairs No


for referencing within the Hive script.
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
MapReduce activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Transform data using Spark activity in Azure Data
Factory
3/7/2019 • 3 minutes to read • Edit Online

The Spark activity in a Data Factory pipeline executes a Spark program on your own or on-demand HDInsight
cluster. This article builds on the data transformation activities article, which presents a general overview of
data transformation and the supported transformation activities. When you use an on-demand Spark linked
service, Data Factory automatically creates a Spark cluster for you just-in-time to process the data and then
deletes the cluster once the processing is complete.

IMPORTANT
Spark Activity does not support HDInsight Spark clusters that use an Azure Data Lake Store as primary storage.

Spark activity properties


Here is the sample JSON definition of a Spark Activity:

{
"name": "Spark Activity",
"description": "Description",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"sparkJobLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"rootPath": "adfspark\\pyFiles",
"entryFilePath": "test.py",
"sparkConfig": {
"ConfigItem1": "Value"
},
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
]
}
}

The following table describes the JSON properties used in the JSON definition:

PROPERTY DESCRIPTION REQUIRED

name Name of the activity in the pipeline. Yes

description Text describing what the activity does. No


PROPERTY DESCRIPTION REQUIRED

type For Spark Activity, the activity type is Yes


HDInsightSpark.

linkedServiceName Name of the HDInsight Spark Linked Yes


Service on which the Spark program
runs. To learn about this linked service,
see Compute linked services article.

SparkJobLinkedService The Azure Storage linked service that No


holds the Spark job file, dependencies,
and logs. If you do not specify a value
for this property, the storage
associated with HDInsight cluster is
used. The value of this property can
only be an Azure Storage linked
service.

rootPath The Azure Blob container and folder Yes


that contains the Spark file. The file
name is case-sensitive. Refer to folder
structure section (next section) for
details about the structure of this
folder.

entryFilePath Relative path to the root folder of the Yes


Spark code/package. The entry file
must be either a Python file or a .jar
file.

className Application's Java/Spark main class No

arguments A list of command-line arguments to No


the Spark program.

proxyUser The user account to impersonate to No


execute the Spark program

sparkConfig Specify values for Spark configuration No


properties listed in the topic: Spark
Configuration - Application properties.

getDebugInfo Specifies when the Spark log files are No


copied to the Azure storage used by
HDInsight cluster (or) specified by
sparkJobLinkedService. Allowed values:
None, Always, or Failure. Default value:
None.

Folder structure
Spark jobs are more extensible than Pig/Hive jobs. For Spark jobs, you can provide multiple dependencies
such as jar packages (placed in the java CLASSPATH), python files (placed on the PYTHONPATH), and any
other files.
Create the following folder structure in the Azure Blob storage referenced by the HDInsight linked service.
Then, upload dependent files to the appropriate sub folders in the root folder represented by entryFilePath.
For example, upload python files to the pyFiles subfolder and jar files to the jars subfolder of the root folder. At
runtime, Data Factory service expects the following folder structure in the Azure Blob storage:

PATH DESCRIPTION REQUIRED TYPE

. (root) The root path of the Spark Yes Folder


job in the storage linked
service

<user defined > The path pointing to the Yes File


entry file of the Spark job

./jars All files under this folder are No Folder


uploaded and placed on
the java classpath of the
cluster

./pyFiles All files under this folder are No Folder


uploaded and placed on
the PYTHONPATH of the
cluster

./files All files under this folder are No Folder


uploaded and placed on
executor working directory

./archives All files under this folder are No Folder


uncompressed

./logs The folder that contains No Folder


logs from the Spark cluster.

Here is an example for a storage containing two Spark job files in the Azure Blob Storage referenced by the
HDInsight linked service.

SparkJob1
main.jar
files
input1.txt
input2.txt
jars
package1.jar
package2.jar
logs

SparkJob2
main.py
pyFiles
scrip1.py
script2.py
logs

Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Create predictive pipelines using Azure Machine
Learning and Azure Data Factory
3/12/2019 • 7 minutes to read • Edit Online

Azure Machine Learning enables you to build, test, and deploy predictive analytics solutions. From a high-level
point of view, it is done in three steps:
1. Create a training experiment. You do this step by using the Azure Machine Learning studio. Azure
Machine Learning studio is a collaborative visual development environment that you use to train and test a
predictive analytics model using training data.
2. Convert it to a predictive experiment. Once your model has been trained with existing data and you are
ready to use it to score new data, you prepare and streamline your experiment for scoring.
3. Deploy it as a web service. You can publish your scoring experiment as an Azure web service. You can
send data to your model via this web service end point and receive result predictions from the model.
Data Factory and Machine Learning together
Azure Data Factory enables you to easily create pipelines that use a published Azure Machine Learning web
service for predictive analytics. Using the Batch Execution Activity in an Azure Data Factory pipeline, you can
invoke an Azure Machine Learning studio web service to make predictions on the data in batch.
Over time, the predictive models in the Azure Machine Learning studio scoring experiments need to be
retrained using new input datasets. You can retrain a model from a Data Factory pipeline by doing the following
steps:
1. Publish the training experiment (not predictive experiment) as a web service. You do this step in the Azure
Machine Learning studio as you did to expose predictive experiment as a web service in the previous
scenario.
2. Use the Azure Machine Learning studio Batch Execution Activity to invoke the web service for the training
experiment. Basically, you can use the Azure Machine Learning studio Batch Execution activity to invoke
both training web service and scoring web service.
After you are done with retraining, update the scoring web service (predictive experiment exposed as a web
service) with the newly trained model by using the Azure Machine Learning studio Update Resource
Activity. See Updating models using Update Resource Activity article for details.

Azure Machine Learning linked service


You create an Azure Machine Learning linked service to link an Azure Machine Learning Web Service to an
Azure data factory. The Linked Service is used by Azure Machine Learning Batch Execution Activity and Update
Resource Activity.
{
"type" : "linkedServices",
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "URL to Azure ML Predictive Web Service",
"apiKey": {
"type": "SecureString",
"value": "api key"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

See Compute linked services article for descriptions about properties in the JSON definition.
Azure Machine Learning support both Classic Web Services and New Web Services for your predictive
experiment. You can choose the right one to use from Data Factory. To get the information required to create
the Azure Machine Learning Linked Service, go to https://services.azureml.net, where all your (new ) Web
Services and Classic Web Services are listed. Click the Web Service you would like to access, and click
Consume page. Copy Primary Key for apiKey property, and Batch Requests for mlEndpoint property.

Azure Machine Learning Batch Execution activity


The following JSON snippet defines an Azure Machine Learning Batch Execution activity. The activity definition
has a reference to the Azure Machine Learning linked service you created earlier.
{
"name": "AzureMLExecutionActivityTemplate",
"description": "description",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "AzureMLLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"<web service input name 1>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"path1"
},
"<web service input name 2>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"path2"
}
},
"webServiceOutputs": {
"<web service output name 1>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"path3"
},
"<web service output name 2>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"path4"
}
},
"globalParameters": {
"<Parameter 1 Name>": "<parameter value>",
"<parameter 2 name>": "<parameter 2 value>"
}
}
}

PROPERTY DESCRIPTION REQUIRED

name Name of the activity in the pipeline Yes

description Text describing what the activity does. No

type For Data Lake Analytics U-SQL activity, Yes


the activity type is
AzureMLBatchExecution.

linkedServiceName Linked Services to the Azure Machine Yes


Learning Linked Service. To learn about
this linked service, see Compute linked
services article.
PROPERTY DESCRIPTION REQUIRED

webServiceInputs Key, Value pairs, mapping the names No


of Azure Machine Learning Web
Service Inputs. Key must match the
input parameters defined in the
published Azure Machine Learning
Web Service. Value is an Azure Storage
Linked Services and FilePath properties
pair specifying the input Blob
locations.

webServiceOutputs Key, Value pairs, mapping the names No


of Azure Machine Learning Web
Service Outputs. Key must match the
output parameters defined in the
published Azure Machine Learning
Web Service. Value is an Azure Storage
Linked Services and FilePath properties
pair specifying the output Blob
locations.

globalParameters Key, Value pairs to be passed to the No


Azure Machine Learning studio Batch
Execution Service endpoint. Keys must
match the names of web service
parameters defined in the published
Azure Machine Learning studio web
service. Values are passed in the
GlobalParameters property of the
Azure Machine Learning studio batch
execution request

Scenario 1: Experiments using Web service inputs/outputs that refer to data in Azure Blob Storage
In this scenario, the Azure Machine Learning Web service makes predictions using data from a file in an Azure
blob storage and stores the prediction results in the blob storage. The following JSON defines a Data Factory
pipeline with an AzureMLBatchExecution activity. The input and output data in Azure Blog Storage is referenced
using a LinkedName and FilePath pair. In the sample Linked Service of inputs and outputs are different, you can
use different Linked Services for each of your inputs/outputs for Data Factory to be able to pick up the right
files and send to Azure Machine Learning studio Web Service.

IMPORTANT
In your Azure Machine Learning studio experiment, web service input and output ports, and global parameters have
default names ("input1", "input2") that you can customize. The names you use for webServiceInputs, webServiceOutputs,
and globalParameters settings must exactly match the names in the experiments. You can view the sample request
payload on the Batch Execution Help page for your Azure Machine Learning studio endpoint to verify the expected
mapping.
{
"name": "AzureMLExecutionActivityTemplate",
"description": "description",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "AzureMLLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"input1": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"amltest/input/in1.csv"
},
"input2": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"amltest/input/in2.csv"
}
},
"webServiceOutputs": {
"outputName1": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"amltest2/output/out1.csv"
},
"outputName2": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"amltest2/output/out2.csv"
}
}
}
}

Scenario 2: Experiments using Reader/Writer Modules to refer to data in various storages


Another common scenario when creating Azure Machine Learning studio experiments is to use Import Data
and Output Data modules. The Import Data module is used to load data into an experiment and the Output
Data module is to save data from your experiments. For details about Import Data and Output Data modules,
see Import Data and Output Data topics on MSDN Library.
When using the Import Data and Output Data modules, it is good practice to use a Web service parameter for
each property of these modules. These web parameters enable you to configure the values during runtime. For
example, you could create an experiment with an Import Data module that uses an Azure SQL Database:
XXX.database.windows.net. After the web service has been deployed, you want to enable the consumers of the
web service to specify another Azure SQL Server called YYY.database.windows.net . You can use a Web service
parameter to allow this value to be configured.
NOTE
Web service input and output are different from Web service parameters. In the first scenario, you have seen how an
input and output can be specified for an Azure Machine Learning studio Web service. In this scenario, you pass
parameters for a Web service that correspond to properties of Import Data/Output Data modules.

Let's look at a scenario for using Web service parameters. You have a deployed Azure Machine Learning web
service that uses a reader module to read data from one of the data sources supported by Azure Machine
Learning (for example: Azure SQL Database). After the batch execution is performed, the results are written
using a Writer module (Azure SQL Database). No web service inputs and outputs are defined in the
experiments. In this case, we recommend that you configure relevant web service parameters for the reader and
writer modules. This configuration allows the reader/writer modules to be configured when using the
AzureMLBatchExecution activity. You specify Web service parameters in the globalParameters section in the
activity JSON as follows.

"typeProperties": {
"globalParameters": {
"Database server name": "<myserver>.database.windows.net",
"Database name": "<database>",
"Server user account name": "<user name>",
"Server user account password": "<password>"
}
}

NOTE
The Web service parameters are case-sensitive, so ensure that the names you specify in the activity JSON match the ones
exposed by the Web service.

After you are done with retraining, update the scoring web service (predictive experiment exposed as a web
service) with the newly trained model by using the Azure Machine Learning studio Update Resource
Activity. See Updating models using Update Resource Activity article for details.

Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Stored procedure activity
Update Azure Machine Learning models by using
Update Resource activity
3/18/2019 • 6 minutes to read • Edit Online

This article complements the main Azure Data Factory - Azure Machine Learning integration article: Create
predictive pipelines using Azure Machine Learning and Azure Data Factory. If you haven't already done so, review
the main article before reading through this article.

Overview
As part of the process of operationalizing Azure Machine Learning models, your model is trained and saved. You
then use it to create a predictive Web service. The Web service can then be consumed in web sites, dashboards,
and mobile apps.
Models you create using Machine Learning are typically not static. As new data becomes available or when the
consumer of the API has their own data the model needs to be retrained. Refer to Retrain a Machine Learning
Model for details about how you can retrain a model in Azure Machine Learning.
Retraining may occur frequently. With Batch Execution activity and Update Resource activity, you can
operationalize the Azure Machine Learning model retraining and updating the predictive Web Service using Data
Factory.
The following picture depicts the relationship between training and predictive Web Services.

Azure Machine Learning update resource activity


The following JSON snippet defines an Azure Machine Learning Batch Execution activity.
{
"name": "amlUpdateResource",
"type": "AzureMLUpdateResource",
"description": "description",
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "updatableScoringEndpoint2"
},
"typeProperties": {
"trainedModelName": "ModelName",
"trainedModelLinkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "StorageLinkedService"
},
"trainedModelFilePath": "ilearner file path"
}
}

PROPERTY DESCRIPTION REQUIRED

name Name of the activity in the pipeline Yes

description Text describing what the activity does. No

type For Azure Machine Learning Update Yes


Resource activity, the activity type is
AzureMLUpdateResource.

linkedServiceName Azure Machine Learning linked service Yes


that contains updateResourceEndpoint
property.

trainedModelName Name of the Trained Model module in Yes


the Web Service experiment to be
updated

trainedModelLinkedServiceName Name of Azure Storage linked service Yes


holding the ilearner file that is uploaded
by the update operation

trainedModelFilePath The relative file path in Yes


trainedModelLinkedService to represent
the ilearner file that is uploaded by the
update operation

End-to-end workflow
The entire process of operationalizing retraining a model and update the predictive Web Services involves the
following steps:
Invoke the training Web Service by using the Batch Execution activity. Invoking a training Web Service is
the same as invoking a predictive Web Service described in Create predictive pipelines using Azure Machine
Learning and Data Factory Batch Execution activity. The output of the training Web Service is an iLearner file
that you can use to update the predictive Web Service.
Invoke the update resource endpoint of the predictive Web Service by using the Update Resource
activity to update the Web Service with the newly trained model.
Azure Machine Learning linked service
For the above mentioned end-to-end workflow to work, you need to create two Azure Machine Learning linked
services:
1. An Azure Machine Learning linked service to the training web service, this linked service is used by Batch
Execution activity in the same way as what's mentioned in Create predictive pipelines using Azure Machine
Learning and Data Factory Batch Execution activity. Difference is the output of the training web service is an
iLearner file which is then used by Update Resource activity to update the predictive web service.
2. An Azure Machine Learning linked service to the update resource endpoint of the predictive web service. This
linked service is used by Update Resource activity to update the predictive web service using the iLearner file
returned from above step.
For the second Azure Machine Learning linked service, the configuration is different when your Azure Machine
Learning Web Service is a classic Web Service or a new Web Service. The differences are discussed separately in
the following sections.

Web service is new Azure Resource Manager web service


If the web service is the new type of web service that exposes an Azure Resource Manager endpoint, you do not
need to add the second non-default endpoint. The updateResourceEndpoint in the linked service is of the
format:

https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resource-group-
name}/providers/Microsoft.MachineLearning/webServices/{web-service-name}?api-version=2016-05-01-preview

You can get values for place holders in the URL when querying the web service on the Azure Machine Learning
Web Services Portal.
The new type of update resource endpoint requires service principal authentication. To use service principal
authentication, register an application entity in Azure Active Directory (Azure AD ) and grant it the Contributor or
Owner role of the subscription or the resource group where the web service belongs to. The See how to create
service principal and assign permissions to manage Azure resource. Make note of the following values, which you
use to define the linked service:
Application ID
Application key
Tenant ID
Here is a sample linked service definition:
{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"description": "The linked service for AML web service.",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/0000000000000000
000000000000000000000/services/0000000000000000000000000000000000000/jobs?api-version=2.0",
"apiKey": {
"type": "SecureString",
"value": "APIKeyOfEndpoint1"
},
"updateResourceEndpoint":
"https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resource-group-
name}/providers/Microsoft.MachineLearning/webServices/{web-service-name}?api-version=2016-05-01-preview",
"servicePrincipalId": "000000000-0000-0000-0000-0000000000000",
"servicePrincipalKey": {
"type": "SecureString",
"value": "servicePrincipalKey"
},
"tenant": "mycompany.com"
}
}
}

The following scenario provides more details. It has an example for retraining and updating Azure Machine
Learning studio models from an Azure Data Factory pipeline.

Sample: Retraining and updating an Azure Machine Learning model


This section provides a sample pipeline that uses the Azure Machine Learning studio Batch Execution
activity to retrain a model. The pipeline also uses the Azure Machine Learning studio Update Resource
activity to update the model in the scoring web service. The section also provides JSON snippets for all the
linked services, datasets, and pipeline in the example.
Azure Blob storage linked service:
The Azure Storage holds the following data:
training data. The input data for the Azure Machine Learning studio training web service.
iLearner file. The output from the Azure Machine Learning studio training web service. This file is also the
input to the Update Resource activity.
Here is the sample JSON definition of the linked service:

{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=name;AccountKey=key"
}
}
}

Linked service for Azure Machine Learning studio training endpoint


The following JSON snippet defines an Azure Machine Learning linked service that points to the default endpoint
of the training web service.
{
"name": "trainingEndpoint",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/xxx/services/--training
experiment--/jobs",
"apiKey": "myKey"
}
}
}

In Azure Machine Learning studio, do the following to get values for mlEndpoint and apiKey:
1. Click WEB SERVICES on the left menu.
2. Click the training web service in the list of web services.
3. Click copy next to API key text box. Paste the key in the clipboard into the Data Factory JSON editor.
4. In the Azure Machine Learning studio, click BATCH EXECUTION link.
5. Copy the Request URI from the Request section and paste it into the Data Factory JSON editor.
Linked service for Azure Machine Learning studio updatable scoring endpoint:
The following JSON snippet defines an Azure Machine Learning linked service that points to updatable endpoint
of the scoring web service.

{
"name": "updatableScoringEndpoint2",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint":
"https://ussouthcentral.services.azureml.net/workspaces/00000000eb0abe4d6bbb1d7886062747d7/services/0000000002
6734a5889e02fbb1f65cefd/jobs?api-version=2.0",
"apiKey":
"sooooooooooh3WvG1hBfKS2BNNcfwSO7hhY6dY98noLfOdqQydYDIXyf2KoIaN3JpALu/AKtflHWMOCuicm/Q==",
"updateResourceEndpoint": "https://management.azure.com/subscriptions/00000000-0000-0000-0000-
000000000000/resourceGroups/Default-MachineLearning-
SouthCentralUS/providers/Microsoft.MachineLearning/webServices/myWebService?api-version=2016-05-01-preview",
"servicePrincipalId": "fe200044-c008-4008-a005-94000000731",
"servicePrincipalKey": "zWa0000000000Tp6FjtZOspK/WMA2tQ08c8U+gZRBlw=",
"tenant": "mycompany.com"
}
}
}

Pipeline
The pipeline has two activities: AzureMLBatchExecution and AzureMLUpdateResource. The Batch Execution
activity takes the training data as input and produces an iLearner file as an output. The Update Resource activity
then takes this iLearner file and use it to update the predictive web service.
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "amlBEGetilearner",
"description": "Use AML BES to get the ileaner file from training web service",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "trainingEndpoint",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"input1": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/input"
},
"input2": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/input"
}
},
"webServiceOutputs": {
"output1": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/output"
}
}
}
},
{
"name": "amlUpdateResource",
"type": "AzureMLUpdateResource",
"description": "Use AML Update Resource to update the predict web service",
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "updatableScoringEndpoint2"
},
"typeProperties": {
"trainedModelName": "ADFV2Sample Model [trained model]",
"trainedModelLinkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "StorageLinkedService"
},
"trainedModelFilePath": "azuremltesting/output/newModelForArm.ilearner"
},
"dependsOn": [
{
"activity": "amlbeGetilearner",
"dependencyConditions": [ "Succeeded" ]
}
]
}
]
}
}
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Stored procedure activity
Transform data by using the SQL Server Stored
Procedure activity in Azure Data Factory
3/7/2019 • 3 minutes to read • Edit Online

You use data transformation activities in a Data Factory pipeline to transform and process raw data into
predictions and insights. The Stored Procedure Activity is one of the transformation activities that Data Factory
supports. This article builds on the transform data article, which presents a general overview of data
transformation and the supported transformation activities in Data Factory.

NOTE
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial: Tutorial:
transform data before reading this article.

You can use the Stored Procedure Activity to invoke a stored procedure in one of the following data stores in
your enterprise or on an Azure virtual machine (VM ):
Azure SQL Database
Azure SQL Data Warehouse
SQL Server Database. If you are using SQL Server, install Self-hosted integration runtime on the same
machine that hosts the database or on a separate machine that has access to the database. Self-Hosted
integration runtime is a component that connects data sources on-premises/on Azure VM with cloud
services in a secure and managed way. See Self-hosted integration runtime article for details.

IMPORTANT
When copying data into Azure SQL Database or SQL Server, you can configure the SqlSink in copy activity to invoke a
stored procedure by using the sqlWriterStoredProcedureName property. For details about the property, see following
connector articles: Azure SQL Database, SQL Server. Invoking a stored procedure while copying data into an Azure SQL
Data Warehouse by using a copy activity is not supported. But, you can use the stored procedure activity to invoke a
stored procedure in a SQL Data Warehouse.
When copying data from Azure SQL Database or SQL Server or Azure SQL Data Warehouse, you can configure
SqlSource in copy activity to invoke a stored procedure to read data from the source database by using the
sqlReaderStoredProcedureName property. For more information, see the following connector articles: Azure SQL
Database, SQL Server, Azure SQL Data Warehouse

Syntax details
Here is the JSON format for defining a Stored Procedure Activity:
{
"name": "Stored Procedure Activity",
"description":"Description",
"type": "SqlServerStoredProcedure",
"linkedServiceName": {
"referenceName": "AzureSqlLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"storedProcedureName": "usp_sample",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" }

}
}
}

The following table describes these JSON properties:

PROPERTY DESCRIPTION REQUIRED

name Name of the activity Yes

description Text describing what the activity is No


used for

type For Stored Procedure Activity, the Yes


activity type is
SqlServerStoredProcedure

linkedServiceName Reference to the Azure SQL Yes


Database or Azure SQL Data
Warehouse or SQL Server registered
as a linked service in Data Factory. To
learn about this linked service, see
Compute linked services article.

storedProcedureName Specify the name of the stored Yes


procedure to invoke.

storedProcedureParameters Specify the values for stored No


procedure parameters. Use
"param1": { "value":
"param1Value","type":"param1Type"
}
to pass parameter values and their
type supported by the data source. If
you need to pass null for a parameter,
use "param1": { "value": null }
(all lower case).

Error info
When a stored procedure fails and returns error details, you can't capture the error info directly in the activity
output. However, Data Factory pumps all of its activity run events to Azure Monitor. Among the events that
Data Factory pumps to Azure Monitor, it pushes error details there. You can, for example, set up email alerts
from those events. For more info, see Alert and Monitor data factories using Azure Monitor.
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL Activity
Hive Activity
Pig Activity
MapReduce Activity
Hadoop Streaming Activity
Spark Activity
.NET custom activity
Machine Learning Bach Execution Activity
Stored procedure activity
Transform data by running U-SQL scripts on Azure
Data Lake Analytics
3/11/2019 • 5 minutes to read • Edit Online

A pipeline in an Azure data factory processes data in linked storage services by using linked compute services.
It contains a sequence of activities where each activity performs a specific processing operation. This article
describes the Data Lake Analytics U -SQL Activity that runs a U -SQL script on an Azure Data Lake
Analytics compute linked service.
Create an Azure Data Lake Analytics account before creating a pipeline with a Data Lake Analytics U -SQL
Activity. To learn about Azure Data Lake Analytics, see Get started with Azure Data Lake Analytics.

Azure Data Lake Analytics linked service


You create an Azure Data Lake Analytics linked service to link an Azure Data Lake Analytics compute service
to an Azure data factory. The Data Lake Analytics U -SQL activity in the pipeline refers to this linked service.
The following table provides descriptions for the generic properties used in the JSON definition.

PROPERTY DESCRIPTION REQUIRED

type The type property should be set to: Yes


AzureDataLakeAnalytics.

accountName Azure Data Lake Analytics Account Yes


Name.

dataLakeAnalyticsUri Azure Data Lake Analytics URI. No

subscriptionId Azure subscription ID No

resourceGroupName Azure resource group name No

Service principal authentication


The Azure Data Lake Analytics linked service requires a service principal authentication to connect to the Azure
Data Lake Analytics service. To use service principal authentication, register an application entity in Azure Active
Directory (Azure AD ) and grant it the access to both the Data Lake Analytics and the Data Lake Store it uses.
For detailed steps, see Service-to-service authentication. Make note of the following values, which you use to
define the linked service:
Application ID
Application key
Tenant ID
Grant service principal permission to your Azure Data Lake Anatlyics using the Add User Wizard.
Use service principal authentication by specifying the following properties:
PROPERTY DESCRIPTION REQUIRED

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Yes

tenant Specify the tenant information Yes


(domain name or tenant ID) under
which your application resides. You can
retrieve it by hovering the mouse in
the upper-right corner of the Azure
portal.

Example: Service principal authentication

{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "<account name>",
"dataLakeAnalyticsUri": "<azure data lake analytics URI>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<optional, subscription id of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

To learn more about the linked service, see Compute linked services.

Data Lake Analytics U-SQL Activity


The following JSON snippet defines a pipeline with a Data Lake Analytics U -SQL Activity. The activity
definition has a reference to the Azure Data Lake Analytics linked service you created earlier. To execute a Data
Lake Analytics U -SQL script, Data Factory submits the script you specified to the Data Lake Analytics, and the
required inputs and outputs is defined in the script for Data Lake Analytics to fetch and output.
{
"name": "ADLA U-SQL Activity",
"description": "description",
"type": "DataLakeAnalyticsU-SQL",
"linkedServiceName": {
"referenceName": "<linked service name of Azure Data Lake Analytics>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "<linked service name of Azure Data Lake Store or Azure Storage which contains
the U-SQL script>",
"type": "LinkedServiceReference"
},
"scriptPath": "scripts\\kona\\SearchLogProcessing.txt",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/datalake/input/SearchLog.tsv",
"out": "/datalake/output/Result.tsv"
}
}
}

The following table describes names and descriptions of properties that are specific to this activity.

PROPERTY DESCRIPTION REQUIRED

name Name of the activity in the pipeline Yes

description Text describing what the activity does. No

type For Data Lake Analytics U-SQL activity, Yes


the activity type is
DataLakeAnalyticsU-SQL.

linkedServiceName Linked Service to Azure Data Lake Yes


Analytics. To learn about this linked
service, see Compute linked services
article.

scriptPath Path to folder that contains the U-SQL Yes


script. Name of the file is case-
sensitive.

scriptLinkedService Linked service that links the Azure Yes


Data Lake Store or Azure Storage
that contains the script to the data
factory

degreeOfParallelism The maximum number of nodes No


simultaneously used to run the job.

priority Determines which jobs out of all that No


are queued should be selected to run
first. The lower the number, the higher
the priority.
PROPERTY DESCRIPTION REQUIRED

parameters Parameters to pass into the U-SQL No


script.

runtimeVersion Runtime version of the U-SQL engine No


to use.

compilationMode Compilation mode of U-SQL. Must No


be one of these values: Semantic:
Only perform semantic checks and
necessary sanity checks, Full:
Perform the full compilation,
including syntax check,
optimization, code generation, etc.,
SingleBox: Perform the full
compilation, with TargetType
setting to SingleBox. If you don't
specify a value for this property,
the server determines the optimal
compilation mode.

See SearchLogProcessing.txt for the script definition.

Sample U-SQL script


@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int,
Urls string,
ClickedUrls string
FROM @in
USING Extractors.Tsv(nullEscape:"#NULL#");

@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";

@rs1 =
SELECT Start, Region, Duration
FROM @rs1
WHERE Start <= DateTime.Parse("2012/02/19");

OUTPUT @rs1
TO @out
USING Outputters.Tsv(quoting:false, dateTimeFormat:null);

In above script example, the input and output to the script is defined in @in and @out parameters. The values
for @in and @out parameters in the U -SQL script are passed dynamically by Data Factory using the
‘parameters’ section.
You can specify other properties such as degreeOfParallelism and priority as well in your pipeline definition for
the jobs that run on the Azure Data Lake Analytics service.

Dynamic parameters
In the sample pipeline definition, in and out parameters are assigned with hard-coded values.

"parameters": {
"in": "/datalake/input/SearchLog.tsv",
"out": "/datalake/output/Result.tsv"
}

It is possible to use dynamic parameters instead. For example:

"parameters": {
"in": "/datalake/input/@{formatDateTime(pipeline().parameters.WindowStart,'yyyy/MM/dd')}/data.tsv",
"out": "/datalake/output/@{formatDateTime(pipeline().parameters.WindowStart,'yyyy/MM/dd')}/result.tsv"
}

In this case, input files are still picked up from the /datalake/input folder and output files are generated in the
/datalake/output folder. The file names are dynamic based on the window start time being passed in when
pipeline gets triggered.

Next steps
See the following articles that explain how to transform data in other ways:
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Transform data by running a Databricks notebook
3/7/2019 • 2 minutes to read • Edit Online

The Azure Databricks Notebook Activity in a Data Factory pipeline runs a Databricks notebook in your Azure
Databricks workspace. This article builds on the data transformation activities article, which presents a general
overview of data transformation and the supported transformation activities. Azure Databricks is a managed
platform for running Apache Spark.

Databricks Notebook activity definition


Here is the sample JSON definition of a Databricks Notebook Activity:

{
"activity": {
"name": "MyActivity",
"description": "MyActivity description",
"type": "DatabricksNotebook",
"linkedServiceName": {
"referenceName": "MyDatabricksLinkedservice",
"type": "LinkedServiceReference"
},
"typeProperties": {
"notebookPath": "/Users/user@example.com/ScalaExampleNotebook",
"baseParameters": {
"inputpath": "input/folder1/",
"outputpath": "output/"
},
"libraries": [
{
"jar": "dbfs:/docs/library.jar"
}
]
}
}
}

Databricks Notebook activity properties


The following table describes the JSON properties used in the JSON definition:

PROPERTY DESCRIPTION REQUIRED

name Name of the activity in the pipeline. Yes

description Text describing what the activity does. No

type For Databricks Notebook Activity, the Yes


activity type is DatabricksNotebook.

linkedServiceName Name of the Databricks Linked Service Yes


on which the Databricks notebook runs.
To learn about this linked service,
see Compute linked services article.
PROPERTY DESCRIPTION REQUIRED

notebookPath The absolute path of the notebook to Yes


be run in the Databricks Workspace.
This path must begin with a slash.

baseParameters An array of Key-Value pairs. Base No


parameters can be used for each
activity run. If the notebook takes a
parameter that is not specified, the
default value from the notebook will be
used. Find more on parameters in
Databricks Notebooks.

libraries A list of libraries to be installed on the No


cluster that will execute the job. It can
be an array of <string, object>.

Supported libraries for Databricks activities


In the above Databricks activity definition, you specify these library types: jar, egg, maven, pypi, cran.

{
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": [ "slf4j:slf4j" ]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "http://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}

For more details, see the Databricks documentation for library types.

How to upload a library in Databricks


Using Databricks workspace UI
To obtain the dbfs path of the library added using UI, you can use the Databricks CLI (installation).
Typically, the Jar libraries are stored under dbfs:/FileStore/jars while using the UI. You can list all through the CLI:
databricks fs ls dbfs:/FileStore/jars.
Copy library using Databricks CLI
Example: databricks fs cp SparkPi-assembly-0.1.jar dbfs:/FileStore/jars
Transform data by running a Jar activity in Azure
Databricks
3/7/2019 • 2 minutes to read • Edit Online

The Azure Databricks Jar Activity in a Data Factory pipeline runs a Spark Jar in your Azure Databricks cluster. This
article builds on the data transformation activities article, which presents a general overview of data
transformation and the supported transformation activities. Azure Databricks is a managed platform for running
Apache Spark.
For an eleven-minute introduction and demonstration of this feature, watch the following video:

Databricks Jar activity definition


Here is the sample JSON definition of a Databricks Jar Activity:

{
"name": "SparkJarActivity",
"type": "DatabricksSparkJar",
"linkedServiceName": {
"referenceName": "AzureDatabricks",
"type": "LinkedServiceReference"
},
"typeProperties": {
"mainClassName": "org.apache.spark.examples.SparkPi",
"parameters": [ "10" ],
"libraries": [
{
"jar": "dbfs:/docs/sparkpi.jar"
}
]
}
}

Databricks Jar activity properties


The following table describes the JSON properties used in the JSON definition:

PROPERTY DESCRIPTION REQUIRED

name Name of the activity in the pipeline. Yes

description Text describing what the activity does. No

type For Databricks Jar Activity, the activity Yes


type is DatabricksSparkJar.
PROPERTY DESCRIPTION REQUIRED

linkedServiceName Name of the Databricks Linked Service Yes


on which the Jar activity runs. To learn
about this linked service, see Compute
linked services article.

mainClassName The full name of the class containing Yes


the main method to be executed. This
class must be contained in a JAR
provided as a library.

parameters Parameters that will be passed to the No


main method. This is an array of strings.

libraries A list of libraries to be installed on the Yes (at least one containing the
cluster that will execute the job. It can mainClassName method)
be an array of <string, object>

Supported libraries for databricks activities


In the above Databricks activity definition you specify these library types: jar, egg, maven, pypi, cran.

{
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": [ "slf4j:slf4j" ]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "http://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}

For more details refer Databricks documentation for library types.

How to upload a library in Databricks


Using Databricks workspace UI
To obtain the dbfs path of the library added using UI, you can use Databricks CLI (installation).
Typically the Jar libraries are stored under dbfs:/FileStore/jars while using the UI. You can list all through the CLI:
databricks fs ls dbfs:/FileStore/job -jars
Copy library using Databricks CLI
Use Databricks CLI (installation steps).
Example - copying JAR to dbfs: dbfs cp SparkPi-assembly-0.1.jar dbfs:/docs/sparkpi.jar
Transform data by running a Python activity in Azure
Databricks
5/22/2019 • 2 minutes to read • Edit Online

The Azure Databricks Python Activity in a Data Factory pipeline runs a Python file in your Azure Databricks cluster.
This article builds on the data transformation activities article, which presents a general overview of data
transformation and the supported transformation activities. Azure Databricks is a managed platform for running
Apache Spark.
For an eleven-minute introduction and demonstration of this feature, watch the following video:

Databricks Python activity definition


Here is the sample JSON definition of a Databricks Python Activity:

{
"activity": {
"name": "MyActivity",
"description": "MyActivity description",
"type": "DatabricksSparkPython",
"linkedServiceName": {
"referenceName": "MyDatabricksLinkedservice",
"type": "LinkedServiceReference"
},
"typeProperties": {
"pythonFile": "dbfs:/docs/pi.py",
"parameters": [
"10"
],
"libraries": [
{
"pypi": {
"package": "tensorflow"
}
}
]
}
}
}

Databricks Python activity properties


The following table describes the JSON properties used in the JSON definition:

PROPERTY DESCRIPTION REQUIRED

name Name of the activity in the pipeline. Yes

description Text describing what the activity does. No


PROPERTY DESCRIPTION REQUIRED

type For Databricks Python Activity, the Yes


activity type is DatabricksSparkPython.

linkedServiceName Name of the Databricks Linked Service Yes


on which the Python activity runs. To
learn about this linked service,
see Compute linked services article.

pythonFile The URI of the Python file to be Yes


executed. Only DBFS paths are
supported.

parameters Command line parameters that will be No


passed to the Python file. This is an
array of strings.

libraries A list of libraries to be installed on the No


cluster that will execute the job. It can
be an array of <string, object>

Supported libraries for databricks activities


In the above Databricks activity definition you specify these library types: jar, egg, maven, pypi, cran.

{
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": [ "slf4j:slf4j" ]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "http://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}

For more details refer Databricks documentation for library types.

How to upload a library in Databricks


Using Databricks workspace UI
To obtain the dbfs path of the library added using UI, you can use Databricks CLI (installation).
Typically the Jar libraries are stored under dbfs:/FileStore/jars while using the UI. You can list all through the CLI:
databricks fs ls dbfs:/FileStore/jars
Copy library using Databricks CLI
Example: databricks fs cp SparkPi-assembly-0.1.jar dbfs:/FileStore/jars
Use custom activities in an Azure Data Factory
pipeline
4/3/2019 • 12 minutes to read • Edit Online

There are two types of activities that you can use in an Azure Data Factory pipeline.
Data movement activities to move data between supported source and sink data stores.
Data transformation activities to transform data using compute services such as Azure HDInsight, Azure
Batch, and Azure Machine Learning.
To move data to/from a data store that Data Factory does not support, or to transform/process data in a way
that isn't supported by Data Factory, you can create a Custom activity with your own data movement or
transformation logic and use the activity in a pipeline. The custom activity runs your customized code logic on
an Azure Batch pool of virtual machines.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.

See following articles if you are new to Azure Batch service:


Azure Batch basics for an overview of the Azure Batch service.
New -AzBatchAccount cmdlet to create an Azure Batch account (or) Azure portal to create the Azure Batch
account using Azure portal. See Using PowerShell to manage Azure Batch Account article for detailed
instructions on using the cmdlet.
New -AzBatchPool cmdlet to create an Azure Batch pool.

Azure Batch linked service


The following JSON defines a sample Azure Batch linked service. For details, see Compute environments
supported by Azure Data Factory
{
"name": "AzureBatchLinkedService",
"properties": {
"type": "AzureBatch",
"typeProperties": {
"accountName": "batchaccount",
"accessKey": {
"type": "SecureString",
"value": "access key"
},
"batchUri": "https://batchaccount.region.batch.azure.com",
"poolName": "poolname",
"linkedServiceName": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}

To learn more about Azure Batch linked service, see Compute linked services article.

Custom activity
The following JSON snippet defines a pipeline with a simple Custom Activity. The activity definition has a
reference to the Azure Batch linked service.

{
"name": "MyCustomActivityPipeline",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "helloworld.exe",
"folderPath": "customactv2/helloworld",
"resourceLinkedService": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
}
}]
}
}

In this sample, the helloworld.exe is a custom application stored in the customactv2/helloworld folder of the
Azure Storage account used in the resourceLinkedService. The Custom activity submits this custom application
to be executed on Azure Batch. You can replace the command to any preferred application that can be executed
on the target Operation System of the Azure Batch Pool nodes.
The following table describes names and descriptions of properties that are specific to this activity.

PROPERTY DESCRIPTION REQUIRED

name Name of the activity in the pipeline Yes


PROPERTY DESCRIPTION REQUIRED

description Text describing what the activity does. No

type For Custom activity, the activity type is Yes


Custom.

linkedServiceName Linked Service to Azure Batch. To learn Yes


about this linked service, see Compute
linked services article.

command Command of the custom application Yes


to be executed. If the application is
already available on the Azure Batch
Pool Node, the resourceLinkedService
and folderPath can be skipped. For
example, you can specify the
command to be cmd /c dir , which is
natively supported by the Windows
Batch Pool node.

resourceLinkedService Azure Storage Linked Service to the No *


Storage account where the custom
application is stored

folderPath Path to the folder of the custom No *


application and all its dependencies

If you have dependencies stored in


subfolders - that is, in a hierarchical
folder structure under folderPath - the
folder structure is currently flattened
when the files are copied to Azure
Batch. That is, all files are copied into a
single folder with no subfolders. To
work around this behavior, consider
compressing the files, copying the
compressed file, and then unzipping it
with custom code in the desired
location.

referenceObjects An array of existing Linked Services No


and Datasets. The referenced Linked
Services and Datasets are passed to
the custom application in JSON format
so your custom code can reference
resources of the Data Factory

extendedProperties User-defined properties that can be No


passed to the custom application in
JSON format so your custom code can
reference additional properties

retentionTimeInDays The retention time for the files No


submitted for custom activity. Default
value is 30 days.

* The properties resourceLinkedService and folderPath must either both be specified or both be omitted.
NOTE
If you are passing linked services as referenceObjects in Custom Activity, it is a good security practice to pass an Azure
Key Vault enabled linked service (since it does not contain any secure strings) and fetch the credentials using secret name
directly from Key Vault from the code. You can find an example here that references AKV enabled linked service, retrieves
the credentials from Key Vault, and then accesses the storage in the code.

Custom activity permissions


The custom activity sets the Azure Batch auto-user account to Non-admin access with task scope (the default
auto-user specification). You can't change the permission level of the auto-user account. For more info, see Run
tasks under user accounts in Batch | Auto-user accounts.

Executing commands
You can directly execute a command using Custom Activity. The following example runs the "echo hello world"
command on the target Azure Batch Pool nodes and prints the output to stdout.

{
"name": "MyCustomActivity",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "cmd /c echo hello world"
}
}]
}
}

Passing objects and properties


This sample shows how you can use the referenceObjects and extendedProperties to pass Data Factory objects
and user-defined properties to your custom application.
{
"name": "MyCustomActivityPipeline",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "SampleApp.exe",
"folderPath": "customactv2/SampleApp",
"resourceLinkedService": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"referenceObjects": {
"linkedServices": [{
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
}]
},
"extendedProperties": {
"connectionString": {
"type": "SecureString",
"value": "aSampleSecureString"
},
"PropertyBagPropertyName1": "PropertyBagValue1",
"propertyBagPropertyName2": "PropertyBagValue2",
"dateTime1": "2015-04-12T12:13:14Z"
}
}
}]
}
}

When the activity is executed, referenceObjects and extendedProperties are stored in following files that are
deployed to the same execution folder of the SampleApp.exe:
activity.json

Stores extendedProperties and properties of the custom activity.


linkedServices.json

Stores an array of Linked Services defined in the referenceObjects property.


datasets.json

Stores an array of Datasets defined in the referenceObjects property.


Following sample code demonstrate how the SampleApp.exe can access the required information from JSON
files:
using Newtonsoft.Json;
using System;
using System.IO;

namespace SampleApp
{
class Program
{
static void Main(string[] args)
{
//From Extend Properties
dynamic activity = JsonConvert.DeserializeObject(File.ReadAllText("activity.json"));
Console.WriteLine(activity.typeProperties.extendedProperties.connectionString.value);

// From LinkedServices
dynamic linkedServices =
JsonConvert.DeserializeObject(File.ReadAllText("linkedServices.json"));
Console.WriteLine(linkedServices[0].properties.typeProperties.accountName);
}
}
}

Retrieve execution outputs


You can start a pipeline run using the following PowerShell command:

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName $pipelineName

When the pipeline is running, you can check the execution output using the following commands:

while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)

if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}

Write-Host "Activity `Output` section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "Activity `Error` section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

The stdout and stderr of your custom application are saved to the adfjobs container in the Azure Storage
Linked Service you defined when creating Azure Batch Linked Service with a GUID of the task. You can get the
detailed path from Activity Run output as shown in the following snippet:
Pipeline ' MyCustomActivity' run finished. Result:

ResourceGroupName : resourcegroupname
DataFactoryName : datafactoryname
ActivityName : MyCustomActivity
PipelineRunId : xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
PipelineName : MyCustomActivity
Input : {command}
Output : {exitcode, outputs, effectiveIntegrationRuntime}
LinkedServiceName :
ActivityRunStart : 10/5/2017 3:33:06 PM
ActivityRunEnd : 10/5/2017 3:33:28 PM
DurationInMs : 21203
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity Output section:


"exitcode": 0
"outputs": [
"https://<container>.blob.core.windows.net/adfjobs/<GUID>/output/stdout.txt",
"https://<container>.blob.core.windows.net/adfjobs/<GUID>/output/stderr.txt"
]
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)"
Activity Error section:
"errorCode": ""
"message": ""
"failureType": ""
"target": "MyCustomActivity"

If you would like to consume the content of stdout.txt in downstream activities, you can get the path to the
stdout.txt file in expression "@activity('MyCustomActivity').output.outputs[0]".

IMPORTANT
The activity.json, linkedServices.json, and datasets.json are stored in the runtime folder of the Batch task. For this
example, the activity.json, linkedServices.json, and datasets.json are stored in
"https://adfv2storage.blob.core.windows.net/adfjobs/<GUID>/runtime/" path. If needed, you need to clean them up
separately.
For Linked Services that use the Self-Hosted Integration Runtime, the sensitive information like keys or passwords are
encrypted by the Self-Hosted Integration Runtime to ensure credential stays in customer defined private network
environment. Some sensitive fields could be missing when referenced by your custom application code in this way. Use
SecureString in extendedProperties instead of using Linked Service reference if needed.

Pass outputs to another activity


You can send custom values from your code in a Custom Activity back to Azure Data Factory. You can do so by
writing them into outputs.json from your application. Data Factory copies the content of outputs.json and
appends it into the Activity Output as the value of the customOutput property. (The size limit is 2MB.) If you
want to consume the content of outputs.json in downstream activities, you can get the value by using the
expression @activity('<MyCustomActivity>').output.customOutput .

Retrieve SecureString outputs


Sensitive property values designated as type SecureString, as shown in some of the examples in this article, are
masked out in the Monitoring tab in the Data Factory user interface. In actual pipeline execution, however, a
SecureString property is serialized as JSON within the activity.json file as plain text. For example:
"extendedProperties": {
"connectionString": {
"type": "SecureString",
"value": "aSampleSecureString"
}
}

This serialization is not truly secure, and is not intended to be secure. The intent is to hint to Data Factory to
mask the value in the Monitoring tab.
To access properties of type SecureString from a custom activity, read the activity.json file, which is placed in
the same folder as your .EXE, deserialize the JSON, and then access the JSON property (extendedProperties
=> [propertyName] => value).

Compare v2 Custom Activity and version 1 (Custom) DotNet Activity


In Azure Data Factory version 1, you implement a (Custom) DotNet Activity by creating a .NET Class Library
project with a class that implements the Execute method of the IDotNetActivity interface. The Linked
Services, Datasets, and Extended Properties in the JSON payload of a (Custom) DotNet Activity are passed to
the execution method as strongly-typed objects. For details about the version 1 behavior, see (Custom) DotNet
in version 1. Because of this implementation, your version 1 DotNet Activity code has to target .NET
Framework 4.5.2. The version 1 DotNet Activity also has to be executed on Windows-based Azure Batch Pool
nodes.
In the Azure Data Factory V2 Custom Activity, you are not required to implement a .NET interface. You can now
directly run commands, scripts, and your own custom code, compiled as an executable. To configure this
implementation, you specify the Command property together with the folderPath property. The Custom Activity
uploads the executable and its dependencies to folderpath and executes the command for you.
The Linked Services, Datasets (defined in referenceObjects), and Extended Properties defined in the JSON
payload of a Data Factory v2 Custom Activity can be accessed by your executable as JSON files. You can access
the required properties using a JSON serializer as shown in the preceding SampleApp.exe code sample.
With the changes introduced in the Data Factory V2 Custom Activity, you can write your custom code logic in
your preferred language and execute it on Windows and Linux Operation Systems supported by Azure Batch.
The following table describes the differences between the Data Factory V2 Custom Activity and the Data
Factory version 1 (Custom) DotNet Activity:

DIFFERENCES CUSTOM ACTIVITY VERSION 1 (CUSTOM) DOTNET ACTIVITY

How custom logic is defined By providing an executable By implementing a .NET DLL

Execution environment of the custom Windows or Linux Windows (.NET Framework 4.5.2)
logic

Executing scripts Supports executing scripts directly (for Requires implementation in the .NET
example "cmd /c echo hello world" on DLL
Windows VM)

Dataset required Optional Required to chain activities and pass


information
DIFFERENCES CUSTOM ACTIVITY VERSION 1 (CUSTOM) DOTNET ACTIVITY

Pass information from activity to Through ReferenceObjects Through ExtendedProperties (custom


custom logic (LinkedServices and Datasets) and properties), Input, and Output
ExtendedProperties (custom Datasets
properties)

Retrieve information in custom logic Parses activity.json, linkedServices.json, Through .NET SDK (.NET Frame 4.5.2)
and datasets.json stored in the same
folder of the executable

Logging Writes directly to STDOUT Implementing Logger in .NET DLL

If you have existing .NET code written for a version 1 (Custom) DotNet Activity, you need to modify your code
for it to work with the current version of the Custom Activity. Update your code by following these high-level
guidelines:
Change the project from a .NET Class Library to a Console App.
Start your application with the Main method. The Execute method of the IDotNetActivity interface is no
longer required.
Read and parse the Linked Services, Datasets and Activity with a JSON serializer, and not as strongly-typed
objects. Pass the values of required properties to your main custom code logic. Refer to the preceding
SampleApp.exe code as an example.
The Logger object is no longer supported. Output from your executable can be printed to the console and is
saved to stdout.txt.
The Microsoft.Azure.Management.DataFactories NuGet package is no longer required.
Compile your code, upload the executable and its dependencies to Azure Storage, and define the path in the
folderPath property.

For a complete sample of how the end-to-end DLL and pipeline sample described in the Data Factory version 1
article Use custom activities in an Azure Data Factory pipeline can be rewritten as a Data Factory Custom
Activity, see Data Factory Custom Activity sample.

Auto-scaling of Azure Batch


You can also create an Azure Batch pool with autoscale feature. For example, you could create an azure batch
pool with 0 dedicated VMs and an autoscale formula based on the number of pending tasks.
The sample formula here achieves the following behavior: When the pool is initially created, it starts with 1 VM.
$PendingTasks metric defines the number of tasks in running + active (queued) state. The formula finds the
average number of pending tasks in the last 180 seconds and sets TargetDedicated accordingly. It ensures that
TargetDedicated never goes beyond 25 VMs. So, as new tasks are submitted, pool automatically grows and as
tasks complete, VMs become free one by one and the autoscaling shrinks those VMs. startingNumberOfVMs
and maxNumberofVMs can be adjusted to your needs.
Autoscale formula:

startingNumberOfVMs = 1;
maxNumberofVMs = 25;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180
* TimeInterval_Second));
$TargetDedicated=min(maxNumberofVMs,pendingTaskSamples);

See Automatically scale compute nodes in an Azure Batch pool for details.
If the pool is using the default autoScaleEvaluationInterval, the Batch service could take 15-30 minutes to
prepare the VM before running the custom activity. If the pool is using a different autoScaleEvaluationInterval,
the Batch service could take autoScaleEvaluationInterval + 10 minutes.

Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
Machine Learning Batch Execution activity
Stored procedure activity
Compute environments supported by Azure Data
Factory
6/3/2019 • 19 minutes to read • Edit Online

This article explains different compute environments that you can use to process or transform data. It also
provides details about different configurations (on-demand vs. bring your own) supported by Data Factory
when configuring linked services linking these compute environments to an Azure data factory.
The following table provides a list of compute environments supported by Data Factory and the activities that
can run on them.

COMPUTE ENVIRONMENT ACTIVITIES

On-demand HDInsight cluster or your own HDInsight Hive, Pig, Spark, MapReduce, Hadoop Streaming
cluster

Azure Batch Custom

Azure Machine Learning Machine Learning activities: Batch Execution and Update
Resource

Azure Data Lake Analytics Data Lake Analytics U-SQL

Azure SQL, Azure SQL Data Warehouse, SQL Server Stored Procedure

Azure Databricks Notebook, Jar, Python

On-demand HDInsight compute environment


In this type of configuration, the computing environment is fully managed by the Azure Data Factory service.
It is automatically created by the Data Factory service before a job is submitted to process data and removed
when the job is completed. You can create a linked service for the on-demand compute environment,
configure it, and control granular settings for job execution, cluster management, and bootstrapping actions.

NOTE
The on-demand configuration is currently supported only for Azure HDInsight clusters. Azure Databricks also supports
on-demand jobs using job clusters, refer to Azure databricks linked service for more details.

Azure HDInsight on-demand linked service


The Azure Data Factory service can automatically create an on-demand HDInsight cluster to process data. The
cluster is created in the same region as the storage account (linkedServiceName property in the JSON )
associated with the cluster. The storage account must be a general-purpose standard Azure storage account.
Note the following important points about on-demand HDInsight linked service:
The on-demand HDInsight cluster is created under your Azure subscription. You are able to see the cluster
in your Azure portal when the cluster is up and running.
The logs for jobs that are run on an on-demand HDInsight cluster are copied to the storage account
associated with the HDInsight cluster. The clusterUserName, clusterPassword, clusterSshUserName,
clusterSshPassword defined in your linked service definition are used to log in to the cluster for in-depth
troubleshooting during the lifecycle of the cluster.
You are charged only for the time when the HDInsight cluster is up and running jobs.
You can use a Script Action with the Azure HDInsight on-demand linked service.

IMPORTANT
It typically takes 20 minutes or more to provision an Azure HDInsight cluster on demand.

Example
The following JSON defines a Linux-based on-demand HDInsight linked service. The Data Factory service
automatically creates a Linux-based HDInsight cluster to process the required activity.

{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterType": "hadoop",
"clusterSize": 1,
"timeToLive": "00:15:00",
"hostSubscriptionId": "<subscription ID>",
"servicePrincipalId": "<service principal ID>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenent id>",
"clusterResourceGroup": "<resource group name>",
"version": "3.6",
"osType": "Linux",
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

IMPORTANT
The HDInsight cluster creates a default container in the blob storage you specified in the JSON
(linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design.
With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice needs to be processed
unless there is an existing live cluster (timeToLive) and is deleted when the processing is done.
As more activity runs, you see many containers in your Azure blob storage. If you do not need them for
troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers
follow a pattern: adf**yourdatafactoryname**-**linkedservicename**-datetimestamp . Use tools such as Microsoft
Storage Explorer to delete containers in your Azure blob storage.

Properties
PROPERTY DESCRIPTION REQUIRED

type The type property should be set to Yes


HDInsightOnDemand.

clusterSize Number of worker/data nodes in the Yes


cluster. The HDInsight cluster is
created with 2 head nodes along with
the number of worker nodes you
specify for this property. The nodes
are of size Standard_D3 that has 4
cores, so a 4 worker node cluster
takes 24 cores (4*4 = 16 cores for
worker nodes, plus 2*4 = 8 cores for
head nodes). See Set up clusters in
HDInsight with Hadoop, Spark, Kafka,
and more for details.

linkedServiceName Azure Storage linked service to be Yes


used by the on-demand cluster for
storing and processing data. The
HDInsight cluster is created in the
same region as this Azure Storage
account. Azure HDInsight has
limitation on the total number of
cores you can use in each Azure
region it supports. Make sure you
have enough core quotas in that
Azure region to meet the required
clusterSize. For details, refer to Set up
clusters in HDInsight with Hadoop,
Spark, Kafka, and more
Currently, you cannot create an
on-demand HDInsight cluster
that uses an Azure Data Lake
Store as the storage. If you want
to store the result data from
HDInsight processing in an Azure
Data Lake Store, use a Copy
Activity to copy the data from the
Azure Blob Storage to the Azure
Data Lake Store.

clusterResourceGroup The HDInsight cluster is created in Yes


this resource group.
PROPERTY DESCRIPTION REQUIRED

timetolive The allowed idle time for the on- Yes


demand HDInsight cluster. Specifies
how long the on-demand HDInsight
cluster stays alive after completion of
an activity run if there are no other
active jobs in the cluster. The minimal
allowed value is 5 minutes (00:05:00).

For example, if an activity run takes 6


minutes and timetolive is set to 5
minutes, the cluster stays alive for 5
minutes after the 6 minutes of
processing the activity run. If another
activity run is executed with the 6-
minutes window, it is processed by
the same cluster.

Creating an on-demand HDInsight


cluster is an expensive operation
(could take a while), so use this
setting as needed to improve
performance of a data factory by
reusing an on-demand HDInsight
cluster.

If you set timetolive value to 0, the


cluster is deleted as soon as the
activity run completes. Whereas, if
you set a high value, the cluster may
stay idle for you to log on for some
troubleshooting purpose but it could
result in high costs. Therefore, it is
important that you set the
appropriate value based on your
needs.

If the timetolive property value is


appropriately set, multiple pipelines
can share the instance of the on-
demand HDInsight cluster.

clusterType The type of the HDInsight cluster to No


be created. Allowed values are
"hadoop" and "spark". If not specified,
default value is hadoop. Enterprise
Security Package enabled cluster
cannot be created on-demand,
instead use an existing cluster/ bring
your own compute.

version Version of the HDInsight cluster. If not No


specified, it's using the current
HDInsight defined default version.

hostSubscriptionId The Azure subscription ID used to No


create HDInsight cluster. If not
specified, it uses the Subscription ID
of your Azure login context.
PROPERTY DESCRIPTION REQUIRED

clusterNamePrefix The prefix of HDI cluster name, a No


timestamp will be automatically
appended at the end of the cluster
name

sparkVersion The version of spark if the cluster type No


is "Spark"

additionalLinkedServiceNames Specifies additional storage accounts No


for the HDInsight linked service so
that the Data Factory service can
register them on your behalf. These
storage accounts must be in the same
region as the HDInsight cluster, which
is created in the same region as the
storage account specified by
linkedServiceName.

osType Type of operating system. Allowed No


values are: Linux and Windows (for
HDInsight 3.3 only). Default is Linux.

hcatalogLinkedServiceName The name of Azure SQL linked service No


that point to the HCatalog database.
The on-demand HDInsight cluster is
created by using the Azure SQL
database as the metastore.

connectVia The Integration Runtime to be used No


to dispatch the activities to this
HDInsight linked service. For on-
demand HDInsight linked service, it
only supports Azure Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.

clusterUserName The username to access the cluster. No

clusterPassword The password in type of secure string No


to access the cluster.

clusterSshUserName The username to SSH remotely No


connect to cluster’s node (for Linux).

clusterSshPassword The password in type of secure string No


to SSH remotely connect cluster’s
node (for Linux).

scriptActions Specify script for HDInsight cluster No


customizations during on-demand
cluster creation.
Currently, Azure Data Factory's User
Interface authoring tool supports
specifying only 1 script action, but
you can get through this limitation in
the JSON (specify multiple script
actions in the JSON).
IMPORTANT
HDInsight supports multiple Hadoop cluster versions that can be deployed. Each version choice creates a specific
version of the Hortonworks Data Platform (HDP) distribution and a set of components that are contained within that
distribution. The list of supported HDInsight versions keeps being updated to provide latest Hadoop ecosystem
components and fixes. Make sure you always refer to latest information of Supported HDInsight version and OS Type
to ensure you are using supported version of HDInsight.

IMPORTANT
Currently, HDInsight linked services does not support HBase, Interactive Query (Hive LLAP), Storm.

additionalLinkedServiceNames JSON example

"additionalLinkedServiceNames": [{
"referenceName": "MyStorageLinkedService2",
"type": "LinkedServiceReference"
}]

Service principal authentication


The On-Demand HDInsight linked service requires a service principal authentication to create HDInsight
clusters on your behalf. To use service principal authentication, register an application entity in Azure Active
Directory (Azure AD ) and grant it the Contributor role of the subscription or the resource group in which the
HDInsight cluster is created. For detailed steps, see Use portal to create an Azure Active Directory application
and service principal that can access resources. Make note of the following values, which you use to define the
linked service:
Application ID
Application key
Tenant ID
Use service principal authentication by specifying the following properties:

PROPERTY DESCRIPTION REQUIRED

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Yes

tenant Specify the tenant information Yes


(domain name or tenant ID) under
which your application resides. You
can retrieve it by hovering the mouse
in the upper-right corner of the Azure
portal.

Advanced Properties
You can also specify the following properties for the granular configuration of the on-demand HDInsight
cluster.

PROPERTY DESCRIPTION REQUIRED


PROPERTY DESCRIPTION REQUIRED

coreConfiguration Specifies the core configuration No


parameters (as in core-site.xml) for
the HDInsight cluster to be created.

hBaseConfiguration Specifies the HBase configuration No


parameters (hbase-site.xml) for the
HDInsight cluster.

hdfsConfiguration Specifies the HDFS configuration No


parameters (hdfs-site.xml) for the
HDInsight cluster.

hiveConfiguration Specifies the hive configuration No


parameters (hive-site.xml) for the
HDInsight cluster.

mapReduceConfiguration Specifies the MapReduce No


configuration parameters (mapred-
site.xml) for the HDInsight cluster.

oozieConfiguration Specifies the Oozie configuration No


parameters (oozie-site.xml) for the
HDInsight cluster.

stormConfiguration Specifies the Storm configuration No


parameters (storm-site.xml) for the
HDInsight cluster.

yarnConfiguration Specifies the Yarn configuration No


parameters (yarn-site.xml) for the
HDInsight cluster.

Example – On-demand HDInsight cluster configuration with advanced properties


{
"name": " HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterSize": 16,
"timeToLive": "01:30:00",
"hostSubscriptionId": "<subscription ID>",
"servicePrincipalId": "<service principal ID>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenent id>",
"clusterResourceGroup": "<resource group name>",
"version": "3.6",
"osType": "Linux",
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"coreConfiguration": {
"templeton.mapper.memory.mb": "5000"
},
"hiveConfiguration": {
"templeton.mapper.memory.mb": "5000"
},
"mapReduceConfiguration": {
"mapreduce.reduce.java.opts": "-Xmx4000m",
"mapreduce.map.java.opts": "-Xmx4000m",
"mapreduce.map.memory.mb": "5000",
"mapreduce.reduce.memory.mb": "5000",
"mapreduce.job.reduce.slowstart.completedmaps": "0.8"
},
"yarnConfiguration": {
"yarn.app.mapreduce.am.resource.mb": "5000",
"mapreduce.map.memory.mb": "5000"
},
"additionalLinkedServiceNames": [{
"referenceName": "MyStorageLinkedService2",
"type": "LinkedServiceReference"
}]
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}

Node sizes
You can specify the sizes of head, data, and zookeeper nodes using the following properties:

PROPERTY DESCRIPTION REQUIRED

headNodeSize Specifies the size of the head node. No


The default value is: Standard_D3. See
the Specifying node sizes section
for details.

dataNodeSize Specifies the size of the data node. No


The default value is: Standard_D3.
PROPERTY DESCRIPTION REQUIRED

zookeeperNodeSize Specifies the size of the Zoo Keeper No


node. The default value is:
Standard_D3.

Specifying node sizes


See the Sizes of Virtual Machines article for string values you need to specify for the properties mentioned in
the previous section. The values need to conform to the CMDLETs & APIS referenced in the article. As you
can see in the article, the data node of Large (default) size has 7-GB memory, which may not be good enough
for your scenario.
If you want to create D4 sized head nodes and worker nodes, specify Standard_D4 as the value for
headNodeSize and dataNodeSize properties.

"headNodeSize": "Standard_D4",
"dataNodeSize": "Standard_D4",

If you specify a wrong value for these properties, you may receive the following error: Failed to create cluster.
Exception: Unable to complete the cluster create operation. Operation failed with code '400'. Cluster left
behind state: 'Error'. Message: 'PreClusterCreationValidationFailure'. When you receive this error, ensure that
you are using the CMDLET & APIS name from the table in the Sizes of Virtual Machines article.

Bring your own compute environment


In this type of configuration, users can register an already existing computing environment as a linked service
in Data Factory. The computing environment is managed by the user and the Data Factory service uses it to
execute the activities.
This type of configuration is supported for the following compute environments:
Azure HDInsight
Azure Batch
Azure Machine Learning
Azure Data Lake Analytics
Azure SQL DB, Azure SQL DW, SQL Server

Azure HDInsight linked service


You can create an Azure HDInsight linked service to register your own HDInsight cluster with Data Factory.
Example
{
"name": "HDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": " https://<hdinsightclustername>.azurehdinsight.net/",
"userName": "username",
"password": {
"value": "passwordvalue",
"type": "SecureString"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Properties
PROPERTY DESCRIPTION REQUIRED

type The type property should be set to Yes


HDInsight.

clusterUri The URI of the HDInsight cluster. Yes

username Specify the name of the user to be Yes


used to connect to an existing
HDInsight cluster.

password Specify password for the user account. Yes

linkedServiceName Name of the Azure Storage linked Yes


service that refers to the Azure blob
storage used by the HDInsight cluster.
Currently, you cannot specify an
Azure Data Lake Store linked
service for this property. If the
HDInsight cluster has access to
the Data Lake Store, you may
access data in the Azure Data
Lake Store from Hive/Pig scripts.

isEspEnabled Specify 'true' if the HDInsight cluster No


is Enterprise Security Package
enabled. Default is 'false'.
PROPERTY DESCRIPTION REQUIRED

connectVia The Integration Runtime to be used No


to dispatch the activities to this linked
service. You can use Azure Integration
Runtime or Self-hosted Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.
For Enterprise Security Package (ESP)
enabled HDInsight cluster use a self-
hosted integration runtime which has
a line of sight to the cluster or it
should be deployed inside the same
Virtual Network as the ESP HDInsight
cluster.

IMPORTANT
HDInsight supports multiple Hadoop cluster versions that can be deployed. Each version choice creates a specific
version of the Hortonworks Data Platform (HDP) distribution and a set of components that are contained within that
distribution. The list of supported HDInsight versions keeps being updated to provide latest Hadoop ecosystem
components and fixes. Make sure you always refer to latest information of Supported HDInsight version and OS Type
to ensure you are using supported version of HDInsight.

IMPORTANT
Currently, HDInsight linked services does not support HBase, Interactive Query (Hive LLAP), Storm.

Azure Batch linked service


NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module,
which will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and
AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions,
see Install Azure PowerShell.

You can create an Azure Batch linked service to register a Batch pool of virtual machines (VMs) to a data
factory. You can run Custom activity using Azure Batch.
See following topics if you are new to Azure Batch service:
Azure Batch basics for an overview of the Azure Batch service.
New -AzBatchAccount cmdlet to create an Azure Batch account (or) Azure portal to create the Azure Batch
account using Azure portal. See Using PowerShell to manage Azure Batch Account topic for detailed
instructions on using the cmdlet.
New -AzBatchPool cmdlet to create an Azure Batch pool.
Example
{
"name": "AzureBatchLinkedService",
"properties": {
"type": "AzureBatch",
"typeProperties": {
"accountName": "batchaccount",
"accessKey": {
"type": "SecureString",
"value": "access key"
},
"batchUri": "https://batchaccount.region.batch.azure.com",
"poolName": "poolname",
"linkedServiceName": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Properties
PROPERTY DESCRIPTION REQUIRED

type The type property should be set to Yes


AzureBatch.

accountName Name of the Azure Batch account. Yes

accessKey Access key for the Azure Batch Yes


account.

batchUri URL to your Azure Batch account, in Yes


format of
https://batchaccountname.region.bat
ch.azure.com.

poolName Name of the pool of virtual machines. Yes

linkedServiceName Name of the Azure Storage linked Yes


service associated with this Azure
Batch linked service. This linked
service is used for staging files
required to run the activity.

connectVia The Integration Runtime to be used No


to dispatch the activities to this linked
service. You can use Azure Integration
Runtime or Self-hosted Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.

Azure Machine Learning linked service


You create an Azure Machine Learning linked service to register a Machine Learning batch scoring endpoint
to a data factory.
Example

{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://[batch scoring endpoint]/jobs",
"apiKey": {
"type": "SecureString",
"value": "access key"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Properties
PROPERTY DESCRIPTION REQUIRED

Type The type property should be set to: Yes


AzureML.

mlEndpoint The batch scoring URL. Yes

apiKey The published workspace model’s API. Yes

updateResourceEndpoint The Update Resource URL for an No


Azure ML Web Service endpoint used
to update the predictive Web Service
with trained model file

servicePrincipalId Specify the application's client ID. Required if updateResourceEndpoint


is specified

servicePrincipalKey Specify the application's key. Required if updateResourceEndpoint


is specified

tenant Specify the tenant information Required if updateResourceEndpoint


(domain name or tenant ID) under is specified
which your application resides. You
can retrieve it by hovering the mouse
in the upper-right corner of the Azure
portal.

connectVia The Integration Runtime to be used No


to dispatch the activities to this linked
service. You can use Azure Integration
Runtime or Self-hosted Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.

Azure Data Lake Analytics linked service


You create an Azure Data Lake Analytics linked service to link an Azure Data Lake Analytics compute
service to an Azure data factory. The Data Lake Analytics U -SQL activity in the pipeline refers to this linked
service.
Example

{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "adftestaccount",
"dataLakeAnalyticsUri": "azuredatalakeanalytics URI",
"servicePrincipalId": "service principal id",
"servicePrincipalKey": {
"value": "service principal key",
"type": "SecureString"
},
"tenant": "tenant ID",
"subscriptionId": "<optional, subscription id of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Properties
PROPERTY DESCRIPTION REQUIRED

type The type property should be set to: Yes


AzureDataLakeAnalytics.

accountName Azure Data Lake Analytics Account Yes


Name.

dataLakeAnalyticsUri Azure Data Lake Analytics URI. No

subscriptionId Azure subscription id No

resourceGroupName Azure resource group name No

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Yes

tenant Specify the tenant information Yes


(domain name or tenant ID) under
which your application resides. You
can retrieve it by hovering the mouse
in the upper-right corner of the Azure
portal.
PROPERTY DESCRIPTION REQUIRED

connectVia The Integration Runtime to be used No


to dispatch the activities to this linked
service. You can use Azure Integration
Runtime or Self-hosted Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.

Azure Databricks linked service


You can create Azure Databricks linked service to register Databricks workspace that you will use to run
the Databricks workloads(notebooks).
Example - Using new job cluster in Databricks

{
"name": "AzureDatabricks_LS",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://eastus.azuredatabricks.net",
"newClusterNodeType": "Standard_D3_v2",
"newClusterNumOfWorker": "1:10",
"newClusterVersion": "4.0.x-scala2.11",
"accessToken": {
"type": "SecureString",
"value": "dapif33c9c721144c3a790b35000b57f7124f"
}
}
}
}

Example - Using existing Interactive cluster in Databricks

{
"name": " AzureDataBricksLinedService",
"properties": {
"type": " AzureDatabricks",
"typeProperties": {
"domain": "https://westeurope.azuredatabricks.net",
"accessToken": {
"type": "SecureString",
"value": "dapif33c9c72344c3a790b35000b57f7124f"
},
"existingClusterId": "{clusterId}"
}
}

Properties
PROPERTY DESCRIPTION REQUIRED

name Name of the Linked Service Yes

type The type property should be set to: Yes


AzureDatabricks.
PROPERTY DESCRIPTION REQUIRED

domain Specify the Azure Region accordingly Yes


based on the region of the Databricks
workspace. Example:
https://eastus.azuredatabricks.net

accessToken Access token is required for Data Yes


Factory to authenticate to Azure
Databricks. Access token needs to be
generated from the databricks
workspace. More detailed steps to
find the access token can be found
here

existingClusterId Cluster ID of an existing cluster to run No


all jobs on this. This should be an
already created Interactive Cluster.
You may need to manually restart the
cluster if it stops responding.
Databricks suggest running jobs on
new clusters for greater reliability. You
can find the Cluster ID of an
Interactive Cluster on Databricks
workspace -> Clusters -> Interactive
Cluster Name -> Configuration ->
Tags. More details

newClusterVersion The Spark version of the cluster. It will No


create a job cluster in databricks.

newClusterNumOfWorker Number of worker nodes that this No


cluster should have. A cluster has one
Spark Driver and num_workers
Executors for a total of num_workers
+ 1 Spark nodes. A string formatted
Int32, like “1” means numOfWorker is
1 or “1:10” means auto-scale from 1
as min and 10 as max.

newClusterNodeType This field encodes, through a single No


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for
memory or compute intensive
workloads This field is required for
new cluster

newClusterSparkConf a set of optional, user-specified Spark No


configuration key-value pairs. Users
can also pass in a string of extra JVM
options to the driver and the
executors via
spark.driver.extraJavaOptions and
spark.executor.extraJavaOptions
respectively.
PROPERTY DESCRIPTION REQUIRED

newClusterInitScripts a set of optional, user-defined No


initialization scripts for the new cluster.
Specifying the DBFS path to the init
scripts.

Azure SQL Database linked service


You create an Azure SQL linked service and use it with the Stored Procedure Activity to invoke a stored
procedure from a Data Factory pipeline. See Azure SQL Connector article for details about this linked service.

Azure SQL Data Warehouse linked service


You create an Azure SQL Data Warehouse linked service and use it with the Stored Procedure Activity to
invoke a stored procedure from a Data Factory pipeline. See Azure SQL Data Warehouse Connector article
for details about this linked service.

SQL Server linked service


You create a SQL Server linked service and use it with the Stored Procedure Activity to invoke a stored
procedure from a Data Factory pipeline. See SQL Server connector article for details about this linked service.

Next steps
For a list of the transformation activities supported by Azure Data Factory, see Transform data.
Append Variable Activity in Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online

Use the Append Variable activity to add a value to an existing array variable defined in a Data Factory pipeline.

Type properties
PROPERTY DESCRIPTION REQUIRED

name Name of the activity in pipeline Yes

description Text describing what the activity does no

type Activity Type is AppendVariable yes

value String literal or expression object value yes


used to append into specified variable

variableName Name of the variable that will be yes


modified by activity, the variable must
be of type ‘Array’

Next steps
Learn about a related control flow activity supported by Data Factory:
Set Variable Activity
Azure Function activity in Azure Data Factory
4/24/2019 • 2 minutes to read • Edit Online

The Azure Function activity allows you to run Azure Functions in a Data Factory pipeline. To run an Azure Function,
you need to create a linked service connection and an activity that specifies the Azure Function that you plan to
execute.
For an eight-minute introduction and demonstration of this feature, watch the following video:

Azure Function linked service


The return type of the Azure function has to be a valid JObject . (Keep in mind that JArray is not a JObject .) Any
return type other than JObject fails and raises the user error Response Content is not a valid JObject.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: yes


AzureFunction

function app url URL for the Azure Function App. Format yes
is
https://<accountname>.azurewebsites.net
. This URL is the value under URL
section when viewing your Function
App in the Azure portal

function key Access key for the Azure Function. Click yes
on the Manage section for the
respective function, and copy either the
Function Key or the Host key. Find
out more here: Azure Functions HTTP
triggers and bindings

Azure Function activity


PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

name Name of the activity in the String yes


pipeline

type Type of activity is String yes


‘AzureFunctionActivity’

linked service The Azure Function linked Linked service reference yes
service for the
corresponding Azure
Function App
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

function name Name of the function in the String yes


Azure Function App that this
activity calls

method REST API method for the String Supported Types: yes
function call "GET", "POST", "PUT"

header Headers that are sent to the String (or expression with No
request. For example, to set resultType of string)
the language and type on a
request: "headers": {
"Accept-Language": "en-us",
"Content-Type":
"application/json" }

body body that is sent along with String (or expression with Required for PUT/POST
the request to the function resultType of string) or methods
api method object.

See the schema of the request payload in Request payload schema section.

Routing and queries


The Azure Function Activity supports routing. For example, if your Azure Function has the endpoint
https://functionAPP.azurewebsites.net/api/<functionName>/<value>?code=<secret> , then the functionName to use in
the Azure Function Activity is <functionName>/<value> . You can parameterize this function to provide the desired
functionName at runtime.

The Azure Function Activity also supports queries. A query has to be included as part of the functionName . For
example, when the function name is HttpTriggerCSharp and the query that you want to include is name=hello , then
you can construct the functionName in the Azure Function Activity as HttpTriggerCSharp?name=hello . This function
can be parameterized so the value can be determined at runtime.

Timeout and long running functions


Azure Functions times out after 230 seconds regardless of the functionTimeout setting you've configured in the
settings. For more information, see this article. To work around this behavior, follow an async pattern or use
Durable Functions. The benefit of Durable Functions is that they offer their own state-tracking mechanism, so you
won't have to implement your own.
Learn more about Durable Functions in this article. You can set up an Azure Function Activity to call the Durable
Function, which will return a response with a different URI, such as this example. Because statusQueryGetUri
returns HTTP Status 202 while the function is running, you can poll the status of the function by using a Web
Activity. Simply set up a Web Activity with the url field set to
@activity('<AzureFunctionActivityName>').output.statusQueryGetUri . When the Durable Function completes, the
output of the function will be the output of the Web Activity.

Next steps
Learn more about activities in Data Factory in Pipelines and activities in Azure Data Factory.
Execute data flow activity in Azure Data Factory
5/23/2019 • 2 minutes to read • Edit Online

Use the execute data flow activity to run your ADF data flow in pipeline debug (sandbox) runs and in pipeline
triggered runs.

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Syntax
{
"name": "MyDataFlowActivity",
"type": "ExecuteDataFlow",
"typeProperties": {
"dataflow": {
"referenceName": "dataflow1",
"type": "DataFlowReference"
},
"compute": {
"computeType": "General",
"coreCount": 8,
}
}

Type properties
dataflow is the name of the data flow entity that you wish to execute
compute describes the Spark execution environment
coreCount is the number of cores to assign to this activity execution of your data flow
Debugging pipelines with data flows

Use the Data Flow Debug to utilize a warmed cluster for testing your data flows interactively in a pipeline debug
run. Use the Pipeline Debug option to test your data flows inside a pipeline.
Run on
This is a required field that defines which Integration Runtime to use for your Data Flow activity execution. By
default, Data Factory will use the default auto-resolve Azure Integration runtime. However, you can create your
own Azure Integration Runtimes that define specific regions, compute type, core counts, and TTL for your data
flow activity execution.
The default setting for Data Flow executions is 8 cores of general compute with a TTL of 60 minutes.
Choose the compute environment for this execution of your data flow. The default is the Azure Auto-Resolve
Default Integration Runtime. This choice will execute the data flow on the Spark environment in the same region
as your data factory. The compute type will be a job cluster, which means the compute environment will take
several minutes to start-up.
You have control over the Spark execution environment for your Data Flow activities. In the Azure integration
runtime are settings to set the compute type (general purpose, memory optimized, and compute optimized),
number of worker cores, and time-to-live to match the execution engine with your Data Flow compute
requirements. Also, setting TTL will allow you to maintain a warm cluster that is immediately available for job
executions.
NOTE
The Integration Runtime selection in the Data Flow activity only applies to triggered executions of your pipeline. Debugging
your pipeline with Data Flows with Debug will execute against the 8-core default Spark cluster.

Staging area
If you are sinking your data into Azure Data Warehouse, you must choose a staging location for your Polybase
batch load.

Parameterized datasets
If you are using parameterized datasets, be sure to set the parameter values.
Debugging parameterized data flows
You can only debug data flows with parameterized datasets from the Pipeline Debug run using the execute data
flow activity. Currently, interactive debug sessions in ADF Data Flow do not work with parameterized data sets.
Pipeline executions and debug runs will work with parameters.
A good practice is to build your data flow with a static dataset so that you have full metadata column propagation
available at design-time. Then replace the static dataset with a dynamic parameterized dataset when you
operationalize your data flow pipeline.

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Execute Pipeline activity in Azure Data Factory
3/14/2019 • 2 minutes to read • Edit Online

The Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline.

Syntax
{
"name": "MyPipeline",
"properties": {
"activities": [
{
"name": "ExecutePipelineActivity",
"type": "ExecutePipeline",
"typeProperties": {
"parameters": {
"mySourceDatasetFolderPath": {
"value": "@pipeline().parameters.mySourceDatasetFolderPath",
"type": "Expression"
}
},
"pipeline": {
"referenceName": "<InvokedPipelineName>",
"type": "PipelineReference"
},
"waitOnCompletion": true
}
}
],
"parameters": [
{
"mySourceDatasetFolderPath": {
"type": "String"
}
}
]
}
}

Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

name Name of the execute String Yes


pipeline activity.

type Must be set to: String Yes


ExecutePipeline.
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

pipeline Pipeline reference to the PipelineReference Yes


dependent pipeline that this
pipeline invokes. A pipeline
reference object has two
properties:
referenceName and type.
The referenceName
property specifies the name
of the reference pipeline.
The type property must be
set to PipelineReference.

parameters Parameters to be passed to A JSON object that maps No


the invoked pipeline parameter names to
argument values

waitOnCompletion Defines whether activity Boolean No


execution waits for the
dependent pipeline
execution to finish. Default
is false.

Sample
This scenario has two pipelines:
Master pipeline - This pipeline has one Execute Pipeline activity that calls the invoked pipeline. The master
pipeline takes two parameters: masterSourceBlobContainer , masterSinkBlobContainer .
Invoked pipeline - This pipeline has one Copy activity that copies data from an Azure Blob source to Azure
Blob sink. The invoked pipeline takes two parameters: sourceBlobContainer , sinkBlobContainer .
Master pipeline definition
{
"name": "masterPipeline",
"properties": {
"activities": [
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "invokedPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceBlobContainer": {
"value": "@pipeline().parameters.masterSourceBlobContainer",
"type": "Expression"
},
"sinkBlobContainer": {
"value": "@pipeline().parameters.masterSinkBlobContainer",
"type": "Expression"
}
},
"waitOnCompletion": true
},
"name": "MyExecutePipelineActivity"
}
],
"parameters": {
"masterSourceBlobContainer": {
"type": "String"
},
"masterSinkBlobContainer": {
"type": "String"
}
}
}
}

Invoked pipeline definition


{
"name": "invokedPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "CopyBlobtoBlob",
"inputs": [
{
"referenceName": "SourceBlobDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "sinkBlobDataset",
"type": "DatasetReference"
}
]
}
],
"parameters": {
"sourceBlobContainer": {
"type": "String"
},
"sinkBlobContainer": {
"type": "String"
}
}
}
}

Linked service

{
"name": "BlobStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=*****",
"type": "SecureString"
}
}
}
}

Source dataset
{
"name": "SourceBlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@pipeline().parameters.sourceBlobContainer",
"type": "Expression"
},
"fileName": "salesforce.txt"
},
"linkedServiceName": {
"referenceName": "BlobStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

Sink dataset

{
"name": "sinkBlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@pipeline().parameters.sinkBlobContainer",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "BlobStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

Running the pipeline


To run the master pipeline in this example, the following values are passed for the masterSourceBlobContainer
and masterSinkBlobContainer parameters:

{
"masterSourceBlobContainer": "executetest",
"masterSinkBlobContainer": "executesink"
}

The master pipeline forwards these values to the invoked pipeline as shown in the following example:
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "invokedPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceBlobContainer": {
"value": "@pipeline().parameters.masterSourceBlobContainer",
"type": "Expression"
},
"sinkBlobContainer": {
"value": "@pipeline().parameters.masterSinkBlobContainer",
"type": "Expression"
}
},

....
}

Next steps
See other control flow activities supported by Data Factory:
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Filter activity in Azure Data Factory
1/3/2019 • 2 minutes to read • Edit Online

You can use a Filter activity in a pipeline to apply a filter expression to an input array.

Syntax
{
"name": "MyFilterActivity",
"type": "filter",
"typeProperties": {
"condition": "<condition>",
"items": "<input array>"
}
}

Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

name Name of the Filter String Yes


activity.

type Must be set to filter. String Yes

condition Condition to be used for Expression Yes


filtering the input.

items Input array on which filter Expression Yes


should be applied.

Example
In this example, the pipeline has two activities: Filter and ForEach. The Filter activity is configured to filter the
input array for items with a value greater than 3. The ForEach activity then iterates over the filtered values and
waits for the number of seconds specified by the current value.
{
"name": "PipelineName",
"properties": {
"activities": [{
"name": "MyFilterActivity",
"type": "filter",
"typeProperties": {
"condition": "@greater(item(),3)",
"items": "@pipeline().parameters.inputs"
}
},
{
"name": "MyForEach",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"batchCount": 1,
"items": "@activity('MyFilterActivity').output.value",
"activities": [{
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": "@item()"
},
"name": "MyWaitActivity"
}]
},
"dependsOn": [{
"activity": "MyFilterActivity",
"dependencyConditions": ["Succeeded"]
}]
}
],
"parameters": {
"inputs": {
"type": "Array",
"defaultValue": [1, 2, 3, 4, 5, 6]
}
}
}
}

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
ForEach activity in Azure Data Factory
3/15/2019 • 5 minutes to read • Edit Online

The ForEach Activity defines a repeating control flow in your pipeline. This activity is used to iterate over a
collection and executes specified activities in a loop. The loop implementation of this activity is similar to Foreach
looping structure in programming languages.

Syntax
The properties are described later in this article. The items property is the collection and each item in the
collection is referred to by using the @item() as shown in the following syntax:

{
"name":"MyForEachActivityName",
"type":"ForEach",
"typeProperties":{
"isSequential":"true",
"items": {
"value": "@pipeline().parameters.mySinkDatasetFolderPathCollection",
"type": "Expression"
},
"activities":[
{
"name":"MyCopyActivity",
"type":"Copy",
"typeProperties":{
...
},
"inputs":[
{
"referenceName":"MyDataset",
"type":"DatasetReference",
"parameters":{
"MyFolderPath":"@pipeline().parameters.mySourceDatasetFolderPath"
}
}
],
"outputs":[
{
"referenceName":"MyDataset",
"type":"DatasetReference",
"parameters":{
"MyFolderPath":"@item()"
}
}
]
}
]
}
}

Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

name Name of the for-each String Yes


activity.

type Must be set to ForEach String Yes

isSequential Specifies whether the loop Boolean No. Default is False.


should be executed
sequentially or in parallel.
Maximum of 20 loop
iterations can be executed
at once in parallel). For
example, if you have a
ForEach activity iterating
over a copy activity with 10
different source and sink
datasets with isSequential
set to False, all copies are
executed at once. Default is
False.

If "isSequential" is set to
False, ensure that there is a
correct configuration to run
multiple executables.
Otherwise, this property
should be used with caution
to avoid incurring write
conflicts. For more
information, see Parallel
execution section.

batchCount Batch count to be used for Integer (maximum 50) No. Default is 20.
controlling the number of
parallel execution (when
isSequential is set to false).

Items An expression that returns a Expression (which returns a Yes


JSON Array to be iterated JSON Array)
over.

Activities The activities to be List of Activities Yes


executed.

Parallel execution
If isSequential is set to false, the activity iterates in parallel with a maximum of 20 concurrent iterations. This
setting should be used with caution. If the concurrent iterations are writing to the same folder but to different
files, this approach is fine. If the concurrent iterations are writing concurrently to the exact same file, this
approach most likely causes an error.

Iteration expression language


In the ForEach activity, provide an array to be iterated over for the property items." Use @item() to iterate over
a single enumeration in ForEach activity. For example, if items is an array: [1, 2, 3], @item() returns 1 in the first
iteration, 2 in the second iteration, and 3 in the third iteration.
Iterating over a single activity
Scenario: Copy from the same source file in Azure Blob to multiple destination files in Azure Blob.
Pipeline definition
{
"name": "<MyForEachPipeline>",
"properties": {
"activities": [
{
"name": "<MyForEachActivity>",
"type": "ForEach",
"typeProperties": {
"isSequential": "true",
"items": {
"value": "@pipeline().parameters.mySinkDatasetFolderPath",
"type": "Expression"
},
"activities": [
{
"name": "MyCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": "false"
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy"
}
},
"inputs": [
{
"referenceName": "<MyDataset>",
"type": "DatasetReference",
"parameters": {
"MyFolderPath": "@pipeline().parameters.mySourceDatasetFolderPath"
}
}
],
"outputs": [
{
"referenceName": "MyDataset",
"type": "DatasetReference",
"parameters": {
"MyFolderPath": "@item()"
}
}
]
}
]
}
}
],
"parameters": {
"mySourceDatasetFolderPath": {
"type": "String"
},
"mySinkDatasetFolderPath": {
"type": "String"
}
}
}
}

Blob dataset definition


{
"name":"<MyDataset>",
"properties":{
"type":"AzureBlob",
"typeProperties":{
"folderPath":{
"value":"@dataset().MyFolderPath",
"type":"Expression"
}
},
"linkedServiceName":{
"referenceName":"StorageLinkedService",
"type":"LinkedServiceReference"
},
"parameters":{
"MyFolderPath":{
"type":"String"
}
}
}
}

Run parameter values

{
"mySourceDatasetFolderPath": "input/",
"mySinkDatasetFolderPath": [ "outputs/file1", "outputs/file2" ]
}

Iterate over multiple activities


It's possible to iterate over multiple activities (for example: copy and web activities) in a ForEach activity. In this
scenario, we recommend that you abstract out multiple activities into a separate pipeline. Then, you can use the
ExecutePipeline activity in the pipeline with ForEach activity to invoke the separate pipeline with multiple
activities.
Syntax
{
"name": "masterPipeline",
"properties": {
"activities": [
{
"type": "ForEach",
"name": "<MyForEachMultipleActivities>"
"typeProperties": {
"isSequential": true,
"items": {
...
},
"activities": [
{
"type": "ExecutePipeline",
"name": "<MyInnerPipeline>"
"typeProperties": {
"pipeline": {
"referenceName": "<copyHttpPipeline>",
"type": "PipelineReference"
},
"parameters": {
...
},
"waitOnCompletion": true
}
}
]
}
}
],
"parameters": {
...
}
}
}

Example
Scenario: Iterate over an InnerPipeline within a ForEach activity with Execute Pipeline activity. The inner
pipeline copies with schema definitions parameterized.
Master Pipeline definition
{
"name": "masterPipeline",
"properties": {
"activities": [
{
"type": "ForEach",
"name": "MyForEachActivity",
"typeProperties": {
"isSequential": true,
"items": {
"value": "@pipeline().parameters.inputtables",
"type": "Expression"
},
"activities": [
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "InnerCopyPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceTableName": {
"value": "@item().SourceTable",
"type": "Expression"
},
"sourceTableStructure": {
"value": "@item().SourceTableStructure",
"type": "Expression"
},
"sinkTableName": {
"value": "@item().DestTable",
"type": "Expression"
},
"sinkTableStructure": {
"value": "@item().DestTableStructure",
"type": "Expression"
}
},
"waitOnCompletion": true
},
"name": "ExecuteCopyPipeline"
}
]
}
}
],
"parameters": {
"inputtables": {
"type": "Array"
}
}
}
}

Inner pipeline definition

{
"name": "InnerCopyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"type": "SqlSource",
}
},
"sink": {
"type": "SqlSink"
}
},
"name": "CopyActivity",
"inputs": [
{
"referenceName": "sqlSourceDataset",
"parameters": {
"SqlTableName": {
"value": "@pipeline().parameters.sourceTableName",
"type": "Expression"
},
"SqlTableStructure": {
"value": "@pipeline().parameters.sourceTableStructure",
"type": "Expression"
}
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "sqlSinkDataset",
"parameters": {
"SqlTableName": {
"value": "@pipeline().parameters.sinkTableName",
"type": "Expression"
},
"SqlTableStructure": {
"value": "@pipeline().parameters.sinkTableStructure",
"type": "Expression"
}
},
"type": "DatasetReference"
}
]
}
],
"parameters": {
"sourceTableName": {
"type": "String"
},
"sourceTableStructure": {
"type": "String"
},
"sinkTableName": {
"type": "String"
},
"sinkTableStructure": {
"type": "String"
}
}
}
}

Source dataset definition


{
"name": "sqlSourceDataset",
"properties": {
"type": "SqlServerTable",
"typeProperties": {
"tableName": {
"value": "@dataset().SqlTableName",
"type": "Expression"
}
},
"structure": {
"value": "@dataset().SqlTableStructure",
"type": "Expression"
},
"linkedServiceName": {
"referenceName": "sqlserverLS",
"type": "LinkedServiceReference"
},
"parameters": {
"SqlTableName": {
"type": "String"
},
"SqlTableStructure": {
"type": "String"
}
}
}
}

Sink dataset definition

{
"name": "sqlSinkDataSet",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": {
"value": "@dataset().SqlTableName",
"type": "Expression"
}
},
"structure": {
"value": "@dataset().SqlTableStructure",
"type": "Expression"
},
"linkedServiceName": {
"referenceName": "azureSqlLS",
"type": "LinkedServiceReference"
},
"parameters": {
"SqlTableName": {
"type": "String"
},
"SqlTableStructure": {
"type": "String"
}
}
}
}

Master pipeline parameters


{
"inputtables": [
{
"SourceTable": "department",
"SourceTableStructure": [
{
"name": "departmentid",
"type": "int"
},
{
"name": "departmentname",
"type": "string"
}
],
"DestTable": "department2",
"DestTableStructure": [
{
"name": "departmentid",
"type": "int"
},
{
"name": "departmentname",
"type": "string"
}
]
}
]

Aggregating outputs
To aggregate outputs of foreach activity, please utilize Variables and Append Variable activity.
First, declare an array variable in the pipeline. Then, invoke Append Variable activity inside each foreach loop.
Subsequently, you can retrieve the aggregation from your array.

Limitations and workarounds


Here are some limitations of the ForEach activity and suggested workarounds.

LIMITATION WORKAROUND

You can't nest a ForEach loop inside another ForEach loop Design a two-level pipeline where the outer pipeline with the
(or an Until loop). outer ForEach loop iterates over an inner pipeline with the
nested loop.

The ForEach activity has a maximum batchCount of 50 for Design a two-level pipeline where the outer pipeline with the
parallel processing, and a maximum of 100,000 items. ForEach activity iterates over an inner pipeline.

Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
Get Metadata Activity
Lookup Activity
Web Activity
Get metadata activity in Azure Data Factory
3/11/2019 • 4 minutes to read • Edit Online

GetMetadata activity can be used to retrieve metadata of any data in Azure Data Factory. This activity can be
used in the following scenarios:
Validate the metadata information of any data
Trigger a pipeline when data is ready/ available
The following functionality is available in the control flow:
The output from GetMetadata Activity can be used in conditional expressions to perform validation.
A pipeline can be triggered when condition is satisfied via Do-Until looping

Supported capabilities
The GetMetadata Activity takes a dataset as a required input, and outputs metadata information available as
activity output. Currently, the following connectors with corresponding retrievable metadata are supported,
and the maximum supported metadata size is up to 1MB.

NOTE
If you run GetMetadata activity on a Self-hosted Integration Runtime, the latest capability is supported on version 3.6
or above.

Supported connectors
File storage:

LASTM
CONNE ITEMN ITEMTY CREATE ODIFIE CHILDI COLUM
CTOR/ AME PE D D TEMS CONTE STRUC NCOUN EXISTS
METAD (FILE/F (FILE/F SIZE (FILE/F (FILE/F (FOLDE NTMD5 TURE T (FILE/F
ATA OLDER) OLDER) (FILE) OLDER) OLDER) R) (FILE) (FILE) (FILE) OLDER)

Amazo √/√ √/√ √ x/x √/√* √ x √ √ √/√*


n S3

Googl √/√ √/√ √ x/x √/√* √ x √ √ √/√*


e
Cloud
Storag
e

Azure √/√ √/√ √ x/x √/√* √ √ √ √ √/√


Blob

Azure √/√ √/√ √ x/x √/√ √ x √ √ √/√


Data
Lake
Storag
e
Gen1
LASTM
CONNE ITEMN ITEMTY CREATE ODIFIE CHILDI COLUM
CTOR/ AME PE D D TEMS CONTE STRUC NCOUN EXISTS
METAD (FILE/F (FILE/F SIZE (FILE/F (FILE/F (FOLDE NTMD5 TURE T (FILE/F
ATA OLDER) OLDER) (FILE) OLDER) OLDER) R) (FILE) (FILE) (FILE) OLDER)

Azure √/√ √/√ √ x/x √/√ √ x √ √ √/√


Data
Lake
Storag
e
Gen2

Azure √/√ √/√ √ √/√ √/√ √ x √ √ √/√


File
Storag
e

File √/√ √/√ √ √/√ √/√ √ x √ √ √/√


Syste
m

SFTP √/√ √/√ √ x/x √/√ √ x √ √ √/√

FTP √/√ √/√ √ x/x √/√ √ x √ √ √/√

For Amazon S3 and Google Sloud Storage, the lastModified applies to bucket and key but not virtual
folder; ; and the exists applies to bucket and key but not prefix or virtual folder.
For Azure Blob, the lastModified applies to container and blob but not virtual folder.
Relational database:

CONNECTOR/METADATA STRUCTURE COLUMNCOUNT EXISTS

Azure SQL Database √ √ √

Azure SQL Database √ √ √


Managed Instance

Azure SQL Data √ √ √


Warehouse

SQL Server √ √ √

Metadata options
The following metadata types can be specified in the GetMetadata activity field list to retrieve:

METADATA TYPE DESCRIPTION

itemName Name of the file or folder.

itemType Type of the file or folder. Output value is File or


Folder .

size Size of the file in byte. Applicable to file only.


METADATA TYPE DESCRIPTION

created Created datetime of the file or folder.

lastModified Last modified datetime of the file or folder.

childItems List of sub-folders and files inside the given folder.


Applicable to folder only. Output value is a list of name and
type of each child item.

contentMD5 MD5 of the file. Applicable to file only.

structure Data structure inside the file or relational database table.


Output value is a list of column name and column type.

columnCount Number of columns inside the file or relational table.

exists Whether a file/folder/table exists or not. Note if "exists" is


specified in the GetaMetadata field list, the activity won't
fail even when the item (file/folder/table) doesn't exists;
instead, it returns exists: false in the output.

TIP
When you want to validate if a file/folder/table exists or not, specify exists in the GetMetadata activity field list, then
you can check the exists: true/false result from the activity output. If exists is not configured in the field list,
the GetMetadata activity will fail when the object is not found.

Syntax
GetMetadata activity:

{
"name": "MyActivity",
"type": "GetMetadata",
"typeProperties": {
"fieldList" : ["size", "lastModified", "structure"],
"dataset": {
"referenceName": "MyDataset",
"type": "DatasetReference"
}
}
}

Dataset:
{
"name": "MyDataset",
"properties": {
"type": "AzureBlob",
"linkedService": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath":"container/folder",
"filename": "file.json",
"format":{
"type":"JsonFormat"
}
}
}
}

Type properties
Currently GetMetadata activity can fetch the following types of metadata information.

PROPERTY DESCRIPTION REQUIRED

fieldList Lists the types of metadata Yes


information required. See details in
Metadata options section on
supported metadata.

dataset The reference dataset whose Yes


metadata activity is to be retrieved by
the GetMetadata Activity. See
Supported capabilities section on
supported connectors, and refer to
connector topic on dataset syntax
details.

Sample output
The GetMetadata result is shown in activity output. Below are two samples with exhaustive metadata options
selected in field list as reference. To use the result in subsequent activity, use the pattern of
@{activity('MyGetMetadataActivity').output.itemName} .

Get a file's metadata


{
"exists": true,
"itemName": "test.csv",
"itemType": "File",
"size": 104857600,
"lastModified": "2017-02-23T06:17:09Z",
"created": "2017-02-23T06:17:09Z",
"contentMD5": "cMauY+Kz5zDm3eWa9VpoyQ==",
"structure": [
{
"name": "id",
"type": "Int64"
},
{
"name": "name",
"type": "String"
}
],
"columnCount": 2
}

Get a folder's metadata

{
"exists": true,
"itemName": "testFolder",
"itemType": "Folder",
"lastModified": "2017-02-23T06:17:09Z",
"created": "2017-02-23T06:17:09Z",
"childItems": [
{
"name": "test.avro",
"type": "File"
},
{
"name": "folder hello",
"type": "Folder"
}
]
}

Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
For Each Activity
Lookup Activity
Web Activity
If Condition activity in Azure Data Factory
3/5/2019 • 3 minutes to read • Edit Online

The If Condition activity provides the same functionality that an if statement provides in programming languages.
It evaluates a set of activities when the condition evaluates to true and another set of activities when the
condition evaluates to false .

Syntax
{
"name": "<Name of the activity>",
"type": "IfCondition",
"typeProperties": {
"expression": {
"value": "<expression that evaluates to true or false>",
"type": "Expression"
},

"ifTrueActivities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
],

"ifFalseActivities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
}
}

Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

name Name of the if-condition String Yes


activity.

type Must be set to IfCondition String Yes


PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

expression Expression that must Expression with result type Yes


evaluate to true or false boolean

ifTrueActivities Set of activities that are Array Yes


executed when the
expression evaluates to
true .

ifFalseActivities Set of activities that are Array Yes


executed when the
expression evaluates to
false .

Example
The pipeline in this example copies data from an input folder to an output folder. The output folder is determined
by the value of pipeline parameter: routeSelection. If the value of routeSelection is true, the data is copied to
outputPath1. And, if the value of routeSelection is false, the data is copied to outputPath2.

NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with step-
by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial: create a
data factory by using Azure PowerShell.

Pipeline with IF -Condition activity (Adfv2QuickStartPipeline.json)

{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "MyIfCondition",
"type": "IfCondition",
"typeProperties": {
"expression": {
"value": "@bool(pipeline().parameters.routeSelection)",
"type": "Expression"
},

"ifTrueActivities": [
{
"name": "CopyFromBlobToBlob1",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath1"
},
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"ifFalseActivities": [
{
"name": "CopyFromBlobToBlob2",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath2"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath1": {
"type": "String"
},
"outputPath2": {
"type": "String"
},
"routeSelection": {
"type": "String"
}
}
}
}

Another example for expression is:


"expression": {
"value": "@pipeline().parameters.routeSelection == 1",
"type": "Expression"
}

Azure Storage linked service (AzureStorageLinkedService.json)

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<Azure Storage account name>;AccountKey=
<Azure Storage account key>",
"type": "SecureString"
}
}
}
}

Parameterized Azure Blob dataset (BlobDataset.json)


The pipeline sets the folderPath to the value of either outputPath1 or outputPath2 parameter of the pipeline.

{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}

Pipeline parameter JSON (PipelineParameters.json)

{
"inputPath": "adftutorial/input",
"outputPath1": "adftutorial/outputIf",
"outputPath2": "adftutorial/outputElse",
"routeSelection": "false"
}

PowerShell commands
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.

These commands assume that you have saved the JSON files into the folder: C:\ADF.

Connect-AzAccount
Select-AzSubscription "<Your subscription name>"

$resourceGroupName = "<Resource Group Name>"


$dataFactoryName = "<Data Factory Name. Must be globally unique>";
Remove-AzDataFactoryV2 $dataFactoryName -ResourceGroupName $resourceGroupName -force

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName


Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -
Name "AzureStorageLinkedService" -DefinitionFile "C:\ADF\AzureStorageLinkedService.json"
Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"BlobDataset" -DefinitionFile "C:\ADF\BlobDataset.json"
Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"Adfv2QuickStartPipeline" -DefinitionFile "C:\ADF\Adfv2QuickStartPipeline.json"
$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineName "Adfv2QuickStartPipeline" -ParameterFile C:\ADF\PipelineParameters.json
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId

if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}

Start-Sleep -Seconds 30
}
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
$result

Write-Host "Activity 'Output' section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "\nActivity 'Error' section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Lookup activity in Azure Data Factory
3/15/2019 • 6 minutes to read • Edit Online

Lookup activity can retrieve a dataset from any of the Azure Data Factory-supported data sources. Use it in
the following scenario:
Dynamically determine which objects to operate on in a subsequent activity, instead of hard coding the
object name. Some object examples are files and tables.
Lookup activity reads and returns the content of a configuration file or table. It also returns the result of
executing a query or stored procedure. The output from Lookup activity can be used in a subsequent copy
or transformation activity if it's a singleton value. The output can be used in a ForEach activity if it's an array
of attributes.

Supported capabilities
The following data sources are supported for Lookup activity. The largest number of rows that can be
returned by Lookup activity is 5,000, up to 2 MB in size. Currently, the longest duration for Lookup activity
before timeout is one hour.

CATEGORY DATA STORE

Azure Azure Blob storage

Azure Cosmos DB (SQL API)

Azure Data Explorer

Azure Data Lake Storage Gen1

Azure Data Lake Storage Gen2

Azure Database for MariaDB

Azure Database for MySQL

Azure Database for PostgreSQL

Azure Files

Azure SQL Database

Azure SQL Database Managed Instance

Azure SQL Data Warehouse

Azure Table storage

Database Amazon Redshift


CATEGORY DATA STORE

DB2

Drill (Preview)

Google BigQuery

Greenplum

HBase

Hive

Apache Impala (Preview)

Informix

MariaDB

Microsoft Access

MySQL

Netezza

Oracle

Phoenix

PostgreSQL

Presto (Preview)

SAP Business Warehouse Open Hub

SAP Business Warehouse via MDX

SAP HANA

SAP Table

Spark

SQL Server

Sybase

Teradata

Vertica
CATEGORY DATA STORE

NoSQL Cassandra

Couchbase (Preview)

File Amazon S3

File System

FTP

Google Cloud Storage

HDFS

SFTP

Generic protocol Generic HTTP

Generic OData

Generic ODBC

Services and apps Amazon Marketplace Web Service (Preview)

Common Data Service for Apps

Concur (Preview)

Dynamics 365

Dynamics AX (Preview)

Dynamics CRM

Google AdWords (Preview)

HubSpot (Preview)

Jira (Preview)

Magento (Preview)

Marketo (Preview)

Oracle Eloqua (Preview)

Oracle Responsys (Preview)

Oracle Service Cloud (Preview)


CATEGORY DATA STORE

Paypal (Preview)

QuickBooks (Preview)

Salesforce

Salesforce Service Cloud

Salesforce Marketing Cloud (Preview)

SAP Cloud for Customer (C4C)

SAP ECC

ServiceNow

Shopify (Preview)

Square (Preview)

Web Table (HTML table)

Xero (Preview)

Zoho (Preview)

NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a
dependency on preview connectors in your solution, please contact Azure support.

Syntax
{
"name": "LookupActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "<source type>"
<additional source specific properties (optional)>
},
"dataset": {
"referenceName": "<source dataset name>",
"type": "DatasetReference"
},
"firstRowOnly": false
}
}

Type properties
NAME DESCRIPTION TYPE REQUIRED?

dataset Provides the dataset Key/value pair Yes


reference for the lookup.
Get details from the
Dataset properties
section in each
corresponding connector
article.

source Contains dataset-specific Key/value pair Yes


source properties, the
same as the Copy Activity
source. Get details from
the Copy Activity
properties section in each
corresponding connector
article.

firstRowOnly Indicates whether to Boolean No. The default is true .


return only the first row or
all rows.

NOTE
Source columns with ByteArray type aren't supported.
Structure isn't supported in dataset definitions. For text-format files, use the header row to provide the column
name.
If your lookup source is a JSON file, the jsonPathDefinition setting for reshaping the JSON object isn't
supported. The entire objects will be retrieved.

Use the Lookup activity result in a subsequent activity


The lookup result is returned in the output section of the activity run result.
When firstRowOnly is set to true (default), the output format is as shown in the following code.
The lookup result is under a fixed firstRow key. To use the result in subsequent activity, use the
pattern of @{activity('MyLookupActivity').output.firstRow.TableName} .

{
"firstRow":
{
"Id": "1",
"TableName" : "Table1"
}
}

When firstRowOnly is set to false , the output format is as shown in the following code. A count
field indicates how many records are returned. Detailed values are displayed under a fixed value
array. In such a case, the Lookup activity is followed by a Foreach activity. You pass the value array
to the ForEach activity items field by using the pattern of
@activity('MyLookupActivity').output.value . To access elements in the value array, use the following
syntax: @{activity('lookupActivity').output.value[zero based index].propertyname} . An example is
@{activity('lookupActivity').output.value[0].tablename} .
{
"count": "2",
"value": [
{
"Id": "1",
"TableName" : "Table1"
},
{
"Id": "2",
"TableName" : "Table2"
}
]
}

Copy Activity example


In this example, Copy Activity copies data from a SQL table in your Azure SQL Database instance to Azure
Blob storage. The name of the SQL table is stored in a JSON file in Blob storage. The Lookup activity looks
up the table name at runtime. JSON is modified dynamically by using this approach. You don't need to
redeploy pipelines or datasets.
This example demonstrates lookup for the first row only. For lookup for all rows and to chain the results
with ForEach activity, see the samples in Copy multiple tables in bulk by using Azure Data Factory.
Pipeline
This pipeline contains two activities: Lookup and Copy.
The Lookup activity is configured to use LookupDataset, which refers to a location in Azure Blob
storage. The Lookup activity reads the name of the SQL table from a JSON file in this location.
Copy Activity uses the output of the Lookup activity, which is the name of the SQL table. The
tableName property in the SourceDataset is configured to use the output from the Lookup activity.
Copy Activity copies data from the SQL table to a location in Azure Blob storage. The location is specified
by the SinkDataset property.
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "LookupActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"dataset": {
"referenceName": "LookupDataset",
"type": "DatasetReference"
}
}
},
{
"name": "CopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from
@{activity('LookupActivity').output.firstRow.tableName}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupActivity",
"dependencyConditions": [ "Succeeded" ]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
}
]
}
}

Lookup dataset
The lookup dataset is the sourcetable.json file in the Azure Storage lookup folder specified by the
AzureStorageLinkedService type.
{
"name": "LookupDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "lookup",
"fileName": "sourcetable.json",
"format": {
"type": "JsonFormat",
"filePattern": "SetOfObjects"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

Source dataset for Copy Activity


The source dataset uses the output of the Lookup activity, which is the name of the SQL table. Copy Activity
copies data from this SQL table to a location in Azure Blob storage. The location is specified by the sink
dataset.

{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties":{
"tableName": "@{activity('LookupActivity').output.firstRow.tableName}"
},
"linkedServiceName": {
"referenceName": "AzureSqlLinkedService",
"type": "LinkedServiceReference"
}
}
}

Sink dataset for Copy Activity


Copy Activity copies data from the SQL table to the filebylookup.csv file in the csv folder in Azure
Storage. The file is specified by the AzureStorageLinkedService property.

{
"name": "SinkDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "csv",
"fileName": "filebylookup.csv",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
Azure Storage linked service
This storage account contains the JSON file with the names of the SQL tables.

{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<StorageAccountName>;AccountKey=
<StorageAccountKey>",
"type": "SecureString"
}
}
},
"name": "AzureStorageLinkedService"
}

Azure SQL Database linked service


This Azure SQL Database instance contains the data to be copied to Blob storage.

{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"description": "",
"typeProperties": {
"connectionString": {
"value": "Server=<server>;Initial Catalog=<database>;User ID=<user>;Password=<password>;",
"type": "SecureString"
}
}
}
}

sourcetable.json
Set of objects

{
"Id": "1",
"tableName": "Table1"
}
{
"Id": "2",
"tableName": "Table2"
}

Array of objects

[
{
"Id": "1",
"tableName": "Table1"
},
{
"Id": "2",
"tableName": "Table2"
}
]
Limitations and workarounds
Here are some limitations of the Lookup activity and suggested workarounds.

LIMITATION WORKAROUND

The Lookup activity has a maximum of 5,000 rows, and a Design a two-level pipeline where the outer pipeline
maximum size of 2 MB. iterates over an inner pipeline, which retrieves data that
doesn't exceed the maximum rows or size.

Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline activity
ForEach activity
GetMetadata activity
Web activity
Set Variable Activity in Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online

Use the Set Variable activity to set the value of an existing variable of type String, Bool, or Array defined in a Data
Factory pipeline.

Type properties
PROPERTY DESCRIPTION REQUIRED

name Name of the activity in pipeline Yes

description Text describing what the activity does no

type Activity Type is SetVariable yes

value String literal or expression object value yes


used to set specified variable

variableName Name of the variable that will be set by yes


this activity

Next steps
Learn about a related control flow activity supported by Data Factory:
Append Variable Activity
Until activity in Azure Data Factory
3/5/2019 • 4 minutes to read • Edit Online

The Until activity provides the same functionality that a do-until looping structure provides in programming
languages. It executes a set of activities in a loop until the condition associated with the activity evaluates to true.
You can specify a timeout value for the until activity in Data Factory.

Syntax
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "<expression that evaluates to true or false>",
"type": "Expression"
},
"timeout": "<time out for the loop. for example: 00:01:00 (1 minute)>",
"activities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
},
"name": "MyUntilActivity"
}

Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

name Name of the Until String Yes


activity.

type Must be set to Until. String Yes

expression Expression that must Expression. Yes


evaluate to true or false

timeout The do-until loop times out String. d.hh:mm:ss (or) No


after the specified time here. hh:mm:ss . The default
value is 7 days. Maximum
value is: 90 days.

Activities Set of activities that are Array of activities. Yes


executed until expression
evaluates to true .
Example 1
NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with step-
by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial: create a
data factory by using Azure PowerShell.

Pipeline with Until activity


In this example, the pipeline has two activities: Until and Wait. The Wait activity waits for the specified period of
time before running the Web activity in the loop. To learn about expressions and functions in Data Factory, see
Expression language and functions.

{
"name": "DoUntilPipeline",
"properties": {
"activities": [
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "@equals('Failed', coalesce(body('MyUnauthenticatedActivity')?.status,
actions('MyUnauthenticatedActivity')?.status, 'null'))",
"type": "Expression"
},
"timeout": "00:00:01",
"activities": [
{
"name": "MyUnauthenticatedActivity",
"type": "WebActivity",
"typeProperties": {
"method": "get",
"url": "https://www.fake.com/",
"headers": {
"Content-Type": "application/json"
}
},
"dependsOn": [
{
"activity": "MyWaitActivity",
"dependencyConditions": [ "Succeeded" ]
}
]
},
{
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": 1
},
"name": "MyWaitActivity"
}
]
},
"name": "MyUntilActivity"
}
]
}
}

Example 2
The pipeline in this sample copies data from an input folder to an output folder in a loop. The loop terminates
when the value for the repeat parameter is set to false or it times out after one minute.
Pipeline with Until activity (Adfv2QuickStartPipeline.json)

{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "@equals('false', pipeline().parameters.repeat)",
"type": "Expression"
},
"timeout": "00:01:00",
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"policy": {
"retry": 1,
"timeout": "00:10:00",
"retryIntervalInSeconds": 60
}
}
]
},
"name": "MyUntilActivity"
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
},
"repeat": {
"type": "String"
}
}
}
}
}

Azure Storage linked service (AzureStorageLinkedService.json)

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<Azure Storage account name>;AccountKey=
<Azure Storage account key>",
"type": "SecureString"
}
}
}
}

Parameterized Azure Blob dataset (BlobDataset.json)


The pipeline sets the folderPath to the value of either outputPath1 or outputPath2 parameter of the pipeline.

{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}

Pipeline parameter JSON (PipelineParameters.json)

{
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/outputUntil",
"repeat": "true"
}

PowerShell commands
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

These commands assume that you have saved the JSON files into the folder: C:\ADF.

Connect-AzAccount
Select-AzSubscription "<Your subscription name>"

$resourceGroupName = "<Resource Group Name>"


$dataFactoryName = "<Data Factory Name. Must be globally unique>";
Remove-AzDataFactoryV2 $dataFactoryName -ResourceGroupName $resourceGroupName -force

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName


Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -
Name "AzureStorageLinkedService" -DefinitionFile "C:\ADF\AzureStorageLinkedService.json"
Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"BlobDataset" -DefinitionFile "C:\ADF\BlobDataset.json"
Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"Adfv2QuickStartPipeline" -DefinitionFile "C:\ADF\Adfv2QuickStartPipeline.json"
$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineName "Adfv2QuickStartPipeline" -ParameterFile C:\ADF\PipelineParameters.json

while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId

if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
$result

Write-Host "Activity 'Output' section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"
}

Start-Sleep -Seconds 15
}

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Validation activity in Azure Data Factory
3/27/2019 • 2 minutes to read • Edit Online

You can use a Validation in a pipeline to ensure the pipeline only continues execution once it has validated the
attached dataset reference exists, that it meets the specified criteria, or timeout has been reached.

Syntax
{
"name": "Validation_Activity",
"type": "Validation",
"typeProperties": {
"dataset": {
"referenceName": "Storage_File",
"type": "DatasetReference"
},
"timeout": "7.00:00:00",
"sleep": 10,
"minimumSize": 20
}
},
{
"name": "Validation_Activity_Folder",
"type": "Validation",
"typeProperties": {
"dataset": {
"referenceName": "Storage_Folder",
"type": "DatasetReference"
},
"timeout": "7.00:00:00",
"sleep": 10,
"childItems": true
}
}

Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

name Name of the 'Validation' String Yes


activity

type Must be set to Validation. String Yes

dataset Activity will block execution Dataset reference Yes


until it has validated this
dataset reference exists and
that it meets the specified
criteria, or timeout has been
reached. Dataset provided
should support
"MinimumSize" or
"ChildItems" property.
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

timeout Specifies the timeout for the String No


activity to run. If no value is
specified, default value is 7
days ("7.00:00:00"). Format
is d.hh:mm:ss

sleep A delay in seconds between Integer No


validation attempts. If no
value is specified, default
value is 10 seconds.

childItems Checks if the folder has child Boolean No


items. Can be set to-true :
Validate that the folder exists
and that it has items. Blocks
until at least one item is
present in the folder or
timeout value is reached.-
false: Validate that the folder
exists and that it is empty.
Blocks until folder is empty
or until timeout value is
reached. If no value is
specified, activity will block
until the folder exists or until
timeout is reached.

minimumSize Minimum size of a file in Integer No


bytes. If no value is specified,
default value is 0 bytes

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Execute wait activity in Azure Data Factory
2/25/2019 • 2 minutes to read • Edit Online

When you use a Wait activity in a pipeline, the pipeline waits for the specified period of time before continuing
with execution of subsequent activities.

Syntax
{
"name": "MyWaitActivity",
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": 1
}
}

Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

name Name of the Wait activity. String Yes

type Must be set to Wait. String Yes

waitTimeInSeconds The number of seconds that Integer Yes


the pipeline waits before
continuing with the
processing.

Example
NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with step-
by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial: create a
data factory by using Azure PowerShell.

Pipeline with Wait activity


In this example, the pipeline has two activities: Until and Wait. The Wait activity is configured to wait for one
second. The pipeline runs the Web activity in a loop with one second waiting time between each run.
{
"name": "DoUntilPipeline",
"properties": {
"activities": [
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "@equals('Failed', coalesce(body('MyUnauthenticatedActivity')?.status,
actions('MyUnauthenticatedActivity')?.status, 'null'))",
"type": "Expression"
},
"timeout": "00:00:01",
"activities": [
{
"name": "MyUnauthenticatedActivity",
"type": "WebActivity",
"typeProperties": {
"method": "get",
"url": "https://www.fake.com/",
"headers": {
"Content-Type": "application/json"
}
},
"dependsOn": [
{
"activity": "MyWaitActivity",
"dependencyConditions": [ "Succeeded" ]
}
]
},
{
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": 1
},
"name": "MyWaitActivity"
}
]
},
"name": "MyUntilActivity"
}
]
}
}

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Web activity in Azure Data Factory
1/10/2019 • 3 minutes to read • Edit Online

Web Activity can be used to call a custom REST endpoint from a Data Factory pipeline. You can pass datasets
and linked services to be consumed and accessed by the activity.

Syntax
{
"name":"MyWebActivity",
"type":"WebActivity",
"typeProperties":{
"method":"Post",
"url":"<URLEndpoint>",
"headers":{
"Content-Type":"application/json"
},
"authentication":{
"type":"ClientCertificate",
"pfx":"****",
"password":"****"
},
"datasets":[
{
"referenceName":"<ConsumedDatasetName>",
"type":"DatasetReference",
"parameters":{
...
}
}
],
"linkedServices":[
{
"referenceName":"<ConsumedLinkedServiceName>",
"type":"LinkedServiceReference"
}
]
}
}

Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

name Name of the web activity String Yes

type Must be set to String Yes


WebActivity.

method Rest API method for the String. Yes


target endpoint.
Supported Types: "GET",
"POST", "PUT"
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

url Target endpoint and path String (or expression with Yes
resultType of string). The
activity will timeout at 1
minute with an error if it
does not receive a response
from the endpoint.

headers Headers that are sent to String (or expression with Yes, Content-type header is
the request. For example, resultType of string) required.
to set the language and "headers":{ "Content-
type on a request: Type":"application/json"}
"headers" : { "Accept-
Language": "en-us",
"Content-Type":
"application/json" }
.

body Represents the payload String (or expression with Required for POST/PUT
that is sent to the resultType of string). methods.
endpoint.
See the schema of the
request payload in Request
payload schema section.

authentication Authentication method String (or expression with No


used for calling the resultType of string)
endpoint. Supported Types
are "Basic, or
ClientCertificate." For more
information, see
Authentication section. If
authentication is not
required, exclude this
property.

datasets List of datasets passed to Array of dataset references. Yes


the endpoint. Can be an empty array.

linkedServices List of linked services Array of linked service Yes


passed to endpoint. references. Can be an
empty array.

NOTE
REST endpoints that the web activity invokes must return a response of type JSON. The activity will timeout at 1 minute
with an error if it does not receive a response from the endpoint.

The following table shows the requirements for JSON content:

VALUE TYPE REQUEST BODY RESPONSE BODY

JSON object Supported Supported

JSON array Supported Unsupported


(At present, JSON arrays don't work as
a result of a bug. A fix is in progress.)
VALUE TYPE REQUEST BODY RESPONSE BODY

JSON value Supported Unsupported

Non-JSON type Unsupported Unsupported

Authentication
None
If authentication is not required, do not include the "authentication" property.
Basic
Specify user name and password to use with the basic authentication.

"authentication":{
"type":"Basic",
"username":"****",
"password":"****"
}

Client certificate
Specify base64-encoded contents of a PFX file and the password.

"authentication":{
"type":"ClientCertificate",
"pfx":"****",
"password":"****"
}

Managed Identity
Specify the resource uri for which the access token will be requested using the managed identity for the data
factory. To call the Azure Resource Management API, use https://management.azure.com/ . For more
information about how managed identities works see the managed identities for Azure resources overview
page.

"authentication": {
"type": "MSI",
"resource": "https://management.azure.com/"
}

Request payload schema


When you use the POST/PUT method, the body property represents the payload that is sent to the endpoint.
You can pass linked services and datasets as part of the payload. Here is the schema for the payload:
{
"body": {
"myMessage": "Sample",
"datasets": [{
"name": "MyDataset1",
"properties": {
...
}
}],
"linkedServices": [{
"name": "MyStorageLinkedService1",
"properties": {
...
}
}]
}
}

Example
In this example, the web activity in the pipeline calls a REST end point. It passes an Azure SQL linked service
and an Azure SQL dataset to the endpoint. The REST end point uses the Azure SQL connection string to
connect to the Azure SQL server and returns the name of the instance of SQL server.
Pipeline definition
{
"name": "<MyWebActivityPipeline>",
"properties": {
"activities": [
{
"name": "<MyWebActivity>",
"type": "WebActivity",
"typeProperties": {
"method": "Post",
"url": "@pipeline().parameters.url",
"headers": {
"Content-Type": "application/json"
},
"authentication": {
"type": "ClientCertificate",
"pfx": "*****",
"password": "*****"
},
"datasets": [
{
"referenceName": "MySQLDataset",
"type": "DatasetReference",
"parameters": {
"SqlTableName": "@pipeline().parameters.sqlTableName"
}
}
],
"linkedServices": [
{
"referenceName": "SqlLinkedService",
"type": "LinkedServiceReference"
}
]
}
}
],
"parameters": {
"sqlTableName": {
"type": "String"
},
"url": {
"type": "String"
}
}
}
}

Pipeline parameter values

{
"sqlTableName": "department",
"url": "https://adftes.azurewebsites.net/api/execute/running"
}

Web service endpoint code


[HttpPost]
public HttpResponseMessage Execute(JObject payload)
{
Trace.TraceInformation("Start Execute");

JObject result = new JObject();


result.Add("status", "complete");

JArray datasets = payload.GetValue("datasets") as JArray;


result.Add("sinktable", datasets[0]["properties"]["typeProperties"]["tableName"].ToString());

JArray linkedServices = payload.GetValue("linkedServices") as JArray;


string connString = linkedServices[0]["properties"]["typeProperties"]["connectionString"].ToString();

System.Data.SqlClient.SqlConnection sqlConn = new System.Data.SqlClient.SqlConnection(connString);

result.Add("sinkServer", sqlConn.DataSource);

Trace.TraceInformation("Stop Execute");

return this.Request.CreateResponse(HttpStatusCode.OK, result);


}

Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Webhook activity in Azure Data Factory
4/10/2019 • 2 minutes to read • Edit Online

You can use a web hook activity to control the execution of pipelines through your custom code. Using the
webhook activity, customers can call an endpoint and pass a callback URL. The pipeline run waits for the callback to
be invoked before proceeding to the next activity.

Syntax
{
"name": "MyWebHookActivity",
"type": "WebHook",
"typeProperties": {
"method": "POST",
"url": "<URLEndpoint>",
"headers": {
"Content-Type": "application/json"
},
"body": {
"key": "value"
},
"timeout": "00:03:00",
"authentication": {
"type": "ClientCertificate",
"pfx": "****",
"password": "****"
}
}
}

Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

name Name of the web hook String Yes


activity

type Must be set to WebHook. String Yes

method Rest API method for the String. Supported Types: Yes
target endpoint. 'POST'

url Target endpoint and path String (or expression with Yes
resultType of string).

headers Headers that are sent to the String (or expression with Yes, Content-type header is
request. For example, to set resultType of string) required. "headers":{
the language and type on a "Content-
request: "headers" : { Type":"application/json"}
"Accept-Language": "en-us",
"Content-Type":
"application/json" }.
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

body Represents the payload that The body passed back to the Yes
is sent to the endpoint. call back URI should be a
valid JSON. See the schema
of the request payload in
Request payload schema
section.

authentication Authentication method used String (or expression with No


for calling the endpoint. resultType of string)
Supported Types are "Basic"
or "ClientCertificate." For
more information, see
Authentication section. If
authentication is not
required, exclude this
property.

timeout How long the activity will String No


wait for the 'callBackUri' to
be invoked. How long the
activity will wait for the
‘callBackUri’ to be invoked.
Default value is 10mins
(“00:10:00”). Format is
Timespan i.e. d.hh:mm:ss

Additional notes
Azure Data Factory will pass an additional property “callBackUri” in the body to the url endpoint, and will expect
this uri to be invoked before the timeout value specified. If the uri is not invoked, the activity will fail with status
‘TimedOut’.
The web hook activity itself fails only when the call to the custom endpoint fails. Any error message can be added
into the body of the callback and used in a subsequent activity.

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Azure Data Factory Mapping Data Flow Aggregate
Transformation
2/22/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

The Aggregate transformation is where you'll define aggregations of columns in your data streams. In the
Expression Builder, you can define different types of aggregations (i.e. SUM, MIN, MAX, COUNT, etc.) and create a
new field in your output that includes these aggregations with optional group-by fields.

Group By
(Optional) Choose a Group-by clause for your aggregation and use either the name of an existing column or a new
name. Use "Add Column" add more group-by clauses and click on the text box next to the column name to launch
the Expression Builder to either select just an existing column, combination of columns or expressions for your
grouping.

The Aggregate Column tab


(Required) Choose the Aggregate Column tab to build the aggregation expressions. You can either choose an
existing column to overwrite the value with the aggregation, or create a new field with the new name for the
aggregation. The expression that you wish to use for the aggregation will be entered in the right-hand box next to
the column name selector. Clicking on that text box will open up the Expression Builder.
Data Preview in Expression Builder
In Debug mode, the expression builder cannot produce data previews with Aggregate functions. To view data
previews for aggregate transformations, close the expression builder and view the data profile from the data flow
designer.
Azure Data Factory Alter Row Transformation
5/10/2019 • 2 minutes to read • Edit Online

Use the Alter Row transformation to set insert, delete, update, and upsert policies on rows. You can add one-to-
many conditions as expressions. Each of those conditions can result in a row (or rows) being inserted, updated,
deleted, or upsert. Alter Row can produce both DDL & DML actions against your database.

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

NOTE
Alter Row transformations will only operate on database sinks in your data flow. The actions that you assign to rows (insert,
update, delete, upsert) will not occur during debug sessions. You must add an Execute Data Flow task to a pipeline and use
pipeline debug or triggers to enact the alter row policies on your database tables.

View policies
Switch the Data Flow Debug mode to on and then view the results of your alter row policies in the Data Preview
pane. Executing an alter row in Data Flow Debug mode will not produce DDL or DML actions against your target.
In order to have those actions occur, execute the data flow inside an Execute Data Flow activity inside a pipeline.
This will allow you to verify and view the state of each row based on your conditions. There are icon represents for
each insert, update, delete, and upsert action that will occur in your data flow, indicating which action will take place
when you execute the data flow inside a pipeline.

Sink settings
You must have a database sink type for Alter Row to work. In the sink Settings, you must set each action to be
allowed.

The default behavior in ADF Data Flow with database sinks is to insert rows. If you want to allow updates, upserts,
and deletes as well, you must also check these boxes in the sink to allow the actions.

NOTE
If your inserts, updates, or upserts modify the schema of the target table in the sink, your data flow will fail. In order to
modify the target schema in your database, you must choose the "Recreate table" option in the sink. This will drop and
recreate your table with the new schema definition.

Next steps
After the Alter Row transformation, you may want to sink your data into a destination data store.
Mapping data flow conditional split transformation
5/15/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

The Conditional Split transformation can route data rows to different streams depending on the content of the
data. The implementation of the Conditional Split transformation is similar to a CASE decision structure in a
programming language. The transformation evaluates expressions, and based on the results, directs the data row
to the specified stream. This transformation also provides a default output, so that if a row matches no expression it
is directed to the default output.

Multiple paths
To add additional conditions, select "Add Stream" in the bottom configuration pane and click in the Expression
Builder text box to build your expression.

Next steps
Common data flow transformations used with conditional split: Join transformation, Loopup transformation, Select
transformation
Mapping data flow derived column transformation
4/29/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Use the Derived Column transformation to generate new columns in your data flow or to modify existing fields.

You can perform multiple Derived Column actions in a single Derived Column transformation. Click "Add
Column" to transform more than 1 column in the single transformation step.
In the Column field, either select an existing column to overwrite with a new derived value, or click "Create New
Column" to generate a new column with the newly derived value.
The Expression text box will open the Expression Builder where you can build the expression for the derived
columns using expression functions.

Column patterns
If your column names are variable from your sources, you may wish to build transformations inside of the Derived
Column using Column Patterns instead of using named columns. See the Schema Drift article for more details.
Next steps
Learn more about the Data Factory expression language for transformations and the Expression Builder
Mapping data flow exists transformation
5/6/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

The Exists transformation is a row filtering transformation that stops or allows rows in your data to flow through.
The Exists Transform is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS . After the Exists Transformation, the
resulting rows from your data stream will either include all rows where column values from source 1 exist in
source 2 or do not exist in source 2.

Choose the second source for your Exists so that Data Flow can compare values from Stream 1 against Stream 2.
Select the column from Source 1 and from Source 2 whose values you wish to check against for Exists or Not
Exists.

Multiple exists conditions


Next to each row in your column conditions for Exists, you'll find a + sign available when you hover over reach
row. This will allow you to add multiple rows for Exists conditions. Each additional condition is an "And".

Custom expression
You can click "Custom Expression" to instead create a free-form expression as your exists or not-exists condition.
Checking this box will allow you to type in your own expression as a condition.

Next steps
Similar transformations are Lookup and Join.
Azure data factory filter transformation
5/24/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

The Filter transforms provides row filtering. Build an expression that defines the filter condition. Click in the text
box to launch the Expression Builder. Inside the expression builder, build a filter expression to control which rows
from current data stream are allowed to pass through (filter) to the next transformation. Think of the Filter
transformation as the WHERE clause of a SQL statement.

Filter on loan_status column:


in([‘Default’, ‘Charged Off’, ‘Fully Paid’], loan_status).

Filter on the year column in the Movies demo:

year > 1980

Next steps
Try a column filtering transformation, the Select transformation
Mapping Data Flow Join Transformation
3/27/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Use Join to combine data from two tables in your Data Flow. Click on the transformation that will be the left
relationship and add a Join transformation from the toolbox. Inside the Join transform, you will select another
data stream from your data flow to be right relationship.

Join types
Selecting Join Type is required for the Join transformation.
Inner Join
Inner join will pass through only rows that match the column conditions from both tables.
Left Outer
All rows from the left stream not meeting the join condition are passed through, and output columns from the
other table are set to NULL in addition to all rows returned by the inner join.
Right Outer
All rows from the right stream not meeting the join condition are passed through, and output columns that
correspond to the other table are set to NULL, in addition to all rows returned by the inner join.
Full Outer
Full Outer produces all columns and rows from both sides with NULL values for columns that are not present in
the other table.
Cross Join
Specify the cross product of the two streams with an expression. You can use this to create custom join conditions.

Specify Join Conditions


The Left Join condition is from the data stream connected to the left of your Join. The Right Join condition is the
second data stream connected to your Join on the bottom, which will either be a direct connector to another
stream or a reference to another stream.
You are required to enter at least 1 (1..n) join conditions. They can be fields that are either referenced directly,
selected from the drop-down menu, or expressions.

Join Performance Optimizations


Unlike Merge Join in tools like SSIS, Join in ADF Data Flow is not a mandatory merge join operation. Therefore,
the join keys do not need to be sorted first. The Join operation will occur in Spark using Databricks based on the
optimal join operation in Spark: Broadcast / Map-side join:

If your dataset can fit into the Databricks worker node memory, we can optimize your Join performance. You can
also specify partitioning of your data on the Join operation to create sets of data that can fit better into memory
per worker.

Self-Join
You can achieve self-join conditions in ADF Data Flow by using the Select transformation to alias an existing
stream. First, create a "New Branch" from a stream, then add a Select to alias the entire original stream.
In the above diagram, the Select transform is at the top. All it's doing is aliasing the original stream to
"OrigSourceBatting". In the highlighted Join transform below it you can see that we use this Select alias stream as
the right-hand join, allowing us to reference the same key in both the Left & Right side of the Inner Join.

Composite and custom keys


You can build custom and composite keys on the fly inside the Join transformation. Add rows for additional join
columns with the plus sign (+) next to each relationship row. Or compute a new key value in the Expression
Builder for an on-the-fly join value.

Next steps
After joining data, you can then create new columns and sink your data to a destination data store.
Azure Data Factory Mapping Data Flow Lookup
Transformation
4/28/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Use Lookup to add reference data from another source to your Data Flow. The Lookup transform requires a
defined source that points to your reference table and matches on key fields.

Select the key fields that you wish to match on between the incoming stream fields and the fields from the
reference source. You must first have created a new source on the Data Flow design canvas to use as the right-side
for the lookup.
When matches are found, the resulting rows and columns from the reference source will be added to your data
flow. You can choose which fields of interest that you wish to include in your Sink at the end of your Data Flow.

Match / No match
After your Lookup transformation, you can use subsequent transformations to inspect the results of each match
row by using the expression function isMatch() to make further choices in your logic based on whether or not the
Lookup resulted in a row match or not.

Optimizations
In Data Factory, Data Flows execute in scaled-out Spark environments. If your dataset can fit into worker node
memory space, we can optimize your Lookup performance.
Broadcast join
Select Left and/or Right side broadcast join to request ADF to push the entire dataset from either side of the
Lookup relationship into memory.
Data partitioning
You can also specify partitioning of your data by selecting "Set Partitioning" on the Optimize tab of the Lookup
transformation to create sets of data that can fit better into memory per worker.

Next steps
Join and Exists transformations perform similar tasks in ADF Mapping Data Flows. Take a look at those
transformations next.
Azure Data Factory Mapping Data Flow New Branch
Transformation
2/22/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Branching will take the current data stream in your data flow and replicate it to another stream. Use New Branch to
perform multiple sets of operations and transformations against the same data stream.
Example: Your data flow has a Source Transform with a selected set of columns and data type conversions. You
then place a Derived Column immediately following that Source. In the Derived Column, you've create a new field
that combines first name and last name to make a new "full name" field.
You can treat that new stream with a set of transformations and a sink on one row and use New Branch to create a
copy of that stream where you can transform that same data with a different set of transformations. By
transforming that copied data in a separate branch, you can subsequently sink that data to a separate location.

NOTE
"New Branch" will only show as an action on the + Transformation menu when there is a subsequent transformation
following the current location where you are attempting to branch. i.e. You will not see a "New Branch" option at the end here
until you add another transformation after the Select
Azure data factory pivot transformation
4/10/2019 • 3 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Use Pivot in ADF Data Flow as an aggregation where one or more grouping columns has its distinct row values
transformed into individual columns. Essentially, you can Pivot row values into new columns (turn data into
metadata).
Group by

First, set the columns that you wish to group by for your pivot aggregation. You can set more than 1 column here
with the + sign next to the column list.

Pivot key

The Pivot Key is the column that ADF will pivot from row to column. By default, each unique value in the dataset
for this field will pivot to a column. However, you can optionally enter the values from the dataset that you wish to
pivot to column values. This is the column that will determine the new columns that will be created.

Pivoted columns
Lastly, you will choose the aggregation that you wish to use for the pivoted values and how you would like the
columns to be displayed in the new output projection from the transformation.
(Optional) You can set a naming pattern with a prefix, middle, and suffix to be added to each new column name
from the row values.
For instance, pivoting "Sales" by "Region" would result in new column values from each sales value, i.e. "25", "50",
"1000", etc. However, if you set a prefix value of "Sales-", each column value would add "Sales-" to the beginning of
the value.

Setting the Column Arrangement to "Normal" will group together all of the pivoted columns with their aggregated
values. Setting the columns arrangement to "Lateral" will alternate between column and value.
Aggregation
To set the aggregation you wish to use for the pivot values, click on the field at the bottom of the Pivoted Columns
pane. You will enter into the ADF Data Flow expression builder where you can build an aggregation expression and
provide a descriptive alias name for your new aggregated values.
Use the ADF Data Flow Expression Language to describe the pivoted column transformations in the Expression
Builder: https://aka.ms/dataflowexpressions.

Pivot metadata
The Pivot transformation will produce new column names that are dynamic based on your incoming data. The
Pivot Key produces the values for each new column name. If you do not specify individual values and wish to
create dynamic column names for each unique value in your Pivot Key, then the UI will not display the metadata in
Inspect and there will be no column propagation to the Sink transformation. If you set values for the Pivot Key,
then ADF can determine the new column names and those column names will be available to you in the Inspect
and Sink mapping.
Landing new columns in Sink
Even with dynamic column names in Pivot, you can still sink your new column names and values into your
destination store. Just set "Allow Schema Drift" to on in your Sink settings. You will not see the new dynamic
names in your column metadata, but the schema drift option will allow you to land the data.
View metadata in design mode
If you wish to view the new column names as metadata in Inspect and you wish to see the columns propagate
explicitly to the Sink transformation, then set explicit Values in the Pivot Key tab.
How to rejoin original fields
The Pivot transformation will only project the columns used in the aggregation, grouping, and pivot action. If you
wish to include the other columns from the previous step in your flow, use a New Branch from the previous step
and use the self-join pattern to connect the flow with the original metadata.

Next steps
Try the unpivot transformation to turn column values into row values.
Azure Data Factory Mapping Data Flow Select
Transformation
2/22/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Use this transformation for column selectivity (reducing number of columns) or to alias columns and stream
names.
The Select transform allows you to alias an entire stream, or columns in that stream, assign different names
(aliases) and then reference those new names later in your data flow. This transform is useful for self-join
scenarios. The way to implement a self-join in ADF Data Flow is to take a stream, branch it with "New Branch",
then immediately afterward, add a "Select" transform. That stream will now have a new name that you can use to
join back to the original stream, creating a self-join:

In the above diagram, the Select transform is at the top. This is aliasing the original stream to "OrigSourceBatting".
In the higlighted Join transform below it, you can see that we use this Select alias stream as the right-hand join,
allowing us to reference the same key in both the Left & Right side of the Inner Join.
Select can also be used as a way de-select columns from your data flow. For example, if you have 6 columns
defined in your sink, but you only wish to pick a specific 3 to transform and then flow to the sink, you can select
just those 3 by using the select transform.

NOTE
You must switch off "Select All" to pick only specific columns

Options
The default setting for "Select" is to include all incoming columns and keep those original names. You can alias the
stream by setting the name of the Select transform.
To alias individual columns, deselect "Select All" and use the column mapping at the bottom.
Sink transformation for a data flow
5/13/2019 • 3 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

After you transform your data flow, you can sink the data into a destination dataset. In the sink transformation,
choose a dataset definition for the destination output data. You can have as many sink transformations as your
data flow requires.
To account for schema drift and changes in incoming data, sink the output data to a folder without a defined
schema in the output dataset. You can also account for column changes in your sources by selecting Allow
schema drift in the source. Then automap all fields in the sink.

To sink all incoming fields, turn on Auto Map. To choose the fields to sink to the destination, or to change the
names of the fields at the destination, turn off Auto Map. Then open the Mapping tab to map output fields.
Output
For Azure Blob storage or Data Lake Storage sink types, output the transformed data into a folder. Spark
generates partitioned output data files based on the partitioning scheme that the sink transformation uses.
You can set the partitioning scheme from the Optimize tab. If you want Data Factory to merge your output into
a single file, select Single partition.

Field mapping
On the Mapping tab of your sink transformation, you can map the incoming columns on the left to the
destinations on the right. When you sink data flows to files, Data Factory will always write new files to a folder.
When you map to a database dataset, you can generate a new table that uses this schema by setting Save
Policy to Overwrite. Or insert new rows in an existing table and then map the fields to the existing schema.

In the mapping table, you can multiselect to link multiple columns, delink multiple columns, or map multiple
rows to the same column name.
To always map the incoming set of fields to a target as they are and to fully accept flexible schema definitions,
select Allow schema drift.
To reset your column mappings, select Re-map.

Select Validate schema to fail the sink if the schema changes.


Select Clear the folder to truncate the contents of the sink folder before writing the destination files in that
target folder.

File name options


Set up file naming:
Default: Allow Spark to name files based on PART defaults.
Pattern: Enter a pattern for your output files. For example, loans[n] will create loans1.csv, loans2.csv, and so
on.
Per partition: Enter one file name per partition.
As data in column: Set the output file to the value of a column.
Output to a single file: With this option, ADF will combine the partitioned output files into a single named
file. To use this option, your dataset should resolve to a folder name. Also, please be aware that this merge
operation can possibly fail based upon node size.
NOTE
File operations start only when you're running the Execute Data Flow activity. They don't start in Data Flow Debug mode.

Database options
Choose database settings:
Update method: The default is to allow inserts. Clear Allow insert if you want to stop inserting new rows
from your source. To update, upsert, or delete rows, first add an alter-row transformation to tag rows for
those actions.
Recreate table: Drop or create your target table before the data flow finishes.
Truncate table: Remove all rows from your target table before the data flow finishes.
Batch size: Enter a number to bucket writes into chunks. Use this option for large data loads.
Enable staging: Use PolyBase when you load Azure Data Warehouse as your sink dataset.

NOTE
In Data Flow, you can direct Data Factory to create a new table definition in your target database. To create the table
definition, set a dataset in the sink transformation that has a new table name. In the SQL dataset, below the table name,
select Edit and enter a new table name. Then, in the sink transformation, turn on Allow schema drift. Set Import
schema to None.
NOTE
When you update or delete rows in your database sink, you must set the key column. This setting allows the alter-row
transformation to determine the unique row in the data movement library (DML).

Next steps
Now that you've created your data flow, add a Data Flow activity to your pipeline.
Azure Data Factory Data Flow Sort Transformations
3/13/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

The Sort transformation allows you to sort the incoming rows on the current data stream. The outgoing rows from
the Sort Transformation will subsequently follow the ordering rules that you set. You can choose individual
columns and sort them ASC or DEC, using the arrow indicator next to each field. If you need to modify the column
before applying the sort, click on "Computed Columns" to launch the expression editor. This will provide with an
opportunity to build an expression for the sort operation instead of simply applying a column for the sort.

Case insensitive
You can turn on "Case insensitive" if you wish to ignore case when sorting string or text fields.
"Sort Only Within Partitions" leverages Spark data partitioning. By sorting incoming data only within each
partition, Data Flows can sort partitioned data instead of sorting entire data stream.
Each of the sort conditions in the Sort Transformation can be rearranged. So if you need to move a column higher
in the sort precedence, grab that row with your mouse and move it higher or lower in the sorting list.
Partitioning effects on Sort
ADF Data Flow is executed on big data Spark clusters with data distributed across multiple nodes and partitions. It
is important to keep this in mind when architecting your data flow if you are depending on the Sort transform to
keep data in that same order. If you choose to repartition your data in a subsequent transformation, you may lose
your sorting due to that reshuffling of data.

Next steps
After sorting, you may want to use the Aggregate Transformation
Source transformation for Mapping Data Flow
5/24/2019 • 4 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

A source transformation configures your data source for the data flow. A data flow can include more than one
source transformation. When designing data flows, always begin with a source transformation.
Every data flow requires at least one source transformation. Add as many sources as necessary to complete your
data transformations. You can join those sources together with a join transformation or a union transformation.

NOTE
When you debug your data flow, data is read from the source by using the sampling setting or the debug source limits. To
write data to a sink, you must run your data flow from a pipeline Data Flow activity.

Associate your Data Flow source transformation with exactly one Data Factory dataset. The dataset defines the
shape and location of the data you want to write to or read from. You can use wildcards and file lists in your
source to work with more than one file at a time.

Data Flow staging areas


Data Flow works with staging datasets that are all in Azure. Use these datasets for staging when you're
transforming your data.
Data Factory has access to nearly 80 native connectors. To include data from those other sources in your data
flow, use the Copy Activity tool to stage that data in one of the Data Flow dataset staging areas.

Options
Choose schema and sampling options for your data.
Allow schema drift
Select Allow schema drift if the source columns will change often. This setting allows all incoming source fields
to flow through the transformations to the sink.
Validate schema
If the incoming version of the source data doesn't match the defined schema, the data flow will fail to run.

Sample the data


Enable Sampling to limit the number of rows from your source. Use this setting when you test or sample data
from your source for debugging purposes.

Define schema
When your source files aren't strongly typed (for example, flat files rather than Parquet files), define the data
types for each field here in the source transformation.

You can later change the column names in a select transformation. Use a derived-column transformation to
change the data types. For strongly typed sources, you can modify the data types in a later select transformation.
Optimize the source transformation
On the Optimize tab for the source transformation, you might see a Source partition type. This option is
available only when your source is Azure SQL Database. This is because Data Factory tries to make connections
parallel to run large queries against your SQL Database source.

You don't have to partition data on your SQL Database source, but partitions are useful for large queries. You
can base your partition on a column or a query.
Use a column to partition data
From your source table, select a column to partition on. Also set the maximum number of connections.
Use a query to partition data
You can choose to partition the connections based on a query. Simply enter the contents of a WHERE predicate.
For example, enter year > 1980.

Source file management


Choose settings to manage files in your source.
Wildcard path: From your source folder, choose a series of files that match a pattern. This setting overrides
any file in your dataset definition.
List of files: This is a file set. Create a text file that includes a list of relative path files to process. Point to this
text file.
Column to store file name: Store the name of the source file in a column in your data. Enter a new name
here to store the file name string.
After completion: Choose to do nothing with the source file after the data flow runs, delete the source file,
or move the source file. The paths for the move are relative.
SQL datasets
If your source is in SQL Database or SQL Data Warehouse, you have additional options for source file
management.
Query: Enter a SQL query for your source. This setting overrides any table that you've chosen in the dataset.
Note that Order By clauses aren't supported here, but you can set a full SELECT FROM statement. You can
also use user-defined table functions. select * from udfGetData() is a UDF in SQL that returns a table. This
query will produce a source table that you can use in your data flow.
Batch size: Enter a batch size to chunk large data into reads.

NOTE
File operations run only when you start the data flow from a pipeline run (a pipeline debug or execution run) that uses the
Execute Data Flow activity in a pipeline. File operations do not run in Data Flow debug mode.

Projection
Like schemas in datasets, the projection in a source defines the data columns, types, and formats from the source
data.
If your text file has no defined schema, select Detect data type so that Data Factory will sample and infer the
data types. Select Define default format to autodetect the default data formats.
You can modify the column data types in a later derived-column transformation. Use a select transformation to
modify the column names.

Next steps
Begin building a derived-column transformation and a select transformation.
Mapping Data Flow Surrogate Key Transformation
4/17/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Use the Surrogate Key Transformation to add an incrementing non-business arbitrary key value to your data flow
rowset. This is useful when designing dimension tables in a star schema analytical data model where each member
in your dimension tables needs to have a unique key that is a non-business key, part of the Kimball DW
methodology.

"Key Column" is the name that you will give to your new surrogate key column.
"Start Value" is the beginning point of the incremental value.

Increment keys from existing sources


If you'd like to start your sequence from a value that exists in a Source, you can use a Derived Column
transformation immediately following your Surrogate Key transformation and add the two values together:
To seed the key value with the previous max, there are two techniques that you can use:
Database sources
Use the "Query" option to select MAX() from your source using the Source transformation:

File sources
If your previous max value is in a file, you can use your Source transformation together with an Aggregate
transformation and use the MAX() expression function to get the previous max value:
In both cases, you must Join your incoming new data together with your source that contains the previous max
value:

Next steps
These examples use the Join and Derived Column transformations.
Mapping data flow union transformation
3/12/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Union will combine multiple data streams into one, with the SQL Union of those streams as the new output from
the Union transformation. All of the schema from each input stream will be combined inside of your data flow,
without needing to have a join key.
You can combine n-number of streams in the settings table by selecting the "+" icon next to each configured row,
including both source data as well as streams from existing transformations in your data flow.

In this case, you can combine disparate metadata from multiple sources (in this example, three different source
files) and combine them into a single stream:

To achieve this, add additional rows in the Union Settings by including all source you wish to add. There is no need
for a common lookup or join key:
If you set a Select transformation after your Union, you will be able to rename overlapping fields or fields that were
not named from headerless sources. Click on "Inspect" to see the combine metadata with 132 total columns in this
example from three different sources:

Name and position


When you choose "union by name", each column value will drop into the corresponding column from each source,
with a new concatenated metadata schema.
If you choose "union by position", each column value will drop into the original position from each corresponding
source, resulting in a new combined stream of data where the data from each source is added to the same stream:
Next steps
Explore similar transformations including Join and Exists.
Azure Data Factory Unpivot Transformation
3/13/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Use Unpivot in ADF Mapping Data Flow as a way to turn an unnormalized dataset into a more normalized version
by expanding values from multiple columns in a single record into multiple records with the same values in a
single column.

Ungroup By

First, set the columns that you wish to group by for your pivot aggregation. Set one or more columns for
ungrouping with the + sign next to the column list.

Unpivot Key
The Pivot Key is the column that ADF will pivot from row to column. By default, each unique value in the dataset
for this field will pivot to a column. However, you can optionally enter the values from the dataset that you wish to
pivot to column values.

Unpivoted Columns

Lastly, choose the aggregation that you wish to use for the pivoted values and how you would like the columns to
be displayed in the new output projection from the transformation.
(Optional) You can set a naming pattern with a prefix, middle, and suffix to be added to each new column name
from the row values.
For instance, pivoting "Sales" by "Region" would simply give you new column values from each sales value. For
example: "25", "50", "1000", ... However, if you set a prefix value of "Sales", then "Sales" will be prefixed to the
values.
Setting the Column Arrangement to "Normal" will group together all of the pivoted columns with their aggregated
values. Setting the columns arrangement to "Lateral" will alternate between column and value.

The final unpivoted data result set shows the column totals now unpivoted into separate row values.

Next steps
Use the Pivot transformation to pivot rows to columns.
Azure Data Factory Window Transformation
3/13/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

The Window transformation is where you will define window -based aggregations of columns in your data streams.
In the Expression Builder, you can define different types of aggregations that are based on data or time windows
(SQL OVER clause) such as LEAD, LAG, NTILE, CUMEDIST, RANK, etc.). A new field will be generated in your
output that includes these aggregations. You can also include optional group-by fields.
Over
Set the partitioning of column data for your window transformation. The SQL equivalent is the Partition By in
the Over clause in SQL. If you wish to create a calculation or create an expression to use for the partitioning, you
can do that by hovering over the column name and select "computed column".

Sort
Another part of the Over clause is setting the Order By . This will set the data sort ordering. You can also create an
expression for a calculate value in this column field for sorting.

Range By
Next, set the window frame as Unbounded or Bounded. To set an unbounded window frame, set the slider to
Unbounded on both ends. If you choose a setting between Unbounded and Current Row, then you must set the
Offset start and end values. Both values will be positive integers. You can use either relative numbers or values
from your data.
The window slider has two values to set: the values before the current row and the values after the current row. The
Start and End offset matches the two selectors on the slider.
Window columns
Lastly, use the Expression Builder to define the aggregations you wish to use with the data windows such as RANK,
COUNT, MIN, MAX, DENSE RANK, LEAD, LAG, etc.

The full list of aggregation and analytical functions available for you to use in the ADF Data Flow Expression
Language via the Expression Builder are listed here: https://aka.ms/dataflowexpressions.

Next steps
If you are looking for a simple group-by aggregation, use the Aggregate transformation
Parameterize linked services in Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online

You can now parameterize a linked service and pass dynamic values at run time. For example, if you want to
connect to different databases on the same Azure SQL Database server, you can now parameterize the database
name in the linked service definition. This prevents you from having to create a linked service for each database on
the Azure SQL database server. You can parameterize other properties in the linked service definition as well - for
example, User name.
You can use the Data Factory UI in the Azure Portal or a programming interface to parameterize linked services.

TIP
We recommend not to parameterize passwords or secrets. Store all connection strings in Azure Key Vault instead, and
parameterize the Secret Name.

For a seven-minute introduction and demonstration of this feature, watch the following video:

Supported data stores


At this time, linked service parameterization is supported in the Data Factory UI in the Azure portal for the
following data stores. For all other data stores, you can parameterize the linked service by selecting the Code icon
on the Connections tab and using the JSON editor.
Azure SQL Database
Azure SQL Data Warehouse
SQL Server
Oracle
Cosmos DB
Amazon Redshift
MySQL
Azure Database for MySQL

Data Factory UI
JSON
{
"name": "AzureSqlDatabase",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server=tcp:myserver.database.windows.net,1433;Database=@{linkedService().DBName};User
ID=user;Password=fake;Trusted_Connection=False;Encrypt=True;Connection Timeout=30",
"type": "SecureString"
}
},
"connectVia": null,
"parameters": {
"DBName": {
"type": "String"
}
}
}
}
Expressions and functions in Azure Data Factory
4/26/2019 • 22 minutes to read • Edit Online

This article provides details about expressions and functions supported by Azure Data Factory.

Introduction
JSON values in the definition can be literal or expressions that are evaluated at runtime. For example:

"name": "value"

(or)

"name": "@pipeline().parameters.password"

Expressions
Expressions can appear anywhere in a JSON string value and always result in another JSON value. If a JSON
value is an expression, the body of the expression is extracted by removing the at-sign (@). If a literal string is
needed that starts with @, it must be escaped by using @@. The following examples show how expressions are
evaluated.

JSON VALUE RESULT

"parameters" The characters 'parameters' are returned.

"parameters[1]" The characters 'parameters[1]' are returned.

"@@" A 1 character string that contains '@' is returned.

" @" A 2 character string that contains ' @' is returned.

Expressions can also appear inside strings, using a feature called string interpolation where expressions are
wrapped in @{ ... } . For example:
"name" : "First Name: @{pipeline().parameters.firstName} Last Name: @{pipeline().parameters.lastName}"

Using string interpolation, the result is always a string. Say I have defined myNumber as 42 and myString as foo
:

JSON VALUE RESULT

"@pipeline().parameters.myString" Returns foo as a string.

"@{pipeline().parameters.myString}" Returns foo as a string.

"@pipeline().parameters.myNumber" Returns 42 as a number.

"@{pipeline().parameters.myNumber}" Returns 42 as a string.


JSON VALUE RESULT

"Answer is: @{pipeline().parameters.myNumber}" Returns the string Answer is: 42 .

"@concat('Answer is: ', Returns the string Answer is: 42


string(pipeline().parameters.myNumber))"

"Answer is: @@{pipeline().parameters.myNumber}" Returns the string


Answer is: @{pipeline().parameters.myNumber} .

Examples
A dataset with a parameter
In the following example, the BlobDataset takes a parameter named path. Its value is used to set a value for the
folderPath property by using the expression: dataset().path .

{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@dataset().path"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}

A pipeline with a parameter


In the following example, the pipeline takes inputPath and outputPath parameters. The path for the
parameterized blob dataset is set by using values of these parameters. The syntax used here is:
pipeline().parameters.parametername .
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}

Functions
You can call functions within expressions. The following sections provide information about the functions that can
be used in an expression.

String functions
The following functions only apply to strings. You can also use a number of the collection functions on strings.

FUNCTION NAME DESCRIPTION


FUNCTION NAME DESCRIPTION

concat Combines any number of strings together. For example, if


parameter1 is foo, the following expression would return
somevalue-foo-somevalue :
concat('somevalue-
',pipeline().parameters.parameter1,'-somevalue')

Parameter number: 1 ... n

Name: String n

Description: Required. The strings to combine into a single


string.

substring Returns a subset of characters from a string. For example, the


following expression:

substring('somevalue-foo-somevalue',10,3)

returns:

foo

Parameter number: 1

Name: String

Description: Required. The string from which the substring is


taken.

Parameter number: 2

Name: Start index

Description: Required. The index of where the substring


begins in parameter 1.

Parameter number: 3

Name: Length

Description: Required. The length of the substring.


FUNCTION NAME DESCRIPTION

replace Replaces a string with a given string. For example, the


expression:

replace('the old string', 'old', 'new')

returns:

the new string

Parameter number: 1

Name: string

Description: Required. If parameter 2 is found in parameter


1, the string that is searched for parameter 2 and updated
with parameter 3.

Parameter number: 2

Name: Old string

Description: Required. The string to replace with parameter 3


when a match is found in parameter 1

Parameter number: 3

Name: New string

Description: Required. The string that is used to replace the


string in parameter 2 when a match is found in parameter 1.

guid Generates a globally unique string (aka. guid). For example,


the following output could be generated
c2ecc88d-88c8-4096-912c-d6f2e2b138ce :

guid()

Parameter number: 1

Name: Format

Description: Optional. A single format specifier that indicates


how to format the value of this Guid. The format parameter
can be "N", "D", "B", "P", or "X". If format is not provided, "D" is
used.

toLower Converts a string to lowercase. For example, the following


returns two by two is four :
toLower('Two by Two is Four')

Parameter number: 1

Name: String

Description: Required. The string to convert to lower casing.


If a character in the string does not have a lowercase
equivalent, it is included unchanged in the returned string.
FUNCTION NAME DESCRIPTION

toUpper Converts a string to uppercase. For example, the following


expression returns TWO BY TWO IS FOUR :
toUpper('Two by Two is Four')

Parameter number: 1

Name: String

Description: Required. The string to convert to upper casing.


If a character in the string does not have an uppercase
equivalent, it is included unchanged in the returned string.

indexof Find the index of a value within a string case insensitively. For
example, the following expression returns 7 :
indexof('hello, world.', 'world')

Parameter number: 1

Name: String

Description: Required. The string that may contain the value.

Parameter number: 2

Name: String

Description: Required. The value to search the index of.

lastindexof Find the last index of a value within a string case insensitively.
For example, the following expression returns 3 :
lastindexof('foofoo', 'foo')

Parameter number: 1

Name: String

Description: Required. The string that may contain the value.

Parameter number: 2

Name: String

Description: Required. The value to search the index of.

startswith Checks if the string starts with a value case insensitively. For
example, the following expression returns true :
startswith('hello, world', 'hello')

Parameter number: 1

Name: String

Description: Required. The string that may contain the value.

Parameter number: 2

Name: String

Description: Required. The value the string may start with.


FUNCTION NAME DESCRIPTION

endswith Checks if the string ends with a value case insensitively. For
example, the following expression returns true :
endswith('hello, world', 'world')

Parameter number: 1

Name: String

Description: Required. The string that may contain the value.

Parameter number: 2

Name: String

Description: Required. The value the string may end with.

split Splits the string using a separator. For example, the following
expression returns ["a", "b", "c"] : split('a;b;c',';')

Parameter number: 1

Name: String

Description: Required. The string that is split.

Parameter number: 2

Name: String

Description: Required. The separator.

Collection functions
These functions operate over collections such as arrays, strings, and sometimes dictionaries.

FUNCTION NAME DESCRIPTION

contains Returns true if dictionary contains a key, list contains value, or


string contains substring. For example, the following
expression returns true:``contains('abacaba','aca')

Parameter number: 1

Name: Within collection

Description: Required. The collection to search within.

Parameter number: 2

Name: Find object

Description: Required. The object to find inside the Within


collection.
FUNCTION NAME DESCRIPTION

length Returns the number of elements in an array or string. For


example, the following expression returns 3 :
length('abc')

Parameter number: 1

Name: Collection

Description: Required. The collection to get the length of.

empty Returns true if object, array, or string is empty. For example,


the following expression returns true :

empty('')

Parameter number: 1

Name: Collection

Description: Required. The collection to check if it is empty.

intersection Returns a single array or object with the common elements


between the arrays or objects passed to it. For example, this
function returns [1, 2] :

intersection([1, 2, 3], [101, 2, 1, 10],[6, 8, 1,


2])

The parameters for the function can either be a set of objects


or a set of arrays (not a mixture thereof). If there are two
objects with the same name, the last object with that name
appears in the final object.

Parameter number: 1 ... n

Name: Collection n

Description: Required. The collections to evaluate. An object


must be in all collections passed in to appear in the result.

union Returns a single array or object with all of the elements that
are in either array or object passed to it. For example, this
function returns [1, 2, 3, 10, 101]:

: union([1, 2, 3], [101, 2, 1, 10])

The parameters for the function can either be a set of objects


or a set of arrays (not a mixture thereof). If there are two
objects with the same name in the final output, the last object
with that name appears in the final object.

Parameter number: 1 ... n

Name: Collection n

Description: Required. The collections to evaluate. An object


that appears in any of the collections appears in the result.
FUNCTION NAME DESCRIPTION

first Returns the first element in the array or string passed in. For
example, this function returns 0 :

first([0,2,3])

Parameter number: 1

Name: Collection

Description: Required. The collection to take the first object


from.

last Returns the last element in the array or string passed in. For
example, this function returns 3 :

last('0123')

Parameter number: 1

Name: Collection

Description: Required. The collection to take the last object


from.

take Returns the first Count elements from the array or string
passed in, for example this function returns [1, 2] :
take([1, 2, 3, 4], 2)

Parameter number: 1

Name: Collection

Description: Required. The collection to take the first Count


objects from.

Parameter number: 2

Name: Count

Description: Required. The number of objects to take from


the Collection. Must be a positive integer.
FUNCTION NAME DESCRIPTION

skip Returns the elements in the array starting at index Count, for
example this function returns [3, 4] :

skip([1, 2 ,3 ,4], 2)

Parameter number: 1

Name: Collection

Description: Required. The collection to skip the first Count


objects from.

Parameter number: 2

Name: Count

Description: Required. The number of objects to remove


from the front of Collection. Must be a positive integer.

Logical functions
These functions are useful inside conditions, they can be used to evaluate any type of logic.

FUNCTION NAME DESCRIPTION

equals Returns true if two values are equal. For example, if


parameter1 is foo, the following expression would return
true :
equals(pipeline().parameters.parameter1), 'foo')

Parameter number: 1

Name: Object 1

Description: Required. The object to compare to Object 2.

Parameter number: 2

Name: Object 2

Description: Required. The object to compare to Object 1.


FUNCTION NAME DESCRIPTION

less Returns true if the first argument is less than the second.
Note, values can only be of type integer, float, or string. For
example, the following expression returns true :
less(10,100)

Parameter number: 1

Name: Object 1

Description: Required. The object to check if it is less than


Object 2.

Parameter number: 2

Name: Object 2

Description: Required. The object to check if it is greater than


Object 1.

lessOrEquals Returns true if the first argument is less than or equal to the
second. Note, values can only be of type integer, float, or
string. For example, the following expression returns true :
lessOrEquals(10,10)

Parameter number: 1

Name: Object 1

Description: Required. The object to check if it is less or equal


to Object 2.

Parameter number: 2

Name: Object 2

Description: Required. The object to check if it is greater than


or equal to Object 1.

greater Returns true if the first argument is greater than the second.
Note, values can only be of type integer, float, or string. For
example, the following expression returns false :
greater(10,10)

Parameter number: 1

Name: Object 1

Description: Required. The object to check if it is greater than


Object 2.

Parameter number: 2

Name: Object 2

Description: Required. The object to check if it is less than


Object 1.
FUNCTION NAME DESCRIPTION

greaterOrEquals Returns true if the first argument is greater than or equal to


the second. Note, values can only be of type integer, float, or
string. For example, the following expression returns false :
greaterOrEquals(10,100)

Parameter number: 1

Name: Object 1

Description: Required. The object to check if it is greater than


or equal to Object 2.

Parameter number: 2

Name: Object 2

Description: Required. The object to check if it is less than or


equal to Object 1.

and Returns true if both of the parameters are true. Both


arguments need to be Booleans. The following returns
false : and(greater(1,10),equals(0,0))

Parameter number: 1

Name: Boolean 1

Description: Required. The first argument that must be


true .

Parameter number: 2

Name: Boolean 2

Description: Required. The second argument must be true .

or Returns true if either of the parameters are true. Both


arguments need to be Booleans. The following returns true :
or(greater(1,10),equals(0,0))

Parameter number: 1

Name: Boolean 1

Description: Required. The first argument that may be true


.

Parameter number: 2

Name: Boolean 2

Description: Required. The second argument may be true .


FUNCTION NAME DESCRIPTION

not Returns true if the parameter is false . Both arguments


need to be Booleans. The following returns true :
not(contains('200 Success','Fail'))

Parameter number: 1

Name: Boolean

Description: Returns true if the parameter is false . Both


arguments need to be Booleans. The following returns true :
not(contains('200 Success','Fail'))

if Returns a specified value based on if the expression provided


results in true or false . For example, the following
returns "yes" : if(equals(1, 1), 'yes', 'no')

Parameter number: 1

Name: Expression

Description: Required. A boolean value that determines


which value is returned by the expression.

Parameter number: 2

Name: True

Description: Required. The value to return if the expression is


true .

Parameter number: 3

Name: False

Description: Required. The value to return if the expression is


false .

Conversion functions
These functions are used to convert between each of the native types in the language:
string
integer
float
boolean
arrays
dictionaries

FUNCTION NAME DESCRIPTION


FUNCTION NAME DESCRIPTION

int Convert the parameter to an integer. For example, the


following expression returns 100 as a number, rather than a
string: int('100')

Parameter number: 1

Name: Value

Description: Required. The value that is converted to an


integer.

string Convert the parameter to a string. For example, the following


expression returns '10' : string(10) You can also convert
an object to a string, for example if the foo parameter is an
object with one property bar : baz , then the following
would return {"bar" : "baz"}
string(pipeline().parameters.foo)

Parameter number: 1

Name: Value

Description: Required. The value that is converted to a


string.

json Convert the parameter to a JSON type value. It is the


opposite of string(). For example, the following expression
returns [1,2,3] as an array, rather than a string:

json('[1,2,3]')

Likewise, you can convert a string to an object. For example,


json('{"bar" : "baz"}') returns:

{ "bar" : "baz" }

Parameter number: 1

Name: String

Description: Required. The string that is converted to a


native type value.

The json function supports xml input as well. For example, the
parameter value of:

<?xml version="1.0"?> <root> <person id='1'>


<name>Alan</name> <occupation>Engineer</occupation>
</person> </root>

is converted to the following json:

{ "?xml": { "@version": "1.0" }, "root": { "person":


[ { "@id": "1", "name": "Alan", "occupation":
"Engineer" } ] } }
FUNCTION NAME DESCRIPTION

float Convert the parameter argument to a floating-point number.


For example, the following expression returns 10.333 :
float('10.333')

Parameter number: 1

Name: Value

Description: Required. The value that is converted to a


floating-point number.

bool Convert the parameter to a Boolean. For example, the


following expression returns false : bool(0)

Parameter number: 1

Name: Value

Description: Required. The value that is converted to a


boolean.

coalesce Returns the first non-null object in the arguments passed in.
Note: an empty string is not null. For example, if parameters 1
and 2 are not defined, this returns fallback :
coalesce(pipeline().parameters.parameter1',
pipeline().parameters.parameter2 ,'fallback')

Parameter number: 1 ... n

Name: Objectn

Description: Required. The objects to check for null .

base64 Returns the base64 representation of the input string. For


example, the following expression returns
c29tZSBzdHJpbmc= : base64('some string')

Parameter number: 1

Name: String 1

Description: Required. The string to encode into base64


representation.

base64ToBinary Returns a binary representation of a base64 encoded string.


For example, the following expression returns the binary
representation of some string:
base64ToBinary('c29tZSBzdHJpbmc=') .

Parameter number: 1

Name: String

Description: Required. The base64 encoded string.


FUNCTION NAME DESCRIPTION

base64ToString Returns a string representation of a based64 encoded string.


For example, the following expression returns some string:
base64ToString('c29tZSBzdHJpbmc=') .

Parameter number: 1

Name: String

Description: Required. The base64 encoded string.

Binary Returns a binary representation of a value. For example, the


following expression returns a binary representation of some
string: binary(‘some string’).

Parameter number: 1

Name: Value

Description: Required. The value that is converted to binary.

dataUriToBinary Returns a binary representation of a data URI. For example,


the following expression returns the binary representation of
some string:
dataUriToBinary('data:;base64,c29tZSBzdHJpbmc=')

Parameter number: 1

Name: String

Description: Required. The data URI to convert to binary


representation.

dataUriToString Returns a string representation of a data URI. For example,


the following expression returns some string:
dataUriToString('data:;base64,c29tZSBzdHJpbmc=')

Parameter number: 1

Name: String

Description: Required. The data URI to convert to String


representation.

dataUri Returns a data URI of a value. For example, the following


expression returns data:
text/plain;charset=utf8;base64,c29tZSBzdHJpbmc=:
dataUri('some string')

Parameter number: 1

Name: Value

Description: Required. The value to convert to data URI.


FUNCTION NAME DESCRIPTION

decodeBase64 Returns a string representation of an input based64 string.


For example, the following expression returns some string :
decodeBase64('c29tZSBzdHJpbmc=')

Parameter number: 1

Name: String

Description: Returns a string representation of an input


based64 string.

encodeUriComponent URL-escapes the string that's passed in. For example, the
following expression returns You+Are%3ACool%2FAwesome :
encodeUriComponent('You Are:Cool/Awesome')

Parameter number: 1

Name: String

Description: Required. The string to escape URL-unsafe


characters from.

decodeUriComponent Un-URL-escapes the string that's passed in. For example, the
following expression returns You Are:Cool/Awesome :
encodeUriComponent('You+Are%3ACool%2FAwesome')

Parameter number: 1

Name: String

Description: Required. The string to decode the URL-unsafe


characters from.

decodeDataUri Returns a binary representation of an input data URI string.


For example, the following expression returns the binary
representation of some string :
decodeDataUri('data:;base64,c29tZSBzdHJpbmc=')

Parameter number: 1

Name: String

Description: Required. The dataURI to decode into a binary


representation.

uriComponent Returns a URI encoded representation of a value. For


example, the following expression returns
You+Are%3ACool%2FAwesome: uriComponent('You
Are:Cool/Awesome ')

Parameter Details: Number: 1, Name: String, Description:


Required. The string to be URI encoded.
FUNCTION NAME DESCRIPTION

uriComponentToBinary Returns a binary representation of a URI encoded string. For


example, the following expression returns a binary
representation of You Are:Cool/Awesome :
uriComponentToBinary('You+Are%3ACool%2FAwesome')

Parameter number: 1

Name: String

Description: Required. The URI encoded string.

uriComponentToString Returns a string representation of a URI encoded string. For


example, the following expression returns
You Are:Cool/Awesome :
uriComponentToString('You+Are%3ACool%2FAwesome')

Parameter number: 1

Name: String

Description: Required. The URI encoded string.

xml Return an xml representation of the value. For example, the


following expression returns an xml content represented by
'\<name>Alan\</name>' : xml('\<name>Alan\</name>') .
The xml function supports JSON object input as well. For
example, the parameter { "abc": "xyz" } is converted to
an xml content \<abc>xyz\</abc>

Parameter number: 1

Name: Value

Description: Required. The value to convert to XML.

xpath Return an array of xml nodes matching the xpath expression


of a value that the xpath expression evaluates to.

Example 1

Assume the value of parameter ‘p1’ is a string representation


of the following XML:

<?xml version="1.0"?> <lab> <robot> <parts>5</parts>


<name>R1</name> </robot> <robot> <parts>8</parts>
<name>R2</name> </robot> </lab>

1. This code:
xpath(xml(pipeline().parameters.p1),
'/lab/robot/name')

would return

[ <name>R1</name>, <name>R2</name> ]

whereas

2. This code:
xpath(xml(pipeline().parameters.p1, '
sum(/lab/robot/parts)')
would return
FUNCTION NAME DESCRIPTION
13

Example 2

Given the following XML content:

<?xml version="1.0"?> <File xmlns="http://foo.com">


<Location>bar</Location> </File>

1. This code:
@xpath(xml(body('Http')), '/*[name()=\"File\"]/*
[name()=\"Location\"]')

or

2. This code:
@xpath(xml(body('Http')), '/*[local-name()=\"File\"
and namespace-uri()=\"http://foo.com\"]/*[local-
name()=\"Location\" and namespace-uri()=\"\"]')

returns

<Location xmlns="http://foo.com">bar</Location>

and

3. This code:
@xpath(xml(body('Http')), 'string(/*
[name()=\"File\"]/*[name()=\"Location\"])')

returns

bar

Parameter number: 1

Name: Xml

Description: Required. The XML on which to evaluate the


XPath expression.

Parameter number: 2

Name: XPath

Description: Required. The XPath expression to evaluate.

array Convert the parameter to an array. For example, the following


expression returns ["abc"] : array('abc')

Parameter number: 1

Name: Value

Description: Required. The value that is converted to an


array.
FUNCTION NAME DESCRIPTION

createArray Creates an array from the parameters. For example, the


following expression returns ["a", "c"] :
createArray('a', 'c')

Parameter number: 1 ... n

Name: Any n

Description: Required. The values to combine into an array.

Math functions
These functions can be used for either types of numbers: integers and floats.

FUNCTION NAME DESCRIPTION

add Returns the result of the addition of the two numbers. For
example, this function returns 20.333 : add(10,10.333)

Parameter number: 1

Name: Summand 1

Description: Required. The number to add to Summand 2.

Parameter number: 2

Name: Summand 2

Description: Required. The number to add to Summand 1.

sub Returns the result of the subtraction of the two numbers. For
example, this function returns: -0.333 :

sub(10,10.333)

Parameter number: 1

Name: Minuend

Description: Required. The number that Subtrahend is


removed from.

Parameter number: 2

Name: Subtrahend

Description: Required. The number to remove from the


Minuend.
FUNCTION NAME DESCRIPTION

mul Returns the result of the multiplication of the two numbers.


For example, the following returns 103.33 :

mul(10,10.333)

Parameter number: 1

Name: Multiplicand 1

Description: Required. The number to multiply Multiplicand


2 with.

Parameter number: 2

Name: Multiplicand 2

Description: Required. The number to multiply Multiplicand


1 with.

div Returns the result of the division of the two numbers. For
example, the following returns 1.0333 :

div(10.333,10)

Parameter number: 1

Name: Dividend

Description: Required. The number to divide by the Divisor.

Parameter number: 2

Name: Divisor

Description: Required. The number to divide the Dividend


by.

mod Returns the result of the remainder after the division of the
two numbers (modulo). For example, the following expression
returns 2 :

mod(10,4)

Parameter number: 1

Name: Dividend

Description: Required. The number to divide by the Divisor.

Parameter number: 2

Name: Divisor

Description: Required. The number to divide the Dividend


by. After the division, the remainder is taken.
FUNCTION NAME DESCRIPTION

min There are two different patterns for calling this function:
min([0,1,2]) Here min takes an array. This expression
returns 0 . Alternatively, this function can take a comma-
separated list of values: min(0,1,2) This function also
returns 0. Note, all values must be numbers, so if the
parameter is an array it has to only have numbers in it.

Parameter number: 1

Name: Collection or Value

Description: Required. It can either be an array of values to


find the minimum value, or the first value of a set.

Parameter number: 2 ... n

Name: Value n

Description: Optional. If the first parameter is a Value, then


you can pass additional values and the minimum of all passed
values are returned.

max There are two different patterns for calling this function:
max([0,1,2])

Here max takes an array. This expression returns 2 .


Alternatively, this function can take a comma-separated list of
values: max(0,1,2) This function returns 2. Note, all values
must be numbers, so if the parameter is an array it has to
only have numbers in it.

Parameter number: 1

Name: Collection or Value

Description: Required. It can either be an array of values to


find the maximum value, or the first value of a set.

Parameter number: 2 ... n

Name: Value n

Description: Optional. If the first parameter is a Value, then


you can pass additional values and the maximum of all passed
values are returned.
FUNCTION NAME DESCRIPTION

range Generates an array of integers starting from a certain


number, and you define the length of the returned array. For
example, this function returns [3,4,5,6] :

range(3,4)

Parameter number: 1

Name: Start index

Description: Required. It is the first integer in the array.

Parameter number: 2

Name: Count

Description: Required. Number of integers that are in the


array.

rand Generates a random integer within the specified range


(inclusive on both ends. For example, this could return 42 :

rand(-1000,1000)

Parameter number: 1

Name: Minimum

Description: Required. The lowest integer that could be


returned.

Parameter number: 2

Name: Maximum

Description: Required. The highest integer that could be


returned.

Date functions
FUNCTION NAME DESCRIPTION

utcnow Returns the current timestamp as a string. For example


2015-03-15T13:27:36Z :

utcnow()

Parameter number: 1

Name: Format

Description: Optional. Either a single format specifier


character or a custom format pattern that indicates how to
format the value of this timestamp. If format is not provided,
the ISO 8601 format ("o") is used.
FUNCTION NAME DESCRIPTION

addseconds Adds an integer number of seconds to a string timestamp


passed in. The number of seconds can be positive or negative.
The result is a string in ISO 8601 format ("o") by default,
unless a format specifier is provided. For example
2015-03-15T13:27:00Z :

addseconds('2015-03-15T13:27:36Z', -36)

Parameter number: 1

Name: Timestamp

Description: Required. A string that contains the time.

Parameter number: 2

Name: Seconds

Description: Required. The number of seconds to add. May


be negative to subtract seconds.

Parameter number: 3

Name: Format

Description: Optional. Either a single format specifier


character or a custom format pattern that indicates how to
format the value of this timestamp. If format is not provided,
the ISO 8601 format ("o") is used.

addminutes Adds an integer number of minutes to a string timestamp


passed in. The number of minutes can be positive or negative.
The result is a string in ISO 8601 format ("o") by default,
unless a format specifier is provided. For example,
2015-03-15T14:00:36Z :

addminutes('2015-03-15T13:27:36Z', 33)

Parameter number: 1

Name: Timestamp

Description: Required. A string that contains the time.

Parameter number: 2

Name: Minutes

Description: Required. The number of minutes to add. May


be negative to subtract minutes.

Parameter number: 3

Name: Format

Description: Optional. Either a single format specifier


character or a custom format pattern that indicates how to
format the value of this timestamp. If format is not provided,
the ISO 8601 format ("o") is used.
FUNCTION NAME DESCRIPTION

addhours Adds an integer number of hours to a string timestamp


passed in. The number of hours can be positive or negative.
The result is a string in ISO 8601 format ("o") by default,
unless a format specifier is provided. For example
2015-03-16T01:27:36Z :

addhours('2015-03-15T13:27:36Z', 12)

Parameter number: 1

Name: Timestamp

Description: Required. A string that contains the time.

Parameter number: 2

Name: Hours

Description: Required. The number of hours to add. May be


negative to subtract hours.

Parameter number: 3

Name: Format

Description: Optional. Either a single format specifier


character or a custom format pattern that indicates how to
format the value of this timestamp. If format is not provided,
the ISO 8601 format ("o") is used.

adddays Adds an integer number of days to a string timestamp passed


in. The number of days can be positive or negative. The result
is a string in ISO 8601 format ("o") by default, unless a format
specifier is provided. For example 2015-02-23T13:27:36Z :

adddays('2015-03-15T13:27:36Z', -20)

Parameter number: 1

Name: Timestamp

Description: Required. A string that contains the time.

Parameter number: 2

Name: Days

Description: Required. The number of days to add. May be


negative to subtract days.

Parameter number: 3

Name: Format

Description: Optional. Either a single format specifier


character or a custom format pattern that indicates how to
format the value of this timestamp. If format is not provided,
the ISO 8601 format ("o") is used.
FUNCTION NAME DESCRIPTION

formatDateTime Returns a string in date format. The result is a string in ISO


8601 format ("o") by default, unless a format specifier is
provided. For example 2015-02-23T13:27:36Z :

formatDateTime('2015-03-15T13:27:36Z', 'o')

Parameter number: 1

Name: Date

Description: Required. A string that contains the date.

Parameter number: 2

Name: Format

Description: Optional. Either a single format specifier


character or a custom format pattern that indicates how to
format the value of this timestamp. If format is not provided,
the ISO 8601 format ("o") is used.

Next steps
For a list of system variables you can use in expressions, see System variables.
System variables supported by Azure Data Factory
5/6/2019 • 2 minutes to read • Edit Online

This article describes system variables supported by Azure Data Factory. You can use these variables in
expressions when defining Data Factory entities.

Pipeline scope
These system variables can be referenced anywhere in the pipeline JSON.

VARIABLE NAME DESCRIPTION

@pipeline().DataFactory Name of the data factory the pipeline run is running within

@pipeline().Pipeline Name of the pipeline

@pipeline().RunId ID of the specific pipeline run

@pipeline().TriggerType Type of the trigger that invoked the pipeline (Manual,


Scheduler)

@pipeline().TriggerId ID of the trigger that invokes the pipeline

@pipeline().TriggerName Name of the trigger that invokes the pipeline

@pipeline().TriggerTime Time when the trigger that invoked the pipeline. The trigger
time is the actual fired time, not the scheduled time. For
example, 13:20:08.0149599Z is returned instead of
13:20:00.00Z

Schedule Trigger scope


These system variables can be referenced anywhere in the trigger JSON if the trigger is of type:
"ScheduleTrigger."

VARIABLE NAME DESCRIPTION

@trigger().scheduledTime Time when the trigger was scheduled to invoke the pipeline
run. For example, for a trigger that fires every 5 min, this
variable would return 2017-06-01T22:20:00Z ,
2017-06-01T22:25:00Z , 2017-06-01T22:29:00Z
respectively.

@trigger().startTime Time when the trigger actually fired to invoke the pipeline
run. For example, for a trigger that fires every 5 min, this
variable might return something like this
2017-06-01T22:20:00.4061448Z ,
2017-06-01T22:25:00.7958577Z ,
2017-06-01T22:29:00.9935483Z respectively. (Note: The
timestamp is by default in ISO 8601 format)
Tumbling Window Trigger scope
These system variables can be referenced anywhere in the trigger JSON if the trigger is of type:
"TumblingWindowTrigger." (Note: The timestamp is by default in ISO 8601 format)

VARIABLE NAME DESCRIPTION

@trigger().outputs.windowStartTime Start of the window when the trigger was scheduled to invoke
the pipeline run. If the tumbling window trigger has a
frequency of "hourly" this would be the time at the beginning
of the hour.

@trigger().outputs.windowEndTime End of the window when the trigger was scheduled to invoke
the pipeline run. If the tumbling window trigger has a
frequency of "hourly" this would be the time at the end of the
hour.

Next steps
For information about how these variables are used in expressions, see Expression language & functions.
Security considerations for data movement in Azure
Data Factory
4/19/2019 • 10 minutes to read • Edit Online

This article describes basic security infrastructure that data movement services in Azure Data Factory use to help
secure your data. Data Factory management resources are built on Azure security infrastructure and use all
possible security measures offered by Azure.
In a Data Factory solution, you create one or more data pipelines. A pipeline is a logical grouping of activities that
together perform a task. These pipelines reside in the region where the data factory was created.
Even though Data Factory is only available in few regions, the data movement service is available globally to
ensure data compliance, efficiency, and reduced network egress costs.
Azure Data Factory does not store any data except for linked service credentials for cloud data stores, which are
encrypted by using certificates. With Data Factory, you create data-driven workflows to orchestrate movement of
data between supported data stores, and processing of data by using compute services in other regions or in an
on-premises environment. You can also monitor and manage workflows by using SDKs and Azure Monitor.
Data Factory has been certified for:

CSA STAR CERTIFICATION

ISO 20000-1:2011

ISO 22301:2012

ISO 27001:2013

ISO 27017:2015

ISO 27018:2014

ISO 9001:2015

SOC 1, 2, 3

HIPAA BAA

If you're interested in Azure compliance and how Azure secures its own infrastructure, visit the Microsoft Trust
Center. For the latest list of all Azure Compliance offerings check - https://aka.ms/AzureCompliance.
In this article, we review security considerations in the following two data movement scenarios:
Cloud scenario: In this scenario, both your source and your destination are publicly accessible through the
internet. These include managed cloud storage services such as Azure Storage, Azure SQL Data Warehouse,
Azure SQL Database, Azure Data Lake Store, Amazon S3, Amazon Redshift, SaaS services such as Salesforce,
and web protocols such as FTP and OData. Find a complete list of supported data sources in Supported data
stores and formats.
Hybrid scenario: In this scenario, either your source or your destination is behind a firewall or inside an on-
premises corporate network. Or, the data store is in a private network or virtual network (most often the
source) and is not publicly accessible. Database servers hosted on virtual machines also fall under this scenario.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Cloud scenarios
Securing data store credentials
Store encrypted credentials in an Azure Data Factory managed store. Data Factory helps protect your
data store credentials by encrypting them with certificates managed by Microsoft. These certificates are rotated
every two years (which includes certificate renewal and the migration of credentials). The encrypted credentials
are securely stored in an Azure storage account managed by Azure Data Factory management services. For
more information about Azure Storage security, see Azure Storage security overview.
Store credentials in Azure Key Vault. You can also store the data store's credential in Azure Key Vault. Data
Factory retrieves the credential during the execution of an activity. For more information, see Store credential in
Azure Key Vault.
Data encryption in transit
If the cloud data store supports HTTPS or TLS, all data transfers between data movement services in Data Factory
and a cloud data store are via secure channel HTTPS or TLS .

NOTE
All connections to Azure SQL Database and Azure SQL Data Warehouse require encryption (SSL/TLS) while data is in transit
to and from the database. When you're authoring a pipeline by using JSON, add the encryption property and set it to true in
the connection string. For Azure Storage, you can use HTTPS in the connection string.

NOTE
To enable encryption in transit while moving data from Oracle follow one of the below options:
1. In Oracle server, go to Oracle Advanced Security (OAS) and configure the encryption settings, which supports Triple-DES
Encryption (3DES) and Advanced Encryption Standard (AES), refer here for details. ADF automatically negotiates the
encryption method to use the one you configure in OAS when establishing connection to Oracle.
2. In ADF, you can add EncryptionMethod=1 in the connection string (in the Linked Service). This will use SSL/TLS as the
encryption method. To use this, you need to disable non-SSL encryption settings in OAS on the Oracle server side to
avoid encryption conflict.

NOTE
TLS version used is 1.2.

Data encryption at rest


Some data stores support encryption of data at rest. We recommend that you enable the data encryption
mechanism for those data stores.
Azure SQL Data Warehouse
Transparent Data Encryption (TDE ) in Azure SQL Data Warehouse helps protect against the threat of malicious
activity by performing real-time encryption and decryption of your data at rest. This behavior is transparent to the
client. For more information, see Secure a database in SQL Data Warehouse.
Azure SQL Database
Azure SQL Database also supports transparent data encryption (TDE ), which helps protect against the threat of
malicious activity by performing real-time encryption and decryption of the data, without requiring changes to the
application. This behavior is transparent to the client. For more information, see Transparent data encryption for
SQL Database and Data Warehouse.
Azure Data Lake Store
Azure Data Lake Store also provides encryption for data stored in the account. When enabled, Data Lake Store
automatically encrypts data before persisting and decrypts before retrieval, making it transparent to the client that
accesses the data. For more information, see Security in Azure Data Lake Store.
Azure Blob storage and Azure Table storage
Azure Blob storage and Azure Table storage support Storage Service Encryption (SSE ), which automatically
encrypts your data before persisting to storage and decrypts before retrieval. For more information, see Azure
Storage Service Encryption for Data at Rest.
Amazon S3
Amazon S3 supports both client and server encryption of data at rest. For more information, see Protecting Data
Using Encryption.
Amazon Redshift
Amazon Redshift supports cluster encryption for data at rest. For more information, see Amazon Redshift
Database Encryption.
Salesforce
Salesforce supports Shield Platform Encryption that allows encryption of all files, attachments, and custom fields.
For more information, see Understanding the Web Server OAuth Authentication Flow.

Hybrid scenarios
Hybrid scenarios require self-hosted integration runtime to be installed in an on-premises network, inside a virtual
network (Azure), or inside a virtual private cloud (Amazon). The self-hosted integration runtime must be able to
access the local data stores. For more information about self-hosted integration runtime, see How to create and
configure self-hosted integration runtime.

The command channel allows communication between data movement services in Data Factory and self-hosted
integration runtime. The communication contains information related to the activity. The data channel is used for
transferring data between on-premises data stores and cloud data stores.
On-premises data store credentials
The credentials for your on-premises data stores are always encrypted and stored. They can be either stored locally
on the self-hosted integration runtime machine, or stored in Azure Data Factory managed storage (just like cloud
store credentials).
Store credentials locally. If you want to encrypt and store credentials locally on the self-hosted integration
runtime, follow the steps in Encrypt credentials for on-premises data stores in Azure Data Factory. All
connectors support this option. The self-hosted integration runtime uses Windows DPAPI to encrypt the
sensitive data and credential information.
Use the New-AzDataFactoryV2LinkedServiceEncryptedCredential cmdlet to encrypt linked service
credentials and sensitive details in the linked service. You can then use the JSON returned (with the
EncryptedCredential element in the connection string) to create a linked service by using the Set-
AzDataFactoryV2LinkedService cmdlet.
Store in Azure Data Factory managed storage. If you directly use the Set-
AzDataFactoryV2LinkedService cmdlet with the connection strings and credentials inline in the JSON,
the linked service is encrypted and stored in Azure Data Factory managed storage. The sensitive
information is still encrypted by certificate, and Microsoft manages these certificates.
Ports used when encrypting linked service on self-hosted integration runtime
By default, PowerShell uses port 8050 on the machine with self-hosted integration runtime for secure
communication. If necessary, this port can be changed.

Encryption in transit
All data transfers are via secure channel HTTPS and TLS over TCP to prevent man-in-the-middle attacks during
communication with Azure services.
You can also use IPSec VPN or Azure ExpressRoute to further secure the communication channel between your
on-premises network and Azure.
Azure Virtual Network is a logical representation of your network in the cloud. You can connect an on-premises
network to your virtual network by setting up IPSec VPN (site-to-site) or ExpressRoute (private peering).
The following table summarizes the network and self-hosted integration runtime configuration recommendations
based on different combinations of source and destination locations for hybrid data movement.

SOURCE DESTINATION NETWORK CONFIGURATION INTEGRATION RUNTIME SETUP

On-premises Virtual machines and cloud IPSec VPN (point-to-site or The self-hosted integration
services deployed in virtual site-to-site) runtime should be installed
networks on an Azure virtual machine
in the virtual network.

On-premises Virtual machines and cloud ExpressRoute (private The self-hosted integration
services deployed in virtual peering) runtime should be installed
networks on an Azure virtual machine
in the virtual network.

On-premises Azure-based services that ExpressRoute (Microsoft The self-hosted integration


have a public endpoint peering) runtime can be installed on-
premises or on an Azure
virtual machine.

The following images show the use of self-hosted integration runtime for moving data between an on-premises
database and Azure services by using ExpressRoute and IPSec VPN (with Azure Virtual Network):
ExpressRoute

IPSec VPN
Firewall configurations and whitelisting IP addresses
Firewall requirements for on-premises/private network
In an enterprise, a corporate firewall runs on the central router of the organization. Windows Firewall runs as a
daemon on the local machine in which the self-hosted integration runtime is installed.
The following table provides outbound port and domain requirements for corporate firewalls:

DOMAIN NAMES OUTBOUND PORTS DESCRIPTION

*.servicebus.windows.net 443 Required by the self-hosted integration


runtime to connect to data movement
services in Data Factory.

*.frontend.clouddatahub.net 443 Required by the self-hosted integration


runtime to connect to the Data Factory
service.

download.microsoft.com 443 Required by the self-hosted integration


runtime for downloading the updates. If
you have disabled auto-update then
you may skip this.

*.core.windows.net 443 Used by the self-hosted integration


runtime to connect to the Azure
storage account when you use the
staged copy feature.

*.database.windows.net 1433 (Optional) Required when you copy


from or to Azure SQL Database or
Azure SQL Data Warehouse. Use the
staged copy feature to copy data to
Azure SQL Database or Azure SQL Data
Warehouse without opening port 1433.

*.azuredatalakestore.net 443 (Optional) Required when you copy


login.microsoftonline.com/<tenant>/oauth2/token from or to Azure Data Lake Store.
NOTE
You might have to manage ports or whitelisting domains at the corporate firewall level as required by the respective data
sources. This table only uses Azure SQL Database, Azure SQL Data Warehouse, and Azure Data Lake Store as examples.

The following table provides inbound port requirements for Windows Firewall:

INBOUND PORTS DESCRIPTION

8060 (TCP) Required by the PowerShell encryption cmdlet as described in


Encrypt credentials for on-premises data stores in Azure Data
Factory, and by the credential manager application to securely
set credentials for on-premises data stores on the self-hosted
integration runtime.

IP configurations and whitelisting in data stores


Some data stores in the cloud also require that you whitelist the IP address of the machine accessing the store.
Ensure that the IP address of the self-hosted integration runtime machine is whitelisted or configured in the
firewall appropriately.
The following cloud data stores require that you whitelist the IP address of the self-hosted integration runtime
machine. Some of these data stores, by default, might not require whitelisting.
Azure SQL Database
Azure SQL Data Warehouse
Azure Data Lake Store
Azure Cosmos DB
Amazon Redshift
Frequently asked questions
Can the self-hosted integration runtime be shared across different data factories?
Yes. More details here.
What are the port requirements for the self-hosted integration runtime to work?
The self-hosted integration runtime makes HTTP -based connections to access the internet. The outbound ports
443 must be opened for the self-hosted integration runtime to make this connection. Open inbound port 8050
only at the machine level (not the corporate firewall level) for credential manager application. If Azure SQL
Database or Azure SQL Data Warehouse is used as the source or the destination, you need to open port 1433 as
well. For more information, see the Firewall configurations and whitelisting IP addresses section.

Next steps
For information about Azure Data Factory Copy Activity performance, see Copy Activity performance and tuning
guide.
Store credential in Azure Key Vault
3/13/2019 • 2 minutes to read • Edit Online

You can store credentials for data stores and computes in an Azure Key Vault. Azure
Data Factory retrieves the credentials when executing an activity that uses the data
store/compute.
Currently, all activity types except custom activity support this feature. For connector
configuration specifically, check the "linked service properties" section in each
connector topic for details.

Prerequisites
This feature relies on the data factory managed identity. Learn how it works from
Managed identity for Data factory and make sure your data factory have an
associated one.

Steps
To reference a credential stored in Azure Key Vault, you need to:
1. Retrieve data factory managed identity by copying the value of "SERVICE
IDENTITY APPLICATION ID" generated along with your factory. If you use ADF
authoring UI, the managed identity application ID will be shown on the Azure
Key Vault linked service creation window; you can also retrieve it from Azure
portal, refer to Retrieve data factory managed identity.
2. Grant the managed identity access to your Azure Key Vault. In your key
vault -> Access policies -> Add new -> search this managed identity application
ID to grant Get permission in Secret permissions dropdown. It allows this
designated factory to access secret in key vault.
3. Create a linked service pointing to your Azure Key Vault. Refer to Azure
Key Vault linked service.
4. Create data store linked service, inside which reference the
corresponding secret stored in key vault. Refer to reference secret stored in
key vault.

Azure Key Vault linked service


The following properties are supported for Azure Key Vault linked service:

PROPERTY DESCRIPTION REQUIRED

type The type property must be Yes


set to: AzureKeyVault.

baseUrl Specify the Azure Key Vault Yes


URL.

Using authoring UI:


Click Connections -> Linked Services -> +New -> search for "Azure Key Vault":
Select the provisioned Azure Key Vault where your credentials are stored. You can
do Test Connection to make sure your AKV connection is valid.

JSON example:
{
"name": "AzureKeyVaultLinkedService",
"properties": {
"type": "AzureKeyVault",
"typeProperties": {
"baseUrl": "https://<azureKeyVaultName>.vault.azure.net"
}
}
}

Reference secret stored in key vault


The following properties are supported when you configure a field in linked service
referencing a key vault secret:

PROPERTY DESCRIPTION REQUIRED

type The type property of the Yes


field must be set to:
AzureKeyVaultSecret.

secretName The name of secret in azure Yes


key vault.

secretVersion The version of secret in No


azure key vault.
If not specified, it always
uses the latest version of
the secret.
If specified, then it sticks to
the given version.

store Refers to an Azure Key Vault Yes


linked service that you use
to store the credential.

Using authoring UI:


Select Azure Key Vault for secret fields while creating the connection to your data
store/compute. Select the provisioned Azure Key Vault Linked Service and provide
the Secret name. You can optionally provide a secret version as well.

TIP
For connectors using connection string in linked service like SQL Server, Blob storage, etc.,
you can choose either to store only the secret field e.g. password in AKV, or to store the
entire connection string in AKV. You can find both options on the UI.
JSON example: (see the "password" section)

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "<>",
"organizationName": "<>",
"authenticationType": "<>",
"username": "<>",
"password": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
}
}
}
}
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure
Data Factory, see supported data stores.
Encrypt credentials for on-premises data stores in
Azure Data Factory
4/3/2019 • 2 minutes to read • Edit Online

You can encrypt and store credentials for your on-premises data stores (linked services with sensitive information)
on a machine with self-hosted integration runtime.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

You pass a JSON definition file with credentials to the


New-AzDataFactoryV2LinkedServiceEncryptedCredential cmdlet to produce an output JSON definition file
with the encrypted credentials. Then, use the updated JSON definition to create the linked services.

Author SQL Server linked service


Create a JSON file named SqlServerLinkedService.json in any folder with the following content:
Replace <servername> , <databasename> , <username> , and <password> with values for your SQL Server before
saving the file. And, replace <integration runtime name> with the name of your integration runtime.

{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<servername>;Database=<databasename>;User ID=<username>;Password=<password>;Timeout=60"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
},
"name": "SqlServerLinkedService"
}
}

Encrypt credentials
To encrypt the sensitive data from the JSON payload on an on-premises self-hosted integration runtime, run
New-AzDataFactoryV2LinkedServiceEncryptedCredential, and pass on the JSON payload. This cmdlet
ensures the credentials are encrypted using DPAPI and stored on the self-hosted integration runtime node locally.
The output payload containing the encrypted reference to the credential can be redirected to another JSON file (in
this case 'encryptedLinkedService.json').
New-AzDataFactoryV2LinkedServiceEncryptedCredential -DataFactoryName $dataFactoryName -ResourceGroupName
$ResourceGroupName -Name "SqlServerLinkedService" -DefinitionFile ".\SQLServerLinkedService.json" >
encryptedSQLServerLinkedService.json

Use the JSON with encrypted credentials


Now, use the output JSON file from the previous command containing the encrypted credential to set up the
SqlServerLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $ResourceGroupName -Name


"EncryptedSqlServerLinkedService" -DefinitionFile ".\encryptedSqlServerLinkedService.json"

Next steps
For information about security considerations for data movement, see Data movement security considerations.
Managed identity for Data Factory
4/8/2019 • 4 minutes to read • Edit Online

This article helps you understand what is managed identity for Data Factory (formerly known as Managed
Service Identity/MSI) and how it works.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.

Overview
When creating a data factory, a managed identity can be created along with factory creation. The managed
identity is a managed application registered to Azure Activity Directory, and represents this specific data factory.
Managed identity for Data Factory benefits the following features:
Store credential in Azure Key Vault, in which case data factory managed identity is used for Azure Key Vault
authentication.
Connectors including Azure Blob storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2,
Azure SQL Database, and Azure SQL Data Warehouse.
Web activity.

Generate managed identity


Managed identity for Data Factory is generated as follows:
When creating data factory through Azure portal or PowerShell, managed identity will always be created
automatically.
When creating data factory through SDK, managed identity will be created only if you specify "Identity = new
FactoryIdentity()" in the factory object for creation. See example in .NET quickstart - create data factory .
When creating data factory through REST API, managed identity will be created only if you specify "identity"
section in request body. See example in REST quickstart - create data factory .
If you find your data factory doesn't have a managed identity associated following retrieve managed identity
instruction, you can explicitly generate one by updating the data factory with identity initiator programmatically:
Generate managed identity using PowerShell
Generate managed identity using REST API
Generate managed identity using an Azure Resource Manager template
Generate managed identity using SDK
NOTE
Managed identity cannot be modified. Updating a data factory which already have a managed identity won't have any
impact, the managed identity is kept unchanged.
If you update a data factory which already have a managed identity without specifying "identity" parameter in the
factory object or without specifying "identity" section in REST request body, you will get an error.
When you delete a data factory, the associated managed identity will be deleted along.

Generate managed identity using PowerShell


Call Set-AzDataFactoryV2 command again, then you see "Identity" fields being newly generated:

PS C:\WINDOWS\system32> Set-AzDataFactoryV2 -ResourceGroupName <resourceGroupName> -Name <dataFactoryName> -


Location <region>

DataFactoryName : ADFV2DemoFactory
DataFactoryId :
/subscriptions/<subsID>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factories/ADFV2Dem
oFactory
ResourceGroupName : <resourceGroupName>
Location : East US
Tags : {}
Identity : Microsoft.Azure.Management.DataFactory.Models.FactoryIdentity
ProvisioningState : Succeeded

Generate managed identity using REST API


Call below API with "identity" section in the request body:

PATCH
https://management.azure.com/subscriptions/<subsID>/resourceGroups/<resourceGroupName>/providers/Microsoft.Da
taFactory/factories/<data factory name>?api-version=2018-06-01

Request body: add "identity": { "type": "SystemAssigned" }.

{
"name": "<dataFactoryName>",
"location": "<region>",
"properties": {},
"identity": {
"type": "SystemAssigned"
}
}

Response: managed identity is created automatically, and "identity" section is populated accordingly.
{
"name": "<dataFactoryName>",
"tags": {},
"properties": {
"provisioningState": "Succeeded",
"loggingStorageAccountKey": "**********",
"createTime": "2017-09-26T04:10:01.1135678Z",
"version": "2018-06-01"
},
"identity": {
"type": "SystemAssigned",
"principalId": "765ad4ab-XXXX-XXXX-XXXX-51ed985819dc",
"tenantId": "72f988bf-XXXX-XXXX-XXXX-2d7cd011db47"
},
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factories
/ADFV2DemoFactory",
"type": "Microsoft.DataFactory/factories",
"location": "<region>"
}

Generate managed identity using an Azure Resource Manager template


Template: add "identity": { "type": "SystemAssigned" }.

{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"resources": [{
"name": "<dataFactoryName>",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "<region>",
"identity": {
"type": "SystemAssigned"
}
}]
}

Generate managed identity using SDK


Call the data factory create_or_update function with Identity=new FactoryIdentity(). Sample code using .NET:

Factory dataFactory = new Factory


{
Location = <region>,
Identity = new FactoryIdentity()
};
client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, dataFactory);

Retrieve managed identity


You can retrieve the managed identity from Azure portal or programmatically. The following sections show some
samples.

TIP
If you don't see the managed identity, generate managed identity by updating your factory.

Retrieve managed identity using Azure portal


You can find the managed identity information from Azure portal -> your data factory -> Properties:
Managed Identity Object ID
Managed Identity Tenant
Managed Identity Application ID > copy this value

Retrieve managed identity using PowerShell


The managed identity principal ID and tenant ID will be returned when you get a specific data factory as follows:

PS C:\WINDOWS\system32> (Get-AzDataFactoryV2 -ResourceGroupName <resourceGroupName> -Name


<dataFactoryName>).Identity

PrincipalId TenantId
----------- --------
765ad4ab-XXXX-XXXX-XXXX-51ed985819dc 72f988bf-XXXX-XXXX-XXXX-2d7cd011db47

Copy the principal ID, then run below Azure Active Directory command with principal ID as parameter to get the
ApplicationId, which you use to grant access:

PS C:\WINDOWS\system32> Get-AzADServicePrincipal -ObjectId 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc

ServicePrincipalNames : {76f668b3-XXXX-XXXX-XXXX-1b3348c75e02,
https://identity.azure.net/P86P8g6nt1QxfPJx22om8MOooMf/Ag0Qf/nnREppHkU=}
ApplicationId : 76f668b3-XXXX-XXXX-XXXX-1b3348c75e02
DisplayName : ADFV2DemoFactory
Id : 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc
Type : ServicePrincipal

Next steps
See the following topics which introduce when and how to use data factory managed identity:
Store credential in Azure Key Vault
Copy data from/to Azure Data Lake Store using managed identities for Azure resources authentication
See Managed Identities for Azure Resources Overview for more background on managed identities for Azure
resources, which data factory managed identity is based upon.
Visually monitor Azure data factories
1/18/2019 • 4 minutes to read • Edit Online

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the
cloud for orchestrating and automating data movement and data transformation. Using Azure Data Factory, you
can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores,
process/transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake
Analytics, and Azure Machine Learning, and publish output data to data stores such as Azure SQL Data Warehouse
for business intelligence (BI) applications to consume.
In this quickstart, you learn how to visually monitor Data Factory pipelines without writing a single line of code.
If you don't have an Azure subscription, create a free account before you begin.

Monitor Data Factory pipelines


Monitor pipeline and activity runs with a simple list view interface. All the runs are displayed in the local browser
time zone. You can change the time zone and all the date time fields snap to the selected time zone.
1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. Log in to the Azure portal.
3. Navigate to the created data factory blade in Azure portal and click the 'Monitor & Manage' tile to launch the
Data Factory visual monitoring experience.

Monitor pipeline runs


List view showcasing each pipeline run for your data factory v2 pipelines. Included columns:

COLUMN NAME DESCRIPTION

Pipeline Name Name of the pipeline.

Actions Single action available to view activity runs.

Run Start Pipeline run start date time (MM/DD/YYYY, HH:MM:SS


AM/PM)

Duration Run duration (HH:MM:SS)

Triggered By Manual trigger, Schedule trigger

Status Failed, Succeeded, In Progress

Parameters Pipeline run parameters (name, value pairs)

Error Pipeline run error (if/any)

Run ID ID of the pipeline run


Monitor activity runs
List view showcasing activity runs corresponding to each pipeline run. Click 'Activity Runs' icon under the
'Actions' column to view activity runs for each pipeline run. Included columns:

COLUMN NAME DESCRIPTION

Activity Name Name of the activity inside the pipeline.

Activity Type Type of the activity, such as Copy, HDInsightSpark,


HDInsightHive, etc.

Run Start Activity run start date time (MM/DD/YYYY, HH:MM:SS


AM/PM)

Duration Run duration (HH:MM:SS)

Status Failed, Succeeded, In Progress

Input JSON array describing the activity inputs

Output JSON array describing the activity outputs

Error Activity run error (if/any)

IMPORTANT
You need to click 'Refresh' icon on top to refresh the list of pipeline and activity runs. Auto-refresh is currently not
supported.
Select a data factory to monitor
Hover on the Data Factory icon on the top left. Click on the 'Arrow' icon to see a list of azure subscriptions and
data factories that you can monitor.

Configure the list view


Apply rich ordering and filtering
Order pipeline runs in desc/asc by Run Start and filter pipeline runs by following columns:

COLUMN NAME DESCRIPTION

Pipeline Name Name of the pipeline. Options include quick filters for 'Last 24
hours', 'Last week', 'Last 30 days' or select a custom date time.

Run Start Pipeline run start date time

Run Status Filter runs by status - Succeeded, Failed, In Progress


Add or remove columns
Right-click the list view header and choose columns that you want to appear in the list view

Adjust column widths


Increase and decrease the column widths in list view by hovering over the column header

Promote user properties to monitor


You can promote any pipeline activity property as a user property so that it becomes an entity that you can
monitor. For example, you can promote the Source and Destination properties of the Copy activity in your
pipeline as user properties. You can also select Auto Generate to generate the Source and Destination user
properties for a Copy activity.

NOTE
You can only promote up to 5 pipeline activity properties as user properties.

After you create the user properties, you can then monitor them in the monitoring list views. If the source for the
Copy activity is a table name, you can monitor the source table name as a column in the activity runs list view.
Rerun activities inside a pipeline
You can now rerun activities inside a pipeline. Click View activity runs and select the activity in your pipeline from
which point you want to rerun your pipeline.

View rerun history


You can view the rerun history for all the pipeline runs in the list view.
You can also view rerun history for a particular pipeline run.

Guided Tours
Click on the 'Information Icon' in lower left and click 'Guided Tours' to get step-by-step instructions on how to
monitor your pipeline and activity runs.
Feedback
Click on the 'Feedback' icon to give us feedback on various features or any issues that you might be facing.

Alerts
You can raise alerts on supported metrics in Data Factory. Select Monitor -> Alerts & Metrics on the Data
Factory Monitor page to get started.
For a seven-minute introduction and demonstration of this feature, watch the following video:

Create Alerts
1. Click New Alert rule to create a new alert.

2. Specify the rule name and select the alert Severity.

3. Select the Alert Criteria.


4. Configure the Alert logic. You can create an alert for the selected metric for all pipelines and corresponding
activities. You can also select a particular activity type, activity name, pipeline name, or a failure type.

5. Configure Email/SMS/Push/Voice notifications for the alert. Create or choose an existing Action Group
for the alert notifications.
6. Create the alert rule.
Next steps
See Monitor and manage pipelines programmatically article to learn about monitoring and managing pipelines.
Alert and Monitor data factories using Azure Monitor
3/15/2019 • 10 minutes to read • Edit Online

Cloud applications are complex with many moving parts. Monitoring provides data to ensure that your application stays up and running in a healthy state. It also helps you to
stave off potential problems or troubleshoot past ones. In addition, you can use monitoring data to gain deep insights about your application. This knowledge can help you to
improve application performance or maintainability, or automate actions that would otherwise require manual intervention.
Azure Monitor provides base level infrastructure metrics and logs for most services in Microsoft Azure. For details, see monitoring overview. Azure Diagnostic logs are logs
emitted by a resource that provide rich, frequent data about the operation of that resource. Data Factory outputs diagnostic logs in Azure Monitor.

Persist Data Factory Data


Data Factory only stores pipeline run data for 45 days. If you want to persist pipeline run data for more than 45 days, using Azure Monitor, you cannot only route diagnostic
logs for analysis, you can persist them into a storage account so you have factory information for the duration of your choosing.

Diagnostic logs
Save them to a Storage Account for auditing or manual inspection. You can specify the retention time (in days) using the diagnostic settings.
Stream them to Event Hubs for ingestion by a third-party service or custom analytics solution such as Power BI.
Analyze them with Log Analytics
You can use a storage account or event hub namespace that is not in the same subscription as the resource that is emitting logs. The user who configures the setting must
have the appropriate role-based access control (RBAC) access to both subscriptions.

Set up diagnostic logs


Diagnostic Settings
Diagnostic Logs for non-compute resources are configured using diagnostic settings. Diagnostic settings for a resource control:
Where diagnostic logs are sent (Storage Account, Event Hubs, or Azure Monitor logs).
Which log categories are sent.
How long each log category should be retained in a storage account.
A retention of zero days means logs are kept forever. Otherwise, the value can be any number of days between 1 and 2147483647.
If retention policies are set but storing logs in a storage account is disabled (for example, only Event Hubs or Azure Monitor logs options are selected), the retention
policies have no effect.
Retention policies are applied per-day, so at the end of a day (UTC), logs from the day that is now beyond the retention policy are deleted. For example, if you had a
retention policy of one day, at the beginning of the day today the logs from the day before yesterday would be deleted.
Enable diagnostic logs via REST APIs
Create or update a diagnostics setting in Azure Monitor REST API
Request

PUT
https://management.azure.com/{resource-id}/providers/microsoft.insights/diagnosticSettings/service?api-version={api-version}

Headers
Replace {api-version} with 2016-09-01 .
Replace {resource-id} with the resource ID of the resource for which you would like to edit diagnostic settings. For more information Using Resource groups to manage
your Azure resources.
Set the Content-Type header to application/json .
Set the authorization header to a JSON web token that you obtain from Azure Active Directory. For more information, see Authenticating requests.
Body
{
"properties": {
"storageAccountId": "/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.Storage/storageAccounts/<storageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.EventHub/namespaces/<eventHubName>/authorizationrules/RootManageSharedAccessKey",
"workspaceId": "/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.OperationalInsights/workspaces/<LogAnalyticsName>",
"metrics": [
],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"location": ""
}

PROPERTY TYPE DESCRIPTION

storageAccountId String The resource ID of the storage account to which you would
like to send Diagnostic Logs

serviceBusRuleId String The service bus rule ID of the service bus namespace in which
you would like to have Event Hubs created for streaming
Diagnostic Logs. The rule ID is of the format: "{service bus
resource ID}/authorizationrules/{key name}".

workspaceId Complex Type Array of metric time grains and their retention policies.
Currently, this property is empty.

metrics Parameter values of the pipeline run to be passed to the A JSON object mapping parameter names to argument values
invoked pipeline

logs Complex Type Name of a Diagnostic Log category for a resource type. To
obtain the list of Diagnostic Log categories for a resource, first
perform a GET diagnostic settings operation.

category String Array of log categories and their retention policies

timeGrain String The granularity of metrics that are captured in ISO 8601
duration format. Must be PT1M (one minute)

enabled Boolean Specifies whether collection of that metric or log category is


enabled for this resource

retentionPolicy Complex Type Describes the retention policy for a metric or log category.
Used for storage account option only.

days Int Number of days to retain the metrics or logs. A value of 0


retains the logs indefinitely. Used for storage account option
only.

Response
200 OK
{
"id": "/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/providers/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId": "/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.Storage/storageAccounts/<storageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.EventHub/namespaces/<eventHubName>/authorizationrules/RootManageSharedAccessKey",
"workspaceId": "/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.OperationalInsights/workspaces/<LogAnalyticsName>",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}

Get information about diagnostics setting in Azure Monitor REST API


Request

GET
https://management.azure.com/{resource-id}/providers/microsoft.insights/diagnosticSettings/service?api-version={api-version}

Headers
Replace {api-version} with 2016-09-01 .
Replace {resource-id} with the resource ID of the resource for which you would like to edit diagnostic settings. For more information Using Resource groups to manage
your Azure resources.
Set the Content-Type header to application/json .
Set the authorization header to a JSON Web Token that you obtain from Azure Active Directory. For more information, see Authenticating requests.
Response
200 OK
{
"id": "/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/providers/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId": "/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.Storage/storageAccounts/azmonlogs",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.EventHub/namespaces/shloeventhub/authorizationrules/RootManageSharedAccessKey",
"workspaceId": "/subscriptions/<subID>/resourceGroups/ADF/providers/Microsoft.OperationalInsights/workspaces/mihaipie",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}

More info here

Schema of Logs & Events


Activity Run Logs Attributes

{
"Level": "",
"correlationId":"",
"time":"",
"activityRunId":"",
"pipelineRunId":"",
"resourceId":"",
"category":"ActivityRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"activityName":"",
"start":"",
"end":"",
"properties":
{
"Input": "{
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}",
"Output": "{"dataRead":121,"dataWritten":121,"copyDuration":5,
"throughput":0.0236328132,"errors":[]}",
"Error": "{
"errorCode": "null",
"message": "null",
"failureType": "null",
"target": "CopyBlobtoBlob"
}
}
}

PROPERTY TYPE DESCRIPTION EXAMPLE

Level String Level of the diagnostic logs. Level 4 always is 4


the case for activity run logs.

correlationId String Unique ID to track a particular request end- 319dc6b4-f348-405e-b8d7-aafc77b73e77


to-end

time String Time of the event in timespan, UTC format 2017-06-28T21:00:27.3534352Z


YYYY-MM-DDTHH:MM:SS.00000Z

activityRunId String ID of the activity run 3a171e1f-b36e-4b80-8a54-5625394f4354

pipelineRunId String ID of the pipeline run 9f6069d6-e522-4608-9f99-21807bfc3c70


PROPERTY TYPE DESCRIPTION EXAMPLE

resourceId String Associated resource ID for the data factory /SUBSCRIPTIONS/<subID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS


resource

category String Category of Diagnostic Logs. Set this ActivityRuns


property to "ActivityRuns"

level String Level of the diagnostic logs. Set this property Informational
to "Informational"

operationName String Name of the activity with status. If the status MyActivity - Succeeded
is the start heartbeat, it is MyActivity - . If
the status is the end heartbeat, it is
MyActivity - Succeeded with final status

pipelineName String Name of the pipeline MyPipeline

activityName String Name of the activity MyActivity

start String Start of the activity run in timespan, UTC 2017-06-26T20:55:29.5007959Z


format

end String Ends of the activity run in timespan, UTC 2017-06-26T20:55:29.5007959Z


format. If the activity has not ended yet
(diagnostic log for an activity starting), a
default value of 1601-01-01T00:00:00Z is
set.

Pipeline Run Logs Attributes

{
"Level": "",
"correlationId":"",
"time":"",
"runId":"",
"resourceId":"",
"category":"PipelineRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"start":"",
"end":"",
"status":"",
"properties":
{
"Parameters": {
"<parameter1Name>": "<parameter1Value>"
},
"SystemParameters": {
"ExecutionStart": "",
"TriggerId": "",
"SubscriptionId": ""
}
}
}

PROPERTY TYPE DESCRIPTION EXAMPLE

Level String Level of the diagnostic logs. Level 4 is the 4


case for activity run logs.

correlationId String Unique ID to track a particular request end- 319dc6b4-f348-405e-b8d7-aafc77b73e77


to-end

time String Time of the event in timespan, UTC format 2017-06-28T21:00:27.3534352Z


YYYY-MM-DDTHH:MM:SS.00000Z

runId String ID of the pipeline run 9f6069d6-e522-4608-9f99-21807bfc3c70

resourceId String Associated resource ID for the data factory /SUBSCRIPTIONS/<subID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS


resource

category String Category of Diagnostic Logs. Set this PipelineRuns


property to "PipelineRuns"

level String Level of the diagnostic logs. Set this property Informational
to "Informational"

operationName String Name of the pipeline with status. "Pipeline - MyPipeline - Succeeded
Succeeded" with final status when pipeline
run is completed

pipelineName String Name of the pipeline MyPipeline

start String Start of the activity run in timespan, UTC 2017-06-26T20:55:29.5007959Z


format

end String End of the activity runs in timespan, UTC 2017-06-26T20:55:29.5007959Z


format. If the activity has not ended yet
(diagnostic log for an activity starting), a
default value of 1601-01-01T00:00:00Z is
set.
PROPERTY TYPE DESCRIPTION EXAMPLE

status String Final status of the pipeline run (Succeeded or Succeeded


Failed)

Trigger Run Logs Attributes

{
"Level": "",
"correlationId":"",
"time":"",
"triggerId":"",
"resourceId":"",
"category":"TriggerRuns",
"level":"Informational",
"operationName":"",
"triggerName":"",
"triggerType":"",
"triggerEvent":"",
"start":"",
"status":"",
"properties":
{
"Parameters": {
"TriggerTime": "",
"ScheduleTime": ""
},
"SystemParameters": {}
}
}

PROPERTY TYPE DESCRIPTION EXAMPLE

Level String Level of the diagnostic logs. Set to level 4 for 4


activity run logs.

correlationId String Unique ID to track a particular request end- 319dc6b4-f348-405e-b8d7-aafc77b73e77


to-end

time String Time of the event in timespan, UTC format 2017-06-28T21:00:27.3534352Z


YYYY-MM-DDTHH:MM:SS.00000Z

triggerId String ID of the trigger run 08587023010602533858661257311

resourceId String Associated resource ID for the data factory /SUBSCRIPTIONS/<subID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS


resource

category String Category of Diagnostic Logs. Set this PipelineRuns


property to "PipelineRuns"

level String Level of the diagnostic logs. Set this property Informational
to "Informational"

operationName String Name of the trigger with final status whether MyTrigger - Succeeded
it successfully fired. "MyTrigger - Succeeded"
if the heartbeat was successful

triggerName String Name of the trigger MyTrigger

triggerType String Type of the trigger (Manual trigger or ScheduleTrigger


Schedule Trigger)

triggerEvent String Event of the trigger ScheduleTime - 2017-07-06T01:50:25Z

start String Start of trigger fire in timespan, UTC format 2017-06-26T20:55:29.5007959Z

status String Final status of whether trigger successfully Succeeded


fired (Succeeded or Failed)

Metrics
Azure Monitor enables you to consume telemetry to gain visibility into the performance and health of your workloads on Azure. The most important type of Azure telemetry
data is the metrics (also called performance counters) emitted by most Azure resources. Azure Monitor provides several ways to configure and consume these metrics for
monitoring and troubleshooting.
ADFV2 emits the following metrics

METRIC METRIC DISPLAY NAME UNIT AGGREGATION TYPE DESCRIPTION

PipelineSucceededRun Succeeded pipeline runs metrics Count Total Total pipelines runs succeeded
within a minute window

PipelineFailedRuns Failed pipeline runs metrics Count Total Total pipelines runs failed within a
minute window

ActivitySucceededRuns Succeeded activity runs metrics Count Total Total activity runs succeeded within
a minute window

ActivityFailedRuns Failed activity runs metrics Count Total Total activity runs failed within a
minute window
METRIC METRIC DISPLAY NAME UNIT AGGREGATION TYPE DESCRIPTION

TriggerSucceededRuns Succeeded trigger runs metrics Count Total Total trigger runs succeeded within
a minute window

TriggerFailedRuns Failed trigger runs metrics Count Total Total trigger runs failed within a
minute window

To access the metrics, follow the instructions in the article - https://docs.microsoft.com/azure/monitoring-and-diagnostics/monitoring-overview-metrics

Monitor Data Factory Metrics with Azure Monitor


You can use Azure Data Factory integration with Azure Monitor to route data to Azure Monitor. This integration is useful in the following scenarios:
1. You want to write complex queries on a rich set of metrics that is published by Data Factory to Azure Monitor. You can also create custom alerts on these queries via
Azure Monitor.
2. You want to monitor across data factories. You can route data from multiple data factories to a single Azure Monitor workspace.
For a seven-minute introduction and demonstration of this feature, watch the following video:

Configure Diagnostic Settings and Workspace


Enable Diagnostic Settings for your data factory.
1. Select Azure Monitor -> Diagnostics settings -> Select the data factory -> Turn on diagnostics.

2. Provide diagnostic settings including configuration of the workspace.

Install Azure Data Factory Analytics from Azure Marketplace


Click Create and select the Workspace and Workspace settings.
Monitor Data Factory Metrics
Installing Azure Data Factory Analytics creates a default set of views that enables the following metrics:
ADF Runs- 1) Pipeline Runs by Data Factory
ADF Runs- 2) Activity Runs by Data Factory
ADF Runs- 3) Trigger Runs by Data Factory
ADF Errors- 1) Top 10 Pipeline Errors by Data Factory
ADF Errors- 2) Top 10 Activity Runs by Data Factory
ADF Errors- 3) Top 10 Trigger Errors by Data Factory
ADF Statistics- 1) Activity Runs by Type
ADF Statistics- 2) Trigger Runs by Type
ADF Statistics- 3) Max Pipeline Runs Duration
You can visualize the above metrics, look at the queries behind these metrics, edit the queries, create alerts, and so forth.

Alerts
Log in to the Azure portal and click Monitor -> Alerts to create alerts.
Create Alerts
1. Click + New Alert rule to create a new alert.

2. Define the Alert condition.


NOTE
Make sure to select All in the Filter by resource type.

3. Define the Alert details.


4. Define the Action group.
Next steps
See Monitor and manage pipelines programmatically article to learn about monitoring and managing pipelines with code.
Programmatically monitor an Azure data factory
3/26/2019 • 3 minutes to read • Edit Online

This article describes how to monitor a pipeline in a data factory by using different software development kits
(SDKs).

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Data range
Data Factory only stores pipeline run data for 45 days. When you query programmatically for data about Data
Factory pipeline runs - for example, with the PowerShell command Get-AzDataFactoryV2PipelineRun - there are no
maximum dates for the optional LastUpdatedAfter and LastUpdatedBefore parameters. But if you query for data
for the past year, for example, the query does not return an error, but only returns pipeline run data from the last
45 days.
If you want to persist pipeline run data for more than 45 days, set up your own diagnostic logging with Azure
Monitor.

.NET
For a complete walkthrough of creating and monitoring a pipeline using .NET SDK, see Create a data factory and
pipeline using .NET.
1. Add the following code to continuously check the status of the pipeline run until it finishes copying the data.

// Monitor the pipeline run


Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(resourceGroup, dataFactoryName, runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress")
System.Threading.Thread.Sleep(15000);
else
break;
}

2. Add the following code to that retrieves copy activity run details, for example, size of the data read/written.
// Check the copy activity run details
Console.WriteLine("Checking copy activity run details...");

List<ActivityRun> activityRuns = client.ActivityRuns.ListByPipelineRun(


resourceGroup, dataFactoryName, runResponse.RunId, DateTime.UtcNow.AddMinutes(-10),
DateTime.UtcNow.AddMinutes(10)).ToList();
if (pipelineRun.Status == "Succeeded")
Console.WriteLine(activityRuns.First().Output);
else
Console.WriteLine(activityRuns.First().Error);
Console.WriteLine("\nPress any key to exit...");
Console.ReadKey();

For complete documentation on .NET SDK, see Data Factory .NET SDK reference.

Python
For a complete walkthrough of creating and monitoring a pipeline using Python SDK, see Create a data factory
and pipeline using Python.
To monitor the pipeline run, add the following code:

#Monitor the pipeline run


time.sleep(30)
pipeline_run = adf_client.pipeline_runs.get(rg_name, df_name, run_response.run_id)
print("\n\tPipeline run status: {}".format(pipeline_run.status))
activity_runs_paged = list(adf_client.activity_runs.list_by_pipeline_run(rg_name, df_name,
pipeline_run.run_id, datetime.now() - timedelta(1), datetime.now() + timedelta(1)))
print_activity_run_details(activity_runs_paged[0])

For complete documentation on Python SDK, see Data Factory Python SDK reference.

REST API
For a complete walkthrough of creating and monitoring a pipeline using REST API, see Create a data factory and
pipeline using REST API.
1. Run the following script to continuously check the pipeline run status until it finishes copying the data.

$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Microso
ft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}?api-version=${apiVersion}"
while ($True) {
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
Write-Host "Pipeline run status: " $response.Status -foregroundcolor "Yellow"

if ($response.Status -eq "InProgress") {


Start-Sleep -Seconds 15
}
else {
$response | ConvertTo-Json
break
}
}

2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Microso
ft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}/activityruns?api-
version=${apiVersion}&startTime="+(Get-Date).ToString('yyyy-MM-dd')+"&endTime="+(Get-
Date).AddDays(1).ToString('yyyy-MM-dd')+"&pipelineName=Adfv2QuickStartPipeline"
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
$response | ConvertTo-Json

For complete documentation on REST API, see Data Factory REST API reference.

PowerShell
For a complete walkthrough of creating and monitoring a pipeline using PowerShell, see Create a data factory and
pipeline using PowerShell.
1. Run the following script to continuously check the pipeline run status until it finishes copying the data.

while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId

if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}

Start-Sleep -Seconds 30
}

2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.

Write-Host "Activity run details:" -foregroundcolor "Yellow"


$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)
$result

Write-Host "Activity 'Output' section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "\nActivity 'Error' section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

For complete documentation on PowerShell cmdlets, see Data Factory PowerShell cmdlet reference.

Next steps
See Monitor pipelines using Azure Monitor article to learn about using Azure Monitor to monitor Data Factory
pipelines.
Monitor an integration runtime in Azure Data Factory
3/7/2019 • 9 minutes to read • Edit Online

Integration runtime is the compute infrastructure used by Azure Data Factory to provide various data integration
capabilities across different network environments. There are three types of integration runtimes offered by Data
Factory:
Azure integration runtime
Self-hosted integration runtime
Azure-SSIS integration runtime

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

To get the status of an instance of integration runtime (IR ), run the following PowerShell command:

Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName MyDataFactory -ResourceGroupName MyResourceGroup -Name


MyAzureIR -Status

The cmdlet returns different information for different types of integration runtime. This article explains the
properties and statuses for each type of integration runtime.

Azure integration runtime


The compute resource for an Azure integration runtime is fully managed elastically in Azure. The following table
provides descriptions for properties returned by the Get-AzDataFactoryV2IntegrationRuntime command:
Properties
The following table provides descriptions of properties returned by the cmdlet for an Azure integration runtime:

PROPERTY DESCRIPTION

Name Name of the Azure integration runtime.

State Status of the Azure integration runtime.

Location Location of the Azure integration runtime. For details about


location of an Azure integration runtime, see Introduction to
integration runtime.

DataFactoryName Name of the data factory that the Azure integration runtime
belongs to.

ResourceGroupName Name of the resource group that the data factory belongs to.
PROPERTY DESCRIPTION

Description Description of the integration runtime.

Status
The following table provides possible statuses of an Azure integration runtime:

STATUS COMMENTS/SCENARIOS

Online The Azure integration runtime is online and ready to be used.

Offline The Azure integration runtime is offline due to an internal


error.

Self-hosted integration runtime


This section provides descriptions for properties returned by the Get-AzDataFactoryV2IntegrationRuntime cmdlet.

NOTE
The returned properties and status contain information about overall self-hosted integration runtime and each node in the
runtime.

Properties
The following table provides descriptions of monitoring Properties for each node:

PROPERTY DESCRIPTION

Name Name of the self-hosted integration runtime and nodes


associated with it. Node is an on-premises Windows machine
that has the self-hosted integration runtime installed on it.

Status The status of the overall self-hosted integration runtime and


each node. Example: Online/Offline/Limited/etc. For
information about these statuses, see the next section.

Version The version of self-hosted integration runtime and each node.


The version of the self-hosted integration runtime is
determined based on version of majority of nodes in the
group. If there are nodes with different versions in the self-
hosted integration runtime setup, only the nodes with the
same version number as the logical self-hosted integration
runtime function properly. Others are in the limited mode and
need to be manually updated (only in case auto-update fails).

Available memory Available memory on a self-hosted integration runtime node.


This value is a near real-time snapshot.

CPU utilization CPU utilization of a self-hosted integration runtime node. This


value is a near real-time snapshot.

Networking (In/Out) Network utilization of a self-hosted integration runtime node.


This value is a near real-time snapshot.
PROPERTY DESCRIPTION

Concurrent Jobs (Running/ Limit) Running. Number of jobs or tasks running on each node. This
value is a near real-time snapshot.

Limit. Limit signifies the maximum concurrent jobs for each


node. This value is defined based on the machine size. You can
increase the limit to scale up concurrent job execution in
advanced scenarios, when activities are timing out even when
CPU, memory, or network is under-utilized. This capability is
also available with a single-node self-hosted integration
runtime.

Role There are two types of roles in a multi-node self-hosted


integration runtime – dispatcher and worker. All nodes are
workers, which means they can all be used to execute jobs.
There is only one dispatcher node, which is used to pull
tasks/jobs from cloud services and dispatch them to different
worker nodes. The dispatcher node is also a worker node.

Some settings of the properties make more sense when there are two or more nodes in the self-hosted integration
runtime (that is, in a scale out scenario).
Concurrent jobs limit
The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this
value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the
more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs
limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you run a
maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48
concurrent jobs (that is, 4 x 12). We recommend that you increase the concurrent jobs limit only when you see low
resource usage with the default values on each node.
You can override the calculated default value in the Azure portal. Select Author > Connections > Integration
Runtimes > Edit > Nodes > Modify concurrent job value per node. You can also use the PowerShell update-
Azdatafactoryv2integrationruntimenode command.
Status (per node )
The following table provides possible statuses of a self-hosted integration runtime node:

STATUS DESCRIPTION

Online Node is connected to the Data Factory service.

Offline Node is offline.

Upgrading The node is being auto-updated.

Limited Due to a connectivity issue. May be due to HTTP port 8050


issue, service bus connectivity issue, or a credential sync issue.

Inactive Node is in a configuration different from the configuration of


other majority nodes.

A node can be inactive when it cannot connect to other nodes.


Status (overall self-hosted integration runtime )
The following table provides possible statuses of a self-hosted integration runtime. This status depends on statuses
of all nodes that belong to the runtime.

STATUS DESCRIPTION

Need Registration No node is registered to this self-hosted integration runtime


yet.

Online All nodes are online.

Offline No node is online.

Limited Not all nodes in this self-hosted integration runtime are in a


healthy state. This status is a warning that some nodes might
be down. This status could be due to a credential sync issue on
dispatcher/worker node.

Use the Get-AzDataFactoryV2IntegrationRuntimeMetric cmdlet to fetch the JSON payload containing the
detailed self-hosted integration runtime properties, and their snapshot values during the time of execution of the
cmdlet.

Get-AzDataFactoryV2IntegrationRuntimeMetric -name $integrationRuntimeName -ResourceGroupName $resourceGroupName


-DataFactoryName $dataFactoryName | | ConvertTo-Json

Sample output (assumes that there are two nodes associated with this self-hosted integration runtime):

{
"IntegrationRuntimeName": "<Name of your integration runtime>",
"ResourceGroupName": "<Resource Group Name>",
"DataFactoryName": "<Data Factory Name>",
"Nodes": [
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
},
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
}

]
}

Azure-SSIS integration runtime


Azure-SSIS integration runtime is a fully managed cluster of Azure virtual machines (or nodes) dedicated to run
your SSIS packages. It does not run any other activities of Azure Data Factory. Once provisioned, you can query its
properties and monitor its overall/node-specific statuses.
Properties
PROPERTY/STATUS DESCRIPTION

CreateTime The UTC time when your Azure-SSIS integration runtime was
created.

Nodes The allocated/available nodes of your Azure-SSIS integration


runtime with node-specific statuses
(starting/available/recycling/unavailable) and actionable errors.

OtherErrors The non-node-specific actionable errors on your Azure-SSIS


integration runtime.

LastOperation The result of last start/stop operation on your Azure-SSIS


integration runtime with actionable error(s) if it failed.

State The overall status (initial/starting/started/stopping/stopped) of


your Azure-SSIS integration runtime.

Location The location of your Azure-SSIS integration runtime.

NodeSize The size of each node of your Azure-SSIS integration runtime.

NodeCount The number of nodes in your Azure-SSIS integration runtime.

MaxParallelExecutionsPerNode The number of parallel executions per node in your Azure-SSIS


integration runtime.

CatalogServerEndpoint The endpoint of your existing Azure SQL Database/Managed


Instance server to host SSISDB.

CatalogAdminUserName The admin username of your existing Azure SQL


Database/Managed Instance server. Data Factory service uses
this information to prepare and manage SSISDB on your
behalf.

CatalogAdminPassword The admin password of your existing Azure SQL


Database/Managed Instance server.

CatalogPricingTier The pricing tier for SSISDB hosted by your existing Azure SQL
Database server. Not applicable to Azure SQL Database
Managed Instance hosting SSISDB.

VNetId The virtual network resource ID for your Azure-SSIS


integration runtime to join.

Subnet The subnet name for your Azure-SSIS integration runtime to


join.

ID The resource ID of your Azure-SSIS integration runtime.


PROPERTY/STATUS DESCRIPTION

Type The type (Managed/Self-Hosted) of your Azure-SSIS


integration runtime.

ResourceGroupName The name of your Azure Resource Group, in which your data
factory and Azure-SSIS integration runtime were created.

DataFactoryName The name of your Azure data factory.

Name The name of your Azure-SSIS integration runtime.

Description The description of your Azure-SSIS integration runtime.

Status (per node )


STATUS DESCRIPTION

Starting This node is being prepared.

Available This node is ready for you to deploy/execute SSIS packages.

Recycling This node is being repaired/restarting.

Unavailable This node is not ready for you to deploy/execute SSIS


packages and has actionable errors/issues that you could
resolve.

Status (overall Azure -SSIS integration runtime )


OVERALL STATUS DESCRIPTION

Initial The nodes of your Azure-SSIS integration runtime have not


been allocated/prepared.

Starting The nodes of your Azure-SSIS integration runtime are being


allocated/prepared and billing has started.

Started The nodes of your Azure-SSIS integration runtime have been


allocated/prepared and they are ready for you to
deploy/execute SSIS packages.

Stopping The nodes of your Azure-SSIS integration runtime are being


released.

Stopped The nodes of your Azure-SSIS integration runtime have been


released and billing has stopped.

Monitor the Azure -SSIS integration runtime in the Azure portal


The following screenshots show how to select the Azure-SSIS IR to monitor, and provide an example of the
information that's displayed.
Monitor the Azure -SSIS integration runtime with PowerShell
Use a script like the following example to check the status of the Azure-SSIS IR.

Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -ResourceGroupName


$ResourceGroupName -Status

More info about the Azure -SSIS integration runtime


See the following articles to learn more about Azure-SSIS integration runtime:
Azure-SSIS Integration Runtime. This article provides conceptual information about integration runtimes in
general including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-SSIS
IR and uses an Azure SQL database to host the SSIS catalog.
How to: Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides instructions
on using Azure SQL Database Managed Instance and joining the IR to a virtual network.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or remove an Azure-SSIS IR. It also shows
you how to scale out your Azure-SSIS IR by adding more nodes to the IR.
Join an Azure-SSIS IR to a virtual network. This article provides conceptual information about joining an Azure-
SSIS IR to an Azure virtual network. It also provides steps to use Azure portal to configure the virtual network
so that the Azure-SSIS IR can join the virtual network.

Next steps
See the following articles for monitoring pipelines in different ways:
Quickstart: create a data factory.
Use Azure Monitor to monitor Data Factory pipelines
Monitor an integration runtime in Azure Data Factory
3/7/2019 • 9 minutes to read • Edit Online

Integration runtime is the compute infrastructure used by Azure Data Factory to provide various data integration
capabilities across different network environments. There are three types of integration runtimes offered by Data
Factory:
Azure integration runtime
Self-hosted integration runtime
Azure-SSIS integration runtime

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

To get the status of an instance of integration runtime (IR ), run the following PowerShell command:

Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName MyDataFactory -ResourceGroupName MyResourceGroup -Name


MyAzureIR -Status

The cmdlet returns different information for different types of integration runtime. This article explains the
properties and statuses for each type of integration runtime.

Azure integration runtime


The compute resource for an Azure integration runtime is fully managed elastically in Azure. The following table
provides descriptions for properties returned by the Get-AzDataFactoryV2IntegrationRuntime command:
Properties
The following table provides descriptions of properties returned by the cmdlet for an Azure integration runtime:

PROPERTY DESCRIPTION

Name Name of the Azure integration runtime.

State Status of the Azure integration runtime.

Location Location of the Azure integration runtime. For details about


location of an Azure integration runtime, see Introduction to
integration runtime.

DataFactoryName Name of the data factory that the Azure integration runtime
belongs to.

ResourceGroupName Name of the resource group that the data factory belongs to.
PROPERTY DESCRIPTION

Description Description of the integration runtime.

Status
The following table provides possible statuses of an Azure integration runtime:

STATUS COMMENTS/SCENARIOS

Online The Azure integration runtime is online and ready to be used.

Offline The Azure integration runtime is offline due to an internal


error.

Self-hosted integration runtime


This section provides descriptions for properties returned by the Get-AzDataFactoryV2IntegrationRuntime cmdlet.

NOTE
The returned properties and status contain information about overall self-hosted integration runtime and each node in the
runtime.

Properties
The following table provides descriptions of monitoring Properties for each node:

PROPERTY DESCRIPTION

Name Name of the self-hosted integration runtime and nodes


associated with it. Node is an on-premises Windows machine
that has the self-hosted integration runtime installed on it.

Status The status of the overall self-hosted integration runtime and


each node. Example: Online/Offline/Limited/etc. For
information about these statuses, see the next section.

Version The version of self-hosted integration runtime and each node.


The version of the self-hosted integration runtime is
determined based on version of majority of nodes in the
group. If there are nodes with different versions in the self-
hosted integration runtime setup, only the nodes with the
same version number as the logical self-hosted integration
runtime function properly. Others are in the limited mode and
need to be manually updated (only in case auto-update fails).

Available memory Available memory on a self-hosted integration runtime node.


This value is a near real-time snapshot.

CPU utilization CPU utilization of a self-hosted integration runtime node. This


value is a near real-time snapshot.

Networking (In/Out) Network utilization of a self-hosted integration runtime node.


This value is a near real-time snapshot.
PROPERTY DESCRIPTION

Concurrent Jobs (Running/ Limit) Running. Number of jobs or tasks running on each node. This
value is a near real-time snapshot.

Limit. Limit signifies the maximum concurrent jobs for each


node. This value is defined based on the machine size. You can
increase the limit to scale up concurrent job execution in
advanced scenarios, when activities are timing out even when
CPU, memory, or network is under-utilized. This capability is
also available with a single-node self-hosted integration
runtime.

Role There are two types of roles in a multi-node self-hosted


integration runtime – dispatcher and worker. All nodes are
workers, which means they can all be used to execute jobs.
There is only one dispatcher node, which is used to pull
tasks/jobs from cloud services and dispatch them to different
worker nodes. The dispatcher node is also a worker node.

Some settings of the properties make more sense when there are two or more nodes in the self-hosted integration
runtime (that is, in a scale out scenario).
Concurrent jobs limit
The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this
value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the
more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs
limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you run a
maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48
concurrent jobs (that is, 4 x 12). We recommend that you increase the concurrent jobs limit only when you see low
resource usage with the default values on each node.
You can override the calculated default value in the Azure portal. Select Author > Connections > Integration
Runtimes > Edit > Nodes > Modify concurrent job value per node. You can also use the PowerShell update-
Azdatafactoryv2integrationruntimenode command.
Status (per node )
The following table provides possible statuses of a self-hosted integration runtime node:

STATUS DESCRIPTION

Online Node is connected to the Data Factory service.

Offline Node is offline.

Upgrading The node is being auto-updated.

Limited Due to a connectivity issue. May be due to HTTP port 8050


issue, service bus connectivity issue, or a credential sync issue.

Inactive Node is in a configuration different from the configuration of


other majority nodes.

A node can be inactive when it cannot connect to other nodes.


Status (overall self-hosted integration runtime )
The following table provides possible statuses of a self-hosted integration runtime. This status depends on statuses
of all nodes that belong to the runtime.

STATUS DESCRIPTION

Need Registration No node is registered to this self-hosted integration runtime


yet.

Online All nodes are online.

Offline No node is online.

Limited Not all nodes in this self-hosted integration runtime are in a


healthy state. This status is a warning that some nodes might
be down. This status could be due to a credential sync issue
on dispatcher/worker node.

Use the Get-AzDataFactoryV2IntegrationRuntimeMetric cmdlet to fetch the JSON payload containing the
detailed self-hosted integration runtime properties, and their snapshot values during the time of execution of the
cmdlet.

Get-AzDataFactoryV2IntegrationRuntimeMetric -name $integrationRuntimeName -ResourceGroupName


$resourceGroupName -DataFactoryName $dataFactoryName | | ConvertTo-Json

Sample output (assumes that there are two nodes associated with this self-hosted integration runtime):

{
"IntegrationRuntimeName": "<Name of your integration runtime>",
"ResourceGroupName": "<Resource Group Name>",
"DataFactoryName": "<Data Factory Name>",
"Nodes": [
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
},
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
}

]
}

Azure-SSIS integration runtime


Azure-SSIS integration runtime is a fully managed cluster of Azure virtual machines (or nodes) dedicated to run
your SSIS packages. It does not run any other activities of Azure Data Factory. Once provisioned, you can query its
properties and monitor its overall/node-specific statuses.
Properties
PROPERTY/STATUS DESCRIPTION

CreateTime The UTC time when your Azure-SSIS integration runtime was
created.

Nodes The allocated/available nodes of your Azure-SSIS integration


runtime with node-specific statuses
(starting/available/recycling/unavailable) and actionable errors.

OtherErrors The non-node-specific actionable errors on your Azure-SSIS


integration runtime.

LastOperation The result of last start/stop operation on your Azure-SSIS


integration runtime with actionable error(s) if it failed.

State The overall status (initial/starting/started/stopping/stopped) of


your Azure-SSIS integration runtime.

Location The location of your Azure-SSIS integration runtime.

NodeSize The size of each node of your Azure-SSIS integration runtime.

NodeCount The number of nodes in your Azure-SSIS integration runtime.

MaxParallelExecutionsPerNode The number of parallel executions per node in your Azure-SSIS


integration runtime.

CatalogServerEndpoint The endpoint of your existing Azure SQL Database/Managed


Instance server to host SSISDB.

CatalogAdminUserName The admin username of your existing Azure SQL


Database/Managed Instance server. Data Factory service uses
this information to prepare and manage SSISDB on your
behalf.

CatalogAdminPassword The admin password of your existing Azure SQL


Database/Managed Instance server.

CatalogPricingTier The pricing tier for SSISDB hosted by your existing Azure SQL
Database server. Not applicable to Azure SQL Database
Managed Instance hosting SSISDB.

VNetId The virtual network resource ID for your Azure-SSIS


integration runtime to join.

Subnet The subnet name for your Azure-SSIS integration runtime to


join.

ID The resource ID of your Azure-SSIS integration runtime.


PROPERTY/STATUS DESCRIPTION

Type The type (Managed/Self-Hosted) of your Azure-SSIS


integration runtime.

ResourceGroupName The name of your Azure Resource Group, in which your data
factory and Azure-SSIS integration runtime were created.

DataFactoryName The name of your Azure data factory.

Name The name of your Azure-SSIS integration runtime.

Description The description of your Azure-SSIS integration runtime.

Status (per node )


STATUS DESCRIPTION

Starting This node is being prepared.

Available This node is ready for you to deploy/execute SSIS packages.

Recycling This node is being repaired/restarting.

Unavailable This node is not ready for you to deploy/execute SSIS


packages and has actionable errors/issues that you could
resolve.

Status (overall Azure -SSIS integration runtime )


OVERALL STATUS DESCRIPTION

Initial The nodes of your Azure-SSIS integration runtime have not


been allocated/prepared.

Starting The nodes of your Azure-SSIS integration runtime are being


allocated/prepared and billing has started.

Started The nodes of your Azure-SSIS integration runtime have been


allocated/prepared and they are ready for you to
deploy/execute SSIS packages.

Stopping The nodes of your Azure-SSIS integration runtime are being


released.

Stopped The nodes of your Azure-SSIS integration runtime have been


released and billing has stopped.

Monitor the Azure -SSIS integration runtime in the Azure portal


The following screenshots show how to select the Azure-SSIS IR to monitor, and provide an example of the
information that's displayed.
Monitor the Azure -SSIS integration runtime with PowerShell
Use a script like the following example to check the status of the Azure-SSIS IR.

Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName -Status

More info about the Azure -SSIS integration runtime


See the following articles to learn more about Azure-SSIS integration runtime:
Azure-SSIS Integration Runtime. This article provides conceptual information about integration runtimes in
general including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-SSIS
IR and uses an Azure SQL database to host the SSIS catalog.
How to: Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using Azure SQL Database Managed Instance and joining the IR to a virtual network.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or remove an Azure-SSIS IR. It also shows
you how to scale out your Azure-SSIS IR by adding more nodes to the IR.
Join an Azure-SSIS IR to a virtual network. This article provides conceptual information about joining an
Azure-SSIS IR to an Azure virtual network. It also provides steps to use Azure portal to configure the virtual
network so that the Azure-SSIS IR can join the virtual network.

Next steps
See the following articles for monitoring pipelines in different ways:
Quickstart: create a data factory.
Use Azure Monitor to monitor Data Factory pipelines
Reconfigure the Azure-SSIS integration runtime
3/5/2019 • 3 minutes to read • Edit Online

This article describes how to reconfigure an existing Azure-SSIS integration runtime. To create an Azure-SSIS
integration runtime (IR ) in Azure Data Factory, see Create an Azure-SSIS integration runtime.

Data Factory UI
You can use Data Factory UI to stop, edit/reconfigure, or delete an Azure-SSIS IR.
1. In the Data Factory UI, switch to the Edit tab. To launch Data Factory UI, click Author & Monitor on the
home page of your data factory.
2. In the left pane, click Connections.
3. In the right pane, switch to the Integration Runtimes.
4. You can use buttons in the Actions column to stop, edit, or delete the integration runtime. The Code
button in the Actions column lets you view the JSON definition associated with the integration runtime.

To reconfigure an Azure -SSIS IR


1. Stop the integration runtime by clicking Stop in the Actions column. To refresh the list view, click Refresh
on the toolbar. After the IR is stopped, you see that the first action lets you start the IR.

2. Edit/reconfigure IR by clicking Edit button in the Actions column. In the Integration Runtime Setup
window, change settings (for example, size of the node, number of nodes, or maximum parallel executions
per node).
3. To restart the IR, click Start button in the Actions column.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.

After you provision and start an instance of Azure-SSIS integration runtime, you can reconfigure it by running a
sequence of Stop - Set - Start PowerShell cmdlets consecutively. For example, the following PowerShell
script changes the number of nodes allocated for the Azure-SSIS integration runtime instance to five.
Reconfigure an Azure -SSIS IR
1. First, stop the Azure-SSIS integration runtime by using the Stop-AzDataFactoryV2IntegrationRuntime
cmdlet. This command releases all of its nodes and stops billing.

Stop-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName

2. Next, reconfigure the Azure-SSIS IR by using the Set-AzDataFactoryV2IntegrationRuntime cmdlet. The


following sample command scales out an Azure-SSIS integration runtime to five nodes.

Set-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName -NodeCount 5

3. Then, start the Azure-SSIS integration runtime by using the Start-AzDataFactoryV2IntegrationRuntime


cmdlet. This command allocates all of its nodes for running SSIS packages.

Start-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName

Delete an Azure -SSIS IR


1. First, list all existing Azure SSIS IRs under your data factory.

Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -ResourceGroupName


$ResourceGroupName -Status

2. Next, stop all existing Azure SSIS IRs in your data factory.

Stop-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName -Force

3. Next, remove all existing Azure SSIS IRs in your data factory one by one.

Remove-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName -Force

4. Finally, remove your data factory.


Remove-AzDataFactoryV2 -Name $DataFactoryName -ResourceGroupName $ResourceGroupName -Force

5. If you had created a new resource group, remove the resource group.

Remove-AzResourceGroup -Name $ResourceGroupName -Force

Next steps
For more information about Azure-SSIS runtime, see the following topics:
Azure-SSIS Integration Runtime. This article provides conceptual information about integration runtimes in
general including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-
SSIS IR and uses an Azure SQL database to host the SSIS catalog.
How to: Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using Azure SQL Database Managed Instance and joining the IR to a virtual network.
Join an Azure-SSIS IR to a virtual network. This article provides conceptual information about joining an
Azure-SSIS IR to an Azure virtual network. It also provides steps to use Azure portal to configure virtual
network so that Azure-SSIS IR can join the virtual network.
Monitor an Azure-SSIS IR. This article shows you how to retrieve information about an Azure-SSIS IR and
descriptions of statuses in the returned information.
Copy or clone a data factory in Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online

This article describes how to copy or clone a data factory in Azure Data Factory.

Use cases for cloning a data factory


Here are some of the circumstances in which you may find it useful to copy or clone a data factory:
Renaming resources. Azure doesn't support renaming resources. If you want to rename a data factory, you
can clone the data factory with a different name, and then delete the existing one.
Debugging changes when the debug features aren't sufficient. Sometimes to test your changes, you may
want to test your changes in a different factory before applying them to your main one. In most scenarios,
you can use Debug. Changes in triggers, however, such as how your changes behave when a trigger is
invoked automatically, or over a time window, may not be testable easily without checking in. In these cases,
cloning the factory and applying your changes there makes a lot of sense. Since Azure Data Factory charges
primarily by the number of runs, the second factory does not lead to any additional charges.

How to clone a data factory


1. The Data Factory UI in the Azure portal lets you export the entire payload of your data factory into a
Resource Manager template, along with a parameter file that lets you change any values you want to change
when you clone your factory.
2. As a prerequisite, you need to create your target data factory from the Azure portal.
3. If you have a SelfHosted IntegrationRuntime in your source factory, you need to precreate it with the same
name in the target factory. If you want to share the SelfHosted IRs between different factories, you can use
the pattern published here.
4. If you are in GIT mode, every time you publish from the portal, the factory's Resource Manager template is
saved into GIT in the adf_publish branch of the repository.
5. For other scenarios, the Resource Manager template can be downloaded by clicking on the Export
Resource Manager template button in the portal.
6. After you download the Resource Manager template, you can deploy it via standard Resource Manager
template deployment methods.
7. For security reasons, the generated Resource Manager template does not contain any secret information,
such as passwords for linked services. As a result, you have to provide these passwords as deployment
parameters. If providing parameters is not desirable, you have to obtain the connection strings and
passwords of the linked services from Azure Key Vault.

Next steps
Review the guidance for creating a data factory in the Azure portal in Create a data factory by using the Azure Data
Factory UI.
How to create and configure Azure Integration
Runtime
3/7/2019 • 2 minutes to read • Edit Online

The Integration Runtime (IR ) is the compute infrastructure used by Azure Data Factory to provide data integration
capabilities across different network environments. For more information about IR, see Integration runtime.
Azure IR provides a fully managed compute to natively perform data movement and dispatch data transformation
activities to compute services like HDInsight. It is hosted in Azure environment and supports connecting to
resources in public network environment with public accessible endpoints.
This document introduces how you can create and configure Azure Integration Runtime.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Default Azure IR
By default, each data factory has an Azure IR in the backend that supports operations on cloud data stores and
compute services in public network. The location of that Azure IR is auto-resolve. If connectVia property is not
specified in the linked service definition, the default Azure IR is used. You only need to explicitly create an Azure IR
when you would like to explicitly define the location of the IR, or if you would like to virtually group the activity
executions on different IRs for management purpose.

Create Azure IR
Integration Runtime can be created using the Set-AzDataFactoryV2IntegrationRuntime PowerShell cmdlet. To
create an Azure IR, you specify the name, location and type to the command. Here is a sample command to create
an Azure IR with location set to "West Europe":

Set-AzDataFactoryV2IntegrationRuntime -DataFactoryName "SampleV2DataFactory1" -Name "MySampleAzureIR" -


ResourceGroupName "ADFV2SampleRG" -Type Managed -Location "West Europe"

For Azure IR, the type must be set to Managed. You do not need to specify compute details because it is fully
managed elastically in cloud. Specify compute details like node size and node count when you would like to create
Azure-SSIS IR. For more information, see Create and Configure Azure-SSIS IR.
You can configure an existing Azure IR to change its location using the Set-AzDataFactoryV2IntegrationRuntime
PowerShell cmdlet. For more information about the location of an Azure IR, see Introduction to integration
runtime.

Use Azure IR
Once an Azure IR is created, you can reference it in your Linked Service definition. Below is a sample of how you
can reference the Azure Integration Runtime created above from an Azure Storage Linked Service:
{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=myaccountname;AccountKey=...",
"type": "SecureString"
}
},
"connectVia": {
"referenceName": "MySampleAzureIR",
"type": "IntegrationRuntimeReference"
}
}
}

Next steps
See the following articles on how to create other types of integration runtimes:
Create self-hosted integration runtime
Create Azure-SSIS integration runtime
Create and configure a self-hosted integration
runtime
5/21/2019 • 19 minutes to read • Edit Online

The integration runtime (IR ) is the compute infrastructure that Azure Data Factory uses to provide data-
integration capabilities across different network environments. For details about IR, see Integration runtime
overview.
A self-hosted integration runtime can run copy activities between a cloud data store and a data store in a
private network, and it can dispatch transform activities against compute resources in an on-premises
network or an Azure virtual network. The installation of a self-hosted integration runtime needs on an on-
premises machine or a virtual machine (VM ) inside a private network.
This document describes how you can create and configure a self-hosted IR.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module,
which will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and
AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions,
see Install Azure PowerShell.

High-level steps to install a self-hosted IR


1. Create a self-hosted integration runtime. You can use the Azure Data Factory UI for this task. Here is a
PowerShell example:

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntimeName -Type SelfHosted -Description "selfhosted
IR description"

2. Download and install the self-hosted integration runtime on a local machine.


3. Retrieve the authentication key and register the self-hosted integration runtime with the key. Here is a
PowerShell example:

Get-AzDataFactoryV2IntegrationRuntimeKey -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntimeName

Setting up a self-hosted IR on an Azure VM by using an Azure


Resource Manager template (automation)
You can automate self-hosted IR setup on an Azure virtual machine by using this Azure Resource Manager
template. This template provides an easy way to have a fully functioning self-hosted IR inside an Azure
virtual network with high-availability and scalability features (as long as you set the node count to 2 or
higher).
Command flow and data flow
When you move data between on-premises and the cloud, the activity uses a self-hosted integration runtime
to transfer the data from an on-premises data source to the cloud and vice versa.
Here is a high-level data flow for the summary of steps for copying with a self-hosted IR:

1. The data developer creates a self-hosted integration runtime within an Azure data factory by using a
PowerShell cmdlet. Currently, the Azure portal does not support this feature.
2. The data developer creates a linked service for an on-premises data store by specifying the self-hosted
integration runtime instance that it should use to connect to data stores.
3. The self-hosted integration runtime node encrypts the credentials by using Windows Data Protection
Application Programming Interface (DPAPI) and saves the credentials locally. If multiple nodes are set for
high availability, the credentials are further synchronized across other nodes. Each node encrypts the
credentials by using DPAPI and stores them locally. Credential synchronization is transparent to the data
developer and is handled by the self-hosted IR.
4. The Data Factory service communicates with the self-hosted integration runtime for scheduling and
management of jobs via a control channel that uses a shared Azure Service Bus Relay. When an activity
job needs to be run, Data Factory queues the request along with any credential information (in case
credentials are not already stored on the self-hosted integration runtime). The self-hosted integration
runtime kicks off the job after polling the queue.
5. The self-hosted integration runtime copies data from an on-premises store to a cloud storage, or vice
versa depending on how the copy activity is configured in the data pipeline. For this step, the self-hosted
integration runtime directly communicates with cloud-based storage services such as Azure Blob storage
over a secure (HTTPS ) channel.
Considerations for using a self-hosted IR
A single self-hosted integration runtime can be used for multiple on-premises data sources. A single self-
hosted integration runtime can be shared with another data factory within the same Azure Active
Directory tenant. For more information, see Sharing a self-hosted integration runtime.
You can have only one instance of a self-hosted integration runtime installed on a single machine. If you
have two data factories that need to access on-premises data sources, either use the self-hosted IR
sharing feature to share the self-hosted integration runtime, or install the self-hosted integration runtime
on two on-premises computers, one for each data factory.
The self-hosted integration runtime does not need to be on the same machine as the data source.
However, having the self-hosted integration runtime closer to the data source reduces the time for the
self-hosted integration runtime to connect to the data source. We recommend that you install the self-
hosted integration runtime on a machine that is different from the one that hosts on-premises data
source. When the self-hosted integration runtime and data source are on different machines, the self-
hosted integration runtime does not compete for resources with the data source.
You can have multiple self-hosted integration runtimes on different machines that connect to the same
on-premises data source. For example, you might have two self-hosted integration runtimes that serve
two data factories, but the same on-premises data source is registered with both the data factories.
If you already have a gateway installed on your computer to serve a Power BI scenario, install a separate
self-hosted integration runtime for Azure Data Factory on another machine.
The self-hosted integration runtime must be used for supporting data integration within an Azure virtual
network.
Treat your data source as an on-premises data source that is behind a firewall, even when you use Azure
ExpressRoute. Use the self-hosted integration runtime to establish connectivity between the service and
the data source.
You must use the self-hosted integration runtime even if the data store is in the cloud on an Azure IaaS
virtual machine.
Tasks might fail in a self-hosted integration runtime that's installed on a Windows server on which FIPS -
compliant encryption is enabled. To work around this problem, disable FIPS -compliant encryption on the
server. To disable FIPS -compliant encryption, change the following registry value from 1 (enabled) to 0
(disabled): HKLM\System\CurrentControlSet\Control\Lsa\FIPSAlgorithmPolicy\Enabled .

Prerequisites
The supported operating system versions are Windows 7 Service Pack 1, Windows 8.1, Windows 10,
Windows Server 2008 R2 SP1, Windows Server 2012, Windows Server 2012 R2, and Windows Server
2016. Installation of the self-hosted integration runtime on a domain controller is not supported.
.NET Framework 4.6.1 or later is required. If you're installing the self-hosted integration runtime on a
Windows 7 machine, install .NET Framework 4.6.1 or later. See .NET Framework System Requirements
for details.
The recommended configuration for the self-hosted integration runtime machine is at least 2 GHz, four
cores, 8 GB of RAM, and an 80-GB disk.
If the host machine hibernates, the self-hosted integration runtime does not respond to data requests.
Configure an appropriate power plan on the computer before you install the self-hosted integration
runtime. If the machine is configured to hibernate, the self-hosted integration runtime installation
prompts a message.
You must be an administrator on the machine to install and configure the self-hosted integration runtime
successfully.
Copy activity runs happen on a specific frequency. Resource usage (CPU, memory) on the machine
follows the same pattern with peak and idle times. Resource utilization also depends heavily on the
amount of data being moved. When multiple copy jobs are in progress, you see resource usage go up
during peak times.

Installation best practices


You can install the self-hosted integration runtime by downloading an MSI setup package from the
Microsoft Download Center. See Move data between on-premises and cloud article for step-by-step
instructions.
Configure a power plan on the host machine for the self-hosted integration runtime so that the machine
does not hibernate. If the host machine hibernates, the self-hosted integration runtime goes offline.
Back up the credentials associated with the self-hosted integration runtime regularly.

Install and register self-hosted IR from the Download Center


1. Go to the Microsoft integration runtime download page.
2. Select Download, select the 64-bit version (32-bit is not supported), and select Next.
3. Run the MSI file directly, or save it to your hard disk and run it.
4. On the Welcome page, select a language and select Next.
5. Accept the Microsoft Software License Terms and select Next.
6. Select folder to install the self-hosted integration runtime, and select Next.
7. On the Ready to install page, select Install.
8. Click Finish to complete installation.
9. Get the authentication key by using Azure PowerShell. Here's a PowerShell example for retrieving the
authentication key:

Get-AzDataFactoryV2IntegrationRuntimeKey -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntime

10. On the Register Integration Runtime (Self-hosted) page of Microsoft Integration Runtime
Configuration Manager running on your machine, take the following steps:
a. Paste the authentication key in the text area.
b. Optionally, select Show authentication key to see the key text.
c. Select Register.

High availability and scalability


A self-hosted integration runtime can be associated with multiple on-premises machines or Virtual Machines
in Azure. These machines are called nodes. You can have up to four nodes associated with a self-hosted
integration runtime. The benefits of having multiple nodes (on-premises machines with a gateway installed)
for a logical gateway are:
Higher availability of the self-hosted integration runtime so that it's no longer the single point of failure in
your big data solution or cloud data integration with Azure Data Factory, ensuring continuity with up to
four nodes.
Improved performance and throughput during data movement between on-premises and cloud data
stores. Get more information on performance comparisons.
You can associate multiple nodes by installing the self-hosted integration runtime software from the
Download Center. Then, register it by using either of the authentication keys obtained from the New-
AzDataFactoryV2IntegrationRuntimeKey cmdlet, as described in the tutorial.

NOTE
You don't need to create new self-hosted integration runtime for associating each node. You can install the self-hosted
integration runtime on another machine and register it by using the same authentication key.

NOTE
Before you add another node for high availability and scalability, ensure that the Remote access to intranet option is
enabled on the first node (Microsoft Integration Runtime Configuration Manager > Settings > Remote access
to intranet).

Scale considerations
Scale out
When the available memory on the self-hosted IR is low and the CPU usage is high, adding a new node
helps scale out the load across machines. If activities are failing because they're timing out or because the
self-hosted IR node is offline, it helps if you add a node to the gateway.
Scale up
When the available memory and CPU are not utilized well, but the execution of concurrent jobs is reaching
the limit, you should scale up by increasing the number of concurrent jobs that can run on a node. You might
also want to scale up when activities are timing out because the self-hosted IR is overloaded. As shown in
the following image, you can increase the maximum capacity for a node:

TLS/SSL certificate requirements


Here are the requirements for the TLS/SSL certificate that is used for securing communications between
integration runtime nodes:
The certificate must be a publicly trusted X509 v3 certificate. We recommend that you use certificates
that are issued by a public (partner) certification authority (CA).
Each integration runtime node must trust this certificate.
We don't recommend Subject Alternative Name (SAN ) certificates because only the last SAN item will be
used and all others will be ignored due to current limitations. For example, if you have a SAN certificate
whose SANs are node1.domain.contoso.com and node2.domain.contoso.com, you can use this
certificate only on a machine whose FQDN is node2.domain.contoso.com.
The certificate supports any key size supported by Windows Server 2012 R2 for SSL certificates.
Certificates that use CNG keys are not supported.

NOTE
This certificate is used to encrypt ports on self-hosted IR node, used for node-to-node communication (for state
synchronization which includes linked services' credentials synchronization across nodes) and while using PowerShell
cmdlet for linked service credential setting from within local network. We suggest using this certificate if your
private network environment is not secure or if you would like to secure the communication between nodes within
your private network as well. Data movement in transit from self-hosted IR to other data stores always happens using
encrypted channel, irrespective of this certificate set or not.

Sharing the self-hosted integration runtime with multiple data


factories
You can reuse an existing self-hosted integration runtime infrastructure that you already set up in a data
factory. This enables you to create a linked self-hosted integration runtime in a different data factory by
referencing an existing self-hosted IR (shared).
To share a self-hosted integration runtime by using PowerShell, see Create a shared self-hosted integration
runtime in Azure Data Factory with PowerShell.
For a twelve-minute introduction and demonstration of this feature, watch the following video:

Terminology
Shared IR: The original self-hosted IR that's running on a physical infrastructure.
Linked IR: The IR that references another shared IR. This is a logical IR and uses the infrastructure of
another self-hosted IR (shared).
High-level steps for creating a linked self-hosted IR
1. In the self-hosted IR to be shared, grant permission to the data factory in which you want to create
the linked IR.
2. Note the resource ID of the self-hosted IR to be shared.
3. In the data factory to which the permissions were granted, create a new self-hosted IR (linked) and
enter the resource ID.
Monitoring
Shared IR
Linked IR

Known limitations of self-hosted IR sharing


The data factory in which a linked IR will be created must have an MSI. By default, the data factories
created in the Azure portal or PowerShell cmdlets have an MSI created implicitly. But when a data
factory is created through an Azure Resource Manager template or SDK, the Identity property must
be set explicitly to ensure that Azure Resource Manager creates a data factory that contains an MSI.
The Azure Data Factory .NET SDK that supports this feature is version 1.1.0 or later.
To grant permission, the user needs the Owner role or the inherited Owner role in the data factory
where the shared IR exists.
Sharing feature works only for Data Factories within the same Azure Active Directory tenant.
For Active Directory guest users, the search functionality (listing all data factories by using a search
keyword) in the UI does not work. But as long as the guest user is the Owner of the data factory, they
can share the IR without the search functionality, by directly typing the MSI of the data factory with
which the IR needs to be shared in the Assign Permission text box and selecting Add in the Azure
Data Factory UI.

NOTE
This feature is available only in Azure Data Factory V2.

Notification area icons and notifications


If you move your cursor over the icon or message in the notification area, you can find details about the state
of the self-hosted integration runtime.

Ports and firewall


There are two firewalls to consider: the corporate firewall running on the central router of the organization,
and the Windows firewall configured as a daemon on the local machine where the self-hosted integration
runtime is installed.

At the corporate firewall level, you need to configure the following domains and outbound ports:

DOMAIN NAMES PORTS DESCRIPTION


DOMAIN NAMES PORTS DESCRIPTION

*.servicebus.windows.net 443 Used for communication with the


back-end data movement service

*.core.windows.net 443 Used for staged copy through Azure


Blob storage (if configured)

*.frontend.clouddatahub.net 443 Used for communication with the


back-end data movement service

download.microsoft.com 443 Used for downloading the updates

At the Windows firewall level (machine level), these outbound ports are normally enabled. If not, you can
configure the domains and ports accordingly on a self-hosted integration runtime machine.

NOTE
Based on your source and sinks, you might have to whitelist additional domains and outbound ports in your
corporate firewall or Windows firewall.
For some cloud databases (for example, Azure SQL Database and Azure Data Lake), you might need to whitelist IP
addresses of self-hosted integration runtime machines on their firewall configuration.

Copy data from a source to a sink


Ensure that the firewall rules are enabled properly on the corporate firewall, the Windows firewall on the
self-hosted integration runtime machine, and the data store itself. Enabling these rules allows the self-hosted
integration runtime to connect to both source and sink successfully. Enable rules for each data store that is
involved in the copy operation.
For example, to copy from an on-premises data store to an Azure SQL Database sink or an Azure SQL Data
Warehouse sink, take the following steps:
1. Allow outbound TCP communication on port 1433 for both Windows firewall and corporate firewall.
2. Configure the firewall settings of the Azure SQL database to add the IP address of the self-hosted
integration runtime machine to the list of allowed IP addresses.

NOTE
If your firewall does not allow outbound port 1433, the self-hosted integration runtime can't access the Azure SQL
database directly. In this case, you can use a staged copy to Azure SQL Database and Azure SQL Data Warehouse. In
this scenario, you would require only HTTPS (port 443) for the data movement.

Proxy server considerations


If your corporate network environment uses a proxy server to access the internet, configure the self-hosted
integration runtime to use appropriate proxy settings. You can set the proxy during the initial registration
phase.
When configured, the self-hosted integration runtime uses the proxy server to connect to the cloud service,
source/ destination (those using HTTP/ HTTPS protocol). This is Select Change link during initial setup. You
see the proxy-setting dialog box.

There are three configuration options:


Do not use proxy: The self-hosted integration runtime does not explicitly use any proxy to connect to
cloud services.
Use system proxy: The self-hosted integration runtime uses the proxy setting that is configured in
diahost.exe.config and diawp.exe.config. If no proxy is configured in diahost.exe.config and
diawp.exe.config, the self-hosted integration runtime connects to the cloud service directly without going
through a proxy.
Use custom proxy: Configure the HTTP proxy setting to use for the self-hosted integration runtime,
instead of using configurations in diahost.exe.config and diawp.exe.config. Address and Port are
required. User Name and Password are optional depending on your proxy’s authentication setting. All
settings are encrypted with Windows DPAPI on the self-hosted integration runtime and stored locally on
the machine.
The integration runtime Host Service restarts automatically after you save the updated proxy settings.
After the self-hosted integration runtime has been successfully registered, if you want to view or update
proxy settings, use Integration Runtime Configuration Manager.
1. Open Microsoft Integration Runtime Configuration Manager.
2. Switch to the Settings tab.
3. Select the Change link in the HTTP Proxy section to open the Set HTTP Proxy dialog box.
4. Select Next. You then see a warning that asks for your permission to save the proxy setting and restart
the integration runtime Host Service.
You can view and update the HTTP proxy by using the Configuration Manager tool.

NOTE
If you set up a proxy server with NTLM authentication, the integration runtime Host Service runs under the domain
account. If you change the password for the domain account later, remember to update the configuration settings for
the service and restart it accordingly. Due to this requirement, we suggest that you use a dedicated domain account
to access the proxy server that does not require you to update the password frequently.

Configure proxy server settings


If you select the Use system proxy setting for the HTTP proxy, the self-hosted integration runtime uses the
proxy setting in diahost.exe.config and diawp.exe.config. If no proxy is specified in diahost.exe.config and
diawp.exe.config, the self-hosted integration runtime connects to the cloud service directly without going
through proxy. The following procedure provides instructions for updating the diahost.exe.config file:
1. In File Explorer, make a safe copy of C:\Program Files\Microsoft Integration
Runtime\3.0\Shared\diahost.exe.config to back up the original file.
2. Open Notepad.exe running as an administrator, and open the text file C:\Program Files\Microsoft
Integration Runtime\3.0\Shared\diahost.exe.config. Find the default tag for system.net as shown in
the following code:
<system.net>
<defaultProxy useDefaultCredentials="true" />
</system.net>

You can then add proxy server details as shown in the following example:

<system.net>
<defaultProxy enabled="true">
<proxy bypassonlocal="true" proxyaddress="http://proxy.domain.org:8888/" />
</defaultProxy>
</system.net>

Additional properties are allowed inside the proxy tag to specify the required settings like
scriptLocation . See proxy Element ( Network Settings) for syntax.

<proxy autoDetect="true|false|unspecified" bypassonlocal="true|false|unspecified"


proxyaddress="uriString" scriptLocation="uriString" usesystemdefault="true|false|unspecified "/>

3. Save the configuration file in the original location. Then restart the self-hosted integration runtime
Host Service, which picks up the changes.
To restart the service, use the services applet from the control panel. Or from Integration Runtime
Configuration Manager, select the Stop Service button, and then select Start Service.
If the service does not start, it's likely that an incorrect XML tag syntax was added in the application
configuration file that was edited.

IMPORTANT
Don't forget to update both diahost.exe.config and diawp.exe.config.

You also need to make sure that Microsoft Azure is in your company’s whitelist. You can download the list of
valid Microsoft Azure IP addresses from the Microsoft Download Center.
Possible symptoms for firewall and proxy server-related issues
If you encounter errors similar to the following ones, it's likely due to improper configuration of the firewall
or proxy server, which blocks the self-hosted integration runtime from connecting to Data Factory to
authenticate itself. To ensure that your firewall and proxy server are properly configured, refer to the
previous section.
When you try to register the self-hosted integration runtime, you receive the following error: "Failed
to register this Integration Runtime node! Confirm that the Authentication key is valid and the
integration service Host Service is running on this machine."
When you open Integration Runtime Configuration Manager, you see a status of Disconnected or
Connecting. When you're viewing Windows event logs, under Event Viewer > Application and
Services Logs > Microsoft Integration Runtime, you see error messages like this one:

Unable to connect to the remote server


A component of Integration Runtime has become unresponsive and restarts automatically. Component
name: Integration Runtime (Self-hosted).

Enabling remote access from an intranet


If you use PowerShell to encrypt credentials from another machine (in the network) other than where the
self-hosted integration runtime is installed, you can enable the Remote Access from Intranet option. If you
run PowerShell to encrypt credentials on the same machine where the self-hosted integration runtime is
installed, you can't enable Remote Access from Intranet.
You should enable Remote Access from Intranet before you add another node for high availability and
scalability.
During self-hosted integration runtime setup (version 3.3.xxxx.x later), by default, the self-hosted integration
runtime installation disables Remote Access from Intranet on the self-hosted integration runtime
machine.
If you're using a third-party firewall, you can manually open port 8060 (or the user-configured port). If you
have a firewall problem while setting up the self-hosted integration runtime, try using the following
command to install the self-hosted integration runtime without configuring the firewall:

msiexec /q /i IntegrationRuntime.msi NOFIREWALL=1

If you choose not to open port 8060 on the self-hosted integration runtime machine, use mechanisms other
than the Setting Credentials application to configure data store credentials. For example, you can use the
New-AzDataFactoryV2LinkedServiceEncryptCredential PowerShell cmdlet.

Next steps
See the following tutorial for step-by-step instructions: Tutorial: Copy on-premises data to cloud.
Create Azure-SSIS Integration Runtime in Azure
Data Factory
4/9/2019 • 23 minutes to read • Edit Online

This article provides steps for provisioning Azure-SSIS Integration Runtime (IR ) in Azure Data Factory (ADF ).
Then, you can use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy and
run SQL Server Integration Services (SSIS ) packages on this integration runtime in Azure.
The Tutorial: Deploy SSIS packages to Azure shows you how to create Azure-SSIS IR by using Azure SQL
Database server to host SSIS catalog database (SSISDB ). This article expands on the tutorial and shows you
how to do the following things:
Optionally use Azure SQL Database server with virtual network service endpoints/Managed Instance to
host SSISDB. For guidance in choosing the type of database server to host SSISDB, see Compare Azure
SQL Database single databases/elastic pools and Managed Instance. As a prerequisite, you need to join
your Azure-SSIS IR to a virtual network and configure virtual network permissions/settings as necessary.
See Join Azure-SSIS IR to a virtual network.
Optionally use Azure Active Directory (AAD ) authentication with the managed identity for your ADF to
connect to the database server. As a prerequisite, you will need to add the managed identity for your ADF
as a contained database user capable of creating SSISDB in your Azure SQL Database server/Managed
Instance, see Enable AAD authentication for Azure-SSIS IR.

Overview
This article shows different ways of provisioning Azure-SSIS IR:
Azure portal
Azure PowerShell
Azure Resource Manager template
When you create Azure-SSIS IR, ADF service connects to your Azure SQL Database server/Managed Instance
to prepare SSISDB. It also configures permissions/settings for your virtual network, if specified, and joins your
Azure-SSIS IR to the virtual network.
When you provision Azure-SSIS IR, Azure Feature Pack for SSIS and Access Redistributable are also installed.
These components provide connectivity to Excel/Access files and various Azure data sources, in addition to the
data sources supported by built-in components. You can also install additional components. For more info, see
Custom setup for the Azure-SSIS integration runtime.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.

Azure subscription. If you do not already have a subscription, you can create a free trial account.
Azure SQL Database server or Managed Instance. If you do not already have a database server, you
can create one in Azure portal before you get started. This server will host SSISDB. We recommend that
you create the database server in the same Azure region as your integration runtime. This configuration
lets your integration runtime write execution logs to SSISDB without crossing Azure regions. Based on
the selected database server, SSISDB can be created on your behalf as a single database, part of an
elastic pool, or in your Managed Instance and accessible in public network or by joining a virtual network.
For a list of supported pricing tiers for Azure SQL Database, see SQL Database resource limits.
Make sure that your Azure SQL Database server/Managed Instance does not already have an SSISDB.
The provisioning of Azure-SSIS IR does not support using an existing SSISDB.
Azure Resource Manager virtual network (optional). You must have an Azure Resource Manager
virtual network if at least one of the following conditions is true:
You are hosting SSISDB in Azure SQL Database server with virtual network service endpoints or in
Managed Instance that is inside a virtual network.
You want to connect to on-premises data stores from SSIS packages running on your Azure-SSIS IR.
Azure PowerShell. Follow the instructions on How to install and configure Azure PowerShell, if you
want to run a PowerShell script to provision Azure-SSIS IR.
Region support
For a list of Azure regions, in which ADF and Azure-SSIS IR are currently available, see ADF + SSIS IR
availability by region.
Compare SQL Database single database/elastic pool and SQL Database Managed Instance
The following table compares certain features of Azure SQL Database server and Managed Instance as they
relate to Azure-SSIR IR:

FEATURE SINGLE DATABASE/ELASTIC POOL MANAGED INSTANCE

Scheduling SQL Server Agent is not available. Managed Instance Agent is available.

See Schedule a package execution in


ADF pipeline.

Authentication You can create SSISDB with a You can create SSISDB with a
contained database user representing contained database user representing
any AAD group with the managed the managed identity of your ADF.
identity of your ADF as a member in
the db_owner role. See Enable Azure AD authentication to
create SSISDB in Azure SQL Database
See Enable Azure AD authentication to Managed Instance.
create SSISDB in Azure SQL Database
server.

Service tier When you create Azure-SSIS IR with When you create Azure-SSIS IR with
your Azure SQL Database server, you your Managed Instance, you cannot
can select the service tier for SSISDB. select the service tier for SSISDB. All
There are multiple service tiers. databases in your Managed Instance
share the same resource allocated to
that instance.
FEATURE SINGLE DATABASE/ELASTIC POOL MANAGED INSTANCE

Virtual network Supports only Azure Resource Supports only Azure Resource
Manager virtual networks for your Manager virtual networks for your
Azure-SSIS IR to join if you use Azure Azure-SSIS IR to join. The virtual
SQL Database server with virtual network is always required.
network service endpoints or require
access to on-premises data stores. If you join your Azure-SSIS IR to the
same virtual network as your Managed
Instance, make sure that your Azure-
SSIS IR is in a different subnet than
your Managed Instance. If you join
your Azure-SSIS IR to a different virtual
network than your Managed Instance,
we recommend either a virtual
network peering or virtual network to
virtual network connection. See
Connect your application to Azure SQL
Database Managed Instance.

Distributed transactions Supported through Elastic Not supported.


Transactions. Microsoft Distributed
Transaction Coordinator (MSDTC)
transactions are not supported. If your
SSIS packages use MSDTC to
coordinate distributed transactions,
consider migrating to Elastic
Transactions for Azure SQL Database.
For more info, see Distributed
transactions across cloud databases.

Azure portal
In this section, you use Azure portal, specifically ADF User Interface (UI)/app, to create Azure-SSIS IR.
Create a data factory
1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. Sign in to the Azure portal.
3. Click New on the left menu, click Data + Analytics, and click Data Factory.
4. In the New data factory page, enter MyAzureSsisDataFactory for the name.
The name of the Azure data factory must be globally unique. If you receive the following error, change
the name of the data factory (for example, yournameMyAzureSsisDataFactory) and try creating again.
See Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
Data factory name “MyAzureSsisDataFactory” is not available

5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version.
8. Select the location for the data factory. Only locations that are supported for creation of data factories
are shown in the list.
9. Select Pin to dashboard.
10. Click Create.
11. On the dashboard, you see the following tile with status: Deploying data factory.

12. After the creation is complete, you see the Data Factory page as shown in the image.
13. Click Author & Monitor to launch the Data Factory User Interface (UI) in a separate tab.
Provision an Azure SSIS integration runtime
1. In the get started page, click Configure SSIS Integration Runtime tile.

2. On the General Settings page of Integration Runtime Setup, complete the following steps:
a. For Name, enter the name of your integration runtime.
b. For Description, enter the description of your integration runtime.
c. For Location, select the location of your integration runtime. Only supported locations are displayed.
We recommend that you select the same location of your database server to host SSISDB.
d. For Node Size, select the size of node in your integration runtime cluster. Only supported node sizes
are displayed. Select a large node size (scale up), if you want to run many compute/memory –intensive
packages.
e. For Node Number, select the number of nodes in your integration runtime cluster. Only supported
node numbers are displayed. Select a large cluster with many nodes (scale out), if you want to run many
packages in parallel.
f. For Edition/License, select SQL Server edition/license for your integration runtime: Standard or
Enterprise. Select Enterprise, if you want to use advanced/premium features on your integration runtime.
g. For Save Money, select Azure Hybrid Benefit (AHB ) option for your integration runtime: Yes or No.
Select Yes, if you want to bring your own SQL Server license with Software Assurance to benefit from
cost savings with hybrid use.
h. Click Next.
3. On the SQL Settings page, complete the following steps:

a. For Subscription, select the Azure subscription that has your database server to host SSISDB.
b. For Location, select the location of your database server to host SSISDB. We recommend that you
select the same location of your integration runtime.
c. For Catalog Database Server Endpoint, select the endpoint of your database server to host SSISDB.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part
of an elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual
network.
d. On Use AAD authentication... checkbox, select the authentication method for your database server
to host SSISDB: SQL or Azure Active Directory (AAD ) with the managed identity for your Azure Data
Factory. If you check it, you need to add the managed identity for your ADF into an AAD group with
access permissions to the database server, see Enable AAD authentication for Azure-SSIS IR.
e. For Admin Username, enter SQL authentication username for your database server to host SSISDB.
f. For Admin Password, enter SQL authentication password for your database server to host SSISDB.
g. For Catalog Database Service Tier, select the service tier for your database server to host SSISDB:
Basic/Standard/Premium tier or elastic pool name.
h. Click Test Connection and if successful, click Next.
4. On the Advanced Settings page, complete the following steps:
a. For Maximum Parallel Executions Per Node, select the maximum number of packages to execute
concurrently per node in your integration runtime cluster. Only supported package numbers are
displayed. Select a low number, if you want to use more than one core to run a single large/heavy-weight
package that is compute/memory -intensive. Select a high number, if you want to run one or more
small/light-weight packages in a single core.
b. For Custom Setup Container SAS URI, optionally enter Shared Access Signature (SAS ) Uniform
Resource Identifier (URI) of your Azure Storage Blob container where your setup script and its associated
files are stored, see Custom setup for Azure-SSIS IR.
5. On Select a virtual network... checkbox, select whether you want to join your integration runtime to a
virtual network. Check it if you use Azure SQL Database with virtual network service
endpoints/Managed Instance to host SSISDB or require access to on-premises data; that is, you have on-
premises data sources/destinations in your SSIS packages, see Join Azure-SSIS IR to a virtual network. If
you check it, complete the following steps:

a. For Subscription, select the Azure subscription that has your virtual network.
b. For Location, the same location of your integration runtime is selected.
c. For Type, select the type of your virtual network: Classic or Azure Resource Manager. We recommend
that you select Azure Resource Manager virtual network, since Classic virtual network will be deprecated
soon.
d. For VNet Name, select the name of your virtual network. This virtual network should be the same
virtual network used for Azure SQL Database with virtual network service endpoints/Managed Instance
to host SSISDB and or the one connected to your on-premises network.
e. For Subnet Name, select the name of subnet for your virtual network. This should be a different
subnet than the one used for Managed Instance to host SSISDB.
6. Click VNet Validation and if successful, click Finish to start the creation of your Azure-SSIS integration
runtime.

IMPORTANT
This process takes approximately 20 to 30 minutes to complete
The Data Factory service connects to your Azure SQL Database to prepare the SSIS Catalog database (SSISDB).
It also configures permissions and settings for your virtual network, if specified, and joins the new instance of
Azure-SSIS integration runtime to the virtual network.

7. In the Connections window, switch to Integration Runtimes if needed. Click Refresh to refresh the
status.

8. Use the links under Actions column to stop/start, edit, or delete the integration runtime. Use the last link
to view JSON code for the integration runtime. The edit and delete buttons are enabled only when the IR
is stopped.

Azure SSIS integration runtimes in the portal


1. In the Azure Data Factory UI, switch to the Edit tab, click Connections, and then switch to Integration
Runtimes tab to view existing integration runtimes in your data factory.

2. Click New to create a new Azure-SSIS IR.


3. To create an Azure-SSIS integration runtime, click New as shown in the image.
4. In the Integration Runtime Setup window, select Lift-and-shift existing SSIS packages to execute in
Azure, and then click Next.

5. See the Provision an Azure SSIS integration runtime section for the remaining steps to set up an Azure-
SSIS IR.

Azure PowerShell
In this section, you use the Azure PowerShell to create an Azure-SSIS IR.
Create variables
Define variables for use in the script in this tutorial:
### Azure Data Factory information
# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like
"`$"
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$DataFactoryLocation = "EastUS"

### Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[specify a description for your Azure-SSIS IR]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium
features on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your
own on-premises SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit
(AHB) option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
(2 x number of cores) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup
script and its associated files are stored
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use Azure SQL Database
with virtual network service endpoints/Managed Instance/on-premises data, Azure Resource Manager virtual
network is recommended, Classic virtual network will be deprecated soon
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Please use the same subnet as the one used
with your Azure SQL Database with virtual network service endpoints or a different subnet than the one used
for your Managed Instance

### SSISDB info


$SSISDBServerEndpoint = "[your Azure SQL Database server name or Managed Instance name.DNS
prefix].database.windows.net" # WARNING: Please ensure that there is no existing SSISDB, so we can prepare
and manage one on your behalf
# Authentication info: SQL or Azure Active Directory (AAD)
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication or leave it empty for AAD
authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication or leave it empty for AAD
authentication]"
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>) for Azure SQL Database or leave it empty for Managed Instance]"

Sign in and select subscription


Add the following code the script to sign in and select your Azure subscription:

Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

Validate the connection to database


Add the following script to validate your Azure SQL Database server endpoint.
# Validate only when you do not use VNet nor AAD authentication
if([string]::IsNullOrEmpty($VnetId) -and [string]::IsNullOrEmpty($SubnetName))
{
if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) -and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword))
{
$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" +
$SSISDBServerAdminUserName + ";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect to your Azure SQL Database server, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you want
to proceed? [Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}
}
}

Configure virtual network


Add the following script to automatically configure virtual network permissions/settings for your Azure-SSIS
integration runtime to join.

# Make sure to run this script against the subscription to which the virtual network belongs
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign the VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}

Create a resource group


Create an Azure resource group using the New -AzResourceGroup command. A resource group is a logical
container into which Azure resources are deployed and managed as a group.

New-AzResourceGroup -Location $DataFactoryLocation -Name $ResourceGroupName

Create a data factory


Run the following command to create a data factory.
Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `
-Location $DataFactoryLocation `
-Name $DataFactoryName

Create an integration runtime


Run the following commands to create an Azure-SSIS integration runtime that runs SSIS packages in Azure.
If you do not use Azure SQL Database with virtual network service endpoints/Managed Instance to host
SSISDB nor require access to on-premises data, you can omit VNetId and Subnet parameters or pass empty
values for them. Otherwise, you cannot omit them and must pass valid values from your virtual network
configuration, see Join Azure-SSIS IR to a virtual network.
If you use Managed Instance to host SSISDB, you can omit CatalogPricingTier parameter or pass an empty
value for it. Otherwise, you cannot omit it and must pass a valid value from the list of supported pricing tiers for
Azure SQL Database, see SQL Database resource limits.
If you use Azure Active Directory (AAD ) authentication with the managed identity for your Azure Data Factory
to connect to the database server, you can omit CatalogAdminCredential parameter, but you must add the
managed identity for your ADF into an AAD group with access permissions to the database server, see Enable
AAD authentication for Azure-SSIS IR. Otherwise, you cannot omit it and must pass a valid object formed from
your server admin username and password for SQL authentication.

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode
$AzureSSISMaxParallelExecutionsPerNode `
-VnetId $VnetId `
-Subnet $SubnetName `
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogPricingTier $SSISDBPricingTier

# Add SetupScriptContainerSasUri parameter when you use custom setup


if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}

# Add CatalogAdminCredential parameter when you do not use AAD authentication


if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) –and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword))
{
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName,
$secpasswd)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-CatalogAdminCredential $serverCreds
}
Start integration runtime
Run the following command to start the Azure-SSIS integration runtime:

write-host("##### Starting #####")


Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

write-host("##### Completed #####")


write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")

This command takes from 20 to 30 minutes to complete.


Full script
Here is the full script that creates an Azure-SSIS integration runtime.

### Azure Data Factory information


# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like
"`$"
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$DataFactoryLocation = "EastUS"

### Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[specify a description for your Azure-SSIS IR]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium
features on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your
own on-premises SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit
(AHB) option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
(2 x number of cores) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup
script and its associated files are stored
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use Azure SQL Database
with virtual network service endpoints/Managed Instance/on-premises data, Azure Resource Manager virtual
network is recommended, Classic virtual network will be deprecated soon
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Please use the same subnet as the one used
with your Azure SQL Database with virtual network service endpoints or a different subnet than the one used
for your Managed Instance

### SSISDB info


$SSISDBServerEndpoint = "[your Azure SQL Database server name or Managed Instance name.DNS
prefix].database.windows.net" # WARNING: Please ensure that there is no existing SSISDB, so we can prepare
and manage one on your behalf
# Authentication info: SQL or Azure Active Directory (AAD)
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication or leave it empty for AAD
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication or leave it empty for AAD
authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication or leave it empty for AAD
authentication]"
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>) for Azure SQL Database or leave it empty for Managed Instance]"

### Log in and select subscription


Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

### Validate the connection to database


# Validate only when you do not use VNet nor AAD authentication
if([string]::IsNullOrEmpty($VnetId) -and [string]::IsNullOrEmpty($SubnetName))
{
if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) -and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword))
{
$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" +
$SSISDBServerAdminUserName + ";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect to your Azure SQL Database server, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you want
to proceed? [Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}
}
}

### Configure virtual network


# Make sure to run this script against the subscription to which the virtual network belongs
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign the VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}

### Create a data factory


Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `
-Location $DataFactoryLocation `
-Name $DataFactoryName

### Create an integration runtime


Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode
$AzureSSISMaxParallelExecutionsPerNode `
-VnetId $VnetId `
-Subnet $SubnetName `
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogPricingTier $SSISDBPricingTier

# Add SetupScriptContainerSasUri parameter when you use custom setup


if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}

# Add CatalogAdminCredential parameter when you do not use AAD authentication


if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) –and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword))
{
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName,
$secpasswd)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-CatalogAdminCredential $serverCreds
}

### Start integration runtime


write-host("##### Starting your Azure-SSIS integration runtime. This command takes 20 to 30 minutes to
complete. #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

write-host("##### Completed #####")


write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")

Azure Resource Manager template


In this section, you use the Azure Resource Manager template to create Azure-SSIS integration runtime. Here is
a sample walkthrough:
1. Create a JSON file with the following Azure Resource Manager template. Replace values in the angled
brackets (place holders) with your own values.
{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {},
"variables": {},
"resources": [{
"name": "<Specify a name for your data factory>",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "East US",
"properties": {},
"resources": [{
"type": "integrationruntimes",
"name": "<Specify a name for your Azure-SSIS IR>",
"dependsOn": [ "<The name of the data factory you specified at the beginning>" ],
"apiVersion": "2018-06-01",
"properties": {
"type": "Managed",
"typeProperties": {
"computeProperties": {
"location": "East US",
"nodeSize": "Standard_D8_v3",
"numberOfNodes": 1,
"maxParallelExecutionsPerNode": 8
},
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "<Azure SQL Database server
name>.database.windows.net",
"catalogAdminUserName": "<Azure SQL Database server admin username>",
"catalogAdminPassword": {
"type": "SecureString",
"value": "<Azure SQL Database server admin password>"
},
"catalogPricingTier": "Basic"
}
}
}
}
}]
}]
}

2. To deploy the Azure Resource Manager template, run New -AzResourceGroupDeployment command as
shown in the following example, where ADFTutorialResourceGroup is the name of your resource group
and ADFTutorialARM.json is the file that contains JSON definition for your data factory and Azure-SSIS
IR.

New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json

This command creates your data factory and Azure-SSIS IR in it, but it does not start the IR.
3. To start your Azure-SSIS IR, run Start-AzDataFactoryV2IntegrationRuntime command:

Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName "<Resource Group Name>" `


-DataFactoryName "<Data Factory Name>" `
-Name "<Azure SSIS IR Name>" `
-Force

Deploy SSIS packages


Now, use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy your SSIS
packages to Azure. Connect to your database server that hosts the SSIS catalog (SSISDB ). The name of
database server is in the format: <Azure SQL Database server name>.database.windows.net or <Managed
Instance name.DNS prefix>.database.windows.net. See Deploy packages article for instructions.

Next steps
See the other Azure-SSIS IR topics in this documentation:
Azure-SSIS Integration Runtime. This article provides conceptual information about integration runtimes in
general including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-
SSIS IR and uses an Azure SQL database to host the SSIS catalog.
Monitor an Azure-SSIS IR. This article shows you how to retrieve information about an Azure-SSIS IR and
descriptions of statuses in the returned information.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or remove an Azure-SSIS IR. It also
shows you how to scale out your Azure-SSIS IR by adding more nodes to the IR.
Join an Azure-SSIS IR to a virtual network. This article provides conceptual information about joining your
Azure-SSIS IR to an Azure virtual network. It also provides steps to use Azure portal to configure virtual
network so that Azure-SSIS IR can join the virtual network.
Create a shared self-hosted integration runtime in
Azure Data Factory with PowerShell
3/26/2019 • 4 minutes to read • Edit Online

This step-by-step guide shows you how to create a shared self-hosted integration runtime in Azure Data Factory
by using Azure PowerShell. Then you can use the shared self-hosted integration runtime in another data factory. In
this tutorial, you take the following steps:
1. Create a data factory.
2. Create a self-hosted integration runtime.
3. Share the self-hosted integration runtime with other data factories.
4. Create a linked integration runtime.
5. Revoke the sharing.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure PowerShell. Follow the instructions in Install Azure PowerShell on Windows with PowerShellGet.
You use PowerShell to run a script to create a self-hosted integration runtime that can be shared with other
data factories.

NOTE
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on Products
available by region.

Create a data factory


1. Launch the Windows PowerShell Integrated Scripting Environment (ISE ).
2. Create variables. Copy and paste the following script. Replace the variables, such as SubscriptionName
and ResourceGroupName, with actual values:
# If input contains a PSH special character, e.g. "$", precede it with the escape character "`" like
"`$".
$SubscriptionName = "[Azure subscription name]"
$ResourceGroupName = "[Azure resource group name]"
$DataFactoryLocation = "EastUS"

# Shared Self-hosted integration runtime information. This is a Data Factory compute resource for
running any activities
# Data factory name. Must be globally unique
$SharedDataFactoryName = "[Shared Data factory name]"
$SharedIntegrationRuntimeName = "[Shared Integration Runtime Name]"
$SharedIntegrationRuntimeDescription = "[Description for Shared Integration Runtime]"

# Linked integration runtime information. This is a Data Factory compute resource for running any
activities
# Data factory name. Must be globally unique
$LinkedDataFactoryName = "[Linked Data factory name]"
$LinkedIntegrationRuntimeName = "[Linked Integration Runtime Name]"
$LinkedIntegrationRuntimeDescription = "[Description for Linked Integration Runtime]"

3. Sign in and select a subscription. Add the following code to the script to sign in and select your Azure
subscription:

Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

4. Create a resource group and a data factory.

NOTE
This step is optional. If you already have a data factory, skip this step.

Create an Azure resource group by using the New -AzResourceGroup command. A resource group is a
logical container into which Azure resources are deployed and managed as a group. The following example
creates a resource group named myResourceGroup in the WestEurope location:

New-AzResourceGroup -Location $DataFactoryLocation -Name $ResourceGroupName

Run the following command to create a data factory:

Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `


-Location $DataFactoryLocation `
-Name $SharedDataFactoryName

Create a self-hosted integration runtime


NOTE
This step is optional. If you already have the self-hosted integration runtime that you want to share with other data factories,
skip this step.

Run the following command to create a self-hosted integration runtime:


$SharedIR = Set-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName `
-Type SelfHosted `
-Description $SharedIntegrationRuntimeDescription

Get the integration runtime authentication key and register a node


Run the following command to get the authentication key for the self-hosted integration runtime:

Get-AzDataFactoryV2IntegrationRuntimeKey `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName

The response contains the authentication key for this self-hosted integration runtime. You use this key when you
register the integration runtime node.
Install and register the self-hosted integration runtime
1. Download the self-hosted integration runtime installer from Azure Data Factory Integration Runtime.
2. Run the installer to install the self-hosted integration on a local computer.
3. Register the new self-hosted integration with the authentication key that you retrieved in a previous step.

Share the self-hosted integration runtime with another data factory


Create another data factory

NOTE
This step is optional. If you already have the data factory that you want to share with, skip this step.

$factory = Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `


-Location $DataFactoryLocation `
-Name $LinkedDataFactoryName

Grant permission
Grant permission to the data factory that needs to access the self-hosted integration runtime you created and
registered.

IMPORTANT
Do not skip this step!

New-AzRoleAssignment `
-ObjectId $factory.Identity.PrincipalId ` #MSI of the Data Factory with which it needs to be shared
-RoleDefinitionId 'b24988ac-6180-42a0-ab88-20f7382dd24c' ` #This is the Contributor role
-Scope $SharedIR.Id

Create a linked self-hosted integration runtime


Run the following command to create a linked self-hosted integration runtime:
Set-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $LinkedDataFactoryName `
-Name $LinkedIntegrationRuntimeName `
-Type SelfHosted `
-SharedIntegrationRuntimeResourceId $SharedIR.Id `
-Description $LinkedIntegrationRuntimeDescription

Now you can use this linked integration runtime in any linked service. The linked integration runtime uses the
shared integration runtime to run activities.

Revoke integration runtime sharing from a data factory


To revoke the access of a data factory from the shared integration runtime, run the following command:

Remove-AzRoleAssignment `
-ObjectId $factory.Identity.PrincipalId `
-RoleDefinitionId 'b24988ac-6180-42a0-ab88-20f7382dd24c' `
-Scope $SharedIR.Id

To remove the existing linked integration runtime, run the following command against the shared integration
runtime:

Remove-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName `
-Links `
-LinkedDataFactoryName $LinkedDataFactoryName

Next steps
Review integration runtime concepts in Azure Data Factory.
Learn how to create a self-hosted integration runtime in the Azure portal.
Run an SSIS package with the Execute SSIS Package
activity in Azure Data Factory
3/20/2019 • 9 minutes to read • Edit Online

This article describes how to run an SSIS package in Azure Data Factory (ADF ) pipeline by using the Execute SSIS
Package activity.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Create an Azure-SSIS Integration Runtime (IR ) if you do not have one already by following the step-by-step
instructions in the Tutorial: Deploy SSIS packages to Azure.

Run a package in the Azure portal


In this section, you use ADF User Interface (UI)/app to create an ADF pipeline with Execute SSIS Package activity
that runs your SSIS package.
Create a pipeline with an Execute SSIS Package activity
In this step, you use ADF UI/app to create a pipeline. You add an Execute SSIS Package activity to the pipeline and
configure it to run your SSIS package.
1. On your ADF overview/home page in Azure portal, click on the Author & Monitor tile to launch ADF
UI/app in a separate tab.
On the Let's get started page, click Create pipeline:

2. In the Activities toolbox, expand General, then drag & drop an Execute SSIS Package activity to the
pipeline designer surface.
3. On the General tab for Execute SSIS Package activity, provide a name and description for the activity. Set
optional timeout and retry values.

4. On the Settings tab for Execute SSIS Package activity, select your Azure-SSIS IR that is associated with
SSISDB database where the package is deployed. If your package uses Windows authentication to access
data stores, e.g. SQL Servers/file shares on premises, Azure Files, etc., check the Windows authentication
checkbox and enter the domain/username/password for your package execution. If your package needs 32-
bit runtime to run, check the 32-Bit runtime checkbox. For Logging level, select a predefined scope of
logging for your package execution. Check the Customized checkbox, if you want to enter your customized
logging name instead. When your Azure-SSIS IR is running and the Manual entries checkbox is
unchecked, you can browse and select your existing folders/projects/packages/environments from SSISDB.
Click the Refresh button to fetch your newly added folders/projects/packages/environments from SSISDB,
so they are available for browsing and selection.
When your Azure-SSIS IR is not running or the Manual entries checkbox is checked, you can enter your
package and environment paths from SSISDB directly in the following formats:
<folder name>/<project name>/<package name>.dtsx and <folder name>/<environment name> .

5. On the SSIS Parameters tab for Execute SSIS Package activity, when your Azure-SSIS IR is running and
the Manual entries checkbox on Settings tab is unchecked, the existing SSIS parameters in your selected
project/package from SSISDB will be displayed for you to assign values to them. Otherwise, you can enter
them one by one to assign values to them manually – Please ensure that they exist and are correctly entered
for your package execution to succeed. You can add dynamic content to their values using expressions,
functions, ADF system variables, and ADF pipeline parameters/variables. Alternatively, you can use secrets
stored in your Azure Key Vault (AKV ) as their values. To do so, click on the AZURE KEY VAULT checkbox
next to the relevant parameter, select/edit your existing AKV linked service or create a new one, and then
select the secret name/version for your parameter value. When you create/edit your AKV linked service, you
can select/edit your existing AKV or create a new one, but please grant ADF managed identity access to
your AKV if you have not done so already. You can also enter your secrets directly in the following format:
<AKV linked service name>/<secret name>/<secret version> .
6. On the Connection Managers tab for Execute SSIS Package activity, when your Azure-SSIS IR is running
and the Manual entries checkbox on Settings tab is unchecked, the existing connection managers in your
selected project/package from SSISDB will be displayed for you to assign values to their properties.
Otherwise, you can enter them one by one to assign values to their properties manually – Please ensure
that they exist and are correctly entered for your package execution to succeed. You can add dynamic
content to their property values using expressions, functions, ADF system variables, and ADF pipeline
parameters/variables. Alternatively, you can use secrets stored in your Azure Key Vault (AKV ) as their
property values. To do so, click on the AZURE KEY VAULT checkbox next to the relevant property,
select/edit your existing AKV linked service or create a new one, and then select the secret name/version for
your property value. When you create/edit your AKV linked service, you can select/edit your existing AKV
or create a new one, but please grant ADF managed identity access to your AKV if you have not done so
already. You can also enter your secrets directly in the following format:
<AKV linked service name>/<secret name>/<secret version> .

7. On the Property Overrides tab for Execute SSIS Package activity, you can enter the paths of existing
properties in your selected package from SSISDB one by one to assign values to them manually – Please
ensure that they exist and are correctly entered for your package execution to succeed, e.g. to override the
value of your user variable, enter its path in the following format:
\Package.Variables[User::YourVariableName].Value . You can also add dynamic content to their values using
expressions, functions, ADF system variables, and ADF pipeline parameters/variables.
8. To validate the pipeline configuration, click Validate on the toolbar. To close the Pipeline Validation
Report, click >>.
9. Publish the pipeline to ADF by clicking Publish All button.
Run the pipeline
In this step, you trigger a pipeline run.
1. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger now.

2. In the Pipeline Run window, select Finish.


Monitor the pipeline
1. Switch to the Monitor tab on the left. You see the pipeline run and its status along with other information
(such as Run Start time). To refresh the view, click Refresh.
2. Click View Activity Runs link in the Actions column. You see only one activity run as the pipeline has only
one activity (the Execute SSIS Package activity).

3. You can run the following query against the SSISDB database in your Azure SQL server to verify that the
package executed.

select * from catalog.executions

4. You can also get the SSISDB execution ID from the output of the pipeline activity run, and use the ID to
check more comprehensive execution logs and error messages in SSMS.
Schedule the pipeline with a trigger
You can also create a scheduled trigger for your pipeline so that the pipeline runs on a schedule (hourly, daily, etc.).
For an example, see Create a data factory - Data Factory UI.

Run a package with PowerShell


In this section, you use Azure PowerShell to create an ADF pipeline with Execute SSIS Package activity that runs
your SSIS package.
Install the latest Azure PowerShell modules by following the step-by-step instructions in How to install and
configure Azure PowerShell.
Create an ADF with Azure -SSIS IR
You can either use an existing ADF that already has Azure-SSIS IR provisioned or create a new ADF with Azure-
SSIS IR following the step-by-step instructions in the Tutorial: Deploy SSIS packages to Azure via PowerShell.
Create a pipeline with an Execute SSIS Package activity
In this step, you create a pipeline with an Execute SSIS Package activity. The activity runs your SSIS package.
1. Create a JSON file named RunSSISPackagePipeline.json in the C:\ADF\RunSSISPackage folder with
content similar to the following example:

IMPORTANT
Replace object names, descriptions, and paths, property and parameter values, passwords, and other variable values
before saving the file.

{
"name": "RunSSISPackagePipeline",
"properties": {
"activities": [{
"name": "mySSISActivity",
"description": "My SSIS package/activity description",
"type": "ExecuteSSISPackage",
"typeProperties": {
"connectVia": {
"referenceName": "myAzureSSISIR",
"type": "IntegrationRuntimeReference"
},
"executionCredential": {
"domain": "MyDomain",
"userName": "MyUsername",
"password": {
"type": "SecureString",
"value": "**********"
}
},
"runtime": "x64",
"loggingLevel": "Basic",
"packageLocation": {
"packagePath": "FolderName/ProjectName/PackageName.dtsx"
},
"environmentPath": "FolderName/EnvironmentName",
"projectParameters": {
"project_param_1": {
"value": "123"
},
"project_param_2": {
"value": {
"value": "@pipeline().parameters.MyPipelineParameter",
"type": "Expression"
}
}
},
"packageParameters": {
"package_param_1": {
"value": "345"
},
"package_param_2": {
"value": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MySecret"
}
}
},
"projectConnectionManagers": {
"MyAdonetCM": {
"userName": {
"value": "sa"
},
"passWord": {
"value": {
"type": "SecureString",
"value": "abc"
}
}
}
},
"packageConnectionManagers": {
"MyOledbCM": {
"userName": {
"value": {
"value": "@pipeline().parameters.MyUsername",
"type": "Expression"
}
},
"passWord": {
"value": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MyPassword",
"secretVersion": "3a1b74e361bf4ef4a00e47053b872149"
}
}
}
},
"propertyOverrides": {
"\\Package.MaxConcurrentExecutables": {
"value": 8,
"isSensitive": false
}
}
},
"policy": {
"timeout": "0.01:00:00",
"retry": 0,
"retryIntervalInSeconds": 30
}
}]
}
}

2. In Azure PowerShell, switch to the C:\ADF\RunSSISPackage folder.


3. To create the pipeline RunSSISPackagePipeline, run the Set-AzDataFactoryV2Pipeline cmdlet.

$DFPipeLine = Set-AzDataFactoryV2Pipeline -DataFactoryName $DataFactory.DataFactoryName `


-ResourceGroupName $ResGrp.ResourceGroupName `
-Name "RunSSISPackagePipeline"
-DefinitionFile ".\RunSSISPackagePipeline.json"

Here is the sample output:

PipelineName : Adfv2QuickStartPipeline
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {CopyFromBlobToBlob}
Parameters : {[inputPath, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification],
[outputPath, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}
Run the pipeline
Use the Invoke-AzDataFactoryV2Pipeline cmdlet to run the pipeline. The cmdlet returns the pipeline run ID for
future monitoring.

$RunId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $DataFactory.DataFactoryName `


-ResourceGroupName $ResGrp.ResourceGroupName `
-PipelineName $DFPipeLine.Name

Monitor the pipeline


Run the following PowerShell script to continuously check the pipeline run status until it finishes copying the data.
Copy/paste the following script in the PowerShell window, and press ENTER.

while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId

if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}

Start-Sleep -Seconds 10
}

You can also monitor the pipeline using the Azure portal. For step-by-step instructions, see Monitor the pipeline.
Schedule the pipeline with a trigger
In the previous step, you ran the pipeline on-demand. You can also create a schedule trigger to run the pipeline on
a schedule (hourly, daily, etc.).
1. Create a JSON file named MyTrigger.json in C:\ADF\RunSSISPackage folder with the following content:

{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-12-07T00:00:00-08:00",
"endTime": "2017-12-08T00:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "RunSSISPackagePipeline"
},
"parameters": {}
}]
}
}

2. In Azure PowerShell, switch to the C:\ADF\RunSSISPackage folder.


3. Run the Set-AzDataFactoryV2Trigger cmdlet, which creates the trigger.

Set-AzDataFactoryV2Trigger -ResourceGroupName $ResGrp.ResourceGroupName `


-DataFactoryName $DataFactory.DataFactoryName `
-Name "MyTrigger" -DefinitionFile ".\MyTrigger.json"

4. By default, the trigger is in stopped state. Start the trigger by running the Start-AzDataFactoryV2Trigger
cmdlet.

Start-AzDataFactoryV2Trigger -ResourceGroupName $ResGrp.ResourceGroupName `


-DataFactoryName $DataFactory.DataFactoryName `
-Name "MyTrigger"

5. Confirm that the trigger is started by running the Get-AzDataFactoryV2Trigger cmdlet.

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name "MyTrigger"

6. Run the following command after the next hour. For example, if the current time is 3:25 PM UTC, run the
command at 4 PM UTC.

Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-TriggerName "MyTrigger" `
-TriggerRunStartedAfter "2017-12-06" `
-TriggerRunStartedBefore "2017-12-09"

You can run the following query against the SSISDB database in your Azure SQL server to verify that the
package executed.

select * from catalog.executions

Next steps
See the following blog post:
Modernize and extend your ETL/ELT workflows with SSIS activities in ADF pipelines
Run an SSIS package with the Stored Procedure
activity in Azure Data Factory
4/8/2019 • 10 minutes to read • Edit Online

This article describes how to run an SSIS package in an Azure Data Factory pipeline by using a Stored Procedure
activity.

Prerequisites
Azure SQL Database
The walkthrough in this article uses an Azure SQL database that hosts the SSIS catalog. You can also use an Azure
SQL Database Managed Instance.

Create an Azure-SSIS integration runtime


Create an Azure-SSIS integration runtime if you don't have one by following the step-by-step instruction in the
Tutorial: Deploy SSIS packages.

Data Factory UI (Azure portal)


In this section, you use Data Factory UI to create a Data Factory pipeline with a stored procedure activity that
invokes an SSIS package.
Create a data factory
First step is to create a data factory by using the Azure portal.
1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. Navigate to the Azure portal.
3. Click New on the left menu, click Data + Analytics, and click Data Factory.
4. In the New data factory page, enter ADFTutorialDataFactory for the name.
The name of the Azure data factory must be globally unique. If you see the following error for the name
field, change the name of the data factory (for example, yournameADFTutorialDataFactory). See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.

5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version.
8. Select the location for the data factory. Only locations that are supported by Data Factory are shown in the
drop-down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.)
used by data factory can be in other locations.
9. Select Pin to dashboard.
10. Click Create.
11. On the dashboard, you see the following tile with status: Deploying data factory.
12. After the creation is complete, you see the Data Factory page as shown in the image.

13. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) application in a separate
tab.
Create a pipeline with stored procedure activity
In this step, you use the Data Factory UI to create a pipeline. You add a stored procedure activity to the pipeline and
configure it to run the SSIS package by using the sp_executesql stored procedure.
1. In the get started page, click Create pipeline:
2. In the Activities toolbox, expand General, and drag-drop Stored Procedure activity to the pipeline
designer surface.

3. In the properties window for the stored procedure activity, switch to the SQL Account tab, and click +
New. You create a connection to the Azure SQL database that hosts the SSIS Catalog (SSIDB database).
4. In the New Linked Service window, do the following steps:
a. Select Azure SQL Database for Type.
b. Select the Default Azure Integration Runtime to connect to the Azure SQL Database that hosts the
SSISDB database.

c. Select the Azure SQL Database that hosts the SSISDB database for the Server name field.
d. Select SSISDB for Database name.
e. For User name, enter the name of user who has access to the database.
f. For Password, enter the password of the user.
g. Test the connection to the database by clicking Test connection button.
h. Save the linked service by clicking the Save button.
5. In the properties window, switch to the Stored Procedure tab from the SQL Account tab, and do the
following steps:
a. Select Edit.
b. For the Stored procedure name field, Enter sp_executesql .
c. Click + New in the Stored procedure parameters section.
d. For name of the parameter, enter stmt.
e. For type of the parameter, enter String.
f. For value of the parameter, enter the following SQL query:
In the SQL query, specify the right values for the folder_name, project_name, and package_name
parameters.
DECLARE @return_value INT, @exe_id BIGINT, @err_msg NVARCHAR(150) EXEC @return_value=[SSISDB].
[catalog].[create_execution] @folder_name=N'<FOLDER name in SSIS Catalog>',
@project_name=N'<PROJECT name in SSIS Catalog>', @package_name=N'<PACKAGE name>.dtsx',
@use32bitruntime=0, @runinscaleout=1, @useanyworker=1, @execution_id=@exe_id OUTPUT EXEC
[SSISDB].[catalog].[set_execution_parameter_value] @exe_id, @object_type=50,
@parameter_name=N'SYNCHRONIZED', @parameter_value=1 EXEC [SSISDB].[catalog].[start_execution]
@execution_id=@exe_id, @retry_count=0 IF(SELECT [status] FROM [SSISDB].[catalog].[executions]
WHERE execution_id=@exe_id)<>7 BEGIN SET @err_msg=N'Your package execution did not succeed for
execution ID: ' + CAST(@exe_id AS NVARCHAR(20)) RAISERROR(@err_msg,15,1) END

6. To validate the pipeline configuration, click Validate on the toolbar. To close the Pipeline Validation
Report, click >>.
7. Publish the pipeline to Data Factory by clicking Publish All button.

Run and monitor the pipeline


In this section, you trigger a pipeline run and then monitor it.
1. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger now.

2. In the Pipeline Run window, select Finish.


3. Switch to the Monitor tab on the left. You see the pipeline run and its status along with other information
(such as Run Start time). To refresh the view, click Refresh.

4. Click View Activity Runs link in the Actions column. You see only one activity run as the pipeline has only
one activity (stored procedure activity).

5. You can run the following query against the SSISDB database in your Azure SQL server to verify that the
package executed.

select * from catalog.executions

NOTE
You can also create a scheduled trigger for your pipeline so that the pipeline runs on a schedule (hourly, daily, etc.). For an
example, see Create a data factory - Data Factory UI.

Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

In this section, you use Azure PowerShell to create a Data Factory pipeline with a stored procedure activity that
invokes an SSIS package.
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
Create a data factory
You can either use the same data factory that has the Azure-SSIS IR or create a separate data factory. The
following procedure provides steps to create a data factory. You create a pipeline with a stored procedure activity in
this data factory. The stored procedure activity executes a stored procedure in the SSISDB database to run your
SSIS package.
1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotes, and
then run the command. For example: "adfrg" .

$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again

2. To create the Azure resource group, run the following command:

$ResGrp = New-AzResourceGroup $resourceGroupName -location 'eastus'

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again.

3. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to be globally unique.

$DataFactoryName = "ADFTutorialFactory";

4. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet, using the Location and
ResourceGroupName property from the $ResGrp variable:

$DataFactory = Set-AzDataFactoryV2 -ResourceGroupName $ResGrp.ResourceGroupName -Location


$ResGrp.Location -Name $dataFactoryName

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error, change the
name and try again.

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names must
be globally unique.

To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory
can be in other regions.
Create an Azure SQL Database linked service
Create a linked service to link your Azure SQL database that hosts the SSIS catalog to your data factory. Data
Factory uses information in this linked service to connect to SSISDB database, and executes a stored procedure to
run an SSIS package.
1. Create a JSON file named AzureSqlDatabaseLinkedService.json in C:\ADF\RunSSISPackage folder
with the following content:

IMPORTANT
Replace <servername>, <username>, and <password> with values of your Azure SQL Database before saving the
file.

{
"name": "AzureSqlDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=SSISDB;User ID=
<username>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
}

2. In Azure PowerShell, switch to the C:\ADF\RunSSISPackage folder.


3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:
AzureSqlDatabaseLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $DataFactory.DataFactoryName -ResourceGroupName


$ResGrp.ResourceGroupName -Name "AzureSqlDatabaseLinkedService" -File
".\AzureSqlDatabaseLinkedService.json"

Create a pipeline with stored procedure activity


In this step, you create a pipeline with a stored procedure activity. The activity invokes the sp_executesql stored
procedure to run your SSIS package.
1. Create a JSON file named RunSSISPackagePipeline.json in the C:\ADF\RunSSISPackage folder with
the following content:
IMPORTANT
Replace <FOLDER NAME>, <PROJECT NAME>, <PACKAGE NAME> with names of folder, project, and package in the
SSIS catalog before saving the file.

{
"name": "RunSSISPackagePipeline",
"properties": {
"activities": [
{
"name": "My SProc Activity",
"description":"Runs an SSIS package",
"type": "SqlServerStoredProcedure",
"linkedServiceName": {
"referenceName": "AzureSqlDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"storedProcedureName": "sp_executesql",
"storedProcedureParameters": {
"stmt": {
"value": "DECLARE @return_value INT, @exe_id BIGINT, @err_msg NVARCHAR(150)
EXEC @return_value=[SSISDB].[catalog].[create_execution] @folder_name=N'<FOLDER NAME>',
@project_name=N'<PROJECT NAME>', @package_name=N'<PACKAGE NAME>', @use32bitruntime=0, @runinscaleout=1,
@useanyworker=1, @execution_id=@exe_id OUTPUT EXEC [SSISDB].[catalog].[set_execution_parameter_value]
@exe_id, @object_type=50, @parameter_name=N'SYNCHRONIZED', @parameter_value=1 EXEC [SSISDB].
[catalog].[start_execution] @execution_id=@exe_id, @retry_count=0 IF(SELECT [status] FROM [SSISDB].
[catalog].[executions] WHERE execution_id=@exe_id)<>7 BEGIN SET @err_msg=N'Your package execution did
not succeed for execution ID: ' + CAST(@exe_id AS NVARCHAR(20)) RAISERROR(@err_msg,15,1) END"
}
}
}
}
]
}
}

2. To create the pipeline: RunSSISPackagePipeline, Run the Set-AzDataFactoryV2Pipeline cmdlet.

$DFPipeLine = Set-AzDataFactoryV2Pipeline -DataFactoryName $DataFactory.DataFactoryName -


ResourceGroupName $ResGrp.ResourceGroupName -Name "RunSSISPackagePipeline" -DefinitionFile
".\RunSSISPackagePipeline.json"

Here is the sample output:

PipelineName : Adfv2QuickStartPipeline
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {CopyFromBlobToBlob}
Parameters : {[inputPath, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification],
[outputPath, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}

Create a pipeline run


Use the Invoke-AzDataFactoryV2Pipeline cmdlet to run the pipeline. The cmdlet returns the pipeline run ID for
future monitoring.

$RunId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $DataFactory.DataFactoryName -ResourceGroupName


$ResGrp.ResourceGroupName -PipelineName $DFPipeLine.Name
Monitor the pipeline run
Run the following PowerShell script to continuously check the pipeline run status until it finishes copying the data.
Copy/paste the following script in the PowerShell window, and press ENTER.

while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $ResGrp.ResourceGroupName -DataFactoryName
$DataFactory.DataFactoryName -PipelineRunId $RunId

if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}

Start-Sleep -Seconds 10
}

Create a trigger
In the previous step, you invoked the pipeline on-demand. You can also create a schedule trigger to run the
pipeline on a schedule (hourly, daily, etc.).
1. Create a JSON file named MyTrigger.json in C:\ADF\RunSSISPackage folder with the following content:

{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-12-07T00:00:00-08:00",
"endTime": "2017-12-08T00:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "RunSSISPackagePipeline"
},
"parameters": {}
}
]
}
}

2. In Azure PowerShell, switch to the C:\ADF\RunSSISPackage folder.


3. Run the Set-AzDataFactoryV2Trigger cmdlet, which creates the trigger.

Set-AzDataFactoryV2Trigger -ResourceGroupName $ResGrp.ResourceGroupName -DataFactoryName


$DataFactory.DataFactoryName -Name "MyTrigger" -DefinitionFile ".\MyTrigger.json"

4. By default, the trigger is in stopped state. Start the trigger by running the Start-AzDataFactoryV2Trigger
cmdlet.
Start-AzDataFactoryV2Trigger -ResourceGroupName $ResGrp.ResourceGroupName -DataFactoryName
$DataFactory.DataFactoryName -Name "MyTrigger"

5. Confirm that the trigger is started by running the Get-AzDataFactoryV2Trigger cmdlet.

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name


"MyTrigger"

6. Run the following command after the next hour. For example, if the current time is 3:25 PM UTC, run the
command at 4 PM UTC.

Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


TriggerName "MyTrigger" -TriggerRunStartedAfter "2017-12-06" -TriggerRunStartedBefore "2017-12-09"

You can run the following query against the SSISDB database in your Azure SQL server to verify that the
package executed.

select * from catalog.executions

Next steps
You can also monitor the pipeline using the Azure portal. For step-by-step instructions, see Monitor the pipeline.
How to start and stop Azure-SSIS Integration Runtime on a schedule
3/28/2019 • 13 minutes to read • Edit Online

This article describes how to schedule the starting and stopping of Azure-SSIS Integration Runtime (IR) by using Azure Data Factory (ADF). Azure-SSIS IR is ADF compute
resource dedicated for executing SQL Server Integration Services (SSIS) packages. Running Azure-SSIS IR has a cost associated with it. Therefore, you typically want to run
your IR only when you need to execute SSIS packages in Azure and stop your IR when you do not need it anymore. You can use ADF User Interface (UI)/app or Azure
PowerShell to manually start or stop your IR ).
Alternatively, you can create Web activities in ADF pipelines to start/stop your IR on schedule, e.g. starting it in the morning before executing your daily ETL workloads and
stopping it in the afternoon after they are done. You can also chain an Execute SSIS Package activity between two Web activities that start and stop your IR, so your IR will
start/stop on demand, just in time before/after your package execution. For more info about Execute SSIS Package activity, see Run an SSIS package using Execute SSIS
Package activity in ADF pipeline article.

IMPORTANT
Using this Azure feature from PowerShell requires the AzureRM module installed. This is an older module only available for Windows PowerShell 5.1 that no longer receives new features. The
Az and AzureRM modules are not compatible when installed for the same versions of PowerShell. If you need both versions:

1. Uninstall the Az module from a PowerShell 5.1 session.


2. Install the AzureRM module from a PowerShell 5.1 session.
3. Download and install PowerShell Core 6.x or later.
4. Install the Az module in a PowerShell Core session.

Prerequisites
If you have not provisioned your Azure-SSIS IR already, provision it by following instructions in the tutorial.

Create and schedule ADF pipelines that start and or stop Azure-SSIS IR
This section shows you how to use Web activities in ADF pipelines to start/stop your Azure-SSIS IR on schedule or start & stop it on demand. We will guide you to create
three pipelines:
1. The first pipeline contains a Web activity that starts your Azure-SSIS IR.
2. The second pipeline contains a Web activity that stops your Azure-SSIS IR.
3. The third pipeline contains an Execute SSIS Package activity chained between two Web activities that start/stop your Azure-SSIS IR.
After you create and test those pipelines, you can create a schedule trigger and associate it with any pipeline. The schedule trigger defines a schedule for running the
associated pipeline.
For example, you can create two triggers, the first one is scheduled to run daily at 6 AM and associated with the first pipeline, while the second one is scheduled to run daily at
6 PM and associated with the second pipeline. In this way, you have a period between 6 AM to 6 PM every day when your IR is running, ready to execute your daily ETL
workloads.
If you create a third trigger that is scheduled to run daily at midnight and associated with the third pipeline, that pipeline will run at midnight every day, starting your IR just
before package execution, subsequently executing your package, and immediately stopping your IR just after package execution, so your IR will not be running idly.
Create your ADF
1. Sign in to Azure portal.
2. Click New on the left menu, click Data + Analytics, and click Data Factory.
3. In the New data factory page, enter MyAzureSsisDataFactory for Name.

The name of your ADF must be globally unique. If you receive the following error, change the name of your ADF (e.g. yournameMyAzureSsisDataFactory) and try
creating it again. See Data Factory - Naming Rules article to learn about naming rules for ADF artifacts.
Data factory name MyAzureSsisDataFactory is not available

4. Select your Azure Subscription under which you want to create your ADF.
5. For Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of your new resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources article.
6. For Version, select V2 .
7. For Location, select one of the locations supported for ADF creation from the drop-down list.
8. Select Pin to dashboard.
9. Click Create.
10. On Azure dashboard, you will see the following tile with status: Deploying Data Factory.
11. After the creation is complete, you can see your ADF page as shown below.

12. Click Author & Monitor to launch ADF UI/app in a separate tab.
Create your pipelines
1. In Let's get started page, select Create pipeline.

2. In Activities toolbox, expand General menu, and drag & drop a Web activity onto the pipeline designer surface. In General tab of the activity properties window,
change the activity name to startMyIR. Switch to Settings tab, and do the following actions.
a. For URL, enter the following URL for REST API that starts Azure-SSIS IR, replacing {subscriptionId} , {resourceGroupName} , {factoryName} , and
{integrationRuntimeName} with the actual values for your IR:
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRuntimes/{integrationRuntimeName
api-version=2018-06-01
Alternatively, you can also copy & paste the resource ID of your IR from its monitoring page on ADF UI/app to replace the following part of the above URL:
/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRuntimes/{integrationRuntimeName}
b. For Method, select POST.
c. For Body, enter {"message":"Start my IR"} .
d. For Authentication, select MSI to use the managed identity for your ADF, see Managed identity for Data Factory article for more info.
e. For Resource, enter https://management.azure.com/ .

3. Clone the first pipeline to create a second one, changing the activity name to stopMyIR and replacing the following properties.
a. For URL, enter the following URL for REST API that stops Azure-SSIS IR, replacing {subscriptionId} , {resourceGroupName} , {factoryName} , and
{integrationRuntimeName} with the actual values for your IR:
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRuntimes/{integrationRuntimeName
api-version=2018-06-01

b. For Body, enter {"message":"Stop my IR"} .


4. Create a third pipeline, drag & drop an Execute SSIS Package activity from Activities toolbox onto the pipeline designer surface, and configure it following the
instructions in Invoke an SSIS package using Execute SSIS Package activity in ADF article. Alternatively, you can use a Stored Procedure activity instead and
configure it following the instructions in Invoke an SSIS package using Stored Procedure activity in ADF article. Next, chain the Execute SSIS Package/Stored
Procedure activity between two Web activities that start/stop your IR, similar to those Web activities in the first/second pipelines.
5. Assign the managed identity for your ADF a Contributor role to itself, so Web activities in its pipelines can call REST API to start/stop Azure-SSIS IRs provisioned in
it. On your ADF page in Azure portal, click Access control (IAM), click + Add role assignment, and then on Add role assignment blade, do the following actions.
a. For Role, select Contributor.
b. For Assign access to, select Azure AD user, group, or service principal.
c. For Select, search for your ADF name and select it.
d. Click Save.

6. Validate your ADF and all pipeline settings by clicking Validate all/Validate on the factory/pipeline toolbar. Close Factory/Pipeline Validation Output by clicking
>> button.

Test run your pipelines


1. Select Test Run on the toolbar for each pipeline and see Output window in the bottom pane.
2. To test the third pipeline, launch SQL Server Management Studio (SSMS). In Connect to Server window, do the following actions.
a. For Server name, enter <your Azure SQL Database server name>.database.windows.net.
b. Select Options >>.
c. For Connect to database, select SSISDB.
d. Select Connect.
e. Expand Integration Services Catalogs -> SSISDB -> Your folder -> Projects -> Your SSIS project -> Packages.
f. Right-click the specified SSIS package to run and select Reports -> Standard Reports -> All Executions.
g. Verify that it ran.

Schedule your pipelines


Now that your pipelines work as you expected, you can create triggers to run them at specified cadences. For details about associating triggers with pipelines, see Trigger the
pipeline on a schedule article.
1. On the pipeline toolbar, select Trigger and select New/Edit.

2. In Add Triggers pane, select + New.


3. In New Trigger pane, do the following actions:
a. For Name, enter a name for the trigger. In the following example, Run daily is the trigger name.
b. For Type, select Schedule.
c. For Start Date (UTC), enter a start date and time in UTC.
d. For Recurrence, enter a cadence for the trigger. In the following example, it is Daily once.
e. For End, select No End or enter an end date and time after selecting On Date.
f. Select Activated to activate the trigger immediately after you publish the whole ADF settings.
g. Select Next.

4. In Trigger Run Parameters page, review any warning, and select Finish.
5. Publish the whole ADF settings by selecting Publish All in the factory toolbar.

Monitor your pipelines and triggers in Azure portal


1. To monitor trigger runs and pipeline runs, use Monitor tab on the left of ADF UI/app. For detailed steps, see Monitor the pipeline article.
2. To view the activity runs associated with a pipeline run, select the first link (View Activity Runs) in Actions column. For the third pipeline, you will see three activity
runs, one for each chained activity in the pipeline (Web activity to start your IR, Stored Procedure activity to execute your package, and Web activity to stop your IR). To
view the pipeline runs again, select Pipelines link at the top.

3. To view the trigger runs, select Trigger Runs from the drop-down list under Pipeline Runs at the top.

Monitor your pipelines and triggers with PowerShell


Use scripts like the following examples to monitor your pipelines and triggers.
1. Get the status of a pipeline run.

Get-AzDataFactoryV2PipelineRun -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -PipelineRunId $myPipelineRun

2. Get info about a trigger.

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "myTrigger"

3. Get the status of a trigger run.

Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -TriggerName "myTrigger" -TriggerRunStartedAfter "2018-07-


15" -TriggerRunStartedBefore "2018-07-16"

Create and schedule Azure Automation runbook that starts/stops Azure-SSIS IR


In this section, you will learn to create Azure Automation runbook that executes PowerShell script, starting/stopping your Azure-SSIS IR on a schedule. This is useful when
you want to execute additional scripts before/after starting/stopping your IR for pre/post-processing.
Create your Azure Automation account
If you do not have an Azure Automation account already, create one by following the instructions in this step. For detailed steps, see Create an Azure Automation account
article. As part of this step, you create an Azure Run As account (a service principal in your Azure Active Directory) and assign it a Contributor role in your Azure
subscription. Ensure that it is the same subscription that contains your ADF with Azure SSIS IR. Azure Automation will use this account to authenticate to Azure Resource
Manager and operate on your resources.
1. Launch Microsoft Edge or Google Chrome web browser. Currently, ADF UI/app is only supported in Microsoft Edge and Google Chrome web browsers.
2. Sign in to Azure portal.
3. Select New on the left menu, select Monitoring + Management, and select Automation.
4. In Add Automation Account pane, do the following actions.
a. For Name, enter a name for your Azure Automation account.
b. For Subscription, select the subscription that has your ADF with Azure-SSIS IR.
c. For Resource group, select Create new to create a new resource group or Use existing to select an existing one.
d. For Location, select a location for your Azure Automation account.
e. Confirm Create Azure Run As account as Yes. A service principal will be created in your Azure Active Directory and assigned a Contributor role in your Azure
subscription.
f. Select Pin to dashboard to display it permanently in Azure dashboard.
g. Select Create.

5. You will see the deployment status of your Azure Automation account in Azure dashboard and notifications.
6. You will see the homepage of your Azure Automation account after it is created successfully.

Import ADF modules


1. Select Modules in SHARED RESOURCES section on the left menu and verify whether you have AzureRM.DataFactoryV2 + AzureRM.Profile in the list of
modules.

2. If you do not have AzureRM.DataFactoryV2, go to the PowerShell Gallery for AzureRM.DataFactoryV2 module, select Deploy to Azure Automation, select your
Azure Automation account, and then select OK. Go back to view Modules in SHARED RESOURCES section on the left menu and wait until you see STATUS of
AzureRM.DataFactoryV2 module changed to Available.
3. If you do not have AzureRM.Profile, go to the PowerShell Gallery for AzureRM.Profile module, select Deploy to Azure Automation, select your Azure Automation
account, and then select OK. Go back to view Modules in SHARED RESOURCES section on the left menu and wait until you see STATUS of the AzureRM.Profile
module changed to Available.

Create your PowerShell runbook


The following section provides steps for creating a PowerShell runbook. The script associated with your runbook either starts/stops Azure-SSIS IR based on the command
you specify for OPERATION parameter. This section does not provide the complete details for creating a runbook. For more information, see Create a runbook article.
1. Switch to Runbooks tab and select + Add a runbook from the toolbar.

2. Select Create a new runbook and do the following actions:


a. For Name, enter StartStopAzureSsisRuntime.
b. For Runbook type, select PowerShell.
c. Select Create.

3. Copy & paste the following PowerShell script to your runbook script window. Save and then publish your runbook by using Save and Publish buttons on the toolbar.

Param
(
[Parameter (Mandatory= $true)]
[String] $ResourceGroupName,

[Parameter (Mandatory= $true)]


[String] $DataFactoryName,

[Parameter (Mandatory= $true)]


[String] $AzureSSISName,

[Parameter (Mandatory= $true)]


[String] $Operation
)

$connectionName = "AzureRunAsConnection"
try
{
# Get the connection "AzureRunAsConnection "
$servicePrincipalConnection=Get-AutomationConnection -Name $connectionName

"Logging in to Azure..."
Connect-AzAccount `
-ServicePrincipal `
-TenantId $servicePrincipalConnection.TenantId `
-ApplicationId $servicePrincipalConnection.ApplicationId `
-CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint
}
catch {
if (!$servicePrincipalConnection)
{
$ErrorMessage = "Connection $connectionName not found."
throw $ErrorMessage
} else{
Write-Error -Message $_.Exception
throw $_.Exception
}
}

if($Operation -eq "START" -or $operation -eq "start")


{
"##### Starting #####"
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name $AzureSSISName -Force
}
elseif($Operation -eq "STOP" -or $operation -eq "stop")
{
"##### Stopping #####"
Stop-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -ResourceGroupName $ResourceGroupName -Force
}
"##### Completed #####"
4. Test your runbook by selecting Start button on the toolbar.

5. In Start Runbook pane, do the following actions:


a. For RESOURCE GROUP NAME, enter the name of resource group that has your ADF with Azure-SSIS IR.
b. For DATA FACTORY NAME, enter the name of your ADF with Azure-SSIS IR.
c. For AZURESSISNAME, enter the name of Azure-SSIS IR.
d. For OPERATION, enter START.
e. Select OK.
6. In the job window, select Output tile. In the output window, wait for the message ##### Completed ##### after you see ##### Starting #####. Starting Azure-SSIS
IR takes approximately 20 minutes. Close Job window and get back to Runbook window.

7. Repeat the previous two steps using STOP as the value for OPERATION. Start your runbook again by selecting Start button on the toolbar. Enter your resource
group, ADF, and Azure-SSIS IR names. For OPERATION, enter STOP. In the output window, wait for the message ##### Completed ##### after you see #####
Stopping #####. Stopping Azure-SSIS IR does not take as long as starting it. Close Job window and get back to Runbook window.

Create schedules for your runbook to start/stop Azure-SSIS IR


In the previous section, you have created your Azure Automation runbook that can either start or stop Azure-SSIS IR. In this section, you will create two schedules for your
runbook. When configuring the first schedule, you specify START for OPERATION. Similarly, when configuring the second one, you specify STOP for OPERATION. For
detailed steps to create schedules, see Create a schedule article.
1. In Runbook window, select Schedules, and select + Add a schedule on the toolbar.

2. In Schedule Runbook pane, do the following actions:


a. Select Link a schedule to your runbook.
b. Select Create a new schedule.
c. In New Schedule pane, enter Start IR daily for Name.
d. For Starts, enter a time that is a few minutes past the current time.
e. For Recurrence, select Recurring.
f. For Recur every, enter 1 and select Day.
g. Select Create.
3. Switch to Parameters and run settings tab. Specify your resource group, ADF, and Azure-SSIS IR names. For OPERATION, enter START and select OK. Select OK
again to see the schedule on Schedules page of your runbook.

4. Repeat the previous two steps to create a schedule named Stop IR daily. Enter a time that is at least 30 minutes after the time you specified for Start IR daily
schedule. For OPERATION, enter STOP and select OK. Select OK again to see the schedule on Schedules page of your runbook.
5. In Runbook window, select Jobs on the left menu. You should see the jobs created by your schedules at the specified times and their statuses. You can see the job
details, such as its output, similar to what you have seen after you tested your runbook.
6. After you are done testing, disable your schedules by editing them. Select Schedules on the left menu, select Start IR daily/Stop IR daily, and select No for
Enabled.

Next steps
See the following blog post:
Modernize and extend your ETL/ELT workflows with SSIS activities in ADF pipelines
See the following articles from SSIS documentation:
Deploy, run, and monitor an SSIS package on Azure
Connect to SSIS catalog on Azure
Schedule package execution on Azure
Connect to on-premises data sources with Windows authentication
Join an Azure-SSIS integration runtime to a virtual
network
4/16/2019 • 17 minutes to read • Edit Online

Join your Azure-SSIS integration runtime (IR ) to an Azure virtual network in the following scenarios:
You want to connect to on-premises data stores from SSIS packages running on an Azure-SSIS
integration runtime.
You are hosting the SQL Server Integration Services (SSIS ) catalog database in Azure SQL Database with
virtual network service endpoints/Managed Instance.
Azure Data Factory lets you join your Azure-SSIS integration runtime to a virtual network created
through the classic deployment model or the Azure Resource Manager deployment model.

IMPORTANT
The classic virtual network is currently being deprecated, so please use the Azure Resource Manager virtual network
instead. If you already use the classic virtual network, please switch to use the Azure Resource Manager virtual network as
soon as possible.

Access to on-premises data stores


If SSIS packages access only public cloud data stores, you don't need to join the Azure-SSIS IR to a virtual
network. If SSIS packages access on-premises data stores, you must join the Azure-SSIS IR to a virtual network
that is connected to the on-premises network.
Here are a few important points to note:
If there is no existing virtual network connected to your on-premises network, first create an Azure
Resource Manager virtual network or a classic virtual network for your Azure-SSIS integration runtime to
join. Then, configure a site-to-site VPN gateway connection or ExpressRoute connection from that virtual
network to your on-premises network.
If there is an existing Azure Resource Manager or classic virtual network connected to your on-premises
network in the same location as your Azure-SSIS IR, you can join the IR to that virtual network.
If there is an existing classic virtual network connected to your on-premises network in a different location
from your Azure-SSIS IR, you can first create a classic virtual network for your Azure-SSIS IR to join.
Then, configure a classic-to-classic virtual network connection. Or you can create an Azure Resource
Manager virtual network for your Azure-SSIS integration runtime to join. Then configure a classic-to-
Azure Resource Manager virtual network connection.
If there is an existing Azure Resource Manager virtual network connected to your on-premises network in
a different location from your Azure-SSIS IR, you can first create an Azure Resource Manager virtual
network for your Azure-SSIS IR to join. Then, configure an Azure Resource Manager-to-Azure Resource
Manager virtual network connection. Or, you can create a classic virtual network for your Azure-SSIS IR
to join. Then, configure a classic-to-Azure Resource Manager virtual network connection.

Host the SSIS Catalog database in Azure SQL Database with virtual
network service endpoints/Managed Instance
If the SSIS catalog is hosted in Azure SQL Database with virtual network service endpoints, or Managed
Instance, you can join your Azure-SSIS IR to:
The same virtual network
A different virtual network that has a network-to-network connection with the one that is used for the
Managed Instance
If you host your SSIS catalog in Azure SQL Database with virtual network service endpoints, make sure that you
join your Azure-SSIS IR to the same virtual network and subnet.
If you join your Azure-SSIS IR to the same virtual network as the Managed Instance, make sure that the Azure-
SSIS IR is in a different subnet than the Managed Instance. If you join your Azure-SSIS IR to a different virtual
network than the Managed Instance, we recommend either virtual network peering (which is limited to the same
region) or a virtual network to virtual network connection. See Connect your application to Azure SQL Database
Managed Instance.
In all cases, the virtual network can only be deployed through the Azure Resource Manager deployment model.
The following sections provide more details.

Requirements for virtual network configuration


Make sure that Microsoft.Batch is a registered provider under the subscription of your virtual network
subnet which hosts the Azure-SSIS IR. If you are using Classic virtual network, also join
MicrosoftAzureBatch to the Classic Virtual Machine Contributor role for that virtual network.

Make sure you have the required permissions. See Required permissions.
Select the proper subnet to host the Azure-SSIS IR. See Select the subnet.
If you are using your own Domain Name Services (DNS ) server on the virtual network, see Domain
Name Services server.
If you are using a Network Security Group (NSG ) on the subnet, see Network security group
If you are using Azure Express Route or configuring User Defined Route (UDR ), see Use Azure
ExpressRoute or User Defined Route.
Make sure the Resource Group of the virtual network can create and delete certain Azure Network
resources. See Requirements for Resource Group.
Here is a diagram showing the required connections for your Azure-SSIS IR:
Required permissions
The user who creates the Azure-SSIS Integration Runtime must have the following permissions:
If you're joining your SSIS IR to an Azure Resource Manager virtual network, you have two options:
Use the built-in Network Contributor role. This role comes with the Microsoft.Network/*
permission, which has a much larger scope than necessary.
Create a custom role that includes only the necessary
Microsoft.Network/virtualNetworks/*/join/action permission.
If you're joining your SSIS IR to a classic virtual network, we recommend that you use the built-in Classic
Virtual Machine Contributor role. Otherwise you have to define a custom role that includes the
permission to join the virtual network.
Select the subnet
Do not select the GatewaySubnet for deploying an Azure-SSIS Integration Runtime, because it is
dedicated for virtual network gateways.
Ensure that the subnet you select has sufficient available address space for Azure-SSIS IR to use. Leave at
least 2 * IR node number in available IP addresses. Azure reserves some IP addresses within each subnet,
and these addresses can't be used. The first and last IP addresses of the subnets are reserved for protocol
conformance, along with three more addresses used for Azure services. For more information, see Are
there any restrictions on using IP addresses within these subnets?.
Don’t use a subnet which is exclusively occupied by other Azure services (for example, SQL Database
Managed Instance, App Service, etc.).
Domain Name Services server
If you need to use your own Domain Name Services (DNS ) server in a virtual network joined by your Azure-
SSIS integration runtime, make sure it can resolve public Azure host names (for example, an Azure Storage blob
name, <your storage account>.blob.core.windows.net ).
The following steps are recommended:
Configure Custom DNS to forward requests to Azure DNS. You can forward unresolved DNS records to
the IP address of Azure's recursive resolvers (168.63.129.16) on your own DNS server.
Set up the Custom DNS as primary and Azure DNS as secondary for the virtual network. Register the IP
address of Azure's recursive resolvers (168.63.129.16) as a secondary DNS server in case your own DNS
server is unavailable.
For more info, see Name resolution that uses your own DNS server.
Network security group
If you need to implement a network security group (NSG ) for the subnet used by your Azure-SSIS integration
runtime, allow inbound/outbound traffic through the following ports:

TRANSPORT SOURCE PORT DESTINATION


DIRECTION PROTOCOL SOURCE RANGE DESTINATION PORT RANGE COMMENTS
TRANSPORT SOURCE PORT DESTINATION
DIRECTION PROTOCOL SOURCE RANGE DESTINATION PORT RANGE COMMENTS

Inbound TCP AzureCloud * VirtualNetwor 29876, 29877 The Data


(or larger k (if you join the Factory
scope like IR to an Azure service uses
Internet) Resource these ports to
Manager communicate
virtual with the
network) nodes of your
Azure-SSIS
10100, integration
20100, 30100 runtime in the
(if you join the virtual
IR to a classic network.
virtual
network) Whether you
create a
subnet-level
NSG or not,
Data Factory
always
configures an
NSG at the
level of the
network
interface cards
(NICs)
attached to
the virtual
machines that
host the
Azure-SSIS IR.
Only inbound
traffic from
Data Factory
IP addresses
on the
specified
ports is
allowed by
that NIC-level
NSG. Even if
you open
these ports to
Internet traffic
at the subnet
level, traffic
from IP
addresses
that are not
Data Factory
IP addresses
is blocked at
the NIC level.
TRANSPORT SOURCE PORT DESTINATION
DIRECTION PROTOCOL SOURCE RANGE DESTINATION PORT RANGE COMMENTS

Outbound TCP VirtualNetwor * AzureCloud 443 The nodes of


k (or larger your Azure-
scope like SSIS
Internet) integration
runtime in the
virtual
network use
this port to
access Azure
services, such
as Azure
Storage and
Azure Event
Hubs.

Outbound TCP VirtualNetwor * Internet 80 The nodes of


k your Azure-
SSIS
integration
runtime in the
virtual
network use
this port to
download
certificate
revocation list
from Internet.

Outbound TCP VirtualNetwor * Sql 1433, 11000- The nodes of


k (or larger 11999, your Azure-
scope like 14000-14999 SSIS
Internet) integration
runtime in the
virtual
network use
these ports to
access SSISDB
hosted by
your Azure
SQL Database
server - This
purpose is not
applicable to
SSISDB
hosted by
Managed
Instance.

Use Azure ExpressRoute or User Defined Route


You can connect an Azure ExpressRoute circuit to your virtual network infrastructure to extend your on-premises
network to Azure.
A common configuration is to use forced tunneling (advertise a BGP route, 0.0.0.0/0 to the virtual network)
which forces outbound Internet traffic from the virtual network flow to on-premises network appliance for
inspection and logging. This traffic flow breaks connectivity between the Azure-SSIS IR in the virtual network
with dependent Azure Data Factory services. The solution is to define one (or more) user-defined routes (UDRs)
on the subnet that contains the Azure-SSIS IR. A UDR defines subnet-specific routes that are honored instead of
the BGP route.
Or you can define user-defined routes (UDRs) to force outbound Internet traffic from the subnet which hosts the
Azure-SSIS IR to another subnet, which hosts a Virtual Network Appliance as a firewall or a DMZ host for
inspection and logging.
In both cases, apply a 0.0.0.0/0 route with the next hop type as Internet on the subnet which hosts the Azure-
SSIS IR, so that communication between the Data Factory service and the Azure-SSIS IS IR can succeed.

If you're concerned about losing the ability to inspect outbound Internet traffic from that subnet, you can also add
an NSG rule on the subnet to restrict outbound destinations to Azure data center IP addresses.
See this PowerShell script for an example. You have to run the script weekly to keep the Azure data center IP
address list up-to-date.
Requirements for Resource Group
The Azure-SSIS IR needs to create certain network resources under the same resource group as the
virtual network. These resources include the following:
An Azure load balancer, with the name <Guid>-azurebatch-cloudserviceloadbalancer.
An Azure public IP address, with the name <Guid>-azurebatch-cloudservicepublicip.
A network work security group, with the name <Guid>-azurebatch-cloudservicenetworksecuritygroup.
Make sure that you don't have any resource lock on the Resource Group or Subscription to which the
virtual network belongs. If you configure either a read-only lock or a delete lock, starting and stopping the
IR may fail or stop responding.
Make sure that you don't have an Azure policy which prevents the following resources from being created
under the Resource Group or Subscription to which the virtual network belongs:
Microsoft.Network/LoadBalancers
Microsoft.Network/NetworkSecurityGroups
Microsoft.Network/PublicIPAddresses

Azure portal (Data Factory UI)


This section shows you how to join an existing Azure-SSIS runtime to a virtual network (classic or Azure
Resource Manager) by using the Azure portal and Data Factory UI. First, you need to configure the virtual
network appropriately before joining your Azure-SSIS IR to it. Go through one of the next two sections based on
the type of your virtual network (classic or Azure Resource Manager). Then, continue with the third section to
join your Azure-SSIS IR to the virtual network.
Use the portal to configure an Azure Resource Manager virtual network
You need to configure a virtual network before you can join an Azure-SSIS IR to it.
1. Start Microsoft Edge or Google Chrome. Currently, the Data Factory UI is supported only in those web
browsers.
2. Sign in to the Azure portal.
3. Select More services. Filter for and select Virtual networks.
4. Filter for and select your virtual network in the list.
5. On the Virtual network page, select Properties.
6. Select the copy button for RESOURCE ID to copy the resource ID for the virtual network to the clipboard.
Save the ID from the clipboard in OneNote or a file.
7. Select Subnets on the left menu. Ensure that the number of available addresses is greater than the
nodes in your Azure-SSIS integration runtime.
8. Verify that the Azure Batch provider is registered in the Azure subscription that has the virtual network.
Or, register the Azure Batch provider. If you already have an Azure Batch account in your subscription,
then your subscription is registered for Azure Batch. (If you create the Azure-SSIS IR in the Data Factory
portal, the Azure Batch provider is automatically registered for you.)
a. In Azure portal, select Subscriptions on the left menu.
b. Select your subscription.
c. Select Resource providers on the left, and confirm that Microsoft.Batch is a registered provider.

If you don't see Microsoft.Batch in the list, to register it, create an empty Azure Batch account in your
subscription. You can delete it later.
Use the portal to configure a classic virtual network
You need to configure a virtual network before you can join an Azure-SSIS IR to it.
1. Start Microsoft Edge or Google Chrome. Currently, the Data Factory UI is supported only in these web
browsers.
2. Sign in to the Azure portal.
3. Select More services. Filter for and select Virtual networks (classic).
4. Filter for and select your virtual network in the list.
5. On the Virtual network (classic) page, select Properties.

6. Select the copy button for RESOURCE ID to copy the resource ID for the classic network to the clipboard.
Save the ID from the clipboard in OneNote or a file.
7. Select Subnets on the left menu. Ensure that the number of available addresses is greater than the
nodes in your Azure-SSIS integration runtime.

8. Join MicrosoftAzureBatch to the Classic Virtual Machine Contributor role for the virtual network.
a. Select Access control (IAM ) on the left menu, and select the Role assignments tab.

b. Select Add role assignment.


c. On the Add role assignment page, select Classic Virtual Machine Contributor for Role. Paste
ddbf3205-c6bd-46ae-8127-60eb93363864 in the Select box, and then select Microsoft Azure Batch
from the list of search results.

d. Select Save to save the settings and to close the page.

e. Confirm that you see Microsoft Azure Batch in the list of contributors.

9. Verify that the Azure Batch provider is registered in the Azure subscription that has the virtual network.
Or, register the Azure Batch provider. If you already have an Azure Batch account in your subscription,
then your subscription is registered for Azure Batch. (If you create the Azure-SSIS IR in the Data Factory
portal, the Azure Batch provider is automatically registered for you.)
a. In Azure portal, select Subscriptions on the left menu.
b. Select your subscription.
c. Select Resource providers on the left, and confirm that Microsoft.Batch is a registered provider.

If you don't see Microsoft.Batch in the list, to register it, create an empty Azure Batch account in your
subscription. You can delete it later.
Join the Azure -SSIS IR to a virtual network
1. Start Microsoft Edge or Google Chrome. Currently, the Data Factory UI is supported only in those web
browsers.
2. In the Azure portal, select Data factories on the left menu. If you don't see Data factories on the menu,
select More services, and the select Data factories in the INTELLIGENCE + ANALYTICS section.
3. Select your data factory with the Azure-SSIS integration runtime in the list. You see the home page for
your data factory. Select the Author & Deploy tile. You see the Data Factory UI on a separate tab.

4. In the Data Factory UI, switch to the Edit tab, select Connections, and switch to the Integration
Runtimes tab.
5. If your Azure-SSIS IR is running, in the integration runtime list, select the Stop button in the Actions
column for your Azure-SSIS IR. You cannot edit an IR until you stop it.

6. In the integration runtime list, select the Edit button in the Actions column for your Azure-SSIS IR.

7. On the General Settings page of the Integration Runtime Setup window, select Next.
8. On the SQL Settings page, enter the administrator password, and select Next.
9. On the Advanced Settings page, do the following actions:
a. Select the check box for Select a VNet for your Azure-SSIS Integration Runtime to join and
allow Azure services to configure VNet permissions/settings.
b. For Type, select whether the virtual network is a classic virtual network or an Azure Resource Manager
virtual network.
c. For VNet Name, select your virtual network.
d. For Subnet Name, select your subnet in the virtual network.
e. Click VNet Validation and if successful, click Update.
10. Now, you can start the IR by using the Start button in the Actions column for your Azure-SSIS IR. It
takes approximately 20 to 30 minutes to start an Azure-SSIS IR.

Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.

Configure a virtual network


You need to configure a virtual network before you can join your Azure-SSIS IR to it. To automatically configure
virtual network permissions/settings for your Azure-SSIS integration runtime to join the virtual network, add the
following script:

# Make sure to run this script against the subscription to which the virtual network belongs.
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign the VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}

Create an Azure -SSIS IR and join it to a virtual network


You can create an Azure-SSIS IR and join it to a virtual network at the same time. For the complete script and
instructions, see Create an Azure-SSIS integration runtime.
Join an existing Azure -SSIS IR to a virtual network
The script in the Create an Azure-SSIS integration runtime article shows you how to create an Azure-SSIS IR
and join it to a virtual network in the same script. If you have an existing Azure-SSIS IR, perform the following
steps to join it to the virtual network:
1. Stop the Azure-SSIS IR.
2. Configure the Azure-SSIS IR to join the virtual network.
3. Start the Azure-SSIS IR.
Define the variables

$ResourceGroupName = "<your Azure resource group name>"


$DataFactoryName = "<your Data Factory name>"
$AzureSSISName = "<your Azure-SSIS IR name>"
# Specify the information about your classic or Azure Resource Manager virtual network.
$VnetId = "<your Azure virtual network resource ID>"
$SubnetName = "<the name of subnet in your virtual network>"

Stop the Azure -SSIS IR


Stop the Azure-SSIS integration runtime before you can join it to a virtual network. This command releases all of
its nodes and stops billing:

Stop-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

Configure virtual network settings for the Azure -SSIS IR to join


# Make sure to run this script against the subscription to which the virtual network belongs.
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}

Configure the Azure -SSIS IR


To configure the Azure-SSIS integration runtime to join the virtual network, run the
Set-AzDataFactoryV2IntegrationRuntime command:

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Type Managed `
-VnetId $VnetId `
-Subnet $SubnetName

Start the Azure -SSIS IR


To start the Azure-SSIS integration runtime, run the following command:

Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

This command takes 20 to 30 minutes to finish.

Next steps
For more information about the Azure-SSIS runtime, see the following topics:
Azure-SSIS integration runtime. This article provides conceptual information about integration runtimes in
general, including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-
SSIS IR. It uses Azure SQL Database to host the SSIS catalog.
Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides instructions on
using Azure SQL Database with virtual network service endpoints/Managed Instance to host the SSIS
catalog and joining the IR to a virtual network.
Monitor an Azure-SSIS IR. This article shows you how to retrieve information about an Azure-SSIS IR and
descriptions of statuses in the returned information.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or remove an Azure-SSIS IR. It also
shows you how to scale out your Azure-SSIS IR by adding nodes.
Enable Azure Active Directory authentication for
Azure-SSIS Integration Runtime
5/13/2019 • 7 minutes to read • Edit Online

This article shows you how to enable Azure Active Directory (Azure AD ) authentication with the managed identity
for your Azure Data Factory (ADF ) and use it instead of SQL authentication to create an Azure-SSIS Integration
Runtime (IR ) that will in turn provision SSIS catalog database (SSISDB ) in Azure SQL Database server/Managed
Instance on your behalf.
For more info about the managed identity for your ADF, see Managed identiy for Data Factory.

NOTE
In this scenario, Azure AD authentication with the managed identity for your ADF is only used in the creation and
subsequent starting operations of your SSIS IR that will in turn provision and connect to SSISDB. For SSIS package
executions, your SSIS IR will still connect to SSISDB using SQL authentication with fully managed accounts that are created
during SSISDB provisioning.
If you have already created your SSIS IR using SQL authentication, you can not reconfigure it to use Azure AD
authentication via PowerShell at this time, but you can do so via Azure portal/ADF app.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Enable Azure AD on Azure SQL Database


Azure SQL Database server supports creating a database with an Azure AD user. First, you need to create an Azure
AD group with the managed identity for your ADF as a member. Next, you need to set an Azure AD user as the
Active Directory admin for your Azure SQL Database server and then connect to it on SQL Server Management
Studio (SSMS ) using that user. Finally, you need to create a contained user representing the Azure AD group, so
the managed identity for your ADF can be used by Azure-SSIS IR to create SSISDB on your behalf.
Create an Azure AD group with the managed identity for your ADF as a member
You can use an existing Azure AD group or create a new one using Azure AD PowerShell.
1. Install the Azure AD PowerShell module.
2. Sign in using Connect-AzureAD , run the following cmdlet to create a group, and save it in a variable:

$Group = New-AzureADGroup -DisplayName "SSISIrGroup" `


-MailEnabled $false `
-SecurityEnabled $true `
-MailNickName "NotSet"

The result looks like the following example, which also displays the variable value:
$Group

ObjectId DisplayName Description


-------- ----------- -----------
6de75f3c-8b2f-4bf4-b9f8-78cc60a18050 SSISIrGroup

3. Add the managed identity for your ADF to the group. You can follow the article Managed identiy for Data
Factory to get the principal Managed Identity Object ID (e.g. 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc,
but do not use Managed Identity Application ID for this purpose).

Add-AzureAdGroupMember -ObjectId $Group.ObjectId -RefObjectId 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc

You can also check the group membership afterwards.

Get-AzureAdGroupMember -ObjectId $Group.ObjectId

Configure Azure AD authentication for Azure SQL Database server


You can Configure and manage Azure AD authentication with SQL using the following steps:
1. In Azure portal, select All services -> SQL servers from the left-hand navigation.
2. Select your Azure SQL Database server to be configured with Azure AD authentication.
3. In the Settings section of the blade, select Active Directory admin.
4. In the command bar, select Set admin.
5. Select an Azure AD user account to be made administrator of the server, and then select Select.
6. In the command bar, select Save.
Create a contained user in Azure SQL Database server representing the Azure AD group
For this next step, you need Microsoft SQL Server Management Studio (SSMS ).
1. Start SSMS.
2. In the Connect to Server dialog, enter your Azure SQL Database server name in the Server name field.
3. In the Authentication field, select Active Directory - Universal with MFA support (you can also use the
other two Active Directory authentication types, see Configure and manage Azure AD authentication with
SQL ).
4. In the User name field, enter the name of Azure AD account that you set as the server administrator, e.g.
testuser@xxxonline.com.
5. select Connect and complete the sign-in process.
6. In the Object Explorer, expand the Databases -> System Databases folder.
7. Right-click on master database and select New query.
8. In the query window, enter the following T-SQL command, and select Execute on the toolbar.

CREATE USER [SSISIrGroup] FROM EXTERNAL PROVIDER

The command should complete successfully, creating a contained user to represent the group.
9. Clear the query window, enter the following T-SQL command, and select Execute on the toolbar.

ALTER ROLE dbmanager ADD MEMBER [SSISIrGroup]

The command should complete successfully, granting the contained user the ability to create a database
(SSISDB ).
10. If your SSISDB was created using SQL authentication and you want to switch to use Azure AD
authentication for your Azure-SSIS IR to access it, right-click on SSISDB database and select New query.
11. In the query window, enter the following T-SQL command, and select Execute on the toolbar.

CREATE USER [SSISIrGroup] FROM EXTERNAL PROVIDER

The command should complete successfully, creating a contained user to represent the group.
12. Clear the query window, enter the following T-SQL command, and select Execute on the toolbar.

ALTER ROLE db_owner ADD MEMBER [SSISIrGroup]

The command should complete successfully, granting the contained user the ability to access SSISDB.

Enable Azure AD on Azure SQL Database Managed Instance


Azure SQL Database Managed Instance supports creating a database with the managed identity for your ADF
directly. You need not join the managed identity for your ADF to an Azure AD group nor create a contained user
representing that group in your Managed Instance.
Configure Azure AD authentication for Azure SQL Database Managed Instance
1. In Azure portal, select All services -> SQL servers from the left-hand navigation.
2. Select your Managed Instance to be configured with Azure AD authentication.
3. In the Settings section of the blade, select Active Directory admin.
4. In the command bar, select Set admin.
5. Select an Azure AD user account to be made administrator of the server, and then select Select.
6. In the command bar, select Save.
Add the managed identity for your ADF as a user in Azure SQL Database Managed Instance
For this next step, you need Microsoft SQL Server Management Studio (SSMS ).
1. Start SSMS.
2. Connect to your Managed Instance using your SQL/Active Directory admin account.
3. In the Object Explorer, expand the Databases -> System Databases folder.
4. Right-click on master database and select New query.
5. Get the managed identity for your ADF. You can follow the article Managed identiy for Data Factory to get
the principal Managed Identity Application ID (but do not use Managed Identity Object ID for this purpose).
6. In the query window, execute the following T-SQL script to convert the managed identity for your ADF to
binary type:
DECLARE @applicationId uniqueidentifier = '{your Managed Identity Application ID}'
select CAST(@applicationId AS varbinary)

The command should complete successfully, displaying the managed identity for your ADF as binary.
7. Clear the query window and execute the following T-SQL script to add the managed identity for your ADF
as a user

CREATE LOGIN [{a name for the managed identity}] FROM EXTERNAL PROVIDER with SID = {your Managed
Identity Application ID as binary}, TYPE = E
ALTER SERVER ROLE [dbcreator] ADD MEMBER [{the managed identity name}]
ALTER SERVER ROLE [securityadmin] ADD MEMBER [{the managed identity name}]

The command should complete successfully, granting the managed identity for your ADF the ability to
create a database (SSISDB ).
8. If your SSISDB was created using SQL authentication and you want to switch to use Azure AD
authentication for your Azure-SSIS IR to access it, right-click on SSISDB database and select New query.
9. In the query window, enter the following T-SQL command, and select Execute on the toolbar.

CREATE USER [{the managed identity name}] FOR LOGIN [{the managed identity name}] WITH DEFAULT_SCHEMA =
dbo
ALTER ROLE db_owner ADD MEMBER [{the managed identity name}]

The command should complete successfully, granting the managed identity for your ADF the ability to
access SSISDB.

Provision Azure-SSIS IR in Azure portal/ADF app


When you provision your Azure-SSIS IR in Azure portal/ADF app, on SQL Settings page, select Use AAD
authentication with the managed identity for your ADF option. The following screenshot shows the settings
for IR with Azure SQL Database server hosting SSISDB. For IR with Managed Instance hosting SSISDB, the
Catalog Database Service Tier and Allow Azure services to access settings are not applicable, while other
settings are the same.
For more info about how to create an Azure-SSIS IR, see Create an Azure-SSIS integration runtime in Azure Data
Factory.
Provision Azure-SSIS IR with PowerShell
To provision your Azure-SSIS IR with PowerShell, do the following things:
1. Install Azure PowerShell module.
2. In your script, do not set CatalogAdminCredential parameter. For example:

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-MaxParallelExecutionsPerNode
$AzureSSISMaxParallelExecutionsPerNode `
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogPricingTier $SSISDBPricingTier

Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName
Provision Enterprise Edition for the Azure-SSIS
Integration Runtime
3/5/2019 • 3 minutes to read • Edit Online

The Enterprise Edition of the Azure-SSIS Integration Runtime lets you use the following advanced and premium
features:
Change Data Capture (CDC ) components
Oracle, Teradata, and SAP BW connectors
SQL Server Analysis Services (SSAS ) and Azure Analysis Services (AAS ) connectors and transformations
Fuzzy Grouping and Fuzzy Lookup transformations
Term Extraction and Term Lookup transformations
Some of these features require you to install additional components to customize the Azure-SSIS IR. For more
info about how to install additional components, see Custom setup for the Azure-SSIS integration runtime.

Enterprise features
ENTERPRISE FEATURES DESCRIPTIONS

CDC components The CDC Source, Control Task, and Splitter Transformation are
preinstalled on the Azure-SSIS IR Enterprise Edition. To
connect to Oracle, you also need to install the CDC Designer
and Service on another computer.

Oracle connectors The Oracle Connection Manager, Source, and Destination are
preinstalled on the Azure-SSIS IR Enterprise Edition. You also
need to install the Oracle Call Interface (OCI) driver, and if
necessary configure the Oracle Transport Network Substrate
(TNS), on the Azure-SSIS IR. For more info, see Custom setup
for the Azure-SSIS integration runtime.

Teradata connectors You need to install the Teradata Connection Manager, Source,
and Destination, as well as the Teradata Parallel Transporter
(TPT) API and Teradata ODBC driver, on the Azure-SSIS IR
Enterprise Edition. For more info, see Custom setup for the
Azure-SSIS integration runtime.

SAP BW connectors The SAP BW Connection Manager, Source, and Destination


are preinstalled on the Azure-SSIS IR Enterprise Edition. You
also need to install the SAP BW driver on the Azure-SSIS IR.
These connectors support SAP BW 7.0 or earlier versions. To
connect to later versions of SAP BW or other SAP products,
you can purchase and install SAP connectors from third-party
ISVs on the Azure-SSIS IR. For more info about how to install
additional components, see Custom setup for the Azure-SSIS
integration runtime.
ENTERPRISE FEATURES DESCRIPTIONS

Analysis Services components The Data Mining Model Training Destination, the Dimension
Processing Destination, and the Partition Processing
Destination, as well as the Data Mining Query Transformation,
are preinstalled on the Azure-SSIS IR Enterprise Edition. All
these components support SQL Server Analysis Services
(SSAS), but only the Partition Processing Destination supports
Azure Analysis Services (AAS). To connect to SSAS, you also
need to configure Windows Authentication credentials in
SSISDB. In addition to these components, the Analysis Services
Execute DDL Task, the Analysis Services Processing Task, and
the Data Mining Query Task are also preinstalled on the
Azure-SSIS IR Standard/Enterprise Edition.

Fuzzy Grouping and Fuzzy Lookup transformations The Fuzzy Grouping and Fuzzy Lookup transformations are
preinstalled on the Azure-SSIS IR Enterprise Edition. These
components support both SQL Server and Azure SQL
Database for storing reference data.

Term Extraction and Term Lookup transformations The Term Extraction and Term Lookup transformations are
preinstalled on the Azure-SSIS IR Enterprise Edition. These
components support both SQL Server and Azure SQL
Database for storing reference data.

Instructions
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

1. Download and install Azure PowerShell.


2. When you provision or reconfigure the Azure-SSIS IR with PowerShell, run
Set-AzDataFactoryV2IntegrationRuntime with Enterprise as the value for the Edition parameter before you
start the Azure-SSIS IR. Here is a sample script:

$MyAzureSsisIrEdition = "Enterprise"

Set-AzDataFactoryV2IntegrationRuntime -DataFactoryName $MyDataFactoryName


-Name $MyAzureSsisIrName
-ResourceGroupName $MyResourceGroupName
-Edition $MyAzureSsisIrEdition

Start-AzDataFactoryV2IntegrationRuntime -DataFactoryName $MyDataFactoryName


-Name $MyAzureSsisIrName
-ResourceGroupName $MyResourceGroupName

Next steps
Custom setup for the Azure-SSIS integration runtime
How to develop paid or licensed custom components for the Azure-SSIS integration runtime
Customize setup for the Azure-SSIS integration
runtime
4/24/2019 • 9 minutes to read • Edit Online

The custom setup interface for the Azure-SSIS Integration Runtime provides an interface to add your own setup
steps during the provisioning or reconfiguration of your Azure-SSIS IR. Custom setup lets you alter the default
operating configuration or environment (for example, to start additional Windows services or persist access
credentials for file shares) or install additional components (for example, assemblies, drivers, or extensions) on
each node of your Azure-SSIS IR.
You configure your custom setup by preparing a script and its associated files, and uploading them into a blob
container in your Azure Storage account. You provide a Shared Access Signature (SAS ) Uniform Resource
Identifier (URI) for your container when you provision or reconfigure your Azure-SSIS IR. Each node of your
Azure-SSIS IR then downloads the script and its associated files from your container and runs your custom
setup with elevated privileges. When custom setup is finished, each node uploads the standard output of
execution and other logs into your container.
You can install both free or unlicensed components, and paid or licensed components. If you're an ISV, see How
to develop paid or licensed components for the Azure-SSIS IR.

IMPORTANT
The v2-series nodes of Azure-SSIS IR are not suitable for custom setup, so please use the v3-series nodes instead. If you
already use the v2-series nodes, please switch to use the v3-series nodes as soon as possible.

Current limitations
If you want to use gacutil.exe to install assemblies in the Global Assembly Cache (GAC ), you need to
provide gacutil.exe as part of your custom setup, or use the copy provided in the Public Preview
container.
If you want to reference a subfolder in your script, msiexec.exe does not support the .\ notation to
reference the root folder. Use a command like msiexec /i "MySubfolder\MyInstallerx64.msi" ... instead
of msiexec /i ".\MySubfolder\MyInstallerx64.msi" ... .
If you need to join your Azure-SSIS IR with custom setup to a virtual network, only Azure Resource
Manager virtual network is supported. Classic virtual network is not supported.
Administrative share is currently not supported on the Azure-SSIS IR.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
To customize your Azure-SSIS IR, you need the following things:
Azure subscription
An Azure SQL Database or Managed Instance server
Provision your Azure-SSIS IR
An Azure Storage account. For custom setup, you upload and store your custom setup script and its
associated files in a blob container. The custom setup process also uploads its execution logs to the same
blob container.

Instructions
1. Download and install Azure PowerShell.
2. Prepare your custom setup script and its associated files (for example, .bat, .cmd, .exe, .dll, .msi, or .ps1
files).
a. You must have a script file named main.cmd , which is the entry point of your custom setup.
b. If you want additional logs generated by other tools (for example, msiexec.exe ) to be uploaded
into your container, specify the predefined environment variable, CUSTOM_SETUP_SCRIPT_LOG_DIR as
the log folder in your scripts (for example,
msiexec /i xxx.msi /quiet /lv %CUSTOM_SETUP_SCRIPT_LOG_DIR%\install.log ).

3. Download, install, and launch Azure Storage Explorer.


a. Under (Local and Attached), right-select Storage Accounts and select Connect to Azure
storage.

b. Select Use a storage account name and key and select Next.
c. Enter your Azure Storage account name and key, select Next, and then select Connect.

d. Under your connected Azure Storage account, right-click on Blob Containers, select Create Blob
Container, and name the new container.
e. Select the new container and upload your custom setup script and its associated files. Make sure
that you upload main.cmd at the top level of your container, not in any folder. Please also ensure
that your container contains only the necessary custom setup files, so downloading them onto
your Azure-SSIS IR later will not take a long time. The maximum period for custom setup is
currently set at 45 minutes before it times out and this includes the time to download all files from
your container and install them on Azure-SSIS IR. If a longer period is needed, please raise a
support ticket.

f. Right-click the container and select Get Shared Access Signature.


g. Create the SAS URI for your container with a sufficiently long expiry time and with read + write +
list permissions. You need the SAS URI to download and run your custom setup script and its
associated files whenever any node of your Azure-SSIS IR is reimaged/restarted. You need write
permission to upload setup execution logs.

IMPORTANT
Please ensure that the SAS URI does not expire and custom setup resources are always available during the
whole lifecycle of your Azure-SSIS IR, from creation to deletion, especially if you regularly stop and start
your Azure-SSIS IR during this period.

h. Copy and save the SAS URI of your container.


i. When you provision or reconfigure your Azure-SSIS IR with Data Factory UI, before you start
your Azure-SSIS IR, enter the SAS URI of your container in the appropriate field on Advanced
Settings panel:
When you provision or reconfigure your Azure-SSIS IR with PowerShell, before you start your
Azure-SSIS IR, run the Set-AzDataFactoryV2IntegrationRuntime cmdlet with the SAS URI of your
container as the value for new SetupScriptContainerSasUri parameter. For example:

Set-AzDataFactoryV2IntegrationRuntime -DataFactoryName $MyDataFactoryName `


-Name $MyAzureSsisIrName `
-ResourceGroupName $MyResourceGroupName `
-SetupScriptContainerSasUri
$MySetupScriptContainerSasUri

Start-AzDataFactoryV2IntegrationRuntime -DataFactoryName $MyDataFactoryName `


-Name $MyAzureSsisIrName `
-ResourceGroupName $MyResourceGroupName

j. After your custom setup finishes and your Azure-SSIS IR starts, you can find the standard output
of main.cmd and other execution logs in the main.cmd.log folder of your storage container.
4. To see other custom setup examples, connect to the Public Preview container with Azure Storage
Explorer.
a. Under (Local and Attached), right-click Storage Accounts, select Connect to Azure storage, select
Use a connection string or a shared access signature URI, and then select Next.

b. Select Use a SAS URI and enter the following SAS URI for the Public Preview container. Select Next,
and the select Connect.
https://ssisazurefileshare.blob.core.windows.net/publicpreview?sp=rl&st=2018-04-
08T14%3A10%3A00Z&se=2020-04-10T14%3A10%3A00Z&sv=2017-04-
17&sig=mFxBSnaYoIlMmWfxu9iMlgKIvydn85moOnOch6%2F%2BheE%3D&sr=c

c. Select the connected Public Preview container and double-click the CustomSetupScript folder. In this
folder are the following items:
a. A Sample folder, which contains a custom setup to install a basic task on each node of your Azure-
SSIS IR. The task does nothing but sleep for a few seconds. The folder also contains a gacutil
folder, the whole content of which ( gacutil.exe , gacutil.exe.config , and 1033\gacutlrc.dll ) can
be copied as is into your container. Additionally, main.cmd contains comments to persist access
credentials for file shares.
b. A UserScenarios folder, which contains several custom setups for real user scenarios.

d. Double-click the UserScenarios folder. In this folder are the following items:
a. A .NET FRAMEWORK 3.5 folder, which contains a custom setup to install an earlier version of the .NET
Framework that might be required for custom components on each node of your Azure-SSIS IR.
b. A BCPfolder, which contains a custom setup to install SQL Server command-line utilities (
MsSqlCmdLnUtils.msi ), including the bulk copy program ( bcp ), on each node of your Azure-SSIS
IR.
c. An EXCELfolder, which contains a custom setup to install open-source assemblies (
DocumentFormat.OpenXml.dll , ExcelDataReader.DataSet.dll , and ExcelDataReader.dll ) on each
node of your Azure-SSIS IR.
d. An ORACLE ENTERPRISE folder, which contains a custom setup script ( main.cmd ) and silent install
config file ( client.rsp ) to install the Oracle connectors and OCI driver on each node of your
Azure-SSIS IR Enterprise Edition. This setup lets you use the Oracle Connection Manager, Source,
and Destination. First, download Microsoft Connectors v5.0 for Oracle (
AttunitySSISOraAdaptersSetup.msi and AttunitySSISOraAdaptersSetup64.msi ) from Microsoft
Download Center and the latest Oracle client - for example, winx64_12102_client.zip - from
Oracle, then upload them all together with main.cmd and client.rsp into your container. If you
use TNS to connect to Oracle, you also need to download tnsnames.ora , edit it, and upload it into
your container, so it can be copied into the Oracle installation folder during setup.
e. An ORACLE STANDARD ADO.NET folder, which contains a custom setup script ( main.cmd ) to install the
Oracle ODP.NET driver on each node of your Azure-SSIS IR. This setup lets you use the
ADO.NET Connection Manager, Source, and Destination. First, download the latest Oracle
ODP.NET driver - for example, ODP.NET_Managed_ODAC122cR1.zip - from Oracle, and then upload it
together with main.cmd into your container.
f. An ORACLE STANDARD ODBC folder, which contains a custom setup script ( main.cmd ) to install the
Oracle ODBC driver and configure DSN on each node of your Azure-SSIS IR. This setup lets you
use the ODBC Connection Manager/Source/Destination or Power Query Connection
Manager/Source with ODBC data source kind to connect to Oracle server. First, download the
latest Oracle Instant Client (Basic Package or Basic Lite Package) and ODBC Package - for
example, the 64-bit packages from here (Basic Package:
instantclient-basic-windows.x64-18.3.0.0.0dbru.zip , Basic Lite Package:
instantclient-basiclite-windows.x64-18.3.0.0.0dbru.zip , ODBC Package:
instantclient-odbc-windows.x64-18.3.0.0.0dbru.zip ) or the 32 -bit packages from here ( Basic
Package: instantclient-basic-nt-18.3.0.0.0dbru.zip , Basic Lite Package:
instantclient-basiclite-nt-18.3.0.0.0dbru.zip , ODBC Package:
instantclient-odbc-nt-18.3.0.0.0dbru.zip ), and then upload them together with main.cmd into
your container.
g. An SAP BW folder, which contains a custom setup script ( main.cmd ) to install the SAP .NET
connector assembly ( librfc32.dll ) on each node of your Azure-SSIS IR Enterprise Edition. This
setup lets you use the SAP BW Connection Manager, Source, and Destination. First, upload the
64-bit or the 32-bit version of librfc32.dll from the SAP installation folder into your container,
together with main.cmd . The script then copies the SAP assembly into the %windir%\SysWow64 or
%windir%\System32 folder during setup.

h. A STORAGE folder, which contains a custom setup to install Azure PowerShell on each node of your
Azure-SSIS IR. This setup lets you deploy and run SSIS packages that run PowerShell scripts to
manipulate your Azure Storage account. Copy main.cmd , a sample AzurePowerShell.msi (or install
the latest version), and storage.ps1 to your container. Use PowerShell.dtsx as a template for your
packages. The package template combines an Azure Blob Download Task, which downloads
storage.ps1 as a modifiable PowerShell script, and an Execute Process Task that executes the
script on each node.
i. A TERADATA folder, which contains a custom setup script ( main.cmd ), its associated file (
install.cmd ), and installer packages ( .msi ). These files install Teradata connectors, the TPT API,
and the ODBC driver on each node of your Azure-SSIS IR Enterprise Edition. This setup lets you
use the Teradata Connection Manager, Source, and Destination. First, download the Teradata Tools
and Utilities (TTU ) 15.x zip file (for example,
TeradataToolsAndUtilitiesBase__windows_indep.15.10.22.00.zip ) from Teradata, and then upload it
together with the above .cmd and .msi files into your container.
e. To try these custom setup samples, copy and paste the content from the selected folder into your
container. When you provision or reconfigure your Azure-SSIS IR with PowerShell, run the
Set-AzDataFactoryV2IntegrationRuntime cmdlet with the SAS URI of your container as the value for new
SetupScriptContainerSasUri parameter.

Next steps
Enterprise Edition of the Azure-SSIS Integration Runtime
How to develop paid or licensed custom components for the Azure-SSIS integration runtime
Install paid or licensed custom components for the
Azure-SSIS integration runtime
1/3/2019 • 3 minutes to read • Edit Online

This article describes how an ISV can develop and install paid or licensed custom components for SQL Server
Integration Services (SSIS ) packages that run in Azure in the Azure-SSIS integration runtime.

The problem
The nature of the Azure-SSIS integration runtime presents several challenges, which make the typical licensing
methods used for the on-premises installation of custom components inadequate. As a result, the Azure-SSIS IR
requires a different approach.
The nodes of the Azure-SSIS IR are volatile and can be allocated or released at any time. For example, you
can start or stop nodes to manage the cost, or scale up and down through various node sizes. As a result,
binding a third-party component license to a particular node by using machine-specific info such as MAC
address or CPU ID is no longer viable.
You can also scale the Azure-SSIS IR in or out, so that the number of nodes can shrink or expand at any
time.

The solution
As a result of the limitations of traditional licensing methods described in the previous section, the Azure-SSIS IR
provides a new solution. This solution uses Windows environment variables and SSIS system variables for the
license binding and validation of third-party components. ISVs can use these variables to obtain unique and
persistent info for an Azure-SSIS IR, such as Cluster ID and Cluster Node Count. With this info, ISVs can then
bind the license for their component to an Azure-SSIS IR as a cluster. This binding uses an ID that doesn't change
when customers start or stop, scale up or down, scale in or out, or reconfigure the Azure-SSIS IR in any way.
The following diagram shows the typical installation, activation and license binding, and validation flows for third-
party components that use these new variables:

Instructions
1. ISVs can offer their licensed components in various SKUs or tiers (for example, single node, up to 5 nodes,
up to 10 nodes, and so forth). The ISV provides the corresponding Product Key when customers purchase a
product. The ISV can also provide an Azure Storage blob container that contains an ISV Setup script and
associated files. Customers can copy these files into their own storage container and modify them with their
own Product Key (for example, by running IsvSetup.exe -pid xxxx-xxxx-xxxx ). Customers can then
provision or reconfigure the Azure-SSIS IR with the SAS URI of their container as parameter. For more
info, see Custom setup for the Azure-SSIS integration runtime.
2. When the Azure-SSIS IR is provisioned or reconfigured, ISV Setup runs on each node to query the
Windows environment variables, SSIS_CLUSTERID and SSIS_CLUSTERNODECOUNT . Then the Azure-SSIS IR
submits its Cluster ID and the Product Key for the licensed product to the ISV Activation Server to generate
an Activation Key.
3. After receiving the Activation Key, ISV Setup can store the key locally on each node (for example, in the
Registry).
4. When customers run a package that uses the ISV's licensed component on a node of the Azure-SSIS IR,
the package reads the locally stored Activation Key and validates it against the node's Cluster ID. The
package can also optionally report the Cluster Node Count to the ISV activation server.
Here is an example of code that validates the activation key and reports the cluster node count:

public override DTSExecResult Validate(Connections, VariableDispenser, IDTSComponentEvents


componentEvents, IDTSLogging log)

Variables vars = null;

variableDispenser.LockForRead("System::ClusterID");

variableDispenser.LockForRead("System::ClusterNodeCount");

variableDispenser.GetVariables(ref vars);

// Validate Activation Key with ClusterID

// Report on ClusterNodeCount

vars.Unlock();

return base.Validate(connections, variableDispenser, componentEvents, log);

ISV partners
You can find a list of ISV partners who have adapted their components and extensions for the Azure-SSIS IR at
the end of this blog post - Enterprise Edition, Custom Setup, and 3rd Party Extensibility for SSIS in ADF.

Next steps
Custom setup for the Azure-SSIS integration runtime
Enterprise Edition of the Azure-SSIS Integration Runtime
Configure the Azure-SSIS Integration Runtime for
high performance
5/7/2019 • 8 minutes to read • Edit Online

This article describes how to configure an Azure-SSIS Integration Runtime (IR ) for high performance. The Azure-
SSIS IR allows you to deploy and run SQL Server Integration Services (SSIS ) packages in Azure. For more
information about Azure-SSIS IR, see Integration runtime article. For information about deploying and running
SSIS packages on Azure, see Lift and shift SQL Server Integration Services workloads to the cloud.

IMPORTANT
This article contains performance results and observations from in-house testing done by members of the SSIS development
team. Your results may vary. Do your own testing before you finalize your configuration settings, which affect both cost and
performance.

Properties to configure
The following portion of a configuration script shows the properties that you can configure when you create an
Azure-SSIS Integration Runtime. For the complete PowerShell script and description, see Deploy SQL Server
Integration Services packages to Azure.
# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like "`$"
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$DataFactoryLocation = "EastUS"

### Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[specify a description for your Azure-SSIS IR]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium features
on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your own
on-premises SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit (AHB)
option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
max(2 x number of cores, 8) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup script
and its associated files are stored
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use Azure SQL Database with
virtual network service endpoints/Managed Instance/on-premises data, Azure Resource Manager virtual network is
recommended, Classic virtual network will be deprecated soon
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Please use the same subnet as the one used with
your Azure SQL Database with virtual network service endpoints or a different subnet than the one used for your
Managed Instance

### SSISDB info


$SSISDBServerEndpoint = "[your Azure SQL Database server name or Managed Instance name.DNS
prefix].database.windows.net" # WARNING: Please ensure that there is no existing SSISDB, so we can prepare and
manage one on your behalf
# Authentication info: SQL or Azure Active Directory (AAD)
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication or leave it empty for AAD
authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication or leave it empty for AAD
authentication]"
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>) for Azure SQL Database or leave it empty for Managed Instance]"

AzureSSISLocation
AzureSSISLocation is the location for the integration runtime worker node. The worker node maintains a
constant connection to the SSIS Catalog database (SSISDB ) on an Azure SQL database. Set the
AzureSSISLocation to the same location as the SQL Database server that hosts SSISDB, which lets the
integration runtime to work as efficiently as possible.

AzureSSISNodeSize
Data Factory, including the Azure-SSIS IR, supports the following options:
Standard_A4_v2
Standard_A8_v2
Standard_D1_v2
Standard_D2_v2
Standard_D3_v2
Standard_D4_v2
Standard_D2_v3
Standard_D4_v3
Standard_D8_v3
Standard_D16_v3
Standard_D32_v3
Standard_D64_v3
Standard_E2_v3
Standard_E4_v3
Standard_E8_v3
Standard_E16_v3
Standard_E32_v3
Standard_E64_v3
In the unofficial in-house testing by the SSIS engineering team, the D series appears to be more suitable for SSIS
package execution than the A series.
The performance/price ratio of the D series is higher than the A series and the performance/price ratio of the v3
series is higher than the v2 series.
The throughput for the D series is higher than the A series at the same price and the throughput for the v3
series is higher than the v2 series at the same price.
The v2 series nodes of Azure-SSIS IR are not suitable for custom setup, so please use the v3 series nodes
instead. If you already use the v2 series nodes, please switch to use the v3 series nodes as soon as possible.
The E series is memory optimized VM sizes that provides a higher memory-to-CPU ratio than other
machines.If your package requires a lot of memory, you can consider choosing E series VM.
Configure for execution speed
If you don't have many packages to run, and you want packages to run quickly, use the information in the following
chart to choose a virtual machine type suitable for your scenario.
This data represents a single package execution on a single worker node. The package loads 3 million records with
first name and last name columns from Azure Blob Storage, generates a full name column, and writes the records
that have the full name longer than 20 characters to Azure Blob Storage.
Configure for overall throughput
If you have lots of packages to run, and you care most about the overall throughput, use the information in the
following chart to choose a virtual machine type suitable for your scenario.

AzureSSISNodeNumber
AzureSSISNodeNumber adjusts the scalability of the integration runtime. The throughput of the integration
runtime is proportional to the AzureSSISNodeNumber. Set the AzureSSISNodeNumber to a small value at
first, monitor the throughput of the integration runtime, then adjust the value for your scenario. To reconfigure the
worker node count, see Manage an Azure-SSIS integration runtime.
AzureSSISMaxParallelExecutionsPerNode
When you're already using a powerful worker node to run packages, increasing
AzureSSISMaxParallelExecutionsPerNode may increase the overall throughput of the integration runtime. For
Standard_D1_v2 nodes, 1-4 parallel executions per node are supported. For all other types of nodes, 1-max(2 x
number of cores, 8) parallel executions per node are supported. If you want
AzureSSISMaxParallelExecutionsPerNode beyond the max value we supported, you can open a support ticket
and we can increase max value for you and after that you need use Azure Powershell to update
AzureSSISMaxParallelExecutionsPerNode. You can estimate the appropriate value based on the cost of your
package and the following configurations for the worker nodes. For more information, see General-purpose virtual
machine sizes.

MAX TEMP
STORAGE MAX NICS /
THROUGHPUT: MAX DATA EXPECTED
IOPS / READ DISKS / NETWORK
TEMP STORAGE MBPS / WRITE THROUGHPUT: PERFORMANCE
SIZE VCPU MEMORY: GIB (SSD) GIB MBPS IOPS (MBPS)

Standard_D1_ 1 3.5 50 3000 / 46 / 2 / 2x500 2 / 750


v2 23

Standard_D2_ 2 7 100 6000 / 93 / 4 / 4x500 2 / 1500


v2 46

Standard_D3_ 4 14 200 12000 / 187 / 8 / 8x500 4 / 3000


v2 93

Standard_D4_ 8 28 400 24000 / 375 / 16 / 16x500 8 / 6000


v2 187

Standard_A4_ 4 8 40 4000 / 80 / 8 / 8x500 4 / 1000


v2 40

Standard_A8_ 8 16 80 8000 / 160 / 16 / 16x500 8 / 2000


v2 80

Standard_D2_ 2 8 50 3000 / 46 / 4 / 6x500 2 / 1000


v3 23

Standard_D4_ 4 16 100 6000 / 93 / 8 / 12x500 2 / 2000


v3 46

Standard_D8_ 8 32 200 12000 / 187 / 16 / 24x500 4 / 4000


v3 93

Standard_D16 16 64 400 24000 / 375 / 32/ 48x500 8 / 8000


_v3 187

Standard_D32 32 128 800 48000 / 750 / 32 / 96x500 8 / 16000


_v3 375

Standard_D64 64 256 1600 96000 / 1000 32 / 192x500 8 / 30000


_v3 / 500

Standard_E2_v 2 16 50 3000 / 46 / 4 / 6x500 2 / 1000


3 23
MAX TEMP
STORAGE MAX NICS /
THROUGHPUT: MAX DATA EXPECTED
IOPS / READ DISKS / NETWORK
TEMP STORAGE MBPS / WRITE THROUGHPUT: PERFORMANCE
SIZE VCPU MEMORY: GIB (SSD) GIB MBPS IOPS (MBPS)

Standard_E4_v 4 32 100 6000 / 93 / 8 / 12x500 2 / 2000


3 46

Standard_E8_v 8 64 200 12000 / 187 / 16 / 24x500 4 / 4000


3 93

Standard_E16 16 128 400 24000 / 375 / 32 / 48x500 8 / 8000


_v3 187

Standard_E32 32 256 800 48000 / 750 / 32 / 96x500 8 / 16000


_v3 375

Standard_E64 64 432 1600 96000 / 1000 32 / 192x500 8 / 30000


_v3 / 500

Here are the guidelines for setting the right value for the AzureSSISMaxParallelExecutionsPerNode property:
1. Set it to a small value at first.
2. Increase it by a small amount to check whether the overall throughput is improved.
3. Stop increasing the value when the overall throughput reaches the maximum value.

SSISDBPricingTier
SSISDBPricingTier is the pricing tier for the SSIS Catalog database (SSISDB ) on an Azure SQL database. This
setting affects the maximum number of workers in the IR instance, the speed to queue a package execution, and
the speed to load the execution log.
If you don't care about the speed to queue package execution and to load the execution log, you can choose
the lowest database pricing tier. Azure SQL Database with Basic pricing supports 8 workers in an
integration runtime instance.
Choose a more powerful database than Basic if the worker count is more than 8, or the core count is more
than 50. Otherwise the database becomes the bottleneck of the integration runtime instance and the overall
performance is negatively impacted.
Choose a more powerful database such as s3 if the logging level is set to verbose. According our unofficial
in-house testing, s3 pricing tier can support SSIS package execution with 2 nodes, 128 parallel counts and
verbose logging level.
You can also adjust the database pricing tier based on database transaction unit (DTU ) usage information available
on the Azure portal.

Design for high performance


Designing an SSIS package to run on Azure is different from designing a package for on-premises execution.
Instead of combining multiple independent tasks in the same package, separate them into several packages for
more efficient execution in the Azure-SSIS IR. Create a package execution for each package, so that they don’t
have to wait for each other to finish. This approach benefits from the scalability of the Azure-SSIS integration
runtime and improves the overall throughput.
Next steps
Learn more about the Azure-SSIS Integration Runtime. See Azure-SSIS Integration Runtime.
Configure the Azure-SSIS Integration Runtime with
Azure SQL Database geo-replication and failover
5/30/2019 • 4 minutes to read • Edit Online

This article describes how to configure the Azure-SSIS Integration Runtime with Azure SQL Database geo-
replication for the SSISDB database. When a failover occurs, you can ensure that the Azure-SSIS IR keeps working
with the secondary database.
For more info about geo-replication and failover for SQL Database, see Overview: Active geo-replication and auto-
failover groups.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Scenario 1 - Azure-SSIS IR is pointing to read-write listener endpoint


Conditions
This section applies when the following conditions are true:
The Azure-SSIS IR is pointing to the read-write listener endpoint of the failover group.
AND
The SQL Database server is not configured with the virtual network service endpoint rule.
Solution
When failover occurs, it is transparent to the Azure-SSIS IR. The Azure-SSIS IR automatically connects to the new
primary of the failover group.

Scenario 2 - Azure-SSIS IR is pointing to primary server endpoint


Conditions
This section applies when one of the following conditions is true:
The Azure-SSIS IR is pointing to the primary server endpoint of the failover group. This endpoint changes
when failover occurs.
OR
The Azure SQL Database server is configured with the virtual network service endpoint rule.
OR
The database server is a SQL Database Managed Instance configured with a virtual network.
Solution
When failover occurs, you have to do the following things:
1. Stop the Azure-SSIS IR.
2. Reconfigure the IR to point to the new primary endpoint and to a virtual network in the new region.
3. Restart the IR.
The following sections describe these steps in more detail.
Prerequisites
Make sure that you have enabled disaster recovery for your Azure SQL Database server in case the server
has an outage at the same time. For more info, see Overview of business continuity with Azure SQL
Database.
If you are using a virtual network in the current region, you need to use another virtual network in the new
region to connect your Azure-SSIS integration runtime. For more info, see Join an Azure-SSIS integration
runtime to a virtual network.
If you are using a custom setup, you may need to prepare another SAS URI for the blob container that
stores your custom setup script and associated files, so it continues to be accessible during an outage. For
more info, see Configure a custom setup on an Azure-SSIS integration runtime.
Steps
Follow these steps to stop your Azure-SSIS IR, switch the IR to a new region, and start it again.
1. Stop the IR in the original region.
2. Call the following command in PowerShell to update the IR with the new settings.

Set-AzDataFactoryV2IntegrationRuntime -Location "new region" `


-CatalogServerEndpoint "Azure SQL Database server endpoint" `
-CatalogAdminCredential "Azure SQL Database server admin credentials" `
-VNetId "new VNet" `
-Subnet "new subnet" `
-SetupScriptContainerSasUri "new custom setup SAS URI"

For more info about this PowerShell command, see Create the Azure-SSIS integration runtime in Azure
Data Factory
3. Start the IR again.

Scenario 3 - Attaching an existing SSISDB (SSIS catalog) to a new


Azure-SSIS IR
When an ADF or Azure-SSIS IR disaster occurs in current region, you can make your SSISDB keeps working with
a new Azure-SSIS IR in a new region.
Prerequisites
If you are using a virtual network in the current region, you need to use another virtual network in the new
region to connect your Azure-SSIS integration runtime. For more info, see Join an Azure-SSIS integration
runtime to a virtual network.
If you are using a custom setup, you may need to prepare another SAS URI for the blob container that
stores your custom setup script and associated files, so it continues to be accessible during an outage. For
more info, see Configure a custom setup on an Azure-SSIS integration runtime.
Steps
Follow these steps to stop your Azure-SSIS IR, switch the IR to a new region, and start it again.
1. Execute stored procedure to make SSISDB attached to <new_data_factory_name> or
<new_integration_runtime_name>.

EXEC [catalog].[failover_integration_runtime] @data_factory_name='<new_data_factory_name>',


@integration_runtime_name='<new_integration_runtime_name>'

2. Create a new data factory named <new_data_factory_name> in the new region. For more info, see Create
a data factory.

Set-AzDataFactoryV2 -ResourceGroupName "new resource group name" `


-Location "new region"`
-Name "<new_data_factory_name>"

For more info about this PowerShell command, see Create an Azure data factory using PowerShell
3. Create a new Azure-SSIS IR named <new_integration_runtime_name> in the new region using Azure
PowerShell.

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName "new resource group name" `


-DataFactoryName "new data factory name" `
-Name "<new_integration_runtime_name>" `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode
$AzureSSISMaxParallelExecutionsPerNode `
-VnetId "new vnet" `
-Subnet "new subnet" `
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogPricingTier $SSISDBPricingTier

For more info about this PowerShell command, see Create the Azure-SSIS integration runtime in Azure
Data Factory
4. Start the IR again.

Next steps
Consider these other configuration options for the Azure-SSIS IR:
Configure the Azure-SSIS Integration Runtime for high performance
Customize setup for the Azure-SSIS integration runtime
Provision Enterprise Edition for the Azure-SSIS Integration Runtime
Clean up SSISDB logs with Azure Elastic Database
Jobs
3/5/2019 • 8 minutes to read • Edit Online

This article describes how to use Azure Elastic Database Jobs to trigger the stored procedure that cleans up logs
for the SQL Server Integration Services catalog database, SSISDB .
Elastic Database Jobs is an Azure service that makes it easy to automate and run jobs against a database or a
group of databases. You can schedule, run, and monitor these jobs by using the Azure portal, Transact-SQL,
PowerShell, or REST APIs. Use the Elastic Database Job to trigger the stored procedure for log cleanup one time or
on a schedule. You can choose the schedule interval based on SSISDB resource usage to avoid heavy database
load.
For more info, see Manage groups of databases with Elastic Database Jobs.
The following sections describe how to trigger the stored procedure
[internal].[cleanup_server_retention_window_exclusive] , which removes SSISDB logs that are outside the
retention window set by the administrator.

Clean up logs with Power Shell


IMPORTANT
Using this Azure feature from PowerShell requires the AzureRM module installed. This is an older module only available for
Windows PowerShell 5.1 that no longer receives new features. The Az and AzureRM modules are not compatible when
installed for the same versions of PowerShell. If you need both versions:
1. Uninstall the Az module from a PowerShell 5.1 session.
2. Install the AzureRM module from a PowerShell 5.1 session.
3. Download and install PowerShell Core 6.x or later.
4. Install the Az module in a PowerShell Core session.

The following sample PowerShell scripts create a new Elastic Job to trigger the stored procedure for SSISDB log
cleanup. For more info, see Create an Elastic Job agent using PowerShell.
Create parameters
# Parameters needed to create the Job Database
param(
$ResourceGroupName = $(Read-Host "Please enter an existing resource group name"),
$AgentServerName = $(Read-Host "Please enter the name of an existing Azure SQL server(for example, yhxserver)
to hold the SSISDBLogCleanup job database"),
$SSISDBLogCleanupJobDB = $(Read-Host "Please enter a name for the Job Database to be created in the given SQL
Server"),
# The Job Database should be a clean,empty,S0 or higher service tier. We set S0 as default.
$PricingTier = "S0",

# Parameters needed to create the Elastic Job agent


$SSISDBLogCleanupAgentName = $(Read-Host "Please enter a name for your new Elastic Job agent"),

# Parameters needed to create the job credential in the Job Database to connect to SSISDB
$PasswordForSSISDBCleanupUser = $(Read-Host "Please provide a new password for SSISDBLogCleanup job user to
connect to SSISDB database for log cleanup"),
# Parameters needed to create a login and a user in the SSISDB of the target server
$SSISDBServerEndpoint = $(Read-Host "Please enter the name of the target Azure SQL server which contains SSISDB
you need to cleanup, for example, myserver") + '.database.windows.net',
$SSISDBServerAdminUserName = $(Read-Host "Please enter the target server admin username for SQL
authentication"),
$SSISDBServerAdminPassword = $(Read-Host "Please enter the target server admin password for SQL
authentication"),
$SSISDBName = "SSISDB",

# Parameters needed to set job scheduling to trigger execution of cleanup stored procedure
$RunJobOrNot = $(Read-Host "Please indicate whether you want to run the job to cleanup SSISDB logs outside the
log retention window immediately(Y/N). Make sure the retention window is set appropriately before running the
following powershell scripts. Those removed SSISDB logs cannot be recoverd"),
$IntervalType = $(Read-Host "Please enter the interval type for the execution schedule of SSISDB log cleanup
stored procedure. For the interval type, Year, Month, Day, Hour, Minute, Second can be supported."),
$IntervalCount = $(Read-Host "Please enter the detailed interval value in the given interval type for the
execution schedule of SSISDB log cleanup stored procedure"),
# StartTime of the execution schedule is set as the current time as default.
$StartTime = (Get-Date)

Trigger the cleanup stored procedure

# Install the latest PackageManagement powershell package which PowershellGet v1.6.5 is dependent on
Find-Package PackageManagement -RequiredVersion 1.1.7.2 | Install-Package -Force
# You may need to restart the powershell session
# Install the latest PowershellGet module which adds the -AllowPrerelease flag to Install-Module
Find-Package PowerShellGet -RequiredVersion 1.6.5 | Install-Package -Force

# Place AzureRM.Sql preview cmdlets side by side with existing AzureRM.Sql version
Install-Module -Name AzureRM.Sql -AllowPrerelease -Force

# Sign in to your Azure account


Connect-AzureRmAccount

# Create a Job Database which is used for defining jobs of triggering SSISDB log cleanup stored procedure and
tracking cleanup history of jobs
Write-Output "Creating a blank SQL database to be used as the SSISDBLogCleanup Job Database ..."
$JobDatabase = New-AzureRmSqlDatabase -ResourceGroupName $ResourceGroupName -ServerName $AgentServerName -
DatabaseName $SSISDBLogCleanupJobDB -RequestedServiceObjectiveName $PricingTier
$JobDatabase

# Enable the Elastic Jobs preview in your Azure subscription


Register-AzureRmProviderFeature -FeatureName sqldb-JobAccounts -ProviderNamespace Microsoft.Sql

# Create the Elastic Job agent


Write-Output "Creating the Elastic Job agent..."
$JobAgent = $JobDatabase | New-AzureRmSqlElasticJobAgent -Name $SSISDBLogCleanupAgentName
$JobAgent

# Create the job credential in the Job Database to connect to SSISDB database in the target server for log
cleanup
cleanup
Write-Output "Creating job credential to connect to SSISDB database..."
$JobCredSecure = ConvertTo-SecureString -String $PasswordForSSISDBCleanupUser -AsPlainText -Force
$JobCred = New-Object -TypeName "System.Management.Automation.PSCredential" -ArgumentList
"SSISDBLogCleanupUser", $JobCredSecure
$JobCred = $JobAgent | New-AzureRmSqlElasticJobCredential -Name "SSISDBLogCleanupUser" -Credential $JobCred

# In the master database of the target SQL server which contains SSISDB to cleanup
# - Create the job user login
Write-Output "Grant permissions on the master database of the target server..."
$Params = @{
'Database' = 'master'
'ServerInstance' = $SSISDBServerEndpoint
'Username' = $SSISDBServerAdminUserName
'Password' = $SSISDBServerAdminPassword
'OutputSqlErrors' = $true
'Query' = "CREATE LOGIN SSISDBLogCleanupUser WITH PASSWORD = '" + $PasswordForSSISDBCleanupUser + "'"
}
Invoke-SqlCmd @Params

# For SSISDB database of the target SQL server


# - Create the SSISDBLogCleanup user from the SSISDBlogCleanup user login
# - Grant permissions for the execution of SSISDB log cleanup stored procedure
Write-Output "Grant appropriate permissions on SSISDB database..."
$TargetDatabase = $SSISDBName
$CreateJobUser = "CREATE USER SSISDBLogCleanupUser FROM LOGIN SSISDBLogCleanupUser"
$GrantStoredProcedureExecution = "GRANT EXECUTE ON internal.cleanup_server_retention_window_exclusive TO
SSISDBLogCleanupUser"

$TargetDatabase | % {
$Params.Database = $_
$Params.Query = $CreateJobUser
Invoke-SqlCmd @Params
$Params.Query = $GrantStoredProcedureExecution
Invoke-SqlCmd @Params
}

# Create a target group which includes SSISDB database needed to cleanup


Write-Output "Creating the target group including only SSISDB database needed to cleanup ..."
$SSISDBTargetGroup = $JobAgent | New-AzureRmSqlElasticJobTargetGroup -Name "SSISDBTargetGroup"
$SSISDBTargetGroup | Add-AzureRmSqlElasticJobTarget -ServerName $SSISDBServerEndpoint -Database $SSISDBName

# Create the job to trigger execution of SSISDB log cleanup stored procedure
Write-Output "Creating a new job to trigger execution of the stored procedure for SSISDB log cleanup"
$JobName = "CleanupSSISDBLog"
$Job = $JobAgent | New-AzureRmSqlElasticJob -Name $JobName -RunOnce
$Job

# Add the job step to execute internal.cleanup_server_retention_window_exclusive


Write-Output "Adding the job step for the cleanup stored procedure execution"
$SqlText = "EXEC internal.cleanup_server_retention_window_exclusive"
$Job | Add-AzureRmSqlElasticJobStep -Name "step to execute cleanup stored procedure" -TargetGroupName
$SSISDBTargetGroup.TargetGroupName -CredentialName $JobCred.CredentialName -CommandText $SqlText

# Run the job to immediately start cleanup stored procedure execution for once
IF(($RunJobOrNot = "Y") -Or ($RunJobOrNot = "y"))
{
Write-Output "Start a new execution of the stored procedure for SSISDB log cleanup immediately..."
$JobExecution = $Job | Start-AzureRmSqlElasticJob
$JobExecution
}

# Schedule the job running to trigger stored procedure execution on schedule for removing SSISDB logs outside
the retention window
Write-Output "Start the execution schedule of the stored procedure for SSISDB log cleanup..."
$Job | Set-AzureRmSqlElasticJob -IntervalType $IntervalType -IntervalCount $IntervalCount -StartTime $StartTime
-Enable
Clean up logs with Transact-SQL
The following sample Transact-SQL scripts create a new Elastic Job to trigger the stored procedure for SSISDB log
cleanup. For more info, see Use Transact-SQL (T-SQL ) to create and manage Elastic Database Jobs.
1. Create or identify an empty S0 or higher Azure SQL Database to be the SSISDBCleanup Job Database.
Then create an Elastic Job Agent in the Azure portal.
2. In the Job Database, create a credential for the SSISDB log cleanup job. This credential is used to connect to
your SSISDB database to clean up the logs.

-- Connect to the job database specified when creating the job agent
-- Create a database master key if one does not already exist, using your own password.
CREATE MASTER KEY ENCRYPTION BY PASSWORD= '<EnterStrongPasswordHere>';

-- Create a credential for SSISDB log cleanup.


CREATE DATABASE SCOPED CREDENTIAL SSISDBLogCleanupCred WITH IDENTITY = 'SSISDBLogCleanupUser', SECRET =
'<EnterStrongPasswordHere>';

3. Define the target group that includes the SSISDB database for which you want to run the cleanup stored
procedure.

-- Connect to the job database


-- Add a target group
EXEC jobs.sp_add_target_group 'SSISDBTargetGroup'

-- Add SSISDB database into the target group


EXEC jobs.sp_add_target_group_member 'SSISDBTargetGroup',
@target_type = 'SqlDatabase',
@server_name = '<EnterSSISDBTargetServerName>',
@database_name = '<EnterSSISDBName>'

--View the recently created target group and target group members
SELECT * FROM jobs.target_groups WHERE target_group_name = 'SSISDBTargetGroup';
SELECT * FROM jobs.target_group_members WHERE target_group_name = 'SSISDBTargetGroup';

4. Grant appropriate permissions for the SSISDB database. The SSISDB catalog must have proper
permissions for the stored procedure to run SSISDB log cleanup successfully. For detailed guidance, see
Manage logins.

-- Connect to the master database in the target server including SSISDB


CREATE LOGIN SSISDBLogCleanupUser WITH PASSWORD = '<strong_password>';

-- Connect to SSISDB database in the target server to cleanup logs


CREATE USER SSISDBLogCleanupUser FROM LOGIN SSISDBLogCleanupUser;
GRANT EXECUTE ON internal.cleanup_server_retention_window_exclusive TO SSISDBLogCleanupUser

5. Create the job and add a job step to trigger the execution of the stored procedure for SSISDB log cleanup.
--Connect to the job database
--Add the job for the execution of SSISDB log cleanup stored procedure.
EXEC jobs.sp_add_job @job_name='CleanupSSISDBLog', @description='Remove SSISDB logs which are outside
the retention window'

--Add a job step to execute internal.cleanup_server_retention_window_exclusive


EXEC jobs.sp_add_jobstep @job_name='CleanupSSISDBLog',
@command=N'EXEC internal.cleanup_server_retention_window_exclusive',
@credential_name='SSISDBLogCleanupCred',
@target_group_name='SSISDBTargetGroup'

6. Before you continue, make sure the retention window has been set appropriately. SSISDB logs outside the
window are deleted and can't be recovered.
Then you can run the job immediately to begin SSISDB log cleanup.

--Connect to the job database


--Run the job immediately to execute the stored procedure for SSISDB log cleanup
declare @je uniqueidentifier
exec jobs.sp_start_job 'CleanupSSISDBLog', @job_execution_id = @je output

--Watch the execution results for SSISDB log cleanup


select @je
select * from jobs.job_executions where job_execution_id = @je

7. Optionally, schedule job executions to remove SSISDB logs outside the retention window on a schedule.
Use a similar statement to update the job parameters.

--Connect to the job database


EXEC jobs.sp_update_job
@job_name='CleanupSSISDBLog',
@enabled=1,
@schedule_interval_type='<EnterIntervalType(Month,Day,Hour,Minute,Second)>',
@schedule_interval_count='<EnterDetailedIntervalValue>',
@schedule_start_time='<EnterProperStartTimeForSchedule>',
@schedule_end_time='<EnterProperEndTimeForSchedule>'

Monitor the cleanup job in the Azure portal


You can monitor the execution of the cleanup job in the Azure portal. For each execution, you see the status, start
time, and end time of the job.
Monitor the cleanup job with Transact-SQL
You can also use Transact-SQL to view the execution history of the cleanup job.

--Connect to the job database


--View all execution statuses for the job to cleanup SSISDB logs
SELECT * FROM jobs.job_executions WHERE job_name = 'CleanupSSISDBLog'
ORDER BY start_time DESC

-- View all active executions


SELECT * FROM jobs.job_executions WHERE is_active = 1
ORDER BY start_time DESC

Next steps
For management and monitoring tasks related to the Azure-SSIS Integration Runtime, see the following articles.
The Azure-SSIS IR is the runtime engine for SSIS packages stored in SSISDB in Azure SQL Database.
Reconfigure the Azure-SSIS integration runtime
Monitor the Azure-SSIS integration runtime.
Create a trigger that runs a pipeline in response to an
event
3/7/2019 • 3 minutes to read • Edit Online

This article describes the event-based triggers that you can create in your Data Factory pipelines.
Event-driven architecture (EDA) is a common data integration pattern that involves production, detection,
consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger
pipelines based on events. Data Factory is now integrated with Azure Event Grid, which lets you trigger pipelines
on an event.
For a ten-minute introduction and demonstration of this feature, watch the following video:

NOTE
The integration described in this article depends on Azure Event Grid. Make sure that your subscription is registered with the
Event Grid resource provider. For more info, see Resource providers and types.

Data Factory UI
Create a new event trigger
A typical event is the arrival of a file, or the deletion of a file, in your Azure Storage account. You can create a
trigger that responds to this event in your Data Factory pipeline.

NOTE
This integration supports only version 2 Storage accounts (General purpose).
Configure the event trigger
With the Blob path begins with and Blob path ends with properties, you can specify the containers, folders,
and blob names for which you want to receive events. You can use variety of patterns for both Blob path begins
with and Blob path ends with properties, as shown in the examples later in this article. At least one of these
properties is required.

Select the event trigger type


As soon as the file arrives in your storage location and the corresponding blob is created, this event triggers and
runs your Data Factory pipeline. You can create a trigger that responds to a blob creation event, a blob deletion
event, or both events, in your Data Factory pipelines.

Map trigger properties to pipeline parameters


When an event trigger fires for a specific blob, the event captures the folder path and file name of the blob into the
properties @triggerBody().folderPath and @triggerBody().fileName . To use the values of these properties in a
pipeline, you must map the properties to pipeline parameters. After mapping the properties to parameters, you
can access the values captured by the trigger through the @pipeline().parameters.parameterName expression
throughout the pipeline.

For example, in the preceding screenshot. the trigger is configured to fire when a blob path ending in .csv is
created in the Storage Account. As a result, when a blob with the .csv extension is created anywhere in the
Storage Account, the folderPath and fileName properties capture the location of the new blob. For example,
@triggerBody().folderPath has a value like /containername/foldername/nestedfoldername and
@triggerBody().fileName has a value like filename.csv . These values are mapped in the example to the pipeline
parameters sourceFolder and sourceFile . You can use them throughout the pipeline as
@pipeline().parameters.sourceFolder and @pipeline().parameters.sourceFile respectively.

JSON schema
The following table provides an overview of the schema elements that are related to event-based triggers:

JSON ELEMENT DESCRIPTION TYPE ALLOWED VALUES REQUIRED

scope The Azure Resource String Azure Resource Yes


Manager resource ID Manager ID
of the Storage
Account.

events The type of events Array Microsoft.Storage.Blo Yes, any combination


that cause this trigger bCreated, of these values.
to fire. Microsoft.Storage.Blo
bDeleted
JSON ELEMENT DESCRIPTION TYPE ALLOWED VALUES REQUIRED

blobPathBeginsWith The blob path must String You have to provide a


begin with the value for at least one
pattern provided for of these properties:
the trigger to fire. For blobPathBeginsWith
example, or
/records/blobs/december/ blobPathEndsWith .
only fires the trigger
for blobs in the
december folder
under the records
container.

blobPathEndsWith The blob path must String You have to provide a


end with the pattern value for at least one
provided for the of these properties:
trigger to fire. For blobPathBeginsWith
example, or
december/boxes.csv blobPathEndsWith .
only fires the trigger
for blobs named
boxes in a
december folder.

Examples of event-based triggers


This section provides examples of event-based trigger settings.

IMPORTANT
You have to include the /blobs/ segment of the path, as shown in the following examples, whenever you specify container
and folder, container and file, or container, folder, and file.

PROPERTY EXAMPLE DESCRIPTION

Blob path begins with /containername/ Receives events for any blob in the
container.

Blob path begins with /containername/blobs/foldername/ Receives events for any blobs in the
containername container and
foldername folder.

Blob path begins with You can also reference a subfolder.


/containername/blobs/foldername/subfoldername/

Blob path begins with Receives events for a blob named


/containername/blobs/foldername/file.txt
file.txt in the foldername folder
under the containername container.

Blob path ends with file.txt Receives events for a blob named
file.txt in any path.

Blob path ends with /containername/blobs/file.txt Receives events for a blob named
file.txt under container
containername .
PROPERTY EXAMPLE DESCRIPTION

Blob path ends with foldername/file.txt Receives events for a blob named
file.txt in foldername folder
under any container.

Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Create a trigger that runs a pipeline on a schedule
5/6/2019 • 17 minutes to read • Edit Online

This article provides information about the schedule trigger and the steps to create, start, and monitor a schedule
trigger. For other types of triggers, see Pipeline execution and triggers.
When creating a schedule trigger, you specify a schedule (start date, recurrence, end date etc.) for the trigger, and
associate with a pipeline. Pipelines and triggers have a many-to-many relationship. Multiple triggers can kick off a
single pipeline. A single trigger can kick off multiple pipelines.
The following sections provide steps to create a schedule trigger in different ways.

Data Factory UI
You can create a schedule trigger to schedule a pipeline to run periodically (hourly, daily, etc.).

NOTE
For a complete walkthrough of creating a pipeline and a schedule trigger, associating the trigger with the pipeline, and
running and monitoring the pipeline, see Quickstart: create a data factory using Data Factory UI.

1. Switch to the Edit tab.

2. Click Trigger on the menu, and click New/Edit.


3. In the Add Triggers page, click Choose trigger..., and click New.

4. In the New Trigger page, do the following steps:


a. Confirm that Schedule is selected for Type.
b. Specify the start datetime of the trigger for Start Date (UTC ). It's set to the current datetime by
default.
c. Specify Recurrence for the trigger. Select one of the values from the drop-down list (Every minute,
Hourly, Daily, Weekly, and Monthly). Enter the multiplier in the text box. For example, if you want the
trigger to run once for every 15 minutes, you select Every Minute, and enter 15 in the text box.
d. For the End field, if you do not want to specify an end datetime for the trigger, select No End. To
specify an end date time, select On Date, and specify end datetime, and click Apply. There is a cost
associated with each pipeline run. If you are testing, you may want to ensure that the pipeline is
triggered only a couple of times. However, ensure that there is enough time for the pipeline to run
between the publish time and the end time. The trigger comes into effect only after you publish the
solution to Data Factory, not when you save the trigger in the UI.
5. In the New Trigger window, check the Activated option, and click Next. You can use this checkbox to
deactivate the trigger later.
6. In the New Trigger page, review the warning message, and click Finish.

7. Click Publish to publish changes to Data Factory. Until you publish changes to Data Factory, the trigger
does not start triggering the pipeline runs.
8. Switch to the Monitor tab on the left. Click Refresh to refresh the list. You see the pipeline runs triggered
by the scheduled trigger. Notice the values in the Triggered By column. If you use Trigger Now option,
you see the manual trigger run in the list.

9. Click the down-arrow next to Pipeline Runs to switch to the Trigger Runs view.

Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

This section shows you how to use Azure PowerShell to create, start, and monitor a schedule trigger. To see this
sample working, first go through the Quickstart: Create a data factory by using Azure PowerShell. Then, add the
following code to the main method, which creates and starts a schedule trigger that runs every 15 minutes. The
trigger is associated with a pipeline named Adfv2QuickStartPipeline that you create as part of the Quickstart.
1. Create a JSON file named MyTrigger.json in the C:\ADFv2QuickStartPSH\ folder with the following
content:

IMPORTANT
Before you save the JSON file, set the value of the startTime element to the current UTC time. Set the value of the
endTime element to one hour past the current UTC time.

{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Minute",
"interval": 15,
"startTime": "2017-12-08T00:00:00",
"endTime": "2017-12-08T01:00:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "Adfv2QuickStartPipeline"
},
"parameters": {
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/output"
}
}
]
}
}

In the JSON snippet:


The type element of the trigger is set to "ScheduleTrigger."
The frequency element is set to "Minute" and the interval element is set to 15. Therefore, the trigger
runs the pipeline every 15 minutes between the start and end times.
The endTime element is one hour after the value of the startTime element. Therefore, the trigger runs
the pipeline 15 minutes, 30 minutes, and 45 minutes after the start time. Don't forget to update the start
time to the current UTC time, and the end time to one hour past the start time.
The trigger is associated with the Adfv2QuickStartPipeline pipeline. To associate multiple pipelines
with a trigger, add more pipelineReference sections.
The pipeline in the Quickstart takes two parameters values: inputPath and outputPath. Therefore, you
pass values for these parameters from the trigger.
2. Create a trigger by using the Set-AzDataFactoryV2Trigger cmdlet:

Set-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger" -DefinitionFile "C:\ADFv2QuickStartPSH\MyTrigger.json"

3. Confirm that the status of the trigger is Stopped by using the Get-AzDataFactoryV2Trigger cmdlet:
Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -
Name "MyTrigger"

4. Start the trigger by using the Start-AzDataFactoryV2Trigger cmdlet:

Start-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger"

5. Confirm that the status of the trigger is Started by using the Get-AzDataFactoryV2Trigger cmdlet:

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger"

6. Get the trigger runs in Azure PowerShell by using the Get-AzDataFactoryV2TriggerRun cmdlet. To get
the information about the trigger runs, execute the following command periodically. Update the
TriggerRunStartedAfter and TriggerRunStartedBefore values to match the values in your trigger
definition:

Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


TriggerName "MyTrigger" -TriggerRunStartedAfter "2017-12-08T00:00:00" -TriggerRunStartedBefore "2017-
12-08T01:00:00"

To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.

.NET SDK
This section shows you how to use the .NET SDK to create, start, and monitor a trigger. To see this sample
working, first go through the Quickstart: Create a data factory by using the .NET SDK. Then, add the following
code to the main method, which creates and starts a schedule trigger that runs every 15 minutes. The trigger is
associated with a pipeline named Adfv2QuickStartPipeline that you create as part of the Quickstart.
To create and start a schedule trigger that runs every 15 minutes, add the following code to the main method:
// Create the trigger
Console.WriteLine("Creating the trigger");

// Set the start time to the current UTC time


DateTime startTime = DateTime.UtcNow;

// Specify values for the inputPath and outputPath parameters


Dictionary<string, object> pipelineParameters = new Dictionary<string, object>();
pipelineParameters.Add("inputPath", "adftutorial/input");
pipelineParameters.Add("outputPath", "adftutorial/output");

// Create a schedule trigger


string triggerName = "MyTrigger";
ScheduleTrigger myTrigger = new ScheduleTrigger()
{
Pipelines = new List<TriggerPipelineReference>()
{
// Associate the Adfv2QuickStartPipeline pipeline with the trigger
new TriggerPipelineReference()
{
PipelineReference = new PipelineReference(pipelineName),
Parameters = pipelineParameters,
}
},
Recurrence = new ScheduleTriggerRecurrence()
{
// Set the start time to the current UTC time and the end time to one hour after the start
time
StartTime = startTime,
TimeZone = "UTC",
EndTime = startTime.AddHours(1),
Frequency = RecurrenceFrequency.Minute,
Interval = 15,
}
};

// Now, create the trigger by invoking the CreateOrUpdate method


TriggerResource triggerResource = new TriggerResource()
{
Properties = myTrigger
};
client.Triggers.CreateOrUpdate(resourceGroup, dataFactoryName, triggerName, triggerResource);

// Start the trigger


Console.WriteLine("Starting the trigger");
client.Triggers.Start(resourceGroup, dataFactoryName, triggerName);

To monitor a trigger run, add the following code before the last Console.WriteLine statement in the sample:
// Check that the trigger runs every 15 minutes
Console.WriteLine("Trigger runs. You see the output every 15 minutes");

for (int i = 0; i < 3; i++)


{
System.Threading.Thread.Sleep(TimeSpan.FromMinutes(15));
List<TriggerRun> triggerRuns = client.Triggers.ListRuns(resourceGroup, dataFactoryName,
triggerName, DateTime.UtcNow.AddMinutes(-15 * (i + 1)), DateTime.UtcNow.AddMinutes(2)).ToList();
Console.WriteLine("{0} trigger runs found", triggerRuns.Count);

foreach (TriggerRun run in triggerRuns)


{
foreach (KeyValuePair<string, string> triggeredPipeline in run.TriggeredPipelines)
{
PipelineRun triggeredPipelineRun = client.PipelineRuns.Get(resourceGroup,
dataFactoryName, triggeredPipeline.Value);
Console.WriteLine("Pipeline run ID: {0}, Status: {1}", triggeredPipelineRun.RunId,
triggeredPipelineRun.Status);
List<ActivityRun> runs = client.ActivityRuns.ListByPipelineRun(resourceGroup,
dataFactoryName, triggeredPipelineRun.RunId, run.TriggerRunTimestamp.Value,
run.TriggerRunTimestamp.Value.AddMinutes(20)).ToList();
}
}
}

To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.

Python SDK
This section shows you how to use the Python SDK to create, start, and monitor a trigger. To see this sample
working, first go through the Quickstart: Create a data factory by using the Python SDK. Then, add the following
code block after the "monitor the pipeline run" code block in the Python script. This code creates a schedule trigger
that runs every 15 minutes between the specified start and end times. Update the start_time variable to the
current UTC time, and the end_time variable to one hour past the current UTC time.

# Create a trigger
tr_name = 'mytrigger'
scheduler_recurrence = ScheduleTriggerRecurrence(frequency='Minute', interval='15',start_time='2017-12-
12T04:00:00', end_time='2017-12-12T05:00:00', time_zone='UTC')
pipeline_parameters = {'inputPath':'adftutorial/input', 'outputPath':'adftutorial/output'}
pipelines_to_run = []
pipeline_reference = PipelineReference('copyPipeline')
pipelines_to_run.append(TriggerPipelineReference(pipeline_reference, pipeline_parameters))
tr_properties = ScheduleTrigger(description='My scheduler trigger', pipelines = pipelines_to_run,
recurrence=scheduler_recurrence)
adf_client.triggers.create_or_update(rg_name, df_name, tr_name, tr_properties)

# Start the trigger


adf_client.triggers.start(rg_name, df_name, tr_name)

To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.

Azure Resource Manager template


You can use an Azure Resource Manager template to create a trigger. For step-by-step instructions, see Create an
Azure data factory by using a Resource Manager template.

Pass the trigger start time to a pipeline


Azure Data Factory version 1 supports reading or writing partitioned data by using the system variables:
SliceStart, SliceEnd, WindowStart, and WindowEnd. In the current version of Azure Data Factory, you can
achieve this behavior by using a pipeline parameter. The start time and scheduled time for the trigger are set as the
value for the pipeline parameter. In the following example, the scheduled time for the trigger is passed as a value
to the pipeline scheduledRunTime parameter:

"parameters": {
"scheduledRunTime": "@trigger().scheduledTime"
}

For more information, see the instructions in How to read or write partitioned data.

JSON schema
The following JSON definition shows you how to create a schedule trigger with scheduling and recurrence:

{
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": <<Minute, Hour, Day, Week, Month>>,
"interval": <<int>>, // Optional, specifies how often to fire (default to 1)
"startTime": <<datetime>>,
"endTime": <<datetime - optional>>,
"timeZone": "UTC"
"schedule": { // Optional (advanced scheduling specifics)
"hours": [<<0-23>>],
"weekDays": [<<Monday-Sunday>>],
"minutes": [<<0-59>>],
"monthDays": [<<1-31>>],
"monthlyOccurrences": [
{
"day": <<Monday-Sunday>>,
"occurrence": <<1-5>>
}
]
}
}
},
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "<Name of your pipeline>"
},
"parameters": {
"<parameter 1 Name>": {
"type": "Expression",
"value": "<parameter 1 Value>"
},
"<parameter 2 Name>" : "<parameter 2 Value>"
}
}
]
}
}

IMPORTANT
The parameters property is a mandatory property of the pipelines element. If your pipeline doesn't take any parameters,
you must include an empty JSON definition for the parameters property.
Schema overview
The following table provides a high-level overview of the major schema elements that are related to recurrence
and scheduling of a trigger:

JSON PROPERTY DESCRIPTION

startTime A Date-Time value. For simple schedules, the value of the


startTime property applies to the first occurrence. For
complex schedules, the trigger starts no sooner than the
specified startTime value.

endTime The end date and time for the trigger. The trigger doesn't
execute after the specified end date and time. The value for
the property can't be in the past. This property is optional.

timeZone The time zone. Currently, only the UTC time zone is
supported.

recurrence A recurrence object that specifies the recurrence rules for the
trigger. The recurrence object supports the frequency,
interval, endTime, count, and schedule elements. When a
recurrence object is defined, the frequency element is
required. The other elements of the recurrence object are
optional.

frequency The unit of frequency at which the trigger recurs. The


supported values include "minute," "hour," "day," "week," and
"month."

interval A positive integer that denotes the interval for the frequency
value, which determines how often the trigger runs. For
example, if the interval is 3 and the frequency is "week," the
trigger recurs every 3 weeks.

schedule The recurrence schedule for the trigger. A trigger with a


specified frequency value alters its recurrence based on a
recurrence schedule. The schedule property contains
modifications for the recurrence that are based on minutes,
hours, weekdays, month days, and week number.

Schema defaults, limits, and examples


JSON PROPERTY TYPE REQUIRED DEFAULT VALUE VALID VALUES EXAMPLE

startTime String Yes None ISO-8601 Date- "startTime" :


Times "2013-01-
09T09:30:00-
08:00"

recurrence Object Yes None Recurrence object "recurrence"


: {
"frequency" :
"monthly",
"interval" :
1 }

interval Number No 1 1 to 1,000 "interval":10


JSON PROPERTY TYPE REQUIRED DEFAULT VALUE VALID VALUES EXAMPLE

endTime String Yes None A Date-Time "endTime" :


value that "2013-02-
09T09:30:00-
represents a time 08:00"
in the future.

schedule Object No None Schedule object "schedule" :


{ "minute" :
[30], "hour"
: [8,17] }

startTime property
The following table shows you how the startTime property controls a trigger run:

STARTTIME VALUE RECURRENCE WITHOUT SCHEDULE RECURRENCE WITH SCHEDULE

Start time in past Calculates the first future execution The trigger starts no sooner than the
time after the start time and runs at specified start time. The first occurrence
that time. is based on the schedule that's
calculated from the start time.
Runs subsequent executions based on
calculating from the last execution time. Runs subsequent executions based on
the recurrence schedule.
See the example that follows this table.

Start time in future or at present Runs once at the specified start time. The trigger starts no sooner than the
specified start time. The first occurrence
Runs subsequent executions based on is based on the schedule that's
calculating from the last execution time. calculated from the start time.

Runs subsequent executions based on


the recurrence schedule.

Let's see an example of what happens when the start time is in the past, with a recurrence, but no schedule.
Assume that the current time is 2017-04-08 13:00 , the start time is 2017-04-07 14:00 , and the recurrence is every
two days. (The recurrence value is defined by setting the frequency property to "day" and the interval property
to 2.) Notice that the startTime value is in the past and occurs before the current time.
Under these conditions, the first execution is at 2017-04-09 at 14:00 . The Scheduler engine calculates execution
occurrences from the start time. Any instances in the past are discarded. The engine uses the next instance that
occurs in the future. In this scenario, the start time is 2017-04-07 at 2:00pm , so the next instance is two days from
that time, which is 2017-04-09 at 2:00pm .
The first execution time is the same even if the startTime value is 2017-04-05 14:00 or 2017-04-01 14:00 . After
the first execution, subsequent executions are calculated by using the schedule. Therefore, the subsequent
executions are at 2017-04-11 at 2:00pm , then 2017-04-13 at 2:00pm , then 2017-04-15 at 2:00pm , and so on.
Finally, when the hours or minutes aren’t set in the schedule for a trigger, the hours or minutes of the first
execution are used as the defaults.
schedule property
On one hand, the use of a schedule can limit the number of trigger executions. For example, if a trigger with a
monthly frequency is scheduled to run only on day 31, the trigger runs only in those months that have a 31st day.
Whereas, a schedule can also expand the number of trigger executions. For example, a trigger with a monthly
frequency that's scheduled to run on month days 1 and 2, runs on the 1st and 2nd days of the month, rather than
once a month.
If multiple schedule elements are specified, the order of evaluation is from the largest to the smallest schedule
setting. The evaluation starts with week number, and then month day, weekday, hour, and finally, minute.
The following table describes the schedule elements in detail:

JSON ELEMENT DESCRIPTION VALID VALUES

minutes Minutes of the hour at which the Integer


trigger runs. Array of integers

hours Hours of the day at which the trigger Integer


runs. Array of integers

weekDays Days of the week on which the trigger Monday, Tuesday, Wednesday,
runs. The value can be specified with a Thursday, Friday, Saturday,
weekly frequency only. Sunday
Array of day values (maximum
array size is 7)
Day values are not case-
sensitive

monthlyOccurrences Days of the month on which the trigger Array of monthlyOccurrence


runs. The value can be specified with a objects:
monthly frequency only. { "day": day,
"occurrence": occurrence }
.
The day attribute is the day of
the week on which the trigger
runs. For example, a
monthlyOccurrences property
with a day value of {Sunday}
means every Sunday of the
month. The day attribute is
required.
The occurrence attribute is the
occurrence of the specified day
during the month. For example,
a monthlyOccurrences
property with day and
occurrence values of
{Sunday, -1} means the last
Sunday of the month. The
occurrence attribute is
optional.

monthDays Day of the month on which the trigger Any value <= -1 and >= -31
runs. The value can be specified with a Any value >= 1 and <= 31
monthly frequency only. Array of values

Examples of trigger recurrence schedules


This section provides examples of recurrence schedules and focuses on the schedule object and its elements.
The examples assume that the interval value is 1, and that the frequency value is correct according to the
schedule definition. For example, you can't have a frequency value of "day" and also have a "monthDays"
modification in the schedule object. Restrictions such as these are mentioned in the table in the previous section.

EXAMPLE DESCRIPTION

{"hours":[5]} Run at 5:00 AM every day.

{"minutes":[15], "hours":[5]} Run at 5:15 AM every day.

{"minutes":[15], "hours":[5,17]} Run at 5:15 AM and 5:15 PM every day.

{"minutes":[15,45], "hours":[5,17]} Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM every day.

{"minutes":[0,15,30,45]} Run every 15 minutes.

{hours":[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, Run every hour. This trigger runs every hour. The minutes are
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]} controlled by the startTime value, when a value is specified. If
a value not specified, the minutes are controlled by the
creation time. For example, if the start time or creation time
(whichever applies) is 12:25 PM, the trigger runs at 00:25,
01:25, 02:25, ..., and 23:25.

This schedule is equivalent to having a trigger with a


frequency value of "hour," an interval value of 1, and no
schedule. This schedule can be used with different frequency
and interval values to create other triggers. For example,
when the frequency value is "month," the schedule runs only
once a month, rather than every day, when the frequency
value is "day."

{"minutes":[0]} Run every hour on the hour. This trigger runs every hour on
the hour starting at 12:00 AM, 1:00 AM, 2:00 AM, and so on.

This schedule is equivalent to a trigger with a frequency


value of "hour" and a startTime value of zero minutes, or no
schedule but a frequency value of "day." If the frequency
value is "week" or "month," the schedule executes one day a
week or one day a month only, respectively.

{"minutes":[15]} Run at 15 minutes past every hour. This trigger runs every
hour at 15 minutes past the hour starting at 00:15 AM, 1:15
AM, 2:15 AM, and so on, and ending at 11:15 PM.

{"hours":[17], "weekDays":["saturday"]} Run at 5:00 PM on Saturdays every week.

{"hours":[17], "weekDays":["monday", "wednesday", Run at 5:00 PM on Monday, Wednesday, and Friday every
"friday"]} week.

{"minutes":[15,45], "hours":[17], "weekDays": Run at 5:15 PM and 5:45 PM on Monday, Wednesday, and
["monday", "wednesday", "friday"]} Friday every week.

{"minutes":[0,15,30,45], "weekDays":["monday", Run every 15 minutes on weekdays.


"tuesday", "wednesday", "thursday", "friday"]}

{"minutes":[0,15,30,45], "hours": [9, 10, 11, 12, Run every 15 minutes on weekdays between 9:00 AM and
13, 14, 15, 16] "weekDays":["monday", "tuesday", 4:45 PM.
"wednesday", "thursday", "friday"]}
EXAMPLE DESCRIPTION

{"weekDays":["tuesday", "thursday"]} Run on Tuesdays and Thursdays at the specified start time.

{"minutes":[0], "hours":[6], "monthDays":[28]} Run at 6:00 AM on the 28th day of every month (assuming a
frequency value of "month").

{"minutes":[0], "hours":[6], "monthDays":[-1]} Run at 6:00 AM on the last day of the month. To run a trigger
on the last day of a month, use -1 instead of day 28, 29, 30,
or 31.

{"minutes":[0], "hours":[6], "monthDays":[1,-1]} Run at 6:00 AM on the first and last day of every month.

{monthDays":[1,14]} Run on the first and 14th day of every month at the specified
start time.

{"minutes":[0], "hours":[5], "monthlyOccurrences": Run on the first Friday of every month at 5:00 AM.
[{"day":"friday", "occurrence":1}]}

{"monthlyOccurrences":[{"day":"friday", Run on the first Friday of every month at the specified start
"occurrence":1}]} time.

{"monthlyOccurrences":[{"day":"friday", Run on the third Friday from the end of the month, every
"occurrence":-3}]} month, at the specified start time.

{"minutes":[15], "hours":[5], "monthlyOccurrences": Run on the first and last Friday of every month at 5:15 AM.
[{"day":"friday", "occurrence":1},{"day":"friday",
"occurrence":-1}]}

{"monthlyOccurrences":[{"day":"friday", Run on the first and last Friday of every month at the
"occurrence":1},{"day":"friday", "occurrence":-1}]} specified start time.

{"monthlyOccurrences":[{"day":"friday", Run on the fifth Friday of every month at the specified start
"occurrence":5}]} time. When there's no fifth Friday in a month, the pipeline
doesn't run, since it's scheduled to run only on fifth Fridays. To
run the trigger on the last occurring Friday of the month,
consider using -1 instead of 5 for the occurrence value.

{"minutes":[0,15,30,45], "monthlyOccurrences": Run every 15 minutes on the last Friday of the month.
[{"day":"friday", "occurrence":-1}]}

{"minutes":[15,45], "hours":[5,17], Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM on the third
"monthlyOccurrences":[{"day":"wednesday", Wednesday of every month.
"occurrence":3}]}

Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Create a trigger that runs a pipeline on a tumbling
window
3/5/2019 • 6 minutes to read • Edit Online

This article provides steps to create, start, and monitor a tumbling window trigger. For general information about
triggers and the supported types, see Pipeline execution and triggers.
Tumbling window triggers are a type of trigger that fires at a periodic time interval from a specified start time,
while retaining state. Tumbling windows are a series of fixed-sized, non-overlapping, and contiguous time
intervals. A tumbling window trigger has a one-to-one relationship with a pipeline and can only reference a
singular pipeline.

Data Factory UI
To create a tumbling window trigger in the Azure portal, select Trigger > Tumbling window > Next, and then
configure the properties that define the tumbling window.

Tumbling window trigger type properties


A tumbling window has the following trigger type properties:
{
"name": "MyTriggerName",
"properties": {
"type": "TumblingWindowTrigger",
"runtimeState": "<<Started/Stopped/Disabled - readonly>>",
"typeProperties": {
"frequency": "<<Minute/Hour>>",
"interval": <<int>>,
"startTime": "<<datetime>>",
"endTime: "<<datetime – optional>>"",
"delay": "<<timespan – optional>>",
“maxConcurrency”: <<int>> (required, max allowed: 50),
"retryPolicy": {
"count": <<int - optional, default: 0>>,
“intervalInSeconds”: <<int>>,
}
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyPipelineName"
},
"parameters": {
"parameter1": {
"type": "Expression",
"value": "@{concat('output',formatDateTime(trigger().outputs.windowStartTime,'-dd-MM-yyyy-
HH-mm-ss-ffff'))}"
},
"parameter2": {
"type": "Expression",
"value": "@{concat('output',formatDateTime(trigger().outputs.windowEndTime,'-dd-MM-yyyy-
HH-mm-ss-ffff'))}"
},
"parameter3": "https://mydemo.azurewebsites.net/api/demoapi"
}
}
}
}

The following table provides a high-level overview of the major JSON elements that are related to recurrence and
scheduling of a tumbling window trigger:

JSON ELEMENT DESCRIPTION TYPE ALLOWED VALUES REQUIRED

type The type of the String "TumblingWindowTrig Yes


trigger. The type is ger"
the fixed value
"TumblingWindowTrig
ger."

runtimeState The current state of String "Started," "Stopped," Yes


the trigger run time. "Disabled"
Note: This element is
<readOnly>.
JSON ELEMENT DESCRIPTION TYPE ALLOWED VALUES REQUIRED

frequency A string that String "minute," "hour" Yes


represents the
frequency unit
(minutes or hours) at
which the trigger
recurs. If the
startTime date
values are more
granular than the
frequency value, the
startTime dates are
considered when the
window boundaries
are computed. For
example, if the
frequency value is
hourly and the
startTime value is
2017-09-
01T10:10:10Z, the
first window is (2017-
09-01T10:10:10Z,
2017-09-
01T11:10:10Z).

interval A positive integer Integer A positive integer. Yes


that denotes the
interval for the
frequency value,
which determines
how often the trigger
runs. For example, if
the interval is 3 and
the frequency is
"hour," the trigger
recurs every 3 hours.

startTime The first occurrence, DateTime A DateTime value. Yes


which can be in the
past. The first trigger
interval is (startTime,
startTime +
interval).

endTime The last occurrence, DateTime A DateTime value. Yes


which can be in the
past.
JSON ELEMENT DESCRIPTION TYPE ALLOWED VALUES REQUIRED

delay The amount of time Timespan A timespan value No


to delay the start of (hh:mm:ss) where the default is
data processing for 00:00:00.
the window. The
pipeline run is started
after the expected
execution time plus
the amount of delay.
The delay defines
how long the trigger
waits past the due
time before triggering
a new run. The delay
doesn’t alter the
window startTime.
For example, a delay
value of 00:10:00
implies a delay of 10
minutes.

maxConcurrency The number of Integer An integer between 1 Yes


simultaneous trigger and 50.
runs that are fired for
windows that are
ready. For example, to
back fill hourly runs
for yesterday results
in 24 windows. If
maxConcurrency =
10, trigger events are
fired only for the first
10 windows (00:00-
01:00 - 09:00-10:00).
After the first 10
triggered pipeline
runs are complete,
trigger runs are fired
for the next 10
windows (10:00-
11:00 - 19:00-20:00).
Continuing with this
example of
maxConcurrency =
10, if there are 10
windows ready, there
are 10 total pipeline
runs. If there's only 1
window ready, there's
only 1 pipeline run.

retryPolicy: Count The number of retries Integer An integer, where the No


before the pipeline default is 0 (no
run is marked as retries).
"Failed."

retryPolicy: The delay between Integer The number of No


intervalInSeconds retry attempts seconds, where the
specified in seconds. default is 30.

WindowStart and WindowEnd system variables


You can use the WindowStart and WindowEnd system variables of the tumbling window trigger in your
pipeline definition (that is, for part of a query). Pass the system variables as parameters to your pipeline in the
trigger definition. The following example shows you how to pass these variables as parameters:

{
"name": "MyTriggerName",
"properties": {
"type": "TumblingWindowTrigger",
...
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyPipelineName"
},
"parameters": {
"MyWindowStart": {
"type": "Expression",
"value": "@{concat('output',formatDateTime(trigger().outputs.windowStartTime,'-dd-MM-yyyy-
HH-mm-ss-ffff'))}"
},
"MyWindowEnd": {
"type": "Expression",
"value": "@{concat('output',formatDateTime(trigger().outputs.windowEndTime,'-dd-MM-yyyy-
HH-mm-ss-ffff'))}"
}
}
}
}
}

To use the WindowStart and WindowEnd system variable values in the pipeline definition, use your
"MyWindowStart" and "MyWindowEnd" parameters, accordingly.
Execution order of windows in a backfill scenario
When there are multiple windows up for execution (especially in a backfill scenario), the order of execution for
windows is deterministic, from oldest to newest intervals. Currently, this behavior can't be modified.
Existing TriggerResource elements
The following points apply to existing TriggerResource elements:
If the value for the frequency element (or window size) of the trigger changes, the state of the windows that
are already processed is not reset. The trigger continues to fire for the windows from the last window that it
executed by using the new window size.
If the value for the endTime element of the trigger changes (added or updated), the state of the windows that
are already processed is not reset. The trigger honors the new endTime value. If the new endTime value is
before the windows that are already executed, the trigger stops. Otherwise, the trigger stops when the new
endTime value is encountered.

Sample for Azure PowerShell


NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

This section shows you how to use Azure PowerShell to create, start, and monitor a trigger.
1. Create a JSON file named MyTrigger.json in the C:\ADFv2QuickStartPSH\ folder with the following
content:

IMPORTANT
Before you save the JSON file, set the value of the startTime element to the current UTC time. Set the value of the
endTime element to one hour past the current UTC time.

{
"name": "PerfTWTrigger",
"properties": {
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Minute",
"interval": "15",
"startTime": "2017-09-08T05:30:00Z",
"delay": "00:00:01",
"retryPolicy": {
"count": 2,
"intervalInSeconds": 30
},
"maxConcurrency": 50
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "DynamicsToBlobPerfPipeline"
},
"parameters": {
"windowStart": "@trigger().outputs.windowStartTime",
"windowEnd": "@trigger().outputs.windowEndTime"
}
},
"runtimeState": "Started"
}
}

2. Create a trigger by using the Set-AzDataFactoryV2Trigger cmdlet:

Set-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger" -DefinitionFile "C:\ADFv2QuickStartPSH\MyTrigger.json"

3. Confirm that the status of the trigger is Stopped by using the Get-AzDataFactoryV2Trigger cmdlet:

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger"

4. Start the trigger by using the Start-AzDataFactoryV2Trigger cmdlet:

Start-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger"

5. Confirm that the status of the trigger is Started by using the Get-AzDataFactoryV2Trigger cmdlet:

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger"
6. Get the trigger runs in Azure PowerShell by using the Get-AzDataFactoryV2TriggerRun cmdlet. To get
information about the trigger runs, execute the following command periodically. Update the
TriggerRunStartedAfter and TriggerRunStartedBefore values to match the values in your trigger
definition:

Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


TriggerName "MyTrigger" -TriggerRunStartedAfter "2017-12-08T00:00:00" -TriggerRunStartedBefore "2017-
12-08T01:00:00"

To monitor trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.

Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Templates
3/14/2019 • 2 minutes to read • Edit Online

Templates are predefined Azure Data Factory pipelines that allow you to get started quickly with Data Factory.
Templates are useful when you're new to Data Factory and want to get started quickly. These templates reduce the
development time for building data integration projects thereby improving developer productivity.

Create Data Factory pipelines from templates


You can get started creating a Data Factory pipeline from a template in the following two ways:
1. Select Create pipeline from template on the Overview page to open the template gallery.

2. On the Author tab in Resource Explorer, select +, then Pipeline from template to open the template
gallery.
Template Gallery

Out of the box Data Factory templates


Data Factory uses Azure Resource Manager templates for saving data factory pipeline templates. You can see all
the Resource Manager templates, along with the manifest file used for out of the box Data Factory templates, in the
official Azure Data Factory GitHub repo. The predefined templates provided by Microsoft include but are not
limited to the following items:
Copy templates:
Bulk copy from Database
Copy new files by LastModifiedDate
Copy multiple file containers between file-based stores
Delta copy from Database
Copy from <source> to <destination>
From Amazon S3 to Azure Data Lake Store Gen 2
From Google Big Query to Azure Data Lake Store Gen 2
From HDF to Azure Data Lake Store Gen 2
From Netezza to Azure Data Lake Store Gen 1
From SQL Server on premises to Azure SQL Database
From SQL Server on premises to Azure SQL Data Warehouse
From Oracle on premises to Azure SQL Data Warehouse
SSIS templates
Schedule Azure-SSIS Integration Runtime to execute SSIS packages
Transform templates
ETL with Azure Databricks
My Templates
You can also save a pipeline as a template by selecting Save as template on the Pipeline tab.

You can view pipelines saved as templates in the My Templates section of the Template Gallery. You can also see
them in the Templates section in the Resource Explorer.
NOTE
To use the My Templates feature, you have to enable GIT integration. Both Azure DevOps GIT and GitHub are supported.
Copy files from multiple containers with Azure Data
Factory
3/6/2019 • 2 minutes to read • Edit Online

This article describes a solution template that you can use to copy files from multiple containers between file
stores. For example, you can use it to migrate your data lake from AWS S3 to Azure Data Lake Store. Or, you
could use the template to replicate everything from one Azure Blob storage account to another.

NOTE
If you want to copy files from a single container, it's more efficient to use the Copy Data Tool to create a pipeline with a single
copy activity. The template in this article is more than you need for that simple scenario.

About this solution template


This template enumerates the containers from your source storage store. It then copies those containers to the
destination store.
The template contains three activities:
GetMetadata scans your source storage store and gets the container list.
ForEach gets the container list from the GetMetadata activity and then iterates over the list and passes each
container to the Copy activity.
Copy copies each container from the source storage store to the destination store.
The template defines two parameters:
SourceFilePath is the path of your data source store, where you can get a list of the containers. In most cases,
the path is the root directory, which contains multiple container folders. The default value of this parameter is
/ .
DestinationFilePath is the path where the files will be copied to in your destination store. The default value of
this parameter is / .

How to use this solution template


1. Go to the Copy multiple files containers between File Stores template. Create a New connection to
your source storage store. The source storage store is where you want to copy files from multiple
containers from.
2. Create a New connection to your destination storage store.

3. Select Use this template.


4. You'll see the pipeline, as in the following example:

5. Select Debug, enter the Parameters, and then select Finish.


6. Review the result.

Next steps
Bulk copy from a database by using a control table with Azure Data Factory
Copy files from multiple containers with Azure Data Factory
Copy new and changed files by LastModifiedDate
with Azure Data Factory
3/15/2019 • 3 minutes to read • Edit Online

This article describes a solution template that you can use to copy new and changed files only by
LastModifiedDate from a file-based store to a destination store.

About this solution template


This template first selects the new and changed files only by their attributes LastModifiedDate, and then copies
those selected files from the data source store to the data destination store.
The template contains one activity:
Copy to copy new and changed files only by LastModifiedDate from a file store to a destination store.
The template defines four parameters:
FolderPath_Source is the folder path where you can read the files from the source store. You need to replace
the default value with your own folder path.
FolderPath_Destination is the folder path where you want to copy files to the destination store. You need to
replace the default value with your own folder path.
LastModified_From is used to select the files whose LastModifiedDate attribute is after or equal to this datetime
value. In order to select the new files only, which has not been copied last time, this datetime value can be the
time when the pipeline was triggered last time. You can replace the default value '2019-02-01T00:00:00Z' to
your expected LastModifiedDate in UTC timezone.
LastModified_To is used to select the files whose LastModifiedDate attribute is before this datetime value. In
order to select the new files only, which has not been copied last time, this datetime value can be the present
time. You can replace the default value '2019-02-01T00:00:00Z' to your expected LastModifiedDate in UTC
timezone.

How to use this solution template


1. Go to template Copy new files only by LastModifiedDate. Create a New connection to your source
storage store. The source storage store is where you want to copy files from.
2. First select the storage Type. After that input the storage account name and the account key. Finally,
select Finish.
3. Create a New connection to your destination store. The destination store is where you want to copy files to.
You also need input the connection information of data destination store similar as you did in step 2.
4. Select Use this template.

5. You will see the pipeline available in the panel, as shown in the following example:
6. Select Debug, write the value for the Parameters and select Finish. In the picture below, we set the
parameters as following.
FolderPath_Source = /source/
FolderPath_Destination = /destination/
LastModified_From = 2019-02-01T00:00:00Z
LastModified_To = 2019-03-01T00:00:00Z
The example is indicating the files which have been last modified within the timespan between 2019 -
02 -01T00:00:00Z and 2019 -03 -01T00:00:00Z will be copied from a folder /source/ to a folder
/destination/. You can replace these with your own parameters.

7. Review the result. You will see only the files last modified within the configured timespan has been copied to
the destination store.
8. Now you can add a tumbling windows trigger to automate this pipeline, so that the pipeline can always
copy new and changed files only by LastModifiedDate periodically. Select Add trigger, and select
New/Edit.

9. In the Add Triggers window, select + New.

10. Select Tumbling Window for the trigger type, set Every 15 minute(s) as the recurrence (you can change
to any interval time), and then select Next.
11. Write the value for the Trigger Run Parameters as following, and select Finish.
FolderPath_Source = /source/. You can replace with your folder in source data store.
FolderPath_Destination = /destination/. You can replace with your folder in destination data store.
LastModified_From = **@trigger().outputs.windowStartTime**. It is a system variable from the trigger
determining the time when the pipeline was triggered last time.
LastModified_To = **@trigger().outputs.windowEndTime**. It is a system variable from the trigger
determining the time when the pipeline is triggered this time.
12. Select Publish All.
13. Create new files in your source folder of data source store. You are now waiting for the pipeline to be
triggered automatically and only the new files will be copied to the destination store.
14. Select Monitoring tab in the left navigation panel, and wait for about 15 minutes if the recurrence of
trigger has been set to every 15 minutes.

15. Review the result. You will see your pipeline will be triggered automatically every 15 minutes, and only the
new or changed files from source store will be copied to the destination store in each pipeline run.
Next steps
Introduction to Azure Data Factory
Bulk copy from a database with a control table
3/6/2019 • 3 minutes to read • Edit Online

To copy data from a data warehouse in Oracle Server, Netezza, Teradata, or SQL Server to Azure SQL Data
Warehouse, you have to load huge amounts of data from multiple tables. Usually, the data has to be partitioned in
each table so that you can load rows with multiple threads in parallel from a single table. This article describes a
template to use in these scenarios.

!NOTE If you want to copy data from a small number of tables with relatively small data volume to SQL Data
Warehouse, it's more efficient to use the Azure Data Factory Copy Data tool. The template that's described in
this article is more than you need for that scenario.

About this solution template


This template retrieves a list of source database partitions to copy from an external control table. Then it iterates
over each partition in the source database and copies the data to the destination.
The template contains three activities:
Lookup retrieves the list of sure database partitions from an external control table.
ForEach gets the partition list from the Lookup activity and iterates each partition to the Copy activity.
Copy copies each partition from the source database store to the destination store.
The template defines five parameters:
Control_Table_Name is your external control table, which stores the partition list for the source database.
Control_Table_Schema_PartitionID is the name of the column name in your external control table that stores
each partition ID. Make sure that the partition ID is unique for each partition in the source database.
Control_Table_Schema_SourceTableName is your external control table that stores each table name from the
source database.
Control_Table_Schema_FilterQuery is the name of the column in your external control table that stores the
filter query to get the data from each partition in the source database. For example, if you partitioned the data
by year, the query that's stored in each row might be similar to ‘select * from datasource where LastModifytime
>= ''2015-01-01 00:00:00'' and LastModifytime <= ''2015-12-31 23:59:59.999'''.
Data_Destination_Folder_Path is the path where the data is copied into your destination store. This parameter
is only visible if the destination that you choose is file-based storage. If you choose SQL Data Warehouse as
the destination store, this parameter is not required. But the table names and the schema in SQL Data
Warehouse must be the same as the ones in the source database.

How to use this solution template


1. Create a control table in SQL Server or Azure SQL Database to store the source database partition list for
bulk copy. In the following example, there are five partitions in the source database. Three partitions are for
the datasource_table, and two are for the project_table. The column LastModifytime is used to partition the
data in table datasource_table from the source database. The query that's used to read the first partition is
'select * from datasource_table where LastModifytime >= ''2015-01-01 00:00:00'' and LastModifytime <=
''2015-12-31 23:59:59.999'''. You can use a similar query to read data from other partitions.
Create table ControlTableForTemplate
(
PartitionID int,
SourceTableName varchar(255),
FilterQuery varchar(255)
);

INSERT INTO ControlTableForTemplate


(PartitionID, SourceTableName, FilterQuery)
VALUES
(1, 'datasource_table','select * from datasource_table where LastModifytime >= ''2015-01-01
00:00:00'' and LastModifytime <= ''2015-12-31 23:59:59.999'''),
(2, 'datasource_table','select * from datasource_table where LastModifytime >= ''2016-01-01
00:00:00'' and LastModifytime <= ''2016-12-31 23:59:59.999'''),
(3, 'datasource_table','select * from datasource_table where LastModifytime >= ''2017-01-01
00:00:00'' and LastModifytime <= ''2017-12-31 23:59:59.999'''),
(4, 'project_table','select * from project_table where ID >= 0 and ID < 1000'),
(5, 'project_table','select * from project_table where ID >= 1000 and ID < 2000');

2. Go to the Bulk Copy from Database template. Create a New connection to the external control table that
you created in step 1.

3. Create a New connection to the source database that you're copying data from.
4. Create a New connection to the destination data store that you're copying the data to.

5. Select Use this template.


6. You see the pipeline, as shown in the following example:

7. Select Debug, enter the Parameters, and then select Finish.


8. You see results that are similar to the following example:

9. (Optional) If you chose SQL Data Warehouse as the data destination, you must enter a connection to Azure
Blob storage for staging, as required by SQL Data Warehouse Polybase. Make sure that the container in
Blob storage has already been created.
Next steps
Introduction to Azure Data Factory
Delta copy from a database with a control table
3/11/2019 • 4 minutes to read • Edit Online

This article describes a template that's available to incrementally load new or updated rows from a database table
to Azure by using an external control table that stores a high-watermark value.
This template requires that the schema of the source database contains a timestamp column or incrementing key
to identify new or updated rows.

NOTE
If you have a timestamp column in your source database to identify new or updated rows but you don't want to create an
external control table to use for delta copy, you can instead use the Azure Data Factory Copy Data tool to get a pipeline. That
tool uses a trigger-scheduled time as a variable to read new rows from the source database.

About this solution template


This template first retrieves the old watermark value and compares it with the current watermark value. After that,
it copies only the changes from the source database, based on a comparison between the two watermark values.
Finally, it stores the new high-watermark value to an external control table for delta data loading next time.
The template contains four activities:
Lookup retrieves the old high-watermark value, which is stored in an external control table.
Another Lookup activity retrieves the current high-watermark value from the source database.
Copy copies only changes from the source database to the destination store. The query that identifies the
changes in the source database is similar to 'SELECT * FROM Data_Source_Table WHERE
TIMESTAMP_Column > “last high-watermark” and TIMESTAMP_Column <= “current high-watermark”'.
SqlServerStoredProcedure writes the current high-watermark value to an external control table for delta
copy next time.
The template defines five parameters:
Data_Source_Table_Name is the table in the source database that you want to load data from.
Data_Source_WaterMarkColumn is the name of the column in the source table that's used to identify new or
updated rows. The type of this column is typically datetime, INT, or similar.
Data_Destination_Folder_Path or Data_Destination_Table_Name is the place where the data is copied to in
your destination store.
Control_Table_Table_Name is the external control table that stores the high-watermark value.
Control_Table_Column_Name is the column in the external control table that stores the high-watermark value.

How to use this solution template


1. Explore the source table you that want to load, and define the high-watermark column that can be used to
identify new or updated rows. The type of this column might be datetime, INT, or similar. This column's
value increases as new rows are added. From the following sample source table (data_source_table), we can
use the LastModifytime column as the high-watermark column.
PersonID Name LastModifytime
1 aaaa 2017-09-01 00:56:00.000
2 bbbb 2017-09-02 05:23:00.000
3 cccc 2017-09-03 02:36:00.000
4 dddd 2017-09-04 03:21:00.000
5 eeee 2017-09-05 08:06:00.000
6 fffffff 2017-09-06 02:23:00.000
7 gggg 2017-09-07 09:01:00.000
8 hhhh 2017-09-08 09:01:00.000
9 iiiiiiiii 2017-09-09 09:01:00.000

2. Create a control table in SQL Server or Azure SQL Database to store the high-watermark value for delta
data loading. In the following example, the name of the control table is watermarktable. In this table,
WatermarkValue is the column that stores the high-watermark value, and its type is datetime.

create table watermarktable


(
WatermarkValue datetime,
);
INSERT INTO watermarktable
VALUES ('1/1/2010 12:00:00 AM')

3. Create a stored procedure in the same SQL Server or Azure SQL Database instance that you used to create
the control table. The stored procedure is used to write the new high-watermark value to the external
control table for delta data loading next time.

CREATE PROCEDURE update_watermark @LastModifiedtime datetime


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime

END

4. Go to the Delta copy from Database template. Create a New connection to the source database that you
want to data copy from.
5. Create a New connection to the destination data store that you want to copy the data to.

6. Create a New connection to the external control table and stored procedure that you created in steps 2 and
3.
7. Select Use this template.

8. You see the available pipeline, as shown in the following example:


9. Select Stored Procedure. For Stored procedure name, choose [update_watermark]. Select Import
parameter, and then select Add dynamic content.

10. Write the content @{activity('LookupCurrentWaterMark').output.firstRow.NewWatermarkValue},


and then select Finish.
11. Select Debug, enter the Parameters, and then select Finish.

12. Results similar to the following example are displayed:


13. You can create new rows in your source table. Here is sample SQL language to create new rows:

INSERT INTO data_source_table


VALUES (10, 'newdata','9/10/2017 2:23:00 AM')

INSERT INTO data_source_table


VALUES (11, 'newdata','9/11/2017 9:01:00 AM')

14. To run the pipeline again, select Debug, enter the Parameters, and then select Finish.

You see that only new rows were copied to the destination.
15. (Optional:) If you selected SQL Data Warehouse as the data destination, you must also provide a connection
to Azure Blob storage for staging, which is required by SQL Data Warehouse Polybase. Make sure that the
container has already been created in Blob storage.
Next steps
Bulk copy from a database by using a control table with Azure Data Factory
Copy files from multiple containers with Azure Data Factory
Transform data by using Databricks in Azure Data
Factory
3/29/2019 • 3 minutes to read • Edit Online

In this tutorial, you create an end-to-end pipeline containing Lookup, Copy, and Databricks notebook activities
in Data Factory.
Lookup or GetMetadata activity is used to ensure the source dataset is ready for downstream consumption,
before triggering the copy and analytics job.
Copy activity copies the source file/ dataset to the sink storage. The sink storage is mounted as DBFS in
the Databricks notebook so that the dataset can be directly consumed by Spark.
Databricks notebook activity triggers the Databricks notebook that transforms the dataset, and adds it to
a processed folder/ SQL DW.
To keep this template simple, the template doesn't create a scheduled trigger. You can add that if necessary.

Prerequisites
1. Create a blob storage account and a container called sinkdata to be used as sink. Keep a note of the
storage account name, container name, and access key, since they are referenced later in the template.
2. Ensure you have an Azure Databricks workspace or create a new one.
3. Import the notebook for ETL. Import the below Transform notebook to the Databricks workspace. (It
does not have to be in the same location as below, but remember the path that you choose for later.) Import
the notebook from the following URL by entering this URL in the URL field:
https://adflabstaging1.blob.core.windows.net/share/Transformations.html . Select Import.
4. Now let’s update the Transformation notebook with your storage connection information (name and
access key). Go to command 5 in the imported notebook above, replace it with the below code snippet
after replacing the highlighted values. Ensure this account is the same storage account created earlier and
contains the sinkdata container.

# Supply storageName and accessKey values


storageName = "<storage name>"
accessKey = "<access key>"

try:
dbutils.fs.mount(
source = "wasbs://sinkdata\@"+storageName+".blob.core.windows.net/",
mount_point = "/mnt/Data Factorydata",
extra_configs = {"fs.azure.account.key."+storageName+".blob.core.windows.net": accessKey})

except Exception as e:
# The error message has a long stack track.
This code tries to print just the relevant line indicating what failed.

import re
result = re.findall(r"\^\s\*Caused by:\s*\S+:\s\*(.*)\$", e.message, flags=re.MULTILINE)
if result:
print result[-1] \# Print only the relevant error message
else:
print e \# Otherwise print the whole stack trace.

5. Generate a Databricks access token for Data Factory to access Databricks. Save the access token for
later use in creating a Databricks linked service, which looks something like
'dapi32db32cbb4w6eee18b7d87e45exxxxxx'

Create linked services and datasets


1. Create new linked services in Data Factory UI by going to Connections Linked services + new
a. Source – for accessing source data. You can use the public blob storage containing the source files
for this sample.
Select Blob Storage, use the below SAS URI to connect to source storage (read-only access).
https://storagewithdata.blob.core.windows.net/?sv=2017-11-09&ss=b&srt=sco&sp=rl&se=2019-12-
31T21:40:53Z&st=2018-10-24T13:40:53Z&spr=https&sig=K8nRio7c4xMLnUV0wWVAmqr5H4P3JDwBaG9HCevI7kU%3D
b. Sink – for copying data into.
Select a storage created in the prerequisite 1, in the sink linked service.
c. Databricks – for connecting to Databricks cluster
Create a Databricks linked service using access key generated in prerequisite 2.c. If you have an
interactive cluster, you may select that. (This example uses the New job cluster option.)
2. Create datasets
a. Create 'sourceAvailability_Dataset' to check if source data is available
a. Source dataset – for copying the source data (using binary copy)

a. Sink dataset – for copying into the sink/ destination location


a. Linked service - select 'sinkBlob_LS' created in 1.b
b. File path - 'sinkdata/staged_sink'
Create activities
1. Create a Lookup activity 'Availability flag' for doing a Source Availability check (Lookup or GetMetadata
can be used). Select 'sourceAvailability_Dataset' created in 2.a.

2. Create a Copy activity 'file-to-blob' for copying dataset from source to sink. In this case, the data is binary
file. Reference the below screenshots for source and sink configurations in the copy activity.
3. Define pipeline parameters
4. Create a Databricks activity
Select the linked service created in a previous step.

Configure the settings. Create Base Parameters as shown in the screenshot and create parameters to be
passed to the Databricks notebook from Data Factory. Browse and select the correct notebook path
uploaded in prerequisite 2.
5. Run the pipeline. You can find link to Databricks logs for more detailed Spark logs.

You can also verify the data file using storage explorer. (For correlating with Data Factory pipeline runs, this
example appends the pipeline run ID from data factory to the output folder. This way you can track back the
files generated via each run.)
Next steps
Introduction to Azure Data Factory
Azure Data Factory FAQ
4/4/2019 • 13 minutes to read • Edit Online

This article provides answers to frequently asked questions about Azure Data Factory.

What is Azure Data Factory?


Data Factory is a fully managed, cloud-based, data-integration service that automates the movement and
transformation of data. Like a factory that runs equipment to transform raw materials into finished goods, Azure
Data Factory orchestrates existing services that collect raw data and transform it into ready-to-use information.
By using Azure Data Factory, you can create data-driven workflows to move data between on-premises and cloud
data stores. And you can process and transform data by using compute services such as Azure HDInsight, Azure
Data Lake Analytics, and the SQL Server Integration Services (SSIS ) integration runtime.
With Data Factory, you can execute your data processing either on an Azure-based cloud service or in your own
self-hosted compute environment, such as SSIS, SQL Server, or Oracle. After you create a pipeline that performs
the action you need, you can schedule it to run periodically (hourly, daily, or weekly, for example), time window
scheduling, or trigger the pipeline from an event occurrence. For more information, see Introduction to Azure Data
Factory.
Control flows and scale
To support the diverse integration flows and patterns in the modern data warehouse, Data Factory enables flexible
data pipeline modeling. This entails full control flow programming paradigms, which include conditional execution,
branching in data pipelines, and the ability to explicitly pass parameters within and across these flows. Control flow
also encompasses transforming data through activity dispatch to external execution engines and data flow
capabilities, including data movement at scale, via the Copy activity.
Data Factory provides freedom to model any flow style that's required for data integration and that can be
dispatched on demand or repeatedly on a schedule. A few common flows that this model enables are:
Control flows:
Activities can be chained together in a sequence within a pipeline.
Activities can be branched within a pipeline.
Parameters:
Parameters can be defined at the pipeline level and arguments can be passed while you invoke the
pipeline on demand or from a trigger.
Activities can consume the arguments that are passed to the pipeline.
Custom state passing:
Activity outputs, including state, can be consumed by a subsequent activity in the pipeline.
Looping containers:
The foreach activity will iterate over a specified collection of activities in a loop.
Trigger-based flows:
Pipelines can be triggered on demand or by wall-clock time.
Delta flows:
Parameters can be used to define your high-water mark for delta copy while moving dimension or
reference tables from a relational store, either on-premises or in the cloud, to load the data into the lake.
For more information, see Tutorial: Control flows.
Data transformed at scale with code -free pipelines
The new browser-based tooling experience provides code-free pipeline authoring and deployment with a modern,
interactive web-based experience.
For visual data developers and data engineers, the Data Factory web UI is the code-free design environment that
you will use to build pipelines. It's fully integrated with Visual Studio Online Git and provides integration for CI/CD
and iterative development with debugging options.
Rich cross-platform SDKs for advanced users
Data Factory V2 provides a rich set of SDKs that can be used to author, manage, and monitor pipelines by using
your favorite IDE, including:
Python SDK
PowerShell CLI
C# SDK
Users can also use the documented REST APIs to interface with Data Factory V2.
Iterative development and debugging by using visual tools
Azure Data Factory visual tools enable iterative development and debugging. You can create your pipelines and do
test runs by using the Debug capability in the pipeline canvas without writing a single line of code. You can view
the results of your test runs in the Output window of your pipeline canvas. After your test run succeeds, you can
add more activities to your pipeline and continue debugging in an iterative manner. You can also cancel your test
runs after they are in progress.
You are not required to publish your changes to the data factory service before selecting Debug. This is helpful in
scenarios where you want to make sure that the new additions or changes will work as expected before you update
your data factory workflows in development, test, or production environments.
Ability to deploy SSIS packages to Azure
If you want to move your SSIS workloads, you can create a Data Factory and provision an Azure-SSIS integration
runtime. An Azure-SSIS integration runtime is a fully managed cluster of Azure VMs (nodes) that are dedicated to
run your SSIS packages in the cloud. For step-by-step instructions, see the Deploy SSIS packages to Azure tutorial.
SDKs
If you are an advanced user and looking for a programmatic interface, Data Factory provides a rich set of SDKs that
you can use to author, manage, or monitor pipelines by using your favorite IDE. Language support includes .NET,
PowerShell, Python, and REST.
Monitoring
You can monitor your Data Factories via PowerShell, SDK, or the Visual Monitoring Tools in the browser user
interface. You can monitor and manage on-demand, trigger-based, and clock-driven custom flows in an efficient
and effective manner. Cancel existing tasks, see failures at a glance, drill down to get detailed error messages, and
debug the issues, all from a single pane of glass without context switching or navigating back and forth between
screens.
New features for SSIS in Data Factory
Since the initial public preview release in 2017, Data Factory has added the following features for SSIS:
Support for three more configurations/variants of Azure SQL Database to host the SSIS database (SSISDB ) of
projects/packages:
SQL Database with virtual network service endpoints
Managed instance
Elastic pool
Support for an Azure Resource Manager virtual network on top of a classic virtual network to be deprecated in
the future, which lets you inject/join your Azure-SSIS integration runtime to a virtual network configured for
SQL Database with virtual network service endpoints/MI/on-premises data access. For more information, see
also Join an Azure-SSIS integration runtime to a virtual network.
Support for Azure Active Directory (Azure AD ) authentication and SQL authentication to connect to the
SSISDB, allowing Azure AD authentication with your Data Factory managed identity for Azure resources
Support for bringing your own on-premises SQL Server license to earn substantial cost savings from the Azure
Hybrid Benefit option
Support for Enterprise Edition of the Azure-SSIS integration runtime that lets you use advanced/premium
features, a custom setup interface to install additional components/extensions, and a partner ecosystem. For
more information, see also Enterprise Edition, Custom Setup, and 3rd Party Extensibility for SSIS in ADF.
Deeper integration of SSIS in Data Factory that lets you invoke/trigger first-class Execute SSIS Package
activities in Data Factory pipelines and schedule them via SSMS. For more information, see also Modernize and
extend your ETL/ELT workflows with SSIS activities in ADF pipelines.

What is the integration runtime?


The integration runtime is the compute infrastructure that Azure Data Factory uses to provide the following data
integration capabilities across various network environments:
Data movement: For data movement, the integration runtime moves the data between the source and
destination data stores, while providing support for built-in connectors, format conversion, column mapping,
and performant and scalable data transfer.
Dispatch activities: For transformation, the integration runtime provides capability to natively execute SSIS
packages.
Execute SSIS packages: The integration runtime natively executes SSIS packages in a managed Azure
compute environment. The integration runtime also supports dispatching and monitoring transformation
activities running on a variety of compute services, such as Azure HDInsight, Azure Machine Learning, SQL
Database, and SQL Server.
You can deploy one or many instances of the integration runtime as required to move and transform data. The
integration runtime can run on an Azure public network or on a private network (on-premises, Azure Virtual
Network, or Amazon Web Services virtual private cloud [VPC ]).
For more information, see Integration runtime in Azure Data Factory.

What is the limit on the number of integration runtimes?


There is no hard limit on the number of integration runtime instances you can have in a data factory. There is,
however, a limit on the number of VM cores that the integration runtime can use per subscription for SSIS package
execution. For more information, see Data Factory limits.

What are the top-level concepts of Azure Data Factory?


An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory
contains four key components that work together as a platform on which you can compose data-driven workflows
with steps to move and transform data.
Pipelines
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities to perform a unit of
work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of activities
that ingest data from an Azure blob and then run a Hive query on an HDInsight cluster to partition the data. The
benefit is that you can use a pipeline to manage the activities as a set instead of having to manage each activity
individually. You can chain together the activities in a pipeline to operate them sequentially, or you can operate
them independently, in parallel.
Activities
Activities represent a processing step in a pipeline. For example, you can use a Copy activity to copy data from one
data store to another data store. Similarly, you can use a Hive activity, which runs a Hive query on an Azure
HDInsight cluster to transform or analyze your data. Data Factory supports three types of activities: data
movement activities, data transformation activities, and control activities.
Datasets
Datasets represent data structures within the data stores, which simply point to or reference the data you want to
use in your activities as inputs or outputs.
Linked services
Linked services are much like connection strings, which define the connection information needed for Data Factory
to connect to external resources. Think of it this way: A linked service defines the connection to the data source, and
a dataset represents the structure of the data. For example, an Azure Storage linked service specifies the connection
string to connect to the Azure Storage account. And an Azure blob dataset specifies the blob container and the
folder that contains the data.
Linked services have two purposes in Data Factory:
To represent a data store that includes, but is not limited to, an on-premises SQL Server instance, an Oracle
database instance, a file share, or an Azure Blob storage account. For a list of supported data stores, see Copy
Activity in Azure Data Factory.
To represent a compute resource that can host the execution of an activity. For example, the HDInsight Hive
activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and supported compute
environments, see Transform data in Azure Data Factory.
Triggers
Triggers represent units of processing that determine when a pipeline execution is kicked off. There are different
types of triggers for different types of events.
Pipeline runs
A pipeline run is an instance of a pipeline execution. You usually instantiate a pipeline run by passing arguments to
the parameters that are defined in the pipeline. You can pass the arguments manually or within the trigger
definition.
Parameters
Parameters are key-value pairs in a read-only configuration.You define parameters in a pipeline, and you pass the
arguments for the defined parameters during execution from a run context. The run context is created by a trigger
or from a pipeline that you execute manually. Activities within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and an entity that you can reuse or reference. An activity can reference
datasets, and it can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains connection information to either a data store or a
compute environment. It's also an entity that you can reuse or reference.
Control flows
Control flows orchestrate pipeline activities that include chaining activities in a sequence, branching, parameters
that you define at the pipeline level, and arguments that you pass as you invoke the pipeline on demand or from a
trigger. Control flows also include custom state passing and looping containers (that is, foreach iterators).
For more information about Data Factory concepts, see the following articles:
Dataset and linked services
Pipelines and activities
Integration runtime

What is the pricing model for Data Factory?


For Azure Data Factory pricing details, see Data Factory pricing details.

How can I stay up-to-date with information about Data Factory?


For the most up-to-date information about Azure Data Factory, go to the following sites:
Blog
Documentation home page
Product home page

Technical deep dive


How can I schedule a pipeline?
You can use the scheduler trigger or time window trigger to schedule a pipeline. The trigger uses a wall-clock
calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example,
on Mondays at 6:00 PM and Thursdays at 9:00 PM ). For more information, see Pipeline execution and triggers.
Can I pass parameters to a pipeline run?
Yes, parameters are a first-class, top-level concept in Data Factory. You can define parameters at the pipeline level
and pass arguments as you execute the pipeline run on demand or by using a trigger.
Can I define default values for the pipeline parameters?
Yes. You can define default values for the parameters in the pipelines.
Can an activity in a pipeline consume arguments that are passed to a pipeline run?
Yes. Each activity within the pipeline can consume the parameter value that's passed to the pipeline and run with
the @parameter construct.
Can an activity output property be consumed in another activity?
Yes. An activity output can be consumed in a subsequent activity with the @activity construct.
How do I gracefully handle null values in an activity output?
You can use the @coalesce construct in the expressions to handle null values gracefully.

Mapping data flows


Which Data Factory version do I use to create data flows?
Use the Data Factory V2 version to create data flows.
I was a previous private preview customer who used data flows, and I used the Data Factory V2 preview version
for data flows.
This version is now obsolete. Use Data Factory V2 for data flows.
What has changed from private preview to limited public preview in regard to data flows?
You will no longer have to bring your own Azure Databricks clusters. Data Factory will manage cluster creation and
tear-down. Blob datasets and Azure Data Lake Storage Gen2 datasets are separated into delimited text and Apache
Parquet datasets. You can still use Data Lake Storage Gen2 and Blob storage to store those files. Use the
appropriate linked service for those storage engines.
Can I migrate my private preview factories to Data Factory V2?
Yes. Follow the instructions.
I need help troubleshooting my data flow logic. What info do I need to provide to get help?
When Microsoft provides help or troubleshooting with data flows, please provide the DSL code plan. To do this,
follow these steps:
1. From the Data Flow Designer, select Code in the top-right corner. This will display the editable JSON code for
the data flow.
2. From the code view, select Plan on the top-right corner. This toggle will switch from JSON to the read-only
formatted DSL script plan.
3. Copy and paste this script or save it in a text file.
How do I access data by using the other 80 dataset types in Data Factory?
The Mapping Data Flow feature currently allows Azure SQL Database, Azure SQL Data Warehouse, delimited text
files from Azure Blob storage or Azure Data Lake Storage Gen2, and Parquet files from Blob storage or Data Lake
Storage Gen2 natively for source and sink.
Use the Copy activity to stage data from any of the other connectors, and then execute a Data Flow activity to
transform data after it's been staged. For example, your pipeline will first copy into Blob storage, and then a Data
Flow activity will use a dataset in source to transform that data.

Next steps
For step-by-step instructions to create a data factory, see the following tutorials:
Quickstart: Create a data factory
Tutorial: Copy data in the cloud
Azure Data Factory whitepapers
4/25/2019 • 2 minutes to read • Edit Online

Whitepapers allow you to explore Azure Data Factory at a deeper level. This article provides you with a list of
available whitepapers for Azure Data Factory.

WHITEPAPER DESCRIPTION

Azure Data Factory—Data Integration in the Cloud This paper describes how Azure Data Factory can enable you
to build a modern data warehouse, enable advanced analytics
to drive intelligent SaaS applications and lift your SQL Server
Integrations Services packages to Azure.

Data Migration from on-premise relational Data Warehouse to This paper addresses the complexity of migrating tens of TB
Azure using Azure Data Factory data from existing on-premise relational data warehouse (for
example, Netezza, Oracle, Teradata, SQL server) to Azure (for
example, Blob Storage or Azure Data Lake Storage) using
Azure Data Factory. The challenges and best practices are
illustrated around resilience, performance, scalability,
management, and security for the big data ingestion journey
to Azure by Azure Data Factory.

Azure Data Factory: SSIS in the Cloud This paper goes over why you would want to migrate your
existing SSIS workloads to Azure Data Factory and address
common considerations and concerns. We'll then walk you
through the technical details of creating an Azure-SSIS IR and
then show you how to upload, execute, and monitor your
packages through Azure Data Factory using the tools you are
probably are familiar with like SQL Server Management Studio
(SSMS).

You might also like