Data Factory, Data Integration
Data Factory, Data Integration
NOTE
This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see
Introduction to Data Factory V2.
Azure Data Factory is the platform for these kinds of scenarios. It is a cloud -based data integration service that
allows you to create data -driven workflows in the cloud that orchestrate and automate data movement and data
transformation. Using Azure Data Factory, you can do the following tasks:
Create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data
stores.
Process or transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure
Data Lake Analytics, and Azure Machine Learning.
Publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
It's more of an Extract-and-Load (EL ) and Transform-and-Load (TL ) platform rather than a traditional Extract-
Transform-and-Load (ETL ) platform. The transformations process data by using compute services rather than by
adding derived columns, counting the number of rows, sorting data, and so on.
Currently, in Azure Data Factory, the data that workflows consume and produce is time-sliced data (hourly, daily,
weekly, and so on). For example, a pipeline might read input data, process data, and produce output data once a
day. You can also run a workflow just one time.
Key components
An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory
is composed of four key components. These components work together to provide the platform on which you can
compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory can have one or more pipelines. A pipeline is a group of activities. Together, the activities in a
pipeline perform a task.
For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a Hive
query on an HDInsight cluster to partition the data. The benefit of this is that the pipeline allows you to manage
the activities as a set instead of each one individually. For example, you can deploy and schedule the pipeline,
instead of scheduling independent activities.
Activity
A pipeline can have one or more activities. Activities define the actions to perform on your data. For example, you
can use a copy activity to copy data from one data store to another data store. Similarly, you can use a Hive activity.
A Hive activity runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data Factory
supports two types of activities: data movement activities and data transformation activities.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data from any source can
be written to any sink. Select a data store to learn how to copy data to and from that store. Data Factory supports
the following data stores:
DB2* ✓
MySQL* ✓
Oracle* ✓ ✓
PostgreSQL* ✓
SAP HANA* ✓
SQL Server* ✓ ✓
Sybase* ✓
Teradata* ✓
NoSQL Cassandra* ✓
MongoDB* ✓
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK
File Amazon S3 ✓
File System* ✓ ✓
FTP ✓
HDFS* ✓
SFTP ✓
Generic OData ✓
Generic ODBC* ✓
Salesforce ✓
Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server
Supported regions
Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data
factory can access data stores and compute services in other Azure regions to move data between data stores or
process data by using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the
movement of data between supported data stores. It also lets you process data by using compute services in other
regions or in an on-premises environment. It also allows you to monitor and manage workflows by using both
programmatic and UI mechanisms.
Data Factory is available in only West US, East US, and North Europe regions. However, the service that powers
the data movement in Data Factory is available globally in several regions. If a data store is behind a firewall, then
a Data Management Gateway that's installed in your on-premises environment moves the data instead.
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are located in the West Europe region. You can create and use an Azure Data Factory instance in
North Europe. Then you can use it to schedule jobs on your compute environments in West Europe. It takes a few
milliseconds for Data Factory to trigger the job on your compute environment, but the time for running the job on
your computing environment does not change.
TUTORIAL DESCRIPTION
Move data between two cloud data stores Create a data factory with a pipeline that moves data from
blob storage to a SQL database.
Transform data by using Hadoop cluster Build your first Azure data factory with a data pipeline that
processes data by running a Hive script on an Azure
HDInsight (Hadoop) cluster.
Move data between an on-premises data store and a cloud Build a data factory with a pipeline that moves data from an
data store by using Data Management Gateway on-premises SQL Server database to an Azure blob. As part of
the walkthrough, you install and configure the Data
Management Gateway on your machine.
Introduction to Azure Data Factory
2/27/2019 • 8 minutes to read • Edit Online
In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage
systems. However, on its own, raw data doesn't have the proper context or meaning to provide meaningful
insights to analysts, data scientists, or business decision makers.
Big data requires service that can orchestrate and operationalize processes to refine these enormous stores of
raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for
these complex hybrid extract-transform-load (ETL ), extract-load-transform (ELT), and data integration
projects.
For example, imagine a gaming company that collects petabytes of game logs that are produced by games in
the cloud. The company wants to analyze these logs to gain insights into customer preferences,
demographics, and usage behavior. It also wants to identify up-sell and cross-sell opportunities, develop
compelling new features, drive business growth, and provide a better experience to its customers.
To analyze these logs, the company needs to use reference data such as customer information, game
information, and marketing campaign information that is in an on-premises data store. The company wants
to utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud
data store.
To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight),
and publish the transformed data into a cloud data warehouse such as Azure SQL Data Warehouse to easily
build a report on top of it. They want to automate this workflow, and monitor and manage it on a daily
schedule. They also want to execute it when files land in a blob store container.
Azure Data Factory is the platform that solves such data scenarios. It is a cloud -based data integration service
that allows you to create data -driven workflows in the cloud for orchestrating and automating data
movement and data transformation. Using Azure Data Factory, you can create and schedule data-driven
workflows (called pipelines) that can ingest data from disparate data stores. It can process and transform the
data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and
Azure Machine Learning.
Additionally, you can publish output data to data stores such as Azure SQL Data Warehouse for business
intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized
into meaningful data stores and data lakes for better business decisions.
How does it work?
The pipelines (data-driven workflows) in Azure Data Factory typically perform the following four steps:
Top-level concepts
An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data
Factory is composed of four key components. These components work together to provide the platform on
which you can compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a
unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group
of activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to
partition the data.
The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each
one individually. The activities in a pipeline can be chained together to operate sequentially, or they can
operate independently in parallel.
Activity
Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data
from one data store to another data store. Similarly, you might use a Hive activity, which runs a Hive query on
an Azure HDInsight cluster, to transform or analyze your data. Data Factory supports three types of activities:
data movement activities, data transformation activities, and control activities.
Datasets
Datasets represent data structures within the data stores, which simply point to or reference the data you
want to use in your activities as inputs or outputs.
Linked services
Linked services are much like connection strings, which define the connection information that's needed for
Data Factory to connect to external resources. Think of it this way: a linked service defines the connection to
the data source, and a dataset represents the structure of the data. For example, an Azure Storage-linked
service specifies a connection string to connect to the Azure Storage account. Additionally, an Azure blob
dataset specifies the blob container and the folder that contains the data.
Linked services are used for two purposes in Data Factory:
To represent a data store that includes, but isn't limited to, an on-premises SQL Server database,
Oracle database, file share, or Azure blob storage account. For a list of supported data stores, see the
copy activity article.
To represent a compute resource that can host the execution of an activity. For example, the
HDInsightHive activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and
supported compute environments, see the transform data article.
Triggers
Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off.
There are different types of triggers for different types of events.
Pipeline runs
A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the
arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within
the trigger definition.
Parameters
Parameters are key-value pairs of read-only configuration. Parameters are defined in the pipeline. The
arguments for the defined parameters are passed during execution from the run context that was created by a
trigger or a pipeline that was executed manually. Activities within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and a reusable/referenceable entity. An activity can reference datasets
and can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains the connection information to either a data
store or a compute environment. It is also a reusable/referenceable entity.
Control flow
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching,
defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or
from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.
For more information about Data Factory concepts, see the following articles:
Dataset and linked services
Pipelines and activities
Integration runtime
Supported regions
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region.
However, a data factory can access data stores and compute services in other Azure regions to move data
between data stores or process data using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the
movement of data between supported data stores and the processing of data using compute services in other
regions or in an on-premises environment. It also allows you to monitor and manage workflows by using
both programmatic and UI mechanisms.
Although Data Factory is available only in certain regions, the service that powers the data movement in Data
Factory is available globally in several regions. If a data store is behind a firewall, then a Self-hosted
Integration Runtime that's installed in your on-premises environment moves the data instead.
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are running out of the West Europe region. You can create and use an Azure Data Factory
instance in East US or East US 2 and use it to schedule jobs on your compute environments in West Europe.
It takes a few milliseconds for Data Factory to trigger the job on your compute environment, but the time for
running the job on your computing environment does not change.
Accessibility
The Data Factory user experience in the Azure portal is accessible.
Compare with version 1
For a list of differences between version 1 and the current version of the Data Factory service, see Compare
with version 1.
Next steps
Get started with creating a Data Factory pipeline by using one of the following tools/SDKs:
Data Factory UI in the Azure portal
Copy Data tool in the Azure portal
PowerShell
.NET
Python
REST
Azure Resource Manager template
Compare Azure Data Factory with Data Factory
version 1
3/5/2019 • 10 minutes to read • Edit Online
This article compares Data Factory with Data Factory version 1. For an introduction to Data Factory, see
Introduction to Data Factory.For an introduction to Data Factory version 1, see Introduction to Azure Data Factory.
Feature comparison
The following table compares the features of Data Factory with the features of Data Factory version 1.
Datasets A named view of data that references Datasets are the same in the current
the data that you want to use in your version. However, you do not need to
activities as inputs and outputs. define availability schedules for
Datasets identify data within different datasets. You can define a trigger
data stores, such as tables, files, folders, resource that can schedule pipelines
and documents. For example, an Azure from a clock scheduler paradigm. For
Blob dataset specifies the blob more information, see Triggers and
container and folder in Azure Blob Datasets.
storage from which the activity should
read the data.
Linked services Linked services are much like Linked services are the same as in Data
connection strings, which define the Factory V1, but with a new connectVia
connection information that's necessary property to utilize the Integration
for Data Factory to connect to external Runtime compute environment of the
resources. current version of Data Factory. For
more information, see Integration
runtime in Azure Data Factory and
Linked service properties for Azure Blob
storage.
Pipelines A data factory can have one or more Pipelines are groups of activities that
pipelines. A pipeline is a logical are performed on data. However, the
grouping of activities that together scheduling of activities in the pipeline
perform a task. You use startTime, has been separated into new trigger
endTime, and isPaused to schedule and resources. You can think of pipelines in
run pipelines. the current version of Data Factory
more as “workflow units” that you
schedule separately via triggers.
Activities Activities define actions to perform on In the current version of Data Factory,
your data within a pipeline. Data activities still are defined actions within
movement (copy activity) and data a pipelineThe current version of Data
transformation activities (such as Hive, Factory introduces new control flow
Pig, and MapReduce) are supported. activities. You use these activities in a
control flow (looping and branching).
Data movement and data
transformation activities that were
supported in V1 are supported in the
current version. You can define
transformation activities without using
datasets in the current version.
Hybrid data movement and activity Now called Integration Runtime, Data Data Management Gateway is now
dispatch Management Gateway supported called Self-Hosted Integration Runtime.
moving data between on-premises and It provides the same capability as it did
cloud. in V1.
Expressions Data Factory V1 allows you to use In the current version of Data Factory,
functions and system variables in data you can use expressions anywhere in a
selection queries and activity/dataset JSON string value. For more
properties. information, see Expressions and
functions in the current version of Data
Factory.
The following sections provide more information about the capabilities of the current version.
Control flow
To support diverse integration flows and patterns in the modern data warehouse, the current version of Data
Factory has enabled a new flexible data pipeline model that is no longer tied to time-series data. A few common
flows that were previously not possible are now enabled. They are described in the following sections.
Chaining activities
In V1, you had to configure the output of an activity as an input of another activity to chain them. in the current
version, you can chain activities in a sequence within a pipeline. You can use the dependsOn property in an
activity definition to chain it with an upstream activity. For more information and an example, see Pipelines and
activities and Branching and chaining activities.
Branching activities
in the current version, you can branch activities within a pipeline. The If-condition activity provides the same
functionality that an if statement provides in programming languages. It evaluates a set of activities when the
condition evaluates to true and another set of activities when the condition evaluates to false . For examples of
branching activities, see the Branching and chaining activities tutorial.
Parameters
You can define parameters at the pipeline level and pass arguments while you're invoking the pipeline on-demand
or from a trigger. Activities can consume the arguments that are passed to the pipeline. For more information, see
Pipelines and triggers.
Custom state passing
Activity outputs including state can be consumed by a subsequent activity in the pipeline. For example, in the
JSON definition of an activity, you can access the output of the previous activity by using the following syntax:
@activity('NameofPreviousActivity').output.value . By using this feature, you can build workflows where values
can pass through activities.
Looping containers
The ForEach activity defines a repeating control flow in your pipeline. This activity iterates over a collection and
runs specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping
structure in programming languages.
The Until activity provides the same functionality that a do-until looping structure provides in programming
languages. It runs a set of activities in a loop until the condition that's associated with the activity evaluates to
true . You can specify a timeout value for the until activity in Data Factory.
Trigger-based flows
Pipelines can be triggered by on-demand (event-based, i.e. blob post) or wall-clock time. The pipelines and triggers
article has detailed information about triggers.
Invoking a pipeline from another pipeline
The Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline.
Delta flows
A key use case in ETL patterns is “delta loads,” in which only data that has changed since the last iteration of a
pipeline is loaded. New capabilities in the current version, such as lookup activity, flexible scheduling, and control
flow, enable this use case in a natural way. For a tutorial with step-by-step instructions, see Tutorial: Incremental
copy.
Other control flow activities
Following are a few more control flow activities that are supported by the current version of Data Factory.
ForEach activity Defines a repeating control flow in your pipeline. This activity
is used to iterate over a collection and runs specified activities
in a loop. The loop implementation of this activity is similar to
Foreach looping structure in programming languages.
Web activity Calls a custom REST endpoint from a Data Factory pipeline.
You can pass datasets and linked services to be consumed
and accessed by the activity.
Lookup activity Reads or looks up a record or table name value from any
external source. This output can further be referenced by
succeeding activities.
Get metadata activity Retrieves the metadata of any data in Azure Data Factory.
Flexible scheduling
In the current version of Data Factory, you do not need to define dataset availability schedules. You can define a
trigger resource that can schedule pipelines from a clock scheduler paradigm. You can also pass parameters to
pipelines from a trigger for a flexible scheduling and execution model.
Pipelines do not have “windows” of time execution in the current version of Data Factory. The Data Factory V1
concepts of startTime, endTime, and isPaused don't exist in the current version of Data Factory. For more
information about how to build and then schedule a pipeline in the current version of Data Factory, see Pipeline
execution and triggers.
Custom activities
In V1, you implement (custom) DotNet activity code by creating a .NET class library project with a class that
implements the Execute method of the IDotNetActivity interface. Therefore, you need to write your custom code in
.NET Framework 4.5.2 and run it on Windows-based Azure Batch Pool nodes.
In a custom activity in the current version, you don't have to implement a .NET interface. You can directly run
commands, scripts, and your own custom code compiled as an executable.
For more information, see Difference between custom activity in Data Factory and version 1.
SDKs
the current version of Data Factory provides a richer set of SDKs that can be used to author, manage, and monitor
pipelines.
.NET SDK: The .NET SDK is updated in the current version.
PowerShell: The PowerShell cmdlets are updated in the current version. The cmdlets for the current
version have DataFactoryV2 in the name, for example: Get-AzDataFactoryV2.
Python SDK: This SDK is new in the current version.
REST API: The REST API is updated in the current version.
The SDKs that are updated in the current version are not backward-compatible with V1 clients.
Authoring experience
V2 V1
Monitoring experience
in the current version, you can also monitor data factories by using Azure Monitor. The new PowerShell cmdlets
support monitoring of integration runtimes. Both V1 and V2 support visual monitoring via a monitoring
application that can be launched from the Azure portal.
Next steps
Learn how to create a data factory by following step-by-step instructions in the following quickstarts: PowerShell,
.NET, Python, REST API.
Quickstart: Create a data factory by using the
Azure Data Factory UI
2/11/2019 • 10 minutes to read • Edit Online
This quickstart describes how to use the Azure Data Factory UI to create and monitor a data factory.
The pipeline that you create in this data factory copies data from one folder to another folder in Azure
Blob storage. For a tutorial on how to transform data by using Azure Data Factory, see Tutorial:
Transform data by using Spark.
NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.
5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not
already exist.
John, Doe
Jane, Doe
8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.
10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Video
Watching this video helps you understand the Data Factory UI:
5. For Subscription, select your Azure subscription in which you want to create the data factory.
6. For Resource Group, use one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. For Version, select V2.
8. For Location, select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory
meta data will be stored. Please note that the associated data stores (like Azure Storage and
Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can run in
other regions.
9. Select Create.
10. After the creation is complete, you see the Data Factory page. Select the Author & Monitor
tile to start the Azure Data Factory user interface (UI) application on a separate tab.
11. On the Let's get started page, switch to the Author tab in the left panel.
Create a linked service
In this procedure, you create a linked service to link your Azure storage account to the data factory. The
linked service has the connection information that the Data Factory service uses at runtime to connect
to it.
1. Select Connections, and then select the New button on the toolbar.
2. On the New Linked Service page, select Azure Blob Storage, and then select Continue.
3. Complete the following steps:
a. For Name, enter AzureStorageLinkedService.
b. For Storage account name, select the name of your Azure storage account.
c. Select Test connection to confirm that the Data Factory service can connect to the storage
account.
d. Select Finish to save the linked service.
Create datasets
In this procedure, you create two datasets: InputDataset and OutputDataset. These datasets are of
type AzureBlob. They refer to the Azure Storage linked service that you created in the previous
section.
The input dataset represents the source data in the input folder. In the input dataset definition, you
specify the blob container (adftutorial), the folder (input), and the file (emp.txt) that contain the
source data.
The output dataset represents the data that's copied to the destination. In the output dataset definition,
you specify the blob container (adftutorial), the folder (output), and the file to which the data is
copied. Each run of a pipeline has a unique ID associated with it. You can access this ID by using the
system variable RunId. The name of the output file is dynamically evaluated based on the run ID of the
pipeline.
In the linked service settings, you specified the Azure storage account that contains the source data. In
the source dataset settings, you specify where exactly the source data resides (blob container, folder,
and file). In the sink dataset settings, you specify where the data is copied to (blob container, folder, and
file).
1. Select the + (plus) button, and then select Dataset.
2. On the New Dataset page, select Azure Blob Storage, and then select Finish.
3. In the General tab for the dataset, enter InputDataset for Name.
4. Switch to the Connection tab and complete the following steps:
a. For Linked service, select AzureStorageLinkedService.
b. For File path, select the Browse button.
c. In the Choose a file or folder window, browse to the input folder in the adftutorial
container, select the emp.txt file, and then select Finish.
d. (optional) Select Preview data to preview the data in the emp.txt file.
5. Repeat the steps to create the output dataset:
a. Select the + (plus) button, and then select Dataset.
b. On the New Dataset page, select Azure Blob Storage, and then select Finish.
c. In General table, specify OutputDataset for the name.
d. In Connection tab, select AzureStorageLinkedService as linked service, and enter
adftutorial/output for the folder, in the directory field. If the output folder does not exist, the
copy activity creates it at runtime.
Create a pipeline
In this procedure, you create and validate a pipeline with a copy activity that uses the input and output
datasets. The copy activity copies data from the file you specified in the input dataset settings to the file
you specified in the output dataset settings. If the input dataset specifies only a folder (not the file
name), the copy activity copies all the files in the source folder to the destination.
1. Select the + (plus) button, and then select Pipeline.
2. To trigger the pipeline manually, select Trigger on the pipeline toolbar, and then select Trigger
Now.
3. To view details about the copy operation, select the Details (eyeglasses image) link in the
Actions column. For details about the properties, see Copy Activity overview.
5. On the New Trigger page, select the Activated check box, and then select Next.
6. Review the warning message, and select Finish.
10. Confirm that an output file is created for every pipeline run until the specified end date and time
in the output folder.
Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To
learn about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Use the Copy Data tool to copy
data
4/8/2019 • 6 minutes to read • Edit Online
In this quickstart, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to
create a pipeline that copies data from a folder in Azure Blob storage to another folder.
NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of
the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account, see
Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.
5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a folder
named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not already
exist.
John, Doe
Jane, Doe
8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the Files
box.
9. Enter input as a value for the Upload to folder box.
10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Create a data factory
1. Select New on the left menu, select Data + Analytics, and then select Data Factory.
3. For Subscription, select your Azure subscription in which you want to create the data factory.
4. For Resource Group, use one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
5. For Version, select V2.
6. For Location, select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory
meta data will be stored. Please note that the associated data stores (like Azure Storage and
Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can run in
other regions.
7. Select Create.
8. After the creation is complete, you see the Data Factory page. Select the Author & Monitor tile
to start the Azure Data Factory user interface (UI) application on a separate tab.
c. On the Specify the Azure Blob storage account page, select your storage account from the
Storage account name list, and then select Finish.
d. Select the newly created linked service as source, then click Next.
4. On the Choose the input file or folder page, complete the following steps:
a. Click Browse to navigate to the adftutorial/input folder, select the emp.txt file, then click
Choose.
d. Check the Binary copy option to copy file as-is, then select Next.
5. On the Destination data store page, select the Azure Blob Storage linked service you just
created, and then select Next.
6. On the Choose the output file or folder page, enter adftutorial/output for the folder path,
then select Next.
10. The application switches to the Monitor tab. You see the status of the pipeline on this tab. Select
Refresh to refresh the list.
11. Select the View Activity Runs link in the Actions column. The pipeline has only one activity of
type Copy.
12. To view details about the copy operation, select the Details (eyeglasses image) link in the
Actions column. For details about the properties, see Copy Activity overview.
13. Verify that the emp.txt file is created in the output folder of the adftutorial container. If the
output folder does not exist, the Data Factory service automatically creates it.
14. Switch to the Author tab above the Monitor tab on the left panel so that you can edit linked
services, datasets, and pipelines. To learn about editing them in the Data Factory UI, see Create a
data factory by using the Azure portal.
Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To
learn about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Create an Azure data factory using
PowerShell
3/5/2019 • 12 minutes to read • Edit Online
This quickstart describes how to use PowerShell to create an Azure data factory. The pipeline you
create in this data factory copies data from one folder to another folder in an Azure blob storage. For
a tutorial on how to transform data using Azure Data Factory, see Tutorial: Transform data using
Spark.
NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the
Azure Data Factory service, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the
permissions that you have in the subscription, in the Azure portal, select your username in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.
5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.
3. In the New container dialog box, enter adftutorial for the name, and then select OK.
7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not
already exist.
John, Doe
Jane, Doe
8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.
10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.
Install the latest Azure PowerShell modules by following instructions in How to install and configure
Azure PowerShell.
Log in to PowerShell
1. Launch PowerShell on your machine. Keep PowerShell open until the end of this quickstart. If
you close and reopen, you need to run these commands again.
2. Run the following command, and enter the same Azure user name and password that you use
to sign in to the Azure portal:
Connect-AzAccount
3. Run the following command to view all the subscriptions for this account:
Get-AzSubscription
4. If you see multiple subscriptions associated with your account, run the following command to
select the subscription that you want to work with. Replace SubscriptionId with the ID of your
Azure subscription:
$resourceGroupName = "ADFQuickStartRG";
If the resource group already exists, you may not want to overwrite it. Assign a different value
to the $ResourceGroupName variable and run the command again
2. To create the Azure resource group, run the following command:
If the resource group already exists, you may not want to overwrite it. Assign a different value
to the $ResourceGroupName variable and run the command again.
3. Define a variable for the data factory name.
IMPORTANT
Update the data factory name to be globally unique. For example, ADFTutorialFactorySP1127.
$dataFactoryName = "ADFQuickStartFactory";
4. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet, using the Location
and ResourceGroupName property from the $ResGrp variable:
To create Data Factory instances, the user account you use to log in to Azure must be a
member of contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory:
Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before
saving the file.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=
<accountName>;AccountKey=<accountKey>;EndpointSuffix=core.windows.net",
"type": "SecureString"
}
}
}
}
If you are using Notepad, select All files for the Save as type filed in the Save as dialog box.
Otherwise, it may add .txt extension to the file. For example,
AzureStorageLinkedService.json.txt . If you create the file in File Explorer before opening it in
Notepad, you may not see the .txt extension since the Hide extensions for known files
types option is set by default. Remove the .txt extension before proceeding to the next step.
2. In PowerShell, switch to the ADFv2QuickStartPSH folder.
Set-Location 'C:\ADFv2QuickStartPSH'
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
Create a dataset
In this step, you define a dataset that represents the data to copy from a source to a sink. The dataset
is of type AzureBlob. It refers to the Azure Storage linked service you created in the previous step.
It takes a parameter to construct the folderPath property. For an input dataset, the copy activity in
the pipeline passes the input path as a value for this parameter. Similarly, for an output dataset, the
copy activity passes the output path as a value for this parameter.
1. Create a JSON file named BlobDataset.json in the C:\ADFv2QuickStartPSH folder, with
the following content:
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@{dataset().path}"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
DatasetName : BlobDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset
Create a pipeline
In this quickstart, you create a pipeline with one activity that takes two parameters - input blob path
and output blob path. The values for these parameters are set when the pipeline is triggered/run. The
copy activity uses the same blob dataset created in the previous step as input and output. When the
dataset is used as an input dataset, input path is specified. And, when the dataset is used as an output
dataset, the output path is specified.
1. Create a JSON file named Adfv2QuickStartPipeline.json in the C:\ADFv2QuickStartPSH
folder with the following content:
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}
$DFPipeLine = Set-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-Name "Adfv2QuickStartPipeline" `
-DefinitionFile ".\Adfv2QuickStartPipeline.json"
2. Run the Invoke-AzDataFactoryV2Pipeline cmdlet to create a pipeline run and pass in the
parameter values. The cmdlet returns the pipeline run ID for future monitoring.
$RunId = Invoke-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-PipelineName $DFPipeLine.Name `
-ParameterFile .\PipelineParameters.json
while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun `
-ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId
if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}
Start-Sleep -Seconds 10
}
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : SPTestFactory0928
RunId : 0000000000-0000-0000-0000-0000000000000
PipelineName : Adfv2QuickStartPipeline
LastUpdated : 9/28/2017 8:28:38 PM
Parameters : {[inputPath, adftutorial/input], [outputPath, adftutorial/output]}
RunStart : 9/28/2017 8:28:14 PM
RunEnd : 9/28/2017 8:28:38 PM
DurationInMs : 24151
Status : Succeeded
Message :
"connectionString": {
"value":
"DefaultEndpointsProtocol=https;AccountName=mystorageaccountname;AccountKey=mystorag
eaccountkey;EndpointSuffix=core.windows.net",
"type": "SecureString"
}
e. Recreate the linked service by following steps in the Create a linked service section.
f. Rerun the pipeline by following steps in the Create a pipeline run section.
g. Run the current monitoring command again to monitor the new pipeline run.
2. Run the following script to retrieve copy activity run details, for example, size of the data
read/written.
3. Confirm that you see the output similar to the following sample output of activity run result:
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : SPTestFactory0928
ActivityName : CopyFromBlobToBlob
PipelineRunId : 00000000000-0000-0000-0000-000000000000
PipelineName : Adfv2QuickStartPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, copyDuration, throughput...}
LinkedServiceName :
ActivityRunStart : 9/28/2017 8:28:18 PM
ActivityRunEnd : 9/28/2017 8:28:36 PM
DurationInMs : 18095
Status : Succeeded
Error : {errorCode, message, failureType, target}
Note: dropping a resource group may take some time. Please be patient with the process
If you want to delete just the data factory, not the entire resource group, run the following command:
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob
storage. Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline
using .NET SDK
4/28/2019 • 11 minutes to read • Edit Online
This quickstart describes how to use .NET SDK to create an Azure data factory. The pipeline you
create in this data factory copies data from one folder to another folder in an Azure blob storage. For
a tutorial on how to transform data using Azure Data Factory, see Tutorial: Transform data using
Spark.
NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the
Azure Data Factory service, see Introduction to Azure Data Factory.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the
permissions that you have in the subscription, in the Azure portal, select your username in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and
password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.
5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.
3. In the New container dialog box, enter adftutorial for the name, and then select OK.
John, Doe
Jane, Doe
8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.
10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Visual Studio
The walkthrough in this article uses Visual Studio 2017. You can also use Visual Studio 2013 or 2015.
Azure .NET SDK
Download and install Azure .NET SDK on your machine.
Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
2. Add the following code to the Main method that sets the variables. Replace the place-holders
with your own values. For a list of Azure regions in which Data Factory is currently available,
select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL
Database, etc.) and computes (HDInsight, etc.) used by data factory can be in other regions.
// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID where the data factory resides>";
string resourceGroup = "<your resource group where the data factory resides>";
string region = "East US 2";
string dataFactoryName = "<specify the name of data factory to create. It must be globally
unique.>";
string storageAccount = "<your storage account name to copy data>";
string storageKey = "<your storage account key>";
// specify the container and input folder from which all files need to be copied to the
output folder.
string inputBlobPath = "<the path to existing blob(s) to copy data from, e.g.
containername/foldername>";
//specify the contains and output folder where the files are copied
string outputBlobPath = "<the blob path to copy data to, e.g. containername/foldername>";
3. Add the following code to the Main method that creates an instance of
DataFactoryManagementClient class. You use this object to create a data factory, a linked
service, datasets, and a pipeline. You also use this object to monitor the pipeline run details.
// Authenticate and create a data factory management client
var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/",
cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };
Create a dataset
Add the following code to the Main method that creates an Azure blob dataset.
You define a dataset that represents the data to copy from a source to a sink. In this example, this
Blob dataset references to the Azure Storage linked service you created in the previous step. The
dataset takes a parameter whose value is set in an activity that consumes the dataset. The parameter
is used to construct the "folderPath" pointing to where the data resides/stored.
}
}
);
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobDatasetName, blobDataset);
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity.
In this example, this pipeline contains one activity and takes two parameters - input blob path and
output blob path. The values for these parameters are set when the pipeline is triggered/run. The
copy activity refers to the same blob dataset created in the previous step as input and output. When
the dataset is used as an input dataset, input path is specified. And, when the dataset is used as an
output dataset, the output path is specified.
// Create a pipeline with a copy activity
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource pipeline = new PipelineResource
{
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "inputPath", new ParameterSpecification { Type = ParameterType.String } },
{ "outputPath", new ParameterSpecification { Type = ParameterType.String } }
},
Activities = new List<Activity>
{
new CopyActivity
{
Name = "CopyFromBlobToBlob",
Inputs = new List<DatasetReference>
{
new DatasetReference()
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.inputPath" }
}
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.outputPath" }
}
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
}
}
};
client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, pipeline);
Console.WriteLine(SafeJsonConvert.SerializeObject(pipeline, client.SerializationSettings));
2. Add the following code to the Main method that retrieves copy activity run details, for
example, size of the data read/written.
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob
storage. Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline
using Python
3/6/2019 • 9 minutes to read • Edit Online
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven
workflows in the cloud for orchestrating and automating data movement and data transformation.
Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that
can ingest data from disparate data stores, process/transform the data by using compute services such
as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and
publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
This quickstart describes how to use Python to create an Azure data factory. The pipeline in this data
factory copies data from one folder to another folder in an Azure blob storage.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure Storage account. You use the blob storage as source and sink data store. If you don't have
an Azure storage account, see the Create a storage account article for steps to create one.
Create an application in Azure Active Directory following this instruction. Make note of the
following values that you use in later steps: application ID, authentication key, and tenant ID.
Assign application to "Contributor" role by following instructions in the same article.
Create and upload an input file
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.
John|Doe
Jane|Doe
2. Use tools such as Azure Storage Explorer to create the adfv2tutorial container, and input folder
in the container. Then, upload the input.txt file to the input folder.
3. To install the Python package for Data Factory, run the following command:
The Python SDK for Data Factory supports Python 2.7, 3.3, 3.4, 3.5, 3.6 and 3.7.
Create a data factory client
1. Create a file named datafactory.py. Add the following statements to add references to
namespaces.
def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)
def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n\n")
def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))
3. Add the following code to the Main method that creates an instance of
DataFactoryManagementClient class. You use this object to create the data factory, linked
service, datasets, and pipeline. You also use this object to monitor the pipeline run details. Set
subscription_id variable to the ID of your Azure subscription. For a list of Azure regions in
which Data Factory is currently available, select the regions that interest you on the following
page, and then expand Analytics to locate Data Factory: Products available by region. The data
stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
def main():
# Azure subscription ID
subscription_id = '<Specify your Azure Subscription ID>'
# This program creates this resource group. If it's an existing resource group, comment
out the code that creates the resource group
rg_name = 'ADFTutorialResourceGroup'
# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ServicePrincipalCredentials(client_id='<Active Directory application/client
ID>', secret='<client secret>', tenant='<Active Directory tenant ID>')
resource_client = ResourceManagementClient(credentials, subscription_id)
adf_client = DataFactoryManagementClient(credentials, subscription_id)
rg_params = {'location':'eastus'}
df_params = {'location':'eastus'}
# IMPORTANT: specify the name and key of your Azure Storage account.
storage_string = SecureString('DefaultEndpointsProtocol=https;AccountName=
<storageaccountname>;AccountKey=<storageaccountkey>')
ls_azure_storage = AzureStorageLinkedService(connection_string=storage_string)
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)
Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. For information about
properties of Azure Blob dataset, see Azure blob connector article.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure
Storage linked service you create in the previous step.
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity.
Now, add the following statement to invoke the main method when the program is run:
Full script
Here is the full Python code:
def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)
print("\n")
def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n")
def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))
def main():
# Azure subscription ID
subscription_id = '<your Azure subscription ID>'
# This program creates this resource group. If it's an existing resource group, comment out the
code that creates the resource group
rg_name = '<Azure resource group name>'
# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ServicePrincipalCredentials(client_id='<Active Directory client ID>',
secret='<client secret>', tenant='<tenant ID>')
resource_client = ResourceManagementClient(credentials, subscription_id)
adf_client = DataFactoryManagementClient(credentials, subscription_id)
rg_params = {'location':'eastus'}
df_params = {'location':'eastus'}
ls_azure_storage = AzureStorageLinkedService(connection_string=storage_string)
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)
Name: storageLinkedService
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory
name>/linkedservices/storageLinkedService
Name: ds_in
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_in
Name: ds_out
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_out
Name: copyPipeline
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/pipelines/copyPipeline
Clean up resources
To delete the data factory, add the following code to the program:
adf_client.factories.delete(rg_name,df_name)
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create an Azure data factory and
pipeline by using the REST API
3/26/2019 • 8 minutes to read • Edit Online
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven
workflows in the cloud for orchestrating and automating data movement and data transformation.
Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that
can ingest data from disparate data stores, process/transform the data by using compute services such
as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and
publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
This quickstart describes how to use REST API to create an Azure data factory. The pipeline in this data
factory copies data from one location to another location in an Azure blob storage.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.
Azure subscription. If you don't have a subscription, you can create a free trial account.
Azure Storage account. You use the blob storage as source and sink data store. If you don't have
an Azure storage account, see the Create a storage account article for steps to create one.
Create a blob container in Blob Storage, create an input folder in the container, and upload some
files to the folder. You can use tools such as Azure Storage explorer to connect to Azure Blob
storage, create a blob container, upload input file, and verify the output file.
Install Azure PowerShell. Follow the instructions in How to install and configure Azure
PowerShell. This quickstart uses PowerShell to invoke REST API calls.
Create an application in Azure Active Directory following this instruction. Make note of the
following values that you use in later steps: application ID, authentication key, and tenant ID.
Assign application to "Contributor" role.
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:
2. Run the following commands after replacing the places-holders with your own values, to set
global variables to be used in later steps.
$AuthContext =
[Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext]"https://login.microsoftonli
ne.com/${tenantId}"
$cred = New-Object -TypeName Microsoft.IdentityModel.Clients.ActiveDirectory.ClientCredential -
ArgumentList ($appId, $authKey)
$result = $AuthContext.AcquireToken("https://management.core.windows.net/", $cred)
$authHeader = @{
'Content-Type'='application/json'
'Accept'='application/json'
'Authorization'=$result.CreateAuthorizationHeader()
}
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}?api-version=${apiVersion}"
$body = @"
{
"name": "$dataFactoryName",
"location": "East US",
"properties": {},
"identity": {
"type": "SystemAssigned"
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
Note the following points:
The name of the Azure data factory must be globally unique. If you receive the following error,
change the name and try again.
For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory:
Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Here is the sample response:
{
"name": "<dataFactoryName>",
"tags": {
},
"properties": {
"provisioningState": "Succeeded",
"loggingStorageAccountKey": "**********",
"createTime": "2017-09-14T06:22:59.9106216Z",
"version": "2018-06-01"
},
"identity": {
"type": "SystemAssigned",
"principalId": "<service principal ID>",
"tenantId": "<tenant ID>"
},
"id": "dataFactoryName",
"type": "Microsoft.DataFactory/factories",
"location": "East US"
}
{
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory
/factories/<dataFactoryName>/linkedservices/AzureStorageLinkedService",
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "@{value=**********; type=SecureString}"
}
},
"etag": "0000c552-0000-0000-0000-59b1459c0000"
}
Create datasets
You define a dataset that represents the data to copy from a source to a sink. In this example, this Blob
dataset refers to the Azure Storage linked service you create in the previous step. The dataset takes a
parameter whose value is set in an activity that consumes the dataset. The parameter is used to
construct the "folderPath" pointing to where the data resides/stored.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/datasets/BlobDataset?api-version=${apiVersion}"
$body = @"
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
{
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory
/factories/<dataFactoryName>/datasets/BlobDataset",
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@{value=@{dataset().path}; type=Expression}"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": "@{type=String}"
}
},
"etag": "0000c752-0000-0000-0000-59b1459d0000"
}
Create pipeline
In this example, this pipeline contains one activity and takes two parameters - input blob path and
output blob path. The values for these parameters are set when the pipeline is triggered/run. The copy
activity refers to the same blob dataset created in the previous step as input and output. When the
dataset is used as an input dataset, input path is specified. And, when the dataset is used as an output
dataset, the output path is specified.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/pipelines/Adfv2QuickStartPipeline?api-
version=${apiVersion}"
$body = @"
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/pipelines/Adfv2QuickStartPipeline/createRun?api-
version=${apiVersion}"
$body = @"
{
"inputPath": "<the path to existing blob(s) to copy data from, e.g. containername/path>",
"outputPath": "<the blob path to copy data to, e.g. containername/path>"
}
"@
$response = Invoke-RestMethod -Method POST -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
$runId = $response.runId
{
"runId": "2f26be35-c112-43fa-9eaa-8ba93ea57881"
}
Monitor pipeline
1. Run the following script to continuously check the pipeline run status until it finishes copying
the data.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/provid
ers/Microsoft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}?api-
version=${apiVersion}"
while ($True) {
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
Write-Host "Pipeline run status: " $response.Status -foregroundcolor "Yellow"
{
"key": "000000000-0000-0000-0000-00000000000",
"timestamp": "2017-09-07T13:12:39.5561795Z",
"runId": "000000000-0000-0000-0000-000000000000",
"dataFactoryName": "<dataFactoryName>",
"pipelineName": "Adfv2QuickStartPipeline",
"parameters": [
"inputPath: <inputBlobPath>",
"outputPath: <outputBlobPath>"
],
"parametersCount": 2,
"parameterNames": [
"inputPath",
"outputPath"
],
"parameterNamesCount": 2,
"parameterValues": [
"<inputBlobPath>",
"<outputBlobPath>"
],
"parameterValuesCount": 2,
"runStart": "2017-09-07T13:12:00.3710792Z",
"runEnd": "2017-09-07T13:12:39.5561795Z",
"durationInMs": 39185,
"status": "Succeeded",
"message": ""
}
2. Run the following script to retrieve copy activity run details, for example, size of the data
read/written.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/provid
ers/Microsoft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}/activityruns?
api-version=${apiVersion}&startTime="+(Get-Date).ToString('yyyy-MM-dd')+"&endTime="+(Get-
Date).AddDays(1).ToString('yyyy-MM-dd')+"&pipelineName=Adfv2QuickStartPipeline"
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
$response | ConvertTo-Json
Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure
resource group, which includes all the resources in the resource group. If you want to keep the other
resources intact, delete only the data factory you created in this tutorial.
Run the following command to delete the entire resource group:
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Tutorial: Create an Azure data factory using
Azure Resource Manager template
3/26/2019 • 15 minutes to read • Edit Online
This quickstart describes how to use an Azure Resource Manager template to create an Azure data
factory. The pipeline you create in this data factory copies data from one folder to another folder in an
Azure blob storage. For a tutorial on how to transform data using Azure Data Factory, see Tutorial:
Transform data using Spark.
NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the Azure
Data Factory service, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of
the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account, see
Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.
5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a folder
named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.
3. In the New container dialog box, enter adftutorial for the name, and then select OK.
7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not already
exist.
John, Doe
Jane, Doe
8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the Files
box.
9. Enter input as a value for the Upload to folder box.
10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.
Install the latest Azure PowerShell modules by following instructions in How to install and configure
Azure PowerShell.
{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {
"dataFactoryName": {
"type": "string",
"metadata": {
"description": "Name of the data factory. Must be globally unique."
}
},
"dataFactoryLocation": {
"type": "string",
"allowedValues": [
"East US",
"East US 2",
"West Europe"
],
"defaultValue": "East US",
"metadata": {
"description": "Location of the data factory. Currently, only East US, East US 2, and West
Europe are supported. "
}
},
"storageAccountName": {
"type": "string",
"metadata": {
"description": "Name of the Azure storage account that contains the input/output data."
}
},
"storageAccountKey": {
"storageAccountKey": {
"type": "securestring",
"metadata": {
"description": "Key for the Azure storage account."
}
},
"blobContainer": {
"type": "string",
"metadata": {
"description": "Name of the blob container in the Azure Storage account."
}
},
"inputBlobFolder": {
"type": "string",
"metadata": {
"description": "The folder in the blob container that has the input file."
}
},
"inputBlobName": {
"type": "string",
"metadata": {
"description": "Name of the input file/blob."
}
},
"outputBlobFolder": {
"type": "string",
"metadata": {
"description": "The folder in the blob container that will hold the transformed data."
}
},
"outputBlobName": {
"type": "string",
"metadata": {
"description": "Name of the output file/blob."
}
},
"triggerStartTime": {
"type": "string",
"metadata": {
"description": "Start time for the trigger."
}
},
"triggerEndTime": {
"type": "string",
"metadata": {
"description": "End time for the trigger."
}
}
},
"variables": {
"azureStorageLinkedServiceName": "ArmtemplateStorageLinkedService",
"inputDatasetName": "ArmtemplateTestDatasetIn",
"outputDatasetName": "ArmtemplateTestDatasetOut",
"pipelineName": "ArmtemplateSampleCopyPipeline",
"triggerName": "ArmTemplateTestTrigger"
},
"resources": [{
"name": "[parameters('dataFactoryName')]",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "[parameters('dataFactoryLocation')]",
"identity": {
"type": "SystemAssigned"
},
"resources": [{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[parameters('dataFactoryName')]"
],
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": {
"value": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=
',parameters('storageAccountKey'))]",
"type": "SecureString"
}
}
}
},
{
"type": "datasets",
"name": "[variables('inputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'),
'/')]",
"fileName": "[parameters('inputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},
{
"type": "datasets",
"name": "[variables('outputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'),
'/')]",
"fileName": "[parameters('outputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},
{
"type": "pipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"activities": [{
"type": "Copy",
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "MyCopyActivity",
"inputs": [{
"referenceName": "[variables('inputDatasetName')]",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "[variables('outputDatasetName')]",
"type": "DatasetReference"
}]
}]
}
},
{
"type": "triggers",
"name": "[variables('triggerName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]",
"[variables('pipelineName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "[parameters('triggerStartTime')]",
"endTime": "[parameters('triggerEndTime')]",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "ArmtemplateSampleCopyPipeline"
},
"parameters": {}
}]
}
}
]
}]
}
Parameters JSON
Create a JSON file named ADFTutorialARM -Parameters.json that contains parameters for the Azure
Resource Manager template.
IMPORTANT
Specify the name and key of your Azure Storage account for the storageAccountName and
storageAccountKey parameters in this parameter file. You created the adftutorial container and uploaded
the sample file (emp.txt) to the input folder in this Azure blob storage.
Specify a globally unique name for the data factory for the dataFactoryName parameter. For example:
ARMTutorialFactoryJohnDoe11282017.
For the triggerStartTime, specify the current day in the format: 2017-11-28T00:00:00 .
For the triggerEndTime, specify the next day in the format: 2017-11-29T00:00:00 . You can also check the
current UTC time and specify the next hour or two as the end time. For example, if the UTC time now is 1:32
AM, specify 2017-11-29:03:00:00 as the end time. In this case, the trigger runs the pipeline twice (at 2 AM
and 3 AM).
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"dataFactoryName": {
"value": "<datafactoryname>"
},
"dataFactoryLocation": {
"value": "East US"
},
"storageAccountName": {
"value": "<yourstorageaccountname>"
},
"storageAccountKey": {
"value": "<yourstorageaccountkey>"
},
"blobContainer": {
"value": "adftutorial"
},
"inputBlobFolder": {
"value": "input"
},
"inputBlobName": {
"value": "emp.txt"
},
"outputBlobFolder": {
"value": "output"
},
"outputBlobName": {
"value": "emp.txt"
},
"triggerStartTime": {
"value": "2017-11-28T00:00:00. Set to today"
},
"triggerEndTime": {
"value": "2017-11-29T00:00:00. Set to tomorrow"
}
}
}
IMPORTANT
You may have separate parameter JSON files for development, testing, and production environments that you
can use with the same Data Factory JSON template. By using a Power Shell script, you can automate deploying
Data Factory entities in these environments.
Deploy Data Factory entities
In PowerShell, run the following command to deploy Data Factory entities using the Resource Manager
template you created earlier in this quickstart.
DeploymentName : MyARMDeployment
ResourceGroupName : ADFTutorialResourceGroup
ProvisioningState : Succeeded
Timestamp : 11/29/2017 3:11:13 AM
Mode : Incremental
TemplateLink :
Parameters :
Name Type Value
=============== ============ ==========
dataFactoryName String <data factory name>
dataFactoryLocation String East US
storageAccountName String <storage account name>
storageAccountKey SecureString
blobContainer String adftutorial
inputBlobFolder String input
inputBlobName String emp.txt
outputBlobFolder String output
outputBlobName String emp.txt
triggerStartTime String 11/29/2017 12:00:00 AM
triggerEndTime String 11/29/2017 4:00:00 AM
Outputs :
DeploymentDebugLogLevel :
$resourceGroupName = "ADFTutorialResourceGroup"
2. Create a variable to hold the name of the data factory. Specify the same name that you specified
in the ADFTutorialARM -Parameters.json file.
$dataFactoryName = "<yourdatafactoryname>"
3. Set a variable for the name of the trigger. The name of the trigger is hardcoded in the Resource
Manager template file (ADFTutorialARM.json).
$triggerName = "ArmTemplateTestTrigger"
4. Get the status of the trigger by running the following PowerShell command after specifying the
name of your data factory and trigger:
TriggerName : ArmTemplateTestTrigger
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ARMFactory1128
Properties : Microsoft.Azure.Management.DataFactory.Models.ScheduleTrigger
RuntimeState : Stopped
Confirm
Are you sure you want to start trigger 'ArmTemplateTestTrigger' in data factory
'ARMFactory1128'?
[Y] Yes [N] No [S] Suspend [?] Help (default is "Y"): y
True
6. Confirm that the trigger has been started by running the Get-AzDataFactoryV2Trigger command
again.
TriggerName : ArmTemplateTestTrigger
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ARMFactory1128
Properties : Microsoft.Azure.Management.DataFactory.Models.ScheduleTrigger
RuntimeState : Started
Monitor the pipeline
1. After logging in to the Azure portal, Click All services, search with the keyword such as data fa,
and select Data factories.
2. In the Data Factories page, click the data factory you created. If needed, filter the list with the
name of your data factory.
IMPORTANT
You see pipeline runs only at the hour clock (for example: 4 AM, 5 AM, 6 AM, etc.). Click Refresh on the
toolbar to refresh the list when the time reaches the next hour.
6. You see the activity runs associated with the pipeline run. In this quickstart, the pipeline has only
one activity of type: Copy. Therefore, you see a run for that activity.
7. Click the link under Output column. You see the output from the copy operation in an Output
window. Click the maximize button to see the full output. You can close the maximized output
window or close it.
8. Stop the trigger once you see a successful/failure run. The trigger runs the pipeline once an hour.
The pipeline copies the same file from the input folder to the output folder for each run. To stop
the trigger, run the following command in the PowerShell window.
Note: dropping a resource group may take some time. Please be patient with the process
If you want to delete just the data factory, not the entire resource group, run the following command:
The connectionString uses the storageAccountName and storageAccountKey parameters. The values
for these parameters passed by using a configuration file. The definition also uses variables:
azureStorageLinkedService and dataFactoryName defined in the template.
Azure blob input dataset
The Azure storage linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure storage account. In Azure blob dataset definition, you specify names of
blob container, folder, and file that contains the input data. See Azure Blob dataset properties for details
about JSON properties used to define an Azure Blob dataset.
{
"type": "datasets",
"name": "[variables('inputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'),
'/')]",
"fileName": "[parameters('inputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},
Data pipeline
You define a pipeline that copies data from one Azure blob dataset to another Azure blob dataset. See
Pipeline JSON for descriptions of JSON elements used to define a pipeline in this example.
{
"type": "pipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "MyCopyActivity",
"inputs": [{
"referenceName": "[variables('inputDatasetName')]",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "[variables('outputDatasetName')]",
"type": "DatasetReference"
}]
}]
}
}
Trigger
You define a trigger that runs the pipeline once an hour. The deployed trigger is in stopped state. Start
the trigger by using the Start-AzDataFactoryV2Trigger cmdlet. For more information about triggers,
see Pipeline execution and triggers article.
{
"type": "triggers",
"name": "[variables('triggerName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]",
"[variables('pipelineName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-11-28T00:00:00",
"endTime": "2017-11-29T00:00:00",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "ArmtemplateSampleCopyPipeline"
},
"parameters": {}
}]
}
}
Notice that the first command uses parameter file for the development environment, second one for the
test environment, and the third one for the production environment.
You can also reuse the template to perform repeated tasks. For example, create many data factories with
one or more pipelines that implement the same logic but each data factory uses different Azure storage
accounts. In this scenario, you use the same template in the same environment (dev, test, or production)
with different parameter files to create data factories.
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Create Azure Data Factory Data Flow
5/6/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Mapping Data Flows in ADF provide a way to transform data at scale without any coding required. You can design
a data transformation job in the data flow designer by constructing a series of transformations. Start with any
number of source transformations followed by data transformation steps. Then, complete your data flow with sink
to land your results in a destination.
Get started by first creating a new V2 Data Factory from the Azure portal. After creating your new factory, click on
the "Author & Monitor" tile to launch the Data Factory UI.
Once you are in the Data Factory UI, you can use sample Data Flows. The samples are available from the ADF
Template Gallery. In ADF, create "Pipeline from Template" and select the Data Flow category from the template
gallery.
You will be prompted to enter your Azure Blob Storage account information.
The data used for these samples can be found here. Download the sample data and store the files in your Azure
Blob storage accounts so that you can execute the samples.
Next steps
Begin building your data transformation with a source transformation.
Copy data from Azure Blob storage to a SQL
database by using the Copy Data tool
3/6/2019 • 5 minutes to read • Edit Online
In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that copies data from Azure Blob storage to a SQL database.
NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source data store. If you don't have an Azure storage account,
see the instructions in Create a storage account.
Azure SQL Database: Use a SQL database as the sink data store. If you don't have a SQL database, see the
instructions in Create a SQL database.
Create a blob and a SQL table
Prepare your Blob storage and your SQL database for the tutorial by performing these steps.
Create a source blob
1. Launch Notepad. Copy the following text and save it in a file named inputEmp.txt on your disk:
John|Doe
Jane|Doe
2. Create a container named adfv2tutorial and upload the inputEmp.txt file to the container. You can use
various tools to perform these tasks, such as Azure Storage Explorer.
Create a sink SQL table
1. Use the following SQL script to create a table named dbo.emp in your SQL database:
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which to create the new data factory.
4. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2 for the version.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that are used by your data factory can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, the Deploying Data Factory tile shows the process status.
10. After creation is finished, the Data Factory home page is displayed.
11. To launch the Azure Data Factory user interface (UI) in a separate tab, select the Author & Monitor tile.
c. On the New Linked Service page, select your storage account from the Storage account name list,
and then select Finish.
d. Select the newly created linked service as source, then click Next.
4. On the Choose the input file or folder page, complete the following steps:
a. Click Browse to navigate to the adfv2tutorial/input folder, select the inputEmp.txt file, then click
Choose.
b. Click Next to move to next step.
5. On the File format settings page, notice that the tool automatically detects the column and row delimiters.
Select Next. You also can preview data and view the schema of the input data on this page.
b. Select Azure SQL Database from the gallery, and then select Next.
c. On the New Linked Service page, select your server name and DB name from the dropdown list, and
specify the username and password, then select Finish.
d. Select the newly created linked service as sink, then click Next.
7. On the Table mapping page, select the [dbo].[emp] table, and then select Next.
8. On the Schema mapping page, notice that the first and second columns in the input file are mapped to the
FirstName and LastName columns of the emp table. Select Next.
12. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline. Select Refresh to refresh the list.
13. To view the activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. For details about the copy operation, select the Details link (eyeglasses icon) in the
Actions column. To go back to the Pipeline Runs view, select the Pipelines link at the top. To refresh the
view, select Refresh.
14. Verify that the data is inserted into the emp table in your SQL database.
15. Select the Author tab on the left to switch to the editor mode. You can update the linked services, datasets,
and pipelines that were created via the tool by using the editor. For details on editing these entities in the
Data Factory UI, see the Azure portal version of this tutorial.
Next steps
The pipeline in this sample copies data from Blob storage to a SQL database. You learned how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob storage to a SQL
database by using Azure Data Factory
3/26/2019 • 10 minutes to read • Edit Online
In this tutorial, you create a data factory by using the Azure Data Factory user interface (UI). The pipeline in this
data factory copies data from Azure Blob storage to a SQL database. The configuration pattern in this tutorial
applies to copying from a file-based data store to a relational data store. For a list of data stores supported as
sources and sinks, see the supported data stores table.
NOTE
If you're new to Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription. If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account. You use Blob storage as a source data store. If you don't have a storage account, see
Create an Azure storage account for steps to create one.
Azure SQL Database. You use the database as a sink data store. If you don't have a SQL database, see Create
a SQL database for steps to create one.
Create a blob and a SQL table
Now, prepare your Blob storage and SQL database for the tutorial by performing the following steps.
Create a source blob
1. Launch Notepad. Copy the following text, and save it as an emp.txt file on your disk:
John,Doe
Jane,Doe
2. Create a container named adftutorial in your Blob storage. Create a folder named input in this container.
Then, upload the emp.txt file to the input folder. Use the Azure portal or tools such as Azure Storage
Explorer to do these tasks.
Create a sink SQL table
1. Use the following SQL script to create the dbo.emp table in your SQL database:
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
2. Allow Azure services to access SQL Server. Ensure that Allow access to Azure services is turned ON for
your SQL Server so that Data Factory can write data to your SQL Server. To verify and turn on this setting,
take the following steps:
a. On the left, select More services > SQL servers.
b. Select your server, and under SETTINGS select Firewall.
c. On the Firewall settings page, select ON for Allow access to Azure services.
The name of the Azure data factory must be globally unique. If you see the following error message for the
name field, change the name of the data factory (for example, yournameADFTutorialDataFactory). For
naming rules for Data Factory artifacts, see Data Factory naming rules.
4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
6. Under Version, select V2.
7. Under Location, select a location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by the data factory can be in other regions.
8. Select Pin to dashboard.
9. Select Create.
10. On the dashboard, you see the following tile with the status Deploying Data Factory:
11. After the creation is finished, you see the Data factory page as shown in the image.
12. Select Author & Monitor to launch the Data Factory UI in a separate tab.
Create a pipeline
In this step, you create a pipeline with a copy activity in the data factory. The copy activity copies data from Blob
storage to SQL Database. In the Quickstart tutorial, you created a pipeline by following these steps:
1. Create the linked service.
2. Create input and output datasets.
3. Create a pipeline.
In this tutorial, you start with creating the pipeline. Then you create linked services and datasets when you need
them to configure the pipeline.
1. On the Let's get started page, select Create pipeline.
2. In the General tab for the pipeline, enter CopyPipeline for Name of the pipeline.
3. In the Activities tool box, expand the Move andTransform category, and drag and drop the Copy Data
activity from the tool box to the pipeline designer surface. Specify CopyFromBlobToSql for Name.
Configure source
1. Go to the Source tab. Select + New to create a source dataset.
2. In the New Dataset window, select Azure Blob Storage, and then select Finish. The source data is in
Blob storage, so you select Azure Blob Storage for the source dataset.
3. You see a new tab opened for blob dataset. On the General tab at the bottom of the Properties window,
enter SourceBlobDataset for Name.
4. Go to the Connection tab of the Properties window. Next to the Linked service text box, select + New.
5. In the New Linked Service window, enter AzureStorageLinkedService as name, select your storage
account from the Storage account name list, then select Save to deploy the linked service.
6. After the linked service is created, you are back in the dataset settings. Next to File path, select Browse.
7. Navigate to the adftutorial/input folder, select the emp.txt file, and then select Finish.
8. Confirm that File format is set to Text format and that Column delimiter is set to Comma ( , ). If the
source file uses different row and column delimiters, you can select Detect Text Format for File format.
The Copy Data tool detects the file format and delimiters automatically for you. You can still override these
values. To preview data on this page, select Preview data.
9. Go to the Schema tab of the Properties window, and select Import Schema. Notice that the application
detected two columns in the source file. You import the schema here so that you can map columns from the
source data store to the sink data store. If you don't need to map columns, you can skip this step. For this
tutorial, import the schema.
10. Now, go back to the pipeline -> Source tab, confirm that SourceBlobDataset is selected. To preview data
on this page, select Preview data.
Configure sink
1. Go to the Sink tab, and select + New to create a sink dataset.
2. In the New Dataset window, input "SQL" in the search box to filter the connectors, then select Azure SQL
Database, and then select Finish. In this tutorial, you copy data to a SQL database.
3. On the General tab of the Properties window, in Name, enter OutputSqlDataset.
4. Go to the Connection tab, and next to Linked service, select + New. A dataset must be associated with a
linked service. The linked service has the connection string that Data Factory uses to connect to the SQL
database at runtime. The dataset specifies the container, folder, and the file (optional) to which the data is
copied.
8. Select the ID column, and then select Delete. The ID column is an identity column in the SQL database, so
the copy activity doesn't need to insert data into this column.
9. Go to the tab with the pipeline, and in Sink Dataset, confirm that OutputSqlDataset is selected.
Configure mapping
Go to the Mapping tab at the bottom of the Properties window, and select Import Schemas. Notice that the
first and second columns in the source file are mapped to FirstName and LastName in the SQL database.
3. Wait until you see the Successfully published message. To see notification messages, click the Show
Notifications on the top-right (bell button).
3. To see activity runs associated with the pipeline run, select the View Activity Runs link in the Actions
column. In this example, there is only one activity, so you see only one entry in the list. For details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column. Select Pipelines at the top
to go back to the Pipeline Runs view. To refresh the view, select Refresh.
4. Verify that two more rows are added to the emp table in the SQL database.
5. On the Trigger Run Parameters page, review the warning, and then select Finish. The pipeline in this
example doesn't take any parameters.
6. Click Publish All to publish the change.
7. Go to the Monitor tab on the left to see the triggered pipeline runs.
8. To switch from the Pipeline Runs view to the Trigger Runs view, select Pipeline Runs and then select
Trigger Runs.
10. Verify that two rows per minute (for each pipeline run) are inserted into the emp table until the specified
end time.
Next steps
The pipeline in this sample copies data from one location to another location in Blob storage. You learned how to:
Create a data factory.
Create a pipeline with a copy activity.
Test run the pipeline.
Trigger the pipeline manually.
Trigger the pipeline on a schedule.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob to Azure SQL Database
using Azure Data Factory
5/15/2019 • 10 minutes to read • Edit Online
In this tutorial, you create a Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL
Database. The configuration pattern in this tutorial applies to copying from a file-based data store to a relational
data store. For a list of data stores supported as sources and sinks, see supported data stores table.
You perform the following steps in this tutorial:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline contains a Copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses .NET SDK. You can use other mechanisms to interact with Azure Data Factory, refer to samples
under "Quickstarts".
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Visual Studio 2015, or 2017. The walkthrough in this article uses Visual Studio 2017.
Download and install Azure .NET SDK.
Create an application in Azure Active Directory following this instruction. Make note of the following
values that you use in later steps: application ID, authentication key, and tenant ID. Assign application to
"Contributor" role by following instructions in the same article.
Create a blob and a SQL table
Now, prepare your Azure Blob and Azure SQL Database for the tutorial by performing the following steps:
Create a source blob
1. Launch Notepad. Copy the following text and save it as inputEmp.txt file on your disk.
John|Doe
Jane|Doe
2. Use tools such as Azure Storage Explorer to create the adfv2tutorial container, and to upload the
inputEmp.txt file to the container.
Create a sink SQL table
1. Use the following SQL script to create the dbo.emp table in your Azure SQL Database.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
2. Allow Azure services to access SQL server. Ensure that Allow access to Azure services setting is turned
ON for your Azure SQL server so that the Data Factory service can write data to your Azure SQL server.
To verify and turn on this setting, do the following steps:
a. Click More services hub on the left and click SQL servers.
b. Select your server, and click Firewall under SETTINGS.
c. In the Firewall settings page, click ON for Allow access to Azure services.
Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
2. Add the following code to the Main method that sets variables. Replace place-holders with your own
values. For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory: Products available
by region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used
by data factory can be in other regions.
// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID to create the factory>";
string resourceGroup = "<your resource group to create the factory>";
3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create a data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details.
};
client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, dataFactory);
Console.WriteLine(SafeJsonConvert.SerializeObject(dataFactory, client.SerializationSettings));
Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. Learn more from Azure Blob
dataset properties on supported properties and details.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step, and describes:
The location of the blob to copy from: FolderPath and FileName;
The blob format indicating how to parse the content: TextFormat and its settings (for example, column
delimiter).
The data structure, including column names and data types which in this case map to the sink SQL table.
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity. In this tutorial, this
pipeline contains one activity: copy activity, which takes in the Blob dataset as source and the SQL dataset as sink.
Learn more from Copy Activity Overview on copy activity details.
2. Add the following code to the Main method that retrieves copy activity run details, for example, size of the
data read/written.
if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(activityRuns.First().Output);
}
else
Console.WriteLine(activityRuns.First().Error);
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline contains a Copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copying data from on-premises to cloud:
Copy data from on-premises to cloud
Copy data from an on-premises SQL Server
database to Azure Blob storage by using the Copy
Data tool
4/8/2019 • 8 minutes to read • Edit Online
In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that copies data from an on-premises SQL Server database to Azure Blob storage.
NOTE
If you're new to Azure Data Factory, see Introduction to Data Factory.
Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to log in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. Select your user name in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on how to add a user to a role, see Manage access using RBAC and the
Azure portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Blob storage
(sink). You then create a table named emp in your SQL Server database and insert a couple of sample entries into
the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database. In the tree view, right-click the database that you created, and then select New Query.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
3. In the list of storage accounts, filter for your storage account, if needed. Then select your storage account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Blob storage.
1. In the Storage account window, switch to Overview, and then select Blobs.
3. In the New container window, in the Name box, enter adftutorial, and then select OK.
3. Select the Azure subscription in which you want to create the data factory.
4. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under Version, select **V2 **.
6. Under Location, select the location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by Data Factory can be in other locations/regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, you see the following tile with the status Deploying Data Factory:
10. After the creation is finished, you see the Data Factory page as shown in the image.
11. Select Author & Monitor to launch the Data Factory user interface in a separate tab.
5. Under New Linked Service (SQL Server) Name, enter SqlServerLinkedService. Select +New under
Connect via integration runtime. You must create a self-hosted integration runtime, download it to your
machine, and register it with Data Factory. The self-hosted integration runtime copies data between your
on-premises environment and the cloud.
6. In the Integration Runtime Setup dialog box, Select Private Network. Then select Next.
7. In the Integration Runtime Setup dialog box under Name, enter TutorialIntegrationRuntime. Then
select Next.
8. Select Click here to launch the express setup for this computer. This action installs the integration
runtime on your machine and registers it with Data Factory. Alternatively, you can use the manual setup
option to download the installation file, run it, and use the key to register the integration runtime.
9. Run the downloaded application. You see the status of the express setup in the window.
10. Confirm that TutorialIntegrationRuntime is selected for the Integration Runtime field.
11. In Specify the on-premises SQL Server database, take the following steps:
a. Under Name, enter SqlServerLinkedService.
b. Under Server name, enter the name of your on-premises SQL Server instance.
c. Under Database name, enter the name of your on-premises database.
d. Under Authentication type, select appropriate authentication.
e. Under User name, enter the name of user with access to on-premises SQL Server.
f. Enter the password for the user. Select Finish.
12. Select Next.
13. On the Select tables from which to copy the data or use a custom query page, select the [dbo].[emp]
table in the list, and select Next. You can select any other table based on your database.
14. On the Destination data store page, select Create new connection
15. In New Linked Service, Search and Select Azure Blob, then Continue.
16. On the New Linked Service (Azure Blob Storage) dialog, take the following steps:
c. Under **Storage account name**, select your storage account from the drop-down list.
d. Select **Next**.
17. In Destination data store dialog, select Next. In Connection properties, select Azure storage service
as Azure Blob Storage. Select Next.
18. In the Choose the output file or folder dialog, under Folder path, enter adftutorial/fromonprem. You
created the adftutorial container as part of the prerequisites. If the output folder doesn't exist (in this case
fromonprem ), Data Factory automatically creates it. You also can use the Browse button to browse the
blob storage and its containers/folders. If you do not specify any value under File name, by default the
name from the source would be used (in this case dbo.emp).
22. On the Deployment page, select Monitor to monitor the pipeline or task you created.
23. On the Monitor tab, you can view the status of the pipeline you created. You can use the links in the Action
column to view activity runs associated with the pipeline run and to rerun the pipeline.
24. Select the View Activity Runs link in the Actions column to see activity runs associated with the pipeline
run. To see details about the copy operation, select the Details link (eyeglasses icon) in the Actions column.
To switch back to the Pipeline Runs view, select Pipelines at the top.
25. Confirm that you see the output file in the fromonprem folder of the adftutorial container.
26. Select the Edit tab on the left to switch to the editor mode. You can update the linked services, datasets, and
pipelines created by the tool by using the editor. Select Code to view the JSON code associated with the
entity opened in the editor. For details on how to edit these entities in the Data Factory UI, see the Azure
portal version of this tutorial.
Next steps
The pipeline in this sample copies data from an on-premises SQL Server database to Blob storage. You learned
how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
For a list of data stores that are supported by Data Factory, see Supported data stores.
To learn about how to copy data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Copy data from an on-premises SQL Server database
to Azure Blob storage
4/8/2019 • 9 minutes to read • Edit Online
In this tutorial, you use the Azure Data Factory user interface (UI) to create a data factory pipeline that copies data
from an on-premises SQL Server database to Azure Blob storage. You create and use a self-hosted integration
runtime, which moves data between on-premises and cloud data stores.
NOTE
This article doesn't provide a detailed introduction to Data Factory. For more information, see Introduction to Data Factory.
Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to sign in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. In the upper-right corner, select your
user name, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on how to add a user to a role, see Manage access using RBAC and the Azure
portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Blob storage
(sink). You then create a table named emp in your SQL Server database and insert a couple of sample entries into
the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database:
6. In the tree view, right-click the database that you created, and then select New Query.
Azure storage account
In this tutorial, you use a general-purpose Azure storage account (specifically, Blob storage) as a destination/sink
data store. If you don't have a general-purpose Azure storage account, see Create a storage account. The pipeline
in the data factory that you create in this tutorial copies data from the on-premises SQL Server database (source)
to Blob storage (sink).
Get the storage account name and account key
You use the name and key of your storage account in this tutorial. To get the name and key of your storage
account, take the following steps:
1. Sign in to the Azure portal with your Azure user name and password.
2. In the left pane, select More services. Filter by using the Storage keyword, and then select Storage
accounts.
3. In the list of storage accounts, filter for your storage account, if needed. Then select your storage account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Blob storage.
1. In the Storage account window, go to Overview, and then select Blobs.
3. In the New container window, under Name, enter adftutorial. Then select OK.
The name of the data factory must be globally unique. If you see the following error message for the name field,
change the name of the data factory (for example, yournameADFTutorialDataFactory). For naming rules for Data
Factory artifacts, see Data Factory naming rules.
1. Select the Azure subscription in which you want to create the data factory.
2. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
3. Under Version, select V2.
4. Under Location, select the location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Storage and SQL Database) and computes (for example,
Azure HDInsight) used by Data Factory can be in other regions.
5. Select Pin to dashboard.
6. Select Create.
7. On the dashboard, you see the following tile with the status Deploying Data Factory:
8. After the creation is finished, you see the Data Factory page as shown in the image:
9. Select the Author & Monitor tile to launch the Data Factory UI in a separate tab.
Create a pipeline
1. On the Let's get started page, select Create pipeline. A pipeline is automatically created for you. You see
the pipeline in the tree view, and its editor opens.
2. On the General tab at the bottom of the Properties window, in Name, enter SQLServerToBlobPipeline.
3. In the Activities tool box, expand DataFlow. Drag and drop the Copy activity to the pipeline design
surface. Set the name of the activity to CopySqlServerToAzureBlobActivity.
4. In the Properties window, go to the Source tab, and select + New.
5. In the New Dataset window, search for SQL Server. Select SQL Server, and then select Finish. You see a
new tab titled SqlServerTable1. You also see the SqlServerTable1 dataset in the tree view on the left.
6. On the General tab at the bottom of the Properties window, in Name, enter SqlServerDataset.
7. Go to the Connection tab, and select + New. You create a connection to the source data store (SQL Server
database) in this step.
8. In the New Linked Service window, add Name as SqlServerLinkedService. Select New under Connect
via integration runtime. In this section, you create a self-hosted integration runtime and associate it with
an on-premises machine with the SQL Server database. The self-hosted integration runtime is the
component that copies data from the SQL Server database on your machine to Blob storage.
9. In the Integration Runtime Setup window, select Private Network, and then select Next.
10. Enter a name for the integration runtime, and select Next.
11. Under Option 1: Express setup, select Click here to launch the express setup for this computer.
12. In the Integration Runtime (Self-hosted) Express Setup window, select Close.
13. In the New Linked Service window, ensure the Integration Runtime created above is selected under
Connect via integration runtime.
14. In the New Linked Service window, take the following steps:
a. Under Name, enter SqlServerLinkedService.
b. Under Connect via integration runtime, confirm that the self-hosted integration runtime you created
earlier shows up.
c. Under Server name, enter the name of your SQL Server instance.
d. Under Database name, enter the name of the database with the emp table.
e. Under Authentication type, select the appropriate authentication type that Data Factory should use to
connect to your SQL Server database.
f. Under User name and Password, enter the user name and password. If you need to use a backslash (\) in
your user account or server name, precede it with the escape character (\). For example, use
mydomain\\myuser.
g. Select Test connection. Do this step to confirm that Data Factory can connect to your SQL Server
database by using the self-hosted integration runtime you created.
h. To save the linked service, select Finish.
15. You should be back in the window with the source dataset opened. On the Connection tab of the
Properties window, take the following steps:
a. In Linked service, confirm that you see SqlServerLinkedService.
b. In Table, select [dbo].[emp].
16. Go to the tab with SQLServerToBlobPipeline, or select SQLServerToBlobPipeline in the tree view.
17. Go to the Sink tab at the bottom of the Properties window, and select + New.
18. In the New Dataset window, select Azure Blob Storage. Then select Finish. You see a new tab opened for
the dataset. You also see the dataset in the tree view.
19. In Name, enter AzureBlobDataset.
20. Go to the Connection tab at the bottom of the Properties window. Next to Linked service, select + New.
21. In the New Linked Service window, take the following steps:
a. Under Name, enter AzureStorageLinkedService.
b. Under Storage account name, select your storage account.
c. To test the connection to your storage account, select Test connection.
d. Select Save.
22. You should be back in the window with the sink dataset open. On the Connection tab, take the following
steps:
a. In Linked service, confirm that AzureStorageLinkedService is selected.
b. For the folder/ Directory part of File path, enter adftutorial/fromonprem. If the output folder doesn't
exist in the adftutorial container, Data Factory automatically creates the output folder.
c. For the file name part of File path, select Add dynamic content.
d. Add @CONCAT(pipeline().RunId, '.txt') , select Finish. This will rename the file with PipelineRunID.txt.
23. Go to the tab with the pipeline opened, or select the pipeline in the tree view. In Sink Dataset, confirm that
AzureBlobDataset is selected.
24. To validate the pipeline settings, select Validate on the toolbar for the pipeline. To close the Pipe
Validation Report, select Close.
25. To publish entities you created to Data Factory, select Publish All.
26. Wait until you see the Publishing succeeded pop-up. To check the status of publishing, select the Show
Notifications link on the left. To close the notification window, select Close.
Trigger a pipeline run
Select Trigger on the toolbar for the pipeline, and then select Trigger Now.
Monitor the pipeline run
1. Go to the Monitor tab. You see the pipeline that you manually triggered in the previous step.
2. To view activity runs associated with the pipeline run, select the View Activity Runs link in the Actions
column. You see only activity runs because there is only one activity in the pipeline. To see details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column. To go back to the Pipeline
Runs view, select Pipelines at the top.
In this tutorial, you use Azure PowerShell to create a data-factory pipeline that copies data from an on-premises
SQL Server database to Azure Blob storage. You create and use a self-hosted integration runtime, which moves
data between on-premises and cloud data stores.
NOTE
This article does not provide a detailed introduction to the Data Factory service. For more information, see Introduction to
Azure Data Factory.
Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to log in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal, select your username at the top-right
corner, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on adding a user to a role, see the Manage access using RBAC and the Azure
portal article.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Azure Blob
storage (sink). You then create a table named emp in your SQL Server database, and insert a couple of sample
entries into the table.
1. Start SQL Server Management Studio. If it is not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database:
6. In the tree view, right-click the database that you created, and then select New Query.
Azure Storage account
In this tutorial, you use a general-purpose Azure storage account (specifically, Azure Blob storage) as a
destination/sink data store. If you don't have a general-purpose Azure storage account, see Create a storage
account. The pipeline in the data factory you that create in this tutorial copies data from the on-premises SQL
Server database (source) to this Azure Blob storage (sink).
Get storage account name and account key
You use the name and key of your Azure storage account in this tutorial. Get the name and key of your storage
account by doing the following:
1. Sign in to the Azure portal with your Azure username and password.
2. In the left pane, select More services, filter by using the Storage keyword, and then select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your storage
account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Azure Blob storage.
1. In the Storage account window, switch to Overview, and then select Blobs.
3. In the New container window, in the Name box, enter adftutorial, and then select OK.
Windows PowerShell
Install Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Install the latest version of Azure PowerShell if you don't already have it on your machine. For detailed
instructions, see How to install and configure Azure PowerShell.
Log in to PowerShell
1. Start PowerShell on your machine, and keep it open through completion of this quickstart tutorial. If you
close and reopen it, you'll need to run these commands again.
2. Run the following command, and then enter the Azure username and password that you use to sign in to
the Azure portal:
Connect-AzAccount
3. If you have multiple Azure subscriptions, run the following command to select the subscription that you
want to work with. Replace SubscriptionId with the ID of your Azure subscription:
$resourceGroupName = "ADFTutorialResourceGroup"
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.
3. Define a variable for the data factory name that you can use in PowerShell commands later. The name must
start with a letter or a number, and it can contain only letters, numbers, and the dash (-) character.
IMPORTANT
Update the data factory name with a globally unique name. An example is ADFTutorialFactorySP1127.
$dataFactoryName = "ADFTutorialFactory"
NOTE
The name of the data factory must be globally unique. If you receive the following error, change the name and try again.
The specified data factory name 'ADFv2TutorialDataFactory' is already in use. Data factory names
must be globally unique.
To create data-factory instances, the user account that you use to sign in to Azure must be assigned a contributor or
owner role or must be an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on the
following page, and then expand Analytics to locate Data Factory: Products available by region. The data stores (Azure
Storage, Azure SQL Database, and so on) and computes (Azure HDInsight and so on) used by the data factory can be in
other regions.
$integrationRuntimeName = "ADFTutorialIR"
3. To retrieve the status of the created integration runtime, run the following command:
Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net, *.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration
4. To retrieve the authentication keys for registering the self-hosted integration runtime with the Data Factory
service in the cloud, run the following command. Copy one of the keys (excluding the quotation marks) for
registering the self-hosted integration runtime that you install on your machine in the next step.
{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}
When the self-hosted integration runtime is registered successfully, the following message is displayed:
10. In the New Integration Runtime (Self-hosted) Node window, select Next.
14. Test the connectivity to your SQL Server database by doing the following:
a. In the Configuration Manager window, switch to the Diagnostics tab.
b. In the Data source type box, select SqlServer.
c. Enter the server name.
d. Enter the database name.
e. Select the authentication mode.
f. Enter the username.
g. Enter the password that's associated with the username.
h. To confirm that integration runtime can connect to the SQL Server, select Test.
If the connection is successful, a green checkmark icon is displayed. Otherwise, you'll receive an error
message associated with the failure. Fix any issues, and ensure that the integration runtime can connect to
your SQL Server instance.
Note all the preceding values for later use in this tutorial.
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>;EndpointSuffix=core.windows.net"
}
}
},
"name": "AzureStorageLinkedService"
}
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
If you receive a "file not found" error, confirm that the file exists by running the dir command. If the file
name has a .txt extension (for example, AzureStorageLinkedService.json.txt), remove it, and then run the
PowerShell command again.
Create and encrypt a SQL Server linked service (source )
In this step, you link your on-premises SQL Server instance to the data factory.
1. Create a JSON file named SqlServerLinkedService.json in the C:\ADFv2Tutorial folder by using the
following code:
IMPORTANT
Select the section that's based on the authentication that you use to connect to SQL Server.
{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Integrated Security=True"
},
"userName": "<user> or <domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}
IMPORTANT
Select the section that's based on the authentication you use to connect to your SQL Server instance.
Replace <integration runtime name> with the name of your integration runtime.
Before you save the file, replace <servername>, <databasename>, <username>, and <password> with the
values of your SQL Server instance.
If you need to use a backslash (\) in your user account or server name, precede it with the escape character (\).
For example, use mydomain\\myuser.
2. To encrypt the sensitive data (username, password, and so on), run the
New-AzDataFactoryV2LinkedServiceEncryptedCredential cmdlet.
This encryption ensures that the credentials are encrypted using Data Protection Application Programming
Interface (DPAPI). The encrypted credentials are stored locally on the self-hosted integration runtime node
(local machine). The output payload can be redirected to another JSON file (in this case,
encryptedLinkedService.json) that contains encrypted credentials.
New-AzDataFactoryV2LinkedServiceEncryptedCredential -DataFactoryName $dataFactoryName -
ResourceGroupName $ResourceGroupName -IntegrationRuntimeName $integrationRuntimeName -File
".\SQLServerLinkedService.json" > encryptedSQLServerLinkedService.json
Create datasets
In this step, you create input and output datasets. They represent input and output data for the copy operation,
which copies data from the on-premises SQL Server database to Azure Blob storage.
Create a dataset for the source SQL Server database
In this step, you define a dataset that represents data in the SQL Server database instance. The dataset is of type
SqlServerTable. It refers to the SQL Server linked service that you created in the preceding step. The linked
service has the connection information that the Data Factory service uses to connect to your SQL Server instance
at runtime. This dataset specifies the SQL table in the database that contains the data. In this tutorial, the emp
table contains the source data.
1. Create a JSON file named SqlServerDataset.json in the C:\ADFv2Tutorial folder, with the following code:
{
"properties": {
"type": "SqlServerTable",
"typeProperties": {
"tableName": "dbo.emp"
},
"structure": [
{
"name": "ID",
"type": "String"
},
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"linkedServiceName": {
"referenceName": "EncryptedSqlServerLinkedService",
"type": "LinkedServiceReference"
}
},
"name": "SqlServerDataset"
}
{
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/fromonprem",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"name": "AzureBlobDataset"
}
DatasetName : AzureBlobDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset
Create a pipeline
In this tutorial, you create a pipeline with a copy activity. The copy activity uses SqlServerDataset as the input
dataset and AzureBlobDataset as the output dataset. The source type is set to SqlSource and the sink type is set to
BlobSink.
1. Create a JSON file named SqlServerToBlobPipeline.json in the C:\ADFv2Tutorial folder, with the following
code:
{
"name": "SQLServerToBlobPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type":"BlobSink"
}
},
"name": "CopySqlServerToAzureBlobActivity",
"inputs": [
{
"referenceName": "SqlServerDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlobDataset",
"type": "DatasetReference"
}
]
}
]
}
}
PipelineName : SQLServerToBlobPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Activities : {CopySqlServerToAzureBlobActivity}
Parameters :
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : copy
PipelineRunId : 4ec8980c-62f6-466f-92fa-e69b10f33640
PipelineName : SQLServerToBlobPipeline
Input :
Output :
LinkedServiceName :
ActivityRunStart : 9/13/2017 1:35:22 PM
ActivityRunEnd : 9/13/2017 1:35:42 PM
DurationInMs : 20824
Status : Succeeded
Error : {errorCode, message, failureType, target}
2. You can get the run ID of pipeline SQLServerToBlobPipeline and check the detailed activity run result by
running the following command:
{
"dataRead": 36,
"dataWritten": 24,
"rowsCopied": 2,
"copyDuration": 3,
"throughput": 0.01171875,
"errors": [],
"effectiveIntegrationRuntime": "MyIntegrationRuntime",
"billedDuration": 3
}
Next steps
The pipeline in this sample copies data from one location to another in Azure Blob storage. You learned how to:
Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.
For a list of data stores that are supported by Data Factory, see supported data stores.
To learn about copying data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Copy multiple tables in bulk by using Azure Data
Factory
4/8/2019 • 14 minutes to read • Edit Online
This tutorial demonstrates copying a number of tables from Azure SQL Database to Azure SQL Data
Warehouse. You can apply the same pattern in other copy scenarios as well. For example, copying tables from
SQL Server/Oracle to Azure SQL Database/Data Warehouse/Azure Blob, copying different paths from Blob to
Azure SQL Database tables.
NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory.
End-to-end workflow
In this scenario, you have a number of tables in Azure SQL Database that you want to copy to SQL Data
Warehouse. Here is the logical sequence of steps in the workflow that happens in pipelines:
The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the pipeline
triggers another pipeline, which iterates over each table in the database and performs the data copy operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the list,
copy the specific table in Azure SQL Database to the corresponding table in SQL Data Warehouse using staged
copy via Blob storage and PolyBase for best performance. In this example, the first pipeline passes the list of
tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure Storage account. The Azure Storage account is used as staging blob storage in the bulk copy
operation.
Azure SQL Database. This database contains the source data.
Azure SQL Data Warehouse. This data warehouse holds the data copied over from the SQL Database.
Prepare SQL Database and SQL Data Warehouse
Prepare the source Azure SQL Database:
Create an Azure SQL Database with Adventure Works LT sample data following Create an Azure SQL database
article. This tutorial copies all the tables from this sample database to a SQL data warehouse.
Prepare the sink Azure SQL Data Warehouse:
1. If you don't have an Azure SQL Data Warehouse, see the Create a SQL Data Warehouse article for steps to
create one.
2. Create corresponding table schemas in SQL Data Warehouse. You can use Migration Utility to migrate
schema from Azure SQL Database to Azure SQL Data Warehouse. You use Azure Data Factory to
migrate/copy data in a later step.
The name of the Azure data factory must be globally unique. If you see the following error for the name
field, change the name of the data factory (for example, yournameADFTutorialBulkCopyDF ). See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.
11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Data Factory UI application in a separate tab.
13. In the get started page, switch to the Edit tab in the left panel as shown in the following image:
2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for Name.
b. Select your Azure SQL server for Server name
c. Select your Azure SQL database for Database name.
d. Enter name of the user to connect to Azure SQL database.
e. Enter password for the user.
f. To test the connection to Azure SQL database using the specified information, click Test connection.
g. Click Save.
Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored.
The input dataset AzureSqlDatabaseDataset refers to the AzureSqlDatabaseLinkedService. The linked
service specifies the connection string to connect to the database. The dataset specifies the name of the database
and the table that contains the source data.
The output dataset AzureSqlDWDataset refers to the AzureSqlDWLinkedService. The linked service specifies
the connection string to connect to the data warehouse. The dataset specifies the database and the table to which
the data is copied.
In this tutorial, the source and destination SQL tables are not hard-coded in the dataset definitions. Instead, the
ForEach activity passes the name of the table at runtime to the Copy activity.
Create a dataset for source SQL Database
1. Click + (plus) in the left pane, and click Dataset.
2. In the New Dataset window, select Azure SQL Database, and click Finish. You should see a new tab titled
AzureSqlTable1.
3. In the properties window at the bottom, enter AzureSqlDatabaseDataset for Name.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select any table for Table. This table is a dummy table. You specify a query on the source dataset
when creating a pipeline. The query is used to extract data from the Azure SQL database.
Alternatively, you can click Edit check box, and enter dummyName as the table name.
c. In the Add Dynamic Content page, click the DWTAbleName under Parameters which will
automatically populate the top expression text box @dataset().DWTableName , then click Finish. The
tableName property of the dataset is set to the value that's passed as an argument for the
DWTableName parameter. The ForEach activity iterates through a list of tables, and passes one by one to
the Copy activity.
Create pipelines
In this tutorial, you create two pipelines: IterateAndCopySQLTables and GetTableListAndTriggerCopyData.
The GetTableListAndTriggerCopyData pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline IterateAndCopySQLTables to do the actual data copy.
The IterateAndCopySQLTables takes a list of tables as a parameter. For each table in the list, it copies data from
the table in Azure SQL Database to Azure SQL Data Warehouse using staged copy and PolyBase.
Create the pipeline IterateAndCopySQLTables
1. In the left pane, click + (plus), and click Pipeline.
4. In the Activities toolbox, expand Iteration & Conditions, and drag-drop the ForEach activity to the
pipeline design surface. You can also search for activities in the Activities toolbox.
a. In the General tab at the bottom, enter IterateSQLTables for Name.
b. Switch to the Settings tab, click the inputbox for Items, then click the Add dynamic content link below.
c. In the Add Dynamic Content page, collapse the System Vairables and Functions section, click the
tableList under Parameters which will automatically populate the top expression text box as
@pipeline().parameter.tableList , then click Finish.
d. Switch to Activities tab, click Add activity to add a child activity to the ForEach activity.
5. In the Activities toolbox, expand DataFlow, and drag-drop Copy activity into the pipeline designer
surface. Notice the breadcrumb menu at the top. The IterateAndCopySQLTable is the pipeline name and
IterateSQLTables is the ForEach activity name. The designer is in the activity scope. To switch back to the
pipeline editor from the ForEach editor, click the link in the breadcrumb menu.
6. Switch to the Source tab, and do the following steps:
a. Select AzureSqlDatabaseDataset for Source Dataset.
b. Select Query option for User Query.
c. Click the Query input box -> select the Add dynamic content below -> enter the following
expression for Query -> select Finish.
9. To validate the pipeline settings, click Validate on the top pipeline tool bar. Confirm that there is no
validation error. To close the Pipeline Validation Report, click >>.
Create the pipeline GetTableListAndTriggerCopyData
This pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline "IterateAndCopySQLTables" to do the actual data copy.
1. In the left pane, click + (plus), and click Pipeline.
8. To validate the pipeline, click Validate on the toolbar. Confirm that there are no validation errors. To close
the Pipeline Validation Report, click >>.
9. To publish entities (datasets, pipelines, etc.) to the Data Factory service, click Publish All on top of the
window. Wait until the publishing succeeds.
2. To view activity runs associated with the GetTableListAndTriggerCopyData pipeline, click the first link in the
Actions link for that pipeline. You should see two activity runs for this pipeline run.
3. To view the output of the Lookup activity, click link in the Output column for that activity. You can
maximize and restore the Output window. After reviewing, click X to close the Output window.
{
"count": 9,
"value": [
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Customer"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Product"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductModelProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductCategory"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Address"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "CustomerAddress"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderDetail"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderHeader"
}
],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"effectiveIntegrationRuntimes": [
{
"name": "DefaultIntegrationRuntime",
"type": "Managed",
"location": "East US",
"billedDuration": 0,
"nodes": null
}
]
}
4. To switch back to the Pipeline Runs view, click Pipelines link at the top. Click View Activity Runs link
(first link in the Actions column) for the IterateAndCopySQLTables pipeline. You should see output as
shown in the following image: Notice that there is one Copy activity run for each table in the Lookup
activity output.
5. Confirm that the data was copied to the target SQL Data Warehouse you used in this tutorial.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Copy multiple tables in bulk by using Azure Data
Factory
3/5/2019 • 11 minutes to read • Edit Online
This tutorial demonstrates copying a number of tables from Azure SQL Database to Azure SQL Data
Warehouse. You can apply the same pattern in other copy scenarios as well. For example, copying tables from
SQL Server/Oracle to Azure SQL Database/Data Warehouse/Azure Blob, copying different paths from Blob to
Azure SQL Database tables.
At a high level, this tutorial involves following steps:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses Azure PowerShell. To learn about using other tools/SDKs to create a data factory, see
Quickstarts.
End-to-end workflow
In this scenario, we have a number of tables in Azure SQL Database that we want to copy to SQL Data
Warehouse. Here is the logical sequence of steps in the workflow that happens in pipelines:
The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the pipeline
triggers another pipeline, which iterates over each table in the database and performs the data copy operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the list,
copy the specific table in Azure SQL Database to the corresponding table in SQL Data Warehouse using staged
copy via Blob storage and PolyBase for best performance. In this example, the first pipeline passes the list of
tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Azure Storage account. The Azure Storage account is used as staging blob storage in the bulk copy operation.
Azure SQL Database. This database contains the source data.
Azure SQL Data Warehouse. This data warehouse holds the data copied over from the SQL Database.
Prepare SQL Database and SQL Data Warehouse
Prepare the source Azure SQL Database:
Create an Azure SQL Database with Adventure Works LT sample data following Create an Azure SQL database
article. This tutorial copies all the tables from this sample database to a SQL data warehouse.
Prepare the sink Azure SQL Data Warehouse:
1. If you don't have an Azure SQL Data Warehouse, see the Create a SQL Data Warehouse article for steps to
create one.
2. Create corresponding table schemas in SQL Data Warehouse. You can use Migration Utility to migrate
schema from Azure SQL Database to Azure SQL Data Warehouse. You use Azure Data Factory to
migrate/copy data in a later step.
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:
2. Run the Set-AzDataFactoryV2 cmdlet to create a data factory. Replace place-holders with your own
values before executing the command.
The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory
names must be globally unique.
To create Data Factory instances, you must be a Contributor or Administrator of the Azure
subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest
you on the following page, and then expand Analytics to locate Data Factory: Products available by
region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.)
used by data factory can be in other regions.
IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.
{
"name": "AzureSqlDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
}
LinkedServiceName : AzureSqlDatabaseLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
}
LinkedServiceName : AzureSqlDWLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDWLinkedService
IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before saving the
file.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>"
}
}
}
}
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored:
Create a dataset for source SQL Database
1. Create a JSON file named AzureSqlDatabaseDataset.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content. The "tableName" is a dummy one as later you use the SQL query in copy
activity to retrieve data.
{
"name": "AzureSqlDatabaseDataset",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "AzureSqlDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "dummy"
}
}
}
DatasetName : AzureSqlDatabaseDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
{
"name": "AzureSqlDWDataset",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": {
"referenceName": "AzureSqlDWLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": {
"value": "@{dataset().DWTableName}",
"type": "Expression"
}
},
"parameters":{
"DWTableName":{
"type":"String"
}
}
}
}
DatasetName : AzureSqlDWDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDwTableDataset
Create pipelines
In this tutorial, you create two pipelines:
Create the pipeline "IterateAndCopySQLTables"
This pipeline takes a list of tables as a parameter. For each table in the list, it copies data from the table in Azure
SQL Database to Azure SQL Data Warehouse using staged copy and PolyBase.
1. Create a JSON file named IterateAndCopySQLTables.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:
{
"name": "IterateAndCopySQLTables",
"properties": {
"activities": [
{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"items": {
"value": "@pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "CopyData",
"description": "Copy data from SQL database to SQL DW",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlDatabaseDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlDWDataset",
"type": "DatasetReference",
"parameters": {
"DWTableName": "[@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]"
}
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]"
},
"sink": {
"type": "SqlDWSink",
"preCopyScript": "TRUNCATE TABLE [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}
]
}
}
],
"parameters": {
"tableList": {
"type": "Object"
}
}
}
}
2. To create the pipeline: IterateAndCopySQLTables, Run the Set-AzDataFactoryV2Pipeline cmdlet.
PipelineName : IterateAndCopySQLTables
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {IterateSQLTables}
Parameters : {[tableList, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}
2. Run the following script to continuously check the run status of pipeline
GetTableListAndTriggerCopyData, and print out the final pipeline run and activity run result.
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId
if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
Write-Host "Pipeline run details:" -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}
Start-Sleep -Seconds 15
}
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : TriggerCopy
PipelineRunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
Input : {pipeline, parameters, waitOnCompletion}
Output : {pipelineRunId}
LinkedServiceName :
ActivityRunStart : 9/18/2017 4:07:11 PM
ActivityRunEnd : 9/18/2017 4:08:14 PM
DurationInMs : 62581
Status : Succeeded
Error : {errorCode, message, failureType, target}
3. You can get the run ID of pipeline "IterateAndCopySQLTables", and check the detailed activity run result
as the following.
{
"pipelineRunId": "7514d165-14bf-41fb-b5fb-789bea6c9e58"
}
4. Connect to your sink Azure SQL Data Warehouse and confirm that data has been copied from Azure SQL
Database properly.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Incrementally load data from a source data store to a
destination data store
5/10/2019 • 2 minutes to read • Edit Online
In a data integration solution, incrementally (or delta) loading data after an initial full data load is a widely used
scenario. The tutorials in this section show you different ways of loading data incrementally by using Azure Data
Factory.
Loading new files only by using time partitioned folder or file name.
You can copy new files only, where files or folders has already been time partitioned with timeslice information as
part of the file or folder name (for example, /yyyy/mm/dd/file.csv). It is the most performance approach for
incremental loading new files.
For step-by-step instructions, see the following tutorial:
Incrementally copy new files based on time partitioned folder or file name from Azure Blob storage to Azure Blob
storage
Next steps
Advance to the following tutorial:
Incrementally copy data from one table in Azure SQL Database to Azure Blob storage
Incrementally load data from an Azure SQL database
to Azure Blob storage
3/26/2019 • 13 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data from a table in an Azure SQL
database to Azure Blob storage.
You perform the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run
Overview
Here is the high-level solution diagram:
Prerequisites
Azure SQL Database. You use the database as the source data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Azure Storage. You use the blob storage as the sink data store. If you don't have a storage account, see Create
a storage account for steps to create one. Create a container named adftutorial.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Server Explorer, right-click the database, and choose New
Query.
2. Run the following SQL command against your SQL database to create a table named data_source_table as
the data source store:
In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:
Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
create table watermarktable
(
TableName varchar(255),
WatermarkValue datetime,
);
2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.
Output:
TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
The name of the Azure data factory must be globally unique. If you see a red exclamation mark with the
following error, change the name of the data factory (for example, yournameADFIncCopyTutorialDF ) and
try creating again. See Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.
11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.
Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. In the get started page of Data Factory UI, click the Create pipeline tile.
2. In the General page of the Properties window for the pipeline, enter IncrementalCopyPipeline name.
3. Let's add the first lookup activity to get the old watermark value. In the Activities toolbox, expand General,
and drag-drop the Lookup activity to the pipeline designer surface. Change the name of the activity to
LookupOldWaterMarkActivity.
4. Switch to the Settings tab, and click + New for Source Dataset. In this step, you create a dataset to
represent data in the watermarktable. This table contains the old watermark that was used in the previous
copy operation.
5. In the New Dataset window, select Azure SQL Database, and click Finish. You see a new tab opened for
the dataset.
6. In the properties window for the dataset, enter WatermarkDataset for Name.
7. Switch to the Connection tab, and click + New to make a connection (create a linked service) to your Azure
SQL database.
10. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline in
the tree view on the left. In the properties window for the Lookup activity, confirm that
WatermarkDataset is selected for the Source Dataset field.
11. In the Activities toolbox, expand General, and drag-drop another Lookup activity to the pipeline designer
surface, and set the name to LookupNewWaterMarkActivity in the General tab of the properties
window. This Lookup activity gets the new watermark value from the table with the source data to be copied
to the destination.
12. In the properties window for the second Lookup activity, switch to the Settings tab, and click New. You
create a dataset to point to the source table that contains the new watermark value (maximum value of
LastModifyTime).
13. In the New Dataset window, select Azure SQL Database, and click Finish. You see a new tab opened for
this dataset. You also see the dataset in the tree view.
14. In the General tab of the properties window, enter SourceDataset for Name.
16. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline in
the tree view on the left. In the properties window for the Lookup activity, confirm that SourceDataset is
selected for the Source Dataset field.
17. Select Query for the Use Query field, and enter the following query: you are only selecting the maximum
value of LastModifytime from the data_source_table. If you don't have this query, the dataset gets all the
rows from the table as you specified the table name (data_source_table) in the dataset definition.
20. Select the Copy activity and confirm that you see the properties for the activity in the Properties window.
21. Switch to the Source tab in the Properties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Query for the Use Query field.
c. Enter the following SQL query for the Query field.
23. In this tutorial sink data store is of type Azure Blob Storage. Therefore, select Azure Blob Storage, and click
Finish in the New Dataset window.
24. In the General tab of the Properties window for the dataset, enter SinkDataset for Name.
25. Switch to the Connection tab, and click + New. In this step, you create a connection (linked service) to your
Azure Blob storage.
30. Select Stored Procedure Activity in the pipeline designer, change its name to
StoredProceduretoWriteWatermarkActivity.
31. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedService* for Linked service.
32. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select usp_write_watermark.
b. To specify values for the stored procedure parameters, click Import parameter, and enter following
values for the parameters:
34. Publish entities (linked services, datasets, and pipelines) to the Azure Data Factory service by selecting the
Publish All button. Wait until you see a message that the publishing succeeded.
Trigger a pipeline run
1. Click Trigger on the toolbar, and click Trigger Now.
2. To view activity runs associated with this pipeline run, click the first link (View Activity Runs) in the
Actions column. You can go back to the previous view by clicking Pipelines at the top. Click Refresh
button to refresh the list.
2. Open the output file and notice that all the data is copied from the data_source_table to the blob file.
1,aaaa,2017-09-01 00:56:00.0000000
2,bbbb,2017-09-02 05:23:00.0000000
3,cccc,2017-09-03 02:36:00.0000000
4,dddd,2017-09-04 03:21:00.0000000
5,eeee,2017-09-05 08:06:00.0000000
3. Check the latest value from watermarktable . You see that the watermark value was updated.
TABLENAME WATERMARKVALUE
2. To view activity runs associated with this pipeline run, click the first link (View Activity Runs) in the
Actions column. You can go back to the previous view by clicking Pipelines at the top. Click Refresh
button to refresh the list.
6,newdata,2017-09-06 02:23:00.0000000
7,newdata,2017-09-07 09:01:00.0000000
2. Check the latest value from watermarktable . You see that the watermark value was updated again.
sample output:
TABLENAME WATERMARKVALUE
Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run
In this tutorial, the pipeline copied data from a single table in a SQL database to Blob storage. Advance to the
following tutorial to learn how to copy data from multiple tables in an on-premises SQL Server database to a SQL
database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from an Azure SQL database
to Azure Blob storage
3/14/2019 • 13 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data from a table in an Azure SQL
database to Azure Blob storage.
You perform the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Overview
Here is the high-level solution diagram:
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure SQL Database. You use the database as the source data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Azure Storage. You use the blob storage as the sink data store. If you don't have a storage account, see Create
a storage account for steps to create one. Create a container named adftutorial.
Azure PowerShell. Follow the instructions in Install and configure Azure PowerShell.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Server Explorer, right-click the database, and choose New
Query.
2. Run the following SQL command against your SQL database to create a table named data_source_table
as the data source store:
In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:
Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
create table watermarktable
(
TableName varchar(255),
WatermarkValue datetime,
);
2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.
Output:
TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
IMPORTANT
Update the data factory name to make it globally unique. An example is ADFTutorialFactorySP1127.
$dataFactoryName = "ADFIncCopyTutorialFactory";
The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names must
be globally unique.
To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factory: Products available by region.
The data stores (Storage, SQL Database, etc.) and computes (Azure HDInsight, etc.) used by the data
factory can be in other regions.
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database>; Persist
Security Info=False; User ID=<user> ; Password=<password>; MultipleActiveResultSets = False; Encrypt =
True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}
LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
ProvisioningState :
Create datasets
In this step, you create datasets to represent source and sink data.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
In this tutorial, you use the table name data_source_table. Replace it if you use a table with a different name.
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset SourceDataset.
DatasetName : SourceDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
IMPORTANT
This snippet assumes that you have a blob container named adftutorial in your blob storage. Create the container if
it doesn't exist, or set it to the name of an existing one. The output folder incrementalcopy is automatically
created if it doesn't exist in the container. In this tutorial, the file name is dynamically generated by using the
expression @CONCAT('Incremental-', pipeline().RunId, '.txt') .
DatasetName : SinkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset
{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset WatermarkDataset.
DatasetName : WatermarkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. Create a JSON file IncrementalCopyPipeline.json in the same folder with the following content:
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from watermarktable"
},
"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select MAX(LastModifytime) as NewWatermarkvalue from data_source_table"
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from data_source_table where LastModifytime >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},
{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "usp_write_watermark",
"storedProcedureParameters": {
"LastModifiedtime": {"value":
"@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}", "type": "datetime" },
"TableName": { "value":"@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type":"String"}
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]
}
}
2. Run the Set-AzDataFactoryV2Pipeline cmdlet to create the pipeline IncrementalCopyPipeline.
PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Activities : {LookupOldWaterMarkActivity, LookupNewWaterMarkActivity, IncrementalCopyActivity,
StoredProceduretoWriteWatermarkActivity}
Parameters :
2. Check the status of the pipeline by running the Get-AzDataFactoryV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:42:42 AM
ActivityRunEnd : 9/14/2017 7:43:07 AM
DurationInMs : 25437
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:10 AM
ActivityRunEnd : 9/14/2017 7:43:29 AM
DurationInMs : 19769
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:32 AM
ActivityRunEnd : 9/14/2017 7:43:47 AM
DurationInMs : 14467
Status : Succeeded
Error : {errorCode, message, failureType, target}
2. Check the latest value from watermarktable . You see that the watermark value was updated.
TABLENAME WATERMARKVALUE
Insert data into the data source store to verify delta data loading
1. Insert new data into the SQL database (data source store).
3. Check the status of the pipeline by running the Get-AzDataFactoryV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:52:26 AM
ActivityRunEnd : 9/14/2017 8:52:52 AM
DurationInMs : 25497
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:00 AM
ActivityRunEnd : 9/14/2017 8:53:20 AM
DurationInMs : 20194
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:23 AM
ActivityRunEnd : 9/14/2017 8:53:41 AM
DurationInMs : 18502
Status : Succeeded
Error : {errorCode, message, failureType, target}
4. In the blob storage, you see that another file was created. In this tutorial, the new file name is
Incremental-2fc90ab8-d42c-4583-aa64-755dba9925d7.txt . Open that file, and you see two rows of records in it.
5. Check the latest value from watermarktable . You see that the watermark value was updated again.
TABLENAME WATERMARKVALUE
Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
In this tutorial, the pipeline copied data from a single table in a SQL database to Blob storage. Advance to the
following tutorial to learn how to copy data from multiple tables in an on-premises SQL Server database to a SQL
database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from multiple tables in SQL Server to an
Azure SQL database
4/14/2019 • 17 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data from multiple tables in on-premises SQL Server to an
Azure SQL database.
You perform the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime.
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Overview
Here are the important steps to create this solution:
1. Select the watermark column.
Select one column for each table in the source data store, which can be used to identify the new or updated records for every run.
Normally, the data in this selected column (for example, last_modify_time or ID ) keeps increasing when rows are created or updated. The
maximum value in this column is used as a watermark.
2. Prepare a data store to store the watermark value.
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities:
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter to the pipeline. For each source
table, it invokes the following activities to perform delta loading for that table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the second Lookup activity to
retrieve the new watermark value. These watermark values are passed to the Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark column greater than the old
watermark value and less than the new watermark value. Then, it copies the delta data from the source data store to Azure Blob storage
as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
SQL Server. You use an on-premises SQL Server database as the source data store in this tutorial.
Azure SQL Database. You use a SQL database as the sink data store. If you don't have a SQL database, see Create an Azure SQL database
for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio, and connect to your on-premises SQL Server database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your database to create tables named customer_table and project_table :
Create another table in the Azure SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to store the watermark value:
TableName varchar(255),
WatermarkValue datetime,
);
2. Insert initial watermark values for both source tables into the watermark table.
INSERT INTO watermarktable
VALUES
('customer_table','1/1/2010 12:00:00 AM'),
('project_table','1/1/2010 12:00:00 AM');
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
Create data types and additional stored procedures in Azure SQL database
Run the following query to create two stored procedures and two data types in your SQL database. They're used to merge the data from source
tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in via a table variable and then
merge the them into destination store. Be cautious that it is not expecting a "large" number of delta rows (more than 100) to be stored in the
table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy activity to copy all the delta data
into a temporary "staging" table in the destination store first, and then built your own stored procedure without using table variable to merge
them from the “staging” table to the “final” table.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);
GO
BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
GO
BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
The name of the Azure data factory must be globally unique. If you receive the following error, change the name of the data factory (for
example, yournameADFMultiIncCopyTutorialDF ) and try creating again. See Data Factory - Naming Rules article for naming rules for
Data Factory artifacts.
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 (Preview) for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down list. The data stores (Azure
Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.
11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch Azure Data Factory user interface (UI) in a separate tab.
13. In the get started page of Azure Data Factory UI, click Create pipeline (or) switch to the Edit tab.
Create self-hosted integration runtime
As you are moving data from a data store in a private network (on-premises) to an Azure data store, install a self-hosted integration runtime (IR )
in your on-premises environment. The self-hosted IR moves data between your private network and Azure.
1. Click Connections at the bottom of the left pane, and switch to the Integration Runtimes in the Connections window.
3. In the Integration Runtime Setup window, select Perform data movement and dispatch activities to external computes, and
click Next.
4. Select ** Private Network**, and click Next.
8. In the Web browser, in the Integration Runtime Setup window, click Finish.
9. Confirm that you see MySelfHostedIR in the list of integration runtimes.
2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for Name.
b. For Server name, select the name of your Azure SQL server from the drop-down list.
c. For Database name, select the Azure SQL database in which you created customer_table and project_table as part of the
prerequisites.
d. For User name, enter the name of user that has access to the Azure SQL database.
e. For Password, enter the password for the user.
f. To test whether Data Factory can connect to your SQL Server database, click Test connection. Fix any errors until the connection
succeeds.
g. To save the linked service, click Save.
4. Switch to the Connection tab in the Properties window, and select SqlServerLinkedService for Linked service. You do not select a
table here. The Copy activity in the pipeline uses a SQL query to load the data rather than load the entire table.
Create a sink dataset
1. In the left pane, click + (plus), and click Dataset.
2. In the New Dataset window, select Azure SQL Database, and click Finish.
3. You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the treeview. In the General tab of the
Properties window at the bottom, enter SinkDataset for Name.
4. Switch to the Parameters tab in the Properties window, and do the following steps:
a. Click New in the Create/update parameters section.
b. Enter SinkTableName for the name, and String for the type. This dataset takes SinkTableName as a parameter. The
SinkTableName parameter is set by the pipeline dynamically at runtime. The ForEach activity in the pipeline iterates through a list
of table names and passes the table name to this dataset in each iteration.
5. Switch to the Connection tab in the Properties window, and select AzureSqlLinkedService for Linked service. For Table property,
click Add dynamic content.
2. In the New Dataset window, select Azure SQL Database, and click Finish.
3. In the General tab of the Properties window at the bottom, enter WatermarkDataset for Name.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[watermarktable] for Table.
Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table names and performs the following
operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in the last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark column in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to the destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the next iteration.
Create the pipeline
1. In the left pane, click + (plus), and click Pipeline.
2. In the General tab of the Properties window, enter IncrementalCopyPipeline for Name.
4. In the Activities toolbox, expand Iteration & Conditionals, and drag-drop the ForEach activity to the pipeline designer surface. In the
General tab of the Properties window, enter IterateSQLTables.
5. Switch to the Settings tab in the Properties window, and enter @pipeline().parameters.tableList for Items. The ForEach activity
iterates through a list of tables and performs the incremental copy operation.
6. Select the ForEach activity in the pipeline if it isn't already selected. Click the Edit (Pencil icon) button.
7. In the Activities toolbox, expand General, drag-drop the Lookup activity to the pipeline designer surface, and enter
LookupOldWaterMarkActivity for Name.
8. Switch to the Settings tab of the Properties window, and do the following steps:
a. Select WatermarkDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.
11. Drag-drop the Copy activity from the Activities toolbox, and enter IncrementalCopyActivity for Name.
12. Connect Lookup activities to the Copy activity one by one. To connect, start dragging at the green box attached to the Lookup activity
and drop it on the Copy activity. Release the mouse button when the border color of the Copy activity changes to blue.
13. Select the Copy activity in the pipeline. Switch to the Source tab in the Properties window.
a. Select SourceDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.
17. Select the Stored Procedure activity in the pipeline, and enter StoredProceduretoWriteWatermarkActivity for Name in the General
tab of the Properties window.
18. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedService for Linked Service.
19. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select usp_write_watermark .
b. Select Import parameter.
c. Specify the following values for the parameters:
20. In the left pane, click Publish. This action publishes the entities you created to the Data Factory service.
21. Wait until you see the Successfully published message. To see the notifications, click the Show Notifications link. Close the
notifications window by clicking X.
Run the pipeline
1. On the toolbar for the pipeline, click Trigger, and click Trigger Now.
2. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish.
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
2. Click View Activity Runs link in the Actions column. You see all the activity runs associated with the selected pipeline run.
Review the results
In SQL Server Management Studio, run the following queries against the target SQL database to verify that the data was copied from source
tables to destination tables:
Query
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Query
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
Query
Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000
Notice that the watermark values for both tables were updated.
Add more data to the source tables
Run the following query against the source SQL Server database to update an existing row in customer_table. Insert a new row into
project_table.
UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3
3. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish.
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Notice the new values of Name and LastModifytime for the PersonID for number 3.
Query
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000
Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000
Notice that the watermark values for both tables were updated.
Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR ).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from multiple tables in SQL
Server to an Azure SQL database
4/8/2019 • 18 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data from multiple tables in on-
premises SQL Server to an Azure SQL database.
You perform the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime.
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Overview
Here are the important steps to create this solution:
1. Select the watermark column. Select one column for each table in the source data store, which can be
used to identify the new or updated records for every run. Normally, the data in this selected column (for
example, last_modify_time or ID ) keeps increasing when rows are created or updated. The maximum value
in this column is used as a watermark.
2. Prepare a data store to store the watermark value.
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities:
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter
to the pipeline. For each source table, it invokes the following activities to perform delta loading for that
table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Azure Blob storage as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
SQL Server. You use an on-premises SQL Server database as the source data store in this tutorial.
Azure SQL Database. You use a SQL database as the sink data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio, and connect to your on-premises SQL Server database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your database to create tables named customer_table and
project_table :
Create another table in the Azure SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
TableName varchar(255),
WatermarkValue datetime,
);
2. Insert initial watermark values for both source tables into the watermark table.
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
Create data types and additional stored procedures in the Azure SQL database
Run the following query to create two stored procedures and two data types in your SQL database. They're used
to merge the data from source tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in
via a table variable and then merge the them into destination store. Be cautious that it is not expecting a "large"
number of delta rows (more than 100) to be stored in the table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy
activity to copy all the delta data into a temporary "staging" table in the destination store first, and then built your
own stored procedure without using table variable to merge them from the “staging” table to the “final” table.
GO
BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
GO
BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
Azure PowerShell
Install the latest Azure PowerShell modules by following the instructions in Install and configure Azure
PowerShell.
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
IMPORTANT
Update the data factory name to make it globally unique. An example is ADFIncMultiCopyTutorialFactorySP1127.
$dataFactoryName = "ADFIncMultiCopyTutorialFactory";
The specified Data Factory name 'ADFIncMultiCopyTutorialFactory' is already in use. Data Factory names
must be globally unique.
To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (Azure Storage, SQL Database, etc.) and computes (Azure HDInsight, etc.) used by the data
factory can be in other regions.
$integrationRuntimeName = "ADFTutorialIR"
Id : /subscriptions/<subscription
ID>/resourceGroups/ADFTutorialResourceGroup/providers/Microsoft.DataFactory/factories/onpremdf0914/inte
grationruntimes/myonpremirsp0914
Type : SelfHosted
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Name : myonpremirsp0914
Description :
3. To retrieve the status of the created integration runtime, run the following command. Confirm that the value
of the State property is set to NeedRegistration.
Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net, *.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration
4. To retrieve the authentication keys used to register the self-hosted integration runtime with Azure Data
Factory service in the cloud, run the following command:
5. Copy one of the keys (exclude the double quotation marks) used to register the self-hosted integration
runtime that you install on your machine in the following steps.
11. When the self-hosted integration runtime is registered successfully, you see the following message:
12. On the New Integration Runtime (Self-hosted) Node page, select Next.
13. On the Intranet Communication Channel page, select Skip. Select a TLS/SSL certification to secure
intranode communication in a multinode integration runtime environment.
14. On the Register Integration Runtime (Self-hosted) page, select Launch Configuration Manager.
15. When the node is connected to the cloud service, you see the following page:
NOTE
Make a note of the values for authentication type, server, database, user, and password. You use them later in this
tutorial.
{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<servername>;Database=<databasename>;User ID=<username>;Password=
<password>;Timeout=60"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}
{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Integrated Security=True"
},
"userName": "<user> or <domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}
IMPORTANT
Select the right section based on the authentication you use to connect to SQL Server.
Replace <integration runtime name> with the name of your integration runtime.
Replace <servername>, <databasename>, <username>, and <password> with values of your SQL Server
database before you save the file.
If you need to use a slash character ( \ ) in your user account or server name, use the escape character ( \ ). An
example is mydomain\\myuser .
LinkedServiceName : SqlServerLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerLinkedService
{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database name>; Persist
Security Info=False; User ID=<user name>; Password=<password>; MultipleActiveResultSets = False;
Encrypt = True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}
LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
Create datasets
In this step, you create datasets to represent the data source, the data destination, and the place to store the
watermark.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "SqlServerTable",
"typeProperties": {
"tableName": "dummyName"
},
"linkedServiceName": {
"referenceName": "SqlServerLinkedService",
"type": "LinkedServiceReference"
}
}
}
The table name is a dummy name. The Copy activity in the pipeline uses a SQL query to load the data
rather than load the entire table.
2. Run the Set-AzureRmDataFactoryV2Dataset cmdlet to create the dataset SourceDataset.
DatasetName : SourceDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerTableDataset
{
"name": "SinkDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": {
"value": "@{dataset().SinkTableName}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"SinkTableName": {
"type": "String"
}
}
}
}
DatasetName : SinkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
DatasetName : WatermarkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : <data factory name>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table names
and performs the following operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in the
last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark column
in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to the
destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the next
iteration.
Create the pipeline
1. Create a JSON file named IncrementalCopyPipeline.json in the same folder with the following content:
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"items": {
"value": "@pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from watermarktable where TableName = '@{item().TABLE_NAME}'"
},
"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select MAX(@{item().WaterMark_Column}) as NewWatermarkvalue from
@{item().TABLE_NAME}"
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from @{item().TABLE_NAME} where @{item().WaterMark_Column} >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and
@{item().WaterMark_Column} <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'"
},
"sink": {
"type": "SqlSink",
"SqlWriterTableType": "@{item().TableType}",
"SqlWriterTableType": "@{item().TableType}",
"SqlWriterStoredProcedureName": "@{item().StoredProcedureNameForMergeOperation}"
}
},
"dependsOn": [{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "SinkDataset",
"type": "DatasetReference",
"parameters": {
"SinkTableName": "@{item().TableType}"
}
}]
},
{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "usp_write_watermark",
"storedProcedureParameters": {
"LastModifiedtime": {
"value": "@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}",
"type": "datetime"
},
"TableName": {
"value": "@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type": "String"
}
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"dependsOn": [{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}]
}
}
}],
"parameters": {
"tableList": {
"type": "Object"
"type": "Object"
}
}
}
}
PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Activities : {IterateSQLTables}
Parameters : {[tableList,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}
{
"tableList":
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
}
5. The Data Integration Application opens in a separate tab. You can see all the pipeline runs and their
status. Notice that in the following example, the status of the pipeline run is Succeeded. To check
parameters passed to the pipeline, select the link in the Parameters column. If an error occurred, you see a
link in the Error column. Select the link in the Actions column.
6. When you select the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline:
7. To go back to the Pipeline Runs view, select Pipelines as shown in the image.
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Query
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
Query
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000
Notice that the watermark values for both tables were updated.
UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3
2. Monitor the pipeline runs by following the instructions in the Monitor the pipeline section. Because the
pipeline status is In Progress, you see another action link under Actions to cancel the pipeline run.
3. Select Refresh to refresh the list until the pipeline run succeeds.
4. Optionally, select the View Activity Runs link under Actions to see all the activity runs associated with
this pipeline run.
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Notice the new values of Name and LastModifytime for the PersonID for number 3.
Query
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000
Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000
Notice that the watermark values for both tables were updated.
Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR ).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from Azure SQL Database to
Azure Blob Storage using change tracking
information
3/26/2019 • 15 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change tracking
information in the source Azure SQL database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source, sink, and change tracking datasets.
Create, run, and monitor the full copy pipeline
Add or update data in the source table
Create, run, and monitor the incremental copy pipeline
Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In some
cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time you
processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database and
SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with SQL
Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob Storage.
For more concrete information about SQL Change Tracking technology, see Change tracking in SQL Server.
End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking technology.
NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL Database as
the source data store. You can also use an on-premises SQL Server.
High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data store
(Azure SQL Database) to the destination data store (Azure Blob Storage).
2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure SQL Database. You use the database as the source data store. If you don't have an Azure SQL
Database, see the Create an Azure SQL database article for steps to create one.
Azure Storage account. You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named adftutorial.
Create a data source table in your Azure SQL database
1. Launch SQL Server Management Studio, and connect to your Azure SQL server.
2. In Server Explorer, right-click your database and choose the New Query.
3. Run the following SQL command against your Azure SQL database to create a table named
data_source_table as data source store.
create table data_source_table
(
PersonID int NOT NULL,
Name varchar(255),
Age int
PRIMARY KEY (PersonID)
);
4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:
NOTE
Replace <your database name> with the name of your Azure SQL database that has the data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three days
or more, some changed data is not included. You need to either change the value of CHANGE_RETENTION to a
bigger number. Alternatively, ensure that your period to load the changed data is within two days. For more
information, see Enable change tracking for a database
5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:
NOTE
If the data is not changed after you enabled the change tracking for SQL Database, the value of the change tracking
version is 0.
6. Run the following query to create a stored procedure in your Azure SQL database. The pipeline invokes this
stored procedure to update the change tracking version in the table you created in the previous step.
BEGIN
UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName
END
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.
`Data factory name “ADFTutorialDataFactory” is not available`
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 (Preview) for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.
11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.
13. In the get started page, switch to the Edit tab in the left panel as shown in the following image:
2. In the New Linked Service window, select Azure Blob Storage, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureStorageLinkedService for Name.
b. Select your Azure Storage account for Storage account name.
c. Click Save.
Create Azure SQL Database linked service.
In this step, you link your Azure SQL database to the data factory.
1. Click Connections, and click + New.
2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for the Name field.
b. Select your Azure SQL server for the Server name field.
c. Select your Azure SQL database for the Database name field.
d. Enter name of the user for the User name field.
e. Enter password for the user for the Password field.
f. Click Test connection to test the connection.
g. Click Save to save the linked service.
Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a dataset to represent source data
In this step, you create a dataset to represent the source data.
1. In the treeview, click + (plus), and click Dataset.
2. Select Azure SQL Database, and click Finish.
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Properties
window, change the name of the dataset to SourceDataset.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[data_source_table] for Table.
2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the Properties
window, change the name of the pipeline to FullCopyPipeline.
3. In the Activities toolbox, expand Data Flow, and drag-drop the Copy activity to the pipeline designer
surface, and set the name FullCopyActivity.
4. Switch to the Source tab, and select SourceDataset for the Source Dataset field.
5. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.
6. To validate the pipeline definition, click Validate on the toolbar. Confirm that there is no validation error.
Close the Pipeline Validation Report by clicking >>.
7. To publish entities (linked services, datasets, and pipelines), click Publish. Wait until the publishing
succeeds.
9. You can also see notifications by clicking the Show Notifications button on the left. To close the
notifications window, click X.
Run the full copy pipeline
Click Trigger on the toolbar for the pipeline, and click Trigger Now.
The file should have the data from the Azure SQL database:
1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22
UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1
2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the Properties
window, change the name of the pipeline to IncrementalCopyPipeline.
3. Expand General in the Activities toolbox, and drag-drop the Lookup activity to the pipeline designer
surface. Set the name of the activity to LookupLastChangeTrackingVersionActivity. This activity gets the
change tracking version used in the last copy operation that is stored in the table
table_store_ChangeTracking_version.
4. Switch to the Settings in the Properties window, and select ChangeTrackingDataset for the Source
Dataset field.
5. Drag-and-drop the Lookup activity from the Activities toolbox to the pipeline designer surface. Set the
name of the activity to LookupCurrentChangeTrackingVersionActivity. This activity gets the current
change tracking version.
6. Switch to the Settings in the Properties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.
8. Switch to the Source tab in the Properties window, and do the following steps:
a. Select SourceDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.
select data_source_table.PersonID,data_source_table.Name,data_source_table.Age,
CT.SYS_CHANGE_VERSION, SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN
CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as CT
on data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVer
sion}
9. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.
10. Connect both Lookup activities to the Copy activity one by one. Drag the green button attached to the
Lookup activity to the Copy activity.
11. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer surface.
Set the name of the activity to StoredProceduretoUpdateChangeTrackingActivity. This activity updates
the change tracking version in the table_store_ChangeTracking_version table.
12. Switch to the SQL Account* tab, and select AzureSqlDatabaseLinkedService for Linked service.
13. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select Update_ChangeTracking_Version.
b. Select Import parameter.
c. In the Stored procedure parameters section, specify following values for the parameters:
15. Click Validate on the toolbar. Confirm that there are no validation errors. Close the Pipeline Validation
Report window by clicking >>.
16. Publish entities (linked services, datasets, and pipelines) to the Data Factory service by clicking the Publish
All button. Wait until you see the Publishing succeeded message.
Run the incremental copy pipeline
1. Click Trigger on the toolbar for the pipeline, and click Trigger Now.
The file should have only the delta data from the Azure SQL database. The record with U is the updated row in
the database and I is the one added row.
1,update,10,2,U
6,new,50,1,I
The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.
==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I
Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally load data from Azure SQL Database to
Azure Blob Storage using change tracking
information
3/5/2019 • 14 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change tracking
information in the source Azure SQL database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source, sink, and change tracking datasets.
Create, run, and monitor the full copy pipeline
Add or update data in the source table
Create, run, and monitor the incremental copy pipeline
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In some
cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time you
processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database and
SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with SQL
Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob Storage.
For more concrete information about SQL Change Tracking technology, see Change tracking in SQL Server.
End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking technology.
NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL Database
as the source data store. You can also use an on-premises SQL Server.
High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data store
(Azure SQL Database) to the destination data store (Azure Blob Storage).
2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure PowerShell. Install the latest Azure PowerShell modules by following instructions in How to install and
configure Azure PowerShell.
Azure SQL Database. You use the database as the source data store. If you don't have an Azure SQL
Database, see the Create an Azure SQL database article for steps to create one.
Azure Storage account. You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named adftutorial.
Create a data source table in your Azure SQL database
1. Launch SQL Server Management Studio, and connect to your Azure SQL server.
2. In Server Explorer, right-click your database and choose the New Query.
3. Run the following SQL command against your Azure SQL database to create a table named
data_source_table as data source store.
4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:
NOTE
Replace <your database name> with the name of your Azure SQL database that has the data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three days
or more, some changed data is not included. You need to either change the value of CHANGE_RETENTION to a
bigger number. Alternatively, ensure that your period to load the changed data is within two days. For more
information, see Enable change tracking for a database
5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:
6. Run the following query to create a stored procedure in your Azure SQL database. The pipeline invokes this
stored procedure to update the change tracking version in the table you created in the previous step.
BEGIN
UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName
END
Azure PowerShell
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.
IMPORTANT
Update the data factory name to be globally unique.
$dataFactoryName = "IncCopyChgTrackingDF";
The specified Data Factory name 'ADFIncCopyChangeTrackingTestFactory' is already in use. Data Factory
names must be globally unique.
To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory
can be in other regions.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>",
"type": "SecureString"
}
}
}
}
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database name>; Persist
Security Info=False; User ID=<user name>; Password=<password>; MultipleActiveResultSets = False;
Encrypt = True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}
2. In Azure PowerShell, run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:
AzureSQLDatabaseLinkedService.
LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a source dataset
In this step, you create a dataset to represent the source data.
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
DatasetName : SourceDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
{
"name": "SinkDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/incchgtracking",
"fileName": "@CONCAT('Incremental-', pipeline().RunId, '.txt')",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
You create the adftutorial container in your Azure Blob Storage as part of the prerequisites. Create the
container if it does not exist (or) set it to the name of an existing one. In this tutorial, the output file name is
dynamically generated by using the expression: @CONCAT('Incremental-', pipeline().RunId, '.txt').
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: SinkDataset
DatasetName : SinkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset
{
"name": " ChangeTrackingDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "table_store_ChangeTracking_version"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
DatasetName : ChangeTrackingDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}]
}]
}
}
PipelineName : FullCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Activities : {FullCopyActivity}
Parameters :
5. The Data Integration Application launches in a separate tab. You can see all the pipeline runs and their
statuses. Notice that in the following example, the status of the pipeline run is Succeeded. You can check
parameters passed to the pipeline by clicking link in the Parameters column. If there was an error, you see
a link in the Error column. Click the link in the Actions column.
6. When you click the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline.
7. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see a file named incremental-<GUID>.txt in the incchgtracking folder of the adftutorial container.
The file should have the data from the Azure SQL database:
1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22
UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupLastChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from table_store_ChangeTracking_version"
},
"dataset": {
"referenceName": "ChangeTrackingDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupCurrentChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT CHANGE_TRACKING_CURRENT_VERSION() as
CurrentChangeTrackingVersion"
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select
data_source_table.PersonID,data_source_table.Name,data_source_table.Age, CT.SYS_CHANGE_VERSION,
SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as CT on
data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupLastChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
]
},
{
"activity": "LookupCurrentChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},
{
"name": "StoredProceduretoUpdateChangeTrackingActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "Update_ChangeTracking_Version",
"storedProcedureParameters": {
"CurrentTrackingVersion": {"value":
"@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion}
", "type": "INT64" },
"TableName": {
"value":"@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.TableName}",
"type":"String"}
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]
}
}
2. When you click the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline.
3. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see the second file in the incchgtracking folder of the adftutorial container.
The file should have only the delta data from the Azure SQL database. The record with U is the updated row in
the database and I is the one added row.
1,update,10,2,U
6,new,50,1,I
The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.
==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I
Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally copy new and changed files based on
LastModifiedDate by using the Copy Data tool
5/10/2019 • 5 minutes to read • Edit Online
In this tutorial, you'll use the Azure portal to create a data factory. Then, you'll use the Copy Data tool to create a
pipeline that incrementally copies new and changed files only, based on their LastModifiedDate from Azure Blob
storage to Azure Blob storage.
By doing so, ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and
copy the new and updated file only since last time to the destination store. Please note that if you let ADF scan
huge amounts of files but only copy a few files to destination, you would still expect the long duration due to file
scanning is time consuming as well.
NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source and sink data store. If you don't have an Azure
storage account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
1. Create a container named source. You can use various tools to perform this task, such as Azure Storage
Explorer.
2. Create a container named destination.
The name for your data factory must be globally unique. You might receive the following error message:
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which you'll create the new data factory.
4. For Resource Group, take one of the following steps:
Select Use existing and select an existing resource group from the drop-down list.
Select Create new and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that your data factory uses can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, refer to the Deploying Data Factory tile to see the process status.
10. After creation is finished, the Data Factory home page is displayed.
11. To open the Azure Data Factory user interface (UI) on a separate tab, select the Author & Monitor tile.
c. On the New Linked Service page, select your storage account from the Storage account name list
and then select Finish.
d. Select the newly created linked service and then select Next.
4. On the Choose the input file or folder page, complete the following steps:
a. Browse and select the source folder, and then select Choose.
b. Under File loading behavior, select Incremental load: LastModifiedDate.
6. On the Choose the output file or folder page, complete the following steps:
a. Browse and select the destination folder, and then select Choose.
b. Select Next.
11. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column.
Because there is no file in the source container in your Blob storage account, you will not see any file
copied to the destination container in your Blob storage account.
12. Create an empty text file and name it file1.txt. Upload this text file to the source container in your storage
account. You can use various tools to perform these tasks, such as Azure Storage Explorer.
13. To go back to the Pipeline Runs view, select All Pipeline Runs, and wait for the same pipeline to be
triggered again automatically.
14. Select View Activity Run for the second pipeline run when you see it. Then review the details in the same
way you did for the first pipeline run.
You will that see one file (file1.txt) has been copied from the source container to the destination container
of your Blob storage account.
15. Create another empty text file and name it file2.txt. Upload this text file to the source container in your
Blob storage account.
16. Repeat steps 13 and 14 for this second text file. You will see that only the new file (file2.txt) has been copied
from the source container to the destination container of your storage account in the next pipeline run.
You can also verify this by using Azure Storage Explorer to scan the files.
Next steps
Advance to the following tutorial to learn about transforming data by using an Apache Spark cluster on Azure:
Transform data in the cloud by using an Apache Spark cluster
Incrementally copy new files based on time
partitioned file name by using the Copy Data tool
3/26/2019 • 5 minutes to read • Edit Online
In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that incrementally copies new files based on time partitioned file name from Azure Blob storage to Azure
Blob storage.
NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source and sink data store. If you don't have an Azure storage
account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
1. Create a container named source. Create a folder path as 2019/02/26/14 in your container. Create an
empty text file, and name it as file1.txt. Upload the file1.txt to the folder path source/2019/02/26/14 in
your storage account. You can use various tools to perform these tasks, such as Azure Storage Explorer.
NOTE
Please adjust the folder name with your UTC time. For example, if the current UTC time is 2:03 PM on Feb 26th,
2019, you can create the folder path as source/2019/02/26/14/ by the rule of
source/{Year}/{Month}/{Day}/{Hour}/.
2. Create a container named destination. You can use various tools to perform these tasks, such as Azure
Storage Explorer.
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which to create the new data factory.
4. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2 for the version.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that are used by your data factory can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, the Deploying Data Factory tile shows the process status.
10. After creation is finished, the Data Factory home page is displayed.
11. To launch the Azure Data Factory user interface (UI) in a separate tab, select the Author & Monitor tile.
c. On the New Linked Service page, select your storage account from the Storage account name list,
and then click Finish.
b. Under File loading behavior, select Incremental load: time-partitioned folder/file names.
c. Write the dynamic folder path as source/{year}/{month}/{day}/{hour}/, and change the format as
followings:
d. Check Binary copy and click Next.
5. On the Destination data store page, select the AzureBlobStorage, which is the same storage account as
data source store, and then click Next.
6. On the Choose the output file or folder page, do the following steps:
a. Browse and select the destination folder, then click Choose.
b. Write the dynamic folder path as source/{year}/{month}/{day}/{hour}/, and change the format as
followings:
c. Click Next.
11. There's only one activity (copy activity) in the pipeline, so you see only one entry. You can see the source file
(file1.txt) has been copied from source/2019/02/26/14/ to destination/2019/02/26/14/ with the same
file name.
You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files.
12. Create another empty text file with the new name as file2.txt. Upload the file2.txt file to the folder path
source/2019/02/26/15 in your storage account. You can use various tools to perform these tasks, such as
Azure Storage Explorer.
NOTE
You might be aware that a new folder path is required to be created. Please adjust the folder name with your UTC
time. For example, if the current UTC time is 3:20 PM on Feb 26th, 2019, you can create the folder path as
source/2019/02/26/15/ by the rule of {Year}/{Month}/{Day}/{Hour}/.
13. To go back to the Pipeline Runs view, select All Pipelines Runs, and wait for the same pipeline being
triggered again automatically after another one hour.
14. Select View Activity Run for the second pipeline run when it comes, and do the same to review details.
You can see the source file (file2.txt) has been copied from source/2019/02/26/15/ to
destination/2019/02/26/15/ with the same file name.
You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files
in destination container
Next steps
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Transform data using Spark cluster in cloud
Transform data in the cloud by using a Spark activity
in Azure Data Factory
3/7/2019 • 7 minutes to read • Edit Online
In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline. This pipeline transforms data by
using a Spark activity and an on-demand Azure HDInsight linked service.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure storage account. You create a Python script and an input file, and you upload them to Azure Storage.
The output from the Spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.
NOTE
HdInsight supports only general-purpose storage accounts with standard tier. Make sure that the account is not a premium
or blob only storage account.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload the Python script to your Blob storage account
1. Create a Python file named WordCount_Spark.py with the following content:
import sys
from operator import add
def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/minecr
aftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfiles
/wordcount")
spark.stop()
if __name__ == "__main__":
main()
2. Replace <storageAccountName> with the name of your Azure storage account. Then, save the file.
3. In Azure Blob storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark.
5. Create a subfolder named script under the spark folder.
6. Upload the WordCount_Spark.py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstory.txt with some text. The Spark program counts the number of words in
this text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstory.txt file to the inputfiles subfolder.
4. For Subscription, select your Azure subscription in which you want to create the data factory.
5. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the
resource group. To learn about resource groups, see Using resource groups to manage your Azure
resources.
6. For Version, select V2.
7. For Location, select the location for the data factory.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factory: Products available by region.
The data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data
Factory uses can be in other regions.
8. Select Create.
9. After the creation is complete, you see the Data factory page. Select the Author & Monitor tile to start
the Data Factory UI application on a separate tab.
3. In the New Linked Service window, select Data Store > Azure Blob Storage, and then select Continue.
4. For Storage account name, select the name from the list, and then select Save.
Create an on-demand HDInsight linked service
1. Select the + New button again to create another linked service.
2. In the New Linked Service window, select Compute > Azure HDInsight, and then select Continue.
3. In the New Linked Service window, complete the following steps:
a. For Name, enter AzureHDInsightLinkedService.
b. For Type, confirm that On-demand HDInsight is selected.
c. For Azure Storage Linked Service, select AzureStorage1. You created this linked service earlier. If you
used a different name, specify the right name here.
d. For Cluster type, select spark.
e. For Service principal id, enter the ID of the service principal that has permission to create an HDInsight
cluster.
This service principal needs to be a member of the Contributor role of the subscription or the resource
group in which the cluster is created. For more information, see Create an Azure Active Directory
application and service principal.
f. For Service principal key, enter the key.
g. For Resource group, select the same resource group that you used when you created the data factory.
The Spark cluster is created in this resource group.
h. Expand OS type.
i. Enter a name for Cluster user name.
j. Enter the Cluster password for the user.
k. Select Finish.
NOTE
Azure HDInsight limits the total number of cores that you can use in each Azure region that it supports. For the on-demand
HDInsight linked service, the HDInsight cluster is created in the same Azure Storage location that's used as its primary
storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more information, see Set
up clusters in HDInsight with Hadoop, Spark, Kafka, and more.
Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.
2. In the Activities toolbox, expand HDInsight. Drag the Spark activity from the Activities toolbox to the
pipeline designer surface.
3. In the properties for the Spark activity window at the bottom, complete the following steps:
a. Switch to the HDI Cluster tab.
b. Select AzureHDInsightLinkedService (which you created in the previous procedure).
4. Switch to the Script/Jar tab, and complete the following steps:
a. For Job Linked Service, select AzureStorage1.
b. Select Browse Storage.
c. Browse to the adftutorial/spark/script folder, select WordCount_Spark.py, and then select Finish.
5. To validate the pipeline, select the Validate button on the toolbar. Select the >> (right arrow ) button to
close the validation window.
6. Select Publish All. The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.
3. To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.
You can switch back to the pipeline runs view by selecting the Pipelines link at the top.
(u'This', 1)
(u'a', 1)
(u'is', 1)
(u'test', 1)
(u'file', 1)
Next steps
The pipeline in this sample transforms data by using a Spark activity and an on-demand HDInsight linked service.
You learned how to:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
To learn how to transform data by running a Hive script on an Azure HDInsight cluster that's in a virtual network,
advance to the next tutorial:
Tutorial: Transform data using Hive in Azure Virtual Network.
Transform data in the cloud by using Spark activity in
Azure Data Factory
3/7/2019 • 7 minutes to read • Edit Online
In this tutorial, you use Azure PowerShell to create a Data Factory pipeline that transforms data using Spark
Activity and an on-demand HDInsight linked service. You perform the following steps in this tutorial:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure Storage account. You create a python script and an input file, and upload them to the Azure storage.
The output from the spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload python script to your Blob Storage account
1. Create a python file named WordCount_Spark.py with the following content:
import sys
from operator import add
def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/minecr
aftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfiles
/wordcount")
spark.stop()
if __name__ == "__main__":
main()
2. Replace <storageAccountName> with the name of your Azure Storage account. Then, save the file.
3. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark.
5. Create a subfolder named script under spark folder.
6. Upload the WordCount_Spark.py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstory.txt with some text. The spark program counts the number of words in this
text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstory.txt to the inputfiles subfolder.
Update the <storageAccountName> and <storageAccountKey> with the name and key of your Azure Storage
account.
On-demand HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyOnDemandSparkLinkedService.json.
{
"name": "MyOnDemandSparkLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterSize": 2,
"clusterType": "spark",
"timeToLive": "00:15:00",
"hostSubscriptionId": "<subscriptionID> ",
"servicePrincipalId": "<servicePrincipalID>",
"servicePrincipalKey": {
"value": "<servicePrincipalKey>",
"type": "SecureString"
},
"tenant": "<tenant ID>",
"clusterResourceGroup": "<resourceGroupofHDICluster>",
"version": "3.6",
"osType": "Linux",
"clusterNamePrefix":"ADFSparkSample",
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}
Update values for the following properties in the linked service definition:
hostSubscriptionId. Replace <subscriptionID> with the ID of your Azure subscription. The on-demand
HDInsight cluster is created in this subscription.
tenant. Replace <tenantID> with ID of your Azure tenant.
servicePrincipalId, servicePrincipalKey. Replace <servicePrincipalID> and <servicePrincipalKey> with ID
and key of your service principal in the Azure Active Directory. This service principal needs to be a member of
the Contributor role of the subscription or the resource Group in which the cluster is created. See create Azure
Active Directory application and service principal for details.
clusterResourceGroup. Replace <resourceGroupOfHDICluster> with the name of the resource group in
which the HDInsight cluster needs to be created.
NOTE
Azure HDInsight has limitation on the total number of cores you can use in each Azure region it supports. For On-Demand
HDInsight Linked Service, the HDInsight cluster will be created in the same location of the Azure Storage used as its primary
storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more information, see Set
up clusters in HDInsight with Hadoop, Spark, Kafka, and more.
Author a pipeline
In this step, you create a new pipeline with a Spark activity. The activity uses the word count sample. Download
the contents from this location if you haven't already done so.
Create a JSON file in your preferred editor, copy the following JSON definition of a pipeline definition, and save it
as MySparkOnDemandPipeline.json.
{
"name": "MySparkOnDemandPipeline",
"properties": {
"activities": [
{
"name": "MySparkActivity",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyOnDemandSparkLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"rootPath": "adftutorial/spark",
"entryFilePath": "script/WordCount_Spark.py",
"getDebugInfo": "Failure",
"sparkJobLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}
$resourceGroupName = "ADFTutorialResourceGroup"
Pipeline name
2. Launch PowerShell. Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:
$df
5. Switch to the folder where you created JSON files, and run the following command to deploy an Azure
Storage linked service:
2. Run the following script to continuously check the pipeline run status until it finishes.
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)
if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}
4. Confirm that a folder named outputfiles is created in the spark folder of adftutorial container with the
output from the spark program.
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
Advance to the next tutorial to learn how to transform data by running Hive script on an Azure HDInsight cluster
that is in a virtual network.
Tutorial: transform data using Hive in Azure Virtual Network.
Run a Databricks notebook with the Databricks
Notebook Activity in Azure Data Factory
5/22/2019 • 5 minutes to read • Edit Online
In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline that executes a Databricks
notebook against the Databricks jobs cluster. It also passes Azure Data Factory parameters to the Databricks
notebook during execution.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses Databricks Notebook Activity.
Trigger a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
For an eleven-minute introduction and demonstration of this feature, watch the following video:
Prerequisites
Azure Databricks workspace. Create a Databricks workspace or use an existing one. You create a Python
notebook in your Azure Databricks workspace. Then you execute the notebook and pass parameters to it using
Azure Data Factory.
2. Select Connections at the bottom of the window, and then select + New.
3. In the New Linked Service window, select Compute > Azure Databricks, and then select Continue.
4. In the New Linked Service window, complete the following steps:
a. For Name, enter AzureDatabricks_LinkedService
b. Select the appropriate Databricks workspace that you will run your notebook in
c. For Select cluster, select New job cluster
d. For Domain/ Region, info should auto-populate
e. For Access Token, generate it from Azure Databricks workplace. You can find the steps here.
f. For Cluster version, select 4.2 (with Apache Spark 2.3.1, Scala 2.11)
g. For Cluster node type, select Standard_D3_v2 under General Purpose (HDD ) category for this
tutorial.
h. For Workers, enter 2.
i. Select Finish
Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.
2. Create a parameter to be used in the Pipeline. Later you pass this parameter to the Databricks Notebook
Activity. In the empty pipeline, click on the Parameters tab, then New and name it as 'name'.
3. In the Activities toolbox, expand Databricks. Drag the Notebook activity from the Activities toolbox to
the pipeline designer surface.
4. In the properties for the Databricks Notebook activity window at the bottom, complete the following
steps:
a. Switch to the Azure Databricks tab.
b. Select AzureDatabricks_LinkedService (which you created in the previous procedure).
c. Switch to the Settings tab
c. Browse to select a Databricks Notebook path. Let’s create a notebook and specify the path here. You get
the Notebook Path by following the next few steps.
a. Launch your Azure Databricks Workspace
b. Create a New Folder in Workplace and call it as adftutorial.
c. Create a new notebook (Python), let’s call it mynotebook under adftutorial Folder, click Create.
d. In the newly created notebook "mynotebook'" add the following code:
dbutils.widgets.text("input", "","")
dbutils.widgets.get("input")
y = getArgument("input")
print ("Param -\'input':")
print (y)
7. Select Publish All. The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.
Trigger a pipeline run
Select Trigger on the toolbar, and then select Trigger Now.
The Pipeline Run dialog box asks for the name parameter. Use /path/filename as the parameter here. Click
Finish.
Monitor the pipeline run
1. Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 5-8 minutes to create
a Databricks job cluster, where the notebook is executed.
You can switch back to the pipeline runs view by selecting the Pipelines link at the top.
Verify the output
You can log on to the Azure Databricks workspace, go to Clusters and you can see the Job status as pending
execution, running, or terminated.
You can click on the Job name and navigate to see further details. On successful run, you can validate the
parameters passed and the output of the Python notebook.
Next steps
The pipeline in this sample triggers a Databricks Notebook activity and passes a parameter to it. You learned how
to:
Create a data factory.
Create a pipeline that uses a Databricks Notebook activity.
Trigger a pipeline run.
Monitor the pipeline run.
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory
3/15/2019 • 9 minutes to read • Edit Online
In this tutorial, you use Azure portal to create a Data Factory pipeline that transforms data using Hive Activity on a
HDInsight cluster that is in an Azure Virtual Network (VNet). You perform the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure Storage account. You create a hive script, and upload it to the Azure storage. The output from the
Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Virtual Network. If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration of
Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the previous
step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a sample
configuration of HDInsight in a virtual network.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
A virtual machine. Create an Azure virtual machine VM and join it into the same virtual network that
contains your HDInsight cluster. For details, see How to create virtual machines.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:
DROP TABLE IF EXISTS HiveSampleOut;
CREATE EXTERNAL TABLE HiveSampleOut (clientid string, market string, devicemodel string, state string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${hiveconf:Output}';
2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts.
4. Upload the hivescript.hql file to the hivescripts subfolder.
5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version.
8. Select the location for the data factory. Only locations that are supported for creation of data factories are
shown in the list.
9. Select Pin to dashboard.
10. Click Create.
11. On the dashboard, you see the following tile with status: Deploying data factory.
12. After the creation is complete, you see the Data Factory page as shown in the image.
13. Click Author & Monitor to launch the Data Factory User Interface (UI) in a separate tab.
14. In the get started page, switch to the Edit tab in the left panel as shown in the following image:
Create a self-hosted integration runtime
As the Hadoop cluster is inside a virtual network, you need to install a self-hosted integration runtime (IR ) in the
same virtual network. In this section, you create a new VM, join it to the same virtual network, and install self-
hosted IR on it. The self-hosted IR allows Data Factory service to dispatch processing requests to a compute
service such as HDInsight inside a virtual network. It also allows you to move data to/from data stores inside a
virtual network to Azure. You use a self-hosted IR when the data store or compute is in an on-premises
environment as well.
1. In the Azure Data Factory UI, click Connections at the bottom of the window, switch to the Integration
Runtimes tab, and click + New button on the toolbar.
2. In the Integration Runtime Setup window, Select Perform data movement and dispatch activities to
external computes option, and click Next.
3. Select Private Network, and click Next.
4. Enter MySelfHostedIR for Name, and click Next.
5. Copy the authentication key for the integration runtime by clicking the copy button, and save it. Keep the
window open. You use this key to register the IR installed in a virtual machine.
Install IR on a virtual machine
1. On the Azure VM, download self-hosted integration runtime. Use the authentication key obtained in the
previous step to manually register the self-hosted integration runtime.
2. You see the following message when the self-hosted integration runtime is registered successfully.
3. Click Launch Configuration Manager. You see the following page when the node is connected to the
cloud service:
Self-hosted IR in the Azure Data Factory UI
1. In the Azure Data Factory UI, you should see the name of the self-hosted VM name and its status.
2. Click Finish to close the Integration Runtime Setup window. You see the self-hosted IR in the list of
integration runtimes.
2. In the New Linked Service window, select Azure Blob Storage, and click Continue.
Create a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined.
Note the following points:
scriptPath points to path to Hive script on the Azure Storage Account you used for MyStorageLinkedService.
The path is case-sensitive.
Output is an argument used in the Hive script. Use the format of
wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ to point it to an existing folder on
your Azure Storage. The path is case-sensitive.
1. In the Data Factory UI, click + (plus) in the left pane, and click Pipeline.
2. In the Activities toolbox, expand HDInsight, and drag-drop Hive activity to the pipeline designer surface.
3. In the properties window, switch to the HDI Cluster tab, and select AzureHDInsightLinkedService for
HDInsight Linked Service.
4. You see only one activity run since there is only one activity in the pipeline of type HDInsightHive. To
switch back to the previous view, click Pipelines link at the top.
5. Confirm that you see an output file in the outputfolder of the adftutorial container.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Branching and chaining Data Factory control flow
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory
4/11/2019 • 9 minutes to read • Edit Online
In this tutorial, you use Azure PowerShell to create a Data Factory pipeline that transforms data using Hive Activity
on a HDInsight cluster that is in an Azure Virtual Network (VNet). You perform the following steps in this tutorial:
Create a data factory.
Author and setup self-hosted integration runtime
Author and deploy linked services.
Author and deploy a pipeline that contains a Hive activity.
Start a pipeline run.
Monitor the pipeline run
verify the output.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure Storage account. You create a hive script, and upload it to the Azure storage. The output from the
Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Virtual Network. If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration of
Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the previous
step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a sample
configuration of HDInsight in a virtual network.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:
2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts.
4. Upload the hivescript.hql file to the hivescripts subfolder.
Create a data factory
1. Set the resource group name. You create a resource group as part of this tutorial. However, you can use an
existing resource group if you like.
$resourceGroupName = "ADFTutorialResourceGroup"
$dataFactoryName = "MyDataFactory09142017"
$pipelineName = "MyHivePipeline" #
4. Specify a name for the self-hosted integration runtime. You need a self-hosted integration runtime when the
Data Factory needs to access resources (such as Azure SQL Database) inside a VNet.
$selfHostedIntegrationRuntimeName = "MySelfHostedIR09142017"
5. Launch PowerShell. Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently available,
select the regions that interest you on the following page, and then expand Analytics to locate Data
Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:
6. Create the resource group: ADFTutorialResourceGroup if it does not exist already in your subscription.
Create self-hosted IR
In this section, you create a self-hosted integration runtime and associate it with an Azure VM in the same Azure
Virtual Network where your HDInsight cluster is in.
1. Create Self-hosted integration runtime. Use a unique name in case if another integration runtime with the
same name exists.
{
"AuthKey1": "IR@0000000000000000000000000000000000000=",
"AuthKey2": "IR@0000000000000000000000000000000000000="
}
You see the following page when the node is connected to the cloud service:
Author linked services
You author and deploy two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is the
primary storage used by your HDInsight cluster. In this case, we also use this Azure Storage account to keep the
Hive script and output of the script.
An HDInsight Linked Service. Azure Data Factory submits the Hive script to this HDInsight cluster for
execution.
Azure Storage linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure Storage linked
service, and then save the file as MyStorageLinkedService.json.
{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>",
"type": "SecureString"
}
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}
Replace <accountname> and <accountkey> with the name and key of your Azure Storage account.
HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyHDInsightLinkedService.json.
{
"name": "MyHDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": "https://<clustername>.azurehdinsight.net",
"userName": "<username>",
"password": {
"value": "<password>",
"type": "SecureString"
},
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}
Update values for the following properties in the linked service definition:
userName. Name of the cluster login user that you specified when creating the cluster.
password. The password for the user.
clusterUri. Specify the URL of your HDInsight cluster in the following format:
https://<clustername>.azurehdinsight.net . This article assumes that you have access to the cluster over the
internet. For example, you can connect to the cluster at https://clustername.azurehdinsight.net . This
address uses the public gateway, which is not available if you have used network security groups (NSGs) or
user-defined routes (UDRs) to restrict access from the internet. For Data Factory to submit jobs to
HDInsight clusters in Azure Virtual Network, your Azure Virtual Network needs to be configured in such a
way that the URL can be resolved to the private IP address of the gateway used by HDInsight.
1. From Azure portal, open the Virtual Network the HDInsight is in. Open the network interface with
name starting with nic-gateway-0 . Note down its private IP address. For example, 10.6.0.15.
2. If your Azure Virtual Network has DNS server, update the DNS record so the HDInsight cluster URL
https://<clustername>.azurehdinsight.net can be resolved to 10.6.0.15 . This is the recommended
approach. If you don’t have a DNS server in your Azure Virtual Network, you can temporarily work
around this by editing the hosts file (C:\Windows\System32\drivers\etc) of all VMs that registered as
self-hosted integration runtime nodes by adding an entry like this:
10.6.0.15 myHDIClusterName.azurehdinsight.net
Author a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined. Create a JSON file in your preferred editor, copy the following
JSON definition of a pipeline definition, and save it as MyHivePipeline.json.
{
"name": "MyHivePipeline",
"properties": {
"activities": [
{
"name": "MyHiveActivity",
"type": "HDInsightHive",
"linkedServiceName": {
"referenceName": "MyHDILinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptPath": "adftutorial\\hivescripts\\hivescript.hql",
"getDebugInfo": "Failure",
"defines": {
"Output": "wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/"
},
"scriptLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}
2. Run the following script to continuously check the pipeline run status until it finishes.
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)
if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}
ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 000000000-0000-0000-000000000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output :
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd :
DurationInMs :
Status : InProgress
Error :
ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 0000000-0000-0000-0000-000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output : {logLocation, clusterInUse, jobId, ExecutionProgress...}
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd : 9/18/2017 6:59:16 AM
DurationInMs : 63636
Status : Succeeded
Error : {errorCode, message, failureType, target}
3. Check the outputfolder folder for new file created as the Hive query result, it should look like the following
sample output:
In this tutorial, you create a Data Factory pipeline that showcases some of the control flow features. This pipeline
does a simple copy from a container in Azure Blob Storage to another container in the same storage account. If the
copy activity succeeds, the pipeline sends details of the successful copy operation (such as the amount of data
written) in a success email. If the copy activity fails, the pipeline sends details of copy failure (such as the error
message) in a failure email. Throughout the tutorial, you see how to pass parameters.
A high-level overview of the scenario:
Prerequisites
Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Create blob table
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.
John,Doe
Jane,Doe
For your request trigger, fill in the Request Body JSON Schema with the following JSON:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}
The Request in the Logic App Designer should look like the following image:
For the Send Email action, customize how you wish to format the email, utilizing the properties passed in the
request Body JSON schema. Here is an example:
Save the workflow. Make a note of your HTTP Post request URL for your success email workflow:
The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.
`Data factory name “ADFTutorialDataFactory” is not available`
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.
11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.
Create a pipeline
In this step, you create a pipeline with one Copy activity and two Web activities. You use the following features to
create the pipeline:
Parameters for the pipeline that are access by datasets.
Web activity to invoke logic apps workflows to send success/failure emails.
Connecting one activity with another activity (on success and failure)
Using output from an activity as an input to the subsequent activity
1. In the get started page of Data Factory UI, click the Create pipeline tile.
2. In the properties window for the pipeline, switch to the Parameters tab, and use the New button to add the
following three parameters of type String: sourceBlobContainer, sinkBlobContainer, and receiver.
sourceBlobContainer - parameter in the pipeline consumed by the source blob dataset.
sinkBlobContainer – parameter in the pipeline consumed by the sink blob dataset
receiver – this parameter is used by the two Web activities in the pipeline that send success or failure
emails to the receiver whose email address is specified by this parameter.
3. In the Activities toolbox, expand Data Flow, and drag-drop Copy activity to the pipeline designer surface.
4. In the Properties window for the Copy activity at the bottom, switch to the Source tab, and click + New.
You create a source dataset for the copy activity in this step.
5. In the New Dataset window, select Azure Blob Storage, and click Finish.
6. You see a new tab titled AzureBlob1. Change the name of the dataset to SourceBlobDataset.
7. Switch to the Connection tab in the Properties window, and click New for the Linked service. You create
a linked service to link your Azure Storage account to the data factory in this step.
8. In the New Linked Service window, do the following steps:
a. Enter AzureStorageLinkedService for Name.
b. Select your Azure storage account for the Storage account name.
c. Click Save.
9. Enter @pipeline().parameters.sourceBlobContainer for the folder and emp.txt for the file name. You use the
sourceBlobContainer pipeline parameter to set the folder path for the dataset.
13. Switch to the pipeline tab (or) click the pipeline in the treeview. Confirm that SourceBlobDataset is selected
for Source Dataset.

13. In the properties window, switch to the Sink tab, and click + New for Sink Dataset. You create a sink
dataset for the copy activity in this step similar to the way you created the source dataset.
14. In the New Dataset window, select Azure Blob Storage, and click Finish.
15. In the General settings page for the dataset, enter SinkBlobDataset for Name.
16. Switch to the Connection tab, and do the following steps:
a. Select AzureStorageLinkedService for LinkedService.
b. Enter @pipeline().parameters.sinkBlobContainer for the folder.
c. Enter @CONCAT(pipeline().RunId, '.txt') for the file name. The expression uses the ID of the current
pipeline run for the file name. For the supported list of system variables and expressions, see System
variables and Expression language.
17. Switch to the pipeline tab at the top. Expand General in the Activities toolbox, and drag-drop a Web
activity to the pipeline designer surface. Set the name of the activity to SendSuccessEmailActivity. The
Web Activity allows a call to any REST endpoint. For more information about the activity, see Web Activity.
This pipeline uses a Web Activity to call the Logic Apps email workflow.
18. Switch to the Settings tab from the General tab, and do the following steps:
a. For URL, specify URL for the logic apps workflow that sends the success email.
b. Select POST for Method.
c. Click + Add header link in the Headers section.
d. Add a header Content-Type and set it to application/json.
e. Specify the following JSON for Body.
{
"message": "@{activity('Copy1').output.dataWritten}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
19. Connect the Copy activity to the Web activity by dragging the green button next to the Copy activity and
dropping on the Web activity.
20. Drag-drop another Web activity from the Activities toolbox to the pipeline designer surface, and set the
name to SendFailureEmailActivity.
22. Select Copy activity in the pipeline designer, and click +-> button, and select Error.
23. Drag the red button next to the Copy activity to the second Web activity SendFailureEmailActivity. You
can move the activities around so that the pipeline looks like in the following image:
24. To validate the pipeline, click Validate button on the toolbar. Close the Pipeline Validation Output
window by clicking the >> button.
25. To publish the entities (datasets, pipelines, etc.) to Data Factory service, select Publish All. Wait until you
see the Successfully published message.
Trigger a pipeline run that succeeds
1. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now.
2. To view activity runs associated with this pipeline run, click the first link in the Actions column. You can
switch back to the previous view by clicking Pipelines at the top. Use the Refresh button to refresh the list.
3. To view activity runs associated with this pipeline run, click the first link in the Actions column. Use the
Refresh button to refresh the list. Notice that the Copy activity in the pipeline failed. The Web activity
succeeded to send the failure email to the specified receiver.
4. Click Error link in the Actions column to see details about the error.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now proceed to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Branching and chaining activities in a Data Factory
pipeline
3/29/2019 • 14 minutes to read • Edit Online
In this tutorial, you create a Data Factory pipeline that showcases some of the control flow features. This pipeline
does a simple copy from a container in Azure Blob Storage to another container in the same storage account. If
the copy activity succeeds, you want to send details of the successful copy operation (such as the amount of data
written) in a success email. If the copy activity fails, you want to send details of copy failure (such as the error
message) in a failure email. Throughout the tutorial, you see how to pass parameters.
A high-level overview of the scenario:
Prerequisites
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Visual Studio 2013, 2015, or 2017. The walkthrough in this article uses Visual Studio 2017.
Download and install Azure .NET SDK.
Create an application in Azure Active Directory following these instructions. Make note of the following
values that you use in later steps: application ID, authentication key, and tenant ID. Assign application to
"Contributor" role by following instructions in the same article.
Create blob table
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.
John|Doe
Jane|Doe
2. Use tools such as Azure Storage Explorer to create the adfv2branch container, and to upload the input.txt
file to the container.
Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
2. Add these static variables to the Program class. Replace place-holders with your own values. For a list of
Azure regions in which Data Factory is currently available, select the regions that interest you on the
following page, and then expand Analytics to locate Data Factory: Products available by region. The data
stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory can
be in other regions.
// Set variables
static string tenantID = "<tenant ID>";
static string applicationId = "<application ID>";
static string authenticationKey = "<Authentication key for your application>";
static string subscriptionId = "<Azure subscription ID>";
static string resourceGroup = "<Azure resource group name>";
3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details.
Factory response;
{
response = client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, resource);
}
Add the following code to Main method that creates a data factory.
Factory df = CreateOrUpdateDataFactory(client);
Add the following code to the Main method that creates an Azure Storage linked service. Learn more from
Azure Blob linked service properties on supported properties and details.
Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. Learn more from Azure Blob
dataset properties on supported properties and details.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step, and describes:
The location of the blob to copy from: FolderPath and FileName;
Notice the use of parameters for the FolderPath. “sourceBlobContainer” is the name of the parameter and the
expression is replaced with the values passed in the pipeline run. The syntax to define parameters is
@pipeline().parameters.<parameterName>
Add the following code to the Main method that creates both Azure Blob source and sink datasets.
class EmailRequest
{
[Newtonsoft.Json.JsonProperty(PropertyName = "message")]
public string message;
[Newtonsoft.Json.JsonProperty(PropertyName = "dataFactoryName")]
public string dataFactoryName;
[Newtonsoft.Json.JsonProperty(PropertyName = "pipelineName")]
public string pipelineName;
[Newtonsoft.Json.JsonProperty(PropertyName = "receiver")]
public string receiver;
For your request trigger, fill in the Request Body JSON Schema with the following JSON:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}
This aligns with the EmailRequest class you created in the previous section.
Your Request should look like this in the Logic App Designer:
For the Send Email action, customize how you wish to format the email, utilizing the properties passed in the
request Body JSON schema. Here is an example:
Make a note of your HTTP Post request URL for your success email workflow:
Make a note of your HTTP Post request URL for your failure email workflow:
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity and dependsOn property.
In this tutorial, the pipeline contains one activity: copy activity, which takes in the Blob dataset as a source and
another Blob dataset as a sink. Upon the copy activity succeeding and failing, it calls different email tasks.
In this pipeline, you use the following features:
Parameters
Web Activity
Activity dependency
Using output from an activity as an input to the subsequent activity
Let’s break down the following pipeline section by section:
},
Activities = new List<Activity>
{
new CopyActivity
{
Name = copyBlobActivity,
Inputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSourceDatasetName
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSinkDatasetName
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
},
new WebActivity
{
Name = sendSuccessEmailActivity,
Method = WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/00000000000000000000000000000000000/triggers/manual/path
s/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
},
new WebActivity
{
Name = sendFailEmailActivity,
Method =WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/000000000000000000000000000000000/triggers/manual/paths/
invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').error.message}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Failed" }
}
}
}
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(resource, client.SerializationSettings));
return resource;
}
Add the following code to the Main method that creates the pipeline:
Parameters
The first section of our pipeline defines parameters.
sourceBlobContainer - parameter in the pipeline consumed by the source blob dataset.
sinkBlobContainer – parameter in the pipeline consumed by the sink blob dataset
receiver – this parameter is used by the two Web activities in the pipeline that send success or failure emails to
the receiver whose email address is specified by this parameter.
Web Activity
The Web Activity allows a call to any REST endpoint. For more information about the activity, see Web Activity.
This pipeline uses a Web Activity to call the Logic Apps email workflow. You create two web activities: one that
calls to the CopySuccessEmail workflow and one that calls the CopyFailWorkFlow.
new WebActivity
{
Name = sendCopyEmailActivity,
Method = WebActivityMethod.POST,
Url = "https://prodxxx.eastus.logic.azure.com:443/workflows/12345",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
}
In the “Url” property, paste the Request URL endpoints from your Logic Apps workflow accordingly. In the “Body”
property, pass an instance of the “EmailRequest” class. The email request contains the following properties:
Message – Passing value of @{activity('CopyBlobtoBlob').output.dataWritten . Accesses a property of the
previous copy activity and passes the value of dataWritten. For the failure case, pass the error output instead of
@{activity('CopyBlobtoBlob').error.message .
Data Factory Name – Passing value of @{pipeline().DataFactory} This is a system variable, allowing you to
access the corresponding data factory name. For a list of system variables, see System Variables article.
Pipeline Name – Passing value of @{pipeline().Pipeline} . This is also a system variable, allowing you to
access the corresponding pipeline name.
Receiver – Passing value of "@pipeline().parameters.receiver"). Accessing the pipeline parameters.
This code creates a new Activity Dependency, depending on the previous copy activity that it succeeds.
Main class
Your final Main method should look like this. Build and run your program to trigger a pipeline run!
// Authenticate and create a data factory management client
var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };
Factory df = CreateOrUpdateDataFactory(client);
2. Add the following code to the Main method that retrieves copy activity run details, for example, size of the
data read/written.
// Check the copy activity run details
Console.WriteLine("Checking copy activity run details...");
if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(activityRuns.First().Output);
//SaveToJson(SafeJsonConvert.SerializeObject(activityRuns.First().Output,
client.SerializationSettings), "ActivityRunResult.json", folderForJsons);
}
else
Console.WriteLine(activityRuns.First().Error);
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now proceed to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Provision the Azure-SSIS Integration Runtime in
Azure Data Factory
3/5/2019 • 9 minutes to read • Edit Online
This tutorial provides steps for using the Azure portal to provision an Azure-SSIS integration runtime (IR ) in Azure
Data Factory. Then, you can use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to
deploy and run SQL Server Integration Services (SSIS ) packages in this runtime in Azure. For conceptual
information on Azure-SSIS IRs, see Azure-SSIS integration runtime overview.
In this tutorial, you complete the following steps:
Create a data factory.
Provision an Azure-SSIS integration runtime.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Database server. If you don't already have a database server, create one in the Azure portal before
you get started. Azure Data Factory creates the SSIS Catalog (SSISDB database) on this database server. We
recommend that you create the database server in the same Azure region as the integration runtime. This
configuration lets the integration runtime write execution logs to the SSISDB database without crossing Azure
regions.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part of an
elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual network. If you
use Azure SQL Database with virtual network service endpoints/Managed Instance to host SSISDB or require
access to on-premises data, you need to join your Azure-SSIS IR to a virtual network, see Create Azure-SSIS IR
in a virtual network.
Confirm that the Allow access to Azure services setting is enabled for the database server. This is not
applicable when you use Azure SQL Database with virtual network service endpoints/Managed Instance to
host SSISDB. For more information, see Secure your Azure SQL database. To enable this setting by using
PowerShell, see New -AzSqlServerFirewallRule.
Add the IP address of the client machine, or a range of IP addresses that includes the IP address of client
machine, to the client IP address list in the firewall settings for the database server. For more information, see
Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server using SQL authentication with your server admin credentials or Azure
Active Directory (AAD ) authentication with the managed identity for your Azure Data Factory (ADF ). For the
latter, you need to add the managed identity for your ADF into an AAD group with access permissions to the
database server, see Create Azure-SSIS IR with AAD authentication.
Confirm that your Azure SQL Database server does not have an SSIS Catalog (SSISDB database). The
provisioning of an Azure-SSIS IR does not support using an existing SSIS Catalog.
NOTE
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see ADF +
SSIS IR availability by region.
5. For Subscription, select your Azure subscription in which you want to create the data factory.
6. For Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. For Version, select V2 (Preview).
8. For Location, select the location for the data factory. The list shows only locations that are supported for the
creation of data factories.
9. Select Pin to dashboard.
10. Select Create.
11. On the dashboard, you see the following tile with the status Deploying data factory:
12. After the creation is complete, you see the Data factory page.
13. Select Author & Monitor to open the Data Factory user interface (UI) on a separate tab.
2. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.
From the Authoring UI
1. In the Azure Data Factory UI, switch to the Edit tab, select Connections, and then switch to the
Integration Runtimes tab to view existing integration runtimes in your data factory.
2. Select New to create an Azure-SSIS IR.
3. In the Integration Runtime Setup window, select Lift-and-shift existing SSIS packages to execute in
Azure, and then select Next.
4. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.
a. For Subscription, select the Azure subscription that has your database server to host SSISDB.
b. For Location, select the location of your database server to host SSISDB. We recommend that you select
the same location of your integration runtime.
c. For Catalog Database Server Endpoint, select the endpoint of your database server to host SSISDB.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part of
an elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual network.
For guidance in choosing the type of database server to host SSISDB, see Compare Azure SQL Database
single databases/elastic pools and Managed Instance. If you select Azure SQL Database with virtual
network service endpoints/Managed Instance to host SSISDB or require access to on-premises data, you
need to join your Azure-SSIS IR to a virtual network. See Create Azure-SSIS IR in a virtual network.
d. On Use AAD authentication... checkbox, select the authentication method for your database server to
host SSISDB: SQL or Azure Active Directory (AAD ) with the managed identity for your Azure Data Factory
(ADF ). If you check it, you need to add the managed identity for your ADF into an AAD group with access
permissions to the database server, see Create Azure-SSIS IR with AAD authentication.
e. For Admin Username, enter SQL authentication username for your database server to host SSISDB.
f. For Admin Password, enter SQL authentication password for your database server to host SSISDB.
g. For Catalog Database Service Tier, select the service tier for your database server to host SSISDB:
Basic/Standard/Premium tier or elastic pool name.
h. Click Test Connection and if successful, click Next.
3. On the Advanced Settings page, complete the following steps:
a. For Maximum Parallel Executions Per Node, select the maximum number of packages to execute
concurrently per node in your integration runtime cluster. Only supported package numbers are displayed.
Select a low number, if you want to use more than one cores to run a single large/heavy-weight package
that is compute/memory -intensive. Select a high number, if you want to run one or more small/light-weight
packages in a single core.
b. For Custom Setup Container SAS URI, optionally enter Shared Access Signature (SAS ) Uniform
Resource Identifier (URI) of your Azure Storage Blob container where your setup script and its associated
files are stored, see Custom setup for Azure-SSIS IR.
c. On Select a VNet... checkbox, select whether you want to join your integration runtime to a virtual
network. You should check it if you use Azure SQL Database with virtual network service
endpoints/Managed Instance to host SSISDB or require access to on-premises data, see Create Azure-SSIS
IR in a virtual network.
4. Click Finish to start the creation of your integration runtime.
IMPORTANT
This process takes approximately 20 to 30 minutes to complete.
The Data Factory service connects to your Azure SQL Database server to prepare the SSIS Catalog (SSISDB database).
When you provision an instance of an Azure-SSIS IR, the Azure Feature Pack for SSIS and the Access Redistributable
are also installed. These components provide connectivity to Excel and Access files and to various Azure data sources,
in addition to the data sources supported by the built-in components. You can also install additional components. For
more info, see Custom setup for the Azure-SSIS integration runtime.
5. On the Connections tab, switch to Integration Runtimes if needed. Select Refresh to refresh the status.
6. Use the links in the Actions column to stop/start, edit, or delete the integration runtime. Use the last link to
view JSON code for the integration runtime. The edit and delete buttons are enabled only when the IR is
stopped.
Next steps
In this tutorial, you learned how to:
Create a data factory.
Provision an Azure-SSIS integration runtime.
To learn about customizing your Azure-SSIS integration runtime, advance to the following article:
Customize Azure-SSIS IR
Provision the Azure-SSIS Integration Runtime in
Azure Data Factory with PowerShell
3/15/2019 • 11 minutes to read • Edit Online
This tutorial provides steps for provisioning an Azure-SSIS integration runtime (IR ) in Azure Data Factory. Then,
you can use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy and run SQL
Server Integration Services (SSIS ) packages in this runtime in Azure. In this tutorial, you do the following steps:
NOTE
This article uses Azure PowerShell to provision an Azure SSIS IR. To use the Data Factory user interface (UI) to provision an
Azure SSIS IR, see Tutorial: Create an Azure SSIS integration runtime.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure subscription. If you don't have an Azure subscription, create a free account before you begin. For
conceptual information on Azure-SSIS IR, see Azure-SSIS integration runtime overview.
Azure SQL Database server. If you don't already have a database server, create one in the Azure portal before
you get started. This server hosts the SSIS Catalog database (SSISDB ). We recommend that you create the
database server in the same Azure region as the integration runtime. This configuration lets the integration
runtime write execution logs to SSISDB without crossing Azure regions.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part
of an elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual
network. For guidance in choosing the type of database server to host SSISDB, see Compare Azure SQL
Database single databases/elastic pools and Managed Instance. If you use Azure SQL Database with
virtual network service endpoints/Managed Instance to host SSISDB or require access to on-premises
data, you need to join your Azure-SSIS IR to a virtual network, see Create Azure-SSIS IR in a virtual
network.
Confirm that the "Allow access to Azure services" setting is ON for the database server. This setting is
not applicable when you use Azure SQL Database with virtual network service endpoints/Managed
Instance to host SSISDB. For more information, see Secure your Azure SQL database. To enable this
setting by using PowerShell, see New -AzSqlServerFirewallRule.
Add the IP address of the client machine or a range of IP addresses that includes the IP address of client
machine to the client IP address list in the firewall settings for the database server. For more information,
see Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server using SQL authentication with your server admin credentials or
Azure Active Directory (AAD ) authentication with the managed identity for your Azure Data Factory. For
the latter, you need to add the managed identity for your ADF into an AAD group with access
permissions to the database server, see Create Azure-SSIS IR with AAD authentication.
Confirm that your Azure SQL Database server does not have an SSIS Catalog (SSISDB database). The
provisioning of Azure-SSIS IR does not support using an existing SSIS Catalog.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell. You use
PowerShell to run a script to provision an Azure-SSIS integration runtime that runs SSIS packages in the cloud.
NOTE
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see ADF +
SSIS IR availability by region.
Create variables
Copy and paste the following script: Specify values for the variables. For a list of supported pricing tiers for Azure
SQL Database, see SQL Database resource limits.
# Azure Data Factory information
# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like "`$"
$SubscriptionName = "[Azure subscription name]"
$ResourceGroupName = "[Azure resource group name]"
# Data factory name. Must be globally unique
$DataFactoryName = "[Data factory name]"
$DataFactoryLocation = "EastUS"
# Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[Specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[Specify a description for your Azure-SSIS IR]"
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium features
on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your own
on-premises SQL Server license to earn cost savings from Azure Hybrid Benefit (AHB) option
# For a Standard_D1_v2 node, 1-4 parallel executions per node are supported, but for other nodes, 1-8 are
currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup script
and its associated files are stored
# SSISDB info
$SSISDBServerEndpoint = "[your Azure SQL Database server name].database.windows.net" # WARNING: Please ensure
that there is no existing SSISDB, so we can prepare and manage one on your behalf
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication]"
# For the basic pricing tier, specify "Basic", not "B" - For standard/premium/elastic pool tiers, specify "S0",
"S1", "S2", "S3", etc.
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>)]"
To create an Azure SQL database as part of the script, see the following example:
Set values for the variables that haven't been defined already. For example: SSISDBServerName,
FirewallIPAddress.
Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName
if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}
write-host("##### Starting your Azure-SSIS integration runtime. This command takes 20 to 30 minutes to
complete. #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force
Full script
The PowerShell script in this section configures an instance of Azure-SSIS integration runtime in the cloud that
runs SSIS packages. After you run this script successfully, you can deploy and run SSIS packages in the Microsoft
Azure cloud with SSISDB hosted in Azure SQL Database.
1. Launch the Windows PowerShell Integrated Scripting Environment (ISE ).
2. In the ISE, run the following command from the command prompt.
3. Copy the PowerShell script in this section and paste it into the ISE.
4. Provide appropriate values for all parameters at the beginning of the script.
5. Run the script. The Start-AzDataFactoryV2IntegrationRuntime command near the end of the script runs for 20 to
30 minutes.
NOTE
The script connects to your Azure SQL Database server to prepare the SSIS Catalog database (SSISDB).
When you provision an instance of Azure-SSIS IR, the Azure Feature Pack for SSIS and the Access Redistributable are
also installed. These components provide connectivity to Excel and Access files and to various Azure data sources, in
addition to the data sources supported by the built-in components. You can also install additional components. For
more info, see Custom setup for the Azure-SSIS integration runtime.
For a list of supported pricing tiers for Azure SQL Database, see SQL Database resource limits.
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see
ADF + SSIS IR availability by region.
# Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[Specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[Specify a description for your Azure-SSIS IR]"
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium features
on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your own
on-premises SQL Server license to earn cost savings from Azure Hybrid Benefit (AHB) option
# For a Standard_D1_v2 node, 1-4 parallel executions per node are supported, but for other nodes, 1-8 are
currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup script
and its associated files are stored
# SSISDB info
$SSISDBServerEndpoint = "[your Azure SQL Database server name].database.windows.net" # WARNING: Please ensure
that there is no existing SSISDB, so we can prepare and manage one on your behalf
that there is no existing SSISDB, so we can prepare and manage one on your behalf
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication]"
# For the basic pricing tier, specify "Basic", not "B" - For standard/premium/elastic pool tiers, specify "S0",
"S1", "S2", "S3", etc.
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>)]"
Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName
if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}
write-host("##### Starting your Azure-SSIS integration runtime. This command takes 20 to 30 minutes to
complete. #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force
Next steps
In this tutorial, you learned how to:
Create a data factory.
Create an Azure-SSIS integration runtime
Start the Azure-SSIS integration runtime
Deploy SSIS packages
Review the complete script
To learn about customizing your Azure-SSIS integration runtime, advance to the following article:
Customize Azure-SSIS IR
Azure PowerShell samples for Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online
The following table includes links to sample Azure PowerShell scripts for Azure Data Factory.
Copy data
Copy blobs from a folder to another folder in an Azure Blob This PowerShell script copies blobs from a folder in Azure Blob
Storage Storage to another folder in the same Blob Storage.
Copy data from on-premises SQL Server to Azure Blob This PowerShell script copies data from an on-premises SQL
Storage Server database to an Azure blob storage.
Bulk copy This sample PowerShell script copies data from multiple tables
in an Azure SQL database to an Azure SQL data warehouse.
Incremental copy This sample PowerShell script loads only new or updated
records from a source data store to a sink data store after the
initial full copy of data from the source to the sink.
Transform data
Transform data using a Spark cluster This PowerShell script transforms data by running a program
on a Spark cluster.
Create Azure-SSIS integration runtime This PowerShell script provisions an Azure-SSIS integration
runtime that runs SQL Server Integration Services (SSIS)
packages in Azure.
Pipelines and activities in Azure Data Factory
5/15/2019 • 14 minutes to read • Edit Online
This article helps you understand pipelines and activities in Azure Data Factory and use them to
construct end-to-end data-driven workflows for your data movement and data processing scenarios.
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. For example, a pipeline could contain a set of activities that ingest and clean
log data, and then kick off a Spark job on an HDInsight cluster to analyze the log data. The beauty of
this is that the pipeline allows you to manage the activities as a set instead of each one individually.
For example, you can deploy and schedule the pipeline, instead of the activities independently.
The activities in a pipeline define actions to perform on your data. For example, you may use a copy
activity to copy data from an on-premises SQL Server to an Azure Blob Storage. Then, use a Hive
activity that runs a Hive script on an Azure HDInsight cluster to process/transform data from the
blob storage to produce output data. Finally, use a second copy activity to copy the output data to an
Azure SQL Data Warehouse on top of which business intelligence (BI) reporting solutions are built.
Data Factory supports three types of activities: data movement activities, data transformation
activities, and control activities. An activity can take zero or more input datasets and produce one or
more output datasets. The following diagram shows the relationship between pipeline, activity, and
dataset in Data Factory:
An input dataset represents the input for an activity in the pipeline and an output dataset represents
the output for the activity. Datasets identify data within different data stores, such as tables, files,
folders, and documents. After you create a dataset, you can use it with activities in a pipeline. For
example, a dataset can be an input/output dataset of a Copy Activity or an HDInsightHive Activity.
For more information about datasets, see Datasets in Azure Data Factory article.
Azure Cosmos ✓ ✓ ✓ ✓
DB (SQL API)
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR
Azure Cosmos ✓ ✓ ✓ ✓
DB's API for
MongoDB
Azure Data ✓ ✓ ✓ ✓
Explorer
Azure Data ✓ ✓ ✓ ✓
Lake Storage
Gen1
Azure Data ✓ ✓ ✓ ✓
Lake Storage
Gen2
Azure ✓ ✓ ✓
Database for
MariaDB
Azure ✓ ✓ ✓
Database for
MySQL
Azure ✓ ✓ ✓
Database for
PostgreSQL
Azure File ✓ ✓ ✓ ✓
Storage
Azure SQL ✓ ✓ ✓ ✓
Database
Azure SQL ✓ ✓ ✓
Database
Managed
Instance
Azure SQL ✓ ✓ ✓ ✓
Data
Warehouse
Azure Search ✓ ✓ ✓
Index
Azure Table ✓ ✓ ✓ ✓
Storage
Database Amazon ✓ ✓ ✓
Redshift
DB2 ✓ ✓ ✓
Drill (Preview) ✓ ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR
Google ✓ ✓ ✓
BigQuery
Greenplum ✓ ✓ ✓
HBase ✓ ✓ ✓
Hive ✓ ✓ ✓
Apache Impala ✓ ✓ ✓
(Preview)
Informix ✓ ✓
MariaDB ✓ ✓ ✓
Microsoft ✓ ✓
Access
MySQL ✓ ✓ ✓
Netezza ✓ ✓ ✓
Oracle ✓ ✓ ✓ ✓
Phoenix ✓ ✓ ✓
PostgreSQL ✓ ✓ ✓
Presto ✓ ✓ ✓
(Preview)
SAP Business ✓ ✓
Warehouse
Open Hub
SAP Business ✓ ✓
Warehouse via
MDX
SAP HANA ✓ ✓ ✓
SAP Table ✓ ✓ ✓
Spark ✓ ✓ ✓
SQL Server ✓ ✓ ✓ ✓
Sybase ✓ ✓
Teradata ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR
Vertica ✓ ✓ ✓
NoSQL Cassandra ✓ ✓ ✓
Couchbase ✓ ✓ ✓
(Preview)
MongoDB ✓ ✓ ✓
File Amazon S3 ✓ ✓ ✓
File System ✓ ✓ ✓ ✓
FTP ✓ ✓ ✓
Google Cloud ✓ ✓ ✓
Storage
HDFS ✓ ✓ ✓
SFTP ✓ ✓ ✓
Generic OData ✓ ✓ ✓
Generic ODBC ✓ ✓ ✓
Generic REST ✓ ✓ ✓
Common Data ✓ ✓ ✓ ✓
Service for
Apps
Concur ✓ ✓ ✓
(Preview)
Dynamics 365 ✓ ✓ ✓ ✓
Dynamics AX ✓ ✓ ✓
(Preview)
Dynamics ✓ ✓ ✓ ✓
CRM
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR
Google ✓ ✓ ✓
AdWords
(Preview)
HubSpot ✓ ✓ ✓
(Preview)
Jira (Preview) ✓ ✓ ✓
Magento ✓ ✓ ✓
(Preview)
Marketo ✓ ✓ ✓
(Preview)
Office 365 ✓ ✓ ✓
Oracle Eloqua ✓ ✓ ✓
(Preview)
Oracle ✓ ✓ ✓
Responsys
(Preview)
Oracle Service ✓ ✓ ✓
Cloud
(Preview)
Paypal ✓ ✓ ✓
(Preview)
QuickBooks ✓ ✓ ✓
(Preview)
Salesforce ✓ ✓ ✓ ✓
Salesforce ✓ ✓ ✓ ✓
Service Cloud
Salesforce ✓ ✓ ✓
Marketing
Cloud
(Preview)
SAP ECC ✓ ✓ ✓
ServiceNow ✓ ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR
Shopify ✓ ✓ ✓
(Preview)
Square ✓ ✓ ✓
(Preview)
Web Table ✓ ✓
(HTML table)
Xero (Preview) ✓ ✓ ✓
Zoho (Preview) ✓ ✓ ✓
NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a
dependency on preview connectors in your solution, please contact Azure support.
Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server
Execute Pipeline Activity Execute Pipeline activity allows a Data Factory pipeline
to invoke another pipeline.
Wait Activity When you use a Wait activity in a pipeline, the pipeline
waits for the specified period of time before continuing
with execution of subsequent activities.
Pipeline JSON
Here is how a pipeline is defined in JSON format:
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities":
[
],
"parameters": {
}
}
}
Activity JSON
The activities section can have one or more activities defined within it. There are two main types of
activities: Execution and Control Activities.
Execution activities
Execution activities include data movement and data transformation activities. They have the
following top-level structure:
{
"name": "Execution Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"linkedServiceName": "MyLinkedService",
"policy":
{
},
"dependsOn":
{
}
}
linkedServiceName Name of the linked service used Yes for HDInsight Activity, Azure
by the activity. Machine Learning Batch Scoring
Activity, Stored Procedure Activity.
An activity may require that you
specify the linked service that links No for all others
to the required compute
environment.
Activity policy
Policies affect the run-time behavior of an activity, giving configurability options. Activity Policies are
only available for execution activities.
Activity policy JSON definition
{
"name": "MyPipelineName",
"properties": {
"activities": [
{
"name": "MyCopyBlobtoSqlActivity"
"type": "Copy",
"typeProperties": {
...
},
"policy": {
"timeout": "00:10:00",
"retry": 1,
"retryIntervalInSeconds": 60,
"secureOutput": true
}
}
],
"parameters": {
...
}
}
}
Control activity
Control activities have the following top-level structure:
{
"name": "Control Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"dependsOn":
{
}
}
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities": [
{
"name": "MyFirstActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
}
},
{
"name": "MySecondActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
},
"dependsOn": [
{
"activity": "MyFirstActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
],
"parameters": {
}
}
}
{
"name": "CopyPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"policy": {
"retry": 2,
"timeout": "01:00:00"
}
}
]
}
}
The typeProperties section is different for each transformation activity. To learn about type
properties supported for a transformation activity, click the transformation activity in the Data
transformation activities.
For a complete walkthrough of creating this pipeline, see Tutorial: transform data using Spark.
{
"name": "TriggerA",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
...
}
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyCopyPipeline"
},
"parameters": {
"copySourceName": "FileSource"
}
}
}
}
Next steps
See the following tutorials for step-by-step instructions for creating pipelines with activities:
Build a pipeline with a copy activity
Build a pipeline with a data transformation activity
Linked services in Azure Data Factory
4/28/2019 • 4 minutes to read • Edit Online
This article describes what linked services are, how they are defined in JSON format, and how they are used in
Azure Data Factory pipelines.
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform
a task. The activities in a pipeline define actions to perform on your data. For example, you might use a copy
activity to copy data from an on-premises SQL Server to Azure Blob storage. Then, you might use a Hive activity
that runs a Hive script on an Azure HDInsight cluster to process data from Blob storage to produce output data.
Finally, you might use a second copy activity to copy the output data to Azure SQL Data Warehouse, on top of
which business intelligence (BI) reporting solutions are built. For more information about pipelines and activities,
see Pipelines and activities in Azure Data Factory.
Now, a dataset is a named view of data that simply points or references the data you want to use in your
activities as inputs and outputs.
Before you create a dataset, you must create a linked service to link your data store to the data factory. Linked
services are much like connection strings, which define the connection information needed for Data Factory to
connect to external resources. Think of it this way; the dataset represents the structure of the data within the linked
data stores, and the linked service defines the connection to the data source. For example, an Azure Storage linked
service links a storage account to the data factory. An Azure Blob dataset represents the blob container and the
folder within that Azure storage account that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked services: Azure
Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which refers to the Azure Storage
linked service) and Azure SQL Table dataset (which refers to the Azure SQL Database linked service). The Azure
Storage and Azure SQL Database linked services contain connection strings that Data Factory uses at runtime to
connect to your Azure Storage and Azure SQL Database, respectively. The Azure Blob dataset specifies the blob
container and blob folder that contains the input blobs in your Blob storage. The Azure SQL Table dataset specifies
the SQL table in your SQL database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data Factory:
{
"name": "<Name of the linked service>",
"properties": {
"type": "<Type of the linked service>",
"typeProperties": {
"<data store or compute-specific type properties>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these
tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Datasets in Azure Data Factory
4/28/2019 • 11 minutes to read • Edit Online
This article describes what datasets are, how they are defined in JSON format, and how they are used in
Azure Data Factory pipelines.
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. The activities in a pipeline define actions to perform on your data. Now, a
dataset is a named view of data that simply points or references the data you want to use in your
activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files,
folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in
Blob storage from which the activity should read the data.
Before you create a dataset, you must create a linked service to link your data store to the data factory.
Linked services are much like connection strings, which define the connection information needed for
Data Factory to connect to external resources. Think of it this way; the dataset represents the structure of
the data within the linked data stores, and the linked service defines the connection to the data source.
For example, an Azure Storage linked service links a storage account to the data factory. An Azure Blob
dataset represents the blob container and the folder within that Azure storage account that contains the
input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked
services: Azure Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which
refers to the Azure Storage linked service) and Azure SQL Table dataset (which refers to the Azure SQL
Database linked service). The Azure Storage and Azure SQL Database linked services contain
connection strings that Data Factory uses at runtime to connect to your Azure Storage and Azure SQL
Database, respectively. The Azure Blob dataset specifies the blob container and blob folder that contains
the input blobs in your Blob storage. The Azure SQL Table dataset specifies the SQL table in your SQL
database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in
Data Factory:
Dataset JSON
A dataset in Data Factory is defined in the following JSON format:
{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"structure": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer
SLA provisions.
See supported dataset types for a list of dataset types that are Data Flow compatible. Datasets that are
compatible for Data Flow require fine-grained dataset definitions for transformations. Thus, the JSON
definition is slightly different. Instead of a structure property, datasets that are Data Flow compatible
have a schema property.
In Data Flow, datasets are used in source and sink transformations. The datasets define the basic data
schemas. If your data has no schema, you can use schema drift for your source and sink. The schema in
the dataset represents the physical data type and shape.
By defining the schema from the dataset, you'll get the related data types, data formats, file location, and
connection information from the associated Linked service. Metadata from the datasets appears in your
source transformation as the source projection. The projection in the source transformation represents
the Data Flow data with defined names and types.
When you import the schema of a Data Flow dataset, select the Import Schema button and choose to
import from the source or from a local file. In most cases, you'll import the schema directly from the
source. But if you already have a local schema file (a Parquet file or CSV with headers), you can direct
Data Factory to base the schema on that file.
{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"schema": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}
Dataset example
In the following example, the dataset represents a table named MyTable in a SQL database.
{
"name": "DatasetSample",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "MyAzureSqlLinkedService",
"type": "LinkedServiceReference",
},
"typeProperties":
{
"tableName": "MyTable"
},
}
}
Dataset type
There are many different types of datasets, depending on the data store you use. See the following table
for a list of data stores supported by Data Factory. Click a data store to learn how to create a linked
service and a dataset for that data store.
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
Azure ✓ ✓ ✓ ✓
Cosmos DB
(SQL API)
Azure ✓ ✓ ✓ ✓
Cosmos
DB's API for
MongoDB
Azure Data ✓ ✓ ✓ ✓
Explorer
Azure Data ✓ ✓ ✓ ✓ ✓
Lake Supported
Storage Formats:
Gen1 Delimited Text,
Parquet
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
Azure Data ✓ ✓ ✓ ✓ ✓
Lake Supported
Storage Formats:
Gen2 Delimited Text,
Parquet
Azure ✓ ✓ ✓
Database for
MariaDB
Azure ✓ ✓ ✓
Database for
MySQL
Azure ✓ ✓ ✓
Database for
PostgreSQL
Azure File ✓ ✓ ✓ ✓
Storage
Azure SQL ✓ ✓ ✓ ✓ ✓
Database
Azure SQL ✓ ✓ ✓
Database
Managed
Instance
Azure SQL ✓ ✓ ✓ ✓ ✓
Data
Warehouse
Azure ✓ ✓ ✓
Search Index
Azure Table ✓ ✓ ✓ ✓
Storage
Database Amazon ✓ ✓ ✓
Redshift
DB2 ✓ ✓ ✓
Drill ✓ ✓ ✓
(Preview)
Google ✓ ✓ ✓
BigQuery
Greenplum ✓ ✓ ✓
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
HBase ✓ ✓ ✓
Hive ✓ ✓ ✓
Apache ✓ ✓ ✓
Impala
(Preview)
Informix ✓ ✓
MariaDB ✓ ✓ ✓
Microsoft ✓ ✓
Access
MySQL ✓ ✓ ✓
Netezza ✓ ✓ ✓
Oracle ✓ ✓ ✓ ✓
Phoenix ✓ ✓ ✓
PostgreSQL ✓ ✓ ✓
Presto ✓ ✓ ✓
(Preview)
SAP ✓ ✓
Business
Warehouse
Open Hub
SAP ✓ ✓
Business
Warehouse
via MDX
SAP HANA ✓ ✓ ✓
SAP Table ✓ ✓ ✓
Spark ✓ ✓ ✓
SQL Server ✓ ✓ ✓ ✓
Sybase ✓ ✓
Teradata ✓ ✓
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
Vertica ✓ ✓ ✓
NoSQL Cassandra ✓ ✓ ✓
Couchbase ✓ ✓ ✓
(Preview)
MongoDB ✓ ✓ ✓
File Amazon S3 ✓ ✓ ✓
File System ✓ ✓ ✓ ✓
FTP ✓ ✓ ✓
Google ✓ ✓ ✓
Cloud
Storage
HDFS ✓ ✓ ✓
SFTP ✓ ✓ ✓
Generic Generic ✓ ✓ ✓
protocol HTTP
Generic ✓ ✓ ✓
OData
Generic ✓ ✓ ✓
ODBC
Generic ✓ ✓ ✓
REST
Services Amazon ✓ ✓ ✓
and apps Marketplace
Web Service
(Preview)
Common ✓ ✓ ✓ ✓
Data Service
for Apps
Concur ✓ ✓ ✓
(Preview)
Dynamics ✓ ✓ ✓ ✓
365
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
Dynamics ✓ ✓ ✓
AX (Preview)
Dynamics ✓ ✓ ✓ ✓
CRM
Google ✓ ✓ ✓
AdWords
(Preview)
HubSpot ✓ ✓ ✓
(Preview)
Jira ✓ ✓ ✓
(Preview)
Magento ✓ ✓ ✓
(Preview)
Marketo ✓ ✓ ✓
(Preview)
Office 365 ✓ ✓ ✓
Oracle ✓ ✓ ✓
Eloqua
(Preview)
Oracle ✓ ✓ ✓
Responsys
(Preview)
Oracle ✓ ✓ ✓
Service
Cloud
(Preview)
Paypal ✓ ✓ ✓
(Preview)
QuickBooks ✓ ✓ ✓
(Preview)
Salesforce ✓ ✓ ✓ ✓
Salesforce ✓ ✓ ✓ ✓
Service
Cloud
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
Salesforce ✓ ✓ ✓
Marketing
Cloud
(Preview)
SAP Cloud ✓ ✓ ✓ ✓
for
Customer
(C4C)
SAP ECC ✓ ✓ ✓
ServiceNow ✓ ✓ ✓
Shopify ✓ ✓ ✓
(Preview)
Square ✓ ✓ ✓
(Preview)
Web Table ✓ ✓
(HTML
table)
Xero ✓ ✓ ✓
(Preview)
Zoho ✓ ✓ ✓
(Preview)
NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a
dependency on preview connectors in your solution, please contact Azure support.
In the example in the previous section, the type of the dataset is set to AzureSqlTable. Similarly, for an
Azure Blob dataset, the type of the dataset is set to AzureBlob, as shown in the following JSON:
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference",
},
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
}
}
}
Example
In the following example, suppose the source Blob data is in CSV format and contains three columns:
userid, name, and lastlogindate. They are of type Int64, String, and Datetime with a custom datetime
format using abbreviated French names for day of the week.
Define the Blob dataset structure as follows along with type definitions for the columns:
"structure":
[
{ "name": "userid", "type": "Int64"},
{ "name": "name", "type": "String"},
{ "name": "lastlogindate", "type": "Datetime", "culture": "fr-fr", "format": "ddd-MM-YYYY"}
]
Guidance
The following guidelines help you understand when to include structure information, and what to
include in the structure section. Learn more on how data factory maps source data to sink and when to
specify structure information from Schema and type mapping.
For strong schema data sources, specify the structure section only if you want map source
columns to sink columns, and their names are not the same. This kind of structured data source
stores data schema and type information along with the data itself. Examples of structured data
sources include SQL Server, Oracle, and Azure SQL Database.
As type information is already available for structured data sources, you should not include type
information when you do include the structure section.
For no/weak schema data sources e.g. text file in blob storage, include structure when the
dataset is an input for a copy activity, and data types of source dataset should be converted to native
types for the sink. And include structure when you want to map source columns to sink columns..
Create datasets
You can create datasets by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure
Resource Manager Template, and Azure portal
Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one
of these tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Pipeline execution and triggers in Azure Data
Factory
3/5/2019 • 15 minutes to read • Edit Online
A pipeline run in Azure Data Factory defines an instance of a pipeline execution. For example, say you have a
pipeline that executes at 8:00 AM, 9:00 AM, and 10:00 AM. In this case, there are three separate runs of the
pipeline, or pipeline runs. Each pipeline run has a unique pipeline run ID. A run ID is a GUID that uniquely
defines that particular pipeline run.
Pipeline runs are typically instantiated by passing arguments to parameters that you define in the pipeline. You
can execute a pipeline either manually or by using a trigger. This article provides details about both ways of
executing a pipeline.
In the JSON definition, the pipeline takes two parameters: sourceBlobContainer and sinkBlobContainer.
You pass values to these parameters at runtime.
You can manually run your pipeline by using one of the following methods:
.NET SDK
Azure PowerShell module
REST API
Python SDK
REST API
The following sample command shows you how to manually run your pipeline by using the REST API:
POST
https://management.azure.com/subscriptions/mySubId/resourceGroups/myResourceGroup/providers/Microsoft.DataFa
ctory/factories/myDataFactory/pipelines/copyPipeline/createRun?api-version=2017-03-01-preview
For a complete sample, see Quickstart: Create a data factory by using the REST API.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
The following sample command shows you how to manually run your pipeline by using Azure PowerShell:
You pass parameters in the body of the request payload. In the .NET SDK, Azure PowerShell, and the Python
SDK, you pass values in a dictionary that's passed as an argument to the call:
{
"sourceBlobContainer": "MySourceFolder",
"sinkBlobContainer": "MySinkFolder"
}
{
"runId": "0448d45a-a0bd-23f3-90a5-bfeea9264aed"
}
For a complete sample, see Quickstart: Create a data factory by using Azure PowerShell.
.NET SDK
The following sample call shows you how to manually run your pipeline by using the .NET SDK:
For a complete sample, see Quickstart: Create a data factory by using the .NET SDK.
NOTE
You can use the .NET SDK to invoke Data Factory pipelines from Azure Functions, from your own web services, and so on.
Trigger execution
Triggers are another way that you can execute a pipeline run. Triggers represent a unit of processing that
determines when a pipeline execution needs to be kicked off. Currently, Data Factory supports three types of
triggers:
Schedule trigger: A trigger that invokes a pipeline on a wall-clock schedule.
Tumbling window trigger: A trigger that operates on a periodic interval, while also retaining state.
Event-based trigger: A trigger that responds to an event.
Pipelines and triggers have a many-to-many relationship. Multiple triggers can kick off a single pipeline, or a
single trigger can kick off multiple pipelines. In the following trigger definition, the pipelines property refers to
a list of pipelines that are triggered by the particular trigger. The property definition includes values for the
pipeline parameters.
Basic trigger definition
{
"properties": {
"name": "MyTrigger",
"type": "<type of trigger>",
"typeProperties": {...},
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "<Name of your pipeline>"
},
"parameters": {
"<parameter 1 Name>": {
"type": "Expression",
"value": "<parameter 1 Value>"
},
"<parameter 2 Name>": "<parameter 2 Value>"
}
}
]
}
}
Schedule trigger
A schedule trigger runs pipelines on a wall-clock schedule. This trigger supports periodic and advanced
calendar options. For example, the trigger supports intervals like "weekly" or "Monday at 5:00 PM and
Thursday at 9:00 PM." The schedule trigger is flexible because the dataset pattern is agnostic, and the trigger
doesn't discern between time-series and non-time-series data.
For more information about schedule triggers and for examples, see Create a schedule trigger.
IMPORTANT
The parameters property is a mandatory property of the pipelines element. If your pipeline doesn't take any
parameters, you must include an empty JSON definition for the parameters property.
Schema overview
The following table provides a high-level overview of the major schema elements that are related to recurrence
and scheduling a trigger:
endTime The end date and time for the trigger. The trigger doesn't
execute after the specified end date and time. The value for
the property can't be in the past.
JSON PROPERTY DESCRIPTION
timeZone The time zone. Currently, only the UTC time zone is
supported.
recurrence A recurrence object that specifies the recurrence rules for the
trigger. The recurrence object supports the frequency,
interval, endTime, count, and schedule elements. When a
recurrence object is defined, the frequency element is
required. The other elements of the recurrence object are
optional.
{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-11-01T09:00:00-08:00",
"endTime": "2017-11-02T22:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "SQLServerToBlobPipeline"
},
"parameters": {}
},
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "SQLServerToAzureSQLPipeline"
},
"parameters": {}
}
]
}
}
startTime property
The following table shows you how the startTime property controls a trigger run:
Start time is in the past Calculates the first future execution The trigger starts no sooner than the
time after the start time, and runs at specified start time. The first
that time. occurrence is based on the schedule,
calculated from the start time.
Runs subsequent executions calculated
from the last execution time. Runs subsequent executions based on
the recurrence schedule.
See the example that follows this table.
Start time is in the future or the Runs once at the specified start time. The trigger starts no sooner than the
current time specified start time. The first
Runs subsequent executions calculated occurrence is based on the schedule,
from the last execution time. calculated from the start time.
Let's look at an example of what happens when the start time is in the past, with a recurrence, but no schedule.
Assume that the current time is 2017-04-08 13:00, the start time is 2017-04-07 14:00, and the recurrence is
every two days. (The recurrence value is defined by setting the frequency property to "day" and the interval
property to 2.) Notice that the startTime value is in the past and occurs before the current time.
Under these conditions, the first execution is 2017-04-09 at 14:00. The Scheduler engine calculates execution
occurrences from the start time. Any instances in the past are discarded. The engine uses the next instance that
occurs in the future. In this scenario, the start time is 2017-04-07 at 2:00 PM. The next instance is two days from
that time, which is on 2017-04-09 at 2:00 PM.
The first execution time is the same even whether startTime is 2017-04-05 14:00 or 2017-04-01 14:00. After
the first execution, subsequent executions are calculated by using the schedule. Therefore, the subsequent
executions are on 2017-04-11 at 2:00 PM, then on 2017-04-13 at 2:00 PM, then on 2017-04-15 at 2:00 PM,
and so on.
Finally, when hours or minutes aren’t set in the schedule for a trigger, the hours or minutes of the first execution
are used as defaults.
schedule property
You can use schedule to limit the number of trigger executions. For example, if a trigger with a monthly
frequency is scheduled to run only on day 31, the trigger runs only in those months that have a thirty-first day.
You can also use schedule to expand the number of trigger executions. For example, a trigger with a monthly
frequency that's scheduled to run on month days 1 and 2, runs on the first and second days of the month, rather
than once a month.
If multiple schedule elements are specified, the order of evaluation is from the largest to the smallest schedule
setting: week number, month day, week day, hour, minute.
The following table describes the schedule elements in detail:
monthDays Day of the month on which the trigger - Any value <= -1 and >= -31
runs. The value can be specified with a - Any value >= 1 and <= 31
monthly frequency only. - Array of values
Event-based trigger
An event-based trigger runs pipelines in response to an event, such as the arrival of a file, or the deletion of a
file, in Azure Blob Storage.
For more information about event-based triggers, see Create a trigger that runs a pipeline in response to an
event.
EXAMPLE DESCRIPTION
{"minutes":[15,45], "hours":[5,17]} Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM every day.
{"hours":[17], "weekDays":["monday", "wednesday", Run at 5:00 PM on Monday, Wednesday, and Friday every
"friday"]} week.
{"minutes":[15,45], "hours":[17], "weekDays": Run at 5:15 PM and 5:45 PM on Monday, Wednesday, and
["monday", "wednesday", "friday"]} Friday every week.
{"minutes":[0,15,30,45], "hours": [9, 10, 11, 12, Run every 15 minutes on weekdays between 9:00 AM and
13, 14, 15, 16] "weekDays":["monday", "tuesday", 4:45 PM.
"wednesday", "thursday", "friday"]}
{"weekDays":["tuesday", "thursday"]} Run on Tuesdays and Thursdays at the specified start time.
{"minutes":[0], "hours":[6], "monthDays":[28]} Run at 6:00 AM on the twenty-eighth day of every month
(assuming a frequency value of "month").
EXAMPLE DESCRIPTION
{"minutes":[0], "hours":[6], "monthDays":[-1]} Run at 6:00 AM on the last day of the month.
{"minutes":[0], "hours":[6], "monthDays":[1,-1]} Run at 6:00 AM on the first and last day of every month.
{monthDays":[1,14]} Run on the first and fourteenth day of every month at the
specified start time.
{"minutes":[0], "hours":[5], "monthlyOccurrences": Run on the first Friday of every month at 5:00 AM.
[{"day":"friday", "occurrence":1}]}
{"monthlyOccurrences":[{"day":"friday", Run on the first Friday of every month at the specified start
"occurrence":1}]} time.
{"monthlyOccurrences":[{"day":"friday", Run on the third Friday from the end of the month, every
"occurrence":-3}]} month, at the specified start time.
{"minutes":[15], "hours":[5], "monthlyOccurrences": Run on the first and last Friday of every month at 5:15 AM.
[{"day":"friday", "occurrence":1},{"day":"friday",
"occurrence":-1}]}
{"monthlyOccurrences":[{"day":"friday", Run on the first and last Friday of every month at the
"occurrence":1},{"day":"friday", "occurrence":-1}]} specified start time.
{"monthlyOccurrences":[{"day":"friday", Run on the fifth Friday of every month at the specified start
"occurrence":5}]} time.
{"minutes":[0,15,30,45], "monthlyOccurrences": Run every 15 minutes on the last Friday of the month.
[{"day":"friday", "occurrence":-1}]}
{"minutes":[15,45], "hours":[5,17], Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM on the
"monthlyOccurrences":[{"day":"wednesday", third Wednesday of every month.
"occurrence":3}]}
Backfill scenarios Supported. Pipeline runs can be Not supported. Pipeline runs can be
scheduled for windows in the past. executed only on time periods from
the current time and the future.
TUMBLING WINDOW TRIGGER SCHEDULE TRIGGER
Next steps
See the following tutorials:
Quickstart: Create a data factory by using the .NET SDK
Create a schedule trigger
Create a tumbling window trigger
Integration runtime in Azure Data Factory
5/31/2019 • 11 minutes to read • Edit Online
The Integration Runtime (IR ) is the compute infrastructure used by Azure Data Factory to provide the
following data integration capabilities across different network environments:
Data Flow: Execute a Data Flow in managed Azure compute environment.
Data movement: Copy data across data stores in public network and data stores in private
network (on-premises or virtual private network). It provides support for built-in connectors, format
conversion, column mapping, and performant and scalable data transfer.
Activity dispatch: Dispatch and monitor transformation activities running on a variety of compute
services such as Azure Databricks, Azure HDInsight, Azure Machine Learning, Azure SQL Database,
SQL Server, and more.
SSIS package execution: Natively execute SQL Server Integration Services (SSIS ) packages in a
managed Azure compute environment.
In Data Factory, an activity defines the action to be performed. A linked service defines a target data
store or a compute service. An integration runtime provides the bridge between the activity and linked
Services. It is referenced by the linked service or activity, and provides the compute environment where
the activity either runs on or gets dispatched from. This way, the activity can be performed in the region
closest possible to the target data store or compute service in the most performant way while meeting
security and compliance needs.
The following diagram shows how the different integration runtimes can be used in combination to
offer rich data integration capabilities and network support:
Azure integration runtime
An Azure integration runtime is capable of:
Running Data Flows in Azure
Running copy activity between cloud data stores
Dispatching the following transform activities in public network: Databricks Notebook/ Jar/ Python
activity, HDInsight Hive activity, HDInsight Pig activity, HDInsight MapReduce activity, HDInsight
Spark activity, HDInsight Streaming activity, Machine Learning Batch Execution activity, Machine
Learning Update Resource activities, Stored Procedure activity, Data Lake Analytics U -SQL activity,
.NET custom activity, Web activity, Lookup activity, and Get Metadata activity.
Azure IR network environment
Azure Integration Runtime supports connecting to data stores and compute services with public
accessible endpoints. Use a self-hosted integration runtime for Azure Virtual Network environment.
Azure IR compute resource and scaling
Azure integration runtime provides a fully managed, serverless compute in Azure. You don’t have to
worry about infrastructure provision, software installation, patching, or capacity scaling. In addition,
you only pay for the duration of the actual utilization.
Azure integration runtime provides the native compute to move data between cloud data stores in a
secure, reliable, and high-performance manner. You can set how many data integration units to use on
the copy activity, and the compute size of the Azure IR is elastically scaled up accordingly without you
having to explicitly adjusting size of the Azure Integration Runtime.
Activity dispatch is a lightweight operation to route the activity to the target compute service, so there
isn’t need to scale up the compute size for this scenario.
For information about creating and configuring an Azure IR, see How to create and configure Azure IR
under how to guides.
NOTE
Azure Integration runtime has properties related to Data Flow runtime, which defines the underlying compute
infrastructure that would be used to run the data flows on.
Self-hosted integration runtime
A self-hosted IR is capable of:
Running copy activity between a cloud data stores and a data store in private network.
Dispatching the following transform activities against compute resources in On-Premise or Azure
Virtual Network: HDInsight Hive activity (BYOC -Bring Your Own Cluster), HDInsight Pig activity
(BYOC ), HDInsight MapReduce activity (BYOC ), HDInsight Spark activity (BYOC ), HDInsight
Streaming activity (BYOC ), Machine Learning Batch Execution activity, Machine Learning Update
Resource activities, Stored Procedure activity, Data Lake Analytics U -SQL activity, .NET custom
activity, Lookup activity, and Get Metadata activity.
NOTE
Use self-hosted integration runtime to support data stores that requires bring-your-own driver such as SAP
Hana, MySQL, etc. For more information, see supported data stores.
TIP
A good practice would be to ensure Data flow runs in the same region as your corresponding data
stores (if possible). You can either achieve this by auto-resolve Azure IR (if data store location is same as
Data Factory location), or by creating a new Azure IR instance in the same region as your data stores
and then execute the data flow on it.
You can monitor which IR location takes effect during activity execution in pipeline activity monitoring
view on UI or activity monitoring payload.
TIP
If you have strict data compliance requirements and need ensure that data do not leave a certain geography,
you can explicitly create an Azure IR in a certain region and point the Linked Service to this IR using ConnectVia
property. For example, if you want to copy data from Blob in UK South to SQL DW in UK South and want to
ensure data do not leave UK, create an Azure IR in UK South and link both Linked Services to this IR.
Self-hosted IR location
The self-hosted IR is logically registered to the Data Factory and the compute used to support its
functionalities is provided by you. Therefore there is no explicit location property for self-hosted IR.
When used to perform data movement, the self-hosted IR extracts data from the source and writes into
the destination.
Azure -SSIS IR location
Selecting the right location for your Azure-SSIS IR is essential to achieve high performance in your
extract-transform-load (ETL ) workflows.
The location of your Azure-SSIS IR does not need be the same as the location of your data factory,
but it should be the same as the location of your own Azure SQL Database/Managed Instance
server where SSISDB is to be hosted. This way, your Azure-SSIS Integration Runtime can easily
access SSISDB without incurring excessive traffics between different locations.
If you do not have an existing Azure SQL Database/Managed Instance server to host SSISDB, but
you have on-premises data sources/destinations, you should create a new Azure SQL
Database/Managed Instance server in the same location of a virtual network connected to your on-
premises network. This way, you can create your Azure-SSIS IR using the new Azure SQL
Database/Managed Instance server and joining that virtual network, all in the same location,
effectively minimizing data movements across different locations.
If the location of your existing Azure SQL Database/Managed Instance server where SSISDB is
hosted is not the same as the location of a virtual network connected to your on-premises network,
first create your Azure-SSIS IR using an existing Azure SQL Database/Managed Instance server
and joining another virtual network in the same location, and then configure a virtual network to
virtual network connection between different locations.
The following diagram shows location settings of Data Factory and its integration run times:
Next steps
See the following articles:
Create Azure integration runtime
Create self-hosted integration runtime
Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using Azure SQL Database Managed Instance and joining the IR to a virtual
network.
What are Mapping Data Flows?
5/6/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Mapping Data Flows are visually-designed data transformation in Azure Data Factory. Data Flows allow data
engineers to develop graphical data transformation logic without writing code. The resulting data flows are
executed as activities within Azure Data Factory Pipelines using scaled-out Azure Databricks clusters.
The intent of Azure Data Factory Data Flow is to provide a fully visual experience with no coding required. Your
Data Flows will execute on your own execution cluster for scaled-out data processing. Azure Data Factory
handles all of the code translation, path optimization, and execution of your data flow jobs.
Start by creating data flows in Debug mode so that you can validate your transformation logic interactively. Next,
add a Data Flow activity to your pipeline to execute and test your data flow in pipeline debug, or use "Trigger
Now" in the pipeline to test your Data Flow from a pipeline Activity.
You will then schedule and monitor your data flow activities by using Azure Data Factory pipelines that execute
the Data Flow activity.
The Debug Mode toggle switch on the Data Flow design surface allows interactive building of data
transformations. Debug Mode provides a data prep environment for data flow construction.
Mapping Data Flow Debug Mode
5/23/2019 • 3 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Azure Data Factory Mapping Data Flow has a debug mode, which can be switched on with the Data Flow Debug
button at the top of the design surface. When designing data flows, setting debug mode on will allow you to
interactively watch the data shape transform while you build and debug your data flows. The Debug session can be
used both in Data Flow design sessions as well as during pipeline debug execution of data flows.
Overview
When Debug mode is on, you will interactively build your data flow with an active Spark cluster. The session will
close once you turn debug off in Azure Data Factory. You should be aware of the hourly charges incurred by Azure
Databricks during the time that you have the debug session turned on.
In most cases, it is a good practice to build your Data Flows in debug mode so that you can validate your business
logic and view your data transformations before publishing your work in Azure Data Factory. You should also use
the "Debug" button on the pipeline panel to test your data flow inside of a pipeline.
NOTE
While the debug mode light is green on the Data Factory toolbar, you will be charged at the Data Flow debug rate of 8
cores/hr of general compute with a 60 minute time-to-live
Debug mode on
When you switch on debug mode, you will be prompted with a side-panel form that will request you to point to
your interactive Azure Databricks cluster and select options for the source sampling. You must use an interactive
cluster from Azure Databricks and select either a sampling size from each your Source transforms, or pick a text
file to use for your test data.
NOTE
When running in Debug Mode in Data Flow, your data will not be written to the Sink transform. A Debug session is intended
to serve as a test >harness for your transformations. Sinks are not required during debug and are ignored in your data flow.
If you wish to test writing the data >in your Sink, execute the Data Flow from an Azure Data Factory Pipeline and use the
Debug execution from a pipeline.
Debug settings
Debug settings can be Each Source from your Data Flow will appear in the side panel and can also be edited by
selecting "source settings" on the Data Flow designer toolbar. You can select the limits and/or file source to use for
each your Source transformation here. The row limits in this setting are only for the current debug session. You can
also use the Sampling setting in the source for limiting rows into the Source transformation.
Cluster status
There is a cluster status indicator at the top of the design surface that will turn green when the cluster is ready for
debug. If your cluster is already warm, then the green indicator will appear almost instantly. If your cluster was not
already running when you entered debug mode, then you will have to wait 5-7 minutes for the cluster to spin up.
The indicator light will be yellow until it is ready. Once your cluster is ready for Data Flow debug, the indicator light
will turn green.
When you are finished with your debugging, turn the Debug switch off so that your Azure Databricks cluster can
terminate and you will no longer be billed for debug activity.
Data preview
With debug on, the Data Preview tab will light-up on the bottom panel. Without debug mode on, Data Flow will
show you only the current metadata in and out of each of your transformations in the Inspect tab. The data preview
will only query the number of rows that you have set as your limit in your debug settings. You may need to click
"Fetch data" to refresh the data preview.
Data profiles
Selecting individual columns in your data preview tab will pop-up a chart on the far-right of your data grid with
detailed statistics about each field. Azure Data Factory will make a determination based upon the data sampling of
which type of chart to display. High-cardinality fields will default to NULL / NOT NULL charts while categorical
and numeric data that has low cardinality will display bar charts showing data value frequency. You will also see
max / len length of string fields, min / max values in numeric fields, standard dev, percentiles, counts and average.
Next steps
Once you are finished building and debugging your data flow, execute it from a pipeline.
When testing your pipeline with a data flow, use the pipeline Debug run execution option.
Mapping Data Flow Schema Drift
4/12/2019 • 3 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
The concept of Schema Drift is the case where your sources often change metadata. Fields, columns, types, etc. can
be added, removed or changed on the fly. Without handling for Schema Drift, your Data Flow becomes vulnerable
to changes in upstream data source changes. When incoming columns and fields change, typical ETL patterns fail
because they tend to be tied to those source names.
In order to protect against Schema Drift, it is important to have the facilities in a Data Flow tool to allow you, as a
Data Engineer, to:
Define sources that have mutable field names, data types, values and sizes
Define transformation parameters that can work with data patterns instead of hard-coded fields and values
Define expressions that understand patterns to match incoming fields, instead of using named fields
When you've selected this option, all incoming fields will be read from your source on every Data Flow
execution and will be passed through the entire flow to the Sink.
Make sure to use "Auto-Map" to map all new fields in the Sink Transformation so that all new fields get
picked-up and landed in your destination:
Everything will work when new fields are introduced in that scenario with a simple Source -> Sink (aka
Copy) mapping.
To add transformations in that workflow that handles schema drift, you can use pattern matching to match
columns by name, type, and value.
Click on "Add Column Pattern" in the Derived Column or Aggregate transformation if you wish to create a
transformation that understands "Schema Drift".
NOTE
You need to make an architectural decision in your data flow to accept schema drift throughout your flow. When you do this,
you can protect against schema changes from the sources. However, you will lose early-binding of your columns and types
throughout your data flow. Azure Data Factory treats schema drift flows as late-binding flows, so when you build your
transformations, the column names will not be available to you in the schema views throughout the flow.
In the Taxi Demo sample Data Flow, there is a sample Schema Drift in the bottom data flow with the TripFare
source. In the Aggregate transformation, notice that we are using the "column pattern" design for the aggregation
fields. Instead of naming specific columns, or looking for columns by position, we assume that the data can change
and may not appear in the same order between runs.
In this example of Azure Data Factory Data Flow schema drift handling, we've built and aggregation that scans for
columns of type 'double', knowing that the data domain contains prices for each trip. We can then perform an
aggregate math calculation across all double fields in the source, regardless of where the column lands and
regardless of the column's naming.
The Azure Data Factory Data Flow syntax uses $$ to represent each matched column from your matching pattern.
You can also match on column names using complex string search and regular expression functions. In this case,
we are going to create a new aggregated field name based on each match of a 'double' type of column and append
the text _total to each of those matched names:
concat($$, '_total')
Then, we will round and sum the values for each of those matched columns:
round(sum ($$))
You can test this out with the Azure Data Factory Data Flow sample "Taxi Demo". Switch on the Debug session
using the Debug toggle at the top of the Data Flow design surface so that you can see your results interactively:
Access new columns downstream
When you generate new columns with column patterns, you can access those new columns later in your data flow
transformations using the "byName" expression function.
Next steps
In the Data Flow Expression Language you will find additional facilities for column patterns and schema drift
including "byName" and "byPosition".
Azure Data Factory Mapping Data Flow
Transformation Inspect Tab
2/22/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
The Inspect Pane provides a view into the metadata of the data stream that you're transforming. You will be able to
see the column counts, columns changed, columns added, data types, column ordering, and column references.
"Inspect" is a read-only view of your metadata. You do not need to have Debug mode enabled in order to see
metadate in the Inspect Pane.
As you change the shape of your data through transformations, you will see the metadata changes flow through
the Inspect Pane. If there is not a defined schema in your Source transformation, then metadata will not be visible
in the Inspect Pane. Lack of metadata is common in Schema Drift scenarios.
Data Preview is a pane that provides a read-only view of your data as it is being transformed. You can view the
output of your transformation and expressions to validate your data flow. You must have the Debug mode
switched-on to see data previews. When you click on columns in the data preview grid, you will see a subsequent
panel to the right. The pop-out panel will show the profile statistics about each of the columns that you select.
Azure data factory mapping data flows column
patterns
5/31/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Several Azure Data Factory Data Flow transformations support the idea of "Columns Patterns" so that you can
create template columns based on patterns instead of hard-coded column names. You can use this feature within
the Expression Builder to define patterns to match columns for transformation instead of requiring exact, specific
field names. Patterns are useful if incoming source fields change often, particularly in the case of changing columns
in text files or NoSQL databases. This condition is sometimes referred to as "Schema Drift".
Column patterns are useful for handling both Schema Drift scenarios as well as general scenarios. It is good for
conditions where you are not able to fully know each column name. You can pattern match on column name and
column data type and build an expression for transformation that will perform that operation against any field in
the data stream that matches your name & type patterns.
When adding an expression to a transform that accepts patterns, choose "Add Column Pattern". Column Patterns
allows schema drift column matching patterns.
When building template column patterns, use $$ in the expression to represent a reference to each matched field
from the input data stream.
If you choose to use one of the Expression Builder regex functions, you can then subsequently use $1, $2, $3 ... to
reference the sub-patterns matched from your regex expression.
An example of Column Pattern scenario is using SUM with a series of incoming fields. The aggregate SUM
calculations are in the Aggregate transformation. You can then use SUM on every match of field types that match
"integer" and then use $$ to reference each match in your expression.
Monitor Data Flows
2/22/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
After you have completed building and debugging your data flow, you will want to schedule your data flow to
execute on a schedule within the context of a pipeline. You can schedule the pipeline from Azure Data Factory using
Triggers. Or you can use the Trigger Now option from the Azure Data Factory Pipeline Builder to execute a single-
run execution to test your data flow within the pipeline context.
When you execute your pipeline, you will be able to monitor the pipeline and all of the activities contained in the
pipeline including the Data Flow activity. Click on the monitor icon in the left-hand Azure Data Factory UI panel.
You will see a screen similar to the one below. The highlighted icons will allow you to drill into the activities in the
pipeline, including the Data Flow activity.
You will see stats at this level as well inculding the run times and status. The Run ID at the activity level is different
that the Run ID at the pipeline level. The Run ID at the previous level is for the pipeline. Clicking the eyeglasses will
give you deep details on your data flow execution.
When you are in the graphical node monitoring view, you will see a simplified view -only version of your data flow
graph.
View Data Flow Execution Plans
When your Data Flow is executed in Databricks, Azure Data Factory determines optimal code paths based on the
entirity of your data flow. Additionally, the execution paths may occur on different scale-out nodes and data
partitions. Therefore, the monitoring graph represents the design of your flow, taking into account the execution
path of your transformations. When you click on individual nodes, you will see "groupings" that represent code
that was executed together on the cluster. The timings and counts that you see represent those groups as opposed
to the individual steps in your design.
When you click on the open space in the monitoring window, the stats in the bottom pane will display
timing and row counts for each Sink and the transformations that led to the sink data for transformation
lineage.
When you select individual transformations, you will receive additional feedback on the right-hand panel
that shows partition stats, column counts, skewness (how evenly is the data distributed across partitions),
and kurtosis (how spikey is the data).
When you click on the Sink in the node view, you will see column lineage. There are three different methods
that columns are accumulated throughout your data flow to land in the Sink. They are:
Computed: You use the column for conditional processing or within an expression in your data flow, but
do not land it in the Sink
Derived: The column is a new column that you generated in your flow, i.e. it was not present in the
Source
Mapped: The column originated from the source and your are mapping it to a sink field
Monitor Icons
This icon means that the transformation data was already cached on the cluster, so the timings and execution path
have taken that into account:
You will also see green circle icons in the transformation. They represent a count of the number of sinks that data is
flowing into.
Mapping data flows performance and tuning guide
6/3/2019 • 6 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Azure Data Factory Mapping Data Flows provide a code-free browser interface to design, deploy, and orchestrate
data transformations at scale.
NOTE
If you are not familiar with ADF Mapping Data Flows in general, see Data Flows Overview before reading this article.
NOTE
When you are designing and testing Data Flows from the ADF UI, make sure to turn on the Debug switch so that you can
execute your data flows in real-time without waiting for a cluster to warm up.
Clicking that icon will display the execution plan and subsequent performance profile of your data flow. You can use
this information to estimate the performance of your data flow against different sized data sources. Note that you
can assume 1 minute of cluster job execution set-up time in your overall performance calculations and if you are
using the default Azure Integration Runtime, you may need to add 5 minutes of cluster spin-up time as well.
Optimizing for Azure SQL Database and Azure SQL Data Warehouse
Setting batch size will instruct ADF to store data in sets in memory instead of row -by-row. It is an optional
setting and you may run out of resources on the compute nodes if they are not sized properly.
Setting a query can allow you to filter rows right at the source before they even arrive for Data Flow for
processing, which can make the initial data acquisition faster.
If you use a query, you can add optional query hints for your Azure SQL DB, i.e. READ UNCOMMITTED
Set sink batch size
In order to avoid row -by-row processing of your data flows, set the "Batch size" in the sink settings for Azure
SQL DB. This will tell ADF to process database writes in batches based on the size provided.
Set partitioning options on your sink
Even if you don't have your data partitioned in your destination Azure SQL DB tables, go to the Optimize tab
and set partitioning.
Very often, simply telling ADF to use Round Robin partitioning on the Spark execution clusters results in much
faster data loading instead of forcing all connections from a single node/partition.
Increase size of your compute engine in Azure Integration Runtime
Increase the number of cores, which will increase the number of nodes, and provide you with more processing
power to query and write to your Azure SQL DB.
Try "Compute Optimized" and "Memory Optimized" options to apply more resources to your compute nodes.
Unit test and performance test with debug
When unit testing data flows, set the "Data Flow Debug" button to ON.
Inside of the Data Flow designer, use the Data Preview tab on transformations to view the results of your
transformation logic.
Unit test your data flows from the pipeline designer by placing a Data Flow activity on the pipeline design
canvas and use the "Debug" button to test.
Testing in debug mode will work against a live warmed cluster environment without the need to wait for a just-
in-time cluster spin-up.
Disable indexes on write
Use an ADF pipeline stored procedure activity prior to your Data Flow activity that disables indexes on your
target tables that are being written to from your Sink.
After your Data Flow activity, add another stored proc activity that enabled those indexes.
Increase the size of your Azure SQL DB
Schedule a resizing of your source and sink Azure SQL DB before your run your pipeline to increase the
throughput and minimize Azure throttling once you reach DTU limits.
After your pipeline execution is complete, you can resize your databases back to their normal run rate.
Next steps
See the other Data Flow articles:
Data Flow overview
Data Flow activity
Monitor Data Flow performance
Mapping Data Flow Move Nodes
5/10/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
The Azure Data Factory Data Flow design surface is a "construction" surface where you build data flows top-down,
left-to-right. There is a toolbox attached to each transform with a plus (+) symbol. Concentrate on your business
logic instead of connecting nodes via edges in a free-form DAG environment.
So, without a drag-and-drop paradigm, the way to "move" a transformation node, is to change the incoming
stream. Instead, you will move transforms around by changing the "incoming stream".
Next steps
After completing your Data Flow design, turn the debug button on and test it out in debug mode either directly in
the data flow designer or pipeline debug.
Mapping Data Flow Transformation Optimize Tab
2/22/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Each Data Flow transformation has an "Optimize" tab. The optimize tab contains optional settings to configure
partitioning schemes for data flows.
The default setting is "use current partitioning". Current Partitioning instructs Azure Data Factory to use the
partitioning scheme native to Data Flows running on Spark in Azure Databricks. Generally, this is the
recommended approach.
However, there are instances where you may wish to adjust the partitioning. For instance, if you want to output
your transformations to a single file in the lake, then chose "single partition" on the Optimize tab for partitioning in
the Sink Transformation.
Another case where you may wish to exercise control over the partitioning schemes being used for your data
transformations is in terms of performance. Adjusting the partitioning of data provides a level of control over the
distribution of your data across compute nodes and data locality optimizations that can have both positive as well
as negative effects on your overall data flow performance.
If you wish to change partitioning on any transformation, simply click the Optimize tab and select the "Set
Partitioning" radio button. You will then be presented with a series of options for partitioning. The best method of
partitioning to implement will differ based on your data volumes, candidate keys, null values and cardinality. Best
practice is to start with default partitioning and then try the different partitioning options. You can test using the
Debug run in Pipeline and then view the time spent in each transformation grouping as well as partition usage
from the Monitoring view.
Round Robin
Round Robin is simple partition that automatically distributes data equally across partitions. Use Round Robin
when you do not have good key candidates to implement a solid, smart partitioning strategy. You can set the
number of physical partitions.
Hash
Azure Data Factory will produce a hash of columns to produce uniform partitions such that rows with similar
values will fall in the same partition. When using the Hash option, test for possible partition skew. You can set the
number of physical partitions.
Dynamic Range
Dynamic Range will use Spark dynamic ranges based on the columns or expressions that you provide. You can set
the number of physical partitions.
Fixed Range
You must build an expression that provides a fixed range for values within your partitioned data columns. You
should have a good understanding of your data before using this option in order to avoid partition skew. The value
that enter for the expression will be used as part of a partition function. You can set the number of physical
partitions.
Key
If you have a good understanding of the cardinality of your data, key partitioning may be a good partition strategy.
Key partitioning will create partitions for each unique value in your column. You cannot set the number of
partitions because the number will be based on unique values in the data.
Mapping Data Flow Expression Builder
4/9/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
In Azure Data Factory Mapping Data Flow, you'll find expression boxes where you can enter expressions for data
transformation. Use columns, fields, variables, parameters, functions from your data flow in these boxes. To build
the expression, use the Expression Builder, which is launched by clicking in the expression text box inside the
transformation. You'll also sometimes see "Computed Column" options when selecting columns for
transformation. When you click that, you'll also see the Expression Builder launched.
The Expression Builder tool defaults to the text editor option. the auto-complete feature reads from the entire
Azure Data Factory Data Flow object model with syntax checking and highlighting.
Currently Working on Field
At the top left of the Expression Builder UI, you will see a field called "Currently Working On" with the name of the
field that you are currently working on. The expression that you build in the UI will be applied just to that currently
working field. If you wish to transform another field, save your current work and use this drop-down to select
another field and build an expression for the other fields.
Comments
Add comments to your expressions using single line and multi-line comment syntax:
Regular Expressions
The Azure Data Factory Data Flow expression language, full reference documentation here, enables functions that
include regular expression syntax. When using regular expression functions, the Expression Builder will try to
interpret backslash (\) as an escape character sequence. When using backslashes in your regular expression, either
enclose the entire regex in ticks (`) or use a double backslash.
Example using ticks
Next steps
Begin building data transformation expressions
Mapping Data Flow Reference Node
2/22/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
A reference node is automatically added to the canvas to signify that the node it is attached to references another
existing node on the canvas. Think of a reference node as a pointer or a reference to another data flow
transformation.
For example: When you Join or Union more than one stream of data, the Data Flow canvas may add a reference
node that reflects the name and settings of the non-primary incoming stream.
The reference node cannot be moved or deleted. However, you can click into the node to modify the originating
transformation settings.
The UI rules that govern when Data Flow adds the reference node are based upon available space and vertical
spacing between rows.
Data transformation expressions in Mapping Data
Flow
5/6/2019 • 28 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Expression functions
In Data Factory, use the expression language of the Mapping Data Flow feature to configure data transformations.
abs
acos
add
Adds a pair of strings or numbers. Adds a date to a number of days. Appends one array of similar type to another.
Same as the + operator
add(10, 20) -> 30
10 + 20 -> 30
add('ice', 'cream') -> 'icecream'
'ice' + 'cream' + ' cone' -> 'icecream cone'
add(toDate('2012-12-12'), 3) -> 2012-12-15 (date value)
toDate('2012-12-12') + 3 -> 2012-12-15 (date value)
[10, 20] + [30, 40] => [10, 20, 30, 40]
addDays
addMonths
and
asin
atan
atan2
Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates
atan2(0, 0) -> 0.0
avg
avgIf
byName
Selects a column value by name in the stream. If there are multiple matches, the first match is returned. If no
match it returns a NULL value. The returned value has to be type converted by one of the type conversion
functions(TO_DATE, TO_STRING ...). Column names known at design time should be addressed just by their name.
Computed inputs are not supported but you can use parameter substitutions
toString(byName('parent')) -> appa
toLong(byName('income')) -> 9000000000009
toBoolean(byName('foster')) -> false
toLong(byName($debtCol)) -> 123456890
birthDate -> 12/31/2050
toString(byName('Bogus Column')) -> NULL
byPosition
Selects a column value by its relative position(1 based) in the stream. If the position is out of bounds it returns a
NULL value. The returned value has to be type converted by one of the type conversion functions(TO_DATE,
TO_STRING ...)Computed inputs are not supported but you can use parameter substitutions
toString(byPosition(1)) -> amma
toDecimal(byPosition(2), 10, 2) -> 199990.99
toBoolean(byName(4)) -> false
toString(byName($colName)) -> family
toString(byPosition(1234)) -> NULL
case
Based on alternating conditions applies one value or the other. If the number of inputs are even, the other is NULL
for last condition
case(custType == 'Premium', 10, 4.5)
case(custType == 'Premium', price*0.95, custType == 'Elite', price*0.9, price*2)
case(dayOfWeek(saleDate) == 1, 'Sunday', dayOfWeek(saleDate) == 6, 'Saturday')
cbrt
concat
Concatenates a variable number of strings together. Same as the + operator with strings
concat('Awesome', 'Cool', 'Product') -> 'AwesomeCoolProduct'
'Awesome' + 'Cool' + 'Product' -> 'AwesomeCoolProduct'
concat(addrLine1, ' ', addrLine2, ' ', city, ' ', state, ' ', zip)
addrLine1 + ' ' + addrLine2 + ' ' + city + ' ' + state + ' ' + zip
concatWS
Concatenates a variable number of strings together with a separator. The first parameter is the separator
concatWS(' ', 'Awesome', 'Cool', 'Product') -> 'Awesome Cool Product'
concatWS(' ' , addrLine1, addrLine2, city, state, zip) ->
concatWS(',' , toString(order_total), toString(order_discount))
cos
cosh
count
Gets the aggregate count of values. If the optional column(s) is specified, it ignores NULL values in the count
count(custId) -> 100
count(custId, custName) -> 50
count() -> 125
count(iif(isNull(custId), 1, NULL)) -> 5
countDistinct
countIf
Based on a criteria gets the aggregate count of values. If the optional column is specified, it ignores NULL values
in the count
countIf(state == 'CA' && commission < 10000, name) -> 100
covariancePopulation
covariancePopulationIf
covarianceSample
covarianceSampleIf
crc32
Calculates the CRC32 hash of set of column of varying primitive datatypes given a bit length which can only be of
values 0(256), 224, 256, 384, 512. It can be used to calculate a fingerprint for a row
crc32(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> 3630253689
cumeDist
The CumeDist function computes the position of a value relative to all values in the partition. The result is the
number of rows preceding or equal to the current row in the ordering of the partition divided by the total number
of rows in the window partition. Any tie values in the ordering will evaluate to the same position.
cumeDist() -> 1
currentDate
Gets the current date when this job starts to run. You can pass an optional timezone in the form of 'GMT', 'PST',
'UTC', 'America/Cayman'. The local timezone is used as the default.
currentDate() -> 12-12-2030
currentDate('PST') -> 12-31-2050
currentTimestamp
Gets the current timestamp when the job starts to run with local time zone
currentTimestamp() -> 12-12-2030T12:12:12
currentUTC
Gets the current the timestamp as UTC. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. It is defaulted to the current timezone
currentUTC() -> 12-12-2030T19:18:12
currentUTC('Asia/Seoul') -> 12-13-2030T11:18:12
dayOfMonth
dayOfWeek
Gets the day of the week given a date. 1 - Sunday, 2 - Monday ..., 7 - Saturday
dayOfWeek(toDate('2018-06-08')) -> 7
dayOfYear
degrees
denseRank
Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to
the current row in the ordering of the partition. The values will not produce gaps in the sequence. Dense Rank
works even when data is not sorted and looks for change in values
denseRank(salesQtr, salesAmt) -> 1
divide
endsWith
equals
equalsIgnoreCase
factorial
false
Always returns a false value. Use the function syntax(false()) if there is a column named 'false'
isDiscounted == false()
isDiscounted() == false
first
Gets the first value of a column group. If the second parameter ignoreNulls is omitted, it is assumed false
first(sales) -> 12233.23
first(sales, false) -> NULL
floor
fromUTC
Converts to the timestamp from UTC. You can optionally pass the timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. It is defaulted to the current timezone
fromUTC(currentTimeStamp()) -> 12-12-2030T19:18:12
fromUTC(currentTimeStamp(), 'Asia/Seoul') -> 12-13-2030T11:18:12
greater
greaterOrEqual
greatest
Returns the greatest value among the list of values as input. Returns null if all inputs are null
greatest(10, 30, 15, 20) -> 30
greatest(toDate('12/12/2010'), toDate('12/12/2011'), toDate('12/12/2000')) -> '12/12/2011'
hour
Gets the hour value of a timestamp. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default.
hour(toTimestamp('2009-07-30T12:58:59')) -> 12
hour(toTimestamp('2009-07-30T12:58:59'), 'PST') -> 12
iif
Based on a condition applies one value or the other. If other is unspecified it is considered NULL. Both the values
must be compatible(numeric, string...)
iif(custType == 'Premium', 10, 4.5)
iif(amount > 100, 'High')
iif(dayOfWeek(saleDate) == 6, 'Weekend', 'Weekday')
in
initCap
Converts the first letter of every word to uppercase. Words are identified as separated by whitespace
initCap('cool iceCREAM') -> 'Cool IceCREAM'
instr
Finds the position(1 based) of the substring within a string. 0 is returned if not found
instr('great', 'eat') -> 3
instr('microsoft', 'o') -> 7
instr('good', 'bad') -> 0
isDelete
Checks if the row is marked for delete. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. Default value for the stream index is 1
isDelete() -> true
isDelete(1) -> false
isError
Checks if the row is marked as error. For transformations taking more than one input stream you can pass the (1-
based) index of the stream. Default value for the stream index is 1
isError() -> true
isError(1) -> false
isIgnore
Checks if the row is marked to be ignored. For transformations taking more than one input stream you can pass
the (1-based) index of the stream. Default value for the stream index is 1
isIgnore() -> true
isIgnore(1) -> false
isInsert
Checks if the row is marked for insert. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. Default value for the stream index is 1
isInsert() -> true
isInsert(1) -> false
isMatch
isNull
isUpdate
Checks if the row is marked for update. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. Default value for the stream index is 1
isUpdate() -> true
isUpdate(1) -> false
kurtosis
kurtosisIf
lag
lag(<value> : any, [<number of rows to look before> : number], [<default value> : any]) => any
Gets the value of the first parameter evaluated n rows before the current row. The second parameter is the number
of rows to look back and the default value is 1. If there are not as many rows a value of null is returned unless a
default value is specified
lag(amount, 2) -> 60
lag(amount, 2000, 100) -> 100
last
lastDayOfMonth
lead
lead(<value> : any, [<number of rows to look after> : number], [<default value> : any]) => any
Gets the value of the first parameter evaluated n rows after the current row. The second parameter is the number
of rows to look forward and the default value is 1. If there are not as many rows a value of null is returned unless a
default value is specified
lead(amount, 2) -> 60
lead(amount, 2000, 100) -> 100
least
left
Extracts a substring start at index 1 with number of characters. Same as SUBSTRING (str, 1, n)
left('bojjus', 2) -> 'bo'
left('bojjus', 20) -> 'bojjus'
length
lesser
lesserOrEqual
levenshtein
like
The pattern is a string that is matched literally. The exceptions are the following special symbols: _ matches any
one character in the input (similar to . in posix regular expressions) % matches zero or more characters in the input
(similar to .* in posix regular expressions). The escape character is ''. If an escape character precedes a special
symbol or another escape character, the following character is matched literally. It is invalid to escape any other
character.
like('icecream', 'ice%') -> true
locate
locate(<substring to find> : string, <string> : string, [<from index - 1-based> : integral]) => integer
Finds the position(1 based) of the substring within a string starting a certain position. If the position is omitted it is
considered from the beginning of the string. 0 is returned if not found
locate('eat', 'great') -> 3
locate('o', 'microsoft', 6) -> 7
locate('bad', 'good') -> 0
log
Calculates log value. An optional base can be supplied else a euler number if used
log(100, 10) -> 2
log10
lower
Lowercases a string
lower('GunChus') -> 'gunchus'
lpad
lpad(<string to pad> : string, <final padded length> : integral, <padding> : string) => string
Left pads the string by the supplied padding until it is of a certain length. If the string is equal to or greater than
the length, then it is considered a no-op
lpad('great', 10, '-') -> '-----great'
lpad('great', 4, '-') -> 'great'
lpad('great', 8, '<>') -> '<><great'
ltrim
Left trims a string of leading characters. If second parameter is unspecified, it trims whitespace. Else it trims any
character specified in the second parameter
ltrim('!--!wor!ld!', '-!') -> 'wor!ld!'
max
maxIf
md5
Calculates the MD5 digest of set of column of varying primitive datatypes and returns a 32 character hex string. It
can be used to calculate a fingerprint for a row
md5(5, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> 'c1527622a922c83665e49835e46350fe'
mean
meanIf
min
minIf
minus
Subtracts numbers. Subtract from a date number of days. Same as the - operator
minus(20, 10) -> 10
20 - 10 -> 10
minus(toDate('2012-12-15'), 3) -> 2012-12-12 (date value)
toDate('2012-12-15') - 3 -> 2012-12-13 (date value)
minute
Gets the minute value of a timestamp. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default.
minute(toTimestamp('2009-07-30T12:58:59')) -> 58
minute(toTimestamp('2009-07-30T12:58:59', 'PST')) -> 58
mod
mod(<value1> : any, <value2> : any) => any
month
monthsBetween
Gets the number of months between two datesYou can pass an optional timezone in the form of 'GMT', 'PST',
'UTC', 'America/Cayman'. The local timezone is used as the default.
monthsBetween(toDate('1997-02-28 10:30:00'), toDate('1996-10-30')) -> 3.94959677
multiply
nTile
The NTile function divides the rows for each window partition into n buckets ranging from 1 to at most n .
Bucket values will differ by at most 1. If the number of rows in the partition does not divide evenly into the
number of buckets, then the remainder values are distributed one per bucket, starting with the first bucket. The
NTile function is useful for the calculation of tertiles, quartiles, deciles, and other common summary statistics. The
function calculates two variables during initialization: The size of a regular bucket will have one extra row added to
it. Both variables are based on the size of the current partition. During the calculation process the function keeps
track of the current row number, the current bucket number, and the row number at which the bucket will change
(bucketThreshold). When the current row number reaches bucket threshold, the bucket value is increased by one
and the threshold is increased by the bucket size (plus one extra if the current bucket is padded).
nTile() -> 1
nTile(numOfBuckets) -> 1
negate
nextSequence
Returns the next unique sequence. The number is consecutive only within a partition and is prefixed by the
partitionId
nextSequence() -> 12313112
normalize
not
notEquals
null
Returns a NULL value. Use the function syntax(null()) if there is a column named 'null'. Any operation that uses
will result in a NULL
custId = NULL (for derived field)
custId == NULL -> NULL
'nothing' + NULL -> NULL
10 * NULL -> NULL'
NULL == '' -> NULL'
or
pMod
power
rank
Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to
the current row in the ordering of the partition. The values will produce gaps in the sequence. Rank works even
when data is not sorted and looks for change in values
rank(salesQtr, salesAmt) -> 1
regexExtract
regexExtract(<string> : string, <regex to find> : string, [<match group 1-based index> : integral]) => string
Extract a matching substring for a given regex pattern. The last parameter identifies the match group and is
defaulted to 1 if omitted. Use <regex> (back quote) to match a string without escaping
regexExtract('Cost is between 600 and 800 dollars', '(\\d+) and (\\d+)', 2) -> '800'
regexExtract('Cost is between 600 and 800 dollars', `(\d+) and (\d+)`, 2) -> '800'
regexMatch
Checks if the string matches the given regex pattern. Use <regex> (back quote) to match a string without escaping
regexMatch('200.50', '(\\d+).(\\d+)') -> true
regexMatch('200.50', `(\d+).(\d+)`) -> true
regexReplace
regexReplace(<string> : string, <regex to find> : string, <substring to replace> : string) => string
Replace all occurrences of a regex pattern with another substring in the given string Use <regex> (back quote) to
match a string without escaping
regexReplace('100 and 200', '(\\d+)', 'bojjus') -> 'bojjus and bojjus'
regexReplace('100 and 200', `(\d+)`, 'gunchus') -> 'gunchus and gunchus'
regexSplit
Splits a string based on a delimiter based on regex and returns an array of strings
regexSplit('oneAtwoBthreeC', '[CAB]') -> ['one', 'two', 'three']
regexSplit('oneAtwoBthreeC', '[CAB]')[1] -> 'one'
regexSplit('oneAtwoBthreeC', '[CAB]')[0] -> NULL
regexSplit('oneAtwoBthreeC', '[CAB]')[20] -> NULL
replace
replace(<string> : string, <substring to find> : string, <substring to replace> : string) => string
Replace all occurrences of a substring with another substring in the given string
replace('doggie dog', 'dog', 'cat') -> 'catgie cat'
replace('doggie dog', 'dog', '') -> 'gie'
reverse
Reverses a string
reverse('gunchus') -> 'suhcnug'
right
Extracts a substring with number of characters from the right. Same as SUBSTRING (str, LENGTH(str) - n, n)
right('bojjus', 2) -> 'us'
right('bojjus', 20) -> 'bojjus'
rlike
round
round(<number> : number, [<scale to round> : number], [<rounding option> : integral]) => double
Rounds a number given an optional scale and an optional rounding mode. If the scale is omitted, it is defaulted to
0. If the mode is omitted, it is defaulted to ROUND_HALF_UP (5). The values for rounding include 1 - ROUND_UP
2 - ROUND_DOWN 3 - ROUND_CEILING 4 - ROUND_FLOOR 5 - ROUND_HALF_UP 6 -
ROUND_HALF_DOWN 7 - ROUND_HALF_EVEN 8 - ROUND_UNNECESSARY
round(100.123) -> 100.0
round(2.5, 0) -> 3.0
round(5.3999999999999995, 2, 7) -> 5.40
rowNumber
rpad
rpad(<string to pad> : string, <final padded length> : integral, <padding> : string) => string
Right pads the string by the supplied padding until it is of a certain length. If the string is equal to or greater than
the length, then it is considered a no-op
rpad('great', 10, '-') -> 'great-----'
rpad('great', 4, '-') -> 'great'
rpad('great', 8, '<>') -> 'great<><'
rtrim
Right trims a string of leading characters. If second parameter is unspecified, it trims whitespace. Else it trims any
character specified in the second parameter
rtrim('!--!wor!ld!', '-!') -> '!--!wor!ld'
second
Gets the second value of a date. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default.
second(toTimestamp('2009-07-30T12:58:59')) -> 59
sha1
Calculates the SHA-1 digest of set of column of varying primitive datatypes and returns a 40 character hex string.
It can be used to calculate a fingerprint for a row
sha1(5, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> '63849fd2abb65fbc626c60b1f827bd05573f0cea'
sha2
Calculates the SHA-2 digest of set of column of varying primitive datatypes given a bit length which can only be of
values 0(256), 224, 256, 384, 512. It can be used to calculate a fingerprint for a row
sha2(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) ->
'd3b2bff62c3a00e9370b1ac85e428e661a7df73959fa1a96ae136599e9ee20fd'
sin
sinh
skewness
skewnessIf
slice
slice(<array to slice> : array, <from 1-based index> : integral, [<number of items> : integral]) => array
Extracts a subset of an array from a position. Position is 1 based. If the length is omitted, it is defaulted to end of
the string
slice([10, 20, 30, 40], 1, 2) -> [10, 20]
slice([10, 20, 30, 40], 2) -> [20, 30, 40]
slice([10, 20, 30, 40], 2)[1] -> 20
slice([10, 20, 30, 40], 2)[0] -> NULL
slice([10, 20, 30, 40], 2)[20] -> NULL
slice([10, 20, 30, 40], 8) -> []
soundex
split
sqrt
startsWith
stddev
stddevIf
stddevPopulation
stddevSample
stddevSampleIf
subDays
subMonths
substring
substring(<string to subset> : string, <from 1-based index> : integral, [<number of characters> : integral]) =>
string
Extracts a substring of a certain length from a position. Position is 1 based. If the length is omitted, it is defaulted to
end of the string
substring('Cat in the hat', 5, 2) -> 'in'
substring('Cat in the hat', 5, 100) -> 'in the hat'
substring('Cat in the hat', 5) -> 'in the hat'
substring('Cat in the hat', 100, 100) -> ''
sum
sumDistinct
sumDistinctIf
Based on criteria gets the aggregate sum of a numeric column. The condition can be based on any column
sumDistinctIf(state == 'CA' && commission < 10000, sales) -> value
sumDistinctIf(true, sales) -> SUM(sales)
sumIf
Based on criteria gets the aggregate sum of a numeric column. The condition can be based on any column
sumIf(state == 'CA' && commission < 10000, sales) -> value
sumIf(true, sales) -> SUM(sales)
tan
tanh
toBoolean
Converts a value of ('t', 'true', 'y', 'yes', '1') to true and ('f', 'false', 'n', 'no', '0') to false and NULL for any other value
toBoolean('true') -> true
toBoolean('n') -> false
toBoolean('truthy') -> NULL
toDate
Converts a string to a date given an optional date format. Refer to Java SimpleDateFormat for all possible formats.
If the date format is omitted, combinations of the following are accepted. [ yyyy, yyyy-[M ]M, yyyy-[M ]M -[d]d,
yyyy-[M ]M -[d]d, yyyy-[M ]M -[d]d, yyyy-[M ]M -[d]dT* ]
toDate('2012-8-8') -> 2012-8-8
toDate('12/12/2012', 'MM/dd/yyyy') -> 2012-12-12
toDecimal
Converts any numeric or string to a decimal value. If precision and scale are not specified, it is defaulted to
(10,2).An optional Java decimal format can be used for the conversion. An optional locale format in the form of
BCP47 language like en-US, de, zh-CN
toDecimal(123.45) -> 123.45
toDecimal('123.45', 8, 4) -> 123.4500
toDecimal('$123.45', 8, 4,'$###.00') -> 123.4500
toDecimal('Ç123,45', 10, 2, 'Ç###,##', 'de') -> 123.45
toDouble
Converts any numeric or string to a double value. An optional Java decimal format can be used for the conversion.
An optional locale format in the form of BCP47 language like en-US, de, zh-CN
toDouble(123.45) -> 123.45
toDouble('123.45') -> 123.45
toDouble('$123.45', '$###.00') -> 123.45
toDouble('Ç123,45', 'Ç###,##', 'de') -> 123.45
toFloat
Converts any numeric or string to a float value. An optional Java decimal format can be used for the conversion.
Truncates any double
toFloat(123.45) -> 123.45
toFloat('123.45') -> 123.45
toFloat('$123.45', '$###.00') -> 123.45
toInteger
Converts any numeric or string to an integer value. An optional Java decimal format can be used for the
conversion. Truncates any long, float, double
toInteger(123) -> 123
toInteger('123') -> 123
toInteger('$123', '$###') -> 123
toLong
Converts any numeric or string to a long value. An optional Java decimal format can be used for the conversion.
Truncates any float, double
toLong(123) -> 123
toLong('123') -> 123
toLong('$123', '$###') -> 123
toShort
Converts any numeric or string to a short value. An optional Java decimal format can be used for the conversion.
Truncates any integer, long, float, double
toShort(123) -> 123
toShort('123') -> 123
toShort('$123', '$###') -> 123
toString
Converts a primitive datatype to a string. For numbers and date a format can be specified. If unspecified the
system default is picked.Java decimal format is used for numbers. Refer to Java SimpleDateFormat for all possible
date formats; the default format is yyyy-MM -dd
toString(10) -> '10'
toString('engineer') -> 'engineer'
toString(123456.789, '##,###.##') -> '123,456.79'
toString(123.78, '000000.000') -> '000123.780'
toString(12345, '##0.#####E0') -> '12.345E3'
toString(toDate('2018-12-31')) -> '2018-12-31'
toString(toDate('2018-12-31'), 'MM/dd/yy') -> '12/31/18'
toString(4 == 20) -> 'false'
toTimestamp
toTimestamp(<string> : any, [<timestamp format> : string], [<time zone> : string]) => timestamp
Converts a string to a date given an optional timestamp format. Refer to Java SimpleDateFormat for all possible
formats. If the timestamp is omitted the default pattern yyyy-[M ]M -[d]d hh:mm:ss[.f...] is used
toTimestamp('2016-12-31 00:12:00') -> 2012-8-8T00:12:00
toTimestamp('2016/12/31T00:12:00', 'MM/dd/yyyyThh:mm:ss') -> 2012-12-12T00:12:00
toUTC
Converts the timestamp to UTC. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. It is defaulted to the current timezone
toUTC(currentTimeStamp()) -> 12-12-2030T19:18:12
toUTC(currentTimeStamp(), 'Asia/Seoul') -> 12-13-2030T11:18:12
translate
translate(<string to translate> : string, <lookup characters> : string, <replace characters> : string) =>
string
Replace one set of characters by another set of characters in the string. Characters have 1 to 1 replacement
translate('(Hello)', '()', '[]') -> '[Hello]'
translate('(Hello)', '()', '[') -> '[Hello'
trim
Trims a string of leading and trailing characters. If second parameter is unspecified, it trims whitespace. Else it
trims any character specified in the second parameter
trim('!--!wor!ld!', '-!') -> 'wor!ld'
true
Always returns a true value. Use the function syntax(true()) if there is a column named 'true'
isDiscounted == true()
isDiscounted() == true
typeMatch
Matches the type of the column. Can only be used in pattern expressions.number matches short, integer, long,
double, float or decimal, integral matches short, integer, long, fractional matches double, float, decimal and
datetime matches date or timestamp type
typeMatch(type, 'number') -> true
typeMatch('date', 'number') -> false
upper
Uppercases a string
upper('bojjus') -> 'BOJJUS'
variance
varianceIf
variancePopulation
variancePopulationIf
varianceSample
varianceSampleIf
weekOfYear
xor
xor(<value1> : boolean, <value2> : boolean) => boolean
year
Next steps
Learn how to use Expression Builder.
Roles and permissions for Azure Data Factory
3/7/2019 • 3 minutes to read • Edit Online
This article describes the roles required to create and manage Azure Data Factory resources, and the permissions
granted by those roles.
Set up permissions
After you create a Data Factory, you may want to let other users work with the data factory. To give this access to
other users, you have to add them to the built-in Data Factory Contributor role on the resource group that
contains the data factory.
Scope of the Data Factory Contributor role
Membership in the Data Factory Contributor role lets users do the following things:
Create, edit, and delete data factories and child resources including datasets, linked services, pipelines, triggers,
and integration runtimes.
Deploy Resource Manager templates. Resource Manager deployment is the deployment method used by Data
Factory in the Azure portal.
Manage App Insights alerts for a data factory.
Create support tickets.
For more info about this role, see Data Factory Contributor role.
Resource Manager template deployment
The Data Factory Contributor role, at the resource group level or above, lets users deploy Resource Manager
templates. As a result, members of the role can use Resource Manager templates to deploy both data factories
and their child resources, including datasets, linked services, pipelines, triggers, and integration runtimes.
Membership in this role does not let the user create other resources, however.
Permissions on Azure Repos and GitHub are independent of Data Factory permissions. As a result, a user with
repo permissions who is only a member of the Reader role can edit Data Factory child resources and commit
changes to the repo, but can't publish these changes.
IMPORTANT
Resource Manager template deployment with the Data Factory Contributor role does not elevate your permissions. For
example, if you deploy a template that creates an Azure virtual machine, and you don't have permission to create virtual
machines, the deployment fails with an authorization error.
Next steps
Learn more about roles in Azure - Understand role definitions
Learn more about the Data Factory contributor role - Data Factory Contributor role.
Understanding Data Factory pricing through
examples
5/6/2019 • 6 minutes to read • Edit Online
This article explains and demonstrates the Azure Data Factory pricing model with detailed examples.
NOTE
The prices used in these examples below are hypothetical and are not intended to imply actual pricing.
Run Pipeline 2 Activity runs (1 for trigger run, 1 for activity runs)
Copy Data Assumption: execution time = 10 min 10 * 4 Azure Integration Runtime (default DIU setting = 4) For
more information on data integration units and optimizing
copy performance, see this article
OPERATIONS TYPES AND UNITS
Monitor Pipeline Assumption: Only 1 run occurred 2 Monitoring run records retried (1 for pipeline run, 1 for
activity run)
Run Pipeline 3 Activity runs (1 for trigger run, 2 for activity runs)
Copy Data Assumption: execution time = 10 min 10 * 4 Azure Integration Runtime (default DIU setting = 4) For
more information on data integration units and optimizing
copy performance, see this article
Monitor Pipeline Assumption: Only 1 run occurred 3 Monitoring run records retried (1 for pipeline run, 2 for
activity run)
Execute Databricks activity Assumption: execution time = 10 10 min External Pipeline Activity Execution
min
Run Pipeline 4 Activity runs (1 for trigger run, 3 for activity runs)
Copy Data Assumption: execution time = 10 min 10 * 4 Azure Integration Runtime (default DIU setting = 4) For
more information on data integration units and optimizing
copy performance, see this article
Monitor Pipeline Assumption: Only 1 run occurred 4 Monitoring run records retried (1 for pipeline run, 3 for
activity run)
Execute Lookup activity Assumption: execution time = 1 min 1 min Pipeline Activity execution
Execute Databricks activity Assumption: execution time = 10 10 min External Pipeline Activity execution
min
Using mapping data flow debug for a normal workday (Preview Pricing)
As a Data Engineer, you are responsible for designing, building, and testing Mapping Data Flows every day. You
log into the ADF UI in the morning and enable the Debug mode for Data Flows. The default TTL for Debug
sessions is 60 minutes. You work throughout the day for 10 hours, so your Debug session never expires. Therefore,
your charge for the day will be:
10 (hours) x 8 (cores) x $0.112 = $8.96
Transform data in blob store with mapping data flows (Preview Pricing)
In this scenario, you want to transform data in Blob Store visually in ADF Mapping Data Flows on an hourly
schedule.
To accomplish the scenario, you need to create a pipeline with the following items:
1. A Data Flow activity with the transformation logic.
2. An input dataset for the data on Azure Storage.
3. An output dataset for the data on Azure Storage.
4. A schedule trigger to execute the pipeline every hour.
Run Pipeline 2 Activity runs (1 for trigger run, 1 for activity runs)
Data Flow Assumptions: execution time = 10 min + 10 min 10 * 8 cores of General Compute with TTL of 10
TTL
Monitor Pipeline Assumption: Only 1 run occurred 2 Monitoring run records retried (1 for pipeline run, 1 for
activity run)
Next steps
Now that you understand the pricing for Azure Data Factory, you can get started!
Create a data factory by using the Azure Data Factory UI
Introduction to Azure Data Factory
Visual authoring in Azure Data Factory
Azure Data Factory - naming rules
1/3/2019 • 2 minutes to read • Edit Online
The following table provides naming rules for Data Factory artifacts.
Data Factory Unique across Microsoft Azure. Each data factory is tied to
Names are case-insensitive, that is, exactly one Azure
MyDF and mydf refer to the same subscription.
data factory. Object names must start with
a letter or a number, and can
contain only letters, numbers,
and the dash (-) character.
Every dash (-) character must
be immediately preceded and
followed by a letter or a
number. Consecutive dashes
are not permitted in container
names.
Name can be 3-63 characters
long.
Linked Services/Datasets/Pipelines Unique with in a data factory. Names Object names must start with
are case-insensitive. a letter, number, or an
underscore (_).
Following characters are not
allowed: “.”, “+”, “?”, “/”, “<”,
”>”,”*”,”%”,”&”,”:”,”\”
Dashes ("-") are not allowed in
the names of linked services
and of datasets only.
Resource Group Unique across Microsoft Azure. For more info, see Azure naming rules
Names are case-insensitive. and restrictions.
Next steps
Learn how to create data factories by following step-by-step instructions in Quickstart: create a data factory
article.
Visual authoring in Azure Data Factory
5/9/2019 • 14 minutes to read • Edit Online
The Azure Data Factory user interface experience (UX) lets you visually author and deploy resources for your data
factory without having to write any code. You can drag activities to a pipeline canvas, perform test runs, debug
iteratively, and deploy and monitor your pipeline runs. There are two approaches for using the UX to perform
visual authoring:
Author directly with the Data Factory service.
Author with Azure Repos Git integration for collaboration, source control, and versioning.
When you use the UX Authoring canvas to author directly with the Data Factory service, only the Publish All
mode is available. Any changes that you make are published directly to the Data Factory service.
Author with Azure Repos Git integration
Visual authoring with Azure Repos Git integration supports source control and collaboration for work on your data
factory pipelines. You can associate a data factory with an Azure Repos Git organization repository for source
control, collaboration, versioning, and so on. A single Azure Repos Git organization can have multiple repositories,
but an Azure Repos Git repository can be associated with only one data factory. If you don't have an Azure Repos
organization or repository, follow these instructions to create your resources.
NOTE
You can store script and data files in an Azure Repos Git repository. However, you have to upload the files manually to Azure
Storage. A Data Factory pipeline does not automatically upload script or data files stored in an Azure Repos Git repository to
Azure Storage.
Repository Type The type of the Azure Repos code Azure Repos Git
repository.
Azure Active Directory Your Azure AD tenant name. <your tenant name>
Azure Repos Organization Your Azure Repos organization name. <your organization name>
You can locate your Azure Repos
organization name at
https://{organization
name}.visualstudio.com
. You can sign in to your Azure Repos
organization to access your Visual
Studio profile and see your repositories
and projects.
ProjectName Your Azure Repos project name. You <your Azure Repos project name>
can locate your Azure Repos project
name at
https://{organization
name}.visualstudio.com/{project
name}
.
RepositoryName Your Azure Repos code repository <your Azure Repos code repository
name. Azure Repos projects contain Git name>
repositories to manage your source
code as your project grows. You can
create a new repository or use an
existing repository that's already in your
project.
Collaboration branch Your Azure Repos collaboration branch <your collaboration branch name>
that is used for publishing. By default, it
is master . Change this setting in case
you want to publish resources from
another branch.
Root folder Your root folder in your Azure Repos <your root folder name>
collaboration branch.
Import existing Data Factory Specifies whether to import existing Selected (default)
resources to repository data factory resources from the UX
Authoring canvas into an Azure Repos
Git repository. Select the box to import
your data factory resources into the
associated Git repository in JSON
format. This action exports each
resource individually (that is, the linked
services and datasets are exported into
separate JSONs). When this box isn't
selected, the existing resources aren't
imported.
When you are ready with the feature development in your feature branch, you can click Create pull request. This
action takes you to Azure Repos Git where you can raise pull requests, do code reviews, and merge changes to
your collaboration branch. ( master is the default). You are only allowed to publish to the Data Factory service from
your collaboration branch.
Configure publishing settings
To configure the publish branch - that is, the branch where Resource Manager templates are saved - add a
publish_config.json file to the root folder in the collaboration branch. Data Factory reads this file, looks for the
field publishBranch , and creates a new branch (if it doesn't already exist) with the value provided. Then it saves all
Resource Manager templates to the specified location. For example:
{
"publishBranch": "factory/adf_publish"
}
When you publish from Git mode, you can confirm that Data Factory is using the publish branch that you expect,
as shown in the following screenshot:
When you specify a new publish branch, Data Factory doesn't delete the previous publish branch. If you want to
remote the previous publish branch, delete it manually.
Data Factory only reads the publish_config.json file when it loads the factory. If you already have the factory
loaded in the portal, refresh the browser to make your changes take effect.
Publish code changes
After you have merged changes to the collaboration branch ( master is the default), select Publish to manually
publish your code changes in the master branch to the Data Factory service.
IMPORTANT
The master branch is not representative of what's deployed in the Data Factory service. The master branch must be
published manually to the Data Factory service.
Limitations
You can store script and data files in a GitHub repository. However, you have to upload the files manually to
Azure Storage. A Data Factory pipeline does not automatically upload script or data files stored in a GitHub
repository to Azure Storage.
GitHub Enterprise with a version older than 2.14.0 doesn't work in the Microsoft Edge browser.
GitHub integration with the Data Factor visual authoring tools only works in the generally available version
of Data Factory.
Configure a public GitHub repository with Azure Data Factory
You can configure a GitHub repository with a data factory through two methods.
Configuration method 1 (public repo): Let's get started page
In Azure Data Factory, go to the Let's get started page. Select Configure Code Repository:
The pane shows the following Azure Repos code repository settings:
SETTING DESCRIPTION VALUE
Import existing Data Factory Specifies whether to import existing Selected (default)
resources to repository data factory resources from the
UX Authoring canvas into a GitHub
repository. Select the box to import
your data factory resources into the
associated Git repository in JSON
format. This action exports each
resource individually (that is, the linked
services and datasets are exported into
separate JSONs). When this box isn't
selected, the existing resources aren't
imported.
Branch to import resource into Specifies into which branch the data
factory resources (pipelines, datasets,
linked services etc.) are imported. You
can import resources into one of the
following branches: a. Collaboration b.
Create new c. Use Existing
The pane shows the following Azure Repos code repository settings:
Import existing Data Factory Specifies whether to import existing Selected (default)
resources to repository data factory resources from the
UX Authoring canvas into a GitHub
repository. Select the box to import
your data factory resources into the
associated Git repository in JSON
format. This action exports each
resource individually (that is, the linked
services and datasets are exported into
separate JSONs). When this box isn't
selected, the existing resources aren't
imported.
Branch to import resource into Specifies into which branch the data
factory resources (pipelines, datasets,
linked services etc.) are imported. You
can import resources into one of the
following branches: a. Collaboration b.
Create new c. Use Existing
Next steps
To learn more about monitoring and managing pipelines, see Monitor and manage pipelines programmatically.
Continuous integration and delivery (CI/CD) in Azure
Data Factory
5/22/2019 • 26 minutes to read • Edit Online
Continuous Integration is the practice of testing each change done to your codebase automatically and as early as
possible. Continuous Delivery follows the testing that happens during Continuous Integration and pushes changes
to a staging or production system.
For Azure Data Factory, continuous integration & delivery means moving Data Factory pipelines from one
environment (development, test, production) to another. To do continuous integration & delivery, you can use Data
Factory UI integration with Azure Resource Manager templates. The Data Factory UI can generate a Resource
Manager template when you select the ARM template options. When you select Export ARM template, the
portal generates the Resource Manager template for the data factory and a configuration file that includes all your
connections strings and other parameters. Then you have to create one configuration file for each environment
(development, test, production). The main Resource Manager template file remains the same for all the
environments.
For a nine-minute introduction and demonstration of this feature, watch the following video:
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Then go to your test data factory and production data factory and select Import ARM template.
This action takes you to the Azure portal, where you can import the exported template. Select Build your own
template in the editor and then Load file and select the generated Resource Manager template. Provide the
settings, and the data factory and the entire pipeline is imported in your production environment.
Select Load file to select the exported Resource Manager template and provide all the configuration values (for
example, linked services).
Connection strings. You can find the info required to create connection strings in the articles about the individual
connectors. For example, for Azure SQL Database, see Copy data to or from Azure SQL Database by using Azure
Data Factory. To verify the correct connection string - for a linked service, for example - you can also open code
view for the resource in the Data Factory UI. In code view, however, the password or account key portion of the
connection string is removed. To open code view, select the icon highlighted in the following screenshot.
Requirements
An Azure subscription linked to Team Foundation Server or Azure Repos using the Azure Resource
Manager service endpoint.
A Data Factory with Azure Repos Git integration configured.
An Azure Key Vault containing the secrets.
Set up an Azure Pipelines release
1. Go to your Azure Repos page in the same project as the one configured with the Data Factory.
2. Click on the top menu Azure Pipelines > Releases > Create release definition.
WARNING
If you select Complete deployment mode, existing resources may be deleted, including all the resources in the target
resource group that are not defined in the Resource Manager template.
{
"parameters": {
"azureSqlReportingDbPassword": {
"reference": {
"keyVault": {
"id": "/subscriptions/<subId>/resourceGroups/<resourcegroupId>
/providers/Microsoft.KeyVault/vaults/<vault-name> "
},
"secretName": " < secret - name > "
}
}
}
}
When you use this method, the secret is pulled from the key vault automatically.
The parameters file needs to be in the publish branch as well.
2. Add an Azure Key Vault task before the Azure Resource Manager Deployment described in the previous
section:
Select the Tasks tab, create a new task, search for Azure Key Vault and add it.
In the Key Vault task, choose the subscription in which you created the key vault, provide credentials
if necessary, and then choose the key vault.
You can follow similar steps and use similar code (with the Start-AzDataFactoryV2Trigger function) to restart the
triggers after deployment.
IMPORTANT
In continuous integration and deployment scenarios, the Integration Runtime type across different environments must be
the same. For example, if you have a Self-Hosted Integration Runtime (IR) in the development environment, the same IR
must be of type Self-Hosted in other environments such as test and production also. Similarly, if you're sharing integration
runtimes across multiple stages, you have to configure the Integration Runtimes as Linked Self-Hosted in all environments,
such as development, test, and production.
{
"source": 2,
"id": 1,
"revision": 51,
"name": "Data Factory Prod Deployment",
"description": null,
"createdBy": {
"displayName": "Sample User",
"url": "https://pde14b1dc-d2c9-49e5-88cb-45ccd58d0335.codex.ms/vssps/_apis/Identities/c9f828d1-2dbb-4e39-
b096-f1c53d82bc2c",
"id": "c9f828d1-2dbb-4e39-b096-f1c53d82bc2c",
"uniqueName": "sampleuser@microsoft.com",
"imageUrl": "https://sampleuser.visualstudio.com/_api/_common/identityImage?id=c9f828d1-2dbb-4e39-b096-
f1c53d82bc2c",
"descriptor": "aad.M2Y2N2JlZGUtMDViZC03ZWI3LTgxYWMtMDcwM2UyODMxNTBk"
},
"createdOn": "2018-03-01T22:57:25.660Z",
"modifiedBy": {
"displayName": "Sample User",
"url": "https://pde14b1dc-d2c9-49e5-88cb-45ccd58d0335.codex.ms/vssps/_apis/Identities/c9f828d1-2dbb-4e39-
b096-f1c53d82bc2c",
"id": "c9f828d1-2dbb-4e39-b096-f1c53d82bc2c",
"uniqueName": "sampleuser@microsoft.com",
"imageUrl": "https://sampleuser.visualstudio.com/_api/_common/identityImage?id=c9f828d1-2dbb-4e39-b096-
f1c53d82bc2c",
"descriptor": "aad.M2Y2N2JlZGUtMDViZC03ZWI3LTgxYWMtMDcwM2UyODMxNTBk"
},
"modifiedOn": "2018-03-14T17:58:11.643Z",
"isDeleted": false,
"path": "\\",
"variables": {},
"variableGroups": [],
"environments": [{
"id": 1,
"name": "Prod",
"rank": 1,
"owner": {
"displayName": "Sample User",
"url": "https://pde14b1dc-d2c9-49e5-88cb-45ccd58d0335.codex.ms/vssps/_apis/Identities/c9f828d1-2dbb-4e39-
b096-f1c53d82bc2c",
"id": "c9f828d1-2dbb-4e39-b096-f1c53d82bc2c",
"uniqueName": "sampleuser@microsoft.com",
"imageUrl": "https://sampleuser.visualstudio.com/_api/_common/identityImage?id=c9f828d1-2dbb-4e39-b096-
f1c53d82bc2c",
"descriptor": "aad.M2Y2N2JlZGUtMDViZC03ZWI3LTgxYWMtMDcwM2UyODMxNTBk"
},
"variables": {
"factoryName": {
"value": "sampleuserprod"
}
},
"variableGroups": [],
"preDeployApprovals": {
"approvals": [{
"rank": 1,
"isAutomated": true,
"isNotificationOn": false,
"id": 1
}],
"approvalOptions": {
"requiredApproverCount": null,
"releaseCreatorCanBeApprover": false,
"autoTriggeredAndPreviousEnvironmentApprovedCanBeSkipped": false,
"enforceIdentityRevalidation": false,
"timeoutInMinutes": 0,
"executionOrder": 1
}
},
"deployStep": {
"id": 2
},
"postDeployApprovals": {
"approvals": [{
"rank": 1,
"isAutomated": true,
"isNotificationOn": false,
"id": 3
}],
"approvalOptions": {
"requiredApproverCount": null,
"releaseCreatorCanBeApprover": false,
"releaseCreatorCanBeApprover": false,
"autoTriggeredAndPreviousEnvironmentApprovedCanBeSkipped": false,
"enforceIdentityRevalidation": false,
"timeoutInMinutes": 0,
"executionOrder": 2
}
},
"deployPhases": [{
"deploymentInput": {
"parallelExecution": {
"parallelExecutionType": "none"
},
"skipArtifactsDownload": false,
"artifactsDownloadInput": {
"downloadInputs": []
},
"queueId": 19,
"demands": [],
"enableAccessToken": false,
"timeoutInMinutes": 0,
"jobCancelTimeoutInMinutes": 1,
"condition": "succeeded()",
"overrideInputs": {}
},
"rank": 1,
"phaseType": 1,
"name": "Run on agent",
"workflowTasks": [{
"taskId": "72a1931b-effb-4d2e-8fd8-f8472a07cb62",
"version": "2.*",
"name": "Azure PowerShell script: FilePath",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceNameSelector": "ConnectedServiceNameARM",
"ConnectedServiceName": "",
"ConnectedServiceNameARM": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"ScriptType": "FilePath",
"ScriptPath": "$(System.DefaultWorkingDirectory)/Dev/deployment.ps1",
"Inline": "param\n(\n [parameter(Mandatory = $false)] [String]
$rootFolder=\"C:\\Users\\sampleuser\\Downloads\\arm_template\",\n [parameter(Mandatory = $false)] [String]
$armTemplate=\"$rootFolder\\arm_template.json\",\n [parameter(Mandatory = $false)] [String]
$armTemplateParameters=\"$rootFolder\\arm_template_parameters.json\",\n [parameter(Mandatory = $false)]
[String] $domain=\"microsoft.onmicrosoft.com\",\n [parameter(Mandatory = $false)] [String]
$TenantId=\"72f988bf-86f1-41af-91ab-2d7cd011db47\",\n [parame",
"ScriptArguments": "-rootFolder \"$(System.DefaultWorkingDirectory)/Dev/\" -DataFactoryName $(factoryname)
-predeployment $true",
"TargetAzurePs": "LatestVersion",
"CustomTargetAzurePs": "5.*"
}
}, {
"taskId": "1e244d32-2dd4-4165-96fb-b7441ca9331e",
"version": "1.*",
"name": "Azure Key Vault: sampleuservault",
"refName": "secret1",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceName": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"ConnectedServiceName": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"KeyVaultName": "sampleuservault",
"SecretsFilter": "*"
}
}, {
"taskId": "94a74903-f93f-4075-884f-dc11f34058b4",
"version": "2.*",
"name": "Azure Deployment:Create Or Update Resource Group action on sampleuser-datafactory",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceName": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"action": "Create Or Update Resource Group",
"resourceGroupName": "sampleuser-datafactory",
"location": "East US",
"templateLocation": "Linked artifact",
"csmFileLink": "",
"csmParametersFileLink": "",
"csmFile": "$(System.DefaultWorkingDirectory)/Dev/ARMTemplateForFactory.json",
"csmParametersFile": "$(System.DefaultWorkingDirectory)/Dev/ARMTemplateParametersForFactory.json",
"overrideParameters": "-factoryName \"$(factoryName)\" -linkedService1_connectionString
\"$(linkedService1-connectionString)\" -linkedService2_connectionString \"$(linkedService2-
connectionString)\"",
"deploymentMode": "Incremental",
"enableDeploymentPrerequisites": "None",
"deploymentGroupEndpoint": "",
"project": "",
"deploymentGroupName": "",
"copyAzureVMTags": "true",
"outputVariable": "",
"deploymentOutputs": ""
}
}, {
"taskId": "72a1931b-effb-4d2e-8fd8-f8472a07cb62",
"version": "2.*",
"name": "Azure PowerShell script: FilePath",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceNameSelector": "ConnectedServiceNameARM",
"ConnectedServiceName": "",
"ConnectedServiceNameARM": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"ScriptType": "FilePath",
"ScriptPath": "$(System.DefaultWorkingDirectory)/Dev/deployment.ps1",
"Inline": "# You can write your azure powershell scripts inline here. \n# You can also pass predefined and
custom variables to this script using arguments",
"ScriptArguments": "-rootFolder \"$(System.DefaultWorkingDirectory)/Dev/\" -DataFactoryName $(factoryname)
-predeployment $false",
"TargetAzurePs": "LatestVersion",
"CustomTargetAzurePs": ""
}
}]
}],
"environmentOptions": {
"emailNotificationType": "OnlyOnFailure",
"emailRecipients": "release.environment.owner;release.creator",
"skipArtifactsDownload": false,
"timeoutInMinutes": 0,
"timeoutInMinutes": 0,
"enableAccessToken": false,
"publishDeploymentStatus": true,
"badgeEnabled": false,
"autoLinkWorkItems": false
},
"demands": [],
"conditions": [{
"name": "ReleaseStarted",
"conditionType": 1,
"value": ""
}],
"executionPolicy": {
"concurrencyCount": 1,
"queueDepthCount": 0
},
"schedules": [],
"retentionPolicy": {
"daysToKeep": 30,
"releasesToKeep": 3,
"retainBuild": true
},
"processParameters": {
"dataSourceBindings": [{
"dataSourceName": "AzureRMWebAppNamesByType",
"parameters": {
"WebAppKind": "$(WebAppKind)"
},
"endpointId": "$(ConnectedServiceName)",
"target": "WebAppName"
}]
},
"properties": {},
"preDeploymentGates": {
"id": 0,
"gatesOptions": null,
"gates": []
},
"postDeploymentGates": {
"id": 0,
"gatesOptions": null,
"gates": []
},
"badgeUrl": "https://sampleuser.vsrm.visualstudio.com/_apis/public/Release/badge/19749ef3-2f42-49b5-9696-
f28b49faebcb/1/1"
}, {
"id": 2,
"name": "Staging",
"rank": 2,
"owner": {
"displayName": "Sample User",
"url": "https://pde14b1dc-d2c9-49e5-88cb-45ccd58d0335.codex.ms/vssps/_apis/Identities/c9f828d1-2dbb-4e39-
b096-f1c53d82bc2c",
"id": "c9f828d1-2dbb-4e39-b096-f1c53d82bc2c",
"uniqueName": "sampleuser@microsoft.com",
"imageUrl": "https://sampleuser.visualstudio.com/_api/_common/identityImage?id=c9f828d1-2dbb-4e39-b096-
f1c53d82bc2c",
"descriptor": "aad.M2Y2N2JlZGUtMDViZC03ZWI3LTgxYWMtMDcwM2UyODMxNTBk"
},
"variables": {
"factoryName": {
"value": "sampleuserstaging"
}
},
"variableGroups": [],
"preDeployApprovals": {
"approvals": [{
"rank": 1,
"isAutomated": true,
"isNotificationOn": false,
"isNotificationOn": false,
"id": 4
}],
"approvalOptions": {
"requiredApproverCount": null,
"releaseCreatorCanBeApprover": false,
"autoTriggeredAndPreviousEnvironmentApprovedCanBeSkipped": false,
"enforceIdentityRevalidation": false,
"timeoutInMinutes": 0,
"executionOrder": 1
}
},
"deployStep": {
"id": 5
},
"postDeployApprovals": {
"approvals": [{
"rank": 1,
"isAutomated": true,
"isNotificationOn": false,
"id": 6
}],
"approvalOptions": {
"requiredApproverCount": null,
"releaseCreatorCanBeApprover": false,
"autoTriggeredAndPreviousEnvironmentApprovedCanBeSkipped": false,
"enforceIdentityRevalidation": false,
"timeoutInMinutes": 0,
"executionOrder": 2
}
},
"deployPhases": [{
"deploymentInput": {
"parallelExecution": {
"parallelExecutionType": "none"
},
"skipArtifactsDownload": false,
"artifactsDownloadInput": {
"downloadInputs": []
},
"queueId": 19,
"demands": [],
"enableAccessToken": false,
"timeoutInMinutes": 0,
"jobCancelTimeoutInMinutes": 1,
"condition": "succeeded()",
"overrideInputs": {}
},
"rank": 1,
"phaseType": 1,
"name": "Run on agent",
"workflowTasks": [{
"taskId": "72a1931b-effb-4d2e-8fd8-f8472a07cb62",
"version": "2.*",
"name": "Azure PowerShell script: FilePath",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceNameSelector": "ConnectedServiceNameARM",
"ConnectedServiceName": "",
"ConnectedServiceNameARM": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"ScriptType": "FilePath",
"ScriptPath": "$(System.DefaultWorkingDirectory)/Dev/deployment.ps1",
"Inline": "# You can write your azure powershell scripts inline here. \n# You can also pass predefined and
custom variables to this script using arguments",
"ScriptArguments": "-rootFolder \"$(System.DefaultWorkingDirectory)/Dev/\" -DataFactoryName $(factoryname)
-predeployment $true",
"TargetAzurePs": "LatestVersion",
"CustomTargetAzurePs": ""
}
}, {
"taskId": "1e244d32-2dd4-4165-96fb-b7441ca9331e",
"version": "1.*",
"name": "Azure Key Vault: sampleuservault",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceName": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"KeyVaultName": "sampleuservault",
"SecretsFilter": "*"
}
}, {
"taskId": "94a74903-f93f-4075-884f-dc11f34058b4",
"version": "2.*",
"name": "Azure Deployment:Create Or Update Resource Group action on sampleuser-datafactory",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceName": "e4e2ef4b-8289-41a6-ba7c-92ca469700aa",
"action": "Create Or Update Resource Group",
"resourceGroupName": "sampleuser-datafactory",
"location": "East US",
"templateLocation": "Linked artifact",
"csmFileLink": "",
"csmParametersFileLink": "",
"csmFile": "$(System.DefaultWorkingDirectory)/Dev/ARMTemplateForFactory.json",
"csmParametersFile": "$(System.DefaultWorkingDirectory)/Dev/ARMTemplateParametersForFactory.json",
"overrideParameters": "-factoryName \"$(factoryName)\" -linkedService1_connectionString
\"$(linkedService1-connectionString)\" -linkedService2_connectionString \"$(linkedService2-
connectionString)\"",
"deploymentMode": "Incremental",
"enableDeploymentPrerequisites": "None",
"deploymentGroupEndpoint": "",
"project": "",
"deploymentGroupName": "",
"copyAzureVMTags": "true",
"outputVariable": "",
"deploymentOutputs": ""
}
}, {
"taskId": "72a1931b-effb-4d2e-8fd8-f8472a07cb62",
"version": "2.*",
"name": "Azure PowerShell script: FilePath",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": "task",
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"ConnectedServiceNameSelector": "ConnectedServiceNameARM",
"ConnectedServiceName": "",
"ConnectedServiceNameARM": "16a37943-8b58-4c2f-a3d6-052d6f032a07",
"ScriptType": "FilePath",
"ScriptPath": "$(System.DefaultWorkingDirectory)/Dev/deployment.ps1",
"Inline": "param(\n$x,\n$y,\n$z)\nwrite-host \"----------\"\nwrite-host $x\nwrite-host $y\nwrite-host $z |
ConvertTo-SecureString\nwrite-host \"----------\"",
"ScriptArguments": "-rootFolder \"$(System.DefaultWorkingDirectory)/Dev/\" -DataFactoryName $(factoryname)
-predeployment $false",
"TargetAzurePs": "LatestVersion",
"CustomTargetAzurePs": ""
}
}]
}],
"environmentOptions": {
"emailNotificationType": "OnlyOnFailure",
"emailRecipients": "release.environment.owner;release.creator",
"skipArtifactsDownload": false,
"timeoutInMinutes": 0,
"enableAccessToken": false,
"publishDeploymentStatus": true,
"badgeEnabled": false,
"autoLinkWorkItems": false
},
"demands": [],
"conditions": [{
"name": "ReleaseStarted",
"conditionType": 1,
"value": ""
}],
"executionPolicy": {
"concurrencyCount": 1,
"queueDepthCount": 0
},
"schedules": [],
"retentionPolicy": {
"daysToKeep": 30,
"releasesToKeep": 3,
"retainBuild": true
},
"processParameters": {
"dataSourceBindings": [{
"dataSourceName": "AzureRMWebAppNamesByType",
"parameters": {
"WebAppKind": "$(WebAppKind)"
},
"endpointId": "$(ConnectedServiceName)",
"target": "WebAppName"
}]
},
"properties": {},
"preDeploymentGates": {
"id": 0,
"gatesOptions": null,
"gates": []
},
"postDeploymentGates": {
"id": 0,
"gatesOptions": null,
"gates": []
},
"badgeUrl": "https://sampleuser.vsrm.visualstudio.com/_apis/public/Release/badge/19749ef3-2f42-49b5-9696-
f28b49faebcb/1/2"
}],
"artifacts": [{
"sourceId": "19749ef3-2f42-49b5-9696-f28b49faebcb:a6c88f30-5e1f-4de8-b24d-279bb209d85f",
"type": "Git",
"type": "Git",
"alias": "Dev",
"definitionReference": {
"branches": {
"id": "adf_publish",
"name": "adf_publish"
},
"checkoutSubmodules": {
"id": "",
"name": ""
},
"defaultVersionSpecific": {
"id": "",
"name": ""
},
"defaultVersionType": {
"id": "latestFromBranchType",
"name": "Latest from default branch"
},
"definition": {
"id": "a6c88f30-5e1f-4de8-b24d-279bb209d85f",
"name": "Dev"
},
"fetchDepth": {
"id": "",
"name": ""
},
"gitLfsSupport": {
"id": "",
"name": ""
},
"project": {
"id": "19749ef3-2f42-49b5-9696-f28b49faebcb",
"name": "Prod"
}
},
"isPrimary": true
}],
"triggers": [{
"schedule": {
"jobId": "b5ef09b6-8dfd-4b91-8b48-0709e3e67b2d",
"timeZoneId": "UTC",
"startHours": 3,
"startMinutes": 0,
"daysToRelease": 31
},
"triggerType": 2
}],
"releaseNameFormat": "Release-$(rev:r)",
"url": "https://sampleuser.vsrm.visualstudio.com/19749ef3-2f42-49b5-9696-
f28b49faebcb/_apis/Release/definitions/1",
"_links": {
"self": {
"href": "https://sampleuser.vsrm.visualstudio.com/19749ef3-2f42-49b5-9696-
f28b49faebcb/_apis/Release/definitions/1"
},
"web": {
"href": "https://sampleuser.visualstudio.com/19749ef3-2f42-49b5-9696-f28b49faebcb/_release?definitionId=1"
}
},
"tags": [],
"properties": {
"DefinitionCreationSource": {
"$type": "System.String",
"$value": "ReleaseNew"
}
}
}
Sample script to stop and restart triggers and clean up
Here is a sample script to stop triggers before deployment and to restart triggers afterwards. The script also
includes code to delete resources that have been removed. To install the latest version of Azure PowerShell, see
Install Azure PowerShell on Windows with PowerShellGet.
param
(
[parameter(Mandatory = $false)] [String] $rootFolder,
[parameter(Mandatory = $false)] [String] $armTemplate,
[parameter(Mandatory = $false)] [String] $ResourceGroupName,
[parameter(Mandatory = $false)] [String] $DataFactoryName,
[parameter(Mandatory = $false)] [Bool] $predeployment=$true,
[parameter(Mandatory = $false)] [Bool] $deleteDeployment=$false
)
#Triggers
Write-Host "Getting triggers"
$triggersADF = Get-AzDataFactoryV2Trigger -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$triggersTemplate = $resources | Where-Object { $_.type -eq "Microsoft.DataFactory/factories/triggers" }
$triggerNames = $triggersTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-40)}
$activeTriggerNames = $triggersTemplate | Where-Object { $_.properties.runtimeState -eq "Started" -and
($_.properties.pipelines.Count -gt 0 -or $_.properties.pipeline.pipelineReference -ne $null)} | ForEach-Object
{$_.name.Substring(37, $_.name.Length-40)}
$deletedtriggers = $triggersADF | Where-Object { $triggerNames -notcontains $_.Name }
$triggerstostop = $triggerNames | where { ($triggersADF | Select-Object name).name -contains $_ }
#Delete resources
Write-Host "Deleting triggers"
$deletedtriggers | ForEach-Object {
Write-Host "Deleting trigger " $_.Name
$trig = Get-AzDataFactoryV2Trigger -name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName
if ($trig.RuntimeState -eq "Started") {
Stop-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName
-Name $_.Name -Force
}
Remove-AzDataFactoryV2Trigger -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting pipelines"
$deletedpipelines | ForEach-Object {
Write-Host "Deleting pipeline " $_.Name
Remove-AzDataFactoryV2Pipeline -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting datasets"
$deleteddataset | ForEach-Object {
Write-Host "Deleting dataset " $_.Name
Remove-AzDataFactoryV2Dataset -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting linked services"
$deletedlinkedservices | ForEach-Object {
Write-Host "Deleting Linked Service " $_.Name
Remove-AzDataFactoryV2LinkedService -Name $_.Name -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Force
}
Write-Host "Deleting integration runtimes"
$deletedintegrationruntimes | ForEach-Object {
Write-Host "Deleting integration runtime " $_.Name
Remove-AzDataFactoryV2IntegrationRuntime -Name $_.Name -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Force
}
$deploymentsToDelete | ForEach-Object {
Write-host "Deleting inner deployment: " $_.properties.targetResource.id
Remove-AzResourceGroupDeployment -Id $_.properties.targetResource.id
}
Write-Host "Deleting deployment: " $deploymentName
Remove-AzResourceGroupDeployment -ResourceGroupName $ResourceGroupName -Name $deploymentName
}
Explanation:
Pipelines
Any property in the path activities/typeProperties/waitTimeInSeconds is parameterized. This means that any
activity in a pipeline that has a code-level property named waitTimeInSeconds (for example, the Wait activity) is
parameterized as a number, with a default name. But, it won't have a default value in the Resource Manager
template. It will be a mandatory input during the Resource Manager deployment.
Similarly, a property called headers (for example, in a Web activity) is parameterized with type object
(JObject). It has a default value, which is the same value as in the source factory.
IntegrationRuntimes
Only properties, and all properties, under the path typeProperties are parameterized, with their respective
default values. For example, as of today's schema, there are two properties under IntegrationRuntimes type
properties: computeProperties and ssisProperties . Both property types are created with their respective
default values and types (Object).
Triggers
Under typeProperties , two properties are parameterized. The first one is maxConcurrency , which is specified to
have a default value, and the type would be string . It has the default parameter name of
<entityName>_properties_typeProperties_maxConcurrency .
The recurrence property also is parameterized. Under it, all properties at that level are specified to be
parameterized as strings, with default values and parameter names. An exception is the interval property,
which is parameterized as number type, and with the parameter name suffixed with
<entityName>_properties_typeProperties_recurrence_triggerSuffix . Similarly, the freq property is a string and
is parameterized as a string. However, the freq property is parameterized without a default value. The name is
shortened and suffixed. For example, <entityName>_freq .
LinkedServices
Linked services is unique. Because linked services and datasets can potentially be of several types, you can
provide type-specific customization. For example, you might say that for all linked services of type
AzureDataLakeStore , a specific template will be applied, and for all others (via * ) a different template will be
applied.
In the preceding example, the connectionString property will be parameterized as a securestring value, it
won't have a default value, and it will have a shortened parameter name that's suffixed with connectionString .
The property secretAccessKey , however, happens to be an AzureKeyVaultSecret (for instance, an AmazonS3
linked service). Thus, it is automatically parameterized as an Azure Key Vault secret, and it's fetched from the key
vault that it's configured with in the source factory. You can also parameterize the key vault, itself.
Datasets
Even though type-specific customization is available for datasets, configuration can be provided without
explicitly having a *-level configuration. In the preceding example, all dataset properties under typeProperties
are parameterized.
The default parameterization template can change, but this is the current template. This will be useful if you just
need to add one additional property as a parameter, but also if you don’t want to lose the existing
parameterizations and need to re-create them.
{
"Microsoft.DataFactory/factories/pipelines": {
},
"Microsoft.DataFactory/factories/integrationRuntimes":{
"properties": {
"typeProperties": {
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "=",
"catalogAdminUserName": "=",
"catalogAdminPassword": {
"value": "-::secureString"
}
},
"customSetupScriptProperties": {
"customSetupScriptProperties": {
"sasToken": {
"value": "-::secureString"
}
}
},
"linkedInfo": {
"key": {
"value": "-::secureString"
},
"resourceId": "="
}
}
}
},
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"pipelines": [{
"parameters": {
"*": "="
}
},
"pipelineReference.referenceName"
],
"pipeline": {
"parameters": {
"*": "="
}
},
"typeProperties": {
"scope": "="
}
}
},
"Microsoft.DataFactory/factories/linkedServices": {
"*": {
"properties": {
"typeProperties": {
"accountName": "=",
"username": "=",
"userName": "=",
"accessKeyId": "=",
"servicePrincipalId": "=",
"userId": "=",
"clientId": "=",
"clusterUserName": "=",
"clusterSshUserName": "=",
"hostSubscriptionId": "=",
"clusterResourceGroup": "=",
"subscriptionId": "=",
"resourceGroupName": "=",
"tenant": "=",
"dataLakeStoreUri": "=",
"baseUrl": "=",
"database": "=",
"serviceEndpoint": "=",
"batchUri": "=",
"databaseName": "=",
"systemNumber": "=",
"server": "=",
"url":"=",
"aadResourceId": "=",
"connectionString": "|:-connectionString:secureString"
}
}
},
"Odbc": {
"properties": {
"typeProperties": {
"typeProperties": {
"userName": "=",
"connectionString": {
"secretName": "="
}
}
}
}
},
"Microsoft.DataFactory/factories/datasets": {
"*": {
"properties": {
"typeProperties": {
"folderPath": "=",
"fileName": "="
}
}
}}
}
Example: Add a Databricks Interactive cluster ID (from a Databricks Linked Service) to the parameters file:
{
"Microsoft.DataFactory/factories/pipelines": {
},
"Microsoft.DataFactory/factories/integrationRuntimes":{
"properties": {
"typeProperties": {
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "=",
"catalogAdminUserName": "=",
"catalogAdminPassword": {
"value": "-::secureString"
}
},
"customSetupScriptProperties": {
"sasToken": {
"value": "-::secureString"
}
}
},
"linkedInfo": {
"key": {
"value": "-::secureString"
},
"resourceId": "="
}
}
}
},
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"pipelines": [{
"parameters": {
"*": "="
}
},
"pipelineReference.referenceName"
],
"pipeline": {
"parameters": {
"*": "="
}
},
"typeProperties": {
"scope": "="
}
}
}
},
"Microsoft.DataFactory/factories/linkedServices": {
"*": {
"properties": {
"typeProperties": {
"accountName": "=",
"username": "=",
"userName": "=",
"accessKeyId": "=",
"servicePrincipalId": "=",
"userId": "=",
"clientId": "=",
"clusterUserName": "=",
"clusterSshUserName": "=",
"hostSubscriptionId": "=",
"clusterResourceGroup": "=",
"subscriptionId": "=",
"resourceGroupName": "=",
"tenant": "=",
"dataLakeStoreUri": "=",
"baseUrl": "=",
"database": "=",
"serviceEndpoint": "=",
"batchUri": "=",
"databaseName": "=",
"systemNumber": "=",
"server": "=",
"url":"=",
"aadResourceId": "=",
"connectionString": "|:-connectionString:secureString",
"existingClusterId": "-"
}
}
},
"Odbc": {
"properties": {
"typeProperties": {
"userName": "=",
"connectionString": {
"secretName": "="
}
}
}
}
},
"Microsoft.DataFactory/factories/datasets": {
"*": {
"properties": {
"typeProperties": {
"folderPath": "=",
"fileName": "="
}
}
}}
}
The Linked Resource Manager templates usually have a master template and a set of child templates linked to the
master. The parent template is called ArmTemplate_master.json , and child templates are named with the pattern
ArmTemplate_0.json , ArmTemplate_1.json , and so on. To move over from using the full Resource Manager template
to using the linked templates, update your CI/CD task to point to ArmTemplate_master.json instead of pointing to
ArmTemplateForFactory.json (that is, the full Resource Manager template). Resource Manager also requires you to
upload the linked templates into a storage account so that they can be accessed by Azure during deployment. For
more info, see Deploying Linked ARM Templates with VSTS.
Remember to add the Data Factory scripts in your CI/CD pipeline before and after the deployment task.
If you don’t have Git configured, the linked templates are accessible via the Export ARM template gesture.
Unsupported features
You can't publish individual resources, because data factory entities depend on each other. For example,
triggers depend on pipelines, pipelines depend on datasets and other pipelines, etc. Tracking changing
dependencies is hard. If it was possible to select the resources to publish manually, it would be possible to
pick only a subset of the entire set of changes, which would lead to things unexpected behavior after
publishing.
You can't publish from private branches.
You can't host projects on Bitbucket.
Iterative development and debugging with Azure
Data Factory
3/7/2019 • 2 minutes to read • Edit Online
Azure Data Factory lets you iteratively develop and debug Data Factory pipelines.
For an eight-minute introduction and demonstration of this feature, watch the following video:
View the results of your test runs in the Output window of the pipeline canvas.
After a test run succeeds, add more activities to your pipeline and continue debugging in an iterative manner. You
can also Cancel a test run while it is in progress.
When you do test runs, you don't have to publish your changes to the data factory before you select Debug. This
feature is helpful in scenarios where you want to make sure that the changes work as expected before you update
the data factory workflow.
IMPORTANT
Selecting Debug actually runs the pipeline. So, for example, if the pipeline contains copy activity, the test run copies data
from source to destination. As a result, we recommend that you use test folders in your copy activities and other activities
when debugging. After you've debugged the pipeline, switch to the actual folders that you want to use in normal operations.
To set a breakpoint, select an element on the pipeline canvas. A Debug Until option appears as an empty red circle
at the upper right corner of the element.
After you select the Debug Until option, it changes to a filled red circle to indicate the breakpoint is enabled.
Next steps
Continuous integration and deployment in Azure Data Factory
Copy data from Amazon Marketplace Web Service
using Azure Data Factory (Preview)
1/3/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Amazon Marketplace
Web Service. It builds on the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Amazon Marketplace Web Service to any supported sink data store. For a list of data
stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon Marketplace Web Service connector.
Example:
{
"name": "AmazonMWSLinkedService",
"properties": {
"type": "AmazonMWS",
"typeProperties": {
"endpoint" : "mws.amazonservices.com",
"marketplaceID" : "A2EUQ1WTGCTBG2",
"sellerID" : "<sellerID>",
"mwsAuthToken": {
"type": "SecureString",
"value": "<mwsAuthToken>"
},
"accessKeyId" : "<accessKeyId>",
"secretKey": {
"type": "SecureString",
"value": "<secretKey>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Amazon Marketplace Web Service dataset.
To copy data from Amazon Marketplace Web Service, set the type property of the dataset to
AmazonMWSObject. The following properties are supported:
Example
{
"name": "AmazonMWSDataset",
"properties": {
"type": "AmazonMWSObject",
"linkedServiceName": {
"referenceName": "<AmazonMWS linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Orders where
Amazon_Order_Id = 'xx'"
.
Example:
"activities":[
{
"name": "CopyFromAmazonMWS",
"type": "Copy",
"inputs": [
{
"referenceName": "<AmazonMWS input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AmazonMWSSource",
"query": "SELECT * FROM Orders where Amazon_Order_Id = 'xx'"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Amazon Redshift using Azure Data
Factory
3/14/2019 • 5 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an Amazon Redshift. It
builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Amazon Redshift to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Amazon Redshift connector supports retrieving data from Redshift using query or built-in
Redshift UNLOAD support.
TIP
To achieve the best performance when copying large amounts of data from Redshift, consider using the built-in Redshift
UNLOAD through Amazon S3. See Use UNLOAD to copy data from Amazon Redshift section for details.
Prerequisites
If you are copying data to an on-premises data store using Self-hosted Integration Runtime, grant Integration
Runtime (use IP address of the machine) the access to Amazon Redshift cluster. See Authorize access to the
cluster for instructions.
If you are copying data to an Azure data store, see Azure Data Center IP Ranges for the Compute IP address
and SQL ranges used by the Azure data centers.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon Redshift connector.
port The number of the TCP port that the No, default is 5439
Amazon Redshift server uses to listen
for client connections.
Example:
{
"name": "AmazonRedshiftLinkedService",
"properties":
{
"type": "AmazonRedshift",
"typeProperties":
{
"server": "<server name>",
"database": "<database name>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Amazon Redshift dataset.
To copy data from Amazon Redshift, set the type property of the dataset to RelationalTable. The following
properties are supported:
tableName Name of the table in the Amazon No (if "query" in activity source is
Redshift. specified)
Example
{
"name": "AmazonRedshiftDataset",
"properties":
{
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<Amazon Redshift linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom query to read data. For No (if "tableName" in dataset is
example: select * from MyTable. specified)
Learn more on how to use UNLOAD to copy data from Amazon Redshift efficiently from next section.
BIGINT Int64
BOOLEAN String
CHAR String
DATE DateTime
DECIMAL Decimal
AMAZON REDSHIFT DATA TYPE DATA FACTORY INTERIM DATA TYPE
INTEGER Int32
REAL Single
SMALLINT Int16
TEXT String
TIMESTAMP DateTime
VARCHAR String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Amazon Simple Storage Service
using Azure Data Factory
5/10/2019 • 11 minutes to read • Edit Online
This article outlines how to copy data from Amazon Simple Storage Service (Amazon S3). To learn about Azure
Data Factory, read the introductory article.
Supported capabilities
This Amazon S3 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this Amazon S3 connector supports copying files as-is or parsing files with the supported file
formats and compression codecs. It uses AWS Signature Version 4 to authenticate requests to S3.
TIP
You can use this Amazon S3 connector to copy data from any S3-compatible storage providers e.g. Google Cloud
Storage. Specify the corresponding service URL in the linked service configuration.
Required permissions
To copy data from Amazon S3, make sure you have been granted the following permissions:
For copy activity execution:: s3:GetObject and s3:GetObjectVersion for Amazon S3 Object Operations.
For Data Factory GUI authoring: s3:ListAllMyBuckets and s3:ListBucket / s3:GetBucketLocation for
Amazon S3 Bucket Operations permissions are additionally required for operations like test connection and
browse/navigate file paths. If you don't want to grant these permission, skip test connection in linked service
creation page and specify the path directly in dataset settings.
For details about the full list of Amazon S3 permissions, see Specifying Permissions in a Policy.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon S3.
TIP
Specify the custom S3 service URL if you are copying data from a S3-compatible storage other than the official Amazon S3
service.
NOTE
This connector requires access keys for IAM account to copy data from Amazon S3. Temporary Security Credential is not
supported.
Here is an example:
{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AmazonS3",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from Amazon S3 in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based dataset and supported settings. The following properties are supported for
Amazon S3 under location settings in format-based dataset:
NOTE
AmazonS3Object type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility, but it doesn't work with Mapping Data Flow. You are
suggested to use this new model going forward, and the ADF authoring UI has switched to generating these new types.
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AmazonS3Location",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
bucketName The S3 bucket name. Wildcard filter is Yes for Copy/Lookup activity, No for
not supported. GetMetadata activity
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.
{
"name": "AmazonS3Dataset",
"properties": {
"type": "AmazonS3Object",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"bucketName": "testbucket",
"prefix": "testFolder/test",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
{
"name": "AmazonS3Dataset",
"properties": {
"type": "AmazonS3",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"bucketName": "testbucket",
"key": "testFolder/testfile.csv.gz",
"version": "XXXXXXXXXczm0CJajYkHf0_k6LhBmkcL",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Amazon S3 source.
Amazon S3 as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from Amazon S3 in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity source and supported settings. The following properties are
supported for Amazon S3 under storeSettings settings in format-based copy source:
wildcardFileName The file name with wildcard characters Yes if fileName in dataset and
under the given bucket + prefix are not specified
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.
Example:
"activities":[
{
"name": "CopyFromAmazonS3",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "AmazonS3ReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyFromAmazonS3",
"type": "Copy",
"inputs": [
{
"referenceName": "<Amazon S3 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores that are supported as sources and sinks by the copy activity in Azure Data Factory, see
supported data stores.
Copy data to or from Azure Blob storage by using
Azure Data Factory
5/6/2019 • 22 minutes to read • Edit Online
This article outlines how to copy data to and from Azure Blob storage. To learn about Azure Data Factory, read
the introductory article.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
Supported capabilities
This Azure Blob connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Specifically, this Blob storage connector supports:
Copying blobs to and from general-purpose Azure storage accounts and hot/cool blob storage.
Copying blobs by using account key, service shared access signature, service principal or managed identities
for Azure resources authentications.
Copying blobs from block, append, or page blobs and copying data to only block blobs.
Copying blobs as is or parsing or generating blobs with supported file formats and compression codecs.
NOTE
If you enable the "Allow trusted Microsoft services to access this storage account" option on Azure Storage firewall
settings, using Azure Integration Runtime to connect to Blob storage will fail with a forbidden error, as ADF is not treated
as a trusted Microsoft service. Please connect via a Self-hosted Integration Runtime instead.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Blob storage.
NOTE
HDInsights, Azure Machine Learning and Azure SQL Data Warehouse PolyBase load only support Azure Blob storage
account key authentication.
NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureBlobStorage" linked service type going forward.
Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;"
},
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
Data Factory now supports both service shared access signatures and account shared access signatures. For more
information about these two types and how to construct them, see Types of shared access signatures.
In later dataset configuration, the folder path is the absolute path starting from container level. You need to configure
one aligned with the path in your SAS URI.
TIP
To generate a service shared access signature for your storage account, you can execute the following PowerShell
commands. Replace the placeholders and grant the needed permission.
$context = New-AzStorageContext -StorageAccountName <accountName> -StorageAccountKey <accountKey>
New-AzStorageContainerSASToken -Name <containerName> -Context $context -Permission rwdl -StartTime
<startTime> -ExpiryTime <endTime> -FullUri
To use shared access signature authentication, the following properties are supported:
NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureBlobStorage" linked service type going forward.
Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource e.g.
https://<container>.blob.core.windows.net/?sv=<storage version>&st=<start time>&se=<expire
time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource without token e.g.
https://<container>.blob.core.windows.net/>"
},
"sasToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
When you create a shared access signature URI, consider the following points:
Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is
used in your data factory.
Set Expiry time appropriately. Make sure that the access to Storage objects doesn't expire within the active
period of the pipeline.
The URI should be created at the right container/blob based on the need. A shared access signature URI to a
blob allows Data Factory to access that particular blob. A shared access signature URI to a Blob storage
container allows Data Factory to iterate through blobs in that container. To provide access to more or fewer
objects later, or to update the shared access signature URI, remember to update the linked service with the
new URI.
Service principal authentication
For Azure Storage service principal authentication in general, refer to Authenticate access to Azure Storage using
Azure Active Directory.
To use service principal authentication, follow these steps:
1. Register an application entity in Azure Active Directory (Azure AD ) by following Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Grant the service principal proper permission in Azure Blob storage. Refer to Manage access rights to
Azure Storage data with RBAC with more details on the roles.
As source, in Access control (IAM ), grant at least Storage Blob Data Reader role.
As sink, in Access control (IAM ), grant at least Storage Blob Data Contributor role.
These properties are supported for an Azure Blob storage linked service:
NOTE
Service principal authentication is only supported by "AzureBlobStorage" type linked service but not previous
"AzureStorage" type linked service.
Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<accountName>.blob.core.windows.net/",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<accountName>.blob.core.windows.net/"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data to and from Blob storage in Parquet or delimited text format, refer to Parquet format and Delimited
text format article on format-based dataset and supported settings. The following properties are supported for
Azure Blob under location settings in format-based dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
folderPath Path to the container and folder in the Yes for Copy/Lookup activity, No for
blob storage. GetMetadata activity
Examples:
myblobcontainer/myblobfolder/, see
more examples in Folder and file filter
examples.
PROPERTY DESCRIPTION REQUIRED
format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both the input and
output dataset definitions.
TIP
To copy all blobs under a folder, specify folderPath only.
To copy a single blob with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of blobs under a folder, specify folderPath with folder part and fileName with wildcard filter.
Example:
{
"name": "AzureBlobDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "<Azure Blob storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given container + dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
Example:
"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureBlobStorageReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Blob input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
For Parquet/delimited text format, BlobSink type copy activity sink mentioned in next section is still supported as-is for
backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has switched
to generating these new types.
Example:
"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureBlobStorageWriteSetting",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]
Example:
"activities":[
{
"name": "CopyToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Blob output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data to or from Azure Cosmos DB (SQL API)
by using Azure Data Factory
2/6/2019 • 8 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Cosmos DB
(SQL API). The article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy
Activity.
NOTE
This connector only support copy data to/from Cosmos DB SQL API. For MongoDB API, refer to connector for Azure
Cosmos DB's API for MongoDB. Other API types are not supported now.
Supported capabilities
You can copy data from Azure Cosmos DB (SQL API) to any supported sink data store, or copy data from any
supported source data store to Azure Cosmos DB (SQL API). For a list of data stores that Copy Activity supports
as sources and sinks, see Supported data stores and formats.
You can use the Azure Cosmos DB (SQL API) connector to:
Copy data from and to the Azure Cosmos DB SQL API.
Write to Azure Cosmos DB as insert or upsert.
Import and export JSON documents as-is, or copy data from or to a tabular dataset. Examples include a SQL
database and a CSV file. To copy documents as-is to or from JSON files or to or from another Azure Cosmos
DB collection, see Import or export JSON documents.
Data Factory integrates with the Azure Cosmos DB bulk executor library to provide the best performance when
you write to Azure Cosmos DB.
TIP
The Data Migration video walks you through the steps of copying data from Azure Blob storage to Azure Cosmos DB. The
video also describes performance-tuning considerations for ingesting data to Azure Cosmos DB in general.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to Azure Cosmos DB (SQL API).
Example
{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
This section provides a list of properties that the Azure Cosmos DB (SQL API) dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from or to Azure Cosmos DB (SQL API), set the type property of the dataset to
DocumentDbCollection. The following properties are supported:
Example
{
"name": "CosmosDbSQLAPIDataset",
"properties": {
"type": "DocumentDbCollection",
"linkedServiceName":{
"referenceName": "<Azure Cosmos DB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"collectionName": "<collection name>"
}
}
}
Example
"activities":[
{
"name": "CopyFromCosmosDBSQLAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Cosmos DB SQL API input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "SELECT c.BusinessEntityID, c.Name.First AS FirstName, c.Name.Middle AS MiddleName,
c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-01-01T00:00:00\""
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
Cosmos DB limits single request's size to 2MB. The formula is Request Size = Single Document Size * Write Batch Size. If you
hit error saying "Request size is too large.", reduce the writeBatchSize value in copy sink configuration.
Example
"activities":[
{
"name": "CopyToCosmosDBSQLAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DocumentDbCollectionSink",
"writeBehavior": "upsert"
}
}
}
]
Import or export JSON documents
You can use this Azure Cosmos DB (SQL API) connector to easily:
Import JSON documents from various sources to Azure Cosmos DB, including from Azure Blob storage,
Azure Data Lake Store, and other file-based stores that Azure Data Factory supports.
Export JSON documents from an Azure Cosmos DB collection to various file-based stores.
Copy documents between two Azure Cosmos DB collections as-is.
To achieve schema-agnostic copy:
When you use the Copy Data tool, select the Export as-is to JSON files or Cosmos DB collection option.
When you use activity authoring, don't specify the structure (also called schema) section in the Azure Cosmos
DB dataset. Also, don't specify the nestingSeparator property in the Azure Cosmos DB source or sink in
Copy Activity. When you import from or export to JSON files, in the corresponding file store dataset, specify
the format type as JsonFormat and configure the filePattern as described in the JSON format section. Then,
don't specify the structure section and skip the rest of the format settings.
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see supported
data stores.
Copy data to or from Azure Cosmos DB's API for
MongoDB by using Azure Data Factory
2/6/2019 • 6 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Cosmos DB's
API for MongoDB. The article builds on Copy Activity in Azure Data Factory, which presents a general overview of
Copy Activity.
NOTE
This connector only support copy data to/from Azure Cosmos DB's API for MongoDB. For SQL API, refer to Cosmos DB SQL
API connector. Other API types are not supported now.
Supported capabilities
You can copy data from Azure Cosmos DB's API for MongoDB to any supported sink data store, or copy data
from any supported source data store to Azure Cosmos DB's API for MongoDB. For a list of data stores that Copy
Activity supports as sources and sinks, see Supported data stores and formats.
You can use the Azure Cosmos DB's API for MongoDB connector to:
Copy data from and to the Azure Cosmos DB's API for MongoDB.
Write to Azure Cosmos DB as insert or upsert.
Import and export JSON documents as-is, or copy data from or to a tabular dataset. Examples include a SQL
database and a CSV file. To copy documents as-is to or from JSON files or to or from another Azure Cosmos
DB collection, see Import or export JSON documents.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are specific
to Azure Cosmos DB's API for MongoDB.
Example
{
"name": "CosmosDbMongoDBAPILinkedService",
"properties": {
"type": "CosmosDbMongoDbApi",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "mongodb://<cosmosdb-name>:<password>@<cosmosdb-name>.documents.azure.com:10255/?
ssl=true&replicaSet=globaldb"
},
"database": "myDatabase"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for Azure Cosmos DB's API for MongoDB dataset:
PROPERTY DESCRIPTION REQUIRED
Example
{
"name": "CosmosDbMongoDBAPIDataset",
"properties": {
"type": "CosmosDbMongoDbApiCollection",
"linkedServiceName":{
"referenceName": "<Azure Cosmos DB's API for MongoDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"collectionName": "<collection name>"
}
}
}
TIP
ADF support consuming BSON document in Strict mode. Make sure your filter query is in Strict mode instead of Shell
mode. More description can be found at MongoDB manual.
Example
"activities":[
{
"name": "CopyFromCosmosDBMongoDBAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Cosmos DB's API for MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CosmosDbMongoDbApiSource",
"filter": "{datetimeData: {$gte: ISODate(\"2018-12-11T00:00:00.000Z\"),$lt: ISODate(\"2018-12-
12T00:00:00.000Z\")}, _id: ObjectId(\"5acd7c3d0000000000000000\") }",
"cursorMethods": {
"project": "{ _id : 1, name : 1, age: 1, datetimeData: 1 }",
"sort": "{ age : 1 }",
"skip": 3,
"limit": 3
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example
"activities":[
{
"name": "CopyToCosmosDBMongoDBAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "CosmosDbMongoDbApiSink",
"writeBehavior": "upsert"
}
}
}
]
TIP
To import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data,
refer to Schema mapping.
Schema mapping
To copy data from Azure Cosmos DB's API for MongoDB to tabular sink or reversed, refer to schema mapping.
Specifically for writing into Cosmos DB, to make sure you populate Cosmos DB with the right object ID from your
source data, for example, you have an "id" column in SQL database table and want to use the value of that as the
document ID in MongoDB for insert/upsert, you need to set the proper schema mapping according to MongoDB
strict mode definition ( _id.$oid ) as the following:
After copy activity execution, below BSON ObjectId is generated in sink:
{
"_id": ObjectId("592e07800000000000000000")
}
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see supported data
stores.
Copy data to or from Azure Data Explorer using
Azure Data Factory
4/18/2019 • 5 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data to or from Azure Data
Explorer. It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from any supported source data store to Azure Data Explorer. You can also copy data from
Azure Data Explorer to any supported sink data store. For a list of data stores that are supported as sources or
sinks by the copy activity, see the Supported data stores table.
NOTE
Copying data to/from Azure Data Explorer from/to on premises data store using Self-hosted Integration Runtime is
supported since version 3.14.
Getting started
TIP
For a walkthrough of using Azure Data Explorer connector, see Copy data to/from Azure Data Explorer using Azure Data
Factory.
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Data Explorer connector.
NOTE
When using ADF UI to author, the operations of listing databases on linked service or listing tables on dataset may require
higher privileged permission granted for the service principal. Alternatively, you can choose to manually input database
name and table name. Copy activity execution works as long as the service principal is granted with proper permission to
read/write data.
The following properties are supported for Azure Data Explorer linked service:
{
"name": "AzureDataExplorerLinkedService",
"properties": {
"type": "AzureDataExplorer",
"typeProperties": {
"endpoint": "https://<clusterName>.<regionName>.kusto.windows.net ",
"database": "<database name>",
"tenant": "<tenant name/id e.g. microsoft.onmicrosoft.com>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties that are supported by the Azure Data Explorer dataset.
To copy data to Azure Data Explorer, set the type property of the dataset to AzureDataExplorerTable.
The following properties are supported:
table The name of the table that the linked Yes for sink; No for source
service refers to.
NOTE
Azure Data Explorer source by default has a size limit of 500,000 records or 64 MB. To retrieve all the records without
truncation, you can specify set notruncation; at the beginning of your query. Refer to Query limits on more details.
Example:
"activities":[
{
"name": "CopyFromAzureDataExplorer",
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataExplorerSource",
"query": "TestTable1 | take 10",
"queryTimeout": "00:10:00"
},
"sink": {
"type": "<sink type>"
}
},
"inputs": [
{
"referenceName": "<Azure Data Explorer input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
]
}
]
Example:
"activities":[
{
"name": "CopyToAzureDataExplorer",
"type": "Copy",
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDataExplorerSink",
"ingestionMappingName": "<optional Azure Data Explorer mapping name>"
}
},
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Data Explorer output dataset name>",
"type": "DatasetReference"
}
]
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see
supported data stores.
Learn more about Copy data from Azure Data Factory to Azure Data Explorer.
Copy data to or from Azure Data Lake Storage
Gen1 by using Azure Data Factory
5/13/2019 • 19 minutes to read • Edit Online
This article outlines how to copy data to and from Azure Data Lake Storage Gen1 (ADLS Gen1). To learn about
Azure Data Factory, read the introductory article.
Supported capabilities
This Azure Data Lake Storage Gen1 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Specifically, this connector supports:
Copying files by using one of the following methods of authentication: service principal or managed
identities for Azure resources.
Copying files as-is, or parsing or generating files with the supported file formats and compression codecs.
IMPORTANT
If you copy data using the self-hosted integration runtime, configure the corporate firewall to allow outbound traffic to
<ADLS account name>.azuredatalakestore.net and login.microsoftonline.com/<tenant>/oauth2/token on port
443. The latter is the Azure Security Token Service that the integration runtime needs to communicate with to get the
access token.
Get started
TIP
For a walkthrough of using the Azure Data Lake Store connector, see Load data into Azure Data Lake Store.
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Data Lake Store.
Linked service properties
The following properties are supported for the Azure Data Lake Store linked service:
NOTE
To list folders starting from the root, you must set the permission of the service principal being granted to at root level
with "Execute" permission. This is true when you use the:
Copy Data Tool to author copy pipeline.
Data Factory UI to test connection and navigating folders during authoring. If you have concern on granting
permission at root level, you can skip test connection and input path manually during authoring. Copy activity will still
work as long as the service principal is granted with proper permission at the files to be copied.
Example:
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
IMPORTANT
Make sure you grant the data factory managed identity proper permission in Data Lake Store:
As source: In Data explorer > Access, grant at least Read + Execute permission to list and copy the files in folders
and subfolders. Or, you can grant Read permission to copy a single file. You can choose to add to This folder and all
children for recursive, and add as an access permission and a default permission entry. There's no requirement on
account level access control (IAM).
As sink: In Data explorer > Access, grant at least Write + Execute permission to create child items in the folder. You
can choose to add to This folder and all children for recursive, and add as an access permission and a default
permission entry. If you use Azure integration runtime to copy (both source and sink are in the cloud), in IAM, grant
at least the Reader role in order to let Data Factory detect the region for Data Lake Store. If you want to avoid this
IAM role, explicitly create an Azure integration runtime with the location of Data Lake Store. Associate them in the Data
Lake Store linked service as the following example.
NOTE
To list folders starting from the root, you must set the permission of the managed identity being granted to at root level
with "Execute" permission. This is true when you use the:
Copy Data Tool to author copy pipeline.
Data Factory UI to test connection and navigating folders during authoring. If you have concern on granting
permission at root level, you can skip test connection and input path manually during authoring. Copy activity will still
work as long as the managed identity is granted with proper permission at the files to be copied.
In Azure Data Factory, you don't need to specify any properties besides the general Data Lake Store information
in the linked service.
Example:
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data to and from ADLS Gen1 in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based dataset and supported settings. The following properties are
supported for ADLS Gen1 under location settings in format-based dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<ADLS Gen1 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureDataLakeStoreLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary
copy), skip the format section in both
input and output dataset definitions.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a particular name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.
Example:
{
"name": "ADLSDataset",
"properties": {
"type": "AzureDataLakeStoreFile",
"linkedServiceName":{
"referenceName": "<ADLS linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "datalake/myfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Copy Activity properties
For a full list of sections and properties available for defining activities, see Pipelines. This section provides a list
of properties supported by Azure Data Lake Store source and sink.
Azure Data Lake Store as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from ADLS Gen1 in Parquet or delimited text format, refer to Parquet format and Delimited
text format article on format-based copy activity source and supported settings. The following properties are
supported for ADLS Gen1 under storeSettings settings in format-based copy source:
wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder
name has wildcard or this escape char
inside. See more examples in Folder
and file filter examples.
PROPERTY DESCRIPTION REQUIRED
NOTE
For Parquet/delimited text format, AzureDataLakeStoreSource type copy activity source mentioned in next section is still
supported as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF
authoring UI has switched to generating these new types.
Example:
"activities":[
{
"name": "CopyFromADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureDataLakeStoreReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
"activities":[
{
"name": "CopyFromADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<ADLS Gen1 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
For Parquet/delimited text format, AzureDataLakeStoreSink type copy activity sink mentioned in next section is still
supported as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF
authoring UI has switched to generating these new types.
Example:
"activities":[
{
"name": "CopyToADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureDataLakeStoreWriteSetting",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]
Example:
"activities":[
{
"name": "CopyToADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ADLS Gen1 output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by Copy Activity in Azure Data Factory, see supported
data stores.
Copy data to or from Azure Data Lake Storage
Gen2 using Azure Data Factory
5/24/2019 • 22 minutes to read • Edit Online
Azure Data Lake Storage Gen2 (ADLS Gen2) is a set of capabilities dedicated to big data analytics, built into
Azure Blob storage. It allows you to interface with your data using both file system and object storage
paradigms.
This article outlines how to copy data to and from Azure Data Lake Storage Gen2. To learn about Azure Data
Factory, read the introductory article.
Supported capabilities
This Azure Data Lake Storage Gen2 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Specifically, this connector supports:
Copying data by using account key, service principal or managed identities for Azure resources
authentications.
Copying files as-is or parsing or generating files with supported file formats and compression codecs.
TIP
If you enable the hierarchical namespace, currently there is no interoperability of operations between Blob and ADLS Gen2
APIs. In case you hit the error of "ErrorCode=FilesystemNotFound" with detailed message as "The specified filesystem does
not exist.", it's caused by the specified sink file system was created via Blob API instead of ADLS Gen2 API elsewhere. To fix
the issue, please specify a new file system with a name that does not exist as the name of a Blob container, and ADF will
automatically create that file system during data copy.
NOTE
If you enables "Allow trusted Microsoft services to access this storage account" option on Azure Storage firewall settings,
using Azure Integration Runtime to connect to Data Lake Storage Gen2 will fail with forbidden error, as ADF are not
treated as trusted Microsoft service. Please use Self-hosted Integration Runtime as connect via instead.
Get started
TIP
For a walkthrough of using Data Lake Storage Gen2 connector, see Load data into Azure Data Lake Storage Gen2.
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Data Lake Storage Gen2.
Example:
{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
"accountkey": {
"type": "SecureString",
"value": "<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
To list folders starting from the account level or to test connection, you need to set the permission of the service principal
being granted to storage account with "Execute" permission in IAM. This is true when you use the:
Copy Data Tool to author copy pipeline.
Data Factory UI to test connection and navigating folders during authoring. If you have concern on granting
permission at account level, you can skip test connection and input path manually during authoring. Copy activity will
still work as long as the service principal is granted with proper permission at the files to be copied.
Example:
{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
To list folders starting from the account level or to test connection, you need to set the permission of the managed identity
being granted to storage account with "Execute" permission in IAM. This is true when you use the:
Copy Data Tool to author copy pipeline.
Data Factory UI to test connection and navigating folders during authoring. If you have concern on granting
permission at account level, you can skip test connection and input path manually during authoring. Copy activity will
still work as long as the managed identity is granted with proper permission at the files to be copied.
IMPORTANT
If you use PolyBase to load data from ADLS Gen2 into SQL DW, when using ADLS Gen2 managed identity authentication,
make sure you also follow the steps #1 and #2 in this guidance to register your SQL Database server with Azure Active
Directory (AAD) and assign Storage Blob Data Contributor RBAC role to your SQL Database server; the rest will be handled
by ADF. If your ADLS Gen2 is configured with VNet service endpoint, to use PolyBase to load data from it, you must use
managed identity authentication.
Example:
{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data to and from ADLS Gen2 in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based dataset and supported settings. The following properties are
supported for ADLS Gen2 under location settings in format-based dataset:
NOTE
AzureBlobFSFile type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility, but it doesn't work with Mapping Data Flow. You are
suggested to use this new model going forward, and the ADF authoring UI has switched to generating these new types.
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<ADLS Gen2 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobFSLocation",
"fileSystem": "filesystemname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both the input and
output dataset definitions.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.
Example:
{
"name": "ADLSGen2Dataset",
"properties": {
"type": "AzureBlobFSFile",
"linkedServiceName": {
"referenceName": "<Azure Data Lake Storage Gen2 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "myfilesystem/myfolder",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given file system + dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
Example:
"activities":[
{
"name": "CopyFromADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureBlobFSReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyFromADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<ADLS Gen2 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureBlobFSSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
For Parquet/delimited text format, AzureBlobFSSink type copy activity sink mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.
Example:
"activities":[
{
"name": "CopyToADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureBlobFSWriteSetting",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]
Example:
"activities":[
{
"name": "CopyToADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ADLS Gen2 output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureBlobFSSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]
When copy files from Azure Data Lake Storage (ADLS ) Gen1 to Gen2, you can choose to preserve the POSIX
access control lists (ACLs) along with data. For access control in details, refer to Access control in Azure Data
Lake Storage Gen1 and Access control in Azure Data Lake Storage Gen2.
The following types of ACLs can be preserved using Azure Data Factory Copy activity, you can select one or
more types:
ACL: Copy and preserve POSIX access control lists on files and directories. It will copy the full existing
ACLs from source to sink.
Owner: Copy and preserve the owning user of files and directories. Super-user access to sink ADLS Gen2 is
required.
Group: Copy and preserve the owning group of files and directories. Super-user access to sink ADLS Gen2,
or the owning user (if the owning user is also a member of the target group) is required.
If you specify to copy from a folder, Data Factory replicates the ACLs for that given folder as well as the files and
directories under it (if recursive is set to true). If you specify to copy from a single file, the ACLs on that file is
copied.
IMPORTANT
When you choose to preserve ACLs, make sure you grant high enough permissions for ADF to operate against your sink
ADLS Gen2 account. For example, use account key authentication, or assign Storage Blob Data Owner role to the service
principal/managed identity.
When you configure source as ADLS Gen1 with binary copy option/binary format, and sink as ADLS Gen2 with
binary copy option/binary format, you can find Preserve option in Copy Data Tool Settings page or in Copy
Activity -> Settings tab for activity authoring.
Here is an example of JSON configuration (see preserve ):
"activities":[
{
"name": "CopyFromGen1ToGen2",
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": true
},
"sink": {
"type": "AzureBlobFSSink",
"copyBehavior": "PreserveHierarchy"
},
"preserve": [
"ACL",
"Owner",
"Group"
]
},
"inputs": [
{
"referenceName": "<Azure Data Lake Storage Gen1 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Data Lake Storage Gen2 output dataset name>",
"type": "DatasetReference"
}
]
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Azure Database for MariaDB using
Azure Data Factory
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Azure Database for
MariaDB. It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Azure Database for MariaDB to any supported sink data store. For a list of data stores
that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Database for MariaDB connector.
Example:
{
"name": "AzureDatabaseForMariaDBLinkedService",
"properties": {
"type": "MariaDB",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server={your_server}.mariadb.database.azure.com; Port=3306; Database=
{your_database}; Uid={your_user}@{your_server}; Pwd={your_password}; SslMode=Preferred;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Database for MariaDB dataset.
To copy data from Azure Database for MariaDB, set the type property of the dataset to MariaDBTable. The
following properties are supported:
Example
{
"name": "AzureDatabaseForMariaDBDataset",
"properties": {
"type": "MariaDBTable",
"linkedServiceName": {
"referenceName": "<Azure Database for MariaDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromAzureDatabaseForMariaDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Database for MariaDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MariaDBSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Azure Database for MySQL using
Azure Data Factory
4/19/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Azure Database for
MySQL. It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Azure Database for MySQL to any supported sink data store. For a list of data stores that
are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Database for MySQL connector.
Example:
{
"name": "AzureDatabaseForMySQLLinkedService",
"properties": {
"type": "AzureMySql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>.mysql.database.azure.com;Port=<port>;Database=<database>;UID=
<username>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureDatabaseForMySQLLinkedService",
"properties": {
"type": "AzureMySql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>.mysql.database.azure.com;Port=<port>;Database=<database>;UID=
<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Database for MySQL dataset.
To copy data from Azure Database for MySQL, set the type property of the dataset to AzureMySqlTable. The
following properties are supported:
tableName Name of the table in the MySQL No (if "query" in activity source is
database. specified)
Example
{
"name": "AzureMySQLDataset",
"properties": {
"type": "AzureMySqlTable",
"linkedServiceName": {
"referenceName": "<Azure MySQL linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromAzureDatabaseForMySQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure MySQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureMySqlSource",
"query": "<custom query e.g. SELECT * FROM MyTable>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
AZURE DATABASE FOR MYSQL DATA TYPE DATA FACTORY INTERIM DATA TYPE
bigint Int64
bit Boolean
blob Byte[]
bool Int16
char String
date Datetime
datetime Datetime
double Double
enum String
float Single
int Int32
integer Int32
longblob Byte[]
longtext String
mediumblob Byte[]
mediumint Int32
mediumtext String
numeric Decimal
real Double
set String
smallint Int16
text String
time TimeSpan
timestamp Datetime
AZURE DATABASE FOR MYSQL DATA TYPE DATA FACTORY INTERIM DATA TYPE
tinyblob Byte[]
tinyint Int16
tinytext String
varchar String
year Int32
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Azure Database for PostgreSQL
using Azure Data Factory
3/15/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Azure Database for
PostgreSQL. It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Azure Database for PostgreSQL to any supported sink data store. For a list of data stores
that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Database for PostgreSQL connector.
EncryptionMethod (EM) The method the driver uses 0 (No Encryption) (Default) No
to encrypt data sent / 1 (SSL) / 6 (RequestSSL)
between the driver and the
database server. E.g.
ValidateServerCertificate=
<0/1/6>;
Example:
{
"name": "AzurePostgreSqlLinkedService",
"properties": {
"type": "AzurePostgreSql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>.postgres.database.azure.com;Database=<database>;Port=<port>;UID=
<username>;Password=<Password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Database for PostgreSQL dataset.
To copy data from Azure Database for PostgreSQL, set the type property of the dataset to
AzurePostgreSqlTable. The following properties are supported:
Example
{
"name": "AzurePostgreSqlDataset",
"properties": {
"type": "AzurePostgreSqlTable",
"linkedServiceName": {
"referenceName": "<AzurePostgreSql linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromAzurePostgreSql",
"type": "Copy",
"inputs": [
{
"referenceName": "<AzurePostgreSql input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzurePostgreSqlSource",
"query": "<custom query e.g. SELECT * FROM MyTable>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from or to Azure File Storage by using
Azure Data Factory
5/6/2019 • 14 minutes to read • Edit Online
This article outlines how to copy data to and from Azure File Storage. To learn about Azure Data Factory, read the
introductory article.
Supported capabilities
This Azure File Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this Azure File Storage connector supports copying files as-is or parsing/generating files with the
supported file formats and compression codecs.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure File Storage.
connectVia The Integration Runtime to be used to No for source, Yes for sink
connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.
IMPORTANT
To copy data into Azure File Storage using Azure Integration Runtime, explicitly create an Azure IR with the location of
your File Storage, and associate in the linked service as the following example.
To copy data from/to Azure File Storage using Self-hosted Integration Runtime outside of Azure, remember to open
outbound TCP port 445 in your local network.
TIP
When using ADF UI for authoring, you can find the specific entry of "Azure File Storage" for linked service creation, which
underneath generates type FileServer object.
Example:
{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "FileServer",
"typeProperties": {
"host": "\\\\<storage name>.file.core.windows.net\\<file service name>",
"userid": "AZURE\\<storage name>",
"password": {
"type": "SecureString",
"value": "<storage access key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data to and from Azure File Storage in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based dataset and supported settings. The following properties are
supported for Azure File Storage under location settings in format-based dataset:
NOTE
FileShare type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility. You are suggested to use this new model going forward,
and the ADF authoring UI has switched to generating these new types.
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure File Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "FileServerLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.
NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.
Example:
{
"name": "AzureFileStorageDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<Azure File Storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
Example:
"activities":[
{
"name": "CopyFromAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "FileServerReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyFromAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure File Storage input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
For Parquet/delimited text format, FileSystemSink type copy activity sink mentioned in next section is still supported as-is
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has switched
to generating these new types.
Example:
"activities":[
{
"name": "CopyToAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "FileServerWriteSetting",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]
Example:
"activities":[
{
"name": "CopyToAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure File Storage output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "FileSystemSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data to an Azure Search index using Azure
Data Factory
5/24/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data into Azure Search index. It
builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from any supported source data store into Azure Search index. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Search connector.
IMPORTANT
When copying data from a cloud data store into Azure Search index, in Azure Search linked service, you need to refer an
Azure Integration Runtime with explicit region in connactVia. Set the region as the one your Azure Search resides. Learn
more from Azure Integration Runtime.
Example:
{
"name": "AzureSearchLinkedService",
"properties": {
"type": "AzureSearch",
"typeProperties": {
"url": "https://<service>.search.windows.net",
"key": {
"type": "SecureString",
"value": "<AdminKey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Search dataset.
To copy data into Azure Search, the following properties are supported:
Example:
{
"name": "AzureSearchIndexDataset",
"properties": {
"type": "AzureSearchIndex",
"linkedServiceName": {
"referenceName": "<Azure Search linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties" : {
"indexName": "products"
}
}
}
WriteBehavior property
AzureSearchSink upserts when writing data. In other words, when writing a document, if the document key
already exists in the Azure Search index, Azure Search updates the existing document rather than throwing a
conflict exception.
The AzureSearchSink provides the following two upsert behaviors (by using AzureSearch SDK):
Merge: combine all the columns in the new document with the existing one. For columns with null value in the
new document, the value in the existing one is preserved.
Upload: The new document replaces the existing one. For columns not specified in the new document, the
value is set to null whether there is a non-null value in the existing document or not.
The default behavior is Merge.
WriteBatchSize Property
Azure Search service supports writing documents as a batch. A batch can contain 1 to 1,000 Actions. An action
handles one document to perform the upload/merge operation.
Example:
"activities":[
{
"name": "CopyToAzureSearch",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Search output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureSearchIndexSink",
"writeBehavior": "Merge"
}
}
}
]
String Y
Int32 Y
Int64 Y
Double Y
Boolean Y
DataTimeOffset Y
String Array N
GeographyPoint N
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data to or from Azure SQL Database by using
Azure Data Factory
5/6/2019 • 14 minutes to read • Edit Online
This article outlines how to copy data to and from Azure SQL Database. To learn about Azure Data Factory, read
the introductory article.
Supported capabilities
This Azure SQL Database connector is supported for the following activities:
Copy activity with supported source/sink matrix table
Mapping data flow
Lookup activity
GetMetadata activity
Specifically, this Azure SQL Database connector supports these functions:
Copy data by using SQL authentication and Azure Active Directory (Azure AD ) Application token
authentication with a service principal or managed identities for Azure resources.
As a source, retrieve data by using a SQL query or stored procedure.
As a sink, append data to a destination table or invoke a stored procedure with custom logic during the copy.
Azure SQL Database Always Encrypted is not supported now.
IMPORTANT
If you copy data by using Azure Data Factory Integration Runtime, configure an Azure SQL server firewall so that Azure
Services can access the server. If you copy data by using a self-hosted integration runtime, configure the Azure SQL server
firewall to allow the appropriate IP range. This range includes the machine's IP that is used to connect to Azure SQL
Database.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to an
Azure SQL Database connector.
Linked service properties
These properties are supported for an Azure SQL Database linked service:
servicePrincipalId Specify the application's client ID. Yes, when you use Azure AD
authentication with a service principal.
servicePrincipalKey Specify the application's key. Mark this Yes, when you use Azure AD
field as a SecureString to store it authentication with a service principal.
securely in Data Factory, or reference a
secret stored in Azure Key Vault.
tenant Specify the tenant information (domain Yes, when you use Azure AD
name or tenant ID) under which your authentication with a service principal.
application resides. Retrieve it by
hovering the mouse in the top-right
corner of the Azure portal.
For different authentication types, refer to the following sections on prerequisites and JSON samples,
respectively:
SQL authentication
Azure AD application token authentication: Service principal
Azure AD application token authentication: Managed identities for Azure resources
TIP
If you hit error with error code as "UserErrorFailedToConnectToSqlServer" and message like "The session limit for the
database is XXX and has been reached.", add Pooling=false to your connection string and try again.
SQL authentication
Linked service example that uses SQL authentication
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
4. Grant the service principal needed permissions as you normally do for SQL users or others. Run the
following code, or refer to more options here.
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30"
},
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
3. Grant the Data Factory Managed Identity needed permissions as you normally do for SQL users
and others. Run the following code, or refer to more options here.
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Azure SQL Database dataset.
To copy data from or to Azure SQL Database, the following properties are supported:
tableName The name of the table or view in the No for source, Yes for sink
Azure SQL Database instance that the
linked service refers to.
{
"name": "AzureSQLDbDataset",
"properties":
{
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "<Azure SQL Database linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"tableName": "MyTable"
}
}
}
Copy Activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Azure SQL Database source and sink.
Azure SQL Database as the source
To copy data from Azure SQL Database, set the type property in the Copy Activity source to SqlSource. The
following properties are supported in the Copy Activity source section:
Points to note
If the sqlReaderQuery is specified for the SqlSource, Copy Activity runs this query against the Azure SQL
Database source to get the data. Or you can specify a stored procedure. Specify
sqlReaderStoredProcedureName and storedProcedureParameters if the stored procedure takes
parameters.
If you don't specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in
the structure section of the dataset JSON are used to construct a query.
select column1, column2 from mytable runs against Azure SQL Database. If the dataset definition doesn't have
the structure, all columns are selected from the table.
SQL query example
"activities":[
{
"name": "CopyFromAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL Database input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
"activities":[
{
"name": "CopyFromAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL Database input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
When you copy data to Azure SQL Database, Copy Activity appends data to the sink table by default. To do an upsert or
additional business logic, use the stored procedure in SqlSink. Learn more details from Invoking stored procedure from
SQL Sink.
"activities":[
{
"name": "CopyToAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure SQL Database output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 100000
}
}
}
]
Destination table
NOTE
The target table has an identity column.
{
"name": "SampleTarget",
"properties": {
"structure": [
{ "name": "name" },
{ "name": "age" }
],
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "TestIdentitySQL",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "TargetTbl"
}
}
}
NOTE
Your source and target table have different schema.
The target has an additional column with an identity. In this scenario, you must specify the structure property in
the target dataset definition, which doesn’t include the identity column.
"sink": {
"type": "SqlSink",
"SqlWriterTableType": "MarketingType",
"SqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}
In your database, define the stored procedure with the same name as the SqlWriterStoredProcedureName. It
handles input data from your specified source and merges into the output table. The parameter name of the table
type in the stored procedure should be the same as the tableName defined in the dataset.
In your database, define the table type with the same name as the sqlWriterTableType. The schema of the table
type should be same as the schema returned by your input data.
AZURE SQL DATABASE DATA TYPE DATA FACTORY INTERIM DATA TYPE
bigint Int64
binary Byte[]
bit Boolean
date DateTime
Datetime DateTime
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
smalldatetime DateTime
smallint Int16
smallmoney Decimal
AZURE SQL DATABASE DATA TYPE DATA FACTORY INTERIM DATA TYPE
sql_variant Object
time TimeSpan
timestamp Byte[]
tinyint Byte
uniqueidentifier Guid
varbinary Byte[]
xml Xml
NOTE
For data types maps to Decimal interim type, currently ADF support precision up to 28. If you have data with precision
larger than 28, consider to convert to string in SQL query.
Next steps
For a list of data stores supported as sources and sinks by Copy Activity in Azure Data Factory, see Supported
data stores and formats.
Copy data to and from Azure SQL Database
Managed Instance by using Azure Data Factory
5/6/2019 • 12 minutes to read • Edit Online
This article outlines how to use the copy activity in Azure Data Factory to copy data to and from Azure SQL
Database Managed Instance. It builds on the Copy activity overview article that presents a general overview of the
copy activity.
Supported capabilities
You can copy data from Azure SQL Database Managed Instance to any supported sink data store. You also can
copy data from any supported source data store to the managed instance. For a list of data stores that are
supported as sources and sinks by the copy activity, see the Supported data stores table.
Specifically, this Azure SQL Database Managed Instance connector supports:
Copying data by using SQL or Windows authentication.
As a source, retrieving data by using a SQL query or stored procedure.
As a sink, appending data to a destination table or invoking a stored procedure with custom logic during copy.
SQL Server Always Encrypted is not supported now.
Prerequisites
To use copy data from an Azure SQL Database Managed Instance that's located in a virtual network, set up a self-
hosted integration runtime that can access the database. For more information, see Self-hosted integration
runtime.
If you provision your self-hosted integration runtime in the same virtual network as your managed instance, make
sure that your integration runtime machine is in a different subnet than your managed instance. If you provision
your self-hosted integration runtime in a different virtual network than your managed instance, you can use either
a virtual network peering or virtual network to virtual network connection. For more information, see Connect
your application to Azure SQL Database Managed Instance.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to the
Azure SQL Database Managed Instance connector.
Linked service properties
The following properties are supported for the Azure SQL Database Managed Instance linked service:
TIP
You might see the error code "UserErrorFailedToConnectToSqlServer" with a message like "The session limit for the database
is XXX and has been reached." If this error occurs, add Pooling=false to your connection string and try again.
{
"name": "AzureSqlMILinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Data Source=<servername>\\<instance name if using named instance>;Initial Catalog=
<databasename>;Integrated Security=False;User ID=<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for use to define datasets, see the datasets article. This section
provides a list of properties supported by the Azure SQL Database Managed Instance dataset.
To copy data to and from Azure SQL Database Managed Instance, the following properties are supported:
tableName This property is the name of the table No for source. Yes for sink.
or view in the database instance that
the linked service refers to.
Example
{
"name": "AzureSqlMIDataset",
"properties":
{
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "<Managed Instance linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"tableName": "MyTable"
}
}
}
"activities":[
{
"name": "CopyFromAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Managed Instance input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type": "Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
When data is copied to Azure SQL Database Managed Instance, the copy activity appends data to the sink table by default.
To perform an upsert or additional business logic, use the stored procedure in SqlSink. For more information, see Invoke a
stored procedure from a SQL sink.
"activities":[
{
"name": "CopyToAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Managed Instance output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 100000
}
}
}
]
Destination table
{
"name": "SampleTarget",
"properties": {
"structure": [
{ "name": "name" },
{ "name": "age" }
],
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "TestIdentitySQL",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "TargetTbl"
}
}
}
Notice that your source and target table have different schema. The target table has an identity column. In this
scenario, specify the "structure" property in the target dataset definition, which doesn’t include the identity column.
"sink": {
"type": "SqlSink",
"SqlWriterTableType": "MarketingType",
"SqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}
In your database, define the stored procedure with the same name as the SqlWriterStoredProcedureName. It
handles input data from your specified source and merges into the output table. The parameter name of the table
type in the stored procedure should be the same as the tableName defined in the dataset.
In your database, define the table type with the same name as sqlWriterTableType. The schema of the table type is
the same as the schema returned by your input data.
AZURE SQL DATABASE MANAGED INSTANCE DATA TYPE AZURE DATA FACTORY INTERIM DATA TYPE
bigint Int64
binary Byte[]
bit Boolean
date DateTime
Datetime DateTime
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
smalldatetime DateTime
smallint Int16
smallmoney Decimal
sql_variant Object
AZURE SQL DATABASE MANAGED INSTANCE DATA TYPE AZURE DATA FACTORY INTERIM DATA TYPE
time TimeSpan
timestamp Byte[]
tinyint Int16
uniqueidentifier Guid
varbinary Byte[]
xml Xml
NOTE
For data types that map to the Decimal interim type, currently Azure Data Factory supports precision up to 28. If you have
data that requires precision larger than 28, consider converting to a string in a SQL query.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported
data stores.
Copy data to or from Azure SQL Data Warehouse
by using Azure Data Factory
5/31/2019 • 18 minutes to read • Edit Online
This article outlines how to copy data to and from Azure SQL Data Warehouse. To learn about Azure Data
Factory, read the introductory article.
Supported capabilities
This Azure Blob connector is supported for the following activities:
Copy activity with supported source/sink matrix table
Mapping data flow
Lookup activity
GetMetadata activity
Specifically, this Azure SQL Data Warehouse connector supports these functions:
Copy data by using SQL authentication and Azure Active Directory (Azure AD ) Application token
authentication with a service principal or managed identities for Azure resources.
As a source, retrieve data by using a SQL query or stored procedure.
As a sink, load data by using PolyBase or a bulk insert. We recommend PolyBase for better copy performance.
IMPORTANT
If you copy data by using Azure Data Factory Integration Runtime, configure an Azure SQL server firewall so that Azure
services can access the server. If you copy data by using a self-hosted integration runtime, configure the Azure SQL server
firewall to allow the appropriate IP range. This range includes the machine's IP that is used to connect to Azure SQL
Database.
Get started
TIP
To achieve best performance, use PolyBase to load data into Azure SQL Data Warehouse. The Use PolyBase to load data
into Azure SQL Data Warehouse section has details. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data
Warehouse under 15 minutes with Azure Data Factory.
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that define Data Factory entities specific to an Azure SQL
Data Warehouse connector.
servicePrincipalId Specify the application's client ID. Yes, when you use Azure AD
authentication with a service principal.
servicePrincipalKey Specify the application's key. Mark this Yes, when you use Azure AD
field as a SecureString to store it authentication with a service principal.
securely in Data Factory, or reference a
secret stored in Azure Key Vault.
tenant Specify the tenant information (domain Yes, when you use Azure AD
name or tenant ID) under which your authentication with a service principal.
application resides. You can retrieve it
by hovering the mouse in the top-right
corner of the Azure portal.
For different authentication types, refer to the following sections on prerequisites and JSON samples,
respectively:
SQL authentication
Azure AD application token authentication: Service principal
Azure AD application token authentication: Managed identities for Azure resources
TIP
If you hit error with error code as "UserErrorFailedToConnectToSqlServer" and message like "The session limit for the
database is XXX and has been reached.", add Pooling=false to your connection string and try again.
SQL authentication
Linked service example that uses SQL authentication
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
4. Grant the service principal needed permissions as you normally do for SQL users or others. Run the
following code, or refer to more options here. If you want to use PolyBase to load the data, learn the
required database permission.
5. Configure an Azure SQL Data Warehouse linked service in Azure Data Factory.
Linked service example that uses service principal authentication
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30"
},
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
3. Grant the Data Factory Managed Identity needed permissions as you normally do for SQL users and
others. Run the following code, or refer to more options here. If you want to use PolyBase to load the data,
learn the required database permission.
4. Configure an Azure SQL Data Warehouse linked service in Azure Data Factory.
Example:
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Azure SQL Data Warehouse dataset.
To copy data from or to Azure SQL Data Warehouse, the following properties are supported:
tableName The name of the table or view in the No for source, Yes for sink
Azure SQL Data Warehouse instance
that the linked service refers to.
Points to note
If the sqlReaderQuery is specified for the SqlSource, the Copy Activity runs this query against the Azure
SQL Data Warehouse source to get the data. Or you can specify a stored procedure. Specify the
sqlReaderStoredProcedureName and storedProcedureParameters if the stored procedure takes
parameters.
If you don't specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in
the structure section of the dataset JSON are used to construct a query.
select column1, column2 from mytable runs against Azure SQL Data Warehouse. If the dataset definition
doesn't have the structure, all columns are selected from the table.
SQL query example
"activities":[
{
"name": "CopyFromAzureSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL DW input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
"activities":[
{
"name": "CopyFromAzureSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL DW input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings":
{
"rejectType": "percentage",
"rejectValue": 10.0,
"rejectSampleValue": 100,
"useTypeDefault": true
}
}
Learn more about how to use PolyBase to efficiently load SQL Data Warehouse in the next section.
Use PolyBase to load data into Azure SQL Data Warehouse
Using PolyBase is an efficient way to load a large amount of data into Azure SQL Data Warehouse with high
throughput. You'll see a large gain in the throughput by using PolyBase instead of the default BULKINSERT
mechanism. See Performance reference for a detailed comparison. For a walkthrough with a use case, see Load 1
TB into Azure SQL Data Warehouse.
If your source data is in Azure Blob, Azure Data Lake Storage Gen1 or Azure Data Lake Storage Gen2,
and the format is PolyBase compatible, you can use copy activity to directly invoke PolyBase to let Azure
SQL Data Warehouse pull the data from source. For details, see Direct copy by using PolyBase.
If your source data store and format isn't originally supported by PolyBase, use the Staged copy by using
PolyBase feature instead. The staged copy feature also provides you better throughput. It automatically
converts the data into PolyBase-compatible format. And it stores the data in Azure Blob storage. It then loads
the data into SQL Data Warehouse.
TIP
Learn more on Best practices for using PolyBase.
TIP
To copy data efficiently to SQL Data Warehouse, learn more from Azure Data Factory makes it even easier and convenient
to uncover insights from data when using Data Lake Store with SQL Data Warehouse.
If the requirements aren't met, Azure Data Factory checks the settings and automatically falls back to the
BULKINSERT mechanism for the data movement.
1. The source linked service is with the following types and authentication methods:
Azure Data Lake Storage Gen2 Account key authentication, managed identity
authentication
IMPORTANT
If your Azure Storage is configured with VNet service endpoint, you must use managed identity authentication.
Refer to Impact of using VNet Service Endpoints with Azure storage
2. The source data format is of Parquet, ORC, or Delimited text, with the following configurations:
a. Folder path don't contain wildcard filter.
b. File name points to a single file or is * or *.* .
c. rowDelimiter must be \n.
d. nullValue is either set to empty string ("") or left as default, and treatEmptyAsNull is left as default or
set to true.
e. encodingName is set to utf-8, which is the default value.
f. quoteChar , escapeChar , and skipLineCount aren't specified. PolyBase support skip header row which
can be configured as firstRowAsHeader in ADF.
g. compression can be no compression, GZip, or Deflate.
"activities":[
{
"name": "CopyFromAzureBlobToSQLDataWarehouseViaPolyBase",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
}
}
]
The solution is to unselect "Use type default" option (as false) in copy activity sink -> PolyBase setings.
"USE_TYPE_DEFAULT" is a PolyBase native configuration which specifies how to handle missing values in
delimited text files when PolyBase retrieves data from the text file.
Others
For more knonw PolyBase issues, refer to Troubleshooting Azure SQL Data Warehouse PolyBase load.
SQL Data Warehouse resource class
To achieve the best possible throughput, assign a larger resource class to the user that loads data into SQL Data
Warehouse via PolyBase.
tableName in Azure SQL Data Warehouse
The following table gives examples of how to specify the tableName property in the JSON dataset. It shows
several combinations of schema and table names.
If you see the following error, the problem might be the value you specified for the tableName property. See the
preceding table for the correct way to specify values for the tableName JSON property.
All columns of the table must be specified in the INSERT BULK statement.
The NULL value is a special form of the default value. If the column is nullable, the input data in the blob for that
column might be empty. But it can't be missing from the input dataset. PolyBase inserts NULL for missing values
in Azure SQL Data Warehouse.
TIP
Refer to Table data types in Azure SQL Data Warehouse article on SQL DW supported data types and the workarounds for
unsupported ones.
AZURE SQL DATA WAREHOUSE DATA TYPE DATA FACTORY INTERIM DATA TYPE
bigint Int64
binary Byte[]
bit Boolean
date DateTime
Datetime DateTime
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
smalldatetime DateTime
AZURE SQL DATA WAREHOUSE DATA TYPE DATA FACTORY INTERIM DATA TYPE
smallint Int16
smallmoney Decimal
time TimeSpan
tinyint Byte
uniqueidentifier Guid
varbinary Byte[]
Next steps
For a list of data stores supported as sources and sinks by Copy Activity in Azure Data Factory, see supported
data stores and formats.
Copy data to and from Azure Table storage by using
Azure Data Factory
3/5/2019 • 10 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data to and from Azure Table storage.
It builds on the Copy Activity overview article that presents a general overview of Copy Activity.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Supported capabilities
You can copy data from any supported source data store to Table storage. You also can copy data from Table
storage to any supported sink data store. For a list of data stores that are supported as sources or sinks by the
copy activity, see the Supported data stores table.
Specifically, this Azure Table connector supports copying data by using account key and service shared access
signature authentications.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Table storage.
NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureTableStorage" linked service type going forward.
Example:
{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
Data Factory now supports both service shared access signatures and account shared access signatures. For more
information about these two types and how to construct them, see Types of shared access signatures.
TIP
To generate a service shared access signature for your storage account, you can execute the following PowerShell
commands. Replace the placeholders and grant the needed permission.
$context = New-AzStorageContext -StorageAccountName <accountName> -StorageAccountKey <accountKey>
New-AzStorageContainerSASToken -Name <containerName> -Context $context -Permission rwdl -StartTime
<startTime> -ExpiryTime <endTime> -FullUri
To use shared access signature authentication, the following properties are supported.
NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureTableStorage" linked service type going forward.
Example:
{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource e.g.
https://<account>.table.core.windows.net/<table>?sv=<storage version>&st=<start time>&se=<expire
time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
When you create a shared access signature URI, consider the following points:
Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is
used in your data factory.
Set Expiry time appropriately. Make sure that the access to Storage objects doesn't expire within the active
period of the pipeline.
The URI should be created at the right table level based on the need.
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Azure Table dataset.
To copy data to and from Azure Table, set the type property of the dataset to AzureTable. The following
properties are supported.
Example:
{
"name": "AzureTableDataset",
"properties":
{
"type": "AzureTable",
"linkedServiceName": {
"referenceName": "<Azure Table storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "MyTable"
}
}
}
azureTableSourceQuery examples
If the Azure Table column is of the datetime type:
If you use the pipeline parameter, cast the datetime value to proper format according to the previous samples.
Azure Table as a sink type
To copy data to Azure Table, set the sink type in the copy activity to AzureTableSink. The following properties are
supported in the copy activity sink section.
writeBatchTimeout Inserts data into Azure Table when No (default is 90 seconds, storage
writeBatchSize or writeBatchTimeout is client's default timeout)
hit.
Allowed values are timespan. An
example is "00:20:00" (20 minutes).
Example:
"activities":[
{
"name": "CopyToAzureTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Table output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureTableSink",
"azureTablePartitionKeyName": "<column name>",
"azureTableRowKeyName": "<column name>"
}
}
}
]
azureTablePartitionKeyName
Map a source column to a destination column by using the "translator" property before you can use the
destination column as azureTablePartitionKeyName.
In the following example, source column DivisionID is mapped to the destination column DivisionID:
"translator": {
"type": "TabularTranslator",
"columnMappings": "DivisionID: DivisionID, FirstName: FirstName, LastName: LastName"
}
"sink": {
"type": "AzureTableSink",
"azureTablePartitionKeyName": "DivisionID"
}
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Cassandra using Azure Data Factory
3/14/2019 • 7 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Cassandra database. It
builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Cassandra database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Cassandra connector supports:
Cassandra versions 2.x and 3.x.
Copying data using Basic or Anonymous authentication.
NOTE
For activity running on Self-hosted Integration Runtime, Cassandra 3.x is supported since IR version 3.7 and above.
Prerequisites
To copy data from a Cassandra database that is not publicly accessible, you need to set up a Self-hosted
Integration Runtime. See Self-hosted Integration Runtime article to learn details. The Integration Runtime
provides a built-in Cassandra driver, therefore you don't need to manually install any driver when copying data
from/to Cassandra.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Cassandra connector.
port The TCP port that the Cassandra server No (default is 9042)
uses to listen for client connections.
username Specify user name for the user account. Yes, if authenticationType is set to Basic.
password Specify password for the user account. Yes, if authenticationType is set to Basic.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.
NOTE
Currently connection to Cassandra using SSL is not supported.
Example:
{
"name": "CassandraLinkedService",
"properties": {
"type": "Cassandra",
"typeProperties": {
"host": "<host>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Cassandra dataset.
To copy data from Cassandra, set the type property of the dataset to CassandraTable. The following properties
are supported:
Example:
{
"name": "CassandraDataset",
"properties": {
"type": "CassandraTable",
"linkedServiceName": {
"referenceName": "<Cassandra linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"keySpace": "<keyspace name>",
"tableName": "<table name>"
}
}
}
query Use the custom query to read data. No (if "tableName" and "keyspace" in
SQL-92 query or CQL query. See CQL dataset are specified).
reference.
Example:
"activities":[
{
"name": "CopyFromCassandra",
"type": "Copy",
"inputs": [
{
"referenceName": "<Cassandra input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CassandraSource",
"query": "select id, firstname, lastname from mykeyspace.mytable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Data type mapping for Cassandra
When copying data from Cassandra, the following mappings are used from Cassandra data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the source
schema and data type to the sink.
ASCII String
BIGINT Int64
BLOB Byte[]
BOOLEAN Boolean
DECIMAL Decimal
DOUBLE Double
FLOAT Single
INET String
INT Int32
TEXT String
TIMESTAMP DateTime
TIMEUUID Guid
UUID Guid
VARCHAR String
VARINT Decimal
NOTE
For collection types (map, set, list, etc.), refer to Work with Cassandra collection types using virtual table section.
User-defined types are not supported.
The length of Binary Column and String Column lengths cannot be greater than 4000.
1 "sample value 1" ["1", "2", "3"] {"S1": "a", "S2": "b"} {"A", "B", "C"}
3 "sample value 3" ["100", "101", "102", {"S1": "t"} {"A", "E"}
"105"]
The driver would generate multiple virtual tables to represent this single table. The foreign key columns in the
virtual tables reference the primary key columns in the real table, and indicate which real table row the virtual
table row corresponds to.
The first virtual table is the base table named "ExampleTable" is shown in the following table:
PK_INT VALUE
The base table contains the same data as the original database table except for the collections, which are omitted
from this table and expanded in other virtual tables.
The following tables show the virtual tables that renormalize the data from the List, Map, and StringSet columns.
The columns with names that end with "_index" or "_key" indicate the position of the data within the original list or
map. The columns with names that end with "_value" contain the expanded data from the collection.
Table "ExampleTable_vt_List":
1 0 1
1 1 2
1 2 3
3 0 100
3 1 101
3 2 102
3 3 103
Table "ExampleTable_vt_Map":
1 S1 A
1 S2 b
3 S1 t
Table "ExampleTable_vt_StringSet":
PK_INT STRINGSET_VALUE
1 A
1 B
1 C
3 A
3 E
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Dynamics 365 (Common Data
Service) or Dynamics CRM by using Azure Data
Factory
5/29/2019 • 9 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Microsoft Dynamics
365 or Microsoft Dynamics CRM. It builds on the Copy Activity overview article that presents a general overview
of Copy Activity.
Supported capabilities
You can copy data from Dynamics 365 (Common Data Service) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Common Data Service) or
Dynamics CRM. For a list of data stores supported as sources or sinks by the copy activity, see the Supported data
stores table.
This Dynamics connector supports the following Dynamics versions and authentication types. (IFD is short for
internet-facing deployment.)
Dynamics 365 on-premises with IFD IFD Dynamics on-premises with IFD + IFD
Dynamics CRM 2016 on-premises with auth
IFD
Dynamics CRM 2015 on-premises with
IFD
For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing
Other application types e.g. Finance and Operations, Talent, etc. are not supported by this connector.
TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-step
instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.
connectVia The integration runtime to be used to No for source, Yes for sink if the source
connect to the data store. If not linked service doesn't have an
specified, it uses the default Azure integration runtime
Integration Runtime.
NOTE
The Dynamics connector used to use optional "organizationName" property to identify your Dynamics CRM/365 Online
instance. While it keeps working, you are suggested to specify the new "serviceUri" property instead to gain better
performance for instance discovery.
connectVia The integration runtime to be used to No for source, Yes for sink
connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "test@contoso.onmicrosoft.com",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, set the type property of the dataset to DynamicsEntity. The following
properties are supported.
entityName The logical name of the entity to No for source (if "query" in the activity
retrieve. source is specified), Yes for sink
IMPORTANT
When you copy data from Dynamics, the "structure" section is optional but highly recommanded in the Dynamics dataset
to ensure a deterministic copy result. It defines the column name and data type for Dynamics data that you want to copy
over. To learn more, see Dataset structure and Data type mapping for Dynamics.
When importing schema in authoring UI, ADF infer the schema by sampling the top rows from the Dynamics query result
to initialize the structure construction, in which case columns with no values will be omitted. The same behavior applies to
copy executions if there is no explicit structure definition. You can review and add more columns into the Dynamics dataset
schema/structure as needed, which will be honored during copy runtime.
When you copy data to Dynamics, the "structure" section is optional in the Dynamics dataset. Which columns to copy into
is determined by the source data schema. If your source is a CSV file without a header, in the input dataset, specify the
"structure" with the column name and data type. They map to fields in the CSV file one by one in order.
Example:
{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"structure": [
{
"name": "accountid",
"type": "Guid"
},
{
"name": "name",
"type": "String"
},
{
"name": "marketingonly",
"type": "Boolean"
},
{
"name": "modifiedon",
"type": "Datetime"
}
],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}
NOTE
The PK column will always be copied out even if the column projection you configure in the FetchXML query doesn't contain
it.
Example:
"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
The default value of the sink "writeBatchSize" and the copy activity "parallelCopies" for the Dynamics sink are both 10.
Therefore, 100 records are submitted to Dynamics concurrently.
For Dynamics 365 online, there is a limit of 2 concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" fault is thrown before the first request is ever executed. Keeping "writeBatchSize" less or equal to 10
would avoid such throttling of concurrent calls.
The optimal combination of "writeBatchSize" and "parallelCopies" depends on the schema of your entity e.g.
number of columns, row size, number of plugins/workflows/workflow activities hooked up to those calls, etc. The
default setting of 10 writeBatchSize * 10 parallelCopies is the recommendation according to Dynamics service,
which would work for most Dynamics entities though may not be best performance. You can tune the performance
by adjusting the combination in your copy activity settings.
Example:
"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]
AttributeTypeCode.BigInt Long ✓ ✓
AttributeTypeCode.Boolean Boolean ✓ ✓
AttributeType.Customer Guid ✓
AttributeType.DateTime Datetime ✓ ✓
AttributeType.Decimal Decimal ✓ ✓
AttributeType.Double Double ✓ ✓
DATA FACTORY INTERIM DATA
DYNAMICS DATA TYPE TYPE SUPPORTED AS SOURCE SUPPORTED AS SINK
AttributeType.EntityName String ✓ ✓
AttributeType.Integer Int32 ✓ ✓
AttributeType.ManagedProp Boolean ✓
erty
AttributeType.Memo String ✓ ✓
AttributeType.Money Decimal ✓ ✓
AttributeType.Owner Guid ✓
AttributeType.Picklist Int32 ✓ ✓
AttributeType.Uniqueidentifi Guid ✓ ✓
er
AttributeType.String String ✓ ✓
AttributeType.State Int32 ✓ ✓
AttributeType.Status Int32 ✓ ✓
NOTE
The Dynamics data types AttributeType.CalendarRules and AttributeType.PartyList aren't supported.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Concur using Azure Data Factory
(Preview)
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Concur. It builds on the
copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Concur to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
NOTE
Partner account is currently not supported.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Concur connector.
Example:
{
"name": "ConcurLinkedService",
"properties": {
"type": "Concur",
"typeProperties": {
"clientId" : "<clientId>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Concur dataset.
To copy data from Concur, set the type property of the dataset to ConcurObject. There is no additional type-
specific property in this type of dataset. The following properties are supported:
Example
{
"name": "ConcurDataset",
"properties": {
"type": "ConcurObject",
"linkedServiceName": {
"referenceName": "<Concur linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Opportunities
where Id = xxx "
.
Example:
"activities":[
{
"name": "CopyFromConcur",
"type": "Copy",
"inputs": [
{
"referenceName": "<Concur input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ConcurSource",
"query": "SELECT * FROM Opportunities where Id = xxx"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Couchbase using Azure Data
Factory (Preview)
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Couchbase. It builds on
the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Couchbase to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Couchbase connector.
Example:
{
"name": "CouchbaseLinkedService",
"properties": {
"type": "Couchbase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>; Port=<port>;AuthMech=1;CredString=[{\"user\": \"JSmith\",
\"pass\":\"access123\"}, {\"user\": \"Admin\", \"pass\":\"simba123\"}];"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Couchbase dataset.
To copy data from Couchbase, set the type property of the dataset to CouchbaseTable. The following properties
are supported:
Example
{
"name": "CouchbaseDataset",
"properties": {
"type": "CouchbaseTable",
"linkedServiceName": {
"referenceName": "<Couchbase linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromCouchbase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Couchbase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CouchbaseSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from DB2 by using Azure Data Factory
1/3/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a DB2 database. It
builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from DB2 database to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this DB2 connector supports the following IBM DB2 platforms and versions with Distributed
Relational Database Architecture (DRDA) SQL Access Manager (SQLAM ) version 9, 10 and 11:
IBM DB2 for z/OS 11.1
IBM DB2 for z/OS 10.1
IBM DB2 for i 7.2
IBM DB2 for i 7.1
IBM DB2 for LUW 11
IBM DB2 for LUW 10.5
IBM DB2 for LUW 10.1
TIP
If you receive an error message that states "The package corresponding to an SQL statement execution request was not
found. SQLSTATE=51002 SQLCODE=-805", the reason is a needed package is not created for normal user on such OS.
Follow these instructions according to your DB2 server type:
DB2 for i (AS400): let power user create collection for the login user before using copy activity. Command:
create collection <username>
DB2 for z/OS or LUW: use a high privilege account - power user or admin with package authorities and BIND, BINDADD,
GRANT EXECUTE TO PUBLIC permissions - to run the copy activity once, then the needed package is automatically
created during copy. Afterwards, you can switch back to normal user for your subsequent copy runs.
Prerequisites
To use copy data from a DB2 database that is not publicly accessible, you need to set up a Self-hosted Integration
Runtime. To learn about Self-hosted integration runtimes, see Self-hosted Integration Runtime article. The
Integration Runtime provides a built-in DB2 driver, therefore you don't need to manually install any driver when
copying data from DB2.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
DB2 connector.
Example:
{
"name": "Db2LinkedService",
"properties": {
"type": "Db2",
"typeProperties": {
"server": "<servername:port>",
"database": "<dbname>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by DB2 dataset.
To copy data from DB2, set the type property of the dataset to RelationalTable. The following properties are
supported:
tableName Name of the table in the DB2 database. No (if "query" in activity source is
specified)
Example
{
"name": "DB2Dataset",
"properties":
{
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<DB2 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"query": "SELECT * FROM
\"DB2ADMIN\".\"Customers\""
.
Example:
"activities":[
{
"name": "CopyFromDB2",
"type": "Copy",
"inputs": [
{
"referenceName": "<DB2 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM \"DB2ADMIN\".\"Customers\""
},
"sink": {
"type": "<sink type>"
}
}
}
]
BigInt Int64
Binary Byte[]
Blob Byte[]
Char String
DB2 DATABASE TYPE DATA FACTORY INTERIM DATA TYPE
Clob String
Date Datetime
DB2DynArray String
DbClob String
Decimal Decimal
DecimalFloat Decimal
Double Double
Float Double
Graphic String
Integer Int32
LongVarBinary Byte[]
LongVarChar String
LongVarGraphic String
Numeric Decimal
Real Single
SmallInt Int16
Time TimeSpan
Timestamp DateTime
VarBinary Byte[]
VarChar String
VarGraphic String
Xml Byte[]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Delimited text format in Azure Data Factory
5/6/2019 • 5 minutes to read • Edit Online
Follow this article when you want to parse the delimited text files or write the data into delimited text
format.
Delimited text format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake
Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage,
HDFS, HTTP, and SFTP.
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the delimited text dataset.
fileExtension The file extension used to name the Yes when file name is not specified in
output files, e.g. .csv , .txt . It output dataset
must be specified when the
fileName is not specified in the
output DelimitedText dataset.
Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Copy data from Drill using Azure Data Factory
(Preview)
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Drill. It builds on the
copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Drill to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Drill connector.
Example:
{
"name": "DrillLinkedService",
"properties": {
"type": "Drill",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "ConnectionType=Direct;Host=<host>;Port=<port>;AuthenticationType=Plain;UID=<user
name>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Drill dataset.
To copy data from Drill, set the type property of the dataset to DrillTable. The following properties are supported:
Example
{
"name": "DrillDataset",
"properties": {
"type": "DrillTable",
"linkedServiceName": {
"referenceName": "<Drill linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromDrill",
"type": "Copy",
"inputs": [
{
"referenceName": "<Drill input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DrillSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Dynamics 365 (Common Data
Service) or Dynamics CRM by using Azure Data
Factory
5/29/2019 • 9 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Microsoft Dynamics
365 or Microsoft Dynamics CRM. It builds on the Copy Activity overview article that presents a general overview
of Copy Activity.
Supported capabilities
You can copy data from Dynamics 365 (Common Data Service) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Common Data Service) or
Dynamics CRM. For a list of data stores supported as sources or sinks by the copy activity, see the Supported data
stores table.
This Dynamics connector supports the following Dynamics versions and authentication types. (IFD is short for
internet-facing deployment.)
Dynamics 365 on-premises with IFD IFD Dynamics on-premises with IFD + IFD
Dynamics CRM 2016 on-premises with auth
IFD
Dynamics CRM 2015 on-premises with
IFD
For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing
Other application types e.g. Finance and Operations, Talent, etc. are not supported by this connector.
TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-step
instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.
connectVia The integration runtime to be used to No for source, Yes for sink if the source
connect to the data store. If not linked service doesn't have an
specified, it uses the default Azure integration runtime
Integration Runtime.
NOTE
The Dynamics connector used to use optional "organizationName" property to identify your Dynamics CRM/365 Online
instance. While it keeps working, you are suggested to specify the new "serviceUri" property instead to gain better
performance for instance discovery.
connectVia The integration runtime to be used to No for source, Yes for sink
connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "test@contoso.onmicrosoft.com",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, set the type property of the dataset to DynamicsEntity. The following
properties are supported.
entityName The logical name of the entity to No for source (if "query" in the activity
retrieve. source is specified), Yes for sink
IMPORTANT
When you copy data from Dynamics, the "structure" section is optional but highly recommanded in the Dynamics dataset
to ensure a deterministic copy result. It defines the column name and data type for Dynamics data that you want to copy
over. To learn more, see Dataset structure and Data type mapping for Dynamics.
When importing schema in authoring UI, ADF infer the schema by sampling the top rows from the Dynamics query result
to initialize the structure construction, in which case columns with no values will be omitted. The same behavior applies to
copy executions if there is no explicit structure definition. You can review and add more columns into the Dynamics dataset
schema/structure as needed, which will be honored during copy runtime.
When you copy data to Dynamics, the "structure" section is optional in the Dynamics dataset. Which columns to copy into
is determined by the source data schema. If your source is a CSV file without a header, in the input dataset, specify the
"structure" with the column name and data type. They map to fields in the CSV file one by one in order.
Example:
{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"structure": [
{
"name": "accountid",
"type": "Guid"
},
{
"name": "name",
"type": "String"
},
{
"name": "marketingonly",
"type": "Boolean"
},
{
"name": "modifiedon",
"type": "Datetime"
}
],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}
NOTE
The PK column will always be copied out even if the column projection you configure in the FetchXML query doesn't contain
it.
Example:
"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
The default value of the sink "writeBatchSize" and the copy activity "parallelCopies" for the Dynamics sink are both 10.
Therefore, 100 records are submitted to Dynamics concurrently.
For Dynamics 365 online, there is a limit of 2 concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" fault is thrown before the first request is ever executed. Keeping "writeBatchSize" less or equal to 10
would avoid such throttling of concurrent calls.
The optimal combination of "writeBatchSize" and "parallelCopies" depends on the schema of your entity e.g.
number of columns, row size, number of plugins/workflows/workflow activities hooked up to those calls, etc. The
default setting of 10 writeBatchSize * 10 parallelCopies is the recommendation according to Dynamics service,
which would work for most Dynamics entities though may not be best performance. You can tune the performance
by adjusting the combination in your copy activity settings.
Example:
"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]
AttributeTypeCode.BigInt Long ✓ ✓
AttributeTypeCode.Boolean Boolean ✓ ✓
AttributeType.Customer Guid ✓
AttributeType.DateTime Datetime ✓ ✓
AttributeType.Decimal Decimal ✓ ✓
AttributeType.Double Double ✓ ✓
DATA FACTORY INTERIM DATA
DYNAMICS DATA TYPE TYPE SUPPORTED AS SOURCE SUPPORTED AS SINK
AttributeType.EntityName String ✓ ✓
AttributeType.Integer Int32 ✓ ✓
AttributeType.ManagedProp Boolean ✓
erty
AttributeType.Memo String ✓ ✓
AttributeType.Money Decimal ✓ ✓
AttributeType.Owner Guid ✓
AttributeType.Picklist Int32 ✓ ✓
AttributeType.Uniqueidentifi Guid ✓ ✓
er
AttributeType.String String ✓ ✓
AttributeType.State Int32 ✓ ✓
AttributeType.Status Int32 ✓ ✓
NOTE
The Dynamics data types AttributeType.CalendarRules and AttributeType.PartyList aren't supported.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Dynamics AX by using Azure Data
Factory (Preview)
3/6/2019 • 3 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from Dynamics AX source. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.
Supported capabilities
You can copy data from Dynamics AX to any supported sink data store. For a list of data stores that Copy Activity
supports as sources and sinks, see Supported data stores and formats.
Specifically, this Dynamics AX connector supports copying data from Dynamics AX using OData protocol with
Service Principal authentication.
TIP
You can also use this connector to copy data from Dynamics 365 Finance and Operations. Refer to Dynamics 365's
OData support and Authentication method.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to Dynamics AX connector.
Prerequisites
To use service principal authentication, follow these steps:
1. Register an application entity in Azure Active Directory (Azure AD ) by following Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Go to Dynamics AX, and grant this service principal proper permission to access your Dynamics AX.
Linked service properties
The following properties are supported for Dynamics AX linked service:
Example
{
"name": "DynamicsAXLinkedService",
"properties": {
"type": "DynamicsAX",
"typeProperties": {
"url": "<Dynamics AX instance OData endpoint>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource, e.g. https://sampledynamics.sandbox.operations.dynamics.com>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
Dataset properties
This section provides a list of properties that the Dynamics AX dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from Dynamics AX, set the type property of the dataset to DynamicsAXResource. The following
properties are supported:
Example
{
"name": "DynamicsAXResourceDataset",
"properties": {
"type": "DynamicsAXResource",
"typeProperties": {
"path": "<entity path e.g. dd04tentitySet>"
},
"linkedServiceName": {
"referenceName": "<Dynamics AX linked service name>",
"type": "LinkedServiceReference"
}
}
}
Example
"activities":[
{
"name": "CopyFromDynamicsAX",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics AX input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsAXSource",
"query": "$top=10"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to Dynamics 365 (Common
Data Service) or Dynamics CRM by using Azure
Data Factory
5/29/2019 • 9 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Microsoft
Dynamics 365 or Microsoft Dynamics CRM. It builds on the Copy Activity overview article that presents a
general overview of Copy Activity.
Supported capabilities
You can copy data from Dynamics 365 (Common Data Service) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Common Data Service)
or Dynamics CRM. For a list of data stores supported as sources or sinks by the copy activity, see the Supported
data stores table.
This Dynamics connector supports the following Dynamics versions and authentication types. (IFD is short for
internet-facing deployment.)
Dynamics 365 on-premises with IFD IFD Dynamics on-premises with IFD + IFD
Dynamics CRM 2016 on-premises with auth
IFD
Dynamics CRM 2015 on-premises with
IFD
For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing
Other application types e.g. Finance and Operations, Talent, etc. are not supported by this connector.
TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.
connectVia The integration runtime to be used to No for source, Yes for sink if the source
connect to the data store. If not linked service doesn't have an
specified, it uses the default Azure integration runtime
Integration Runtime.
NOTE
The Dynamics connector used to use optional "organizationName" property to identify your Dynamics CRM/365 Online
instance. While it keeps working, you are suggested to specify the new "serviceUri" property instead to gain better
performance for instance discovery.
Example: Dynamics online using Office365 authentication
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics online linked service using Office365 authentication",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://adfdynamics.crm.dynamics.com",
"authenticationType": "Office365",
"username": "test@contoso.onmicrosoft.com",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
connectVia The integration runtime to be used to No for source, Yes for sink
connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "test@contoso.onmicrosoft.com",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, set the type property of the dataset to DynamicsEntity. The following
properties are supported.
entityName The logical name of the entity to No for source (if "query" in the activity
retrieve. source is specified), Yes for sink
IMPORTANT
When you copy data from Dynamics, the "structure" section is optional but highly recommanded in the Dynamics
dataset to ensure a deterministic copy result. It defines the column name and data type for Dynamics data that you
want to copy over. To learn more, see Dataset structure and Data type mapping for Dynamics.
When importing schema in authoring UI, ADF infer the schema by sampling the top rows from the Dynamics query
result to initialize the structure construction, in which case columns with no values will be omitted. The same behavior
applies to copy executions if there is no explicit structure definition. You can review and add more columns into the
Dynamics dataset schema/structure as needed, which will be honored during copy runtime.
When you copy data to Dynamics, the "structure" section is optional in the Dynamics dataset. Which columns to copy
into is determined by the source data schema. If your source is a CSV file without a header, in the input dataset, specify
the "structure" with the column name and data type. They map to fields in the CSV file one by one in order.
Example:
{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"structure": [
{
"name": "accountid",
"type": "Guid"
},
{
"name": "name",
"type": "String"
},
{
"name": "marketingonly",
"type": "Boolean"
},
{
"name": "modifiedon",
"type": "Datetime"
}
],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}
NOTE
The PK column will always be copied out even if the column projection you configure in the FetchXML query doesn't
contain it.
Example:
"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
The default value of the sink "writeBatchSize" and the copy activity "parallelCopies" for the Dynamics sink are both 10.
Therefore, 100 records are submitted to Dynamics concurrently.
For Dynamics 365 online, there is a limit of 2 concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" fault is thrown before the first request is ever executed. Keeping "writeBatchSize" less or equal to
10 would avoid such throttling of concurrent calls.
The optimal combination of "writeBatchSize" and "parallelCopies" depends on the schema of your entity e.g.
number of columns, row size, number of plugins/workflows/workflow activities hooked up to those calls, etc.
The default setting of 10 writeBatchSize * 10 parallelCopies is the recommendation according to Dynamics
service, which would work for most Dynamics entities though may not be best performance. You can tune the
performance by adjusting the combination in your copy activity settings.
Example:
"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]
AttributeTypeCode.BigInt Long ✓ ✓
AttributeTypeCode.Boolean Boolean ✓ ✓
AttributeType.Customer Guid ✓
AttributeType.DateTime Datetime ✓ ✓
AttributeType.Decimal Decimal ✓ ✓
AttributeType.Double Double ✓ ✓
DATA FACTORY INTERIM
DYNAMICS DATA TYPE DATA TYPE SUPPORTED AS SOURCE SUPPORTED AS SINK
AttributeType.EntityName String ✓ ✓
AttributeType.Integer Int32 ✓ ✓
AttributeType.ManagedPro Boolean ✓
perty
AttributeType.Memo String ✓ ✓
AttributeType.Money Decimal ✓ ✓
AttributeType.Owner Guid ✓
AttributeType.Picklist Int32 ✓ ✓
AttributeType.Uniqueidentifi Guid ✓ ✓
er
AttributeType.String String ✓ ✓
AttributeType.State Int32 ✓ ✓
AttributeType.Status Int32 ✓ ✓
NOTE
The Dynamics data types AttributeType.CalendarRules and AttributeType.PartyList aren't supported.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data to or from a file system by using Azure
Data Factory
5/6/2019 • 15 minutes to read • Edit Online
This article outlines how to copy data to and from file system. To learn about Azure Data Factory, read the
introductory article.
Supported capabilities
This file system connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this file system connector supports:
Copying files from/to local machine or network file share. To use a Linux file share, install Samba on your
Linux server.
Copying files using Windows authentication.
Copying files as-is or parsing/generating files with the supported file formats and compression codecs.
Prerequisites
To copy data from/to a file system that is not publicly accessible, you need to set up a Self-hosted Integration
Runtime. See Self-hosted Integration Runtime article for details.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to file
system.
NOTE
When authoring via UI, you don't need to input double backslash ( \\ ) to escape like you do via JSON, specify single
backslash.
Example:
{
"name": "FileLinkedService",
"properties": {
"type": "FileServer",
"typeProperties": {
"host": "<host>",
"userid": "<domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data to and from file system in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based dataset and supported settings. The following properties are
supported for file system under location settings in format-based dataset:
NOTE
FileShare type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility, but it doesn't work with Mapping Data Flow. You are
suggested to use this new model going forward, and the ADF authoring UI has switched to generating these new types.
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<File system linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "FileServerLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.
NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.
Example:
{
"name": "FileSystemDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<file system linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
Example:
"activities":[
{
"name": "CopyFromFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "FileServerReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyFromFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<file system input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
For Parquet/delimited text format, FileSystemSink type copy activity sink mentioned in next section is still supported as-is
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.
Example:
"activities":[
{
"name": "CopyToFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "FileServerWriteSetting",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]
Example:
"activities":[
{
"name": "CopyToFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<file system output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "FileSystemSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from FTP server by using Azure Data
Factory
5/6/2019 • 9 minutes to read • Edit Online
This article outlines how to copy data from FTP server. To learn about Azure Data Factory, read the introductory
article.
Supported capabilities
This FTP connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this FTP connector supports:
Copying files using Basic or Anonymous authentication.
Copying files as-is or parsing files with the supported file formats and compression codecs.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
FTP.
NOTE
The FTP connector supports accessing FTP server with either no encryption or explicit SSL/TLS encryption; it doesn’t
support implicit SSL/TLS encryption.
{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "<ftp server>",
"port": 21,
"enableSsl": true,
"enableServerCertificateValidation": true,
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from FTP in Parquet or delimited text format, refer to Parquet format and Delimited text format
article on format-based dataset and supported settings. The following properties are supported for FTP under
location settings in format-based dataset:
NOTE
FileShare type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility. You are suggested to use this new model going forward,
and the ADF authoring UI has switched to generating these new types.
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<FTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "FtpServerLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.
NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.
Example:
{
"name": "FTPDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<FTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "myfile.csv.gz",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.
Example:
"activities":[
{
"name": "CopyFromFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "FtpReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyFromFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<FTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Google AdWords using Azure Data
Factory (Preview)
2/1/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Google AdWords. It
builds on the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Google AdWords to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Google AdWords connector.
Example:
{
"name": "GoogleAdWordsLinkedService",
"properties": {
"type": "GoogleAdWords",
"typeProperties": {
"clientCustomerID" : "<clientCustomerID>",
"developerToken": {
"type": "SecureString",
"value": "<developerToken>"
},
"authenticationType" : "ServiceAuthentication",
"refreshToken": {
"type": "SecureString",
"value": "<refreshToken>"
},
"clientId": {
"type": "SecureString",
"value": "<clientId>"
},
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"email" : "<email>",
"keyFilePath" : "<keyFilePath>",
"trustedCertPath" : "<trustedCertPath>",
"useSystemTrustStore" : true,
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Google AdWords dataset.
To copy data from Google AdWords, set the type property of the dataset to GoogleAdWordsObject. The
following properties are supported:
Example
{
"name": "GoogleAdWordsDataset",
"properties": {
"type": "GoogleAdWordsObject",
"linkedServiceName": {
"referenceName": "<GoogleAdWords linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromGoogleAdWords",
"type": "Copy",
"inputs": [
{
"referenceName": "<GoogleAdWords input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GoogleAdWordsSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Google BigQuery by using Azure
Data Factory
1/30/2019 • 5 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from Google BigQuery. It builds
on the Copy Activity overview article that presents a general overview of the copy activity.
Supported capabilities
You can copy data from Google BigQuery to any supported sink data store. For a list of data stores that are
supported as sources or sinks by the copy activity, see the Supported data stores table.
Data Factory provides a built-in driver to enable connectivity. Therefore, you don't need to manually install a driver
to use this connector.
NOTE
This Google BigQuery connector is built on top of the BigQuery APIs. Be aware that BigQuery limits the maximum rate of
incoming requests and enforces appropriate quotas on a per-project basis, refer to Quotas & Limits - API requests. Make
sure you do not trigger too many concurrent requests to the account.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to the
Google BigQuery connector.
Example:
{
"name": "GoogleBigQueryLinkedService",
"properties": {
"type": "GoogleBigQuery",
"typeProperties": {
"project" : "<project ID>",
"additionalProjects" : "<additional project IDs>",
"requestGoogleDriveScope" : true,
"authenticationType" : "UserAuthentication",
"clientId": "<id of the application used to generate the refresh token>",
"clientSecret": {
"type": "SecureString",
"value":"<secret of the application used to generate the refresh token>"
},
"refreshToken": {
"type": "SecureString",
"value": "<refresh token>"
}
}
}
}
Example:
{
"name": "GoogleBigQueryLinkedService",
"properties": {
"type": "GoogleBigQuery",
"typeProperties": {
"project" : "<project id>",
"requestGoogleDriveScope" : true,
"authenticationType" : "ServiceAuthentication",
"email": "<email>",
"keyFilePath": "<.p12 key path on the IR machine>"
},
"connectVia": {
"referenceName": "<name of Self-hosted Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Google BigQuery dataset.
To copy data from Google BigQuery, set the type property of the dataset to GoogleBigQueryObject. The
following properties are supported:
Example
{
"name": "GoogleBigQueryDataset",
"properties": {
"type": "GoogleBigQueryObject",
"linkedServiceName": {
"referenceName": "<GoogleBigQuery linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. An example is specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromGoogleBigQuery",
"type": "Copy",
"inputs": [
{
"referenceName": "<GoogleBigQuery input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GoogleBigQuerySource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Google Cloud Storage using Azure
Data Factory
5/6/2019 • 10 minutes to read • Edit Online
This article outlines how to copy data from Google Cloud Storage. To learn about Azure Data Factory, read the
introductory article.
Supported capabilities
This Google Cloud Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this Google Cloud Storage connector supports copying files as-is or parsing files with the supported
file formats and compression codecs.
NOTE
Copying data from Google Cloud Storage leverages the Amazon S3 connector with corresponding custom S3 endpoint, as
Google Cloud Storage provides S3-compatible interoperability.
Required permissions
To copy data from Google Cloud Storage, make sure you have been granted the following permissions:
For copy activity execution:: s3:GetObject and s3:GetObjectVersion for Object Operations.
For Data Factory GUI authoring: s3:ListAllMyBuckets and s3:ListBucket / s3:GetBucketLocation for Bucket
Operations permissions are additionally required for operations like test connection and browse/navigate file
paths. If you don't want to grant these permission, skip test connection in linked service creation page and
specify the path directly in dataset settings.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Google Cloud Storage.
Linked service properties
The following properties are supported for Google Cloud Storage linked service:
Here is an example:
{
"name": "GoogleCloudStorageLinkedService",
"properties": {
"type": "AmazonS3",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
},
"serviceUrl": "https://storage.googleapis.com"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from Google Cloud Storage in Parquet or delimited text format, refer to Parquet format and
Delimited text format article on format-based dataset and supported settings. The following properties are
supported for Google Cloud Storage under location settings in format-based dataset:
NOTE
AmazonS3Object type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility. You are suggested to use this new model going forward,
and the ADF authoring UI has switched to generating these new types.
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Google Cloud Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AmazonS3Location",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
bucketName The S3 bucket name. Wildcard filter is Yes for Copy/Lookup activity, No for
not supported. GetMetadata activity
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.
TIP
To copy all files under a folder, specify bucketName for bucket and prefix for folder part.
To copy a single file with a given name, specify bucketName for bucket and key for folder part plus file name.
To copy a subset of files under a folder, specify bucketName for bucket and key for folder part plus wildcard filter.
wildcardFileName The file name with wildcard characters Yes if fileName in dataset and
under the given bucket + prefix are not specified
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.
Example:
"activities":[
{
"name": "CopyFromGoogleCloudStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "AmazonS3ReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyFromGoogleCloudStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores that are supported as sources and sinks by the copy activity in Azure Data Factory, see
supported data stores.
Copy data from Greenplum using Azure Data
Factory
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Greenplum. It builds on
the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Greenplum to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Greenplum connector.
Example:
{
"name": "GreenplumLinkedService",
"properties": {
"type": "Greenplum",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "HOST=<server>;PORT=<port>;DB=<database>;UID=<user name>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "GreenplumLinkedService",
"properties": {
"type": "Greenplum",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "HOST=<server>;PORT=<port>;DB=<database>;UID=<user name>;"
},
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Greenplum dataset.
To copy data from Greenplum, set the type property of the dataset to GreenplumTable. The following properties
are supported:
Example
{
"name": "GreenplumDataset",
"properties": {
"type": "GreenplumTable",
"linkedServiceName": {
"referenceName": "<Greenplum linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromGreenplum",
"type": "Copy",
"inputs": [
{
"referenceName": "<Greenplum input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GreenplumSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from HBase using Azure Data Factory
3/14/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from HBase. It builds on the
copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from HBase to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HBase connector.
NOTE
If your cluster doesn't support sticky session e.g. HDInsight, explicitly add node index at the end of the http path setting, e.g.
specify /hbaserest0 instead of /hbaserest .
{
"name": "HBaseLinkedService",
"properties": {
"type": "HBase",
"typeProperties": {
"host" : "<host e.g. 192.168.222.160>",
"port" : "<port>",
"httpPath" : "<e.g. /gateway/sandbox/hbase/version>",
"authenticationType" : "Basic",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"enableSsl" : true,
"trustedCertPath" : "<trustedCertPath>",
"allowHostNameCNMismatch" : true,
"allowSelfSignedServerCert" : true
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by HBase dataset.
To copy data from HBase, set the type property of the dataset to HBaseObject. The following properties are
supported:
PROPERTY DESCRIPTION REQUIRED
Example
{
"name": "HBaseDataset",
"properties": {
"type": "HBaseObject",
"linkedServiceName": {
"referenceName": "<HBase linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromHBase",
"type": "Copy",
"inputs": [
{
"referenceName": "<HBase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HBaseSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from HDFS using Azure Data Factory
5/6/2019 • 15 minutes to read • Edit Online
This article outlines how to copy data from HDFS server. To learn about Azure Data Factory, read the
introductory article.
Supported capabilities
This HDFS connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
Specifically, this HDFS connector supports:
Copying files using Windows (Kerberos) or Anonymous authentication.
Copying files using webhdfs protocol or built-in DistCp support.
Copying files as-is or parsing/generating files with the supported file formats and compression codecs.
Prerequisites
To copy data from an HDFS that is not publicly accessible, you need to set up a Self-hosted Integration Runtime.
See Self-hosted Integration Runtime article to learn details.
NOTE
Make sure the Integration Runtime can access to ALL the [name node server]:[name node port] and [data node servers]:
[data node port] of the Hadoop cluster. Default [name node port] is 50070, and default [data node port] is 50075.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HDFS.
{
"name": "HDFSLinkedService",
"properties": {
"type": "Hdfs",
"typeProperties": {
"url" : "http://<machine>:50070/webhdfs/v1/",
"authenticationType": "Anonymous",
"userName": "hadoop"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from HDFS in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based dataset and supported settings. The following properties are supported for HDFS
under location settings in format-based dataset:
NOTE
FileShare type dataset with Parquet/Text format mentioned in next section is still supported as-is for Copy/Lookup activity
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<HDFS linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "HdfsLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.
Example:
{
"name": "HDFSDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<HDFS linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by HDFS source.
HDFS as source
For copy from Parquet and delimited text format, refer to Parquet and delimited text format source
section.
For copy from other formats like ORC/Avro/JSON/Binary format, refer to Other format source section.
Parquet and delimited text format source
To copy data from HDFS in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based copy activity source and supported settings. The following properties are
supported for HDFS under storeSettings settings in format-based copy source:
wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
PROPERTY DESCRIPTION REQUIRED
NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.
Example:
"activities":[
{
"name": "CopyFromHDFS",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "HdfsReadSetting",
"recursive": true
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
"source": {
"type": "HdfsSource",
"distcpSettings": {
"resourceManagerEndpoint": "resourcemanagerendpoint:8088",
"tempScriptPath": "/usr/hadoop/tempscript",
"distcpOptions": "-m 100"
}
}
Learn more on how to use DistCp to copy data from HDFS efficiently from the next section.
Folder and file filter examples
This section describes the resulting behavior of the folder path and file name with wildcard filters.
C:> Ksetup
default realm = REALM.COM (external)
REALM.com:
kdc = <your_kdc_server_address>
NOTE
Replace REALM.COM and AD.COM in the following tutorial with your own respective realm and domain controller as
needed.
On KDC server:
1. Edit the KDC configuration in krb5.conf file to let KDC trust Windows Domain referring to the following
configuration template. By default, the configuration is located at /etc/krb5.conf.
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = REALM.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
[realms]
REALM.COM = {
kdc = node.REALM.COM
admin_server = node.REALM.COM
}
AD.COM = {
kdc = windc.ad.com
admin_server = windc.ad.com
}
[domain_realm]
.REALM.COM = REALM.COM
REALM.COM = REALM.COM
.ad.com = AD.COM
ad.com = AD.COM
[capaths]
AD.COM = {
REALM.COM = .
}
On domain controller:
1. Run the following Ksetup commands to add a realm entry:
2. Establish trust from Windows Domain to Kerberos Realm. [password] is the password for the principal
krbtgt/REALM.COM@AD.COM.
d. Use Ksetup command to specify the encryption algorithm to be used on the specific REALM.
4. Create the mapping between the domain account and Kerberos principal, in order to use Kerberos
principal in Windows Domain.
a. Start the Administrative tools > Active Directory Users and Computers.
b. Configure advanced features by clicking View > Advanced Features.
c. Locate the account to which you want to create mappings, and right-click to view Name Mappings
> click Kerberos Names tab.
d. Add a principal from the realm.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Hive using Azure Data Factory
2/1/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Hive. It builds on the
copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Hive to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Hive connector.
port The TCP port that the Hive server uses Yes
to listen for client connections. If you
connect to Azure HDInsights, specify
port as 443.
Example:
{
"name": "HiveLinkedService",
"properties": {
"type": "Hive",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "<port>",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Hive dataset.
To copy data from Hive, set the type property of the dataset to HiveObject. The following properties are
supported:
Example
{
"name": "HiveDataset",
"properties": {
"type": "HiveObject",
"linkedServiceName": {
"referenceName": "<Hive linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromHive",
"type": "Copy",
"inputs": [
{
"referenceName": "<Hive input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HiveSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from an HTTP endpoint by using Azure
Data Factory
5/6/2019 • 9 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from an HTTP endpoint. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.
The difference among this HTTP connector, the REST connector and the Web table connector are:
REST connector specifically support copying data from RESTful APIs;
HTTP connector is generic to retrieve data from any HTTP endpoint, e.g. to download file. Before REST
connector becomes available, you may happen to use the HTTP connector to copy data from RESTful API,
which is supported but less functional comparing to REST connector.
Web table connector extracts table content from an HTML webpage.
Supported capabilities
You can copy data from an HTTP source to any supported sink data store. For a list of data stores that Copy
Activity supports as sources and sinks, see Supported data stores and formats.
You can use this HTTP connector to:
Retrieve data from an HTTP/S endpoint by using the HTTP GET or POST methods.
Retrieve data by using one of the following authentications: Anonymous, Basic, Digest, Windows, or
ClientCertificate.
Copy the HTTP response as-is or parse it by using supported file formats and compression codecs.
TIP
To test an HTTP request for data retrieval before you configure the HTTP connector in Data Factory, learn about the API
specification for header and body requirements. You can use tools like Postman or a web browser to validate.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to the HTTP connector.
Linked service properties
The following properties are supported for the HTTP linked service:
Example
{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"authenticationType": "Basic",
"url" : "<HTTP endpoint>",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
If you use certThumbprint for authentication and the certificate is installed in the personal store of the local
computer, grant read permissions to the self-hosted Integration Runtime:
1. Open the Microsoft Management Console (MMC ). Add the Certificates snap-in that targets Local
Computer.
2. Expand Certificates > Personal, and then select Certificates.
3. Right-click the certificate from the personal store, and then select All Tasks > Manage Private Keys.
4. On the Security tab, add the user account under which the Integration Runtime Host Service
(DIAHostService) is running, with read access to the certificate.
{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"authenticationType": "ClientCertificate",
"url": "<HTTP endpoint>",
"embeddedCertData": "<Base64-encoded cert data>",
"password": {
"type": "SecureString",
"value": "password of cert"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from HTTP in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based dataset and supported settings. The following properties are supported for HTTP
under location settings in format-based dataset:
NOTE
HttpFile type dataset with Parquet/Text format mentioned in next section is still supported as-is for Copy/Lookup activity
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "HttpServerLocation",
"relativeUrl": "<relative url>"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
NOTE
The supported HTTP request payload size is around 500 KB. If the payload size you want to pass to your web endpoint is
larger than 500 KB, consider batching the payload in smaller chunks.
{
"name": "HttpSourceDataInput",
"properties": {
"type": "HttpFile",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"relativeUrl": "<relative url>",
"additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
}
}
}
NOTE
For Parquet/delimited text format, HttpSource type copy activity source mentioned in next section is still supported as-is
for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.
Example:
"activities":[
{
"name": "CopyFromHTTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "HttpReadSetting",
"requestMethod": "Post",
"additionalHeaders": "<header key: header value>\n<header key: header value>\n",
"requestBody": "<body for POST HTTP request>"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example
"activities":[
{
"name": "CopyFromHTTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<HTTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HttpSource",
"httpRequestTimeout": "00:01:00"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from HubSpot using Azure Data Factory
(Preview)
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from HubSpot. It builds on
the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from HubSpot to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HubSpot connector.
Example:
{
"name": "HubspotLinkedService",
"properties": {
"type": "Hubspot",
"typeProperties": {
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
},
"refreshToken": {
"type": "SecureString",
"value": "<refreshToken>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by HubSpot dataset.
To copy data from HubSpot, set the type property of the dataset to HubspotObject. The following properties are
supported:
Example
{
"name": "HubspotDataset",
"properties": {
"type": "HubspotObject",
"linkedServiceName": {
"referenceName": "<Hubspot linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Companies where
Company_Id = xxx"
.
Example:
"activities":[
{
"name": "CopyFromHubspot",
"type": "Copy",
"inputs": [
{
"referenceName": "<Hubspot input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HubspotSource",
"query": "SELECT * FROM Companies where Company_Id = xxx"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Impala by using Azure Data Factory
(Preview)
2/1/2019 • 4 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from Impala. It builds on the
Copy Activity overview article that presents a general overview of the copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Impala to any supported sink data store. For a list of data stores that are supported as
sources or sinks by the copy activity, see the Supported data stores table.
Data Factory provides a built-in driver to enable connectivity. Therefore, you don't need to manually install a driver
to use this connector.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to the
Impala connector.
{
"name": "ImpalaLinkedService",
"properties": {
"type": "Impala",
"typeProperties": {
"host" : "<host>",
"port" : "<port>",
"authenticationType" : "UsernameAndPassword",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Impala dataset.
To copy data from Impala, set the type property of the dataset to ImpalaObject. The following properties are
supported:
Example
{
"name": "ImpalaDataset",
"properties": {
"type": "ImpalaObject",
"linkedServiceName": {
"referenceName": "<Impala linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. An example is specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromImpala",
"type": "Copy",
"inputs": [
{
"referenceName": "<Impala input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ImpalaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from and to ODBC data stores using
Azure Data Factory
3/18/2019 • 8 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to an ODBC data
store. It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from ODBC source to any supported sink data store, or copy from any supported source data
store to ODBC sink. For a list of data stores that are supported as sources/sinks by the copy activity, see the
Supported data stores table.
Specifically, this ODBC connector supports copying data from/to any ODBC -compatible data stores using
Basic or Anonymous authentication.
Prerequisites
To use this ODBC connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the ODBC driver for the data store on the Integration Runtime machine.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-step
instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ODBC connector.
{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<connection string>"
},
"authenticationType": "Anonymous",
"credential": {
"type": "SecureString",
"value": "RefreshToken=<secret refresh token>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section provides
a list of properties supported by ODBC dataset.
To copy data from/to ODBC -compatible data store, set the type property of the dataset to RelationalTable. The
following properties are supported:
tableName Name of the table in the ODBC data No for source (if "query" in activity
store. source is specified);
Yes for sink
Example
{
"name": "ODBCDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<ODBC linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}
query Use the custom SQL query to read data. No (if "tableName" in dataset is
For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<ODBC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
ODBC as sink
To copy data to ODBC -compatible data store, set the sink type in the copy activity to OdbcSink. The following
properties are supported in the copy activity sink section:
writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).
NOTE
For "writeBatchSize", if it's not set (auto-detected), copy activity first detects whether the driver supports batch operations,
and set it to 10000 if it does, or set it to 1 if it doesn’t. If you explicitly set the value other than 0, copy activity honors the
value and fails at runtime if the driver doesn’t support batch operations.
Example:
"activities":[
{
"name": "CopyToODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ODBC output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OdbcSink",
"writeBatchSize": 100000
}
}
}
]
Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.
Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.
You can copy data to SAP HANA database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for SAP HANA to connect to the data store. Therefore, install the driver if it is not already
installed on the same machine. See Prerequisites section for details.
Before you use the SAP HANA sink in a Data Factory solution, verify whether the Integration Runtime can connect
to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link a SAP HANA data store to an Azure data factory as shown in the following
example:
{
"name": "SAPHANAViaODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Driver={HDBODBC};servernode=<HANA server>.clouddatahub-int.net:30015"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Jira using Azure Data Factory
(Preview)
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Jira. It builds on the
copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Jira to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to Jira
connector.
Example:
{
"name": "JiraLinkedService",
"properties": {
"type": "Jira",
"typeProperties": {
"host" : "<host>",
"port" : "<port>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Jira dataset.
To copy data from Jira, set the type property of the dataset to JiraObject. The following properties are supported:
PROPERTY DESCRIPTION REQUIRED
Example
{
"name": "JiraDataset",
"properties": {
"type": "JiraObject",
"linkedServiceName": {
"referenceName": "<Jira linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromJira",
"type": "Copy",
"inputs": [
{
"referenceName": "<Jira input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "JiraSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Magento using Azure Data Factory
(Preview)
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Magento. It builds on
the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Magento to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Magento connector.
Example:
{
"name": "MagentoLinkedService",
"properties": {
"type": "Magento",
"typeProperties": {
"host" : "192.168.222.110/magento3",
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Magento dataset.
To copy data from Magento, set the type property of the dataset to MagentoObject. The following properties are
supported:
Example
{
"name": "MagentoDataset",
"properties": {
"type": "MagentoObject",
"linkedServiceName": {
"referenceName": "<Magento linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Customers" .
Example:
"activities":[
{
"name": "CopyFromMagento",
"type": "Copy",
"inputs": [
{
"referenceName": "<Magento input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MagentoSource",
"query": "SELECT * FROM Customers where Id > XXX"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MariaDB using Azure Data Factory
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from MariaDB. It builds on
the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from MariaDB to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
This connector currently supports MariaDB of version 10.0 to 10.2.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MariaDB connector.
Example:
{
"name": "MariaDBLinkedService",
"properties": {
"type": "MariaDB",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<host>;Port=<port>;Database=<database>;UID=<user name>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "MariaDBLinkedService",
"properties": {
"type": "MariaDB",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<host>;Port=<port>;Database=<database>;UID=<user name>;"
},
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by MariaDB dataset.
To copy data from MariaDB, set the type property of the dataset to MariaDBTable. There is no additional type-
specific property in this type of dataset.
Example
{
"name": "MariaDBDataset",
"properties": {
"type": "MariaDBTable",
"linkedServiceName": {
"referenceName": "<MariaDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromMariaDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MariaDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MariaDBSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Marketo using Azure Data Factory
(Preview)
4/22/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Marketo. It builds on
the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Marketo to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
NOTE
This Marketo connector is built on top of the Marketo REST API. Be aware that the Marketo has concurrent request limit on
service side. If you hit errors saying "Error while attempting to use REST API: Max rate limit '100' exceeded with in '20' secs
(606)" or "Error while attempting to use REST API: Concurrent access limit '10' reached (615)", consider to reduce the
concurrent copy activity runs to reduce the number of requests to the service.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Marketo connector.
Example:
{
"name": "MarketoLinkedService",
"properties": {
"type": "Marketo",
"typeProperties": {
"endpoint" : "123-ABC-321.mktorest.com",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Marketo dataset.
To copy data from Marketo, set the type property of the dataset to MarketoObject. The following properties are
supported:
PROPERTY DESCRIPTION REQUIRED
Example
{
"name": "MarketoDataset",
"properties": {
"type": "MarketoObject",
"linkedServiceName": {
"referenceName": "<Marketo linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Activitiy_Types" .
Example:
"activities":[
{
"name": "CopyFromMarketo",
"type": "Copy",
"inputs": [
{
"referenceName": "<Marketo input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MarketoSource",
"query": "SELECT top 1000 * FROM Activitiy_Types"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to ODBC data stores using
Azure Data Factory
3/18/2019 • 8 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to an ODBC data
store. It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from ODBC source to any supported sink data store, or copy from any supported source data
store to ODBC sink. For a list of data stores that are supported as sources/sinks by the copy activity, see the
Supported data stores table.
Specifically, this ODBC connector supports copying data from/to any ODBC -compatible data stores using
Basic or Anonymous authentication.
Prerequisites
To use this ODBC connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the ODBC driver for the data store on the Integration Runtime machine.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-step
instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ODBC connector.
{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<connection string>"
},
"authenticationType": "Anonymous",
"credential": {
"type": "SecureString",
"value": "RefreshToken=<secret refresh token>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section provides
a list of properties supported by ODBC dataset.
To copy data from/to ODBC -compatible data store, set the type property of the dataset to RelationalTable. The
following properties are supported:
tableName Name of the table in the ODBC data No for source (if "query" in activity
store. source is specified);
Yes for sink
Example
{
"name": "ODBCDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<ODBC linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}
query Use the custom SQL query to read data. No (if "tableName" in dataset is
For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<ODBC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
ODBC as sink
To copy data to ODBC -compatible data store, set the sink type in the copy activity to OdbcSink. The following
properties are supported in the copy activity sink section:
writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).
NOTE
For "writeBatchSize", if it's not set (auto-detected), copy activity first detects whether the driver supports batch operations,
and set it to 10000 if it does, or set it to 1 if it doesn’t. If you explicitly set the value other than 0, copy activity honors the
value and fails at runtime if the driver doesn’t support batch operations.
Example:
"activities":[
{
"name": "CopyToODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ODBC output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OdbcSink",
"writeBatchSize": 100000
}
}
}
]
Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.
Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.
You can copy data to SAP HANA database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for SAP HANA to connect to the data store. Therefore, install the driver if it is not already
installed on the same machine. See Prerequisites section for details.
Before you use the SAP HANA sink in a Data Factory solution, verify whether the Integration Runtime can connect
to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link a SAP HANA data store to an Azure data factory as shown in the following
example:
{
"name": "SAPHANAViaODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Driver={HDBODBC};servernode=<HANA server>.clouddatahub-int.net:30015"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores in
a copy operation.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MongoDB using Azure Data Factory
2/1/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a MongoDB database. It
builds on the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
ADF release this new version of MongoDB connector which provides better native MongoDB support. If you are using the
previous MongoDB connector in your solution which is supported as-is for backward compatibility, refer to MongoDB
connector (legacy) article.
Supported capabilities
You can copy data from MongoDB database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MongoDB connector supports versions up to 3.4.
Prerequisites
To copy data from a MongoDB database that is not publicly accessible, you need to set up a Self-hosted
Integration Runtime. See Self-hosted Integration Runtime article to learn details.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MongoDB connector.
Example:
{
"name": "MongoDBLinkedService",
"properties": {
"type": "MongoDbV2",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "mongodb://[username:password@]host[:port][/[database][?options]]"
},
"database": "myDatabase"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for MongoDB dataset:
Example:
{
"name": "MongoDbDataset",
"properties": {
"type": "MongoDbV2Collection",
"linkedServiceName": {
"referenceName": "<MongoDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"collectionName": "<Collection name>"
}
}
}
TIP
ADF support consuming BSON document in Strict mode. Make sure your filter query is in Strict mode instead of Shell
mode. More description can be found at MongoDB manual.
Example:
"activities":[
{
"name": "CopyFromMongoDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MongoDbV2Source",
"filter": "{datetimeData: {$gte: ISODate(\"2018-12-11T00:00:00.000Z\"),$lt: ISODate(\"2018-12-
12T00:00:00.000Z\")}, _id: ObjectId(\"5acd7c3d0000000000000000\") }",
"cursorMethods": {
"project": "{ _id : 1, name : 1, age: 1, datetimeData: 1 }",
"sort": "{ age : 1 }",
"skip": 3,
"limit": 3
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MongoDB using Azure Data Factory
1/15/2019 • 6 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a MongoDB database. It
builds on the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
ADF release a new MongoDB connector which provides better native MongoDB support comparing to this ODBC-based
implementation, refer to MongoDB connector article on details. This legacy MongoDB connector is kept supported as-is for
backward compability, while for any new workload, please use the new connector.
Supported capabilities
You can copy data from MongoDB database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MongoDB connector supports:
MongoDB versions 2.4, 2.6, 3.0, 3.2, 3.4 and 3.6.
Copying data using Basic or Anonymous authentication.
Prerequisites
To copy data from a MongoDB database that is not publicly accessible, you need to set up a Self-hosted
Integration Runtime. See Self-hosted Integration Runtime article to learn details. The Integration Runtime
provides a built-in MongoDB driver, therefore you don't need to manually install any driver when copying data
from MongoDB.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MongoDB connector.
username User account to access MongoDB. Yes (if basic authentication is used).
password Password for the user. Mark this field as Yes (if basic authentication is used).
a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.
authSource Name of the MongoDB database that No. For basic authentication, default is
you want to use to check your to use the admin account and the
credentials for authentication. database specified using databaseName
property.
Example:
{
"name": "MongoDBLinkedService",
"properties": {
"type": "MongoDb",
"typeProperties": {
"server": "<server name>",
"databaseName": "<database name>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for MongoDB dataset:
Example:
{
"name": "MongoDbDataset",
"properties": {
"type": "MongoDbCollection",
"linkedServiceName": {
"referenceName": "<MongoDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"collectionName": "<Collection name>"
}
}
}
query Use the custom SQL-92 query to read No (if "collectionName" in dataset is
data. For example: select * from specified)
MyTable.
Example:
"activities":[
{
"name": "CopyFromMongoDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MongoDbSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
When specify the SQL query, pay attention to the DateTime format. For example:
SELECT * FROM Account WHERE LastModifiedDate >= '2018-06-01' AND LastModifiedDate < '2018-06-02' or to use
parameter
SELECT * FROM Account WHERE LastModifiedDate >= '@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-
dd HH:mm:ss')}' AND LastModifiedDate < '@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-dd
HH:mm:ss')}'
Binary Byte[]
Boolean Boolean
Date DateTime
NumberDouble Double
NumberInt Int32
NumberLong Int64
ObjectID String
String String
UUID Guid
NOTE
To learn about support for arrays using virtual tables, refer to Support for complex types using virtual tables section.
Currently, the following MongoDB data types are not supported: DBPointer, JavaScript, Max/Min key, Regular Expression,
Symbol, Timestamp, Undefined.
The driver would generate multiple virtual tables to represent this single table. The first virtual table is the base
table named “ExampleTable", shown in the example. The base table contains all the data of the original table, but
the data from the arrays has been omitted and is expanded in the virtual tables.
The following tables show the virtual tables that represent the original arrays in the example. These tables contain
the following:
A reference back to the original primary key column corresponding to the row of the original array (via the _id
column)
An indication of the position of the data within the original array
The expanded data for each element within the array
Table “ExampleTable_Invoices":
EXAMPLETABLE_IN
_ID VOICES_DIM1_IDX INVOICE_ID ITEM PRICE DISCOUNT
Table “ExampleTable_Ratings":
1111 0 5
1111 1 6
2222 0 1
_ID EXAMPLETABLE_RATINGS_DIM1_IDX EXAMPLETABLE_RATINGS
2222 1 2
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MySQL using Azure Data Factory
3/15/2019 • 5 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a MySQL database. It
builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from MySQL database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MySQL connector supports MySQL version 5.6 and 5.7.
Prerequisites
If your MySQL database is not publicly accessible, you need to set up a Self-hosted Integration Runtime. To learn
about Self-hosted integration runtimes, see Self-hosted Integration Runtime article. The Integration Runtime
provides a built-in MySQL driver starting from version 3.7, therefore you don't need to manually install any driver.
For Self-hosted IR version lower than 3.7, you need to install the MySQL Connector/Net for Microsoft Windows
version between 6.6.5 and 6.10.7 on the Integration Runtime machine. This 32 bit driver is compatible with 64 bit
IR.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MySQL connector.
Example:
{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<username>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example: store password in Azure Key Vault
{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
If you were using MySQL linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Previous payload:
{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by MySQL dataset.
To copy data from MySQL, set the type property of the dataset to RelationalTable. The following properties are
supported:
PROPERTY DESCRIPTION REQUIRED
tableName Name of the table in the MySQL No (if "query" in activity source is
database. specified)
Example
{
"name": "MySQLDataset",
"properties":
{
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<MySQL linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromMySQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<MySQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
bigint Int64
bit(1) Boolean
blob Byte[]
bool Int16
char String
date Datetime
datetime Datetime
double Double
enum String
float Single
int Int32
integer Int32
longblob Byte[]
longtext String
mediumblob Byte[]
mediumint Int32
mediumtext String
numeric Decimal
real Double
set String
smallint Int16
text String
time TimeSpan
timestamp Datetime
MYSQL DATA TYPE DATA FACTORY INTERIM DATA TYPE
tinyblob Byte[]
tinyint Int16
tinytext String
varchar String
year Int
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Netezza by using Azure Data
Factory
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from Netezza. The article builds
on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.
Supported capabilities
You can copy data from Netezza to any supported sink data store. For a list of data stores that Copy Activity
supports as sources and sinks, see Supported data stores and formats.
Azure Data Factory provides a built-in driver to enable connectivity. You don't need to manually install any driver
to use this connector.
Get started
You can create a pipeline that uses a copy activity by using the .NET SDK, the Python SDK, Azure PowerShell, the
REST API, or an Azure Resource Manager template. See the Copy Activity tutorial for step-by-step instructions on
how to create a pipeline that has a copy activity.
The following sections provide details about properties you can use to define Data Factory entities that are specific
to the Netezza connector.
CaCertFile The full path to the SSL certificate that's Yes, if SSL is enabled
used by the server. Example:
CaCertFile=<cert path>;
Example
{
"name": "NetezzaLinkedService",
"properties": {
"type": "Netezza",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
This section provides a list of properties that the Netezza dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets.
To copy data from Netezza, set the type property of the dataset to NetezzaTable. The following properties are
supported:
Example
{
"name": "NetezzaDataset",
"properties": {
"type": "NetezzaTable",
"linkedServiceName": {
"referenceName": "<Netezza linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. Example: specified)
"SELECT * FROM MyTable"
Example:
"activities":[
{
"name": "CopyFromNetezza",
"type": "Copy",
"inputs": [
{
"referenceName": "<Netezza input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "NetezzaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from an OData source by using Azure
Data Factory
3/6/2019 • 5 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from an OData source. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.
Supported capabilities
You can copy data from an OData source to any supported sink data store. For a list of data stores that Copy
Activity supports as sources and sinks, see Supported data stores and formats.
Specifically, this OData connector supports:
OData version 3.0 and 4.0.
Copying data by using one of the following authentications: Anonymous, Basic, Windows, AAD service
principal, and managed identities for Azure resources.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are specific
to an OData connector.
{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "https://services.odata.org/OData/OData.svc",
"authenticationType": "Anonymous"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "Basic",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "AadServicePrincipal",
"servicePrincipalId": "<service principal id>",
"aadServicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource URL>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
Dataset properties
This section provides a list of properties that the OData dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from OData, set the type property of the dataset to ODataResource. The following properties are
supported:
Example
{
"name": "ODataDataset",
"properties":
{
"type": "ODataResource",
"linkedServiceName": {
"referenceName": "<OData linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties":
{
"path": "Products"
}
}
}
Copy Activity properties
This section provides a list of properties that the OData source supports.
For a full list of sections and properties that are available for defining activities, see Pipelines.
OData as source
To copy data from OData, set the source type in Copy Activity to RelationalSource. The following properties are
supported in the Copy Activity source section:
Example
"activities":[
{
"name": "CopyFromOData",
"type": "Copy",
"inputs": [
{
"referenceName": "<OData input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "?$select=Name,Description&$top=5"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Edm.Binary Byte[]
Edm.Boolean Bool
Edm.Byte Byte[]
Edm.DateTime DateTime
Edm.Decimal Decimal
Edm.Double Double
Edm.Single Single
Edm.Guid Guid
Edm.Int16 Int16
Edm.Int32 Int32
Edm.Int64 Int64
Edm.SByte Int16
Edm.String String
Edm.Time TimeSpan
Edm.DateTimeOffset DateTimeOffset
NOTE
OData complex data types (such as Object) aren't supported.
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to ODBC data stores using
Azure Data Factory
3/18/2019 • 8 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to an ODBC data
store. It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from ODBC source to any supported sink data store, or copy from any supported source data
store to ODBC sink. For a list of data stores that are supported as sources/sinks by the copy activity, see the
Supported data stores table.
Specifically, this ODBC connector supports copying data from/to any ODBC -compatible data stores using
Basic or Anonymous authentication.
Prerequisites
To use this ODBC connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the ODBC driver for the data store on the Integration Runtime machine.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ODBC connector.
{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "<connection string>"
},
"authenticationType": "Anonymous",
"credential": {
"type": "SecureString",
"value": "RefreshToken=<secret refresh token>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by ODBC dataset.
To copy data from/to ODBC -compatible data store, set the type property of the dataset to RelationalTable. The
following properties are supported:
tableName Name of the table in the ODBC data No for source (if "query" in activity
store. source is specified);
Yes for sink
Example
{
"name": "ODBCDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<ODBC linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<ODBC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
ODBC as sink
To copy data to ODBC -compatible data store, set the sink type in the copy activity to OdbcSink. The following
properties are supported in the copy activity sink section:
writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).
NOTE
For "writeBatchSize", if it's not set (auto-detected), copy activity first detects whether the driver supports batch operations,
and set it to 10000 if it does, or set it to 1 if it doesn’t. If you explicitly set the value other than 0, copy activity honors the
value and fails at runtime if the driver doesn’t support batch operations.
Example:
"activities":[
{
"name": "CopyToODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ODBC output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OdbcSink",
"writeBatchSize": 100000
}
}
}
]
Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores
in a copy operation.
Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores
in a copy operation.
You can copy data to SAP HANA database using the generic ODBC connector.
Set up a Self-hosted Integration Runtime on a machine with access to your data store. The Integration Runtime
uses the ODBC driver for SAP HANA to connect to the data store. Therefore, install the driver if it is not already
installed on the same machine. See Prerequisites section for details.
Before you use the SAP HANA sink in a Data Factory solution, verify whether the Integration Runtime can
connect to the data store using instructions in Troubleshoot connectivity issues section.
Create an ODBC linked service to link a SAP HANA data store to an Azure data factory as shown in the following
example:
{
"name": "SAPHANAViaODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Driver={HDBODBC};servernode=<HANA server>.clouddatahub-int.net:30015"
},
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Read the article from the beginning for a detailed overview of using ODBC data stores as source/sink data stores
in a copy operation.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Office 365 into Azure using Azure
Data Factory
6/3/2019 • 11 minutes to read • Edit Online
Azure Data Factory integrates with Microsoft Graph data connect, allowing you to bring the rich organizational
data in your Office 365 tenant into Azure in a scalable way and build analytics applications and extract insights
based on these valuable data assets. Integration with Privileged Access Management provides secured access
control for the valuable curated data in Office 365. Please refer to this link for an overview on Microsoft Graph
data connect and refer to this link for licensing information.
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Office 365. It builds on
the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
ADF Office 365 connector and Microsoft Graph data connect enables at scale ingestion of different types of
datasets from Exchange Email enabled mailboxes, including address book contacts, calendar events, email
messages, user information, mailbox settings, and so on. Refer here to see the complete list of datasets available.
For now, within a single copy activity you can only copy data from Office 365 into Azure Blob Storage, Azure
Data Lake Storage Gen1, and Azure Data Lake Storage Gen2 in JSON format (type setOfObjects). If you
want to load Office 365 into other types of data stores or in other formats, you can chain the first copy activity
with a subsequent copy activity to further load data into any of the supported ADF destination stores (refer to
"supported as a sink" column in the "Supported data stores and formats" table).
IMPORTANT
The Azure subscription containing the data factory and the sink data store must be under the same Azure Active
Directory (Azure AD) tenant as Office 365 tenant.
Ensure the Azure Integration Runtime region used for copy activity as well as the destination is in the same region where
the Office 365 tenant users' mailbox is located. Refer here to understand how the Azure IR location is determined. Refer
to table here for the list of supported Office regions and corresponding Azure regions.
Service Principal authentication is the only authentication mechanism supported for Azure Blob Storage, Azure Data Lake
Storage Gen1, and Azure Data Lake Storage Gen2 as destination stores.
Prerequisites
To copy data from Office 365 into Azure, you need to complete the following prerequisite steps:
Your Office 365 tenant admin must complete on-boarding actions as described here.
Create and configure an Azure AD web application in Azure Active Directory. For instructions, see Create an
Azure AD application.
Make note of the following values, which you will use to define the linked service for Office 365:
Tenant ID. For instructions, see Get tenant ID.
Application ID and Application key. For instructions, see Get application ID and authentication key.
Add the user identity who will be making the data access request as the owner of the Azure AD web
application (from the Azure AD web application > Settings > Owners > Add owner).
The user identity must be in the Office 365 organization you are getting data from and must not be a
Guest user.
Policy validation
If ADF is created as part of a managed app and Azure policies assignments are made on resources within the
management resource group, then for every copy activity run, ADF will check to make sure the policy assignments
are enforced. Refer here for a list of supported policies.
Getting started
TIP
For a walkthrough of using Office 365 connector, see Load data from Office 365 article.
You can create a pipeline with the copy activity by using one of the following tools or SDKs. Select a link to go to a
tutorial with step-by-step instructions to create a pipeline with a copy activity.
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template.
The following sections provide details about properties that are used to define Data Factory entities specific to
Office 365 connector.
NOTE
The difference between office365TenantId and servicePrincipalTenantId and the corresponding value to provide:
If you are an enterprise developer developing an application against Office 365 data for your own organization's usage,
then you should supply the same tenant ID for both properties, which is your organization's AAD tenant ID.
If you are an ISV developer developing an application for your customers, then office365TenantId will be your customer’s
(application installer) AAD tenant ID and servicePrincipalTenantId will be your company’s AAD tenant ID.
Example:
{
"name": "Office365LinkedService",
"properties": {
"type": "Office365",
"typeProperties": {
"office365TenantId": "<Office 365 tenant id>",
"servicePrincipalTenantId": "<AAD app service principal tenant id>",
"servicePrincipalId": "<AAD app service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<AAD app service principal key>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Office 365 dataset.
To copy data from Office 365, the following properties are supported:
dateFilterColumn Name of the DateTime filter column. Yes if dataset has one or more
Use this property to limit the time DateTime columns. Refer here for list of
range for which Office 365 data is datasets that require this DateTime
extracted. filter.
Example
{
"name": "DS_May2019_O365_Message",
"properties": {
"type": "Office365Table",
"linkedServiceName": {
"referenceName": "<Office 365 linked service name>",
"type": "LinkedServiceReference"
},
"structure": [
{
"name": "Id",
"type": "String",
"description": "The unique identifier of the event."
},
{
"name": "CreatedDateTime",
"type": "DateTime",
"description": "The date and time that the event was created."
},
{
"name": "LastModifiedDateTime",
"type": "DateTime",
"description": "The date and time that the event was last modified."
},
{
"name": "ChangeKey",
"type": "String",
"description": "Identifies the version of the event object. Every time the event is changed,
ChangeKey changes as well. This allows Exchange to apply changes to the correct version of the object."
},
{
"name": "Categories",
"type": "String",
"description": "The categories associated with the event. Format: ARRAY<STRING>"
},
{
"name": "OriginalStartTimeZone",
"type": "String",
"description": "The start time zone that was set when the event was created. See
DateTimeTimeZone for a list of valid time zones."
},
{
"name": "OriginalEndTimeZone",
"type": "String",
"description": "The end time zone that was set when the event was created. See
DateTimeTimeZone for a list of valid time zones."
},
{
"name": "ResponseStatus",
"type": "String",
"description": "Indicates the type of response sent in response to an event message. Format:
STRUCT<Response: STRING, Time: STRING>"
},
{
"name": "iCalUId",
"type": "String",
"description": "A unique identifier that is shared by all instances of an event across
different calendars."
},
{
"name": "ReminderMinutesBeforeStart",
"type": "Int32",
"description": "The number of minutes before the event start time that the reminder alert
occurs."
},
{
"name": "IsReminderOn",
"type": "Boolean",
"description": "Set to true if an alert is set to remind the user of the event."
},
{
"name": "HasAttachments",
"type": "Boolean",
"description": "Set to true if the event has attachments."
},
{
"name": "Subject",
"type": "String",
"description": "The text of the event's subject line."
},
{
"name": "Body",
"type": "String",
"description": "The body of the message associated with the event.Format: STRUCT<ContentType:
STRING, Content: STRING>"
},
{
"name": "Importance",
"type": "String",
"description": "The importance of the event: Low, Normal, High."
},
{
"name": "Sensitivity",
"type": "String",
"description": "Indicates the level of privacy for the event: Normal, Personal, Private,
Confidential."
},
{
"name": "Start",
"type": "String",
"description": "The start time of the event. Format: STRUCT<DateTime: STRING, TimeZone:
STRING>"
STRING>"
},
{
"name": "End",
"type": "String",
"description": "The date and time that the event ends. Format: STRUCT<DateTime: STRING,
TimeZone: STRING>"
},
{
"name": "Location",
"type": "String",
"description": "Location information of the event. Format: STRUCT<DisplayName: STRING,
Address: STRUCT<Street: STRING, City: STRING, State: STRING, CountryOrRegion: STRING, PostalCode: STRING>,
Coordinates: STRUCT<Altitude: DOUBLE, Latitude: DOUBLE, Longitude: DOUBLE, Accuracy: DOUBLE, AltitudeAccuracy:
DOUBLE>>"
},
{
"name": "IsAllDay",
"type": "Boolean",
"description": "Set to true if the event lasts all day. Adjusting this property requires
adjusting the Start and End properties of the event as well."
},
{
"name": "IsCancelled",
"type": "Boolean",
"description": "Set to true if the event has been canceled."
},
{
"name": "IsOrganizer",
"type": "Boolean",
"description": "Set to true if the message sender is also the organizer."
},
{
"name": "Recurrence",
"type": "String",
"description": "The recurrence pattern for the event. Format: STRUCT<Pattern: STRUCT<Type:
STRING, `Interval`: INT, Month: INT, DayOfMonth: INT, DaysOfWeek: ARRAY<STRING>, FirstDayOfWeek: STRING,
Index: STRING>, `Range`: STRUCT<Type: STRING, StartDate: STRING, EndDate: STRING, RecurrenceTimeZone: STRING,
NumberOfOccurrences: INT>>"
},
{
"name": "ResponseRequested",
"type": "Boolean",
"description": "Set to true if the sender would like a response when the event is accepted or
declined."
},
{
"name": "ShowAs",
"type": "String",
"description": "The status to show: Free, Tentative, Busy, Oof, WorkingElsewhere, Unknown."
},
{
"name": "Type",
"type": "String",
"description": "The event type: SingleInstance, Occurrence, Exception, SeriesMaster."
},
{
"name": "Attendees",
"type": "String",
"description": "The collection of attendees for the event. Format: ARRAY<STRUCT<EmailAddress:
STRUCT<Name: STRING, Address: STRING>, Status: STRUCT<Response: STRING, Time: STRING>, Type: STRING>>"
},
{
"name": "Organizer",
"type": "String",
"description": "The organizer of the event. Format: STRUCT<EmailAddress: STRUCT<Name: STRING,
Address: STRING>>"
},
{
"name": "WebLink",
"name": "WebLink",
"type": "String",
"description": "The URL to open the event in Outlook Web App."
},
{
"name": "Attachments",
"type": "String",
"description": "The FileAttachment and ItemAttachment attachments for the message. Navigation
property. Format: ARRAY<STRUCT<LastModifiedDateTime: STRING, Name: STRING, ContentType: STRING, Size: INT,
IsInline: BOOLEAN, Id: STRING>>"
},
{
"name": "BodyPreview",
"type": "String",
"description": "The preview of the message associated with the event. It is in text format."
},
{
"name": "Locations",
"type": "String",
"description": "The locations where the event is held or attended from. The location and
locations properties always correspond with each other. Format: ARRAY<STRUCT<DisplayName: STRING, Address:
STRUCT<Street: STRING, City: STRING, State: STRING, CountryOrRegion: STRING, PostalCode: STRING>, Coordinates:
STRUCT<Altitude: DOUBLE, Latitude: DOUBLE, Longitude: DOUBLE, Accuracy: DOUBLE, AltitudeAccuracy: DOUBLE>,
LocationEmailAddress: STRING, LocationUri: STRING, LocationType: STRING, UniqueId: STRING, UniqueIdType:
STRING>>"
},
{
"name": "OnlineMeetingUrl",
"type": "String",
"description": "A URL for an online meeting. The property is set only when an organizer
specifies an event as an online meeting such as a Skype meeting"
},
{
"name": "OriginalStart",
"type": "DateTime",
"description": "The start time that was set when the event was created in UTC time."
},
{
"name": "SeriesMasterId",
"type": "String",
"description": "The ID for the recurring series master item, if this event is part of a
recurring series."
}
],
"typeProperties": {
"tableName": "BasicDataSet_v0.Event_v1",
"dateFilterColumn": "CreatedDateTime",
"startTime": "2019-04-28T16:00:00.000Z",
"endTime": "2019-05-05T16:00:00.000Z",
"userScopeFilterUri": "https://graph.microsoft.com/v1.0/users?$filter=Department eq 'Finance'"
}
}
}
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Oracle by using Azure Data
Factory
3/5/2019 • 7 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to an Oracle database.
It builds on the Copy Activity overview article that presents a general overview of the copy activity.
Supported capabilities
You can copy data from an Oracle database to any supported sink data store. You also can copy data from any
supported source data store to an Oracle database. For a list of data stores that are supported as sources or sinks
by the copy activity, see the Supported data stores table.
Specifically, this Oracle connector supports the following versions of an Oracle database. It also supports Basic or
OID authentications:
Oracle 12c R1 (12.1)
Oracle 11g R1, R2 (11.1, 11.2)
Oracle 10g R1, R2 (10.1, 10.2)
Oracle 9i R1, R2 (9.0.1, 9.2)
Oracle 8i R3 (8.1.7)
NOTE
Oracle proxy server is not supported.
Prerequisites
To copy data from and to an Oracle database that isn't publicly accessible, you need to set up a Self-hosted
Integration Runtime. For more information about integration runtime, see Self-hosted Integration Runtime. The
integration runtime provides a built-in Oracle driver. Therefore, you don't need to manually install a driver when
you copy data from and to Oracle.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to the
Oracle connector.
Linked service properties
The following properties are supported for the Oracle linked service.
TIP
If you hit error saying "ORA-01025: UPI parameter out of range" and your Oracle is of version 8i, add WireProtocolMode=1
to your connection string and try again.
Example: extract cert info from DERcert.cer; then, save the output to cert.txt
b. Build the keystore or truststore. The following command creates the truststore file with or without a
password in PKCS -12 format.
openssl pkcs12 -in [Path to the file created in the previous step] -out [Path and name of
TrustStore] -passout pass:[Keystore PWD] -nokeys -export
openssl pkcs12 -in cert.txt -out MyTrustStoreFile -passout pass:ThePWD -nokeys -export
{
"name": "OracleLinkedService",
"properties": {
"type": "Oracle",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=<password>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Oracle dataset.
To copy data from and to Oracle, set the type property of the dataset to OracleTable. The following properties are
supported.
Example:
{
"name": "OracleDataset",
"properties":
{
"type": "OracleTable",
"linkedServiceName": {
"referenceName": "<Oracle linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "MyTable"
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Oracle source and sink.
Oracle as a source type
To copy data from Oracle, set the source type in the copy activity to OracleSource. The following properties are
supported in the copy activity source section.
If you don't specify "oracleReaderQuery", the columns defined in the "structure" section of the dataset are used to
construct a query ( select column1, column2 from mytable ) to run against the Oracle database. If the dataset
definition doesn't have "structure", all columns are selected from the table.
Example:
"activities":[
{
"name": "CopyFromOracle",
"type": "Copy",
"inputs": [
{
"referenceName": "<Oracle input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "OracleSource",
"oracleReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
writeBatchSize Inserts data into the SQL table when No (default is 10,000)
the buffer size reaches writeBatchSize.
Allowed values are Integer (number of
rows).
Example:
"activities":[
{
"name": "CopyToOracle",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Oracle output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OracleSink"
}
}
}
]
BFILE Byte[]
ORACLE DATA TYPE DATA FACTORY INTERIM DATA TYPE
BLOB Byte[]
(only supported on Oracle 10g and higher)
CHAR String
CLOB String
DATE DateTime
LONG String
NCHAR String
NCLOB String
NVARCHAR2 String
RAW Byte[]
ROWID String
TIMESTAMP DateTime
VARCHAR2 String
XML String
NOTE
The data types INTERVAL YEAR TO MONTH and INTERVAL DAY TO SECOND aren't supported.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Oracle Eloqua using Azure Data
Factory (Preview)
1/3/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Oracle Eloqua. It builds
on the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Oracle Eloqua to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Eloqua connector.
Example:
{
"name": "EloquaLinkedService",
"properties": {
"type": "Eloqua",
"typeProperties": {
"endpoint" : "<base URL e.g. xxx.xxx.eloqua.com>",
"username" : "<site name>\\<user name e.g. Eloqua\\Alice>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Eloqua dataset.
To copy data from Oracle Eloqua, set the type property of the dataset to EloquaObject. The following properties
are supported:
PROPERTY DESCRIPTION REQUIRED
Example
{
"name": "EloquaDataset",
"properties": {
"type": "EloquaObject",
"linkedServiceName": {
"referenceName": "<Eloqua linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Accounts" .
Example:
"activities":[
{
"name": "CopyFromEloqua",
"type": "Copy",
"inputs": [
{
"referenceName": "<Eloqua input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "EloquaSource",
"query": "SELECT * FROM Accounts"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of supported data stored by Azure Data Factory, see supported data stores.
Copy data from Oracle Responsys using Azure Data
Factory (Preview)
1/16/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Oracle Responsys. It
builds on the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Oracle Responsys to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Responsys connector.
Example:
{
"name": "OracleResponsysLinkedService",
"properties": {
"type": "Responsys",
"typeProperties": {
"endpoint" : "<endpoint>",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Responsys dataset.
To copy data from Oracle Responsys, set the type property of the dataset to ResponsysObject. The following
properties are supported:
Example
{
"name": "OracleResponsysDataset",
"properties": {
"type": "ResponsysObject",
"linkedServiceName": {
"referenceName": "<Oracle Responsys linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromOracleResponsys",
"type": "Copy",
"inputs": [
{
"referenceName": "<Oracle Responsys input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ResponsysSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Oracle Service Cloud using Azure
Data Factory (Preview)
1/16/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Oracle Service Cloud. It
builds on the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Oracle Service Cloud to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Service Cloud connector.
Example:
{
"name": "OracleServiceCloudLinkedService",
"properties": {
"type": "OracleServiceCloud",
"typeProperties": {
"host" : "<host>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true,
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Service Cloud dataset.
To copy data from Oracle Service Cloud, set the type property of the dataset to OracleServiceCloudObject. The
following properties are supported:
PROPERTY DESCRIPTION REQUIRED
Example
{
"name": "OracleServiceCloudDataset",
"properties": {
"type": "OracleServiceCloudObject",
"linkedServiceName": {
"referenceName": "<OracleServiceCloud linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromOracleServiceCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<OracleServiceCloud input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "OracleServiceCloudSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Parquet format in Azure Data Factory
5/6/2019 • 3 minutes to read • Edit Online
Follow this article when you want to parse the Parquet files or write the data into Parquet format.
Parquet format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage
Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS,
HTTP, and SFTP.
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Parquet dataset.
NOTE
White space in column name is not supported for Parquet files.
Parquet as sink
The following properties are supported in the copy activity *sink* section.
For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java
runtime by firstly checking the registry
(SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if not found, secondly
checking system variable JAVA_HOME for OpenJDK.
To use JRE: The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK: it's supported since IR version 3.13. Package the jvm.dll with all other required
assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME
accordingly.
TIP
If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error
occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an
environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size
for JVM to empower such copy, then rerun the pipeline.
Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g . The flag Xms specifies the initial
memory allocation pool for a Java Virtual Machine (JVM ), while Xmx specifies the maximum memory
allocation pool. This means that JVM will be started with Xms amount of memory and will be able to use a
maximum of Xmx amount of memory. By default, ADF use min 64MB and max 1G.
Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Copy data from PayPal using Azure Data Factory
(Preview)
1/3/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from PayPal. It builds on the
copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from PayPal to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
PayPal connector.
Example:
{
"name": "PayPalLinkedService",
"properties": {
"type": "PayPal",
"typeProperties": {
"host" : "api.sandbox.paypal.com",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by PayPal dataset.
To copy data from PayPal, set the type property of the dataset to PayPalObject. The following properties are
supported:
{
"name": "PayPalDataset",
"properties": {
"type": "PayPalObject",
"linkedServiceName": {
"referenceName": "<PayPal linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM
Payment_Experience"
.
Example:
"activities":[
{
"name": "CopyFromPayPal",
"type": "Copy",
"inputs": [
{
"referenceName": "<PayPal input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PayPalSource",
"query": "SELECT * FROM Payment_Experience"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Phoenix using Azure Data Factory
1/3/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Phoenix. It builds on the
copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Phoenix to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Phoenix connector.
Example:
{
"name": "PhoenixLinkedService",
"properties": {
"type": "Phoenix",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "443",
"httpPath" : "/hbasephoenix0",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Phoenix dataset.
To copy data from Phoenix, set the type property of the dataset to PhoenixObject. The following properties are
supported:
Example
{
"name": "PhoenixDataset",
"properties": {
"type": "PhoenixObject",
"linkedServiceName": {
"referenceName": "<Phoenix linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromPhoenix",
"type": "Copy",
"inputs": [
{
"referenceName": "<Phoenix input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PhoenixSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from PostgreSQL by using Azure Data
Factory
3/15/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a PostgreSQL database.
It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from PostgreSQL database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this PostgreSQL connector supports PostgreSQL version 7.4 and above.
Prerequisites
If your PostgreSQL database is not publicly accessible, you need to set up a Self-hosted Integration Runtime. To
learn about Self-hosted integration runtimes, see Self-hosted Integration Runtime article. The Integration
Runtime provides a built-in PostgreSQL driver starting from version 3.7, therefore you don't need to manually
install any driver.
For Self-hosted IR version lower than 3.7, you need to install the Ngpsql data provider for PostgreSQL with
version between 2.0.12 and 3.1.9 on the Integration Runtime machine.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
PostgreSQL connector.
EncryptionMethod (EM) The method the driver uses 0 (No Encryption) (Default) No
to encrypt data sent / 1 (SSL) / 6 (RequestSSL)
between the driver and the
database server. E.g.
ValidateServerCertificate=
<0/1/6>;
Example:
{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Port=<port>;UID=<username>;Password=<Password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example: store password in Azure Key Vault
{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Port=<port>;UID=<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
If you were using PostgreSQL linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Previous payload:
{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by PostgreSQL dataset.
To copy data from PostgreSQL, set the type property of the dataset to RelationalTable. The following properties
are supported:
PROPERTY DESCRIPTION REQUIRED
tableName Name of the table in the PostgreSQL No (if "query" in activity source is
database. specified)
Example
{
"name": "PostgreSQLDataset",
"properties":
{
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<PostgreSQL linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"query": "SELECT * FROM
\"MySchema\".\"MyTable\""
.
NOTE
Schema and table names are case-sensitive. Enclose them in "" (double quotes) in the query.
Example:
"activities":[
{
"name": "CopyFromPostgreSQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<PostgreSQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM \"MySchema\".\"MyTable\""
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Presto using Azure Data Factory
(Preview)
1/3/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Presto. It builds on the
copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Presto to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Presto connector.
Example:
{
"name": "PrestoLinkedService",
"properties": {
"type": "Presto",
"typeProperties": {
"host" : "<host>",
"serverVersion" : "0.148-t",
"catalog" : "<catalog>",
"port" : "<port>",
"authenticationType" : "LDAP",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"timeZoneID" : "Europe/Berlin"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Presto dataset.
To copy data from Presto, set the type property of the dataset to PrestoObject. The following properties are
supported:
Example
{
"name": "PrestoDataset",
"properties": {
"type": "PrestoObject",
"linkedServiceName": {
"referenceName": "<Presto linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromPresto",
"type": "Copy",
"inputs": [
{
"referenceName": "<Presto input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PrestoSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from QuickBooks Online using Azure Data
Factory (Preview)
3/14/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from QuickBooks Online. It
builds on the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from QuickBooks Online to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Currently this connector only support 1.0a, which means you need to have a developer account with apps created
before July 17, 2017.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
QuickBooks connector.
Example:
{
"name": "QuickBooksLinkedService",
"properties": {
"type": "QuickBooks",
"typeProperties": {
"endpoint" : "quickbooks.api.intuit.com",
"companyId" : "<companyId>",
"consumerKey": "<consumerKey>",
"consumerSecret": {
"type": "SecureString",
"value": "<consumerSecret>"
},
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
},
"accessTokenSecret": {
"type": "SecureString",
"value": "<accessTokenSecret>"
},
"useEncryptedEndpoints" : true
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by QuickBooks dataset.
To copy data from QuickBooks Online, set the type property of the dataset to QuickBooksObject. The following
properties are supported:
Example
{
"name": "QuickBooksDataset",
"properties": {
"type": "QuickBooksObject",
"linkedServiceName": {
"referenceName": "<QuickBooks linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM "Bill" WHERE Id =
'123'"
.
Example:
"activities":[
{
"name": "CopyFromQuickBooks",
"type": "Copy",
"inputs": [
{
"referenceName": "<QuickBooks input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "QuickBooksSource",
"query": "SELECT * FROM \"Bill\" WHERE Id = '123' "
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from a REST endpoint by using Azure
Data Factory
4/1/2019 • 8 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from a REST endpoint. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.
The difference among this REST connector, HTTP connector and the Web table connector are:
REST connector specifically support copying data from RESTful APIs;
HTTP connector is generic to retrieve data from any HTTP endpoint, e.g. to download file. Before this REST
connector becomes available, you may happen to use HTTP connector to copy data from RESTful API, which is
supported but less functional comparing to REST connector.
Web table connector extracts table content from an HTML webpage.
Supported capabilities
You can copy data from a REST source to any supported sink data store. For a list of data stores that Copy Activity
supports as sources and sinks, see Supported data stores and formats.
Specifically, this generic REST connector supports:
Retrieving data from a REST endpoint by using the GET or POST methods.
Retrieving data by using one of the following authentications: Anonymous, Basic, AAD service principal,
and managed identities for Azure resources.
Pagination in the REST APIs.
Copying the REST JSON response as-is or parse it by using schema mapping. Only response payload in
JSON is supported.
TIP
To test a request for data retrieval before you configure the REST connector in Data Factory, learn about the API
specification for header and body requirements. You can use tools like Postman or a web browser to validate.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are specific
to the REST connector.
Linked service properties
The following properties are supported for the REST linked service:
Example
{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"authenticationType": "Basic",
"url" : "<REST endpoint>",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example
{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"url": "<REST endpoint e.g. https://www.example.com/>",
"authenticationType": "AadServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource URL e.g. https://management.core.windows.net>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example
{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"url": "<REST endpoint e.g. https://www.example.com/>",
"authenticationType": "ManagedServiceIdentity",
"aadResourceId": "<AAD resource URL e.g. https://management.core.windows.net>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
This section provides a list of properties that the REST dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from REST, the following properties are supported:
PROPERTY DESCRIPTION REQUIRED
{
"name": "RESTDataset",
"properties": {
"type": "RestResource",
"linkedServiceName": {
"referenceName": "<REST linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"relativeUrl": "<relative url>",
"additionalHeaders": {
"x-user-defined": "helloworld"
},
"paginationRules": {
"AbsoluteUrl": "$.paging.next"
}
}
}
}
Example
"activities":[
{
"name": "CopyFromREST",
"type": "Copy",
"inputs": [
{
"referenceName": "<REST input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:00"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Pagination support
Normally, REST API limit its response payload size of a single request under a reasonable number; while to return
large amount of data, it splits the result into multiple pages and requires callers to send consecutive requests to
get next page of the result. Usually, the request for one page is dynamic and composed by the information
returned from the response of previous page.
This generic REST connector supports the following pagination patterns:
Next request’s absolute or relative URL = property value in current response body
Next request’s absolute or relative URL = header value in current response headers
Next request’s query parameter = property value in current response body
Next request’s query parameter = header value in current response headers
Next request’s header = property value in current response body
Next request’s header = header value in current response headers
Pagination rules are defined as a dictionary in dataset which contain one or more case-sensitive key-value pairs.
The configuration will be used to generate the request starting from the second page. The connector will stop
iterating when it gets HTTP status code 204 (No Content), or any of the JSONPath expression in
"paginationRules" returns null.
Supported keys in pagination rules:
KEY DESCRIPTION
AbsoluteUrl Indicates the URL to issue the next request. It can be either
absolute URL or relative URL.
VALUE DESCRIPTION
A JSONPath expression starting with "$" (representing the The response body should contain only one JSON object. The
root of the response body) JSONPath expression should return a single primitive value,
which will be used to issue next request.
Example:
Facebook Graph API returns response in the following structure, in which case next page's URL is represented in
paging.next:
{
"data": [
{
"created_time": "2017-12-12T14:12:20+0000",
"name": "album1",
"id": "1809938745705498_1809939942372045"
},
{
"created_time": "2017-12-12T14:14:03+0000",
"name": "album2",
"id": "1809938745705498_1809941802371859"
},
{
"created_time": "2017-12-12T14:14:11+0000",
"name": "album3",
"id": "1809938745705498_1809941879038518"
}
],
"paging": {
"cursors": {
"after": "MTAxNTExOTQ1MjAwNzI5NDE=",
"before": "NDMyNzQyODI3OTQw"
},
"previous": "https://graph.facebook.com/me/albums?limit=25&before=NDMyNzQyODI3OTQw",
"next": "https://graph.facebook.com/me/albums?limit=25&after=MTAxNTExOTQ1MjAwNzI5NDE="
}
}
Schema mapping
To copy data from REST endpoint to tabular sink, refer to schema mapping.
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to Salesforce by using Azure
Data Factory
4/19/2019 • 9 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Salesforce. It builds
on the Copy Activity overview article that presents a general overview of the copy activity.
Supported capabilities
You can copy data from Salesforce to any supported sink data store. You also can copy data from any supported
source data store to Salesforce. For a list of data stores that are supported as sources or sinks by the Copy activity,
see the Supported data stores table.
Specifically, this Salesforce connector supports:
Salesforce Developer, Professional, Enterprise, or Unlimited editions.
Copying data from and to Salesforce production, sandbox, and custom domain.
The Salesforce connector is built on top of the Salesforce REST/Bulk API, with v45 for copy data from and v40 for
copy data to.
Prerequisites
API permission must be enabled in Salesforce. For more information, see Enable API access in Salesforce by
permission set
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-step
instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to the
Salesforce connector.
connectVia The integration runtime to be used to No for source, Yes for sink if the source
connect to the data store. If not linked service doesn't have integration
specified, it uses the default Azure runtime
Integration Runtime.
IMPORTANT
When you copy data into Salesforce, the default Azure Integration Runtime can't be used to execute copy. In other words, if
your source linked service doesn't have a specified integration runtime, explicitly create an Azure Integration Runtime with a
location near your Salesforce instance. Associate the Salesforce linked service as in the following example.
{
"name": "SalesforceLinkedService",
"properties": {
"type": "Salesforce",
"typeProperties": {
"username": "<username>",
"password": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of password in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
},
"securityToken": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of security token in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Salesforce dataset.
To copy data from and to Salesforce, set the type property of the dataset to SalesforceObject. The following
properties are supported.
PROPERTY DESCRIPTION REQUIRED
objectApiName The Salesforce object name to retrieve No for source, Yes for sink
data from.
IMPORTANT
The "__c" part of API Name is needed for any custom object.
Example:
{
"name": "SalesforceDataset",
"properties": {
"type": "SalesforceObject",
"linkedServiceName": {
"referenceName": "<Salesforce linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"objectApiName": "MyTable__c"
}
}
}
NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalTable" type dataset, it
keeps working while you see a suggestion to switch to the new "SalesforceObject" type.
tableName Name of the table in Salesforce. No (if "query" in the activity source is
specified)
query Use the custom query to read data. You No (if "objectApiName" in the dataset is
can use Salesforce Object Query specified)
Language (SOQL) query or SQL-92
query. See more tips in query tips
section. If query is not specified, all the
data of the Salesforce object specified in
"objectApiName" in dataset will be
retrieved.
IMPORTANT
The "__c" part of API Name is needed for any custom object.
Example:
"activities":[
{
"name": "CopyFromSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<Salesforce input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceSource",
"query": "SELECT Col_Currency__c, Col_Date__c, Col_Email__c FROM AllDataType__c"
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalSource" type copy, the
source keeps working while you see a suggestion to switch to the new "SalesforceSource" type.
externalIdFieldName The name of the external ID field for the Yes for "Upsert"
upsert operation. The specified field
must be defined as "External Id Field" in
the Salesforce object. It can't have NULL
values in the corresponding input data.
"activities":[
{
"name": "CopyToSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Salesforce output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SalesforceSink",
"writeBehavior": "Upsert",
"externalIdFieldName": "CustomerId__c",
"writeBatchSize": 10000,
"ignoreNullValues": true
}
}
}
]
Query tips
Retrieve data from a Salesforce report
You can retrieve data from Salesforce reports by specifying a query as {call "<report name>"} . An example is
"query": "{call \"TestReport\"}" .
Quotation marks Filed/object names cannot be quoted. Field/object names can be quoted, e.g.
SELECT "id" FROM "Account"
Datetime format Refer to details here and samples in next Refer to details here and samples in next
section. section.
Error of MALFORMED_QUERY:Truncated
If you hit error of "MALFORMED_QUERY: Truncated", normally it's due to you have JunctionIdList type column in
data and Salesforce has limitation on supporting such data with large number of rows. To mitigate, try to exclude
JunctionIdList column or limit the number of rows to copy (you can partition to multiple copy activity runs).
Checkbox Boolean
Currency Decimal
SALESFORCE DATA TYPE DATA FACTORY INTERIM DATA TYPE
Date DateTime
Date/Time DateTime
Email String
Id String
Number Decimal
Percent Decimal
Phone String
Picklist String
Text String
URL String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from and to Salesforce by using Azure
Data Factory
4/19/2019 • 9 minutes to read • Edit Online
This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Salesforce. It builds
on the Copy Activity overview article that presents a general overview of the copy activity.
Supported capabilities
You can copy data from Salesforce to any supported sink data store. You also can copy data from any supported
source data store to Salesforce. For a list of data stores that are supported as sources or sinks by the Copy
activity, see the Supported data stores table.
Specifically, this Salesforce connector supports:
Salesforce Developer, Professional, Enterprise, or Unlimited editions.
Copying data from and to Salesforce production, sandbox, and custom domain.
The Salesforce connector is built on top of the Salesforce REST/Bulk API, with v45 for copy data from and v40
for copy data to.
Prerequisites
API permission must be enabled in Salesforce. For more information, see Enable API access in Salesforce by
permission set
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Salesforce connector.
connectVia The integration runtime to be used to No for source, Yes for sink if the source
connect to the data store. If not linked service doesn't have integration
specified, it uses the default Azure runtime
Integration Runtime.
IMPORTANT
When you copy data into Salesforce, the default Azure Integration Runtime can't be used to execute copy. In other words,
if your source linked service doesn't have a specified integration runtime, explicitly create an Azure Integration Runtime
with a location near your Salesforce instance. Associate the Salesforce linked service as in the following example.
{
"name": "SalesforceLinkedService",
"properties": {
"type": "Salesforce",
"typeProperties": {
"username": "<username>",
"password": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of password in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
},
"securityToken": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of security token in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Salesforce dataset.
To copy data from and to Salesforce, set the type property of the dataset to SalesforceObject. The following
properties are supported.
PROPERTY DESCRIPTION REQUIRED
objectApiName The Salesforce object name to retrieve No for source, Yes for sink
data from.
IMPORTANT
The "__c" part of API Name is needed for any custom object.
Example:
{
"name": "SalesforceDataset",
"properties": {
"type": "SalesforceObject",
"linkedServiceName": {
"referenceName": "<Salesforce linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"objectApiName": "MyTable__c"
}
}
}
NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalTable" type dataset, it
keeps working while you see a suggestion to switch to the new "SalesforceObject" type.
tableName Name of the table in Salesforce. No (if "query" in the activity source is
specified)
query Use the custom query to read data. No (if "objectApiName" in the dataset is
You can use Salesforce Object Query specified)
Language (SOQL) query or SQL-92
query. See more tips in query tips
section. If query is not specified, all the
data of the Salesforce object specified
in "objectApiName" in dataset will be
retrieved.
IMPORTANT
The "__c" part of API Name is needed for any custom object.
Example:
"activities":[
{
"name": "CopyFromSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<Salesforce input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceSource",
"query": "SELECT Col_Currency__c, Col_Date__c, Col_Email__c FROM AllDataType__c"
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalSource" type copy, the
source keeps working while you see a suggestion to switch to the new "SalesforceSource" type.
externalIdFieldName The name of the external ID field for Yes for "Upsert"
the upsert operation. The specified field
must be defined as "External Id Field" in
the Salesforce object. It can't have
NULL values in the corresponding
input data.
"activities":[
{
"name": "CopyToSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Salesforce output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SalesforceSink",
"writeBehavior": "Upsert",
"externalIdFieldName": "CustomerId__c",
"writeBatchSize": 10000,
"ignoreNullValues": true
}
}
}
]
Query tips
Retrieve data from a Salesforce report
You can retrieve data from Salesforce reports by specifying a query as {call "<report name>"} . An example is
"query": "{call \"TestReport\"}" .
Quotation marks Filed/object names cannot be quoted. Field/object names can be quoted, e.g.
SELECT "id" FROM "Account"
Datetime format Refer to details here and samples in Refer to details here and samples in
next section. next section.
Error of MALFORMED_QUERY:Truncated
If you hit error of "MALFORMED_QUERY: Truncated", normally it's due to you have JunctionIdList type column
in data and Salesforce has limitation on supporting such data with large number of rows. To mitigate, try to
exclude JunctionIdList column or limit the number of rows to copy (you can partition to multiple copy activity
runs).
Checkbox Boolean
SALESFORCE DATA TYPE DATA FACTORY INTERIM DATA TYPE
Currency Decimal
Date DateTime
Date/Time DateTime
Email String
Id String
Number Decimal
Percent Decimal
Phone String
Picklist String
Text String
URL String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Salesforce Marketing Cloud using
Azure Data Factory (Preview)
1/16/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Salesforce Marketing
Cloud. It builds on the copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Salesforce Marketing Cloud to any supported sink data store. For a list of data stores that
are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
NOTE
This connector doesn't support retrieving custom objects or custom data extensions.
Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Salesforce Marketing Cloud connector.
Example:
{
"name": "SalesforceMarketingCloudLinkedService",
"properties": {
"type": "SalesforceMarketingCloud",
"typeProperties": {
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Salesforce Marketing Cloud dataset.
To copy data from Salesforce Marketing Cloud, set the type property of the dataset to
SalesforceMarketingCloudObject. The following properties are supported:
Example
{
"name": "SalesforceMarketingCloudDataset",
"properties": {
"type": "SalesforceMarketingCloudObject",
"linkedServiceName": {
"referenceName": "<SalesforceMarketingCloud linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromSalesforceMarketingCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<SalesforceMarketingCloud input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceMarketingCloudSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Business Warehouse via Open
Hub using Azure Data Factory
5/28/2019 • 7 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP Business
Warehouse (BW ) via Open Hub. It builds on the copy activity overview article that presents a general overview of
copy activity.
Supported capabilities
You can copy data from SAP Business Warehouse via Open Hub to any supported sink data store. For a list of
data stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP Business Warehouse Open Hub connector supports:
SAP Business Warehouse version 7.01 or higher (in a recent SAP Support Package Stack released after
the year 2015).
Copying data via Open Hub Destination local table which underneath can be DSO, InfoCube, MultiProvider,
DataSource, etc.
Copying data using basic authentication.
Connecting to Application Server.
In the first step, a DTP is executed. Each execution creates a new SAP request ID. The request ID is stored in the
Open Hub table and is then used by the ADF connector to identify the delta. The two steps run asynchronously:
the DTP is triggered by SAP, and the ADF data copy is triggered through ADF.
By default, ADF is not reading the latest delta from the Open Hub table (option "exclude last request" is true).
Hereby, the data in ADF is not 100% up-to-date with the data in the Open Hub table (the last delta is missing). In
return, this procedure ensures that no rows get lost caused by the asynchronous extraction. It works fine even
when ADF is reading the Open Hub table while the DTP is still writing into the same table.
You typically store the max copied request ID in the last run by ADF in a staging data store (such as Azure Blob in
above diagram). Therefore, the same request is not read a second time by ADF in the subsequent run. Meanwhile,
note the data is not automatically deleted from the Open Hub table.
For proper delta handling it is not allowed to have request IDs from different DTPs in the same Open Hub table.
Therefore, you must not create more than one DTP for each Open Hub Destination (OHD ). When needing Full
and Delta extraction from the same InfoProvider, you should create two OHDs for the same InfoProvider.
Prerequisites
To use this SAP Business Warehouse Open Hub connector, you need to:
Set up a Self-hosted Integration Runtime with version 3.13 or above. See Self-hosted Integration Runtime
article for details.
Download the 64-bit SAP .NET Connector 3.0 from SAP's website, and install it on the Self-hosted IR
machine. When installing, in the optional setup steps window, make sure you select the Install
Assemblies to GAC option as shown in the following image.
SAP user being used in the Data Factory BW connector needs to have following permissions:
Authorization for RFC and SAP BW.
Permissions to the “Execute” Activity of Authorization Object “S_SDSAUTH”.
Create SAP Open Hub Destination type as Database Table with "Technical Key" option checked. It is also
recommended to leave the Deleting Data from Table as unchecked although it is not required. Leverage the
DTP (directly execute or integrate into existing process chain) to land data from source object (such as
cube) you have chosen to the open hub destination table.
Getting started
TIP
For a walkthrough of using SAP BW Open Hub connector, see Load data from SAP Business Warehouse (BW) by using
Azure Data Factory.
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Business Warehouse Open Hub connector.
language Language that the SAP system uses. No (default value is EN)
Example:
{
"name": "SapBwOpenHubLinkedService",
"properties": {
"type": "SapOpenHub",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the SAP BW Open Hub dataset.
To copy data from and to SAP BW Open Hub, set the type property of the dataset to SapOpenHubTable. The
following properties are supported.
TIP
If your Open Hub table only contains the data generated by single request ID, for example, you always do full load and
overwrite the existing data in the table, or you only run the DTP once for test, remember to uncheck the
"excludeLastRequest" option in order to copy the data out.
Example:
{
"name": "SAPBWOpenHubDataset",
"properties": {
"type": "SapOpenHubTable",
"linkedServiceName": {
"referenceName": "<SAP BW Open Hub linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"openHubDestinationName": "<open hub destination name>"
}
}
}
"activities":[
{
"name": "CopyFromSAPBWOpenHub",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP BW Open Hub input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapOpenHubSource"
},
"sink": {
"type": "<sink type>"
}
}
}
]
C (String) String
I (integer) Int32
F (Float) Double
D (Date) String
T (Time) String
N (Numc) String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Business Warehouse by using
Azure Data Factory
5/22/2019 • 10 minutes to read • Edit Online
This article shows how to use Azure Data Factory to copy data from SAP Business Warehouse (BW ) via Open Hub
to Azure Data Lake Storage Gen2. You can use a similar process to copy data to other supported sink data stores.
TIP
For general information about copying data from SAP BW, including SAP BW Open Hub integration and delta extraction flow,
see Copy data from SAP Business Warehouse via Open Hub by using Azure Data Factory.
Prerequisites
Azure Data Factory: If you don't have one, follow the steps to create a data factory.
SAP BW Open Hub Destination (OHD ) with destination type "Database Table": To create an OHD or
to check that your OHD is configured correctly for Data Factory integration, see the SAP BW Open Hub
Destination configurations section of this article.
The SAP BW user needs the following permissions:
Authorization for Remote Function Calls (RFC ) and SAP BW.
Permissions to the “Execute” activity of the S_SDSAUTH authorization object.
A self-hosted integration runtime (IR) with SAP .NET connector 3.0. Follow these setup steps:
1. Install and register the self-hosted integration runtime, version 3.13 or later. (This process is described
later in this article.)
2. Download the 64-bit SAP Connector for Microsoft .NET 3.0 from SAP's website, and install it on the
same computer as the self-hosted IR. During installation, make sure that you select Install
Assemblies to GAC in the Optional setup steps dialog box, as the following image shows:
Do a full copy from SAP BW Open Hub
In the Azure portal, go to your data factory. Select Author & Monitor to open the Data Factory UI in a separate
tab.
1. On the Let's get started page, select Copy Data to open the Copy Data tool.
2. On the Properties page, specify a Task name, and then select Next.
3. On the Source data store page, select +Create new connection. Select SAP BW Open Hub from the
connector gallery, and then select Continue. To filter the connectors, you can type SAP in the search box.
4. On the Specify SAP BW Open Hub connection page, follow these steps to create a new connection.
a. From the Connect via integration runtime list, select an existing self-hosted IR. Or, choose to
create one if you don't have one yet.
To create a new self-hosted IR, select +New, and then select Self-hosted. Enter a Name, and then
select Next. Select Express setup to install on the current computer, or follow the Manual setup
steps that are provided.
As mentioned in Prerequisites, make sure that you have SAP Connector for Microsoft .NET 3.0
installed on the same computer where the self-hosted IR is running.
b. Fill in the SAP BW Server name, System number, Client ID, Language (if other than EN ), User
name, and Password.
c. Select Test connection to validate the settings, and then select Finish.
d. A new connection is created. Select Next.
5. On the Select Open Hub Destinations page, browse the Open Hub Destinations that are available in your
SAP BW. Select the OHD to copy data from, and then select Next.
6. Specify a filter, if you need one. If your OHD only contains data from a single data-transfer process (DTP )
execution with a single request ID, or you're sure that your DTP is finished and you want to copy the data,
clear the Exclude Last Request check box.
Learn more about these settings in the SAP BW Open Hub Destination configurations section of this article.
Select Validate to double-check what data will be returned. Then select Next.
7. On the Destination data store page, select +Create new connection > Azure Data Lake Storage Gen2
> Continue.
8. On the Specify Azure Data Lake Storage connection page, follow these steps to create a connection.
a. Select your Data Lake Storage Gen2-capable account from the Name drop-down list.
b. Select Finish to create the connection. Then select Next.
9. On the Choose the output file or folder page, enter copyfromopenhub as the output folder name. Then
select Next.
10. On the File format setting page, select Next to use the default settings.
11. On the Settings page, expand Performance settings. Enter a value for Degree of copy parallelism such
as 5 to load from SAP BW in parallel. Then select Next.
12. On the Summary page, review the settings. Then select Next.
13. On the Deployment page, select Monitor to monitor the pipeline.
14. Notice that the Monitor tab on the left side of the page is automatically selected. The Actions column
includes links to view activity-run details and to rerun the pipeline.
15. To view activity runs that are associated with the pipeline run, select View Activity Runs in the Actions
column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch back to
the pipeline-runs view, select the Pipelines link at the top. Select Refresh to refresh the list.
16. To monitor the execution details for each copy activity, select the Details link, which is an eyeglasses icon
below Actions in the activity-monitoring view. Available details include the data volume copied from the
source to the sink, data throughput, execution steps and duration, and configurations used.
17. To view the maximum Request ID, go back to the activity-monitoring view and select Output under
Actions.
On the data factory Let's get started page, select Create pipeline from template to use the built-in template.
1. Search for SAP BW to find and select the Incremental copy from SAP BW to Azure Data Lake Storage
Gen2 template. This template copies data into Azure Data Lake Storage Gen2. You can use a similar
workflow to copy to other sink types.
2. On the template's main page, select or create the following three connections, and then select Use this
template in the lower-right corner of the window.
Azure Blob storage: In this walkthrough, we use Azure Blob storage to store the high watermark, which
is the max copied request ID.
SAP BW Open Hub: This is the source to copy data from. Refer to the previous full-copy walkthrough
for detailed configuration.
Azure Data Lake Storage Gen2: This is the sink to copy data to. Refer to the previous full-copy
walkthrough for detailed configuration.
3. This template generates a pipeline with the following three activities and makes them chained on-success:
Lookup, Copy Data, and Web.
Go to the pipeline Parameters tab. You see all the configurations that you need to provide.
SAPOpenHubDestinationName: Specify the Open Hub table name to copy data from.
ADLSGen2SinkPath: Specify the destination Azure Data Lake Storage Gen2 path to copy data to. If
the path doesn't exist, the Data Factory copy activity creates a path during execution.
HighWatermarkBlobPath: Specify the path to store the high-watermark value, such as
container/path .
HighWatermarkBlobName: Specify the blob name to store the high watermark value, such as
requestIdCache.txt . In Blob storage, go to the corresponding path of
HighWatermarkBlobPath+HighWatermarkBlobName, such as container/path/requestIdCache.txt.
Create a blob with content 0.
LogicAppURL: In this template, we use WebActivity to call Azure Logic Apps to set the high-
watermark value in Blob storage. Or, you can use Azure SQL Database to store it. Use a stored
procedure activity to update the value.
You must first create a logic app, as the following image shows. Then, paste in the HTTP POST URL.
a. Go to the Azure portal. Select a new Logic Apps service. Select +Blank Logic App to go to
Logic Apps Designer.
b. Create a trigger of When an HTTP request is received. Specify the HTTP request body as
follows:
{
"properties": {
"sapOpenHubMaxRequestId": {
"type": "string"
}
},
"type": "object"
}
c. Add a Create blob action. For Folder path and Blob name, use the same values that you
configured previously in HighWatermarkBlobPath and HighWatermarkBlobName.
d. Select Save. Then, copy the value of HTTP POST URL to use in the Data Factory pipeline.
4. After you provide the Data Factory pipeline parameters, select Debug > Finish to invoke a run to validate
the configuration. Or, select Publish All to publish the changes, and then select Trigger to execute a run.
You might increase the number of parallel running SAP work processes for the DTP:
For a full load OHD, choose different options than for delta extraction:
In OHD: Set the Extraction option to Delete Data and Insert Records. Otherwise, data will be extracted
many times when you repeat the DTP in a BW process chain.
In the DTP: Set Extraction Mode to Full. You must change the automatically created DTP from Delta to
Full immediately after the OHD is created, as this image shows:
In the BW Open Hub connector of Data Factory: Turn off Exclude last request. Otherwise, nothing will be
extracted.
You typically run the full DTP manually. Or, you can create a process chain for the full DTP. It's typically a separate
chain that's independent of your existing process chains. In either case, make sure that the DTP is finished before
you start the extraction by using Data Factory copy. Otherwise, only partial data will be copied.
Run delta extraction the first time
The first delta extraction is technically a full extraction. By default, the SAP BW Open Hub connector excludes the
last request when it copies data. For the first delta extraction, no data is extracted by the Data Factory copy activity
until a subsequent DTP generates delta data in the table with a separate request ID. There are two ways to avoid
this scenario:
Turn off the Exclude last request option for the first delta extraction. Make sure that the first delta DTP is
finished before you start the delta extraction the first time.
Use the procedure for resyncing the delta extraction, as described in the next section.
Resync delta extraction
The following scenarios change the data in SAP BW cubes but are not considered by the delta DTP:
SAP BW selective deletion (of rows by using any filter condition)
SAP BW request deletion (of faulty requests)
An SAP Open Hub Destination isn't a data-mart-controlled data target (in all SAP BW support packages since
2015). So, you can delete data from a cube without changing the data in the OHD. You must then resync the data of
the cube with Data Factory:
1. Run a full extraction in Data Factory (by using a full DTP in SAP ).
2. Delete all rows in the Open Hub table for the delta DTP.
3. Set the status of the delta DTP to Fetched.
After this, all subsequent delta DTPs and Data Factory delta extractions work as expected.
To set the status of the delta DTP to Fetched, you can use the following option to run the delta DTP manually:
Next steps
Learn about SAP BW Open Hub connector support:
SAP Business Warehouse Open Hub connector
Copy data from SAP Business Warehouse using
Azure Data Factory
1/3/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP Business
Warehouse (BW ). It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from SAP Business Warehouse to any supported sink data store. For a list of data stores that
are supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP Business Warehouse connector supports:
SAP Business Warehouse version 7.x.
Copying data from InfoCubes and QueryCubes (including BEx queries) using MDX queries.
Copying data using basic authentication.
Prerequisites
To use this SAP Business Warehouse connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the SAP NetWeaver library on the Integration Runtime machine. You can get the SAP Netweaver
library from your SAP administrator, or directly from the SAP Software Download Center. Search for the SAP
Note #1025361 to get the download location for the most recent version. Make sure that you pick the 64-bit
SAP NetWeaver library which matches your Integration Runtime installation. Then install all files included in
the SAP NetWeaver RFC SDK according to the SAP Note. The SAP NetWeaver library is also included in the
SAP Client Tools installation.
TIP
To troubleshoot connectivity issue to SAP BW, make sure:
All dependency libraries extracted from the NetWeaver RFC SDK are in place in the %windir%\system32 folder. Usually it
has icudt34.dll, icuin34.dll, icuuc34.dll, libicudecnumber.dll, librfc32.dll, libsapucum.dll, sapcrypto.dll, sapcryto_old.dll,
sapnwrfc.dll.
The needed ports used to connect to SAP Server are enabled on the Self-hosted IR machine, which usually are port 3300
and 3201.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Business Warehouse connector.
Example:
{
"name": "SapBwLinkedService",
"properties": {
"type": "SapBw",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP BW dataset.
To copy data from SAP BW, set the type property of the dataset to RelationalTable. While there are no type-
specific properties supported for the SAP BW dataset of type RelationalTable.
Example:
{
"name": "SAPBWDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<SAP BW linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
Example:
"activities":[
{
"name": "CopyFromSAPBW",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP BW input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "<MDX query for SAP BW>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
ACCP Int
CHAR String
CLNT String
CURR Decimal
CUKY String
DEC Decimal
FLTP Double
SAP BW DATA TYPE DATA FACTORY INTERIM DATA TYPE
INT1 Byte
INT2 Int16
INT4 Int
LANG String
LCHR String
LRAW Byte[]
PREC Int16
QUAN Decimal
RAW Byte[]
RAWSTRING Byte[]
STRING String
UNIT String
DATS String
NUMC String
TIMS String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Cloud for Customer (C4C)
using Azure Data Factory
1/16/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from/to SAP Cloud for
Customer (C4C ). It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from SAP Cloud for Customer to any supported sink data store, or copy data from any
supported source data store to SAP Cloud for Customer. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this connector enables Azure Data Factory to copy data from/to SAP Cloud for Customer including
the SAP Cloud for Sales, SAP Cloud for Service, and SAP Cloud for Social Engagement solutions.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Cloud for Customer connector.
connectVia The Integration Runtime to be used to No for source, Yes for sink
connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.
IMPORTANT
To copy data into SAP Cloud for Customer, explicitly create an Azure IR with a location near your SAP Cloud for Customer,
and associate in the linked service as the following example:
Example:
{
"name": "SAPC4CLinkedService",
"properties": {
"type": "SapCloudForCustomer",
"typeProperties": {
"url": "https://<tenantname>.crm.ondemand.com/sap/c4c/odata/v1/c4codata/" ,
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP Cloud for Customer dataset.
To copy data from SAP Cloud for Customer, set the type property of the dataset to
SapCloudForCustomerResource. The following properties are supported:
Example:
{
"name": "SAPC4CDataset",
"properties": {
"type": "SapCloudForCustomerResource",
"typeProperties": {
"path": "<path e.g. LeadCollection>"
},
"linkedServiceName": {
"referenceName": "<SAP C4C linked service>",
"type": "LinkedServiceReference"
}
}
}
Example:
"activities":[
{
"name": "CopyFromSAPC4C",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP C4C input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapCloudForCustomerSource",
"query": "<custom query e.g. $top=10>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
writeBatchSize The batch size of write operation. The No. Default 10.
batch size to get best performance may
be different for different table or server.
Example:
"activities":[
{
"name": "CopyToSapC4c",
"type": "Copy",
"inputs": [{
"type": "DatasetReference",
"referenceName": "<dataset type>"
}],
"outputs": [{
"type": "DatasetReference",
"referenceName": "SapC4cDataset"
}],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SapCloudForCustomerSink",
"writeBehavior": "Insert",
"writeBatchSize": 30
},
"parallelCopies": 10,
"dataIntegrationUnits": 4,
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "ErrorLogBlobLinkedService",
"type": "LinkedServiceReference"
},
"path": "incompatiblerows"
}
}
}
]
SAP C4C ODATA DATA TYPE DATA FACTORY INTERIM DATA TYPE
Edm.Binary Byte[]
Edm.Boolean Bool
Edm.Byte Byte[]
Edm.DateTime DateTime
Edm.Decimal Decimal
Edm.Double Double
Edm.Single Single
Edm.Guid Guid
SAP C4C ODATA DATA TYPE DATA FACTORY INTERIM DATA TYPE
Edm.Int16 Int16
Edm.Int32 Int32
Edm.Int64 Int64
Edm.SByte Int16
Edm.String String
Edm.Time TimeSpan
Edm.DateTimeOffset DateTimeOffset
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP ECC using Azure Data Factory
5/24/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from SAP ECC (SAP
Enterprise Central Component). It builds on the copy activity overview article that presents a general overview of
copy activity.
Supported capabilities
You can copy data from SAP ECC to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP ECC connector supports:
Copying data from SAP ECC on SAP NetWeaver version 7.0 and above.
Copying data from any objects exposed by SAP ECC OData services (e.g. SAP Table/Views, BAPI, Data
Extractors, etc.), or data/IDOCs sent to SAP PI that can be received as OData via relative Adapters.
Copying data using basic authentication.
TIP
To copy data from SAP ECC via SAP table/view, you can use SAP Table connector which is more performant and scalable.
Prerequisites
Generally, SAP ECC exposes entities via OData services through SAP Gateway. To use this SAP ECC connector,
you need to:
Set up SAP Gateway. For servers with SAP NetWeaver version higher than 7.4, the SAP Gateway is
already installed. Otherwise, you need to install imbedded Gateway or Gateway hub before exposing SAP
ECC data through OData services. Learn how to set up SAP Gateway from installation guide.
Activate and configure SAP OData service. You can activate the OData Services through TCODE SICF
in seconds. You can also configure which objects needs to be exposed. Here is a sample step-by-step
guidance.
Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP ECC connector.
Example:
{
"name": "SapECCLinkedService",
"properties": {
"type": "SapEcc",
"typeProperties": {
"url": "<SAP ECC OData url e.g. http://eccsvrname:8000/sap/opu/odata/sap/zgw100_dd02l_so_srv/>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP ECC dataset.
To copy data from SAP ECC, set the type property of the dataset to SapEccResource. The following properties
are supported:
Example
{
"name": "SapEccDataset",
"properties": {
"type": "SapEccResource",
"typeProperties": {
"path": "<entity path e.g. dd04tentitySet>"
},
"linkedServiceName": {
"referenceName": "<SAP ECC linked service name>",
"type": "LinkedServiceReference"
}
}
}
Example:
"activities":[
{
"name": "CopyFromSAPECC",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP ECC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapEccSource",
"query": "$top=10"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Edm.Binary String
Edm.Boolean Bool
Edm.Byte String
Edm.DateTime DateTime
Edm.Decimal Decimal
Edm.Double Double
Edm.Single Single
Edm.Guid String
Edm.Int16 Int16
Edm.Int32 Int32
Edm.Int64 Int64
ODATA DATA TYPE DATA FACTORY INTERIM DATA TYPE
Edm.SByte Int16
Edm.String String
Edm.Time TimeSpan
Edm.DateTimeOffset DateTimeOffset
NOTE
Complex data types are not supported now.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP HANA using Azure Data
Factory
1/3/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP HANA
database. It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from SAP HANA database to any supported sink data store. For a list of data stores supported
as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP HANA connector supports:
Copying data from any version of SAP HANA database.
Copying data from HANA information models (such as Analytic and Calculation views) and Row/Column
tables using SQL queries.
Copying data using Basic or Windows authentication.
NOTE
To copy data into SAP HANA data store, use generic ODBC connector. See SAP HANA sink with details. Note the linked
services for SAP HANA connector and ODBC connector are with different type thus cannot be reused.
Prerequisites
To use this SAP HANA connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the SAP HANA ODBC driver on the Integration Runtime machine. You can download the SAP HANA
ODBC driver from the SAP Software Download Center. Search with the keyword SAP HANA CLIENT for
Windows.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP HANA connector.
Linked service properties
The following properties are supported for SAP HANA linked service:
Example:
{
"name": "SapHanaLinkedService",
"properties": {
"type": "SapHana",
"typeProperties": {
"server": "<server>:<port (optional)>",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP HANA dataset.
To copy data from SAP HANA, set the type property of the dataset to RelationalTable. While there are no type-
specific properties supported for the SAP HANA dataset of type RelationalTable.
Example:
{
"name": "SAPHANADataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<SAP HANA linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
Example:
"activities":[
{
"name": "CopyFromSAPHANA",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP HANA input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "<SQL query for SAP HANA>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
ALPHANUM String
BIGINT Int64
BLOB Byte[]
BOOLEAN Byte
CLOB Byte[]
DATE DateTime
DECIMAL Decimal
DOUBLE Single
INT Int32
NVARCHAR String
REAL Single
SAP HANA DATA TYPE DATA FACTORY INTERIM DATA TYPE
SECONDDATE DateTime
SMALLINT Int16
TIME TimeSpan
TIMESTAMP DateTime
TINYINT Byte
VARCHAR String
Known limitations
There are a few known limitations when copying data from SAP HANA:
NVARCHAR strings are truncated to maximum length of 4000 Unicode characters
SMALLDECIMAL is not supported
VARBINARY is not supported
Valid Dates are between 1899/12/30 and 9999/12/31
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Table using Azure Data Factory
5/24/2019 • 7 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP Table. It builds
on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from SAP Table to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP Table connector supports:
Copying data from SAP Table in SAP Business Suite with version 7.01 or higher (in a recent SAP Support
Package Stack released after the year 2015) or S/4HANA.
Copying data from both SAP Transparent Table and View.
Copying data using basic authentication or SNC (Secure Network Communications) if SNC is configured.
Connecting to Application Server or Message Server.
Prerequisites
To use this SAP Table connector, you need to:
Set up a Self-hosted Integration Runtime with version 3.17 or above. See Self-hosted Integration Runtime
article for details.
Download the 64-bit SAP .NET Connector 3.0 from SAP's website, and install it on the Self-hosted IR
machine. When installing, in the optional setup steps window, make sure you select the Install Assemblies
to GAC option.
SAP user being used in the Data Factory SAP Table connector needs to have following permissions:
Authorization for RFC.
Permissions to the "Execute" Activity of Authorization Object "S_SDSAUTH".
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Table connector.
language Language that the SAP system uses. No (default value is EN)
{
"name": "SapTableLinkedService",
"properties": {
"type": "SapTable",
"typeProperties": {
"messageServer": "<message server name>",
"messageServerService": "<service name or port>",
"systemId": "<system id>",
"logonGroup": "<logon group>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the SAP Table dataset.
To copy data from and to SAP BW Open Hub, the following properties are supported.
Example:
{
"name": "SAPTableDataset",
"properties": {
"type": "SapTableResource",
"linkedServiceName": {
"referenceName": "<SAP Table linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<SAP table name>"
}
}
}
Example:
"activities":[
{
"name": "CopyFromSAPTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP Table input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapTableSource",
"partitionOption": "PartitionOnInt",
"partitionSettings": {
"partitionColumnName": "<partition column name>",
"partitionUpperBound": "2000",
"partitionLowerBound": "1",
"maxPartitionsNumber": 500
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
C (String) String
I (Integer) Int32
SAP ABAP TYPE DATA FACTORY INTERIM DATA TYPE
F (Float) Double
D (Date) String
T (Time) String
N (Numeric) String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from ServiceNow using Azure Data
Factory
1/16/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from ServiceNow. It builds
on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from ServiceNow to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ServiceNow connector.
Example:
{
"name": "ServiceNowLinkedService",
"properties": {
"type": "ServiceNow",
"typeProperties": {
"endpoint" : "http://<instance>.service-now.com",
"authenticationType" : "Basic",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by ServiceNow dataset.
To copy data from ServiceNow, set the type property of the dataset to ServiceNowObject. The following
properties are supported:
Example
{
"name": "ServiceNowDataset",
"properties": {
"type": "ServiceNowObject",
"linkedServiceName": {
"referenceName": "<ServiceNow linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Actual.alm_asset" .
Note the following when specifying the schema and column for ServiceNow in query, and refer to Performance
tips on copy performance implication.
Schema: specify the schema as Actual or Display in the ServiceNow query, which you can look at it as the
parameter of sysparm_display_value as true or false when calling ServiceNow restful APIs.
Column: the column name for actual value under Actual schema is [column name]_value , while for display
value under Display schema is [column name]_display_value . Note the column name need map to the schema
being used in the query.
Sample query: SELECT col_value FROM Actual.alm_asset OR SELECT col_display_value FROM Display.alm_asset
Example:
"activities":[
{
"name": "CopyFromServiceNow",
"type": "Copy",
"inputs": [
{
"referenceName": "<ServiceNow input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ServiceNowSource",
"query": "SELECT * FROM Actual.alm_asset"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Performance tips
Schema to use
ServiceNow has 2 different schemas, one is "Actual" which returns actual data, the other is "Display" which
returns the display values of data.
If you have a filter in your query, use "Actual" schema which has better copy performance. When querying against
"Actual" schema, ServiceNow natively support filter when fetching the data to only return the filtered resultset,
whereas when querying "Display" schema, ADF retrieve all the data and apply filter internally.
Index
ServiceNow table index can help improve query performance, refer to Create a table index.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SFTP server using Azure Data
Factory
5/6/2019 • 12 minutes to read • Edit Online
This article outlines how to copy data from SFTP server. To learn about Azure Data Factory, read the introductory
article.
Supported capabilities
This SFTP connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Specifically, this SFTP connector supports:
Copying files using Basic or SshPublicKey authentication.
Copying files as-is or parsing files with the supported file formats and compression codecs.
Get started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SFTP.
hostKeyFingerprint Specify the finger print of the host key. Yes if the "skipHostKeyValidation" is set
to false.
Example:
{
"name": "SftpLinkedService",
"type": "linkedservices",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": false,
"hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
privateKeyPath Specify absolute path to the private key Specify either the privateKeyPath or
file that Integration Runtime can privateKeyContent .
access. Applies only when Self-hosted
type of Integration Runtime is specified
in "connectVia".
privateKeyContent Base64 encoded SSH private key Specify either the privateKeyPath or
content. SSH private key should be privateKeyContent .
OpenSSH format. Mark this field as a
SecureString to store it securely in Data
Factory, or reference a secret stored in
Azure Key Vault.
passPhrase Specify the pass phrase/password to Yes if the private key file is protected by
decrypt the private key if the key file is a pass phrase.
protected by a pass phrase. Mark this
field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.
NOTE
SFTP connector supports RSA/DSA OpenSSH key. Make sure your key file content starts with "-----BEGIN [RSA/DSA]
PRIVATE KEY-----". If the private key file is a ppk-format file, please use Putty tool to convert from .ppk to OpenSSH
format.
{
"name": "SftpLinkedService",
"type": "Linkedservices",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": true,
"authenticationType": "SshPublicKey",
"userName": "<username>",
"privateKeyContent": {
"type": "SecureString",
"value": "<base64 string of the private key content>"
},
"passPhrase": {
"type": "SecureString",
"value": "<pass phrase>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
For Parquet and delimited text format, refer to Parquet and delimited text format dataset section.
For other formats like ORC/Avro/JSON/Binary format, refer to Other format dataset section.
Parquet and delimited text format dataset
To copy data from SFTP in Parquet or delimited text format, refer to Parquet format and Delimited text
format article on format-based dataset and supported settings. The following properties are supported for SFTP
under location settings in format-based dataset:
NOTE
FileShare type dataset with Parquet/Text format mentioned in next section is still supported as-is for
Copy/Lookup/GetMetadata activity for backward compatibility. You are suggested to use this new model going forward,
and the ADF authoring UI has switched to generating these new types.
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<SFTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "SftpLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary copy),
skip the format section in both input
and output dataset definitions.
NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.
Example:
{
"name": "SFTPDataset",
"type": "Datasets",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<SFTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
wildcardFileName The file name with wildcard characters Yes if fileName is not specified in
under the given dataset
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
NOTE
For Parquet/delimited text format, FileSystemSource type copy activity source mentioned in next section is still supported
as-is for backward compatibility. You are suggested to use this new model going forward, and the ADF authoring UI has
switched to generating these new types.
Example:
"activities":[
{
"name": "CopyFromSFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSetting",
"skipLineCount": 10
},
"storeSettings":{
"type": "SftpReadSetting",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyFromSFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<SFTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Shopify using Azure Data Factory
(Preview)
1/3/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Shopify. It builds on the
copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Shopify to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Shopify connector.
Example:
{
"name": "ShopifyLinkedService",
"properties": {
"type": "Shopify",
"typeProperties": {
"host" : "mystore.myshopify.com",
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Shopify dataset.
To copy data from Shopify, set the type property of the dataset to ShopifyObject. The following properties are
supported:
Example
{
"name": "ShopifyDataset",
"properties": {
"type": "ShopifyObject",
"linkedServiceName": {
"referenceName": "<Shopify linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM "Products" WHERE
Product_Id = '123'"
.
Example:
"activities":[
{
"name": "CopyFromShopify",
"type": "Copy",
"inputs": [
{
"referenceName": "<Shopify input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ShopifySource",
"query": "SELECT * FROM \"Products\" WHERE Product_Id = '123'"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Spark using Azure Data Factory
1/3/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Spark. It builds on the
copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Spark to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Spark connector.
{
"name": "SparkLinkedService",
"properties": {
"type": "Spark",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "<port>",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Spark dataset.
To copy data from Spark, set the type property of the dataset to SparkObject. The following properties are
supported:
Example
{
"name": "SparkDataset",
"properties": {
"type": "SparkObject",
"linkedServiceName": {
"referenceName": "<Spark linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromSpark",
"type": "Copy",
"inputs": [
{
"referenceName": "<Spark input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SparkSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data to and from SQL Server using Azure Data
Factory
5/6/2019 • 12 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to an SQL Server
database. It builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from/to SQL Server database to any supported sink data store, or copy data from any
supported source data store to SQL Server database. For a list of data stores that are supported as sources/sinks
by the copy activity, see the Supported data stores table.
Specifically, this SQL Server connector supports:
SQL Server version 2016, 2014, 2012, 2008 R2, 2008, and 2005
Copying data using SQL or Windows authentication.
As source, retrieving data using SQL query or stored procedure.
As sink, appending data to destination table or invoking a stored procedure with custom logic during copy.
SQL Server Always Encrypted is not supported now.
Prerequisites
To use copy data from a SQL Server database that is not publicly accessible, you need to set up a Self-hosted
Integration Runtime. See Self-hosted Integration Runtime article for details. The Integration Runtime provides a
built-in SQL Server database driver, therefore you don't need to manually install any driver when copying data
from/to SQL Server database.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SQL Server database connector.
TIP
If you hit error with error code as "UserErrorFailedToConnectToSqlServer" and message like "The session limit for the
database is XXX and has been reached.", add Pooling=false to your connection string and try again.
{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Data Source=<servername>\\<instance name if using named instance>;Initial Catalog=
<databasename>;Integrated Security=False;User ID=<username>;"
},
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SQL Server dataset.
To copy data from/to SQL Server database, the following properties are supported:
tableName Name of the table or view in the SQL No for source, Yes for sink
Server database instance that linked
service refers to.
Example:
{
"name": "SQLServerDataset",
"properties":
{
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "<SQL Server linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"tableName": "MyTable"
}
}
}
Points to note:
If the sqlReaderQuery is specified for the SqlSource, the Copy Activity runs this query against the SQL
Server source to get the data. Alternatively, you can specify a stored procedure by specifying the
sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes
parameters).
If you do not specify either "sqlReaderQuery" or "sqlReaderStoredProcedureName", the columns defined in
the "structure" section of the dataset JSON are used to construct a query (
select column1, column2 from mytable ) to run against the SQL Server. If the dataset definition does not have
the "structure", all columns are selected from the table.
Example: using SQL query
"activities":[
{
"name": "CopyFromSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<SQL Server input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
"activities":[
{
"name": "CopyFromSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<SQL Server input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type": "Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
When copying data to SQL Server, the copy activity appends data to the sink table by default. To perform an UPSERT or
additional business logic, use the stored procedure in SqlSink. Learn more details from Invoking stored procedure for SQL
Sink.
"activities":[
{
"name": "CopyToSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<SQL Server output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 100000
}
}
}
]
Destination table:
{
"name": "SampleTarget",
"properties": {
"structure": [
{ "name": "name" },
{ "name": "age" }
],
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "TestIdentitySQL",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "TargetTbl"
}
}
}
Notice that as your source and target table have different schema (target has an additional column with identity).
In this scenario, you need to specify structure property in the target dataset definition, which doesn’t include the
identity column.
"sink": {
"type": "SqlSink",
"SqlWriterTableType": "MarketingType",
"SqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}
In your database, define the stored procedure with the same name as the SqlWriterStoredProcedureName. It
handles input data from your specified source and merges into the output table. The parameter name of the table
type in the stored procedure should be the same as the tableName defined in the dataset.
In your database, define the table type with the same name as sqlWriterTableType. Notice that the schema of the
table type should be same as the schema returned by your input data.
bigint Int64
binary Byte[]
bit Boolean
date DateTime
Datetime DateTime
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
smalldatetime DateTime
smallint Int16
smallmoney Decimal
sql_variant Object
SQL SERVER DATA TYPE DATA FACTORY INTERIM DATA TYPE
time TimeSpan
timestamp Byte[]
tinyint Int16
uniqueidentifier Guid
varbinary Byte[]
xml Xml
NOTE
For data types maps to Decimal interim type, currently ADF support precision up to 28. If you have data with precision
larger than 28, consider to convert to string in SQL query.
See Configure the remote access Server Configuration Option for detailed steps.
2. Launch SQL Server Configuration Manager. Expand SQL Server Network Configuration for the
instance you want, and select Protocols for MSSQLSERVER. You should see protocols in the right-pane.
Enable TCP/IP by right-clicking TCP/IP and clicking Enable.
See Enable or Disable a Server Network Protocol for details and alternate ways of enabling TCP/IP
protocol.
3. In the same window, double-click TCP/IP to launch TCP/IP Properties window.
4. Switch to the IP Addresses tab. Scroll down to see IPAll section. Note down the TCP Port (default is
1433).
5. Create a rule for the Windows Firewall on the machine to allow incoming traffic through this port.
6. Verify connection: To connect to the SQL Server using fully qualified name, use SQL Server
Management Studio from a different machine. For example: "<machine>.<domain>.corp.<company>.com,1433" .
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Square using Azure Data Factory
(Preview)
3/21/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Square. It builds on the
copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Square to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Square connector.
Example:
{
"name": "SquareLinkedService",
"properties": {
"type": "Square",
"typeProperties": {
"host" : "mystore.mysquare.com",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"redirectUri" : "http://localhost:2500"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Square dataset.
To copy data from Square, set the type property of the dataset to SquareObject. The following properties are
supported:
Example
{
"name": "SquareDataset",
"properties": {
"type": "SquareObject",
"linkedServiceName": {
"referenceName": "<Square linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Business" .
Example:
"activities":[
{
"name": "CopyFromSquare",
"type": "Copy",
"inputs": [
{
"referenceName": "<Square input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SquareSource",
"query": "SELECT * FROM Business"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Sybase using Azure Data Factory
1/3/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Sybase database. It
builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Sybase database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Sybase connector supports:
SAP Sybase SQL Anywhere (ASA) version 16 and above; IQ and ASE are not supported.
Copying data using Basic or Windows authentication.
Prerequisites
To use this Sybase connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the data provider for Sybase iAnywhere.Data.SQLAnywhere 16 or above on the Integration Runtime
machine.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Sybase connector.
Example:
{
"name": "SybaseLinkedService",
"properties": {
"type": "Sybase",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Sybase dataset.
To copy data from Sybase, set the type property of the dataset to RelationalTable. The following properties are
supported:
tableName Name of the table in the Sybase No (if "query" in activity source is
database. specified)
Example
{
"name": "SybaseDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<Sybase linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromSybase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Sybase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Teradata using Azure Data Factory
1/3/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Teradata database. It
builds on the copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Teradata database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Teradata connector supports:
Teradata version 12 and above.
Copying data using Basic or Windows authentication.
Prerequisites
To use this Teradata connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the .NET Data Provider for Teradata version 14 or above on the Integration Runtime machine.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Teradata connector.
Example:
{
"name": "TeradataLinkedService",
"properties": {
"type": "Teradata",
"typeProperties": {
"server": "<server>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Teradata dataset.
To copy data from Teradata, set the type property of the dataset to RelationalTable. The following properties are
supported:
tableName Name of the table in the Teradata No (if "query" in activity source is
database. specified)
Example:
{
"name": "TeradataDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<Teradata linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromTeradata",
"type": "Copy",
"inputs": [
{
"referenceName": "<Teradata input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
BigInt Int64
Blob Byte[]
Byte Byte[]
ByteInt Int16
Char String
Clob String
Date DateTime
Decimal Decimal
Double Double
Graphic String
Integer Int32
TERADATA DATA TYPE DATA FACTORY INTERIM DATA TYPE
Number Double
Period(Date) String
Period(Time) String
Period(Timestamp) String
SmallInt Int16
Time TimeSpan
Timestamp DateTime
VarByte Byte[]
TERADATA DATA TYPE DATA FACTORY INTERIM DATA TYPE
VarChar String
VarGraphic String
Xml String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Vertica using Azure Data Factory
2/1/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Vertica. It builds on the
copy activity overview article that presents a general overview of copy activity.
Supported capabilities
You can copy data from Vertica to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Vertica connector.
Example:
{
"name": "VerticaLinkedService",
"properties": {
"type": "Vertica",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;PWD=<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "VerticaLinkedService",
"properties": {
"type": "Vertica",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;"
},
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Vertica dataset.
To copy data from Vertica, set the type property of the dataset to VerticaTable. The following properties are
supported:
Example
{
"name": "VerticaDataset",
"properties": {
"type": "VerticaTable",
"linkedServiceName": {
"referenceName": "<Vertica linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromVertica",
"type": "Copy",
"inputs": [
{
"referenceName": "<Vertica input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "VerticaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Web table by using Azure Data
Factory
1/3/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Web table database.
It builds on the copy activity overview article that presents a general overview of copy activity.
The difference among this Web table connector, the REST connector and the HTTP connector are:
Web table connector extracts table content from an HTML webpage.
REST connector specifically support copying data from RESTful APIs.
HTTP connector is generic to retrieve data from any HTTP endpoint, e.g. to download file.
Supported capabilities
You can copy data from Web table database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Web table connector supports extracting table content from an HTML page.
Prerequisites
To use this Web table connector, you need to set up a Self-hosted Integration Runtime. See Self-hosted
Integration Runtime article for details.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Web table connector.
Example:
{
"name": "WebLinkedService",
"properties": {
"type": "Web",
"typeProperties": {
"url" : "https://en.wikipedia.org/wiki/",
"authenticationType": "Anonymous"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Web table dataset.
To copy data from Web table, set the type property of the dataset to WebTable. The following properties are
supported:
path A relative URL to the resource that No. When path is not specified, only
contains the table. the URL specified in the linked service
definition is used.
Example:
{
"name": "WebTableInput",
"properties": {
"type": "WebTable",
"linkedServiceName": {
"referenceName": "<Web linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"index": 1,
"path": "AFI's_100_Years...100_Movies"
}
}
}
"activities":[
{
"name": "CopyFromWebTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<Web table input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "WebSource"
},
"sink": {
"type": "<sink type>"
}
}
}
]
6. In the Query Editor window, click Advanced Editor button on the toolbar.
7. In the Advanced Editor dialog box, the number next to "Source" is the index.
If you are using Excel 2013, use Microsoft Power Query for Excel to get the index. See Connect to a web page
article for details. The steps are similar if you are using Microsoft Power BI for Desktop.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Xero using Azure Data Factory
(Preview)
1/3/2019 • 4 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Xero. It builds on the
copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Xero to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Xero connector supports:
Xero private application but not public application.
All Xero tables (API endpoints) except "Reports".
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Xero connector.
privateKey The private key from the .pem file that Yes
was generated for your Xero private
application, see Create a public/private
key pair. Note to generate the
privatekey.pem with numbits of 512
using
openssl genrsa -out
privatekey.pem 512
; 1024 is not supported. Include all the
text from the .pem file including the
Unix line endings(\n), see sample below.
Example:
{
"name": "XeroLinkedService",
"properties": {
"type": "Xero",
"typeProperties": {
"host" : "api.xero.com",
"consumerKey": {
"type": "SecureString",
"value": "<consumerKey>"
},
"privateKey": {
"type": "SecureString",
"value": "<privateKey>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Xero dataset.
To copy data from Xero, set the type property of the dataset to XeroObject. The following properties are
supported:
Example
{
"name": "XeroDataset",
"properties": {
"type": "XeroObject",
"linkedServiceName": {
"referenceName": "<Xero linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Contacts" .
Example:
"activities":[
{
"name": "CopyFromXero",
"type": "Copy",
"inputs": [
{
"referenceName": "<Xero input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "XeroSource",
"query": "SELECT * FROM Contacts"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Note the following when specifying the Xero query:
Tables with complex items will be split to multiple tables. For example, Bank transactions has a complex
data structure "LineItems", so data of bank transaction is mapped to table Bank_Transaction and
Bank_Transaction_Line_Items , with Bank_Transaction_ID as foreign key to link them together.
Xero data is available through two schemas: Minimal (default) and Complete . The Complete schema
contains prerequisite call tables which require additional data (e.g. ID column) before making the desired
query.
The following tables have the same information in the Minimal and Complete schema. To reduce the number of
API calls, use Minimal schema (default).
Bank_Transactions
Contact_Groups
Contacts
Contacts_Sales_Tracking_Categories
Contacts_Phones
Contacts_Addresses
Contacts_Purchases_Tracking_Categories
Credit_Notes
Credit_Notes_Allocations
Expense_Claims
Expense_Claim_Validation_Errors
Invoices
Invoices_Credit_Notes
Invoices_ Prepayments
Invoices_Overpayments
Manual_Journals
Overpayments
Overpayments_Allocations
Prepayments
Prepayments_Allocations
Receipts
Receipt_Validation_Errors
Tracking_Categories
The following tables can only be queried with complete schema:
Complete.Bank_Transaction_Line_Items
Complete.Bank_Transaction_Line_Item_Tracking
Complete.Contact_Group_Contacts
Complete.Contacts_Contact_ Persons
Complete.Credit_Note_Line_Items
Complete.Credit_Notes_Line_Items_Tracking
Complete.Expense_Claim_ Payments
Complete.Expense_Claim_Receipts
Complete.Invoice_Line_Items
Complete.Invoices_Line_Items_Tracking
Complete.Manual_Journal_Lines
Complete.Manual_Journal_Line_Tracking
Complete.Overpayment_Line_Items
Complete.Overpayment_Line_Items_Tracking
Complete.Prepayment_Line_Items
Complete.Prepayment_Line_Item_Tracking
Complete.Receipt_Line_Items
Complete.Receipt_Line_Item_Tracking
Complete.Tracking_Category_Options
Next steps
For a list of supported data stores by the copy activity, see supported data stores.
Copy data from Zoho using Azure Data Factory
(Preview)
1/3/2019 • 3 minutes to read • Edit Online
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Zoho. It builds on the
copy activity overview article that presents a general overview of copy activity.
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, please contact Azure support.
Supported capabilities
You can copy data from Zoho to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can use one of the following tools or SDKs to use the copy activity with a pipeline. Select a link for step-by-
step instructions:
Copy Data tool
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Zoho connector.
Example:
{
"name": "ZohoLinkedService",
"properties": {
"type": "Zoho",
"typeProperties": {
"endpoint" : "crm.zoho.com/crm/private",
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Zoho dataset.
To copy data from Zoho, set the type property of the dataset to ZohoObject. The following properties are
supported:
Example
{
"name": "ZohoDataset",
"properties": {
"type": "ZohoObject",
"linkedServiceName": {
"referenceName": "<Zoho linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Accounts" .
Example:
"activities":[
{
"name": "CopyFromZoho",
"type": "Copy",
"inputs": [
{
"referenceName": "<Zoho input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ZohoSource",
"query": "SELECT * FROM Accounts"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy Activity in Azure Data Factory
5/6/2019 • 12 minutes to read • Edit Online
Overview
In Azure Data Factory, you can use Copy Activity to copy data among data stores
located on-premises and in the cloud. After the data is copied, it can be further
transformed and analyzed. You can also use Copy Activity to publish transformation
and analysis results for business intelligence (BI) and application consumption.
Azure ✓ ✓ ✓ ✓
Cosmos DB
(SQL API)
Azure ✓ ✓ ✓ ✓
Cosmos
DB's API for
MongoDB
Azure Data ✓ ✓ ✓ ✓
Explorer
Azure Data ✓ ✓ ✓ ✓
Lake
Storage
Gen1
Azure Data ✓ ✓ ✓ ✓
Lake
Storage
Gen2
Azure ✓ ✓ ✓
Database
for MariaDB
Azure ✓ ✓ ✓
Database
for MySQL
Azure ✓ ✓ ✓
Database
for
PostgreSQL
Azure File ✓ ✓ ✓ ✓
Storage
Azure SQL ✓ ✓ ✓ ✓
Database
Azure SQL ✓ ✓ ✓
Database
Managed
Instance
Azure SQL ✓ ✓ ✓ ✓
Data
Warehouse
Azure ✓ ✓ ✓
Search
Index
SUPPORTED SUPPORTED
AS A SUPPORTED SUPPORTED BY SELF-
CATEGORY DATA STORE SOURCE AS A SINK BY AZURE IR HOSTED IR
Azure Table ✓ ✓ ✓ ✓
Storage
Database Amazon ✓ ✓ ✓
Redshift
DB2 ✓ ✓ ✓
Drill ✓ ✓ ✓
(Preview)
Google ✓ ✓ ✓
BigQuery
Greenplum ✓ ✓ ✓
HBase ✓ ✓ ✓
Hive ✓ ✓ ✓
Apache ✓ ✓ ✓
Impala
(Preview)
Informix ✓ ✓
MariaDB ✓ ✓ ✓
Microsoft ✓ ✓
Access
MySQL ✓ ✓ ✓
Netezza ✓ ✓ ✓
Oracle ✓ ✓ ✓ ✓
Phoenix ✓ ✓ ✓
PostgreSQL ✓ ✓ ✓
Presto ✓ ✓ ✓
(Preview)
SAP ✓ ✓
Business
Warehouse
Open Hub
SUPPORTED SUPPORTED
AS A SUPPORTED SUPPORTED BY SELF-
CATEGORY DATA STORE SOURCE AS A SINK BY AZURE IR HOSTED IR
SAP ✓ ✓
Business
Warehouse
via MDX
SAP HANA ✓ ✓ ✓
SAP Table ✓ ✓ ✓
Spark ✓ ✓ ✓
SQL Server ✓ ✓ ✓ ✓
Sybase ✓ ✓
Teradata ✓ ✓
Vertica ✓ ✓ ✓
NoSQL Cassandra ✓ ✓ ✓
Couchbase ✓ ✓ ✓
(Preview)
MongoDB ✓ ✓ ✓
File Amazon S3 ✓ ✓ ✓
File System ✓ ✓ ✓ ✓
FTP ✓ ✓ ✓
Google ✓ ✓ ✓
Cloud
Storage
HDFS ✓ ✓ ✓
SFTP ✓ ✓ ✓
Generic Generic ✓ ✓ ✓
protocol HTTP
Generic ✓ ✓ ✓
OData
Generic ✓ ✓ ✓
ODBC
Generic ✓ ✓ ✓
REST
SUPPORTED SUPPORTED
AS A SUPPORTED SUPPORTED BY SELF-
CATEGORY DATA STORE SOURCE AS A SINK BY AZURE IR HOSTED IR
Services Amazon ✓ ✓ ✓
and apps Marketplac
e Web
Service
(Preview)
Common ✓ ✓ ✓ ✓
Data
Service for
Apps
Concur ✓ ✓ ✓
(Preview)
Dynamics ✓ ✓ ✓ ✓
365
Dynamics ✓ ✓ ✓
AX
(Preview)
Dynamics ✓ ✓ ✓ ✓
CRM
Google ✓ ✓ ✓
AdWords
(Preview)
HubSpot ✓ ✓ ✓
(Preview)
Jira ✓ ✓ ✓
(Preview)
Magento ✓ ✓ ✓
(Preview)
Marketo ✓ ✓ ✓
(Preview)
Office 365 ✓ ✓ ✓
Oracle ✓ ✓ ✓
Eloqua
(Preview)
Oracle ✓ ✓ ✓
Responsys
(Preview)
SUPPORTED SUPPORTED
AS A SUPPORTED SUPPORTED BY SELF-
CATEGORY DATA STORE SOURCE AS A SINK BY AZURE IR HOSTED IR
Oracle ✓ ✓ ✓
Service
Cloud
(Preview)
Paypal ✓ ✓ ✓
(Preview)
QuickBooks ✓ ✓ ✓
(Preview)
Salesforce ✓ ✓ ✓ ✓
Salesforce ✓ ✓ ✓ ✓
Service
Cloud
Salesforce ✓ ✓ ✓
Marketing
Cloud
(Preview)
SAP Cloud ✓ ✓ ✓ ✓
for
Customer
(C4C)
SAP ECC ✓ ✓ ✓
ServiceNow ✓ ✓ ✓
Shopify ✓ ✓ ✓
(Preview)
Square ✓ ✓ ✓
(Preview)
Web Table ✓ ✓
(HTML
table)
Xero ✓ ✓ ✓
(Preview)
Zoho ✓ ✓ ✓
(Preview)
NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If
you want to take a dependency on preview connectors in your solution, please contact
Azure support.
Supported file formats
You can use Copy Activity to copy files as-is between two file-based data stores, in
which case the data is copied efficiently without any serialization/deserialization.
Copy Activity also supports reading from and writing to files in specified formats:
Text, JSON, Avro, ORC, and Parquet, and compressing and decompressing files
with the following codecs: GZip, Deflate, BZip2, and ZipDeflate. See Supported
file and compression formats with details.
For example, you can do the following copy activities:
Copy data in on-premises SQL Server and write to Azure Data Lake Storage
Gen2 in Parquet format.
Copy files in text (CSV ) format from on-premises File System and write to Azure
Blob in Avro format.
Copy zipped files from on-premises File System and decompress then land to
Azure Data Lake Storage Gen2.
Copy data in GZip compressed text (CSV ) format from Azure Blob and write to
Azure SQL Database.
And many more cases with serialization/deserialization or
compression/decompression need.
Supported regions
The service that powers Copy Activity is available globally in the regions and
geographies listed in Azure Integration Runtime locations. The globally available
topology ensures efficient data movement that usually avoids cross-region hops. See
Services by region for availability of Data Factory and Data Movement in a region.
Configuration
To use copy activity in Azure Data Factory, you need to:
1. Create linked services for source data store and sink data store. Refer to the
connector article's "Linked service properties" section on how to configure and
the supported properties. You can find the supported connector list in Supported
data stores and formats section.
2. Create datasets for source and sink. Refer to the source and sink connector
articles' "Dataset properties" section on how to configure and the supported
properties.
3. Create a pipeline with copy activity. The next section provides an example.
Syntax
The following template of a copy activity contains an exhaustive list of supported
properties. Specify the ones that fit your scenario.
"activities":[
{
"name": "CopyActivityTemplate",
"type": "Copy",
"inputs": [
{
"referenceName": "<source dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<sink dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>",
<properties>
},
"sink": {
"type": "<sink type>"
<properties>
},
"translator":
{
"type": "TabularTranslator",
"columnMappings": "<column mapping>"
},
"dataIntegrationUnits": <number>,
"parallelCopies": <number>,
"enableStaging": true/false,
"stagingSettings": {
<properties>
},
"enableSkipIncompatibleRow": true/false,
"redirectIncompatibleRowSettings": {
<properties>
}
}
}
]
Syntax details
PROPERTY DESCRIPTION REQUIRED
Monitoring
You can monitor the copy activity run on Azure Data Factory "Author & Monitor" UI
or programmatically. You can then compare the performance and configuration of
your scenario to Copy Activity's performance reference from in-house testing.
Monitor visually
To visually monitor the copy activity run, go to your data factory -> Author &
Monitor -> Monitor tab, you see a list of pipeline runs with a "View Activity Runs"
link in the Actions column.
Click to see the list of activities in this pipeline run. In the Actions column, you have
links to the copy activity input, output, errors (if copy activity run fails), and details.
Click the "Details" link under Actions to see copy activity's execution details and
performance characteristics. It shows you information including volume/rows/files of
data copied from source to sink, throughput, steps it goes through with
corresponding duration and used configurations for your copy scenario.
TIP
For some scenarios, you will also see "Performance tuning tips" on top of the copy
monitoring page, which tells you the bottleneck identified and guides you on what to
change so as to boost copy throughput, see example with details here.
Example: copy from Azure SQL Database to Azure SQL Data Warehouse
using staged copy
Monitor programmatically
Copy activity execution details and performance characteristics are also returned in
Copy Activity run result -> Output section. Below is an exhaustive list; only the
applicable ones to your copy scenario will show up. Learn how to monitor activity
run from quickstart monitoring section.
"output": {
"dataRead": 107280845500,
"dataWritten": 107280845500,
"filesRead": 10,
"filesWritten": 10,
"copyDuration": 224,
"throughput": 467707.344,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US 2)",
"usedDataIntegrationUnits": 32,
"usedParallelCopies": 8,
"executionDetails": [
{
"source": {
"type": "AmazonS3"
},
"sink": {
"type": "AzureDataLakeStore"
},
"status": "Succeeded",
"start": "2018-01-17T15:13:00.3515165Z",
"duration": 221,
"usedDataIntegrationUnits": 32,
"usedParallelCopies": 8,
"detailedDurations": {
"queuingDuration": 2,
"transferDuration": 219
}
}
]
}
Incremental copy
Data Factory supports scenarios for incrementally copying delta data from a source
data store to a destination data store. See Tutorial: incrementally copy data.
Next steps
See the following quickstarts, tutorials, and samples:
Copy data from one location to another location in the same Azure Blob Storage
Copy data from Azure Blob Storage to Azure SQL Database
Copy data from on-premises SQL Server to Azure
Delete Activity in Azure Data Factory
4/2/2019 • 7 minutes to read • Edit Online
You can use the Delete Activity in Azure Data Factory to delete files or folders from on-premises storage stores or
cloud storage stores. Use this activity to clean up or archive files when they are no longer needed.
WARNING
Deleted files or folders cannot be restored. Be cautious when using the Delete activity to delete files or folders.
Best practices
Here are some recommendations for using the Delete activity:
Back up your files before deleting them with the Delete activity in case you need to restore them in the
future.
Make sure that Data Factory has write permissions to delete folders or files from the storage store.
Make sure you are not deleting files that are being written at the same time.
If you want to delete files or folder from an on-premises system, make sure you are using a self-hosted
integration runtime with a version greater than 3.14.
Syntax
{
"name": "DeleteActivity",
"type": "Delete",
"typeProperties": {
"dataset": {
"referenceName": "<dataset name>",
"type": "DatasetReference"
},
"recursive": true/false,
"maxConcurrentConnections": <number>,
"enableLogging": true/false,
"logStorageSettings": {
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference"
},
"path": "<path to save log file>"
}
}
}
Type properties
PROPERTY DESCRIPTION REQUIRED
recursive Indicates whether the files are deleted No. The default is false .
recursively from the subfolders or only
from the specified folder.
Monitoring
There are two places where you can see and monitor the results of the Delete activity:
From the output of the Delete activity.
From the log file.
Sample output of the Delete activity
{
"datasetName": "AmazonS3",
"type": "AmazonS3Object",
"prefix": "test",
"bucketName": "adf",
"recursive": true,
"isWildcardUsed": false,
"maxConcurrentConnections": 2,
"filesDeleted": 4,
"logPath": "https://sample.blob.core.windows.net/mycontainer/5c698705-a6e2-40bf-911e-e0a927de3f07",
"effectiveIntegrationRuntime": "MyAzureIR (West Central US)",
"executionDuration": 650
}
Sample dataset
{
"name": "PartitionedFolder",
"properties": {
"linkedServiceName": {
"referenceName": "BloblinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"TriggerTime": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@concat('mycontainer/',dataset().TriggerTime)",
"type": "Expression"
}
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Sample trigger
{
"name": "DailyTrigger",
"properties": {
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "cleanup_time_partitioned_folder",
"type": "PipelineReference"
},
"parameters": {
"TriggerTime": "@trigger().scheduledTime"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2018-12-13T00:00:00.000Z",
"timeZone": "UTC",
"schedule": {
"minutes": [
59
],
"hours": [
23
]
}
}
}
}
}
Clean up the expired files that were last modified before 2018.1.1
You can create a pipeline to clean up the old or expired files by leveraging file attribute filter: “LastModified” in
dataset.
Sample pipeline
{
"name": "CleanupExpiredFiles",
"properties": {
"activities": [
{
"name": "DeleteFilebyLastModified",
"type": "Delete",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"dataset": {
"referenceName": "BlobFilesLastModifiedBefore201811",
"type": "DatasetReference"
},
"recursive": true,
"logStorageSettings": {
"linkedServiceName": {
"referenceName": "BloblinkedService",
"type": "LinkedServiceReference"
},
"path": "mycontainer/log"
},
"enableLogging": true
}
}
]
}
}
Sample dataset
{
"name": "BlobFilesLastModifiedBefore201811",
"properties": {
"linkedServiceName": {
"referenceName": "BloblinkedService",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"fileName": "*",
"folderPath": "mycontainer",
"modifiedDatetimeEnd": "2018-01-01T00:00:00.000Z"
}
}
}
Move files by chaining the Copy activity and the Delete activity
You can move a file by using a copy activity to copy a file and then a delete activity to delete a file in a pipeline.
When you want to move multiple files, you can use the GetMetadata activity + Filter activity + Foreach activity +
Copy activity + Delete activity as in the following sample:
NOTE
If you want to move the entire folder by defining a dataset containing a folder path only, and then using a copy activity and a
the Delete activity to reference to the same dataset representing a folder, you need to be very careful. It is because you have
to make sure that there will NOT be new files arriving into the folder between copying operation and deleting operation. If
there are new files arriving at the folder at the moment when your copy activity just completed the copy job but the Delete
activity has not been stared, it is possible that the Delete activity will delete this new arriving file which has NOT been copied
to the destination yet by deleting the entire folder.
Sample pipeline
{
"name": "MoveFiles",
"properties": {
"activities": [
{
"name": "GetFileList",
"type": "GetMetadata",
"typeProperties": {
"dataset": {
"referenceName": "OneSourceFolder",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
]
}
},
{
"name": "FilterFiles",
"type": "Filter",
"dependsOn": [
{
"activity": "GetFileList",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('GetFileList').output.childItems",
"type": "Expression"
},
"condition": {
"value": "@equals(item().type, 'File')",
"type": "Expression"
}
}
},
{
"name": "ForEachFile",
"type": "ForEach",
"dependsOn": [
{
"activity": "FilterFiles",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('FilterFiles').output.value",
"type": "Expression"
},
},
"batchCount": 20,
"activities": [
{
"name": "CopyAFile",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "OneSourceFile",
"type": "DatasetReference",
"parameters": {
"path": "myFolder",
"filename": {
"value": "@item().name",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "OneDestinationFile",
"type": "DatasetReference",
"parameters": {
"DestinationFileName": {
"value": "@item().name",
"type": "Expression"
}
}
}
]
},
{
"name": "DeleteAFile",
"type": "Delete",
"dependsOn": [
{
"activity": "CopyAFile",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"dataset": {
"referenceName": "OneSourceFile",
"type": "DatasetReference",
"parameters": {
"path": "myFolder",
"filename": {
"value": "@item().name",
"type": "Expression"
}
}
},
"logStorageSettings": {
"linkedServiceName": {
"referenceName": "BloblinkedService",
"type": "LinkedServiceReference"
},
"path": "Container/log"
},
"enableLogging": true
}
}
]
}
}
]
}
}
Sample datasets
Dataset used by GetMetadata activity to enumerate the file list.
{
"name": "OneSourceFolder",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"fileName": "",
"folderPath": "myFolder"
}
}
}
Dataset for data source used by copy activity and the Delete activity.
{
"name": "OneSourceFile",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
},
"filename": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"fileName": {
"value": "@dataset().filename",
"type": "Expression"
},
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
}
}
}
{
"name": "OneDestinationFile",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"DestinationFileName": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"fileName": {
"value": "@dataset().DestinationFileName",
"type": "Expression"
},
"folderPath": "mycontainer/dest"
}
}
}
Known limitation
Delete activity does not support deleting list of folders described by wildcard.
When using file attribute filter: modifiedDatetimeStart and modifiedDatetimeEnd to select files to be
deleted, make sure to set "fileName": "*" in dataset.
Next steps
Learn more about moving files in Azure Data Factory.
Copy Data tool in Azure Data Factory
Copy Data tool in Azure Data Factory
3/15/2019 • 5 minutes to read • Edit Online
The Azure Data Factory Copy Data tool eases and optimizes the process of ingesting data into a data lake, which is
usually a first step in an end-to-end data integration scenario. It saves time, especially when you use Azure Data
Factory to ingest data from a data source for the first time. Some of the benefits of using this tool are:
When using the Azure Data Factory Copy Data tool, you do not need understand Data Factory definitions for
linked services, datasets, pipelines, activities, and triggers.
The flow of Copy Data tool is intuitive for loading data into a data lake. The tool automatically creates all the
necessary Data Factory resources to copy data from the selected source data store to the selected
destination/sink data store.
The Copy Data tool helps you validate the data that is being ingested at the time of authoring, which helps you
avoid any potential errors at the beginning itself.
If you need to implement complex business logic to load data into a data lake, you can still edit the Data Factory
resources created by the Copy Data tool by using the per-activity authoring in Data Factory UI.
The following table provides guidance on when to use the Copy Data tool vs. per-activity authoring in Data
Factory UI:
You want to easily build a data loading task without learning You want to implement complex and flexible logic for loading
about Azure Data Factory entities (linked services, datasets, data into lake.
pipelines, etc.)
You want to quickly load a large number of data artifacts into You want to chain Copy activity with subsequent activities for
a data lake. cleansing or processing data.
To start the Copy Data tool, click the Copy Data tile on the home page of your data factory.
Filter data
You can filter source data to select only the data that needs to be copied to the sink data store. Filtering reduces the
volume of the data to be copied to the sink data store and therefore enhances the throughput of the copy
operation. Copy Data tool provides a flexible way to filter data in a relational database by using the SQL query
language, or files in an Azure blob folder.
Filter data in a database
The following screenshot shows a SQL query to filter the data.
Click the Browse button for File or folder, browse to one of these folders (for example, 2016->03->01->02), and
click Choose. You should see 2016/03/01/02 in the text box.
Then, replace 2016 with {year}, 03 with {month}, 01 with {day}, and 02 with {hour}, and press the Tab key. You
should see drop-down lists to select the format for these four variables:
The Copy Data tool generates parameters with expressions, functions, and system variables that can be used to
represent {year}, {month}, {day}, {hour}, and {minute} when creating pipeline. For more information, see the How to
read or write partitioned data article.
Scheduling options
You can run the copy operation once or on a schedule (hourly, daily, and so on). These options can be used for the
connectors across different environments, including on-premises, cloud, and local desktop.
A one-time copy operation enables data movement from a source to a destination only once. It applies to data of
any size and any supported format. The scheduled copy allows you to copy data on a recurrence that you specify.
You can use rich settings (like retry, timeout, and alerts) to configure the scheduled copy.
Next steps
Try these tutorials that use the Copy Data tool:
Quickstart: create a data factory using the Copy Data tool
Tutorial: copy data in Azure using the Copy Data tool
Tutorial: copy on-premises data to Azure using the Copy Data tool
Load data into Azure Data Lake Storage Gen2 with
Azure Data Factory
5/13/2019 • 4 minutes to read • Edit Online
Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built into Azure Blob storage.
It allows you to interface with your data using both file system and object storage paradigms.
Azure Data Factory (ADF ) is a fully managed cloud-based data integration service. You can use the service to
populate the lake with data from a rich set of on-premises and cloud-based data stores and save time when
building your analytics solutions. For a detailed list of supported connectors, see the table of Supported data
stores.
Azure Data Factory offers a scale-out, managed data movement solution. Due to the scale-out architecture of ADF,
it can ingest data at a high throughput. For details, see Copy activity performance.
This article shows you how to use the Data Factory Copy Data tool to load data from Amazon Web Services S3
service into Azure Data Lake Storage Gen2. You can follow similar steps to copy data from other types of data
stores.
TIP
For copying data from Azure Data Lake Storage Gen1 into Gen2, refer to this specific walkthrough.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account with Data Lake Storage Gen2 enabled: If you don't have a Storage account, create an
account.
AWS account with an S3 bucket that contains data: This article shows how to copy data from Amazon S3. You
can use other data stores by following similar steps.
Name: Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadADLSDemo" is not available," enter a different name for the data factory. For example, you
could use the name yournameADFTutorialDataFactory. Try creating the data factory again. For the
naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription: Select your Azure subscription in which to create the data factory.
Resource Group: Select an existing resource group from the drop-down list, or select the Create new
option and enter the name of a resource group. To learn about resource groups, see Using resource
groups to manage your Azure resources.
Version: Select V2.
Location: Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions.
3. Select Create.
4. After creation is complete, go to your data factory. You see the Data Factory home page as shown in the
following image:
Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.
6. Specify the copy behavior by checking the Copy files recursively and Binary copy options. Select Next:
7. In the Destination data store page, click + Create new connection, and then select Azure Data Lake
Storage Gen2, and select Continue:
8. In the Specify Azure Data Lake Storage connection page, do the following steps:
a. Select your Data Lake Storage Gen2 capable account from the "Storage account name" drop down list.
b. Select Finish to create the connection. Then select Next.
9. In the Choose the output file or folder page, enter copyfroms3 as the output folder name, and select
Next. ADF will create the corresponding ADLS Gen2 file system and sub-folders during copy if it doesn't
exist.
10. In the Settings page, select Next to use the default settings:
11. In the Summary page, review the settings, and select Next:
14. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch
back to the pipeline runs view, select the Pipelines link at the top. Select Refresh to refresh the list.
15. To monitor the execution details for each copy activity, select the Details link (eyeglasses image) under
Actions in the activity monitoring view. You can monitor details like the volume of data copied from the
source to the sink, data throughput, execution steps with corresponding duration, and used configurations:
16. Verify that the data is copied into your Data Lake Storage Gen2 account.
Next steps
Copy activity overview
Azure Data Lake Storage Gen2 connector
Copy data from Azure Data Lake Storage Gen1 to
Gen2 with Azure Data Factory
5/13/2019 • 7 minutes to read • Edit Online
Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built into Azure Blob storage.
It allows you to interface with your data using both file system and object storage paradigms.
If you are currently using Azure Data Lake Storage Gen1, you can evaluate the Gen2 new capability by copying
data from Data Lake Storage Gen1 to Gen2 using Azure Data Factory.
Azure Data Factory is a fully managed cloud-based data integration service. You can use the service to populate
the lake with data from a rich set of on-premises and cloud-based data stores and save time when building your
analytics solutions. For a detailed list of supported connectors, see the table of Supported data stores.
Azure Data Factory offers a scale-out, managed data movement solution. Due to the scale-out architecture of ADF,
it can ingest data at a high throughput. For details, see Copy activity performance.
This article shows you how to use the Data Factory Copy Data tool to copy data from Azure Data Lake Storage
Gen1 into Azure Data Lake Storage Gen2. You can follow similar steps to copy data from other types of data
stores.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure Data Lake Storage Gen1 account with data in it.
Azure Storage account with Data Lake Storage Gen2 enabled: If you don't have a Storage account, create an
account.
Name: Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadADLSDemo" is not available," enter a different name for the data factory. For example, you
could use the name yournameADFTutorialDataFactory. Try creating the data factory again. For the
naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription: Select your Azure subscription in which to create the data factory.
Resource Group: Select an existing resource group from the drop-down list, or select the Create new
option and enter the name of a resource group. To learn about resource groups, see Using resource
groups to manage your Azure resources.
Version: Select V2.
Location: Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions.
3. Select Create.
4. After creation is complete, go to your data factory. You see the Data Factory home page as shown in the
following image:
Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.
4. In the Specify Azure Data Lake Storage Gen1 connection page, do the following steps:
a. Select your Data Lake Storage Gen1 for the account name, and specify or validate the Tenant.
b. Click Test connection to validate the settings, then select Finish.
c. You will see a new connection gets created. Select Next.
IMPORTANT
In this walkthrough, you use a managed identity for Azure resources to authenticate your Data Lake Storage Gen1.
Be sure to grant the MSI the proper permissions in Azure Data Lake Storage Gen1 by following these instructions.
5. In the Choose the input file or folder page, browse to the folder and file that you want to copy over.
Select the folder/file, select Choose:
6. Specify the copy behavior by checking the Copy files recursively and Binary copy options. Select Next:
7. In the Destination data store page, click + Create new connection, and then select Azure Data Lake
Storage Gen2, and select Continue:
8. In the Specify Azure Data Lake Storage Gen2 connection page, do the following steps:
a. Select your Data Lake Storage Gen2 capable account from the "Storage account name" drop down list.
b. Select Finish to create the connection. Then select Next.
9. In the Choose the output file or folder page, enter copyfromadlsgen1 as the output folder name, and
select Next. ADF will create the corresponding ADLS Gen2 file system and sub-folders during copy if it
doesn't exist.
10. In the Settings page, select Next to use the default settings.
11. In the Summary page, review the settings, and select Next:
12. In the Deployment page, select Monitor to monitor the pipeline:
13. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline:
14. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch
back to the pipeline runs view, select the Pipelines link at the top. Select Refresh to refresh the list.
15. To monitor the execution details for each copy activity, select the Details link (eyeglasses image) under
Actions in the activity monitoring view. You can monitor details like the volume of data copied from the
source to the sink, data throughput, execution steps with corresponding duration, and used configurations:
16. Verify that the data is copied into your Data Lake Storage Gen2 account.
Best practices
To assess upgrading from Azure Data Lake Storage (ADLS ) Gen1 to Gen2 in general, refer to Upgrade your big
data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2. The following
sections introduce best practices of using ADF for data upgrade from Gen1 to Gen2.
Data partition for historical data copy
If your total data size in ADLS Gen1 is less than 30TB and the number of files is less than 1 million, you can
copy all data in single Copy activity run.
If you have larger size of data to copy, or you want the flexibility to manage data migration in batches and make
each of them completed within a specific timing windows, you are suggested to partition the data, in which case
it can also reduce the risk of any unexpected issue.
A PoC (Proof of Concept) is highly recommended in order to verify the end to end solution and test the copy
throughput in your environment. Major steps of doing PoC:
1. Create one ADF pipeline with single copy activity to copy several TBs of data from ADLS Gen1 to ADLS
Gen2 to get a copy performance baseline, starting with Data Integration Units (DIUs) as 128 .
2. Based on the copy throughput you get in step #1, calculate the estimated time required for the entire data
migration.
3. (Optional) Create a control table and define the file filter to partition the files to be migrated. The way to
partition the files as followings:
Partitioned by folder name or folder name with wildcard filter (suggested)
Partitioned by file’s last modified time
Network bandwidth and storage I/O
You can control the concurrency of ADF copy jobs which read data from ADLS Gen1 and write data to ADLS
Gen2, so that you can manage the usage on storage I/O in order to not impact the normal business work on ADLS
Gen1 during the migration.
Permissions
In Data Factory, ADLS Gen1 connector supports Service Principal and Managed Identity for Azure resource
authentications; ADLS Gen2 connector supports account key, Service Principal and Managed Identity for Azure
resource authentications. To make Data Factory able to navigate and copy all the files/ACLs as you need, make
sure you grant high enough permissions for the account you provide to access/read/write all files and set ACLs if
you choose to. Suggest to grant it as super-user/owner role during the migration period.
Preserve ACLs from Data Lake Storage Gen1
If you want to replicate the ACLs along with data files when upgrading from Data Lake Storage Gen1 to Gen2,
refer to Preserve ACLs from Data Lake Storage Gen1.
Incremental copy
Several approaches can be used to load only the new or updated files from ADLS Gen1:
Load new or updated files by time partitioned folder or file name, e.g. /2019/05/13/*;
Load new or updated files by LastModifiedDate;
Identify new or updated files by any 3rd party tool/solution, then pass the file or folder name to ADF pipeline
via parameter or a table/file.
The proper frequency to do incremental load depends on the total number of files in ADLS Gen1 and the volume
of new or updated file to be loaded every time.
Next steps
Copy activity overview Azure Data Lake Storage Gen1 connector Azure Data Lake Storage Gen2 connector
Load data into Azure SQL Data Warehouse by using
Azure Data Factory
3/26/2019 • 6 minutes to read • Edit Online
Azure SQL Data Warehouse is a cloud-based, scale-out database that's capable of processing massive volumes of
data, both relational and non-relational. SQL Data Warehouse is built on the massively parallel processing (MPP )
architecture that's optimized for enterprise data warehouse workloads. It offers cloud elasticity with the flexibility to
scale storage and compute independently.
Getting started with Azure SQL Data Warehouse is now easier than ever when you use Azure Data Factory. Azure
Data Factory is a fully managed cloud-based data integration service. You can use the service to populate a SQL
Data Warehouse with data from your existing system and save time when building your analytics solutions.
Azure Data Factory offers the following benefits for loading data into Azure SQL Data Warehouse:
Easy to set up: An intuitive 5-step wizard with no scripting required.
Rich data store support: Built-in support for a rich set of on-premises and cloud-based data stores. For a
detailed list, see the table of Supported data stores.
Secure and compliant: Data is transferred over HTTPS or ExpressRoute. The global service presence ensures
that your data never leaves the geographical boundary.
Unparalleled performance by using PolyBase: Polybase is the most efficient way to move data into Azure
SQL Data Warehouse. Use the staging blob feature to achieve high load speeds from all types of data stores,
including Azure Blob storage and Data Lake Store. (Polybase supports Azure Blob storage and Azure Data Lake
Store by default.) For details, see Copy activity performance.
This article shows you how to use the Data Factory Copy Data tool to load data from Azure SQL Database into
Azure SQL Data Warehouse. You can follow similar steps to copy data from other types of data stores.
NOTE
For more information, see Copy data to or from Azure SQL Data Warehouse by using Azure Data Factory.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Data Warehouse: The data warehouse holds the data that's copied over from the SQL database. If
you don't have an Azure SQL Data Warehouse, see the instructions in Create a SQL Data Warehouse.
Azure SQL Database: This tutorial copies data from an Azure SQL database with Adventure Works LT sample
data. You can create a SQL database by following the instructions in Create an Azure SQL database.
Azure storage account: Azure Storage is used as the staging blob in the bulk copy operation. If you don't have
an Azure storage account, see the instructions in Create a storage account.
Name: Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadSQLDWDemo" is not available," enter a different name for the data factory. For example,
you could use the name yournameADFTutorialDataFactory. Try creating the data factory again. For
the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription: Select your Azure subscription in which to create the data factory.
Resource Group: Select an existing resource group from the drop-down list, or select the Create new
option and enter the name of a resource group. To learn about resource groups, see Using resource
groups to manage your Azure resources.
Version: Select V2.
Location: Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These data
stores include Azure Data Lake Store, Azure Storage, Azure SQL Database, and so on.
3. Select Create.
4. After creation is complete, go to your data factory. You see the Data Factory home page as shown in the
following image:
Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.
c. In the New Linked Service page, select your server name and DB name from the dropdown list, and
specify the username and password. Click Test connection to validate the settings, then select Finish.
d. Select the newly created linked service as source, then click Next.
4. In the Select tables from which to copy the data or use a custom query page, enter SalesLT to filter
the tables. Choose the (Select all) box to use all of the tables for the copy, and then select Next:
5. In the Destination data store page, complete the following steps:
a. Click + Create new connection to add a connection
b. Select Azure SQL Data Warehouse from the gallery, and select Next.
c. In the New Linked Service page, select your server name and DB name from the dropdown list, and
specify the username and password. Click Test connection to validate the settings, then select Finish.
d. Select the newly created linked service as sink, then click Next.
6. In the Table mapping page, review the content, and select Next. An intelligent table mapping displays. The
source tables are mapped to the destination tables based on the table names. If a source table doesn't exist
in the destination, Azure Data Factory creates a destination table with the same name by default. You can
also map a source table to an existing destination table.
NOTE
Automatic table creation for the SQL Data Warehouse sink applies when SQL Server or Azure SQL Database is the
source. If you copy data from another source data store, you need to pre-create the schema in the sink Azure SQL
Data Warehouse before executing the data copy.
7. In the Schema mapping page, review the content, and select Next. The intelligent table mapping is based
on the column name. If you let Data Factory automatically create the tables, data type conversion can occur
when there are incompatibilities between the source and destination stores. If there's an unsupported data
type conversion between the source and destination column, you see an error message next to the
corresponding table.
c. In the Advanced settings section, deselect the Use type default option, then select Next.
11. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline:
12. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. To switch back to the pipeline runs view, select the Pipelines link at the top. Select
Refresh to refresh the list.
13. To monitor the execution details for each copy activity, select the Details link under Actions in the activity
monitoring view. You can monitor details like the volume of data copied from the source to the sink, data
throughput, execution steps with corresponding duration, and used configurations:
Next steps
Advance to the following article to learn about Azure SQL Data Warehouse support:
Azure SQL Data Warehouse connector
Load data into Azure Data Lake Storage Gen1 by
using Azure Data Factory
3/26/2019 • 4 minutes to read • Edit Online
Azure Data Lake Storage Gen1 (previously known as Azure Data Lake Store) is an enterprise-wide hyper-scale
repository for big data analytic workloads. Data Lake Storage Gen1 lets you capture data of any size, type, and
ingestion speed. The data is captured in a single place for operational and exploratory analytics.
Azure Data Factory is a fully managed cloud-based data integration service. You can use the service to populate
the lake with data from your existing system and save time when building your analytics solutions.
Azure Data Factory offers the following benefits for loading data into Data Lake Storage Gen1:
Easy to set up: An intuitive 5-step wizard with no scripting required.
Rich data store support: Built-in support for a rich set of on-premises and cloud-based data stores. For a
detailed list, see the table of Supported data stores.
Secure and compliant: Data is transferred over HTTPS or ExpressRoute. The global service presence ensures
that your data never leaves the geographical boundary.
High performance: Up to 1-GB/s data loading speed into Data Lake Storage Gen1. For details, see Copy
activity performance.
This article shows you how to use the Data Factory Copy Data tool to load data from Amazon S3 into Data Lake
Storage Gen1. You can follow similar steps to copy data from other types of data stores.
NOTE
For more information, see Copy data to or from Data Lake Storage Gen1 by using Azure Data Factory.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Data Lake Storage Gen1 account: If you don't have a Data Lake Storage Gen1 account, see the instructions in
Create a Data Lake Storage Gen1 account.
Amazon S3: This article shows how to copy data from Amazon S3. You can use other data stores by following
similar steps.
Name: Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadADLSG1Demo" is not available," enter a different name for the data factory. For example,
you could use the name yournameADFTutorialDataFactory. Try creating the data factory again. For
the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription: Select your Azure subscription in which to create the data factory.
Resource Group: Select an existing resource group from the drop-down list, or select the Create new
option and enter the name of a resource group. To learn about resource groups, see Using resource
groups to manage your Azure resources.
Version: Select V2.
Location: Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These data
stores include Azure Data Lake Storage Gen1, Azure Storage, Azure SQL Database, and so on.
3. Select Create.
4. After creation is complete, go to your data factory. You see the Data Factory home page as shown in the
following image:
Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.
2. In the Properties page, specify CopyFromAmazonS3ToADLS for the Task name field, and select Next:
3. In the Source data store page, click + Create new connection:
6. Choose the copy behavior by selecting the Copy files recursively and Binary copy (copy files as-is)
options. Select Next:
7. In the Destination data store page, click + Create new connection, and then select Azure Data Lake
Storage Gen1, and select Continue:
8. In the New Linked Service (Azure Data Lake Storage Gen1) page, do the following steps:
a. Select your Data Lake Storage Gen1 account for the Data Lake Store account name.
b. Specify the Tenant, and select Finish.
c. Select Next.
IMPORTANT
In this walkthrough, you use a managed identity for Azure resources to authenticate your Data Lake Storage Gen1
account. Be sure to grant the MSI the proper permissions in Data Lake Storage Gen1 by following these instructions.
9. In the Choose the output file or folder page, enter copyfroms3 as the output folder name, and select
Next:
14. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch
back to the pipeline runs view, select the Pipelines link at the top. Select Refresh to refresh the list.
15. To monitor the execution details for each copy activity, select the Details link under Actions in the activity
monitoring view. You can monitor details like the volume of data copied from the source to the sink, data
throughput, execution steps with corresponding duration, and used configurations:
16. Verify that the data is copied into your Data Lake Storage Gen1 account:
Next steps
Advance to the following article to learn about Data Lake Storage Gen1 support:
Azure Data Lake Storage Gen1 connector
Copy data from SAP Business Warehouse by using
Azure Data Factory
5/22/2019 • 10 minutes to read • Edit Online
This article shows how to use Azure Data Factory to copy data from SAP Business Warehouse (BW ) via Open Hub
to Azure Data Lake Storage Gen2. You can use a similar process to copy data to other supported sink data stores.
TIP
For general information about copying data from SAP BW, including SAP BW Open Hub integration and delta extraction flow,
see Copy data from SAP Business Warehouse via Open Hub by using Azure Data Factory.
Prerequisites
Azure Data Factory: If you don't have one, follow the steps to create a data factory.
SAP BW Open Hub Destination (OHD ) with destination type "Database Table": To create an OHD
or to check that your OHD is configured correctly for Data Factory integration, see the SAP BW Open Hub
Destination configurations section of this article.
The SAP BW user needs the following permissions:
Authorization for Remote Function Calls (RFC ) and SAP BW.
Permissions to the “Execute” activity of the S_SDSAUTH authorization object.
A self-hosted integration runtime (IR) with SAP .NET connector 3.0. Follow these setup steps:
1. Install and register the self-hosted integration runtime, version 3.13 or later. (This process is
described later in this article.)
2. Download the 64-bit SAP Connector for Microsoft .NET 3.0 from SAP's website, and install it on the
same computer as the self-hosted IR. During installation, make sure that you select Install
Assemblies to GAC in the Optional setup steps dialog box, as the following image shows:
Do a full copy from SAP BW Open Hub
In the Azure portal, go to your data factory. Select Author & Monitor to open the Data Factory UI in a separate
tab.
1. On the Let's get started page, select Copy Data to open the Copy Data tool.
2. On the Properties page, specify a Task name, and then select Next.
3. On the Source data store page, select +Create new connection. Select SAP BW Open Hub from the
connector gallery, and then select Continue. To filter the connectors, you can type SAP in the search box.
4. On the Specify SAP BW Open Hub connection page, follow these steps to create a new connection.
a. From the Connect via integration runtime list, select an existing self-hosted IR. Or, choose to
create one if you don't have one yet.
To create a new self-hosted IR, select +New, and then select Self-hosted. Enter a Name, and then
select Next. Select Express setup to install on the current computer, or follow the Manual setup
steps that are provided.
As mentioned in Prerequisites, make sure that you have SAP Connector for Microsoft .NET 3.0
installed on the same computer where the self-hosted IR is running.
b. Fill in the SAP BW Server name, System number, Client ID, Language (if other than EN ), User
name, and Password.
c. Select Test connection to validate the settings, and then select Finish.
d. A new connection is created. Select Next.
5. On the Select Open Hub Destinations page, browse the Open Hub Destinations that are available in
your SAP BW. Select the OHD to copy data from, and then select Next.
6. Specify a filter, if you need one. If your OHD only contains data from a single data-transfer process (DTP )
execution with a single request ID, or you're sure that your DTP is finished and you want to copy the data,
clear the Exclude Last Request check box.
Learn more about these settings in the SAP BW Open Hub Destination configurations section of this
article. Select Validate to double-check what data will be returned. Then select Next.
7. On the Destination data store page, select +Create new connection > Azure Data Lake Storage
Gen2 > Continue.
8. On the Specify Azure Data Lake Storage connection page, follow these steps to create a connection.
a. Select your Data Lake Storage Gen2-capable account from the Name drop-down list.
b. Select Finish to create the connection. Then select Next.
9. On the Choose the output file or folder page, enter copyfromopenhub as the output folder name. Then
select Next.
10. On the File format setting page, select Next to use the default settings.
11. On the Settings page, expand Performance settings. Enter a value for Degree of copy parallelism such
as 5 to load from SAP BW in parallel. Then select Next.
12. On the Summary page, review the settings. Then select Next.
13. On the Deployment page, select Monitor to monitor the pipeline.
14. Notice that the Monitor tab on the left side of the page is automatically selected. The Actions column
includes links to view activity-run details and to rerun the pipeline.
15. To view activity runs that are associated with the pipeline run, select View Activity Runs in the Actions
column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch back to
the pipeline-runs view, select the Pipelines link at the top. Select Refresh to refresh the list.
16. To monitor the execution details for each copy activity, select the Details link, which is an eyeglasses icon
below Actions in the activity-monitoring view. Available details include the data volume copied from the
source to the sink, data throughput, execution steps and duration, and configurations used.
17. To view the maximum Request ID, go back to the activity-monitoring view and select Output under
Actions.
On the data factory Let's get started page, select Create pipeline from template to use the built-in template.
1. Search for SAP BW to find and select the Incremental copy from SAP BW to Azure Data Lake
Storage Gen2 template. This template copies data into Azure Data Lake Storage Gen2. You can use a
similar workflow to copy to other sink types.
2. On the template's main page, select or create the following three connections, and then select Use this
template in the lower-right corner of the window.
Azure Blob storage: In this walkthrough, we use Azure Blob storage to store the high watermark,
which is the max copied request ID.
SAP BW Open Hub: This is the source to copy data from. Refer to the previous full-copy walkthrough
for detailed configuration.
Azure Data Lake Storage Gen2: This is the sink to copy data to. Refer to the previous full-copy
walkthrough for detailed configuration.
3. This template generates a pipeline with the following three activities and makes them chained on-success:
Lookup, Copy Data, and Web.
Go to the pipeline Parameters tab. You see all the configurations that you need to provide.
SAPOpenHubDestinationName: Specify the Open Hub table name to copy data from.
ADLSGen2SinkPath: Specify the destination Azure Data Lake Storage Gen2 path to copy data to. If
the path doesn't exist, the Data Factory copy activity creates a path during execution.
HighWatermarkBlobPath: Specify the path to store the high-watermark value, such as
container/path .
HighWatermarkBlobName: Specify the blob name to store the high watermark value, such as
requestIdCache.txt . In Blob storage, go to the corresponding path of
HighWatermarkBlobPath+HighWatermarkBlobName, such as container/path/requestIdCache.txt.
Create a blob with content 0.
LogicAppURL: In this template, we use WebActivity to call Azure Logic Apps to set the high-
watermark value in Blob storage. Or, you can use Azure SQL Database to store it. Use a stored
procedure activity to update the value.
You must first create a logic app, as the following image shows. Then, paste in the HTTP POST URL.
a. Go to the Azure portal. Select a new Logic Apps service. Select +Blank Logic App to go to
Logic Apps Designer.
b. Create a trigger of When an HTTP request is received. Specify the HTTP request body as
follows:
{
"properties": {
"sapOpenHubMaxRequestId": {
"type": "string"
}
},
"type": "object"
}
c. Add a Create blob action. For Folder path and Blob name, use the same values that you
configured previously in HighWatermarkBlobPath and HighWatermarkBlobName.
d. Select Save. Then, copy the value of HTTP POST URL to use in the Data Factory pipeline.
4. After you provide the Data Factory pipeline parameters, select Debug > Finish to invoke a run to validate
the configuration. Or, select Publish All to publish the changes, and then select Trigger to execute a run.
You might increase the number of parallel running SAP work processes for the DTP:
For a full load OHD, choose different options than for delta extraction:
In OHD: Set the Extraction option to Delete Data and Insert Records. Otherwise, data will be extracted
many times when you repeat the DTP in a BW process chain.
In the DTP: Set Extraction Mode to Full. You must change the automatically created DTP from Delta to
Full immediately after the OHD is created, as this image shows:
In the BW Open Hub connector of Data Factory: Turn off Exclude last request. Otherwise, nothing will be
extracted.
You typically run the full DTP manually. Or, you can create a process chain for the full DTP. It's typically a separate
chain that's independent of your existing process chains. In either case, make sure that the DTP is finished before
you start the extraction by using Data Factory copy. Otherwise, only partial data will be copied.
Run delta extraction the first time
The first delta extraction is technically a full extraction. By default, the SAP BW Open Hub connector excludes the
last request when it copies data. For the first delta extraction, no data is extracted by the Data Factory copy activity
until a subsequent DTP generates delta data in the table with a separate request ID. There are two ways to avoid
this scenario:
Turn off the Exclude last request option for the first delta extraction. Make sure that the first delta DTP is
finished before you start the delta extraction the first time.
Use the procedure for resyncing the delta extraction, as described in the next section.
Resync delta extraction
The following scenarios change the data in SAP BW cubes but are not considered by the delta DTP:
SAP BW selective deletion (of rows by using any filter condition)
SAP BW request deletion (of faulty requests)
An SAP Open Hub Destination isn't a data-mart-controlled data target (in all SAP BW support packages since
2015). So, you can delete data from a cube without changing the data in the OHD. You must then resync the data
of the cube with Data Factory:
1. Run a full extraction in Data Factory (by using a full DTP in SAP ).
2. Delete all rows in the Open Hub table for the delta DTP.
3. Set the status of the delta DTP to Fetched.
After this, all subsequent delta DTPs and Data Factory delta extractions work as expected.
To set the status of the delta DTP to Fetched, you can use the following option to run the delta DTP manually:
Next steps
Learn about SAP BW Open Hub connector support:
SAP Business Warehouse Open Hub connector
Load data from Office 365 by using Azure Data
Factory
3/26/2019 • 5 minutes to read • Edit Online
This article shows you how to use the Data Factory load data from Office 365 into Azure Blob storage. You can
follow similar steps to copy data to Azure Data Lake Gen1 or Gen2. Refer to Office 365 connector article on
copying data from Office 365 in general.
2. In the New data factory page, provide values for the fields that are shown in the following image:
Name: Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadFromOffice365Demo" is not available," enter a different name for the data factory. For
example, you could use the name yournameLoadFromOffice365Demo. Try creating the data factory
again. For the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription: Select your Azure subscription in which to create the data factory.
Resource Group: Select an existing resource group from the drop-down list, or select the Create new
option and enter the name of a resource group. To learn about resource groups, see Using resource
groups to manage your Azure resources.
Version: Select V2.
Location: Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These data
stores include Azure Data Lake Store, Azure Storage, Azure SQL Database, and so on.
3. Select Create.
4. After creation is complete, go to your data factory. You see the Data Factory home page as shown in the
following image:
5. Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.
Create a pipeline
1. On the "Let's get started" page, select Create pipeline.
2. In the General tab for the pipeline, enter "CopyPipeline" for Name of the pipeline.
3. In the Activities tool box > Move & Transform category > drag and drop the Copy activity from the tool
box to the pipeline designer surface. Specify "CopyFromOffice365ToBlob" as activity name.
Configure source
1. Go to the pipeline > Source tab, click + New to create a source dataset.
2. In the New Dataset window, select Office 365, and then select Finish.
3. You see a new tab opened for Office 365 dataset. On the General tab at the bottom of the Properties
window, enter "SourceOffice365Dataset" for Name.
4. Go to the Connection tab of the Properties window. Next to the Linked service text box, click + New.
5. In the New Linked Service window, enter "Office365LinkedService" as name, enter the service principal ID
and service principal key, then select Save to deploy the linked service.
6. After the linked service is created, you are back in the dataset settings. Next to "Table", choose the down-
arrow to expand the list of available Office 365 datasets, and choose "BasicDataSet_v0.Contact_v0" from the
drop-down list:
7. Go to the Schema tab of the Properties window and select Import Schema. Notice that the schema and
sample values for Contact dataset is displayed.
8. Now, go back to the pipeline > Source tab, confirm that SourceBlobDataset is selected.
Configure sink
1. Go to the pipeline > Sink tab, and select + New to create a sink dataset.
2. In the New Dataset window, notice that only the supported destination are selected when copying from
Office 365. Select Azure Blob Storage, and then select Finish. In this tutorial, you copy Office 365 data
into an Azure Blob Storage.
3. On the General tab of the Properties window, in Name, enter "OutputBlobDataset".
4. Go to the Connection tab of the Properties window. Next to the Linked service text box, select + New.
5. In the New Linked Service window, enter "AzureStorageLinkedService" as name, select "Service Principal"
from the dropdown list of authentication methods, fill in the Service Endpoint, Tenant Service principal ID,
and Service principal key, then select Save to deploy the linked service. Refer here for how to set up service
principal authentication for Azure Blob Storage.
6. After the linked service is created, you are back in the dataset settings. Next to File path, select Browse to
choose the output folder where the Office 365 data will be extracted to. Under "File Format Settings", next
to File Format, choose "JSON format", and next to File Pattern, choose "Set of objects".
7. Go back to the pipeline > Sink tab, confirm that OutputBlobDataset is selected.
If this is the first time you are requesting data for this context (a combination of which data table is being access,
which destination account is the data being loaded into, and which user identity is making the data access request),
you will see the copy activity status as "In Progress", and only when you click into "Details" link under Actions will
you see the status as "RequesetingConsent". A member of the data access approver group needs to approve the
request in the Privileged Access Management before the data extraction can proceed.
Status as requesting consent:
Now go to the destination Azure Blob Storage and verify that Office 365 data has been extracted in JSON format.
Next steps
Advance to the following article to learn about Azure SQL Data Warehouse support:
Office 365 connector
How to read or write partitioned data in Azure Data
Factory
1/3/2019 • 2 minutes to read • Edit Online
In Azure Data Factory version 1, you could read or write partitioned data by using the SliceStart, SliceEnd,
WindowStart, and WindowEnd system variables. In the current version of Data Factory, you can achieve this
behavior by using a pipeline parameter and a trigger's start time or scheduled time as a value of the parameter.
"folderPath": "adfcustomerprofilingsample/logs/marketingcampaigneffectiveness/{Year}/{Month}/{Day}/",
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "%M" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "%d" } }
],
For more information about the partitonedBy property, see Copy data to or from Azure Blob storage by using
Azure Data Factory.
To achieve this behavior in the current version of Data Factory:
1. Define a pipeline parameter of type string. In the following example, the name of the pipeline parameter is
windowStartTime.
2. Set folderPath in the dataset definition to reference the value of the pipeline parameter.
3. Pass the actual value for the parameter when you invoke the pipeline on demand. You can also pass a trigger's
start time or scheduled time dynamically at runtime.
"folderPath": {
"value":
"adfcustomerprofilingsample/logs/marketingcampaigneffectiveness/@{formatDateTime(pipeline().parameters.windowS
tartTime, 'yyyy/MM/dd')}/",
"type": "Expression"
},
Example
Here is a sample dataset definition:
{
"name": "SampleBlobDataset",
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value":
"adfcustomerprofilingsample/logs/marketingcampaigneffectiveness/@{formatDateTime(pipeline().parameters.windowS
tartTime, 'yyyy/MM/dd')}/",
"type": "Expression"
},
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"structure": [
{ "name": "ProfileID", "type": "String" },
{ "name": "SessionStart", "type": "String" },
{ "name": "Duration", "type": "Int32" },
{ "name": "State", "type": "String" },
{ "name": "SrcIPAddress", "type": "String" },
{ "name": "GameType", "type": "String" },
{ "name": "Multiplayer", "type": "String" },
{ "name": "EndRank", "type": "String" },
{ "name": "WeaponsUsed", "type": "Int32" },
{ "name": "UsersInteractedWith", "type": "String" },
{ "name": "Impressions", "type": "String" }
],
"linkedServiceName": {
"referenceName": "churnStorageLinkedService",
"type": "LinkedServiceReference"
}
}
Pipeline definition:
{
"properties": {
"activities": [{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": {
"value": "@concat(pipeline().parameters.blobContainer, '/scripts/',
pipeline().parameters.partitionHiveScriptFile)",
"type": "Expression"
},
"scriptLinkedService": {
"referenceName": "churnStorageLinkedService",
"type": "LinkedServiceReference"
},
"defines": {
"RAWINPUT": {
"value": "@concat('wasb://', pipeline().parameters.blobContainer, '@',
pipeline().parameters.blobStorageAccount, '.blob.core.windows.net/logs/',
pipeline().parameters.inputRawLogsFolder, '/')",
"type": "Expression"
},
"Year": {
"value": "@formatDateTime(pipeline().parameters.windowStartTime, 'yyyy')",
"type": "Expression"
},
"Month": {
"value": "@formatDateTime(pipeline().parameters.windowStartTime, 'MM')",
"type": "Expression"
},
"Day": {
"value": "@formatDateTime(pipeline().parameters.windowStartTime, 'dd')",
"type": "Expression"
}
}
},
"linkedServiceName": {
"referenceName": "HdiLinkedService",
"type": "LinkedServiceReference"
},
"name": "HivePartitionGameLogs"
}],
"parameters": {
"windowStartTime": {
"type": "String"
},
"blobStorageAccount": {
"type": "String"
},
"blobContainer": {
"type": "String"
},
"inputRawLogsFolder": {
"type": "String"
}
}
}
}
Next steps
For a complete walkthrough of how to create a data factory that has a pipeline, see Quickstart: Create a data
factory.
Supported file formats and compression codecs in
Azure Data Factory
5/22/2019 • 17 minutes to read • Edit Online
This article applies to the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen1, Azure
Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.
If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input
and output dataset definitions. If you want to parse or generate files with a specific format, Azure Data
Factory supports the following file format types:
Text format
JSON format
Parquet format
ORC format
Avro format
TIP
Learn how copy activity maps your source data to sink from Schema mapping in copy activity.
Text format
NOTE
Data Factory introduced new delimited text format datset, see Delimited text format article with details. The following
configurations on file-based data store dataset is still supported as-is for backward compabitility. You are suggested to
use the new model going forward.
If you want to read from a text file or write to a text file, set the type property in the format section of the
dataset to TextFormat. You can also specify the following optional properties in the format section. See
TextFormat example section on how to configure.
TextFormat example
In the following JSON definition for a dataset, some of the optional properties are specified.
"typeProperties":
{
"folderPath": "mycontainer/myfolder",
"fileName": "myblobname",
"format":
{
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": ";",
"quoteChar": "\"",
"NullValue": "NaN",
"firstRowAsHeader": true,
"skipLineCount": 0,
"treatEmptyAsNull": true
}
},
To use an escapeChar instead of quoteChar , replace the line with quoteChar with the following escapeChar:
"escapeChar": "$",
NOTE
For the case of cross-apply data in array into multiple rows (case 1 -> sample 2 in JsonFormat examples), you can only
choose to expand single array using property jsonNodeReference .
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}
{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":
"567834760","switch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":
"789037573","switch1":"US","switch2":"UK"}
{"time":"2015-04-
29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":
"345626404","switch1":"Germany","switch2":"UK"}
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
}
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}
JsonFormat example
Case 1: Copying data from JSON files
Sample 1: extract data from object and array
In this sample, you expect one root JSON object maps to single record in tabular result. If you have a JSON file
with the following content:
{
"id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3",
"context": {
"device": {
"type": "PC"
},
"custom": {
"dimensions": [
{
"TargetResourceType": "Microsoft.Compute/virtualMachines"
},
{
"ResourceManagementProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3"
},
{
"OccurrenceTime": "1/13/2017 11:24:37 AM"
}
]
}
}
}
and you want to copy it into an Azure SQL table in the following format, by extracting data from both objects
and array:
RESOURCEMANAGEME
ID DEVICETYPE TARGETRESOURCETYPE NTPROCESSRUNID OCCURRENCETIME
The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically:
section defines the customized column names and the corresponding data type while converting
structure
to tabular data. This section is optional unless you need to do column mapping. For more information, see
Map source dataset columns to destination dataset columns.
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. To
copy data from array, you can use array[x].property to extract value of the given property from the xth
object, or you can use array[*].property to find the value from any object containing such property.
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "deviceType",
"type": "String"
},
{
"name": "targetResourceType",
"type": "String"
},
{
"name": "resourceManagementProcessRunId",
"type": "String"
},
{
"name": "occurrenceTime",
"type": "DateTime"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonPathDefinition": {"id": "$.id", "deviceType": "$.context.device.type",
"targetResourceType": "$.context.custom.dimensions[0].TargetResourceType",
"resourceManagementProcessRunId": "$.context.custom.dimensions[1].ResourceManagementProcessRunId",
"occurrenceTime": " $.context.custom.dimensions[2].OccurrenceTime"}
}
}
}
Sample 2: cross apply multiple objects with the same pattern from array
In this sample, you expect to transform one root JSON object into multiple records in tabular result. If you have
a JSON file with the following content:
{
"ordernumber": "01",
"orderdate": "20170122",
"orderlines": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "sanmateo": "No 1" } ]
}
and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array
and cross join with the common root info:
01 20170122 P1 23 [{"sanmateo":"No
1"}]
01 20170122 P2 13 [{"sanmateo":"No
1"}]
The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically:
structure section defines the customized column names and the corresponding data type while converting
to tabular data. This section is optional unless you need to do column mapping. For more information, see
Map source dataset columns to destination dataset columns.
jsonNodeReference indicates to iterate and extract data from the objects with the same pattern under array
orderlines .
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. In
this example, ordernumber , orderdate , and city are under root object with JSON path starting with $. ,
while order_pd and order_price are defined with path derived from the array element without $. .
"properties": {
"structure": [
{
"name": "ordernumber",
"type": "String"
},
{
"name": "orderdate",
"type": "String"
},
{
"name": "order_pd",
"type": "String"
},
{
"name": "order_price",
"type": "Int64"
},
{
"name": "city",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonNodeReference": "$.orderlines",
"jsonPathDefinition": {"ordernumber": "$.ordernumber", "orderdate": "$.orderdate", "order_pd":
"prod", "order_price": "price", "city": " $.city"}
}
}
}
and for each record, you expect to write to a JSON object in the following format:
{
"id": "1",
"order": {
"date": "20170119",
"price": 2000,
"customer": "David"
}
}
The output dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically, structure section defines the customized property names in destination file,
nestingSeparator (default is ".") are used to identify the nest layer from the name. This section is optional
unless you want to change the property name comparing with source column name, or nest some of the
properties.
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "order.date",
"type": "String"
},
{
"name": "order.price",
"type": "Int64"
},
{
"name": "order.customer",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat"
}
}
}
Parquet format
NOTE
Data Factory introduced new Parquet format datset, see Parquet format article with details. The following configurations
on file-based data store dataset is still supported as-is for backward compabitility. You are suggested to use the new
model going forward.
If you want to parse the Parquet files or write the data in Parquet format, set the format type property to
ParquetFormat. You do not need to specify any properties in the Format section within the typeProperties
section. Example:
"format":
{
"type": "ParquetFormat"
}
IMPORTANT
For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not
copying Parquet files as-is, you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR
machine. See the following paragraph with more details.
For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java runtime
by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for
JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE: The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK: it's supported since IR version 3.13. Package the jvm.dll with all other required
assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME
accordingly.
TIP
If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred
when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable
_JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such
copy, then rerun the pipeline.
Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g . The flag Xms specifies the initial memory
allocation pool for a Java Virtual Machine (JVM ), while Xmx specifies the maximum memory allocation pool.
This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx
amount of memory. By default, ADF use min 64MB and max 1G.
Data type mapping for Parquet files
DATA FACTORY INTERIM PARQUET ORIGINAL TYPE PARQUET ORIGINAL TYPE
DATA TYPE PARQUET PRIMITIVE TYPE (DESERIALIZE) (SERIALIZE)
ORC format
If you want to parse the ORC files or write the data in ORC format, set the format type property to
OrcFormat. You do not need to specify any properties in the Format section within the typeProperties section.
Example:
"format":
{
"type": "OrcFormat"
}
IMPORTANT
For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not
copying ORC files as-is, you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR
machine. See the following paragraph with more details.
For copy running on Self-hosted IR with ORC file serialization/deserialization, ADF locates the Java runtime by
firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if
not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE: The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK: it's supported since IR version 3.13. Package the jvm.dll with all other required
assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME
accordingly.
Data type mapping for ORC files
DATA FACTORY INTERIM DATA TYPE ORC TYPES
Boolean Boolean
SByte Byte
Byte Short
Int16 Short
UInt16 Int
Int32 Int
UInt32 Long
Int64 Long
UInt64 String
Single Float
DATA FACTORY INTERIM DATA TYPE ORC TYPES
Double Double
Decimal Decimal
String String
DateTime Timestamp
DateTimeOffset Timestamp
TimeSpan Timestamp
ByteArray Binary
Guid String
Char Char(1)
AVRO format
If you want to parse the Avro files or write the data in Avro format, set the format type property to
AvroFormat. You do not need to specify any properties in the Format section within the typeProperties section.
Example:
"format":
{
"type": "AvroFormat",
}
To use Avro format in a Hive table, you can refer to Apache Hive’s tutorial.
Note the following points:
Complex data types are not supported (records, enums, arrays, maps, unions, and fixed).
Compression support
Azure Data Factory supports compress/decompress data during copy. When you specify compression property
in an input dataset, the copy activity read the compressed data from the source and decompress it; and when
you specify the property in an output dataset, the copy activity compress then write data to the sink. Here are a
few sample scenarios:
Read GZIP compressed data from an Azure blob, decompress it, and write result data to an Azure SQL
database. You define the input Azure Blob dataset with the compression type property as GZIP.
Read data from a plain-text file from on-premises File System, compress it using GZip format, and write the
compressed data to an Azure blob. You define an output Azure Blob dataset with the compression type
property as GZip.
Read .zip file from FTP server, decompress it to get the files inside, and land those files in Azure Data Lake
Store. You define an input FTP dataset with the compression type property as ZipDeflate.
Read a GZIP -compressed data from an Azure blob, decompress it, compress it using BZIP2, and write result
data to an Azure blob. You define the input Azure Blob dataset with compression type set to GZIP and the
output dataset with compression type set to BZIP2.
To specify compression for a dataset, use the compression property in the dataset JSON as in the following
example:
{
"name": "AzureBlobDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"fileName": "pagecounts.csv.gz",
"folderPath": "compression/file/",
"format": {
"type": "TextFormat"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
NOTE
Compression settings are not supported for data in the AvroFormat, OrcFormat, or ParquetFormat. When reading
files in these formats, Data Factory detects and uses the compression codec in the metadata. When writing to files in
these formats, Data Factory chooses the default compression codec for that format. For example, ZLIB for OrcFormat and
SNAPPY for ParquetFormat.
Next steps
See the following articles for file-based data stores supported by Azure Data Factory:
Azure Blob Storage connector
Azure Data Lake Store connector
Amazon S3 connector
File System connector
FTP connector
SFTP connector
HDFS connector
HTTP connector
Schema mapping in copy activity
4/29/2019 • 6 minutes to read • Edit Online
This article describes how Azure Data Factory copy activity does schema mapping and data type mapping
from source data to sink data when execute the data copy.
Schema mapping
Column mapping applies when copying data from source to sink. By default, copy activity map source
data to sink by column names. You can specify explicit mapping to customize the column mapping
based on your need. More specifically, copy activity:
1. Read the data from source and determine the source schema
2. Use default column mapping to map columns by name, or apply explicit column mapping if specified.
3. Write the data to sink
Explicit mapping
You can specify the columns to map in copy activity -> translator -> mappings property. The following
example defines a copy activity in a pipeline to copy data from delimited text to Azure SQL Database.
{
"name": "CopyActivity",
"type": "Copy",
"inputs": [{
"referenceName": "DelimitedTextInput",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "AzureSqlOutput",
"type": "DatasetReference"
}],
"typeProperties": {
"source": { "type": "DelimitedTextSource" },
"sink": { "type": "SqlSink" },
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "UserId",
"type": "Guid"
},
"sink": {
"name": "MyUserId"
}
},
{
"source": {
"name": "Name",
"type": "String"
},
"sink": {
"name": "MyName"
}
},
{
"source": {
"name": "Group",
"type": "String"
},
"sink": {
"name": "MyGroup"
}
}
]
}
}
}
The following properties are supported under translator -> mappings -> object with source and sink :
The following properties are supported under translator -> mappings in addition to object with source
and sink :
In this sample, the output dataset has a structure and it points to a table in Salesfoce.
{
"name": "SalesforceDataset",
"properties": {
"structure":
[
{ "name": "MyUserId"},
{ "name": "MyName" },
{ "name": "MyGroup"}
],
"type": "SalesforceObject",
"linkedServiceName": {
"referenceName": "SalesforceLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "SinkTable"
}
}
}
The following JSON defines a copy activity in a pipeline. The columns from source mapped to columns in
sink by using the translator -> columnMappings property.
{
"name": "CopyActivity",
"type": "Copy",
"inputs": [
{
"referenceName": "OracleDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SalesforceDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": { "type": "OracleSource" },
"sink": { "type": "SalesforceSink" },
"translator":
{
"type": "TabularTranslator",
"columnMappings":
{
"UserId": "MyUserId",
"Group": "MyGroup",
"Name": "MyName"
}
}
}
}
If you are using the syntax of "columnMappings": "UserId: MyUserId, Group: MyGroup, Name: MyName" to
specify column mapping, it is still supported as-is.
Alternative schema mapping
You can specify copy activity -> translator -> schemaMapping to map between hierarchical-shaped data
and tabular-shaped data, e.g. copy from MongoDB/REST to text file and copy from Oracle to Azure
Cosmos DB's API for MongoDB. The following properties are supported in copy activity translator
section:
{
"id": {
"$oid": "592e07800000000000000000"
},
"number": "01",
"date": "20170122",
"orders": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "name": "Seattle" } ]
}
and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the
array (order_pd and order_price) and cross join with the common root info (number, date, and city ):
01 20170122 P1 23 Seattle
01 20170122 P2 13 Seattle
Configure the schema-mapping rule as the following copy activity JSON sample:
{
"name": "CopyFromMongoDBToOracle",
"type": "Copy",
"typeProperties": {
"source": {
"type": "MongoDbV2Source"
},
"sink": {
"type": "OracleSink"
},
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"orderNumber": "$.number",
"orderDate": "$.date",
"order_pd": "prod",
"order_price": "price",
"city": " $.city[0].name"
},
"collectionReference": "$.orders"
}
}
}
Next steps
See the other Copy Activity articles:
Copy activity overview
Fault tolerance of copy activity in Azure Data Factory
4/8/2019 • 3 minutes to read • Edit Online
The copy activity in Azure Data Factory offers you two ways to handle incompatible rows when copying data
between source and sink data stores:
You can abort and fail the copy activity when incompatible data is encountered (default behavior).
You can continue to copy all of the data by adding fault tolerance and skipping incompatible data rows. In
addition, you can log the incompatible rows in Azure Blob storage or Azure Data Lake Store. You can then
examine the log to learn the cause for the failure, fix the data on the data source, and retry the copy activity.
Supported scenarios
Copy Activity supports three scenarios for detecting, skipping, and logging incompatible data:
Incompatibility between the source data type and the sink native type.
For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that
contains three INT type columns. The CSV file rows that contain numeric data, such as 123,456,789 are
copied successfully to the sink store. However, the rows that contain non-numeric values, such as 123,456,
abc are detected as incompatible and are skipped.
Mismatch in the number of columns between the source and the sink.
For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that
contains six columns. The CSV file rows that contain six columns are copied successfully to the sink store.
The CSV file rows that contain more or fewer than six columns are detected as incompatible and are
skipped.
Primary key violation when writing to SQL Server/Azure SQL Database/Azure Cosmos DB.
For example: Copy data from a SQL server to a SQL database. A primary key is defined in the sink SQL
database, but no such primary key is defined in the source SQL server. The duplicated rows that exist in the
source cannot be copied to the sink. Copy Activity copies only the first row of the source data into the sink.
The subsequent source rows that contain the duplicated primary key value are detected as incompatible
and are skipped.
NOTE
For loading data into SQL Data Warehouse using PolyBase, configure PolyBase's native fault tolerance settings by
specifying reject policies via "polyBaseSettings" in copy activity. You can still enable redirecting PolyBase incompatible
rows to Blob or ADLS as normal as shown below.
This feature doesn't apply when copy activity is configured to invoke Amazon Redshift Unload.
Configuration
The following example provides a JSON definition to configure skipping the incompatible rows in Copy Activity:
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
},
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "<Azure Storage or Data Lake Store linked service>",
"type": "LinkedServiceReference"
},
"path": "redirectcontainer/erroroutput"
}
}
path The path of the log file that Specify the path that you No
contains the skipped rows. want to use to log the
incompatible data. If you do
not provide a path, the
service creates a container
for you.
"output": {
"dataRead": 95,
"dataWritten": 186,
"rowsCopied": 9,
"rowsSkipped": 2,
"copyDuration": 16,
"throughput": 0.01,
"redirectRowPath": "https://myblobstorage.blob.core.windows.net//myfolder/a84bf8d4-233f-4216-8cb5-
45962831cd1b/",
"errors": []
},
If you configure to log the incompatible rows, you can find the log file at this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/[copy-activity-run-id]/[auto-generated-
GUID].csv
.
The log files can only be the csv files. The original data being skipped will be logged with comma as column
delimiter if needed. We add two more columns "ErrorCode" and "ErrorMessage" in additional to the original
source data in log file, where you can see the root cause of the incompatibility. The ErrorCode and ErrorMessage
will be quoted by double quotes.
An example of the log file content is as follows:
data1, data2, data3, "UserErrorInvalidDataValue", "Column 'Prop_2' contains an invalid value 'data3'. Cannot
convert 'data3' to type 'DateTime'."
data4, data5, data6, "2627", "Violation of PRIMARY KEY constraint 'PK_tblintstrdatetimewithpk'. Cannot insert
duplicate key in object 'dbo.tblintstrdatetimewithpk'. The duplicate key value is (data4)."
Next steps
See the other Copy Activity articles:
Copy activity overview
Copy activity performance
Copy Activity performance and tuning guide
5/31/2019 • 24 minutes to read • Edit Online
Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading
solution. It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-
premises data stores. Blazing-fast data loading performance is key to ensure you can focus on the core “big
data” problem: building advanced analytics solutions and getting deep insights from all that data.
Azure provides a set of enterprise-grade data storage and data warehouse solutions, and Copy Activity offers a
highly optimized data loading experience that is easy to configure and set up. With just a single copy activity,
you can achieve:
Loading data into Azure SQL Data Warehouse at 1.2 GBps.
Loading data into Azure Blob storage at 1.0 GBps
Loading data into Azure Data Lake Store at 1.0 GBps
This article describes:
Performance reference numbers for supported source and sink data stores to help you plan your project;
Features that can boost the copy throughput in different scenarios, including data integration units, parallel
copy, and staged Copy;
Performance tuning guidance on how to tune the performance and the key factors that can impact copy
performance.
NOTE
If you are not familiar with Copy Activity in general, see Copy Activity Overview before reading this article.
Performance reference
As a reference, below table shows the copy throughput number in MBps for the given source and sink pairs in
a single copy activity run based on in-house testing. For comparison, it also demonstrates how different
settings of Data Integration Units or Self-hosted Integration Runtime scalability (multiple nodes) can help on
copy performance.
IMPORTANT
When copy activity is executed on an Azure Integration Runtime, the minimal allowed Data Integration Units (formerly
known as Data Movement Units) is two. If not specified, see default Data Integration Units being used in Data Integration
Units.
Points to note:
Throughput is calculated by using the following formula: [size of data read from source]/[Copy Activity run
duration].
The performance reference numbers in the table were measured using TPC -H dataset in a single copy
activity run. Test files for file-based stores are multiple files with 10GB in size.
In Azure data stores, the source and sink are in the same Azure region.
For hybrid copy between on-premises and cloud data stores, each Self-hosted Integration Runtime node
was running on a machine that was separate from the data store with below specification. When a single
activity was running, the copy operation consumed only a small portion of the test machine's CPU, memory,
or network bandwidth.
Memory 128 GB
Copy data between file-based stores Between 4 and 32 depending on the number and size of the
files.
To override this default, specify a value for the dataIntegrationUnits property as follows. The allowed values
for the dataIntegrationUnits property is up to 256. The actual number of DIUs that the copy operation
uses at run time is equal to or less than the configured value, depending on your data pattern. For information
about the level of performance gain you might get when you configure more units for a specific copy source
and sink, see the performance reference.
You can see the actually used Data Integration Units for each copy run in the copy activity output when
monitoring an activity run. Learn details from Copy activity monitoring.
NOTE
Setting of DIUs larger than 4 currently applies only when you copy multiple files from Azure Storage/Data Lake
Storage/Amazon S3/Google Cloud Storage/cloud FTP/cloud SFTP to any other cloud data stores.
Example:
"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"dataIntegrationUnits": 32
}
}
]
Data Integration Units billing impact
It's important to remember that you are charged based on the total time of the copy operation. The total
duration you are billed for data movement is the sum of duration across DIUs. If a copy job used to take one
hour with two cloud units and now it takes 15 minutes with eight cloud units, the overall bill remains almost the
same.
Parallel Copy
You can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use. You can
think of this property as the maximum number of threads within Copy Activity that can read from your source
or write to your sink data stores in parallel.
For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the
source data store and to the destination data store. The default number of parallel copies that it uses depends
on the type of source and sink that you are using:
Copy data between file-based stores Depends on the size of the files and the number of Data
Integration Units (DIUs) used to copy data between two
cloud data stores, or the physical configuration of the Self-
hosted Integration Runtime machine.
TIP
When copying data between file-based stores, the default behavior (auto determined) usually give you the best
throughput.
To control the load on machines that host your data stores, or to tune copy performance, you may choose to
override the default value and specify a value for the parallelCopies property. The value must be an integer
greater than or equal to 1. At run time, for the best performance, Copy Activity uses a value that is less than or
equal to the value that you set.
"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"parallelCopies": 32
}
}
]
Points to note:
When you copy data between file-based stores, the parallelCopies determine the parallelism at the file
level. The chunking within a single file would happen underneath automatically and transparently, and it's
designed to use the best suitable chunk size for a given source data store type to load data in parallel and
orthogonal to parallelCopies. The actual number of parallel copies the data movement service uses for the
copy operation at run time is no more than the number of files you have. If the copy behavior is mergeFile,
Copy Activity cannot take advantage of file-level parallelism.
When you specify a value for the parallelCopies property, consider the load increase on your source and
sink data stores, and to Self-Hosted Integration Runtime if the copy activity is empowered by it for example,
for hybrid copy. This happens especially when you have multiple activities or concurrent runs of the same
activities that run against the same data store. If you notice that either the data store or Self-hosted
Integration Runtime is overwhelmed with the load, decrease the parallelCopies value to relieve the load.
When you copy data from stores that are not file-based to stores that are file-based, the data movement
service ignores the parallelCopies property. Even if parallelism is specified, it's not applied in this case.
parallelCopies is orthogonal to dataIntegrationUnits. The former is counted across all the Data
Integration Units.
Staged copy
When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an
interim staging store. Staging is especially useful in the following cases:
You want to ingest data from various data stores into SQL Data Warehouse via PolyBase. SQL Data
Warehouse uses PolyBase as a high-throughput mechanism to load a large amount of data into SQL Data
Warehouse. However, the source data must be in Blob storage or Azure Data Lake Store, and it must meet
additional criteria. When you load data from a data store other than Blob storage or Azure Data Lake Store,
you can activate data copying via interim staging Blob storage. In that case, Data Factory performs the
required data transformations to ensure that it meets the requirements of PolyBase. Then it uses PolyBase to
load data into SQL Data Warehouse efficiently. For more information, see Use PolyBase to load data into
Azure SQL Data Warehouse.
Sometimes it takes a while to perform a hybrid data movement (that is, to copy from an on-
premises data store to a cloud data store) over a slow network connection. To improve performance,
you can use staged copy to compress the data on-premises so that it takes less time to move data to the
staging data store in the cloud then decompress the data in the staging store before loading into the
destination data store.
You don't want to open ports other than port 80 and port 443 in your firewall, because of corporate
IT policies. For example, when you copy data from an on-premises data store to an Azure SQL Database
sink or an Azure SQL Data Warehouse sink, you need to activate outbound TCP communication on port
1433 for both the Windows firewall and your corporate firewall. In this scenario, staged copy can take
advantage of the Self-hosted Integration Runtime to first copy data to a Blob storage staging instance over
HTTP or HTTPS on port 443, then load the data into SQL Database or SQL Data Warehouse from Blob
storage staging. In this flow, you don't need to enable port 1433.
How staged copy works
When you activate the staging feature, first the data is copied from the source data store to the staging Blob
storage (bring your own). Next, the data is copied from the staging data store to the sink data store. Data
Factory automatically manages the two-stage flow for you. Data Factory also cleans up temporary data from the
staging storage after the data movement is complete.
When you activate data movement by using a staging store, you can specify whether you want the data to be
compressed before moving data from the source data store to an interim or staging data store, and then
decompressed before moving data from an interim or staging data store to the sink data store.
Currently, you can't copy data between two on-premises data stores by using a staging store.
Configuration
Configure the enableStaging setting in Copy Activity to specify whether you want the data to be staged in
Blob storage before you load it into a destination data store. When you set enableStaging to TRUE , specify the
additional properties listed in the next table. If you don’t have one, you also need to create an Azure Storage or
Storage shared access signature-linked service for staging.
Here's a sample definition of Copy Activity with the properties that are described in the preceding table:
"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingBlob",
"type": "LinkedServiceReference"
},
"path": "stagingcontainer/path",
"enableCompression": true
}
}
}
]
In addition, the following are some common considerations. A full description of performance diagnosis
is beyond the scope of this article.
Performance features:
Parallel copy
Data integration units
Staged copy
Self-hosted Integration Runtime scalability
Self-hosted Integration Runtime
Source
Sink
Serialization and deserialization
Compression
Column mapping
Other considerations
3. Expand the configuration to your entire data set. When you're satisfied with the execution results
and performance, you can expand the definition and pipeline to cover your entire data set.
Other considerations
If the size of data you want to copy is large, you can adjust your business logic to further partition the data and
schedule Copy Activity to run more frequently to reduce the data size for each Copy Activity run.
Be cautious about the number of data sets and copy activities requiring Data Factory to connect to the same
data store at the same time. Many concurrent copy jobs might throttle a data store and lead to degraded
performance, copy job internal retries, and in some cases, execution failures.
Sample scenario: Copy from an on-premises SQL Server to Blob
storage
Scenario: A pipeline is built to copy data from an on-premises SQL Server to Blob storage in CSV format. To
make the copy job faster, the CSV files should be compressed into bzip2 format.
Test and analysis: The throughput of Copy Activity is less than 2 MBps, which is much slower than the
performance benchmark.
Performance analysis and tuning: To troubleshoot the performance issue, let’s look at how the data is
processed and moved.
1. Read data: Integration runtime opens a connection to SQL Server and sends the query. SQL Server
responds by sending the data stream to integration runtime via the intranet.
2. Serialize and compress data: Integration runtime serializes the data stream to CSV format, and
compresses the data to a bzip2 stream.
3. Write data: Integration runtime uploads the bzip2 stream to Blob storage via the Internet.
As you can see, the data is being processed and moved in a streaming sequential manner: SQL Server > LAN >
Integration runtime > WAN > Blob storage. The overall performance is gated by the minimum
throughput across the pipeline.
One or more of the following factors might cause the performance bottleneck:
Source: SQL Server itself has low throughput because of heavy loads.
Self-hosted Integration Runtime:
LAN: Integration runtime is located far from the SQL Server machine and has a low -bandwidth
connection.
Integration runtime: Integration runtime has reached its load limitations to perform the following
operations:
Serialization: Serializing the data stream to CSV format has slow throughput.
Compression: You chose a slow compression codec (for example, bzip2, which is 2.8 MBps
with Core i7).
WAN: The bandwidth between the corporate network and your Azure services is low (for example, T1
= 1,544 kbps; T2 = 6,312 kbps).
Sink: Blob storage has low throughput. (This scenario is unlikely because its SLA guarantees a minimum of
60 MBps.)
In this case, bzip2 data compression might be slowing down the entire pipeline. Switching to a gzip
compression codec might ease this bottleneck.
Reference
Here is performance monitoring and tuning references for some of the supported data stores:
Azure Storage (including Blob storage and Table storage): Azure Storage scalability targets and Azure
Storage performance and scalability checklist
Azure SQL Database: You can monitor the performance and check the database transaction unit (DTU )
percentage
Azure SQL Data Warehouse: Its capability is measured in data warehouse units (DWUs); see Manage
compute power in Azure SQL Data Warehouse (Overview )
Azure Cosmos DB: Performance levels in Azure Cosmos DB
On-premises SQL Server: Monitor and tune for performance
On-premises file server: Performance tuning for file servers
Next steps
See the other Copy Activity articles:
Copy activity overview
Copy Activity schema mapping
Copy activity fault tolerance
Transform data in Azure Data Factory
3/7/2019 • 4 minutes to read • Edit Online
Overview
This article explains data transformation activities in Azure Data Factory that you can use to transform and
processes your raw data into predictions and insights. A transformation activity executes in a computing
environment such as Azure HDInsight cluster or an Azure Batch. It provides links to articles with detailed
information on each transformation activity.
Data Factory supports the following data transformation activities that can be added to pipelines either
individually or chained with another activity.
Custom activity
If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity
with your own data processing logic and use the activity in the pipeline. You can configure the custom .NET
activity to run using either an Azure Batch service or an Azure HDInsight cluster. See Use custom activities
article for details.
You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See Run R Script
using Azure Data Factory.
Compute environments
You create a linked service for the compute environment and then use the linked service when defining a
transformation activity. There are two types of compute environments supported by Data Factory.
On-Demand: In this case, the computing environment is fully managed by Data Factory. It is automatically
created by the Data Factory service before a job is submitted to process data and removed when the job is
completed. You can configure and control granular settings of the on-demand compute environment for job
execution, cluster management, and bootstrapping actions.
Bring Your Own: In this case, you can register your own computing environment (for example HDInsight
cluster) as a linked service in Data Factory. The computing environment is managed by you and the Data
Factory service uses it to execute the activities.
See Compute Linked Services article to learn about compute services supported by Data Factory.
Next steps
See the following tutorial for an example of using a transformation activity: Tutorial: transform data using Spark
Transform data using Hadoop Hive activity in Azure
Data Factory
3/7/2019 • 2 minutes to read • Edit Online
The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand
HDInsight cluster. This article builds on the data transformation activities article, which presents a general
overview of data transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the Tutorial:
transform data before reading this article.
Syntax
{
"name": "Hive Activity",
"description": "description",
"type": "HDInsightHive",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"scriptPath": "MyAzureStorage\\HiveScripts\\MyHiveSript.hql",
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}
Syntax details
PROPERTY DESCRIPTION REQUIRED
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Transform data using Hadoop Pig activity in Azure
Data Factory
3/7/2019 • 2 minutes to read • Edit Online
The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand HDInsight
cluster. This article builds on the data transformation activities article, which presents a general overview of data
transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the Tutorial:
transform data before reading this article.
Syntax
{
"name": "Pig Activity",
"description": "description",
"type": "HDInsightPig",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"scriptPath": "MyAzureStorage\\PigScripts\\MyPigSript.pig",
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}
Syntax details
PROPERTY DESCRIPTION REQUIRED
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Transform data using Hadoop MapReduce activity in
Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online
The HDInsight MapReduce activity in a Data Factory pipeline invokes MapReduce program on your own or on-
demand HDInsight cluster. This article builds on the data transformation activities article, which presents a
general overview of data transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial:
Tutorial: transform data before reading this article.
See Pig and Hive for details about running Pig/Hive scripts on a HDInsight cluster from a pipeline by using
HDInsight Pig and Hive activities.
Syntax
{
"name": "Map Reduce Activity",
"description": "Description",
"type": "HDInsightMapReduce",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"className": "org.myorg.SampleClass",
"jarLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"jarFilePath": "MyAzureStorage/jars/sample.jar",
"getDebugInfo": "Failure",
"arguments": [
"-SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}
Syntax details
PROPERTY DESCRIPTION REQUIRED
Example
You can use the HDInsight MapReduce Activity to run any MapReduce jar file on an HDInsight cluster. In the
following sample JSON definition of a pipeline, the HDInsight Activity is configured to run a Mahout JAR file.
{
"name": "MapReduce Activity for Mahout",
"description": "Custom MapReduce to generate Mahout result",
"type": "HDInsightMapReduce",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
"jarLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
},
"jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar",
"arguments": [
"-s",
"SIMILARITY_LOGLIKELIHOOD",
"--input",
"wasb://adfsamples@spestore.blob.core.windows.net/Mahout/input",
"--output",
"wasb://adfsamples@spestore.blob.core.windows.net/Mahout/output/",
"--maxSimilaritiesPerItem",
"500",
"--tempDir",
"wasb://adfsamples@spestore.blob.core.windows.net/Mahout/temp/mahout"
]
}
}
You can specify any arguments for the MapReduce program in the arguments section. At runtime, you see a
few extra arguments (for example: mapreduce.job.tags) from the MapReduce framework. To differentiate your
arguments with the MapReduce arguments, consider using both option and value as arguments as shown in the
following example (-s, --input, --output etc., are options immediately followed by their values).
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Transform data using Hadoop Streaming activity in
Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online
The HDInsight Streaming Activity in a Data Factory pipeline executes Hadoop Streaming programs on your own
or on-demand HDInsight cluster. This article builds on the data transformation activities article, which presents a
general overview of data transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the Tutorial:
transform data before reading this article.
JSON sample
{
"name": "Streaming Activity",
"description": "Description",
"type": "HDInsightStreaming",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"mapper": "MyMapper.exe",
"reducer": "MyReducer.exe",
"combiner": "MyCombiner.exe",
"fileLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"filePaths": [
"<containername>/example/apps/MyMapper.exe",
"<containername>/example/apps/MyReducer.exe",
"<containername>/example/apps/MyCombiner.exe"
],
"input": "wasb://<containername>@<accountname>.blob.core.windows.net/example/input/MapperInput.txt",
"output":
"wasb://<containername>@<accountname>.blob.core.windows.net/example/output/ReducerOutput.txt",
"commandEnvironment": [
"CmdEnvVarName=CmdEnvVarValue"
],
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}
Syntax details
PROPERTY DESCRIPTION REQUIRED
The Spark activity in a Data Factory pipeline executes a Spark program on your own or on-demand HDInsight
cluster. This article builds on the data transformation activities article, which presents a general overview of
data transformation and the supported transformation activities. When you use an on-demand Spark linked
service, Data Factory automatically creates a Spark cluster for you just-in-time to process the data and then
deletes the cluster once the processing is complete.
IMPORTANT
Spark Activity does not support HDInsight Spark clusters that use an Azure Data Lake Store as primary storage.
{
"name": "Spark Activity",
"description": "Description",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"sparkJobLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"rootPath": "adfspark\\pyFiles",
"entryFilePath": "test.py",
"sparkConfig": {
"ConfigItem1": "Value"
},
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
]
}
}
The following table describes the JSON properties used in the JSON definition:
Folder structure
Spark jobs are more extensible than Pig/Hive jobs. For Spark jobs, you can provide multiple dependencies
such as jar packages (placed in the java CLASSPATH), python files (placed on the PYTHONPATH), and any
other files.
Create the following folder structure in the Azure Blob storage referenced by the HDInsight linked service.
Then, upload dependent files to the appropriate sub folders in the root folder represented by entryFilePath.
For example, upload python files to the pyFiles subfolder and jar files to the jars subfolder of the root folder. At
runtime, Data Factory service expects the following folder structure in the Azure Blob storage:
Here is an example for a storage containing two Spark job files in the Azure Blob Storage referenced by the
HDInsight linked service.
SparkJob1
main.jar
files
input1.txt
input2.txt
jars
package1.jar
package2.jar
logs
SparkJob2
main.py
pyFiles
scrip1.py
script2.py
logs
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Create predictive pipelines using Azure Machine
Learning and Azure Data Factory
3/12/2019 • 7 minutes to read • Edit Online
Azure Machine Learning enables you to build, test, and deploy predictive analytics solutions. From a high-level
point of view, it is done in three steps:
1. Create a training experiment. You do this step by using the Azure Machine Learning studio. Azure
Machine Learning studio is a collaborative visual development environment that you use to train and test a
predictive analytics model using training data.
2. Convert it to a predictive experiment. Once your model has been trained with existing data and you are
ready to use it to score new data, you prepare and streamline your experiment for scoring.
3. Deploy it as a web service. You can publish your scoring experiment as an Azure web service. You can
send data to your model via this web service end point and receive result predictions from the model.
Data Factory and Machine Learning together
Azure Data Factory enables you to easily create pipelines that use a published Azure Machine Learning web
service for predictive analytics. Using the Batch Execution Activity in an Azure Data Factory pipeline, you can
invoke an Azure Machine Learning studio web service to make predictions on the data in batch.
Over time, the predictive models in the Azure Machine Learning studio scoring experiments need to be
retrained using new input datasets. You can retrain a model from a Data Factory pipeline by doing the following
steps:
1. Publish the training experiment (not predictive experiment) as a web service. You do this step in the Azure
Machine Learning studio as you did to expose predictive experiment as a web service in the previous
scenario.
2. Use the Azure Machine Learning studio Batch Execution Activity to invoke the web service for the training
experiment. Basically, you can use the Azure Machine Learning studio Batch Execution activity to invoke
both training web service and scoring web service.
After you are done with retraining, update the scoring web service (predictive experiment exposed as a web
service) with the newly trained model by using the Azure Machine Learning studio Update Resource
Activity. See Updating models using Update Resource Activity article for details.
See Compute linked services article for descriptions about properties in the JSON definition.
Azure Machine Learning support both Classic Web Services and New Web Services for your predictive
experiment. You can choose the right one to use from Data Factory. To get the information required to create
the Azure Machine Learning Linked Service, go to https://services.azureml.net, where all your (new ) Web
Services and Classic Web Services are listed. Click the Web Service you would like to access, and click
Consume page. Copy Primary Key for apiKey property, and Batch Requests for mlEndpoint property.
Scenario 1: Experiments using Web service inputs/outputs that refer to data in Azure Blob Storage
In this scenario, the Azure Machine Learning Web service makes predictions using data from a file in an Azure
blob storage and stores the prediction results in the blob storage. The following JSON defines a Data Factory
pipeline with an AzureMLBatchExecution activity. The input and output data in Azure Blog Storage is referenced
using a LinkedName and FilePath pair. In the sample Linked Service of inputs and outputs are different, you can
use different Linked Services for each of your inputs/outputs for Data Factory to be able to pick up the right
files and send to Azure Machine Learning studio Web Service.
IMPORTANT
In your Azure Machine Learning studio experiment, web service input and output ports, and global parameters have
default names ("input1", "input2") that you can customize. The names you use for webServiceInputs, webServiceOutputs,
and globalParameters settings must exactly match the names in the experiments. You can view the sample request
payload on the Batch Execution Help page for your Azure Machine Learning studio endpoint to verify the expected
mapping.
{
"name": "AzureMLExecutionActivityTemplate",
"description": "description",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "AzureMLLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"input1": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"amltest/input/in1.csv"
},
"input2": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"amltest/input/in2.csv"
}
},
"webServiceOutputs": {
"outputName1": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"amltest2/output/out1.csv"
},
"outputName2": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"amltest2/output/out2.csv"
}
}
}
}
Let's look at a scenario for using Web service parameters. You have a deployed Azure Machine Learning web
service that uses a reader module to read data from one of the data sources supported by Azure Machine
Learning (for example: Azure SQL Database). After the batch execution is performed, the results are written
using a Writer module (Azure SQL Database). No web service inputs and outputs are defined in the
experiments. In this case, we recommend that you configure relevant web service parameters for the reader and
writer modules. This configuration allows the reader/writer modules to be configured when using the
AzureMLBatchExecution activity. You specify Web service parameters in the globalParameters section in the
activity JSON as follows.
"typeProperties": {
"globalParameters": {
"Database server name": "<myserver>.database.windows.net",
"Database name": "<database>",
"Server user account name": "<user name>",
"Server user account password": "<password>"
}
}
NOTE
The Web service parameters are case-sensitive, so ensure that the names you specify in the activity JSON match the ones
exposed by the Web service.
After you are done with retraining, update the scoring web service (predictive experiment exposed as a web
service) with the newly trained model by using the Azure Machine Learning studio Update Resource
Activity. See Updating models using Update Resource Activity article for details.
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Stored procedure activity
Update Azure Machine Learning models by using
Update Resource activity
3/18/2019 • 6 minutes to read • Edit Online
This article complements the main Azure Data Factory - Azure Machine Learning integration article: Create
predictive pipelines using Azure Machine Learning and Azure Data Factory. If you haven't already done so, review
the main article before reading through this article.
Overview
As part of the process of operationalizing Azure Machine Learning models, your model is trained and saved. You
then use it to create a predictive Web service. The Web service can then be consumed in web sites, dashboards,
and mobile apps.
Models you create using Machine Learning are typically not static. As new data becomes available or when the
consumer of the API has their own data the model needs to be retrained. Refer to Retrain a Machine Learning
Model for details about how you can retrain a model in Azure Machine Learning.
Retraining may occur frequently. With Batch Execution activity and Update Resource activity, you can
operationalize the Azure Machine Learning model retraining and updating the predictive Web Service using Data
Factory.
The following picture depicts the relationship between training and predictive Web Services.
End-to-end workflow
The entire process of operationalizing retraining a model and update the predictive Web Services involves the
following steps:
Invoke the training Web Service by using the Batch Execution activity. Invoking a training Web Service is
the same as invoking a predictive Web Service described in Create predictive pipelines using Azure Machine
Learning and Data Factory Batch Execution activity. The output of the training Web Service is an iLearner file
that you can use to update the predictive Web Service.
Invoke the update resource endpoint of the predictive Web Service by using the Update Resource
activity to update the Web Service with the newly trained model.
Azure Machine Learning linked service
For the above mentioned end-to-end workflow to work, you need to create two Azure Machine Learning linked
services:
1. An Azure Machine Learning linked service to the training web service, this linked service is used by Batch
Execution activity in the same way as what's mentioned in Create predictive pipelines using Azure Machine
Learning and Data Factory Batch Execution activity. Difference is the output of the training web service is an
iLearner file which is then used by Update Resource activity to update the predictive web service.
2. An Azure Machine Learning linked service to the update resource endpoint of the predictive web service. This
linked service is used by Update Resource activity to update the predictive web service using the iLearner file
returned from above step.
For the second Azure Machine Learning linked service, the configuration is different when your Azure Machine
Learning Web Service is a classic Web Service or a new Web Service. The differences are discussed separately in
the following sections.
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resource-group-
name}/providers/Microsoft.MachineLearning/webServices/{web-service-name}?api-version=2016-05-01-preview
You can get values for place holders in the URL when querying the web service on the Azure Machine Learning
Web Services Portal.
The new type of update resource endpoint requires service principal authentication. To use service principal
authentication, register an application entity in Azure Active Directory (Azure AD ) and grant it the Contributor or
Owner role of the subscription or the resource group where the web service belongs to. The See how to create
service principal and assign permissions to manage Azure resource. Make note of the following values, which you
use to define the linked service:
Application ID
Application key
Tenant ID
Here is a sample linked service definition:
{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"description": "The linked service for AML web service.",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/0000000000000000
000000000000000000000/services/0000000000000000000000000000000000000/jobs?api-version=2.0",
"apiKey": {
"type": "SecureString",
"value": "APIKeyOfEndpoint1"
},
"updateResourceEndpoint":
"https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resource-group-
name}/providers/Microsoft.MachineLearning/webServices/{web-service-name}?api-version=2016-05-01-preview",
"servicePrincipalId": "000000000-0000-0000-0000-0000000000000",
"servicePrincipalKey": {
"type": "SecureString",
"value": "servicePrincipalKey"
},
"tenant": "mycompany.com"
}
}
}
The following scenario provides more details. It has an example for retraining and updating Azure Machine
Learning studio models from an Azure Data Factory pipeline.
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=name;AccountKey=key"
}
}
}
In Azure Machine Learning studio, do the following to get values for mlEndpoint and apiKey:
1. Click WEB SERVICES on the left menu.
2. Click the training web service in the list of web services.
3. Click copy next to API key text box. Paste the key in the clipboard into the Data Factory JSON editor.
4. In the Azure Machine Learning studio, click BATCH EXECUTION link.
5. Copy the Request URI from the Request section and paste it into the Data Factory JSON editor.
Linked service for Azure Machine Learning studio updatable scoring endpoint:
The following JSON snippet defines an Azure Machine Learning linked service that points to updatable endpoint
of the scoring web service.
{
"name": "updatableScoringEndpoint2",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint":
"https://ussouthcentral.services.azureml.net/workspaces/00000000eb0abe4d6bbb1d7886062747d7/services/0000000002
6734a5889e02fbb1f65cefd/jobs?api-version=2.0",
"apiKey":
"sooooooooooh3WvG1hBfKS2BNNcfwSO7hhY6dY98noLfOdqQydYDIXyf2KoIaN3JpALu/AKtflHWMOCuicm/Q==",
"updateResourceEndpoint": "https://management.azure.com/subscriptions/00000000-0000-0000-0000-
000000000000/resourceGroups/Default-MachineLearning-
SouthCentralUS/providers/Microsoft.MachineLearning/webServices/myWebService?api-version=2016-05-01-preview",
"servicePrincipalId": "fe200044-c008-4008-a005-94000000731",
"servicePrincipalKey": "zWa0000000000Tp6FjtZOspK/WMA2tQ08c8U+gZRBlw=",
"tenant": "mycompany.com"
}
}
}
Pipeline
The pipeline has two activities: AzureMLBatchExecution and AzureMLUpdateResource. The Batch Execution
activity takes the training data as input and produces an iLearner file as an output. The Update Resource activity
then takes this iLearner file and use it to update the predictive web service.
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "amlBEGetilearner",
"description": "Use AML BES to get the ileaner file from training web service",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "trainingEndpoint",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"input1": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/input"
},
"input2": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/input"
}
},
"webServiceOutputs": {
"output1": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/output"
}
}
}
},
{
"name": "amlUpdateResource",
"type": "AzureMLUpdateResource",
"description": "Use AML Update Resource to update the predict web service",
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "updatableScoringEndpoint2"
},
"typeProperties": {
"trainedModelName": "ADFV2Sample Model [trained model]",
"trainedModelLinkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "StorageLinkedService"
},
"trainedModelFilePath": "azuremltesting/output/newModelForArm.ilearner"
},
"dependsOn": [
{
"activity": "amlbeGetilearner",
"dependencyConditions": [ "Succeeded" ]
}
]
}
]
}
}
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Stored procedure activity
Transform data by using the SQL Server Stored
Procedure activity in Azure Data Factory
3/7/2019 • 3 minutes to read • Edit Online
You use data transformation activities in a Data Factory pipeline to transform and process raw data into
predictions and insights. The Stored Procedure Activity is one of the transformation activities that Data Factory
supports. This article builds on the transform data article, which presents a general overview of data
transformation and the supported transformation activities in Data Factory.
NOTE
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial: Tutorial:
transform data before reading this article.
You can use the Stored Procedure Activity to invoke a stored procedure in one of the following data stores in
your enterprise or on an Azure virtual machine (VM ):
Azure SQL Database
Azure SQL Data Warehouse
SQL Server Database. If you are using SQL Server, install Self-hosted integration runtime on the same
machine that hosts the database or on a separate machine that has access to the database. Self-Hosted
integration runtime is a component that connects data sources on-premises/on Azure VM with cloud
services in a secure and managed way. See Self-hosted integration runtime article for details.
IMPORTANT
When copying data into Azure SQL Database or SQL Server, you can configure the SqlSink in copy activity to invoke a
stored procedure by using the sqlWriterStoredProcedureName property. For details about the property, see following
connector articles: Azure SQL Database, SQL Server. Invoking a stored procedure while copying data into an Azure SQL
Data Warehouse by using a copy activity is not supported. But, you can use the stored procedure activity to invoke a
stored procedure in a SQL Data Warehouse.
When copying data from Azure SQL Database or SQL Server or Azure SQL Data Warehouse, you can configure
SqlSource in copy activity to invoke a stored procedure to read data from the source database by using the
sqlReaderStoredProcedureName property. For more information, see the following connector articles: Azure SQL
Database, SQL Server, Azure SQL Data Warehouse
Syntax details
Here is the JSON format for defining a Stored Procedure Activity:
{
"name": "Stored Procedure Activity",
"description":"Description",
"type": "SqlServerStoredProcedure",
"linkedServiceName": {
"referenceName": "AzureSqlLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"storedProcedureName": "usp_sample",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" }
}
}
}
Error info
When a stored procedure fails and returns error details, you can't capture the error info directly in the activity
output. However, Data Factory pumps all of its activity run events to Azure Monitor. Among the events that
Data Factory pumps to Azure Monitor, it pushes error details there. You can, for example, set up email alerts
from those events. For more info, see Alert and Monitor data factories using Azure Monitor.
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL Activity
Hive Activity
Pig Activity
MapReduce Activity
Hadoop Streaming Activity
Spark Activity
.NET custom activity
Machine Learning Bach Execution Activity
Stored procedure activity
Transform data by running U-SQL scripts on Azure
Data Lake Analytics
3/11/2019 • 5 minutes to read • Edit Online
A pipeline in an Azure data factory processes data in linked storage services by using linked compute services.
It contains a sequence of activities where each activity performs a specific processing operation. This article
describes the Data Lake Analytics U -SQL Activity that runs a U -SQL script on an Azure Data Lake
Analytics compute linked service.
Create an Azure Data Lake Analytics account before creating a pipeline with a Data Lake Analytics U -SQL
Activity. To learn about Azure Data Lake Analytics, see Get started with Azure Data Lake Analytics.
{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "<account name>",
"dataLakeAnalyticsUri": "<azure data lake analytics URI>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<optional, subscription id of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
To learn more about the linked service, see Compute linked services.
The following table describes names and descriptions of properties that are specific to this activity.
@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";
@rs1 =
SELECT Start, Region, Duration
FROM @rs1
WHERE Start <= DateTime.Parse("2012/02/19");
OUTPUT @rs1
TO @out
USING Outputters.Tsv(quoting:false, dateTimeFormat:null);
In above script example, the input and output to the script is defined in @in and @out parameters. The values
for @in and @out parameters in the U -SQL script are passed dynamically by Data Factory using the
‘parameters’ section.
You can specify other properties such as degreeOfParallelism and priority as well in your pipeline definition for
the jobs that run on the Azure Data Lake Analytics service.
Dynamic parameters
In the sample pipeline definition, in and out parameters are assigned with hard-coded values.
"parameters": {
"in": "/datalake/input/SearchLog.tsv",
"out": "/datalake/output/Result.tsv"
}
"parameters": {
"in": "/datalake/input/@{formatDateTime(pipeline().parameters.WindowStart,'yyyy/MM/dd')}/data.tsv",
"out": "/datalake/output/@{formatDateTime(pipeline().parameters.WindowStart,'yyyy/MM/dd')}/result.tsv"
}
In this case, input files are still picked up from the /datalake/input folder and output files are generated in the
/datalake/output folder. The file names are dynamic based on the window start time being passed in when
pipeline gets triggered.
Next steps
See the following articles that explain how to transform data in other ways:
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Machine Learning Batch Execution activity
Stored procedure activity
Transform data by running a Databricks notebook
3/7/2019 • 2 minutes to read • Edit Online
The Azure Databricks Notebook Activity in a Data Factory pipeline runs a Databricks notebook in your Azure
Databricks workspace. This article builds on the data transformation activities article, which presents a general
overview of data transformation and the supported transformation activities. Azure Databricks is a managed
platform for running Apache Spark.
{
"activity": {
"name": "MyActivity",
"description": "MyActivity description",
"type": "DatabricksNotebook",
"linkedServiceName": {
"referenceName": "MyDatabricksLinkedservice",
"type": "LinkedServiceReference"
},
"typeProperties": {
"notebookPath": "/Users/user@example.com/ScalaExampleNotebook",
"baseParameters": {
"inputpath": "input/folder1/",
"outputpath": "output/"
},
"libraries": [
{
"jar": "dbfs:/docs/library.jar"
}
]
}
}
}
{
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": [ "slf4j:slf4j" ]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "http://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}
For more details, see the Databricks documentation for library types.
The Azure Databricks Jar Activity in a Data Factory pipeline runs a Spark Jar in your Azure Databricks cluster. This
article builds on the data transformation activities article, which presents a general overview of data
transformation and the supported transformation activities. Azure Databricks is a managed platform for running
Apache Spark.
For an eleven-minute introduction and demonstration of this feature, watch the following video:
{
"name": "SparkJarActivity",
"type": "DatabricksSparkJar",
"linkedServiceName": {
"referenceName": "AzureDatabricks",
"type": "LinkedServiceReference"
},
"typeProperties": {
"mainClassName": "org.apache.spark.examples.SparkPi",
"parameters": [ "10" ],
"libraries": [
{
"jar": "dbfs:/docs/sparkpi.jar"
}
]
}
}
libraries A list of libraries to be installed on the Yes (at least one containing the
cluster that will execute the job. It can mainClassName method)
be an array of <string, object>
{
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": [ "slf4j:slf4j" ]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "http://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}
The Azure Databricks Python Activity in a Data Factory pipeline runs a Python file in your Azure Databricks cluster.
This article builds on the data transformation activities article, which presents a general overview of data
transformation and the supported transformation activities. Azure Databricks is a managed platform for running
Apache Spark.
For an eleven-minute introduction and demonstration of this feature, watch the following video:
{
"activity": {
"name": "MyActivity",
"description": "MyActivity description",
"type": "DatabricksSparkPython",
"linkedServiceName": {
"referenceName": "MyDatabricksLinkedservice",
"type": "LinkedServiceReference"
},
"typeProperties": {
"pythonFile": "dbfs:/docs/pi.py",
"parameters": [
"10"
],
"libraries": [
{
"pypi": {
"package": "tensorflow"
}
}
]
}
}
}
{
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": [ "slf4j:slf4j" ]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "http://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}
There are two types of activities that you can use in an Azure Data Factory pipeline.
Data movement activities to move data between supported source and sink data stores.
Data transformation activities to transform data using compute services such as Azure HDInsight, Azure
Batch, and Azure Machine Learning.
To move data to/from a data store that Data Factory does not support, or to transform/process data in a way
that isn't supported by Data Factory, you can create a Custom activity with your own data movement or
transformation logic and use the activity in a pipeline. The custom activity runs your customized code logic on
an Azure Batch pool of virtual machines.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
To learn more about Azure Batch linked service, see Compute linked services article.
Custom activity
The following JSON snippet defines a pipeline with a simple Custom Activity. The activity definition has a
reference to the Azure Batch linked service.
{
"name": "MyCustomActivityPipeline",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "helloworld.exe",
"folderPath": "customactv2/helloworld",
"resourceLinkedService": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
}
}]
}
}
In this sample, the helloworld.exe is a custom application stored in the customactv2/helloworld folder of the
Azure Storage account used in the resourceLinkedService. The Custom activity submits this custom application
to be executed on Azure Batch. You can replace the command to any preferred application that can be executed
on the target Operation System of the Azure Batch Pool nodes.
The following table describes names and descriptions of properties that are specific to this activity.
* The properties resourceLinkedService and folderPath must either both be specified or both be omitted.
NOTE
If you are passing linked services as referenceObjects in Custom Activity, it is a good security practice to pass an Azure
Key Vault enabled linked service (since it does not contain any secure strings) and fetch the credentials using secret name
directly from Key Vault from the code. You can find an example here that references AKV enabled linked service, retrieves
the credentials from Key Vault, and then accesses the storage in the code.
Executing commands
You can directly execute a command using Custom Activity. The following example runs the "echo hello world"
command on the target Azure Batch Pool nodes and prints the output to stdout.
{
"name": "MyCustomActivity",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "cmd /c echo hello world"
}
}]
}
}
When the activity is executed, referenceObjects and extendedProperties are stored in following files that are
deployed to the same execution folder of the SampleApp.exe:
activity.json
namespace SampleApp
{
class Program
{
static void Main(string[] args)
{
//From Extend Properties
dynamic activity = JsonConvert.DeserializeObject(File.ReadAllText("activity.json"));
Console.WriteLine(activity.typeProperties.extendedProperties.connectionString.value);
// From LinkedServices
dynamic linkedServices =
JsonConvert.DeserializeObject(File.ReadAllText("linkedServices.json"));
Console.WriteLine(linkedServices[0].properties.typeProperties.accountName);
}
}
}
When the pipeline is running, you can check the execution output using the following commands:
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)
if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}
The stdout and stderr of your custom application are saved to the adfjobs container in the Azure Storage
Linked Service you defined when creating Azure Batch Linked Service with a GUID of the task. You can get the
detailed path from Activity Run output as shown in the following snippet:
Pipeline ' MyCustomActivity' run finished. Result:
ResourceGroupName : resourcegroupname
DataFactoryName : datafactoryname
ActivityName : MyCustomActivity
PipelineRunId : xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
PipelineName : MyCustomActivity
Input : {command}
Output : {exitcode, outputs, effectiveIntegrationRuntime}
LinkedServiceName :
ActivityRunStart : 10/5/2017 3:33:06 PM
ActivityRunEnd : 10/5/2017 3:33:28 PM
DurationInMs : 21203
Status : Succeeded
Error : {errorCode, message, failureType, target}
If you would like to consume the content of stdout.txt in downstream activities, you can get the path to the
stdout.txt file in expression "@activity('MyCustomActivity').output.outputs[0]".
IMPORTANT
The activity.json, linkedServices.json, and datasets.json are stored in the runtime folder of the Batch task. For this
example, the activity.json, linkedServices.json, and datasets.json are stored in
"https://adfv2storage.blob.core.windows.net/adfjobs/<GUID>/runtime/" path. If needed, you need to clean them up
separately.
For Linked Services that use the Self-Hosted Integration Runtime, the sensitive information like keys or passwords are
encrypted by the Self-Hosted Integration Runtime to ensure credential stays in customer defined private network
environment. Some sensitive fields could be missing when referenced by your custom application code in this way. Use
SecureString in extendedProperties instead of using Linked Service reference if needed.
This serialization is not truly secure, and is not intended to be secure. The intent is to hint to Data Factory to
mask the value in the Monitoring tab.
To access properties of type SecureString from a custom activity, read the activity.json file, which is placed in
the same folder as your .EXE, deserialize the JSON, and then access the JSON property (extendedProperties
=> [propertyName] => value).
Execution environment of the custom Windows or Linux Windows (.NET Framework 4.5.2)
logic
Executing scripts Supports executing scripts directly (for Requires implementation in the .NET
example "cmd /c echo hello world" on DLL
Windows VM)
Retrieve information in custom logic Parses activity.json, linkedServices.json, Through .NET SDK (.NET Frame 4.5.2)
and datasets.json stored in the same
folder of the executable
If you have existing .NET code written for a version 1 (Custom) DotNet Activity, you need to modify your code
for it to work with the current version of the Custom Activity. Update your code by following these high-level
guidelines:
Change the project from a .NET Class Library to a Console App.
Start your application with the Main method. The Execute method of the IDotNetActivity interface is no
longer required.
Read and parse the Linked Services, Datasets and Activity with a JSON serializer, and not as strongly-typed
objects. Pass the values of required properties to your main custom code logic. Refer to the preceding
SampleApp.exe code as an example.
The Logger object is no longer supported. Output from your executable can be printed to the console and is
saved to stdout.txt.
The Microsoft.Azure.Management.DataFactories NuGet package is no longer required.
Compile your code, upload the executable and its dependencies to Azure Storage, and define the path in the
folderPath property.
For a complete sample of how the end-to-end DLL and pipeline sample described in the Data Factory version 1
article Use custom activities in an Azure Data Factory pipeline can be rewritten as a Data Factory Custom
Activity, see Data Factory Custom Activity sample.
startingNumberOfVMs = 1;
maxNumberofVMs = 25;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180
* TimeInterval_Second));
$TargetDedicated=min(maxNumberofVMs,pendingTaskSamples);
See Automatically scale compute nodes in an Azure Batch pool for details.
If the pool is using the default autoScaleEvaluationInterval, the Batch service could take 15-30 minutes to
prepare the VM before running the custom activity. If the pool is using a different autoScaleEvaluationInterval,
the Batch service could take autoScaleEvaluationInterval + 10 minutes.
Next steps
See the following articles that explain how to transform data in other ways:
U -SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
Machine Learning Batch Execution activity
Stored procedure activity
Compute environments supported by Azure Data
Factory
6/3/2019 • 19 minutes to read • Edit Online
This article explains different compute environments that you can use to process or transform data. It also
provides details about different configurations (on-demand vs. bring your own) supported by Data Factory
when configuring linked services linking these compute environments to an Azure data factory.
The following table provides a list of compute environments supported by Data Factory and the activities that
can run on them.
On-demand HDInsight cluster or your own HDInsight Hive, Pig, Spark, MapReduce, Hadoop Streaming
cluster
Azure Machine Learning Machine Learning activities: Batch Execution and Update
Resource
Azure SQL, Azure SQL Data Warehouse, SQL Server Stored Procedure
NOTE
The on-demand configuration is currently supported only for Azure HDInsight clusters. Azure Databricks also supports
on-demand jobs using job clusters, refer to Azure databricks linked service for more details.
IMPORTANT
It typically takes 20 minutes or more to provision an Azure HDInsight cluster on demand.
Example
The following JSON defines a Linux-based on-demand HDInsight linked service. The Data Factory service
automatically creates a Linux-based HDInsight cluster to process the required activity.
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterType": "hadoop",
"clusterSize": 1,
"timeToLive": "00:15:00",
"hostSubscriptionId": "<subscription ID>",
"servicePrincipalId": "<service principal ID>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenent id>",
"clusterResourceGroup": "<resource group name>",
"version": "3.6",
"osType": "Linux",
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
IMPORTANT
The HDInsight cluster creates a default container in the blob storage you specified in the JSON
(linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design.
With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice needs to be processed
unless there is an existing live cluster (timeToLive) and is deleted when the processing is done.
As more activity runs, you see many containers in your Azure blob storage. If you do not need them for
troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers
follow a pattern: adf**yourdatafactoryname**-**linkedservicename**-datetimestamp . Use tools such as Microsoft
Storage Explorer to delete containers in your Azure blob storage.
Properties
PROPERTY DESCRIPTION REQUIRED
IMPORTANT
Currently, HDInsight linked services does not support HBase, Interactive Query (Hive LLAP), Storm.
"additionalLinkedServiceNames": [{
"referenceName": "MyStorageLinkedService2",
"type": "LinkedServiceReference"
}]
Advanced Properties
You can also specify the following properties for the granular configuration of the on-demand HDInsight
cluster.
Node sizes
You can specify the sizes of head, data, and zookeeper nodes using the following properties:
"headNodeSize": "Standard_D4",
"dataNodeSize": "Standard_D4",
If you specify a wrong value for these properties, you may receive the following error: Failed to create cluster.
Exception: Unable to complete the cluster create operation. Operation failed with code '400'. Cluster left
behind state: 'Error'. Message: 'PreClusterCreationValidationFailure'. When you receive this error, ensure that
you are using the CMDLET & APIS name from the table in the Sizes of Virtual Machines article.
Properties
PROPERTY DESCRIPTION REQUIRED
IMPORTANT
HDInsight supports multiple Hadoop cluster versions that can be deployed. Each version choice creates a specific
version of the Hortonworks Data Platform (HDP) distribution and a set of components that are contained within that
distribution. The list of supported HDInsight versions keeps being updated to provide latest Hadoop ecosystem
components and fixes. Make sure you always refer to latest information of Supported HDInsight version and OS Type
to ensure you are using supported version of HDInsight.
IMPORTANT
Currently, HDInsight linked services does not support HBase, Interactive Query (Hive LLAP), Storm.
You can create an Azure Batch linked service to register a Batch pool of virtual machines (VMs) to a data
factory. You can run Custom activity using Azure Batch.
See following topics if you are new to Azure Batch service:
Azure Batch basics for an overview of the Azure Batch service.
New -AzBatchAccount cmdlet to create an Azure Batch account (or) Azure portal to create the Azure Batch
account using Azure portal. See Using PowerShell to manage Azure Batch Account topic for detailed
instructions on using the cmdlet.
New -AzBatchPool cmdlet to create an Azure Batch pool.
Example
{
"name": "AzureBatchLinkedService",
"properties": {
"type": "AzureBatch",
"typeProperties": {
"accountName": "batchaccount",
"accessKey": {
"type": "SecureString",
"value": "access key"
},
"batchUri": "https://batchaccount.region.batch.azure.com",
"poolName": "poolname",
"linkedServiceName": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Properties
PROPERTY DESCRIPTION REQUIRED
{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://[batch scoring endpoint]/jobs",
"apiKey": {
"type": "SecureString",
"value": "access key"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Properties
PROPERTY DESCRIPTION REQUIRED
{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "adftestaccount",
"dataLakeAnalyticsUri": "azuredatalakeanalytics URI",
"servicePrincipalId": "service principal id",
"servicePrincipalKey": {
"value": "service principal key",
"type": "SecureString"
},
"tenant": "tenant ID",
"subscriptionId": "<optional, subscription id of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Properties
PROPERTY DESCRIPTION REQUIRED
{
"name": "AzureDatabricks_LS",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://eastus.azuredatabricks.net",
"newClusterNodeType": "Standard_D3_v2",
"newClusterNumOfWorker": "1:10",
"newClusterVersion": "4.0.x-scala2.11",
"accessToken": {
"type": "SecureString",
"value": "dapif33c9c721144c3a790b35000b57f7124f"
}
}
}
}
{
"name": " AzureDataBricksLinedService",
"properties": {
"type": " AzureDatabricks",
"typeProperties": {
"domain": "https://westeurope.azuredatabricks.net",
"accessToken": {
"type": "SecureString",
"value": "dapif33c9c72344c3a790b35000b57f7124f"
},
"existingClusterId": "{clusterId}"
}
}
Properties
PROPERTY DESCRIPTION REQUIRED
Next steps
For a list of the transformation activities supported by Azure Data Factory, see Transform data.
Append Variable Activity in Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online
Use the Append Variable activity to add a value to an existing array variable defined in a Data Factory pipeline.
Type properties
PROPERTY DESCRIPTION REQUIRED
Next steps
Learn about a related control flow activity supported by Data Factory:
Set Variable Activity
Azure Function activity in Azure Data Factory
4/24/2019 • 2 minutes to read • Edit Online
The Azure Function activity allows you to run Azure Functions in a Data Factory pipeline. To run an Azure Function,
you need to create a linked service connection and an activity that specifies the Azure Function that you plan to
execute.
For an eight-minute introduction and demonstration of this feature, watch the following video:
function app url URL for the Azure Function App. Format yes
is
https://<accountname>.azurewebsites.net
. This URL is the value under URL
section when viewing your Function
App in the Azure portal
function key Access key for the Azure Function. Click yes
on the Manage section for the
respective function, and copy either the
Function Key or the Host key. Find
out more here: Azure Functions HTTP
triggers and bindings
linked service The Azure Function linked Linked service reference yes
service for the
corresponding Azure
Function App
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
method REST API method for the String Supported Types: yes
function call "GET", "POST", "PUT"
header Headers that are sent to the String (or expression with No
request. For example, to set resultType of string)
the language and type on a
request: "headers": {
"Accept-Language": "en-us",
"Content-Type":
"application/json" }
body body that is sent along with String (or expression with Required for PUT/POST
the request to the function resultType of string) or methods
api method object.
See the schema of the request payload in Request payload schema section.
The Azure Function Activity also supports queries. A query has to be included as part of the functionName . For
example, when the function name is HttpTriggerCSharp and the query that you want to include is name=hello , then
you can construct the functionName in the Azure Function Activity as HttpTriggerCSharp?name=hello . This function
can be parameterized so the value can be determined at runtime.
Next steps
Learn more about activities in Data Factory in Pipelines and activities in Azure Data Factory.
Execute data flow activity in Azure Data Factory
5/23/2019 • 2 minutes to read • Edit Online
Use the execute data flow activity to run your ADF data flow in pipeline debug (sandbox) runs and in pipeline
triggered runs.
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Syntax
{
"name": "MyDataFlowActivity",
"type": "ExecuteDataFlow",
"typeProperties": {
"dataflow": {
"referenceName": "dataflow1",
"type": "DataFlowReference"
},
"compute": {
"computeType": "General",
"coreCount": 8,
}
}
Type properties
dataflow is the name of the data flow entity that you wish to execute
compute describes the Spark execution environment
coreCount is the number of cores to assign to this activity execution of your data flow
Debugging pipelines with data flows
Use the Data Flow Debug to utilize a warmed cluster for testing your data flows interactively in a pipeline debug
run. Use the Pipeline Debug option to test your data flows inside a pipeline.
Run on
This is a required field that defines which Integration Runtime to use for your Data Flow activity execution. By
default, Data Factory will use the default auto-resolve Azure Integration runtime. However, you can create your
own Azure Integration Runtimes that define specific regions, compute type, core counts, and TTL for your data
flow activity execution.
The default setting for Data Flow executions is 8 cores of general compute with a TTL of 60 minutes.
Choose the compute environment for this execution of your data flow. The default is the Azure Auto-Resolve
Default Integration Runtime. This choice will execute the data flow on the Spark environment in the same region
as your data factory. The compute type will be a job cluster, which means the compute environment will take
several minutes to start-up.
You have control over the Spark execution environment for your Data Flow activities. In the Azure integration
runtime are settings to set the compute type (general purpose, memory optimized, and compute optimized),
number of worker cores, and time-to-live to match the execution engine with your Data Flow compute
requirements. Also, setting TTL will allow you to maintain a warm cluster that is immediately available for job
executions.
NOTE
The Integration Runtime selection in the Data Flow activity only applies to triggered executions of your pipeline. Debugging
your pipeline with Data Flows with Debug will execute against the 8-core default Spark cluster.
Staging area
If you are sinking your data into Azure Data Warehouse, you must choose a staging location for your Polybase
batch load.
Parameterized datasets
If you are using parameterized datasets, be sure to set the parameter values.
Debugging parameterized data flows
You can only debug data flows with parameterized datasets from the Pipeline Debug run using the execute data
flow activity. Currently, interactive debug sessions in ADF Data Flow do not work with parameterized data sets.
Pipeline executions and debug runs will work with parameters.
A good practice is to build your data flow with a static dataset so that you have full metadata column propagation
available at design-time. Then replace the static dataset with a dynamic parameterized dataset when you
operationalize your data flow pipeline.
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Execute Pipeline activity in Azure Data Factory
3/14/2019 • 2 minutes to read • Edit Online
The Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline.
Syntax
{
"name": "MyPipeline",
"properties": {
"activities": [
{
"name": "ExecutePipelineActivity",
"type": "ExecutePipeline",
"typeProperties": {
"parameters": {
"mySourceDatasetFolderPath": {
"value": "@pipeline().parameters.mySourceDatasetFolderPath",
"type": "Expression"
}
},
"pipeline": {
"referenceName": "<InvokedPipelineName>",
"type": "PipelineReference"
},
"waitOnCompletion": true
}
}
],
"parameters": [
{
"mySourceDatasetFolderPath": {
"type": "String"
}
}
]
}
}
Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
Sample
This scenario has two pipelines:
Master pipeline - This pipeline has one Execute Pipeline activity that calls the invoked pipeline. The master
pipeline takes two parameters: masterSourceBlobContainer , masterSinkBlobContainer .
Invoked pipeline - This pipeline has one Copy activity that copies data from an Azure Blob source to Azure
Blob sink. The invoked pipeline takes two parameters: sourceBlobContainer , sinkBlobContainer .
Master pipeline definition
{
"name": "masterPipeline",
"properties": {
"activities": [
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "invokedPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceBlobContainer": {
"value": "@pipeline().parameters.masterSourceBlobContainer",
"type": "Expression"
},
"sinkBlobContainer": {
"value": "@pipeline().parameters.masterSinkBlobContainer",
"type": "Expression"
}
},
"waitOnCompletion": true
},
"name": "MyExecutePipelineActivity"
}
],
"parameters": {
"masterSourceBlobContainer": {
"type": "String"
},
"masterSinkBlobContainer": {
"type": "String"
}
}
}
}
Linked service
{
"name": "BlobStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=*****",
"type": "SecureString"
}
}
}
}
Source dataset
{
"name": "SourceBlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@pipeline().parameters.sourceBlobContainer",
"type": "Expression"
},
"fileName": "salesforce.txt"
},
"linkedServiceName": {
"referenceName": "BlobStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
Sink dataset
{
"name": "sinkBlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@pipeline().parameters.sinkBlobContainer",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "BlobStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
{
"masterSourceBlobContainer": "executetest",
"masterSinkBlobContainer": "executesink"
}
The master pipeline forwards these values to the invoked pipeline as shown in the following example:
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "invokedPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceBlobContainer": {
"value": "@pipeline().parameters.masterSourceBlobContainer",
"type": "Expression"
},
"sinkBlobContainer": {
"value": "@pipeline().parameters.masterSinkBlobContainer",
"type": "Expression"
}
},
....
}
Next steps
See other control flow activities supported by Data Factory:
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Filter activity in Azure Data Factory
1/3/2019 • 2 minutes to read • Edit Online
You can use a Filter activity in a pipeline to apply a filter expression to an input array.
Syntax
{
"name": "MyFilterActivity",
"type": "filter",
"typeProperties": {
"condition": "<condition>",
"items": "<input array>"
}
}
Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
Example
In this example, the pipeline has two activities: Filter and ForEach. The Filter activity is configured to filter the
input array for items with a value greater than 3. The ForEach activity then iterates over the filtered values and
waits for the number of seconds specified by the current value.
{
"name": "PipelineName",
"properties": {
"activities": [{
"name": "MyFilterActivity",
"type": "filter",
"typeProperties": {
"condition": "@greater(item(),3)",
"items": "@pipeline().parameters.inputs"
}
},
{
"name": "MyForEach",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"batchCount": 1,
"items": "@activity('MyFilterActivity').output.value",
"activities": [{
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": "@item()"
},
"name": "MyWaitActivity"
}]
},
"dependsOn": [{
"activity": "MyFilterActivity",
"dependencyConditions": ["Succeeded"]
}]
}
],
"parameters": {
"inputs": {
"type": "Array",
"defaultValue": [1, 2, 3, 4, 5, 6]
}
}
}
}
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
ForEach activity in Azure Data Factory
3/15/2019 • 5 minutes to read • Edit Online
The ForEach Activity defines a repeating control flow in your pipeline. This activity is used to iterate over a
collection and executes specified activities in a loop. The loop implementation of this activity is similar to Foreach
looping structure in programming languages.
Syntax
The properties are described later in this article. The items property is the collection and each item in the
collection is referred to by using the @item() as shown in the following syntax:
{
"name":"MyForEachActivityName",
"type":"ForEach",
"typeProperties":{
"isSequential":"true",
"items": {
"value": "@pipeline().parameters.mySinkDatasetFolderPathCollection",
"type": "Expression"
},
"activities":[
{
"name":"MyCopyActivity",
"type":"Copy",
"typeProperties":{
...
},
"inputs":[
{
"referenceName":"MyDataset",
"type":"DatasetReference",
"parameters":{
"MyFolderPath":"@pipeline().parameters.mySourceDatasetFolderPath"
}
}
],
"outputs":[
{
"referenceName":"MyDataset",
"type":"DatasetReference",
"parameters":{
"MyFolderPath":"@item()"
}
}
]
}
]
}
}
Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
If "isSequential" is set to
False, ensure that there is a
correct configuration to run
multiple executables.
Otherwise, this property
should be used with caution
to avoid incurring write
conflicts. For more
information, see Parallel
execution section.
batchCount Batch count to be used for Integer (maximum 50) No. Default is 20.
controlling the number of
parallel execution (when
isSequential is set to false).
Parallel execution
If isSequential is set to false, the activity iterates in parallel with a maximum of 20 concurrent iterations. This
setting should be used with caution. If the concurrent iterations are writing to the same folder but to different
files, this approach is fine. If the concurrent iterations are writing concurrently to the exact same file, this
approach most likely causes an error.
{
"mySourceDatasetFolderPath": "input/",
"mySinkDatasetFolderPath": [ "outputs/file1", "outputs/file2" ]
}
Example
Scenario: Iterate over an InnerPipeline within a ForEach activity with Execute Pipeline activity. The inner
pipeline copies with schema definitions parameterized.
Master Pipeline definition
{
"name": "masterPipeline",
"properties": {
"activities": [
{
"type": "ForEach",
"name": "MyForEachActivity",
"typeProperties": {
"isSequential": true,
"items": {
"value": "@pipeline().parameters.inputtables",
"type": "Expression"
},
"activities": [
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "InnerCopyPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceTableName": {
"value": "@item().SourceTable",
"type": "Expression"
},
"sourceTableStructure": {
"value": "@item().SourceTableStructure",
"type": "Expression"
},
"sinkTableName": {
"value": "@item().DestTable",
"type": "Expression"
},
"sinkTableStructure": {
"value": "@item().DestTableStructure",
"type": "Expression"
}
},
"waitOnCompletion": true
},
"name": "ExecuteCopyPipeline"
}
]
}
}
],
"parameters": {
"inputtables": {
"type": "Array"
}
}
}
}
{
"name": "InnerCopyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"type": "SqlSource",
}
},
"sink": {
"type": "SqlSink"
}
},
"name": "CopyActivity",
"inputs": [
{
"referenceName": "sqlSourceDataset",
"parameters": {
"SqlTableName": {
"value": "@pipeline().parameters.sourceTableName",
"type": "Expression"
},
"SqlTableStructure": {
"value": "@pipeline().parameters.sourceTableStructure",
"type": "Expression"
}
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "sqlSinkDataset",
"parameters": {
"SqlTableName": {
"value": "@pipeline().parameters.sinkTableName",
"type": "Expression"
},
"SqlTableStructure": {
"value": "@pipeline().parameters.sinkTableStructure",
"type": "Expression"
}
},
"type": "DatasetReference"
}
]
}
],
"parameters": {
"sourceTableName": {
"type": "String"
},
"sourceTableStructure": {
"type": "String"
},
"sinkTableName": {
"type": "String"
},
"sinkTableStructure": {
"type": "String"
}
}
}
}
{
"name": "sqlSinkDataSet",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": {
"value": "@dataset().SqlTableName",
"type": "Expression"
}
},
"structure": {
"value": "@dataset().SqlTableStructure",
"type": "Expression"
},
"linkedServiceName": {
"referenceName": "azureSqlLS",
"type": "LinkedServiceReference"
},
"parameters": {
"SqlTableName": {
"type": "String"
},
"SqlTableStructure": {
"type": "String"
}
}
}
}
Aggregating outputs
To aggregate outputs of foreach activity, please utilize Variables and Append Variable activity.
First, declare an array variable in the pipeline. Then, invoke Append Variable activity inside each foreach loop.
Subsequently, you can retrieve the aggregation from your array.
LIMITATION WORKAROUND
You can't nest a ForEach loop inside another ForEach loop Design a two-level pipeline where the outer pipeline with the
(or an Until loop). outer ForEach loop iterates over an inner pipeline with the
nested loop.
The ForEach activity has a maximum batchCount of 50 for Design a two-level pipeline where the outer pipeline with the
parallel processing, and a maximum of 100,000 items. ForEach activity iterates over an inner pipeline.
Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
Get Metadata Activity
Lookup Activity
Web Activity
Get metadata activity in Azure Data Factory
3/11/2019 • 4 minutes to read • Edit Online
GetMetadata activity can be used to retrieve metadata of any data in Azure Data Factory. This activity can be
used in the following scenarios:
Validate the metadata information of any data
Trigger a pipeline when data is ready/ available
The following functionality is available in the control flow:
The output from GetMetadata Activity can be used in conditional expressions to perform validation.
A pipeline can be triggered when condition is satisfied via Do-Until looping
Supported capabilities
The GetMetadata Activity takes a dataset as a required input, and outputs metadata information available as
activity output. Currently, the following connectors with corresponding retrievable metadata are supported,
and the maximum supported metadata size is up to 1MB.
NOTE
If you run GetMetadata activity on a Self-hosted Integration Runtime, the latest capability is supported on version 3.6
or above.
Supported connectors
File storage:
LASTM
CONNE ITEMN ITEMTY CREATE ODIFIE CHILDI COLUM
CTOR/ AME PE D D TEMS CONTE STRUC NCOUN EXISTS
METAD (FILE/F (FILE/F SIZE (FILE/F (FILE/F (FOLDE NTMD5 TURE T (FILE/F
ATA OLDER) OLDER) (FILE) OLDER) OLDER) R) (FILE) (FILE) (FILE) OLDER)
For Amazon S3 and Google Sloud Storage, the lastModified applies to bucket and key but not virtual
folder; ; and the exists applies to bucket and key but not prefix or virtual folder.
For Azure Blob, the lastModified applies to container and blob but not virtual folder.
Relational database:
SQL Server √ √ √
Metadata options
The following metadata types can be specified in the GetMetadata activity field list to retrieve:
TIP
When you want to validate if a file/folder/table exists or not, specify exists in the GetMetadata activity field list, then
you can check the exists: true/false result from the activity output. If exists is not configured in the field list,
the GetMetadata activity will fail when the object is not found.
Syntax
GetMetadata activity:
{
"name": "MyActivity",
"type": "GetMetadata",
"typeProperties": {
"fieldList" : ["size", "lastModified", "structure"],
"dataset": {
"referenceName": "MyDataset",
"type": "DatasetReference"
}
}
}
Dataset:
{
"name": "MyDataset",
"properties": {
"type": "AzureBlob",
"linkedService": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath":"container/folder",
"filename": "file.json",
"format":{
"type":"JsonFormat"
}
}
}
}
Type properties
Currently GetMetadata activity can fetch the following types of metadata information.
Sample output
The GetMetadata result is shown in activity output. Below are two samples with exhaustive metadata options
selected in field list as reference. To use the result in subsequent activity, use the pattern of
@{activity('MyGetMetadataActivity').output.itemName} .
{
"exists": true,
"itemName": "testFolder",
"itemType": "Folder",
"lastModified": "2017-02-23T06:17:09Z",
"created": "2017-02-23T06:17:09Z",
"childItems": [
{
"name": "test.avro",
"type": "File"
},
{
"name": "folder hello",
"type": "Folder"
}
]
}
Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
For Each Activity
Lookup Activity
Web Activity
If Condition activity in Azure Data Factory
3/5/2019 • 3 minutes to read • Edit Online
The If Condition activity provides the same functionality that an if statement provides in programming languages.
It evaluates a set of activities when the condition evaluates to true and another set of activities when the
condition evaluates to false .
Syntax
{
"name": "<Name of the activity>",
"type": "IfCondition",
"typeProperties": {
"expression": {
"value": "<expression that evaluates to true or false>",
"type": "Expression"
},
"ifTrueActivities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
],
"ifFalseActivities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
}
}
Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
Example
The pipeline in this example copies data from an input folder to an output folder. The output folder is determined
by the value of pipeline parameter: routeSelection. If the value of routeSelection is true, the data is copied to
outputPath1. And, if the value of routeSelection is false, the data is copied to outputPath2.
NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with step-
by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial: create a
data factory by using Azure PowerShell.
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "MyIfCondition",
"type": "IfCondition",
"typeProperties": {
"expression": {
"value": "@bool(pipeline().parameters.routeSelection)",
"type": "Expression"
},
"ifTrueActivities": [
{
"name": "CopyFromBlobToBlob1",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath1"
},
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"ifFalseActivities": [
{
"name": "CopyFromBlobToBlob2",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath2"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath1": {
"type": "String"
},
"outputPath2": {
"type": "String"
},
"routeSelection": {
"type": "String"
}
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<Azure Storage account name>;AccountKey=
<Azure Storage account key>",
"type": "SecureString"
}
}
}
}
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
{
"inputPath": "adftutorial/input",
"outputPath1": "adftutorial/outputIf",
"outputPath2": "adftutorial/outputElse",
"routeSelection": "false"
}
PowerShell commands
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
These commands assume that you have saved the JSON files into the folder: C:\ADF.
Connect-AzAccount
Select-AzSubscription "<Your subscription name>"
if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}
Start-Sleep -Seconds 30
}
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
$result
Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Lookup activity in Azure Data Factory
3/15/2019 • 6 minutes to read • Edit Online
Lookup activity can retrieve a dataset from any of the Azure Data Factory-supported data sources. Use it in
the following scenario:
Dynamically determine which objects to operate on in a subsequent activity, instead of hard coding the
object name. Some object examples are files and tables.
Lookup activity reads and returns the content of a configuration file or table. It also returns the result of
executing a query or stored procedure. The output from Lookup activity can be used in a subsequent copy
or transformation activity if it's a singleton value. The output can be used in a ForEach activity if it's an array
of attributes.
Supported capabilities
The following data sources are supported for Lookup activity. The largest number of rows that can be
returned by Lookup activity is 5,000, up to 2 MB in size. Currently, the longest duration for Lookup activity
before timeout is one hour.
Azure Files
DB2
Drill (Preview)
Google BigQuery
Greenplum
HBase
Hive
Informix
MariaDB
Microsoft Access
MySQL
Netezza
Oracle
Phoenix
PostgreSQL
Presto (Preview)
SAP HANA
SAP Table
Spark
SQL Server
Sybase
Teradata
Vertica
CATEGORY DATA STORE
NoSQL Cassandra
Couchbase (Preview)
File Amazon S3
File System
FTP
HDFS
SFTP
Generic OData
Generic ODBC
Concur (Preview)
Dynamics 365
Dynamics AX (Preview)
Dynamics CRM
HubSpot (Preview)
Jira (Preview)
Magento (Preview)
Marketo (Preview)
Paypal (Preview)
QuickBooks (Preview)
Salesforce
SAP ECC
ServiceNow
Shopify (Preview)
Square (Preview)
Xero (Preview)
Zoho (Preview)
NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a
dependency on preview connectors in your solution, please contact Azure support.
Syntax
{
"name": "LookupActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "<source type>"
<additional source specific properties (optional)>
},
"dataset": {
"referenceName": "<source dataset name>",
"type": "DatasetReference"
},
"firstRowOnly": false
}
}
Type properties
NAME DESCRIPTION TYPE REQUIRED?
NOTE
Source columns with ByteArray type aren't supported.
Structure isn't supported in dataset definitions. For text-format files, use the header row to provide the column
name.
If your lookup source is a JSON file, the jsonPathDefinition setting for reshaping the JSON object isn't
supported. The entire objects will be retrieved.
{
"firstRow":
{
"Id": "1",
"TableName" : "Table1"
}
}
When firstRowOnly is set to false , the output format is as shown in the following code. A count
field indicates how many records are returned. Detailed values are displayed under a fixed value
array. In such a case, the Lookup activity is followed by a Foreach activity. You pass the value array
to the ForEach activity items field by using the pattern of
@activity('MyLookupActivity').output.value . To access elements in the value array, use the following
syntax: @{activity('lookupActivity').output.value[zero based index].propertyname} . An example is
@{activity('lookupActivity').output.value[0].tablename} .
{
"count": "2",
"value": [
{
"Id": "1",
"TableName" : "Table1"
},
{
"Id": "2",
"TableName" : "Table2"
}
]
}
Lookup dataset
The lookup dataset is the sourcetable.json file in the Azure Storage lookup folder specified by the
AzureStorageLinkedService type.
{
"name": "LookupDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "lookup",
"fileName": "sourcetable.json",
"format": {
"type": "JsonFormat",
"filePattern": "SetOfObjects"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties":{
"tableName": "@{activity('LookupActivity').output.firstRow.tableName}"
},
"linkedServiceName": {
"referenceName": "AzureSqlLinkedService",
"type": "LinkedServiceReference"
}
}
}
{
"name": "SinkDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "csv",
"fileName": "filebylookup.csv",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
Azure Storage linked service
This storage account contains the JSON file with the names of the SQL tables.
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<StorageAccountName>;AccountKey=
<StorageAccountKey>",
"type": "SecureString"
}
}
},
"name": "AzureStorageLinkedService"
}
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"description": "",
"typeProperties": {
"connectionString": {
"value": "Server=<server>;Initial Catalog=<database>;User ID=<user>;Password=<password>;",
"type": "SecureString"
}
}
}
}
sourcetable.json
Set of objects
{
"Id": "1",
"tableName": "Table1"
}
{
"Id": "2",
"tableName": "Table2"
}
Array of objects
[
{
"Id": "1",
"tableName": "Table1"
},
{
"Id": "2",
"tableName": "Table2"
}
]
Limitations and workarounds
Here are some limitations of the Lookup activity and suggested workarounds.
LIMITATION WORKAROUND
The Lookup activity has a maximum of 5,000 rows, and a Design a two-level pipeline where the outer pipeline
maximum size of 2 MB. iterates over an inner pipeline, which retrieves data that
doesn't exceed the maximum rows or size.
Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline activity
ForEach activity
GetMetadata activity
Web activity
Set Variable Activity in Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online
Use the Set Variable activity to set the value of an existing variable of type String, Bool, or Array defined in a Data
Factory pipeline.
Type properties
PROPERTY DESCRIPTION REQUIRED
Next steps
Learn about a related control flow activity supported by Data Factory:
Append Variable Activity
Until activity in Azure Data Factory
3/5/2019 • 4 minutes to read • Edit Online
The Until activity provides the same functionality that a do-until looping structure provides in programming
languages. It executes a set of activities in a loop until the condition associated with the activity evaluates to true.
You can specify a timeout value for the until activity in Data Factory.
Syntax
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "<expression that evaluates to true or false>",
"type": "Expression"
},
"timeout": "<time out for the loop. for example: 00:01:00 (1 minute)>",
"activities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
},
"name": "MyUntilActivity"
}
Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
{
"name": "DoUntilPipeline",
"properties": {
"activities": [
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "@equals('Failed', coalesce(body('MyUnauthenticatedActivity')?.status,
actions('MyUnauthenticatedActivity')?.status, 'null'))",
"type": "Expression"
},
"timeout": "00:00:01",
"activities": [
{
"name": "MyUnauthenticatedActivity",
"type": "WebActivity",
"typeProperties": {
"method": "get",
"url": "https://www.fake.com/",
"headers": {
"Content-Type": "application/json"
}
},
"dependsOn": [
{
"activity": "MyWaitActivity",
"dependencyConditions": [ "Succeeded" ]
}
]
},
{
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": 1
},
"name": "MyWaitActivity"
}
]
},
"name": "MyUntilActivity"
}
]
}
}
Example 2
The pipeline in this sample copies data from an input folder to an output folder in a loop. The loop terminates
when the value for the repeat parameter is set to false or it times out after one minute.
Pipeline with Until activity (Adfv2QuickStartPipeline.json)
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "@equals('false', pipeline().parameters.repeat)",
"type": "Expression"
},
"timeout": "00:01:00",
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"policy": {
"retry": 1,
"timeout": "00:10:00",
"retryIntervalInSeconds": 60
}
}
]
},
"name": "MyUntilActivity"
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
},
"repeat": {
"type": "String"
}
}
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<Azure Storage account name>;AccountKey=
<Azure Storage account key>",
"type": "SecureString"
}
}
}
}
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
{
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/outputUntil",
"repeat": "true"
}
PowerShell commands
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
These commands assume that you have saved the JSON files into the folder: C:\ADF.
Connect-AzAccount
Select-AzSubscription "<Your subscription name>"
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId
if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
$result
Start-Sleep -Seconds 15
}
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Validation activity in Azure Data Factory
3/27/2019 • 2 minutes to read • Edit Online
You can use a Validation in a pipeline to ensure the pipeline only continues execution once it has validated the
attached dataset reference exists, that it meets the specified criteria, or timeout has been reached.
Syntax
{
"name": "Validation_Activity",
"type": "Validation",
"typeProperties": {
"dataset": {
"referenceName": "Storage_File",
"type": "DatasetReference"
},
"timeout": "7.00:00:00",
"sleep": 10,
"minimumSize": 20
}
},
{
"name": "Validation_Activity_Folder",
"type": "Validation",
"typeProperties": {
"dataset": {
"referenceName": "Storage_Folder",
"type": "DatasetReference"
},
"timeout": "7.00:00:00",
"sleep": 10,
"childItems": true
}
}
Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Execute wait activity in Azure Data Factory
2/25/2019 • 2 minutes to read • Edit Online
When you use a Wait activity in a pipeline, the pipeline waits for the specified period of time before continuing
with execution of subsequent activities.
Syntax
{
"name": "MyWaitActivity",
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": 1
}
}
Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
Example
NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with step-
by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial: create a
data factory by using Azure PowerShell.
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Web activity in Azure Data Factory
1/10/2019 • 3 minutes to read • Edit Online
Web Activity can be used to call a custom REST endpoint from a Data Factory pipeline. You can pass datasets
and linked services to be consumed and accessed by the activity.
Syntax
{
"name":"MyWebActivity",
"type":"WebActivity",
"typeProperties":{
"method":"Post",
"url":"<URLEndpoint>",
"headers":{
"Content-Type":"application/json"
},
"authentication":{
"type":"ClientCertificate",
"pfx":"****",
"password":"****"
},
"datasets":[
{
"referenceName":"<ConsumedDatasetName>",
"type":"DatasetReference",
"parameters":{
...
}
}
],
"linkedServices":[
{
"referenceName":"<ConsumedLinkedServiceName>",
"type":"LinkedServiceReference"
}
]
}
}
Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
url Target endpoint and path String (or expression with Yes
resultType of string). The
activity will timeout at 1
minute with an error if it
does not receive a response
from the endpoint.
headers Headers that are sent to String (or expression with Yes, Content-type header is
the request. For example, resultType of string) required.
to set the language and "headers":{ "Content-
type on a request: Type":"application/json"}
"headers" : { "Accept-
Language": "en-us",
"Content-Type":
"application/json" }
.
body Represents the payload String (or expression with Required for POST/PUT
that is sent to the resultType of string). methods.
endpoint.
See the schema of the
request payload in Request
payload schema section.
NOTE
REST endpoints that the web activity invokes must return a response of type JSON. The activity will timeout at 1 minute
with an error if it does not receive a response from the endpoint.
Authentication
None
If authentication is not required, do not include the "authentication" property.
Basic
Specify user name and password to use with the basic authentication.
"authentication":{
"type":"Basic",
"username":"****",
"password":"****"
}
Client certificate
Specify base64-encoded contents of a PFX file and the password.
"authentication":{
"type":"ClientCertificate",
"pfx":"****",
"password":"****"
}
Managed Identity
Specify the resource uri for which the access token will be requested using the managed identity for the data
factory. To call the Azure Resource Management API, use https://management.azure.com/ . For more
information about how managed identities works see the managed identities for Azure resources overview
page.
"authentication": {
"type": "MSI",
"resource": "https://management.azure.com/"
}
Example
In this example, the web activity in the pipeline calls a REST end point. It passes an Azure SQL linked service
and an Azure SQL dataset to the endpoint. The REST end point uses the Azure SQL connection string to
connect to the Azure SQL server and returns the name of the instance of SQL server.
Pipeline definition
{
"name": "<MyWebActivityPipeline>",
"properties": {
"activities": [
{
"name": "<MyWebActivity>",
"type": "WebActivity",
"typeProperties": {
"method": "Post",
"url": "@pipeline().parameters.url",
"headers": {
"Content-Type": "application/json"
},
"authentication": {
"type": "ClientCertificate",
"pfx": "*****",
"password": "*****"
},
"datasets": [
{
"referenceName": "MySQLDataset",
"type": "DatasetReference",
"parameters": {
"SqlTableName": "@pipeline().parameters.sqlTableName"
}
}
],
"linkedServices": [
{
"referenceName": "SqlLinkedService",
"type": "LinkedServiceReference"
}
]
}
}
],
"parameters": {
"sqlTableName": {
"type": "String"
},
"url": {
"type": "String"
}
}
}
}
{
"sqlTableName": "department",
"url": "https://adftes.azurewebsites.net/api/execute/running"
}
result.Add("sinkServer", sqlConn.DataSource);
Trace.TraceInformation("Stop Execute");
Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Webhook activity in Azure Data Factory
4/10/2019 • 2 minutes to read • Edit Online
You can use a web hook activity to control the execution of pipelines through your custom code. Using the
webhook activity, customers can call an endpoint and pass a callback URL. The pipeline run waits for the callback to
be invoked before proceeding to the next activity.
Syntax
{
"name": "MyWebHookActivity",
"type": "WebHook",
"typeProperties": {
"method": "POST",
"url": "<URLEndpoint>",
"headers": {
"Content-Type": "application/json"
},
"body": {
"key": "value"
},
"timeout": "00:03:00",
"authentication": {
"type": "ClientCertificate",
"pfx": "****",
"password": "****"
}
}
}
Type properties
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
method Rest API method for the String. Supported Types: Yes
target endpoint. 'POST'
url Target endpoint and path String (or expression with Yes
resultType of string).
headers Headers that are sent to the String (or expression with Yes, Content-type header is
request. For example, to set resultType of string) required. "headers":{
the language and type on a "Content-
request: "headers" : { Type":"application/json"}
"Accept-Language": "en-us",
"Content-Type":
"application/json" }.
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
body Represents the payload that The body passed back to the Yes
is sent to the endpoint. call back URI should be a
valid JSON. See the schema
of the request payload in
Request payload schema
section.
Additional notes
Azure Data Factory will pass an additional property “callBackUri” in the body to the url endpoint, and will expect
this uri to be invoked before the timeout value specified. If the uri is not invoked, the activity will fail with status
‘TimedOut’.
The web hook activity itself fails only when the call to the custom endpoint fails. Any error message can be added
into the body of the callback and used in a subsequent activity.
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Azure Data Factory Mapping Data Flow Aggregate
Transformation
2/22/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
The Aggregate transformation is where you'll define aggregations of columns in your data streams. In the
Expression Builder, you can define different types of aggregations (i.e. SUM, MIN, MAX, COUNT, etc.) and create a
new field in your output that includes these aggregations with optional group-by fields.
Group By
(Optional) Choose a Group-by clause for your aggregation and use either the name of an existing column or a new
name. Use "Add Column" add more group-by clauses and click on the text box next to the column name to launch
the Expression Builder to either select just an existing column, combination of columns or expressions for your
grouping.
Use the Alter Row transformation to set insert, delete, update, and upsert policies on rows. You can add one-to-
many conditions as expressions. Each of those conditions can result in a row (or rows) being inserted, updated,
deleted, or upsert. Alter Row can produce both DDL & DML actions against your database.
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
NOTE
Alter Row transformations will only operate on database sinks in your data flow. The actions that you assign to rows (insert,
update, delete, upsert) will not occur during debug sessions. You must add an Execute Data Flow task to a pipeline and use
pipeline debug or triggers to enact the alter row policies on your database tables.
View policies
Switch the Data Flow Debug mode to on and then view the results of your alter row policies in the Data Preview
pane. Executing an alter row in Data Flow Debug mode will not produce DDL or DML actions against your target.
In order to have those actions occur, execute the data flow inside an Execute Data Flow activity inside a pipeline.
This will allow you to verify and view the state of each row based on your conditions. There are icon represents for
each insert, update, delete, and upsert action that will occur in your data flow, indicating which action will take place
when you execute the data flow inside a pipeline.
Sink settings
You must have a database sink type for Alter Row to work. In the sink Settings, you must set each action to be
allowed.
The default behavior in ADF Data Flow with database sinks is to insert rows. If you want to allow updates, upserts,
and deletes as well, you must also check these boxes in the sink to allow the actions.
NOTE
If your inserts, updates, or upserts modify the schema of the target table in the sink, your data flow will fail. In order to
modify the target schema in your database, you must choose the "Recreate table" option in the sink. This will drop and
recreate your table with the new schema definition.
Next steps
After the Alter Row transformation, you may want to sink your data into a destination data store.
Mapping data flow conditional split transformation
5/15/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
The Conditional Split transformation can route data rows to different streams depending on the content of the
data. The implementation of the Conditional Split transformation is similar to a CASE decision structure in a
programming language. The transformation evaluates expressions, and based on the results, directs the data row
to the specified stream. This transformation also provides a default output, so that if a row matches no expression it
is directed to the default output.
Multiple paths
To add additional conditions, select "Add Stream" in the bottom configuration pane and click in the Expression
Builder text box to build your expression.
Next steps
Common data flow transformations used with conditional split: Join transformation, Loopup transformation, Select
transformation
Mapping data flow derived column transformation
4/29/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Use the Derived Column transformation to generate new columns in your data flow or to modify existing fields.
You can perform multiple Derived Column actions in a single Derived Column transformation. Click "Add
Column" to transform more than 1 column in the single transformation step.
In the Column field, either select an existing column to overwrite with a new derived value, or click "Create New
Column" to generate a new column with the newly derived value.
The Expression text box will open the Expression Builder where you can build the expression for the derived
columns using expression functions.
Column patterns
If your column names are variable from your sources, you may wish to build transformations inside of the Derived
Column using Column Patterns instead of using named columns. See the Schema Drift article for more details.
Next steps
Learn more about the Data Factory expression language for transformations and the Expression Builder
Mapping data flow exists transformation
5/6/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
The Exists transformation is a row filtering transformation that stops or allows rows in your data to flow through.
The Exists Transform is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS . After the Exists Transformation, the
resulting rows from your data stream will either include all rows where column values from source 1 exist in
source 2 or do not exist in source 2.
Choose the second source for your Exists so that Data Flow can compare values from Stream 1 against Stream 2.
Select the column from Source 1 and from Source 2 whose values you wish to check against for Exists or Not
Exists.
Custom expression
You can click "Custom Expression" to instead create a free-form expression as your exists or not-exists condition.
Checking this box will allow you to type in your own expression as a condition.
Next steps
Similar transformations are Lookup and Join.
Azure data factory filter transformation
5/24/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
The Filter transforms provides row filtering. Build an expression that defines the filter condition. Click in the text
box to launch the Expression Builder. Inside the expression builder, build a filter expression to control which rows
from current data stream are allowed to pass through (filter) to the next transformation. Think of the Filter
transformation as the WHERE clause of a SQL statement.
Next steps
Try a column filtering transformation, the Select transformation
Mapping Data Flow Join Transformation
3/27/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Use Join to combine data from two tables in your Data Flow. Click on the transformation that will be the left
relationship and add a Join transformation from the toolbox. Inside the Join transform, you will select another
data stream from your data flow to be right relationship.
Join types
Selecting Join Type is required for the Join transformation.
Inner Join
Inner join will pass through only rows that match the column conditions from both tables.
Left Outer
All rows from the left stream not meeting the join condition are passed through, and output columns from the
other table are set to NULL in addition to all rows returned by the inner join.
Right Outer
All rows from the right stream not meeting the join condition are passed through, and output columns that
correspond to the other table are set to NULL, in addition to all rows returned by the inner join.
Full Outer
Full Outer produces all columns and rows from both sides with NULL values for columns that are not present in
the other table.
Cross Join
Specify the cross product of the two streams with an expression. You can use this to create custom join conditions.
If your dataset can fit into the Databricks worker node memory, we can optimize your Join performance. You can
also specify partitioning of your data on the Join operation to create sets of data that can fit better into memory
per worker.
Self-Join
You can achieve self-join conditions in ADF Data Flow by using the Select transformation to alias an existing
stream. First, create a "New Branch" from a stream, then add a Select to alias the entire original stream.
In the above diagram, the Select transform is at the top. All it's doing is aliasing the original stream to
"OrigSourceBatting". In the highlighted Join transform below it you can see that we use this Select alias stream as
the right-hand join, allowing us to reference the same key in both the Left & Right side of the Inner Join.
Next steps
After joining data, you can then create new columns and sink your data to a destination data store.
Azure Data Factory Mapping Data Flow Lookup
Transformation
4/28/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Use Lookup to add reference data from another source to your Data Flow. The Lookup transform requires a
defined source that points to your reference table and matches on key fields.
Select the key fields that you wish to match on between the incoming stream fields and the fields from the
reference source. You must first have created a new source on the Data Flow design canvas to use as the right-side
for the lookup.
When matches are found, the resulting rows and columns from the reference source will be added to your data
flow. You can choose which fields of interest that you wish to include in your Sink at the end of your Data Flow.
Match / No match
After your Lookup transformation, you can use subsequent transformations to inspect the results of each match
row by using the expression function isMatch() to make further choices in your logic based on whether or not the
Lookup resulted in a row match or not.
Optimizations
In Data Factory, Data Flows execute in scaled-out Spark environments. If your dataset can fit into worker node
memory space, we can optimize your Lookup performance.
Broadcast join
Select Left and/or Right side broadcast join to request ADF to push the entire dataset from either side of the
Lookup relationship into memory.
Data partitioning
You can also specify partitioning of your data by selecting "Set Partitioning" on the Optimize tab of the Lookup
transformation to create sets of data that can fit better into memory per worker.
Next steps
Join and Exists transformations perform similar tasks in ADF Mapping Data Flows. Take a look at those
transformations next.
Azure Data Factory Mapping Data Flow New Branch
Transformation
2/22/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Branching will take the current data stream in your data flow and replicate it to another stream. Use New Branch to
perform multiple sets of operations and transformations against the same data stream.
Example: Your data flow has a Source Transform with a selected set of columns and data type conversions. You
then place a Derived Column immediately following that Source. In the Derived Column, you've create a new field
that combines first name and last name to make a new "full name" field.
You can treat that new stream with a set of transformations and a sink on one row and use New Branch to create a
copy of that stream where you can transform that same data with a different set of transformations. By
transforming that copied data in a separate branch, you can subsequently sink that data to a separate location.
NOTE
"New Branch" will only show as an action on the + Transformation menu when there is a subsequent transformation
following the current location where you are attempting to branch. i.e. You will not see a "New Branch" option at the end here
until you add another transformation after the Select
Azure data factory pivot transformation
4/10/2019 • 3 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Use Pivot in ADF Data Flow as an aggregation where one or more grouping columns has its distinct row values
transformed into individual columns. Essentially, you can Pivot row values into new columns (turn data into
metadata).
Group by
First, set the columns that you wish to group by for your pivot aggregation. You can set more than 1 column here
with the + sign next to the column list.
Pivot key
The Pivot Key is the column that ADF will pivot from row to column. By default, each unique value in the dataset
for this field will pivot to a column. However, you can optionally enter the values from the dataset that you wish to
pivot to column values. This is the column that will determine the new columns that will be created.
Pivoted columns
Lastly, you will choose the aggregation that you wish to use for the pivoted values and how you would like the
columns to be displayed in the new output projection from the transformation.
(Optional) You can set a naming pattern with a prefix, middle, and suffix to be added to each new column name
from the row values.
For instance, pivoting "Sales" by "Region" would result in new column values from each sales value, i.e. "25", "50",
"1000", etc. However, if you set a prefix value of "Sales-", each column value would add "Sales-" to the beginning of
the value.
Setting the Column Arrangement to "Normal" will group together all of the pivoted columns with their aggregated
values. Setting the columns arrangement to "Lateral" will alternate between column and value.
Aggregation
To set the aggregation you wish to use for the pivot values, click on the field at the bottom of the Pivoted Columns
pane. You will enter into the ADF Data Flow expression builder where you can build an aggregation expression and
provide a descriptive alias name for your new aggregated values.
Use the ADF Data Flow Expression Language to describe the pivoted column transformations in the Expression
Builder: https://aka.ms/dataflowexpressions.
Pivot metadata
The Pivot transformation will produce new column names that are dynamic based on your incoming data. The
Pivot Key produces the values for each new column name. If you do not specify individual values and wish to
create dynamic column names for each unique value in your Pivot Key, then the UI will not display the metadata in
Inspect and there will be no column propagation to the Sink transformation. If you set values for the Pivot Key,
then ADF can determine the new column names and those column names will be available to you in the Inspect
and Sink mapping.
Landing new columns in Sink
Even with dynamic column names in Pivot, you can still sink your new column names and values into your
destination store. Just set "Allow Schema Drift" to on in your Sink settings. You will not see the new dynamic
names in your column metadata, but the schema drift option will allow you to land the data.
View metadata in design mode
If you wish to view the new column names as metadata in Inspect and you wish to see the columns propagate
explicitly to the Sink transformation, then set explicit Values in the Pivot Key tab.
How to rejoin original fields
The Pivot transformation will only project the columns used in the aggregation, grouping, and pivot action. If you
wish to include the other columns from the previous step in your flow, use a New Branch from the previous step
and use the self-join pattern to connect the flow with the original metadata.
Next steps
Try the unpivot transformation to turn column values into row values.
Azure Data Factory Mapping Data Flow Select
Transformation
2/22/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Use this transformation for column selectivity (reducing number of columns) or to alias columns and stream
names.
The Select transform allows you to alias an entire stream, or columns in that stream, assign different names
(aliases) and then reference those new names later in your data flow. This transform is useful for self-join
scenarios. The way to implement a self-join in ADF Data Flow is to take a stream, branch it with "New Branch",
then immediately afterward, add a "Select" transform. That stream will now have a new name that you can use to
join back to the original stream, creating a self-join:
In the above diagram, the Select transform is at the top. This is aliasing the original stream to "OrigSourceBatting".
In the higlighted Join transform below it, you can see that we use this Select alias stream as the right-hand join,
allowing us to reference the same key in both the Left & Right side of the Inner Join.
Select can also be used as a way de-select columns from your data flow. For example, if you have 6 columns
defined in your sink, but you only wish to pick a specific 3 to transform and then flow to the sink, you can select
just those 3 by using the select transform.
NOTE
You must switch off "Select All" to pick only specific columns
Options
The default setting for "Select" is to include all incoming columns and keep those original names. You can alias the
stream by setting the name of the Select transform.
To alias individual columns, deselect "Select All" and use the column mapping at the bottom.
Sink transformation for a data flow
5/13/2019 • 3 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
After you transform your data flow, you can sink the data into a destination dataset. In the sink transformation,
choose a dataset definition for the destination output data. You can have as many sink transformations as your
data flow requires.
To account for schema drift and changes in incoming data, sink the output data to a folder without a defined
schema in the output dataset. You can also account for column changes in your sources by selecting Allow
schema drift in the source. Then automap all fields in the sink.
To sink all incoming fields, turn on Auto Map. To choose the fields to sink to the destination, or to change the
names of the fields at the destination, turn off Auto Map. Then open the Mapping tab to map output fields.
Output
For Azure Blob storage or Data Lake Storage sink types, output the transformed data into a folder. Spark
generates partitioned output data files based on the partitioning scheme that the sink transformation uses.
You can set the partitioning scheme from the Optimize tab. If you want Data Factory to merge your output into
a single file, select Single partition.
Field mapping
On the Mapping tab of your sink transformation, you can map the incoming columns on the left to the
destinations on the right. When you sink data flows to files, Data Factory will always write new files to a folder.
When you map to a database dataset, you can generate a new table that uses this schema by setting Save
Policy to Overwrite. Or insert new rows in an existing table and then map the fields to the existing schema.
In the mapping table, you can multiselect to link multiple columns, delink multiple columns, or map multiple
rows to the same column name.
To always map the incoming set of fields to a target as they are and to fully accept flexible schema definitions,
select Allow schema drift.
To reset your column mappings, select Re-map.
Database options
Choose database settings:
Update method: The default is to allow inserts. Clear Allow insert if you want to stop inserting new rows
from your source. To update, upsert, or delete rows, first add an alter-row transformation to tag rows for
those actions.
Recreate table: Drop or create your target table before the data flow finishes.
Truncate table: Remove all rows from your target table before the data flow finishes.
Batch size: Enter a number to bucket writes into chunks. Use this option for large data loads.
Enable staging: Use PolyBase when you load Azure Data Warehouse as your sink dataset.
NOTE
In Data Flow, you can direct Data Factory to create a new table definition in your target database. To create the table
definition, set a dataset in the sink transformation that has a new table name. In the SQL dataset, below the table name,
select Edit and enter a new table name. Then, in the sink transformation, turn on Allow schema drift. Set Import
schema to None.
NOTE
When you update or delete rows in your database sink, you must set the key column. This setting allows the alter-row
transformation to determine the unique row in the data movement library (DML).
Next steps
Now that you've created your data flow, add a Data Flow activity to your pipeline.
Azure Data Factory Data Flow Sort Transformations
3/13/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
The Sort transformation allows you to sort the incoming rows on the current data stream. The outgoing rows from
the Sort Transformation will subsequently follow the ordering rules that you set. You can choose individual
columns and sort them ASC or DEC, using the arrow indicator next to each field. If you need to modify the column
before applying the sort, click on "Computed Columns" to launch the expression editor. This will provide with an
opportunity to build an expression for the sort operation instead of simply applying a column for the sort.
Case insensitive
You can turn on "Case insensitive" if you wish to ignore case when sorting string or text fields.
"Sort Only Within Partitions" leverages Spark data partitioning. By sorting incoming data only within each
partition, Data Flows can sort partitioned data instead of sorting entire data stream.
Each of the sort conditions in the Sort Transformation can be rearranged. So if you need to move a column higher
in the sort precedence, grab that row with your mouse and move it higher or lower in the sorting list.
Partitioning effects on Sort
ADF Data Flow is executed on big data Spark clusters with data distributed across multiple nodes and partitions. It
is important to keep this in mind when architecting your data flow if you are depending on the Sort transform to
keep data in that same order. If you choose to repartition your data in a subsequent transformation, you may lose
your sorting due to that reshuffling of data.
Next steps
After sorting, you may want to use the Aggregate Transformation
Source transformation for Mapping Data Flow
5/24/2019 • 4 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
A source transformation configures your data source for the data flow. A data flow can include more than one
source transformation. When designing data flows, always begin with a source transformation.
Every data flow requires at least one source transformation. Add as many sources as necessary to complete your
data transformations. You can join those sources together with a join transformation or a union transformation.
NOTE
When you debug your data flow, data is read from the source by using the sampling setting or the debug source limits. To
write data to a sink, you must run your data flow from a pipeline Data Flow activity.
Associate your Data Flow source transformation with exactly one Data Factory dataset. The dataset defines the
shape and location of the data you want to write to or read from. You can use wildcards and file lists in your
source to work with more than one file at a time.
Options
Choose schema and sampling options for your data.
Allow schema drift
Select Allow schema drift if the source columns will change often. This setting allows all incoming source fields
to flow through the transformations to the sink.
Validate schema
If the incoming version of the source data doesn't match the defined schema, the data flow will fail to run.
Define schema
When your source files aren't strongly typed (for example, flat files rather than Parquet files), define the data
types for each field here in the source transformation.
You can later change the column names in a select transformation. Use a derived-column transformation to
change the data types. For strongly typed sources, you can modify the data types in a later select transformation.
Optimize the source transformation
On the Optimize tab for the source transformation, you might see a Source partition type. This option is
available only when your source is Azure SQL Database. This is because Data Factory tries to make connections
parallel to run large queries against your SQL Database source.
You don't have to partition data on your SQL Database source, but partitions are useful for large queries. You
can base your partition on a column or a query.
Use a column to partition data
From your source table, select a column to partition on. Also set the maximum number of connections.
Use a query to partition data
You can choose to partition the connections based on a query. Simply enter the contents of a WHERE predicate.
For example, enter year > 1980.
NOTE
File operations run only when you start the data flow from a pipeline run (a pipeline debug or execution run) that uses the
Execute Data Flow activity in a pipeline. File operations do not run in Data Flow debug mode.
Projection
Like schemas in datasets, the projection in a source defines the data columns, types, and formats from the source
data.
If your text file has no defined schema, select Detect data type so that Data Factory will sample and infer the
data types. Select Define default format to autodetect the default data formats.
You can modify the column data types in a later derived-column transformation. Use a select transformation to
modify the column names.
Next steps
Begin building a derived-column transformation and a select transformation.
Mapping Data Flow Surrogate Key Transformation
4/17/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Use the Surrogate Key Transformation to add an incrementing non-business arbitrary key value to your data flow
rowset. This is useful when designing dimension tables in a star schema analytical data model where each member
in your dimension tables needs to have a unique key that is a non-business key, part of the Kimball DW
methodology.
"Key Column" is the name that you will give to your new surrogate key column.
"Start Value" is the beginning point of the incremental value.
File sources
If your previous max value is in a file, you can use your Source transformation together with an Aggregate
transformation and use the MAX() expression function to get the previous max value:
In both cases, you must Join your incoming new data together with your source that contains the previous max
value:
Next steps
These examples use the Join and Derived Column transformations.
Mapping data flow union transformation
3/12/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Union will combine multiple data streams into one, with the SQL Union of those streams as the new output from
the Union transformation. All of the schema from each input stream will be combined inside of your data flow,
without needing to have a join key.
You can combine n-number of streams in the settings table by selecting the "+" icon next to each configured row,
including both source data as well as streams from existing transformations in your data flow.
In this case, you can combine disparate metadata from multiple sources (in this example, three different source
files) and combine them into a single stream:
To achieve this, add additional rows in the Union Settings by including all source you wish to add. There is no need
for a common lookup or join key:
If you set a Select transformation after your Union, you will be able to rename overlapping fields or fields that were
not named from headerless sources. Click on "Inspect" to see the combine metadata with 132 total columns in this
example from three different sources:
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Use Unpivot in ADF Mapping Data Flow as a way to turn an unnormalized dataset into a more normalized version
by expanding values from multiple columns in a single record into multiple records with the same values in a
single column.
Ungroup By
First, set the columns that you wish to group by for your pivot aggregation. Set one or more columns for
ungrouping with the + sign next to the column list.
Unpivot Key
The Pivot Key is the column that ADF will pivot from row to column. By default, each unique value in the dataset
for this field will pivot to a column. However, you can optionally enter the values from the dataset that you wish to
pivot to column values.
Unpivoted Columns
Lastly, choose the aggregation that you wish to use for the pivoted values and how you would like the columns to
be displayed in the new output projection from the transformation.
(Optional) You can set a naming pattern with a prefix, middle, and suffix to be added to each new column name
from the row values.
For instance, pivoting "Sales" by "Region" would simply give you new column values from each sales value. For
example: "25", "50", "1000", ... However, if you set a prefix value of "Sales", then "Sales" will be prefixed to the
values.
Setting the Column Arrangement to "Normal" will group together all of the pivoted columns with their aggregated
values. Setting the columns arrangement to "Lateral" will alternate between column and value.
The final unpivoted data result set shows the column totals now unpivoted into separate row values.
Next steps
Use the Pivot transformation to pivot rows to columns.
Azure Data Factory Window Transformation
3/13/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
The Window transformation is where you will define window -based aggregations of columns in your data streams.
In the Expression Builder, you can define different types of aggregations that are based on data or time windows
(SQL OVER clause) such as LEAD, LAG, NTILE, CUMEDIST, RANK, etc.). A new field will be generated in your
output that includes these aggregations. You can also include optional group-by fields.
Over
Set the partitioning of column data for your window transformation. The SQL equivalent is the Partition By in
the Over clause in SQL. If you wish to create a calculation or create an expression to use for the partitioning, you
can do that by hovering over the column name and select "computed column".
Sort
Another part of the Over clause is setting the Order By . This will set the data sort ordering. You can also create an
expression for a calculate value in this column field for sorting.
Range By
Next, set the window frame as Unbounded or Bounded. To set an unbounded window frame, set the slider to
Unbounded on both ends. If you choose a setting between Unbounded and Current Row, then you must set the
Offset start and end values. Both values will be positive integers. You can use either relative numbers or values
from your data.
The window slider has two values to set: the values before the current row and the values after the current row. The
Start and End offset matches the two selectors on the slider.
Window columns
Lastly, use the Expression Builder to define the aggregations you wish to use with the data windows such as RANK,
COUNT, MIN, MAX, DENSE RANK, LEAD, LAG, etc.
The full list of aggregation and analytical functions available for you to use in the ADF Data Flow Expression
Language via the Expression Builder are listed here: https://aka.ms/dataflowexpressions.
Next steps
If you are looking for a simple group-by aggregation, use the Aggregate transformation
Parameterize linked services in Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online
You can now parameterize a linked service and pass dynamic values at run time. For example, if you want to
connect to different databases on the same Azure SQL Database server, you can now parameterize the database
name in the linked service definition. This prevents you from having to create a linked service for each database on
the Azure SQL database server. You can parameterize other properties in the linked service definition as well - for
example, User name.
You can use the Data Factory UI in the Azure Portal or a programming interface to parameterize linked services.
TIP
We recommend not to parameterize passwords or secrets. Store all connection strings in Azure Key Vault instead, and
parameterize the Secret Name.
For a seven-minute introduction and demonstration of this feature, watch the following video:
Data Factory UI
JSON
{
"name": "AzureSqlDatabase",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server=tcp:myserver.database.windows.net,1433;Database=@{linkedService().DBName};User
ID=user;Password=fake;Trusted_Connection=False;Encrypt=True;Connection Timeout=30",
"type": "SecureString"
}
},
"connectVia": null,
"parameters": {
"DBName": {
"type": "String"
}
}
}
}
Expressions and functions in Azure Data Factory
4/26/2019 • 22 minutes to read • Edit Online
This article provides details about expressions and functions supported by Azure Data Factory.
Introduction
JSON values in the definition can be literal or expressions that are evaluated at runtime. For example:
"name": "value"
(or)
"name": "@pipeline().parameters.password"
Expressions
Expressions can appear anywhere in a JSON string value and always result in another JSON value. If a JSON
value is an expression, the body of the expression is extracted by removing the at-sign (@). If a literal string is
needed that starts with @, it must be escaped by using @@. The following examples show how expressions are
evaluated.
Expressions can also appear inside strings, using a feature called string interpolation where expressions are
wrapped in @{ ... } . For example:
"name" : "First Name: @{pipeline().parameters.firstName} Last Name: @{pipeline().parameters.lastName}"
Using string interpolation, the result is always a string. Say I have defined myNumber as 42 and myString as foo
:
Examples
A dataset with a parameter
In the following example, the BlobDataset takes a parameter named path. Its value is used to set a value for the
folderPath property by using the expression: dataset().path .
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@dataset().path"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
Functions
You can call functions within expressions. The following sections provide information about the functions that can
be used in an expression.
String functions
The following functions only apply to strings. You can also use a number of the collection functions on strings.
Name: String n
substring('somevalue-foo-somevalue',10,3)
returns:
foo
Parameter number: 1
Name: String
Parameter number: 2
Parameter number: 3
Name: Length
returns:
Parameter number: 1
Name: string
Parameter number: 2
Parameter number: 3
guid()
Parameter number: 1
Name: Format
Parameter number: 1
Name: String
Parameter number: 1
Name: String
indexof Find the index of a value within a string case insensitively. For
example, the following expression returns 7 :
indexof('hello, world.', 'world')
Parameter number: 1
Name: String
Parameter number: 2
Name: String
lastindexof Find the last index of a value within a string case insensitively.
For example, the following expression returns 3 :
lastindexof('foofoo', 'foo')
Parameter number: 1
Name: String
Parameter number: 2
Name: String
startswith Checks if the string starts with a value case insensitively. For
example, the following expression returns true :
startswith('hello, world', 'hello')
Parameter number: 1
Name: String
Parameter number: 2
Name: String
endswith Checks if the string ends with a value case insensitively. For
example, the following expression returns true :
endswith('hello, world', 'world')
Parameter number: 1
Name: String
Parameter number: 2
Name: String
split Splits the string using a separator. For example, the following
expression returns ["a", "b", "c"] : split('a;b;c',';')
Parameter number: 1
Name: String
Parameter number: 2
Name: String
Collection functions
These functions operate over collections such as arrays, strings, and sometimes dictionaries.
Parameter number: 1
Parameter number: 2
Parameter number: 1
Name: Collection
empty('')
Parameter number: 1
Name: Collection
Name: Collection n
union Returns a single array or object with all of the elements that
are in either array or object passed to it. For example, this
function returns [1, 2, 3, 10, 101]:
Name: Collection n
first Returns the first element in the array or string passed in. For
example, this function returns 0 :
first([0,2,3])
Parameter number: 1
Name: Collection
last Returns the last element in the array or string passed in. For
example, this function returns 3 :
last('0123')
Parameter number: 1
Name: Collection
take Returns the first Count elements from the array or string
passed in, for example this function returns [1, 2] :
take([1, 2, 3, 4], 2)
Parameter number: 1
Name: Collection
Parameter number: 2
Name: Count
skip Returns the elements in the array starting at index Count, for
example this function returns [3, 4] :
skip([1, 2 ,3 ,4], 2)
Parameter number: 1
Name: Collection
Parameter number: 2
Name: Count
Logical functions
These functions are useful inside conditions, they can be used to evaluate any type of logic.
Parameter number: 1
Name: Object 1
Parameter number: 2
Name: Object 2
less Returns true if the first argument is less than the second.
Note, values can only be of type integer, float, or string. For
example, the following expression returns true :
less(10,100)
Parameter number: 1
Name: Object 1
Parameter number: 2
Name: Object 2
lessOrEquals Returns true if the first argument is less than or equal to the
second. Note, values can only be of type integer, float, or
string. For example, the following expression returns true :
lessOrEquals(10,10)
Parameter number: 1
Name: Object 1
Parameter number: 2
Name: Object 2
greater Returns true if the first argument is greater than the second.
Note, values can only be of type integer, float, or string. For
example, the following expression returns false :
greater(10,10)
Parameter number: 1
Name: Object 1
Parameter number: 2
Name: Object 2
Parameter number: 1
Name: Object 1
Parameter number: 2
Name: Object 2
Parameter number: 1
Name: Boolean 1
Parameter number: 2
Name: Boolean 2
Parameter number: 1
Name: Boolean 1
Parameter number: 2
Name: Boolean 2
Parameter number: 1
Name: Boolean
Parameter number: 1
Name: Expression
Parameter number: 2
Name: True
Parameter number: 3
Name: False
Conversion functions
These functions are used to convert between each of the native types in the language:
string
integer
float
boolean
arrays
dictionaries
Parameter number: 1
Name: Value
Parameter number: 1
Name: Value
json('[1,2,3]')
{ "bar" : "baz" }
Parameter number: 1
Name: String
The json function supports xml input as well. For example, the
parameter value of:
Parameter number: 1
Name: Value
Parameter number: 1
Name: Value
coalesce Returns the first non-null object in the arguments passed in.
Note: an empty string is not null. For example, if parameters 1
and 2 are not defined, this returns fallback :
coalesce(pipeline().parameters.parameter1',
pipeline().parameters.parameter2 ,'fallback')
Name: Objectn
Parameter number: 1
Name: String 1
Parameter number: 1
Name: String
Parameter number: 1
Name: String
Parameter number: 1
Name: Value
Parameter number: 1
Name: String
Parameter number: 1
Name: String
Parameter number: 1
Name: Value
Parameter number: 1
Name: String
encodeUriComponent URL-escapes the string that's passed in. For example, the
following expression returns You+Are%3ACool%2FAwesome :
encodeUriComponent('You Are:Cool/Awesome')
Parameter number: 1
Name: String
decodeUriComponent Un-URL-escapes the string that's passed in. For example, the
following expression returns You Are:Cool/Awesome :
encodeUriComponent('You+Are%3ACool%2FAwesome')
Parameter number: 1
Name: String
Parameter number: 1
Name: String
Parameter number: 1
Name: String
Parameter number: 1
Name: String
Parameter number: 1
Name: Value
Example 1
1. This code:
xpath(xml(pipeline().parameters.p1),
'/lab/robot/name')
would return
[ <name>R1</name>, <name>R2</name> ]
whereas
2. This code:
xpath(xml(pipeline().parameters.p1, '
sum(/lab/robot/parts)')
would return
FUNCTION NAME DESCRIPTION
13
Example 2
1. This code:
@xpath(xml(body('Http')), '/*[name()=\"File\"]/*
[name()=\"Location\"]')
or
2. This code:
@xpath(xml(body('Http')), '/*[local-name()=\"File\"
and namespace-uri()=\"http://foo.com\"]/*[local-
name()=\"Location\" and namespace-uri()=\"\"]')
returns
<Location xmlns="http://foo.com">bar</Location>
and
3. This code:
@xpath(xml(body('Http')), 'string(/*
[name()=\"File\"]/*[name()=\"Location\"])')
returns
bar
Parameter number: 1
Name: Xml
Parameter number: 2
Name: XPath
Parameter number: 1
Name: Value
Name: Any n
Math functions
These functions can be used for either types of numbers: integers and floats.
add Returns the result of the addition of the two numbers. For
example, this function returns 20.333 : add(10,10.333)
Parameter number: 1
Name: Summand 1
Parameter number: 2
Name: Summand 2
sub Returns the result of the subtraction of the two numbers. For
example, this function returns: -0.333 :
sub(10,10.333)
Parameter number: 1
Name: Minuend
Parameter number: 2
Name: Subtrahend
mul(10,10.333)
Parameter number: 1
Name: Multiplicand 1
Parameter number: 2
Name: Multiplicand 2
div Returns the result of the division of the two numbers. For
example, the following returns 1.0333 :
div(10.333,10)
Parameter number: 1
Name: Dividend
Parameter number: 2
Name: Divisor
mod Returns the result of the remainder after the division of the
two numbers (modulo). For example, the following expression
returns 2 :
mod(10,4)
Parameter number: 1
Name: Dividend
Parameter number: 2
Name: Divisor
min There are two different patterns for calling this function:
min([0,1,2]) Here min takes an array. This expression
returns 0 . Alternatively, this function can take a comma-
separated list of values: min(0,1,2) This function also
returns 0. Note, all values must be numbers, so if the
parameter is an array it has to only have numbers in it.
Parameter number: 1
Name: Value n
max There are two different patterns for calling this function:
max([0,1,2])
Parameter number: 1
Name: Value n
range(3,4)
Parameter number: 1
Parameter number: 2
Name: Count
rand(-1000,1000)
Parameter number: 1
Name: Minimum
Parameter number: 2
Name: Maximum
Date functions
FUNCTION NAME DESCRIPTION
utcnow()
Parameter number: 1
Name: Format
addseconds('2015-03-15T13:27:36Z', -36)
Parameter number: 1
Name: Timestamp
Parameter number: 2
Name: Seconds
Parameter number: 3
Name: Format
addminutes('2015-03-15T13:27:36Z', 33)
Parameter number: 1
Name: Timestamp
Parameter number: 2
Name: Minutes
Parameter number: 3
Name: Format
addhours('2015-03-15T13:27:36Z', 12)
Parameter number: 1
Name: Timestamp
Parameter number: 2
Name: Hours
Parameter number: 3
Name: Format
adddays('2015-03-15T13:27:36Z', -20)
Parameter number: 1
Name: Timestamp
Parameter number: 2
Name: Days
Parameter number: 3
Name: Format
formatDateTime('2015-03-15T13:27:36Z', 'o')
Parameter number: 1
Name: Date
Parameter number: 2
Name: Format
Next steps
For a list of system variables you can use in expressions, see System variables.
System variables supported by Azure Data Factory
5/6/2019 • 2 minutes to read • Edit Online
This article describes system variables supported by Azure Data Factory. You can use these variables in
expressions when defining Data Factory entities.
Pipeline scope
These system variables can be referenced anywhere in the pipeline JSON.
@pipeline().DataFactory Name of the data factory the pipeline run is running within
@pipeline().TriggerTime Time when the trigger that invoked the pipeline. The trigger
time is the actual fired time, not the scheduled time. For
example, 13:20:08.0149599Z is returned instead of
13:20:00.00Z
@trigger().scheduledTime Time when the trigger was scheduled to invoke the pipeline
run. For example, for a trigger that fires every 5 min, this
variable would return 2017-06-01T22:20:00Z ,
2017-06-01T22:25:00Z , 2017-06-01T22:29:00Z
respectively.
@trigger().startTime Time when the trigger actually fired to invoke the pipeline
run. For example, for a trigger that fires every 5 min, this
variable might return something like this
2017-06-01T22:20:00.4061448Z ,
2017-06-01T22:25:00.7958577Z ,
2017-06-01T22:29:00.9935483Z respectively. (Note: The
timestamp is by default in ISO 8601 format)
Tumbling Window Trigger scope
These system variables can be referenced anywhere in the trigger JSON if the trigger is of type:
"TumblingWindowTrigger." (Note: The timestamp is by default in ISO 8601 format)
@trigger().outputs.windowStartTime Start of the window when the trigger was scheduled to invoke
the pipeline run. If the tumbling window trigger has a
frequency of "hourly" this would be the time at the beginning
of the hour.
@trigger().outputs.windowEndTime End of the window when the trigger was scheduled to invoke
the pipeline run. If the tumbling window trigger has a
frequency of "hourly" this would be the time at the end of the
hour.
Next steps
For information about how these variables are used in expressions, see Expression language & functions.
Security considerations for data movement in Azure
Data Factory
4/19/2019 • 10 minutes to read • Edit Online
This article describes basic security infrastructure that data movement services in Azure Data Factory use to help
secure your data. Data Factory management resources are built on Azure security infrastructure and use all
possible security measures offered by Azure.
In a Data Factory solution, you create one or more data pipelines. A pipeline is a logical grouping of activities that
together perform a task. These pipelines reside in the region where the data factory was created.
Even though Data Factory is only available in few regions, the data movement service is available globally to
ensure data compliance, efficiency, and reduced network egress costs.
Azure Data Factory does not store any data except for linked service credentials for cloud data stores, which are
encrypted by using certificates. With Data Factory, you create data-driven workflows to orchestrate movement of
data between supported data stores, and processing of data by using compute services in other regions or in an
on-premises environment. You can also monitor and manage workflows by using SDKs and Azure Monitor.
Data Factory has been certified for:
ISO 20000-1:2011
ISO 22301:2012
ISO 27001:2013
ISO 27017:2015
ISO 27018:2014
ISO 9001:2015
SOC 1, 2, 3
HIPAA BAA
If you're interested in Azure compliance and how Azure secures its own infrastructure, visit the Microsoft Trust
Center. For the latest list of all Azure Compliance offerings check - https://aka.ms/AzureCompliance.
In this article, we review security considerations in the following two data movement scenarios:
Cloud scenario: In this scenario, both your source and your destination are publicly accessible through the
internet. These include managed cloud storage services such as Azure Storage, Azure SQL Data Warehouse,
Azure SQL Database, Azure Data Lake Store, Amazon S3, Amazon Redshift, SaaS services such as Salesforce,
and web protocols such as FTP and OData. Find a complete list of supported data sources in Supported data
stores and formats.
Hybrid scenario: In this scenario, either your source or your destination is behind a firewall or inside an on-
premises corporate network. Or, the data store is in a private network or virtual network (most often the
source) and is not publicly accessible. Database servers hosted on virtual machines also fall under this scenario.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Cloud scenarios
Securing data store credentials
Store encrypted credentials in an Azure Data Factory managed store. Data Factory helps protect your
data store credentials by encrypting them with certificates managed by Microsoft. These certificates are rotated
every two years (which includes certificate renewal and the migration of credentials). The encrypted credentials
are securely stored in an Azure storage account managed by Azure Data Factory management services. For
more information about Azure Storage security, see Azure Storage security overview.
Store credentials in Azure Key Vault. You can also store the data store's credential in Azure Key Vault. Data
Factory retrieves the credential during the execution of an activity. For more information, see Store credential in
Azure Key Vault.
Data encryption in transit
If the cloud data store supports HTTPS or TLS, all data transfers between data movement services in Data Factory
and a cloud data store are via secure channel HTTPS or TLS .
NOTE
All connections to Azure SQL Database and Azure SQL Data Warehouse require encryption (SSL/TLS) while data is in transit
to and from the database. When you're authoring a pipeline by using JSON, add the encryption property and set it to true in
the connection string. For Azure Storage, you can use HTTPS in the connection string.
NOTE
To enable encryption in transit while moving data from Oracle follow one of the below options:
1. In Oracle server, go to Oracle Advanced Security (OAS) and configure the encryption settings, which supports Triple-DES
Encryption (3DES) and Advanced Encryption Standard (AES), refer here for details. ADF automatically negotiates the
encryption method to use the one you configure in OAS when establishing connection to Oracle.
2. In ADF, you can add EncryptionMethod=1 in the connection string (in the Linked Service). This will use SSL/TLS as the
encryption method. To use this, you need to disable non-SSL encryption settings in OAS on the Oracle server side to
avoid encryption conflict.
NOTE
TLS version used is 1.2.
Hybrid scenarios
Hybrid scenarios require self-hosted integration runtime to be installed in an on-premises network, inside a virtual
network (Azure), or inside a virtual private cloud (Amazon). The self-hosted integration runtime must be able to
access the local data stores. For more information about self-hosted integration runtime, see How to create and
configure self-hosted integration runtime.
The command channel allows communication between data movement services in Data Factory and self-hosted
integration runtime. The communication contains information related to the activity. The data channel is used for
transferring data between on-premises data stores and cloud data stores.
On-premises data store credentials
The credentials for your on-premises data stores are always encrypted and stored. They can be either stored locally
on the self-hosted integration runtime machine, or stored in Azure Data Factory managed storage (just like cloud
store credentials).
Store credentials locally. If you want to encrypt and store credentials locally on the self-hosted integration
runtime, follow the steps in Encrypt credentials for on-premises data stores in Azure Data Factory. All
connectors support this option. The self-hosted integration runtime uses Windows DPAPI to encrypt the
sensitive data and credential information.
Use the New-AzDataFactoryV2LinkedServiceEncryptedCredential cmdlet to encrypt linked service
credentials and sensitive details in the linked service. You can then use the JSON returned (with the
EncryptedCredential element in the connection string) to create a linked service by using the Set-
AzDataFactoryV2LinkedService cmdlet.
Store in Azure Data Factory managed storage. If you directly use the Set-
AzDataFactoryV2LinkedService cmdlet with the connection strings and credentials inline in the JSON,
the linked service is encrypted and stored in Azure Data Factory managed storage. The sensitive
information is still encrypted by certificate, and Microsoft manages these certificates.
Ports used when encrypting linked service on self-hosted integration runtime
By default, PowerShell uses port 8050 on the machine with self-hosted integration runtime for secure
communication. If necessary, this port can be changed.
Encryption in transit
All data transfers are via secure channel HTTPS and TLS over TCP to prevent man-in-the-middle attacks during
communication with Azure services.
You can also use IPSec VPN or Azure ExpressRoute to further secure the communication channel between your
on-premises network and Azure.
Azure Virtual Network is a logical representation of your network in the cloud. You can connect an on-premises
network to your virtual network by setting up IPSec VPN (site-to-site) or ExpressRoute (private peering).
The following table summarizes the network and self-hosted integration runtime configuration recommendations
based on different combinations of source and destination locations for hybrid data movement.
On-premises Virtual machines and cloud IPSec VPN (point-to-site or The self-hosted integration
services deployed in virtual site-to-site) runtime should be installed
networks on an Azure virtual machine
in the virtual network.
On-premises Virtual machines and cloud ExpressRoute (private The self-hosted integration
services deployed in virtual peering) runtime should be installed
networks on an Azure virtual machine
in the virtual network.
The following images show the use of self-hosted integration runtime for moving data between an on-premises
database and Azure services by using ExpressRoute and IPSec VPN (with Azure Virtual Network):
ExpressRoute
IPSec VPN
Firewall configurations and whitelisting IP addresses
Firewall requirements for on-premises/private network
In an enterprise, a corporate firewall runs on the central router of the organization. Windows Firewall runs as a
daemon on the local machine in which the self-hosted integration runtime is installed.
The following table provides outbound port and domain requirements for corporate firewalls:
The following table provides inbound port requirements for Windows Firewall:
Next steps
For information about Azure Data Factory Copy Activity performance, see Copy Activity performance and tuning
guide.
Store credential in Azure Key Vault
3/13/2019 • 2 minutes to read • Edit Online
You can store credentials for data stores and computes in an Azure Key Vault. Azure
Data Factory retrieves the credentials when executing an activity that uses the data
store/compute.
Currently, all activity types except custom activity support this feature. For connector
configuration specifically, check the "linked service properties" section in each
connector topic for details.
Prerequisites
This feature relies on the data factory managed identity. Learn how it works from
Managed identity for Data factory and make sure your data factory have an
associated one.
Steps
To reference a credential stored in Azure Key Vault, you need to:
1. Retrieve data factory managed identity by copying the value of "SERVICE
IDENTITY APPLICATION ID" generated along with your factory. If you use ADF
authoring UI, the managed identity application ID will be shown on the Azure
Key Vault linked service creation window; you can also retrieve it from Azure
portal, refer to Retrieve data factory managed identity.
2. Grant the managed identity access to your Azure Key Vault. In your key
vault -> Access policies -> Add new -> search this managed identity application
ID to grant Get permission in Secret permissions dropdown. It allows this
designated factory to access secret in key vault.
3. Create a linked service pointing to your Azure Key Vault. Refer to Azure
Key Vault linked service.
4. Create data store linked service, inside which reference the
corresponding secret stored in key vault. Refer to reference secret stored in
key vault.
JSON example:
{
"name": "AzureKeyVaultLinkedService",
"properties": {
"type": "AzureKeyVault",
"typeProperties": {
"baseUrl": "https://<azureKeyVaultName>.vault.azure.net"
}
}
}
TIP
For connectors using connection string in linked service like SQL Server, Blob storage, etc.,
you can choose either to store only the secret field e.g. password in AKV, or to store the
entire connection string in AKV. You can find both options on the UI.
JSON example: (see the "password" section)
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "<>",
"organizationName": "<>",
"authenticationType": "<>",
"username": "<>",
"password": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
}
}
}
}
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure
Data Factory, see supported data stores.
Encrypt credentials for on-premises data stores in
Azure Data Factory
4/3/2019 • 2 minutes to read • Edit Online
You can encrypt and store credentials for your on-premises data stores (linked services with sensitive information)
on a machine with self-hosted integration runtime.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<servername>;Database=<databasename>;User ID=<username>;Password=<password>;Timeout=60"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
},
"name": "SqlServerLinkedService"
}
}
Encrypt credentials
To encrypt the sensitive data from the JSON payload on an on-premises self-hosted integration runtime, run
New-AzDataFactoryV2LinkedServiceEncryptedCredential, and pass on the JSON payload. This cmdlet
ensures the credentials are encrypted using DPAPI and stored on the self-hosted integration runtime node locally.
The output payload containing the encrypted reference to the credential can be redirected to another JSON file (in
this case 'encryptedLinkedService.json').
New-AzDataFactoryV2LinkedServiceEncryptedCredential -DataFactoryName $dataFactoryName -ResourceGroupName
$ResourceGroupName -Name "SqlServerLinkedService" -DefinitionFile ".\SQLServerLinkedService.json" >
encryptedSQLServerLinkedService.json
Next steps
For information about security considerations for data movement, see Data movement security considerations.
Managed identity for Data Factory
4/8/2019 • 4 minutes to read • Edit Online
This article helps you understand what is managed identity for Data Factory (formerly known as Managed
Service Identity/MSI) and how it works.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
Overview
When creating a data factory, a managed identity can be created along with factory creation. The managed
identity is a managed application registered to Azure Activity Directory, and represents this specific data factory.
Managed identity for Data Factory benefits the following features:
Store credential in Azure Key Vault, in which case data factory managed identity is used for Azure Key Vault
authentication.
Connectors including Azure Blob storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2,
Azure SQL Database, and Azure SQL Data Warehouse.
Web activity.
DataFactoryName : ADFV2DemoFactory
DataFactoryId :
/subscriptions/<subsID>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factories/ADFV2Dem
oFactory
ResourceGroupName : <resourceGroupName>
Location : East US
Tags : {}
Identity : Microsoft.Azure.Management.DataFactory.Models.FactoryIdentity
ProvisioningState : Succeeded
PATCH
https://management.azure.com/subscriptions/<subsID>/resourceGroups/<resourceGroupName>/providers/Microsoft.Da
taFactory/factories/<data factory name>?api-version=2018-06-01
{
"name": "<dataFactoryName>",
"location": "<region>",
"properties": {},
"identity": {
"type": "SystemAssigned"
}
}
Response: managed identity is created automatically, and "identity" section is populated accordingly.
{
"name": "<dataFactoryName>",
"tags": {},
"properties": {
"provisioningState": "Succeeded",
"loggingStorageAccountKey": "**********",
"createTime": "2017-09-26T04:10:01.1135678Z",
"version": "2018-06-01"
},
"identity": {
"type": "SystemAssigned",
"principalId": "765ad4ab-XXXX-XXXX-XXXX-51ed985819dc",
"tenantId": "72f988bf-XXXX-XXXX-XXXX-2d7cd011db47"
},
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factories
/ADFV2DemoFactory",
"type": "Microsoft.DataFactory/factories",
"location": "<region>"
}
{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"resources": [{
"name": "<dataFactoryName>",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "<region>",
"identity": {
"type": "SystemAssigned"
}
}]
}
TIP
If you don't see the managed identity, generate managed identity by updating your factory.
PrincipalId TenantId
----------- --------
765ad4ab-XXXX-XXXX-XXXX-51ed985819dc 72f988bf-XXXX-XXXX-XXXX-2d7cd011db47
Copy the principal ID, then run below Azure Active Directory command with principal ID as parameter to get the
ApplicationId, which you use to grant access:
ServicePrincipalNames : {76f668b3-XXXX-XXXX-XXXX-1b3348c75e02,
https://identity.azure.net/P86P8g6nt1QxfPJx22om8MOooMf/Ag0Qf/nnREppHkU=}
ApplicationId : 76f668b3-XXXX-XXXX-XXXX-1b3348c75e02
DisplayName : ADFV2DemoFactory
Id : 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc
Type : ServicePrincipal
Next steps
See the following topics which introduce when and how to use data factory managed identity:
Store credential in Azure Key Vault
Copy data from/to Azure Data Lake Store using managed identities for Azure resources authentication
See Managed Identities for Azure Resources Overview for more background on managed identities for Azure
resources, which data factory managed identity is based upon.
Visually monitor Azure data factories
1/18/2019 • 4 minutes to read • Edit Online
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the
cloud for orchestrating and automating data movement and data transformation. Using Azure Data Factory, you
can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores,
process/transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake
Analytics, and Azure Machine Learning, and publish output data to data stores such as Azure SQL Data Warehouse
for business intelligence (BI) applications to consume.
In this quickstart, you learn how to visually monitor Data Factory pipelines without writing a single line of code.
If you don't have an Azure subscription, create a free account before you begin.
IMPORTANT
You need to click 'Refresh' icon on top to refresh the list of pipeline and activity runs. Auto-refresh is currently not
supported.
Select a data factory to monitor
Hover on the Data Factory icon on the top left. Click on the 'Arrow' icon to see a list of azure subscriptions and
data factories that you can monitor.
Pipeline Name Name of the pipeline. Options include quick filters for 'Last 24
hours', 'Last week', 'Last 30 days' or select a custom date time.
NOTE
You can only promote up to 5 pipeline activity properties as user properties.
After you create the user properties, you can then monitor them in the monitoring list views. If the source for the
Copy activity is a table name, you can monitor the source table name as a column in the activity runs list view.
Rerun activities inside a pipeline
You can now rerun activities inside a pipeline. Click View activity runs and select the activity in your pipeline from
which point you want to rerun your pipeline.
Guided Tours
Click on the 'Information Icon' in lower left and click 'Guided Tours' to get step-by-step instructions on how to
monitor your pipeline and activity runs.
Feedback
Click on the 'Feedback' icon to give us feedback on various features or any issues that you might be facing.
Alerts
You can raise alerts on supported metrics in Data Factory. Select Monitor -> Alerts & Metrics on the Data
Factory Monitor page to get started.
For a seven-minute introduction and demonstration of this feature, watch the following video:
Create Alerts
1. Click New Alert rule to create a new alert.
5. Configure Email/SMS/Push/Voice notifications for the alert. Create or choose an existing Action Group
for the alert notifications.
6. Create the alert rule.
Next steps
See Monitor and manage pipelines programmatically article to learn about monitoring and managing pipelines.
Alert and Monitor data factories using Azure Monitor
3/15/2019 • 10 minutes to read • Edit Online
Cloud applications are complex with many moving parts. Monitoring provides data to ensure that your application stays up and running in a healthy state. It also helps you to
stave off potential problems or troubleshoot past ones. In addition, you can use monitoring data to gain deep insights about your application. This knowledge can help you to
improve application performance or maintainability, or automate actions that would otherwise require manual intervention.
Azure Monitor provides base level infrastructure metrics and logs for most services in Microsoft Azure. For details, see monitoring overview. Azure Diagnostic logs are logs
emitted by a resource that provide rich, frequent data about the operation of that resource. Data Factory outputs diagnostic logs in Azure Monitor.
Diagnostic logs
Save them to a Storage Account for auditing or manual inspection. You can specify the retention time (in days) using the diagnostic settings.
Stream them to Event Hubs for ingestion by a third-party service or custom analytics solution such as Power BI.
Analyze them with Log Analytics
You can use a storage account or event hub namespace that is not in the same subscription as the resource that is emitting logs. The user who configures the setting must
have the appropriate role-based access control (RBAC) access to both subscriptions.
PUT
https://management.azure.com/{resource-id}/providers/microsoft.insights/diagnosticSettings/service?api-version={api-version}
Headers
Replace {api-version} with 2016-09-01 .
Replace {resource-id} with the resource ID of the resource for which you would like to edit diagnostic settings. For more information Using Resource groups to manage
your Azure resources.
Set the Content-Type header to application/json .
Set the authorization header to a JSON web token that you obtain from Azure Active Directory. For more information, see Authenticating requests.
Body
{
"properties": {
"storageAccountId": "/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.Storage/storageAccounts/<storageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.EventHub/namespaces/<eventHubName>/authorizationrules/RootManageSharedAccessKey",
"workspaceId": "/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.OperationalInsights/workspaces/<LogAnalyticsName>",
"metrics": [
],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"location": ""
}
storageAccountId String The resource ID of the storage account to which you would
like to send Diagnostic Logs
serviceBusRuleId String The service bus rule ID of the service bus namespace in which
you would like to have Event Hubs created for streaming
Diagnostic Logs. The rule ID is of the format: "{service bus
resource ID}/authorizationrules/{key name}".
workspaceId Complex Type Array of metric time grains and their retention policies.
Currently, this property is empty.
metrics Parameter values of the pipeline run to be passed to the A JSON object mapping parameter names to argument values
invoked pipeline
logs Complex Type Name of a Diagnostic Log category for a resource type. To
obtain the list of Diagnostic Log categories for a resource, first
perform a GET diagnostic settings operation.
timeGrain String The granularity of metrics that are captured in ISO 8601
duration format. Must be PT1M (one minute)
retentionPolicy Complex Type Describes the retention policy for a metric or log category.
Used for storage account option only.
Response
200 OK
{
"id": "/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/providers/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId": "/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.Storage/storageAccounts/<storageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.EventHub/namespaces/<eventHubName>/authorizationrules/RootManageSharedAccessKey",
"workspaceId": "/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.OperationalInsights/workspaces/<LogAnalyticsName>",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}
GET
https://management.azure.com/{resource-id}/providers/microsoft.insights/diagnosticSettings/service?api-version={api-version}
Headers
Replace {api-version} with 2016-09-01 .
Replace {resource-id} with the resource ID of the resource for which you would like to edit diagnostic settings. For more information Using Resource groups to manage
your Azure resources.
Set the Content-Type header to application/json .
Set the authorization header to a JSON Web Token that you obtain from Azure Active Directory. For more information, see Authenticating requests.
Response
200 OK
{
"id": "/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/providers/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId": "/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.Storage/storageAccounts/azmonlogs",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.EventHub/namespaces/shloeventhub/authorizationrules/RootManageSharedAccessKey",
"workspaceId": "/subscriptions/<subID>/resourceGroups/ADF/providers/Microsoft.OperationalInsights/workspaces/mihaipie",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}
{
"Level": "",
"correlationId":"",
"time":"",
"activityRunId":"",
"pipelineRunId":"",
"resourceId":"",
"category":"ActivityRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"activityName":"",
"start":"",
"end":"",
"properties":
{
"Input": "{
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}",
"Output": "{"dataRead":121,"dataWritten":121,"copyDuration":5,
"throughput":0.0236328132,"errors":[]}",
"Error": "{
"errorCode": "null",
"message": "null",
"failureType": "null",
"target": "CopyBlobtoBlob"
}
}
}
level String Level of the diagnostic logs. Set this property Informational
to "Informational"
operationName String Name of the activity with status. If the status MyActivity - Succeeded
is the start heartbeat, it is MyActivity - . If
the status is the end heartbeat, it is
MyActivity - Succeeded with final status
{
"Level": "",
"correlationId":"",
"time":"",
"runId":"",
"resourceId":"",
"category":"PipelineRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"start":"",
"end":"",
"status":"",
"properties":
{
"Parameters": {
"<parameter1Name>": "<parameter1Value>"
},
"SystemParameters": {
"ExecutionStart": "",
"TriggerId": "",
"SubscriptionId": ""
}
}
}
level String Level of the diagnostic logs. Set this property Informational
to "Informational"
operationName String Name of the pipeline with status. "Pipeline - MyPipeline - Succeeded
Succeeded" with final status when pipeline
run is completed
{
"Level": "",
"correlationId":"",
"time":"",
"triggerId":"",
"resourceId":"",
"category":"TriggerRuns",
"level":"Informational",
"operationName":"",
"triggerName":"",
"triggerType":"",
"triggerEvent":"",
"start":"",
"status":"",
"properties":
{
"Parameters": {
"TriggerTime": "",
"ScheduleTime": ""
},
"SystemParameters": {}
}
}
level String Level of the diagnostic logs. Set this property Informational
to "Informational"
operationName String Name of the trigger with final status whether MyTrigger - Succeeded
it successfully fired. "MyTrigger - Succeeded"
if the heartbeat was successful
Metrics
Azure Monitor enables you to consume telemetry to gain visibility into the performance and health of your workloads on Azure. The most important type of Azure telemetry
data is the metrics (also called performance counters) emitted by most Azure resources. Azure Monitor provides several ways to configure and consume these metrics for
monitoring and troubleshooting.
ADFV2 emits the following metrics
PipelineSucceededRun Succeeded pipeline runs metrics Count Total Total pipelines runs succeeded
within a minute window
PipelineFailedRuns Failed pipeline runs metrics Count Total Total pipelines runs failed within a
minute window
ActivitySucceededRuns Succeeded activity runs metrics Count Total Total activity runs succeeded within
a minute window
ActivityFailedRuns Failed activity runs metrics Count Total Total activity runs failed within a
minute window
METRIC METRIC DISPLAY NAME UNIT AGGREGATION TYPE DESCRIPTION
TriggerSucceededRuns Succeeded trigger runs metrics Count Total Total trigger runs succeeded within
a minute window
TriggerFailedRuns Failed trigger runs metrics Count Total Total trigger runs failed within a
minute window
Alerts
Log in to the Azure portal and click Monitor -> Alerts to create alerts.
Create Alerts
1. Click + New Alert rule to create a new alert.
This article describes how to monitor a pipeline in a data factory by using different software development kits
(SDKs).
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Data range
Data Factory only stores pipeline run data for 45 days. When you query programmatically for data about Data
Factory pipeline runs - for example, with the PowerShell command Get-AzDataFactoryV2PipelineRun - there are no
maximum dates for the optional LastUpdatedAfter and LastUpdatedBefore parameters. But if you query for data
for the past year, for example, the query does not return an error, but only returns pipeline run data from the last
45 days.
If you want to persist pipeline run data for more than 45 days, set up your own diagnostic logging with Azure
Monitor.
.NET
For a complete walkthrough of creating and monitoring a pipeline using .NET SDK, see Create a data factory and
pipeline using .NET.
1. Add the following code to continuously check the status of the pipeline run until it finishes copying the data.
2. Add the following code to that retrieves copy activity run details, for example, size of the data read/written.
// Check the copy activity run details
Console.WriteLine("Checking copy activity run details...");
For complete documentation on .NET SDK, see Data Factory .NET SDK reference.
Python
For a complete walkthrough of creating and monitoring a pipeline using Python SDK, see Create a data factory
and pipeline using Python.
To monitor the pipeline run, add the following code:
For complete documentation on Python SDK, see Data Factory Python SDK reference.
REST API
For a complete walkthrough of creating and monitoring a pipeline using REST API, see Create a data factory and
pipeline using REST API.
1. Run the following script to continuously check the pipeline run status until it finishes copying the data.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Microso
ft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}?api-version=${apiVersion}"
while ($True) {
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
Write-Host "Pipeline run status: " $response.Status -foregroundcolor "Yellow"
2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Microso
ft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}/activityruns?api-
version=${apiVersion}&startTime="+(Get-Date).ToString('yyyy-MM-dd')+"&endTime="+(Get-
Date).AddDays(1).ToString('yyyy-MM-dd')+"&pipelineName=Adfv2QuickStartPipeline"
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
$response | ConvertTo-Json
For complete documentation on REST API, see Data Factory REST API reference.
PowerShell
For a complete walkthrough of creating and monitoring a pipeline using PowerShell, see Create a data factory and
pipeline using PowerShell.
1. Run the following script to continuously check the pipeline run status until it finishes copying the data.
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId
if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}
Start-Sleep -Seconds 30
}
2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.
For complete documentation on PowerShell cmdlets, see Data Factory PowerShell cmdlet reference.
Next steps
See Monitor pipelines using Azure Monitor article to learn about using Azure Monitor to monitor Data Factory
pipelines.
Monitor an integration runtime in Azure Data Factory
3/7/2019 • 9 minutes to read • Edit Online
Integration runtime is the compute infrastructure used by Azure Data Factory to provide various data integration
capabilities across different network environments. There are three types of integration runtimes offered by Data
Factory:
Azure integration runtime
Self-hosted integration runtime
Azure-SSIS integration runtime
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
To get the status of an instance of integration runtime (IR ), run the following PowerShell command:
The cmdlet returns different information for different types of integration runtime. This article explains the
properties and statuses for each type of integration runtime.
PROPERTY DESCRIPTION
DataFactoryName Name of the data factory that the Azure integration runtime
belongs to.
ResourceGroupName Name of the resource group that the data factory belongs to.
PROPERTY DESCRIPTION
Status
The following table provides possible statuses of an Azure integration runtime:
STATUS COMMENTS/SCENARIOS
NOTE
The returned properties and status contain information about overall self-hosted integration runtime and each node in the
runtime.
Properties
The following table provides descriptions of monitoring Properties for each node:
PROPERTY DESCRIPTION
Concurrent Jobs (Running/ Limit) Running. Number of jobs or tasks running on each node. This
value is a near real-time snapshot.
Some settings of the properties make more sense when there are two or more nodes in the self-hosted integration
runtime (that is, in a scale out scenario).
Concurrent jobs limit
The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this
value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the
more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs
limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you run a
maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48
concurrent jobs (that is, 4 x 12). We recommend that you increase the concurrent jobs limit only when you see low
resource usage with the default values on each node.
You can override the calculated default value in the Azure portal. Select Author > Connections > Integration
Runtimes > Edit > Nodes > Modify concurrent job value per node. You can also use the PowerShell update-
Azdatafactoryv2integrationruntimenode command.
Status (per node )
The following table provides possible statuses of a self-hosted integration runtime node:
STATUS DESCRIPTION
STATUS DESCRIPTION
Use the Get-AzDataFactoryV2IntegrationRuntimeMetric cmdlet to fetch the JSON payload containing the
detailed self-hosted integration runtime properties, and their snapshot values during the time of execution of the
cmdlet.
Sample output (assumes that there are two nodes associated with this self-hosted integration runtime):
{
"IntegrationRuntimeName": "<Name of your integration runtime>",
"ResourceGroupName": "<Resource Group Name>",
"DataFactoryName": "<Data Factory Name>",
"Nodes": [
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
},
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
}
]
}
CreateTime The UTC time when your Azure-SSIS integration runtime was
created.
CatalogPricingTier The pricing tier for SSISDB hosted by your existing Azure SQL
Database server. Not applicable to Azure SQL Database
Managed Instance hosting SSISDB.
ResourceGroupName The name of your Azure Resource Group, in which your data
factory and Azure-SSIS integration runtime were created.
Next steps
See the following articles for monitoring pipelines in different ways:
Quickstart: create a data factory.
Use Azure Monitor to monitor Data Factory pipelines
Monitor an integration runtime in Azure Data Factory
3/7/2019 • 9 minutes to read • Edit Online
Integration runtime is the compute infrastructure used by Azure Data Factory to provide various data integration
capabilities across different network environments. There are three types of integration runtimes offered by Data
Factory:
Azure integration runtime
Self-hosted integration runtime
Azure-SSIS integration runtime
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
To get the status of an instance of integration runtime (IR ), run the following PowerShell command:
The cmdlet returns different information for different types of integration runtime. This article explains the
properties and statuses for each type of integration runtime.
PROPERTY DESCRIPTION
DataFactoryName Name of the data factory that the Azure integration runtime
belongs to.
ResourceGroupName Name of the resource group that the data factory belongs to.
PROPERTY DESCRIPTION
Status
The following table provides possible statuses of an Azure integration runtime:
STATUS COMMENTS/SCENARIOS
NOTE
The returned properties and status contain information about overall self-hosted integration runtime and each node in the
runtime.
Properties
The following table provides descriptions of monitoring Properties for each node:
PROPERTY DESCRIPTION
Concurrent Jobs (Running/ Limit) Running. Number of jobs or tasks running on each node. This
value is a near real-time snapshot.
Some settings of the properties make more sense when there are two or more nodes in the self-hosted integration
runtime (that is, in a scale out scenario).
Concurrent jobs limit
The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this
value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the
more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs
limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you run a
maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48
concurrent jobs (that is, 4 x 12). We recommend that you increase the concurrent jobs limit only when you see low
resource usage with the default values on each node.
You can override the calculated default value in the Azure portal. Select Author > Connections > Integration
Runtimes > Edit > Nodes > Modify concurrent job value per node. You can also use the PowerShell update-
Azdatafactoryv2integrationruntimenode command.
Status (per node )
The following table provides possible statuses of a self-hosted integration runtime node:
STATUS DESCRIPTION
STATUS DESCRIPTION
Use the Get-AzDataFactoryV2IntegrationRuntimeMetric cmdlet to fetch the JSON payload containing the
detailed self-hosted integration runtime properties, and their snapshot values during the time of execution of the
cmdlet.
Sample output (assumes that there are two nodes associated with this self-hosted integration runtime):
{
"IntegrationRuntimeName": "<Name of your integration runtime>",
"ResourceGroupName": "<Resource Group Name>",
"DataFactoryName": "<Data Factory Name>",
"Nodes": [
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
},
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
}
]
}
CreateTime The UTC time when your Azure-SSIS integration runtime was
created.
CatalogPricingTier The pricing tier for SSISDB hosted by your existing Azure SQL
Database server. Not applicable to Azure SQL Database
Managed Instance hosting SSISDB.
ResourceGroupName The name of your Azure Resource Group, in which your data
factory and Azure-SSIS integration runtime were created.
Next steps
See the following articles for monitoring pipelines in different ways:
Quickstart: create a data factory.
Use Azure Monitor to monitor Data Factory pipelines
Reconfigure the Azure-SSIS integration runtime
3/5/2019 • 3 minutes to read • Edit Online
This article describes how to reconfigure an existing Azure-SSIS integration runtime. To create an Azure-SSIS
integration runtime (IR ) in Azure Data Factory, see Create an Azure-SSIS integration runtime.
Data Factory UI
You can use Data Factory UI to stop, edit/reconfigure, or delete an Azure-SSIS IR.
1. In the Data Factory UI, switch to the Edit tab. To launch Data Factory UI, click Author & Monitor on the
home page of your data factory.
2. In the left pane, click Connections.
3. In the right pane, switch to the Integration Runtimes.
4. You can use buttons in the Actions column to stop, edit, or delete the integration runtime. The Code
button in the Actions column lets you view the JSON definition associated with the integration runtime.
2. Edit/reconfigure IR by clicking Edit button in the Actions column. In the Integration Runtime Setup
window, change settings (for example, size of the node, number of nodes, or maximum parallel executions
per node).
3. To restart the IR, click Start button in the Actions column.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
After you provision and start an instance of Azure-SSIS integration runtime, you can reconfigure it by running a
sequence of Stop - Set - Start PowerShell cmdlets consecutively. For example, the following PowerShell
script changes the number of nodes allocated for the Azure-SSIS integration runtime instance to five.
Reconfigure an Azure -SSIS IR
1. First, stop the Azure-SSIS integration runtime by using the Stop-AzDataFactoryV2IntegrationRuntime
cmdlet. This command releases all of its nodes and stops billing.
2. Next, stop all existing Azure SSIS IRs in your data factory.
3. Next, remove all existing Azure SSIS IRs in your data factory one by one.
5. If you had created a new resource group, remove the resource group.
Next steps
For more information about Azure-SSIS runtime, see the following topics:
Azure-SSIS Integration Runtime. This article provides conceptual information about integration runtimes in
general including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-
SSIS IR and uses an Azure SQL database to host the SSIS catalog.
How to: Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using Azure SQL Database Managed Instance and joining the IR to a virtual network.
Join an Azure-SSIS IR to a virtual network. This article provides conceptual information about joining an
Azure-SSIS IR to an Azure virtual network. It also provides steps to use Azure portal to configure virtual
network so that Azure-SSIS IR can join the virtual network.
Monitor an Azure-SSIS IR. This article shows you how to retrieve information about an Azure-SSIS IR and
descriptions of statuses in the returned information.
Copy or clone a data factory in Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online
This article describes how to copy or clone a data factory in Azure Data Factory.
Next steps
Review the guidance for creating a data factory in the Azure portal in Create a data factory by using the Azure Data
Factory UI.
How to create and configure Azure Integration
Runtime
3/7/2019 • 2 minutes to read • Edit Online
The Integration Runtime (IR ) is the compute infrastructure used by Azure Data Factory to provide data integration
capabilities across different network environments. For more information about IR, see Integration runtime.
Azure IR provides a fully managed compute to natively perform data movement and dispatch data transformation
activities to compute services like HDInsight. It is hosted in Azure environment and supports connecting to
resources in public network environment with public accessible endpoints.
This document introduces how you can create and configure Azure Integration Runtime.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Default Azure IR
By default, each data factory has an Azure IR in the backend that supports operations on cloud data stores and
compute services in public network. The location of that Azure IR is auto-resolve. If connectVia property is not
specified in the linked service definition, the default Azure IR is used. You only need to explicitly create an Azure IR
when you would like to explicitly define the location of the IR, or if you would like to virtually group the activity
executions on different IRs for management purpose.
Create Azure IR
Integration Runtime can be created using the Set-AzDataFactoryV2IntegrationRuntime PowerShell cmdlet. To
create an Azure IR, you specify the name, location and type to the command. Here is a sample command to create
an Azure IR with location set to "West Europe":
For Azure IR, the type must be set to Managed. You do not need to specify compute details because it is fully
managed elastically in cloud. Specify compute details like node size and node count when you would like to create
Azure-SSIS IR. For more information, see Create and Configure Azure-SSIS IR.
You can configure an existing Azure IR to change its location using the Set-AzDataFactoryV2IntegrationRuntime
PowerShell cmdlet. For more information about the location of an Azure IR, see Introduction to integration
runtime.
Use Azure IR
Once an Azure IR is created, you can reference it in your Linked Service definition. Below is a sample of how you
can reference the Azure Integration Runtime created above from an Azure Storage Linked Service:
{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=myaccountname;AccountKey=...",
"type": "SecureString"
}
},
"connectVia": {
"referenceName": "MySampleAzureIR",
"type": "IntegrationRuntimeReference"
}
}
}
Next steps
See the following articles on how to create other types of integration runtimes:
Create self-hosted integration runtime
Create Azure-SSIS integration runtime
Create and configure a self-hosted integration
runtime
5/21/2019 • 19 minutes to read • Edit Online
The integration runtime (IR ) is the compute infrastructure that Azure Data Factory uses to provide data-
integration capabilities across different network environments. For details about IR, see Integration runtime
overview.
A self-hosted integration runtime can run copy activities between a cloud data store and a data store in a
private network, and it can dispatch transform activities against compute resources in an on-premises
network or an Azure virtual network. The installation of a self-hosted integration runtime needs on an on-
premises machine or a virtual machine (VM ) inside a private network.
This document describes how you can create and configure a self-hosted IR.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module,
which will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and
AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions,
see Install Azure PowerShell.
1. The data developer creates a self-hosted integration runtime within an Azure data factory by using a
PowerShell cmdlet. Currently, the Azure portal does not support this feature.
2. The data developer creates a linked service for an on-premises data store by specifying the self-hosted
integration runtime instance that it should use to connect to data stores.
3. The self-hosted integration runtime node encrypts the credentials by using Windows Data Protection
Application Programming Interface (DPAPI) and saves the credentials locally. If multiple nodes are set for
high availability, the credentials are further synchronized across other nodes. Each node encrypts the
credentials by using DPAPI and stores them locally. Credential synchronization is transparent to the data
developer and is handled by the self-hosted IR.
4. The Data Factory service communicates with the self-hosted integration runtime for scheduling and
management of jobs via a control channel that uses a shared Azure Service Bus Relay. When an activity
job needs to be run, Data Factory queues the request along with any credential information (in case
credentials are not already stored on the self-hosted integration runtime). The self-hosted integration
runtime kicks off the job after polling the queue.
5. The self-hosted integration runtime copies data from an on-premises store to a cloud storage, or vice
versa depending on how the copy activity is configured in the data pipeline. For this step, the self-hosted
integration runtime directly communicates with cloud-based storage services such as Azure Blob storage
over a secure (HTTPS ) channel.
Considerations for using a self-hosted IR
A single self-hosted integration runtime can be used for multiple on-premises data sources. A single self-
hosted integration runtime can be shared with another data factory within the same Azure Active
Directory tenant. For more information, see Sharing a self-hosted integration runtime.
You can have only one instance of a self-hosted integration runtime installed on a single machine. If you
have two data factories that need to access on-premises data sources, either use the self-hosted IR
sharing feature to share the self-hosted integration runtime, or install the self-hosted integration runtime
on two on-premises computers, one for each data factory.
The self-hosted integration runtime does not need to be on the same machine as the data source.
However, having the self-hosted integration runtime closer to the data source reduces the time for the
self-hosted integration runtime to connect to the data source. We recommend that you install the self-
hosted integration runtime on a machine that is different from the one that hosts on-premises data
source. When the self-hosted integration runtime and data source are on different machines, the self-
hosted integration runtime does not compete for resources with the data source.
You can have multiple self-hosted integration runtimes on different machines that connect to the same
on-premises data source. For example, you might have two self-hosted integration runtimes that serve
two data factories, but the same on-premises data source is registered with both the data factories.
If you already have a gateway installed on your computer to serve a Power BI scenario, install a separate
self-hosted integration runtime for Azure Data Factory on another machine.
The self-hosted integration runtime must be used for supporting data integration within an Azure virtual
network.
Treat your data source as an on-premises data source that is behind a firewall, even when you use Azure
ExpressRoute. Use the self-hosted integration runtime to establish connectivity between the service and
the data source.
You must use the self-hosted integration runtime even if the data store is in the cloud on an Azure IaaS
virtual machine.
Tasks might fail in a self-hosted integration runtime that's installed on a Windows server on which FIPS -
compliant encryption is enabled. To work around this problem, disable FIPS -compliant encryption on the
server. To disable FIPS -compliant encryption, change the following registry value from 1 (enabled) to 0
(disabled): HKLM\System\CurrentControlSet\Control\Lsa\FIPSAlgorithmPolicy\Enabled .
Prerequisites
The supported operating system versions are Windows 7 Service Pack 1, Windows 8.1, Windows 10,
Windows Server 2008 R2 SP1, Windows Server 2012, Windows Server 2012 R2, and Windows Server
2016. Installation of the self-hosted integration runtime on a domain controller is not supported.
.NET Framework 4.6.1 or later is required. If you're installing the self-hosted integration runtime on a
Windows 7 machine, install .NET Framework 4.6.1 or later. See .NET Framework System Requirements
for details.
The recommended configuration for the self-hosted integration runtime machine is at least 2 GHz, four
cores, 8 GB of RAM, and an 80-GB disk.
If the host machine hibernates, the self-hosted integration runtime does not respond to data requests.
Configure an appropriate power plan on the computer before you install the self-hosted integration
runtime. If the machine is configured to hibernate, the self-hosted integration runtime installation
prompts a message.
You must be an administrator on the machine to install and configure the self-hosted integration runtime
successfully.
Copy activity runs happen on a specific frequency. Resource usage (CPU, memory) on the machine
follows the same pattern with peak and idle times. Resource utilization also depends heavily on the
amount of data being moved. When multiple copy jobs are in progress, you see resource usage go up
during peak times.
10. On the Register Integration Runtime (Self-hosted) page of Microsoft Integration Runtime
Configuration Manager running on your machine, take the following steps:
a. Paste the authentication key in the text area.
b. Optionally, select Show authentication key to see the key text.
c. Select Register.
NOTE
You don't need to create new self-hosted integration runtime for associating each node. You can install the self-hosted
integration runtime on another machine and register it by using the same authentication key.
NOTE
Before you add another node for high availability and scalability, ensure that the Remote access to intranet option is
enabled on the first node (Microsoft Integration Runtime Configuration Manager > Settings > Remote access
to intranet).
Scale considerations
Scale out
When the available memory on the self-hosted IR is low and the CPU usage is high, adding a new node
helps scale out the load across machines. If activities are failing because they're timing out or because the
self-hosted IR node is offline, it helps if you add a node to the gateway.
Scale up
When the available memory and CPU are not utilized well, but the execution of concurrent jobs is reaching
the limit, you should scale up by increasing the number of concurrent jobs that can run on a node. You might
also want to scale up when activities are timing out because the self-hosted IR is overloaded. As shown in
the following image, you can increase the maximum capacity for a node:
NOTE
This certificate is used to encrypt ports on self-hosted IR node, used for node-to-node communication (for state
synchronization which includes linked services' credentials synchronization across nodes) and while using PowerShell
cmdlet for linked service credential setting from within local network. We suggest using this certificate if your
private network environment is not secure or if you would like to secure the communication between nodes within
your private network as well. Data movement in transit from self-hosted IR to other data stores always happens using
encrypted channel, irrespective of this certificate set or not.
Terminology
Shared IR: The original self-hosted IR that's running on a physical infrastructure.
Linked IR: The IR that references another shared IR. This is a logical IR and uses the infrastructure of
another self-hosted IR (shared).
High-level steps for creating a linked self-hosted IR
1. In the self-hosted IR to be shared, grant permission to the data factory in which you want to create
the linked IR.
2. Note the resource ID of the self-hosted IR to be shared.
3. In the data factory to which the permissions were granted, create a new self-hosted IR (linked) and
enter the resource ID.
Monitoring
Shared IR
Linked IR
NOTE
This feature is available only in Azure Data Factory V2.
At the corporate firewall level, you need to configure the following domains and outbound ports:
At the Windows firewall level (machine level), these outbound ports are normally enabled. If not, you can
configure the domains and ports accordingly on a self-hosted integration runtime machine.
NOTE
Based on your source and sinks, you might have to whitelist additional domains and outbound ports in your
corporate firewall or Windows firewall.
For some cloud databases (for example, Azure SQL Database and Azure Data Lake), you might need to whitelist IP
addresses of self-hosted integration runtime machines on their firewall configuration.
NOTE
If your firewall does not allow outbound port 1433, the self-hosted integration runtime can't access the Azure SQL
database directly. In this case, you can use a staged copy to Azure SQL Database and Azure SQL Data Warehouse. In
this scenario, you would require only HTTPS (port 443) for the data movement.
NOTE
If you set up a proxy server with NTLM authentication, the integration runtime Host Service runs under the domain
account. If you change the password for the domain account later, remember to update the configuration settings for
the service and restart it accordingly. Due to this requirement, we suggest that you use a dedicated domain account
to access the proxy server that does not require you to update the password frequently.
You can then add proxy server details as shown in the following example:
<system.net>
<defaultProxy enabled="true">
<proxy bypassonlocal="true" proxyaddress="http://proxy.domain.org:8888/" />
</defaultProxy>
</system.net>
Additional properties are allowed inside the proxy tag to specify the required settings like
scriptLocation . See proxy Element ( Network Settings) for syntax.
3. Save the configuration file in the original location. Then restart the self-hosted integration runtime
Host Service, which picks up the changes.
To restart the service, use the services applet from the control panel. Or from Integration Runtime
Configuration Manager, select the Stop Service button, and then select Start Service.
If the service does not start, it's likely that an incorrect XML tag syntax was added in the application
configuration file that was edited.
IMPORTANT
Don't forget to update both diahost.exe.config and diawp.exe.config.
You also need to make sure that Microsoft Azure is in your company’s whitelist. You can download the list of
valid Microsoft Azure IP addresses from the Microsoft Download Center.
Possible symptoms for firewall and proxy server-related issues
If you encounter errors similar to the following ones, it's likely due to improper configuration of the firewall
or proxy server, which blocks the self-hosted integration runtime from connecting to Data Factory to
authenticate itself. To ensure that your firewall and proxy server are properly configured, refer to the
previous section.
When you try to register the self-hosted integration runtime, you receive the following error: "Failed
to register this Integration Runtime node! Confirm that the Authentication key is valid and the
integration service Host Service is running on this machine."
When you open Integration Runtime Configuration Manager, you see a status of Disconnected or
Connecting. When you're viewing Windows event logs, under Event Viewer > Application and
Services Logs > Microsoft Integration Runtime, you see error messages like this one:
If you choose not to open port 8060 on the self-hosted integration runtime machine, use mechanisms other
than the Setting Credentials application to configure data store credentials. For example, you can use the
New-AzDataFactoryV2LinkedServiceEncryptCredential PowerShell cmdlet.
Next steps
See the following tutorial for step-by-step instructions: Tutorial: Copy on-premises data to cloud.
Create Azure-SSIS Integration Runtime in Azure
Data Factory
4/9/2019 • 23 minutes to read • Edit Online
This article provides steps for provisioning Azure-SSIS Integration Runtime (IR ) in Azure Data Factory (ADF ).
Then, you can use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy and
run SQL Server Integration Services (SSIS ) packages on this integration runtime in Azure.
The Tutorial: Deploy SSIS packages to Azure shows you how to create Azure-SSIS IR by using Azure SQL
Database server to host SSIS catalog database (SSISDB ). This article expands on the tutorial and shows you
how to do the following things:
Optionally use Azure SQL Database server with virtual network service endpoints/Managed Instance to
host SSISDB. For guidance in choosing the type of database server to host SSISDB, see Compare Azure
SQL Database single databases/elastic pools and Managed Instance. As a prerequisite, you need to join
your Azure-SSIS IR to a virtual network and configure virtual network permissions/settings as necessary.
See Join Azure-SSIS IR to a virtual network.
Optionally use Azure Active Directory (AAD ) authentication with the managed identity for your ADF to
connect to the database server. As a prerequisite, you will need to add the managed identity for your ADF
as a contained database user capable of creating SSISDB in your Azure SQL Database server/Managed
Instance, see Enable AAD authentication for Azure-SSIS IR.
Overview
This article shows different ways of provisioning Azure-SSIS IR:
Azure portal
Azure PowerShell
Azure Resource Manager template
When you create Azure-SSIS IR, ADF service connects to your Azure SQL Database server/Managed Instance
to prepare SSISDB. It also configures permissions/settings for your virtual network, if specified, and joins your
Azure-SSIS IR to the virtual network.
When you provision Azure-SSIS IR, Azure Feature Pack for SSIS and Access Redistributable are also installed.
These components provide connectivity to Excel/Access files and various Azure data sources, in addition to the
data sources supported by built-in components. You can also install additional components. For more info, see
Custom setup for the Azure-SSIS integration runtime.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
Azure subscription. If you do not already have a subscription, you can create a free trial account.
Azure SQL Database server or Managed Instance. If you do not already have a database server, you
can create one in Azure portal before you get started. This server will host SSISDB. We recommend that
you create the database server in the same Azure region as your integration runtime. This configuration
lets your integration runtime write execution logs to SSISDB without crossing Azure regions. Based on
the selected database server, SSISDB can be created on your behalf as a single database, part of an
elastic pool, or in your Managed Instance and accessible in public network or by joining a virtual network.
For a list of supported pricing tiers for Azure SQL Database, see SQL Database resource limits.
Make sure that your Azure SQL Database server/Managed Instance does not already have an SSISDB.
The provisioning of Azure-SSIS IR does not support using an existing SSISDB.
Azure Resource Manager virtual network (optional). You must have an Azure Resource Manager
virtual network if at least one of the following conditions is true:
You are hosting SSISDB in Azure SQL Database server with virtual network service endpoints or in
Managed Instance that is inside a virtual network.
You want to connect to on-premises data stores from SSIS packages running on your Azure-SSIS IR.
Azure PowerShell. Follow the instructions on How to install and configure Azure PowerShell, if you
want to run a PowerShell script to provision Azure-SSIS IR.
Region support
For a list of Azure regions, in which ADF and Azure-SSIS IR are currently available, see ADF + SSIS IR
availability by region.
Compare SQL Database single database/elastic pool and SQL Database Managed Instance
The following table compares certain features of Azure SQL Database server and Managed Instance as they
relate to Azure-SSIR IR:
Scheduling SQL Server Agent is not available. Managed Instance Agent is available.
Authentication You can create SSISDB with a You can create SSISDB with a
contained database user representing contained database user representing
any AAD group with the managed the managed identity of your ADF.
identity of your ADF as a member in
the db_owner role. See Enable Azure AD authentication to
create SSISDB in Azure SQL Database
See Enable Azure AD authentication to Managed Instance.
create SSISDB in Azure SQL Database
server.
Service tier When you create Azure-SSIS IR with When you create Azure-SSIS IR with
your Azure SQL Database server, you your Managed Instance, you cannot
can select the service tier for SSISDB. select the service tier for SSISDB. All
There are multiple service tiers. databases in your Managed Instance
share the same resource allocated to
that instance.
FEATURE SINGLE DATABASE/ELASTIC POOL MANAGED INSTANCE
Virtual network Supports only Azure Resource Supports only Azure Resource
Manager virtual networks for your Manager virtual networks for your
Azure-SSIS IR to join if you use Azure Azure-SSIS IR to join. The virtual
SQL Database server with virtual network is always required.
network service endpoints or require
access to on-premises data stores. If you join your Azure-SSIS IR to the
same virtual network as your Managed
Instance, make sure that your Azure-
SSIS IR is in a different subnet than
your Managed Instance. If you join
your Azure-SSIS IR to a different virtual
network than your Managed Instance,
we recommend either a virtual
network peering or virtual network to
virtual network connection. See
Connect your application to Azure SQL
Database Managed Instance.
Azure portal
In this section, you use Azure portal, specifically ADF User Interface (UI)/app, to create Azure-SSIS IR.
Create a data factory
1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. Sign in to the Azure portal.
3. Click New on the left menu, click Data + Analytics, and click Data Factory.
4. In the New data factory page, enter MyAzureSsisDataFactory for the name.
The name of the Azure data factory must be globally unique. If you receive the following error, change
the name of the data factory (for example, yournameMyAzureSsisDataFactory) and try creating again.
See Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
Data factory name “MyAzureSsisDataFactory” is not available
5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version.
8. Select the location for the data factory. Only locations that are supported for creation of data factories
are shown in the list.
9. Select Pin to dashboard.
10. Click Create.
11. On the dashboard, you see the following tile with status: Deploying data factory.
12. After the creation is complete, you see the Data Factory page as shown in the image.
13. Click Author & Monitor to launch the Data Factory User Interface (UI) in a separate tab.
Provision an Azure SSIS integration runtime
1. In the get started page, click Configure SSIS Integration Runtime tile.
2. On the General Settings page of Integration Runtime Setup, complete the following steps:
a. For Name, enter the name of your integration runtime.
b. For Description, enter the description of your integration runtime.
c. For Location, select the location of your integration runtime. Only supported locations are displayed.
We recommend that you select the same location of your database server to host SSISDB.
d. For Node Size, select the size of node in your integration runtime cluster. Only supported node sizes
are displayed. Select a large node size (scale up), if you want to run many compute/memory –intensive
packages.
e. For Node Number, select the number of nodes in your integration runtime cluster. Only supported
node numbers are displayed. Select a large cluster with many nodes (scale out), if you want to run many
packages in parallel.
f. For Edition/License, select SQL Server edition/license for your integration runtime: Standard or
Enterprise. Select Enterprise, if you want to use advanced/premium features on your integration runtime.
g. For Save Money, select Azure Hybrid Benefit (AHB ) option for your integration runtime: Yes or No.
Select Yes, if you want to bring your own SQL Server license with Software Assurance to benefit from
cost savings with hybrid use.
h. Click Next.
3. On the SQL Settings page, complete the following steps:
a. For Subscription, select the Azure subscription that has your database server to host SSISDB.
b. For Location, select the location of your database server to host SSISDB. We recommend that you
select the same location of your integration runtime.
c. For Catalog Database Server Endpoint, select the endpoint of your database server to host SSISDB.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part
of an elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual
network.
d. On Use AAD authentication... checkbox, select the authentication method for your database server
to host SSISDB: SQL or Azure Active Directory (AAD ) with the managed identity for your Azure Data
Factory. If you check it, you need to add the managed identity for your ADF into an AAD group with
access permissions to the database server, see Enable AAD authentication for Azure-SSIS IR.
e. For Admin Username, enter SQL authentication username for your database server to host SSISDB.
f. For Admin Password, enter SQL authentication password for your database server to host SSISDB.
g. For Catalog Database Service Tier, select the service tier for your database server to host SSISDB:
Basic/Standard/Premium tier or elastic pool name.
h. Click Test Connection and if successful, click Next.
4. On the Advanced Settings page, complete the following steps:
a. For Maximum Parallel Executions Per Node, select the maximum number of packages to execute
concurrently per node in your integration runtime cluster. Only supported package numbers are
displayed. Select a low number, if you want to use more than one core to run a single large/heavy-weight
package that is compute/memory -intensive. Select a high number, if you want to run one or more
small/light-weight packages in a single core.
b. For Custom Setup Container SAS URI, optionally enter Shared Access Signature (SAS ) Uniform
Resource Identifier (URI) of your Azure Storage Blob container where your setup script and its associated
files are stored, see Custom setup for Azure-SSIS IR.
5. On Select a virtual network... checkbox, select whether you want to join your integration runtime to a
virtual network. Check it if you use Azure SQL Database with virtual network service
endpoints/Managed Instance to host SSISDB or require access to on-premises data; that is, you have on-
premises data sources/destinations in your SSIS packages, see Join Azure-SSIS IR to a virtual network. If
you check it, complete the following steps:
a. For Subscription, select the Azure subscription that has your virtual network.
b. For Location, the same location of your integration runtime is selected.
c. For Type, select the type of your virtual network: Classic or Azure Resource Manager. We recommend
that you select Azure Resource Manager virtual network, since Classic virtual network will be deprecated
soon.
d. For VNet Name, select the name of your virtual network. This virtual network should be the same
virtual network used for Azure SQL Database with virtual network service endpoints/Managed Instance
to host SSISDB and or the one connected to your on-premises network.
e. For Subnet Name, select the name of subnet for your virtual network. This should be a different
subnet than the one used for Managed Instance to host SSISDB.
6. Click VNet Validation and if successful, click Finish to start the creation of your Azure-SSIS integration
runtime.
IMPORTANT
This process takes approximately 20 to 30 minutes to complete
The Data Factory service connects to your Azure SQL Database to prepare the SSIS Catalog database (SSISDB).
It also configures permissions and settings for your virtual network, if specified, and joins the new instance of
Azure-SSIS integration runtime to the virtual network.
7. In the Connections window, switch to Integration Runtimes if needed. Click Refresh to refresh the
status.
8. Use the links under Actions column to stop/start, edit, or delete the integration runtime. Use the last link
to view JSON code for the integration runtime. The edit and delete buttons are enabled only when the IR
is stopped.
5. See the Provision an Azure SSIS integration runtime section for the remaining steps to set up an Azure-
SSIS IR.
Azure PowerShell
In this section, you use the Azure PowerShell to create an Azure-SSIS IR.
Create variables
Define variables for use in the script in this tutorial:
### Azure Data Factory information
# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like
"`$"
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$DataFactoryLocation = "EastUS"
### Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[specify a description for your Azure-SSIS IR]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium
features on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your
own on-premises SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit
(AHB) option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
(2 x number of cores) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup
script and its associated files are stored
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use Azure SQL Database
with virtual network service endpoints/Managed Instance/on-premises data, Azure Resource Manager virtual
network is recommended, Classic virtual network will be deprecated soon
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Please use the same subnet as the one used
with your Azure SQL Database with virtual network service endpoints or a different subnet than the one used
for your Managed Instance
Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName
# Make sure to run this script against the subscription to which the virtual network belongs
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign the VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}
### Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[specify a description for your Azure-SSIS IR]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium
features on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your
own on-premises SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit
(AHB) option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
(2 x number of cores) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup
script and its associated files are stored
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use Azure SQL Database
with virtual network service endpoints/Managed Instance/on-premises data, Azure Resource Manager virtual
network is recommended, Classic virtual network will be deprecated soon
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Please use the same subnet as the one used
with your Azure SQL Database with virtual network service endpoints or a different subnet than the one used
for your Managed Instance
2. To deploy the Azure Resource Manager template, run New -AzResourceGroupDeployment command as
shown in the following example, where ADFTutorialResourceGroup is the name of your resource group
and ADFTutorialARM.json is the file that contains JSON definition for your data factory and Azure-SSIS
IR.
This command creates your data factory and Azure-SSIS IR in it, but it does not start the IR.
3. To start your Azure-SSIS IR, run Start-AzDataFactoryV2IntegrationRuntime command:
Next steps
See the other Azure-SSIS IR topics in this documentation:
Azure-SSIS Integration Runtime. This article provides conceptual information about integration runtimes in
general including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-
SSIS IR and uses an Azure SQL database to host the SSIS catalog.
Monitor an Azure-SSIS IR. This article shows you how to retrieve information about an Azure-SSIS IR and
descriptions of statuses in the returned information.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or remove an Azure-SSIS IR. It also
shows you how to scale out your Azure-SSIS IR by adding more nodes to the IR.
Join an Azure-SSIS IR to a virtual network. This article provides conceptual information about joining your
Azure-SSIS IR to an Azure virtual network. It also provides steps to use Azure portal to configure virtual
network so that Azure-SSIS IR can join the virtual network.
Create a shared self-hosted integration runtime in
Azure Data Factory with PowerShell
3/26/2019 • 4 minutes to read • Edit Online
This step-by-step guide shows you how to create a shared self-hosted integration runtime in Azure Data Factory
by using Azure PowerShell. Then you can use the shared self-hosted integration runtime in another data factory. In
this tutorial, you take the following steps:
1. Create a data factory.
2. Create a self-hosted integration runtime.
3. Share the self-hosted integration runtime with other data factories.
4. Create a linked integration runtime.
5. Revoke the sharing.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure PowerShell. Follow the instructions in Install Azure PowerShell on Windows with PowerShellGet.
You use PowerShell to run a script to create a self-hosted integration runtime that can be shared with other
data factories.
NOTE
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on Products
available by region.
# Shared Self-hosted integration runtime information. This is a Data Factory compute resource for
running any activities
# Data factory name. Must be globally unique
$SharedDataFactoryName = "[Shared Data factory name]"
$SharedIntegrationRuntimeName = "[Shared Integration Runtime Name]"
$SharedIntegrationRuntimeDescription = "[Description for Shared Integration Runtime]"
# Linked integration runtime information. This is a Data Factory compute resource for running any
activities
# Data factory name. Must be globally unique
$LinkedDataFactoryName = "[Linked Data factory name]"
$LinkedIntegrationRuntimeName = "[Linked Integration Runtime Name]"
$LinkedIntegrationRuntimeDescription = "[Description for Linked Integration Runtime]"
3. Sign in and select a subscription. Add the following code to the script to sign in and select your Azure
subscription:
Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName
NOTE
This step is optional. If you already have a data factory, skip this step.
Create an Azure resource group by using the New -AzResourceGroup command. A resource group is a
logical container into which Azure resources are deployed and managed as a group. The following example
creates a resource group named myResourceGroup in the WestEurope location:
Get-AzDataFactoryV2IntegrationRuntimeKey `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName
The response contains the authentication key for this self-hosted integration runtime. You use this key when you
register the integration runtime node.
Install and register the self-hosted integration runtime
1. Download the self-hosted integration runtime installer from Azure Data Factory Integration Runtime.
2. Run the installer to install the self-hosted integration on a local computer.
3. Register the new self-hosted integration with the authentication key that you retrieved in a previous step.
NOTE
This step is optional. If you already have the data factory that you want to share with, skip this step.
Grant permission
Grant permission to the data factory that needs to access the self-hosted integration runtime you created and
registered.
IMPORTANT
Do not skip this step!
New-AzRoleAssignment `
-ObjectId $factory.Identity.PrincipalId ` #MSI of the Data Factory with which it needs to be shared
-RoleDefinitionId 'b24988ac-6180-42a0-ab88-20f7382dd24c' ` #This is the Contributor role
-Scope $SharedIR.Id
Now you can use this linked integration runtime in any linked service. The linked integration runtime uses the
shared integration runtime to run activities.
Remove-AzRoleAssignment `
-ObjectId $factory.Identity.PrincipalId `
-RoleDefinitionId 'b24988ac-6180-42a0-ab88-20f7382dd24c' `
-Scope $SharedIR.Id
To remove the existing linked integration runtime, run the following command against the shared integration
runtime:
Remove-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName `
-Links `
-LinkedDataFactoryName $LinkedDataFactoryName
Next steps
Review integration runtime concepts in Azure Data Factory.
Learn how to create a self-hosted integration runtime in the Azure portal.
Run an SSIS package with the Execute SSIS Package
activity in Azure Data Factory
3/20/2019 • 9 minutes to read • Edit Online
This article describes how to run an SSIS package in Azure Data Factory (ADF ) pipeline by using the Execute SSIS
Package activity.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Create an Azure-SSIS Integration Runtime (IR ) if you do not have one already by following the step-by-step
instructions in the Tutorial: Deploy SSIS packages to Azure.
2. In the Activities toolbox, expand General, then drag & drop an Execute SSIS Package activity to the
pipeline designer surface.
3. On the General tab for Execute SSIS Package activity, provide a name and description for the activity. Set
optional timeout and retry values.
4. On the Settings tab for Execute SSIS Package activity, select your Azure-SSIS IR that is associated with
SSISDB database where the package is deployed. If your package uses Windows authentication to access
data stores, e.g. SQL Servers/file shares on premises, Azure Files, etc., check the Windows authentication
checkbox and enter the domain/username/password for your package execution. If your package needs 32-
bit runtime to run, check the 32-Bit runtime checkbox. For Logging level, select a predefined scope of
logging for your package execution. Check the Customized checkbox, if you want to enter your customized
logging name instead. When your Azure-SSIS IR is running and the Manual entries checkbox is
unchecked, you can browse and select your existing folders/projects/packages/environments from SSISDB.
Click the Refresh button to fetch your newly added folders/projects/packages/environments from SSISDB,
so they are available for browsing and selection.
When your Azure-SSIS IR is not running or the Manual entries checkbox is checked, you can enter your
package and environment paths from SSISDB directly in the following formats:
<folder name>/<project name>/<package name>.dtsx and <folder name>/<environment name> .
5. On the SSIS Parameters tab for Execute SSIS Package activity, when your Azure-SSIS IR is running and
the Manual entries checkbox on Settings tab is unchecked, the existing SSIS parameters in your selected
project/package from SSISDB will be displayed for you to assign values to them. Otherwise, you can enter
them one by one to assign values to them manually – Please ensure that they exist and are correctly entered
for your package execution to succeed. You can add dynamic content to their values using expressions,
functions, ADF system variables, and ADF pipeline parameters/variables. Alternatively, you can use secrets
stored in your Azure Key Vault (AKV ) as their values. To do so, click on the AZURE KEY VAULT checkbox
next to the relevant parameter, select/edit your existing AKV linked service or create a new one, and then
select the secret name/version for your parameter value. When you create/edit your AKV linked service, you
can select/edit your existing AKV or create a new one, but please grant ADF managed identity access to
your AKV if you have not done so already. You can also enter your secrets directly in the following format:
<AKV linked service name>/<secret name>/<secret version> .
6. On the Connection Managers tab for Execute SSIS Package activity, when your Azure-SSIS IR is running
and the Manual entries checkbox on Settings tab is unchecked, the existing connection managers in your
selected project/package from SSISDB will be displayed for you to assign values to their properties.
Otherwise, you can enter them one by one to assign values to their properties manually – Please ensure
that they exist and are correctly entered for your package execution to succeed. You can add dynamic
content to their property values using expressions, functions, ADF system variables, and ADF pipeline
parameters/variables. Alternatively, you can use secrets stored in your Azure Key Vault (AKV ) as their
property values. To do so, click on the AZURE KEY VAULT checkbox next to the relevant property,
select/edit your existing AKV linked service or create a new one, and then select the secret name/version for
your property value. When you create/edit your AKV linked service, you can select/edit your existing AKV
or create a new one, but please grant ADF managed identity access to your AKV if you have not done so
already. You can also enter your secrets directly in the following format:
<AKV linked service name>/<secret name>/<secret version> .
7. On the Property Overrides tab for Execute SSIS Package activity, you can enter the paths of existing
properties in your selected package from SSISDB one by one to assign values to them manually – Please
ensure that they exist and are correctly entered for your package execution to succeed, e.g. to override the
value of your user variable, enter its path in the following format:
\Package.Variables[User::YourVariableName].Value . You can also add dynamic content to their values using
expressions, functions, ADF system variables, and ADF pipeline parameters/variables.
8. To validate the pipeline configuration, click Validate on the toolbar. To close the Pipeline Validation
Report, click >>.
9. Publish the pipeline to ADF by clicking Publish All button.
Run the pipeline
In this step, you trigger a pipeline run.
1. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger now.
3. You can run the following query against the SSISDB database in your Azure SQL server to verify that the
package executed.
4. You can also get the SSISDB execution ID from the output of the pipeline activity run, and use the ID to
check more comprehensive execution logs and error messages in SSMS.
Schedule the pipeline with a trigger
You can also create a scheduled trigger for your pipeline so that the pipeline runs on a schedule (hourly, daily, etc.).
For an example, see Create a data factory - Data Factory UI.
IMPORTANT
Replace object names, descriptions, and paths, property and parameter values, passwords, and other variable values
before saving the file.
{
"name": "RunSSISPackagePipeline",
"properties": {
"activities": [{
"name": "mySSISActivity",
"description": "My SSIS package/activity description",
"type": "ExecuteSSISPackage",
"typeProperties": {
"connectVia": {
"referenceName": "myAzureSSISIR",
"type": "IntegrationRuntimeReference"
},
"executionCredential": {
"domain": "MyDomain",
"userName": "MyUsername",
"password": {
"type": "SecureString",
"value": "**********"
}
},
"runtime": "x64",
"loggingLevel": "Basic",
"packageLocation": {
"packagePath": "FolderName/ProjectName/PackageName.dtsx"
},
"environmentPath": "FolderName/EnvironmentName",
"projectParameters": {
"project_param_1": {
"value": "123"
},
"project_param_2": {
"value": {
"value": "@pipeline().parameters.MyPipelineParameter",
"type": "Expression"
}
}
},
"packageParameters": {
"package_param_1": {
"value": "345"
},
"package_param_2": {
"value": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MySecret"
}
}
},
"projectConnectionManagers": {
"MyAdonetCM": {
"userName": {
"value": "sa"
},
"passWord": {
"value": {
"type": "SecureString",
"value": "abc"
}
}
}
},
"packageConnectionManagers": {
"MyOledbCM": {
"userName": {
"value": {
"value": "@pipeline().parameters.MyUsername",
"type": "Expression"
}
},
"passWord": {
"value": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MyPassword",
"secretVersion": "3a1b74e361bf4ef4a00e47053b872149"
}
}
}
},
"propertyOverrides": {
"\\Package.MaxConcurrentExecutables": {
"value": 8,
"isSensitive": false
}
}
},
"policy": {
"timeout": "0.01:00:00",
"retry": 0,
"retryIntervalInSeconds": 30
}
}]
}
}
PipelineName : Adfv2QuickStartPipeline
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {CopyFromBlobToBlob}
Parameters : {[inputPath, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification],
[outputPath, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}
Run the pipeline
Use the Invoke-AzDataFactoryV2Pipeline cmdlet to run the pipeline. The cmdlet returns the pipeline run ID for
future monitoring.
while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId
if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}
Start-Sleep -Seconds 10
}
You can also monitor the pipeline using the Azure portal. For step-by-step instructions, see Monitor the pipeline.
Schedule the pipeline with a trigger
In the previous step, you ran the pipeline on-demand. You can also create a schedule trigger to run the pipeline on
a schedule (hourly, daily, etc.).
1. Create a JSON file named MyTrigger.json in C:\ADF\RunSSISPackage folder with the following content:
{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-12-07T00:00:00-08:00",
"endTime": "2017-12-08T00:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "RunSSISPackagePipeline"
},
"parameters": {}
}]
}
}
4. By default, the trigger is in stopped state. Start the trigger by running the Start-AzDataFactoryV2Trigger
cmdlet.
6. Run the following command after the next hour. For example, if the current time is 3:25 PM UTC, run the
command at 4 PM UTC.
You can run the following query against the SSISDB database in your Azure SQL server to verify that the
package executed.
Next steps
See the following blog post:
Modernize and extend your ETL/ELT workflows with SSIS activities in ADF pipelines
Run an SSIS package with the Stored Procedure
activity in Azure Data Factory
4/8/2019 • 10 minutes to read • Edit Online
This article describes how to run an SSIS package in an Azure Data Factory pipeline by using a Stored Procedure
activity.
Prerequisites
Azure SQL Database
The walkthrough in this article uses an Azure SQL database that hosts the SSIS catalog. You can also use an Azure
SQL Database Managed Instance.
5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version.
8. Select the location for the data factory. Only locations that are supported by Data Factory are shown in the
drop-down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.)
used by data factory can be in other locations.
9. Select Pin to dashboard.
10. Click Create.
11. On the dashboard, you see the following tile with status: Deploying data factory.
12. After the creation is complete, you see the Data Factory page as shown in the image.
13. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) application in a separate
tab.
Create a pipeline with stored procedure activity
In this step, you use the Data Factory UI to create a pipeline. You add a stored procedure activity to the pipeline and
configure it to run the SSIS package by using the sp_executesql stored procedure.
1. In the get started page, click Create pipeline:
2. In the Activities toolbox, expand General, and drag-drop Stored Procedure activity to the pipeline
designer surface.
3. In the properties window for the stored procedure activity, switch to the SQL Account tab, and click +
New. You create a connection to the Azure SQL database that hosts the SSIS Catalog (SSIDB database).
4. In the New Linked Service window, do the following steps:
a. Select Azure SQL Database for Type.
b. Select the Default Azure Integration Runtime to connect to the Azure SQL Database that hosts the
SSISDB database.
c. Select the Azure SQL Database that hosts the SSISDB database for the Server name field.
d. Select SSISDB for Database name.
e. For User name, enter the name of user who has access to the database.
f. For Password, enter the password of the user.
g. Test the connection to the database by clicking Test connection button.
h. Save the linked service by clicking the Save button.
5. In the properties window, switch to the Stored Procedure tab from the SQL Account tab, and do the
following steps:
a. Select Edit.
b. For the Stored procedure name field, Enter sp_executesql .
c. Click + New in the Stored procedure parameters section.
d. For name of the parameter, enter stmt.
e. For type of the parameter, enter String.
f. For value of the parameter, enter the following SQL query:
In the SQL query, specify the right values for the folder_name, project_name, and package_name
parameters.
DECLARE @return_value INT, @exe_id BIGINT, @err_msg NVARCHAR(150) EXEC @return_value=[SSISDB].
[catalog].[create_execution] @folder_name=N'<FOLDER name in SSIS Catalog>',
@project_name=N'<PROJECT name in SSIS Catalog>', @package_name=N'<PACKAGE name>.dtsx',
@use32bitruntime=0, @runinscaleout=1, @useanyworker=1, @execution_id=@exe_id OUTPUT EXEC
[SSISDB].[catalog].[set_execution_parameter_value] @exe_id, @object_type=50,
@parameter_name=N'SYNCHRONIZED', @parameter_value=1 EXEC [SSISDB].[catalog].[start_execution]
@execution_id=@exe_id, @retry_count=0 IF(SELECT [status] FROM [SSISDB].[catalog].[executions]
WHERE execution_id=@exe_id)<>7 BEGIN SET @err_msg=N'Your package execution did not succeed for
execution ID: ' + CAST(@exe_id AS NVARCHAR(20)) RAISERROR(@err_msg,15,1) END
6. To validate the pipeline configuration, click Validate on the toolbar. To close the Pipeline Validation
Report, click >>.
7. Publish the pipeline to Data Factory by clicking Publish All button.
4. Click View Activity Runs link in the Actions column. You see only one activity run as the pipeline has only
one activity (stored procedure activity).
5. You can run the following query against the SSISDB database in your Azure SQL server to verify that the
package executed.
NOTE
You can also create a scheduled trigger for your pipeline so that the pipeline runs on a schedule (hourly, daily, etc.). For an
example, see Create a data factory - Data Factory UI.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
In this section, you use Azure PowerShell to create a Data Factory pipeline with a stored procedure activity that
invokes an SSIS package.
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
Create a data factory
You can either use the same data factory that has the Azure-SSIS IR or create a separate data factory. The
following procedure provides steps to create a data factory. You create a pipeline with a stored procedure activity in
this data factory. The stored procedure activity executes a stored procedure in the SSISDB database to run your
SSIS package.
1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotes, and
then run the command. For example: "adfrg" .
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again.
IMPORTANT
Update the data factory name to be globally unique.
$DataFactoryName = "ADFTutorialFactory";
4. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet, using the Location and
ResourceGroupName property from the $ResGrp variable:
The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names must
be globally unique.
To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory
can be in other regions.
Create an Azure SQL Database linked service
Create a linked service to link your Azure SQL database that hosts the SSIS catalog to your data factory. Data
Factory uses information in this linked service to connect to SSISDB database, and executes a stored procedure to
run an SSIS package.
1. Create a JSON file named AzureSqlDatabaseLinkedService.json in C:\ADF\RunSSISPackage folder
with the following content:
IMPORTANT
Replace <servername>, <username>, and <password> with values of your Azure SQL Database before saving the
file.
{
"name": "AzureSqlDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=SSISDB;User ID=
<username>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
}
{
"name": "RunSSISPackagePipeline",
"properties": {
"activities": [
{
"name": "My SProc Activity",
"description":"Runs an SSIS package",
"type": "SqlServerStoredProcedure",
"linkedServiceName": {
"referenceName": "AzureSqlDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"storedProcedureName": "sp_executesql",
"storedProcedureParameters": {
"stmt": {
"value": "DECLARE @return_value INT, @exe_id BIGINT, @err_msg NVARCHAR(150)
EXEC @return_value=[SSISDB].[catalog].[create_execution] @folder_name=N'<FOLDER NAME>',
@project_name=N'<PROJECT NAME>', @package_name=N'<PACKAGE NAME>', @use32bitruntime=0, @runinscaleout=1,
@useanyworker=1, @execution_id=@exe_id OUTPUT EXEC [SSISDB].[catalog].[set_execution_parameter_value]
@exe_id, @object_type=50, @parameter_name=N'SYNCHRONIZED', @parameter_value=1 EXEC [SSISDB].
[catalog].[start_execution] @execution_id=@exe_id, @retry_count=0 IF(SELECT [status] FROM [SSISDB].
[catalog].[executions] WHERE execution_id=@exe_id)<>7 BEGIN SET @err_msg=N'Your package execution did
not succeed for execution ID: ' + CAST(@exe_id AS NVARCHAR(20)) RAISERROR(@err_msg,15,1) END"
}
}
}
}
]
}
}
PipelineName : Adfv2QuickStartPipeline
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {CopyFromBlobToBlob}
Parameters : {[inputPath, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification],
[outputPath, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}
while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $ResGrp.ResourceGroupName -DataFactoryName
$DataFactory.DataFactoryName -PipelineRunId $RunId
if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}
Start-Sleep -Seconds 10
}
Create a trigger
In the previous step, you invoked the pipeline on-demand. You can also create a schedule trigger to run the
pipeline on a schedule (hourly, daily, etc.).
1. Create a JSON file named MyTrigger.json in C:\ADF\RunSSISPackage folder with the following content:
{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-12-07T00:00:00-08:00",
"endTime": "2017-12-08T00:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "RunSSISPackagePipeline"
},
"parameters": {}
}
]
}
}
4. By default, the trigger is in stopped state. Start the trigger by running the Start-AzDataFactoryV2Trigger
cmdlet.
Start-AzDataFactoryV2Trigger -ResourceGroupName $ResGrp.ResourceGroupName -DataFactoryName
$DataFactory.DataFactoryName -Name "MyTrigger"
6. Run the following command after the next hour. For example, if the current time is 3:25 PM UTC, run the
command at 4 PM UTC.
You can run the following query against the SSISDB database in your Azure SQL server to verify that the
package executed.
Next steps
You can also monitor the pipeline using the Azure portal. For step-by-step instructions, see Monitor the pipeline.
How to start and stop Azure-SSIS Integration Runtime on a schedule
3/28/2019 • 13 minutes to read • Edit Online
This article describes how to schedule the starting and stopping of Azure-SSIS Integration Runtime (IR) by using Azure Data Factory (ADF). Azure-SSIS IR is ADF compute
resource dedicated for executing SQL Server Integration Services (SSIS) packages. Running Azure-SSIS IR has a cost associated with it. Therefore, you typically want to run
your IR only when you need to execute SSIS packages in Azure and stop your IR when you do not need it anymore. You can use ADF User Interface (UI)/app or Azure
PowerShell to manually start or stop your IR ).
Alternatively, you can create Web activities in ADF pipelines to start/stop your IR on schedule, e.g. starting it in the morning before executing your daily ETL workloads and
stopping it in the afternoon after they are done. You can also chain an Execute SSIS Package activity between two Web activities that start and stop your IR, so your IR will
start/stop on demand, just in time before/after your package execution. For more info about Execute SSIS Package activity, see Run an SSIS package using Execute SSIS
Package activity in ADF pipeline article.
IMPORTANT
Using this Azure feature from PowerShell requires the AzureRM module installed. This is an older module only available for Windows PowerShell 5.1 that no longer receives new features. The
Az and AzureRM modules are not compatible when installed for the same versions of PowerShell. If you need both versions:
Prerequisites
If you have not provisioned your Azure-SSIS IR already, provision it by following instructions in the tutorial.
Create and schedule ADF pipelines that start and or stop Azure-SSIS IR
This section shows you how to use Web activities in ADF pipelines to start/stop your Azure-SSIS IR on schedule or start & stop it on demand. We will guide you to create
three pipelines:
1. The first pipeline contains a Web activity that starts your Azure-SSIS IR.
2. The second pipeline contains a Web activity that stops your Azure-SSIS IR.
3. The third pipeline contains an Execute SSIS Package activity chained between two Web activities that start/stop your Azure-SSIS IR.
After you create and test those pipelines, you can create a schedule trigger and associate it with any pipeline. The schedule trigger defines a schedule for running the
associated pipeline.
For example, you can create two triggers, the first one is scheduled to run daily at 6 AM and associated with the first pipeline, while the second one is scheduled to run daily at
6 PM and associated with the second pipeline. In this way, you have a period between 6 AM to 6 PM every day when your IR is running, ready to execute your daily ETL
workloads.
If you create a third trigger that is scheduled to run daily at midnight and associated with the third pipeline, that pipeline will run at midnight every day, starting your IR just
before package execution, subsequently executing your package, and immediately stopping your IR just after package execution, so your IR will not be running idly.
Create your ADF
1. Sign in to Azure portal.
2. Click New on the left menu, click Data + Analytics, and click Data Factory.
3. In the New data factory page, enter MyAzureSsisDataFactory for Name.
The name of your ADF must be globally unique. If you receive the following error, change the name of your ADF (e.g. yournameMyAzureSsisDataFactory) and try
creating it again. See Data Factory - Naming Rules article to learn about naming rules for ADF artifacts.
Data factory name MyAzureSsisDataFactory is not available
4. Select your Azure Subscription under which you want to create your ADF.
5. For Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of your new resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources article.
6. For Version, select V2 .
7. For Location, select one of the locations supported for ADF creation from the drop-down list.
8. Select Pin to dashboard.
9. Click Create.
10. On Azure dashboard, you will see the following tile with status: Deploying Data Factory.
11. After the creation is complete, you can see your ADF page as shown below.
12. Click Author & Monitor to launch ADF UI/app in a separate tab.
Create your pipelines
1. In Let's get started page, select Create pipeline.
2. In Activities toolbox, expand General menu, and drag & drop a Web activity onto the pipeline designer surface. In General tab of the activity properties window,
change the activity name to startMyIR. Switch to Settings tab, and do the following actions.
a. For URL, enter the following URL for REST API that starts Azure-SSIS IR, replacing {subscriptionId} , {resourceGroupName} , {factoryName} , and
{integrationRuntimeName} with the actual values for your IR:
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRuntimes/{integrationRuntimeName
api-version=2018-06-01
Alternatively, you can also copy & paste the resource ID of your IR from its monitoring page on ADF UI/app to replace the following part of the above URL:
/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRuntimes/{integrationRuntimeName}
b. For Method, select POST.
c. For Body, enter {"message":"Start my IR"} .
d. For Authentication, select MSI to use the managed identity for your ADF, see Managed identity for Data Factory article for more info.
e. For Resource, enter https://management.azure.com/ .
3. Clone the first pipeline to create a second one, changing the activity name to stopMyIR and replacing the following properties.
a. For URL, enter the following URL for REST API that stops Azure-SSIS IR, replacing {subscriptionId} , {resourceGroupName} , {factoryName} , and
{integrationRuntimeName} with the actual values for your IR:
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRuntimes/{integrationRuntimeName
api-version=2018-06-01
6. Validate your ADF and all pipeline settings by clicking Validate all/Validate on the factory/pipeline toolbar. Close Factory/Pipeline Validation Output by clicking
>> button.
4. In Trigger Run Parameters page, review any warning, and select Finish.
5. Publish the whole ADF settings by selecting Publish All in the factory toolbar.
3. To view the trigger runs, select Trigger Runs from the drop-down list under Pipeline Runs at the top.
5. You will see the deployment status of your Azure Automation account in Azure dashboard and notifications.
6. You will see the homepage of your Azure Automation account after it is created successfully.
2. If you do not have AzureRM.DataFactoryV2, go to the PowerShell Gallery for AzureRM.DataFactoryV2 module, select Deploy to Azure Automation, select your
Azure Automation account, and then select OK. Go back to view Modules in SHARED RESOURCES section on the left menu and wait until you see STATUS of
AzureRM.DataFactoryV2 module changed to Available.
3. If you do not have AzureRM.Profile, go to the PowerShell Gallery for AzureRM.Profile module, select Deploy to Azure Automation, select your Azure Automation
account, and then select OK. Go back to view Modules in SHARED RESOURCES section on the left menu and wait until you see STATUS of the AzureRM.Profile
module changed to Available.
3. Copy & paste the following PowerShell script to your runbook script window. Save and then publish your runbook by using Save and Publish buttons on the toolbar.
Param
(
[Parameter (Mandatory= $true)]
[String] $ResourceGroupName,
$connectionName = "AzureRunAsConnection"
try
{
# Get the connection "AzureRunAsConnection "
$servicePrincipalConnection=Get-AutomationConnection -Name $connectionName
"Logging in to Azure..."
Connect-AzAccount `
-ServicePrincipal `
-TenantId $servicePrincipalConnection.TenantId `
-ApplicationId $servicePrincipalConnection.ApplicationId `
-CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint
}
catch {
if (!$servicePrincipalConnection)
{
$ErrorMessage = "Connection $connectionName not found."
throw $ErrorMessage
} else{
Write-Error -Message $_.Exception
throw $_.Exception
}
}
7. Repeat the previous two steps using STOP as the value for OPERATION. Start your runbook again by selecting Start button on the toolbar. Enter your resource
group, ADF, and Azure-SSIS IR names. For OPERATION, enter STOP. In the output window, wait for the message ##### Completed ##### after you see #####
Stopping #####. Stopping Azure-SSIS IR does not take as long as starting it. Close Job window and get back to Runbook window.
4. Repeat the previous two steps to create a schedule named Stop IR daily. Enter a time that is at least 30 minutes after the time you specified for Start IR daily
schedule. For OPERATION, enter STOP and select OK. Select OK again to see the schedule on Schedules page of your runbook.
5. In Runbook window, select Jobs on the left menu. You should see the jobs created by your schedules at the specified times and their statuses. You can see the job
details, such as its output, similar to what you have seen after you tested your runbook.
6. After you are done testing, disable your schedules by editing them. Select Schedules on the left menu, select Start IR daily/Stop IR daily, and select No for
Enabled.
Next steps
See the following blog post:
Modernize and extend your ETL/ELT workflows with SSIS activities in ADF pipelines
See the following articles from SSIS documentation:
Deploy, run, and monitor an SSIS package on Azure
Connect to SSIS catalog on Azure
Schedule package execution on Azure
Connect to on-premises data sources with Windows authentication
Join an Azure-SSIS integration runtime to a virtual
network
4/16/2019 • 17 minutes to read • Edit Online
Join your Azure-SSIS integration runtime (IR ) to an Azure virtual network in the following scenarios:
You want to connect to on-premises data stores from SSIS packages running on an Azure-SSIS
integration runtime.
You are hosting the SQL Server Integration Services (SSIS ) catalog database in Azure SQL Database with
virtual network service endpoints/Managed Instance.
Azure Data Factory lets you join your Azure-SSIS integration runtime to a virtual network created
through the classic deployment model or the Azure Resource Manager deployment model.
IMPORTANT
The classic virtual network is currently being deprecated, so please use the Azure Resource Manager virtual network
instead. If you already use the classic virtual network, please switch to use the Azure Resource Manager virtual network as
soon as possible.
Host the SSIS Catalog database in Azure SQL Database with virtual
network service endpoints/Managed Instance
If the SSIS catalog is hosted in Azure SQL Database with virtual network service endpoints, or Managed
Instance, you can join your Azure-SSIS IR to:
The same virtual network
A different virtual network that has a network-to-network connection with the one that is used for the
Managed Instance
If you host your SSIS catalog in Azure SQL Database with virtual network service endpoints, make sure that you
join your Azure-SSIS IR to the same virtual network and subnet.
If you join your Azure-SSIS IR to the same virtual network as the Managed Instance, make sure that the Azure-
SSIS IR is in a different subnet than the Managed Instance. If you join your Azure-SSIS IR to a different virtual
network than the Managed Instance, we recommend either virtual network peering (which is limited to the same
region) or a virtual network to virtual network connection. See Connect your application to Azure SQL Database
Managed Instance.
In all cases, the virtual network can only be deployed through the Azure Resource Manager deployment model.
The following sections provide more details.
Make sure you have the required permissions. See Required permissions.
Select the proper subnet to host the Azure-SSIS IR. See Select the subnet.
If you are using your own Domain Name Services (DNS ) server on the virtual network, see Domain
Name Services server.
If you are using a Network Security Group (NSG ) on the subnet, see Network security group
If you are using Azure Express Route or configuring User Defined Route (UDR ), see Use Azure
ExpressRoute or User Defined Route.
Make sure the Resource Group of the virtual network can create and delete certain Azure Network
resources. See Requirements for Resource Group.
Here is a diagram showing the required connections for your Azure-SSIS IR:
Required permissions
The user who creates the Azure-SSIS Integration Runtime must have the following permissions:
If you're joining your SSIS IR to an Azure Resource Manager virtual network, you have two options:
Use the built-in Network Contributor role. This role comes with the Microsoft.Network/*
permission, which has a much larger scope than necessary.
Create a custom role that includes only the necessary
Microsoft.Network/virtualNetworks/*/join/action permission.
If you're joining your SSIS IR to a classic virtual network, we recommend that you use the built-in Classic
Virtual Machine Contributor role. Otherwise you have to define a custom role that includes the
permission to join the virtual network.
Select the subnet
Do not select the GatewaySubnet for deploying an Azure-SSIS Integration Runtime, because it is
dedicated for virtual network gateways.
Ensure that the subnet you select has sufficient available address space for Azure-SSIS IR to use. Leave at
least 2 * IR node number in available IP addresses. Azure reserves some IP addresses within each subnet,
and these addresses can't be used. The first and last IP addresses of the subnets are reserved for protocol
conformance, along with three more addresses used for Azure services. For more information, see Are
there any restrictions on using IP addresses within these subnets?.
Don’t use a subnet which is exclusively occupied by other Azure services (for example, SQL Database
Managed Instance, App Service, etc.).
Domain Name Services server
If you need to use your own Domain Name Services (DNS ) server in a virtual network joined by your Azure-
SSIS integration runtime, make sure it can resolve public Azure host names (for example, an Azure Storage blob
name, <your storage account>.blob.core.windows.net ).
The following steps are recommended:
Configure Custom DNS to forward requests to Azure DNS. You can forward unresolved DNS records to
the IP address of Azure's recursive resolvers (168.63.129.16) on your own DNS server.
Set up the Custom DNS as primary and Azure DNS as secondary for the virtual network. Register the IP
address of Azure's recursive resolvers (168.63.129.16) as a secondary DNS server in case your own DNS
server is unavailable.
For more info, see Name resolution that uses your own DNS server.
Network security group
If you need to implement a network security group (NSG ) for the subnet used by your Azure-SSIS integration
runtime, allow inbound/outbound traffic through the following ports:
If you're concerned about losing the ability to inspect outbound Internet traffic from that subnet, you can also add
an NSG rule on the subnet to restrict outbound destinations to Azure data center IP addresses.
See this PowerShell script for an example. You have to run the script weekly to keep the Azure data center IP
address list up-to-date.
Requirements for Resource Group
The Azure-SSIS IR needs to create certain network resources under the same resource group as the
virtual network. These resources include the following:
An Azure load balancer, with the name <Guid>-azurebatch-cloudserviceloadbalancer.
An Azure public IP address, with the name <Guid>-azurebatch-cloudservicepublicip.
A network work security group, with the name <Guid>-azurebatch-cloudservicenetworksecuritygroup.
Make sure that you don't have any resource lock on the Resource Group or Subscription to which the
virtual network belongs. If you configure either a read-only lock or a delete lock, starting and stopping the
IR may fail or stop responding.
Make sure that you don't have an Azure policy which prevents the following resources from being created
under the Resource Group or Subscription to which the virtual network belongs:
Microsoft.Network/LoadBalancers
Microsoft.Network/NetworkSecurityGroups
Microsoft.Network/PublicIPAddresses
If you don't see Microsoft.Batch in the list, to register it, create an empty Azure Batch account in your
subscription. You can delete it later.
Use the portal to configure a classic virtual network
You need to configure a virtual network before you can join an Azure-SSIS IR to it.
1. Start Microsoft Edge or Google Chrome. Currently, the Data Factory UI is supported only in these web
browsers.
2. Sign in to the Azure portal.
3. Select More services. Filter for and select Virtual networks (classic).
4. Filter for and select your virtual network in the list.
5. On the Virtual network (classic) page, select Properties.
6. Select the copy button for RESOURCE ID to copy the resource ID for the classic network to the clipboard.
Save the ID from the clipboard in OneNote or a file.
7. Select Subnets on the left menu. Ensure that the number of available addresses is greater than the
nodes in your Azure-SSIS integration runtime.
8. Join MicrosoftAzureBatch to the Classic Virtual Machine Contributor role for the virtual network.
a. Select Access control (IAM ) on the left menu, and select the Role assignments tab.
e. Confirm that you see Microsoft Azure Batch in the list of contributors.
9. Verify that the Azure Batch provider is registered in the Azure subscription that has the virtual network.
Or, register the Azure Batch provider. If you already have an Azure Batch account in your subscription,
then your subscription is registered for Azure Batch. (If you create the Azure-SSIS IR in the Data Factory
portal, the Azure Batch provider is automatically registered for you.)
a. In Azure portal, select Subscriptions on the left menu.
b. Select your subscription.
c. Select Resource providers on the left, and confirm that Microsoft.Batch is a registered provider.
If you don't see Microsoft.Batch in the list, to register it, create an empty Azure Batch account in your
subscription. You can delete it later.
Join the Azure -SSIS IR to a virtual network
1. Start Microsoft Edge or Google Chrome. Currently, the Data Factory UI is supported only in those web
browsers.
2. In the Azure portal, select Data factories on the left menu. If you don't see Data factories on the menu,
select More services, and the select Data factories in the INTELLIGENCE + ANALYTICS section.
3. Select your data factory with the Azure-SSIS integration runtime in the list. You see the home page for
your data factory. Select the Author & Deploy tile. You see the Data Factory UI on a separate tab.
4. In the Data Factory UI, switch to the Edit tab, select Connections, and switch to the Integration
Runtimes tab.
5. If your Azure-SSIS IR is running, in the integration runtime list, select the Stop button in the Actions
column for your Azure-SSIS IR. You cannot edit an IR until you stop it.
6. In the integration runtime list, select the Edit button in the Actions column for your Azure-SSIS IR.
7. On the General Settings page of the Integration Runtime Setup window, select Next.
8. On the SQL Settings page, enter the administrator password, and select Next.
9. On the Advanced Settings page, do the following actions:
a. Select the check box for Select a VNet for your Azure-SSIS Integration Runtime to join and
allow Azure services to configure VNet permissions/settings.
b. For Type, select whether the virtual network is a classic virtual network or an Azure Resource Manager
virtual network.
c. For VNet Name, select your virtual network.
d. For Subnet Name, select your subnet in the virtual network.
e. Click VNet Validation and if successful, click Update.
10. Now, you can start the IR by using the Start button in the Actions column for your Azure-SSIS IR. It
takes approximately 20 to 30 minutes to start an Azure-SSIS IR.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
# Make sure to run this script against the subscription to which the virtual network belongs.
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign the VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}
Next steps
For more information about the Azure-SSIS runtime, see the following topics:
Azure-SSIS integration runtime. This article provides conceptual information about integration runtimes in
general, including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-
SSIS IR. It uses Azure SQL Database to host the SSIS catalog.
Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides instructions on
using Azure SQL Database with virtual network service endpoints/Managed Instance to host the SSIS
catalog and joining the IR to a virtual network.
Monitor an Azure-SSIS IR. This article shows you how to retrieve information about an Azure-SSIS IR and
descriptions of statuses in the returned information.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or remove an Azure-SSIS IR. It also
shows you how to scale out your Azure-SSIS IR by adding nodes.
Enable Azure Active Directory authentication for
Azure-SSIS Integration Runtime
5/13/2019 • 7 minutes to read • Edit Online
This article shows you how to enable Azure Active Directory (Azure AD ) authentication with the managed identity
for your Azure Data Factory (ADF ) and use it instead of SQL authentication to create an Azure-SSIS Integration
Runtime (IR ) that will in turn provision SSIS catalog database (SSISDB ) in Azure SQL Database server/Managed
Instance on your behalf.
For more info about the managed identity for your ADF, see Managed identiy for Data Factory.
NOTE
In this scenario, Azure AD authentication with the managed identity for your ADF is only used in the creation and
subsequent starting operations of your SSIS IR that will in turn provision and connect to SSISDB. For SSIS package
executions, your SSIS IR will still connect to SSISDB using SQL authentication with fully managed accounts that are created
during SSISDB provisioning.
If you have already created your SSIS IR using SQL authentication, you can not reconfigure it to use Azure AD
authentication via PowerShell at this time, but you can do so via Azure portal/ADF app.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
The result looks like the following example, which also displays the variable value:
$Group
3. Add the managed identity for your ADF to the group. You can follow the article Managed identiy for Data
Factory to get the principal Managed Identity Object ID (e.g. 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc,
but do not use Managed Identity Application ID for this purpose).
The command should complete successfully, creating a contained user to represent the group.
9. Clear the query window, enter the following T-SQL command, and select Execute on the toolbar.
The command should complete successfully, granting the contained user the ability to create a database
(SSISDB ).
10. If your SSISDB was created using SQL authentication and you want to switch to use Azure AD
authentication for your Azure-SSIS IR to access it, right-click on SSISDB database and select New query.
11. In the query window, enter the following T-SQL command, and select Execute on the toolbar.
The command should complete successfully, creating a contained user to represent the group.
12. Clear the query window, enter the following T-SQL command, and select Execute on the toolbar.
The command should complete successfully, granting the contained user the ability to access SSISDB.
The command should complete successfully, displaying the managed identity for your ADF as binary.
7. Clear the query window and execute the following T-SQL script to add the managed identity for your ADF
as a user
CREATE LOGIN [{a name for the managed identity}] FROM EXTERNAL PROVIDER with SID = {your Managed
Identity Application ID as binary}, TYPE = E
ALTER SERVER ROLE [dbcreator] ADD MEMBER [{the managed identity name}]
ALTER SERVER ROLE [securityadmin] ADD MEMBER [{the managed identity name}]
The command should complete successfully, granting the managed identity for your ADF the ability to
create a database (SSISDB ).
8. If your SSISDB was created using SQL authentication and you want to switch to use Azure AD
authentication for your Azure-SSIS IR to access it, right-click on SSISDB database and select New query.
9. In the query window, enter the following T-SQL command, and select Execute on the toolbar.
CREATE USER [{the managed identity name}] FOR LOGIN [{the managed identity name}] WITH DEFAULT_SCHEMA =
dbo
ALTER ROLE db_owner ADD MEMBER [{the managed identity name}]
The command should complete successfully, granting the managed identity for your ADF the ability to
access SSISDB.
The Enterprise Edition of the Azure-SSIS Integration Runtime lets you use the following advanced and premium
features:
Change Data Capture (CDC ) components
Oracle, Teradata, and SAP BW connectors
SQL Server Analysis Services (SSAS ) and Azure Analysis Services (AAS ) connectors and transformations
Fuzzy Grouping and Fuzzy Lookup transformations
Term Extraction and Term Lookup transformations
Some of these features require you to install additional components to customize the Azure-SSIS IR. For more
info about how to install additional components, see Custom setup for the Azure-SSIS integration runtime.
Enterprise features
ENTERPRISE FEATURES DESCRIPTIONS
CDC components The CDC Source, Control Task, and Splitter Transformation are
preinstalled on the Azure-SSIS IR Enterprise Edition. To
connect to Oracle, you also need to install the CDC Designer
and Service on another computer.
Oracle connectors The Oracle Connection Manager, Source, and Destination are
preinstalled on the Azure-SSIS IR Enterprise Edition. You also
need to install the Oracle Call Interface (OCI) driver, and if
necessary configure the Oracle Transport Network Substrate
(TNS), on the Azure-SSIS IR. For more info, see Custom setup
for the Azure-SSIS integration runtime.
Teradata connectors You need to install the Teradata Connection Manager, Source,
and Destination, as well as the Teradata Parallel Transporter
(TPT) API and Teradata ODBC driver, on the Azure-SSIS IR
Enterprise Edition. For more info, see Custom setup for the
Azure-SSIS integration runtime.
Analysis Services components The Data Mining Model Training Destination, the Dimension
Processing Destination, and the Partition Processing
Destination, as well as the Data Mining Query Transformation,
are preinstalled on the Azure-SSIS IR Enterprise Edition. All
these components support SQL Server Analysis Services
(SSAS), but only the Partition Processing Destination supports
Azure Analysis Services (AAS). To connect to SSAS, you also
need to configure Windows Authentication credentials in
SSISDB. In addition to these components, the Analysis Services
Execute DDL Task, the Analysis Services Processing Task, and
the Data Mining Query Task are also preinstalled on the
Azure-SSIS IR Standard/Enterprise Edition.
Fuzzy Grouping and Fuzzy Lookup transformations The Fuzzy Grouping and Fuzzy Lookup transformations are
preinstalled on the Azure-SSIS IR Enterprise Edition. These
components support both SQL Server and Azure SQL
Database for storing reference data.
Term Extraction and Term Lookup transformations The Term Extraction and Term Lookup transformations are
preinstalled on the Azure-SSIS IR Enterprise Edition. These
components support both SQL Server and Azure SQL
Database for storing reference data.
Instructions
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
$MyAzureSsisIrEdition = "Enterprise"
Next steps
Custom setup for the Azure-SSIS integration runtime
How to develop paid or licensed custom components for the Azure-SSIS integration runtime
Customize setup for the Azure-SSIS integration
runtime
4/24/2019 • 9 minutes to read • Edit Online
The custom setup interface for the Azure-SSIS Integration Runtime provides an interface to add your own setup
steps during the provisioning or reconfiguration of your Azure-SSIS IR. Custom setup lets you alter the default
operating configuration or environment (for example, to start additional Windows services or persist access
credentials for file shares) or install additional components (for example, assemblies, drivers, or extensions) on
each node of your Azure-SSIS IR.
You configure your custom setup by preparing a script and its associated files, and uploading them into a blob
container in your Azure Storage account. You provide a Shared Access Signature (SAS ) Uniform Resource
Identifier (URI) for your container when you provision or reconfigure your Azure-SSIS IR. Each node of your
Azure-SSIS IR then downloads the script and its associated files from your container and runs your custom
setup with elevated privileges. When custom setup is finished, each node uploads the standard output of
execution and other logs into your container.
You can install both free or unlicensed components, and paid or licensed components. If you're an ISV, see How
to develop paid or licensed components for the Azure-SSIS IR.
IMPORTANT
The v2-series nodes of Azure-SSIS IR are not suitable for custom setup, so please use the v3-series nodes instead. If you
already use the v2-series nodes, please switch to use the v3-series nodes as soon as possible.
Current limitations
If you want to use gacutil.exe to install assemblies in the Global Assembly Cache (GAC ), you need to
provide gacutil.exe as part of your custom setup, or use the copy provided in the Public Preview
container.
If you want to reference a subfolder in your script, msiexec.exe does not support the .\ notation to
reference the root folder. Use a command like msiexec /i "MySubfolder\MyInstallerx64.msi" ... instead
of msiexec /i ".\MySubfolder\MyInstallerx64.msi" ... .
If you need to join your Azure-SSIS IR with custom setup to a virtual network, only Azure Resource
Manager virtual network is supported. Classic virtual network is not supported.
Administrative share is currently not supported on the Azure-SSIS IR.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
To customize your Azure-SSIS IR, you need the following things:
Azure subscription
An Azure SQL Database or Managed Instance server
Provision your Azure-SSIS IR
An Azure Storage account. For custom setup, you upload and store your custom setup script and its
associated files in a blob container. The custom setup process also uploads its execution logs to the same
blob container.
Instructions
1. Download and install Azure PowerShell.
2. Prepare your custom setup script and its associated files (for example, .bat, .cmd, .exe, .dll, .msi, or .ps1
files).
a. You must have a script file named main.cmd , which is the entry point of your custom setup.
b. If you want additional logs generated by other tools (for example, msiexec.exe ) to be uploaded
into your container, specify the predefined environment variable, CUSTOM_SETUP_SCRIPT_LOG_DIR as
the log folder in your scripts (for example,
msiexec /i xxx.msi /quiet /lv %CUSTOM_SETUP_SCRIPT_LOG_DIR%\install.log ).
b. Select Use a storage account name and key and select Next.
c. Enter your Azure Storage account name and key, select Next, and then select Connect.
d. Under your connected Azure Storage account, right-click on Blob Containers, select Create Blob
Container, and name the new container.
e. Select the new container and upload your custom setup script and its associated files. Make sure
that you upload main.cmd at the top level of your container, not in any folder. Please also ensure
that your container contains only the necessary custom setup files, so downloading them onto
your Azure-SSIS IR later will not take a long time. The maximum period for custom setup is
currently set at 45 minutes before it times out and this includes the time to download all files from
your container and install them on Azure-SSIS IR. If a longer period is needed, please raise a
support ticket.
IMPORTANT
Please ensure that the SAS URI does not expire and custom setup resources are always available during the
whole lifecycle of your Azure-SSIS IR, from creation to deletion, especially if you regularly stop and start
your Azure-SSIS IR during this period.
j. After your custom setup finishes and your Azure-SSIS IR starts, you can find the standard output
of main.cmd and other execution logs in the main.cmd.log folder of your storage container.
4. To see other custom setup examples, connect to the Public Preview container with Azure Storage
Explorer.
a. Under (Local and Attached), right-click Storage Accounts, select Connect to Azure storage, select
Use a connection string or a shared access signature URI, and then select Next.
b. Select Use a SAS URI and enter the following SAS URI for the Public Preview container. Select Next,
and the select Connect.
https://ssisazurefileshare.blob.core.windows.net/publicpreview?sp=rl&st=2018-04-
08T14%3A10%3A00Z&se=2020-04-10T14%3A10%3A00Z&sv=2017-04-
17&sig=mFxBSnaYoIlMmWfxu9iMlgKIvydn85moOnOch6%2F%2BheE%3D&sr=c
c. Select the connected Public Preview container and double-click the CustomSetupScript folder. In this
folder are the following items:
a. A Sample folder, which contains a custom setup to install a basic task on each node of your Azure-
SSIS IR. The task does nothing but sleep for a few seconds. The folder also contains a gacutil
folder, the whole content of which ( gacutil.exe , gacutil.exe.config , and 1033\gacutlrc.dll ) can
be copied as is into your container. Additionally, main.cmd contains comments to persist access
credentials for file shares.
b. A UserScenarios folder, which contains several custom setups for real user scenarios.
d. Double-click the UserScenarios folder. In this folder are the following items:
a. A .NET FRAMEWORK 3.5 folder, which contains a custom setup to install an earlier version of the .NET
Framework that might be required for custom components on each node of your Azure-SSIS IR.
b. A BCPfolder, which contains a custom setup to install SQL Server command-line utilities (
MsSqlCmdLnUtils.msi ), including the bulk copy program ( bcp ), on each node of your Azure-SSIS
IR.
c. An EXCELfolder, which contains a custom setup to install open-source assemblies (
DocumentFormat.OpenXml.dll , ExcelDataReader.DataSet.dll , and ExcelDataReader.dll ) on each
node of your Azure-SSIS IR.
d. An ORACLE ENTERPRISE folder, which contains a custom setup script ( main.cmd ) and silent install
config file ( client.rsp ) to install the Oracle connectors and OCI driver on each node of your
Azure-SSIS IR Enterprise Edition. This setup lets you use the Oracle Connection Manager, Source,
and Destination. First, download Microsoft Connectors v5.0 for Oracle (
AttunitySSISOraAdaptersSetup.msi and AttunitySSISOraAdaptersSetup64.msi ) from Microsoft
Download Center and the latest Oracle client - for example, winx64_12102_client.zip - from
Oracle, then upload them all together with main.cmd and client.rsp into your container. If you
use TNS to connect to Oracle, you also need to download tnsnames.ora , edit it, and upload it into
your container, so it can be copied into the Oracle installation folder during setup.
e. An ORACLE STANDARD ADO.NET folder, which contains a custom setup script ( main.cmd ) to install the
Oracle ODP.NET driver on each node of your Azure-SSIS IR. This setup lets you use the
ADO.NET Connection Manager, Source, and Destination. First, download the latest Oracle
ODP.NET driver - for example, ODP.NET_Managed_ODAC122cR1.zip - from Oracle, and then upload it
together with main.cmd into your container.
f. An ORACLE STANDARD ODBC folder, which contains a custom setup script ( main.cmd ) to install the
Oracle ODBC driver and configure DSN on each node of your Azure-SSIS IR. This setup lets you
use the ODBC Connection Manager/Source/Destination or Power Query Connection
Manager/Source with ODBC data source kind to connect to Oracle server. First, download the
latest Oracle Instant Client (Basic Package or Basic Lite Package) and ODBC Package - for
example, the 64-bit packages from here (Basic Package:
instantclient-basic-windows.x64-18.3.0.0.0dbru.zip , Basic Lite Package:
instantclient-basiclite-windows.x64-18.3.0.0.0dbru.zip , ODBC Package:
instantclient-odbc-windows.x64-18.3.0.0.0dbru.zip ) or the 32 -bit packages from here ( Basic
Package: instantclient-basic-nt-18.3.0.0.0dbru.zip , Basic Lite Package:
instantclient-basiclite-nt-18.3.0.0.0dbru.zip , ODBC Package:
instantclient-odbc-nt-18.3.0.0.0dbru.zip ), and then upload them together with main.cmd into
your container.
g. An SAP BW folder, which contains a custom setup script ( main.cmd ) to install the SAP .NET
connector assembly ( librfc32.dll ) on each node of your Azure-SSIS IR Enterprise Edition. This
setup lets you use the SAP BW Connection Manager, Source, and Destination. First, upload the
64-bit or the 32-bit version of librfc32.dll from the SAP installation folder into your container,
together with main.cmd . The script then copies the SAP assembly into the %windir%\SysWow64 or
%windir%\System32 folder during setup.
h. A STORAGE folder, which contains a custom setup to install Azure PowerShell on each node of your
Azure-SSIS IR. This setup lets you deploy and run SSIS packages that run PowerShell scripts to
manipulate your Azure Storage account. Copy main.cmd , a sample AzurePowerShell.msi (or install
the latest version), and storage.ps1 to your container. Use PowerShell.dtsx as a template for your
packages. The package template combines an Azure Blob Download Task, which downloads
storage.ps1 as a modifiable PowerShell script, and an Execute Process Task that executes the
script on each node.
i. A TERADATA folder, which contains a custom setup script ( main.cmd ), its associated file (
install.cmd ), and installer packages ( .msi ). These files install Teradata connectors, the TPT API,
and the ODBC driver on each node of your Azure-SSIS IR Enterprise Edition. This setup lets you
use the Teradata Connection Manager, Source, and Destination. First, download the Teradata Tools
and Utilities (TTU ) 15.x zip file (for example,
TeradataToolsAndUtilitiesBase__windows_indep.15.10.22.00.zip ) from Teradata, and then upload it
together with the above .cmd and .msi files into your container.
e. To try these custom setup samples, copy and paste the content from the selected folder into your
container. When you provision or reconfigure your Azure-SSIS IR with PowerShell, run the
Set-AzDataFactoryV2IntegrationRuntime cmdlet with the SAS URI of your container as the value for new
SetupScriptContainerSasUri parameter.
Next steps
Enterprise Edition of the Azure-SSIS Integration Runtime
How to develop paid or licensed custom components for the Azure-SSIS integration runtime
Install paid or licensed custom components for the
Azure-SSIS integration runtime
1/3/2019 • 3 minutes to read • Edit Online
This article describes how an ISV can develop and install paid or licensed custom components for SQL Server
Integration Services (SSIS ) packages that run in Azure in the Azure-SSIS integration runtime.
The problem
The nature of the Azure-SSIS integration runtime presents several challenges, which make the typical licensing
methods used for the on-premises installation of custom components inadequate. As a result, the Azure-SSIS IR
requires a different approach.
The nodes of the Azure-SSIS IR are volatile and can be allocated or released at any time. For example, you
can start or stop nodes to manage the cost, or scale up and down through various node sizes. As a result,
binding a third-party component license to a particular node by using machine-specific info such as MAC
address or CPU ID is no longer viable.
You can also scale the Azure-SSIS IR in or out, so that the number of nodes can shrink or expand at any
time.
The solution
As a result of the limitations of traditional licensing methods described in the previous section, the Azure-SSIS IR
provides a new solution. This solution uses Windows environment variables and SSIS system variables for the
license binding and validation of third-party components. ISVs can use these variables to obtain unique and
persistent info for an Azure-SSIS IR, such as Cluster ID and Cluster Node Count. With this info, ISVs can then
bind the license for their component to an Azure-SSIS IR as a cluster. This binding uses an ID that doesn't change
when customers start or stop, scale up or down, scale in or out, or reconfigure the Azure-SSIS IR in any way.
The following diagram shows the typical installation, activation and license binding, and validation flows for third-
party components that use these new variables:
Instructions
1. ISVs can offer their licensed components in various SKUs or tiers (for example, single node, up to 5 nodes,
up to 10 nodes, and so forth). The ISV provides the corresponding Product Key when customers purchase a
product. The ISV can also provide an Azure Storage blob container that contains an ISV Setup script and
associated files. Customers can copy these files into their own storage container and modify them with their
own Product Key (for example, by running IsvSetup.exe -pid xxxx-xxxx-xxxx ). Customers can then
provision or reconfigure the Azure-SSIS IR with the SAS URI of their container as parameter. For more
info, see Custom setup for the Azure-SSIS integration runtime.
2. When the Azure-SSIS IR is provisioned or reconfigured, ISV Setup runs on each node to query the
Windows environment variables, SSIS_CLUSTERID and SSIS_CLUSTERNODECOUNT . Then the Azure-SSIS IR
submits its Cluster ID and the Product Key for the licensed product to the ISV Activation Server to generate
an Activation Key.
3. After receiving the Activation Key, ISV Setup can store the key locally on each node (for example, in the
Registry).
4. When customers run a package that uses the ISV's licensed component on a node of the Azure-SSIS IR,
the package reads the locally stored Activation Key and validates it against the node's Cluster ID. The
package can also optionally report the Cluster Node Count to the ISV activation server.
Here is an example of code that validates the activation key and reports the cluster node count:
variableDispenser.LockForRead("System::ClusterID");
variableDispenser.LockForRead("System::ClusterNodeCount");
variableDispenser.GetVariables(ref vars);
// Report on ClusterNodeCount
vars.Unlock();
ISV partners
You can find a list of ISV partners who have adapted their components and extensions for the Azure-SSIS IR at
the end of this blog post - Enterprise Edition, Custom Setup, and 3rd Party Extensibility for SSIS in ADF.
Next steps
Custom setup for the Azure-SSIS integration runtime
Enterprise Edition of the Azure-SSIS Integration Runtime
Configure the Azure-SSIS Integration Runtime for
high performance
5/7/2019 • 8 minutes to read • Edit Online
This article describes how to configure an Azure-SSIS Integration Runtime (IR ) for high performance. The Azure-
SSIS IR allows you to deploy and run SQL Server Integration Services (SSIS ) packages in Azure. For more
information about Azure-SSIS IR, see Integration runtime article. For information about deploying and running
SSIS packages on Azure, see Lift and shift SQL Server Integration Services workloads to the cloud.
IMPORTANT
This article contains performance results and observations from in-house testing done by members of the SSIS development
team. Your results may vary. Do your own testing before you finalize your configuration settings, which affect both cost and
performance.
Properties to configure
The following portion of a configuration script shows the properties that you can configure when you create an
Azure-SSIS Integration Runtime. For the complete PowerShell script and description, see Deploy SQL Server
Integration Services packages to Azure.
# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like "`$"
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$DataFactoryLocation = "EastUS"
### Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[specify a description for your Azure-SSIS IR]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium features
on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your own
on-premises SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit (AHB)
option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
max(2 x number of cores, 8) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup script
and its associated files are stored
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use Azure SQL Database with
virtual network service endpoints/Managed Instance/on-premises data, Azure Resource Manager virtual network is
recommended, Classic virtual network will be deprecated soon
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Please use the same subnet as the one used with
your Azure SQL Database with virtual network service endpoints or a different subnet than the one used for your
Managed Instance
AzureSSISLocation
AzureSSISLocation is the location for the integration runtime worker node. The worker node maintains a
constant connection to the SSIS Catalog database (SSISDB ) on an Azure SQL database. Set the
AzureSSISLocation to the same location as the SQL Database server that hosts SSISDB, which lets the
integration runtime to work as efficiently as possible.
AzureSSISNodeSize
Data Factory, including the Azure-SSIS IR, supports the following options:
Standard_A4_v2
Standard_A8_v2
Standard_D1_v2
Standard_D2_v2
Standard_D3_v2
Standard_D4_v2
Standard_D2_v3
Standard_D4_v3
Standard_D8_v3
Standard_D16_v3
Standard_D32_v3
Standard_D64_v3
Standard_E2_v3
Standard_E4_v3
Standard_E8_v3
Standard_E16_v3
Standard_E32_v3
Standard_E64_v3
In the unofficial in-house testing by the SSIS engineering team, the D series appears to be more suitable for SSIS
package execution than the A series.
The performance/price ratio of the D series is higher than the A series and the performance/price ratio of the v3
series is higher than the v2 series.
The throughput for the D series is higher than the A series at the same price and the throughput for the v3
series is higher than the v2 series at the same price.
The v2 series nodes of Azure-SSIS IR are not suitable for custom setup, so please use the v3 series nodes
instead. If you already use the v2 series nodes, please switch to use the v3 series nodes as soon as possible.
The E series is memory optimized VM sizes that provides a higher memory-to-CPU ratio than other
machines.If your package requires a lot of memory, you can consider choosing E series VM.
Configure for execution speed
If you don't have many packages to run, and you want packages to run quickly, use the information in the following
chart to choose a virtual machine type suitable for your scenario.
This data represents a single package execution on a single worker node. The package loads 3 million records with
first name and last name columns from Azure Blob Storage, generates a full name column, and writes the records
that have the full name longer than 20 characters to Azure Blob Storage.
Configure for overall throughput
If you have lots of packages to run, and you care most about the overall throughput, use the information in the
following chart to choose a virtual machine type suitable for your scenario.
AzureSSISNodeNumber
AzureSSISNodeNumber adjusts the scalability of the integration runtime. The throughput of the integration
runtime is proportional to the AzureSSISNodeNumber. Set the AzureSSISNodeNumber to a small value at
first, monitor the throughput of the integration runtime, then adjust the value for your scenario. To reconfigure the
worker node count, see Manage an Azure-SSIS integration runtime.
AzureSSISMaxParallelExecutionsPerNode
When you're already using a powerful worker node to run packages, increasing
AzureSSISMaxParallelExecutionsPerNode may increase the overall throughput of the integration runtime. For
Standard_D1_v2 nodes, 1-4 parallel executions per node are supported. For all other types of nodes, 1-max(2 x
number of cores, 8) parallel executions per node are supported. If you want
AzureSSISMaxParallelExecutionsPerNode beyond the max value we supported, you can open a support ticket
and we can increase max value for you and after that you need use Azure Powershell to update
AzureSSISMaxParallelExecutionsPerNode. You can estimate the appropriate value based on the cost of your
package and the following configurations for the worker nodes. For more information, see General-purpose virtual
machine sizes.
MAX TEMP
STORAGE MAX NICS /
THROUGHPUT: MAX DATA EXPECTED
IOPS / READ DISKS / NETWORK
TEMP STORAGE MBPS / WRITE THROUGHPUT: PERFORMANCE
SIZE VCPU MEMORY: GIB (SSD) GIB MBPS IOPS (MBPS)
Here are the guidelines for setting the right value for the AzureSSISMaxParallelExecutionsPerNode property:
1. Set it to a small value at first.
2. Increase it by a small amount to check whether the overall throughput is improved.
3. Stop increasing the value when the overall throughput reaches the maximum value.
SSISDBPricingTier
SSISDBPricingTier is the pricing tier for the SSIS Catalog database (SSISDB ) on an Azure SQL database. This
setting affects the maximum number of workers in the IR instance, the speed to queue a package execution, and
the speed to load the execution log.
If you don't care about the speed to queue package execution and to load the execution log, you can choose
the lowest database pricing tier. Azure SQL Database with Basic pricing supports 8 workers in an
integration runtime instance.
Choose a more powerful database than Basic if the worker count is more than 8, or the core count is more
than 50. Otherwise the database becomes the bottleneck of the integration runtime instance and the overall
performance is negatively impacted.
Choose a more powerful database such as s3 if the logging level is set to verbose. According our unofficial
in-house testing, s3 pricing tier can support SSIS package execution with 2 nodes, 128 parallel counts and
verbose logging level.
You can also adjust the database pricing tier based on database transaction unit (DTU ) usage information available
on the Azure portal.
This article describes how to configure the Azure-SSIS Integration Runtime with Azure SQL Database geo-
replication for the SSISDB database. When a failover occurs, you can ensure that the Azure-SSIS IR keeps working
with the secondary database.
For more info about geo-replication and failover for SQL Database, see Overview: Active geo-replication and auto-
failover groups.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
For more info about this PowerShell command, see Create the Azure-SSIS integration runtime in Azure
Data Factory
3. Start the IR again.
2. Create a new data factory named <new_data_factory_name> in the new region. For more info, see Create
a data factory.
For more info about this PowerShell command, see Create an Azure data factory using PowerShell
3. Create a new Azure-SSIS IR named <new_integration_runtime_name> in the new region using Azure
PowerShell.
For more info about this PowerShell command, see Create the Azure-SSIS integration runtime in Azure
Data Factory
4. Start the IR again.
Next steps
Consider these other configuration options for the Azure-SSIS IR:
Configure the Azure-SSIS Integration Runtime for high performance
Customize setup for the Azure-SSIS integration runtime
Provision Enterprise Edition for the Azure-SSIS Integration Runtime
Clean up SSISDB logs with Azure Elastic Database
Jobs
3/5/2019 • 8 minutes to read • Edit Online
This article describes how to use Azure Elastic Database Jobs to trigger the stored procedure that cleans up logs
for the SQL Server Integration Services catalog database, SSISDB .
Elastic Database Jobs is an Azure service that makes it easy to automate and run jobs against a database or a
group of databases. You can schedule, run, and monitor these jobs by using the Azure portal, Transact-SQL,
PowerShell, or REST APIs. Use the Elastic Database Job to trigger the stored procedure for log cleanup one time or
on a schedule. You can choose the schedule interval based on SSISDB resource usage to avoid heavy database
load.
For more info, see Manage groups of databases with Elastic Database Jobs.
The following sections describe how to trigger the stored procedure
[internal].[cleanup_server_retention_window_exclusive] , which removes SSISDB logs that are outside the
retention window set by the administrator.
The following sample PowerShell scripts create a new Elastic Job to trigger the stored procedure for SSISDB log
cleanup. For more info, see Create an Elastic Job agent using PowerShell.
Create parameters
# Parameters needed to create the Job Database
param(
$ResourceGroupName = $(Read-Host "Please enter an existing resource group name"),
$AgentServerName = $(Read-Host "Please enter the name of an existing Azure SQL server(for example, yhxserver)
to hold the SSISDBLogCleanup job database"),
$SSISDBLogCleanupJobDB = $(Read-Host "Please enter a name for the Job Database to be created in the given SQL
Server"),
# The Job Database should be a clean,empty,S0 or higher service tier. We set S0 as default.
$PricingTier = "S0",
# Parameters needed to create the job credential in the Job Database to connect to SSISDB
$PasswordForSSISDBCleanupUser = $(Read-Host "Please provide a new password for SSISDBLogCleanup job user to
connect to SSISDB database for log cleanup"),
# Parameters needed to create a login and a user in the SSISDB of the target server
$SSISDBServerEndpoint = $(Read-Host "Please enter the name of the target Azure SQL server which contains SSISDB
you need to cleanup, for example, myserver") + '.database.windows.net',
$SSISDBServerAdminUserName = $(Read-Host "Please enter the target server admin username for SQL
authentication"),
$SSISDBServerAdminPassword = $(Read-Host "Please enter the target server admin password for SQL
authentication"),
$SSISDBName = "SSISDB",
# Parameters needed to set job scheduling to trigger execution of cleanup stored procedure
$RunJobOrNot = $(Read-Host "Please indicate whether you want to run the job to cleanup SSISDB logs outside the
log retention window immediately(Y/N). Make sure the retention window is set appropriately before running the
following powershell scripts. Those removed SSISDB logs cannot be recoverd"),
$IntervalType = $(Read-Host "Please enter the interval type for the execution schedule of SSISDB log cleanup
stored procedure. For the interval type, Year, Month, Day, Hour, Minute, Second can be supported."),
$IntervalCount = $(Read-Host "Please enter the detailed interval value in the given interval type for the
execution schedule of SSISDB log cleanup stored procedure"),
# StartTime of the execution schedule is set as the current time as default.
$StartTime = (Get-Date)
# Install the latest PackageManagement powershell package which PowershellGet v1.6.5 is dependent on
Find-Package PackageManagement -RequiredVersion 1.1.7.2 | Install-Package -Force
# You may need to restart the powershell session
# Install the latest PowershellGet module which adds the -AllowPrerelease flag to Install-Module
Find-Package PowerShellGet -RequiredVersion 1.6.5 | Install-Package -Force
# Place AzureRM.Sql preview cmdlets side by side with existing AzureRM.Sql version
Install-Module -Name AzureRM.Sql -AllowPrerelease -Force
# Create a Job Database which is used for defining jobs of triggering SSISDB log cleanup stored procedure and
tracking cleanup history of jobs
Write-Output "Creating a blank SQL database to be used as the SSISDBLogCleanup Job Database ..."
$JobDatabase = New-AzureRmSqlDatabase -ResourceGroupName $ResourceGroupName -ServerName $AgentServerName -
DatabaseName $SSISDBLogCleanupJobDB -RequestedServiceObjectiveName $PricingTier
$JobDatabase
# Create the job credential in the Job Database to connect to SSISDB database in the target server for log
cleanup
cleanup
Write-Output "Creating job credential to connect to SSISDB database..."
$JobCredSecure = ConvertTo-SecureString -String $PasswordForSSISDBCleanupUser -AsPlainText -Force
$JobCred = New-Object -TypeName "System.Management.Automation.PSCredential" -ArgumentList
"SSISDBLogCleanupUser", $JobCredSecure
$JobCred = $JobAgent | New-AzureRmSqlElasticJobCredential -Name "SSISDBLogCleanupUser" -Credential $JobCred
# In the master database of the target SQL server which contains SSISDB to cleanup
# - Create the job user login
Write-Output "Grant permissions on the master database of the target server..."
$Params = @{
'Database' = 'master'
'ServerInstance' = $SSISDBServerEndpoint
'Username' = $SSISDBServerAdminUserName
'Password' = $SSISDBServerAdminPassword
'OutputSqlErrors' = $true
'Query' = "CREATE LOGIN SSISDBLogCleanupUser WITH PASSWORD = '" + $PasswordForSSISDBCleanupUser + "'"
}
Invoke-SqlCmd @Params
$TargetDatabase | % {
$Params.Database = $_
$Params.Query = $CreateJobUser
Invoke-SqlCmd @Params
$Params.Query = $GrantStoredProcedureExecution
Invoke-SqlCmd @Params
}
# Create the job to trigger execution of SSISDB log cleanup stored procedure
Write-Output "Creating a new job to trigger execution of the stored procedure for SSISDB log cleanup"
$JobName = "CleanupSSISDBLog"
$Job = $JobAgent | New-AzureRmSqlElasticJob -Name $JobName -RunOnce
$Job
# Run the job to immediately start cleanup stored procedure execution for once
IF(($RunJobOrNot = "Y") -Or ($RunJobOrNot = "y"))
{
Write-Output "Start a new execution of the stored procedure for SSISDB log cleanup immediately..."
$JobExecution = $Job | Start-AzureRmSqlElasticJob
$JobExecution
}
# Schedule the job running to trigger stored procedure execution on schedule for removing SSISDB logs outside
the retention window
Write-Output "Start the execution schedule of the stored procedure for SSISDB log cleanup..."
$Job | Set-AzureRmSqlElasticJob -IntervalType $IntervalType -IntervalCount $IntervalCount -StartTime $StartTime
-Enable
Clean up logs with Transact-SQL
The following sample Transact-SQL scripts create a new Elastic Job to trigger the stored procedure for SSISDB log
cleanup. For more info, see Use Transact-SQL (T-SQL ) to create and manage Elastic Database Jobs.
1. Create or identify an empty S0 or higher Azure SQL Database to be the SSISDBCleanup Job Database.
Then create an Elastic Job Agent in the Azure portal.
2. In the Job Database, create a credential for the SSISDB log cleanup job. This credential is used to connect to
your SSISDB database to clean up the logs.
-- Connect to the job database specified when creating the job agent
-- Create a database master key if one does not already exist, using your own password.
CREATE MASTER KEY ENCRYPTION BY PASSWORD= '<EnterStrongPasswordHere>';
3. Define the target group that includes the SSISDB database for which you want to run the cleanup stored
procedure.
--View the recently created target group and target group members
SELECT * FROM jobs.target_groups WHERE target_group_name = 'SSISDBTargetGroup';
SELECT * FROM jobs.target_group_members WHERE target_group_name = 'SSISDBTargetGroup';
4. Grant appropriate permissions for the SSISDB database. The SSISDB catalog must have proper
permissions for the stored procedure to run SSISDB log cleanup successfully. For detailed guidance, see
Manage logins.
5. Create the job and add a job step to trigger the execution of the stored procedure for SSISDB log cleanup.
--Connect to the job database
--Add the job for the execution of SSISDB log cleanup stored procedure.
EXEC jobs.sp_add_job @job_name='CleanupSSISDBLog', @description='Remove SSISDB logs which are outside
the retention window'
6. Before you continue, make sure the retention window has been set appropriately. SSISDB logs outside the
window are deleted and can't be recovered.
Then you can run the job immediately to begin SSISDB log cleanup.
7. Optionally, schedule job executions to remove SSISDB logs outside the retention window on a schedule.
Use a similar statement to update the job parameters.
Next steps
For management and monitoring tasks related to the Azure-SSIS Integration Runtime, see the following articles.
The Azure-SSIS IR is the runtime engine for SSIS packages stored in SSISDB in Azure SQL Database.
Reconfigure the Azure-SSIS integration runtime
Monitor the Azure-SSIS integration runtime.
Create a trigger that runs a pipeline in response to an
event
3/7/2019 • 3 minutes to read • Edit Online
This article describes the event-based triggers that you can create in your Data Factory pipelines.
Event-driven architecture (EDA) is a common data integration pattern that involves production, detection,
consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger
pipelines based on events. Data Factory is now integrated with Azure Event Grid, which lets you trigger pipelines
on an event.
For a ten-minute introduction and demonstration of this feature, watch the following video:
NOTE
The integration described in this article depends on Azure Event Grid. Make sure that your subscription is registered with the
Event Grid resource provider. For more info, see Resource providers and types.
Data Factory UI
Create a new event trigger
A typical event is the arrival of a file, or the deletion of a file, in your Azure Storage account. You can create a
trigger that responds to this event in your Data Factory pipeline.
NOTE
This integration supports only version 2 Storage accounts (General purpose).
Configure the event trigger
With the Blob path begins with and Blob path ends with properties, you can specify the containers, folders,
and blob names for which you want to receive events. You can use variety of patterns for both Blob path begins
with and Blob path ends with properties, as shown in the examples later in this article. At least one of these
properties is required.
For example, in the preceding screenshot. the trigger is configured to fire when a blob path ending in .csv is
created in the Storage Account. As a result, when a blob with the .csv extension is created anywhere in the
Storage Account, the folderPath and fileName properties capture the location of the new blob. For example,
@triggerBody().folderPath has a value like /containername/foldername/nestedfoldername and
@triggerBody().fileName has a value like filename.csv . These values are mapped in the example to the pipeline
parameters sourceFolder and sourceFile . You can use them throughout the pipeline as
@pipeline().parameters.sourceFolder and @pipeline().parameters.sourceFile respectively.
JSON schema
The following table provides an overview of the schema elements that are related to event-based triggers:
IMPORTANT
You have to include the /blobs/ segment of the path, as shown in the following examples, whenever you specify container
and folder, container and file, or container, folder, and file.
Blob path begins with /containername/ Receives events for any blob in the
container.
Blob path begins with /containername/blobs/foldername/ Receives events for any blobs in the
containername container and
foldername folder.
Blob path ends with file.txt Receives events for a blob named
file.txt in any path.
Blob path ends with /containername/blobs/file.txt Receives events for a blob named
file.txt under container
containername .
PROPERTY EXAMPLE DESCRIPTION
Blob path ends with foldername/file.txt Receives events for a blob named
file.txt in foldername folder
under any container.
Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Create a trigger that runs a pipeline on a schedule
5/6/2019 • 17 minutes to read • Edit Online
This article provides information about the schedule trigger and the steps to create, start, and monitor a schedule
trigger. For other types of triggers, see Pipeline execution and triggers.
When creating a schedule trigger, you specify a schedule (start date, recurrence, end date etc.) for the trigger, and
associate with a pipeline. Pipelines and triggers have a many-to-many relationship. Multiple triggers can kick off a
single pipeline. A single trigger can kick off multiple pipelines.
The following sections provide steps to create a schedule trigger in different ways.
Data Factory UI
You can create a schedule trigger to schedule a pipeline to run periodically (hourly, daily, etc.).
NOTE
For a complete walkthrough of creating a pipeline and a schedule trigger, associating the trigger with the pipeline, and
running and monitoring the pipeline, see Quickstart: create a data factory using Data Factory UI.
7. Click Publish to publish changes to Data Factory. Until you publish changes to Data Factory, the trigger
does not start triggering the pipeline runs.
8. Switch to the Monitor tab on the left. Click Refresh to refresh the list. You see the pipeline runs triggered
by the scheduled trigger. Notice the values in the Triggered By column. If you use Trigger Now option,
you see the manual trigger run in the list.
9. Click the down-arrow next to Pipeline Runs to switch to the Trigger Runs view.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
This section shows you how to use Azure PowerShell to create, start, and monitor a schedule trigger. To see this
sample working, first go through the Quickstart: Create a data factory by using Azure PowerShell. Then, add the
following code to the main method, which creates and starts a schedule trigger that runs every 15 minutes. The
trigger is associated with a pipeline named Adfv2QuickStartPipeline that you create as part of the Quickstart.
1. Create a JSON file named MyTrigger.json in the C:\ADFv2QuickStartPSH\ folder with the following
content:
IMPORTANT
Before you save the JSON file, set the value of the startTime element to the current UTC time. Set the value of the
endTime element to one hour past the current UTC time.
{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Minute",
"interval": 15,
"startTime": "2017-12-08T00:00:00",
"endTime": "2017-12-08T01:00:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "Adfv2QuickStartPipeline"
},
"parameters": {
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/output"
}
}
]
}
}
3. Confirm that the status of the trigger is Stopped by using the Get-AzDataFactoryV2Trigger cmdlet:
Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -
Name "MyTrigger"
5. Confirm that the status of the trigger is Started by using the Get-AzDataFactoryV2Trigger cmdlet:
6. Get the trigger runs in Azure PowerShell by using the Get-AzDataFactoryV2TriggerRun cmdlet. To get
the information about the trigger runs, execute the following command periodically. Update the
TriggerRunStartedAfter and TriggerRunStartedBefore values to match the values in your trigger
definition:
To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.
.NET SDK
This section shows you how to use the .NET SDK to create, start, and monitor a trigger. To see this sample
working, first go through the Quickstart: Create a data factory by using the .NET SDK. Then, add the following
code to the main method, which creates and starts a schedule trigger that runs every 15 minutes. The trigger is
associated with a pipeline named Adfv2QuickStartPipeline that you create as part of the Quickstart.
To create and start a schedule trigger that runs every 15 minutes, add the following code to the main method:
// Create the trigger
Console.WriteLine("Creating the trigger");
To monitor a trigger run, add the following code before the last Console.WriteLine statement in the sample:
// Check that the trigger runs every 15 minutes
Console.WriteLine("Trigger runs. You see the output every 15 minutes");
To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.
Python SDK
This section shows you how to use the Python SDK to create, start, and monitor a trigger. To see this sample
working, first go through the Quickstart: Create a data factory by using the Python SDK. Then, add the following
code block after the "monitor the pipeline run" code block in the Python script. This code creates a schedule trigger
that runs every 15 minutes between the specified start and end times. Update the start_time variable to the
current UTC time, and the end_time variable to one hour past the current UTC time.
# Create a trigger
tr_name = 'mytrigger'
scheduler_recurrence = ScheduleTriggerRecurrence(frequency='Minute', interval='15',start_time='2017-12-
12T04:00:00', end_time='2017-12-12T05:00:00', time_zone='UTC')
pipeline_parameters = {'inputPath':'adftutorial/input', 'outputPath':'adftutorial/output'}
pipelines_to_run = []
pipeline_reference = PipelineReference('copyPipeline')
pipelines_to_run.append(TriggerPipelineReference(pipeline_reference, pipeline_parameters))
tr_properties = ScheduleTrigger(description='My scheduler trigger', pipelines = pipelines_to_run,
recurrence=scheduler_recurrence)
adf_client.triggers.create_or_update(rg_name, df_name, tr_name, tr_properties)
To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.
"parameters": {
"scheduledRunTime": "@trigger().scheduledTime"
}
For more information, see the instructions in How to read or write partitioned data.
JSON schema
The following JSON definition shows you how to create a schedule trigger with scheduling and recurrence:
{
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": <<Minute, Hour, Day, Week, Month>>,
"interval": <<int>>, // Optional, specifies how often to fire (default to 1)
"startTime": <<datetime>>,
"endTime": <<datetime - optional>>,
"timeZone": "UTC"
"schedule": { // Optional (advanced scheduling specifics)
"hours": [<<0-23>>],
"weekDays": [<<Monday-Sunday>>],
"minutes": [<<0-59>>],
"monthDays": [<<1-31>>],
"monthlyOccurrences": [
{
"day": <<Monday-Sunday>>,
"occurrence": <<1-5>>
}
]
}
}
},
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "<Name of your pipeline>"
},
"parameters": {
"<parameter 1 Name>": {
"type": "Expression",
"value": "<parameter 1 Value>"
},
"<parameter 2 Name>" : "<parameter 2 Value>"
}
}
]
}
}
IMPORTANT
The parameters property is a mandatory property of the pipelines element. If your pipeline doesn't take any parameters,
you must include an empty JSON definition for the parameters property.
Schema overview
The following table provides a high-level overview of the major schema elements that are related to recurrence
and scheduling of a trigger:
endTime The end date and time for the trigger. The trigger doesn't
execute after the specified end date and time. The value for
the property can't be in the past. This property is optional.
timeZone The time zone. Currently, only the UTC time zone is
supported.
recurrence A recurrence object that specifies the recurrence rules for the
trigger. The recurrence object supports the frequency,
interval, endTime, count, and schedule elements. When a
recurrence object is defined, the frequency element is
required. The other elements of the recurrence object are
optional.
interval A positive integer that denotes the interval for the frequency
value, which determines how often the trigger runs. For
example, if the interval is 3 and the frequency is "week," the
trigger recurs every 3 weeks.
startTime property
The following table shows you how the startTime property controls a trigger run:
Start time in past Calculates the first future execution The trigger starts no sooner than the
time after the start time and runs at specified start time. The first occurrence
that time. is based on the schedule that's
calculated from the start time.
Runs subsequent executions based on
calculating from the last execution time. Runs subsequent executions based on
the recurrence schedule.
See the example that follows this table.
Start time in future or at present Runs once at the specified start time. The trigger starts no sooner than the
specified start time. The first occurrence
Runs subsequent executions based on is based on the schedule that's
calculating from the last execution time. calculated from the start time.
Let's see an example of what happens when the start time is in the past, with a recurrence, but no schedule.
Assume that the current time is 2017-04-08 13:00 , the start time is 2017-04-07 14:00 , and the recurrence is every
two days. (The recurrence value is defined by setting the frequency property to "day" and the interval property
to 2.) Notice that the startTime value is in the past and occurs before the current time.
Under these conditions, the first execution is at 2017-04-09 at 14:00 . The Scheduler engine calculates execution
occurrences from the start time. Any instances in the past are discarded. The engine uses the next instance that
occurs in the future. In this scenario, the start time is 2017-04-07 at 2:00pm , so the next instance is two days from
that time, which is 2017-04-09 at 2:00pm .
The first execution time is the same even if the startTime value is 2017-04-05 14:00 or 2017-04-01 14:00 . After
the first execution, subsequent executions are calculated by using the schedule. Therefore, the subsequent
executions are at 2017-04-11 at 2:00pm , then 2017-04-13 at 2:00pm , then 2017-04-15 at 2:00pm , and so on.
Finally, when the hours or minutes aren’t set in the schedule for a trigger, the hours or minutes of the first
execution are used as the defaults.
schedule property
On one hand, the use of a schedule can limit the number of trigger executions. For example, if a trigger with a
monthly frequency is scheduled to run only on day 31, the trigger runs only in those months that have a 31st day.
Whereas, a schedule can also expand the number of trigger executions. For example, a trigger with a monthly
frequency that's scheduled to run on month days 1 and 2, runs on the 1st and 2nd days of the month, rather than
once a month.
If multiple schedule elements are specified, the order of evaluation is from the largest to the smallest schedule
setting. The evaluation starts with week number, and then month day, weekday, hour, and finally, minute.
The following table describes the schedule elements in detail:
weekDays Days of the week on which the trigger Monday, Tuesday, Wednesday,
runs. The value can be specified with a Thursday, Friday, Saturday,
weekly frequency only. Sunday
Array of day values (maximum
array size is 7)
Day values are not case-
sensitive
monthDays Day of the month on which the trigger Any value <= -1 and >= -31
runs. The value can be specified with a Any value >= 1 and <= 31
monthly frequency only. Array of values
EXAMPLE DESCRIPTION
{"minutes":[15,45], "hours":[5,17]} Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM every day.
{hours":[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, Run every hour. This trigger runs every hour. The minutes are
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]} controlled by the startTime value, when a value is specified. If
a value not specified, the minutes are controlled by the
creation time. For example, if the start time or creation time
(whichever applies) is 12:25 PM, the trigger runs at 00:25,
01:25, 02:25, ..., and 23:25.
{"minutes":[0]} Run every hour on the hour. This trigger runs every hour on
the hour starting at 12:00 AM, 1:00 AM, 2:00 AM, and so on.
{"minutes":[15]} Run at 15 minutes past every hour. This trigger runs every
hour at 15 minutes past the hour starting at 00:15 AM, 1:15
AM, 2:15 AM, and so on, and ending at 11:15 PM.
{"hours":[17], "weekDays":["monday", "wednesday", Run at 5:00 PM on Monday, Wednesday, and Friday every
"friday"]} week.
{"minutes":[15,45], "hours":[17], "weekDays": Run at 5:15 PM and 5:45 PM on Monday, Wednesday, and
["monday", "wednesday", "friday"]} Friday every week.
{"minutes":[0,15,30,45], "hours": [9, 10, 11, 12, Run every 15 minutes on weekdays between 9:00 AM and
13, 14, 15, 16] "weekDays":["monday", "tuesday", 4:45 PM.
"wednesday", "thursday", "friday"]}
EXAMPLE DESCRIPTION
{"weekDays":["tuesday", "thursday"]} Run on Tuesdays and Thursdays at the specified start time.
{"minutes":[0], "hours":[6], "monthDays":[28]} Run at 6:00 AM on the 28th day of every month (assuming a
frequency value of "month").
{"minutes":[0], "hours":[6], "monthDays":[-1]} Run at 6:00 AM on the last day of the month. To run a trigger
on the last day of a month, use -1 instead of day 28, 29, 30,
or 31.
{"minutes":[0], "hours":[6], "monthDays":[1,-1]} Run at 6:00 AM on the first and last day of every month.
{monthDays":[1,14]} Run on the first and 14th day of every month at the specified
start time.
{"minutes":[0], "hours":[5], "monthlyOccurrences": Run on the first Friday of every month at 5:00 AM.
[{"day":"friday", "occurrence":1}]}
{"monthlyOccurrences":[{"day":"friday", Run on the first Friday of every month at the specified start
"occurrence":1}]} time.
{"monthlyOccurrences":[{"day":"friday", Run on the third Friday from the end of the month, every
"occurrence":-3}]} month, at the specified start time.
{"minutes":[15], "hours":[5], "monthlyOccurrences": Run on the first and last Friday of every month at 5:15 AM.
[{"day":"friday", "occurrence":1},{"day":"friday",
"occurrence":-1}]}
{"monthlyOccurrences":[{"day":"friday", Run on the first and last Friday of every month at the
"occurrence":1},{"day":"friday", "occurrence":-1}]} specified start time.
{"monthlyOccurrences":[{"day":"friday", Run on the fifth Friday of every month at the specified start
"occurrence":5}]} time. When there's no fifth Friday in a month, the pipeline
doesn't run, since it's scheduled to run only on fifth Fridays. To
run the trigger on the last occurring Friday of the month,
consider using -1 instead of 5 for the occurrence value.
{"minutes":[0,15,30,45], "monthlyOccurrences": Run every 15 minutes on the last Friday of the month.
[{"day":"friday", "occurrence":-1}]}
{"minutes":[15,45], "hours":[5,17], Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM on the third
"monthlyOccurrences":[{"day":"wednesday", Wednesday of every month.
"occurrence":3}]}
Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Create a trigger that runs a pipeline on a tumbling
window
3/5/2019 • 6 minutes to read • Edit Online
This article provides steps to create, start, and monitor a tumbling window trigger. For general information about
triggers and the supported types, see Pipeline execution and triggers.
Tumbling window triggers are a type of trigger that fires at a periodic time interval from a specified start time,
while retaining state. Tumbling windows are a series of fixed-sized, non-overlapping, and contiguous time
intervals. A tumbling window trigger has a one-to-one relationship with a pipeline and can only reference a
singular pipeline.
Data Factory UI
To create a tumbling window trigger in the Azure portal, select Trigger > Tumbling window > Next, and then
configure the properties that define the tumbling window.
The following table provides a high-level overview of the major JSON elements that are related to recurrence and
scheduling of a tumbling window trigger:
{
"name": "MyTriggerName",
"properties": {
"type": "TumblingWindowTrigger",
...
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyPipelineName"
},
"parameters": {
"MyWindowStart": {
"type": "Expression",
"value": "@{concat('output',formatDateTime(trigger().outputs.windowStartTime,'-dd-MM-yyyy-
HH-mm-ss-ffff'))}"
},
"MyWindowEnd": {
"type": "Expression",
"value": "@{concat('output',formatDateTime(trigger().outputs.windowEndTime,'-dd-MM-yyyy-
HH-mm-ss-ffff'))}"
}
}
}
}
}
To use the WindowStart and WindowEnd system variable values in the pipeline definition, use your
"MyWindowStart" and "MyWindowEnd" parameters, accordingly.
Execution order of windows in a backfill scenario
When there are multiple windows up for execution (especially in a backfill scenario), the order of execution for
windows is deterministic, from oldest to newest intervals. Currently, this behavior can't be modified.
Existing TriggerResource elements
The following points apply to existing TriggerResource elements:
If the value for the frequency element (or window size) of the trigger changes, the state of the windows that
are already processed is not reset. The trigger continues to fire for the windows from the last window that it
executed by using the new window size.
If the value for the endTime element of the trigger changes (added or updated), the state of the windows that
are already processed is not reset. The trigger honors the new endTime value. If the new endTime value is
before the windows that are already executed, the trigger stops. Otherwise, the trigger stops when the new
endTime value is encountered.
This section shows you how to use Azure PowerShell to create, start, and monitor a trigger.
1. Create a JSON file named MyTrigger.json in the C:\ADFv2QuickStartPSH\ folder with the following
content:
IMPORTANT
Before you save the JSON file, set the value of the startTime element to the current UTC time. Set the value of the
endTime element to one hour past the current UTC time.
{
"name": "PerfTWTrigger",
"properties": {
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Minute",
"interval": "15",
"startTime": "2017-09-08T05:30:00Z",
"delay": "00:00:01",
"retryPolicy": {
"count": 2,
"intervalInSeconds": 30
},
"maxConcurrency": 50
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "DynamicsToBlobPerfPipeline"
},
"parameters": {
"windowStart": "@trigger().outputs.windowStartTime",
"windowEnd": "@trigger().outputs.windowEndTime"
}
},
"runtimeState": "Started"
}
}
3. Confirm that the status of the trigger is Stopped by using the Get-AzDataFactoryV2Trigger cmdlet:
5. Confirm that the status of the trigger is Started by using the Get-AzDataFactoryV2Trigger cmdlet:
To monitor trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.
Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Templates
3/14/2019 • 2 minutes to read • Edit Online
Templates are predefined Azure Data Factory pipelines that allow you to get started quickly with Data Factory.
Templates are useful when you're new to Data Factory and want to get started quickly. These templates reduce the
development time for building data integration projects thereby improving developer productivity.
2. On the Author tab in Resource Explorer, select +, then Pipeline from template to open the template
gallery.
Template Gallery
You can view pipelines saved as templates in the My Templates section of the Template Gallery. You can also see
them in the Templates section in the Resource Explorer.
NOTE
To use the My Templates feature, you have to enable GIT integration. Both Azure DevOps GIT and GitHub are supported.
Copy files from multiple containers with Azure Data
Factory
3/6/2019 • 2 minutes to read • Edit Online
This article describes a solution template that you can use to copy files from multiple containers between file
stores. For example, you can use it to migrate your data lake from AWS S3 to Azure Data Lake Store. Or, you
could use the template to replicate everything from one Azure Blob storage account to another.
NOTE
If you want to copy files from a single container, it's more efficient to use the Copy Data Tool to create a pipeline with a single
copy activity. The template in this article is more than you need for that simple scenario.
Next steps
Bulk copy from a database by using a control table with Azure Data Factory
Copy files from multiple containers with Azure Data Factory
Copy new and changed files by LastModifiedDate
with Azure Data Factory
3/15/2019 • 3 minutes to read • Edit Online
This article describes a solution template that you can use to copy new and changed files only by
LastModifiedDate from a file-based store to a destination store.
5. You will see the pipeline available in the panel, as shown in the following example:
6. Select Debug, write the value for the Parameters and select Finish. In the picture below, we set the
parameters as following.
FolderPath_Source = /source/
FolderPath_Destination = /destination/
LastModified_From = 2019-02-01T00:00:00Z
LastModified_To = 2019-03-01T00:00:00Z
The example is indicating the files which have been last modified within the timespan between 2019 -
02 -01T00:00:00Z and 2019 -03 -01T00:00:00Z will be copied from a folder /source/ to a folder
/destination/. You can replace these with your own parameters.
7. Review the result. You will see only the files last modified within the configured timespan has been copied to
the destination store.
8. Now you can add a tumbling windows trigger to automate this pipeline, so that the pipeline can always
copy new and changed files only by LastModifiedDate periodically. Select Add trigger, and select
New/Edit.
10. Select Tumbling Window for the trigger type, set Every 15 minute(s) as the recurrence (you can change
to any interval time), and then select Next.
11. Write the value for the Trigger Run Parameters as following, and select Finish.
FolderPath_Source = /source/. You can replace with your folder in source data store.
FolderPath_Destination = /destination/. You can replace with your folder in destination data store.
LastModified_From = **@trigger().outputs.windowStartTime**. It is a system variable from the trigger
determining the time when the pipeline was triggered last time.
LastModified_To = **@trigger().outputs.windowEndTime**. It is a system variable from the trigger
determining the time when the pipeline is triggered this time.
12. Select Publish All.
13. Create new files in your source folder of data source store. You are now waiting for the pipeline to be
triggered automatically and only the new files will be copied to the destination store.
14. Select Monitoring tab in the left navigation panel, and wait for about 15 minutes if the recurrence of
trigger has been set to every 15 minutes.
15. Review the result. You will see your pipeline will be triggered automatically every 15 minutes, and only the
new or changed files from source store will be copied to the destination store in each pipeline run.
Next steps
Introduction to Azure Data Factory
Bulk copy from a database with a control table
3/6/2019 • 3 minutes to read • Edit Online
To copy data from a data warehouse in Oracle Server, Netezza, Teradata, or SQL Server to Azure SQL Data
Warehouse, you have to load huge amounts of data from multiple tables. Usually, the data has to be partitioned in
each table so that you can load rows with multiple threads in parallel from a single table. This article describes a
template to use in these scenarios.
!NOTE If you want to copy data from a small number of tables with relatively small data volume to SQL Data
Warehouse, it's more efficient to use the Azure Data Factory Copy Data tool. The template that's described in
this article is more than you need for that scenario.
2. Go to the Bulk Copy from Database template. Create a New connection to the external control table that
you created in step 1.
3. Create a New connection to the source database that you're copying data from.
4. Create a New connection to the destination data store that you're copying the data to.
9. (Optional) If you chose SQL Data Warehouse as the data destination, you must enter a connection to Azure
Blob storage for staging, as required by SQL Data Warehouse Polybase. Make sure that the container in
Blob storage has already been created.
Next steps
Introduction to Azure Data Factory
Delta copy from a database with a control table
3/11/2019 • 4 minutes to read • Edit Online
This article describes a template that's available to incrementally load new or updated rows from a database table
to Azure by using an external control table that stores a high-watermark value.
This template requires that the schema of the source database contains a timestamp column or incrementing key
to identify new or updated rows.
NOTE
If you have a timestamp column in your source database to identify new or updated rows but you don't want to create an
external control table to use for delta copy, you can instead use the Azure Data Factory Copy Data tool to get a pipeline. That
tool uses a trigger-scheduled time as a variable to read new rows from the source database.
2. Create a control table in SQL Server or Azure SQL Database to store the high-watermark value for delta
data loading. In the following example, the name of the control table is watermarktable. In this table,
WatermarkValue is the column that stores the high-watermark value, and its type is datetime.
3. Create a stored procedure in the same SQL Server or Azure SQL Database instance that you used to create
the control table. The stored procedure is used to write the new high-watermark value to the external
control table for delta data loading next time.
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
END
4. Go to the Delta copy from Database template. Create a New connection to the source database that you
want to data copy from.
5. Create a New connection to the destination data store that you want to copy the data to.
6. Create a New connection to the external control table and stored procedure that you created in steps 2 and
3.
7. Select Use this template.
14. To run the pipeline again, select Debug, enter the Parameters, and then select Finish.
You see that only new rows were copied to the destination.
15. (Optional:) If you selected SQL Data Warehouse as the data destination, you must also provide a connection
to Azure Blob storage for staging, which is required by SQL Data Warehouse Polybase. Make sure that the
container has already been created in Blob storage.
Next steps
Bulk copy from a database by using a control table with Azure Data Factory
Copy files from multiple containers with Azure Data Factory
Transform data by using Databricks in Azure Data
Factory
3/29/2019 • 3 minutes to read • Edit Online
In this tutorial, you create an end-to-end pipeline containing Lookup, Copy, and Databricks notebook activities
in Data Factory.
Lookup or GetMetadata activity is used to ensure the source dataset is ready for downstream consumption,
before triggering the copy and analytics job.
Copy activity copies the source file/ dataset to the sink storage. The sink storage is mounted as DBFS in
the Databricks notebook so that the dataset can be directly consumed by Spark.
Databricks notebook activity triggers the Databricks notebook that transforms the dataset, and adds it to
a processed folder/ SQL DW.
To keep this template simple, the template doesn't create a scheduled trigger. You can add that if necessary.
Prerequisites
1. Create a blob storage account and a container called sinkdata to be used as sink. Keep a note of the
storage account name, container name, and access key, since they are referenced later in the template.
2. Ensure you have an Azure Databricks workspace or create a new one.
3. Import the notebook for ETL. Import the below Transform notebook to the Databricks workspace. (It
does not have to be in the same location as below, but remember the path that you choose for later.) Import
the notebook from the following URL by entering this URL in the URL field:
https://adflabstaging1.blob.core.windows.net/share/Transformations.html . Select Import.
4. Now let’s update the Transformation notebook with your storage connection information (name and
access key). Go to command 5 in the imported notebook above, replace it with the below code snippet
after replacing the highlighted values. Ensure this account is the same storage account created earlier and
contains the sinkdata container.
try:
dbutils.fs.mount(
source = "wasbs://sinkdata\@"+storageName+".blob.core.windows.net/",
mount_point = "/mnt/Data Factorydata",
extra_configs = {"fs.azure.account.key."+storageName+".blob.core.windows.net": accessKey})
except Exception as e:
# The error message has a long stack track.
This code tries to print just the relevant line indicating what failed.
import re
result = re.findall(r"\^\s\*Caused by:\s*\S+:\s\*(.*)\$", e.message, flags=re.MULTILINE)
if result:
print result[-1] \# Print only the relevant error message
else:
print e \# Otherwise print the whole stack trace.
5. Generate a Databricks access token for Data Factory to access Databricks. Save the access token for
later use in creating a Databricks linked service, which looks something like
'dapi32db32cbb4w6eee18b7d87e45exxxxxx'
2. Create a Copy activity 'file-to-blob' for copying dataset from source to sink. In this case, the data is binary
file. Reference the below screenshots for source and sink configurations in the copy activity.
3. Define pipeline parameters
4. Create a Databricks activity
Select the linked service created in a previous step.
Configure the settings. Create Base Parameters as shown in the screenshot and create parameters to be
passed to the Databricks notebook from Data Factory. Browse and select the correct notebook path
uploaded in prerequisite 2.
5. Run the pipeline. You can find link to Databricks logs for more detailed Spark logs.
You can also verify the data file using storage explorer. (For correlating with Data Factory pipeline runs, this
example appends the pipeline run ID from data factory to the output folder. This way you can track back the
files generated via each run.)
Next steps
Introduction to Azure Data Factory
Azure Data Factory FAQ
4/4/2019 • 13 minutes to read • Edit Online
This article provides answers to frequently asked questions about Azure Data Factory.
Next steps
For step-by-step instructions to create a data factory, see the following tutorials:
Quickstart: Create a data factory
Tutorial: Copy data in the cloud
Azure Data Factory whitepapers
4/25/2019 • 2 minutes to read • Edit Online
Whitepapers allow you to explore Azure Data Factory at a deeper level. This article provides you with a list of
available whitepapers for Azure Data Factory.
WHITEPAPER DESCRIPTION
Azure Data Factory—Data Integration in the Cloud This paper describes how Azure Data Factory can enable you
to build a modern data warehouse, enable advanced analytics
to drive intelligent SaaS applications and lift your SQL Server
Integrations Services packages to Azure.
Data Migration from on-premise relational Data Warehouse to This paper addresses the complexity of migrating tens of TB
Azure using Azure Data Factory data from existing on-premise relational data warehouse (for
example, Netezza, Oracle, Teradata, SQL server) to Azure (for
example, Blob Storage or Azure Data Lake Storage) using
Azure Data Factory. The challenges and best practices are
illustrated around resilience, performance, scalability,
management, and security for the big data ingestion journey
to Azure by Azure Data Factory.
Azure Data Factory: SSIS in the Cloud This paper goes over why you would want to migrate your
existing SSIS workloads to Azure Data Factory and address
common considerations and concerns. We'll then walk you
through the technical details of creating an Azure-SSIS IR and
then show you how to upload, execute, and monitor your
packages through Azure Data Factory using the tools you are
probably are familiar with like SQL Server Management Studio
(SSMS).