Azure Data Factory
Azure Data Factory
Course Contents
• Introduction of Azure
• Introduction of Azure Data Factory
• Data Factory components
• Differences between v1 and v2
• Triggers
• Control Flow
• SSIS in ADFv2
• Demo
Introduction of Azure
• Azure is Microsoft's cloud computing platform, provides cloud services that gives you the freedom to build, manage, and
deploy applications on a massive global network using your favorite tools and frameworks.
A quick explanation on how Azure works
• Cloud computing is the delivery of computing services over the Internet using a pay-as-you-go pricing model. In
other words it's a way to rent compute power and storage from someone’s data center.
• Microsoft categorizes Azure cloud services into below product types:
• Compute
• Storage
• Networking
• Web
• Databases
• Analytics and IOT
• Artificial Intelligence
• DevOps
Introduction of Azure
Introduction of Azure Data Factory
• Azure Data Factory is a cloud-based data integration service to compose data storage, movement, and
processing services into automated data pipelines.
• It compose of data processing, storage, and movement services to create and manage analytics pipelines,
also provides orchestration, data movement and monitoring services.
• In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage
systems, big data requires service that can orchestrate and operationalize processes to refine these
enormous stores of raw data into actionable business insights.
• Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load
(ETL), extract-load-transform (ELT), and data integration projects.
• Azure Data Factory is a data ingestion and transformation service that allows you to load raw data from over
70 different on-premises or cloud sources. The ingested data can be cleaned, transformed, restructured, and
loaded back into a data warehouse.
• Currently, there are two versions of the service: version 1 (V1) and version 2 (V2).
Introduction of Azure Data Factory
• The pipelines (data-driven workflows) in Azure Data Factory typically perform the following four steps:
• Connect and collect: The first step in building an information production system is to connect to all the
required sources of data and processing, such as software-as-a-service (SaaS) services, databases, file shares,
and FTP web services. The next step is to move the data as needed to a centralized location for subsequent
processing.
• Transform and enrich: After data is present in a centralized data store in the cloud, process or transform the
collected data by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine
Learning.
• Publish: After the raw data has been refined into a business-ready consumable form, load the data into Azure
Data Warehouse, Azure SQL Database, Azure Cosmos DB, or whichever analytics engine your business users
can point to from their business intelligence tools.
• Monitor: After you have successfully built and deployed your data integration pipeline, providing business
value from refined data, monitor the scheduled activities and pipelines for success and failure rates.
Data Factory Components
• Azure Data Factory is composed of four key components. These components work together to provide the
platform on which you can compose data-driven workflows with steps to move and transform data.
• Pipeline: A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that
performs a unit of work. For example, a pipeline can contain a group of activities that ingests data from an
Azure blob, and then runs a Hive query on an HDInsight cluster to partition the data.
• Activity: Activities represent a processing step in a pipeline. For example, you might use a copy activity to
copy data from one data store to another data store. Data Factory supports three types of activities: data
movement activities, data transformation activities, and control activities.
• Datasets: Datasets represent data structures within the data stores, which simply point to or reference the
data you want to use in your activities as inputs or outputs.
• Linked services: Linked services are much like connection strings, which define the connection information
that's needed for Data Factory to connect to external resources. For example, an Azure Storage-linked
service specifies a connection string to connect to the Azure Storage account.
• Linked services are used for two purposes in Data Factory :
• To represent a data store that includes, but isn't limited to, an on-premises SQL Server database, Oracle database, file
share, or Azure blob storage account.
• To represent a compute resource that can host the execution of an activity. For example, the HDInsight Hive activity
runs on an HDInsight Hadoop cluster.
Data Factory Components
• Overview of Data Factory flow
Data Factory Components
• Overview of Data Factory flow
My Pipeline1 My Pipeline2
For Each…
Trigger
Activity 3
params params params
Event Wall
Activity 1 Activity 2
Clock Activity 4
On Demand
param “OnError”
…
s Activity1
Data Factory Components
• Other components of Data Factory.
• Triggers: Triggers represent the unit of processing that determines when a pipeline execution needs to be
kicked off. There are different types of triggers for different types of events.
• Pipeline runs: A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated
by passing the arguments to the parameters that are defined in pipelines. The arguments can be passed
manually or within the trigger definition.
• Parameters: Parameters are key-value pairs of read-only configuration. Parameters are defined in the
pipeline. Activities within the pipeline consume the parameter values.
• Control flow: Control flow is an orchestration of pipeline activities that includes chaining activities in a
sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the
pipeline on-demand or from a trigger. It also includes custom-state passing and looping containers, that is,
For-each iterators.
Differences between v1 and v2
Feature Version 1 Version 2
Datasets A named view of data that references Datasets are the same in the current
the data, can be utilized in activities as version. However, you do not need to
inputs and outputs. define availability schedules for
Datasets identify data within different datasets.
data stores, such as tables, files, folders,
and documents
3..NET:
client.Pipelines.CreateRunWithHttpMessagesAsync(+ parameters)
4. AzurePortal
(Data factory -> <Author & Monitor> -> Pipeline runs)
Triggers
Tumbling Window
Tumbling window triggers are a type of trigger that fires at a
periodic time interval from a specified start time, while
retaining state. Tumbling windows are a series of fixed-sized,
non-overlapping, and contiguous time intervals.
Triggers
WebActivity call a custom REST endpoint and pass datasets and linked services
LookupActivity look up a record/ table name/ value from any external source to be referencedby
succeeding activities. Could be used for incrementalloads!
Get Metadata retrieve metadata of any data in Azure Data Factory e.g. did another pipelinefinish
Activity
Do UntilActivity similar to Do-Until looping structure in programminglanguages.
If Condition do something based on condition that evaluates to true orfalse.
Activity
Control Flow
New! Control FlowActivities in v2
Control activity Description
Append Variable to add a value to an existing array variable defined in a Data Factory pipeline.
Activity
Filter activity to apply a filter expression to an input array.
Set Variable to set the value of an existing variable of type String, Bool, or Array defined in a Data Factory
Activity pipeline.
Validation activity to ensure the pipeline only continues execution once it has validated the attached dataset
reference exists
Wait activity the pipeline waits for the specified period of time before continuing with execution of subsequent
activities.
Webhook activity to control the execution of pipelines through your custom code.
Data flow to run your ADF data flow in pipeline debug (sandbox) runs and in pipeline triggered runs. (This
activity Activity is in public preview)
SSIS in ADFv2
Managed Cloud
Environment
Pick# nodes & node size
Resizable
SQL Standard Edition, Enterprise coming soon
Azure
SSIS
Project
Integration
Runtime
Compatible
Same SSIS runtime across Windows, Linux, Azure
Cloud
Get Started
Hourly pricing (no SQL Server license
SSIS in ADFv2
Integration runtime - Different capabilities
1. DataMovement
Move data between data stores, built-in connectors, format
conversion, column mapping, and performantand scalable data
transfer
2. Activity Dispatch
Dispatch and monitor transformationactivities (e.g.Stored Proc on
SQLServer,Hive on HD Insight..)
https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime 52
SSIS in ADFv2
Integrationruntimes
1.Azure IntegrationRuntime
https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime
SSIS in ADFv2
SamplesADF and IRlocations
SSIS in ADFv2
Scaleable IntegrationServices
How to scale up/out using 3 Settings on Azure SSIS IR
3. SSIS packages can be executed via custom code/PSH using SSIS MOM .NET
SDK/API
› Microsoft.SqlServer.Management.IntegrationServices.dll is installed in .NET GAC
with SQL Server/SSMS installation
4. SSIS packages can be executed via T-SQL scripts executing SSISDB sprocs
› Execute SSISDB sprocs [catalog].[create_execution] +
[catalog].[set_execution_parameter_value] + [catalog].[start_execution]
SSIS in ADFv2
Scheduling Methods
1. SSIS package executions can be directly/explicitly scheduled via ADFv2 App (Work in
Progress)
› For now, SSIS package executions can be indirectly/implicitly scheduled via ADFv1/v2
Sproc
Activity
comments