0% found this document useful (0 votes)
166 views

Azure Data Factory tutorial

Azure Data Factory (ADF) is a cloud-based hybrid data integration service that simplifies ETL processes through a code-free, drag-and-drop interface and offers over 80 connectors for data movement and transformation. ADF supports both data flows for simple transformations and can integrate with services like Databricks for more complex tasks. It features various components including pipelines, activities, and integration runtimes, allowing for efficient orchestration and automation of data workflows.

Uploaded by

shilpisonixxx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
166 views

Azure Data Factory tutorial

Azure Data Factory (ADF) is a cloud-based hybrid data integration service that simplifies ETL processes through a code-free, drag-and-drop interface and offers over 80 connectors for data movement and transformation. ADF supports both data flows for simple transformations and can integrate with services like Databricks for more complex tasks. It features various components including pipelines, activities, and integration runtimes, allowing for efficient orchestration and automation of data workflows.

Uploaded by

shilpisonixxx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Azure Data Factory

Cloud version of SSIS


Copy Data
More than 80 connectors to different services are available

Transform Data
Using newly added Data Flow, now Data Factory is complete cloud based ETL tool.
Definition:
Azure Data Factory (ADF) is a hybrid data integration service

that enables you to quickly and efficiently create automated


Azure Data Factory
data pipelines – without having to write any code!
➢ Hybrid Data Integration Service
➢ Simplifies ETL at scale
➢ Enables modern data integration
➢ Drag and drop interface
➢ Over 80 connectors available
➢ Move, transform and save data
➢ Managed Service
Azure Data Factory
➢ Create Data Driver workflows
➢ Orchestrate and automate data movement
➢ Transform and store data
➢ Operationalize the process
➢ ETL or ELT scenarious
Data Factory on Azure Ecosystem

Migration?
01 Data Factory excels in periodic data loads and transformation instead.

Streaming?
02 ADF can orchestrate, but there are other dedicated services for streaming

Transformations?
03 Data flows for simple ones, but you can use Databricks or HDInsight for more complex transforms
SSIS vs Data Factory
Cluster Types

SSIS Data Factory


More code-free transformations Much higher scalability
On Premises connectors (e.g excel) Cloud and SaaS Connectors
Event based Triggers
Can use SSIS Packages
Data Factory considerations

Two versions Build options Highly No data storage Security


integrated standards

ADF V2 is the PowerShell, DevOps, Key Need to persist HTTP/TLS


current and .Net, Python, Vault, Monitor, data by the end. whenever
improved REST, ARM Automation possible
version
Azure Data Factory Components
Delivery Manager

Delivery man
Shop House

Disassembly Delivery Assembly Address &


Address &
Cabinet Info Details Info Keys Cabinet
Keys
Data Factory Pipeline

Integration Runtime
Blob Storage

Copy Activity
Order Table
Order.csv
Data Factory vs SSIS
Cluster Types

Azure Data Factory SSIS


Pipeline Package
Linked Service Connection manager
Source Source
Sink Destination
Activity Control flow task
Data Flow Data flow
➢ Data Factories can contain one ore more pipelines

➢ Logical group of Activities

➢ Manage Activities as a set

Data Factory ➢ One Pipeline can have one or more activities


Pipeline
• Represents a processing step in the pipelines
• Actions to perform on data
• Ingest data
• Transform data
Azure Data
Factory Activities • Store data
• Can be linked
• Execute sequentially or
• Run in parallel
Activity types

Data movement activities


01 Copy data amongst data stores located on-premises and in the cloud
Data stores – Blob storage, Cosmos DB, Amazon Redshift, Google BigQuery Hive, Maria DB…etc.

Data transformation activities


02 Transform and enrich data
e.g. Hive, Pig, MapReduce, Spark or Databricks

Control activities
03 Control pipeline flow
e.g. ForEach, Web
• Data Flow is a new feature of Azure Data Factory
(ADF) that allows you to develop graphical data
transformation logic that can be executed as activities
Data Flows within ADF pipelines.
• Two types:
• Mapping
• Wrangling
➢ Simply point or reference the data

➢ Reference data used in an Activity

➢ Files

➢ Folders
Dataset
➢ Documents

➢ Tables
➢ Similar to connection string

➢ Represent the connection information to connect to

external resources
Linked service
➢ Datastores like Azure SQL Server

➢ Compute resource e.g. Spark Cluster


ADF Components
➢ Provides fully managed, serverless compute

infrastructure

➢ You don't have to worry about infrastructure

provision, software installation, patching, or capacity


Integration Runtimes scaling.

➢ Pay only for duration of actual use

➢ Bridges between the activity and linked service

➢ Activity defines the action

➢ Linked service define the location


➢ Data Integration Capabilities

➢ Data Flow

➢ Data Movement

➢ Format conversion, column mapping, serialization/

deserialization etc.

➢ Provides the native compute to move data between


Integration Runtimes
cloud data stores in a secure, reliable, and high-

performance manner.

➢ Activity dispatch (e.g. Databricks Notebook, HDInsight

Hive, pig, spark activity, SP, ADL Analytics U-SQL activity)

➢ SSIS Package execution


Azure Integration Runtime
Work on public networks
Responsible for data flows, data movements, and activity dispatches

Self-hosted Integration Runtime


Integration Runtimes Work on public and private networks
Provide data movement and activity dispatch capabilities
Need to install on on-premises machine or a virtual machine inside private
Specify the infrastructure to run activities network

SSIS Integration Runtime


Supports SSIS package execution
Works on public and private networks
Integration Runtimes
➢ Default IR – AutoResolveIntegrationRuntime

➢ Create Azure IR

Integration Runtimes ➢ When you want to explicitly define the location of IR

➢ Virtually group the activities executions on different IR for

management purpose
➢ Execute pipeline

➢ Many to many relationship b/w pipeline and trigger

➢ Three types of Trigger

➢ Schedule Trigger – Invoke pipeline on a wall-clock schedule

➢ Tumbling Window Trigger – Operates on a periodic interval, also retain state

➢ one-to-one relationship

➢ Advance configuration options - Dependencies, delay, retry, concurrency

Triggers ➢ Properties - trigger().outputs.WindowStartTime/WindowEndTime

➢ Event-based Trigger – trigger pipeline in response to an event

➢ e.g. Arrival/deletion of file in Blob storage

➢ Event trigger with Azure Event Grid Service

➢ Properties – triggerBody().folderPath/fileName
Demo: Copy Activity
Summary
Data Flows

Allows you to develop graphical data


transformation logic
Example of the SSIS Control
Flow tab for loading our
data mart tables:

Example of the ADF Pipeline


for loading our data mart
tables:
Example of SSIS Data Flow
tab for loading the
FactInternetSales table:

Example of ADF Mapping


Data Flows for loading the
FactInternetSales table:
Mapping Data flow – Transform Data
(Known data and schema)

Data Flow Wrangling Data flow – Prepare and explore


data using power query (known or unknown
datasets)
Mapping Data Flows
Mapping Data Flow Actions
Cluster Types

Multiple Inputs/outputs Schema Modifiers Row Modifiers


Join Derived Columns Filter
Conditional Split Select Sort
Exists Aggregate Alter Row
Union Surrogate key
Lookup Pivot
Unpivot
Window
Wrangling Data Flows
Data flows behind the scene

Behind the scene Data flow will execute on Azure Databricks using Spark

ADF internally handles all the code translation, spark optimization and execution of transformation

You might also like