0% found this document useful (0 votes)
180 views

Azure Data Factory

Uploaded by

Jagadeesh Babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
180 views

Azure Data Factory

Uploaded by

Jagadeesh Babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Azure Data Factory

1. Key Concepts of Azure Data Factory


a. Data Pipelines
A data pipeline in ADF is a logical grouping of activities that together perform a
task. The pipeline allows you to manage the orchestration and execution of
workflows like ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform).
b. Activity
An activity represents a single unit of work in ADF. There are multiple types of
activities:
 Data movement: Moves data from one source to another (Copy activity).
 Data transformation: Transforms data (e.g., using Azure Databricks, Data
Flow, or HDInsight activities).
 Control flow: Directs the flow of execution (e.g., If Condition, ForEach,
Wait, etc.).
c. Dataset
A dataset represents the data structure that the activity works with, defining
the data to be consumed or produced by the activity. Datasets represent data
in tables, files, or other formats.
d. Linked Service
A linked service defines a connection string to a data store or a compute
resource where the pipeline activities will run. It serves as a configuration that
connects to various data sources (e.g., SQL, blob storage, on-premise
databases).
e. Triggers
A trigger allows you to schedule a pipeline to run either on-demand or at
scheduled intervals (e.g., daily, weekly). Triggers can be:
 Schedule-based: Runs pipelines at specific times or intervals.
 Event-based: Runs pipelines based on events (e.g., file arrival in storage).
f. Integration Runtime (IR)
Integration Runtime (IR) is the compute infrastructure used by Azure Data
Factory to perform data movement and transformation activities. There are
three types of IR:
 Azure IR: For cloud-based data movement and transformation.
 Self-hosted IR: For on-premises data sources.
 SSIS IR: To run SQL Server Integration Services (SSIS) packages.

2. Azure Data Factory Architecture


Azure Data Factory provides a hybrid data integration service, allowing you to
move data between cloud-based and on-premises environments. Here’s how
its architecture breaks down:
a. Authoring
You can author pipelines using the following methods:
 ADF Studio: A graphical user interface (GUI) in the Azure portal.
 JSON-based definition: ADF resources (pipelines, datasets, etc.) are
defined using JSON.
 Azure SDK or APIs: You can use Azure SDKs, PowerShell, or REST APIs to
create and manage ADF resources programmatically.
b. Pipeline Orchestration
Azure Data Factory orchestrates workflows in a serverless manner, meaning
you don't have to manage the infrastructure. You create and schedule data-
driven workflows (pipelines), and Azure automatically scales resources as
needed.
c. Data Integration
Azure Data Factory can integrate with many data sources, including:
 Azure services (e.g., Azure Blob Storage, Azure Data Lake, Azure SQL
Database).
 On-premise databases (SQL Server, Oracle, etc.).
 Other cloud services (Amazon S3, Google BigQuery, etc.).
 File-based data (CSV, JSON, Parquet, etc.).

3. Core Components of Azure Data Factory


a. Pipelines
A collection of activities grouped logically that perform a specific task (e.g.,
moving or transforming data).
b. Activities
Azure Data Factory supports the following types of activities:
 Copy Activity: Moves data from source to destination.
 Data Flow Activity: Executes transformations using the Mapping Data
Flow service.
 Transformation Activities: These include calling external services like:
o Azure Databricks
o HDInsight
o Azure Functions
o Stored Procedures, and more.
 Control Activities: These include:
o If Condition for conditional logic.
o Wait for pausing pipelines.
o ForEach for iterating through collections.
c. Data Flows
Data flows in Azure Data Factory allow for data transformations without writing
code. They are visually designed, providing a low-code/no-code way to perform
complex data operations (joins, aggregations, filtering, etc.).
d. Linked Services
They store connection information to external sources. For example:
 Azure Blob Storage
 Azure SQL Database
 Amazon S3
 On-premises databases via self-hosted IR.
e. Integration Runtime
The compute infrastructure for Azure Data Factory that performs data
movement and transformation. You can choose from:
 Azure IR (default for cloud-based activities).
 Self-hosted IR (for on-premise or hybrid scenarios).
 SSIS IR (for running SSIS packages in ADF).

4. Azure Data Factory Use Cases


a. ETL/ELT Data Pipelines
ADF can automate ETL (Extract, Transform, Load) or ELT (Extract, Load,
Transform) processes. It can pull data from multiple sources, transform it using
Azure Data Flow or Databricks, and load it into a target database or data lake.
b. Hybrid Data Integration
ADF supports both on-premises and cloud-based data sources. For on-
premises data, it uses self-hosted IR to connect securely.
c. Data Transformation
ADF allows you to create transformations through Mapping Data Flows or by
calling Azure Databricks or HDInsight for more complex transformations.
d. Big Data Integration
You can integrate big data services like Azure Data Lake and Azure Synapse
Analytics with ADF to handle large datasets and perform advanced analytics.
e. Orchestration of Data Workflows
ADF orchestrates the entire data pipeline, including scheduling, retry logic,
branching, parallelism, and event-based triggers.

5. Azure Data Factory Monitoring


You can monitor your pipelines and activities using:
 ADF Studio: Provides a graphical overview of pipeline runs, activity runs,
and trigger execution.
 Azure Monitor: Allows logging, alerting, and real-time monitoring of
data factory operations.
 Alerts: You can set up alerts for failed pipelines and errors using Azure
Monitor.

6. Pricing Model
Azure Data Factory pricing is based on:
 Data movement: Pricing is based on the amount of data moved.
 Data transformation: The number of data transformation activities.
 Pipeline orchestration: The number of pipeline runs and activities.

7. Example: Creating a Simple Pipeline


Here’s a simple walkthrough of creating a pipeline in Azure Data Factory that
copies data from an Azure Blob Storage container to an Azure SQL Database.
1. Create a Linked Service for Azure Blob Storage and Azure SQL Database.
2. Create Datasets for the source (blob storage) and sink (SQL Database).
3. Create a Copy Activity in the pipeline to copy data from the source
dataset to the sink dataset.
4. Trigger the pipeline to execute immediately or schedule it.

8. Hands-On Labs and Tutorials


 Microsoft Azure Data Factory Documentation
 Azure Data Factory - Hands-On Labs
This should provide a comprehensive overview of Azure Data Factory. Would
you like to dive deeper into any specific section, or need more detailed
examples?

You might also like