Azure Data Factory
Azure Data Factory
com/en-us/azure/data-factory/media/introduction/data-
factory-visual-guide.png#lightbox
###################################################################################
#
Azure Data Factory (ADF) Pipeline
│
├── **1. Definition & Purpose**
│ └── Orchestrates and automates data movement and data transformation workflows.
│ └── A logical grouping of activities that together perform a unit of work.
│
├── **2. Core Components** (Activities ,Parameters ,Variables, Datasets ,Linked
Services ,Triggers , Integration Runtimes,Monitoring & Management)
│ │
│ ├── **2.1. Activities** (The building blocks of a pipeline) -(Data Movement
Activities ,Data Transformation Activities ,Control Flow Activities )
│ │ │
│ │ ├── **2.1.1. Data Movement Activities**
│ │ │ └── Copy Data Activity (Source -> Sink)
│ │ │
│ │ ├── **2.1.2. Data Transformation Activities**
│ │ │ ├── Mapping Data Flow (Visual, code-free transformation)
│ │ │ ├── Wrangling Data Flow (Power Query based, interactive prep)
│ │ │ ├── Azure Function Activity
│ │ │ ├── Databricks Notebook/Jar/Python Activity
│ │ │ ├── Azure Synapse Notebook/Spark Job Definition
│ │ │ ├── Stored Procedure Activity (SQL DB, Synapse, etc.)
│ │ │ ├── U-SQL Activity (Azure Data Lake Analytics)
│ │ │ ├── HDInsight Activities (Hive, Pig, MapReduce, Spark, Streaming)
│ │ │ └── Custom Activity (using Azure Batch or Azure Machine Learning)
│ │ │
│ │ └── **2.1.3. Control Flow Activities**
│ │ ├── Execute Pipeline Activity (Invokes another ADF pipeline)
│ │ ├── ForEach Activity (Iterates over a collection)
│ │ ├── If Condition Activity
│ │ ├── Switch Activity
│ │ ├── Until Activity (Loops until a condition is met)
│ │ ├── Wait Activity
│ │ ├── Web Activity (Calls a custom REST endpoint)
│ │ ├── Webhook Activity (Waits for an external callback)
│ │ ├── Lookup Activity (Retrieves data from a dataset)
│ │ ├── Get Metadata Activity (Retrieves metadata about a dataset)
│ │ ├── Set Variable Activity
│ │ ├── Append Variable Activity
│ │ ├── Fail Activity (Intentionally fails the pipeline)
│ │ └── Validation Activity (Validates dataset/file existence, etc.)
│ │
│ ├── **2.2. Parameters**
│ │ └── Make pipelines dynamic and reusable.
│ │ └── Values passed at trigger time or by an Execute Pipeline activity.
│ │
│ ├── **2.3. Variables**
│ │ └── Temporary values used within a single pipeline run for internal logic.
│ │ └── Scoped to the pipeline run.
│ │
│ ├── **2.4. Datasets**
│ │ └── Named views of data that point to or reference the data you want to
use.
│ │ └── Represents data in data stores (e.g., tables, files, folders).
│ │ └── Used as inputs/outputs for activities.
│ │
│ └── **2.5. Linked Services**
│ └── Connection strings or information needed to connect to external
resources.
│ └── Defines connection to data stores (e.g., Azure Blob, SQL DB) and
compute (e.g., Databricks, HDInsight).
│
├── **3. Triggers** (How pipelines are initiated)
│ │
│ ├── **3.1. Schedule Trigger** (Runs on a wall-clock schedule)
│ ├── **3.2. Tumbling Window Trigger** (Runs on a periodic interval, maintaining
state)
│ ├── **3.3. Storage Event Trigger** (Runs when a blob is created/deleted in
Azure Storage)
│ ├── **3.4. Custom Event Trigger** (Runs in response to an Event Grid topic
event)
│ └── **3.5. Manual Trigger** (Debug Run / Trigger Now)
│
├── **4. Integration Runtimes (IR)** (Compute infrastructure)
│ │
│ ├── **4.1. Azure IR** (Default, serverless, fully managed by Azure)
│ │ └── For data movement between cloud data stores and data transformation in
cloud.
│ │
│ ├── **4.2. Self-Hosted IR** (Managed by you, on-premises or in a VNet)
│ │ └── For data movement between cloud and private network/on-prem.
│ │ └── For dispatching transform activities against resources in a private
network.
│ │
│ └── **4.3. Azure-SSIS IR** (For lifting and shifting SSIS packages to Azure)
│ └── Dedicated Azure VMs for running SSIS packages.
│
├── **5. Monitoring & Management**
│ │
│ ├── **5.1. Pipeline Runs** (Status of each pipeline execution)
│ ├── **5.2. Activity Runs** (Status of each activity within a pipeline run)
│ ├── **5.3. Trigger Runs** (Status of trigger executions)
│ ├── **5.4. Alerts & Metrics** (Via Azure Monitor)
│ ├── **5.5. Logging** (Diagnostic logs, activity logs)
│ └── **5.6. ADF Studio (UI)** (Visual authoring, monitoring, management)
│
├── **6. Development & Deployment**
│ │
│ ├── **6.1. ADF Studio** (UI for visual authoring)
│ ├── **6.2. ARM Templates** (Infrastructure as Code)
│ ├── **6.3. CI/CD Integration** (Azure DevOps, GitHub Actions)
│ ├── **6.4. PowerShell / CLI / SDKs** (Programmatic management)
│ └── **6.5. Source Control Integration** (Git: Azure DevOps Git, GitHub)
│
├── **7. Security**
│ │
│ ├── **7.1. Managed Identity** (For authenticating to Azure services)
│ ├── **7.2. Azure Key Vault Integration** (For storing secrets)
│ ├── **7.3. RBAC** (Role-Based Access Control)
│ ├── **7.4. Private Endpoints / Managed VNet** (Network isolation)
│ └── **7.5. Data Encryption** (At rest and in transit)
│
└── **8. Key Concepts & Best Practices**
│
├── **8.1. Modularity** (Reusable child pipelines, templates)
├── **8.2. Parameterization** (For dynamic pipelines)
├── **8.3. Error Handling & Retries** (Activity-level configuration)
├── **8.4. Idempotency** (Designing pipelines to be re-runnable without side
effects)
├── **8.5. Performance Tuning** (DIU, parallelism, data flow optimization)
└── **8.6. Cost Optimization** (Choosing right IR, activity types, scheduling)
#################################################################################
Steps to Create an Azure Data Factory (ADF) Data Pipeline (source ->linked service
-> Dataset -> Activity (copy/Dataflow) ->linked service-> Sink)
│https://www.productiveedge.com/hubfs/Imported_Blog_Media/Snip20200727_47-
1024x482.png (https://www.productiveedge.com/blog/azure-data-factory-capabilities)
├── **1. Plan & Design**
│ ├── **1.1. Define Requirements**
│ │ ├── What is the data source?
│ │ ├── What is the data destination (sink)?
│ │ ├── What transformations are needed?
│ │ ├── What is the desired frequency/schedule?
│ │ ├── What are the error handling requirements?
│ │ └── What are the security considerations?
│ ├── **1.2. Choose Integration Runtime (IR)**
│ │ ├── Azure IR (for cloud-to-cloud)
│ │ ├── Self-Hosted IR (for on-prem/VNet access)
│ │ └── Azure-SSIS IR (for SSIS package execution)
│ └── **1.3. Sketch the Data Flow**
│ └── High-level diagram of sources, transformations, sinks.
│
├── **2. Setup Azure Data Factory Instance**
│ ├── **2.1. Create/Select Azure Subscription & Resource Group**
│ ├── **2.2. Create Data Factory Instance**
│ │ ├── Name, Region, Version (V2)
│ │ └── Configure Git integration (optional, but recommended: Azure DevOps
Git / GitHub)
│ └── **2.3. Configure Integration Runtimes (if not default Azure IR)**
│ ├── Install & Register Self-Hosted IR (if needed)
│ └── Provision Azure-SSIS IR (if needed)
│
├── **3. Create Linked Services (Connections)**
│ ├── **3.1. For Source(s)**
│ │ ├── e.g., Azure Blob Storage, SQL Database, REST API
│ │ └── Configure connection details & credentials (Key, MSI, Service
Principal, Key Vault)
│ ├── **3.2. For Sink(s)**
│ │ ├── e.g., Azure Data Lake Storage, Azure Synapse Analytics, SQL Database
│ │ └── Configure connection details & credentials
│ ├── **3.3. For Compute (if applicable)**
│ │ ├── e.g., Azure Databricks, HDInsight, Azure Functions, Azure Batch
│ │ └── Configure connection details
│ └── **3.4. Test Connections**
│
├── **4. Create Datasets (Data Pointers)**
│ ├── **4.1. For Source(s)**
│ │ ├── Select Linked Service
│ │ ├── Specify data structure (e.g., file path, table name, API endpoint)
│ │ ├── Define schema (import or manually define)
│ │ └── Parameterize (if needed for dynamic paths/names)
│ ├── **4.2. For Sink(s)**
│ │ ├── Select Linked Service
│ │ ├── Specify data structure
│ │ ├── Define schema
│ │ └── Parameterize (if needed)
│
├── **5. Create the Pipeline** (Add New Pipeline->Add Activities->Link to
source/sink Datasets->)
│ ├── **5.1. Add New Pipeline**
│ │ └── Provide a name and description.
│ ├── **5.2. Add Activities to the Canvas**
│ │ ├── **Data Movement:** Copy Data Activity
│ │ ├── **Data Transformation:** Mapping Data Flow, Stored Procedure, Azure
Function, Databricks Notebook, etc.
│ │ ├── **Control Flow:** ForEach, If Condition, Execute Pipeline, Wait, Web,
Lookup, etc.
│ ├── **5.3. Configure Each Activity**
│ │ ├── Link to source/sink Datasets (for Copy Data)
│ │ ├── Configure settings specific to the activity type (e.g., query, script,
mapping)
│ │ ├── Set timeouts, retry policies
│ │ └── Map inputs/outputs
│ ├── **5.4. Connect Activities (Define Dependencies)**
│ │ ├── On Success (Green arrow)
│ │ ├── On Failure (Red arrow)
│ │ ├── On Completion (Blue arrow)
│ │ └── On Skipped (Grey arrow)
│ ├── **5.5. Add Pipeline Parameters (for reusability)**
│ │ └── Define parameters that can be passed at runtime.
│ └── **5.6. Add Pipeline Variables (for internal state)**
│ └── Define variables for temporary storage within the pipeline run.
│
├── **6. Test & Debug the Pipeline**
│ ├── **6.1. Validate Pipeline**
│ │ └── Check for configuration errors.
│ ├── **6.2. Debug Run**
│ │ ├── Execute the pipeline manually with specific parameter values.
│ │ ├── Inspect activity inputs, outputs, and errors.
│ │ └── Use breakpoints in Data Flows.
│ ├── **6.3. Iterate & Refine**
│ └── Fix issues, optimize performance.
│
├── **7. Create Triggers (Scheduling & Automation)**
│ ├── **7.1. Choose Trigger Type**
│ │ ├── Schedule Trigger (wall-clock)
│ │ ├── Tumbling Window Trigger (periodic, maintains state)
│ │ ├── Storage Event Trigger (blob created/deleted)
│ │ └── Custom Event Trigger (Event Grid)
│ ├── **7.2. Configure Trigger**
│ │ ├── Set schedule, recurrence, event details.
│ │ └── Map pipeline parameters to trigger parameters (if any).
│ ├── **7.3. Associate Trigger with Pipeline**
│ └── **7.4. Activate Trigger**
│
├── **8. Publish Changes**
│ ├── **8.1. Save all ADF artifacts (if not using Git auto-save)**
│ ├── **8.2. (If using Git) Commit & Push changes to your repository**
│ ├── **8.3. Publish from ADF Studio**
│ └── This deploys your configuration from the collaboration branch (e.g.,
`main` or `adf_publish`) to the live Data Factory service.
│ └── **8.4. (Optional) Use CI/CD Pipelines**
│ └── Azure DevOps / GitHub Actions for automated deployment of ARM
templates.
│
└── **9. Monitor & Manage**
├── **9.1. Monitor Pipeline Runs**
│ └── Check status (Succeeded, Failed, In Progress).
├── **9.2. Monitor Activity Runs**
│ └── Drill down into individual activity details.
├── **9.3. Monitor Trigger Runs**
├── **9.4. Set up Alerts (via Azure Monitor)**
│ └── For failures or long-running pipelines.
└── **9.5. Review Logs**
└── For troubleshooting and auditing.
#################################################################################
Azure Data Factory (ADF) - Cloud-based ETL & Data Integration Service
└─── 1. Core Concepts & Components
│
├─── 1.1. Pipeline
│ │ └── Definition: Logical grouping of activities that perform a unit of
work.
│ │ └── Properties: Parameters, Variables, Annotations, Concurrency, Run
Dimensions
│ │ └── Orchestration: Defines sequence, parallelism, conditional
execution.
│
├─── 1.2. Activity
│ │ └── Definition: Individual processing step in a pipeline.
│ │ └── Types:
│ │ ├── 1.2.1. Data Movement Activities
│ │ │ └── Copy Data Activity
│ │ │ ├── Sources (Connectors: ~100+, e.g., Blob, SQL DB, On-
prem, SaaS)
│ │ │ ├── Sinks (Connectors: ~100+, e.g., Data Lake, Synapse, SQL
DB)
│ │ │ ├── Data Consistency Verification
│ │ │ ├── Fault Tolerance (Skip incompatible rows)
│ │ │ ├── Performance Tuning (DIUs, Parallel Copies)
│ │ │ ├── Staging (e.g., PolyBase, Interim Blob)
│ │ │ └── Schema Mapping (Explicit, Dynamic)
│ │ │
│ │ ├── 1.2.2. Data Transformation Activities
│ │ │ ├── Mapping Data Flows
│ │ │ │ ├── Visual Data Transformation (Code-free / Low-code)
│ │ │ │ ├── Scalable execution on Spark clusters (managed by ADF)
│ │ │ │ ├── Sources & Sinks (within Data Flow)
│ │ │ │ ├── Transformations (Join, Union, Aggregate, Derived
Column, Lookup, Filter, Sort, etc.)
│ │ │ │ ├── Debug Mode (Interactive data preview & debugging)
│ │ │ │ ├── Parameterization
│ │ │ │ └── Data Flow Script (underlying JSON definition)
│ │ │ │
│ │ │ ├── External / Code-Based Activities
│ │ │ │ ├── Azure Function Activity
│ │ │ │ ├── Azure Batch Activity (Custom .NET)
│ │ │ │ ├── Azure Databricks Activity (Notebook, Jar, Python)
│ │ │ │ ├── Azure Synapse Notebook / Spark Job Definition
│ │ │ │ ├── HDInsight Activity (Hive, Pig, MapReduce, Spark,
Streaming)
│ │ │ │ ├── Stored Procedure Activity (SQL Server, Azure SQL,
Synapse)
│ │ │ │ ├── U-SQL Activity (Azure Data Lake Analytics)
│ │ │ │ └── Machine Learning Activities (Execute Batch, Update
Resource)
│ │ │
│ │ └── 1.2.3. Control Flow Activities
│ │ ├── Execute Pipeline Activity (Invoke another ADF pipeline)
│ │ ├── ForEach Activity (Looping)
│ │ ├── If Condition Activity
│ │ ├── Switch Activity
│ │ ├── Until Activity (Do-while loop)
│ │ ├── Wait Activity
│ │ ├── Web Activity (Call REST API)
│ │ ├── Webhook Activity (Wait for external callback)
│ │ ├── Lookup Activity (Retrieve data for decisions)
│ │ ├── Get Metadata Activity (File/table properties, exists)
│ │ ├── Set Variable Activity
│ │ ├── Append Variable Activity
│ │ └── Fail Activity (Intentionally fail pipeline)
│
├─── 1.3. Datasets
│ │ └── Definition: Named view of data; points to or references the data
you want to use.
│ │ └── Properties: Schema, Parameters, Connection (via Linked Service)
│ │ └── Types: File-based (CSV, Parquet, JSON), Database tables, NoSQL
collections, etc.
│
├─── 1.4. Linked Services
│ │ └── Definition: Connection strings to external resources.
│ │ └── Stores connection info: Credentials, Server names, Database names,
etc.
│ │ └── Manages connectivity to data stores & compute services.
│ │ └── Parameterization (for dynamic connections)
│
├─── 1.5. Integration Runtimes (IR)
│ │ └── Definition: Compute infrastructure used by ADF.
│ │ └── Types:
│ │ ├── 1.5.1. Azure IR
│ │ │ └── Fully managed, serverless compute in Azure.
│ │ │ └── Used for cloud-to-cloud data movement & data flow
execution.
│ │ │ └── Region specific or Auto-resolve.
│ │ │ └── Managed Virtual Network (for secure data access)
│ │ │
│ │ ├── 1.5.2. Self-Hosted IR (SHIR)
│ │ │ └── Installed on-premises or in a private virtual network.
│ │ │ └── Enables access to private network resources.
│ │ │ └── Used for data movement to/from private networks.
│ │ │ └── Can dispatch activities to on-prem compute (e.g., SQL
Server).
│ │ │ └── High Availability & Scalability (multi-node)
│ │ │ └── Credential Management (local or Key Vault)
│ │ │
│ │ └── 1.5.3. Azure-SSIS IR
│ │ └── Dedicated Azure compute for lifting & shifting SSIS
packages.
│ │ └── Provisions Azure VMs to host SSISDB (in Azure SQL DB/MI).
│ │ └── VNet Integration (for private network access).
│ │ └── Custom Setups (e.g., installing additional components).
│ │ └── Express Custom Setup
│
├─── 1.6. Triggers
│ │ └── Definition: Unit of processing that determines when a pipeline run
needs to be kicked off.
│ │ └── Types:
│ │ ├── 1.6.1. Schedule Trigger (Wall-clock schedule)
│ │ ├── 1.6.2. Tumbling Window Trigger (Fixed-size, non-overlapping,
contiguous time intervals)
│ │ │ └── Supports dependencies on other Tumbling Window Triggers.
│ │ │ └── Backfilling / Reruns for specific windows.
│ │ ├── 1.6.3. Storage Event Trigger (Reacts to Blob created/deleted in
Azure Storage)
│ │ ├── 1.6.4. Custom Event Trigger (Integrates with Azure Event Grid
for broader event sources)
│ │ └── 1.6.5. Manual Trigger (On-demand execution)
│
├─── 1.7. Parameters & Variables
│ │ └── Parameters: Defined at Pipeline, Dataset, Linked Service level for
reusability & dynamic behavior. Passed in at execution time.
│ │ └── Variables: Used within pipelines for temporary value storage during
execution (Set Variable, Append Variable).
│
└─── 1.8. Expressions & Functions
│ └── Used for dynamic content, conditions, data manipulation within
pipelines.
│ └── System variables, user-defined functions (in Data Flows), string,
collection, logical, conversion, math, date functions.
└─── Global Parameters
└─── Constants defined at Data Factory level, usable across all
pipelines.
└─── 4. Security
│
├─── 4.1. Authentication & Authorization
│ │ ├── Managed Identities for Azure Resources (System-assigned, User-
assigned)
│ │ ├── Service Principals
│ │ ├── Azure Key Vault Integration (for storing secrets, connection
strings)
│ │ └── Role-Based Access Control (RBAC) (Data Factory Contributor, Reader,
User Access Admin)
│
├─── 4.2. Network Security
│ │ ├── Managed Virtual Network & Managed Private Endpoints (for Azure PaaS
sources with Azure IR)
│ │ ├── Private Link for Azure Data Factory (Access ADF Studio/service
privately)
│ │ ├── Self-Hosted IR in VNet/On-prem (for private data sources)
│ │ └── Firewall rules on data stores
│
├─── 4.3. Data Encryption
│ │ ├── In Transit (HTTPS/TLS)
│ │ ├── At Rest (Azure Storage Service Encryption, Transparent Data
Encryption for SQL)
│ │ └── Customer-Managed Keys (CMK) for encrypting ADF environment.
│
└─── 4.4. Data Masking / Obfuscation (within Transformations)
###################################################################################
#
#################################################################################
Azure Data Factory (ADF) - Cloud-based ETL & Data Integration Service
Detailed Description: Azure Data Factory is a fully managed, serverless data
integration service provided by Microsoft Azure. It's designed to compose,
orchestrate, schedule, and monitor data-driven workflows (pipelines) that can
ingest data from disparate sources, transform it, and publish it to various
destinations. ADF is primarily used for building Extract, Transform, Load (ETL),
Extract, Load, Transform (ELT), and data integration solutions at scale.
1. Core Concepts & Components
│
├─── **1.1. Pipeline**
│ │ └── **Definition:** A pipeline is a logical grouping of activities that
together perform a unit of work. It's the top-level orchestrator in ADF. For
example, a pipeline might ingest data from a Blob storage, transform it using a
Data Flow, and then load it into Azure Synapse Analytics.
│ │ └── **Properties:**
│ │ ├── **Parameters:** Input values passed to a pipeline at runtime,
making pipelines reusable and dynamic.
│ │ ├── **Variables:** Temporary values used within a single pipeline run
to store and manipulate data during execution.
│ │ ├── **Annotations:** User-defined tags or notes for better organization
and understanding.
│ │ ├── **Concurrency:** Controls how many simultaneous runs of the same
pipeline are allowed.
│ │ └── **Run Dimensions:** User-defined key-value pairs to categorize and
filter pipeline runs.
│ │ └── **Orchestration:** Pipelines define the control flow for activities,
including sequential execution, parallel execution (activities running
simultaneously), and conditional execution (activities running based on the outcome
of previous ones or specific conditions).
│
├─── **1.2. Activity**
│ │ └── **Definition:** An activity represents a single processing step within
a pipeline. Each activity performs a specific action.
│ │ └── **Types:**
│ │ ├── **1.2.1. Data Movement Activities**
│ │ │ └── **Copy Data Activity:**
│ │ │ ├── **Description:** The primary activity for data movement. It
copies data from a source data store to a sink data store.
│ │ │ ├── **Sources & Sinks:** Supports a vast array of connectors
(~100+) for both sources and sinks, including on-premises systems (via Self-Hosted
IR), Azure services (Blob, SQL DB, DataLake, Synapse), and third-party SaaS
applications.
│ │ │ ├── **Data Consistency Verification:** Options to ensure data
integrity during transfer, like checking MD5 hashes.
│ │ │ ├── **Fault Tolerance:** Ability to skip incompatible rows
during copy and log them, allowing the main job to continue.
│ │ │ ├── **Performance Tuning:** Scalability through Data
Integration Units (DIUs) and parallel copy settings.
│ │ │ ├── **Staging:** Can use an interim storage (like Blob) for
optimized loading into certain sinks (e.g., PolyBase for Synapse, or when direct
copy isn't optimal).
│ │ │ └── **Schema Mapping:** Allows explicit mapping of source
columns to sink columns, or dynamic schema handling.
│ │ │
│ │ ├── **1.2.2. Data Transformation Activities**
│ │ │ ├── **Mapping Data Flows:**
│ │ │ │ ├── **Description:** Visually design and build code-free or
low-code data transformations.
│ │ │ │ ├── **Scalable execution:** Transformations are executed on
managed, auto-scaling Apache Spark clusters provisioned by ADF.
│ │ │ │ ├── **Sources & Sinks (within Data Flow):** Specific connectors
optimized for data flow processing.
│ │ │ │ ├── **Transformations:** Rich set of transformations like Join,
Union, Aggregate, Derived Column, Lookup, Filter, Sort, Pivot, Unpivot, Window,
etc.
│ │ │ │ ├── **Debug Mode:** Interactive environment to build, test, and
preview data transformations with live data samples.
│ │ │ │ ├── **Parameterization:** Allows dynamic behavior in data flows
(e.g., dynamic source/sink paths, filter conditions).
│ │ │ │ └── **Data Flow Script:** The underlying JSON definition of the
visual data flow, useful for advanced editing or version control.
│ │ │ │
│ │ │ ├── **External / Code-Based Activities:** Activities that execute
transformations on external compute services.
│ │ │ │ ├── **Azure Function Activity:** Executes an Azure Function
(serverless code).
│ │ │ │ ├── **Azure Batch Activity:** Runs custom .NET code on Azure
Batch pools for large-scale parallel processing.
│ │ │ │ ├── **Azure Databricks Activity:** Executes Databricks
notebooks, JARs, or Python scripts on Databricks clusters.
│ │ │ │ ├── **Azure Synapse Notebook / Spark Job Definition:** Executes
Synapse notebooks or Spark jobs within an Azure Synapse Analytics workspace.
│ │ │ │ ├── **HDInsight Activity:** Executes Hive, Pig, MapReduce, or
Spark jobs on HDInsight clusters.
│ │ │ │ ├── **Stored Procedure Activity:** Executes stored procedures
in SQL Server, Azure SQL Database, or Azure Synapse Analytics.
│ │ │ │ ├── **U-SQL Activity (Azure Data Lake Analytics):** Executes U-
SQL scripts on Azure Data Lake Analytics (legacy service).
│ │ │ │ └── **Machine Learning Activities:**
│ │ │ │ └── **Azure ML Execute Batch:** Runs Azure Machine Learning
batch scoring pipelines.
│ │ │ │ └── **Azure ML Update Resource:** Updates Azure Machine
Learning model resources.
│ │ │
│ │ └── **1.2.3. Control Flow Activities:** Activities that manage the
execution flow of a pipeline.
│ │ ├── **Execute Pipeline Activity:** Invokes another ADF pipeline,
enabling modular design.
│ │ ├── **ForEach Activity:** Iterates over a collection of items and
executes specified activities for each item (in sequence or parallel).
│ │ ├── **If Condition Activity:** Executes one set of activities if a
condition is true, and optionally another set if false.
│ │ ├── **Switch Activity:** Similar to a switch-case statement in
programming; executes a specific set of activities based on matching an
expression's value.
│ │ ├── **Until Activity:** Executes a set of activities repeatedly
until a specified condition becomes true (like a do-while loop).
│ │ ├── **Wait Activity:** Pauses pipeline execution for a specified
period.
│ │ ├── **Web Activity:** Calls a custom REST API endpoint.
│ │ ├── **Webhook Activity:** Pauses pipeline execution and waits for
an external service to call back a provided URL before resuming.
│ │ ├── **Lookup Activity:** Retrieves data (a single row or a small
set of rows) from any ADF-supported data source, often used to make decisions in
subsequent activities.
│ │ ├── **Get Metadata Activity:** Retrieves metadata about a data
source (e.g., file existence, size, last modified date, table schema).
│ │ ├── **Set Variable Activity:** Assigns a value to a pipeline
variable.
│ │ ├── **Append Variable Activity:** Appends a value to an array-type
pipeline variable.
│ │ └── **Fail Activity:** Intentionally causes a pipeline run to fail,
often used in error handling paths.
│
├─── **1.3. Datasets**
│ │ └── **Definition:** A dataset is a named view of data that simply points to
or references the data you want to use as an input or output in your activities. It
doesn't store the actual data but defines the structure and location.
│ │ └── **Properties:** Defines the data's **schema** (structure), can be
**parameterized** for dynamic referencing (e.g., dynamic file paths), and specifies
the **connection** to the data store via a Linked Service.
│ │ └── **Types:** Represents various data structures like files (CSV, Parquet,
JSON, Avro), database tables, NoSQL collections, API responses, etc.
│
├─── **1.4. Linked Services**
│ │ └── **Definition:** Linked services are like connection strings. They
define the connection information needed for Data Factory to connect to external
resources (data stores or compute services).
│ │ └── **Stores connection info:** Includes credentials (which can be stored
in Azure Key Vault), server names, database names, file paths, API keys, etc.
│ │ └── **Manages connectivity:** A single linked service can be used by
multiple datasets or activities if they connect to the same resource.
│ │ └── **Parameterization:** Allows for dynamic connection strings, useful for
different environments (Dev, Test, Prod).
│
├─── **1.5. Integration Runtimes (IR)**
│ │ └── **Definition:** The Integration Runtime (IR) is the compute
infrastructure used by Azure Data Factory to provide data integration capabilities
across different network environments. It bridges the gap between
activities/datasets and linked services.
│ │ └── **Types:**
│ │ ├── **1.5.1. Azure IR**
│ │ │ └── **Description:** A fully managed, serverless compute in Azure.
ADF automatically manages, scales, and patches this IR.
│ │ │ └── **Usage:** Primarily used for cloud-to-cloud data movement
(Copy activity) and for executing Mapping Data Flows.
│ │ │ └── **Configuration:** Can be region-specific or set to "Auto-
resolve" to use the region of the data sink.
│ │ │ └── **Managed Virtual Network (VNet):** Can be provisioned within
an ADF-managed VNet for secure data access to Azure PaaS services using private
endpoints.
│ │ │
│ │ ├── **1.5.2. Self-Hosted IR (SHIR)**
│ │ │ └── **Description:** An agent that you install and manage on your
on-premises machines or in a private virtual network (Azure VNet or other cloud
provider's VNet).
│ │ │ └── **Usage:** Enables ADF to access data sources and dispatch
activities within a private network (e.g., on-premises SQL Server, file shares).
│ │ │ └── **Data Movement:** Facilitates data copy to/from private
networks.
│ │ │ └── **Activity Dispatch:** Can dispatch external activities (like
Stored Procedure) to compute resources within the private network.
│ │ │ └── **High Availability & Scalability:** Can be scaled out by
adding multiple nodes to the SHIR logical group.
│ │ │ └── **Credential Management:** Credentials for on-premises data
stores can be stored locally on the SHIR machine or referenced from Azure Key
Vault.
│ │ │
│ │ └── **1.5.3. Azure-SSIS IR**
│ │ └── **Description:** A dedicated Azure compute infrastructure
specifically designed to lift and shift existing SQL Server Integration Services
(SSIS) packages to the cloud.
│ │ └── **Functionality:** Provisions Azure VMs to host an SSIS engine.
You need an Azure SQL Database or Managed Instance to host the SSIS catalog
(SSISDB).
│ │ └── **VNet Integration:** Can be joined to an Azure Virtual
Network, allowing it to access on-premises data sources via VPN or ExpressRoute, or
other Azure resources within the VNet.
│ │ └── **Custom Setups:** Allows installation of additional
components, drivers, or custom tasks required by SSIS packages.
│ │ └── **Express Custom Setup:** A simplified way to run common custom
setup scripts.
│
├─── **1.6. Triggers**
│ │ └── **Definition:** A trigger is a unit of processing that determines when
a pipeline run needs to be initiated. It defines the "when" for pipeline execution.
│ │ └── **Types:**
│ │ ├── **1.6.1. Schedule Trigger:** Initiates pipeline runs based on a
recurring wall-clock schedule (e.g., daily at 2 AM, every 15 minutes).
│ │ ├── **1.6.2. Tumbling Window Trigger:** Operates on fixed-size, non-
overlapping, contiguous time intervals (windows). Useful for processing time-sliced
data (e.g., hourly data chunks).
│ │ │ └── **Dependencies:** Can depend on other Tumbling Window Triggers,
ensuring data processing order.
│ │ │ └── **Backfilling/Reruns:** Supports reprocessing data for specific
past time windows.
│ │ ├── **1.6.3. Storage Event Trigger:** Initiates pipeline runs in
response to storage events, specifically when a blob is created or deleted in Azure
Blob Storage or Azure Data Lake Storage Gen2.
│ │ ├── **1.6.4. Custom Event Trigger:** Integrates with Azure Event Grid,
allowing pipelines to be triggered by a wide variety of custom events from Azure
services or custom applications.
│ │ └── **1.6.5. Manual Trigger (Debug/Trigger Now):** Allows for on-demand
execution of pipelines directly from the ADF Studio or via SDK/API.
│
├─── **1.7. Parameters & Variables**
│ │ └── **Parameters:** Defined at the Pipeline, Dataset, Linked Service, or
Data Flow level. They are strongly typed values that are passed into these
artifacts when they are invoked or used, promoting reusability and dynamic
behavior. For example, a pipeline parameter could specify a source folder path.
│ │ └── **Variables:** Defined at the pipeline level. They are used to store
temporary values during a single pipeline run. Their values can be set and modified
using the "Set Variable" and "Append Variable" activities. Unlike parameters,
variables are not passed in from outside but are managed internally within the
pipeline run.
│
└─── **1.8. Expressions & Functions**
│ └── **Description:** ADF provides a rich expression language with built-in
functions. These are used to create dynamic content, evaluate conditions, and
manipulate data within pipeline definitions (e.g., in activity properties,
conditional paths, variable assignments).
│ └── **Examples:** Includes system variables (e.g., `@pipeline().RunId`),
user-defined functions (within Mapping Data Flows), and functions for string
manipulation, collection handling, logical operations, data type conversions,
mathematical calculations, and date/time operations.
└─── **Global Parameters**
└─── **Description:** Constants defined at the Data Factory level (in the
"Manage" hub). These values can be referenced in any pipeline expression within
that Data Factory instance, making it easy to manage common values (like
environment tags, shared paths) across multiple pipelines without parameterizing
each one individually.
Use code with caution.
2. Development & Authoring
│
├─── **2.1. Azure Data Factory Studio (Web UI)**
│ │ └── **Description:** The primary web-based graphical user interface for
developing, managing, and monitoring ADF resources.
│ │ ├── **Visual authoring:** Provides a drag-and-drop interface for building
pipelines and mapping data flows.
│ │ ├── **JSON editing:** Allows direct editing of the underlying JSON
definitions for all ADF artifacts.
│ │ ├── **Pipeline canvas:** The workspace where activities are arranged and
connected to define a pipeline.
│ │ ├── **Data Flow designer:** A visual tool for creating complex data
transformations without writing code.
│ │ ├── **Monitoring views:** Integrated dashboards to track pipeline,
activity, and trigger runs.
│ │ └── **Management hub:** Area for managing linked services, integration
runtimes, source control, global parameters, and security settings.
│
├─── **2.2. Azure PowerShell / CLI**
│ │ └── **Description:** Command-line tools for automating the creation,
deployment, management, and monitoring of ADF resources. Ideal for scripting
repetitive tasks and integrating ADF management into broader automation workflows.
│
├─── **2.3. SDKs (.NET, Python, Java, JavaScript)**
│ │ └── **Description:** Software Development Kits that allow developers to
interact with ADF programmatically. Useful for building custom applications that
manage or trigger ADF pipelines, or for embedding ADF functionalities within other
systems.
│
├─── **2.4. ARM Templates (Azure Resource Manager Templates)**
│ │ └── **Description:** JSON files that define the infrastructure and
configuration for your Azure resources, including ADF.
│ │ └── **Usage:** Enables Infrastructure as Code (IaC) practices for
consistent and repeatable deployment of ADF instances and their components across
different environments (e.g., Dev, Test, Prod). ADF Studio provides an option to
export the entire factory or individual resources as an ARM template.
│
└─── **2.5. Source Control Integration**
│ └── **Description:** ADF integrates directly with Git-based source control
repositories.
│ ├── **Supported Repositories:** Azure DevOps Git and GitHub.
│ └── **Workflow:** Developers typically work on feature branches, merge
changes into a collaboration branch (e.g., `main` or `develop`), and then publish
changes from a designated publish branch (e.g., `adf_publish`) which generates ARM
templates for deployment. This enables version control, collaboration, and CI/CD.
└─── **CI/CD (Continuous Integration / Continuous Deployment)**
└─── **Description:** Practices for automating the build, test, and
deployment of ADF solutions.
└─── **Tools:** Typically implemented using Azure Pipelines (part of Azure
DevOps) or GitHub Actions.
└─── **Process:** Involves taking the ARM templates generated from the
publish branch and deploying them to target ADF instances, often with environment-
specific parameter overrides.
Use code with caution.
3. Monitoring & Management
│
├─── **3.1. ADF Studio Monitoring View**
│ │ └── **Description:** The built-in monitoring interface within ADF Studio
providing a user-friendly way to track the status and details of various ADF
operations.
│ │ ├── **Pipeline Runs:** Lists all pipeline executions with their status
(Succeeded, Failed, In Progress), duration, trigger type, parameters used, and
links to view activity runs.
│ │ ├── **Activity Runs:** Shows details for each activity executed within a
pipeline run, including input, output, errors (if any), and duration.
│ │ ├── **Data Flow Debug Sessions:** Monitors active data flow debug sessions,
showing cluster status and data previews.
│ │ ├── **Trigger Runs:** Lists all trigger executions and their status.
│ │ └── **Alerts & Metrics:** Provides a summary and links to Azure Monitor for
more detailed metrics and configured alerts.
│
├─── **3.2. Azure Monitor Integration**
│ │ └── **Description:** ADF integrates deeply with Azure Monitor, Azure's
centralized monitoring service, for comprehensive logging, metrics, and alerting.
│ │ ├── **Metrics:** Collects various performance and operational metrics
(e.g., successful/failed activity runs, pipeline run counts, IR CPU/memory
utilization, Data Flow execution times).
│ │ ├── **Diagnostic Logs:** Captures detailed logs for pipeline, activity, and
trigger runs, as well as IR health and operations.
│ │ │ └── **Destinations:** Logs can be sent to Log Analytics (for querying
with KQL), an Azure Storage Account (for archiving), or Azure Event Hubs (for real-
time streaming to other systems).
│ │ ├── **Alerts:** Allows configuration of alerts based on metric thresholds
or log query results (e.g., alert if a pipeline fails, or if IR CPU is too high).
│ │ └── **Workbooks:** Enables creation of custom, interactive dashboards in
Azure Monitor using ADF metrics and logs for tailored visualizations.
│
├─── **3.3. Programmatic Monitoring**
│ │ └── **Description:** ADF resources and run statuses can be monitored
programmatically using the Azure SDKs (e.g., .NET, Python), REST APIs, or Azure
PowerShell/CLI cmdlets. This is useful for custom monitoring solutions or
integrating ADF status into external dashboards.
│
└─── **3.4. Lineage**
│ └── **Description:** Refers to tracking the origin, movement,
transformation, and destination of data.
│ └── **Integration:** Azure Data Factory integrates with **Azure Purview**
(Azure's unified data governance service). When ADF pipelines run, they can report
lineage information to Purview, allowing users to visualize data flow from source
to destination across various systems and transformations, enhancing data discovery
and impact analysis.
Use code with caution.
4. Security
│
├─── **4.1. Authentication & Authorization**
│ │ ├── **Managed Identities for Azure Resources:** Recommended way for ADF to
authenticate to other Azure services that support Azure AD authentication (e.g.,
Azure Blob Storage, Azure SQL DB, Azure Key Vault). Can be System-assigned (tied to
the ADF instance's lifecycle) or User-assigned (standalone Azure resource
assignable to multiple services).
│ │ ├── **Service Principals:** An Azure AD application identity that can be
granted permissions to resources. Used when Managed Identities are not applicable
or for specific scenarios.
│ │ ├── **Azure Key Vault Integration:** ADF can retrieve secrets (like
connection strings, passwords, keys) stored securely in Azure Key Vault at runtime,
instead of storing them directly in ADF linked service definitions.
│ │ └── **Role-Based Access Control (RBAC):** Azure RBAC is used to control who
has what permissions to the ADF service itself (e.g., Data Factory Contributor to
create/edit, Reader to view). This also applies to data stores ADF accesses, where
ADF's identity (Managed Identity/Service Principal) needs appropriate permissions
(e.g., Storage Blob Data Contributor).
│
├─── **4.2. Network Security**
│ │ ├── **Managed Virtual Network & Managed Private Endpoints:** When using
Azure IR, ADF can provision it within a managed VNet. ADF can then create managed
private endpoints within this VNet to connect securely to Azure PaaS data stores
(e.g., Azure SQL, Storage) over a private link, avoiding public internet exposure.
│ │ ├── **Private Link for Azure Data Factory:** Allows you to access the ADF
service itself (Studio, API endpoints) via a private endpoint in your own VNet,
ensuring that all traffic to and from ADF stays within the Azure backbone or your
private network.
│ │ ├── **Self-Hosted IR in VNet/On-prem:** The SHIR is installed within your
private network (on-premises or cloud VNet), allowing ADF to securely access data
sources that are not publicly accessible.
│ │ └── **Firewall rules on data stores:** Configuring firewall rules on data
stores (e.g., Azure SQL, Storage) to allow access only from specific IP addresses
or VNet service endpoints associated with ADF's IRs.
│
├─── **4.3. Data Encryption**
│ │ ├── **In Transit:** All connections from ADF to external data stores and
compute services are typically encrypted using HTTPS/TLS by default.
│ │ ├── **At Rest:** Data stored by ADF (like pipeline definitions) is
encrypted by Azure. Data in target data stores is usually encrypted by the
respective service's mechanisms (e.g., Azure Storage Service Encryption,
Transparent Data Encryption for SQL).
│ │ └── **Customer-Managed Keys (CMK):** For enhanced control, you can encrypt
the Data Factory environment (entities like pipelines, datasets, linked services)
using your own encryption key stored in Azure Key Vault.
│
└─── **4.4. Data Masking / Obfuscation**
│ └── **Description:** While not a dedicated data masking service, ADF's
Mapping Data Flows can be used to implement data masking or obfuscation logic
within transformations (e.g., using Derived Column to replace sensitive characters
or hash values) before writing data to a sink. For more advanced masking, dedicated
database features or other tools might be used.
Use code with caution.
5. Performance & Optimization
│
├─── **5.1. Copy Activity**
│ │ ├── **Data Integration Units (DIUs):** A measure of power (CPU, memory,
network resource allocation) for the Copy activity on Azure IR. More DIUs (2-256)
provide more throughput. Default is often 'Auto' (4 DIUs).
│ │ ├── **Parallel Copies:**
│ │ └── **`parallelCopies` property:** Controls the number of parallel
threads within a single Copy activity run that read from the source or write to the
sink.
│ │ └── **Multiple activities:** Running multiple Copy activities in
parallel within a ForEach loop or as separate branches in a pipeline.
│ │ ├── **Staging Copy:** Using an interim storage (like Blob) to optimize data
loading into certain sinks (e.g., using PolyBase or COPY statement for Azure
Synapse Analytics, or when direct copy has limitations).
│ │ └── **Compression:** Using compression (e.g., GZip, BZip2) for supported
file formats can reduce data volume transferred and improve I/O performance, at the
cost of some CPU for compression/decompression.
│
├─── **5.2. Mapping Data Flows**
│ │ ├── **Compute Type:** Choose the Azure IR configuration for Data Flow Spark
clusters: General Purpose, Memory Optimized (for memory-intensive operations like
joins, lookups), or Compute Optimized (for CPU-intensive transformations).
│ │ ├── **Core Count:** Number of cores for the Spark cluster driver and worker
nodes (e.g., 4, 8, 16+ cores per node). More cores generally mean faster
processing.
│ │ ├── **Time To Live (TTL):** Configures how long the Spark cluster remains
active after the last Data Flow run, allowing subsequent runs to reuse the warm
cluster, reducing startup time.
│ │ ├── **Partitioning (Source, Sink, Transformation level):** Optimizing data
partitioning in Spark can significantly improve performance by distributing data
processing evenly across worker nodes and minimizing data shuffling. Can be
configured at source, sink, and within certain transformations.
│ │ └── **Optimize tab in transformations:** Provides options like 'Broadcast'
for join optimization (sending smaller datasets to all nodes), and settings to
handle data skew.
│
├─── **5.3. Pipeline Design**
│ │ ├── **Parallel execution of activities:** Design pipelines to run
independent activities in parallel rather than sequentially where possible.
│ │ ├── **Batching:** When processing many small files or items, group them
into larger batches to reduce overhead per item.
│ │ └── **Efficient use of Lookup and Get Metadata:** Minimize frequent calls
if data doesn't change often; cache results in variables if appropriate for the
pipeline run's scope.
│
└─── **5.4. Integration Runtime Sizing**
│ ├── **SHIR:** For Self-Hosted IR, performance depends on the CPU, memory,
and network bandwidth of the machine(s) it's installed on. Scaling out by adding
more nodes to the SHIR group improves throughput and high availability.
│ └── **Azure-SSIS IR:** Choose appropriate VM Node size (affecting CPU,
memory) and Node count (for scale-out) based on the complexity and volume of SSIS
packages being executed.
Use code with caution.
6. Pricing Model
* Detailed Description: ADF pricing is generally pay-as-you-go and based on
consumption of different components.
│
├─── 6.1. Pipeline Orchestration & Execution: Charged per activity run, trigger
execution, and for pipeline orchestration (e.g., evaluating conditions, loops).
├─── 6.2. Data Flow Execution & Debugging: Charged based on vCore-hours of the
Spark cluster used. The rate depends on the compute type (General Purpose, Memory
Optimized, Compute Optimized) and core count. Debugging sessions are also charged.
├─── 6.3. Data Movement: For the Copy activity on Azure IR, charged per Data
Integration Unit-hour (DIU-hour). For SHIR, you pay for activity runs, but the SHIR
machine cost is yours.
├─── 6.4. SSIS Integration Runtime: Charged per hour based on the VM node size and
number of nodes. Also incurs costs for the Azure SQL DB/Managed Instance used to
host SSISDB, and potentially SSIS licensing (Azure Hybrid Benefit can apply).
├─── 6.5. Read/Write Operations: Charges for interactions with the ADF service,
such as creating/reading/updating/deleting ADF entities (pipelines, datasets,
etc.). These are typically minor compared to execution costs.
├─── 6.6. Monitoring: Costs associated with Azure Monitor (Log Analytics data
ingestion/retention, alert rules).
└─── 6.7. Software-Defined Network (Managed VNet, Private Endpoints): Costs may
apply for using managed VNet capabilities and private endpoints.
7. Use Cases
* Detailed Description: ADF is versatile and can be applied to a wide range of data
integration scenarios.
│
├─── 7.1. ETL/ELT Processes: Classic data warehousing scenarios: Extracting data
from various sources, Transforming it (e.g., cleaning, aggregating, joining using
Data Flows or other compute), and Loading it into a data warehouse or data mart.
ELT involves loading raw data first and then transforming it in the target system.
├─── 7.2. Data Warehousing: Specifically, ingesting data into modern data warehouse
solutions like Azure Synapse Analytics, Snowflake, Google BigQuery, or traditional
SQL Server warehouses.
├─── 7.3. Big Data Processing: Orchestrating data pipelines that involve big data
technologies like Azure Databricks (Spark), Azure HDInsight (Hadoop, Spark, Hive),
or Azure Data Lake Storage.
├─── 7.4. Data Migration: Moving data from on-premises systems to Azure cloud
services, or migrating data between different cloud services or storage types.
├─── 7.5. Operational Data Integration: Integrating data between operational
systems, often requiring more frequent or near real-time updates.
├─── 7.6. SaaS Data Integration: Ingesting data from Software-as-a-Service (SaaS)
applications like Salesforce, Dynamics 365, SAP, etc., using ADF's connectors.
└─── 7.7. Real-time/Near Real-time (with Event Triggers): Building event-driven
data pipelines that react to events like new file arrivals in blob storage or
custom events via Event Grid.
8. Advanced Topics & Integrations
* Detailed Description: ADF works in conjunction with other Azure services to
provide a more comprehensive data platform.
│
├─── 8.1. Azure Purview (Data Governance & Cataloging): ADF integrates with Azure
Purview to automatically scan ADF instances and capture lineage information from
pipeline executions. This helps in data discovery, understanding data flow, and
impact analysis.
├─── 8.2. Power BI (Consuming data processed by ADF): Power BI can connect to data
stores (like Azure Synapse, Data Lake, SQL DB) that are populated or curated by ADF
pipelines, enabling business intelligence and reporting on the processed data.
├─── 8.3. Azure Logic Apps (Broader workflow orchestration): While ADF focuses on
data orchestration, Azure Logic Apps provides broader workflow automation. Logic
Apps can trigger ADF pipelines as part of a larger business process, or ADF can
call Logic Apps (via Web activity) for specific tasks.
├─── 8.4. Azure Functions (Serverless compute for custom steps): ADF can invoke
Azure Functions to execute custom code snippets for lightweight, event-driven
processing tasks that might not fit standard ADF activities.
└─── 8.5. Delta Lake Support (in Data Flows for ACID transactions on data lakes):
Mapping Data Flows can read from and write to Delta Lake format in Azure Data Lake
Storage. This enables ACID (Atomicity, Consistency, Isolation, Durability)
transactions, time travel (data versioning), and schema enforcement on data lakes,
bringing data warehouse-like reliability to data lakes.
This detailed breakdown should provide a much deeper understanding of each Azure
Data Factory concept and component.