Azure Data Factory

Uploaded by

cloudtraining2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views18 pages

Azure Data Factory

Uploaded by

cloudtraining2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 18

https://learn.microsoft.

com/en-us/azure/data-factory/media/introduction/data-
factory-visual-guide.png#lightbox
###################################################################################
#
Azure Data Factory (ADF) Pipeline
│
├── **1. Definition & Purpose**
│ └── Orchestrates and automates data movement and data transformation workflows.
│ └── A logical grouping of activities that together perform a unit of work.
│
├── **2. Core Components** (Activities ,Parameters ,Variables, Datasets ,Linked
Services ,Triggers , Integration Runtimes,Monitoring & Management)
│ │
│ ├── **2.1. Activities** (The building blocks of a pipeline) -(Data Movement
Activities ,Data Transformation Activities ,Control Flow Activities )
│ │ │
│ │ ├── **2.1.1. Data Movement Activities**
│ │ │ └── Copy Data Activity (Source -> Sink)
│ │ │
│ │ ├── **2.1.2. Data Transformation Activities**
│ │ │ ├── Mapping Data Flow (Visual, code-free transformation)
│ │ │ ├── Wrangling Data Flow (Power Query based, interactive prep)
│ │ │ ├── Azure Function Activity
│ │ │ ├── Databricks Notebook/Jar/Python Activity
│ │ │ ├── Azure Synapse Notebook/Spark Job Definition
│ │ │ ├── Stored Procedure Activity (SQL DB, Synapse, etc.)
│ │ │ ├── U-SQL Activity (Azure Data Lake Analytics)
│ │ │ ├── HDInsight Activities (Hive, Pig, MapReduce, Spark, Streaming)
│ │ │ └── Custom Activity (using Azure Batch or Azure Machine Learning)
│ │ │
│ │ └── **2.1.3. Control Flow Activities**
│ │ ├── Execute Pipeline Activity (Invokes another ADF pipeline)
│ │ ├── ForEach Activity (Iterates over a collection)
│ │ ├── If Condition Activity
│ │ ├── Switch Activity
│ │ ├── Until Activity (Loops until a condition is met)
│ │ ├── Wait Activity
│ │ ├── Web Activity (Calls a custom REST endpoint)
│ │ ├── Webhook Activity (Waits for an external callback)
│ │ ├── Lookup Activity (Retrieves data from a dataset)
│ │ ├── Get Metadata Activity (Retrieves metadata about a dataset)
│ │ ├── Set Variable Activity
│ │ ├── Append Variable Activity
│ │ ├── Fail Activity (Intentionally fails the pipeline)
│ │ └── Validation Activity (Validates dataset/file existence, etc.)
│ │
│ ├── **2.2. Parameters**
│ │ └── Make pipelines dynamic and reusable.
│ │ └── Values passed at trigger time or by an Execute Pipeline activity.
│ │
│ ├── **2.3. Variables**
│ │ └── Temporary values used within a single pipeline run for internal logic.
│ │ └── Scoped to the pipeline run.
│ │
│ ├── **2.4. Datasets**
│ │ └── Named views of data that point to or reference the data you want to
use.
│ │ └── Represents data in data stores (e.g., tables, files, folders).
│ │ └── Used as inputs/outputs for activities.
│ │
│ └── **2.5. Linked Services**
│ └── Connection strings or information needed to connect to external
resources.
│ └── Defines connection to data stores (e.g., Azure Blob, SQL DB) and
compute (e.g., Databricks, HDInsight).
│
├── **3. Triggers** (How pipelines are initiated)
│ │
│ ├── **3.1. Schedule Trigger** (Runs on a wall-clock schedule)
│ ├── **3.2. Tumbling Window Trigger** (Runs on a periodic interval, maintaining
state)
│ ├── **3.3. Storage Event Trigger** (Runs when a blob is created/deleted in
Azure Storage)
│ ├── **3.4. Custom Event Trigger** (Runs in response to an Event Grid topic
event)
│ └── **3.5. Manual Trigger** (Debug Run / Trigger Now)
│
├── **4. Integration Runtimes (IR)** (Compute infrastructure)
│ │
│ ├── **4.1. Azure IR** (Default, serverless, fully managed by Azure)
│ │ └── For data movement between cloud data stores and data transformation in
cloud.
│ │
│ ├── **4.2. Self-Hosted IR** (Managed by you, on-premises or in a VNet)
│ │ └── For data movement between cloud and private network/on-prem.
│ │ └── For dispatching transform activities against resources in a private
network.
│ │
│ └── **4.3. Azure-SSIS IR** (For lifting and shifting SSIS packages to Azure)
│ └── Dedicated Azure VMs for running SSIS packages.
│
├── **5. Monitoring & Management**
│ │
│ ├── **5.1. Pipeline Runs** (Status of each pipeline execution)
│ ├── **5.2. Activity Runs** (Status of each activity within a pipeline run)
│ ├── **5.3. Trigger Runs** (Status of trigger executions)
│ ├── **5.4. Alerts & Metrics** (Via Azure Monitor)
│ ├── **5.5. Logging** (Diagnostic logs, activity logs)
│ └── **5.6. ADF Studio (UI)** (Visual authoring, monitoring, management)
│
├── **6. Development & Deployment**
│ │
│ ├── **6.1. ADF Studio** (UI for visual authoring)
│ ├── **6.2. ARM Templates** (Infrastructure as Code)
│ ├── **6.3. CI/CD Integration** (Azure DevOps, GitHub Actions)
│ ├── **6.4. PowerShell / CLI / SDKs** (Programmatic management)
│ └── **6.5. Source Control Integration** (Git: Azure DevOps Git, GitHub)
│
├── **7. Security**
│ │
│ ├── **7.1. Managed Identity** (For authenticating to Azure services)
│ ├── **7.2. Azure Key Vault Integration** (For storing secrets)
│ ├── **7.3. RBAC** (Role-Based Access Control)
│ ├── **7.4. Private Endpoints / Managed VNet** (Network isolation)
│ └── **7.5. Data Encryption** (At rest and in transit)
│
└── **8. Key Concepts & Best Practices**
│
├── **8.1. Modularity** (Reusable child pipelines, templates)
├── **8.2. Parameterization** (For dynamic pipelines)
├── **8.3. Error Handling & Retries** (Activity-level configuration)
├── **8.4. Idempotency** (Designing pipelines to be re-runnable without side
effects)
├── **8.5. Performance Tuning** (DIU, parallelism, data flow optimization)
└── **8.6. Cost Optimization** (Choosing right IR, activity types, scheduling)
#################################################################################
Steps to Create an Azure Data Factory (ADF) Data Pipeline (source ->linked service
-> Dataset -> Activity (copy/Dataflow) ->linked service-> Sink)
│https://www.productiveedge.com/hubfs/Imported_Blog_Media/Snip20200727_47-
1024x482.png (https://www.productiveedge.com/blog/azure-data-factory-capabilities)
├── **1. Plan & Design**
│ ├── **1.1. Define Requirements**
│ │ ├── What is the data source?
│ │ ├── What is the data destination (sink)?
│ │ ├── What transformations are needed?
│ │ ├── What is the desired frequency/schedule?
│ │ ├── What are the error handling requirements?
│ │ └── What are the security considerations?
│ ├── **1.2. Choose Integration Runtime (IR)**
│ │ ├── Azure IR (for cloud-to-cloud)
│ │ ├── Self-Hosted IR (for on-prem/VNet access)
│ │ └── Azure-SSIS IR (for SSIS package execution)
│ └── **1.3. Sketch the Data Flow**
│ └── High-level diagram of sources, transformations, sinks.
│
├── **2. Setup Azure Data Factory Instance**
│ ├── **2.1. Create/Select Azure Subscription & Resource Group**
│ ├── **2.2. Create Data Factory Instance**
│ │ ├── Name, Region, Version (V2)
│ │ └── Configure Git integration (optional, but recommended: Azure DevOps
Git / GitHub)
│ └── **2.3. Configure Integration Runtimes (if not default Azure IR)**
│ ├── Install & Register Self-Hosted IR (if needed)
│ └── Provision Azure-SSIS IR (if needed)
│
├── **3. Create Linked Services (Connections)**
│ ├── **3.1. For Source(s)**
│ │ ├── e.g., Azure Blob Storage, SQL Database, REST API
│ │ └── Configure connection details & credentials (Key, MSI, Service
Principal, Key Vault)
│ ├── **3.2. For Sink(s)**
│ │ ├── e.g., Azure Data Lake Storage, Azure Synapse Analytics, SQL Database
│ │ └── Configure connection details & credentials
│ ├── **3.3. For Compute (if applicable)**
│ │ ├── e.g., Azure Databricks, HDInsight, Azure Functions, Azure Batch
│ │ └── Configure connection details
│ └── **3.4. Test Connections**
│
├── **4. Create Datasets (Data Pointers)**
│ ├── **4.1. For Source(s)**
│ │ ├── Select Linked Service
│ │ ├── Specify data structure (e.g., file path, table name, API endpoint)
│ │ ├── Define schema (import or manually define)
│ │ └── Parameterize (if needed for dynamic paths/names)
│ ├── **4.2. For Sink(s)**
│ │ ├── Select Linked Service
│ │ ├── Specify data structure
│ │ ├── Define schema
│ │ └── Parameterize (if needed)
│
├── **5. Create the Pipeline** (Add New Pipeline->Add Activities->Link to
source/sink Datasets->)
│ ├── **5.1. Add New Pipeline**
│ │ └── Provide a name and description.
│ ├── **5.2. Add Activities to the Canvas**
│ │ ├── **Data Movement:** Copy Data Activity
│ │ ├── **Data Transformation:** Mapping Data Flow, Stored Procedure, Azure
Function, Databricks Notebook, etc.
│ │ ├── **Control Flow:** ForEach, If Condition, Execute Pipeline, Wait, Web,
Lookup, etc.
│ ├── **5.3. Configure Each Activity**
│ │ ├── Link to source/sink Datasets (for Copy Data)
│ │ ├── Configure settings specific to the activity type (e.g., query, script,
mapping)
│ │ ├── Set timeouts, retry policies
│ │ └── Map inputs/outputs
│ ├── **5.4. Connect Activities (Define Dependencies)**
│ │ ├── On Success (Green arrow)
│ │ ├── On Failure (Red arrow)
│ │ ├── On Completion (Blue arrow)
│ │ └── On Skipped (Grey arrow)
│ ├── **5.5. Add Pipeline Parameters (for reusability)**
│ │ └── Define parameters that can be passed at runtime.
│ └── **5.6. Add Pipeline Variables (for internal state)**
│ └── Define variables for temporary storage within the pipeline run.
│
├── **6. Test & Debug the Pipeline**
│ ├── **6.1. Validate Pipeline**
│ │ └── Check for configuration errors.
│ ├── **6.2. Debug Run**
│ │ ├── Execute the pipeline manually with specific parameter values.
│ │ ├── Inspect activity inputs, outputs, and errors.
│ │ └── Use breakpoints in Data Flows.
│ ├── **6.3. Iterate & Refine**
│ └── Fix issues, optimize performance.
│
├── **7. Create Triggers (Scheduling & Automation)**
│ ├── **7.1. Choose Trigger Type**
│ │ ├── Schedule Trigger (wall-clock)
│ │ ├── Tumbling Window Trigger (periodic, maintains state)
│ │ ├── Storage Event Trigger (blob created/deleted)
│ │ └── Custom Event Trigger (Event Grid)
│ ├── **7.2. Configure Trigger**
│ │ ├── Set schedule, recurrence, event details.
│ │ └── Map pipeline parameters to trigger parameters (if any).
│ ├── **7.3. Associate Trigger with Pipeline**
│ └── **7.4. Activate Trigger**
│
├── **8. Publish Changes**
│ ├── **8.1. Save all ADF artifacts (if not using Git auto-save)**
│ ├── **8.2. (If using Git) Commit & Push changes to your repository**
│ ├── **8.3. Publish from ADF Studio**
│ └── This deploys your configuration from the collaboration branch (e.g.,
`main` or `adf_publish`) to the live Data Factory service.
│ └── **8.4. (Optional) Use CI/CD Pipelines**
│ └── Azure DevOps / GitHub Actions for automated deployment of ARM
templates.
│
└── **9. Monitor & Manage**
├── **9.1. Monitor Pipeline Runs**
│ └── Check status (Succeeded, Failed, In Progress).
├── **9.2. Monitor Activity Runs**
│ └── Drill down into individual activity details.
├── **9.3. Monitor Trigger Runs**
├── **9.4. Set up Alerts (via Azure Monitor)**
│ └── For failures or long-running pipelines.
└── **9.5. Review Logs**
└── For troubleshooting and auditing.

#################################################################################
Azure Data Factory (ADF) - Cloud-based ETL & Data Integration Service
└─── 1. Core Concepts & Components
│
├─── 1.1. Pipeline
│ │ └── Definition: Logical grouping of activities that perform a unit of
work.
│ │ └── Properties: Parameters, Variables, Annotations, Concurrency, Run
Dimensions
│ │ └── Orchestration: Defines sequence, parallelism, conditional
execution.
│
├─── 1.2. Activity
│ │ └── Definition: Individual processing step in a pipeline.
│ │ └── Types:
│ │ ├── 1.2.1. Data Movement Activities
│ │ │ └── Copy Data Activity
│ │ │ ├── Sources (Connectors: ~100+, e.g., Blob, SQL DB, On-
prem, SaaS)
│ │ │ ├── Sinks (Connectors: ~100+, e.g., Data Lake, Synapse, SQL
DB)
│ │ │ ├── Data Consistency Verification
│ │ │ ├── Fault Tolerance (Skip incompatible rows)
│ │ │ ├── Performance Tuning (DIUs, Parallel Copies)
│ │ │ ├── Staging (e.g., PolyBase, Interim Blob)
│ │ │ └── Schema Mapping (Explicit, Dynamic)
│ │ │
│ │ ├── 1.2.2. Data Transformation Activities
│ │ │ ├── Mapping Data Flows
│ │ │ │ ├── Visual Data Transformation (Code-free / Low-code)
│ │ │ │ ├── Scalable execution on Spark clusters (managed by ADF)
│ │ │ │ ├── Sources & Sinks (within Data Flow)
│ │ │ │ ├── Transformations (Join, Union, Aggregate, Derived
Column, Lookup, Filter, Sort, etc.)
│ │ │ │ ├── Debug Mode (Interactive data preview & debugging)
│ │ │ │ ├── Parameterization
│ │ │ │ └── Data Flow Script (underlying JSON definition)
│ │ │ │
│ │ │ ├── External / Code-Based Activities
│ │ │ │ ├── Azure Function Activity
│ │ │ │ ├── Azure Batch Activity (Custom .NET)
│ │ │ │ ├── Azure Databricks Activity (Notebook, Jar, Python)
│ │ │ │ ├── Azure Synapse Notebook / Spark Job Definition
│ │ │ │ ├── HDInsight Activity (Hive, Pig, MapReduce, Spark,
Streaming)
│ │ │ │ ├── Stored Procedure Activity (SQL Server, Azure SQL,
Synapse)
│ │ │ │ ├── U-SQL Activity (Azure Data Lake Analytics)
│ │ │ │ └── Machine Learning Activities (Execute Batch, Update
Resource)
│ │ │
│ │ └── 1.2.3. Control Flow Activities
│ │ ├── Execute Pipeline Activity (Invoke another ADF pipeline)
│ │ ├── ForEach Activity (Looping)
│ │ ├── If Condition Activity
│ │ ├── Switch Activity
│ │ ├── Until Activity (Do-while loop)
│ │ ├── Wait Activity
│ │ ├── Web Activity (Call REST API)
│ │ ├── Webhook Activity (Wait for external callback)
│ │ ├── Lookup Activity (Retrieve data for decisions)
│ │ ├── Get Metadata Activity (File/table properties, exists)
│ │ ├── Set Variable Activity
│ │ ├── Append Variable Activity
│ │ └── Fail Activity (Intentionally fail pipeline)
│
├─── 1.3. Datasets
│ │ └── Definition: Named view of data; points to or references the data
you want to use.
│ │ └── Properties: Schema, Parameters, Connection (via Linked Service)
│ │ └── Types: File-based (CSV, Parquet, JSON), Database tables, NoSQL
collections, etc.
│
├─── 1.4. Linked Services
│ │ └── Definition: Connection strings to external resources.
│ │ └── Stores connection info: Credentials, Server names, Database names,
etc.
│ │ └── Manages connectivity to data stores & compute services.
│ │ └── Parameterization (for dynamic connections)
│
├─── 1.5. Integration Runtimes (IR)
│ │ └── Definition: Compute infrastructure used by ADF.
│ │ └── Types:
│ │ ├── 1.5.1. Azure IR
│ │ │ └── Fully managed, serverless compute in Azure.
│ │ │ └── Used for cloud-to-cloud data movement & data flow
execution.
│ │ │ └── Region specific or Auto-resolve.
│ │ │ └── Managed Virtual Network (for secure data access)
│ │ │
│ │ ├── 1.5.2. Self-Hosted IR (SHIR)
│ │ │ └── Installed on-premises or in a private virtual network.
│ │ │ └── Enables access to private network resources.
│ │ │ └── Used for data movement to/from private networks.
│ │ │ └── Can dispatch activities to on-prem compute (e.g., SQL
Server).
│ │ │ └── High Availability & Scalability (multi-node)
│ │ │ └── Credential Management (local or Key Vault)
│ │ │
│ │ └── 1.5.3. Azure-SSIS IR
│ │ └── Dedicated Azure compute for lifting & shifting SSIS
packages.
│ │ └── Provisions Azure VMs to host SSISDB (in Azure SQL DB/MI).
│ │ └── VNet Integration (for private network access).
│ │ └── Custom Setups (e.g., installing additional components).
│ │ └── Express Custom Setup
│
├─── 1.6. Triggers
│ │ └── Definition: Unit of processing that determines when a pipeline run
needs to be kicked off.
│ │ └── Types:
│ │ ├── 1.6.1. Schedule Trigger (Wall-clock schedule)
│ │ ├── 1.6.2. Tumbling Window Trigger (Fixed-size, non-overlapping,
contiguous time intervals)
│ │ │ └── Supports dependencies on other Tumbling Window Triggers.
│ │ │ └── Backfilling / Reruns for specific windows.
│ │ ├── 1.6.3. Storage Event Trigger (Reacts to Blob created/deleted in
Azure Storage)
│ │ ├── 1.6.4. Custom Event Trigger (Integrates with Azure Event Grid
for broader event sources)
│ │ └── 1.6.5. Manual Trigger (On-demand execution)
│
├─── 1.7. Parameters & Variables
│ │ └── Parameters: Defined at Pipeline, Dataset, Linked Service level for
reusability & dynamic behavior. Passed in at execution time.
│ │ └── Variables: Used within pipelines for temporary value storage during
execution (Set Variable, Append Variable).
│
└─── 1.8. Expressions & Functions
│ └── Used for dynamic content, conditions, data manipulation within
pipelines.
│ └── System variables, user-defined functions (in Data Flows), string,
collection, logical, conversion, math, date functions.
└─── Global Parameters
└─── Constants defined at Data Factory level, usable across all
pipelines.

└─── 2. Development & Authoring

│
├─── 2.1. Azure Data Factory Studio (Web UI)
│ │ ├── Visual authoring (drag & drop)
│ │ ├── JSON editing
│ │ ├── Pipeline canvas
│ │ ├── Data Flow designer
│ │ ├── Monitoring views
│ │ └── Management hub
│
├─── 2.2. Azure PowerShell / CLI
│ │ └── Scripting creation, management, monitoring of ADF resources.
│
├─── 2.3. SDKs (.NET, Python, Java, JavaScript)
│ │ └── Programmatic interaction with ADF.
│
├─── 2.4. ARM Templates
│ │ └── Infrastructure as Code (IaC) for ADF deployment.
│ │ └── Export/Import templates from ADF Studio.
│
└─── 2.5. Source Control Integration
│ ├── Azure DevOps Git
│ ├── GitHub
│ └── Collaboration branch, feature branches, publish branch.
└─── CI/CD (Continuous Integration / Continuous Deployment)
└─── Automated deployment across environments (Dev, Test, Prod).
└─── Using Azure Pipelines (DevOps) or GitHub Actions.
└─── Parameterization for environment-specific settings.

└─── 3. Monitoring & Management

│
├─── 3.1. ADF Studio Monitoring View
│ │ ├── Pipeline Runs (Status, duration, parameters, input/output)
│ │ ├── Activity Runs (Details, errors, input/output)
│ │ ├── Data Flow Debug Sessions
│ │ ├── Trigger Runs
│ │ └── Alerts & Metrics (links to Azure Monitor)
│
├─── 3.2. Azure Monitor Integration
│ │ ├── Metrics (Activity/Pipeline success/failure, IR CPU/Memory)
│ │ ├── Diagnostic Logs (Activity/Pipeline/Trigger run logs, IR logs)
│ │ │ └── Send to Log Analytics, Storage Account, Event Hubs
│ │ ├── Alerts (Based on metrics or log queries)
│ │ └── Workbooks (Custom visualizations)
│
├─── 3.3. Programmatic Monitoring
│ │ └── Using SDKs, REST API, PowerShell/CLI.
│
└─── 3.4. Lineage
│ └── Via Azure Purview integration (for data discovery & governance)

└─── 4. Security
│
├─── 4.1. Authentication & Authorization
│ │ ├── Managed Identities for Azure Resources (System-assigned, User-
assigned)
│ │ ├── Service Principals
│ │ ├── Azure Key Vault Integration (for storing secrets, connection
strings)
│ │ └── Role-Based Access Control (RBAC) (Data Factory Contributor, Reader,
User Access Admin)
│
├─── 4.2. Network Security
│ │ ├── Managed Virtual Network & Managed Private Endpoints (for Azure PaaS
sources with Azure IR)
│ │ ├── Private Link for Azure Data Factory (Access ADF Studio/service
privately)
│ │ ├── Self-Hosted IR in VNet/On-prem (for private data sources)
│ │ └── Firewall rules on data stores
│
├─── 4.3. Data Encryption
│ │ ├── In Transit (HTTPS/TLS)
│ │ ├── At Rest (Azure Storage Service Encryption, Transparent Data
Encryption for SQL)
│ │ └── Customer-Managed Keys (CMK) for encrypting ADF environment.
│
└─── 4.4. Data Masking / Obfuscation (within Transformations)

└─── 5. Performance & Optimization

│
├─── 5.1. Copy Activity
│ │ ├── Data Integration Units (DIUs)
│ │ ├── Parallel Copies (within one activity, or multiple activities)
│ │ ├── Staging Copy (e.g., PolyBase for Synapse)
│ │ └── Compression
│
├─── 5.2. Mapping Data Flows
│ │ ├── Compute Type (General Purpose, Memory Optimized, Compute Optimized)
│ │ ├── Core Count
│ │ ├── Time To Live (TTL) for Spark cluster
│ │ ├── Partitioning (Source, Sink, Transformation level)
│ │ └── Optimize tab in transformations (Broadcast, Skewness handling)
│
├─── 5.3. Pipeline Design
│ │ ├── Parallel execution of activities
│ │ ├── Batching
│ │ └── Efficient use of Lookup and Get Metadata
│
└─── 5.4. Integration Runtime Sizing
│ ├── SHIR: Node count, CPU/Memory of SHIR machines
│ └── Azure-SSIS IR: Node size, Node count

└─── 6. Pricing Model

│
├─── 6.1. Pipeline Orchestration & Execution (Activity runs, Trigger runs)
├─── 6.2. Data Flow Execution & Debugging (vCore-hour based on cluster
type/size)
├─── 6.3. Data Movement (DIU-hour for Copy Activity)
├─── 6.4. SSIS Integration Runtime (Node hours, SQL DB/MI for SSISDB, License -
AHB)
├─── 6.5. Read/Write Operations (Interactions with ADF service - GET/PUT)
├─── 6.6. Monitoring (Azure Monitor costs)
└─── 6.7. Software-Defined Network (Managed VNet, Private Endpoints)

└─── 7. Use Cases

│
├─── 7.1. ETL/ELT Processes
├─── 7.2. Data Warehousing (Loading Synapse, Snowflake, etc.)
├─── 7.3. Big Data Processing (with Databricks, HDInsight)
├─── 7.4. Data Migration (On-prem to Cloud, Cloud to Cloud)
├─── 7.5. Operational Data Integration
├─── 7.6. SaaS Data Integration (e.g., Salesforce, Dynamics 365)
└─── 7.7. Real-time/Near Real-time (with Event Triggers)

└─── 8. Advanced Topics & Integrations

│
├─── 8.1. Azure Purview (Data Governance & Cataloging)
├─── 8.2. Power BI (Consuming data processed by ADF)
├─── 8.3. Azure Logic Apps (Broader workflow orchestration, calling ADF
pipelines)
├─── 8.4. Azure Functions (Serverless compute for custom steps)
└─── 8.5. Delta Lake Support (in Data Flows for ACID transactions on data
lakes)

Tips for using this mind map:

* Start from the center "Azure Data Factory".
* Expand each main branch (1. Core Concepts, 2. Development, etc.) one by one.
* Drill down into sub-branches for specific details.
* Use it as a checklist when designing solutions or studying for certifications.
* Consider how different components interact (e.g., a Pipeline uses Activities,
which use Linked Services and Datasets, and runs on an Integration Runtime).

###################################################################################
#
#################################################################################
Azure Data Factory (ADF) - Cloud-based ETL & Data Integration Service
Detailed Description: Azure Data Factory is a fully managed, serverless data
integration service provided by Microsoft Azure. It's designed to compose,
orchestrate, schedule, and monitor data-driven workflows (pipelines) that can
ingest data from disparate sources, transform it, and publish it to various
destinations. ADF is primarily used for building Extract, Transform, Load (ETL),
Extract, Load, Transform (ELT), and data integration solutions at scale.
1. Core Concepts & Components
│
├─── **1.1. Pipeline**
│ │ └── **Definition:** A pipeline is a logical grouping of activities that
together perform a unit of work. It's the top-level orchestrator in ADF. For
example, a pipeline might ingest data from a Blob storage, transform it using a
Data Flow, and then load it into Azure Synapse Analytics.
│ │ └── **Properties:**
│ │ ├── **Parameters:** Input values passed to a pipeline at runtime,
making pipelines reusable and dynamic.
│ │ ├── **Variables:** Temporary values used within a single pipeline run
to store and manipulate data during execution.
│ │ ├── **Annotations:** User-defined tags or notes for better organization
and understanding.
│ │ ├── **Concurrency:** Controls how many simultaneous runs of the same
pipeline are allowed.
│ │ └── **Run Dimensions:** User-defined key-value pairs to categorize and
filter pipeline runs.
│ │ └── **Orchestration:** Pipelines define the control flow for activities,
including sequential execution, parallel execution (activities running
simultaneously), and conditional execution (activities running based on the outcome
of previous ones or specific conditions).
│
├─── **1.2. Activity**
│ │ └── **Definition:** An activity represents a single processing step within
a pipeline. Each activity performs a specific action.
│ │ └── **Types:**
│ │ ├── **1.2.1. Data Movement Activities**
│ │ │ └── **Copy Data Activity:**
│ │ │ ├── **Description:** The primary activity for data movement. It
copies data from a source data store to a sink data store.
│ │ │ ├── **Sources & Sinks:** Supports a vast array of connectors
(~100+) for both sources and sinks, including on-premises systems (via Self-Hosted
IR), Azure services (Blob, SQL DB, DataLake, Synapse), and third-party SaaS
applications.
│ │ │ ├── **Data Consistency Verification:** Options to ensure data
integrity during transfer, like checking MD5 hashes.
│ │ │ ├── **Fault Tolerance:** Ability to skip incompatible rows
during copy and log them, allowing the main job to continue.
│ │ │ ├── **Performance Tuning:** Scalability through Data
Integration Units (DIUs) and parallel copy settings.
│ │ │ ├── **Staging:** Can use an interim storage (like Blob) for
optimized loading into certain sinks (e.g., PolyBase for Synapse, or when direct
copy isn't optimal).
│ │ │ └── **Schema Mapping:** Allows explicit mapping of source
columns to sink columns, or dynamic schema handling.
│ │ │
│ │ ├── **1.2.2. Data Transformation Activities**
│ │ │ ├── **Mapping Data Flows:**
│ │ │ │ ├── **Description:** Visually design and build code-free or
low-code data transformations.
│ │ │ │ ├── **Scalable execution:** Transformations are executed on
managed, auto-scaling Apache Spark clusters provisioned by ADF.
│ │ │ │ ├── **Sources & Sinks (within Data Flow):** Specific connectors
optimized for data flow processing.
│ │ │ │ ├── **Transformations:** Rich set of transformations like Join,
Union, Aggregate, Derived Column, Lookup, Filter, Sort, Pivot, Unpivot, Window,
etc.
│ │ │ │ ├── **Debug Mode:** Interactive environment to build, test, and
preview data transformations with live data samples.
│ │ │ │ ├── **Parameterization:** Allows dynamic behavior in data flows
(e.g., dynamic source/sink paths, filter conditions).
│ │ │ │ └── **Data Flow Script:** The underlying JSON definition of the
visual data flow, useful for advanced editing or version control.
│ │ │ │
│ │ │ ├── **External / Code-Based Activities:** Activities that execute
transformations on external compute services.
│ │ │ │ ├── **Azure Function Activity:** Executes an Azure Function
(serverless code).
│ │ │ │ ├── **Azure Batch Activity:** Runs custom .NET code on Azure
Batch pools for large-scale parallel processing.
│ │ │ │ ├── **Azure Databricks Activity:** Executes Databricks
notebooks, JARs, or Python scripts on Databricks clusters.
│ │ │ │ ├── **Azure Synapse Notebook / Spark Job Definition:** Executes
Synapse notebooks or Spark jobs within an Azure Synapse Analytics workspace.
│ │ │ │ ├── **HDInsight Activity:** Executes Hive, Pig, MapReduce, or
Spark jobs on HDInsight clusters.
│ │ │ │ ├── **Stored Procedure Activity:** Executes stored procedures
in SQL Server, Azure SQL Database, or Azure Synapse Analytics.
│ │ │ │ ├── **U-SQL Activity (Azure Data Lake Analytics):** Executes U-
SQL scripts on Azure Data Lake Analytics (legacy service).
│ │ │ │ └── **Machine Learning Activities:**
│ │ │ │ └── **Azure ML Execute Batch:** Runs Azure Machine Learning
batch scoring pipelines.
│ │ │ │ └── **Azure ML Update Resource:** Updates Azure Machine
Learning model resources.
│ │ │
│ │ └── **1.2.3. Control Flow Activities:** Activities that manage the
execution flow of a pipeline.
│ │ ├── **Execute Pipeline Activity:** Invokes another ADF pipeline,
enabling modular design.
│ │ ├── **ForEach Activity:** Iterates over a collection of items and
executes specified activities for each item (in sequence or parallel).
│ │ ├── **If Condition Activity:** Executes one set of activities if a
condition is true, and optionally another set if false.
│ │ ├── **Switch Activity:** Similar to a switch-case statement in
programming; executes a specific set of activities based on matching an
expression's value.
│ │ ├── **Until Activity:** Executes a set of activities repeatedly
until a specified condition becomes true (like a do-while loop).
│ │ ├── **Wait Activity:** Pauses pipeline execution for a specified
period.
│ │ ├── **Web Activity:** Calls a custom REST API endpoint.
│ │ ├── **Webhook Activity:** Pauses pipeline execution and waits for
an external service to call back a provided URL before resuming.
│ │ ├── **Lookup Activity:** Retrieves data (a single row or a small
set of rows) from any ADF-supported data source, often used to make decisions in
subsequent activities.
│ │ ├── **Get Metadata Activity:** Retrieves metadata about a data
source (e.g., file existence, size, last modified date, table schema).
│ │ ├── **Set Variable Activity:** Assigns a value to a pipeline
variable.
│ │ ├── **Append Variable Activity:** Appends a value to an array-type
pipeline variable.
│ │ └── **Fail Activity:** Intentionally causes a pipeline run to fail,
often used in error handling paths.
│
├─── **1.3. Datasets**
│ │ └── **Definition:** A dataset is a named view of data that simply points to
or references the data you want to use as an input or output in your activities. It
doesn't store the actual data but defines the structure and location.
│ │ └── **Properties:** Defines the data's **schema** (structure), can be
**parameterized** for dynamic referencing (e.g., dynamic file paths), and specifies
the **connection** to the data store via a Linked Service.
│ │ └── **Types:** Represents various data structures like files (CSV, Parquet,
JSON, Avro), database tables, NoSQL collections, API responses, etc.
│
├─── **1.4. Linked Services**
│ │ └── **Definition:** Linked services are like connection strings. They
define the connection information needed for Data Factory to connect to external
resources (data stores or compute services).
│ │ └── **Stores connection info:** Includes credentials (which can be stored
in Azure Key Vault), server names, database names, file paths, API keys, etc.
│ │ └── **Manages connectivity:** A single linked service can be used by
multiple datasets or activities if they connect to the same resource.
│ │ └── **Parameterization:** Allows for dynamic connection strings, useful for
different environments (Dev, Test, Prod).
│
├─── **1.5. Integration Runtimes (IR)**
│ │ └── **Definition:** The Integration Runtime (IR) is the compute
infrastructure used by Azure Data Factory to provide data integration capabilities
across different network environments. It bridges the gap between
activities/datasets and linked services.
│ │ └── **Types:**
│ │ ├── **1.5.1. Azure IR**
│ │ │ └── **Description:** A fully managed, serverless compute in Azure.
ADF automatically manages, scales, and patches this IR.
│ │ │ └── **Usage:** Primarily used for cloud-to-cloud data movement
(Copy activity) and for executing Mapping Data Flows.
│ │ │ └── **Configuration:** Can be region-specific or set to "Auto-
resolve" to use the region of the data sink.
│ │ │ └── **Managed Virtual Network (VNet):** Can be provisioned within
an ADF-managed VNet for secure data access to Azure PaaS services using private
endpoints.
│ │ │
│ │ ├── **1.5.2. Self-Hosted IR (SHIR)**
│ │ │ └── **Description:** An agent that you install and manage on your
on-premises machines or in a private virtual network (Azure VNet or other cloud
provider's VNet).
│ │ │ └── **Usage:** Enables ADF to access data sources and dispatch
activities within a private network (e.g., on-premises SQL Server, file shares).
│ │ │ └── **Data Movement:** Facilitates data copy to/from private
networks.
│ │ │ └── **Activity Dispatch:** Can dispatch external activities (like
Stored Procedure) to compute resources within the private network.
│ │ │ └── **High Availability & Scalability:** Can be scaled out by
adding multiple nodes to the SHIR logical group.
│ │ │ └── **Credential Management:** Credentials for on-premises data
stores can be stored locally on the SHIR machine or referenced from Azure Key
Vault.
│ │ │
│ │ └── **1.5.3. Azure-SSIS IR**
│ │ └── **Description:** A dedicated Azure compute infrastructure
specifically designed to lift and shift existing SQL Server Integration Services
(SSIS) packages to the cloud.
│ │ └── **Functionality:** Provisions Azure VMs to host an SSIS engine.
You need an Azure SQL Database or Managed Instance to host the SSIS catalog
(SSISDB).
│ │ └── **VNet Integration:** Can be joined to an Azure Virtual
Network, allowing it to access on-premises data sources via VPN or ExpressRoute, or
other Azure resources within the VNet.
│ │ └── **Custom Setups:** Allows installation of additional
components, drivers, or custom tasks required by SSIS packages.
│ │ └── **Express Custom Setup:** A simplified way to run common custom
setup scripts.
│
├─── **1.6. Triggers**
│ │ └── **Definition:** A trigger is a unit of processing that determines when
a pipeline run needs to be initiated. It defines the "when" for pipeline execution.
│ │ └── **Types:**
│ │ ├── **1.6.1. Schedule Trigger:** Initiates pipeline runs based on a
recurring wall-clock schedule (e.g., daily at 2 AM, every 15 minutes).
│ │ ├── **1.6.2. Tumbling Window Trigger:** Operates on fixed-size, non-
overlapping, contiguous time intervals (windows). Useful for processing time-sliced
data (e.g., hourly data chunks).
│ │ │ └── **Dependencies:** Can depend on other Tumbling Window Triggers,
ensuring data processing order.
│ │ │ └── **Backfilling/Reruns:** Supports reprocessing data for specific
past time windows.
│ │ ├── **1.6.3. Storage Event Trigger:** Initiates pipeline runs in
response to storage events, specifically when a blob is created or deleted in Azure
Blob Storage or Azure Data Lake Storage Gen2.
│ │ ├── **1.6.4. Custom Event Trigger:** Integrates with Azure Event Grid,
allowing pipelines to be triggered by a wide variety of custom events from Azure
services or custom applications.
│ │ └── **1.6.5. Manual Trigger (Debug/Trigger Now):** Allows for on-demand
execution of pipelines directly from the ADF Studio or via SDK/API.
│
├─── **1.7. Parameters & Variables**
│ │ └── **Parameters:** Defined at the Pipeline, Dataset, Linked Service, or
Data Flow level. They are strongly typed values that are passed into these
artifacts when they are invoked or used, promoting reusability and dynamic
behavior. For example, a pipeline parameter could specify a source folder path.
│ │ └── **Variables:** Defined at the pipeline level. They are used to store
temporary values during a single pipeline run. Their values can be set and modified
using the "Set Variable" and "Append Variable" activities. Unlike parameters,
variables are not passed in from outside but are managed internally within the
pipeline run.
│
└─── **1.8. Expressions & Functions**
│ └── **Description:** ADF provides a rich expression language with built-in
functions. These are used to create dynamic content, evaluate conditions, and
manipulate data within pipeline definitions (e.g., in activity properties,
conditional paths, variable assignments).
│ └── **Examples:** Includes system variables (e.g., `@pipeline().RunId`),
user-defined functions (within Mapping Data Flows), and functions for string
manipulation, collection handling, logical operations, data type conversions,
mathematical calculations, and date/time operations.
└─── **Global Parameters**
└─── **Description:** Constants defined at the Data Factory level (in the
"Manage" hub). These values can be referenced in any pipeline expression within
that Data Factory instance, making it easy to manage common values (like
environment tags, shared paths) across multiple pipelines without parameterizing
each one individually.
Use code with caution.
2. Development & Authoring
│
├─── **2.1. Azure Data Factory Studio (Web UI)**
│ │ └── **Description:** The primary web-based graphical user interface for
developing, managing, and monitoring ADF resources.
│ │ ├── **Visual authoring:** Provides a drag-and-drop interface for building
pipelines and mapping data flows.
│ │ ├── **JSON editing:** Allows direct editing of the underlying JSON
definitions for all ADF artifacts.
│ │ ├── **Pipeline canvas:** The workspace where activities are arranged and
connected to define a pipeline.
│ │ ├── **Data Flow designer:** A visual tool for creating complex data
transformations without writing code.
│ │ ├── **Monitoring views:** Integrated dashboards to track pipeline,
activity, and trigger runs.
│ │ └── **Management hub:** Area for managing linked services, integration
runtimes, source control, global parameters, and security settings.
│
├─── **2.2. Azure PowerShell / CLI**
│ │ └── **Description:** Command-line tools for automating the creation,
deployment, management, and monitoring of ADF resources. Ideal for scripting
repetitive tasks and integrating ADF management into broader automation workflows.
│
├─── **2.3. SDKs (.NET, Python, Java, JavaScript)**
│ │ └── **Description:** Software Development Kits that allow developers to
interact with ADF programmatically. Useful for building custom applications that
manage or trigger ADF pipelines, or for embedding ADF functionalities within other
systems.
│
├─── **2.4. ARM Templates (Azure Resource Manager Templates)**
│ │ └── **Description:** JSON files that define the infrastructure and
configuration for your Azure resources, including ADF.
│ │ └── **Usage:** Enables Infrastructure as Code (IaC) practices for
consistent and repeatable deployment of ADF instances and their components across
different environments (e.g., Dev, Test, Prod). ADF Studio provides an option to
export the entire factory or individual resources as an ARM template.
│
└─── **2.5. Source Control Integration**
│ └── **Description:** ADF integrates directly with Git-based source control
repositories.
│ ├── **Supported Repositories:** Azure DevOps Git and GitHub.
│ └── **Workflow:** Developers typically work on feature branches, merge
changes into a collaboration branch (e.g., `main` or `develop`), and then publish
changes from a designated publish branch (e.g., `adf_publish`) which generates ARM
templates for deployment. This enables version control, collaboration, and CI/CD.
└─── **CI/CD (Continuous Integration / Continuous Deployment)**
└─── **Description:** Practices for automating the build, test, and
deployment of ADF solutions.
└─── **Tools:** Typically implemented using Azure Pipelines (part of Azure
DevOps) or GitHub Actions.
└─── **Process:** Involves taking the ARM templates generated from the
publish branch and deploying them to target ADF instances, often with environment-
specific parameter overrides.
Use code with caution.
3. Monitoring & Management
│
├─── **3.1. ADF Studio Monitoring View**
│ │ └── **Description:** The built-in monitoring interface within ADF Studio
providing a user-friendly way to track the status and details of various ADF
operations.
│ │ ├── **Pipeline Runs:** Lists all pipeline executions with their status
(Succeeded, Failed, In Progress), duration, trigger type, parameters used, and
links to view activity runs.
│ │ ├── **Activity Runs:** Shows details for each activity executed within a
pipeline run, including input, output, errors (if any), and duration.
│ │ ├── **Data Flow Debug Sessions:** Monitors active data flow debug sessions,
showing cluster status and data previews.
│ │ ├── **Trigger Runs:** Lists all trigger executions and their status.
│ │ └── **Alerts & Metrics:** Provides a summary and links to Azure Monitor for
more detailed metrics and configured alerts.
│
├─── **3.2. Azure Monitor Integration**
│ │ └── **Description:** ADF integrates deeply with Azure Monitor, Azure's
centralized monitoring service, for comprehensive logging, metrics, and alerting.
│ │ ├── **Metrics:** Collects various performance and operational metrics
(e.g., successful/failed activity runs, pipeline run counts, IR CPU/memory
utilization, Data Flow execution times).
│ │ ├── **Diagnostic Logs:** Captures detailed logs for pipeline, activity, and
trigger runs, as well as IR health and operations.
│ │ │ └── **Destinations:** Logs can be sent to Log Analytics (for querying
with KQL), an Azure Storage Account (for archiving), or Azure Event Hubs (for real-
time streaming to other systems).
│ │ ├── **Alerts:** Allows configuration of alerts based on metric thresholds
or log query results (e.g., alert if a pipeline fails, or if IR CPU is too high).
│ │ └── **Workbooks:** Enables creation of custom, interactive dashboards in
Azure Monitor using ADF metrics and logs for tailored visualizations.
│
├─── **3.3. Programmatic Monitoring**
│ │ └── **Description:** ADF resources and run statuses can be monitored
programmatically using the Azure SDKs (e.g., .NET, Python), REST APIs, or Azure
PowerShell/CLI cmdlets. This is useful for custom monitoring solutions or
integrating ADF status into external dashboards.
│
└─── **3.4. Lineage**
│ └── **Description:** Refers to tracking the origin, movement,
transformation, and destination of data.
│ └── **Integration:** Azure Data Factory integrates with **Azure Purview**
(Azure's unified data governance service). When ADF pipelines run, they can report
lineage information to Purview, allowing users to visualize data flow from source
to destination across various systems and transformations, enhancing data discovery
and impact analysis.
Use code with caution.
4. Security
│
├─── **4.1. Authentication & Authorization**
│ │ ├── **Managed Identities for Azure Resources:** Recommended way for ADF to
authenticate to other Azure services that support Azure AD authentication (e.g.,
Azure Blob Storage, Azure SQL DB, Azure Key Vault). Can be System-assigned (tied to
the ADF instance's lifecycle) or User-assigned (standalone Azure resource
assignable to multiple services).
│ │ ├── **Service Principals:** An Azure AD application identity that can be
granted permissions to resources. Used when Managed Identities are not applicable
or for specific scenarios.
│ │ ├── **Azure Key Vault Integration:** ADF can retrieve secrets (like
connection strings, passwords, keys) stored securely in Azure Key Vault at runtime,
instead of storing them directly in ADF linked service definitions.
│ │ └── **Role-Based Access Control (RBAC):** Azure RBAC is used to control who
has what permissions to the ADF service itself (e.g., Data Factory Contributor to
create/edit, Reader to view). This also applies to data stores ADF accesses, where
ADF's identity (Managed Identity/Service Principal) needs appropriate permissions
(e.g., Storage Blob Data Contributor).
│
├─── **4.2. Network Security**
│ │ ├── **Managed Virtual Network & Managed Private Endpoints:** When using
Azure IR, ADF can provision it within a managed VNet. ADF can then create managed
private endpoints within this VNet to connect securely to Azure PaaS data stores
(e.g., Azure SQL, Storage) over a private link, avoiding public internet exposure.
│ │ ├── **Private Link for Azure Data Factory:** Allows you to access the ADF
service itself (Studio, API endpoints) via a private endpoint in your own VNet,
ensuring that all traffic to and from ADF stays within the Azure backbone or your
private network.
│ │ ├── **Self-Hosted IR in VNet/On-prem:** The SHIR is installed within your
private network (on-premises or cloud VNet), allowing ADF to securely access data
sources that are not publicly accessible.
│ │ └── **Firewall rules on data stores:** Configuring firewall rules on data
stores (e.g., Azure SQL, Storage) to allow access only from specific IP addresses
or VNet service endpoints associated with ADF's IRs.
│
├─── **4.3. Data Encryption**
│ │ ├── **In Transit:** All connections from ADF to external data stores and
compute services are typically encrypted using HTTPS/TLS by default.
│ │ ├── **At Rest:** Data stored by ADF (like pipeline definitions) is
encrypted by Azure. Data in target data stores is usually encrypted by the
respective service's mechanisms (e.g., Azure Storage Service Encryption,
Transparent Data Encryption for SQL).
│ │ └── **Customer-Managed Keys (CMK):** For enhanced control, you can encrypt
the Data Factory environment (entities like pipelines, datasets, linked services)
using your own encryption key stored in Azure Key Vault.
│
└─── **4.4. Data Masking / Obfuscation**
│ └── **Description:** While not a dedicated data masking service, ADF's
Mapping Data Flows can be used to implement data masking or obfuscation logic
within transformations (e.g., using Derived Column to replace sensitive characters
or hash values) before writing data to a sink. For more advanced masking, dedicated
database features or other tools might be used.
Use code with caution.
5. Performance & Optimization
│
├─── **5.1. Copy Activity**
│ │ ├── **Data Integration Units (DIUs):** A measure of power (CPU, memory,
network resource allocation) for the Copy activity on Azure IR. More DIUs (2-256)
provide more throughput. Default is often 'Auto' (4 DIUs).
│ │ ├── **Parallel Copies:**
│ │ └── **`parallelCopies` property:** Controls the number of parallel
threads within a single Copy activity run that read from the source or write to the
sink.
│ │ └── **Multiple activities:** Running multiple Copy activities in
parallel within a ForEach loop or as separate branches in a pipeline.
│ │ ├── **Staging Copy:** Using an interim storage (like Blob) to optimize data
loading into certain sinks (e.g., using PolyBase or COPY statement for Azure
Synapse Analytics, or when direct copy has limitations).
│ │ └── **Compression:** Using compression (e.g., GZip, BZip2) for supported
file formats can reduce data volume transferred and improve I/O performance, at the
cost of some CPU for compression/decompression.
│
├─── **5.2. Mapping Data Flows**
│ │ ├── **Compute Type:** Choose the Azure IR configuration for Data Flow Spark
clusters: General Purpose, Memory Optimized (for memory-intensive operations like
joins, lookups), or Compute Optimized (for CPU-intensive transformations).
│ │ ├── **Core Count:** Number of cores for the Spark cluster driver and worker
nodes (e.g., 4, 8, 16+ cores per node). More cores generally mean faster
processing.
│ │ ├── **Time To Live (TTL):** Configures how long the Spark cluster remains
active after the last Data Flow run, allowing subsequent runs to reuse the warm
cluster, reducing startup time.
│ │ ├── **Partitioning (Source, Sink, Transformation level):** Optimizing data
partitioning in Spark can significantly improve performance by distributing data
processing evenly across worker nodes and minimizing data shuffling. Can be
configured at source, sink, and within certain transformations.
│ │ └── **Optimize tab in transformations:** Provides options like 'Broadcast'
for join optimization (sending smaller datasets to all nodes), and settings to
handle data skew.
│
├─── **5.3. Pipeline Design**
│ │ ├── **Parallel execution of activities:** Design pipelines to run
independent activities in parallel rather than sequentially where possible.
│ │ ├── **Batching:** When processing many small files or items, group them
into larger batches to reduce overhead per item.
│ │ └── **Efficient use of Lookup and Get Metadata:** Minimize frequent calls
if data doesn't change often; cache results in variables if appropriate for the
pipeline run's scope.
│
└─── **5.4. Integration Runtime Sizing**
│ ├── **SHIR:** For Self-Hosted IR, performance depends on the CPU, memory,
and network bandwidth of the machine(s) it's installed on. Scaling out by adding
more nodes to the SHIR group improves throughput and high availability.
│ └── **Azure-SSIS IR:** Choose appropriate VM Node size (affecting CPU,
memory) and Node count (for scale-out) based on the complexity and volume of SSIS
packages being executed.
Use code with caution.
6. Pricing Model
* Detailed Description: ADF pricing is generally pay-as-you-go and based on
consumption of different components.
│
├─── 6.1. Pipeline Orchestration & Execution: Charged per activity run, trigger
execution, and for pipeline orchestration (e.g., evaluating conditions, loops).
├─── 6.2. Data Flow Execution & Debugging: Charged based on vCore-hours of the
Spark cluster used. The rate depends on the compute type (General Purpose, Memory
Optimized, Compute Optimized) and core count. Debugging sessions are also charged.
├─── 6.3. Data Movement: For the Copy activity on Azure IR, charged per Data
Integration Unit-hour (DIU-hour). For SHIR, you pay for activity runs, but the SHIR
machine cost is yours.
├─── 6.4. SSIS Integration Runtime: Charged per hour based on the VM node size and
number of nodes. Also incurs costs for the Azure SQL DB/Managed Instance used to
host SSISDB, and potentially SSIS licensing (Azure Hybrid Benefit can apply).
├─── 6.5. Read/Write Operations: Charges for interactions with the ADF service,
such as creating/reading/updating/deleting ADF entities (pipelines, datasets,
etc.). These are typically minor compared to execution costs.
├─── 6.6. Monitoring: Costs associated with Azure Monitor (Log Analytics data
ingestion/retention, alert rules).
└─── 6.7. Software-Defined Network (Managed VNet, Private Endpoints): Costs may
apply for using managed VNet capabilities and private endpoints.
7. Use Cases
* Detailed Description: ADF is versatile and can be applied to a wide range of data
integration scenarios.
│
├─── 7.1. ETL/ELT Processes: Classic data warehousing scenarios: Extracting data
from various sources, Transforming it (e.g., cleaning, aggregating, joining using
Data Flows or other compute), and Loading it into a data warehouse or data mart.
ELT involves loading raw data first and then transforming it in the target system.
├─── 7.2. Data Warehousing: Specifically, ingesting data into modern data warehouse
solutions like Azure Synapse Analytics, Snowflake, Google BigQuery, or traditional
SQL Server warehouses.
├─── 7.3. Big Data Processing: Orchestrating data pipelines that involve big data
technologies like Azure Databricks (Spark), Azure HDInsight (Hadoop, Spark, Hive),
or Azure Data Lake Storage.
├─── 7.4. Data Migration: Moving data from on-premises systems to Azure cloud
services, or migrating data between different cloud services or storage types.
├─── 7.5. Operational Data Integration: Integrating data between operational
systems, often requiring more frequent or near real-time updates.
├─── 7.6. SaaS Data Integration: Ingesting data from Software-as-a-Service (SaaS)
applications like Salesforce, Dynamics 365, SAP, etc., using ADF's connectors.
└─── 7.7. Real-time/Near Real-time (with Event Triggers): Building event-driven
data pipelines that react to events like new file arrivals in blob storage or
custom events via Event Grid.
8. Advanced Topics & Integrations
* Detailed Description: ADF works in conjunction with other Azure services to
provide a more comprehensive data platform.
│
├─── 8.1. Azure Purview (Data Governance & Cataloging): ADF integrates with Azure
Purview to automatically scan ADF instances and capture lineage information from
pipeline executions. This helps in data discovery, understanding data flow, and
impact analysis.
├─── 8.2. Power BI (Consuming data processed by ADF): Power BI can connect to data
stores (like Azure Synapse, Data Lake, SQL DB) that are populated or curated by ADF
pipelines, enabling business intelligence and reporting on the processed data.
├─── 8.3. Azure Logic Apps (Broader workflow orchestration): While ADF focuses on
data orchestration, Azure Logic Apps provides broader workflow automation. Logic
Apps can trigger ADF pipelines as part of a larger business process, or ADF can
call Logic Apps (via Web activity) for specific tasks.
├─── 8.4. Azure Functions (Serverless compute for custom steps): ADF can invoke
Azure Functions to execute custom code snippets for lightweight, event-driven
processing tasks that might not fit standard ADF activities.
└─── 8.5. Delta Lake Support (in Data Flows for ACID transactions on data lakes):
Mapping Data Flows can read from and write to Delta Lake format in Azure Data Lake
Storage. This enables ACID (Atomicity, Consistency, Isolation, Durability)
transactions, time travel (data versioning), and schema enforcement on data lakes,
bringing data warehouse-like reliability to data lakes.
This detailed breakdown should provide a much deeper understanding of each Azure
Data Factory concept and component.

Azure Data Factory For Beginners
No ratings yet
Azure Data Factory For Beginners
250 pages
Azure Data Factory
No ratings yet
Azure Data Factory
3,167 pages
f4b7901ed5e5f9106a3a82eea2e2f003
No ratings yet
f4b7901ed5e5f9106a3a82eea2e2f003
3,614 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Azure DATA Fatcory
No ratings yet
Azure DATA Fatcory
2,982 pages
Data Factory, Data Integration
No ratings yet
Data Factory, Data Integration
2,034 pages
Azure Data Factory Tutorial
No ratings yet
Azure Data Factory Tutorial
36 pages
Azure Data Factory Presentation
No ratings yet
Azure Data Factory Presentation
30 pages
Python For Data Engineering
0% (1)
Python For Data Engineering
1 page
Azure Data Factory - A Complete Introduction
No ratings yet
Azure Data Factory - A Complete Introduction
72 pages
I&A Tech Solution Architecture Guidelines
No ratings yet
I&A Tech Solution Architecture Guidelines
321 pages
Create A Duplicate ORACLE Database On Windows
No ratings yet
Create A Duplicate ORACLE Database On Windows
12 pages
? Exploring Common Tasks in Azure Synapse Analytics ?
No ratings yet
? Exploring Common Tasks in Azure Synapse Analytics ?
54 pages
ADF - Control Flow Activites 2
No ratings yet
ADF - Control Flow Activites 2
20 pages
Azure Data Factory
100% (1)
Azure Data Factory
6 pages
Copy Activity in ADF
No ratings yet
Copy Activity in ADF
52 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Strabon Geografija PDF
No ratings yet
Strabon Geografija PDF
589 pages
Introduction To ADF - LWTN
No ratings yet
Introduction To ADF - LWTN
54 pages
Study Guide For Exam DP-203 - Data Engineering On Microsoft Azure - Microsoft Learn
No ratings yet
Study Guide For Exam DP-203 - Data Engineering On Microsoft Azure - Microsoft Learn
4 pages
Azure Data Factory
100% (4)
Azure Data Factory
16 pages
ADF Hands-On
No ratings yet
ADF Hands-On
98 pages
Azure de QSN and Ans
No ratings yet
Azure de QSN and Ans
16 pages
ADE Project Amit
No ratings yet
ADE Project Amit
17 pages
ADF Pipeline Documentation Template
No ratings yet
ADF Pipeline Documentation Template
5 pages
DEVOPS
No ratings yet
DEVOPS
13 pages
ADF - Data Movt and IR
No ratings yet
ADF - Data Movt and IR
26 pages
Pipeline: Azure Data Factory Cheat Sheet by
100% (1)
Pipeline: Azure Data Factory Cheat Sheet by
14 pages
MS Azure Data Factory Lab Overview
No ratings yet
MS Azure Data Factory Lab Overview
58 pages
ADF Notes
No ratings yet
ADF Notes
1 page
Detailed Azure Data Factory Presentation
No ratings yet
Detailed Azure Data Factory Presentation
30 pages
Medical Store Management Proposal
No ratings yet
Medical Store Management Proposal
4 pages
ADF - Data Tranformation Activities
No ratings yet
ADF - Data Tranformation Activities
17 pages
ADF - Intro and Components
No ratings yet
ADF - Intro and Components
17 pages
Azure Data Engr POC - S For Interns
No ratings yet
Azure Data Engr POC - S For Interns
9 pages
Start To Finish With Azure Data Factory
100% (2)
Start To Finish With Azure Data Factory
30 pages
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Goblin Market and Other Poems
No ratings yet
Goblin Market and Other Poems
212 pages
Azure Functions
No ratings yet
Azure Functions
12 pages
Azure Synapse
No ratings yet
Azure Synapse
12 pages
Azure Data Factory Deck 1
No ratings yet
Azure Data Factory Deck 1
59 pages
SVN Book
No ratings yet
SVN Book
402 pages
Azure Data Engineer Associate Syllabus
No ratings yet
Azure Data Engineer Associate Syllabus
4 pages
Databricks
No ratings yet
Databricks
43 pages
Airflow Git CICD
No ratings yet
Airflow Git CICD
6 pages
Snowflake
No ratings yet
Snowflake
43 pages
Airflow
No ratings yet
Airflow
37 pages
Az Questions
No ratings yet
Az Questions
11 pages
Lecture 1. Introduction
No ratings yet
Lecture 1. Introduction
42 pages
DEWA Project Timeline
No ratings yet
DEWA Project Timeline
10 pages
Azure Data Factory
100% (2)
Azure Data Factory
14 pages
Business Requirements Document (BRD) Template
100% (1)
Business Requirements Document (BRD) Template
8 pages
Azure Data Factory Interview Questions Answers 1740678784
No ratings yet
Azure Data Factory Interview Questions Answers 1740678784
9 pages
ADF Cheat Sheet 21 To 50
No ratings yet
ADF Cheat Sheet 21 To 50
3 pages
Capgemini Questionnaire
No ratings yet
Capgemini Questionnaire
11 pages
Azure Data Factory Full Notes
No ratings yet
Azure Data Factory Full Notes
4 pages
Azure Databricks
No ratings yet
Azure Databricks
5 pages
Advanced Data Structures
No ratings yet
Advanced Data Structures
2 pages
Auto Jack Loader Research Paper
No ratings yet
Auto Jack Loader Research Paper
6 pages
Pega722 Install JBoss MSSQL
No ratings yet
Pega722 Install JBoss MSSQL
63 pages
06.introduction To Data Factory
No ratings yet
06.introduction To Data Factory
26 pages
Intro Slides
No ratings yet
Intro Slides
21 pages
Veritas InfoScale 7.3 Fundamentals For UNIX Linux Admistration
No ratings yet
Veritas InfoScale 7.3 Fundamentals For UNIX Linux Admistration
3 pages
Adf Part-1
No ratings yet
Adf Part-1
5 pages
05 RPMSDataCollection ConsolidatingIPCRFData
No ratings yet
05 RPMSDataCollection ConsolidatingIPCRFData
32 pages
Himanshu - Assignment Solved ETL 1
No ratings yet
Himanshu - Assignment Solved ETL 1
6 pages
ADF Syllabus
No ratings yet
ADF Syllabus
8 pages
Azure Networking
No ratings yet
Azure Networking
6 pages
Fy Racle
No ratings yet
Fy Racle
43 pages
Azure Compute
No ratings yet
Azure Compute
6 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
DP203-Certification Preparation
No ratings yet
DP203-Certification Preparation
9 pages
SpendoliniAPEX Security Checklist
No ratings yet
SpendoliniAPEX Security Checklist
38 pages
Azure Synapse Analytics PoC Environment
No ratings yet
Azure Synapse Analytics PoC Environment
8 pages
m06 41523 Bangalore Motors
No ratings yet
m06 41523 Bangalore Motors
9 pages
Database Views: CHAPTER 5 (6/E) CHAPTER 8 (5/E)
No ratings yet
Database Views: CHAPTER 5 (6/E) CHAPTER 8 (5/E)
8 pages
Azure Notes - 3 Data Integration
No ratings yet
Azure Notes - 3 Data Integration
9 pages
NCP DB 6 6.5
No ratings yet
NCP DB 6 6.5
20 pages
K L University Department of Computer Science & Engineering II/IV B.Tech Semester II Database Management Systems (13CS204) TEST-2 Key
No ratings yet
K L University Department of Computer Science & Engineering II/IV B.Tech Semester II Database Management Systems (13CS204) TEST-2 Key
8 pages
SQL in A Nutshell, 4th Edition Kevin Kline Instant Download
100% (2)
SQL in A Nutshell, 4th Edition Kevin Kline Instant Download
77 pages
Question Bank For DMDW
No ratings yet
Question Bank For DMDW
10 pages
Database Summary Note
No ratings yet
Database Summary Note
10 pages
Azure Data Bricks & Factory
No ratings yet
Azure Data Bricks & Factory
2 pages
Untitled
No ratings yet
Untitled
3 pages
Top 22 Windows Server Interview Questions & Answers
No ratings yet
Top 22 Windows Server Interview Questions & Answers
5 pages
Suvarna Anumolu - Cytel
No ratings yet
Suvarna Anumolu - Cytel
4 pages
Azure Data Factory - Pratap - Qbex Technologies - 8886230001
No ratings yet
Azure Data Factory - Pratap - Qbex Technologies - 8886230001
4 pages
Azure Data Engineer
No ratings yet
Azure Data Engineer
3 pages
Azure Data Platform
No ratings yet
Azure Data Platform
5 pages
Azure DEVOPS
No ratings yet
Azure DEVOPS
4 pages
Excel1 (CAS 4)
No ratings yet
Excel1 (CAS 4)
22 pages
HRMS-Payroll - Insert Data Load
No ratings yet
HRMS-Payroll - Insert Data Load
4 pages
Senior Data Engineer JD
No ratings yet
Senior Data Engineer JD
3 pages
001 Slow Performance P6
No ratings yet
001 Slow Performance P6
3 pages
Inner and Outer Join SQL
No ratings yet
Inner and Outer Join SQL
5 pages
Use Case Requirements Validation Checklist
No ratings yet
Use Case Requirements Validation Checklist
2 pages
The Library System: HIT 234 - Database Concepts
No ratings yet
The Library System: HIT 234 - Database Concepts
14 pages
Azure Fabric
No ratings yet
Azure Fabric
2 pages
Aws Scripts Dataload
No ratings yet
Aws Scripts Dataload
2 pages
Azure Data Factory
No ratings yet
Azure Data Factory
5 pages
P3 Table Synaptics Part # Item Type Die ID Division Product Group
No ratings yet
P3 Table Synaptics Part # Item Type Die ID Division Product Group
8 pages
Pratical File Term 2
No ratings yet
Pratical File Term 2
11 pages
Azure Synapse Analytics
No ratings yet
Azure Synapse Analytics
2 pages
Why Should We Use Separate ASM Home?
No ratings yet
Why Should We Use Separate ASM Home?
3 pages
Azure Data Solutions
No ratings yet
Azure Data Solutions
7 pages
Rapidminer Paper
No ratings yet
Rapidminer Paper
5 pages
R Programming Cheat Sheet: Ata Tructures
No ratings yet
R Programming Cheat Sheet: Ata Tructures
2 pages

Azure Data Factory

Uploaded by

Azure Data Factory

Uploaded by

https://learn.microsoft.

└─── 2. Development & Authoring

└─── 3. Monitoring & Management

└─── 5. Performance & Optimization

└─── 6. Pricing Model

└─── 7. Use Cases

└─── 8. Advanced Topics & Integrations

Tips for using this mind map:

You might also like