Data Factory
Data Factory
The raw data doesn't have the proper context or meaning to provide meaningful insights to
analysts, data scientists, or business decision makers
Big data requires a service that can orchestrate and operationalize processes to refine these
enormous stores of raw data into actionable business insights
Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-
transform-load (ETL), extract-load-transform (ELT), and data integration projects
The pipeline allows you to manage the activities as a set instead of each one individually
You deploy and schedule the pipeline instead of the activities independently
Copy activity to copy data from SQL Server to an Azure Blob Storage.
Then, use a data flow activity or a Databricks Notebook activity to process and transform
data from the blob storage to an Azure Synapse Analytics pool
An activity can take zero or more input datasets and produce one or more output datasets
These resources can be on-premises or in the cloud, and they can include data stores, compute
resources, and other Azure services
For example, an Azure Storage linked service links a storage account to the service
Connection string 2
ADF Azure SQL Database
Connection string 3
Amazon S3
Datasets identify data within different data stores, such as tables, files, folders, and documents
For example, an Azure Blob dataset specifies the blob container and folder in Blob Storage from which the
activity should read the data
Before you create a dataset, you must create a linked service to link your data store to the service
Tumbling window triggers are typically used to run pipelines on a periodic interval, while also
retaining state
Tumbling window trigger: Running a pipeline to process streaming data from a Kafka topic in
1-hour batches
Tumbling window trigger with dependency: Running a pipeline to process data from a
database in 1-hour batches, with the pipeline run depending on a successful run of a previous
pipeline that processes data from a different database.
4. Starting a pipeline when a custom event is published from another Azure service, such as Azure
Logic Apps or Azure Functions
4. Configure the trigger to listen for the custom events that you want to start the pipeline
Once the pipeline is published, it will start whenever a custom event is published to the Event Grid topic that the
trigger is listening for
You can publish events to a topic from any source, and you can subscribe to events from any
topic by creating an event subscription
1. Starting a pipeline in Azure Data Factory when a new file is uploaded to a storage account
3. Triggering a workflow in Azure Logic Apps when a new row is inserted into a database table
Event Grid topics are highly scalable and reliable, and they can be used to distribute events to any
number of event handler
Data Flow
Data movement
Activity dispatch
An integration runtime provides the bridge between activities and linked services
This allows the activity to be performed in the closest possible region to the target data store
You should choose the type that best serves your data integration capabilities and network
environment requirements
are values that can be set and modified during a pipeline run
Unlike pipeline parameters, which are defined at the pipeline level & cannot be changed during a
pipeline run
pipeline variables can be set and modified within a pipeline using a Set Variable activity
Pipeline variables can be used to store and manipulate data during a pipeline run,
Such as by storing the results of a computation
Can be used to capture commonly used pipeline-related information & pass it dynamically
anywhere within the pipeline
This allows you to generate unique file names and paths for each pipeline run
To check the status of a previous activity before running the next activity
This allows you to reuse data from one activity in another activity
Components that allow you to connect to & interact with external data sources
Connectors for on-premises and cloud data sources, SaaS applications, and other Azure services
Ingesting data:
From a variety of sources, such as on-premises databases, cloud storage, and SaaS
applications
Loading data:
To load data into a variety of destinations, such as Azure Data Lake Storage, Azure Synapse
Analytics, and Azure SQL Database