0% found this document useful (0 votes)
20 views13 pages

Azure Data Factory

Uploaded by

22metadata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views13 pages

Azure Data Factory

Uploaded by

22metadata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Azure Data Factory

Azure Data Factory (ADF) is a cloud-based data integration service


provided by Microsoft Azure. It allows you to create, schedule, and
orchestrate data workflows and pipelines to move and transform data
from various sources to destinations. Essentially, it’s a tool for data
integration and ETL (Extract, Transform, Load) processes. Here’s a
breakdown of its key features:

➢ Data Ingestion: We can connect to a wide range of data sources,


including on-premises databases, cloud storage, and SaaS
applications. ADF supports both structured and unstructured
data.
➢ Data Transformation: With ADF, we can transform data using a
variety of methods, such as data flows (visual data
transformation) and using external compute services like Azure
Databricks or HDInsight.
➢ Data Movement: It facilitates the movement of data between
different data stores, which can be within Azure or other cloud
environments.
➢ Orchestration: We can build workflows to manage complex data
processing tasks, including scheduling, error handling, and retry
mechanisms. This helps in automating data pipelines.
➢ Monitoring: ADF provides monitoring capabilities to track the
execution of data pipelines and to ensure that data processes are
running smoothly.
➢ Integration: It integrates with various Azure services and
third-party tools, enhancing its capabilities for comprehensive
data management.
How to create a Azure data factory resource and consume it :

Step1: Go to the Azure portal and search for Azure data factory

Step2 : Create resource

Choose and create the Data Factory and create the resource.
Step3: Configure resource group, subscription ,name etc

Step 4 : if we don’t want to configure any other option just create


Step5 : Once deployed just go to resource and launch the Data factory
studio

Step6 : Explore the Data Factory studio


Components of ADF :

These are some of the core components that ADF provides :


● ADF simply puts data from a source, transforms it or moves the
data within multiple data stores even on-premise to cloud or
cloud to cloud.
● As it picks the data from multiple sources , the ADF needs to
establish a connection with it so there is something in ADF known
as Linked Services that we need to configure

linked service is a crucial component that defines the connection


information needed to connect to a data source or a data sink
(destination). Essentially, it acts as a bridge between ADF and your data
storage or compute resources.

Configure the Link service name , Integration runtime , Authentication


method and any parameters (will discuss later on )
And test the connection

Like in this ss you can see we are configuring the ADLS gen 2 as a
source destination within our Azure domain.
● For sure do check the connection

Once connection is successful create the resource and you can found
your link service under manage tab
Now the connection to the data source has been established so we
need to configure the data files so there is a component in ADF named
Datasets.

Dataset is a crucial component that represents the structure of the data


you want to work with. It defines the data you want to interact with and
provides the necessary information for data processing activities.
Datasets act as references to the data in your data sources or sinks and
are used in conjunction with linked services to access the data.

We will access a csv file from the ADLS container by creating a dataset
reference
Choose the file format in ADLS.

Choose CSV and then click continue.

Configure the Name , Linked service , File path and enable the first row
as header if the data does have.
● Configure the connection properties , Schema or parameters as
per your use case

Publish it

Now our dataset is ready to be processed.


Next important component in ADF is Dataflow

Data Flows are a feature that allows you to design, build, and manage
data transformation processes visually within a pipeline. Data Flows
provide a way to perform data transformation and manipulation without
having to write code manually

so , in this data flow we will fetch the superstore data from the created
Dataset (SuperstoreCSV) that we created earlier.

1. Configure the source for dataflow from


From Projections you can handle the datatypes of the columns

These are multiple operations inside the data flow to operate on the
source data :
SELECT operation is used to select all the product related columns

Except the columns whatever you want select others and delete them
Like we choose other columns except Product ID , Category ,
Sun-category , Product Name

To look at what data you get on the data flow debug (Cluster) and fo
the data preview option.

“ Data Flow Debug is a feature that allows you to interactively test and
troubleshoot data flows. It provides real-time feedback on data
transformations, helps identify and fix errors, and allows you to preview
data at various stages. This helps ensure that your data flows work
correctly before deploying them.”
This is how the data looks alike :

Sink is the destination component in a data flow where the transformed


data is written or stored. It represents the endpoint where the data is
ultimately delivered after processing through the various
transformations in the data flow.

Configure a different dataset for the sink where the output is been
stored in any desired format (i used parquet for it).

You might also like