Azure Data Factory
Azure Data Factory
Step1: Go to the Azure portal and search for Azure data factory
Choose and create the Data Factory and create the resource.
Step3: Configure resource group, subscription ,name etc
Like in this ss you can see we are configuring the ADLS gen 2 as a
source destination within our Azure domain.
● For sure do check the connection
Once connection is successful create the resource and you can found
your link service under manage tab
Now the connection to the data source has been established so we
need to configure the data files so there is a component in ADF named
Datasets.
We will access a csv file from the ADLS container by creating a dataset
reference
Choose the file format in ADLS.
Configure the Name , Linked service , File path and enable the first row
as header if the data does have.
● Configure the connection properties , Schema or parameters as
per your use case
Publish it
Data Flows are a feature that allows you to design, build, and manage
data transformation processes visually within a pipeline. Data Flows
provide a way to perform data transformation and manipulation without
having to write code manually
so , in this data flow we will fetch the superstore data from the created
Dataset (SuperstoreCSV) that we created earlier.
These are multiple operations inside the data flow to operate on the
source data :
SELECT operation is used to select all the product related columns
Except the columns whatever you want select others and delete them
Like we choose other columns except Product ID , Category ,
Sun-category , Product Name
To look at what data you get on the data flow debug (Cluster) and fo
the data preview option.
“ Data Flow Debug is a feature that allows you to interactively test and
troubleshoot data flows. It provides real-time feedback on data
transformations, helps identify and fix errors, and allows you to preview
data at various stages. This helps ensure that your data flows work
correctly before deploying them.”
This is how the data looks alike :
Configure a different dataset for the sink where the output is been
stored in any desired format (i used parquet for it).