Azure de Project
Azure de Project
Before creating all the resources we will create the resource group
in which we will create all the required resources.
Go to Azure Portal and log in with your Azure account. In the left-hand menu, select
"Resource groups". If you don't see it, use the search bar at the top of the page and search
for "Resource groups". Click the "+ Create" button or "Add" at the top of the Resource
groups page.
Select the subscription under which the resource group will be created. Enter a unique name
for your resource group. Choose a location (region) where your resources will reside (e.g.,
East US, West Europe).
Search for "Storage accounts" and click "+ Create". Select subscription, resource group,
region, and enter a unique storage account name as shown below.
Enable ADLS Gen2: Go to the Advanced tab and enable Hierarchical namespace.
Then in the configs container create the directory “emr” and then upload the file “load_config.csv”
in it.
Steps to create Azure SQL database:
In the search bar, type "SQL Database" and select "SQL Database" from the results.
Choose your Subscription and Resource Group. Enter a Database Name. Also
create a SQL Server.
We will create the server as shown below. Choose a Compute + Storage tier (e.g.,
Basic, General Purpose).
After this Go to Networking => In Network connectivity select “Public endpoint”
option. Also set yes for the options “Allow Azure services and resources to access
this server” and “Add current client IP address”
Note: Please note down this username and password for future reference.
Click Review + Create, validate the details, and then click Create as shown below.
Note: While creating database if you are not able to allow public access and
add client ip address you can follow below steps:
After creating this database Go to Networking => For Public access (select option
Selected networks and save this) =>
Also while using query editor if you face below error click on “Allowing IP for current
ip address” as shown below
Similarly we will create another database trendytech-hospital-b(We will use the same
server i.e trendytech-sqlserver that we have created while creating
trendytech-hospital-a database). Thus we have created 2 databases as shown
below.
Then we will create the tables in these databases and for creating tables in the
database use below scripts which are present on github account:
For trendytech-hospital-a =>
Trendytech_hospital_A_table_creation_commands
In the search bar, type "Data Factory" and select "Data Factory" from the results.
Click the "Create" button on the Data Factory page. Provide a globally unique name
for your Data Factory instance. Choose V2 (Data Factory Version 2) for the latest
features.
Click "Review + Create" to validate the details. If validation passes, click "Create" to
deploy the Data Factory.
In the ADF interface, go to the Manage section on the left-hand panel. Under the
Connections section, select Linked Services. Click on New to create a new Linked
Service.
1. Azure SQL DB
Note down the server name for that sql database that we have created.
In fully qualified domain names, mention the server name, mention the
username and password for sql server and define the parameter db_name
and using this parameter we will pass the database name as shown below.
And click on create to create the linked service.
2. ADLS GEN2
Select Azure Data Lake Storage1 as the data store. Provide the following
details- Name of your Blob Storage account, Authentication. Then click Test
Connection to verify and save the Linked Service.
To get the url for Azure Data Lake Storage go to Adls gen2 storage that we
have created => Setting => Endpoints => and copy the URL as shown below
Also copy the access key. Using these details create the linked service as
shown below.
Dataset creation:
In the ADF interface, click on the Author section (left-hand panel).
Expand the Datasets option. Click on the “…” next to Datasets in
order to create the dataset.
1. Azure SQL DB
We will select the linked service that we have created for the SQL database.
To create the datasets for the tables in a parameterized way in the sql
database , we will create the parameter db_name, schema_name,
table_name.
In order to store data in ADLS gen2 in parquet format we will need the
dataset.
While creating this dataset we will select source as ADLS gen2, fileformat as
parquet and we will create parameters file_name, file_path and container.
4. Databricks Delta Lake for Delta lake
We will select the source as Azure Databricks Delta Lake. For this we will
create the parameter schema_name and table_name.
Once all the dataset and linked service are created, publish all in order to
save them.
Creation of Pipelines:
We will use the same linked service “hosa_sql_ls” that we have created
earlier for the database.
2. Creation of Datasets:
Select source as ADLS Gen2 storage, then in file format select json as our
lookup file is a json file as shown below
Steps to Configure the Pipeline:
This pipeline will copy the data from the file into the tables in the sql database. On
successfully running the pipeline we will get below output.
Pipeline to copy data from Azure Sql db to Landing Folder
in ADLS Gen2
@activity('lkp_EMR_configs').output.value
a. We will use get metadata activity in order to check whether
file exists in Bronze container:
This will check if the file exists in the Bronze container. Based on the file's
presence or absence, we will use an If Condition activity to determine the
subsequent processing steps.
Condition 1: File Exists (True) => Move the file to the Archive folder.
condition:
@and(equals(activity('fileExists').output.exists,true),equals(item().is_acti
ve, '1'))
Source: Container: Bronze, Path: hosa, File: encounters
@equals(items().loadtype, 'Full')
If “If condition” holds true => Full Load => Copy all data from the database
table. => Enter Log details in the audit table:
Folder and File Structure
Bronze Container:
Source Path: bronze/hosa
Target Path for Data Loads: bronze/<target-path>
(Fetch incremental data using the last fetched date) using Lookup=>
Incremental load using copy activity =>Enter log details in the audit table:
Lookup:
Incremental load:
Source Path: bronze/hosa
Target Path for Data Loads: bronze/<target-path>
Before running the pipeline for each activity select the “sequential” option as
shown below.
But limitation with this pipeline is it is sequential which will we resolve in part 2
Part 2:
In this section, we will focus on improving our data pipeline and governance.
Transition from a local Hive Metastore to Databricks Unity Catalog for centralized
metadata management and improved data governance.
To configure the Metastore for the Unity catalog follow below steps:
Then Go to your storage account and click on access control (IAM)=> click on add ->
role assignment -> storage blob data contributor -> in managed identity select the
databricks connector that we have created.
Clicked on Catalog => create metastore -> region should be the same where you
have the storage account and the workspaces -> storage account path:
Ex:
[email protected]/
Go to Catalog => select the recently created metastore => click on Assign to
workspace => select the workspace that you want to assign.
And we will assign “create catalog” permission to the user for this metastore if you
are not getting the option to create a new Catalog.
Also to organize the notebook we will create the folder as shown below.
1. Set up:
2. API extracts
3. Silver
4. Gold
Note: After creating the databricks workspace, enable the DBFS.
Note: Before proceeding ahead we will create the key vault in azure and will store key
for our ADLS gen2 storage in it. Refer below steps for more clarification.
Steps to Store Key in Key Vault and Create Secret Scope in Databricks:
In Access configuration select option vault access policy option and select define permission
for your username.
First get the storage account key, to get the storage account access key follow below
steps:
=> Go to your Azure Storage Account => Security & Networking> Access Keys.
=> Copy the key under the "Key1" or "Key2" section.
Go to the Key Vault => Navigate to your Key Vault in the Azure Portal. => Click on
Access Policies > + Add Access Policy. => Grant Permissions:
You can find the Managed Identity from the Databricks Resource in Azure Portal
under Managed Identity.
Now select this managed resource group and check the managed identity and define
policy for it.
Save the Policy:
Once the secret is stored in Key Vault, create a secret scope in Databricks.
While creating the secret scope, edit the url till .net and add
“#secrets/createScope”
Similarly first we store credentials for sql database (Username and password) and
databricks also (In case of Databricks we will store access token).
To get access token in Databricks got to Setting => Developer => Access token =>
Generate Access token
Now for ADF and ADLS gen we will create application and will provide permission (You can
Provide all the permission) as shown by Sumit Sir in the video
To perform role assignments in Azure Databricks and Azure Data Factory (ADF),
follow the steps outlined below.
To assign the role : Go to resource => IAM => Add => Role Assignment.
Define Contributor role for the application that is created for databricks as
shown below.
Note: In condition give select “Allow user to assign all roles (highly privileged)”
Also for your username define the owner role if it is not present as shown below.
For ADF:
To assign the role : Go to resource => IAM => Add => Role Assignment.
Define Contributor role for the application that is created for databricks as
shown below.
We will use the code in the adls_mount notebook to mount the containers in
our storage account.
Now we will create a linked service for key vault and Databricks as shown
below.
Before creating the linked services, grant access permissions to the Data
Factory service principal under the access policies in Key Vault and provide
key and secret access permission.
After this you will be able to access secrets while creating linked services.
Note: While creating Linked service for Databricks select the existing cluster
option as shown below.
We will update the linked services for ADLS Gen2, sql database and now
instead of providing the storage account key directly, we will select the secret
scope for respective ADLS gen storage, SQL DB, etc. as shown by Sumit sir in
the video.
However, when you move the copy activity to a child pipeline (triggered using
the Execute Pipeline activity) and place the If condition in the parent pipeline’s
For Each loop, the child pipeline no longer has direct access to item(). Instead,
you need to pass the loop data (e.g., loadtype) as a parameter to the child
pipeline.
So, in this pipeline update the Expression as shown below and also we have to
add the parameter as shown below
@equals(pipeline().parameters.Load_Type,'Full' )
Parameter:
We will update the values for the source and sink variables for both the full
load and incremental load copy activities. Since the variable values are the
same for both, we will use the same variable, as demonstrated below.
Source:
db_name - @pipeline().parameters.database
schema_name - @split(pipeline().parameters.tablename, '.')[0]
table_name - @split(pipeline().parameters.tablename, '.')[1]
Sink:
Also we have to update the queries for Full load as shown below
Full Load:
@concat('select *,''',pipeline().parameters.datasource,''' as
datasource from ',pipeline().parameters.tablename)
Additionally, we will update the lookup query to insert logs into the audit table
for Full load as shown below.
@concat('insert into
audit.load_logs(data_source,tablename,numberofrowscopied,watermarkcolum
nname,loaddate) values (''',pipeline().parameters.datasource,''',
''',pipeline().parameters.tablename,''',''',activity('Incremental_Load_
CP').output.rowscopied,''',''',pipeline().parameters.watermark,''',''',
utcNow(),''')')
Incremental load:
For activity “Fetch logs” we have to update the query instead of item().
@concat('select coalesce(cast(max(loaddate) as
date),''','1900-01-01',''') as last_fetched_date from audit.load_logs
where',' data_source=''',pipeline().parameters.datasource,''' and
tablename=''',pipeline().parameters.tablename,'''')
Also we have to update the queries for incremental load as shown below
@concat('select *,''',pipeline().parameters.datasource,''' as
datasource from ',pipeline().parameters.tablename,' where
',pipeline().parameters.watermark,' >=
''',activity('Fetch_logs').output.firstRow.last_fetched_date,'''')
Additionally, we will update the lookup query to insert logs into the audit table
for Incremental load as shown below.
@concat('insert into
audit.load_logs(data_source,tablename,numberofrowscopied,watermarkcolum
nname,loaddate) values (''',pipeline().parameters.datasource,''',
''',pipeline().parameters.tablename,''',''',activity('Incremental_Load_
CP').output.rowscopied,''',''',pipeline().parameters.watermark,''',''',
utcNow(),''')')
Also for this execute pipeline we will pass the parameter value as shown
below.
Steps to move NPI and ICD from API to bronze layer:
For this, we will use the provided code notebook and implement the logic in it.
2. API extracts
=> ICD Code API extract
Note: Run this code notebooks in order to fetch data before going ahead.
Steps to move Claims and CPT data from landing to bronze layer:
As part of the background activity, we will manually place the claims and CPT
data files into their respective folders within the landing folder.
Claims data:
Now using the logic present in the code notebooks we will move this data to the
bronze layer in parquet format.
3. Silver :
=> Claims
=> CPT codes
Moving data from Bronze layer to Silver.
=> Standardized the data format to align with the Common Data Model for
consistency and compatibility across systems.
=> Implement SCD Type 2 logic to maintain historical changes in the data,
enabling tracking of changes over time.
Delta Table:
=> Stored the transformed data in Delta tables to support ACID transactions,
incremental loads, and versioned data.
This transformation ensures that data from the Bronze layer is cleansed,
standardized, and enriched before moving to the Silver layer for further
analytics or downstream processing.
We have documented the logic for this transformation in the notebook located
in the Silver folder within our Databricks workspace. We have shared the code
notebook on Github to which you can refer.
Note: For both claims and CPT codes, we have used the same notebook. This
notebook contains the code for moving data from the landing layer to the
bronze layer, followed by the necessary transformations to clean and move the
data from bronze to the silver layer.
We have created notebooks to implement the required logic, and all these
notebooks are located in the "Gold" folder. We have shared the code notebook
on Github to which you can refer.
Pipeline to Move data from Silver to Gold layer:
We will create a pipeline “pl_silver_to_gold” to move the data from the Silver layer to
the Gold layer.
First create the pipeline “pl_silver_to_gold”, and add the activity Databricks notebook
as shown below.
Ex: Here, we have added a notebook activity and named it slv_transaction. Using the
Browse Path option, we selected the Transactions notebook located in the Silver
folder. Refer below screenshot for more information.
Similarly, we will add the remaining notebooks as demonstrated by Sumit Sir in the
video. And our complete pipeline will be as shown below.
Once this pipeline is created, publish these changes.
Final Pipeline:
=> The first execution pipeline (exec_pl_emr_src_to_landing) will move data from
the SQL database to the landing folder.
=> The second execution pipeline (exec_pl_silver_to_gold) will move the cleaned
and transformed data from the Silver layer to the Gold layer.
Before running these pipelines make sure to create the silver and gold schema using
below commands in catalog tt_hc_adb_ws.
Upon successful execution of the pipeline, the results will be displayed as shown
below.
To link a project on Github follow below steps.
In Azure Datbricks: Get the username and token for your databricks
In Github:
Now go to databricks => setting => Linked accounts = > Git provider (Github) then
select personal access token => then provide the username and the token for github
Go to databricks => Workspace => Repos => select option create git folder
Now you can clone existing folders and files into this git folder.
Right click on file/folder => clone => then using the browse option select this git
folder.
To use Git in Databricks, click on the three dots next to the Git-linked folder and
select the Git option. This allows you to pull and push changes, as well as create
branches, directly from the Databricks interface.