0% found this document useful (0 votes)
11 views21 pages

Azure Databricks

The document outlines the steps to create an ETL pipeline using Azure Databricks, starting with setting up a Databricks workspace and cluster, creating a notebook, and adding ETL transformation code. It details the process of extracting data from Azure Blob Storage, transforming it, and loading it back to Blob Storage and Azure SQL Database. Additionally, it covers configuring linked services in Azure Data Factory and designing the pipeline for data movement and transformation.

Uploaded by

Ankitha BL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views21 pages

Azure Databricks

The document outlines the steps to create an ETL pipeline using Azure Databricks, starting with setting up a Databricks workspace and cluster, creating a notebook, and adding ETL transformation code. It details the process of extracting data from Azure Blob Storage, transforming it, and loading it back to Blob Storage and Azure SQL Database. Additionally, it covers configuring linked services in Azure Data Factory and designing the pipeline for data movement and transformation.

Uploaded by

Ankitha BL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Azure Databricks

Task : ETL Pipeline Using the Azure Databricks


Service
Step 1: Create an Azure Databricks Workspace
1. Log in to the Azure Portal:
○ Go to Azure Portal.

2. Create a New Resource:


○ In the left sidebar, click Create a resource.
○ Search for Azure Databricks and select it.
○ Click Create.

3. Configure Databricks Workspace Basics:


○ Subscription: Choose your Azure subscription.
○ Resource Group: Create a new resource group or select an existing one.
○ Workspace Name: Enter a unique name for the Databricks workspace.
○ Region: Choose a region close to your data and resources.
○ Pricing Tier: Select a pricing tier (Standard or Premium).
4. Review and Create:
○ Click Review + create to validate the workspace settings.
○ Click Create to provision the workspace. This may take a few minutes.

5. Access the Databricks Workspace:


○ Once the workspace is created, go to the resource and click Launch
Workspace. This will open the Databricks web interface
Step 2: Set Up a Databricks Cluster
1. Navigate to Clusters in Databricks:
○ In the Databricks workspace, go to the Compute tab on the left.And click
Create Compute.

2. Create a New Cluster:


○ Cluster Name: Enter a name for the cluster.
○ Cluster Mode: Select Standard for a single-user cluster or High
Concurrency for shared clusters.
○ Databricks Runtime Version: Choose a runtime version (e.g., Databricks
Runtime 10.4 LTS).
○ Autoscaling: Enable autoscaling if you want the cluster to scale
automatically.
○ Node Type and Worker Count: Choose the VM type and configure the
number of workers.
○ Termination: Set an auto-termination time to shut down the cluster after a
period of inactivity (optional).

3. Start the Cluster:


○ Click Create Compute to start the cluster.And Created cluster will be
listed.

Step 3: Create a Notebook in Databricks


1. Go to the Workspace Tab:
○ In Databricks, click on the Workspace tab.
2. Create a New Notebook:
○ Click Create > Notebook.
○ Name the notebook, choose Python as the language, and select the
cluster you created.

Step 4 :Add ETL Transformation Code in the notebook:


1. Configure the Azure Blob Storage in Databricks

First, ensure that your Databricks environment has access to Azure Blob
Storage. You’ll need the following:

● Storage account name


● Access key (or use Azure Active Directory for more secure access)

You can mount the Azure Blob Storage to Databricks as follows:


dbutils.fs.mount(
source=f"wasbs://{container_name}@{storage_account_name}.blob.core.window
s.net", mount_point="/mnt/blob_storage", extra_configs={
f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net":
storage_account_access_key } )

dbutils.fs.mount(

source="wasbs://[email protected]
.net",

mount_point="/mnt/trans",

extra_configs={"fs.azure.account.key.samdatabricks.blob.core.window
s.net":"5kxx17N7TGNiXAkD7qDXQCGiEmumpM7yC+tS6/5Er7SK7Nio1fGHFUfP+go
MQ4cy+dcnFlGu6cAN+AStqpNqKA=="})

2. Check for correct mounting

3. Extract Data from Azure Blob Storage

df = spark.read.format("csv") \
.option("header", "true") \
.load("dbfs:/mnt/trans/covid_vaccine_statewise.csv")
4. Display the Data
from pyspark.sql.functions import col, lit
from pyspark.sql.types import DateType, DoubleType, IntegerType,
StringType

df = df.withColumn("Updated On", col("Updated


On").cast(DateType()))

df = df.withColumn("Total Doses Administered", col("Total Doses


Administered").cast(DoubleType()))
df = df.withColumn("Sessions",
col("Sessions").cast(IntegerType()))

display(df)
5. Perform the Main Transformation Required
from pyspark.sql.functions import max

# Displaying Total Doses administered statewise


transformed_df = df.groupBy("State") \
.agg(max("Total Doses
Administered").alias("Total Doses Administered"))

6.Display the Transformed Result:

display(transformed_df)
7 .Load Data Back to Azure Blob Storage

dbutils.fs.mount(source="wasbs://[email protected]
ore.windows.net",
mount_point="/mnt/transaction2",
extra_configs={"fs.azure.account.key.samdatabricks.blob.core
.windows.net":"5kxx17N7TGNiXAkD7qDXQCGiEmumpM7yC+tS6/5Er7SK7Nio1fG
HFUfP+goMQ4cy+dcnFlGu6cAN+AStqpNqKA=="})

def df2file(df, destPath, df_name, extension='.csv',


header='true'):
'''
dest Path should be like
'abfss://{containername}@{storagename}.dfs.core.windows.net/path/f
ilename'
The extension is added as parameter so you use write you file
with any
kind of extension
it will write before the file partitioned as csv,it is how
pyspark works,
but this function will remove other artifact and it will keep
just your
filename
'''
destPath += 'result_' + df_name
# Write databricks data frame as file
df.coalesce(1)\
.write\
.format('csv')\
.mode('overwrite')\
.option('header', header)\
.save(destPath)

filtered_files = [file.path for file in


dbutils.fs.ls(destPath) if file.path.endswith(".csv")]
# Using coalesce the file generated is always one, but it is
worst on the performance
dbutils.fs.cp(filtered_files[0], destPath + extension,
recurse=False)
dbutils.fs.rm(destPath, recurse=True)

output_path = "/mnt/transaction2/"

df2file(transformed_df,output_path,'df_statewise_doses_administrat
ed')

8. Check for the Data Transfer in the BLob Storage

9. Storing the Data Directly to the AZure SQL Database Using JDBC
Connector
# Step 1: Define JDBC connection details
jdbcHostname = "sam-sql-server123.database.windows.net"
jdbcPort = 1433
jdbcDatabase = "databricksdatabase"
jdbcUrl =
f"jdbc:sqlserver://{jdbcHostname}:{jdbcPort};database={jdbcDatabas
e}"

connectionProperties = {
"user": "samadhan",
"password": "@Sam1Sam",
"driver":
"com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# Step 2: Write processed data to Azure SQL Database


transformed_df.write.jdbc(
url=jdbcUrl,
table="statewise_doses_administrated",
mode="overwrite",
properties=connectionProperties
)

10. Check for the Data transfer in the SQL DB


Step 5: Configure Linked Services in Azure Data Factory

1. Azure Blob Storage Linked Service:


○ Set up a linked service in ADF for Azure Blob Storage to access your input
and output data containers.
2. Azure Databricks Linked Service:
○ Copy the Access token from the Data Bricks : Go to ADB Click on Profile
-> Settings -> Developer -> Access tokens manage -> Generate Access
key -> Copy
○ Set up a linked service in ADF for Azure Databricks using your workspace
URL and authentication token.
3. Azure SQL Database Linked Service:
○ Set up a linked service in ADF for Azure SQL Database:
■ In ADF, go to Manage > Linked services > New.
■ Select Azure SQL Database as the service type.
■ Configure the SQL Server name, database name, authentication
type, and credentials (username and password).
■ Test the connection and save.
Step 3: Design the Pipeline in ADF

1. Create a New Pipeline

● In ADF, go to the Author tab and create a new pipeline.

2. Add Copy Data Activity to Extract Data from Blob Storage

1. Drag a Copy Data Activity onto the canvas.


2. Configure the Source as Azure Blob Storage:
○ Set the Blob Storage linked service and specify the path and format of the
source data (e.g., CSV).
3. Configure the Sink to Databricks DBFS:
○ Set the Databricks DBFS path or a staging area for storing intermediate
data in DBFS.

3. Add a Databricks Notebook Activity for Transformation

1. Add a Databricks Notebook Activity to the pipeline.


2. Configure the Databricks Linked Service.
3. Specify the Notebook Path:
○ Select the Databricks notebook you created for data transformation.

4. Add Copy Data Activity to Load Transformed Data Back to Blob Storage

1. Drag Another Copy Data Activity onto the pipeline.


2. Configure the Source as the Databricks DBFS output location (from the
notebook).
3. Configure the Sink as Azure Blob Storage:
○ Choose the Blob Storage linked service and set the path to store the
transformed data.

5. Add Copy Data Activity to Load Transformed Data into Azure SQL Database

1. Add Another Copy Data Activity to load data into SQL Database.
2. Configure the Source as Databricks DBFS or Blob Storage:
○ If Databricks, use the DBFS path for the transformed data.
○ If Blob Storage, use the output container path for transformed data.
3. Configure the Sink as Azure SQL Database:
○ Choose the Azure SQL Database linked service.
○ Specify the table name in SQL Database where the transformed data
should be stored.

Step 4: Add a Trigger and Monitor the Pipeline

1. Add a Trigger to schedule the pipeline (e.g., daily or hourly).


2. Publish the Pipeline.
3. Monitor the pipeline runs in the Monitor tab to check the status, troubleshoot
issues, and verify that data is stored in both Blob Storage and Azure SQL
Database.

You might also like