0% found this document useful (0 votes)

11 views21 pages

Azure Databricks

The document outlines the steps to create an ETL pipeline using Azure Databricks, starting with setting up a Databricks workspace and cluster, creating a notebook, and adding ETL transformation code. It details the process of extracting data from Azure Blob Storage, transforming it, and loading it back to Blob Storage and Azure SQL Database. Additionally, it covers configuring linked services in Azure Data Factory and designing the pipeline for data movement and transformation.

Uploaded by

Ankitha BL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views21 pages

Azure Databricks

Uploaded by

Ankitha BL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Azure Databricks

Task : ETL Pipeline Using the Azure Databricks

Service
Step 1: Create an Azure Databricks Workspace
1. Log in to the Azure Portal:
○ Go to Azure Portal.

2. Create a New Resource:

○ In the left sidebar, click Create a resource.
○ Search for Azure Databricks and select it.
○ Click Create.

3. Configure Databricks Workspace Basics:

○ Subscription: Choose your Azure subscription.
○ Resource Group: Create a new resource group or select an existing one.
○ Workspace Name: Enter a unique name for the Databricks workspace.
○ Region: Choose a region close to your data and resources.
○ Pricing Tier: Select a pricing tier (Standard or Premium).
4. Review and Create:
○ Click Review + create to validate the workspace settings.
○ Click Create to provision the workspace. This may take a few minutes.

5. Access the Databricks Workspace:

○ Once the workspace is created, go to the resource and click Launch
Workspace. This will open the Databricks web interface
Step 2: Set Up a Databricks Cluster
1. Navigate to Clusters in Databricks:
○ In the Databricks workspace, go to the Compute tab on the left.And click
Create Compute.

2. Create a New Cluster:

○ Cluster Name: Enter a name for the cluster.
○ Cluster Mode: Select Standard for a single-user cluster or High
Concurrency for shared clusters.
○ Databricks Runtime Version: Choose a runtime version (e.g., Databricks
Runtime 10.4 LTS).
○ Autoscaling: Enable autoscaling if you want the cluster to scale
automatically.
○ Node Type and Worker Count: Choose the VM type and configure the
number of workers.
○ Termination: Set an auto-termination time to shut down the cluster after a
period of inactivity (optional).

3. Start the Cluster:

○ Click Create Compute to start the cluster.And Created cluster will be
listed.

Step 3: Create a Notebook in Databricks

1. Go to the Workspace Tab:
○ In Databricks, click on the Workspace tab.
2. Create a New Notebook:
○ Click Create > Notebook.
○ Name the notebook, choose Python as the language, and select the
cluster you created.

Step 4 :Add ETL Transformation Code in the notebook:

1. Configure the Azure Blob Storage in Databricks

First, ensure that your Databricks environment has access to Azure Blob
Storage. You’ll need the following:

● Storage account name

● Access key (or use Azure Active Directory for more secure access)

You can mount the Azure Blob Storage to Databricks as follows:

dbutils.fs.mount(
source=f"wasbs://{container_name}@{storage_account_name}.blob.core.window
s.net", mount_point="/mnt/blob_storage", extra_configs={
f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net":
storage_account_access_key } )

dbutils.fs.mount(

source="wasbs://[email protected]
.net",

mount_point="/mnt/trans",

extra_configs={"fs.azure.account.key.samdatabricks.blob.core.window
s.net":"5kxx17N7TGNiXAkD7qDXQCGiEmumpM7yC+tS6/5Er7SK7Nio1fGHFUfP+go
MQ4cy+dcnFlGu6cAN+AStqpNqKA=="})

2. Check for correct mounting

3. Extract Data from Azure Blob Storage

df = spark.read.format("csv") \
.option("header", "true") \
.load("dbfs:/mnt/trans/covid_vaccine_statewise.csv")
4. Display the Data
from pyspark.sql.functions import col, lit
from pyspark.sql.types import DateType, DoubleType, IntegerType,
StringType

df = df.withColumn("Updated On", col("Updated

On").cast(DateType()))

df = df.withColumn("Total Doses Administered", col("Total Doses

Administered").cast(DoubleType()))
df = df.withColumn("Sessions",
col("Sessions").cast(IntegerType()))

display(df)
5. Perform the Main Transformation Required
from pyspark.sql.functions import max

# Displaying Total Doses administered statewise

transformed_df = df.groupBy("State") \
.agg(max("Total Doses
Administered").alias("Total Doses Administered"))

6.Display the Transformed Result:

display(transformed_df)
7 .Load Data Back to Azure Blob Storage

dbutils.fs.mount(source="wasbs://[email protected]
ore.windows.net",
mount_point="/mnt/transaction2",
extra_configs={"fs.azure.account.key.samdatabricks.blob.core
.windows.net":"5kxx17N7TGNiXAkD7qDXQCGiEmumpM7yC+tS6/5Er7SK7Nio1fG
HFUfP+goMQ4cy+dcnFlGu6cAN+AStqpNqKA=="})

def df2file(df, destPath, df_name, extension='.csv',

header='true'):
'''
dest Path should be like
'abfss://{containername}@{storagename}.dfs.core.windows.net/path/f
ilename'
The extension is added as parameter so you use write you file
with any
kind of extension
it will write before the file partitioned as csv,it is how
pyspark works,
but this function will remove other artifact and it will keep
just your
filename
'''
destPath += 'result_' + df_name
# Write databricks data frame as file
df.coalesce(1)\
.write\
.format('csv')\
.mode('overwrite')\
.option('header', header)\
.save(destPath)

filtered_files = [file.path for file in

dbutils.fs.ls(destPath) if file.path.endswith(".csv")]
# Using coalesce the file generated is always one, but it is
worst on the performance
dbutils.fs.cp(filtered_files[0], destPath + extension,
recurse=False)
dbutils.fs.rm(destPath, recurse=True)

output_path = "/mnt/transaction2/"

df2file(transformed_df,output_path,'df_statewise_doses_administrat
ed')

8. Check for the Data Transfer in the BLob Storage

9. Storing the Data Directly to the AZure SQL Database Using JDBC
Connector
# Step 1: Define JDBC connection details
jdbcHostname = "sam-sql-server123.database.windows.net"
jdbcPort = 1433
jdbcDatabase = "databricksdatabase"
jdbcUrl =
f"jdbc:sqlserver://{jdbcHostname}:{jdbcPort};database={jdbcDatabas
e}"

connectionProperties = {
"user": "samadhan",
"password": "@Sam1Sam",
"driver":
"com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# Step 2: Write processed data to Azure SQL Database

transformed_df.write.jdbc(
url=jdbcUrl,
table="statewise_doses_administrated",
mode="overwrite",
properties=connectionProperties
)

10. Check for the Data transfer in the SQL DB

Step 5: Configure Linked Services in Azure Data Factory

1. Azure Blob Storage Linked Service:

○ Set up a linked service in ADF for Azure Blob Storage to access your input
and output data containers.
2. Azure Databricks Linked Service:
○ Copy the Access token from the Data Bricks : Go to ADB Click on Profile
-> Settings -> Developer -> Access tokens manage -> Generate Access
key -> Copy
○ Set up a linked service in ADF for Azure Databricks using your workspace
URL and authentication token.
3. Azure SQL Database Linked Service:
○ Set up a linked service in ADF for Azure SQL Database:
■ In ADF, go to Manage > Linked services > New.
■ Select Azure SQL Database as the service type.
■ Configure the SQL Server name, database name, authentication
type, and credentials (username and password).
■ Test the connection and save.
Step 3: Design the Pipeline in ADF

1. Create a New Pipeline

● In ADF, go to the Author tab and create a new pipeline.

2. Add Copy Data Activity to Extract Data from Blob Storage

1. Drag a Copy Data Activity onto the canvas.

2. Configure the Source as Azure Blob Storage:
○ Set the Blob Storage linked service and specify the path and format of the
source data (e.g., CSV).
3. Configure the Sink to Databricks DBFS:
○ Set the Databricks DBFS path or a staging area for storing intermediate
data in DBFS.

3. Add a Databricks Notebook Activity for Transformation

1. Add a Databricks Notebook Activity to the pipeline.

2. Configure the Databricks Linked Service.
3. Specify the Notebook Path:
○ Select the Databricks notebook you created for data transformation.

4. Add Copy Data Activity to Load Transformed Data Back to Blob Storage

1. Drag Another Copy Data Activity onto the pipeline.

2. Configure the Source as the Databricks DBFS output location (from the
notebook).
3. Configure the Sink as Azure Blob Storage:
○ Choose the Blob Storage linked service and set the path to store the
transformed data.

5. Add Copy Data Activity to Load Transformed Data into Azure SQL Database

1. Add Another Copy Data Activity to load data into SQL Database.
2. Configure the Source as Databricks DBFS or Blob Storage:
○ If Databricks, use the DBFS path for the transformed data.
○ If Blob Storage, use the output container path for transformed data.
3. Configure the Sink as Azure SQL Database:
○ Choose the Azure SQL Database linked service.
○ Specify the table name in SQL Database where the transformed data
should be stored.

Step 4: Add a Trigger and Monitor the Pipeline

1. Add a Trigger to schedule the pipeline (e.g., daily or hourly).

2. Publish the Pipeline.
3. Monitor the pipeline runs in the Monitor tab to check the status, troubleshoot
issues, and verify that data is stored in both Blob Storage and Azure SQL
Database.

Azure Databricks - An Introduction
No ratings yet
Azure Databricks - An Introduction
38 pages
Azure Databricks
67% (6)
Azure Databricks
69 pages
Azure Data Factory
100% (2)
Azure Data Factory
10 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
Azure Databricks Documentation
No ratings yet
Azure Databricks Documentation
7,197 pages
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
From Everand
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
Devi Prasad
No ratings yet
Azure Databricks Course Slide Deck V4
100% (5)
Azure Databricks Course Slide Deck V4
308 pages
Azure Databricks Documentation
No ratings yet
Azure Databricks Documentation
32 pages
Start To Finish With Azure Data Factory
100% (2)
Start To Finish With Azure Data Factory
30 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
ETL Azure
No ratings yet
ETL Azure
12 pages
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Azure Data Engineering Project Part 1
No ratings yet
Azure Data Engineering Project Part 1
41 pages
Azure de Project
No ratings yet
Azure de Project
73 pages
Azure Data Factory For Beginners
No ratings yet
Azure Data Factory For Beginners
250 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Load Data With Azure Data Factory
No ratings yet
Load Data With Azure Data Factory
4 pages
Azure Project
No ratings yet
Azure Project
13 pages
Azure Databricks An Introduction
No ratings yet
Azure Databricks An Introduction
54 pages
Azuredatabricks New
No ratings yet
Azuredatabricks New
22 pages
Tasks For Orchestration With Azure Databricks For Data Processing
No ratings yet
Tasks For Orchestration With Azure Databricks For Data Processing
3 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
Course Notes
No ratings yet
Course Notes
11 pages
Ultimate Big Data Masters Program Curriculum v1
No ratings yet
Ultimate Big Data Masters Program Curriculum v1
14 pages
Lab 3 - Enabling Team Based Data Science With Azure Databricks
No ratings yet
Lab 3 - Enabling Team Based Data Science With Azure Databricks
18 pages
Day13 Notes
No ratings yet
Day13 Notes
3 pages
Databricks 2
No ratings yet
Databricks 2
22 pages
ADF Copy Data
No ratings yet
ADF Copy Data
85 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
dp-203 Notes1
No ratings yet
dp-203 Notes1
12 pages
Running Azure Databricks Notebook On Synapse Analytics
No ratings yet
Running Azure Databricks Notebook On Synapse Analytics
12 pages
Lab 3 - Enabling Team Based Data Science With Azure Databricks
No ratings yet
Lab 3 - Enabling Team Based Data Science With Azure Databricks
18 pages
Databricks Lab 1
100% (3)
Databricks Lab 1
7 pages
Azure Databricks Brief Introduction
No ratings yet
Azure Databricks Brief Introduction
40 pages
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
Azure Databricks Engineering 1746278570
No ratings yet
Azure Databricks Engineering 1746278570
96 pages
Use-Case 2: Utilize Azure Data Factory (ADF) To Ingest Orders and Customers Data, and Execute Fundamental Transformations On The Datasets
No ratings yet
Use-Case 2: Utilize Azure Data Factory (ADF) To Ingest Orders and Customers Data, and Execute Fundamental Transformations On The Datasets
36 pages
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
From Everand
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
Georgio Daccache
No ratings yet
Databricks
No ratings yet
Databricks
36 pages
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
14 pages
004 Azure Databricks Course Slide Deck V3
0% (1)
004 Azure Databricks Course Slide Deck V3
219 pages
MC Microsoft Certified Azure Data Fundamentals Study Guide: Exam DP-900
From Everand
MC Microsoft Certified Azure Data Fundamentals Study Guide: Exam DP-900
Jake Switzer
No ratings yet
Data Factory
No ratings yet
Data Factory
1,158 pages
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Azure DataBricks
No ratings yet
Azure DataBricks
37 pages
Azure Databricks - An Introduction 2019 Roadshow
No ratings yet
Azure Databricks - An Introduction 2019 Roadshow
13 pages
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Azure Data Factory - Pratap - Qbex Technologies - 8886230001
No ratings yet
Azure Data Factory - Pratap - Qbex Technologies - 8886230001
4 pages
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Solution (Updated)
No ratings yet
Solution (Updated)
61 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Azure Data Factory Overview With Realtime Ex
No ratings yet
Azure Data Factory Overview With Realtime Ex
5 pages
ADE Project Amit
No ratings yet
ADE Project Amit
17 pages
Data Analyst Azure PowerBI Syllabus
No ratings yet
Data Analyst Azure PowerBI Syllabus
35 pages
Azure Databricks Course Content - Pratap - Qbex Technologies - 8886230001
No ratings yet
Azure Databricks Course Content - Pratap - Qbex Technologies - 8886230001
3 pages
Azure Data Engr POC - S For Interns
No ratings yet
Azure Data Engr POC - S For Interns
9 pages
Databricks
No ratings yet
Databricks
131 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
AZ-600 Configuring and Operating a Hybrid Cloud with Microsoft Azure Stack Hub Study Guide
From Everand
AZ-600 Configuring and Operating a Hybrid Cloud with Microsoft Azure Stack Hub Study Guide
Anand Vemula
No ratings yet
Master Databrciks
No ratings yet
Master Databrciks
79 pages
Log
No ratings yet
Log
44 pages
Lec 10 A
No ratings yet
Lec 10 A
37 pages
Whats New and Exciting in RPG
100% (1)
Whats New and Exciting in RPG
32 pages
Cambridge IGCSE: Computer Science 0478/13
No ratings yet
Cambridge IGCSE: Computer Science 0478/13
12 pages
BBM 6th Sem Database Notes Updated 2024
No ratings yet
BBM 6th Sem Database Notes Updated 2024
145 pages
Course File
No ratings yet
Course File
6 pages
VMware - Installing ESX 4.0 and Vcenter 4
No ratings yet
VMware - Installing ESX 4.0 and Vcenter 4
6 pages
Amd Ryzen™ Threadripper™ Nvme Raid Quick Start Guide Rc-9.1.0 Release Version 1.0
No ratings yet
Amd Ryzen™ Threadripper™ Nvme Raid Quick Start Guide Rc-9.1.0 Release Version 1.0
5 pages
Introduction To Transaction Processing Concepts and Theory
No ratings yet
Introduction To Transaction Processing Concepts and Theory
52 pages
Nutanix - Pre .NCM MCI - by .VCEplus.83q DEMO
No ratings yet
Nutanix - Pre .NCM MCI - by .VCEplus.83q DEMO
35 pages
Recovery - Magisk 25.2 (25200) .Apk - 2022 10 31 20 49 07
No ratings yet
Recovery - Magisk 25.2 (25200) .Apk - 2022 10 31 20 49 07
21 pages
Delay and Loss in Packet-Switched Networks
No ratings yet
Delay and Loss in Packet-Switched Networks
14 pages
(Hot Spot) D. A Web Conference
No ratings yet
(Hot Spot) D. A Web Conference
6 pages
Relational DB Checklist
No ratings yet
Relational DB Checklist
2 pages
Install
No ratings yet
Install
9 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
Bank Database: Table Queries
No ratings yet
Bank Database: Table Queries
22 pages
Vsphere Upgrade Process
No ratings yet
Vsphere Upgrade Process
3 pages
Challenges of SSD Forensic Analysis: by Digital Assembly
No ratings yet
Challenges of SSD Forensic Analysis: by Digital Assembly
44 pages
File Handling - CSV File
No ratings yet
File Handling - CSV File
22 pages
Expanding A File System Using The Growfs Command
No ratings yet
Expanding A File System Using The Growfs Command
3 pages
Database Management System Lab Prelim Q2
No ratings yet
Database Management System Lab Prelim Q2
10 pages
1 - 3 Hardware Recommendations (CAT Grade 12)
No ratings yet
1 - 3 Hardware Recommendations (CAT Grade 12)
26 pages
Lamp Iran2
No ratings yet
Lamp Iran2
53 pages
Mastering C Programming
No ratings yet
Mastering C Programming
4 pages
Security in SAP HANA 2016
No ratings yet
Security in SAP HANA 2016
17 pages
Logical and Physical Address in Operating System
No ratings yet
Logical and Physical Address in Operating System
22 pages
Minimizing Memory Usage For Creating Application Subprocesses
No ratings yet
Minimizing Memory Usage For Creating Application Subprocesses
4 pages
Computer Maintenance and Support
No ratings yet
Computer Maintenance and Support
45 pages
DDR Basics Frescale
No ratings yet
DDR Basics Frescale
53 pages

Azure Databricks

Uploaded by

Azure Databricks

Uploaded by

Azure Databricks

Task : ETL Pipeline Using the Azure Databricks

2. Create a New Resource:

3. Configure Databricks Workspace Basics:

5. Access the Databricks Workspace:

2. Create a New Cluster:

3. Start the Cluster:

Step 3: Create a Notebook in Databricks

Step 4 :Add ETL Transformation Code in the notebook:

● Storage account name

You can mount the Azure Blob Storage to Databricks as follows:

2. Check for correct mounting

3. Extract Data from Azure Blob Storage

df = df.withColumn("Updated On", col("Updated

df = df.withColumn("Total Doses Administered", col("Total Doses

# Displaying Total Doses administered statewise

6.Display the Transformed Result:

def df2file(df, destPath, df_name, extension='.csv',

filtered_files = [file.path for file in

8. Check for the Data Transfer in the BLob Storage

# Step 2: Write processed data to Azure SQL Database

10. Check for the Data transfer in the SQL DB

1. Azure Blob Storage Linked Service:

1. Create a New Pipeline

● In ADF, go to the Author tab and create a new pipeline.

2. Add Copy Data Activity to Extract Data from Blob Storage

1. Drag a Copy Data Activity onto the canvas.

3. Add a Databricks Notebook Activity for Transformation

1. Add a Databricks Notebook Activity to the pipeline.

1. Drag Another Copy Data Activity onto the pipeline.

Step 4: Add a Trigger and Monitor the Pipeline

1. Add a Trigger to schedule the pipeline (e.g., daily or hourly).

You might also like