0% found this document useful (0 votes)

2 views5 pages

Script

Uploaded by

Suraj Solanke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views5 pages

Script

Uploaded by

Suraj Solanke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

To handle multiple tables with different srcdb and tgtdb names, we can modify the script to

take a JSON or CSV file as input, where each row (or JSON entry) represents a table with its
corresponding srcdb, tgtdb, and partition details.

---

Approach

1. Use a JSON or CSV file as input, containing multiple srcdb, tgtdb, and table partition
details.

2. Modify the script to iterate over multiple tables.

3. Pass the file as a parameter to the Databricks notebook.

---

Updated Implementation

1. Input File Format (JSON)

We use a JSON file (tables.json) with multiple database-table mappings:

[
{
"srcdb": "source_db1",
"tgtdb": "target_db1",
"csv_file": "dbfs:/mnt/input/table1.csv"
},
{
"srcdb": "source_db2",
"tgtdb": "target_db2",
"csv_file": "dbfs:/mnt/input/table2.csv"
}
]

Each entry represents one table's operation with a corresponding CSV file.

---

2. Databricks Notebook Script

import pandas as pd
import json
import re
import uuid

# Constants
SIZE_MULTIPLIER = {"kb": 1 / (1024 * 1024), "mb": 1 / 1024, "gb": 1}
CHUNK_SIZE_GB = 1 # 1GB chunk size

# Widgets for Parameters

dbutils.widgets.text("json_file", "dbfs:/mnt/input/tables.json")
dbutils.widgets.text("requestid", "request123")

# Read the Parameters

json_file = dbutils.widgets.get("json_file")
requestid = dbutils.widgets.get("requestid")

# Load JSON Configuration

with open("/dbfs" + json_file.replace("dbfs:/", ""), "r") as f:
tables_config = json.load(f)

def parse_size(size_str):
"""Convert file size string (e.g., '500mb', '2gb') to GB."""
match = re.match(r"([\d.]+)(kb|mb|gb)", size_str.lower())
if match:
size, unit = match.groups()
return float(size) * SIZE_MULTIPLIER[unit]
return 0 # Default if no match

def extract_partition_info(partition_str):
"""Extract table name, cluster-by column, and value from partition string."""
match = re.search(r"\/([^\/]+)\/([^=]+)=([\d\/]+)", partition_str)
if match:
table_name, clusterby_column, clusterby_value = match.groups()
return table_name, clusterby_column, clusterby_value
return None, None, None

def generate_insert_queries(csv_file, srcdb, tgtdb, requestid):

"""Generate SQL insert queries and write to a .sql file."""
df = pd.read_csv(csv_file)

# Convert file_size to GB
df["file_size_gb"] = df["file_size"].apply(parse_size)

queries = []
unique_request_id = str(uuid.uuid4())[:8] # Unique base request ID
for index, row in df.iterrows():
table_name, clusterby_column, clusterby_value = extract_partition_info(row["partition"])

if not table_name or not clusterby_column or not clusterby_value:

continue # Skip invalid rows

total_size = row["file_size_gb"]
num_chunks = max(1, int(total_size // CHUNK_SIZE_GB) + (1 if total_size %
CHUNK_SIZE_GB > 0 else 0))

for chunk in range(num_chunks):

chunk_start = clusterby_value # Assuming single value partitioning
chunk_end = clusterby_value # Adjust logic if range-based
request_id_chunk = f"{requestid}_{unique_request_id}_{chunk + 1}"

query = f"""
INSERT INTO {tgtdb}.{table_name} (srcdb, tgtdb, table, requestid, condition)
VALUES ('{srcdb}', '{tgtdb}', '{table_name}', '{request_id_chunk}',
'{clusterby_column} >= \"{chunk_start}\" AND {clusterby_column} <=
\"{chunk_end}\"');
"""
queries.append(query.strip())

return queries

# Process Multiple Tables

all_queries = []
for entry in tables_config:
srcdb = entry["srcdb"]
tgtdb = entry["tgtdb"]
csv_file = entry["csv_file"]

queries = generate_insert_queries(csv_file, srcdb, tgtdb, requestid)

all_queries.extend(queries)

# Write SQL queries to DBFS

output_path = "dbfs:/mnt/output/insert_queries.sql"
with open("/dbfs/mnt/output/insert_queries.sql", "w") as f:
f.write("\n".join(all_queries))

print(f"SQL queries written to {output_path}")

---

3. Run the Notebook

Option 1: Manually Set Widgets in Databricks

Go to Widgets > Edit

Set:

json_file → dbfs:/mnt/input/tables.json

requestid → request123

Run the notebook.

Option 2: Pass Parameters When Running as a Job

You can trigger this notebook via Databricks Jobs:

databricks jobs run-now --job-id <your-job-id> --notebook-params '{

"json_file": "dbfs:/mnt/input/tables.json",
"requestid": "request123"
}'

---

4. Expected Output

The SQL file (insert_queries.sql) will contain queries like:

INSERT INTO target_db1.hud_assc (srcdb, tgtdb, table, requestid, condition)

VALUES ('source_db1', 'target_db1', 'hud_assc', 'request123_ab12cd34_1',
'call_dt >= "23/10/2010" AND call_dt <= "23/10/2010"');

INSERT INTO target_db2.hud_sale (srcdb, tgtdb, table, requestid, condition)

VALUES ('source_db2', 'target_db2', 'hud_sale', 'request123_ab12cd34_1',
'call_dt >= "01/01/2022" AND call_dt <= "01/01/2022"');

INSERT INTO target_db2.hud_sale (srcdb, tgtdb, table, requestid, condition)

VALUES ('source_db2', 'target_db2', 'hud_sale', 'request123_ab12cd34_2',
'call_dt >= "01/01/2022" AND call_dt <= "01/01/2022"');

---

5. Download the SQL Output (Optional)

To download the SQL file from Databricks:

databricks fs cp dbfs:/mnt/output/insert_queries.sql .

Or use the DBFS File Browser in the Databricks UI.

---

Final Thoughts

Scalability: You can process multiple tables with different databases dynamically.

Flexibility: Just update the tables.json file for new tables.

Databricks Integration: Uses widgets for easy job execution.

Let me know if you need further refinements!

Databricks - Cheatsheet
No ratings yet
Databricks - Cheatsheet
7 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Azure Databricks Documentation
No ratings yet
Azure Databricks Documentation
7,197 pages
4 Closed Levelling PDF
No ratings yet
4 Closed Levelling PDF
5 pages
A Mighty Change An Anthology of Deaf American Writing (156368098X)
100% (1)
A Mighty Change An Anthology of Deaf American Writing (156368098X)
277 pages
Spark Job Dataproc
No ratings yet
Spark Job Dataproc
4 pages
Interview Prep
No ratings yet
Interview Prep
24 pages
(Agent Builder) - BigQuery Structured Data (NJ)
No ratings yet
(Agent Builder) - BigQuery Structured Data (NJ)
49 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Gcloud Dataflow Jobs Run Iotflow
No ratings yet
Gcloud Dataflow Jobs Run Iotflow
6 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
10 Python&Hadoop
No ratings yet
10 Python&Hadoop
32 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
?stuck in A Loop of Rejections - Let's Break The Cycle!?
No ratings yet
?stuck in A Loop of Rejections - Let's Break The Cycle!?
7 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
U Ba 3 QEry
0% (2)
U Ba 3 QEry
2 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
Setup Database - Py
No ratings yet
Setup Database - Py
2 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
05 Data Warehouse Using Google Big Query
No ratings yet
05 Data Warehouse Using Google Big Query
6 pages
To Migrate Data From Teradata To Google BigQuery
No ratings yet
To Migrate Data From Teradata To Google BigQuery
4 pages
Dbms Responsive Ness Integrati On Scalabili Ty Ease of Setup Cost Final Hardoop (Spark/H Ive) Mongod B
No ratings yet
Dbms Responsive Ness Integrati On Scalabili Ty Ease of Setup Cost Final Hardoop (Spark/H Ive) Mongod B
2 pages
Practical 4.4 HBase
No ratings yet
Practical 4.4 HBase
12 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Vs Bucketby - 1746119883
No ratings yet
Vs Bucketby - 1746119883
4 pages
Customer - Segmentation - Jupyter Notebook
No ratings yet
Customer - Segmentation - Jupyter Notebook
3 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
05 Functions
No ratings yet
05 Functions
6 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
Chatgpt Said
No ratings yet
Chatgpt Said
4 pages
Stdout
No ratings yet
Stdout
13 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Databricks Etl Pipeline 1699423882
No ratings yet
Databricks Etl Pipeline 1699423882
6 pages
Tushar Verma 21scse1310012 Data Analysis Using Big Data Tools 21scse1310012 Report
No ratings yet
Tushar Verma 21scse1310012 Data Analysis Using Big Data Tools 21scse1310012 Report
6 pages
Property Database (Market Value Finder)
No ratings yet
Property Database (Market Value Finder)
4 pages
Confluence Stuff
No ratings yet
Confluence Stuff
100 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Hadoop - Session 7 Python
No ratings yet
Hadoop - Session 7 Python
6 pages
Fritz Data
No ratings yet
Fritz Data
3 pages
Bigquery Scenarios - Dipakraj Patil
No ratings yet
Bigquery Scenarios - Dipakraj Patil
37 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
Notes of Azure Data Bricks
No ratings yet
Notes of Azure Data Bricks
16 pages
Azure Project
No ratings yet
Azure Project
13 pages
Databricks
No ratings yet
Databricks
15 pages
Requirement:: Hive/Impala/Presto Hadoop (Spark / HDFS) No SQL Database Game Server/application
No ratings yet
Requirement:: Hive/Impala/Presto Hadoop (Spark / HDFS) No SQL Database Game Server/application
4 pages
DATA-Code 1-050624-120338
No ratings yet
DATA-Code 1-050624-120338
3 pages
### Monitoring How To Build An Application Monitoring System With FastAPI and RabbitMQ - Python - by Carlos Armando Marcano Vargas - Medium
No ratings yet
### Monitoring How To Build An Application Monitoring System With FastAPI and RabbitMQ - Python - by Carlos Armando Marcano Vargas - Medium
25 pages
Optimizing 1TB Data Handling Using PySpark 3p
No ratings yet
Optimizing 1TB Data Handling Using PySpark 3p
3 pages
Snowpark For Python
No ratings yet
Snowpark For Python
5 pages
Databricks Data Engineer Associate Notes
No ratings yet
Databricks Data Engineer Associate Notes
5 pages
4220 5 (Python)
No ratings yet
4220 5 (Python)
12 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4
No ratings yet
Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4
2 pages
Data Storage Services in GCP: Relational Database Data Warehouse Nosql Big Data Database Service
No ratings yet
Data Storage Services in GCP: Relational Database Data Warehouse Nosql Big Data Database Service
15 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
23CP309T BDA MSE Question Paper
No ratings yet
23CP309T BDA MSE Question Paper
2 pages
Getting Started With Databricks
No ratings yet
Getting Started With Databricks
39 pages
Code Explanation
No ratings yet
Code Explanation
3 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Inversion Theory
No ratings yet
Inversion Theory
11 pages
Latihan Soal Simple Present Tense - Lembar Isian - WWW - Bimbelbrilian
No ratings yet
Latihan Soal Simple Present Tense - Lembar Isian - WWW - Bimbelbrilian
7 pages
A Linguistic Perspective On Communicative Language Teaching
No ratings yet
A Linguistic Perspective On Communicative Language Teaching
15 pages
Four Conceptions of Social Pathology
No ratings yet
Four Conceptions of Social Pathology
29 pages
MCQ Esiot-2
50% (2)
MCQ Esiot-2
35 pages
Ambigram
100% (2)
Ambigram
13 pages
The Sacred Name Movement Exposed
100% (5)
The Sacred Name Movement Exposed
81 pages
Bloom's Taxonomy Verbs: Knowledge Comprehend
No ratings yet
Bloom's Taxonomy Verbs: Knowledge Comprehend
2 pages
New Captive PDF
No ratings yet
New Captive PDF
24 pages
ENGLISH 7-SUMMATIVE TEST 3rd QUARTER
No ratings yet
ENGLISH 7-SUMMATIVE TEST 3rd QUARTER
2 pages
Julie Word Exercise 1 - Home Tab
No ratings yet
Julie Word Exercise 1 - Home Tab
5 pages
Use Direct and Reported Speech Appropriately in Varied Contexts
100% (1)
Use Direct and Reported Speech Appropriately in Varied Contexts
6 pages
Datasheet
No ratings yet
Datasheet
6 pages
To Tame A Land by Nightquesttarja-C9tpcbwj
100% (1)
To Tame A Land by Nightquesttarja-C9tpcbwj
529 pages
Sample Power Bi Q & A
No ratings yet
Sample Power Bi Q & A
8 pages
Lecture-Discussion Model
100% (1)
Lecture-Discussion Model
1 page
Deliberative Architectonic Rhetoric
100% (1)
Deliberative Architectonic Rhetoric
43 pages
PL 300ExamRequirements230822
No ratings yet
PL 300ExamRequirements230822
4 pages
3 II Worksheet
No ratings yet
3 II Worksheet
10 pages
Testing in Statistics
No ratings yet
Testing in Statistics
22 pages
Final Revised Unit-2 Xi Cs - Ip 2020-21 Lets Learn Python Together
No ratings yet
Final Revised Unit-2 Xi Cs - Ip 2020-21 Lets Learn Python Together
73 pages
PYQ Big Data Analytics 1 SEC May 2024
No ratings yet
PYQ Big Data Analytics 1 SEC May 2024
2 pages
Hernandez, Vincent Jaed C - (Cvet - 2201 Gradebook and Decision Maker)
No ratings yet
Hernandez, Vincent Jaed C - (Cvet - 2201 Gradebook and Decision Maker)
11 pages
Hash Joins - Implementation and Tuning
No ratings yet
Hash Joins - Implementation and Tuning
20 pages
RHEL 9.2 - Installing and Using Dynamic Programming Languages
No ratings yet
RHEL 9.2 - Installing and Using Dynamic Programming Languages
26 pages
767 Implementing A SQL Data Warehouse: Exam Design
No ratings yet
767 Implementing A SQL Data Warehouse: Exam Design
4 pages
Sonar Boat-Blueshore 01
No ratings yet
Sonar Boat-Blueshore 01
2 pages
Kinder Worksheets Q1 Wk1 GSB 2 FINAL FS 1
No ratings yet
Kinder Worksheets Q1 Wk1 GSB 2 FINAL FS 1
10 pages

Script

Uploaded by

Script

Uploaded by

To handle multiple tables with different srcdb and tgtdb names, we can modify the script to

2. Modify the script to iterate over multiple tables.

3. Pass the file as a parameter to the Databricks notebook.

1. Input File Format (JSON)

We use a JSON file (tables.json) with multiple database-table mappings:

2. Databricks Notebook Script

# Widgets for Parameters

# Read the Parameters

# Load JSON Configuration

def generate_insert_queries(csv_file, srcdb, tgtdb, requestid):

if not table_name or not clusterby_column or not clusterby_value:

for chunk in range(num_chunks):

# Process Multiple Tables

queries = generate_insert_queries(csv_file, srcdb, tgtdb, requestid)

# Write SQL queries to DBFS

print(f"SQL queries written to {output_path}")

3. Run the Notebook

Option 1: Manually Set Widgets in Databricks

Run the notebook.

Option 2: Pass Parameters When Running as a Job

You can trigger this notebook via Databricks Jobs:

databricks jobs run-now --job-id <your-job-id> --notebook-params '{

The SQL file (insert_queries.sql) will contain queries like:

INSERT INTO target_db1.hud_assc (srcdb, tgtdb, table, requestid, condition)

INSERT INTO target_db2.hud_sale (srcdb, tgtdb, table, requestid, condition)

INSERT INTO target_db2.hud_sale (srcdb, tgtdb, table, requestid, condition)

5. Download the SQL Output (Optional)

To download the SQL file from Databricks:

Or use the DBFS File Browser in the Databricks UI.

Flexibility: Just update the tables.json file for new tables.

Databricks Integration: Uses widgets for easy job execution.

Let me know if you need further refinements!

You might also like