0% found this document useful (0 votes)
2 views

script

Uploaded by

Suraj Solanke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

script

Uploaded by

Suraj Solanke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

To handle multiple tables with different srcdb and tgtdb names, we can modify the script to

take a JSON or CSV file as input, where each row (or JSON entry) represents a table with its
corresponding srcdb, tgtdb, and partition details.

---

Approach

1. Use a JSON or CSV file as input, containing multiple srcdb, tgtdb, and table partition
details.

2. Modify the script to iterate over multiple tables.

3. Pass the file as a parameter to the Databricks notebook.

---

Updated Implementation

1. Input File Format (JSON)

We use a JSON file (tables.json) with multiple database-table mappings:

[
{
"srcdb": "source_db1",
"tgtdb": "target_db1",
"csv_file": "dbfs:/mnt/input/table1.csv"
},
{
"srcdb": "source_db2",
"tgtdb": "target_db2",
"csv_file": "dbfs:/mnt/input/table2.csv"
}
]

Each entry represents one table's operation with a corresponding CSV file.

---

2. Databricks Notebook Script


import pandas as pd
import json
import re
import uuid

# Constants
SIZE_MULTIPLIER = {"kb": 1 / (1024 * 1024), "mb": 1 / 1024, "gb": 1}
CHUNK_SIZE_GB = 1 # 1GB chunk size

# Widgets for Parameters


dbutils.widgets.text("json_file", "dbfs:/mnt/input/tables.json")
dbutils.widgets.text("requestid", "request123")

# Read the Parameters


json_file = dbutils.widgets.get("json_file")
requestid = dbutils.widgets.get("requestid")

# Load JSON Configuration


with open("/dbfs" + json_file.replace("dbfs:/", ""), "r") as f:
tables_config = json.load(f)

def parse_size(size_str):
"""Convert file size string (e.g., '500mb', '2gb') to GB."""
match = re.match(r"([\d.]+)(kb|mb|gb)", size_str.lower())
if match:
size, unit = match.groups()
return float(size) * SIZE_MULTIPLIER[unit]
return 0 # Default if no match

def extract_partition_info(partition_str):
"""Extract table name, cluster-by column, and value from partition string."""
match = re.search(r"\/([^\/]+)\/([^=]+)=([\d\/]+)", partition_str)
if match:
table_name, clusterby_column, clusterby_value = match.groups()
return table_name, clusterby_column, clusterby_value
return None, None, None

def generate_insert_queries(csv_file, srcdb, tgtdb, requestid):


"""Generate SQL insert queries and write to a .sql file."""
df = pd.read_csv(csv_file)

# Convert file_size to GB
df["file_size_gb"] = df["file_size"].apply(parse_size)

queries = []
unique_request_id = str(uuid.uuid4())[:8] # Unique base request ID
for index, row in df.iterrows():
table_name, clusterby_column, clusterby_value = extract_partition_info(row["partition"])

if not table_name or not clusterby_column or not clusterby_value:


continue # Skip invalid rows

total_size = row["file_size_gb"]
num_chunks = max(1, int(total_size // CHUNK_SIZE_GB) + (1 if total_size %
CHUNK_SIZE_GB > 0 else 0))

for chunk in range(num_chunks):


chunk_start = clusterby_value # Assuming single value partitioning
chunk_end = clusterby_value # Adjust logic if range-based
request_id_chunk = f"{requestid}_{unique_request_id}_{chunk + 1}"

query = f"""
INSERT INTO {tgtdb}.{table_name} (srcdb, tgtdb, table, requestid, condition)
VALUES ('{srcdb}', '{tgtdb}', '{table_name}', '{request_id_chunk}',
'{clusterby_column} >= \"{chunk_start}\" AND {clusterby_column} <=
\"{chunk_end}\"');
"""
queries.append(query.strip())

return queries

# Process Multiple Tables


all_queries = []
for entry in tables_config:
srcdb = entry["srcdb"]
tgtdb = entry["tgtdb"]
csv_file = entry["csv_file"]

queries = generate_insert_queries(csv_file, srcdb, tgtdb, requestid)


all_queries.extend(queries)

# Write SQL queries to DBFS


output_path = "dbfs:/mnt/output/insert_queries.sql"
with open("/dbfs/mnt/output/insert_queries.sql", "w") as f:
f.write("\n".join(all_queries))

print(f"SQL queries written to {output_path}")

---

3. Run the Notebook

Option 1: Manually Set Widgets in Databricks


Go to Widgets > Edit

Set:

json_file → dbfs:/mnt/input/tables.json

requestid → request123

Run the notebook.

Option 2: Pass Parameters When Running as a Job

You can trigger this notebook via Databricks Jobs:

databricks jobs run-now --job-id <your-job-id> --notebook-params '{


"json_file": "dbfs:/mnt/input/tables.json",
"requestid": "request123"
}'

---

4. Expected Output

The SQL file (insert_queries.sql) will contain queries like:

INSERT INTO target_db1.hud_assc (srcdb, tgtdb, table, requestid, condition)


VALUES ('source_db1', 'target_db1', 'hud_assc', 'request123_ab12cd34_1',
'call_dt >= "23/10/2010" AND call_dt <= "23/10/2010"');

INSERT INTO target_db2.hud_sale (srcdb, tgtdb, table, requestid, condition)


VALUES ('source_db2', 'target_db2', 'hud_sale', 'request123_ab12cd34_1',
'call_dt >= "01/01/2022" AND call_dt <= "01/01/2022"');

INSERT INTO target_db2.hud_sale (srcdb, tgtdb, table, requestid, condition)


VALUES ('source_db2', 'target_db2', 'hud_sale', 'request123_ab12cd34_2',
'call_dt >= "01/01/2022" AND call_dt <= "01/01/2022"');

---

5. Download the SQL Output (Optional)

To download the SQL file from Databricks:


databricks fs cp dbfs:/mnt/output/insert_queries.sql .

Or use the DBFS File Browser in the Databricks UI.

---

Final Thoughts

Scalability: You can process multiple tables with different databases dynamically.

Flexibility: Just update the tables.json file for new tables.

Databricks Integration: Uses widgets for easy job execution.

Let me know if you need further refinements!

You might also like