0% found this document useful (0 votes)
2 views

script

Uploaded by

Suraj Solanke
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

script

Uploaded by

Suraj Solanke
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

To handle multiple tables with different srcdb and tgtdb names, we can modify the script to

take a JSON or CSV file as input, where each row (or JSON entry) represents a table with its
corresponding srcdb, tgtdb, and partition details.

---

Approach

1. Use a JSON or CSV file as input, containing multiple srcdb, tgtdb, and table partition
details.

2. Modify the script to iterate over multiple tables.

3. Pass the file as a parameter to the Databricks notebook.

---

Updated Implementation

1. Input File Format (JSON)

We use a JSON file (tables.json) with multiple database-table mappings:

[
{
"srcdb": "source_db1",
"tgtdb": "target_db1",
"csv_file": "dbfs:/mnt/input/table1.csv"
},
{
"srcdb": "source_db2",
"tgtdb": "target_db2",
"csv_file": "dbfs:/mnt/input/table2.csv"
}
]

Each entry represents one table's operation with a corresponding CSV file.

---

2. Databricks Notebook Script


import pandas as pd
import json
import re
import uuid

# Constants
SIZE_MULTIPLIER = {"kb": 1 / (1024 * 1024), "mb": 1 / 1024, "gb": 1}
CHUNK_SIZE_GB = 1 # 1GB chunk size

# Widgets for Parameters


dbutils.widgets.text("json_file", "dbfs:/mnt/input/tables.json")
dbutils.widgets.text("requestid", "request123")

# Read the Parameters


json_file = dbutils.widgets.get("json_file")
requestid = dbutils.widgets.get("requestid")

# Load JSON Configuration


with open("/dbfs" + json_file.replace("dbfs:/", ""), "r") as f:
tables_config = json.load(f)

def parse_size(size_str):
"""Convert file size string (e.g., '500mb', '2gb') to GB."""
match = re.match(r"([\d.]+)(kb|mb|gb)", size_str.lower())
if match:
size, unit = match.groups()
return float(size) * SIZE_MULTIPLIER[unit]
return 0 # Default if no match

def extract_partition_info(partition_str):
"""Extract table name, cluster-by column, and value from partition string."""
match = re.search(r"\/([^\/]+)\/([^=]+)=([\d\/]+)", partition_str)
if match:
table_name, clusterby_column, clusterby_value = match.groups()
return table_name, clusterby_column, clusterby_value
return None, None, None

def generate_insert_queries(csv_file, srcdb, tgtdb, requestid):


"""Generate SQL insert queries and write to a .sql file."""
df = pd.read_csv(csv_file)

# Convert file_size to GB
df["file_size_gb"] = df["file_size"].apply(parse_size)

queries = []
unique_request_id = str(uuid.uuid4())[:8] # Unique base request ID
for index, row in df.iterrows():
table_name, clusterby_column, clusterby_value = extract_partition_info(row["partition"])

if not table_name or not clusterby_column or not clusterby_value:


continue # Skip invalid rows

total_size = row["file_size_gb"]
num_chunks = max(1, int(total_size // CHUNK_SIZE_GB) + (1 if total_size %
CHUNK_SIZE_GB > 0 else 0))

for chunk in range(num_chunks):


chunk_start = clusterby_value # Assuming single value partitioning
chunk_end = clusterby_value # Adjust logic if range-based
request_id_chunk = f"{requestid}_{unique_request_id}_{chunk + 1}"

query = f"""
INSERT INTO {tgtdb}.{table_name} (srcdb, tgtdb, table, requestid, condition)
VALUES ('{srcdb}', '{tgtdb}', '{table_name}', '{request_id_chunk}',
'{clusterby_column} >= \"{chunk_start}\" AND {clusterby_column} <=
\"{chunk_end}\"');
"""
queries.append(query.strip())

return queries

# Process Multiple Tables


all_queries = []
for entry in tables_config:
srcdb = entry["srcdb"]
tgtdb = entry["tgtdb"]
csv_file = entry["csv_file"]

queries = generate_insert_queries(csv_file, srcdb, tgtdb, requestid)


all_queries.extend(queries)

# Write SQL queries to DBFS


output_path = "dbfs:/mnt/output/insert_queries.sql"
with open("/dbfs/mnt/output/insert_queries.sql", "w") as f:
f.write("\n".join(all_queries))

print(f"SQL queries written to {output_path}")

---

3. Run the Notebook

Option 1: Manually Set Widgets in Databricks


Go to Widgets > Edit

Set:

json_file → dbfs:/mnt/input/tables.json

requestid → request123

Run the notebook.

Option 2: Pass Parameters When Running as a Job

You can trigger this notebook via Databricks Jobs:

databricks jobs run-now --job-id <your-job-id> --notebook-params '{


"json_file": "dbfs:/mnt/input/tables.json",
"requestid": "request123"
}'

---

4. Expected Output

The SQL file (insert_queries.sql) will contain queries like:

INSERT INTO target_db1.hud_assc (srcdb, tgtdb, table, requestid, condition)


VALUES ('source_db1', 'target_db1', 'hud_assc', 'request123_ab12cd34_1',
'call_dt >= "23/10/2010" AND call_dt <= "23/10/2010"');

INSERT INTO target_db2.hud_sale (srcdb, tgtdb, table, requestid, condition)


VALUES ('source_db2', 'target_db2', 'hud_sale', 'request123_ab12cd34_1',
'call_dt >= "01/01/2022" AND call_dt <= "01/01/2022"');

INSERT INTO target_db2.hud_sale (srcdb, tgtdb, table, requestid, condition)


VALUES ('source_db2', 'target_db2', 'hud_sale', 'request123_ab12cd34_2',
'call_dt >= "01/01/2022" AND call_dt <= "01/01/2022"');

---

5. Download the SQL Output (Optional)

To download the SQL file from Databricks:


databricks fs cp dbfs:/mnt/output/insert_queries.sql .

Or use the DBFS File Browser in the Databricks UI.

---

Final Thoughts

Scalability: You can process multiple tables with different databases dynamically.

Flexibility: Just update the tables.json file for new tables.

Databricks Integration: Uses widgets for easy job execution.

Let me know if you need further refinements!

You might also like