Databricks Optimization Technique
Databricks Optimization Technique
Introduction
Optimizing Databricks workloads is crucial for enhancing performance, reducing costs, and
minimizing errors caused by inefficient resource utilization. Based on my experience, performance
degradation typically occurs in the following key phases. While these points address significant
optimization aspects, they do not cover 100% of all scenarios—I will continue to refine and expand
this list as I encounter new challenges.
1. Data Shuffling Phase – Optimizing how data is redistributed across partitions to minimize
shuffle overhead.
2. Data Transformation Phase – Enhancing data processing efficiency by optimizing operations
such as joins, aggregations, and filtering.
3. Cluster Configuration Phase – Adjusting compute resources to balance cost and
performance, including instance types, autoscaling, and parallelism.
4. Storage & Query Performance Phase – Improving data storage structures and query
execution plans to accelerate processing.
5. Rewrite Phase – Managing data modifications, including updates, deletions, and merges, to
ensure efficient reprocessing and storage performance.
Each section provides insights into common performance bottlenecks, their underlying causes, and
step-by-step solutions to optimize workloads effectively.
Shuffle occurs when data is redistributed across multiple nodes or partitions during query execution.
This is often triggered by operations such as GROUP BY, JOIN, ORDER BY, and DISTINCT, where data
needs to be grouped or sorted based on different keys.
GROUP BY: Data needs to be grouped based on a column that may not be pre-partitioned.
JOIN: When two tables are joined on a non-partitioned column, data from both tables needs
to be shuffled to bring matching keys together.
ORDER BY: Sorting requires merging and redistributing data across multiple nodes.
DISTINCT: Similar to GROUP BY, data needs to be shuffled to ensure unique values are
identified globally.
We analyze the following SQL query using EXPLAIN FORMATTED to check for data shuffling.
EXPLAIN FORMATTED
FROM transactions
GROUP BY customer_id;
EXPLAIN FORMATTED provides a detailed execution plan, showing how the query is
processed.
We look for EXCHANGE operators in the execution plan, which indicate data shuffling.
(Your actual output may differ based on the database engine, but it will have similar components.)
== Physical Plan ==
*HashAggregate (2)
- HashAggregate (0)
- Scan [transactions]
Breaking it down:
3
(0) HashAggregate: The first stage of aggregation occurs at the local node level.
(1) Exchange: Data is shuffled across different partitions to group all records of the same
customer_id together.
(2) HashAggregate: The final aggregation is performed after shuffling.
Optimize Partitioning
If your data is stored in a partitioned table, ensure that the partition key matches the
GROUP BY column.
Example: If transactions is partitioned by customer_id, then GROUP BY customer_id will
have minimal shuffle.
CREATE TABLE transactions_partitioned
USING PARQUET
PARTITIONED BY (customer_id)
AS SELECT * FROM transactions;
Broadcasting small tables (<10MB) prevents unnecessary shuffling by sending the table to all worker
nodes.
A Broadcast Join is a join optimization technique used in distributed computing frameworks like
Apache Spark and Databricks. It works by replicating (broadcasting) a smaller dataset to all nodes
where the larger dataset is stored, reducing data shuffling and improving query performance.
By forcing a Broadcast Join using a query hint, we explicitly instruct the SQL engine to treat the
specified table as a broadcasted table, even if it might not choose to do so automatically.
Query Explanation:
FROM transactions
1. Smaller Table: The table being broadcasted (customer_data) should be small enough to fit
into memory on each node.
2. Avoiding Data Shuffle: Helps reduce network transfer costs and execution time in a
distributed system.
3. Improving Performance: Useful when joining a large table with a much smaller table, as it
eliminates the need for a full shuffle.
1. Large Table as Broadcast: If customer_data is too large, it can cause memory overflow or
performance degradation.
2. Cluster Constraints: If memory is limited, broadcasting can slow down execution instead of
speeding it up.
3. Default Optimizer Decision is Better: If the query optimizer automatically selects a more
efficient join strategy, forcing a broadcast might not be necessary.
Open Spark UI
5
What is AQE?
Adaptive Query Execution (AQE) is a feature in Apache Spark 3.0+ that dynamically optimizes query
execution at runtime based on actual data statistics. Unlike traditional static query plans, AQE allows
Spark to adjust join strategies, optimize shuffle partitions, and change execution plans dynamically.
1. Dynamically Switch Join Strategies → Changes between Sort-Merge Join (SMJ) and Broadcast
Hash Join (BHJ) based on data size.
2. Optimize Shuffle Partitions → Adjusts the number of partitions dynamically to reduce shuffle
size and improve performance.
3. Handle Skewed Joins → Detects data skew and applies optimized join strategies to avoid slow
execution.
4. Eliminate Unnecessary Shuffles → Removes redundant shuffle operations.
6
Rerun the query and check shuffle size reduction in Spark UI.
Look for the following indicators in Spark UI → SQL Tab → Query Execution Details:
1.4 Shuffle Hash Join vs. Sort Merge Join in Apache Spark
Sort Merge Join (SMJ) is the default join strategy in Spark when two large tables are joined.
It first sorts both datasets on the join key and then merges them efficiently.
Best suited for large datasets where one table does not fit in memory.
Downside: Sorting is expensive, and it requires shuffling before joining.
Shuffle Hash Join (SHJ) is faster than Sort Merge Join when one of the datasets can fit into
memory.
Instead of sorting, it creates a hash table for the smaller dataset and performs lookups while
scanning the larger dataset.
7
Best suited when at least one table is small enough to fit in memory.
Downside: If the table is too large to fit in memory, it may cause out-of-memory errors.
1. Shuffle occurs to ensure that matching keys are in the same partition.
2. The smaller table is loaded into memory as a hash table.
3. The larger table is scanned, and the join is performed using hash lookups instead of sorting.
Join
Best Use Case Execution Overhead Spark UI Indicators
Type
Sort
Large datasets that do not
Merge Sorting and shuffling "SortMergeJoin" in Query Plan
fit in memory
Join
Shuffle
When one dataset fits in Shuffle only (No "ShuffledHashJoin" in Query
Hash
memory Sorting) Plan
Join
Benefits of CBO:
By default, CBO is disabled in Spark. Enable it using the following SQL command:
Allows Spark to analyze table statistics and select an optimal execution plan.
Helps choose the best join order in multi-table joins.
Compute Column ANALYZE TABLE transactions COMPUTE Improves filter pushdown & partition
Statistics STATISTICS FOR COLUMNS customer_id; pruning
SET spark.sql.cbo.joinReorder.enabled = Ensures Spark processes smaller
Enable Join Reordering
true; tables first
Check Execution Plan in
Spark UI → SQL Tab → Query Plan Verify if optimizations are applied
UI
9
Definition:
Small files problem occurs when a large number of tiny files are written to storage, leading
to inefficient queries, high metadata overhead, and increased costs.
Example Output:
1. numFiles (5000):
o The total number of Parquet files in the Delta table.
o 5000 files indicate a high number of small files, which may impact query
performance and storage costs.
2. sizeInBytes (1TB):
o The total size of the table in bytes (converted here to 1TB).
o This means 5000 files together occupy 1TB of storage.
10
3. minFileSize (50KB):
o The smallest file in the table is 50KB.
o This is too small for efficient querying, leading to high metadata overhead.
4. maxFileSize (150MB):
o The largest file in the table is 150MB.
o Ideally, Delta tables should have file sizes close to 128MB–256MB for
optimal performance.
Recommended Action:
OPTIMIZE table_name;
'delta.autoOptimize.optimizeWrite' = true,
'delta.autoOptimize.autoCompact' = true
);
This will help reduce the number of files, improve query speed, and lower storage
costs.
oSet Min Workers and Max Workers to allow Databricks to scale up/down as
needed.
2. Monitor utilization in Spark UI
o If CPU usage is low, reduce the max worker count to save costs.
o If jobs are slow, increase the min worker count.
Definition
ANALYZE TABLE collects metadata and statistics about a table to help the query optimizer generate
efficient execution plans.
Steps to Optimize
Definition
Delta Lake optimizes file storage by automatically managing file sizes and compacting small files into
larger ones, reducing fragmentation and read latencies.
Steps to Optimize
o Automatically adjusts file sizes during compaction to balance performance and cost.
OPTIMIZE table_name;
o Merges small files into larger Parquet files, improving query performance.
o Reorders data layout to improve performance when filtering by the given column.
Definition
Caching stores frequently used data in memory, reducing the need for repeated computations and
disk reads, significantly improving query performance.
Steps to Optimize
2. Use the persist() method for selective caching in PySpark (if needed):
df = spark.read.table("table_name").persist()
df.show()
CLEAR CACHE;
o Removes all cached tables, preventing memory overflow.
5.Rewrite Phase
The Rewrite Phase in Databricks refers to the process of modifying and reprocessing data in storage
due to updates, deletions, or merges. Since Parquet files (used in Delta Lake) are immutable, any
updates or modifications require rewriting the affected data files instead of modifying records in
place. This can lead to high resource consumption, memory spills, and increased execution time if
not optimized properly.
Delta Lake stores data in Parquet files, which cannot be updated directly.
When a record is updated, the entire affected Parquet file(s) must be rewritten.
If a small number of records require modification, but they belong to large files, the
entire file must be rewritten, leading to write amplification (rewriting more data
than necessary).
High Memory Spillage: Large-scale rewrites can cause shuffle spills and increased
memory consumption.
Write Amplification: Updating a few rows may lead to rewriting millions or billions
of rows, significantly increasing execution time.
Concurrency Issues: Running OPTIMIZE and ZORDER on actively used partitions can
cause conflicts with streaming jobs.
A Deletion Vector is a logical deletion mechanism in Delta Lake that allows records to be
marked as deleted without rewriting entire Parquet files. Instead of physically removing
records and rewriting files, deletion vectors maintain a lightweight metadata index,
significantly improving performance and reducing write amplification.
To take advantage of deletion vectors, enable them at the table level using:
This setting ensures that instead of rewriting Parquet files during DELETE or MERGE
operations, Delta Lake marks the rows as deleted and avoids unnecessary rewrites.
Instead of physically deleting rows and triggering a full rewrite, Delta Lake marks them as
deleted using deletion vectors.
With deletion vectors enabled, only metadata is updated, and affected rows are excluded
from queries.
No Parquet files are rewritten, reducing resource consumption.
Since deleted rows are logically removed, they are automatically excluded from queries.