0% found this document useful (0 votes)

21 views18 pages

Databricks Optimization Technique

The document discusses optimizing Databricks workloads to enhance performance and reduce costs, focusing on key phases such as data shuffling, transformation, cluster configuration, storage, and query performance. It provides insights into common performance bottlenecks, methods to reduce data shuffling, and techniques like Broadcast Join and Adaptive Query Execution for dynamic optimization. Additionally, it covers the importance of Cost-Based Optimization and addressing the small files problem to improve query performance and storage efficiency.

Uploaded by

foodsamor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views18 pages

Databricks Optimization Technique

Uploaded by

foodsamor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

1

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

Introduction

Optimizing Databricks workloads is crucial for enhancing performance, reducing costs, and
minimizing errors caused by inefficient resource utilization. Based on my experience, performance
degradation typically occurs in the following key phases. While these points address significant
optimization aspects, they do not cover 100% of all scenarios—I will continue to refine and expand
this list as I encounter new challenges.

1. Data Shuffling Phase – Optimizing how data is redistributed across partitions to minimize
shuffle overhead.
2. Data Transformation Phase – Enhancing data processing efficiency by optimizing operations
such as joins, aggregations, and filtering.
3. Cluster Configuration Phase – Adjusting compute resources to balance cost and
performance, including instance types, autoscaling, and parallelism.
4. Storage & Query Performance Phase – Improving data storage structures and query
execution plans to accelerate processing.
5. Rewrite Phase – Managing data modifications, including updates, deletions, and merges, to
ensure efficient reprocessing and storage performance.

Each section provides insights into common performance bottlenecks, their underlying causes, and
step-by-step solutions to optimize workloads effectively.

1. Data Shuffling Optimization

What is Shuffle in SQL Queries?

Shuffle occurs when data is redistributed across multiple nodes or partitions during query execution.
This is often triggered by operations such as GROUP BY, JOIN, ORDER BY, and DISTINCT, where data
needs to be grouped or sorted based on different keys.

Why Does Shuffle Happen?

 GROUP BY: Data needs to be grouped based on a column that may not be pre-partitioned.
 JOIN: When two tables are joined on a non-partitioned column, data from both tables needs
to be shuffled to bring matching keys together.
 ORDER BY: Sorting requires merging and redistributing data across multiple nodes.
 DISTINCT: Similar to GROUP BY, data needs to be shuffled to ensure unique values are
identified globally.

Common Issues with Data Shuffling:

❌ High network overhead due to excessive data movement.

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

❌ Slow execution times caused by inefficient joins and aggregations.

❌ Data spilling to disk when memory is insufficient, leading to performance degradation.

Example: Detecting Shuffle in SQL Queries

We analyze the following SQL query using EXPLAIN FORMATTED to check for data shuffling.

EXPLAIN FORMATTED

SELECT customer_id, COUNT(*)

FROM transactions

GROUP BY customer_id;

Step-by-Step Execution Analysis

Understanding the Query

 We are counting the number of transactions for each customer_id.

 The database needs to group the data by customer_id, which may require shuffling if the
data is not already partitioned on this column.

Running EXPLAIN FORMATTED

 EXPLAIN FORMATTED provides a detailed execution plan, showing how the query is
processed.
 We look for EXCHANGE operators in the execution plan, which indicate data shuffling.

Sample Execution Plan

(Your actual output may differ based on the database engine, but it will have similar components.)

== Physical Plan ==

*HashAggregate (2)

- Exchange (1) << SHUFFLE DETECTED

- HashAggregate (0)

- Scan [transactions]

Breaking it down:
3

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

 (0) HashAggregate: The first stage of aggregation occurs at the local node level.
 (1) Exchange: Data is shuffled across different partitions to group all records of the same
customer_id together.
 (2) HashAggregate: The final aggregation is performed after shuffling.

1.1 How to Reduce Shuffle?

Optimize Partitioning
 If your data is stored in a partitioned table, ensure that the partition key matches the
GROUP BY column.
 Example: If transactions is partitioned by customer_id, then GROUP BY customer_id will
have minimal shuffle.
CREATE TABLE transactions_partitioned
USING PARQUET
PARTITIONED BY (customer_id)
AS SELECT * FROM transactions;

Use BROADCAST JOIN (for JOIN queries)

 If one table is small, you can use a BROADCAST JOIN to avoid shuffling.
SELECT /*+ BROADCAST(customers) */
t.customer_id, COUNT(*)
FROM transactions t
JOIN customers c ON t.customer_id = c.customer_id
GROUP BY t.customer_id;
Reduce the Data Before Shuffle
 Apply filters (WHERE clause) before GROUP BY to reduce the amount of shuffled data.
SELECT customer_id, COUNT(*)
FROM transactions
WHERE transaction_date >= '2024-01-01'
GROUP BY customer_id;

1.2 Broadcast Hash Join to Avoid Shuffling (Some additional points)

 Why Use Broadcast Join?

Broadcasting small tables (<10MB) prevents unnecessary shuffling by sending the table to all worker
nodes.

 Steps to Enable Broadcast Join

Enable automatic broadcasting (default threshold: 10MB
SET spark.sql.autoBroadcastJoinThreshold = 10485760; -- 10MB
 Force Broadcast Join using Query Hint
4

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

A Broadcast Join is a join optimization technique used in distributed computing frameworks like
Apache Spark and Databricks. It works by replicating (broadcasting) a smaller dataset to all nodes
where the larger dataset is stored, reducing data shuffling and improving query performance.

By forcing a Broadcast Join using a query hint, we explicitly instruct the SQL engine to treat the
specified table as a broadcasted table, even if it might not choose to do so automatically.

Query Explanation:

SELECT /+ BROADCAST(customer_data) / *

FROM transactions

JOIN customer_data ON transactions.customer_id = customer_data.customer_id;

/+ BROADCAST(customer_data) / → This is a query hint that forces the customer_data

table to be broadcasted across all nodes.

transactions → This is the larger table.

customer_data → This is the smaller table (broadcasted to all nodes).

The JOIN condition is based on customer_id.

When to Use Force Broadcast Join

1. Smaller Table: The table being broadcasted (customer_data) should be small enough to fit
into memory on each node.
2. Avoiding Data Shuffle: Helps reduce network transfer costs and execution time in a
distributed system.
3. Improving Performance: Useful when joining a large table with a much smaller table, as it
eliminates the need for a full shuffle.

When NOT to Use Force Broadcast Join

1. Large Table as Broadcast: If customer_data is too large, it can cause memory overflow or
performance degradation.
2. Cluster Constraints: If memory is limited, broadcasting can slow down execution instead of
speeding it up.
3. Default Optimizer Decision is Better: If the query optimizer automatically selects a more
efficient join strategy, forcing a broadcast might not be necessary.

Verify Broadcast in Spark UI

Open Spark UI
5

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

1. Run your Spark application.

2. Open Spark UI in your browser:
o If using Databricks, go to "Query Details" under "SQL Query History".
o If using standalone Spark, access Spark UI via “http://<driver-node-ip>:4040”
o Navigate to the "SQL" Tab In the Spark UI, go to the "SQL" tab.
o Click on the executed query to open "Query Details".
o Look for "BroadcastExchange" Operator Under the "DAG Visualization" or
"Query Plan", find BroadcastExchange (HashedRelationBroadcastMode)
o This confirms that Spark successfully broadcasted the customer_data table.

1.3 Adaptive Query Execution (AQE) for Dynamic Optimization

What is AQE?

Adaptive Query Execution (AQE) is a feature in Apache Spark 3.0+ that dynamically optimizes query
execution at runtime based on actual data statistics. Unlike traditional static query plans, AQE allows
Spark to adjust join strategies, optimize shuffle partitions, and change execution plans dynamically.

Traditional vs. Adaptive Query Execution

Feature Without AQE (Static Execution) With AQE (Dynamic Execution)

Join Strategy Fixed at compile time Adjusts based on data size
Predefined by
Shuffle Partitioning Dynamically adjusted to avoid skew
spark.sql.shuffle.partitions
Detects skewed data and optimizes
Skewed Joins Handling Not optimized
joins
Automatically reduces partitions for
Coalesce Shuffle Partitions Manual tuning required
efficiency

Key Optimizations Done by AQE

1. Dynamically Switch Join Strategies → Changes between Sort-Merge Join (SMJ) and Broadcast
Hash Join (BHJ) based on data size.
2. Optimize Shuffle Partitions → Adjusts the number of partitions dynamically to reduce shuffle
size and improve performance.
3. Handle Skewed Joins → Detects data skew and applies optimized join strategies to avoid slow
execution.
4. Eliminate Unnecessary Shuffles → Removes redundant shuffle operations.
6

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

Steps to Enable AQE

 Ensure AQE is enabled (default in Databricks)

SET spark.sql.adaptive.enabled = true;

 Enable automatic broadcast join threshold for AQE

SET spark.databricks.adaptive.autoBroadcastJoinThreshold = 209715200; -- 200MB

 Rerun the query and check shuffle size reduction in Spark UI.

Check the Query Plan for AQE Optimizations

Look for the following indicators in Spark UI → SQL Tab → Query Execution Details:

1. "Adaptive Query Stage" → Confirms that AQE is applied.

2. "BroadcastExchange" → Shows that AQE automatically enabled broadcast joins.
3. "Shuffle Partition Coalesce" → Confirms that AQE optimized shuffle partitions.

1.4 Shuffle Hash Join vs. Sort Merge Join in Apache Spark

Step 1: Understanding the Differences

What is Sort Merge Join?

 Sort Merge Join (SMJ) is the default join strategy in Spark when two large tables are joined.
 It first sorts both datasets on the join key and then merges them efficiently.
 Best suited for large datasets where one table does not fit in memory.
 Downside: Sorting is expensive, and it requires shuffling before joining.

Example Execution Steps of Sort Merge Join:

1. Spark sorts both tables based on the join key.

2. Shuffle operation occurs to bring matching keys to the same partition.
3. The sorted tables are merged using an efficient merge operation.

What is Shuffle Hash Join?

 Shuffle Hash Join (SHJ) is faster than Sort Merge Join when one of the datasets can fit into
memory.
 Instead of sorting, it creates a hash table for the smaller dataset and performs lookups while
scanning the larger dataset.
7

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

 Best suited when at least one table is small enough to fit in memory.
 Downside: If the table is too large to fit in memory, it may cause out-of-memory errors.

Example Execution Steps of Shuffle Hash Join:

1. Shuffle occurs to ensure that matching keys are in the same partition.
2. The smaller table is loaded into memory as a hash table.
3. The larger table is scanned, and the join is performed using hash lookups instead of sorting.

Enabling Shuffle Hash Join in Spark:

1. Disable Sort Merge Join (Force Shuffle Hash Join)

By default, Spark prefers Sort Merge Join. To force Shuffle Hash Join, disable Sort Merge Join:
SET spark.sql.join.preferSortMergeJoin = false;
This tells Spark not to use Sort Merge Join, allowing Shuffle Hash Join to be chosen
automatically when possible.
2. Adjust Auto Broadcast Join Threshold
For small tables, enabling Broadcast Join can further optimize performance.
SET spark.sql.autoBroadcastJoinThreshold = -1; -- Disable broadcast join to ensure SHJ is
used.
3. Comparing Performance Before and After

Join
Best Use Case Execution Overhead Spark UI Indicators
Type
Sort
Large datasets that do not
Merge Sorting and shuffling "SortMergeJoin" in Query Plan
fit in memory
Join
Shuffle
When one dataset fits in Shuffle only (No "ShuffledHashJoin" in Query
Hash
memory Sorting) Plan
Join

4. Expected Results After Forcing Shuffle Hash Join:

Faster execution time when the smaller table fits in memory.
No sorting operations in the query plan.
Lower shuffle write size compared to Sort Merge Join

1.5 Cost-Based Optimizer (CBO)

What is Cost-Based Optimization (CBO)?

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

 Cost-Based Optimizer (CBO) is a query optimization technique in Spark that uses

table statistics to choose the most efficient execution plan.
 Without CBO: Spark uses Rule-Based Optimization (RBO), which follows static rules
without considering data distribution.
 With CBO: Spark selects the best join order, join type, and execution plan by
analyzing data size, cardinality, and distribution.

Benefits of CBO:

 Reduces query execution time by selecting efficient join strategies.

 Minimizes shuffle and memory usage by optimizing data distribution.
 Improves parallelism by balancing partition workloads.

Step-by-Step Guide to Enable & Use CBO in Spark:

 Enable Cost-Based Optimization (CBO)

By default, CBO is disabled in Spark. Enable it using the following SQL command:

SET spark.sql.cbo.enabled = true;

Why Enable CBO?

 Allows Spark to analyze table statistics and select an optimal execution plan.
 Helps choose the best join order in multi-table joins.

Summary of CBO Optimizations

Optimization Command / Setting Benefit

Allows Spark to use statistics for
Enable CBO SET spark.sql.cbo.enabled = true;
query optimization
Compute Table ANALYZE TABLE transactions COMPUTE
Helps Spark decide best join strategy
Statistics STATISTICS;

Compute Column ANALYZE TABLE transactions COMPUTE Improves filter pushdown & partition
Statistics STATISTICS FOR COLUMNS customer_id; pruning
SET spark.sql.cbo.joinReorder.enabled = Ensures Spark processes smaller
Enable Join Reordering
true; tables first
Check Execution Plan in
Spark UI → SQL Tab → Query Plan Verify if optimizations are applied
UI
9

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

2. Data Transformation Optimization

Common Issues

 High storage costs due to excessive small files.

 Poor query performance due to unoptimized file structures.
 Slow ingestion speed impacting job SLAs.

2.1 Identify the Small Files Problem

Definition:

Small files problem occurs when a large number of tiny files are written to storage, leading
to inefficient queries, high metadata overhead, and increased costs.

How to Identify Small Files?

1. Check the Storage System:

o Use the Databricks UI, Azure Data Lake, or AWS S3 console to inspect the file
sizes.
o Look for a large number of files < 128MB.
2. Run the following query to count files in a Delta table:

DESCRIBE DETAIL table_name;

o This returns metadata, including numFiles, sizeInBytes, and min/maxFileSize.

o If numFiles is significantly high, the table needs optimization.

Example Output:

numFiles sizeInBytes minFileSize maxFileSize

5000 1TB 50KB 150MB

Explanation of Each Column:

1. numFiles (5000):
o The total number of Parquet files in the Delta table.
o 5000 files indicate a high number of small files, which may impact query
performance and storage costs.
2. sizeInBytes (1TB):
o The total size of the table in bytes (converted here to 1TB).
o This means 5000 files together occupy 1TB of storage.
10

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

3. minFileSize (50KB):
o The smallest file in the table is 50KB.
o This is too small for efficient querying, leading to high metadata overhead.
4. maxFileSize (150MB):
o The largest file in the table is 150MB.
o Ideally, Delta tables should have file sizes close to 128MB–256MB for
optimal performance.

What Does This Indicate?

 The table has too many small files, causing:

o High storage costs (due to inefficient file distribution).
o Slow query performance (more files → more metadata reads).
o Longer job execution times (excessive I/O operations).

Recommended Action:

 Run OPTIMIZE to merge small files:

OPTIMIZE table_name;

 Enable Auto-Optimize to prevent future small files:

ALTER TABLE table_name SET TBLPROPERTIES (

'delta.autoOptimize.optimizeWrite' = true,

'delta.autoOptimize.autoCompact' = true

);

 Consider partitioning if appropriate, but avoid over-partitioning.

This will help reduce the number of files, improve query speed, and lower storage
costs.

3. Cluster Configuration Optimization

Common Issues
1. Underutilized or Over-Provisioned Clusters
o Clusters are either too large (wasting resources) or too small (causing slow
execution).
11

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

2. High Compute Costs due to Misconfigured Resources

o Inefficient configurations lead to unnecessary spending.
3. Frequent Data Spilling to Disk
o When memory runs out, Spark writes data to disk, slowing down
performance.

3.1 Tune Shuffle Partitions

Definition
Shuffle partitions control how data is distributed and processed across executors. A wrong
number of partitions can cause memory issues and slow performance.
Steps to Optimize
1. Set shuffle partitions to auto-adjust (Databricks 12+ supports this)
set spark.sql.shuffle.partitions = auto;
2. Manually adjust based on cluster size (if needed)
o Default is 200 partitions, but adjust based on data size:
set spark.sql.shuffle.partitions = <number>;
o General rule:
 Small dataset (<10GB) → 50-100 partitions
 Medium dataset (10GB-1TB) → 200-1000 partitions
 Large dataset (>1TB) → 1000+ partitions

3.2 Preventing Data Spilling

Definition
Data spilling occurs when Spark runs out of memory and writes intermediate data to disk,
slowing down execution.
Steps to Reduce Spilling
1. Check the Spark UI for spill metrics
o Look at "Tasks" → "Spill (disk)" in Spark UI.
o If spilling is frequent, adjust memory allocation.
2. Adjust Adaptive Query Execution (AQE) shuffle partition size
set spark.databricks.adaptive.autoOptimizeShuffle.preshufflePartitionSizeInBytes =
16777216;
o This reduces partition size to 16MB (adjust based on workload).
o Larger partitions reduce overhead but can cause memory pressure.

3.3 Enable Cluster Autoscaling

Definition
Autoscaling automatically adjusts the number of workers in a cluster based on the
workload, ensuring efficient resource use.
Steps to Enable Autoscaling:
1. Enable autoscaling when creating a cluster
12

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

oSet Min Workers and Max Workers to allow Databricks to scale up/down as
needed.
2. Monitor utilization in Spark UI
o If CPU usage is low, reduce the max worker count to save costs.
o If jobs are slow, increase the min worker count.

3.4 Leverage Photon Engine

Definition
Photon is a high-performance query engine optimized for Apache Spark SQL. It improves
query execution speed by using vectorized processing and native execution.
Steps to Enable Photon Engine
1. Use Photon when creating a cluster
o In Databricks UI → Compute → Create Cluster, enable "Use Photon".
2. Run SQL workloads on Photon for faster execution
o Works best with Delta Lake and SQL-heavy workloads.

4. Storage and Query Performance Optimization

Common Issues
1. Slow Query Performance due to Fragmented Storage

o Data is spread across multiple small files, making reads inefficient.

2. High I/O Costs from Inefficient File Structures

o Poor file management leads to increased read/write operations, increasing costs.

3. Suboptimal Caching Leading to Redundant Computations

o Data is repeatedly recomputed instead of being cached, slowing down processing.

4.1 Table Maintenance with ANALYZE TABLE

Definition
ANALYZE TABLE collects metadata and statistics about a table to help the query optimizer generate
efficient execution plans.

Steps to Optimize

1. Compute statistics for query optimization:

ANALYZE TABLE table_name COMPUTE STATISTICS;

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

o This updates column-level statistics, improving filtering and join operations.

2. For column-specific statistics (improves query pruning and indexing):

ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS column1, column2;

o Helps Databricks' Catalyst Optimizer determine the best execution plan.

4.2 Delta Lake Optimization

Definition
Delta Lake optimizes file storage by automatically managing file sizes and compacting small files into
larger ones, reducing fragmentation and read latencies.

Steps to Optimize

1. Enable automatic file size management:

ALTER TABLE table_name SET TBLPROPERTIES (
'delta.tuneFileSizesForRewrite' = true
);

o Automatically adjusts file sizes during compaction to balance performance and cost.

2. Run OPTIMIZE to merge small files (manually triggered optimization):

OPTIMIZE table_name;

o Merges small files into larger Parquet files, improving query performance.

3. Apply Z-Ordering (for faster lookups on frequently queried columns):

OPTIMIZE table_name ZORDER BY (high_cardinality_column);

o Reorders data layout to improve performance when filtering by the given column.

4.3 Efficient Data Caching

Definition
Caching stores frequently used data in memory, reducing the need for repeated computations and
disk reads, significantly improving query performance.

Steps to Optimize

1. Cache frequently accessed tables:

CACHE TABLE table_name;
14

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

o Stores the table in memory for faster retrieval.

2. Use the persist() method for selective caching in PySpark (if needed):
df = spark.read.table("table_name").persist()
df.show()

o This keeps the DataFrame in memory, avoiding reloading from disk.

3. Clear the cache if it's not needed to free memory:

CLEAR CACHE;
o Removes all cached tables, preventing memory overflow.

5.Rewrite Phase
The Rewrite Phase in Databricks refers to the process of modifying and reprocessing data in storage
due to updates, deletions, or merges. Since Parquet files (used in Delta Lake) are immutable, any
updates or modifications require rewriting the affected data files instead of modifying records in
place. This can lead to high resource consumption, memory spills, and increased execution time if
not optimized properly.

Breakdown of the Rewrite Phase

1. Identifying When Rewrite Happens

Rewrite occurs in scenarios like:

 MERGE operations (e.g., updating Delta tables with new data).

 DELETE operations (deleting specific records in Delta tables).
 UPDATE operations (modifying existing records).

2. Understanding Data Rewriting Mechanics

 Delta Lake stores data in Parquet files, which cannot be updated directly.
 When a record is updated, the entire affected Parquet file(s) must be rewritten.
 If a small number of records require modification, but they belong to large files, the
entire file must be rewritten, leading to write amplification (rewriting more data
than necessary).

3. Key Challenges in the Rewrite Phase

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

 High Memory Spillage: Large-scale rewrites can cause shuffle spills and increased
memory consumption.
 Write Amplification: Updating a few rows may lead to rewriting millions or billions
of rows, significantly increasing execution time.
 Concurrency Issues: Running OPTIMIZE and ZORDER on actively used partitions can
cause conflicts with streaming jobs.

Solution: Deletion Vectors

A Deletion Vector is a logical deletion mechanism in Delta Lake that allows records to be
marked as deleted without rewriting entire Parquet files. Instead of physically removing
records and rewriting files, deletion vectors maintain a lightweight metadata index,
significantly improving performance and reducing write amplification.

Step-by-Step Implementation of Deletion Vectors in Delta Lake

1. Enable Deletion Vectors in Delta Lake

To take advantage of deletion vectors, enable them at the table level using:

ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableDeletionVectors' = 'true');

This setting ensures that instead of rewriting Parquet files during DELETE or MERGE
operations, Delta Lake marks the rows as deleted and avoids unnecessary rewrites.

2. Handling DELETE Operations Efficiently

Instead of physically deleting rows and triggering a full rewrite, Delta Lake marks them as
deleted using deletion vectors.

Example: Delete Old Records Without Full Rewrite

DELETE FROM my_table WHERE event_date < '2023-01-01';

 With deletion vectors enabled, only metadata is updated, and affected rows are excluded
from queries.
 No Parquet files are rewritten, reducing resource consumption.

3. Querying Data with Deletion Vectors

Since deleted rows are logically removed, they are automatically excluded from queries.

Example: Query Without Deleted Rows

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

SELECT * FROM my_table;

Queries automatically ignore logically deleted rows, improving read performance.

Additional Parameters which can optimize

Parameter Definition Optimizatio When to
n Category Use?

spark.sql.autoBroadcastJoinThreshold Controls Data Increase

the size Shuffling when small
limit for Optimization lookup tables
automatical (default
ly 10MB) are
broadcastin not being
g tables in broadcasted
joins. efficiently in
joins.
spark.sql.files.openCostInBytes Defines the Storage & Reduce when
cost Query dealing with
associated Performance many small
with Optimization files to
opening a improve
file during query
query performance.
execution.
spark.sql.adaptive.enabled Enables Data Set to true to
Adaptive Transformati allow Spark to
Query on optimize
Execution Optimization shuffle
(AQE) for partitions and
dynamic join strategies
query dynamically.
optimizatio
ns.
spark.databricks.delta.optimizeWrite.enabled Enables Rewrite Set to true
optimized Optimization when
writes to frequent
reduce the writes cause
number of small files in
small files Delta tables.
written to
Delta
tables.
17

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

spark.databricks.delta.autoCompact.enabled Enables Rewrite Set to true

automatic Optimization when Delta
compaction table
of small performance
Delta files. degrades due
to too many
small files.
spark.databricks.delta.optimize.zorder.autoCompac Enables Storage & Use when
t.enabled automatic Query queries
Z-Ordering Performance frequently
compaction Optimization filter on
on Delta specific
tables. columns,
reducing
unnecessary
scans.
spark.databricks.delta.merge.optimizeWrite.enable Optimizes Rewrite Set to true
d MERGE Optimization when MERGE
operations operations
to avoid are rewriting
excessive large portions
rewrites. of the table.
spark.sql.parquet.filterPushdown Pushes Storage & Set to true to
filters down Query minimize data
to the Performance read from
Parquet file Optimization Parquet files
scan. by applying
filters at the
file scan level.
spark.sql.join.preferSortMergeJoin Controls Data Set to false
whether Shuffling when dealing
Spark Optimization with smaller
prefers tables to
Sort-Merge allow
Joins over Broadcast
Broadcast Joins instead.
Joins.
spark.sql.sources.bucketing.enabled Enables Data Use when
bucketing Shuffling working with
for efficient Optimization large joins or
joins and aggregations
aggregation that benefit
s. from
bucketed
tables.
18

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

spark.databricks.delta.checkpointInterval Defines Rewrite Increase

how Optimization interval for
frequently high-
Delta table frequency
checkpoints updates to
are written. reduce
checkpoint
overhead.
spark.sql.codegen.wholeStage Enables Storage & Use when
whole- Query queries are
stage code Performance CPU-intensive
generation Optimization and involve
for query complex
execution. transformatio
ns.
spark.sql.merge.treeMergeThreshold Controls Rewrite Increase this
the Optimization value when
threshold MERGE
for tree- operations
based cause
merging in excessive file
Delta Lake. rewrites.

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
APJ Elevate - Databricks Certification Exam Overview Training Data Analyst Associate
No ratings yet
APJ Elevate - Databricks Certification Exam Overview Training Data Analyst Associate
96 pages
Ebook Accelerating Apache Spark 3
No ratings yet
Ebook Accelerating Apache Spark 3
108 pages
Databricks Data Engineer Associate Notes
No ratings yet
Databricks Data Engineer Associate Notes
5 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
No ratings yet
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
39 pages
ITeffectivity Cost Optimzation Guide Master 4-28-25
No ratings yet
ITeffectivity Cost Optimzation Guide Master 4-28-25
32 pages
Spark QA
No ratings yet
Spark QA
34 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
dp-300 0
No ratings yet
dp-300 0
23 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Databricks Data Engineer Professional Practice
No ratings yet
Databricks Data Engineer Professional Practice
10 pages
Spark Basic 2-1
No ratings yet
Spark Basic 2-1
25 pages
Azure AnalysisServiceOverview
No ratings yet
Azure AnalysisServiceOverview
173 pages
Apache Iceberg - Additional Real World Use Cases
No ratings yet
Apache Iceberg - Additional Real World Use Cases
25 pages
DP 200
No ratings yet
DP 200
370 pages
Azure Data Engineer DP-203
No ratings yet
Azure Data Engineer DP-203
98 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Distributed Computing BE (AI&DS)
No ratings yet
Distributed Computing BE (AI&DS)
53 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Change Data Capture Using Snowflake Streams - by Alexander - Snowflake - Medium
No ratings yet
Change Data Capture Using Snowflake Streams - by Alexander - Snowflake - Medium
5 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
No ratings yet
Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
25 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Azure Roadmap
No ratings yet
Azure Roadmap
3 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Databricks
No ratings yet
Databricks
56 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Databricks Certified Data Engineer Professional Practice Questions
No ratings yet
Databricks Certified Data Engineer Professional Practice Questions
13 pages
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
No ratings yet
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
38 pages
Azure Devops Integeration With Databricks
No ratings yet
Azure Devops Integeration With Databricks
10 pages
Azure Databricks Brief Introduction
No ratings yet
Azure Databricks Brief Introduction
40 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Data Lake Bootcamp: Building Reliable Data Lakes
No ratings yet
Data Lake Bootcamp: Building Reliable Data Lakes
29 pages
Matillion Optimizing Snowflake
No ratings yet
Matillion Optimizing Snowflake
23 pages
1 Introduction To Databricks Machine Learning
No ratings yet
1 Introduction To Databricks Machine Learning
9 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Databricks
No ratings yet
Databricks
36 pages
1714069759520
No ratings yet
1714069759520
17 pages
PDF 1733662736
No ratings yet
PDF 1733662736
17 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Do We Need The Lakehouse Architecture - by Vu Trinh - Apr, 2024 - Data Engineer Things
No ratings yet
Do We Need The Lakehouse Architecture - by Vu Trinh - Apr, 2024 - Data Engineer Things
19 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Databricks
No ratings yet
Databricks
4 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
WP - Databricks vs. ETL Data Lake - Updated
No ratings yet
WP - Databricks vs. ETL Data Lake - Updated
12 pages
DP 600 Day 1 en 1731207686301
No ratings yet
DP 600 Day 1 en 1731207686301
41 pages
EMEA SkillUp! Certification Portfolio
No ratings yet
EMEA SkillUp! Certification Portfolio
294 pages
Course12 2 PDF
No ratings yet
Course12 2 PDF
36 pages
Use Delta Lake in Azure Synapse Analytics
No ratings yet
Use Delta Lake in Azure Synapse Analytics
37 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Whiz Cheat Sheet DP 203 v2
No ratings yet
Whiz Cheat Sheet DP 203 v2
42 pages
Data Engineering Roadmap 2024
No ratings yet
Data Engineering Roadmap 2024
4 pages
22011a0512 Madhu Da
No ratings yet
22011a0512 Madhu Da
5 pages
EY Special Referrals
No ratings yet
EY Special Referrals
24 pages
Azure SQL Trainings: Contact: +91 90 32 82 44 67
No ratings yet
Azure SQL Trainings: Contact: +91 90 32 82 44 67
6 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Sunil Ongolu Data Engineer
No ratings yet
Sunil Ongolu Data Engineer
4 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Data Lakehouse Architecture
No ratings yet
Data Lakehouse Architecture
11 pages
Tiger Jobs List
No ratings yet
Tiger Jobs List
11 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Accelerate Your Data and AI Transformation
No ratings yet
Accelerate Your Data and AI Transformation
14 pages
Azure Data Enggineering Learning Plan
No ratings yet
Azure Data Enggineering Learning Plan
6 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
DataBricks Overview
No ratings yet
DataBricks Overview
13 pages
Dolly Kushwah Data Engineer CV 2.0
No ratings yet
Dolly Kushwah Data Engineer CV 2.0
2 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Introduction To Databricks
No ratings yet
Introduction To Databricks
149 pages
02 Introduction
No ratings yet
02 Introduction
13 pages
Adatis Azure National Archives
No ratings yet
Adatis Azure National Archives
26 pages
Just Another Study Guide For "Databricks Certified Associate Developer For Apache Spark 3.0" - by David Suarez - Medium
No ratings yet
Just Another Study Guide For "Databricks Certified Associate Developer For Apache Spark 3.0" - by David Suarez - Medium
15 pages
100 Days of Data Engineering - Make A Copy and Use As You Need - Sheet1
No ratings yet
100 Days of Data Engineering - Make A Copy and Use As You Need - Sheet1
4 pages
Mosaic AI Sales Lead, EMEA
No ratings yet
Mosaic AI Sales Lead, EMEA
4 pages
Generative AI Service Page
No ratings yet
Generative AI Service Page
2 pages
DP-203 Agenda
No ratings yet
DP-203 Agenda
8 pages
Databricks Platform Increases Business Agility and Drives Better Outcomes
No ratings yet
Databricks Platform Increases Business Agility and Drives Better Outcomes
8 pages
Databricks Questions
No ratings yet
Databricks Questions
23 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages

Databricks Optimization Technique

Uploaded by

Databricks Optimization Technique

Uploaded by

1

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

1. Data Shuffling Optimization

What is Shuffle in SQL Queries?

Why Does Shuffle Happen?

Common Issues with Data Shuffling:

❌ High network overhead due to excessive data movement.

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

❌ Slow execution times caused by inefficient joins and aggregations.

❌ Data spilling to disk when memory is insufficient, leading to performance degradation.

Example: Detecting Shuffle in SQL Queries

SELECT customer_id, COUNT(*)

Step-by-Step Execution Analysis

Understanding the Query

 We are counting the number of transactions for each customer_id.

Running EXPLAIN FORMATTED

Sample Execution Plan

- Exchange (1) << SHUFFLE DETECTED

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

1.1 How to Reduce Shuffle?

Use BROADCAST JOIN (for JOIN queries)

1.2 Broadcast Hash Join to Avoid Shuffling (Some additional points)

 Why Use Broadcast Join?

 Steps to Enable Broadcast Join

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

SELECT /*+ BROADCAST(customer_data) */ *

JOIN customer_data ON transactions.customer_id = customer_data.customer_id;

/*+ BROADCAST(customer_data) */ → This is a query hint that forces the customer_data

transactions → This is the larger table.

customer_data → This is the smaller table (broadcasted to all nodes).

The JOIN condition is based on customer_id.

When to Use Force Broadcast Join

When NOT to Use Force Broadcast Join

Verify Broadcast in Spark UI

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

1. Run your Spark application.

1.3 Adaptive Query Execution (AQE) for Dynamic Optimization

Traditional vs. Adaptive Query Execution

Feature Without AQE (Static Execution) With AQE (Dynamic Execution)

Key Optimizations Done by AQE

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

Steps to Enable AQE

 Ensure AQE is enabled (default in Databricks)

SET spark.sql.adaptive.enabled = true;

 Enable automatic broadcast join threshold for AQE

SET spark.databricks.adaptive.autoBroadcastJoinThreshold = 209715200; -- 200MB

Check the Query Plan for AQE Optimizations

1. "Adaptive Query Stage" → Confirms that AQE is applied.

Step 1: Understanding the Differences

What is Sort Merge Join?

Example Execution Steps of Sort Merge Join:

1. Spark sorts both tables based on the join key.

What is Shuffle Hash Join?

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

Example Execution Steps of Shuffle Hash Join:

Enabling Shuffle Hash Join in Spark:

1. Disable Sort Merge Join (Force Shuffle Hash Join)

4. Expected Results After Forcing Shuffle Hash Join:

1.5 Cost-Based Optimizer (CBO)

What is Cost-Based Optimization (CBO)?

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

 Cost-Based Optimizer (CBO) is a query optimization technique in Spark that uses

 Reduces query execution time by selecting efficient join strategies.

Step-by-Step Guide to Enable & Use CBO in Spark:

 Enable Cost-Based Optimization (CBO)

SET spark.sql.cbo.enabled = true;

Why Enable CBO?

Summary of CBO Optimizations

Optimization Command / Setting Benefit

DATABRICKS OPTIMIZATION AUTHOR: BIBHU

2. Data Transformation Optimization

 High storage costs due to excessive small files.

2.1 Identify the Small Files Problem

How to Identify Small Files?

1. Check the Storage System:

DESCRIBE DETAIL table_name;

SELECT /+ BROADCAST(customer_data) / *

/+ BROADCAST(customer_data) / → This is a query hint that forces the customer_data