0% found this document useful (0 votes)

25 views16 pages

Data Engineering 101 - Databricks Optimization

Databricks Optimization

Uploaded by

Madhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views16 pages

Data Engineering 101 - Databricks Optimization

Databricks Optimization

Uploaded by

Madhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

OPTIMIZING

DATABRICKS
Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Optimize & Z-order for Delta Tables
To compact multiple files together
OPTIMIZE compacts the files to get a file size of up to 1GB, which is configurable.
OPTIMIZE table_name [WHERE predicate]
Like Deframentation in Databases
[ZORDER BY (col_name1 [, ...] ) ] Like Indexing in Databases

Z-Ordering is a technique to co-locate related information in the same set of files.

Given a column that you want to perform ZORDER on, say OrderColumn, Delta
Takes existing parquet files within a partition.
Maps the rows within the parquet files according to OrderColumn using the Z-order curve algorithm.
In the case of only one column, the mapping above becomes a linear sort
Rewrites the sorted data into new parquet files.

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Auto Optimize in Delta Table
Automatically compacts small files during individual writes to a Delta table.
Optimize Write Use when not using OPTIMIZE manually
It dynamically optimizes Apache Spark partition sizes based on the actual data, and attempts to write out
128MB files for each table partition. It’s done inside the same Spark job.

Auto Compact
Following the completion of the Spark job, Auto Compact launches a new job to see if it can further compress
files to attain a 128MB file size.

-- In Spark session conf for all new tables set -- Table properties
spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true delta.autoOptimize.optimizeWrite = true
set spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true delta.autoOptimize.autoCompact = true

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Partitioning
Partitioning can speed up your queries if you provide the partition column(s) as filters or join on partition
column(s) or aggregate on partition column(s) or merge on partition column(s), as it will help Spark to skip a lot
of unnecessary data partition (i.e., subfolders) during scan time.
PARTITIONED BY ( { partition_column [ column_type ] } [, ...] )
useful when table size is 1 TB+
For smaller than 1 TB tables let Ingestion Time Clustering do it’s work.
The data as coming in the batch is stored in its own partition.

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Broadcast hash join
To entirely avoid data shuffling, broadcast one of the two tables or DataFrames (the smaller one) that
are being joined together. The table is broadcast by the driver, who copies it to all worker nodes.

set spark.sql.autoBroadcastJoinThreshold = <size in bytes>

default is 10 MB
In Spark 3.0 the sort-merge is change to broadcast join by AQE, if stats of any table is less than 30 MB

set spark.databricks.adaptive.autoBroadcastJoinThreshold = <size in bytes>

default can be changed

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Leverage cost-based optimizer
Spark SQL can use a Cost-based optimizer (CBO) to improve query plans. For it to work, it is critical to
collect table and column statistics and keep them up to date. Based on the stats, CBO chooses the most
economical join strategy.

ANALYZE TABLE table_name COMPUTE STATISTICS FROM COLUMNS col1, col2, ...;
Adapative Query Execution (AQE) also leverages it

ANALYZE TABLE command needs to be executed regularly (preferably once per day or when data has mutated by
more than 10%, whichever happens first)

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Shuffle hash join over sort-merge join
In most cases Spark chooses sort-merge join (SMJ) when it can’t broadcast tables. Sort-merge joins are
the most expensive ones. Shuffle-hash join (SHJ) has been found to be faster in some circumstances
(but not all) than sort-merge since it does not require an extra sorting step like SMJ.

set spark.sql.join.preferSortMergeJoin = false

Spark will try to use SHJ instead of SMJ wherever possible

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
AQE auto-tuning
Spark AQE has a feature called autoOptimizeShuffle (AOS), which can automatically find the right
number of shuffle partitions

set spark.sql.shuffle.partitions=auto

The default setting for the number of Spark SQL shuffle partitions (i.e., the number of CPU cores used to perform wide
transformations such as joins, aggregations and so on) is 200, which isn’t always the best value.

As a result, each Spark task (or CPU core) is given a large amount of data to process, and if the memory available to each core is
insufficient to fit all of that data, some of it is spilled to disk.

Spilling to disk is a costly operation, as it involves data serialization, de-serialization, reading and writing to disk, etc.

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Manually fine tune Partitions
As a rule of thumb, we need to make sure that after tuning the number of shuffle partitions, each task should
approximately be processing 128MB to 200MB of data.
Let’s assume that:
Total number of total worker cores in cluster = T
Total amount of data being shuffled in shuffle stage (in megabytes) = B
Optimal size of data to be processed per task (in megabytes) = 128
Hence the multiplication factor (M): M = ceiling(B / 128 / T)
And the number of shuffle partitions (N): N = M x T
Note that we have used the ceiling function here to ensure that all the cluster cores are fully engaged till the very last execution cycle.

-- in SQL
OR set spark.sql.shuffle.partitions = 2*<number of total worker cores in cluster>
-- in PySpark
spark.conf.set(“spark.sql.shuffle.partitions”, 2*<number of total worker cores in cluster>)
-- or
Shwetank Singh spark.conf.set(“spark.sql.shuffle.partitions”, 2*sc.defaultParallelism)

GritSetGrow - GSGLearn.com
Optimizing Databricks
Fixing Skewed Data
Filter Skewed data
If it’s possible to filter out the values around which there is a skew, then that will easily solve the issue. If you join using a column
with a lot of null values, for example, you’ll have data skew. In this scenario, filtering out the null values will resolve the issue.

Skew Hints
In the case where you are able to identify the table, the column, and preferably also the values that are causing data skew, then
you can explicitly tell Spark about it using skew hints so that Spark can try to resolve it for you.
SELECT /*+ SKEW(’table’, ’column_name’, (value1, value2)) */ * FROM table

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Fixing Skewed Data Continued
AQE skew optimization
By default any partition that has at least 256MB of data and is at least 5 times bigger in size than the average partition size will be
considered as a skewed partition by AQE.

set spark.sql.adaptive.skewJoin.enabled = false

Salting
It’s a strategy for breaking a large skewed partition into smaller partitions by appending random integers as suffixes to skewed
column values.

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Delta data skipping
Delta data skipping automatically collects the stats (min, max, etc.) for the first 32 columns for each
underlying Parquet file when you write data into a Delta table. Databricks takes advantage of this
information (minimum and maximum values) at query time to skip unnecessary files in order to speed
up the queries.
delta.dataSkippingNumIndexedCols = <value>
Change the default value of 32

ALTER TABLE table_name ALTER [COLUMN] col_name col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name]
Move larger columns to after the last column for which Databricks is going to collect the stats information

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Data Skipping and Pruning
Column pruning
Select only those columns that are truly part of the workload computation and are needed by downstream queries.
-- SQL
SELECT col1, col2, .. coln FROM table
-- PySpark
dataframe = spark.table(“table”).select(“col1”, “col2”, ... “coln”)

Predicate pushdown
Pushing down the filtering to the “bare metal” — i.e., a data source engine. Predicate pushdown is data source engine
dependent. It works for data sources like Parquet, Delta, Cassandra, JDBC, etc., but it will not work for data sources like text,
JSON, XML, etc. -- SQL
SELECT col1, col2 .. coln FROM table WHERE col1 = <value>
-- PySpark
dataframe = spark.table(“table”).select(“col1”, “col2”, ... “coln”).filter(col(“col1”) = <value>)

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Data Skipping and Pruning
Partition pruning
The partition elimination technique allows optimizing performance when reading folders from the corresponding file system so
that the desired files only in the specified partition can be read.

To leverage partition pruning, all you have to do is provide a filter on the column(s) being used as table partition(s).

Dynamic partition pruning and Dynamic file pruning

From Spark 3.0 onwards prune the partitions/files the join reads from a fact table by identifying those partitions that result from
filtering the dimension tables.

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Data Caching
Delta Caching or Disk Caching
The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage (SSD drives) using a fast
intermediate data format.

set spark.databricks.io.cache.enabled = true

Spark cache
Using cache() and persist() methods, Spark provides an optimization mechanism to cache the intermediate computation of a
Spark DataFrame so they can be reused in subsequent actions. Similarly, you can also cache a table using the CACHE TABLE
command.

Shwetank Singh
GritSetGrow - GSGLearn.com
Optimizing Databricks
Disk Cache vs Spark Cache
Feature disk cache Apache Spark cache

Stored as Local files on a worker node. In-memory blocks, but it depends on storage level.

Applied to Any Parquet table stored on ABFS and other file systems. Any DataFrame or RDD.

Triggered Automatically, on the first read (if cache is enabled). Manually, requires code changes.

Evaluated Lazily. Lazily.

Availability Can be enabled or disabled with configuration flags, enabled by default on certain node types. Always available.

Evicted Automatically in LRU fashion or on any file change, manually when restarting a cluster. Automatically in LRU fashion, manually with unpersist.

Shwetank Singh
GritSetGrow - GSGLearn.com

Manual Vivitar IPC 112
No ratings yet
Manual Vivitar IPC 112
31 pages
Detyrat Programim Kapitulli 8
No ratings yet
Detyrat Programim Kapitulli 8
26 pages
master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
Getting Started With Databricks
No ratings yet
Getting Started With Databricks
39 pages
Visual Basic6
No ratings yet
Visual Basic6
31 pages
Databricks Certified Professional Data Engineer 1 1
No ratings yet
Databricks Certified Professional Data Engineer 1 1
16 pages
M01B-L01_ The 5 Ss, Condensed (1)
No ratings yet
M01B-L01_ The 5 Ss, Condensed (1)
39 pages
Demonstrations
No ratings yet
Demonstrations
19 pages
Jamb Test Manual-1
No ratings yet
Jamb Test Manual-1
10 pages
Instructions For Downloading An Using Piano For All.
0% (1)
Instructions For Downloading An Using Piano For All.
7 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
spark basic 2-1
No ratings yet
spark basic 2-1
25 pages
MC Ceb Implementation Guide
No ratings yet
MC Ceb Implementation Guide
16 pages
ASSIGNMENT
No ratings yet
ASSIGNMENT
3 pages
Delta Merge Operation in Databricks Using PySpark - by Ashish Garg - Medium
No ratings yet
Delta Merge Operation in Databricks Using PySpark - by Ashish Garg - Medium
11 pages
How To Use The Windows Registry Editor (Regedit) in One Easy Lesson
No ratings yet
How To Use The Windows Registry Editor (Regedit) in One Easy Lesson
10 pages
Architecture of Linux
No ratings yet
Architecture of Linux
20 pages
IT 101 MidtermExam
No ratings yet
IT 101 MidtermExam
5 pages
bdafinal
No ratings yet
bdafinal
11 pages
Imax B6AC MANUAL
No ratings yet
Imax B6AC MANUAL
17 pages
Optimizing 1TB Data Handling using PySpark 3p
No ratings yet
Optimizing 1TB Data Handling using PySpark 3p
3 pages
1747753825912_2
No ratings yet
1747753825912_2
11 pages
Data Skew and Remedies in Spark Programming
No ratings yet
Data Skew and Remedies in Spark Programming
19 pages
SparkOtp
No ratings yet
SparkOtp
7 pages
ravi_databricks_best_practices__1655702853
No ratings yet
ravi_databricks_best_practices__1655702853
29 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
Spark All Optimizations & Code
No ratings yet
Spark All Optimizations & Code
25 pages
Pyspark_12_questions
No ratings yet
Pyspark_12_questions
8 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Boost Your Delta Lake With Z-Ordering and Bin-Packing
No ratings yet
Boost Your Delta Lake With Z-Ordering and Bin-Packing
10 pages
Databricks - Data Analyst
No ratings yet
Databricks - Data Analyst
5 pages
Databricks Raveendra 1668569836
No ratings yet
Databricks Raveendra 1668569836
25 pages
Delta lake
No ratings yet
Delta lake
11 pages
spark
No ratings yet
spark
27 pages
Spark+Databricks
No ratings yet
Spark+Databricks
19 pages
Q1. Difference between cache and pe
No ratings yet
Q1. Difference between cache and pe
13 pages
Partition Pruning
No ratings yet
Partition Pruning
2 pages
1714069759520
No ratings yet
1714069759520
17 pages
Kyratec Super Cycler
No ratings yet
Kyratec Super Cycler
4 pages
Spark optimisation
No ratings yet
Spark optimisation
7 pages
APJ Lakehouse Optimisation Webinar
No ratings yet
APJ Lakehouse Optimisation Webinar
53 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
TSNA Consolidation Database User Manual V2
No ratings yet
TSNA Consolidation Database User Manual V2
20 pages
code optimization in spark
No ratings yet
code optimization in spark
4 pages
Spark Optimization Case Study Cleaned
No ratings yet
Spark Optimization Case Study Cleaned
7 pages
Cambridge IGCSE: Information and Communication Technology 0417/32
No ratings yet
Cambridge IGCSE: Information and Communication Technology 0417/32
8 pages
IBM_PySpark_CheatSheet
No ratings yet
IBM_PySpark_CheatSheet
2 pages
Spark_optimization_techniques_1676610430
No ratings yet
Spark_optimization_techniques_1676610430
15 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
PySpark Optimization techniques for Data Engineers
No ratings yet
PySpark Optimization techniques for Data Engineers
1 page
Optimizing PySpark Operations
No ratings yet
Optimizing PySpark Operations
4 pages
Dimensionnement Spark - Les 5 Erreurs À Éviter
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
75 pages
62 Create PDF From Iseries Using Itext
No ratings yet
62 Create PDF From Iseries Using Itext
6 pages
An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng
No ratings yet
An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng
29 pages
Spark - Out of Memory Exception Handling
No ratings yet
Spark - Out of Memory Exception Handling
3 pages
spark_optimization_1741826797
No ratings yet
spark_optimization_1741826797
7 pages
Day 28 Master Spark Concept
No ratings yet
Day 28 Master Spark Concept
5 pages
GetTempFileNameA Function
No ratings yet
GetTempFileNameA Function
4 pages
Databricks LakeHouse Architectre
No ratings yet
Databricks LakeHouse Architectre
10 pages
ACA Cloud Sample Questions
100% (1)
ACA Cloud Sample Questions
5 pages
(Exam) Data Engineering Certification Prep Guide - Partners
No ratings yet
(Exam) Data Engineering Certification Prep Guide - Partners
15 pages
spark QA
No ratings yet
spark QA
34 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
Lesson 01.05 The 5 Ss Storage
No ratings yet
Lesson 01.05 The 5 Ss Storage
50 pages
Databricks RealQuestions
No ratings yet
Databricks RealQuestions
9 pages
Databricks
No ratings yet
Databricks
15 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Intel® Parallel Studio XE 2017 Update 5: Installation Guide For Linux OS
No ratings yet
Intel® Parallel Studio XE 2017 Update 5: Installation Guide For Linux OS
14 pages
PySpark Q&A
No ratings yet
PySpark Q&A
56 pages
DBR 7.x - Spark 3.x Features Migration
No ratings yet
DBR 7.x - Spark 3.x Features Migration
86 pages
IPC Diagnostics: Manual
No ratings yet
IPC Diagnostics: Manual
156 pages
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
Developer Title
100% (1)
Developer Title
374 pages
Command Prompt Keys
No ratings yet
Command Prompt Keys
2 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Text Analysis in R
No ratings yet
Text Analysis in R
21 pages
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
100% (1)
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
20 pages
Lesson Day 8 - Nature and Purposes of Online Applications
No ratings yet
Lesson Day 8 - Nature and Purposes of Online Applications
11 pages
Basic Starter Kit For MEGAV1.0.18.11.22
No ratings yet
Basic Starter Kit For MEGAV1.0.18.11.22
115 pages
Installing Oracle Database 12c R1 On Linux 6 With ASM
100% (1)
Installing Oracle Database 12c R1 On Linux 6 With ASM
47 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Dell Equallogic SAN HQ Guide v2.2
No ratings yet
Dell Equallogic SAN HQ Guide v2.2
144 pages
Making A Perfect Custom Wordlist Using Crunch
No ratings yet
Making A Perfect Custom Wordlist Using Crunch
1 page
Azure Databricks Best Practices 1664384402
No ratings yet
Azure Databricks Best Practices 1664384402
30 pages
70-410 R2 Lesson 03 - Configuring Local Storage
No ratings yet
70-410 R2 Lesson 03 - Configuring Local Storage
23 pages
Notes of Azure Data Bricks
No ratings yet
Notes of Azure Data Bricks
16 pages
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet

Data Engineering 101 - Databricks Optimization

Uploaded by

Data Engineering 101 - Databricks Optimization

Uploaded by

OPTIMIZING

Z-Ordering is a technique to co-locate related information in the same set of files.

set spark.sql.autoBroadcastJoinThreshold = <size in bytes>

set spark.databricks.adaptive.autoBroadcastJoinThreshold = <size in bytes>

set spark.sql.join.preferSortMergeJoin = false

set spark.sql.adaptive.skewJoin.enabled = false

Dynamic partition pruning and Dynamic file pruning

set spark.databricks.io.cache.enabled = true

Evaluated Lazily. Lazily.

You might also like