Databricks Best Practices
Databricks Best Practices
You should call count() or write() immediately after calling cache() so that the entire
DataFrame is processed and cached in memory. If you only cache part of the DataFrame,
the entire DataFrame may be recomputed when a subsequent action is performed on the
DataFrame.
2. Create partitions on every table and for fact tables use partition column on key join
column like country_code, city, market_code
Delta tables in ADB support partitioning, which enhances performance. You can partition by
a column if you expect data in that partition to be at least 1 GB. If column cardinality is high,
do not use that column for partitioning. For example, if you partition by user ID and there
are 1M distinct user IDs, partitioning would increase table load time. Syntax example:
Avoid high list cost on large directories like Hierarichal folder structure
Raveendra Youtube: techlake
Z-Ordering is a technique to colocate related information in the same set of files. This co-
locality is automatically used by Delta Lake on Databricks data-skipping algorithms to
Raveendra Youtube: techlake
dramatically reduce the amount of data that needs to be read. To Z-Order data, you specify
the columns to order on in the ZORDER BY clause:
You must explicitly enable Optimized Writes and Auto Compaction using one of the
following methods:
Auto Optimize consists of two complementary features: Optimized Writes and Auto
Compaction.
CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES
(delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)
NOTE:
Raveendra Youtube: techlake
Databricks does not support Z-Ordering with Auto Compaction as Z-Ordering is
significantly more expensive than just compaction.
Auto Compaction generates smaller files (128 MB) than OPTIMIZE (1 GB).
Auto Compaction greedily chooses a limited set of partitions that would best leverage
compaction. The number of partitions selected will vary depending on the size of cluster
it is launched on. If your cluster has more CPUs, more partitions can be optimized.
To control the output file size, set the Spark
configuration spark.databricks.delta.autoCompact.maxFileSize. The default value
is 134217728, which sets the size to 128 MB. Specifying the value 104857600 sets the
file size to 100MB.
spark.sql("set spark.databricks.delta.autoCompact.enabled = true")
6. Decide partition size (block size default is 128MB). Based on that it will create no of
files at table.
Join hints
Join hints allow you to suggest the join strategy that Databricks Runtime should use. When
different join strategy hints are specified on both sides of a join, Databricks Runtime
prioritizes hints in the following
order: BROADCAST over MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_N
L. When both sides are specified with the BROADCAST hint or the SHUFFLE_HASH hint,
Databricks Runtime picks the build side based on the join type and the sizes of the
relations. Since a given strategy may not support all join types, Databricks Runtime is not
guaranteed to use the join strategy suggested by the hint.
Raveendra Youtube: techlake
Join hint types
BROADCAST
Use broadcast join. The join side with the hint is broadcast regardless
of autoBroadcastJoinThreshold. If both sides of the join have the broadcast hints, the
one with the smaller size (based on stats) is broadcast. The aliases
for BROADCAST are BROADCASTJOIN and MAPJOIN.
MERGE
SHUFFLE_HASH
Raveendra Youtube: techlake
Use shuffle hash join. If both sides have the shuffle hash hints, Databricks Runtime
chooses the smaller side (based on stats) as the build side.
SHUFFLE_REPLICATE_NL
COALESCE
Reduce the number of partitions to the specified number of partitions. It takes a partition
number as a parameter.
REPARTITION
Raveendra Youtube: techlake
Repartition to the specified number of partitions using the specified partitioning
expressions. It takes a partition number, column names, or both as parameters.
REPARTITION_BY_RANGE
REBALANCE
The REBALANCE hint can be used to rebalance the query result output partitions, so
that every partition is of a reasonable size (not too small and not too big). It can take
column names as parameters, and try its best to partition the query result by these
columns. This is a best-effort: if there are skews, Spark will split the skewed partitions,
to make these partitions not too big. This hint is useful when you need to write the result
of this query to a table, to avoid too small/big files. This hint is ignored if AQE is not
enabled.
Delete temporary tables that were created as intermediate tables during notebook
execution. Deleting tables saves storage, especially if the notebook is scheduled daily.
Raveendra Youtube: techlake
ADB clusters store table metadata, even if you use drop statements to delete. Before
creating temporary tables, use dbutils.fs.rm() to permanently delete metadata. If you don’t
use this statement, an error message will appear stating that the table already exists. To
avoid this error in daily refreshes, you must use dbutils.fs.rm().
11. Use Lower() or Upper() when comparing strings or common filter conditions to avoid
losing data
ADB can't compare strings with different casing. To avoid losing data, use case conversion
statements Lower() or Upper(). Example:
If your calculation requires multiple steps, you can save time and by creating a one-step
custom function. ADB offers a variety of built in SQL functions, however to create custom
functions, known as user-defined functions (UDF), use Scala. Once you have a custom
function, you can call it every time you need to perform that specific calculation.
In ADB, Hive tables do not support UPDATE and MERGE statements or NOT NULL and
CHECK constraints. Delta tables do support these commands, however running large
amounts of data on Delta tables decreases query performance. So not to decrease
performance, store table versions.
Raveendra Youtube: techlake
If you need to create intermediate tables, use views to minimize storage usage and save
costs. Views are session-oriented and will automatically remove tables from storage after
query execution. For optimal query performance, do not use joins or subqueries in views.
Raveendra Youtube: techlake
AQE improves large query performance. By default, AQE is disabled in ADB. To enable it,
use: set spark.sql.adaptive.enabled = true;
Enabling AQE
AQE can be enabled by setting SQL config spark.sql.adaptive.enabled to true (default false
in Spark 3.0), and applies if the query meets the following criteria:
It contains at least one exchange (usually when there’s a join, aggregate or window
operator) or one subquery
1. Optimizing Shuffles
2. Choosing Join Strategies
3. Handling Skew Joins
4. Understand AQE Query Plans
5. The AdaptiveSparkPlan Node
6. The CustomShuffleReader Node
7. Detecting Join Strategy Change
When creating mount points to Azure Data Lake Storage (ADLS), use a key vault client ID
and client secret to enhance security.
Raveendra Youtube: techlake
17. Query directly on parquet files from ADLS
If you need to use the data from parquet files, do not extract into ADB in intermediate table
format. Instead, directly query on the parquet file to save time and storage. Example:
SELECT ColumnName FROM parquet.`Location of the file`
18. Choosing cluster mode for individual jobs execution and common jobs execution.
For group of jobs and multiple jobs with dependency tables in parallel or sequential load
choose High Concurrency Mode.
1. Deploy a shared cluster instead of letting each user create their own cluster.
2. Create the shared cluster in High Concurrency mode instead of Standard mode.
3. Configure security on the shared High Concurrency cluster, using one of the following
options:
o Turn on AAD Credential Passthrough if you’re using ADLS
o Turn on Table Access Control for all other stores
Raveendra Youtube: techlake
1. Minimizing Cost: By forcing users to share an autoscaling cluster you have configured
with maximum node count, rather than say, asking them to create a new one for their use
each time they log in, you can control the total cost easily. The max cost of shared cluster
can be calculated by assuming it is running X hours at maximum size with the particular
VMs. It is difficult to achieve this if each user is given free reign over creating clusters of
arbitrary size and VMs.
2. Optimizing for Latency: Only High Concurrency clusters have features which allow
queries from different users share cluster resources in a fair, secure manner. HC clusters
come with Query Watchdog, a process which keeps disruptive queries in check by
automatically pre-empting rogue queries, limiting the maximum size of output rows
returned, etc.
3. Security: Table Access control feature is only available in High Concurrency mode and
needs to be turned on so that users can limit access to their database objects (tables,
views, functions, etc.) created on the shared cluster. In case of ADLS, we recommend
restricting access using the AAD Credential Passthrough feature instead of Table Access
Controls.
It is impossible to predict the correct cluster size without developing the application because
Spark and Azure Databricks use numerous techniques to improve cluster utilization. The broad
approach you should follow for sizing is:
Raveendra Youtube: techlake
1. Develop on a medium sized cluster of 2-8 nodes, with VMs matched to workload class
as explained earlier.
2. After meeting functional requirements, run end to end test on larger representative data
while measuring CPU, memory and I/O used by the cluster at an aggregate level.
3. Optimize cluster to remove bottlenecks found in step 2
o CPU bound: add more cores by adding more nodes
o Network bound: use fewer, bigger SSD backed machines to reduce network
size and improve remote read performance
o Disk I/O bound: if jobs are spilling to disk, use VMs with more memory.
Repeat steps 2 and 3 by adding nodes and/or evaluating different VMs until all obvious
bottlenecks have been addressed.
Performing these steps will help you to arrive at a baseline cluster size which can meet SLA on
a subset of data. In theory, Spark jobs, like jobs on other Data Intensive frameworks (Hadoop)
exhibit linear scaling. For example, if it takes 5 nodes to meet SLA on a 100TB dataset, and
the production data is around 1PB, then prod cluster is likely going to be around 50 nodes in
size. You can use this back of the envelope calculation as a first guess to do capacity planning.
However, there are scenarios where Spark jobs don’t scale linearly. In some cases this is due
to large amounts of shuffle adding an exponential synchronization cost (explained next), but
there could be other reasons as well. Hence, to refine the first estimate and arrive at a more
accurate node count we recommend repeating this process 3-4 times on increasingly larger
data set sizes, say 5%, 10%, 15%, 30%, etc. The overall accuracy of the process depends on
how closely the test data matches the live workload both in type and size.
Raveendra Youtube: techlake
20. Specify distribution when publishing data to Azure Data Warehouse (ADW)
Raveendra Youtube: techlake
Use hash distribution for fact tables or large tables, round robin for dimensional tables,
replicated for small dimensional tables. Example:
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://
") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "my_table_in_dw_copy") \
.option("tableOptions", "table_options") \
.save()
VMs VM pricing
Example 1: If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2
instances, the billing would be the following for All-purpose Compute:
Example 2: If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2
instances, the billing would be the following for Jobs Compute workload:
In addition to VM and DBU charges, there will be additional charges for managed disks,
public IP address, bandwidth, or any other resource such as Azure Storage, Azure Cosmos
DB depending on your application.
ADB offers cluster autoscaling, which is disabled by default. Enable this feature to enhance
job performance. Instead of providing a fixed number of worker nodes during cluster
creation, you should provide a minimum and maximum. ADB then automatically reallocates
the worker nodes based on job characteristics.
Raveendra Youtube: techlake
24. Use Azure Data Factory (ADF) to run ADB notebook jobs
If you run numerous notebooks daily, the ADB job scheduler will not be efficient. The ADB
job scheduler cannot set notebook dependency, so you would have to store all notebooks
in one master, which is difficult to debug. Instead, schedule jobs through Azure Data
Factory, which enables you to set dependency and easily debug if anything fails.
Raveendra Youtube: techlake
Processing notebooks in ADB through ADF can overload the cluster, causing notebooks to
fail. If failure occurs, the entire job should not stop. To continue work from the point of
failure, set ADF to retry two to three times with five-minute intervals. As a result, the
processing should continue from the set time, saving you time and effort.
Raveendra Youtube: techlake
With ADB, you can dump data into multiple resources like ADW or ADLS. Publishing
numerous tables to another resource takes time. If publishing fails, do not restart the entire
process. Implement checkpoints, so that you can restart from the point of failure.
Your business’s data has never been more valuable. Additional security is a worthwhile
investment. ADB Premium includes 5-level access control.
Raveendra Youtube: techlake
Raveendra Youtube: techlake
Raveendra Youtube: techlake