Azure Databricks: Job Performance Monitoring, Troubleshooting and Optimization - by Prashanth Kumar - Feb, 2024 - Medium
Azure Databricks: Job Performance Monitoring, Troubleshooting and Optimization - by Prashanth Kumar - Feb, 2024 - Medium
Become a member
What to monitor
Using Cluster UI
Using Spark UI
Spark Logs
• Demos
• Standalone master
• Driver node
• Schedule jobs
Load libraries
•Spark REPL
• Worker node
Runs Spark executors.
Run tasks.
Exchange / shuffle
• Spark master
· Cluster
· JVM
• Cluster
Cluster UI
Ganglia
Grafana
Spark UI
Spark Logs — driver log, executor log, event log
CPU
Memory
Network
Disk
IO
• JVM — low level debugging beyond what Spark UI and Spark Logs can
provide
Spark logs:
Spark UI:
Using Cluster UI
Databricks Cluster UI provides a comprehensive interface for managing
and monitoring clusters within the Databricks environment. Here’s some
information about the features and functionalities typically found in the
Databricks Cluster UI.
Sequence of jobs
Delay in between
Timeline
Metrics
• Spark UI provides high level view and information that logs do not
provide. Logs provide precise root cause analysis. Combining both give
the complete view of the issue.
https://docs.databricks.com/en/compute/debugging-spark-ui.html
· Stages
· Storage
· Environment
· Executors
· SQL / DataFrame
· JDBC/ODBC Server
· Structured Streaming
· Connect
• Jobs page
Users can examine the stages tab to view information such as stage
ID, description, number of tasks, input/output size, and duration for
each stage.
Monitoring the tasks count can help users understand the parallelism
and distribution of work within each stage.
Monitoring shuffle read/write data size can help identify stages with
excessive data shuffling, which may indicate inefficient join
operations or skewed data distribution
Stages of each job: you can further check each jobs and its
properties based on
DAG diagram
DAG visualization: Visual representation of the directed acyclic graph of
this job where vertices represent the RDDs or DataFrames and the edges
represent an operation to be applied on RDD.
You can read more about DAG: https://spark.apache.org/docs/3.1.2/web-
ui.html#:\~:text=DAG visualization%3A Visual representation of,to be
applied on RDD.
Summary metrics for completed tasks.: Summary metrics for all tasks
are represented in a table and in a timeline.
Executors Tab
The Executors tab displays summary information about the executors
that were created for the application, including memory and disk usage
and task and shuffle information. The Storage Memory column shows
the amount of memory used and reserved for caching data.
• GC time
• Shuffle
You can always view additional metrics based on “On Heap memory”,
“Off heap memory” etc, here is a full list of options.
Spark Logs:
Spark driver log provides essential information about the Spark driver,
including stack traces of exceptions.
summary of the key aspects related to Databricks and its Spark logs:
Driver Log:
The driver log provides essential information about the Spark driver,
including stack traces of exceptions.
Batch Initialization:
It ensures that each batch starts with the necessary context and
environment.
Tasks Scheduling:
GC Logs
GC logs you can find heap memory related info; you can find additional
info such as
GC Time in Spark UI: The log mentions that GC time is also visible in
the Spark UI.
Full GC and Pauses: Full GC events can lead to pauses, causing delays
in job execution.
OldGen Accumulation: An increase in OldGen over time indicates
object accumulation. Restarting the driver or executor can help clean
up heap space.
o Slow VM Node: Slow VMs skipped during cluster startup can result
in fewer initial worker nodes than configured.
Throttling:
o Network:
Concurrent Workload:
Network Latency:
Disk Issues:
o Driver or executor running out of disk space can cause jobs to hang.
Slow Aggregations
Slow joins
o Total memory across all executors determines how much data can be stored in me
o Shuffle operation performs better on cluster with large memory and fewer worke
• Cluster mode
• Use the latest DBR version for all-purpose clusters – latest optimization
spark.conf.set("spark.databricks.queryWatchdog.enabled", true)
spark.conf.set("spark.databricks.queryWatchdog.outputRatioThreshol
d", 1000L)
spark.conf.set("spark.databricks.queryWatchdog.minTimeSecs", 10L)
spark.conf.set("spark.databricks.queryWatchdog.minOutputRows",
100000L)
• sortWithinPartitions
• Spark.serializer = org.apache.park.serializer.KryoSerializer
• Ensure only long-lived cached datasets are stored in the Old generation
• GC tunning:
• Full GC multiple times before task completion =\> decrease memory used fo
• Too many minor collections, not many major garbage collections =\> alloca
• Black box to Spark – can not leverage the code optimization on structured
• Serialization of objects to and from Python for UDF and RDD is very expen
Finally, you can get support from Microsoft and make sure to provide
following information while opening a case with them.
• Workspace ID
• Cluster ID
Azure Databricks
Written by Prashanth Kumar Follow
104 Followers
5 min read · Aug 12, 2023 4 min read · Apr 10, 2023
16
56 1 155 1
Lists
39 13
1 326 2