SlideShare a Scribd company logo
Understanding Query Plans
and Spark UIs
Xiao Li @ gatorsmile
Spark + AI Summit @ SF | April 2019
1
About Me
• Engineering Manager at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile
Databricks Customers Across Industries
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services
DATABRICKS WORKSPACE
Databricks Delta ML Frameworks
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Reliable & Scalable Simple & Integrated
Databricks Unified Analytics Platform
APIs
Jobs
Models
Notebooks
Dashboards End to end ML lifecycle
Apache Spark 3.x
5
Catalyst Optimization & Tungsten Execution
SparkSession / DataFrame / DataSet APIs
SQL
Spark ML
Spark
Streaming
Spark
Graph
3rd-party
Libraries
Spark CoreData Source Connectors
Apache Spark 3.x
6
Catalyst Optimization & Tungsten Execution
SparkSession / DataFrame / DataSet APIs
SQL
Spark ML
Spark
Streaming
Spark
Graph
3rd-party
Libraries
Spark CoreData Source Connectors
From declarative queries to RDDs
7
Cypher
8
Maximize Performance
9
Read Plan.
Interpret Plan.
Tune Plan.
Track Execution.
10
Read Plans from
SQL Tab in either
Spark UI or Spark
History Server
Read Plans from
SQL Tab in either
Spark UI or Spark
History Server
11
Spark 3.0: Show the actual SQL statement? [SPARK-27045]
Page: In Details for SQL Query
12
13
Parsed
Plan
Analyzed
Plan
Optimized
Plan
Physical
Plan
14
Understand and Tune Plans
15
Different Results!!!
16
Read the analyzed
plan to check the
implicit type
casting.
Tip:
Explicitly cast the
types in the queries.
17
Read the analyzed
plan to check the
implicit type
casting.
Tip:
Explicitly cast the
types in the queries.
Create Hive Tables
18
Syntax to create a Hive Serde table
Hive serde reader
Read Tables
20
filter pushdown
Native
reader/writer
performs faster
than Hive serde
reader/writer
21
Create Native Tables
Syntax to create a Spark native ORC table
Tip:
Create native
data source
tables for better
performance
and stability.
22
Push Down + Implicit Type Casting
Not pushed down???
Tip:
Cast is needed?
Update the
constants?
Nested Schema Pruning
23Not pruned???
Nested Schema Pruning
24
Collapse Projects
25
Call UDF three times!!!
Collapse Projects
26
Cross-session SQL Cache
27
• If a query is cached in the one session, the new
queries in all the sessions might be impacted.
• Check your query plan!
28
29
Join Hints in Spark 3.0
• BROADCAST
• Broadcast Hash/Nested-loop Join
• MERGE
• Shuffle Sort Merge Join
• SHUFFLE_HASH
• Shuffle Hash Join
• SHUFFLE_REPLICATE_NL
• Shuffle-and-Replicate Nested Loop Join
30
Track Execution
From
SQL query
to
Spark Jobs
31
32
• A SQL query => multiple Spark jobs.
• - For example, broadcast exchange, shuffle
exchange, Scalar subquery.
• - External data sources: Delta Lake.
• - New adaptive query execution.
• A Spark job => A DAG
• A chain of RDD dependencies organized in a
directed acyclic graph (DAG)
33
The higher
level SQL
physical
operators.
Optimized
ogical Plan DAGsPhysical
Plans
Selected
Physical Plan
CostModel
he
ger
r Planner
Query
ExecutionQuery Execution
The low
level Spark
RDD
primitives.
Job Tab in Spark UI
34
The amount of time for each job.
Any stage/task failure?
Job Tab
35
The amount of time for each stage.
• Jobs
• Stages
• Tasks
Stages Tab
36
• How the time are spent?
• Any outlier in task execution?
• Straggler tasks?
• Skew in data size, compute time?
• Too many/few tasks (partitions)?
• Load balanced? Locality?
Tasks specific info
37
Balanced? Skew?
Killed?
Which
executor’s
log we
should read?
Executors Tab
38
size of data transferred
between stages
used/available memory
All the problematic executors in the same node?
39
- Interacting with Hive metastore?
- Slow query planning?
- Slow file listing?
40
Insert
Partitioned
Hive
Table OR “STORED AS PARQUET”
5000 partitions took
almost 8 minutes!!!
41
42
Insert
Partitioned
Native
Table
Reduced from almost 8 minutes
to less than 1 minute !!!
43
Insert
Partitioned
Delta
Table
Reduced from almost 8 minutes
to 27 seconds!!!
Typical Spark Performance Issues
44
The table has thousands of partitions
• Hive metastore overhead
This table can have 100s of thousands to millions of files
• File system overhead - listing takes forever!
New data is not immediately visible
• Need to invoke a command “Refresh Table” with the SQL
engine they were using
The above issues can add 10s of minutes to the response time!
Delta Lake + Spark
45
Scalable metadata handling @ Delta Lake
Store metadata in transaction log file instead of metastore
The table has thousands of partitions
• Zero Hive Metastore overhead
The table can have 100s of thousands to millions of files
• No file listing
New data is not immediately visible
• Delta table state is computed on read
How do I use Delta?
format(“parquet”) -> format(“delta”)
Delta Lake + Spark
47
• Full ACID transactions
• Schema management
• Data versioning and time travel
• Unified batch/streaming support
• Scalable metadata handling
• Record update and deletion
• Data expectation
Delta Lake: https://delta.io/
For details, refer to the blog
https://tinyurl.com/yxhbe2lg
Delta Usage Statistics
More than 1 exabyte
processed (1018 bytes)
monthly
ManufacturingPublic Sector Technology Other
Healthcare and Life Sciences Financial Services Media and Entertainment Retail, CPG, and eCommerce
Additional Resources
49
• Apache Spark document: https://spark.apache.org/docs/latest/sql-
programming-guide.html
• Blog: https://databricks.com/blog/category/engineering/spark
• Previous summit: https://databricks.com/sparkaisummit/north-
america/sessions
• Delta Lake document: https://docs.delta.io
• Databricks document: https://docs.databricks.com/
• Books: https://www.amazon.com/s?k=apache+spark
• Databricks academy: https://academy.databricks.com
• Databricks ebooks: https://databricks.com/resources/type/ebooks
Thank you
Xiao Li
(lixiao@databricks.com)

More Related Content

PDF
Physical Plans in Spark SQL
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Apache Spark Core – Practical Optimization
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Physical Plans in Spark SQL
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core – Practical Optimization
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Fine Tuning and Enhancing Performance of Apache Spark Jobs

What's hot (20)

PDF
The Parquet Format and Performance Optimization Opportunities
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Enabling Vectorized Engine in Apache Spark
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
PDF
Making Apache Spark Better with Delta Lake
PDF
Dynamic Partition Pruning in Apache Spark
PPTX
Delta lake and the delta architecture
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Spark shuffle introduction
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
Databricks Delta Lake and Its Benefits
PDF
Top 5 Mistakes When Writing Spark Applications
PPTX
How to Actually Tune Your Spark Jobs So They Work
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
PDF
Performance Troubleshooting Using Apache Spark Metrics
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
The Parquet Format and Performance Optimization Opportunities
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Enabling Vectorized Engine in Apache Spark
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Making Apache Spark Better with Delta Lake
Dynamic Partition Pruning in Apache Spark
Delta lake and the delta architecture
Deep Dive: Memory Management in Apache Spark
Spark shuffle introduction
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Cosco: An Efficient Facebook-Scale Shuffle Service
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks Delta Lake and Its Benefits
Top 5 Mistakes When Writing Spark Applications
How to Actually Tune Your Spark Jobs So They Work
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Performance Troubleshooting Using Apache Spark Metrics
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Ad

Similar to Understanding Query Plans and Spark UIs (20)

PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PDF
Fighting Fraud with Apache Spark
PDF
Spark + AI Summit 2020 イベント概要
PPTX
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
PDF
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
PDF
What's New in Upcoming Apache Spark 2.3
PDF
Apache Spark Presentation good for big data
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPTX
Spark to DocumentDB connector
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
2018 02-08-what's-new-in-apache-spark-2.3
PDF
Jump Start on Apache Spark 2.2 with Databricks
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
PPTX
Spark SQL
PDF
Spark and Couchbase: Augmenting the Operational Database with Spark
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
An Insider’s Guide to Maximizing Spark SQL Performance
Fighting Fraud with Apache Spark
Spark + AI Summit 2020 イベント概要
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
What's New in Upcoming Apache Spark 2.3
Apache Spark Presentation good for big data
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark to DocumentDB connector
Jump Start with Apache Spark 2.0 on Databricks
2018 02-08-what's-new-in-apache-spark-2.3
Jump Start on Apache Spark 2.2 with Databricks
Processing Large Data with Apache Spark -- HasGeek
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Sa introduction to big data pipelining with cassandra & spark west mins...
Spark SQL
Spark and Couchbase: Augmenting the Operational Database with Spark
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Spark Summit EU talk by Miklos Christine paddling up the stream
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Data Science Trends & Career Guide---ppt
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Logistic Regression ml machine learning.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PDF
The Rise of Impact Investing- How to Align Profit with Purpose
Galatica Smart Energy Infrastructure Startup Pitch Deck
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
climate analysis of Dhaka ,Banglades.pptx
Mega Projects Data Mega Projects Data
Data Science Trends & Career Guide---ppt
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
Launch Your Data Science Career in Kochi – 2025
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Moving the Public Sector (Government) to a Digital Adoption
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Logistic Regression ml machine learning.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Business Ppt On Nestle.pptx huunnnhhgfvu
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
The Rise of Impact Investing- How to Align Profit with Purpose

Understanding Query Plans and Spark UIs