SlideShare a Scribd company logo
Introduction to DataFusion
An Embeddable Query Engine
Written in Rust
CC BY-SA
Today: IOx Team at InfluxData
Past life 1: Query Optimizer @ Vertica, also
on Oracle DB server
Past life 2: Chief Architect + VP Engineering
roles at some ML startups
Talk Outline
What is a Query Engine
Introduction to DataFusion / Apache Arrow
DataFusion Architectural Overview
Motivation
Data is stored
somewhere
Users who want to
access data
without writing a
program
Motivation
Users who want to
access data
without writing a
program
UIs (visual and
textual)
Data is stored
somewhere
Motivation
Users who want to
access data
without writing a
program
UIs (visual and
textual)
Data is stored
somewhere
Query Engine
SQL is the
common
interface
DataFusion Use Cases
1. Data engineering / ETL:
a. Construct fast and efficient data pipelines (~ Spark)
2. Data Science
a. Prepare data for ML / other tasks (~ Pandas)
3. Database Systems:
a. E.g. IOx, Ballista, Cloudfuse Buzz, various internal systems
Why DataFusion?
High Performance: Memory (no GC) and Performance, leveraging Rust/Arrow
Easy to Connect: Interoperability with other tools via Arrow, Parquet and Flight
Easy to Embed: Can extend data sources, functions, operators
First Class Rust: High quality Query / SQL Engine entirely in Rust
High Quality: Extensive tests and integration tests with Arrow ecosystems
My goal: DataFusion to be *the* choice for any SQL support in Rust
DBMS vs Query Engine ( , )
Database Management Systems (DBMS) are full featured systems
● Storage system (stores actual data)
● Catalog (store metadata about what is in the storage system)
● Query Engine (query, and retrieve requested data)
● Access Control and Authorization (users, groups, permissions)
● Resource Management (divide resources between uses)
● Administration utilities (monitor resource usage, set policies, etc)
● Clients for Network connectivity (e.g. implement JDBC, ODBC, etc)
● Multi-node coordination and management
DataFusion
What is DataFusion?
“DataFusion is an in-memory query engine
that uses Apache Arrow as the memory
model” - crates.io
● In Apache Arrow github repo
● Apache licensed
● Not part of the Arrow spec, uses Arrow
● Initially implemented and donated by
Andy Grove; design based on How
Query Engines Work
DataFusion + Arrow + Parquet
arrow
datafusion
parquet
arrow-flight
DataFusion Extensibility 🧰
● User Defined Functions
● User Defined Aggregates
● User Defined Optimizer passes
● User Defined LogicalPlan nodes
● User Defined ExecutionPlan nodes
● User Defined TableProvider for tables
* Built in data persistence using parquet and CSV files
What is a Query Engine?
1. Frontend
a. Query Language + Parser
2. Intermediate Query Representation
a. Expression / Type system
b. Query Plan w/ Relational Operators (Data Flow Graph)
c. Rewrites / Optimizations on that graph
3. Concrete Execution Operators
a. Allocate resources (CPU, Memory, etc)
b. Pushed bytes around, vectorized calculations, etc
��
DataFusion is a Query Engine!
SQLStatement
1. Frontend
LogicalPlan
Expr
ExecutionPlan
RecordBatches
Rust struct
2. Intermediate Query Representation
3. Concrete Execution Operators
DataFusion Input / Output Diagram
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
RecordBatches
DataFrame
ctx.read_table("http")?
.filter(...)?
.aggregate(..)?;
RecordBatches
Catalog information:
tables, schemas, etc
OR
DataFusion in Action
DataFusion CLI
> CREATE EXTERNAL TABLE
http_api_requests_total
STORED AS PARQUET
LOCATION
'http_api_requests_total.parquet';
+--------+-----------------+
| status | COUNT(UInt8(1)) |
+--------+-----------------+
| 4XX | 73621 |
| 2XX | 338304 |
+--------+-----------------+
> SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
EXPLAIN Plan
Gets a textual representation of LogicalPlan
+--------------+----------------------------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total projection=None |
+--------------+----------------------------------------------------------+
> explain SELECT status, COUNT(1) FROM http_api_requests_total
WHERE path = '/api/v2/write' GROUP BY status;
Plans as DataFlow graphs
Filter:
#path Eq Utf8("/api/v2/write")
Aggregate:
groupBy=[[#status]],
aggr=[[COUNT(UInt8(1))]]
TableScan: http_api_requests_total
projection=None
Step 2: Predicate is applied
Step 1: Parquet file is read
Step 3: Data is aggregated
Data flows up from the
leaves to the root of the
tree
More than initially meets the eye
Use EXPLAIN VERBOSE to see optimizations applied
> EXPLAIN VERBOSE SELECT status, COUNT(1) FROM http_api_requests_total
WHERE path = '/api/v2/write' GROUP BY status;
+----------------------+----------------------------------------------------------------+
| plan_type | plan |
+----------------------+----------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total projection=None |
| projection_push_down | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total
projection=Some([6, 8]) |
| type_coercion | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total
projection=Some([6, 8]) |
...
+----------------------+----------------------------------------------------------------+
Optimizer “pushed” down
projection so only status
and path columns from
file were read from
parquet
Data Representation
Array + Record Batches + Schema
+--------+--------+
| status | COUNT |
+--------+--------+
| 4XX | 73621 |
| 2XX | 338304 |
| 5XX | 42 |
| 1XX | 3 |
+--------+--------+
4XX
2XX
5XX
* StringArray representation is somewhat misleading as it actually has a fixed length portion and the character data in different locations
StringArray
1XX
StringArray
73621
338304
42
UInt64Array
3
UInt64Array
Schema:
fields[0]: “status”, Utf8
fields[1]: “COUNT()”, UInt64
RecordBatch
cols:
schema:
RecordBatch
cols:
schema:
Query Planning
DataFusion Planning Flow
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
LogicalPlan
ExecutionPlan
RecordBatches
Parsing/Planning
Optimization
Execution
“Query Plan”
PG:” Query Tree”
“Access Plan”
“Operator Tree”
PG: “Plan Tree”
DataFusion Logical Plan Creation
● Declarative: Describe WHAT you want; system figures out HOW
○ Input: “SQL” text (postgres dialect)
● Procedural Describe HOW directly
○ Input is a program to build up the plan
○ Two options:
■ Use a LogicalPlanBuilder, Rust style builder
■ DataFrame - model popularized by Pandas and Spark
SQL → LogicalPlan
SQL Parser
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
Planner
Query {
ctes: [],
body: Select(
Select {
distinct: false,
top: None,
projection: [
UnnamedExpr(
Identifier(
Ident {
value: "status",
quote_style: None,
},
),
),
...
Parsed
Statement
LogicalPlan
“DataFrame” → Logical Plan
Rust Code
let df = ctx
.read_table("http_api_requests_total")?
.filter(col("path").eq(lit("/api/v2/write")))?
.aggregate([col("status")]), [count(lit(1))])?;
DataFrame
(Builder)
LogicalPlan
Supported Logical Plan operators (source link)
Projection
Filter
Aggregate
Sort
Join
Repartition
TableScan
EmptyRelation
Limit
CreateExternalTable
Explain
Extension
Query Optimization Overview
Compute the same (correct) result, only faster
Optimizer
Pass 1
LogicalPlan
(intermediate)
“Optimizer”
Optimizer
Pass 2
LogicalPlan
(input)
LogicalPlan
(output)
…
Other
Passes
...
Built in DataFusion Optimizer Passes (source link)
ProjectionPushDown: Minimize the number of columns passed from node to node
to minimize intermediate result size (number of columns)
FilterPushdown (“predicate pushdown”): Push filters as close to scans as possible
to minimize intermediate result size
HashBuildProbeOrder (“join reordering”): Order joins to minimize the intermediate
result size and hash table sizes
ConstantFolding: Partially evaluates expressions at plan time. Eg. ColA && true
→ ColA
Expression Evaluation
Expression Evaluation
Arrow Compute Kernels typically operate on 1 or 2 arrays and/or scalars.
Partial list of included comparison kernels:
eq Perform left == right operation on two arrays.
eq_scalar Perform left == right operation on an array and a scalar value.
eq_utf8 Perform left == right operation on StringArray / LargeStringArray.
eq_utf8_scalar Perform left == right operation on StringArray / LargeStringArray and a scalar.
and Performs AND operation on two arrays. If either left or right value is null then the result is also null.
is_not_null Returns a non-null BooleanArray with whether each value of the array is not null.
or Performs OR operation on two arrays. If either left or right value is null then the result is also null.
...
Exprs for evaluating arbitrary expressions
path = '/api/v2/write' OR path IS NULL
Column
path
Literal
ScalarValue::Utf8
'/api/v2/write'
Column
path
IsNull
BinaryExpr
op: Eq
left right
BinaryExpr
op: Or
left right
col(“path”)
.eq(lit(‘api/v2/write’))
.or(col(“path”).is_null())
Expression Builder API
Expr Vectorized Evaluation
Column
path
Literal
ScalarValue::Utf8
'/api/v2/write'
Column
path
IsNull
BinaryExpr
op: Eq
BinaryExpr
op: Or
Expr Vectorized Evaluation
Literal
ScalarValue::Utf8
'/api/v2/write'
Column
path
IsNull
BinaryExpr
op: Eq
BinaryExpr
op: Or
/api/v2/write
/api/v1/write
/api/v2/read
/api/v2/write
…
/api/v2/write
/foo/bar
StringArray
Expr Vectorized Evaluation
Column
path
IsNull
BinaryExpr
op: Eq
BinaryExpr
op: Or
/api/v2/write
/api/v1/write
/api/v2/read
/api/v2/write
…
/api/v2/write
/foo/bar
StringArray
ScalarValue::Utf8(
Some(
“/api/v2/write”
)
)
Expr Vectorized Evaluation
Column
path
IsNull
BinaryExpr
op: Eq
BinaryExpr
op: Or
/api/v2/write
/api/v1/write
/api/v2/read
/api/v2/write
…
/api/v2/write
/foo/bar
StringArray
ScalarValue::Utf8(
Some(
“/api/v2/write”
)
)
Call: eq_utf8_scalar
Expr Vectorized Evaluation
Column
path
IsNull
BinaryExpr
op: Or
True
False
False
True
…
True
False
BooleanArray
Expr Vectorized Evaluation
IsNull
BinaryExpr
op: Or
True
False
False
True
…
True
False
BooleanArray
/api/v2/write
/api/v1/write
/api/v2/read
/api/v2/write
…
/api/v2/write
/foo/bar
StringArray
Expr Vectorized Evaluation
BinaryExpr
op: Or
True
False
False
True
…
True
False
BooleanArray
False
False
False
False
…
False
False
BooleanArray
Expr Vectorized Evaluation
True
False
False
True
…
True
False
BooleanArray
Type Coercion
sqrt(col)
sqrt(col) → sqrt(CAST col as Float32)
col is Int8, but sqrt implemented for Float32 or Float64
⇒ Type Coercion: adds typecast cast so the implementation can be called
Note: Coercion is lossless; if col was Float64, would not coerce to Float32
Source Code: coercion.rs
Execution Plans
Plan Execution Overview
Typically called the “execution engine” in database systems
DataFusion features:
● Async: Mostly avoids blocking I/O
● Vectorized: Process RecordBatch at a time, configurable batch size
● Eager Pull: Data is produced using a pull model, natural backpressure
● Partitioned: each operator produces partitions, in parallel
● Multi-Core*
* Uses async tasks; still some unease about this / if we need another thread pool
Plan Execution
LogicalPlan
ExecutionPlan
collect
SendableRecordBatchStream
Partitions
ExecutionPlan nodes allocate resources
(buffers, hash tables, files, etc)
RecordBatches
execute produces an
iterator-style thing that produces
Arrow RecordBatches for each
partition
create_physical_plan
execute
create_physical_plan
Filter:
#path Eq Utf8("/api/v2/write")
Aggregate:
groupBy=[[#status]],
aggr=[[COUNT(UInt8(1))]]
TableScan: http_api_requests_total
projection=None
HashAggregateExec (1 partition)
AggregateMode::Final
SUM(1), GROUP BY status
HashAggregateExec (2 partitions)
AggregateMode::Partial
COUNT(1), GROUP BY status
FilterExec (2 partitions)
path = “/api/v2/write”
ParquetExec (2 partitions)
files = file1, file2
LogicalPlan ExecutionPlan
MergeExec (1 partition)
execute
ExecutionPlan SendableRecordBatchStream
GroupHash
AggregateStream
GroupHash
AggregateStream
GroupHash
AggregateStream
FilterExecStream FilterExecStream
“ParquetStream”*
For file1
“ParquetStream”*
For file2
* this is actually a channel getting results from a different thread, as parquet reader is not yet async
HashAggregateExec (1 partition)
AggregateMode::Final
SUM(1), GROUP BY status
HashAggregateExec (2 partitions)
AggregateMode::Partial
COUNT(1), GROUP BY status
FilterExec (2 partitions)
path = “/api/v2/write”
ParquetExec (2 partitions)
files = file1, file2
MergeExec
MergeStream
execute(0)
execute(0)
execute(0)
execute(0)
execute(0)
execute(1)
execute(1)
execute(1)
next()
SendableRecordBatchStream
GroupHash
AggregateStream
FilterExecStream
“ParquetStream”*
For file1
Ready to produce values! 😅
Rust Stream: an async iterator that
produces record batches
Execution of GroupHash starts
eagerly (before next() is called on it)
next().await
next().await
RecordBatch
RecordBatch
Step 2:
Data is
filtered
Step 1: Data read from parquet
and returned
Step 3: data
is fed into a
hash table
Step 0: new task spawned, starts
computing input immediately
Step 5: output is requested RecordBatch
Step 6:
returned to
caller
Step 4:
hash done,
output
produced
next()
GroupHash
AggregateStream
GroupHash
AggregateStream
GroupHash
AggregateStream
next().await
Step 1: output is requested
MergeStream
MergeStream eagerly
starts on its own task, back
pressure via bounded
channels
Step 0: new task spawned, starts
computing input
RecordBatch
Step 2: eventually RecordBatch is
produced from downstream and returned
Step 0: new task spawned, starts
computing input immediately next().await
next().await
Step 0: new task spawned, starts
computing input
next().await
Step 4: data
is fed into a
hash table
RecordBatch
Step 3: Merge
passes on
RecordBatch
RecordBatch
Step 5:
hash done,
output
produced
Step 6:
returned to
caller
Get Involved
Check out the project Apache Arrow
Join the mailing list (links on project page)
Test out Arrow (crates.io) and DataFusion (crates.io) in your projects
Help out with the docs/code/tickets on GitHub
Thank You!!!!

More Related Content

PDF
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Andrew Lamb
 
PDF
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
aiuy
 
PDF
Change Data Feed in Delta
Databricks
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Andrew Lamb
 
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
aiuy
 
Change Data Feed in Delta
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

What's hot (20)

PPTX
Hive 3 - a new horizon
Thejas Nair
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PDF
Optimizing Hive Queries
Owen O'Malley
 
PDF
Introducing Apache Airflow and how we are using it
Bruno Faria
 
PPTX
PostgreSQL Database Slides
metsarin
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
PDF
The delta architecture
Prakash Chockalingam
 
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
PDF
AWS Glue - let's get stuck in!
Chris Taylor
 
PDF
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
PDF
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PgDay.Seoul
 
PDF
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde
 
Hive 3 - a new horizon
Thejas Nair
 
Making Apache Spark Better with Delta Lake
Databricks
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Optimizing Hive Queries
Owen O'Malley
 
Introducing Apache Airflow and how we are using it
Bruno Faria
 
PostgreSQL Database Slides
metsarin
 
Iceberg: a fast table format for S3
DataWorks Summit
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
The delta architecture
Prakash Chockalingam
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
AWS Glue - let's get stuck in!
Chris Taylor
 
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PgDay.Seoul
 
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde
 
Ad

Similar to Introduction to DataFusion An Embeddable Query Engine Written in Rust (20)

PPTX
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
PDF
Prestogres internals
Sadayuki Furuhashi
 
PPT
The life of a query (oracle edition)
maclean liu
 
PDF
Real time analytics at uber @ strata data 2019
Zhenxiao Luo
 
PPTX
Ztech Connect '19, IBM PureApplication
Chris Clark
 
PDF
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
PDF
Hyperspace: An Indexing Subsystem for Apache Spark
Databricks
 
PDF
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
PPT
Accelerated data access
gordonyorke
 
PPT
DataFinder: A Python Application for Scientific Data Management
Andreas Schreiber
 
PDF
Using Databricks as an Analysis Platform
Databricks
 
PDF
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
PPTX
C# + SQL = Big Data
Sascha Dittmann
 
PDF
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 
PPTX
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for Developers
Dave Bost
 
PPT
SQL Server 2008 for .NET Developers
llangit
 
PPT
Tech Days09 Sqldev
llangit
 
PPT
SQL Server 2008 for Developers
llangit
 
PPTX
ALM Search Presentation for the VSS Arch Council
Sunita Shrivastava
 
PDF
Azure Data Factory presentation with links
Chris Testa-O'Neill
 
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
Prestogres internals
Sadayuki Furuhashi
 
The life of a query (oracle edition)
maclean liu
 
Real time analytics at uber @ strata data 2019
Zhenxiao Luo
 
Ztech Connect '19, IBM PureApplication
Chris Clark
 
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Hyperspace: An Indexing Subsystem for Apache Spark
Databricks
 
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Accelerated data access
gordonyorke
 
DataFinder: A Python Application for Scientific Data Management
Andreas Schreiber
 
Using Databricks as an Analysis Platform
Databricks
 
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
C# + SQL = Big Data
Sascha Dittmann
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for Developers
Dave Bost
 
SQL Server 2008 for .NET Developers
llangit
 
Tech Days09 Sqldev
llangit
 
SQL Server 2008 for Developers
llangit
 
ALM Search Presentation for the VSS Arch Council
Sunita Shrivastava
 
Azure Data Factory presentation with links
Chris Testa-O'Neill
 
Ad

Recently uploaded (20)

PDF
Wondershare Filmora 14.5.20.12999 Crack Full New Version 2025
gsgssg2211
 
PDF
How to Seamlessly Integrate Salesforce Data Cloud with Marketing Cloud.pdf
NSIQINFOTECH
 
PDF
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
PDF
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
Hironori Washizaki
 
PPTX
Presentation of Computer CLASS 2 .pptx
darshilchaudhary558
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
PPTX
oapresentation.pptx
mehatdhavalrajubhai
 
PDF
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
PPTX
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PDF
The Role of Automation and AI in EHS Management for Data Centers.pdf
TECH EHS Solution
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
Microsoft Teams Essentials; The pricing and the versions_PDF.pdf
Q-Advise
 
PPTX
Why Use Open Source Reporting Tools for Business Intelligence.pptx
Varsha Nayak
 
PDF
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PDF
Build Multi-agent using Agent Development Kit
FadyIbrahim23
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PDF
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
Q-Advise
 
Wondershare Filmora 14.5.20.12999 Crack Full New Version 2025
gsgssg2211
 
How to Seamlessly Integrate Salesforce Data Cloud with Marketing Cloud.pdf
NSIQINFOTECH
 
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
Hironori Washizaki
 
Presentation of Computer CLASS 2 .pptx
darshilchaudhary558
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
oapresentation.pptx
mehatdhavalrajubhai
 
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Activate_Methodology_Summary presentatio
annapureddyn
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
The Role of Automation and AI in EHS Management for Data Centers.pdf
TECH EHS Solution
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
Microsoft Teams Essentials; The pricing and the versions_PDF.pdf
Q-Advise
 
Why Use Open Source Reporting Tools for Business Intelligence.pptx
Varsha Nayak
 
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
Build Multi-agent using Agent Development Kit
FadyIbrahim23
 
Presentation about variables and constant.pptx
kr2589474
 
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
Q-Advise
 

Introduction to DataFusion An Embeddable Query Engine Written in Rust

  • 1. Introduction to DataFusion An Embeddable Query Engine Written in Rust CC BY-SA
  • 2. Today: IOx Team at InfluxData Past life 1: Query Optimizer @ Vertica, also on Oracle DB server Past life 2: Chief Architect + VP Engineering roles at some ML startups
  • 3. Talk Outline What is a Query Engine Introduction to DataFusion / Apache Arrow DataFusion Architectural Overview
  • 4. Motivation Data is stored somewhere Users who want to access data without writing a program
  • 5. Motivation Users who want to access data without writing a program UIs (visual and textual) Data is stored somewhere
  • 6. Motivation Users who want to access data without writing a program UIs (visual and textual) Data is stored somewhere Query Engine SQL is the common interface
  • 7. DataFusion Use Cases 1. Data engineering / ETL: a. Construct fast and efficient data pipelines (~ Spark) 2. Data Science a. Prepare data for ML / other tasks (~ Pandas) 3. Database Systems: a. E.g. IOx, Ballista, Cloudfuse Buzz, various internal systems
  • 8. Why DataFusion? High Performance: Memory (no GC) and Performance, leveraging Rust/Arrow Easy to Connect: Interoperability with other tools via Arrow, Parquet and Flight Easy to Embed: Can extend data sources, functions, operators First Class Rust: High quality Query / SQL Engine entirely in Rust High Quality: Extensive tests and integration tests with Arrow ecosystems My goal: DataFusion to be *the* choice for any SQL support in Rust
  • 9. DBMS vs Query Engine ( , ) Database Management Systems (DBMS) are full featured systems ● Storage system (stores actual data) ● Catalog (store metadata about what is in the storage system) ● Query Engine (query, and retrieve requested data) ● Access Control and Authorization (users, groups, permissions) ● Resource Management (divide resources between uses) ● Administration utilities (monitor resource usage, set policies, etc) ● Clients for Network connectivity (e.g. implement JDBC, ODBC, etc) ● Multi-node coordination and management DataFusion
  • 10. What is DataFusion? “DataFusion is an in-memory query engine that uses Apache Arrow as the memory model” - crates.io ● In Apache Arrow github repo ● Apache licensed ● Not part of the Arrow spec, uses Arrow ● Initially implemented and donated by Andy Grove; design based on How Query Engines Work
  • 11. DataFusion + Arrow + Parquet arrow datafusion parquet arrow-flight
  • 12. DataFusion Extensibility 🧰 ● User Defined Functions ● User Defined Aggregates ● User Defined Optimizer passes ● User Defined LogicalPlan nodes ● User Defined ExecutionPlan nodes ● User Defined TableProvider for tables * Built in data persistence using parquet and CSV files
  • 13. What is a Query Engine? 1. Frontend a. Query Language + Parser 2. Intermediate Query Representation a. Expression / Type system b. Query Plan w/ Relational Operators (Data Flow Graph) c. Rewrites / Optimizations on that graph 3. Concrete Execution Operators a. Allocate resources (CPU, Memory, etc) b. Pushed bytes around, vectorized calculations, etc ��
  • 14. DataFusion is a Query Engine! SQLStatement 1. Frontend LogicalPlan Expr ExecutionPlan RecordBatches Rust struct 2. Intermediate Query Representation 3. Concrete Execution Operators
  • 15. DataFusion Input / Output Diagram SQL Query SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; RecordBatches DataFrame ctx.read_table("http")? .filter(...)? .aggregate(..)?; RecordBatches Catalog information: tables, schemas, etc OR
  • 17. DataFusion CLI > CREATE EXTERNAL TABLE http_api_requests_total STORED AS PARQUET LOCATION 'http_api_requests_total.parquet'; +--------+-----------------+ | status | COUNT(UInt8(1)) | +--------+-----------------+ | 4XX | 73621 | | 2XX | 338304 | +--------+-----------------+ > SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status;
  • 18. EXPLAIN Plan Gets a textual representation of LogicalPlan +--------------+----------------------------------------------------------+ | plan_type | plan | +--------------+----------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") | | | TableScan: http_api_requests_total projection=None | +--------------+----------------------------------------------------------+ > explain SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status;
  • 19. Plans as DataFlow graphs Filter: #path Eq Utf8("/api/v2/write") Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] TableScan: http_api_requests_total projection=None Step 2: Predicate is applied Step 1: Parquet file is read Step 3: Data is aggregated Data flows up from the leaves to the root of the tree
  • 20. More than initially meets the eye Use EXPLAIN VERBOSE to see optimizations applied > EXPLAIN VERBOSE SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; +----------------------+----------------------------------------------------------------+ | plan_type | plan | +----------------------+----------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") | | | TableScan: http_api_requests_total projection=None | | projection_push_down | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") | | | TableScan: http_api_requests_total projection=Some([6, 8]) | | type_coercion | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") | | | TableScan: http_api_requests_total projection=Some([6, 8]) | ... +----------------------+----------------------------------------------------------------+ Optimizer “pushed” down projection so only status and path columns from file were read from parquet
  • 22. Array + Record Batches + Schema +--------+--------+ | status | COUNT | +--------+--------+ | 4XX | 73621 | | 2XX | 338304 | | 5XX | 42 | | 1XX | 3 | +--------+--------+ 4XX 2XX 5XX * StringArray representation is somewhat misleading as it actually has a fixed length portion and the character data in different locations StringArray 1XX StringArray 73621 338304 42 UInt64Array 3 UInt64Array Schema: fields[0]: “status”, Utf8 fields[1]: “COUNT()”, UInt64 RecordBatch cols: schema: RecordBatch cols: schema:
  • 24. DataFusion Planning Flow SQL Query SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; LogicalPlan ExecutionPlan RecordBatches Parsing/Planning Optimization Execution “Query Plan” PG:” Query Tree” “Access Plan” “Operator Tree” PG: “Plan Tree”
  • 25. DataFusion Logical Plan Creation ● Declarative: Describe WHAT you want; system figures out HOW ○ Input: “SQL” text (postgres dialect) ● Procedural Describe HOW directly ○ Input is a program to build up the plan ○ Two options: ■ Use a LogicalPlanBuilder, Rust style builder ■ DataFrame - model popularized by Pandas and Spark
  • 26. SQL → LogicalPlan SQL Parser SQL Query SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; Planner Query { ctes: [], body: Select( Select { distinct: false, top: None, projection: [ UnnamedExpr( Identifier( Ident { value: "status", quote_style: None, }, ), ), ... Parsed Statement LogicalPlan
  • 27. “DataFrame” → Logical Plan Rust Code let df = ctx .read_table("http_api_requests_total")? .filter(col("path").eq(lit("/api/v2/write")))? .aggregate([col("status")]), [count(lit(1))])?; DataFrame (Builder) LogicalPlan
  • 28. Supported Logical Plan operators (source link) Projection Filter Aggregate Sort Join Repartition TableScan EmptyRelation Limit CreateExternalTable Explain Extension
  • 29. Query Optimization Overview Compute the same (correct) result, only faster Optimizer Pass 1 LogicalPlan (intermediate) “Optimizer” Optimizer Pass 2 LogicalPlan (input) LogicalPlan (output) … Other Passes ...
  • 30. Built in DataFusion Optimizer Passes (source link) ProjectionPushDown: Minimize the number of columns passed from node to node to minimize intermediate result size (number of columns) FilterPushdown (“predicate pushdown”): Push filters as close to scans as possible to minimize intermediate result size HashBuildProbeOrder (“join reordering”): Order joins to minimize the intermediate result size and hash table sizes ConstantFolding: Partially evaluates expressions at plan time. Eg. ColA && true → ColA
  • 32. Expression Evaluation Arrow Compute Kernels typically operate on 1 or 2 arrays and/or scalars. Partial list of included comparison kernels: eq Perform left == right operation on two arrays. eq_scalar Perform left == right operation on an array and a scalar value. eq_utf8 Perform left == right operation on StringArray / LargeStringArray. eq_utf8_scalar Perform left == right operation on StringArray / LargeStringArray and a scalar. and Performs AND operation on two arrays. If either left or right value is null then the result is also null. is_not_null Returns a non-null BooleanArray with whether each value of the array is not null. or Performs OR operation on two arrays. If either left or right value is null then the result is also null. ...
  • 33. Exprs for evaluating arbitrary expressions path = '/api/v2/write' OR path IS NULL Column path Literal ScalarValue::Utf8 '/api/v2/write' Column path IsNull BinaryExpr op: Eq left right BinaryExpr op: Or left right col(“path”) .eq(lit(‘api/v2/write’)) .or(col(“path”).is_null()) Expression Builder API
  • 35. Expr Vectorized Evaluation Literal ScalarValue::Utf8 '/api/v2/write' Column path IsNull BinaryExpr op: Eq BinaryExpr op: Or /api/v2/write /api/v1/write /api/v2/read /api/v2/write … /api/v2/write /foo/bar StringArray
  • 36. Expr Vectorized Evaluation Column path IsNull BinaryExpr op: Eq BinaryExpr op: Or /api/v2/write /api/v1/write /api/v2/read /api/v2/write … /api/v2/write /foo/bar StringArray ScalarValue::Utf8( Some( “/api/v2/write” ) )
  • 37. Expr Vectorized Evaluation Column path IsNull BinaryExpr op: Eq BinaryExpr op: Or /api/v2/write /api/v1/write /api/v2/read /api/v2/write … /api/v2/write /foo/bar StringArray ScalarValue::Utf8( Some( “/api/v2/write” ) ) Call: eq_utf8_scalar
  • 38. Expr Vectorized Evaluation Column path IsNull BinaryExpr op: Or True False False True … True False BooleanArray
  • 39. Expr Vectorized Evaluation IsNull BinaryExpr op: Or True False False True … True False BooleanArray /api/v2/write /api/v1/write /api/v2/read /api/v2/write … /api/v2/write /foo/bar StringArray
  • 40. Expr Vectorized Evaluation BinaryExpr op: Or True False False True … True False BooleanArray False False False False … False False BooleanArray
  • 42. Type Coercion sqrt(col) sqrt(col) → sqrt(CAST col as Float32) col is Int8, but sqrt implemented for Float32 or Float64 ⇒ Type Coercion: adds typecast cast so the implementation can be called Note: Coercion is lossless; if col was Float64, would not coerce to Float32 Source Code: coercion.rs
  • 44. Plan Execution Overview Typically called the “execution engine” in database systems DataFusion features: ● Async: Mostly avoids blocking I/O ● Vectorized: Process RecordBatch at a time, configurable batch size ● Eager Pull: Data is produced using a pull model, natural backpressure ● Partitioned: each operator produces partitions, in parallel ● Multi-Core* * Uses async tasks; still some unease about this / if we need another thread pool
  • 45. Plan Execution LogicalPlan ExecutionPlan collect SendableRecordBatchStream Partitions ExecutionPlan nodes allocate resources (buffers, hash tables, files, etc) RecordBatches execute produces an iterator-style thing that produces Arrow RecordBatches for each partition create_physical_plan execute
  • 46. create_physical_plan Filter: #path Eq Utf8("/api/v2/write") Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] TableScan: http_api_requests_total projection=None HashAggregateExec (1 partition) AggregateMode::Final SUM(1), GROUP BY status HashAggregateExec (2 partitions) AggregateMode::Partial COUNT(1), GROUP BY status FilterExec (2 partitions) path = “/api/v2/write” ParquetExec (2 partitions) files = file1, file2 LogicalPlan ExecutionPlan MergeExec (1 partition)
  • 47. execute ExecutionPlan SendableRecordBatchStream GroupHash AggregateStream GroupHash AggregateStream GroupHash AggregateStream FilterExecStream FilterExecStream “ParquetStream”* For file1 “ParquetStream”* For file2 * this is actually a channel getting results from a different thread, as parquet reader is not yet async HashAggregateExec (1 partition) AggregateMode::Final SUM(1), GROUP BY status HashAggregateExec (2 partitions) AggregateMode::Partial COUNT(1), GROUP BY status FilterExec (2 partitions) path = “/api/v2/write” ParquetExec (2 partitions) files = file1, file2 MergeExec MergeStream execute(0) execute(0) execute(0) execute(0) execute(0) execute(1) execute(1) execute(1)
  • 48. next() SendableRecordBatchStream GroupHash AggregateStream FilterExecStream “ParquetStream”* For file1 Ready to produce values! 😅 Rust Stream: an async iterator that produces record batches Execution of GroupHash starts eagerly (before next() is called on it) next().await next().await RecordBatch RecordBatch Step 2: Data is filtered Step 1: Data read from parquet and returned Step 3: data is fed into a hash table Step 0: new task spawned, starts computing input immediately Step 5: output is requested RecordBatch Step 6: returned to caller Step 4: hash done, output produced
  • 49. next() GroupHash AggregateStream GroupHash AggregateStream GroupHash AggregateStream next().await Step 1: output is requested MergeStream MergeStream eagerly starts on its own task, back pressure via bounded channels Step 0: new task spawned, starts computing input RecordBatch Step 2: eventually RecordBatch is produced from downstream and returned Step 0: new task spawned, starts computing input immediately next().await next().await Step 0: new task spawned, starts computing input next().await Step 4: data is fed into a hash table RecordBatch Step 3: Merge passes on RecordBatch RecordBatch Step 5: hash done, output produced Step 6: returned to caller
  • 50. Get Involved Check out the project Apache Arrow Join the mailing list (links on project page) Test out Arrow (crates.io) and DataFusion (crates.io) in your projects Help out with the docs/code/tickets on GitHub Thank You!!!!