Spark 4.0
Spark 4.0
Upcoming Apache
Spark 4.0 Release?
Wenchen Fan cloud-fan
Xiao Li gatorsmile
Arrow optimized pandas 2 Structured [WIP] Variant Python Data SQL UDF/UDTF
Spark ANSI Mode
Python UDF API parity Logging Data Types Source APIs [WIP]
Connect
Python SQL
UDF-level
applyInArrow Polymorphic PySpark UDF [WIP] Collation [WIP] Stored Execute View
DF.toArrow Dependency
Python UDTF Unified Profiling Support Procedures Immediate Evolution
Control [WIP]
Streaming More
Arbitrary
State Data Streaming Python New Streaming Error Class XML Spark K8S
Stateful Java 21
Source Reader Data Sources Doc Enhancements Connectors operator
Processing V2
©2024 Databricks Inc. — All rights reserved 2
New Functionalities
Spark Connect, ANSI Mode, Arbitrary Stateful Processing V2,
Collation Support, Variant Data Types, pandas 2.x Support
Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0
Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Spark Connect
Analyzer
IDEs / Notebooks
Scheduler
No JVM InterOp
Applications
Spark’s Driver
Analyzer
IDEs / Notebooks
Optimizer
Scheduler
Scala3
Databricks
Connect
pip install pyspark>=3.4.0 Check out Databricks Connect, Get started with our github
use & contribute the Go client example!
in your favorite IDE!
©2024 Databricks Inc. — All rights reserved
7
The lightweight Spark Connect Package
• pip install pyspark-connect
• Pure Python library, no JVM.
• Preferred if your application has fully migrated to the Spark Connect API.
Data Corruption!
VALUES
(
PARSE_JSON(
'{"level": "warning",
"message": "invalid request",
"user_agent": "Mozilla/5.0 ..."}'
)
);
SELECT
*
FROM
variant_tbl
WHERE
event_data:user_agent ilike '%Mozilla%';
©2024 Databricks Inc. — All rights reserved
Performance
name
anthony
Anthony
SQL
name
Ānthōnī
anthony
Anthony
SQL
High-level API
Granular API
Layered, Flexible,
Extensible State API
Python Python
Backwards incompatible
API changes
• Pandas 2.0.0
• Pandas 2.1.0
SPARK-44101
Spark migration guide
©2024 Databricks Inc. — All rights reserved ©2024 Databricks Inc. — All rights reserved 37
New Functionalities
Spark Connect, ANSI Mode, Arbitrary Stateful Processing V2,
Collation Support, Variant Data Types, pandas 2.x Support
Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0
Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Streaming and Batching
Python Data Sources
Spark Python
Step 1: Create a Data Source Step 2: Register the Data Step 3: Read from or write to
Source the data source
class MySource(DataSource): Register the data source in the spark.read
… current Spark session using the .format("my-source")
Python data source class: .load(...)
spark
.dataSource df.write
.register(MySource) .format("my-source")
.mode("append")
.save(...)
©2024 Databricks Inc. — All rights reserved ©2024 Databricks Inc. — All rights reserved
DataFrame.toArrow
GroupedData.applyInArrow
spark.read.xml("/path/to/my/file.xml").show()
+-------+-----+
| name | age |
+-------+-----+
| Alice | 23 |
| Bob | 32 | Python
+-------+-----+
| name | age |
+-------+-----+
| Alice | 23 |
| Bob | 32 | Python
MERGE Auto-
Table cloning CDF
improvements compaction
Delta Identity Type
Connect Collations columns widening
Spark Connect Flexible sort and Pain-free primary Data types expand
support comparison and foreign keys with your data
Interoperability
Data Parquet
Liquid clustering Usage Walkthrough
Create a new Delta table with liquid clustering
CREATE [EXTERNAL] TABLE tbl (id INT, name STRING) CLUSTER BY(id)
Change Liquid Clustering keys on existing clustered table:
ALTER TABLE tbl CLUSTER BY (name);
Clustering data in a Delta table with liquid clustering:
OPTIMIZE tbl;
What you don’t need to worry about:
● Optimal file sizes
● Whether a column can be used as a clustering key
● Order of clustering keys
Public doc: https://docs.databricks.com/delta/clustering.html
©2024 Databricks Inc. — All rights reserved
Liquid Clustering GA
Easy to use
Up to 7x faster writes
Highly flexible
Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0
Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Python UDTF
● Key Benefits
○ Enhances data serialization and deserialization speed
Performance
©2024 Databricks Inc. — All rights reserved
sdf.select(
udf(lambda v: v + 1, DoubleType(), useArrow=True)("v"),
udf(lambda v: v - 1, DoubleType(), useArrow=True)("v"),
udf(lambda v: v * v, DoubleType(), useArrow=True)("v")
)
Python
Performance
©2024 Databricks Inc. — All rights reserved
Pickled Python UDF
>>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show()
+-----------------------------------------------------------------------+
| date_in_string |
+-----------------------------------------------------------------------+
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
+-----------------------------------------------------------------------+ Python
> SELECT > -- Return weekdays for date ranges originating from a
day_of_week, LATERAL correlation
day > SELECT
FROM weekdays.*
weekdays(DATE '2024-01-01’, FROM
DATE '2024-01-14’); VALUES
(DATE '2020-01-01’),
Mon 2022-01-01 (DATE '2021-01-01') AS starts(start),
… LATERAL weekdays(start, start + INTERVAL '7' DAYS);
Fri 2022-01-05
Mon 2022-01-08 Wed 2020-01-01
SQL Thu 2020-01-02
Fri 2020-01-03 SQL
…
BEGIN
DECLARE c INT = 10;
WHILE c > 0 DO
INSERT INTO t VALUES (c);
SET VAR c = c - 1;
END WHILE;
END
SQL
df = spark.range(10)
@pandas_udf("long")
def add1(x):
return x + 1
added = df.select(add1("id"))
spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
added.show() Python
df = spark.range(10)
@pandas_udf("long")
def add1(x):
return x + 1
added = df.select(add1("id"))
spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")
added.show() Python
Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0
Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Structured Logging Framework
{
"ts": "2023-03-12T12:02:46.661-0700",
"level": "ERROR",
"msg": "Fail to know the executor 289 is alive or not",
"context": {
"executor_id": "289"
},
"exception": {
"class": "org.apache.spark.SparkException",
"msg": "Exception thrown in awaitResult",
"stackTrace": "..."
},
"source": "BlockManagerMasterEndpoint"
} Json
logs = spark.read.json("/var/spark/logs.json")
logs = spark.read.json("/var/spark/logs.json")
Spark 4.0
• Error Messages:
• Clarity and Actionability: Ensure that all error messages are clear and direct, informing
the user precisely what went wrong.
• Workarounds: Wherever possible, provide actionable advice within the error message,
including configuration changes that can revert to previous behaviors or other
immediate solutions.
• Environment Setup
• Quickstart
• Type System
PySpark
Catalyst Optimizer
Adaptive Execution
PySpark
Functions