0% found this document useful (0 votes)

671 views123 pages

Spark 4.0

Uploaded by

shubhamkapadnis770

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

671 views123 pages

Spark 4.0

Uploaded by

shubhamkapadnis770

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 123

What’s Next for the

Upcoming Apache
Spark 4.0 Release?
Wenchen Fan cloud-fan
Xiao Li gatorsmile

Data + AI Summit 2024

©2024 Databricks Inc. — All rights reserved 1

Apache Spark 4.0
GA Major Features

Arrow optimized pandas 2 Structured [WIP] Variant Python Data SQL UDF/UDTF
Spark ANSI Mode
Python UDF API parity Logging Data Types Source APIs [WIP]
Connect
Python SQL

UDF-level
applyInArrow Polymorphic PySpark UDF [WIP] Collation [WIP] Stored Execute View
DF.toArrow Dependency
Python UDTF Unified Profiling Support Procedures Immediate Evolution
Control [WIP]
Streaming More

Arbitrary
State Data Streaming Python New Streaming Error Class XML Spark K8S
Stateful Java 21
Source Reader Data Sources Doc Enhancements Connectors operator
Processing V2
©2024 Databricks Inc. — All rights reserved 2
New Functionalities
Spark Connect, ANSI Mode, Arbitrary Stateful Processing V2,
Collation Support, Variant Data Types, pandas 2.x Support

Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0

Agenda Custom Functions and Procedures

SQL UDFs, SQL Scripting, Python UDTF, Arrow optimized
Python UDF, PySpark UDF Unified Profiler

Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Spark Connect

©2024 Databricks Inc. — All rights reserved

How to embed Spark in applications?
Up until Spark 3.4: Hard to support today’s developer
experience requirements
Applications
Spark’s Monolith Driver
SQL only
Modern data application
Application Logic

Analyzer
IDEs / Notebooks

Close to REPL Optimizer

Scheduler

Programming Languages / SDKs Distributed Execution Engine

No JVM InterOp

©2024 Databricks Inc. — All rights reserved

Connect to Spark from Any App
Thin client, with full power of Apache Spark

Applications
Spark’s Driver

Spark Connect Client API

Modern data application
Application Gateway

Analyzer
IDEs / Notebooks
Optimizer

Scheduler

Programming Languages / SDKs Distributed Execution Engine

©2024 Databricks Inc. — All rights reserved

Spark Connect GA in Apache Spark 4.0

Interactively develop & New Connectors and Build interactive Data

debug from your IDE SDKs in any language! Applications

Scala3

Databricks
Connect

pip install pyspark>=3.4.0 Check out Databricks Connect, Get started with our github
use & contribute the Go client example!
in your favorite IDE!
©2024 Databricks Inc. — All rights reserved
7
The lightweight Spark Connect Package
• pip install pyspark-connect
• Pure Python library, no JVM.

• Pure Spark Connect client, not entire PySpark

• Only 1.5MB (PySpark 355 MB)

• Preferred if your application has fully migrated to the Spark Connect API.

©2024 Databricks Inc. — All rights reserved

ANSI MODE
ON by default in 4.0

©2024 Databricks Inc. — All rights reserved

Migration to
ANSI ON

Action: Turn On ANSI

Mode to fix your data
corruptions!

©2024 Databricks Inc. — All rights reserved

Without ANSI mode

Data Corruption!

©2024 Databricks Inc. — All rights reserved

With ANSI mode (3.5)

Error callsite is captured

©2024 Databricks Inc. — All rights reserved

With ANSI mode (4.0)

Error callsite is highlighted

©2024 Databricks Inc. — All rights reserved

With ANSI mode (4.0)

DataFrame queries with error callsite

Culprit operation Line number

©2024 Databricks Inc. — All rights reserved

DataFrame queries with error callsite
• PySpark support is on the way.
• Spark Connect support is on the way.
• Native notebook integration is on the way (so that you can see
highlight).

©2024 Databricks Inc. — All rights reserved

Variant Data Type for Semi-
Structured Data

©2024 Databricks Inc. — All rights reserved

Motivation
Data Engineer’s Dilemma: Only pick 2 out of 3…

What if you could ingest Flexible

JSON, maintain flexibility,
boost performance, and
use an open standard? Fast Open

©2024 Databricks Inc. — All rights reserved

Variant is flexible
INSERT INTO variant_tbl (event_data) SQL

VALUES
(
PARSE_JSON(
'{"level": "warning",
"message": "invalid request",
"user_agent": "Mozilla/5.0 ..."}'
)
);

SELECT
*
FROM
variant_tbl
WHERE
event_data:user_agent ilike '%Mozilla%';
©2024 Databricks Inc. — All rights reserved
Performance

©2024 Databricks Inc. — All rights reserved

String Collation Support

©2024 Databricks Inc. — All rights reserved

ANSI SQL COLLATE
Sorting and comparing strings according to locale
● Associate columns, fields, array elements with a collation of
choice
○ Case insensitive
○ Accent insensitive
○ Locale aware
● Supported by many string functions such as
○ lower()/upper()
○ substr()
○ locate()
○ like
● GROUP BY, ORDER BY, comparisons, …
● Supported by Delta and Photon
©2024 Databricks Inc. — All rights reserved
A look at the default collation
A<Z<a<z<Ā

> SELECT name FROM names ORDER BY name;

name
Anthony
Bertha
anthony
bertha
Ānthōnī SQL

Is this really what we want here?

> SELECT name

FROM names
ORDER BY name COLLATE unicode;
name
Ānthōnī
anthony
Anthony
bertha
Bertha SQL

Root collation with decent sort order for most locales

> SELECT name

FROM names
WHERE startswith(name COLLATE unicode_ci, 'a')
ORDER BY name COLLATE unicode_ci;

name
anthony
Anthony
SQL

Case insensitive is not accent insensitive: We lost Ānthōnī

> SELECT name

FROM names
WHERE startswith(name COLLATE unicode_ci_ai, 'a')
ORDER BY name COLLATE unicode_ci_ai;

name
Ānthōnī
anthony
Anthony
SQL

100s of supported predefined collations across many locales

©2024 Databricks Inc. — All rights reserved

Stateful Stream Processing

©2024 Databricks Inc. — All rights reserved

Streaming State data source
• Allows you to inspect the internal states of streaming applications,
for debugging, profiling, testing, troubleshooting, etc.
• Allows you to manipulate the internal states for quick workaround to
recover urgent issues.
• All with your familiar data source APIs.

©2024 Databricks Inc. — All rights reserved

State Reader API: state-metadata

High-level API

©2024 Databricks Inc. — All rights reserved

State Reader API: statestore

Granular API

©2024 Databricks Inc. — All rights reserved

Arbitrary Stateful Processing V2

©2024 Databricks Inc. — All rights reserved

(flat)MapGroupsWithState: current V1 version
val ds = spark.readStream.json(path)
• Supports a single user .as[CreditCardTransaction]

defined state object per ds.groupByKey(_.cardId)

grouping key .flatMapGroupsWithState[
CreditCardTransactionState,
CreditCardTransaction
](
• State object can be OutputMode.Append(),

updated while evaluating GroupStateTimeout.NoTimeout())

(
the current group, and (_, txns, groupState) => {
updated value will be // read state, compute new average, and save to state
...
available in next trigger. }
Scala
)

©2024 Databricks Inc. — All rights reserved

Limitations of (flat)MapGroupsWithState
Existing API

Lack of Data Modelling Flexibility Lack of Composite Types

Prevents users from splitting state (for Values stored in GroupState are single
a grouping key) into multiple logical types and cannot support data structures
instances, which can be read/updated like List, Map etc efficiently. Current
independently. approach requires users to read/update
the entire data structure.

Lack of State Eviction Support Lack of State Schema Evolution

No support for eventual state cleanup Does not support changes to state schema
using TTL. once the streaming query has started.

©2024 Databricks Inc. — All rights reserved

transformWithState: the V2 version

Layered, Flexible,
Extensible State API

©2024 Databricks Inc. — All rights reserved

pandas 2 support/API parity

©2024 Databricks Inc. — All rights reserved

Pandas API on Spark (Koalas)

pandas Pandas API on Spark

import pandas as pd import pyspark.pandas as ps

df = pd.read_csv("my_data.csv") df = ps.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1'] df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x df['x2'] = df.x * df.x

Python Python

©2024 Databricks Inc. — All rights reserved

Pandas 2.x Support
API change parity with Pandas 2.2.2

Backwards incompatible
API changes
• Pandas 2.0.0
• Pandas 2.1.0
SPARK-44101
Spark migration guide

©2024 Databricks Inc. — All rights reserved ©2024 Databricks Inc. — All rights reserved 37
New Functionalities
Spark Connect, ANSI Mode, Arbitrary Stateful Processing V2,
Collation Support, Variant Data Types, pandas 2.x Support

Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0

Agenda Custom Functions and Procedures

SQL UDFs, SQL Scripting, Python UDTF, Arrow optimized
Python UDF, PySpark UDF Unified Profiler

Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Streaming and Batching
Python Data Sources

©2024 Databricks Inc. — All rights reserved

Why Python Data Source?
• People like writing Python!
• pip install is so convenient.
• Simplified API without complicated performance features in Data
Source V2.

Spark Python

©2024 Databricks Inc. — All rights reserved

Python Data Source APIs

• SPIP: Python Data Source API (SPARK-44076)

• Available in Spark 4.0 preview version and Databricks Runtime 15.2
• Support both batch and streaming, read and write

©2024 Databricks Inc. — All rights reserved

Python Data Source Overview
Easy Three Steps to create and use your custom data sources

Step 1: Create a Data Source Step 2: Register the Data Step 3: Read from or write to
Source the data source
class MySource(DataSource): Register the data source in the spark.read
… current Spark session using the .format("my-source")
Python data source class: .load(...)

spark
.dataSource df.write
.register(MySource) .format("my-source")
.mode("append")
.save(...)

©2024 Databricks Inc. — All rights reserved ©2024 Databricks Inc. — All rights reserved
DataFrame.toArrow
GroupedData.applyInArrow

©2024 Databricks Inc. — All rights reserved

DataFrame.toArrow

• An simply API to convert PySpark DataFrame to PyArrow Table.

• Make it easier to integrate with Arrow ecosystem.
• Note, all the data is loaded into the driver’s memory. It may cause out-
of-memory errors for large data

©2024 Databricks Inc. — All rights reserved

GroupedData.applyInArrow
from pyspark.sql.functions import ceil
>>> import pyarrow
• Utilizes Apache Arrow to map >>> import pyarrow.compute as pc
>>> df = spark.createDataFrame(
functions over DataFrame groups ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))
• Returns the result as a DataFrame >>> def normalize(table):
... v = table.column("v")
... norm = pc.divide(pc.subtract(v, pc.mean(v)),
• Supports functions taking pc.stddev(v, ddof=1))
... return table.set_column(1, "v", norm)
pyarrow.Table or tuple of grouping >>> df.groupby("id").applyInArrow(
... normalize, schema="id long, v double").show()
keys and pyarrow.Table +---+-------------------+
| id| v|
• Note: Function requires a full shuffle +---+-------------------+
| 1|-0.7071067811865475|
and may cause out-of-memory errors | 1| 0.7071067811865475|
| 2|-0.8320502943378437|
for large data groups | 2|-0.2773500981126146|
| 2| 1.1094003924504583|
+---+-------------------+

©2024 Databricks Inc. — All rights reserved

XML Connectors

©2024 Databricks Inc. — All rights reserved

Reading XML files out of the box

spark.read.xml("/path/to/my/file.xml").show()

+-------+-----+
| name | age |
+-------+-----+
| Alice | 23 |
| Bob | 32 | Python

©2024 Databricks Inc. — All rights reserved

More Than
Just a Simple
Port

©2024 Databricks Inc. — All rights reserved

Databricks JDBC Dialect

©2024 Databricks Inc. — All rights reserved

Reading Databricks SQL out of the box
spark.read.jdbc(
"jdbc:databricks://…",
"my_table",
properties
).show()

+-------+-----+
| name | age |
+-------+-----+
| Alice | 23 |
| Bob | 32 | Python

©2024 Databricks Inc. — All rights reserved

Delta Lake 4.0

©2024 Databricks Inc. — All rights reserved

The biggest Delta release yet
Delta 3.x Delta 4.0

Deletion Liquid Optimized

Delta Kernel
Writes Multi-cluster
Vectors clustering VARIANT UniForm GA Liquid GA writes
Lightning fast Easy migration
semi-structured Write once, read as Cross cloud, cross
from partitioned
Incremental Log data all formats engines
Table features Row IDs tables
checkpoints compactions

MERGE Auto-
Table cloning CDF
improvements compaction
Delta Identity Type
Connect Collations columns widening

Spark Connect Flexible sort and Pain-free primary Data types expand
support comparison and foreign keys with your data

©2024 Databricks Inc. — All rights reserved

UniForm GA Delta Lake
With UniForm
Towards full
lakehouse format Metadata Metadata

Interoperability
Data Parquet
Liquid clustering Usage Walkthrough
Create a new Delta table with liquid clustering
CREATE [EXTERNAL] TABLE tbl (id INT, name STRING) CLUSTER BY(id)
Change Liquid Clustering keys on existing clustered table:
ALTER TABLE tbl CLUSTER BY (name);
Clustering data in a Delta table with liquid clustering:
OPTIMIZE tbl;
What you don’t need to worry about:
● Optimal ﬁle sizes
● Whether a column can be used as a clustering key
● Order of clustering keys
Public doc: https://docs.databricks.com/delta/clustering.html
©2024 Databricks Inc. — All rights reserved
Liquid Clustering GA
Easy to use

Up to 7x faster writes

Up to 12x faster reads

Highly flexible

©2024 Databricks Inc. — All rights reserved

New Functionalities
Spark Connect, ANSI Mode, Arbitrary Stateful Processing V2,
Collation Support, Variant Data Types, pandas 2.x Support

Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0

Agenda Custom Functions and Procedures

SQL UDFs, SQL Scripting, Python UDTF, Arrow optimized
Python UDF, PySpark UDF Uniﬁed Proﬁler

Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Python UDTF

©2024 Databricks Inc. — All rights reserved

Apache Spark 3.5

Python User Defined Table Functions

This is a new kind of function that returns
an entire table as output instead of a
single scalar result value
○ Once registered, they can appear in
the FROM clause of a SQL query
○ Or use the DataFrame API to call them

from pyspark.sql.functions import udtf

@udtf(returnType="num: int, squared: int")

class SquareNumbers:
def eval(self, start: int, end: int):
for num in range(start, end + 1):
yield (num, num * num) Python

©2024 Databricks Inc. — All rights reserved

Apache Spark 3.5

Python User Defined Table Functions

SELECT * FROM SquareNumbers(

SquareNumbers(lit(1), lit(3)).show()
num => 1, squared => 3);
+-----+--------+
+-----+--------+
| num | squared|
| num | squared|
+-----+--------+
+-----+--------+
| 1 | 1 |
| 1 | 1 |
| 2 | 4 |
| 2 | 4 |
| 3 | 9 |
| 3 | 9 |
+-----+--------+
+-----+--------+ SQL Python

SQL Lang PySpark DataFrame APIs

©2024 Databricks Inc. — All rights reserved

Apache Spark 4.0

Python UDTF Input Table Partitioning

Split input rows among class instances: eval
Polymorphic Analysis runs once each row, then terminate runs last
Compute the output schema for each call class CountAndMax:
def __init__(self):
depending on arguments, using analyze self._count = 0
self._max = 0
class ReadFromConfigFile: def eval(self, row: Row):
@staticmethod self._count += 1
def analyze(filename: AnalyzeArgument): self._max = max(self._max, row[0])
with open(os.path.join( def terminate(self):
SparkFiles.getRootDirectory(), yield self._count, self._max
filename.value), "r") as f:
# Compute the UDTF output schema
WITH t AS (SELECT id FROM RANGE(0, 100))
# based on the contents of the file.
SELECT * FROM CountAndMax(
return AnalyzeResult(
from_file(f.read())) TABLE(t) PARTITION BY id / 10 ORDER BY id);
...
+-------+-----+
ReadFromConfigFile(lit("config.txt")).show() | count | max |
+-------+-----+
+------------+-------------+ | 10 | 0 |
| start_date | other_field | | 10 | 1 |
+------------+-------------+ ...
| 2024-04-02 | 1 | Python
+------------+-------------+ Python
©2024 Databricks Inc. — All rights reserved
Apache Spark 4.0

Python UDTF Custom Initialization

Create a subclass of AnalyzeResult and
Variable Keyword Arguments consume it in each subsequent __init__
The analyze and eval methods may class SplitWords:
@dataclass
accept *args or **kwargs class MyAnalyzeResult(AnalyzeResult):
numWords: int
class VarArgs: numArticles: int
@staticmethod
def analyze(**kwargs: AnalyzeArgument): @staticmethod
return AnalyzeResult(StructType( def analyze(text: str):
[StructField(key, arg.dataType) words = text.split(" ")
for key, arg in sorted( return MyAnalyzeResult(
kwargs.items())])) schema=StructType()
def eval(self, **kwargs): .add("word", StringType())
yield tuple(value for _, value .add("total", IntegerType()),
in sorted(kwargs.items())) withSinglePartition=true,
numWords=len(words)
SELECT * FROM VarArgs(a => 10, b => 'x'); numArticles=len((
word for word in words
+----+-----+ if word in ("a", "an", "the")))
| a | b |
+----+-----+ def __init__(self, r: MyAnalyzeResult):
| 10 | "x" | ... Python
+----+-----+
©2024 Databricks Inc. — All rights reserved
Python
Arrow Optimized Python UDF

©2024 Databricks Inc. — All rights reserved

Enhancing Python UDFs with Apache Arrow

● Introduction to Arrow and Its Role in UDF Optimization:

○ Utilizes Apache Arrow

○ Supported since Spark 3.5 and ON by default since Spark 4.0

● Key Beneﬁts
○ Enhances data serialization and deserialization speed

○ Provides standardized type coercion

©2024 Databricks Inc. — All rights reserved

Enabling Arrow Optimization
Global Activation in a UDF
Local Activation in a UDF
• Activates Arrow optimization for all
• Activates Arrow optimization for Python UDFs in the Spark session
a specific UDF, improving
performance
spark.conf.set("spark.sql.execution.pythonUDF.arro
w.enabled", True)

# An Arrow Python UDF # An Arrow Python UDF

@udf(returnType='int', useArrow=True) @udf(returnType='int')
def arrow_slen(s): def arrow_slen(s):
return len(s) return len(s)
Python Python

©2024 Databricks Inc. — All rights reserved

@udf(returnType='int', useArrow=True)
def arrow_slen(s):
return len(s)
Python

Performance
©2024 Databricks Inc. — All rights reserved
sdf.select(
udf(lambda v: v + 1, DoubleType(), useArrow=True)("v"),
udf(lambda v: v - 1, DoubleType(), useArrow=True)("v"),
udf(lambda v: v * v, DoubleType(), useArrow=True)("v")
)
Python

Performance
©2024 Databricks Inc. — All rights reserved
Pickled Python UDF
>>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show()
+-----------------------------------------------------------------------+
| date_in_string |
+-----------------------------------------------------------------------+
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
+-----------------------------------------------------------------------+ Python

Arrow-optimized Python UDF

>>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show()
+--------------+
|date_in_string|
+--------------+
| 1970-01-01|
Comparing Pickled and Arrow-optimized
| 1970-01-02| Python UDFs on type coercion [Link]
+--------------+ Python

©2024 Databricks Inc. — All rights reserved

SQL UDF / UDTF

©2024 Databricks Inc. — All rights reserved

Easily extend SQL function library

● SQL User Defined Scalar Functions

○ Persisted SQL Expressions

● SQL User Defined Table Functions

○ Persisted Parameterized Views

● Support named parameter invocation and defaulting

● Table functions with lateral correlation

©2024 Databricks Inc. — All rights reserved

SQL User Defined Scalar Functions

● Encapsulate (complex) expressions, including subqueries

● May contain subqueries

● Return a scalar value

● Can be used in most places where builtin functions go

©2024 Databricks Inc. — All rights reserved

SQL User Defined Scalar Functions
Persists complex expression patterns
> CREATE FUNCTION roll_dice(
num_dice INT DEFAULT 1 COMMENT 'number of dice to roll (Default: 1)’,
num_sides INT DEFAULT 6 COMMENT 'number of sides per die (Default: 6)'
) COMMENT 'Roll a number of n-sided dice’
RETURN aggregate(
sequence(1, roll_dice.num_dice, 1),
0,
(acc, x) -> (rand() * roll_dice.num_sides) :: INT,
acc -> acc + roll_dice.num_dice
);

> SELECT roll_dice();

3
-- Roll 3 6-sided dice
> SELECT roll_dice(3);
15
-- Roll 3 10-sided dice
> SELECT roll_dice(3, 10) SQL
21
©2024 Databricks Inc. — All rights reserved
SQL User Defined Table Functions

● Encapsulate (complex) correlated subqueries aka a parameterized view

● Can be used in the FROM clause

©2024 Databricks Inc. — All rights reserved

SQL User Deﬁned Table Functions
Persist complex parameterized queries

CREATE FUNCTION weekdays(start DATE,end DATE)

RETURNS TABLE(day_of_week STRING, day DATE)
RETURN SELECT
to_char(day, 'E’),
day
FROM
(
SELECT sequence(weekdays.start, weekdays.end)
) AS t(days),
LATERAL(explode(days)) AS dates(day)
WHERE
extract(DAYOFWEEK_ISO FROM day) BETWEEN 1 AND 5;
SQL

©2024 Databricks Inc. — All rights reserved

SQL User Defined Table Functions
Persist complex parameterized queries

> SELECT > -- Return weekdays for date ranges originating from a
day_of_week, LATERAL correlation
day > SELECT
FROM weekdays.*
weekdays(DATE '2024-01-01’, FROM
DATE '2024-01-14’); VALUES
(DATE '2020-01-01’),
Mon 2022-01-01 (DATE '2021-01-01') AS starts(start),
… LATERAL weekdays(start, start + INTERVAL '7' DAYS);
Fri 2022-01-05
Mon 2022-01-08 Wed 2020-01-01
SQL Thu 2020-01-02
Fri 2020-01-03 SQL
…

©2024 Databricks Inc. — All rights reserved

Named parameter invocation
Self documenting and safer SQL UDF invocation

> DESCRIBE FUNCTION roll_dice;

Function: default.roll_dice
Type: SCALAR
Input: num_dice INT
num_sides INT
Returns: INT

> -- Roll 1 10-sided dice - skip dice count

> SELECT roll_dice(num_sides => 10)
7

> -- Roll 3 10-sided dice - reversed order

> SELECT roll_dice(num_sides => 10, num_dice => 3)
21
SQL

©2024 Databricks Inc. — All rights reserved

Stored Procedure

©2024 Databricks Inc. — All rights reserved

External Stored Procedures

©2024 Databricks Inc. — All rights reserved

External Iceberg Stored Procedure

©2024 Databricks Inc. — All rights reserved

SQL Scripting
It’s SQL, but with control flow!

• Support for control flow, iterators & error handling

Natively in SQL
• Control flow → IF/ELSE, CASE
• Looping → WHILE, REPEAT, ITERATE
• Resultset iterator → FOR
• Exception handling → CONTINUE/EXIT
• Parameterized queries → EXECUTE IMMEDIATE

• Following the SQL/PSM standard

©2024 Databricks Inc. — All rights reserved

SQL Scripting

BEGIN
DECLARE c INT = 10;
WHILE c > 0 DO
INSERT INTO t VALUES (c);
SET VAR c = c - 1;
END WHILE;
END
SQL

©2024 Databricks Inc. — All rights reserved

SQL Scripting
WHILE i < array_size(tableArray) DO
SET
-- parameters tableType = (
DECLARE oldColName = 'ColoUr'; SELECT
DECLARE newColName = 'color’; table_type
FROM
BEGIN INFORMATION_SCHEMA.tables
DECLARE tableArray Array < STRING >; WHERE
DECLARE tableType STRING; table_name = tableArray [i]
DECLARE i INT = 0; );
DECLARE alterQuery STRING; IF tableType != 'VIEW' COLLATE UNICODE_CI THEN
SET SET
tableArray = ( alterQuery = 'ALTER TABLE ' || tableArray [i] ||
SELECT ' RENAME COLUMN ' || oldColName || ' TO ' ||
array_agg(table_name) newColName;
FROM EXECUTE IMMEDIATE alterQuery;
INFORMATION_SCHEMA.columns END IF;
WHERE SET i = i + 1;
column_name END WHILE;
COLLATE UNICODE_CI = oldColName END;
SQL
);

©2024 Databricks Inc. — All rights reserved

PySpark UDF Unified Profiling

©2024 Databricks Inc. — All rights reserved

Overview of Uniﬁed Proﬁling
• Key Components: Performance and memory profiling

• Benefits: Tracks function calls, execution time, memory usage

• Replacement for Legacy Profiling

• Drawbacks of Legacy Profiling

• Advantages of New Unified Profiling

• Session-based, works with Spark Connect, runtime toggling

Overview of Unified Profiling
• How to Enable:
• Performance Profiler: spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
• Memory Profiler: spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")

• API Features: "show", "dump", and "clear" commands

• Show results:
• Performance: spark.profile.show(type="perf")
• Performance: spark.profile.show(type="memory")
• Dump results: spark.profile.dump("/your_path/...")
• Clear results: spark.profile.clear()

PySpark
Performance
Profiler
from pyspark.sql.functions import pandas_udf

df = spark.range(10)
@pandas_udf("long")
def add1(x):
return x + 1

added = df.select(add1("id"))

spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
added.show() Python

PySpark
Performance
Proﬁler

PySpark
Memory
Profiler
from pyspark.sql.functions import pandas_udf

df = spark.range(10)
@pandas_udf("long")
def add1(x):
return x + 1

added = df.select(add1("id"))

spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")
added.show() Python

PySpark
Memory
Profiler

New Functionalities
Spark Connect, ANSI Mode, Arbitrary Stateful Processing V2,
Collation Support, Variant Data Types, pandas 2.x Support

Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0

Agenda Custom Functions and Procedures

SQL UDFs, SQL Scripting, Python UDTF, Arrow optimized
Python UDF, PySpark UDF Uniﬁed Proﬁler

Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Structured Logging Framework

What are we going to build to improve this?

• Transition to Structured Logging in Apache Spark

• Introducing Spark System Log Directories

Analyzing Spark logs is challenging due to their
unstructured nature

Structured Spark Logging
Starting from Spark 4.0, the default log format is JSON lines, making it
easier to parse and analyze.

{
"ts": "2023-03-12T12:02:46.661-0700",
"level": "ERROR",
"msg": "Fail to know the executor 289 is alive or not",
"context": {
"executor_id": "289"
},
"exception": {
"class": "org.apache.spark.SparkException",
"msg": "Exception thrown in awaitResult",
"stackTrace": "..."
},
"source": "BlockManagerMasterEndpoint"
} Json

Use Spark to Analyze Spark Logs

System Log Directories

logs = spark.read.json("/var/spark/logs.json")

# To get all the errors on host 100.116.29.4

errors_host_logs = logs.filter(
(col("context.host") == "100.116.29.4") & (col("level") == "ERROR"))

# To get all the exceptions from Spark

spark_exceptions_logs = logs.filter(
col("exception.class").startswith("org.apache.spark")) Python

System Log Directories

logs = spark.read.json("/var/spark/logs.json")

# To get all the executor loss logs

executor_lost_logs = logs.filter(
col("msg").contains("Lost executor"))
# To get all the distributed logs about executor 289
executor_289_logs = logs.filter(
col("context.executor_id") == 289) Python

Error Conditions and Messages

Error Conditions
800+ top-frequency error conditions from the server.

Error Conditions in PySpark
All 200+ error conditions issued by PySpark client

Quality PySpark Errors

• Clear and speciﬁc error

classes
• Improved documentation
• Consistency and
standardization
• Enhanced debugging and
maintenance

Spark 3.5

Spark 4.0

Behavior Changes

Overview of Apache Spark Versioning Policy
• Semantic Versioning Structure: [MAJOR].[FEATURE].[MAINTENANCE]
• MAJOR: Long-term API stability

• FEATURE: New features and improvements

• MAINTENANCE: Frequent, urgent patches

• API Compatibility Commitments: Maintain compatibility across

feature releases, reducing the need for users to refactor code.

Categories of Impactful Behavior Changes

• Query Results Impact: Changes affecting the accuracy and outcome of

data queries.
• Schema and Configuration Changes: Adjustments to the database or
application schema and Spark configuration settings.
• API Modifications: Alterations to the public and developer APIs across
multiple programming languages.
• Error Handling Adjustments: Modifications in how errors are classified
and handled within the system.
• Deployment and Management Revisions: Changes in the methods and
tools used for deploying and managing Spark environments.
©2024 Databricks Inc. — All rights reserved
Best Practices for API Changes
• General Approach:
• Avoid API changes whenever possible.

• Prefer deprecating features over direct modiﬁcations to ensure smoother

transitions.

• Implement legacy ﬂags to allow users to opt into previous behaviors

temporarily, easing the transition to new versions.

• Communicate changes clearly through deprecation warnings and

documentation updates.

Best Practices for API Changes
• User-facing documentation
• Migration Guide Updates: Regularly update the migration guide with detailed
information on changes affecting user operations.
• Legacy Configs: Include information on legacy configurations that might help users
transition smoothly between versions.

• Error Messages:
• Clarity and Actionability: Ensure that all error messages are clear and direct, informing
the user precisely what went wrong.
• Workarounds: Wherever possible, provide actionable advice within the error message,
including configuration changes that can revert to previous behaviors or other
immediate solutions.

Best Practices for API Changes
• PR Descriptions
• Detail: Provide comprehensive explanations of the changes, highlighting the
modiﬁcations and their implications.
• Transparency: Explain clearly how the new behavior differs from the old, and the
reasons for these changes.

Documentation

PySpark Doc – Dark Mode

PySpark Doc
• SPARK-44728
• More examples

• Environment Setup

• Quickstart

• Type System

Versionless Spark Programming Guide

[Proposal] Versionless Spark Programming
Guide
• Motivation: Allow for real-time
updates and rapid content iteration
without release schedule
constraints.
• Small changes pose less SEO
deranking risk than major updates.
• Transition 9 existing programming
guides to Spark website repo.

Python: The Number One Choice

>330
millions
Number of PyPI
Downloads per Year

Countries
and Regions
PyPI downloads of
PySpark
in the last 12 months
PySpark’s View of the World
DataFrame
SQL pandas APIs
APIs

PySpark
Catalyst Optimizer

Adaptive Execution

Spark Ecosystem (connectors, 3rd party libraries)

Key Focus of PySpark

Functionality Parity Ease of Use

• Complete feature • No JVM Knowledge
availability Required
• Python native APIs • Pythonic APIs

Performance Parity Ecosystem Integration

• Matched Performance with Scala • Integrated with Python
• Optimized query compiler and ecosystem
engine • Spark ecosystem growth

Key Enhancements in PySpark 3.5 and 4.0
Functionality Parity Ease of Use
• 180 new built-in functions in Spark 3.5+ • Spark Connect
• Python native data source APIs • Error Framework
• Stateful streaming processing V2 • Enriched Documentation
• Python UDTFs • Unified UDF profiling

Performance Ecosystem Integration

• Spark Connect • Compatible with pandas 2 (pandas API)
• Arrow-optimized Python UDF • Delta Lake 4.0
• Variant type for semi-structured processing • [ML] Distributed training with TorchDistributor
• Advanced optimizer and adaptive execution • Arrow integration: toArrow API

Python UDF Pandas 2 pandas API on NumPy UDF-level applyInArrow
Memory/CPU Compatibility Spark Coverage Inputs Dependency Control
Profiler [WIP]

180+ Spark Connect mapInArrow Deepspeed

New Python Distributor

PySpark
Functions

Visualization/ Richer Pythonic [WIP] DF / UDF

Plotting Connectors Error Handling Debuggability

Arrow-optimized DF.toArrow() Python Arbitrary Type Python UDTF PySpark Testing

Python UDFs Stateful Annotations API
Processing

Databricks Lakehouse Fundamentals Slide Deck
No ratings yet
Databricks Lakehouse Fundamentals Slide Deck
121 pages
Pythons Basics
No ratings yet
Pythons Basics
104 pages
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Dbms Practical File
No ratings yet
Dbms Practical File
29 pages
Illuminations
No ratings yet
Illuminations
18 pages
Alicia Cardigan Template: Print OUT & Keep
No ratings yet
Alicia Cardigan Template: Print OUT & Keep
22 pages
SSC Stihl SC GB
No ratings yet
SSC Stihl SC GB
64 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Databricks: Building and Operating A Big Data Service Based On Apache Spark
No ratings yet
Databricks: Building and Operating A Big Data Service Based On Apache Spark
32 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Lakehouse: A Unified Data Architecture
No ratings yet
Lakehouse: A Unified Data Architecture
9 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Cert DEWD (Edits)
No ratings yet
Cert DEWD (Edits)
158 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
De Mod 4 Build Data Pipelines With Delta Live Tables
No ratings yet
De Mod 4 Build Data Pipelines With Delta Live Tables
52 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Introduction To Databricks
No ratings yet
Introduction To Databricks
149 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Datawarehouse To Data Lakehouse
100% (1)
Datawarehouse To Data Lakehouse
48 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Snowflake Architecture
No ratings yet
Snowflake Architecture
18 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
No ratings yet
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
13 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Apache Spark
No ratings yet
Apache Spark
100 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Datamesh Ebook
No ratings yet
Datamesh Ebook
46 pages
Manage Data Access With Unity Catalog
No ratings yet
Manage Data Access With Unity Catalog
17 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
A Deep Dive Into Query Execution Engine of Spark SQL
100% (2)
A Deep Dive Into Query Execution Engine of Spark SQL
88 pages
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
How Do We Engage With Others While Staying True To Ourselves?
No ratings yet
How Do We Engage With Others While Staying True To Ourselves?
4 pages
Solaris Command Reference
100% (12)
Solaris Command Reference
7 pages
Ubuntu
100% (1)
Ubuntu
382 pages
The Cinema of Prabhat - An Overview
No ratings yet
The Cinema of Prabhat - An Overview
13 pages
3GPP Decoder For LTE, UMTS and GSM - Download FREE
No ratings yet
3GPP Decoder For LTE, UMTS and GSM - Download FREE
9 pages
CourseMAS 2019 Ciurea
No ratings yet
CourseMAS 2019 Ciurea
92 pages
Paper DCRE
No ratings yet
Paper DCRE
13 pages
Elsc 109 Module 1 - Fbadua
No ratings yet
Elsc 109 Module 1 - Fbadua
27 pages
The Complete Idiot's Guide To Herbal Remedies
100% (3)
The Complete Idiot's Guide To Herbal Remedies
400 pages
02 Modul Exasol SQL - en
No ratings yet
02 Modul Exasol SQL - en
41 pages
Edited V Yakal Mid Test Phil Iri Ort Filipino English 5
No ratings yet
Edited V Yakal Mid Test Phil Iri Ort Filipino English 5
3 pages
Descriptive Froebelian Writers
No ratings yet
Descriptive Froebelian Writers
1 page
Ebooks File History of Modern Psychology 4th Edition by C. James Goodwin A All Chapters
100% (3)
Ebooks File History of Modern Psychology 4th Edition by C. James Goodwin A All Chapters
24 pages
Job Application Form - 2 Pages
No ratings yet
Job Application Form - 2 Pages
2 pages
Test Design Specification Template
No ratings yet
Test Design Specification Template
5 pages
Tutorial 2 What Is The Output of The Below Program?
No ratings yet
Tutorial 2 What Is The Output of The Below Program?
2 pages
Mac Network Commands Cheat Sheet
No ratings yet
Mac Network Commands Cheat Sheet
1 page
Jane Eyre PDF
No ratings yet
Jane Eyre PDF
6 pages
Tieng Anh 8 Sach Moi de Thi Giua Hoc Ki 2
No ratings yet
Tieng Anh 8 Sach Moi de Thi Giua Hoc Ki 2
6 pages
Syllabus
No ratings yet
Syllabus
5 pages
MA-2203: Introduction To Probability and Statistics: Lecture Slides
No ratings yet
MA-2203: Introduction To Probability and Statistics: Lecture Slides
64 pages
Adrian Wilson (Book Designer)
No ratings yet
Adrian Wilson (Book Designer)
2 pages
Jiva Profounded in Visistadvitha
No ratings yet
Jiva Profounded in Visistadvitha
185 pages
Experiment Ii Introduction To Computer Numerical Control Ii
No ratings yet
Experiment Ii Introduction To Computer Numerical Control Ii
5 pages
S Lora
No ratings yet
S Lora
16 pages
Snare Drum Grade
No ratings yet
Snare Drum Grade
1 page