Day 19

This document discusses additional functions and user-defined functions in Spark SQL. It covers: 1. Applying built-in functions like collect_set, explode, and lit to generate new columns and handle null values. 2. Defining a user-defined function (UDF) to transform data, which has overhead due to serialization between Spark and the executor. 3. Registering UDFs with spark.udf.register to make them available for SQL queries, and using Python decorators for vectorized UDFs.

Uploaded by

Dawood urine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views4 pages

Day 19

Uploaded by

Dawood urine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Additional Functions

Objectives :
1. Apply built-in functions to generate data for new columns
2. Apply DataFrame NA functions to handle null values
3. Join DataFrames

Methods :
DataFrameNaFunctions: fill

Built-In Functions:

• Aggregate: collect_set
• Collection: explode
• Non-aggregate and miscellaneous: col, lit
1-A: Get emails of converted users from transactions Select the email column in UserDF and remove
duplicates Add a new column converted with the value True for all rows Save the result as
convertedUsersDF.

User-Defined Functions
Objectives :
1. Define a function
2. Create and apply a UDF
3. Register the UDF to use in SQL
4. Create and register a UDF with Python decorator syntax
5. Create and apply a Pandas (vectorized) UDF

Methods :
• UDF Registration (spark.udf): register
• Built-In Functions: udf
• Python UDF Decorator: @udf
• Pandas UDF Decorator: @pandas_udf
User-Defined Function (UDF)
A custom column transformation function

• Can’t be optimized by Catalyst Optimizer

• Function is serialized and sent to executors
• Row data is deserialized from Spark's native binary format to pass to the UDF, and the results
are serialized back into Spark's native format
• For Python UDFs, additional interprocess communication overhead between the executor and
a Python interpreter running on each worker node

Define a function
Define a function (on the driver) to get the first letter of a string from the email field.
Register UDF to use in SQL
Register the UDF using spark.udf.register to also make it available for use in the SQL namespace.

Note :

If You Want to Do More Hands – On of UDF so Go With NoteBook : ASP 2.5 – UDFs of Sandip Sir.