Pyspark Basics
Pyspark Basics
PySpark, a powerful tool for processing and analyzing big data. It's tailored for data
engineers, data scientists, and ML enthusiasts, focusing on working with large-scale
datasets in distributed environments. The goal is to transform raw data into insights.
What is PySpark
• Apache Spark: An open-source distributed computing system.
• PySpark: The Python API for Spark, allowing parallel computation on large datasets.
• It’s suitable for:
o Batch processing
o Real-time streaming
o Machine learning (ML)
o SQL-based analytics
• Industries benefiting from PySpark include finance, healthcare, and e-commerce
due to its speed and scalability.
When to Use PySpark
PySpark shines when handling big data that exceeds single-machine capacity:
• Distributed data processing using Spark's in-memory architecture.
• ML on large datasets using Spark’s MLlib.
• ETL/ELT pipelines for structured transformation of raw data.
• Works well with formats like CSV, Parquet, and others.
Spark Cluster
A Spark cluster consists of:
• Master node: Manages task distribution and resources.
• Worker nodes: Execute computing tasks. This setup allows parallel, distributed
processing of large datasets, a core part of PySpark’s power.
SparkSession
• SparkSession is the entry point for using PySpark.
• Enables access to Spark features like SQL, streaming, ML, and data handling.
• To create:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()
• The .appName() is useful for managing multiple applications.
• getOThis helps avoid creating duplicate sessions, which can waste memory or even
crash the app.rCreate() ensures you don’t accidentally start multiple sessions.
7. PySpark DataFrames
• Similar to Pandas DataFrames, but distributed for large-scale processing.
• Created using:
df = spark.read.csv("file.csv", header=True, inferSchema=True)
• df.show() displays contents.
• Syntax is intuitive for those familiar with Pandas, but underlying processing is
optimized for distributed environments.
Introduction to PySpark DataFrames
PySpark DataFrames vs Pandas:
o Pandas: Operates on a single machine — not ideal for huge datasets.
o PySpark: Designed for distributed computing — data is spread across a
cluster, enabling faster and scalable processing.
• Why PySpark DataFrames matter:
o Efficient for big data handling.
o Support common operations like filtering, grouping, and aggregating.
o Provide SQL-like syntax, but execute operations differently behind the scenes.
• Although syntax might feel familiar to those with Pandas experience, the
performance model is different due to distributed execution.
Equivalent in
Function Use Case
SQL
Takeaway:
PySpark DataFrames are essential for scalable, high-performance data processing. They:
• Resemble Pandas in syntax but work across clusters.
• Support SQL-like operations.
• Allow flexible, powerful analytics on massive datasets.
# Load the CSV file into a DataFrame
salaries_df = spark.read.csv("salaries.csv", header=True, inferSchema=True)
# Count the total number of rows
row_count = salaries_df.count()
print(f"Total rows: {row_count}")
# Group by company size and calculate the average of salaries
salaries_df.groupBy("company_size").agg({"salary_in_usd": "avg"}).show()
salaries_df.show()
# Average salary for entry level in Canada
CA_jobs = ca_salaries_df.filter(ca_salaries_df['company_location'] == "CA").filter(ca_sa
laries_df['experience_level']
== "EN").groupBy().avg("salary_in_usd")
# Show the result
CA_jobs.show()
Union Operation
The union operation in PySpark combines or “stacks” two DataFrames with the same
schema (same number of columns, with matching types and order). This operation is crucial
when datasets are split across different sources or time periods and need to be consolidated
into one.
• Conditions for Union:
o The DataFrames must have the same number of columns and matching data
types for the operation to succeed. If they don’t match, PySpark will throw an
error.
o The union operation is typically used to combine data from separate files, such
as monthly sales data into a unified yearly dataset.
df_union = df1.union(df2)
o This stacks df2 below df1, creating a single DataFrame that consolidates rows
from both DataFrames.
• Schema Alignment: It is critical to ensure that the schemas of the two DataFrames
align correctly. Mismatched schemas or column types will prevent the union from
working.
• Examine the airports DataFrame. Note which key column will let you join airports to
the flights table.
• Join the flights with the airports DataFrame on the "dest" column. Save the result
as flights_with_airports.
• Examine flights_with_airports again. Note the new information that has been added.
• # Examine the data
• airports.show()
•
• # .withColumnRenamed() renames the "faa" column to "dest"
• airports = airports.withColumnRenamed("faa", "dest")
•
• # Join the DataFrames
• flights_with_airports = flights.join(airports, on='dest', how='left
outer')
1. Introduction to UDFs
o UDF stands for User-Defined Function in PySpark.
o UDFs are custom functions that you define and use within PySpark DataFrames
to manipulate data in ways that built-in PySpark functions may not support.
o There are two main types of UDFs in PySpark: PySpark UDFs and pandas
UDFs.
2. UDFs for Repeatable Tasks
o UDFs are designed to be reusable and repeatable. This means that once
created and registered with the SparkSession, they can be applied multiple times
across different DataFrames.
o PySpark UDFs are suitable for handling smaller datasets.
o pandas UDFs are designed for larger datasets and typically offer better
performance in those cases.
o Despite differences in performance, both types of UDFs operate similarly, but
the performance optimization comes from how they handle data row by row.
3. Defining and Registering a UDF
o To create a UDF, you first define a standard Python function. For example, a
function like to_upper_case() could be used to convert text data in a column to
uppercase.
o Once you define the function, you need to register it as a UDF using the .udf()
function in PySpark and specify the correct data type (e.g., StringType() for
string operations).
o Registration is important because it makes the UDF accessible across all
worker nodes in the Spark cluster through the SparkSession.
Example code:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def to_upper_case(input_string):
return input_string.upper()
4. pandas UDF
o pandas UDFs are designed to work with larger datasets and provide better
performance.
o You define a pandas UDF using the @pandas_udf() decorator and specify the
return type (e.g., @pandas_udf("float")).
o Unlike PySpark UDFs, pandas UDFs do not need to be explicitly registered
with the SparkSession as they are more efficient at handling data in bulk.
Example code:
from pyspark.sql.functions import pandas_udf
import pandas as pd
# Define a pandas UDF
@pandas_udf("float")
def calculate_square(s: pd.Series) -> pd.Series:
return s * s
# Apply the pandas UDF to a DataFrame
df = df.withColumn("squared_column", calculate_square(df["numeric_column"]))
df.show()
5. PySpark UDFs vs. pandas UDFs
o PySpark UDFs are ideal for small datasets and simple transformations. They
operate at the column level and require registration with the SparkSession.
o pandas UDFs are better for large datasets because they leverage the power of
pandas, making them more efficient for processing large volumes of data.
o The decision to use one over the other depends on several factors, including:
▪ Data size: Small datasets work better with PySpark UDFs, while large
datasets benefit from pandas UDFs.
▪ Complexity: Simple transformations may be fine with PySpark UDFs,
but for more complex operations on large data, pandas UDFs offer
performance benefits.
▪ Performance considerations: pandas UDFs typically offer better
performance, but they may also require a more complex setup and
handling of data types.