PySpark Notes
PySpark Notes
Section 1: Introduction
o Retains intermediate data in memory, avoiding disk I/O (10–100x faster than Hadoop
MapReduce).
• Lazy Evaluation:
o Transformations (e.g., filter, join) are queued until an action (e.g., count, show) is called.
o Example:
# No execution happens here
filtered_df = df.filter(df["age"] > 30)
transformed_df = filtered_df.withColumn("age_plus_10", df["age"] + 10)
# Execution triggered by action
transformed_df.show()
o Spark optimizes the DAG before execution (e.g., predicate pushdown).
• Fault Tolerance:
o Rebuilds lost data partitions using lineage (log of transformations).
• Partitioning:
o Data split into partitions for parallel processing.
o Example:
df.rdd.getNumPartitions() # Returns partition count
df.repartition(8) # Explicitly increase parallelism
Best Practices:
• Avoid collect(): It retrieves all data to the driver, risking OutOfMemory errors.
• Use .cache() judiciously: Only for DataFrames reused across multiple actions.
Common Pitfalls:
• Skewed Partitions:
# Bad: Causes uneven data distribution
df.groupBy("country").count()
schema = StructType([
StructField("item_id", StringType(), nullable=False), # Non-nullable
StructField("price", DoubleType(), nullable=True),
StructField("attributes", MapType(StringType(), StringType())) # Nested type
])
df = spark.read.schema(schema).json("/path.json")
• Pros: Supports nested data (arrays/maps), enforces nullability.
Schema Override Example
# Force "item_weight" to string despite numeric values
schema = StructType([StructField("item_weight", StringType())])
df = spark.read.schema(schema).csv("bigmart_sales.csv")
2.4 Data Inspection
show() vs display()
• df.show(n=20, truncate=True, vertical=False):
o PySpark-native.
o truncate=False shows full column content.
• display(df):
o Databricks-specific with UI enhancements (charts, filters).
o Not available in vanilla Spark.
Schema Verification
df.printSchema() # Output:
# root
# |-- column1: integer (nullable = true)
# |-- column2: string (nullable = true)
Best Practices
1. Avoid inferSchema in Production:
o Predefine schemas for reliability and performance.
2. Use Multi-line JSON Sparingly:
o Not splittable → single-task reads → memory bottlenecks.
3. Handle Corrupt Records:
spark.read.option("columnNameOfCorruptRecord", "_malformed") \
.json("data.json")
Common Pitfalls
• Schema Mismatches:
o Error: AnalysisException: Cannot up cast column from StringType to IntegerType.
o Fix: Use .schema() or cast() during transformations.
• Date Format Ambiguity:
spark.read.option("dateFormat", "MM/dd/yyyy").csv("dates.csv")
Best Practices
• Use aliases in aggregations:
df.groupBy("category").agg(
avg("price").alias("avg_price")
)
• Aliases are immutable: Original column remains unless explicitly overwritten.
3.3 Filter Transformation
Syntax Deep Dive
# Equivalent to SQL WHERE clause
df.filter(col("price") > 100)
df.where(col("price") > 100) # Alias for filter()
# Complex logic
df.filter(
(col("city") == "NYC") &
(col("sales").isNotNull()) &
(~col("product").isin("A", "B"))
)
Optimization Tips
• Push filters upstream: Apply filters before joins/aggregations to reduce data volume.
• Use isin() instead of multiple OR conditions for readability.
# Conditional logic
df = df.withColumn(
"discount_tier",
when(col("sales") > 1000, "Gold")
.when(col("sales") > 500, "Silver")
.otherwise("Bronze")
)
# Type-safe operations
from pyspark.sql.types import DateType
df = df.withColumn(
"delivery_date",
col("order_date").cast(DateType())
)
Performance Note
Chaining multiple withColumn calls creates a single optimized plan:
# Date conversions
df.withColumn(
"order_date",
to_date(col("date_string"), "MM/dd/yyyy")
)
Pitfalls
• Silent nulls: Invalid casts return null (use try_cast for error tracking).
• Locale issues: Date formats vary (e.g., dd/MM/yyyy vs MM/dd/yyyy).
df.withColumn("rank", rank().over(window))
.filter(col("rank") <= 3) # Get top 3 per category
Best Practices
1. Column Selection:
o Prefer explicit column lists over select("*").
2. Filter Early:
o Apply filters before resource-intensive operations.
3. Alias Conflicts:
o Use df.hint("broadcast") when joining large + small datasets.
Common Errors
• AnalysisException: Cannot resolve column name:
o Caused by typos or unaliased duplicate columns.
• IllegalArgumentException: requirement failed:
o Invalid type casting (e.g., string → integer with non-numeric values).
Real-World Use Case
An e-commerce platform uses these transformations to:
• Clean raw JSON clickstream data (filter, withColumn).
• Enrich user profiles (select, alias).
• Generate leaderboards (sort, window functions).
# Advanced aggregations
df.groupBy("category") \
.agg(
approx_count_distinct("user_id", 0.05).alias("unique_users"), # 5% error tolerance
collect_list("product_id").alias("products_purchased")
)
Best Practices
• Skew Mitigation: Use salting for skewed keys:
df.withColumn("salted_key", concat(col("key"), lit("_"), (rand() * 10).cast("int")))
• Pre-Aggregation: Reduce data size with combineByKey before shuffling.
Pitfalls
• OOM Errors: Large groups (e.g., 100M users) → Use spill-to-disk config:
spark.conf.set("spark.sql.shuffle.partitions", 2000)
4.2 Joins
Join Types Explained
Join Type Behavior Use Case
inner Intersection of keys Fact-dimension linking
left_anti Rows in left NOT in right Data validation
left_semi Rows in left that exist in right Filtering
Optimization Techniques
• Broadcast Join:
# Auto-broadcast if table < spark.sql.autoBroadcastJoinThreshold (default 10MB)
df_large.join(broadcast(df_small), "key")
• Sort-Merge Join: For large tables with sorted keys:
df1.hint("sort").join(df2.hint("sort"), "key")
Skew Handling
# Split skewed keys into sub-ranges
df_skew = df.withColumn("key", when(col("key") == "hot_key", rand()).otherwise(col("key")))
df.withColumn("cumulative_spend", _sum("amount").over(window_spec))
Performance Considerations
• Range Frames: Use .rangeBetween for time-series to avoid row-by-row processing.
• Partition Sizing: Too many partitions → Task overhead; too few → Skew.
imputer = Imputer(
inputCols=["price"],
outputCols=["price_imputed"],
strategy="median" # "mean", "mode"
).fit(df)
imputer.transform(df)
Null Handling in Joins
# Treat nulls as valid join keys
spark.conf.set("spark.sql.legacy.nullInPredicateRewrite", True)
@pandas_udf("double")
def pandas_normalize(s: pd.Series) -> pd.Series:
return (s - s.mean()) / s.std()
df.withColumn("normalized", pandas_normalize(col("value")))
Best Practices
1. Join Strategy: Use explain() to verify join type (broadcast/sort-merge).
2. Pivot Limits: Never pivot columns with >10K distinct values.
3. Window Tuning: Pair .orderBy() with .rowsBetween() for deterministic results.
Common Errors
• Cannot broadcast table larger than 8GB: Increase autoBroadcastJoinThreshold.
• Specified window frame appears more than once: Reuse window specs.
Storage Levels
Level Description Use Case
MEMORY_ONLY Deserialized Java objects in memory Fast access for small datasets
MEMORY_AND_DISK_SER Serialized data (spills to disk if OOM) Large datasets with limited RAM
DISK_ONLY Data stored only on disk Rarely accessed DataFrames
Example
# Cache for iterative ML training
train_data = spark.read.parquet("train_data").cache()
for i in range(10):
model = train_model(train_data) # Reused without re-reading
train_data.unpersist()
Best Practices
• Avoid Over-Caching: Cache only DataFrames used ≥3 times.
• Monitor Usage:
storage = spark.sparkContext.statusTracker().getExecutorMemoryStatus
print(storage)
Pitfalls
• Serialization Overhead: MEMORY_ONLY_SER saves space but adds CPU cost.
• Stale Data: Cached DataFrames don’t auto-update if source data changes.
• Optimal Size:
# Calculate partitions
data_size = spark.sql("DESCRIBE DETAIL df").select("sizeInBytes").first()[0]
num_partitions = data_size // (128 * 1024 * 1024) # 128MB partitions
Coalesce
# Merge partitions without shuffle (e.g., after filtering)
df = df.filter(col("year") == 2023).coalesce(50)
Skew Handling
# Salting skewed keys
df = df.withColumn("salted_key", concat(col("key"), lit("_"), (rand() * 100).cast("int")))
df.groupBy("salted_key").count()
Configuration
# Adjust auto-broadcast threshold (default 10MB)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 * 1024) # 100MB
Example
# Force broadcast for a medium-sized table
dim_df = spark.read.parquet("dim_table")
fact_df.join(broadcast(dim_df), "id")
Pitfalls
• OOM Errors: Broadcasting tables >1/4 of executor memory.
• Network Overhead: Large broadcasts delay task scheduling.
BI Integration
# Expose via JDBC
./bin/beeline -u jdbc:hive2://localhost:10000
Optimization
• Predicate Pushdown: Ensure filters are applied at scan stage.
• Column Pruning: Read only necessary columns.
Configuration
# Enable all AQE optimizations
spark.sql.adaptive.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.adaptive.localShuffleReader.enabled=true
Real-World Impact
A telecom company reduced ETL job time by 40% using AQE to handle skewed customer usage data.
Best Practices
1. Caching: Prefer MEMORY_AND_DISK_SER for DataFrames >1GB.
2. Partitioning: Align partition keys with common filter/join columns.
3. Broadcast: Use for star schema joins (fact ↔ dimensions).
4. Execution Plans: Check for unnecessary shuffles using explain().
Common Errors
• OutOfMemoryError: Caused by caching oversized DataFrames or over-partitioning.
• Broadcast timeout: Increase spark.sql.broadcastTimeout (default 300s).
Section 6: PySpark in Production
• Idempotence:
# Overwrite specific partitions safely
df.write.partitionBy("date") \
.mode("overwrite") \
.option("replaceWhere", "date = '2024-01-01'") # Delta Lake only
• Data Quality:
# Use PySpark Expectations (open-source)
from pyspark_expectations import expect
df = df.withColumn("valid_amount", expect.greater_than(col("amount"), 0))
invalid = df.filter(col("valid_amount") == False)
invalid.write.mode("append").parquet("/error_logs")
Pitfalls
• Schema Drift: Use Delta Lake schema enforcement to block invalid writes.
• Partial Failures: Write to temporary directories first, then atomically commit.
stream.writeStream.foreachBatch(write_to_cassandra).start()
Backpressure Handling
spark.conf.set("spark.streaming.backpressure.enabled", "true") # Kafka only
# Load in production
from pyspark.ml import PipelineModel
prod_model = PipelineModel.load("s3://models/rf_v1")
Real-Time Inference
• Structured Streaming:
stream_df = spark.readStream.format("kafka")...
predictions = prod_model.transform(stream_df)
• Serving via REST: Export to ONNX and serve via Triton Inference Server.
Cost-Efficient Tuning
# Use TrainValidationSplit instead of CrossValidator for large datasets
from pyspark.ml.tuning import TrainValidationSplit
tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=param_grid, trainRatio=0.8)
Dynamic Allocation
spark.conf.set("spark.dynamicAllocation.maxExecutors", 100) # Prevent cost overruns
Monitoring
• Spark UI: Track task skew via Task Time histograms.
• Prometheus Integration:
spark.conf.set("spark.metrics.conf.*.sink.prometheus.class",
"org.apache.spark.metrics.sink.PrometheusSink")
6.5 Delta Lake Integration
Schema Evolution
• Add Columns: autoMerge allows new nullable columns.
• Breaking Changes (e.g., rename column):
ALTER TABLE delta.`/path` SET TBLPROPERTIES ('delta.enableSchemaEnforcement' = 'false')
Time Travel
# Compare versions for debugging
df_v1 = spark.read.format("delta").option("versionAsOf", 1).load("/path")
df_v2 = spark.read.format("delta").option("versionAsOf", 2).load("/path")
diff = df_v2.exceptAll(df_v1)
Optimizations
• Z-Ordering:
OPTIMIZE delta.`/path` ZORDER BY (user_id, timestamp) # Speed up user-centric queries
Comprehensive Validation
try:
df = spark.read.format("parquet").load("s3://data/")
except Exception as e:
if "Path does not exist" in str(e):
initialize_storage()
elif "AccessDenied" in str(e):
refresh_aws_credentials()
else:
raise
Best Practices
1. Circuit Breakers: Halt pipelines after consecutive failures.
2. Dead Letter Queues: Route invalid records to dedicated storage for analysis.
Key Differences:
• current_date() returns DateType (e.g., 2024-05-27).
• current_timestamp() returns TimestampType (e.g., 2024-05-27 14:30:45.123456).
2. Date Arithmetic
Purpose: Manipulate dates by adding/subtracting days or calculating intervals.
a. Date Addition/Subtraction
from pyspark.sql.functions import date_add, date_sub
b. Date Difference
from pyspark.sql.functions import datediff, months_between
df = df.withColumn("order_year", year(col("order_date")))
df = df.withColumn("order_month", month(col("order_date")))
df = df.withColumn("order_day_of_week", dayofweek(col("order_date"))) # Sunday=1, Saturday=7
Common Pitfalls
1. Incorrect Format Strings:
# Fails silently if format mismatches
df.withColumn("date", to_date(col("date_str"), "yyyy-MM-dd")) # Returns null for "27/05/2024"
Fix: Validate data or use try_to_date (Spark 3.0+).
2. Leap Year/Invalid Dates:
o to_date("2023-02-30", "yyyy-MM-dd") → null.
Mitigation: Use data validation pipelines.
3. Time Zone Ignorance: Mixing timestamps without converting to UTC.
Advanced Example
# Calculate business days (excluding weekends)
from pyspark.sql.functions import when, dayofweek
df = df.withColumn("business_days",
when(dayofweek(col("order_date")).isin(1, 7), # Sunday=1, Saturday=7
date_add(col("order_date"), 2 # Skip weekend
).otherwise(date_add(col("order_date"), 1))
)
Section 8: Handling Nulls in PySpark
8.1 Handling Nulls in PySpark
1. Dropping Null Values
Purpose: Remove rows/columns containing nulls to clean datasets.
Syntax & Parameters
# Drop rows with ANY nulls in specified columns
df_clean = df.dropna(how="any", subset=["order_date", "customer_id"])
Use Cases
• Critical columns (e.g., order_id, timestamp) where nulls indicate data corruption.
• Aggregation prep: Ensure key grouping columns are non-null.
# Column-specific replacements
df_filled = df.fillna(
{"outlet_size": "Unknown", "item_weight": df.agg(avg("item_weight")).first()[0]}
)
5. Best Practices
1. Avoid Over-Dropping:
# Only drop nulls in critical columns, retain others for analysis
df.dropna(subset=["order_id"])
2. Documentation:
o Log null counts before/after handling.
o Track replacement logic (e.g., "Filled outlet_size nulls with 'Unknown'").
3. Validation:
# Ensure no nulls remain in key columns
assert df.filter(col("order_id").isNull()).count() == 0
6. Real-World Scenario
Problem:
• Outlet_Size has 15% nulls (fill with "Unknown").
• Order_Date has 0.1% nulls (drop these rows).
Solution:
df = (df
.fillna("Unknown", subset=["Outlet_Size"])
.dropna(subset=["Order_Date"])
)
7. Common Pitfalls
1. Silent Data Loss:
o Using how="any" without subset drops rows with nulls in unimportant columns.
2. Inappropriate Fill Values:
o Filling numeric nulls with 0 might skew aggregations (use median instead).
3. Cascade Effects:
o Aggregations like groupBy automatically exclude nulls, leading to undercounts.
8. Advanced Techniques
Custom Null Markers
# Treat "N/A" and "?" as nulls during read
df = spark.read.csv("data.csv", nullValue=["N/A", "?"])
2. Array Indexing
Purpose: Access specific elements without exploding the entire array.
getItem() Method
# Extract first and second array elements
df = df.withColumn("main_type", col("type_components").getItem(0)) \
.withColumn("sub_type", col("type_components").getItem(1))
Performance Note:
• 10-100x faster than explode() when only specific elements are needed.
Negative Indexing
# Get last element in array
df = df.withColumn("last_tag", col("tags").getItem(-1))
3. Explode Function
Purpose: Transform array elements into individual rows (denormalization).
Basic Explode
from pyspark.sql.functions import explode
c. Tag Analysis
# Count occurrences of each tag
df_tags = df.withColumn("tag", explode(col("tags"))) \
.groupBy("tag").count()
5. Performance Considerations
Explosion Factor
• Exploding an array with average length N increases row count by Nx.
• Mitigation:
# Filter arrays before exploding
df = df.withColumn("large_tags", filter(col("tags"), lambda x: length(x) > 3))
Alternatives to Explode
• Use array_contains() for existence checks:
df.filter(array_contains(col("tags"), "urgent"))
6. Best Practices
1. Schema Control:
# Ensure exploded columns are typed correctly
df.withColumn("size", explode(col("sizes")).cast(StringType()))
2. Error Handling:
# Skip null arrays
df.withColumn("tag", explode(coalesce(col("tags"), array())))
7. Common Pitfalls
• Cartesian Explosion: Exploding multiple array columns without caution:
# DANGER: Rows multiply by len(tags)*len(sizes)
df.withColumn("tag", explode("tags")) \
.withColumn("size", explode("sizes"))
• Null Propagation: Exploding null arrays yields zero rows (silent data loss).
Section 10: Array Operations (array_contains, collect_list)
1. array_contains()
Purpose: Check for the existence of a value in an array column.
Syntax & Usage
from pyspark.sql.functions import array_contains
Output:
split_col is_type1
["Supermarket", "Type1"] true
["Grocery", "Type2"] false
Use Cases:
• Filtering records with specific tags:
df.filter(array_contains(col("tags"), "urgent")).show()
• Feature engineering for ML (e.g., flagging products with certain attributes).
2. collect_list()
Purpose: Aggregate values into an array with duplicates, preserving order.
Syntax & Example
from pyspark.sql.functions import collect_list
Key Features:
• Retains duplicates and order (e.g., chronological user activity logs).
• Returns ArrayType matching the input column’s data type.
3. collect_set()
Purpose: Aggregate values into an array without duplicates.
Syntax & Example
from pyspark.sql.functions import collect_set
Output:
user_id unique_products
1001 ["P123", "P456"]
Performance Note:
• Deduplication requires additional computation (hash + sort).
• Avoid on high-cardinality columns (e.g., timestamps).
b. Size Analysis
from pyspark.sql.functions import size
c. Explode + Collect
# Normalize and re-aggregate (e.g., cleaning tags)
df_exploded = df.withColumn("tag", explode(col("tags"))) \
.filter(length(col("tag")) > 3)
df_clean = df_exploded.groupBy("id").agg(collect_set("tag"))
6. Best Practices
1. Use collect_list for Ordered Data:
o User session streams, time-series event sequences.
2. Use collect_set for Uniqueness:
o Distinct product views, unique IP addresses.
3. Limit Array Sizes:
# Truncate arrays to avoid OOM errors
df.withColumn("trimmed_actions", slice(col("action_sequence"), 1, 100))
7. Common Pitfalls
• Memory Overload: Collecting large arrays (e.g., 1M elements) → Driver OOM.
o Fix: Use agg(collect_list(...).alias(...)) only if necessary for downstream processing.
• Order Non-Guarantee in collect_set:
# Use collect_list + array_distinct for order preservation
df.groupBy("user_id") \
.agg(array_distinct(collect_list("product_id")).alias("unique_ordered"))
8. Real-World Applications
a. Recommendation Systems
# Collect co-viewed products
df_co_view = df.groupBy("session_id") \
.agg(collect_set("product_id").alias("related_products"))
aggregated_df = df.groupBy("Category") \
.agg(sum("Sales").alias("total_sales"))
Example:
# Sample DataFrame
data = [("Electronics", 100), ("Clothing", 200), ("Electronics", 300)]
df = spark.createDataFrame(data, ["Category", "Sales"])
# GroupBy + Sum
result = df.groupBy("Category").agg(sum("Sales").alias("total_sales"))
result.show()
Electronics 400
Clothing 200
Key Notes:
• Use groupBy() to specify grouping columns.
• agg() accepts aggregate expressions (e.g., sum(), avg()).
df.groupBy("Category") \
.agg(avg("Sales").alias("avg_sales"),
collect_list("Sales").alias("all_sales")) \
.show()
3. Multiple Aggregations
Concept: Apply multiple aggregate functions in a single groupBy operation.
Example:
from pyspark.sql.functions import sum, avg, count
df.groupBy("Department") \
.agg(sum("Sales").alias("total_sales"),
avg("Sales").alias("avg_sales"),
count("*").alias("transactions")) \
.show()
Best Practices:
• Use * in count("*") to include rows with null values.
• Always alias columns for readability (e.g., alias("total_sales")).
4. Pivot Operations
Concept: Reshape data by converting unique values of a column into multiple columns.
Syntax:
pivoted_df = df.groupBy("Year").pivot("Quarter").sum("Revenue")
Example:
python
Copy
# Sample Data
data = [("2023", "Q1", 500), ("2023", "Q2", 600), ("2024", "Q1", 700)]
df = spark.createDataFrame(data, ["Year", "Quarter", "Revenue"])
# Pivot by Quarter
pivoted = df.groupBy("Year").pivot("Quarter").sum("Revenue")
pivoted.show()
Year Q1 Q2 Q3
2024 700
5. Performance Considerations
1. Use Approximate Counts:
o Replace countDistinct() with approx_count_distinct() for large datasets (faster but
slightly inaccurate):
from pyspark.sql.functions import approx_count_distinct
df.agg(approx_count_distinct("CustomerID", rsd=0.05))
2. Filter Early:
o Reduce data size before grouping:
df.where(col("Sales") > 100).groupBy("Category").sum()
6. Real-World Example
Scenario: Analyze sales data for a retail chain.
Code:
from pyspark.sql.functions import sum, avg, countDistinct
Output Columns:
• total_sales: Total revenue per category per store.
• avg_price: Average product price.
• unique_customers: Distinct customers (using countDistinct).
7. Common Pitfalls & Solutions
1. Unaliased Columns:
o Issue: Aggregation columns get ugly names like sum(Sales).
o Fix: Always use .alias("readable_name").
2. Data Skew:
o Issue: Grouping by high-cardinality columns (e.g., user_id) causes uneven partitions.
o Fix: Use salting or pre-aggregation.
3. Null Handling:
o Issue: avg() ignores nulls, but count("col") does not.
o Fix: Clean data with fillna() or coalesce() before aggregation:
df.fillna(0, subset=["Sales"])
Summary
• GroupBy + Agg is essential for summarizing data (e.g., sales reports, KPIs).
• Use pivot to reshape data for visualization or reporting.
• Optimize with filtering, approximate functions, and caching for large datasets.
1. Core Concept
What Are Window Functions?
Window functions perform calculations across a set of rows related to the current row without
collapsing the dataset. Unlike groupBy, they retain all original data while adding computed columns
(e.g., rankings, running totals).
Key Characteristics:
• Preserve row-level details.
• Define a "window" of rows using partitions, ordering, and frame boundaries.
• Commonly used for time-series analysis, rankings, and cumulative metrics.
# Sample DataFrame
data = [("Sales", 1000), ("Sales", 1500), ("HR", 800), ("Sales", 1200)]
df = spark.createDataFrame(data, ["department", "salary"])
result.show()
HR 800 1 1 1 null
3. Window Specification
Syntax:
from pyspark.sql.window import Window
window_spec = Window \
.partitionBy("partition_col") \ # Split data into groups
.orderBy("order_col") \ # Sort rows within partitions
.rowsBetween(start, end) # Define frame boundaries
Frame Boundaries:
• rowsBetween(start, end): Physical row offsets
(e.g., Window.unboundedPreceding, 1, Window.currentRow).
• rangeBetween(start, end): Logical value offsets (e.g., for dates or numeric ranges).
Examples:
1. Cumulative Sum (Entire Partition):
Window.partitionBy("department").orderBy("date").rowsBetween(Window.unboundedPreceding,
Window.currentRow)
4. Practical Examples
a) Running Totals
from pyspark.sql.functions import sum
window_spec =
Window.partitionBy("department").orderBy("date").rowsBetween(Window.unboundedPreceding,
Window.currentRow)
df.withColumn("running_total", sum("sales").over(window_spec)).show()
HR Jan 1 50 50
window_spec = Window.partitionBy("department").orderBy(col("salary").desc())
ranked_df = df.withColumn("rank", row_number().over(window_spec))
ranked_df.filter(col("rank") <= 3).show()
c) Month-over-Month Growth
from pyspark.sql.functions import lag, when
window_spec = Window.partitionBy("department").orderBy("month")
df.withColumn("prev_month_sales", lag("sales", 1).over(window_spec)) \
.withColumn("growth", when(col("prev_month_sales").isNull(), 0)
.otherwise((col("sales") - col("prev_month_sales")) / col("prev_month_sales")))
5. Performance Optimization
1. Narrow Partitions:
Avoid partitioning by high-cardinality columns (e.g., user_id). Use partitionBy on columns with
limited unique values.
2. Frame Selection:
o Use rangeBetween for numeric/temporal data (e.g., time-series intervals).
o Use rowsBetween for fixed-size windows (e.g., last 7 rows).
3. Caching:
Cache DataFrames if reusing the same window spec multiple times:
df.cache().groupBy(...) # Reduces re-computation overhead
1. Join Types
Concept: Combine data from two DataFrames based on a common key.
Types:
SQL
Join Type Behavior Use Case
Equivalent
Returns only rows with matching keys in
Inner INNER JOIN Intersection of data.
both DataFrames.
Returns all rows from the left DataFrame LEFT OUTER Enriching data without
Left
+ matched rows from the right. JOIN losing left rows.
Returns all rows from the right DataFrame RIGHT OUTER Rarely used (prefer left
Right
+ matched rows from the left. JOIN join).
Full FULL OUTER Merging datasets with
Returns all rows from both DataFrames.
(Outer) JOIN partial overlaps.
Returns rows from the left DataFrame NOT IN / NOT Finding missing records
Anti
that do not exist in the right. EXISTS (e.g., no orders).
Returns left rows that exist in the right (no Filtering without merging
Semi EXISTS
columns from the right). data.
Example:
# Sample DataFrames
employees = spark.createDataFrame([(1, "Alice", "IT"), (2, "Bob", "HR")], ["id", "name", "dept"])
departments = spark.createDataFrame([("IT", "Infra"), ("HR", "People")], ["dept", "team"])
# Inner Join
employees.join(departments, "dept", "inner").show()
IT 1 Alice Infra
HR 2 Bob People
2. Join Syntax
Basic Join:
df1.join(df2, df1.key == df2.key, "inner")
Multiple Conditions:
df1.join(df2,
(df1.key1 == df2.key1) &
(df1.key2 == df2.key2),
"inner")
3. Performance Considerations
a) Broadcast Join:
Use when one DataFrame is small (fits in executor memory). PySpark automatically uses broadcast
for tables <10MB (configurable).
c) Optimization Tips:
1. Filter Early: Reduce data size before joining.
df1.filter(df1.date > "2023-01-01").join(df2, "key")
2. Avoid Cartesian Products: Use crossJoin cautiously (computationally expensive).
3. Monitor Skew: Use .explain() to check join plan and partition distribution.
4. Practical Examples
a) Enriching Customer Data:
# Left join to retain all customers, even without orders
customers.join(orders, "customer_id", "left") \
.select("customer_id", "name", "order_amount")
Summary
• Join Types: Choose based on whether you need unmatched rows (left/anti) or full merges
(inner/full).
• Performance: Use broadcast for small tables, filter early, and monitor skew.
• Best Practice: Always alias/rename conflicting columns before joining.
Use Cases:
• Financial reporting (quarterly/yearly summaries).
• Survey data analysis (questions as columns).
• Time-series reshaping (hourly/daily metrics).
Example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Sample DataFrame
data = [
("2023", "Q1", 1500000),
("2023", "Q2", 1800000),
("2024", "Q1", 1600000),
("2023", "Q3", 2200000)
]
df = spark.createDataFrame(data, ["Year", "Quarter", "Revenue"])
Notes:
• Unpivoted rows (e.g., Q4 in 2023) become null by default.
• Use .na.fill(0) to replace nulls with zeros.
pivoted = df.groupBy("Region").pivot("Product").sum("Sales")
pivoted.na.fill(0).show()
# Before fill:
# After fill:
East 1000 0
Summary
• Pivoting reshapes data for summaries and reports.
• Static Pivots outperform dynamic ones by avoiding value detection.
• Optimize with filtering, predefined values, and null handling.
reverse_udf = udf(reverse_string, StringType()) # Return type must match the function's output
# Output:
hello olleh
world dlrow
3. Performance Considerations
Key Limitations:
• Serialization Overhead: Data must be serialized between JVM (Spark) and Python, adding
latency.
• No Catalyst Optimization: Spark’s query optimizer cannot optimize UDF logic.
• Speed: UDFs are 2–10x slower than native Spark functions.
When to Avoid UDFs:
• Simple operations achievable with pyspark.sql.functions (e.g., upper(), substring()).
• Large-scale data transformations (use Spark SQL or Scala UDFs instead).
@pandas_udf(IntegerType())
def vectorized_square(s: pd.Series) -> pd.Series:
return s ** 2 # Operate on entire Series instead of row-by-row
df.withColumn("squared", vectorized_square(col("value")))
Why Faster?:
• Processes data in chunks (Arrow-based serialization).
• Leverages Pandas’ vectorized operations.
schema = StructType([
StructField("is_valid", BooleanType()),
StructField("normalized", StringType())
])
@udf(schema)
def validate_email(email: str):
import re
pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
is_valid = bool(re.match(pattern, email))
normalized = email.strip().lower() if is_valid else None
return (is_valid, normalized)
df.withColumn("email_status", validate_email(col("email"))) \
.select("email", "email_status.*") \
.show()
5. Optimization Strategies
1. Prefer Built-in Functions:
Use native Spark operations (e.g., regexp_extract() instead of custom regex UDFs).
4. Tune Partitions:
Balance workload across executors:
spark.conf.set("spark.sql.shuffle.partitions", 200)
6. Real-World Examples
a) Address Standardization:
@udf(StringType())
def clean_address(address):
import re
return re.sub(r'\s+', ' ', address).strip().upper()
df.withColumn("clean_addr", clean_address(col("raw_address")))
df.withColumn("gc_count", count_gc_content(col("dna_sequence")))
7. Debugging Tips
1. Test Python Code Locally:
Validate logic outside Spark:
assert reverse_string("hello") == "olleh"
8. Production Considerations
Scala UDFs:
For mission-critical pipelines, rewrite UDFs in Scala:
spark.udf.register("scala_udf", (s: String) => s.reverse)
Benefits:
• Avoid Python-JVM serialization.
• Full Catalyst optimization.
Summary
• UDFs enable custom logic but incur performance costs.
• Pandas UDFs bridge the gap with vectorized operations.
• Avoid UDFs for simple tasks—always prioritize native Spark functions.
1. Output Modes
Concept: Control how data is written to storage when the target path/table already exists.
Mode Behavior Use Case
Streaming pipelines, incremental
append Adds new rows to existing data.
updates (e.g., daily logs).
Deletes existing data and replaces it
overwrite Batch full-refresh (e.g., nightly ETL).
entirely.
Conditional writes (e.g., first-time
ignore No operation if data/table already exists.
setup).
Throws an error if the target path/table
errorIfExists Preventing accidental data loss.
exists (default behavior).
Example:
# Append to existing Parquet dataset
df.write.mode("append").parquet("/data/sales")
2. File Formats
a) Columnar Formats (Optimized for Analytics):
• Parquet (Default):
df.write.parquet("/data/output/parquet") # Snappy compression by default
• Delta Lake:
df.write.format("delta").save("/data/output/delta") # Requires Delta Lake library
b) Row-Based Formats:
• CSV:
df.write.option("header", True).csv("/data/output/csv") # Add delimiter, escape chars
o Use Case: Interoperability with Excel/legacy systems.
• JSON:
df.write.json("/data/output/json") # Line-delimited JSON
o Use Case: Semi-structured data pipelines.
Performance Comparison:
Format Compression Schema Handling Read Speed Write Speed
Parquet Excellent Embedded Fastest Fast
Delta Excellent Embedded + History Fast Moderate
ORC Good Embedded Fast Fast
CSV None External (manual) Slowest Slow
JSON None Inferred on read Slow Moderate
3. Partitioning Strategies
Concept: Organize data into directories for faster query performance.
Basic Partitioning:
df.write.partitionBy("year", "month").parquet("/data/partitioned")
• Directory Structure:
Copy
/data/partitioned/year=2023/month=01/...
/data/partitioned/year=2023/month=02/...
Advanced Partitioning:
# Bucketing + Sorting (Hive Metastore required)
df.write \
.bucketBy(50, "customer_id") \ # 50 buckets per partition
.sortBy("transaction_date") \ # Sort within buckets
.saveAsTable("sales_data") # Managed table
Guidelines:
1. Partition Size: Aim for 100MB–1GB per partition.
2. Avoid Over-Partitioning: >10K partitions can degrade metadata performance.
3. Pre-Filter: Use .filter() before writing to skip unnecessary data.
4. Table Management
a) Managed Tables:
• Spark controls both data and metadata.
• Dropping the table deletes the data.
df.write.saveAsTable("managed_sales") # Stores data in Spark SQL warehouse
b) External Tables:
• Spark manages metadata only; data resides in external storage.
• Dropping the table only removes metadata.
df.write \
.option("path", "/external/data/sales").saveAsTable("external_sales")
5. Cloud-Specific Optimizations
AWS S3:
# IAM role-based authentication
spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider")
# Schema validation
written_schema = spark.read.parquet("/data/output").schema
assert written_schema == df.schema, "Schema mismatch!"
c) Multi-Cloud Write:
# Write to both S3 and GCS in parallel
df.write.parquet("s3a://bucket/data")
df.write.parquet("gs://bucket/data")
Summary
• Output Modes: Choose based on update strategy (append vs overwrite).
• Formats: Prefer Parquet/Delta for analytics; use CSV/JSON for compatibility.
• Partitioning: Balance file size and metadata overhead.
• Validation: Always verify writes in production pipelines.
Section 17: Spark SQL Integration in PySpark
1. SQL Interoperability
Concept: Bridge PySpark DataFrames and SQL for seamless querying.
a) Temporary Views:
Create session-scoped tables to query DataFrames using SQL.
# Create a DataFrame
data = [("Alice", 34), ("Bob", 45)]
df = spark.createDataFrame(data, ["name", "age"])
Name Age
Bob 45
Best Practice:
• Use temp views for ad-hoc analysis; avoid them in production pipelines (they’re session-
scoped).
• Prefer createOrReplaceGlobalTempView() for cross-session sharing in notebooks.
2. Hybrid Approach
Concept: Combine DataFrame API and SQL for readability and flexibility.
Example:
# Start with DataFrame
df = spark.read.parquet("/data/sales")
filtered = df.filter(col("region") == "West")
b) Window Functions:
Leverage SQL for ranking, running totals, and time-series analysis.
sql
Copy
SELECT
product,
month,
SUM(sales) OVER (PARTITION BY product ORDER BY month) AS running_total FROM sales_data
Equivalent DataFrame Code:
from pyspark.sql.window import Window
window_spec = Window.partitionBy("product").orderBy("month")
df.withColumn("running_total", sum("sales").over(window_spec))
4. Catalog Integration
Concept: Programmatically manage databases, tables, and metadata.
a) Database Management:
# List databases
spark.catalog.listDatabases()
# [Database(name='default', description='default database', ...)]
b) Table Caching:
# Cache a table for faster access
spark.catalog.cacheTable("customers")
# Check if cached
spark.catalog.isCached("customers") # Returns True/False
c) Table Inspection:
# List columns of a table
for column in spark.catalog.listColumns("sales"):
print(f"Column: {column.name}, Type: {column.dataType}")
# Output:
# Column: order_id, Type: bigint
# Column: amount, Type: double
5. Performance Optimization
a) Adaptive Query Execution (AQE):
Automatically optimize shuffle partitions and joins at runtime.
# Enable AQE (Spark 3.0+)
spark.conf.set("spark.sql.adaptive.enabled", True)
b) Join Hints:
Force Spark to use specific join strategies.
# Broadcast join hint
spark.sql("SELECT /*+ BROADCAST(c) */ *
FROM orders o
INNER JOIN customers c
ON o.cust_id = c.id")
c) Materialized Views (Delta Lake):
Precompute aggregates for faster queries (requires Delta Lake 1.0+):
spark.sql("""
CREATE MATERIALIZED VIEW mv_product_sales
AS SELECT product_id, SUM(amount)
FROM sales
GROUP BY product_id
""")
6. Security Integration
a) Row-Level Security:
Restrict access using views or Delta Lake predicates.
# Example: Filter rows by user role
spark.sql("""
CREATE VIEW user_sales AS
SELECT * FROM sales
WHERE region = current_user()
""")
b) Column Masking:
Obfuscate sensitive data (supported in some distributions like Databricks):
spark.sql("""
CREATE MASK ssn_mask ON TABLE employees
FOR COLUMN ssn RETURN CONCAT('***-**-', RIGHT(ssn, 4))
""")
# Output includes:
# == Physical Plan ==
# *(1) Filter (isnotnull(amount) AND (amount > 1000.0))
# +- *(1) ColumnarToRow ...
Pro Tips:
• Use EXPLAIN CODEGEN to inspect generated Java code.
• Monitor SQL queries via Spark UI’s SQL tab.
8. Common Pitfalls
Issue Solution
Temp View
Use createGlobalTempView() for cross-session access.
Lifetime
Sanitize inputs or use parameterized queries with spark.sql() (via string
SQL Injection
formatting carefully).
Metadata
Avoid excessive temporary views; use spark.catalog.dropTempView() post-use.
Overload
Summary
• SQL + DataFrames: Combine for readability and flexibility.
• Catalog API: Manage databases/tables programmatically.
• Optimization: Use AQE, hints, and materialized views for performance.
• Security: Implement row/column-level protections via views or extensions.
Example Configuration:
configs = {
"spark.sql.shuffle.partitions": "200", # Default, adjust based on data size
"spark.executor.memory": "8g", # For 12G node
"spark.driver.memory": "4g", # Driver-side operations (e.g., collect())
"spark.dynamicAllocation.enabled": "true", # Scale executors as needed
"spark.sql.adaptive.coalescePartitions.enabled": "true" # Merge small shuffle partitions
}
spark = SparkSession.builder.config(conf=configs).getOrCreate()
2. Partition Optimization
a) Ideal Partition Size:
• Goal: Balance parallelism and overhead.
• Rule of Thumb: 128MB–1GB per partition.
• Calculate Partitions:
df_size_gb = df.rdd.map(lambda x: len(str(x))).sum() / (1024**3) # Approx. size
desired_partitions = max(1, int(df_size_gb / 0.5)) # 500MB partitions
df = df.repartition(desired_partitions)
3. Caching Strategies
When to Cache:
• Datasets reused in multiple actions (e.g., loops, iterative ML).
• Small lookup tables for frequent joins.
Storage Levels:
Level Description Use Case
MEMORY_ONLY Store in memory (no serialization). Fast access for small datasets.
MEMORY_AND_DISK Spill to disk if memory full. Large datasets (>60% of memory).
DISK_ONLY Store only on disk. Rarely used; avoid unless necessary.
Example:
from pyspark import StorageLevel
# Verify caching
if spark.catalog.isCached("table_name"):
print("Cached!")
# Release memory
df.unpersist()
4. Join Optimization Techniques
a) Broadcast Join:
• Use Case: Small table (≤10MB) + large table.
• Implementation:
from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), "key")
b) Sort-Merge Join:
• Use Case: Large tables sorted on join keys.
• Implementation:
df1.hint("merge").join(df2, "key")
c) Bucket Join:
• Use Case: Frequent joins on the same column.
• Steps:
1. Pre-bucket tables:
df.write.bucketBy(50, "key").sortBy("key").saveAsTable("bucketed_table")
5. Debugging Tools
a) Execution Plans:
• Logical Plan: High-level transformations.
• Physical Plan: Actual execution steps (e.g., scan, filter, exchange).
df.explain(mode="extended") # Show all plans (parsed, analyzed, optimized, physical)
b) Spark UI:
• Access at http://<driver-ip>:4040.
• Key Features:
o Jobs Tab: Identify slow stages.
o Stages Tab: View shuffle read/write sizes.
o SQL Tab: Analyze query execution DAG.
6. File-Level Optimizations
a) File Size Control:
df.write.option("maxRecordsPerFile", 100000) # ~1GB at 1KB/record
.parquet("path")
b) Parquet Tuning:
df.write.option("parquet.block.size", 256 * 1024 * 1024) # 256MB row groups
.parquet("path")
c) Statistics Collection:
spark.sql("ANALYZE TABLE sales COMPUTE STATISTICS FOR COLUMNS product_id, revenue")
• Enables better query planning (e.g., join order).
8. Common Pitfalls
• Too Many Partitions: Increases task scheduling overhead.
• Over-Caching: Wastes memory on rarely used data.
• Ignoring AQE: Failing to enable Adaptive Query Execution in Spark 3+.
Section 19: Delta Lake Integration in PySpark
2. Critical Operations
a) Creating a Delta Table:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DeltaDemo") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
b) Schema Evolution:
python
Copy
# Allow new columns to be added automatically
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", True)
3. Performance Optimizations
a) Z-Ordering:
Optimize data layout to speed up queries on specific columns.
• Benefit: Co-locates related data (e.g., all records for a customer) in the same files.
b) Compaction (Bin-Packing):
Merge small files to improve read efficiency.
spark.sql("OPTIMIZE delta.`/data/delta/sales`")
spark.sql("""
MERGE INTO delta.`/data/delta/sales` AS target
USING updates AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN
UPDATE SET target.amount = source.amount, target.status = source.status
WHEN NOT MATCHED THEN
INSERT (order_id, amount, status) VALUES (source.order_id, source.amount, source.status)
""")
# Output:
# +-------+-------------------+---------+----------------------+
# |version| timestamp|operation| operationParameters|
# +-------+-------------------+---------+----------------------+
# | 2|2024-03-20 12:30:45| MERGE |{predicate -> "..."} |
# +-------+-------------------+---------+----------------------+
b) Rollback:
# Restore to version 8
spark.sql("RESTORE delta.`/data/delta/sales` TO VERSION AS OF 8")
c) Retention Policies:
# Set defaults for new tables
spark.conf.set("spark.databricks.delta.properties.defaults.logRetentionDuration", "30 days")
spark.conf.set("spark.databricks.delta.properties.defaults.deletedFileRetentionDuration", "15 days")
8. Common Pitfalls
Issue Solution
Unmanaged Retention Always set logRetentionDuration to avoid unbounded log growth.
Schema Drift Use mergeSchema cautiously; prefer explicit schema evolution.
Small Files Schedule daily OPTIMIZE jobs.
Summary
• Delta Lake adds enterprise-grade reliability to data lakes.
• Key operations: Time travel, schema evolution, and MERGE for CDC.
• Optimize with Z-ordering, compaction, and CDF for streaming.
Section 20: Real-World Project Walkthrough
(spark.readStream
.schema(bronze_schema) # Block invalid data upfront
.option("header", "true")
.csv("s3://raw-bucket/")
.withColumn("_ingest_time", current_timestamp()) # Track ingestion time
.writeStream
.format("delta")
.option("checkpointLocation", "/checkpoints/bronze") # Exactly-once guarantee
.outputMode("append")
.start("s3://bronze-layer/sales"))
Key Features:
• Schema Enforcement: Rejects malformed records during ingestion.
• Checkpointing: Tracks progress and ensures fault tolerance.
• Audit Column: _ingest_time helps debug data freshness issues.
2. Silver Layer (Cleaned Data)
Goal: Clean, validate, and explode nested data for analysis.
Optimizations:
• Filter Early: Remove invalid data before transformations.
• Explode Nested Data: Prepare for aggregation in the Gold layer.
• Schema Evolution: Automatically handle new fields (e.g., promo codes).
3. Gold Layer (Aggregated Metrics)
Goal: Compute daily store metrics with incremental updates.
# MERGE INTO for efficient upserts (existing dates get updated)
spark.sql("""
MERGE INTO gold.daily_store_metrics target
USING (
SELECT
store_id,
date(timestamp) as date,
SUM(item.price * item.quantity) as revenue,
COUNT(DISTINCT invoice_id) as transactions
FROM silver.sales
WHERE timestamp > current_date() - INTERVAL 1 DAY # Incremental processing
GROUP BY store_id, date
) source
ON target.store_id = source.store_id AND target.date = source.date
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
Key Features:
• Incremental Processing: Only process the latest day’s data.
• Z-Ordering: Co-locates data by date for faster time-range queries.
• ACID Compliance: MERGE INTO ensures updates are atomic.
job_spec = {
"name": "nightly-sales-pipeline",
"tasks": [{
"task_key": "silver-layer",
"notebook_path": "/Production/Silver_Processing",
"timeout_seconds": 3600,
"retry_on_timeout": True # Auto-retry transient errors
}]
}
c) Post-Deployment Checks:
1. Validate Outputs:
spark.sql("ANALYZE TABLE gold.daily_store_metrics COMPUTE STATISTICS")
2. Retention Policies:
spark.sql("VACUUM gold.daily_store_metrics RETAIN 90 DAYS") # Comply with GDPR
3. Cost Control:
spark.sql("SET spark.databricks.delta.optimize.maxFileSize = 134217728") # 128MB files
5. Debugging Toolkit
a) Data Lineage:
# Track table evolution
display(spark.sql("DESCRIBE HISTORY gold.daily_store_metrics"))
b) Query History:
# Audit job performance
display(spark.sql("SELECT * FROM system.operational_metrics.jobs"))
c) Data Sampling:
# Debug without full scans
spark.read.option("samplingRatio", 0.01).format("delta").load("s3://gold-layer")