Apache Spark - Optimization Techniques
Apache Spark - Optimization Techniques
com/in/ksushant/
What are Spark optimization techniques with real-time use cases and examples?
1. **Data Partitioning:**
- Use Case: Distributing data based on a specific column to optimize join operations.
• Scenario: Suppose you have a large dataset with customer information, and
you often filter data based on customer IDs. Partition the data by customer
ID to improve query performance.
• Technique: Use the repartition or partitionBy transformation to distribute
data across partitions based on a specific column.
```python
partitioned_data = sales_data.repartition("customer_id")
```
2. **Broadcasting:**
- Use Case: Efficiently joining a small lookup table with a larger fact table.
Technique: Use the broadcast function before joining the smaller DataFrame to avoid
unnecessary shuffling.
- Example: Broadcasting a small product lookup table to improve performance when
joining with a large sales transaction table.
```python
from pyspark.sql.functions import broadcast
```python
cached_data = data_to_cache.cache()
```
4. **Predicate Pushdown:**
- Use Case: Reading only required columns from Parquet files to reduce I/O.
• Scenario: While reading data from a source like Parquet, only read the
columns necessary for your analysis.
• Technique: Use the select transformation to read only the required columns
from the source.
- Example: Selectively reading only the relevant columns from a Parquet file.
```python
selected_columns_data = spark.read.parquet("data.parquet").select("col1", "col2")
```
```python
coalesced_data = large_data.coalesce(4)
```
6. **Avoiding Shuffling:**
- Use Case: Minimizing data shuffling during aggregations.
• Scenario: When performing joins or aggregations, minimize data shuffling
by using transformations like reduceByKey instead of groupByKey .
• Technique: Use transformations that perform local aggregations before
shuffling data.
```python
aggregated_data = data.rdd.reduceByKey(lambda x, y: x + y).toDF()
```
7. **Column Pruning:**
```python
selected_columns_data = full_data.select("col1", "col2")
```
8. **UDFs Optimization:**
- Use Case: Choosing built-in functions over UDFs for efficiency.
• Scenario: While using User-Defined Functions (UDFs), prefer built-in Spark
functions whenever possible, as they are optimized for distributed
processing.
• Technique: Use Spark's built-in functions like withColumn , when , concat , etc.,
instead of UDFs for common operations
```python
from pyspark.sql.functions import col, when
9. **Dynamic Allocation:**
- Use Case: Dynamically adjusting the number of executors based on workload.
• Scenario: In a dynamic workload scenario, allow Spark to automatically
adjust the number of executors based on resource availability.
• Technique: Enable dynamic allocation in the Spark configuration.
```python
conf = SparkConf().set("spark.dynamicAllocation.enabled", "true")
```
• Scenario: When writing data to Hive or Parquet, use bucketing and sorting
for efficient data storage and retrieval.
• Technique: Use the bucketBy and sortBy options while writing data.
```python
bucketed_sorted_data.write.bucketBy(10,
"column_name").sortBy("column_name").parquet("bucketed_sorted_data.parquet")
```
Each technique has its own use case and benefits, but the actual effectiveness depends
on your specific data and workload. Always profile and monitor your Spark jobs to
identify bottlenecks and apply the appropriate optimization strategies.
Summary:
- Cost/Core/ hour
- Get a baseline threshold per workload of the cluster dev/prod e.g 70%
- data skipping
- hive partitions
- bucketing
spark.sql.files.maxPartitionBytes()
- Coelesce(n) to shrink
- df.write.option("maxRecordsPerFile", N)
- increase parallelism
- UDFs
4. Advanced Optimizations:
- Persistence(memory, disk) when you have repeatitive data scans then unpersist after job
- Join Optimizations
Risks:
- df > spark.driver.maxResultSize
- Skew Joins
df.withColumn('salt', lit(saltval)).groupBy('city',
'state','salt').agg(avg('sales')).drop('salt').orderBy(col).desc()
- poin in interval range join: predicate specify value in one relation that is
between two values from the other relation
- If Distincts are required put them in right place: Use dropDuplicates() before the join operation
and before groupBy.
6.UDF penalties:
- use org.apache.spark.sql.functions