0% found this document useful (0 votes)

481 views7 pages

Apache Spark - Optimization Techniques

The document discusses several Spark optimization techniques with examples: 1. Data partitioning and broadcasting are used to optimize joins by distributing or sending smaller tables to partitions. 2. Caching and persistence avoid recomputing intermediate DataFrames in iterative algorithms like machine learning. 3. Predicate pushdown and column pruning reduce I/O and memory usage by selecting only required columns. 4. Techniques like coalesce and repartition balance partitions after filtering to optimize performance.

Uploaded by

vikas gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

481 views7 pages

Apache Spark - Optimization Techniques

Uploaded by

vikas gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

SUSHANTA KHARA: linkedin.

com/in/ksushant/

What are Spark optimization techniques with real-time use cases and examples?

1. **Data Partitioning:**
- Use Case: Distributing data based on a specific column to optimize join operations.
• Scenario: Suppose you have a large dataset with customer information, and
you often filter data based on customer IDs. Partition the data by customer
ID to improve query performance.
• Technique: Use the repartition or partitionBy transformation to distribute
data across partitions based on a specific column.

- Example: Partitioning sales data by customer ID to improve performance when

joining with customer information.

```python
partitioned_data = sales_data.repartition("customer_id")
```

2. **Broadcasting:**
- Use Case: Efficiently joining a small lookup table with a larger fact table.
Technique: Use the broadcast function before joining the smaller DataFrame to avoid
unnecessary shuffling.
- Example: Broadcasting a small product lookup table to improve performance when
joining with a large sales transaction table.

```python
from pyspark.sql.functions import broadcast

joined_data = sales_transactions.join(broadcast(product_lookup), "product_id")

```

3. Caching and Persistence:

- Use Case: Reusing intermediate DataFrames in iterative machine learning
algorithms.
• Scenario: When performing iterative algorithms, like machine learning
training, cache intermediate DataFrames to avoid recomputing them in
every iteration.
• Technique: Use the cache or persist transformation to store DataFrames in
memory or disk.

- Example: Caching a DataFrame used in a machine learning training loop to avoid

recomputation.

```python

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 1

SUSHANTA KHARA: linkedin.com/in/ksushant/

cached_data = data_to_cache.cache()
```

4. **Predicate Pushdown:**
- Use Case: Reading only required columns from Parquet files to reduce I/O.
• Scenario: While reading data from a source like Parquet, only read the
columns necessary for your analysis.
• Technique: Use the select transformation to read only the required columns
from the source.

- Example: Selectively reading only the relevant columns from a Parquet file.

```python
selected_columns_data = spark.read.parquet("data.parquet").select("col1", "col2")
```

5. Coalesce and Repartition:

- Use Case: Balancing partitions after filtering large datasets.
• Scenario: After filtering a large DataFrame, you may end up with a skewed
partition distribution. Coalesce or repartition the DataFrame to balance the
partitions.
• Technique: Use the coalesce or repartition transformation to control the
number of partitions.

- Example: Coalescing or repartitioning a DataFrame to reduce skew after filtering.

```python
coalesced_data = large_data.coalesce(4)
```

6. **Avoiding Shuffling:**
- Use Case: Minimizing data shuffling during aggregations.
• Scenario: When performing joins or aggregations, minimize data shuffling
by using transformations like reduceByKey instead of groupByKey .
• Technique: Use transformations that perform local aggregations before
shuffling data.

- Example: Using `reduceByKey` instead of `groupByKey` for better performance.

```python
aggregated_data = data.rdd.reduceByKey(lambda x, y: x + y).toDF()
```

7. **Column Pruning:**

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 2

SUSHANTA KHARA: linkedin.com/in/ksushant/

- Use Case: Reducing memory consumption by selecting only required columns.

• Scenario: If your application only needs a few columns from a DataFrame,
prune the unnecessary columns to reduce memory consumption.
• Technique: Use the select transformation to keep only the required
columns.

- Example: Selecting only the necessary columns for analysis.

```python
selected_columns_data = full_data.select("col1", "col2")
```

8. **UDFs Optimization:**
- Use Case: Choosing built-in functions over UDFs for efficiency.
• Scenario: While using User-Defined Functions (UDFs), prefer built-in Spark
functions whenever possible, as they are optimized for distributed
processing.
• Technique: Use Spark's built-in functions like withColumn , when , concat , etc.,
instead of UDFs for common operations

- Example: Using built-in Spark functions for common operations.

```python
from pyspark.sql.functions import col, when

transformed_data = full_data.withColumn("new_col", when(col("old_col") > 0,

col("old_col")).otherwise(0))
```

9. **Dynamic Allocation:**
- Use Case: Dynamically adjusting the number of executors based on workload.
• Scenario: In a dynamic workload scenario, allow Spark to automatically
adjust the number of executors based on resource availability.
• Technique: Enable dynamic allocation in the Spark configuration.

- Example: Enabling dynamic allocation in Spark configuration.

```python
conf = SparkConf().set("spark.dynamicAllocation.enabled", "true")
```

10. Bucketing and Sorting:

- Use Case: Improving query performance by organizing data in Hive or Parquet.

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 3

SUSHANTA KHARA: linkedin.com/in/ksushant/

• Scenario: When writing data to Hive or Parquet, use bucketing and sorting
for efficient data storage and retrieval.
• Technique: Use the bucketBy and sortBy options while writing data.

- Example: Writing data to Parquet with bucketing and sorting.

```python
bucketed_sorted_data.write.bucketBy(10,
"column_name").sortBy("column_name").parquet("bucketed_sorted_data.parquet")
```

Each technique has its own use case and benefits, but the actual effectiveness depends
on your specific data and workload. Always profile and monitor your Spark jobs to
identify bottlenecks and apply the appropriate optimization strategies.

Summary:

Spark Optimization steps:

1. Understand your hardware:-

- core count and speed

- memory per core(storage & working)

- local disk type, size, count and speed

- network speed and topology

- data lake properties(rate limits)

- Cost/Core/ hour

- financial for cloud

- opportunity for shared and on-prem

- Get a baseline threshold per workload of the cluster dev/prod e.g 70%

2. Minimize data scan(lazy loads)

- data skipping

- hive partitions

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 4

SUSHANTA KHARA: linkedin.com/in/ksushant/

- bucketing

- databricks delta z ordering

3. Check on Spark Partitions:-

- input: control size

spark.sql.files.maxPartitionBytes()

-Shuffles: control count spark.sql.shuffle.partitions

- Output: control size

- Coelesce(n) to shrink

- Repartition(n) to increase or balance(shuffle)

- df.write.option("maxRecordsPerFile", N)

Partirtions Right Sizing- Shuffle:

Largest Shuffle stage- target size<= 200MB/partition

partition count = stage input data/ target size

Input Partirtions right sizing:

Use spark default 128mb size unless -

- increase parallelism

- Heavily nested/ repeatitive data

- Generating data (explode)

- upstream source structure is not optimal

- UDFs

Output Right partition sizing:

Write Once => Read many

- more time to write but faster to read

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 5

SUSHANTA KHARA: linkedin.com/in/ksushant/

Perfect writes limit parallelism

- compactions(minor & major)

4. Advanced Optimizations:

- Finding Imbalances: joins, groupBy, etc.

- Persistence(memory, disk) when you have repeatitive data scans then unpersist after job

- Join Optimizations

- SortMergeJoin(standard)- when both sides are large

- BroadcastJoin(fastest)- when one side is small

Automatic if one side < spark.sql.autoBroadcastJoinThreshold(default 10m)

Risks:

- Not enough driver memory

- df > spark.driver.maxResultSize

- df > Single Executer available working memory

Prod: mitigate risks- validation functions

- Skew Joins

Salting- add column to each side with random int between 0 to

spark.sql.partitions-1 to both sides

- add join clause to include join on generated column above

- drop temp cols from result

- Skew Aggregates: df.groupby('city', 'state').agg(avg('sales')).orderBy(col).desc()

solution with salting -

saltval = random(0, spark.conf.get(org.shuffle.partitions)-1)

df.withColumn('salt', lit(saltval)).groupBy('city',
'state','salt').agg(avg('sales')).drop('salt').orderBy(col).desc()

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 6

SUSHANTA KHARA: linkedin.com/in/ksushant/

- BroadcastNestedLoopJoins: instead of IN , NOT IN use EXISTS, NOT EXISTS with AND

operation

- Range join optimization

- poin in interval range join: predicate specify value in one relation that is
between two values from the other relation

- interval overlap range join: predicate specifies an overlap of intervals between

two values from each relation

5. Omit Expensive Operations:

- Repartition - use coelesce or shuffle partition count

- Count, CountDistinct avoid whereever not required and approxCountDistinct()

- If Distincts are required put them in right place: Use dropDuplicates() before the join operation
and before groupBy.

6.UDF penalties:

Traditional UDFs can't use Tungsten

- use org.apache.spark.sql.functions

- use pandas UDFs it uses apache arrow

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 7

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
A Deep Dive Into Query Execution Engine of Spark SQL
100% (2)
A Deep Dive Into Query Execution Engine of Spark SQL
88 pages
Snowflake
No ratings yet
Snowflake
11 pages
How To Work With Apache Airflow
No ratings yet
How To Work With Apache Airflow
111 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Sales Script Template PDF
0% (1)
Sales Script Template PDF
2 pages
Pyspark Hands On
No ratings yet
Pyspark Hands On
189 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Airflow - Notes
No ratings yet
Airflow - Notes
82 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Databricks Pyspark 1712042928
100% (1)
Databricks Pyspark 1712042928
21 pages
PySpark Notes
No ratings yet
PySpark Notes
29 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Microwave and Radar Engineering M Kulkarni
50% (2)
Microwave and Radar Engineering M Kulkarni
3 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
No ratings yet
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
12 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Qms Interaction Chart
50% (2)
Qms Interaction Chart
1 page
Snowflake
No ratings yet
Snowflake
122 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Aksha Interview Questions
100% (1)
Aksha Interview Questions
52 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Word Puzzle PDF
No ratings yet
Word Puzzle PDF
23 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
PySpark VS SQL Interview Questions
100% (1)
PySpark VS SQL Interview Questions
16 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Hive Interview Questions Answers
No ratings yet
Hive Interview Questions Answers
6 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
SeisImager 2D TM Manual Compressed
No ratings yet
SeisImager 2D TM Manual Compressed
257 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Advanced Project For Data Engineering in Azure
100% (1)
Advanced Project For Data Engineering in Azure
5 pages
Iphone 14 Plus
No ratings yet
Iphone 14 Plus
1 page
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Unit 03 Our Findings Show
100% (1)
Unit 03 Our Findings Show
19 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
FCASD - Lab Assignment - 6
No ratings yet
FCASD - Lab Assignment - 6
7 pages
Karthik CV 2025-Updated
No ratings yet
Karthik CV 2025-Updated
5 pages
State of Practice of Building Information Modeling
No ratings yet
State of Practice of Building Information Modeling
8 pages
Bonafide Certificate: of Student Information System in Kongu Engineering College"
No ratings yet
Bonafide Certificate: of Student Information System in Kongu Engineering College"
9 pages
E Governance Final Documentation E Governance - Docxmanisha
No ratings yet
E Governance Final Documentation E Governance - Docxmanisha
23 pages
GSM-To-UMTS Training Series 01 - Principles of The WCDMA System - V1.0
No ratings yet
GSM-To-UMTS Training Series 01 - Principles of The WCDMA System - V1.0
87 pages
Chapter1 2challenges
No ratings yet
Chapter1 2challenges
5 pages
Textile Management System Final Review
No ratings yet
Textile Management System Final Review
40 pages
Understanding The Resistor Color Code
No ratings yet
Understanding The Resistor Color Code
21 pages
PLM Direction For 2008-2010
No ratings yet
PLM Direction For 2008-2010
17 pages
Deep Reinforcement Learning For 5G Networks: Joint Beamforming, Power Control, and Interference Coordination
No ratings yet
Deep Reinforcement Learning For 5G Networks: Joint Beamforming, Power Control, and Interference Coordination
30 pages
Boost IoT With 5G NR RedCap
No ratings yet
Boost IoT With 5G NR RedCap
15 pages
4.melsec Q Process Control
No ratings yet
4.melsec Q Process Control
13 pages
DPG 21XX
No ratings yet
DPG 21XX
54 pages
Review C Topic 2.13 EquationsInequalitiesInverses Updated
No ratings yet
Review C Topic 2.13 EquationsInequalitiesInverses Updated
2 pages
FYP Proposal Presentation Final
No ratings yet
FYP Proposal Presentation Final
14 pages
Programming With Python - PGDBDA - Feb20
No ratings yet
Programming With Python - PGDBDA - Feb20
26 pages
Sample Paper SEET 2023
No ratings yet
Sample Paper SEET 2023
1 page
Ford Company Document
No ratings yet
Ford Company Document
1 page
A Beginner's Guide To Scanning With DirBuster For The NCL Games
No ratings yet
A Beginner's Guide To Scanning With DirBuster For The NCL Games
7 pages
Kafd A1 CJ01 P504 Gas TRN 00402
No ratings yet
Kafd A1 CJ01 P504 Gas TRN 00402
2 pages
Harsh PDF
No ratings yet
Harsh PDF
3 pages
Sony Exmor
No ratings yet
Sony Exmor
2 pages
Learning Informatica PowerCenter 9.x
From Everand
Learning Informatica PowerCenter 9.x
Rahul Malewar
3/5 (4)
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)
From Everand
Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)
Adam Aspin
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet

Apache Spark - Optimization Techniques

Uploaded by

Apache Spark - Optimization Techniques

Uploaded by

SUSHANTA KHARA: linkedin.

- Example: Partitioning sales data by customer ID to improve performance when

joined_data = sales_transactions.join(broadcast(product_lookup), "product_id")

3. **Caching and Persistence:**

- Example: Caching a DataFrame used in a machine learning training loop to avoid

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 1

5. **Coalesce and Repartition:**

- Example: Coalescing or repartitioning a DataFrame to reduce skew after filtering.

- Example: Using `reduceByKey` instead of `groupByKey` for better performance.

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 2

- Use Case: Reducing memory consumption by selecting only required columns.

- Example: Selecting only the necessary columns for analysis.

- Example: Using built-in Spark functions for common operations.

transformed_data = full_data.withColumn("new_col", when(col("old_col") > 0,

- Example: Enabling dynamic allocation in Spark configuration.

10. **Bucketing and Sorting:**

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 3

- Example: Writing data to Parquet with bucketing and sorting.

Spark Optimization steps:

1. Understand your hardware:-

- core count and speed

- memory per core(storage & working)

- local disk type, size, count and speed

- network speed and topology

- data lake properties(rate limits)

- financial for cloud

- opportunity for shared and on-prem

2. Minimize data scan(lazy loads)

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 4

- databricks delta z ordering

3. Check on Spark Partitions:-

- input: control size

-Shuffles: control count spark.sql.shuffle.partitions

- Output: control size

- Repartition(n) to increase or balance(shuffle)

Partirtions Right Sizing- Shuffle:

Largest Shuffle stage- target size<= 200MB/partition

partition count = stage input data/ target size

Input Partirtions right sizing:

Use spark default 128mb size unless -

- Heavily nested/ repeatitive data

- Generating data (explode)

- upstream source structure is not optimal

Output Right partition sizing:

Write Once => Read many

- more time to write but faster to read

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 5

Perfect writes limit parallelism

- compactions(minor & major)

- Finding Imbalances: joins, groupBy, etc.

- SortMergeJoin(standard)- when both sides are large

- BroadcastJoin(fastest)- when one side is small

Automatic if one side < spark.sql.autoBroadcastJoinThreshold(default 10m)

- Not enough driver memory

- df > Single Executer available working memory

Prod: mitigate risks- validation functions

Salting- add column to each side with random int between 0 to

- add join clause to include join on generated column above

- drop temp cols from result

- Skew Aggregates: df.groupby('city', 'state').agg(avg('sales')).orderBy(col).desc()

solution with salting -

saltval = random(0, spark.conf.get(org.shuffle.partitions)-1)

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 6

- BroadcastNestedLoopJoins: instead of IN , NOT IN use EXISTS, NOT EXISTS with AND

- Range join optimization

- interval overlap range join: predicate specifies an overlap of intervals between

5. Omit Expensive Operations:

- Repartition - use coelesce or shuffle partition count

- Count, CountDistinct avoid whereever not required and approxCountDistinct()

Traditional UDFs can't use Tungsten

- use pandas UDFs it uses apache arrow

SUSHANTA KHARA: LINKEDIN.COM/IN/KSUSHANT/ 7

You might also like

3. Caching and Persistence:

5. Coalesce and Repartition:

10. Bucketing and Sorting: