0% found this document useful (0 votes)
62 views8 pages

PySpark Optimization Scenarios - Wipro

The document outlines various optimization scenarios and solutions for using PySpark, including issues related to slow reads from Parquet, job failures, and memory pressure. It provides practical fixes and guidelines for optimizing joins, handling streaming data, and managing partitions effectively. Additionally, it emphasizes the importance of preparation for data engineering interviews and promotes the services of Prominent Academy for interview coaching and training.

Uploaded by

Emmanuel Anyira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views8 pages

PySpark Optimization Scenarios - Wipro

The document outlines various optimization scenarios and solutions for using PySpark, including issues related to slow reads from Parquet, job failures, and memory pressure. It provides practical fixes and guidelines for optimizing joins, handling streaming data, and managing partitions effectively. Additionally, it emphasizes the importance of preparation for data engineering interviews and promotes the services of Prominent Academy for interview coaching and training.

Uploaded by

Emmanuel Anyira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

PySpark

optimization
scenarios,
interview
questions
www.prominentacademy.in

+91 98604 38743


Scenario
You're using Parquet from ADLS Gen2, but Spark reads
are slow. What could be wrong?

✅ Fixes:
Ensure hierarchical namespace is enabled (HNS)
Optimize by:
Filtering with partition column
Minimizing select *
Leveraging .inputFiles() to verify what's being read
Enable ABFS fast directory listing:
python

spark.conf.set("fs.azure.enable.fastpath", "true")

Scenario
Why does your job randomly fail after long stages?
✅ Root Cause:
Long GC pauses or executor heartbeats missed
Driver or executor OOM
✅ Fixes:
Increase spark.network.timeout:
python

spark.conf.set("spark.network.timeout", "800s")

Break job into smaller stages using .checkpoint() or


.persist()
Scale executors horizontally (more but smaller)

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
How do you optimize joins on multiple keys when data is
huge?

✅ Tactics:
Use ZORDER BY (col1, col2, ...) for Delta Lake
Co-partition both DataFrames on same columns
Broadcast one side if it's small and fits memory
Repartition by composite key:
python

df.repartition("col1", "col2")

Scenario
Using explode() on nested arrays causes memory pressure.
How do you fix it?

✅ Solution:
Avoid explode() if not needed — use posexplode_outer()
selectively
Use flatMap() with rdd.mapPartitions() for memory-
safe transformations
Apply filtering before explode to reduce output size
python

spark.conf.set("spark.network.timeout", "800s")

Break job into smaller stages using .checkpoint() or


.persist()
Scale executors horizontally (more but smaller)

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario

You’re writing streaming data into a Delta table and


sometimes get commit errors. How do you make it robust?
✅ Tuning Tips:
Use checkpointing to allow restart from last successful
write
Enable ignoreChanges, mergeSchema options if
schema evolves
Use trigger(once=True) for micro-batch like pipelines
Separate streaming writes by partition (to reduce file
lock contention)

Scenario

When should you use repartition() vs coalesce()? What are


the performance implications?
✅ Guidelines:
repartition(n) → expensive full shuffle; use to increase
partitions
coalesce(n) → narrow transformation, efficient for
decreasing partitions
Use repartition() before joins for data balance
Use coalesce() before writes to control file count

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
What does this mean and how do you fix it?
✅Explanation:
Too much time spent on Java Garbage Collection
Indicates memory pressure or memory leak in job

✅ Fix:
Increase executor memory
Reduce number of cached DataFrames
Avoid UDFs that generate large object graphs
Use persist(StorageLevel.DISK_ONLY) if memory is
tight.

Scenario

Spark UI shows that most executors are idle while a few


are overloaded. What’s going wrong and how can you fix
it?
✅ Root Cause:
Data skew — one or few partitions have
disproportionate data
✅ Solutions:
Repartition using repartition(n) to rebalance
Use salting for skewed keys
Enable adaptive query execution (AQE):
python

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
You don’t know the optimal number of partitions in
advance. How do you dynamically set them based on data
size?

✅ Dynamic Partitioning Logic:


python

file_size_gb = spark.read.parquet(path).rdd.map(lambda x: len(str(x))).sum()


/ (1024**3)
num_partitions = int(file_size_gb * 4) # ~256 MB per partition

df = df.repartition(num_partitions)

Tune based on memory profile and executor cores

Scenario

Built-in partitioning is not optimal. How do you implement


a custom partitioner in PySpark?
✅ Solution:
Use rdd.partitionBy(num_partitions,
custom_partitioner) with key-value RDD
Or define a custom hash partitioner using:
python

rdd = rdd.partitionBy(n, lambda key: hash(key) % n)

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
Your cluster has 64 GB per executor, but Spark still spills
to disk. What could be the issue?
✅ Possible Causes:
Insufficient memory for shuffle buffers
Large skewed joins or aggregations
Unbalanced partitioning
✅ Fixes:
Increase shuffle buffer:
python

spark.conf.set("spark.shuffle.spill.compress", "true")
spark.conf.set("spark.reducer.maxSizeInFlight", "48m")

Tune memory fraction:


python

spark.conf.set("spark.memory.fraction", 0.8)

Scenario
You call .cache() on a DataFrame, but subsequent actions
are still slow. Why?
✅ Gotchas:
.cache() is lazy — no action = nothing cached
Insufficient executor memory → cache evicted
You cached after transformations instead of before
✅ Fix:
rdd = rdd.partitionBy(n, lambda key: hash(key) % n)

python

df.cache()
df.count() # Trigger materialization

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
#AzureSynapse #DataEngineering
#InterviewPreparation #JobReady
#MockInterviews #Deloitte #CareerSuccess
#ProminentAcademy

❌Think your skills are enough?


Think again—these Data engineer
scenario-based questions could cost you
your data engineering job.
In a recent interview at many big MNC’s, one of our
students faced scenario-based questions related to
data engineering, and many candidates struggled to
answer them correctly. These questions are designed
to test your real-world knowledge and ability to solve
complex data engineering problems.

Unfortunately, many students failed to answer these


questions confidently. The truth is, preparation is key,
and that’s where Prominent Academy comes in!
We specialize in preparing you for spark and data


engineering interviews by:


Offering scenario-based mock interviews
Providing hands-on training with data engineering


features


Optimizing your resume & LinkedIn profile
Giving personalized interview coaching to ensure
you’re job-ready
Don’t leave your future to chance!

📞Call us at +91 98604 38743and get the


interview prep you need to succeed

You might also like