PySpark Optimization Scenarios - Wipro
PySpark Optimization Scenarios - Wipro
optimization
scenarios,
interview
questions
www.prominentacademy.in
✅ Fixes:
Ensure hierarchical namespace is enabled (HNS)
Optimize by:
Filtering with partition column
Minimizing select *
Leveraging .inputFiles() to verify what's being read
Enable ABFS fast directory listing:
python
spark.conf.set("fs.azure.enable.fastpath", "true")
Scenario
Why does your job randomly fail after long stages?
✅ Root Cause:
Long GC pauses or executor heartbeats missed
Driver or executor OOM
✅ Fixes:
Increase spark.network.timeout:
python
spark.conf.set("spark.network.timeout", "800s")
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
How do you optimize joins on multiple keys when data is
huge?
✅ Tactics:
Use ZORDER BY (col1, col2, ...) for Delta Lake
Co-partition both DataFrames on same columns
Broadcast one side if it's small and fits memory
Repartition by composite key:
python
df.repartition("col1", "col2")
Scenario
Using explode() on nested arrays causes memory pressure.
How do you fix it?
✅ Solution:
Avoid explode() if not needed — use posexplode_outer()
selectively
Use flatMap() with rdd.mapPartitions() for memory-
safe transformations
Apply filtering before explode to reduce output size
python
spark.conf.set("spark.network.timeout", "800s")
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
Scenario
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
What does this mean and how do you fix it?
✅Explanation:
Too much time spent on Java Garbage Collection
Indicates memory pressure or memory leak in job
✅ Fix:
Increase executor memory
Reduce number of cached DataFrames
Avoid UDFs that generate large object graphs
Use persist(StorageLevel.DISK_ONLY) if memory is
tight.
Scenario
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
You don’t know the optimal number of partitions in
advance. How do you dynamically set them based on data
size?
df = df.repartition(num_partitions)
Scenario
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
Your cluster has 64 GB per executor, but Spark still spills
to disk. What could be the issue?
✅ Possible Causes:
Insufficient memory for shuffle buffers
Large skewed joins or aggregations
Unbalanced partitioning
✅ Fixes:
Increase shuffle buffer:
python
spark.conf.set("spark.shuffle.spill.compress", "true")
spark.conf.set("spark.reducer.maxSizeInFlight", "48m")
spark.conf.set("spark.memory.fraction", 0.8)
Scenario
You call .cache() on a DataFrame, but subsequent actions
are still slow. Why?
✅ Gotchas:
.cache() is lazy — no action = nothing cached
Insufficient executor memory → cache evicted
You cached after transformations instead of before
✅ Fix:
rdd = rdd.partitionBy(n, lambda key: hash(key) % n)
python
df.cache()
df.count() # Trigger materialization
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
#AzureSynapse #DataEngineering
#InterviewPreparation #JobReady
#MockInterviews #Deloitte #CareerSuccess
#ProminentAcademy
✅
engineering interviews by:
✅
Offering scenario-based mock interviews
Providing hands-on training with data engineering
✅
features
✅
Optimizing your resume & LinkedIn profile
Giving personalized interview coaching to ensure
you’re job-ready
Don’t leave your future to chance!