0% found this document useful (0 votes)

62 views8 pages

PySpark Optimization Scenarios - Wipro

The document outlines various optimization scenarios and solutions for using PySpark, including issues related to slow reads from Parquet, job failures, and memory pressure. It provides practical fixes and guidelines for optimizing joins, handling streaming data, and managing partitions effectively. Additionally, it emphasizes the importance of preparation for data engineering interviews and promotes the services of Prominent Academy for interview coaching and training.

Uploaded by

Emmanuel Anyira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views8 pages

PySpark Optimization Scenarios - Wipro

Uploaded by

Emmanuel Anyira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

PySpark

optimization
scenarios,
interview
questions
www.prominentacademy.in

+91 98604 38743

Scenario
You're using Parquet from ADLS Gen2, but Spark reads
are slow. What could be wrong?

✅ Fixes:
Ensure hierarchical namespace is enabled (HNS)
Optimize by:
Filtering with partition column
Minimizing select *
Leveraging .inputFiles() to verify what's being read
Enable ABFS fast directory listing:
python

spark.conf.set("fs.azure.enable.fastpath", "true")

Scenario
Why does your job randomly fail after long stages?
✅ Root Cause:
Long GC pauses or executor heartbeats missed
Driver or executor OOM
✅ Fixes:
Increase spark.network.timeout:
python

spark.conf.set("spark.network.timeout", "800s")

Break job into smaller stages using .checkpoint() or

.persist()
Scale executors horizontally (more but smaller)

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
How do you optimize joins on multiple keys when data is
huge?

✅ Tactics:
Use ZORDER BY (col1, col2, ...) for Delta Lake
Co-partition both DataFrames on same columns
Broadcast one side if it's small and fits memory
Repartition by composite key:
python

df.repartition("col1", "col2")

Scenario
Using explode() on nested arrays causes memory pressure.
How do you fix it?

✅ Solution:
Avoid explode() if not needed — use posexplode_outer()
selectively
Use flatMap() with rdd.mapPartitions() for memory-
safe transformations
Apply filtering before explode to reduce output size
python

spark.conf.set("spark.network.timeout", "800s")

Break job into smaller stages using .checkpoint() or

.persist()
Scale executors horizontally (more but smaller)

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario

You’re writing streaming data into a Delta table and

sometimes get commit errors. How do you make it robust?
✅ Tuning Tips:
Use checkpointing to allow restart from last successful
write
Enable ignoreChanges, mergeSchema options if
schema evolves
Use trigger(once=True) for micro-batch like pipelines
Separate streaming writes by partition (to reduce file
lock contention)

Scenario

When should you use repartition() vs coalesce()? What are

the performance implications?
✅ Guidelines:
repartition(n) → expensive full shuffle; use to increase
partitions
coalesce(n) → narrow transformation, efficient for
decreasing partitions
Use repartition() before joins for data balance
Use coalesce() before writes to control file count

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
What does this mean and how do you fix it?
✅Explanation:
Too much time spent on Java Garbage Collection
Indicates memory pressure or memory leak in job

✅ Fix:
Increase executor memory
Reduce number of cached DataFrames
Avoid UDFs that generate large object graphs
Use persist(StorageLevel.DISK_ONLY) if memory is
tight.

Scenario

Spark UI shows that most executors are idle while a few

are overloaded. What’s going wrong and how can you fix
it?
✅ Root Cause:
Data skew — one or few partitions have
disproportionate data
✅ Solutions:
Repartition using repartition(n) to rebalance
Use salting for skewed keys
Enable adaptive query execution (AQE):
python

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
You don’t know the optimal number of partitions in
advance. How do you dynamically set them based on data
size?

✅ Dynamic Partitioning Logic:

python

file_size_gb = spark.read.parquet(path).rdd.map(lambda x: len(str(x))).sum()

/ (1024**3)
num_partitions = int(file_size_gb * 4) # ~256 MB per partition

df = df.repartition(num_partitions)

Tune based on memory profile and executor cores

Scenario

Built-in partitioning is not optimal. How do you implement

a custom partitioner in PySpark?
✅ Solution:
Use rdd.partitionBy(num_partitions,
custom_partitioner) with key-value RDD
Or define a custom hash partitioner using:
python

rdd = rdd.partitionBy(n, lambda key: hash(key) % n)

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Scenario
Your cluster has 64 GB per executor, but Spark still spills
to disk. What could be the issue?
✅ Possible Causes:
Insufficient memory for shuffle buffers
Large skewed joins or aggregations
Unbalanced partitioning
✅ Fixes:
Increase shuffle buffer:
python

spark.conf.set("spark.shuffle.spill.compress", "true")
spark.conf.set("spark.reducer.maxSizeInFlight", "48m")

Tune memory fraction:

python

spark.conf.set("spark.memory.fraction", 0.8)

Scenario
You call .cache() on a DataFrame, but subsequent actions
are still slow. Why?
✅ Gotchas:
.cache() is lazy — no action = nothing cached
Insufficient executor memory → cache evicted
You cached after transformations instead of before
✅ Fix:
rdd = rdd.partitionBy(n, lambda key: hash(key) % n)

python

df.cache()
df.count() # Trigger materialization

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
#AzureSynapse #DataEngineering
#InterviewPreparation #JobReady
#MockInterviews #Deloitte #CareerSuccess
#ProminentAcademy

❌Think your skills are enough?

Think again—these Data engineer
scenario-based questions could cost you
your data engineering job.
In a recent interview at many big MNC’s, one of our
students faced scenario-based questions related to
data engineering, and many candidates struggled to
answer them correctly. These questions are designed
to test your real-world knowledge and ability to solve
complex data engineering problems.

Unfortunately, many students failed to answer these

questions confidently. The truth is, preparation is key,
and that’s where Prominent Academy comes in!
We specialize in preparing you for spark and data

✅
engineering interviews by:

✅
Offering scenario-based mock interviews
Providing hands-on training with data engineering

✅
features

✅
Optimizing your resume & LinkedIn profile
Giving personalized interview coaching to ensure
you’re job-ready
Don’t leave your future to chance!

📞Call us at +91 98604 38743and get the

interview prep you need to succeed

Azure Data Engineering Interview Q & A - Topicwise
No ratings yet
Azure Data Engineering Interview Q & A - Topicwise
57 pages
Databricks Data Engineer Associate Notes
No ratings yet
Databricks Data Engineer Associate Notes
5 pages
PySpark and Azure Data Engineer Free Notes
No ratings yet
PySpark and Azure Data Engineer Free Notes
65 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
DBT Interview Questions
No ratings yet
DBT Interview Questions
18 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Databricks Data Engineer Professional Practice
No ratings yet
Databricks Data Engineer Professional Practice
10 pages
Azure Data Factory
No ratings yet
Azure Data Factory
6 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Azure Databricks Interview Question
No ratings yet
Azure Databricks Interview Question
12 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Snowflake
No ratings yet
Snowflake
122 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
SQL Interview Questions For A Data Engineer
No ratings yet
SQL Interview Questions For A Data Engineer
11 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
External Tables
No ratings yet
External Tables
105 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Spark QA
No ratings yet
Spark QA
34 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Module 08 - Basic Network Configuration
100% (1)
Module 08 - Basic Network Configuration
12 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Ajay Resume VLaF
No ratings yet
Ajay Resume VLaF
2 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Databricks
No ratings yet
Databricks
11 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Databricks
No ratings yet
Databricks
4 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Pushpender Snowflake 24thjune Questions
No ratings yet
Pushpender Snowflake 24thjune Questions
3 pages
The Marketig Plan of Cocoon Viet Nam
No ratings yet
The Marketig Plan of Cocoon Viet Nam
36 pages
HBRI Brochure
0% (1)
HBRI Brochure
8 pages
Captiva Sevies
No ratings yet
Captiva Sevies
5 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Check List (Quality Auditors) - Converted1
No ratings yet
Check List (Quality Auditors) - Converted1
65 pages
F.M.L. Thompson - The Cambridge Social History of Britain, 1750-1950, Vol. 01. Regions and Communities
No ratings yet
F.M.L. Thompson - The Cambridge Social History of Britain, 1750-1950, Vol. 01. Regions and Communities
592 pages
Boym RussianSoulPostCommunist 1995
No ratings yet
Boym RussianSoulPostCommunist 1995
35 pages
1.1 Introduction To Service Management
No ratings yet
1.1 Introduction To Service Management
22 pages
Script Reading Month
No ratings yet
Script Reading Month
2 pages
Electrically Switch Electromagnet
No ratings yet
Electrically Switch Electromagnet
16 pages
SCMA306 Course Outline - July 2017
No ratings yet
SCMA306 Course Outline - July 2017
9 pages
1.) Trace The Development of Science and Technology From Pre-Colonial Times Up To The Present. What Have You Observe?
No ratings yet
1.) Trace The Development of Science and Technology From Pre-Colonial Times Up To The Present. What Have You Observe?
1 page
The Motor Spirit and High Speed Diesel (Regulation of Supply and Distribution and Prevention of M - 0
No ratings yet
The Motor Spirit and High Speed Diesel (Regulation of Supply and Distribution and Prevention of M - 0
32 pages
002 - ManualC - G - 47-50 - ING Rev.2 20.10.11
No ratings yet
002 - ManualC - G - 47-50 - ING Rev.2 20.10.11
13 pages
Cambridge International AS & A Level: Physics 9702/23
No ratings yet
Cambridge International AS & A Level: Physics 9702/23
12 pages
Aspratame :from Dr. Adrian Gross, FDA Toxicologist, To Carl Sharp
No ratings yet
Aspratame :from Dr. Adrian Gross, FDA Toxicologist, To Carl Sharp
3 pages
GM 3500T OwnersManual
No ratings yet
GM 3500T OwnersManual
36 pages
建筑师求职信
100% (1)
建筑师求职信
7 pages
Writing Task 2-1
No ratings yet
Writing Task 2-1
3 pages
Analysis: SEED: The Untold Store
No ratings yet
Analysis: SEED: The Untold Store
1 page
Bageshwori Civil Consult Pvt. LTD: Kathmandu, Nepal
No ratings yet
Bageshwori Civil Consult Pvt. LTD: Kathmandu, Nepal
7 pages
Modeep
No ratings yet
Modeep
13 pages
Friction JEE
No ratings yet
Friction JEE
33 pages
2025 Article 58786
No ratings yet
2025 Article 58786
12 pages
Design and Analysis of An Hydraulic Trash Compactor: Test Engineering and Management February 2020
No ratings yet
Design and Analysis of An Hydraulic Trash Compactor: Test Engineering and Management February 2020
13 pages
A Data-Driven Online Prediction Model For Battery
No ratings yet
A Data-Driven Online Prediction Model For Battery
17 pages
Energies 13 02602
No ratings yet
Energies 13 02602
22 pages
Confirmation
No ratings yet
Confirmation
2 pages
Original MSG
No ratings yet
Original MSG
9 pages
The Role of Academic Libraries in The Digital Transformation of The Universities
No ratings yet
The Role of Academic Libraries in The Digital Transformation of The Universities
5 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet

PySpark Optimization Scenarios - Wipro

Uploaded by

PySpark Optimization Scenarios - Wipro

Uploaded by

PySpark

+91 98604 38743

Break job into smaller stages using .checkpoint() or

Break job into smaller stages using .checkpoint() or

You’re writing streaming data into a Delta table and

When should you use repartition() vs coalesce()? What are

Spark UI shows that most executors are idle while a few

✅ Dynamic Partitioning Logic:

file_size_gb = spark.read.parquet(path).rdd.map(lambda x: len(str(x))).sum()

Tune based on memory profile and executor cores

Built-in partitioning is not optimal. How do you implement

rdd = rdd.partitionBy(n, lambda key: hash(key) % n)

Tune memory fraction:

❌Think your skills are enough?

Unfortunately, many students failed to answer these

📞Call us at +91 98604 38743and get the

You might also like