Pyspark Scenario Based Qs
Pyspark Scenario Based Qs
2. Data Transformation
How would you split a column with concatenated values into
multiple columns?
Describe the process of exploding a column that contains arrays
into separate rows?
Write a PySpark code snippet to remove special characters from a
string column?
How do you normalize numerical data in a DataFrame?
Explain how to convert a column of timestamps into a different
time zone?
4. Window Functions
How do you use window functions to rank items within each
partition of a DataFrame?
Write a PySpark example to calculate the moving average of sales
over a 30-day window?
Explain how to use window functions to compute cumulative
sums or averages?
How would you partition data by date and compute the last value
in each partition using window functions?
5. Performance Optimization
What strategies would you use to optimize the performance of a
Spark job?
How do you handle data skew in a join operation to improve
performance?
Write code to repartition a DataFrame to optimize parallel
processing?
Describe how to cache intermediate results to speed up iterative
computations?
Explain the importance of optimizing shuffle operations and how
to achieve it?