Pyspark and python preparation notes
Pyspark and python preparation notes
INT QUESTIONS On ability to write efficient code, handle edge cases, and think
about improvements or alternative methods
Finding the paint color with the lowest price from a dictionary
of paint colors and prices.
This task tests their knowledge of dictionary operations, function definitions, and the
use of the min function with a key argument in Python.
1. What is PySpark, and how does it differ from traditional Python data processing?
2. Can you explain the difference between an RDD and a DataFrame in PySpark?
3. Why do we need to define a schema in PySpark DataFrames?
4. How would you create a DataFrame from a list of tuples in PySpark?
5. What are some common use cases for Spark in a data engineering context?
6. How would you filter rows in this DataFrame where value is greater than 1?
7. What is lazy evaluation in Spark, and how does it apply to transformations in this
DataFrame?
8. Can you explain how Spark handles data across partitions, and why this is beneficial
for big data?
9. How would you perform a groupBy operation on this DataFrame, for example,
grouping by value?
10. What are some performance optimization techniques you could apply in PySpark?
11. How would you add a new column to the DataFrame with a transformed version of
value, for example, doubling each value?
12. If we need to save this DataFrame to a file or a database, how would we do that in
PySpark?
13. Can you explain what happens when you use df.show() versus df.collect()?