0% found this document useful (0 votes)
13 views

PySpark_Interview_Questions

The document contains a comprehensive list of interview questions related to PySpark, covering fundamental concepts such as RDD, DataFrames, and Spark architecture. It includes questions about data manipulation techniques, join types, handling missing values, and performance optimization in Spark. Additionally, it addresses advanced topics like UDFs, lazy evaluation, and the differences between various Spark components.

Uploaded by

Satyajit Ligade
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

PySpark_Interview_Questions

The document contains a comprehensive list of interview questions related to PySpark, covering fundamental concepts such as RDD, DataFrames, and Spark architecture. It includes questions about data manipulation techniques, join types, handling missing values, and performance optimization in Spark. Additionally, it addresses advanced topics like UDFs, lazy evaluation, and the differences between various Spark components.

Uploaded by

Satyajit Ligade
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

PySpark Interview Questions

1. What is PySpark?
2. How does PySpark differ from Pandas?
3. What is RDD in PySpark?
4. What is the difference between RDD and DataFrame?
5. How do you create a DataFrame in PySpark?
6. What are the different ways to read data into a DataFrame?
7. What is the difference between select() and selectExpr()?
8. How do you filter data in PySpark?
9. What is the difference between filter() and where()?
10. How do you add a new column to a DataFrame?
11. How do you drop a column from a DataFrame?
12. How do you rename a column in PySpark?
13. What are different join types in PySpark?
14. How do you perform an inner join in PySpark?
15. What is the difference between join() and crossJoin()?
16. What is the use of groupBy() in PySpark?
17. How do you apply aggregate functions in PySpark?
18. How do you handle missing/null values in PySpark?
19. How do you replace null values in PySpark?
20. What is the difference between dropna(), fillna(), and replace()?
21. How do you remove duplicate rows in PySpark?
22. What is the difference between distinct() and dropDuplicates()?
23. How do you sort data in PySpark?
24. What is the difference between orderBy() and sort()?
25. What is a UDF (User Defined Function) in PySpark?
26. How do you register and use a UDF in PySpark?
27. What is the difference between map() and flatMap() in PySpark?
28. What is lazy evaluation in PySpark?
29. What are actions and transformations in PySpark?
30. What is the difference between collect() and show()?
31. What is Apache Spark?
32. What is PySpark?
33. Explain the architecture of Apache Spark.
34. What is the difference between repartition() and coalesce() in Spark?
35. What is the difference between SparkContext and SparkSession?
36. What are narrow transformations in Spark?
37. What are wide transformations in Spark?
38. What is Adaptive Query Execution (AQE) in Spark?
39. What are some optimization techniques in Spark?
40. What is the Catalyst Optimizer in Spark?
41. What is serialization in Spark?
42. What is the difference between cache() and persist() in Spark?
43. What are different types of files that Spark can process?
44. What is the difference between RDD, DataFrame, and Dataset in Spark?
45. What are advanced join techniques in Spark?
46. What is lineage in Spark?
47. What is DAG (Directed Acyclic Graph) in Spark?
48. What is a Spark job?
49. What is a Spark stage?
50. What is a Spark task?
51. How does Spark divide a job into stages?
52. What factors determine the number of tasks in a Spark stage?

You might also like