The document lists 50 questions about how to perform common data wrangling and analysis tasks using PySpark DataFrames. These include questions about selecting, filtering, aggregating, joining, handling nulls and outliers, performing string operations, date/time handling, and more. It covers a wide range of fundamental techniques for working with data in PySpark.
Download as TXT, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
16 views
Pyspark Dataframe Questions
The document lists 50 questions about how to perform common data wrangling and analysis tasks using PySpark DataFrames. These include questions about selecting, filtering, aggregating, joining, handling nulls and outliers, performing string operations, date/time handling, and more. It covers a wide range of fundamental techniques for working with data in PySpark.
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1
1. How do you select all columns from a DataFrame?
2. How do you select specific columns from a DataFrame?
3. How do you filter rows based on a condition? 4. How do you count the number of rows in a DataFrame? 5. How do you find the distinct values in a column? 6. How do you rename a column in a DataFrame? 7. How do you drop a column from a DataFrame? 8. How do you drop duplicate rows from a DataFrame? 9. How do you sort the DataFrame based on one or more columns? 10. How do you perform a group by operation in PySpark? 11. How do you perform aggregation functions like sum, max, min, and avg? 12. How do you join two DataFrames in PySpark? 13. How do you perform inner join, left join, right join, and outer join operations? 14. How do you handle null values in PySpark DataFrames? 15. How do you create a new column based on existing columns in a DataFrame? 16. How do you apply user-defined functions (UDFs) to a DataFrame? 17. How do you convert a DataFrame to a Pandas DataFrame? 18. How do you convert a Pandas DataFrame to a PySpark DataFrame? 19. How do you read data from a CSV file into a PySpark DataFrame? 20. How do you write data from a PySpark DataFrame to a CSV file? 21. How do you read data from a JSON file into a PySpark DataFrame? 22. How do you write data from a PySpark DataFrame to a JSON file? 23. How do you read data from a Parquet file into a PySpark DataFrame? 24. How do you write data from a PySpark DataFrame to a Parquet file? 25. How do you handle date and timestamp data in PySpark DataFrames? 26. How do you extract year, month, day, hour, minute, or second from a timestamp column? 27. How do you convert a string to a timestamp in PySpark? 28. How do you convert a timestamp to a string in PySpark? 29. How do you handle timezone conversions in PySpark? 30. How do you concatenate strings in PySpark? 31. How do you perform case-insensitive string operations in PySpark? 32. How do you perform a wildcard search in PySpark? 33. How do you perform a substring search in PySpark? 34. How do you convert a column to lowercase or uppercase in PySpark? 35. How do you check if a column contains a specific substring in PySpark? 36. How do you filter rows based on a list of values in PySpark? 37. How do you compute row-wise operations in PySpark? 38. How do you filter rows based on a regex pattern in PySpark? 39. How do you handle outliers in PySpark DataFrames? 40. How do you compute the correlation between two columns in PySpark? 41. How do you compute the covariance between two columns in PySpark? 42. How do you pivot a DataFrame in PySpark? 43. How do you unpivot a DataFrame in PySpark? 44. How do you handle missing or null values in PySpark DataFrames? 45. How do you impute missing values in PySpark? 46. How do you calculate the cumulative sum or running total in PySpark? 47. How do you perform window functions in PySpark? 48. How do you rank rows based on a specific column in PySpark? 49. How do you compute lead and lag values in PySpark? 50. How do you handle skewed data in PySpark when performing joins or aggregations?