PySpark Interview Cheatsheet 1741068112
PySpark Interview Cheatsheet 1741068112
Interview
Cheatsheet:
30 Core Topics
Explained with Code
spark = SparkSession.builder \
.appName("InterviewCheatsheet") \
.getOrCreate()
📌 Explanation:
A SparkSession is the entry point for working with
Spark. It manages the cluster connection and creates
DataFrames.
2. Creating a DataFrame
data = [("Shivani", 25), ("Amit", 30), ("Raj", 28)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()
📌 Explanation:
A DataFrame is similar to a table in SQL or a Pandas
DataFrame.
https://www.seekhobigdata.com/
Swipe for more
3. Reading a CSV File
df = spark.read.csv("data.csv", header=True,
inferSchema=True)
df.show()
📌 Explanation:
header=True considers the first row as a header, and
inferSchema=True automatically detects data types.
df.write.csv("output.csv", header=True,
mode="overwrite")
📌 Explanation:
Writes DataFrame to a CSV file with a header.
"overwrite" replaces existing data.
https://www.seekhobigdata.com/
Swipe for more
5. Schema Definition
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True)
])
df = spark.createDataFrame(data,
schema=schema)
df.printSchema()
📌 Explanation:
Defines a structured schema for better control over
data types.
https://www.seekhobigdata.com/
Swipe for more
6. Filtering Data
📌 Explanation:
Filters rows where Age > 26.
7. Selecting Columns
df.select("Name").show()
📌 Explanation:
Returns only the Name column.
https://www.seekhobigdata.com/
Swipe for more
8. Adding a New Column
df = df.withColumn("Country", lit("India"))
df.show()
📌 Explanation:
lit("India") adds a new column with a constant value.
9. Renaming a Column
df = df.withColumnRenamed("Age", "Years")
df.show()
📌 Explanation:
Renames column Age to Years.
https://www.seekhobigdata.com/
Swipe for more
10. Dropping a Column
df = df.drop("Country")
df.show()
📌 Explanation:
Drops the Country column.
df = df.na.fill({"Age": 0})
df.show()
📌 Explanation:
Fills NULL values in the Age column with 0.
https://www.seekhobigdata.com/
Swipe for more
12. GroupBy and Aggregation
df.groupBy("Name").agg({"Age": "max"}).show()
📌 Explanation:
Finds the maximum age for each name.
df.orderBy(df.Age.desc()).show()
📌 Explanation:
Sorts data in descending order of Age.
https://www.seekhobigdata.com/
Swipe for more
14. Joining Two DataFrames
df.join(df2, "Name").show()
📌 Explanation:
Performs an inner join on the Name column.
https://www.seekhobigdata.com/
Swipe for more
15. Using UDF (User Defined Function)
def greet(name):
return "Hello, " + name
df = df.withColumn("Greeting",
greet_udf(df.Name))
df.show()
📌 Explanation:
Defines a custom function and applies it to a DataFrame
column.
https://www.seekhobigdata.com/
Swipe for more
16. Exploding Arrays
📌 Explanation:
Converts array values into separate rows.
https://www.seekhobigdata.com/
Swipe for more
17. Window Functions
window_spec =
Window.partitionBy("City").orderBy(df.Age.desc(
))
df.withColumn("Rank",
rank().over(window_spec)).show()
📌 Explanation:
Ranks records within each city.
df.groupBy("Name").pivot("City").sum("Age").
show()
📌 Explanation:
Converts row values into column headers.
https://www.seekhobigdata.com/
Swipe for more
19. Handling Duplicates
df.dropDuplicates(["Name"]).show()
📌 Explanation:
Removes duplicate names.
df.cache()
df.show()
📌 Explanation:
Stores the DataFrame in memory for faster processing.
https://www.seekhobigdata.com/
Swipe for more
21. Repartitioning
df = df.repartition(2)
df.show()
📌 Explanation:
Redistributes data into 2 partitions.
df.write.parquet("output.parquet")
📌 Explanation:
Parquet format is optimized for big data processing.
https://www.seekhobigdata.com/
Swipe for more
23. Broadcast Joins
df = df.join(broadcast(df2), "Name")
df.show()
📌 Explanation:
Optimizes joins when one DataFrame is small.
df = df.withColumn("Date",
current_date()).withColumn("Timestamp",
current_timestamp())
df.show()
📌 Explanation:
Adds current date and timestamp.
https://www.seekhobigdata.com/
Swipe for more
25. Converting DataFrame to Pandas
pdf = df.toPandas()
print(pdf)
📌 Explanation:
Converts a Spark DataFrame to a Pandas DataFrame.
df = spark.read.json("data.json")
df.show()
📌 Explanation:
Reads JSON data.
https://www.seekhobigdata.com/
Swipe for more
27. Using Explode with JSON
df = df.withColumn("data",
explode(df.json_column))
df.show()
📌 Explanation:
Expands JSON data.
df.write.json("output.json")
📌 Explanation:
Saves the DataFrame as a JSON file.
https://www.seekhobigdata.com/
Swipe for more
29. Using SQL Queries in PySpark
df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE Age >
25").show()
📌 Explanation:
Runs SQL queries on DataFrames.
spark.stop()
📌 Explanation:
Stops the SparkSession to free up resources.
https://www.seekhobigdata.com/
Swipe for more
If you
find this
helpful like
and share
https://www.seekhobigdata.com/
+91 99894 54737