Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
PySpark Scenario-Based
Interview Questions &
Answers
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Candidate: Sure! To find the average age per name, I’d use the
groupBy() function on the name column, followed by avg() on the age
column. To handle missing values in the age column, I’d use fillna()
to replace null values with a specified default, or dropna() to remove
rows with null values in the age column, depending on the
requirements.
Example:
# Removing duplicates based on specified columns
df = df.dropDuplicates(["column1", "column2"])
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Candidate: I’d use the Window function and partition the data by
region. Then, I’d apply an order by operation, if needed, and use
F.sum("sales").over(window_spec) to calculate the cumulative sum.
Candidate: I’d use the pivot() function on the year column and
apply sum() or first() to aggregate the population values.
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Candidate: I’d
use the split() function from PySpark to separate the full_name
column by a space and extract the parts into first_name and
last_name columns.
Example:
# Splitting full_name into first_name and last_name
df = df.withColumn("first_name", F.split(F.col("full_name"), "
")[0])
df = df.withColumn("last_name", F.split(F.col("full_name"), "
")[1])
df.show()
Candidate: For large DataFrames, I’d first filter out rows with null
values in the join column on both sides, using dropna(). If keeping
nulls is essential, I’d consider filling them with unique placeholder
values to avoid erroneous joins.
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Example:
# Dropping rows with null join keys
df1 = df1.dropna(subset=["join_column"])
df2 = df2.dropna(subset=["join_column"])
Example:
from pyspark.sql.types import StructType, StructField,
StringType, IntegerType
from pyspark.sql import functions as F
Example:
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
#
Splitting and
exploding the column
df = df.withColumn("split_column",
F.split(F.col("comma_separated_column"), ","))
df = df.select(F.col("id"),
F.explode(F.col("split_column")).alias("separated_value"))
df.show()
Interviewer: How would you filter out rows with null values in more
than two specific columns in a PySpark DataFrame?
Example:
# Filtering rows with at least two non-null values
df = df.filter((F.col("col1").isNotNull() &
F.col("col2").isNotNull()) | (F.col("col1").isNotNull() &
F.col("col3").isNotNull()) | (F.col("col2").isNotNull() &
F.col("col3").isNotNull()))
df.show()
Example:
from pyspark.sql import Window
from pyspark.sql import functions as F
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
#
Calculate the 7-day rolling average for a specific metric
df = df.withColumn("rolling_avg",
F.avg("metric_column").over(window_spec))
df.show()
Candidate: I’d use groupBy() on both city and product columns and
then apply avg() to calculate the average sales for each combination.
Example:
# Group by city and product, and calculate average sales
df = df.groupBy("city",
"product").agg(F.avg("sales").alias("average_sales"))
df.show()
Example:
from pyspark.sql import Window
from pyspark.sql import functions as F
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Interviewer:
Given a sales dataset, explain how to identify the top 5 products with
the highest revenue per region.
Example:
from pyspark.sql import Window
from pyspark.sql import functions as F
Example:
from pyspark.sql import Window
from pyspark.sql import functions as F
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
#
Calculate time
difference between purchases
df = df.withColumn("previous_purchase_date",
F.lag("purchase_date").over(window_spec))
df = df.withColumn("days_between_purchases",
F.datediff("purchase_date", "previous_purchase_date"))
Candidate: I’d use the agg() function with max(), min(), and avg()
for each numeric column to create a summary report.
Example:
# Creating a summary report
summary_df = df.agg(
F.max("column1").alias("max_column1"),
F.min("column1").alias("min_column1"),
F.avg("column1").alias("avg_column1"),
F.max("column2").alias("max_column2"),
F.min("column2").alias("min_column2"),
F.avg("column2").alias("avg_column2")
)
summary_df.show()
Candidate: I’d filter the data to include only the last month’s records
based on purchase_date, then group by product_category, and use
countDistinct() on customer_id to get the unique count of
customers.
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Example:
from pyspark.sql import functions as F
Candidate: I’d extract the month and year from the purchase_date
column, group by product and the extracted month, and then use
sum() on the revenue column.
Example:
# Extracting month and year, grouping by product and month
monthly_revenue_df = df.withColumn("month",
F.date_format("purchase_date", "yyyy-MM")).groupBy("product",
"month").agg(F.sum("revenue").alias("monthly_revenue"))
monthly_revenue_df.show()
Candidate: I’d calculate the total sales per region for each of the last
three quarters. Then, I’d use a window function to create a column
showing whether each quarter’s sales are greater than the previous
one. Finally, I’d filter regions where sales consistently increase.
Example:
from pyspark.sql import Window
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
from
pyspark.sql
import functions as F
Candidate: I’d use when() and otherwise() to define the ranges for
each grade and create a new column for the assigned grade.
Example:
from pyspark.sql import functions as F
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
.when((F.col("score") >= 80) & (F.col("score") < 90), "B")
.when((F.col("score") >= 70) & (F.col("score") < 80), "C")
.otherwise("D")
)
df.show()
http://www.nityacloudtech.com/ @nityacloudtech