Py 1731703428
Py 1731703428
PySpark Scenario-Based
Interview Questions &
Answers
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Interviewer: You need to read data from multiple JSON files stored
in a directory, each with slight variations in structure. How would you
handle schema mismatches in PySpark?
Candidate: I’d use the mergeSchema option when reading JSON files,
which enables PySpark to handle slight schema variations by merging
schemas from each file. This way, fields from different files will be
included in the final schema as optional fields, preventing
mismatches.
Example:
df = spark.read.option("mergeSchema",
"true").json("/path/to/json/files")
Candidate: I’d use jdbc to read data from PostgreSQL, passing filter
criteria as part of the dbtable query to reduce data at the source level.
This minimizes the amount of data transferred, improving
performance.
Example:
url = "jdbc:postgresql://host:port/database"
properties = {"user": "username", "password": "password",
"driver": "org.postgresql.Driver"}
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Interviewer:
You have a DataFrame with raw timestamp data in different formats.
How would you standardize these timestamps into a uniform format?
Example:
from pyspark.sql.functions import to_timestamp, coalesce, when
df = df.withColumn("standard_timestamp", coalesce(
to_timestamp("timestamp_column", "yyyy-MM-dd HH:mm:ss"),
to_timestamp("timestamp_column", "MM/dd/yyyy HH:mm:ss")
))
Example:
df.write.mode("overwrite").partitionBy("year",
"month").saveAsTable("database.table_name")
Candidate: I’d use Spark Structured Streaming to read the log file as
a stream and apply transformations on the streaming DataFrame. The
readStream API can be used to treat a file directory as a streaming
source.
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Example:
df = spark.readStream.format("text").option("path",
"/path/to/log/files").load()
transformed_df = df.withColumn("processed_column",
some_transformation_function(df["column"]))
Example:
df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "server:port") \
.option("subscribe", "topic_name") \
.load()
Candidate: I’d use Spark XML library to parse the XML file into a
DataFrame, specifying the root tag and row tag to correctly interpret
the structure.
Example:
from pyspark.sql import SparkSession
df = spark.read.format("com.databricks.spark.xml") \
.option("rowTag", "yourRowTag") \
.load("path/to/file.xml")
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Interviewer:
Describe how you would integrate data from a NoSQL database, such
as MongoDB, into PySpark for analysis.
Candidate: I’d use the write method with the format set to parquet
or csv, and specify a compression codec like gzip or snappy.
Example:
df.write.format("parquet").option("compression",
"gzip").save("s3://bucket_name/path")
Optimization Techniques
Interviewer: Explain how you would avoid skew in your data if you
observe certain keys appear very frequently in a join operation.
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Candidate: I’d
use techniques like salting or skew join hints. Salting involves adding
a random value to the skewed key to spread it across multiple
partitions.
http://www.nityacloudtech.com/ @nityacloudtech
query Nitya CloudTech Pvt Ltd.
optimizations,
particularly with deeply nested or SQL-heavy transformations.
Interviewer: Describe how you would monitor and adjust the number
of shuffle partitions in a PySpark job.
Candidate: I’d use Spark UI to monitor the shuffle size and adjust
spark.sql.shuffle.partitions based on the stage shuffle size and
resource availability to optimize performance.
Candidate: I’d try to partition the data first by the grouping key using
repartition to align data in the same partition, reducing the need for
a shuffle.
http://www.nityacloudtech.com/ @nityacloudtech
Nitya CloudTech Pvt Ltd.
Candidate: I’d
precompute the result of the UDF if it’s deterministic and cache the
result column using persist to avoid recalculating for each action on
the DataFrame.
http://www.nityacloudtech.com/ @nityacloudtech