Spark
Spark
Data Engineering
Interview Scenario
Interviewer:
Spark UI Analysis:
Review task execution times in the "Stages"
tab; tasks on skewed partitions will run
significantly longer.
Check partition sizes in shuffle read/write
metrics—skewed partitions will show
disproportionately larger sizes.
Profiling Keys:
Run data.groupByKey().count() to identify
keys with an unusually large number of
records.
Question 3: What are the advantages
of using repartition() versus
coalesce()?
Candidate:
1. Repartition():
Fully shuffles the data, ensuring an even
distribution across all partitions.
Suitable for increasing the number of
partitions.
1. Coalesce():
Avoids full shuffling, making it faster but
less evenly distributed.
Suitable for reducing the number of
partitions.
Question 4: If repartitioning doesn’t
resolve the issue, what alternative
strategies would you consider?
Candidate:
Data Salting:
Add random prefixes to skewed keys to
split them across multiple partitions.
Custom Partitioners:
Write a partitioner that redistributes data
intelligently based on key distribution.
Pre-Aggregation:
Reduce data size by aggregating records
before operations like joins or shuffles.
Question 5: What are the
implications of too many partitions
in a Spark job?
Candidate:
Task Overhead:
Excessive partitions lead to an increased
number of tasks, causing task scheduling
overhead.
Network Costs:
More partitions increase shuffle
operations, leading to higher network I/O.
Question 6: How does Adaptive
Query Execution (AQE) mitigate
partition imbalance?
Candidate:
Fine-Tuning:
Enabled with
spark.sql.adaptive.enabled=true.
Configure parameters like
spark.sql.adaptive.shuffle.targetPostShuff
leInputSize to control partition sizes.
Question 7: When would you prefer
salting over repartitioning?
Candidate:
Custom Partitioning:
Allows control over how data is
distributed, reducing shuffle data size and
improving performance.
Question 9: What tools or metrics
would you monitor to verify your
fixes?
Candidate:
1. Spark UI:
Check task execution times and shuffle
read/write sizes to confirm balanced
partition sizes.
2. Logs:
Look for reduced shuffle spill warnings
and fewer memory-related errors.
Non-Columnar Formats:
Formats like CSV and
JSON don’t inherently support
partitioning, leading to less efficient I/O
operations.
Partition Pruning:
Using partition-aware formats allows
Spark to skip unnecessary partitions,
optimizing performance.
Question 11: Tell me more about
Parquet and ORC. How do they
differ, and how are they beneficial in
a Spark workflow?
Parquet
1. Features:
Columnar Storage: Data is stored in
columns, enabling efficient compression
and query performance for analytics
workloads.
Schema Evolution: Supports schema
evolution (adding/removing columns) with
backward and forward compatibility.
Compression: Default compression uses
Snappy, but other algorithms (e.g., Gzip,
Brotli) are also supported.
Splittable: Parquet files can be split for
parallel processing, making it ideal for
distributed systems.
2. Advantages in Spark:
Supports partition pruning for optimized
reads.
Works seamlessly with Spark’s DataFrame
API.
Efficient for aggregation queries, as
columnar storage minimizes unnecessary
reads.
ORC (Optimized Row Columnar)
Features:
Columnar Storage: Similar to Parquet, but
optimized for Hadoop-based ecosystems
like Hive.
Advanced Indexing: Includes built-in
indexes like min/max and bloom filters
for faster data scans.
Compression: Default compression uses
Zlib, which often results in smaller file
sizes than Parquet.
Splittable: Supports split processing for
distributed systems.
Advantages in Spark:
Great for heavy read-intensive workloads
with its indexing features.
Efficient for operations involving filtering
or range queries due to its min/max
index.
Supports ACID operations in Hive
environments.
FOR CAREER GUIDANCE,
CHECK OUT OUR PAGE
www.nityacloudtech.com