Before Spark Interview
Before Spark Interview
Task:
1. You were given a task to develop a new reporting pipeline which
consumes data from various sources, dimensions and fact tables.
2. This pipeline has to be scheduled after your SLA hours due to which it
has to be scheduled separately.
3. After writing the logic, its time to test the pipeline end-to-end and
validate whether data was flowing correctly.
4. You have to define the cluster configuration for this new pipeline. How
do you go about it?
Cluster configuration:
Steps to answer:
Set the executor and driver memory size appropriately to ensure sufficient memory
for data processing and shuffle operations.
Allocate an optimal number of executor cores considering the CPU resources
required for parallel processing.
Set the memory overhead per executor to accommodate JVM overheads, off-heap
storage, and other system-related memory requirements.
(calculate these parameters below)
spark.executor.memory:
spark.executor.cores:
spark.executor.memoryOverhead:
Spark.driver.memory:
spark.driver.memoryOverhead:
Broadcast Join Threshold:
Tune the broadcast join threshold to determine the size threshold for broadcasting
smaller tables during join operations.
Use broadcast joins for smaller tables to reduce shuffle data and minimize network
traffic.
spark.sql.autoBroadcastJoinThreshold: 10MB
What are the checks done after data ingestion:
1. EPIC (Initiative)
3. SLA to be met
4. Business value