Spark Interview Questions Answers
Spark Interview Questions Answers
RDD (Resilient Distributed Dataset) is a low-level API that provides more control over data but lacks optimization.
DataFrame is a higher-level API that supports SQL-like queries and optimizations through the Catalyst optimizer.
Example:
- Data Skewness
Example:
4. Persist vs Cache
cache() stores RDD in memory only, while persist() allows storing in memory, disk, or both.
Parquet is a columnar storage format that supports compression and predicate pushdown, improving query
performance.
Broadcast join is an efficient join strategy when one dataset is small enough to fit in memory. The smaller dataset is
7. Coalesce vs Repartition
coalesce() reduces the number of partitions without shuffling data, while repartition() shuffles data to create new
partitions.
The driver converts the code into a DAG, schedules tasks on executors, and monitors job execution.
- Predicate pushdown
- Broadcast joins
- Partition pruning