0% found this document useful (0 votes)
19 views2 pages

Spark Interview Questions Answers

The document outlines key Apache Spark interview questions and answers, covering topics such as the differences between RDDs and DataFrames, challenges faced in Spark, and various operations like reduceByKey and groupByKey. It also discusses concepts like caching, Parquet file advantages, broadcast joins, and optimization techniques. Additionally, it highlights the role of the driver in Spark architecture and addresses data skewness and its solutions.

Uploaded by

Suraj Solanke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views2 pages

Spark Interview Questions Answers

The document outlines key Apache Spark interview questions and answers, covering topics such as the differences between RDDs and DataFrames, challenges faced in Spark, and various operations like reduceByKey and groupByKey. It also discusses concepts like caching, Parquet file advantages, broadcast joins, and optimization techniques. Additionally, it highlights the role of the driver in Spark architecture and addresses data skewness and its solutions.

Uploaded by

Suraj Solanke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Apache Spark Interview Questions and Answers

1. Difference between RDD & Dataframes

RDD (Resilient Distributed Dataset) is a low-level API that provides more control over data but lacks optimization.

DataFrame is a higher-level API that supports SQL-like queries and optimizations through the Catalyst optimizer.

Example:

rdd = spark.sparkContext.parallelize([('Alice', 23), ('Bob', 30)])

df = spark.createDataFrame(rdd, ['Name', 'Age'])

2. What are the challenges you face in Spark?

- Data Skewness

- Memory Management (OOM issues)

- Small file handling

- Slow shuffle operations

3. Difference between reduceByKey & groupByKey

reduceByKey performs aggregation during the shuffle, which is more efficient.

groupByKey groups data first, which can lead to memory issues.

Example:

rdd = spark.sparkContext.parallelize([('a', 1), ('b', 1), ('a', 2)])

rdd.reduceByKey(lambda x, y: x + y).collect() # [('a', 3), ('b', 1)]

4. Persist vs Cache

cache() stores RDD in memory only, while persist() allows storing in memory, disk, or both.

5. Advantage of Parquet File


Apache Spark Interview Questions and Answers

Parquet is a columnar storage format that supports compression and predicate pushdown, improving query

performance.

6. What is a Broadcast Join?

Broadcast join is an efficient join strategy when one dataset is small enough to fit in memory. The smaller dataset is

broadcasted to all nodes.

7. Coalesce vs Repartition

coalesce() reduces the number of partitions without shuffling data, while repartition() shuffles data to create new

partitions.

8. Role of Driver in Spark Architecture

The driver converts the code into a DAG, schedules tasks on executors, and monitors job execution.

9. What is Data Skewness and How to Handle It?

Data skewness occurs when data is unevenly distributed across partitions.

Solutions: Salting technique, increasing shuffle partitions.

10. Spark Optimization Techniques

- Predicate pushdown

- Caching intermediate results

- Broadcast joins

- Partition pruning

You might also like