0% found this document useful (0 votes)
27 views3 pages

Extended Spark Interview QA

Uploaded by

Adarsh Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views3 pages

Extended Spark Interview QA

Uploaded by

Adarsh Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Apache Spark Interview Questions and Answers

What is Apache Spark? Explain its key features.


Apache Spark is an open-source, distributed computing system designed for fast data
processing, analytics, and machine learning tasks. Key features include in-memory
processing, real-time data streaming capabilities, support for a wide range of workloads
(batch, streaming, interactive queries, and machine learning), fault-tolerance, and
integration with a variety of data sources.

What are the main components of the Spark ecosystem?


The main components include:
- Spark Core: The foundation providing essential functions like task scheduling.
- Spark SQL: For structured data processing using SQL and DataFrames.
- Spark Streaming: For real-time data processing.
- MLlib: A machine learning library.
- GraphX: For graph processing and analysis.

Compare Spark with Hadoop MapReduce. What are the key differences?
Spark provides in-memory data processing which makes it much faster than Hadoop
MapReduce, which relies on disk-based processing. Spark also supports interactive queries,
streaming data, and iterative algorithms better than Hadoop's batch-processing model.

What is an RDD (Resilient Distributed Dataset) in Spark? Explain its key


properties.
An RDD is a fundamental data structure in Spark representing an immutable distributed
collection of objects that can be processed in parallel. Key properties include fault tolerance
(using lineage information), in-memory computation, lazy evaluation, and partitioning.

What is the difference between RDD, DataFrame, and Dataset?


RDDs offer low-level operations and control but lack optimizations. DataFrames provide
higher-level abstraction with query optimizations using Spark SQL's Catalyst Optimizer.
Datasets, introduced in Spark 1.6, provide type-safety, combine features of RDDs and
DataFrames, and support object-oriented programming.

Describe the Spark execution model. How does it process data in parallel?
Spark follows a master-slave architecture where the driver coordinates tasks and workers
(executors) process data in parallel across partitions. This distributed approach allows data
to be split and processed concurrently.
What is a Spark Driver? Explain its role in a Spark application.
The Spark Driver is responsible for orchestrating the execution of a Spark job. It translates
user code into tasks, distributes tasks among executors, and monitors their progress. It runs
on the master node.

What is a Spark Executor? How does it work?


Spark Executors are distributed agents responsible for executing tasks and storing data on
worker nodes. Each application gets its own set of executors, which run tasks and keep data
in memory/disk for the duration of the application.

Explain the concept of lineage in Spark. Why is it important?


Lineage tracks the transformations applied to RDDs to reconstruct lost data in case of
failures. It is key to fault-tolerance, as Spark can recompute lost partitions using the lineage
graph.

What is a DAG (Directed Acyclic Graph) in Spark? How does it help optimize
operations?
A DAG represents a sequence of computations as a graph of stages and tasks in Spark. By
breaking tasks into stages, Spark optimizes execution through stage-level scheduling,
avoiding unnecessary data shuffling and recomputation.

What are transformations and actions in Spark? Give examples.


Transformations create new RDDs from existing ones, e.g., `map()` and `filter()`. They are
lazy, meaning Spark only evaluates them when an action is performed. Actions trigger
computations, e.g., `collect()` and `reduce()`.

How does Spark handle failures?


Spark handles failures by recomputing lost or failed tasks using lineage information,
rerunning only the affected operations, thereby maintaining fault-tolerance.

Explain the difference between a `map()` and a `flatMap()` transformation in


Spark.
`map()` transforms each element of an RDD, resulting in one output element per input
element. `flatMap()` can produce zero, one, or multiple output elements for each input
element, flattening the results into a single collection.

How can you achieve partitioning in Spark? Why is it important?


Partitioning allows data to be distributed across nodes. Custom partitioners can be applied
to achieve better performance, especially during shuffling. Proper partitioning reduces data
transfer across nodes and speeds up transformations like joins.
What is the Catalyst Optimizer in Spark?
The Catalyst Optimizer is a query optimization framework in Spark SQL that analyzes,
optimizes, and transforms logical query plans to improve performance. It supports rule-
based and cost-based optimization.

What is Tungsten in Spark?


Tungsten is an optimization engine for Spark SQL that focuses on binary memory
management and code generation, leading to better CPU and memory usage, faster
execution, and reduced garbage collection.

How does Spark integrate with Hive?


Spark integrates with Hive by supporting Hive metastore, allowing Spark SQL queries to
access Hive tables and metadata. Users can run SQL queries on Hive tables using Spark SQL
without requiring a separate Hive installation.

What is Spark Streaming, and how does it work?


Spark Streaming is a component of Apache Spark for processing real-time data streams. It
ingests data from sources like Kafka and divides data into small batches, processing them
with Spark's batch jobs.

Explain the concept of DStreams in Spark Streaming.


DStreams (Discretized Streams) represent continuous streams of data divided into smaller
batches. Each batch is treated as an RDD, allowing transformations and actions on
streaming data using Spark's APIs.

What is Structured Streaming in Spark? How does it differ from Spark


Streaming?
Structured Streaming is a newer and more advanced API for real-time data processing built
on top of Spark SQL engine. It offers declarative streaming analytics and better fault
tolerance compared to Spark Streaming.

What is MLlib in Spark? What types of algorithms does it support?


MLlib is a distributed machine learning library in Spark supporting a variety of algorithms
such as classification, regression, clustering, collaborative filtering, and dimensionality
reduction.

How do you handle skewed data in Spark?


Handling skewed data involves techniques like partitioning to ensure an even distribution
of data across nodes, using salted keys to redistribute data, and reducing shuffle operations.

You might also like