0% found this document useful (0 votes)
7 views19 pages

Spark Databricks

The document outlines key topics for preparing for an Apache Spark technical interview, including Spark's core components, job and task management, and performance optimization techniques. It covers essential concepts such as transformations, shuffling, the Catalyst Optimizer, and Delta Lake features. Additionally, it highlights the importance of understanding Spark's execution engine and tools for managing data access and CI/CD in Databricks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

Spark Databricks

The document outlines key topics for preparing for an Apache Spark technical interview, including Spark's core components, job and task management, and performance optimization techniques. It covers essential concepts such as transformations, shuffling, the Catalyst Optimizer, and Delta Lake features. Additionally, it highlights the importance of understanding Spark's execution engine and tools for managing data access and CI/CD in Databricks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

A winning

Apache
Spark Interview

Here are top topics for what to


prepare in a technical interview

@FEDI_WANNES
Apache Spark Ecosystem

Spark Core API: The heart of Spark: responsible for task scheduling, memory
management, fault recovery, and I/O.
Spark SQL & DataFrames: Enables SQL-like queries on structured data.
Streaming: Processes real-time data streams for applications like fraud
detection or log monitoring.
Mllib: Built-in machine learning library for scalable ML algorithms.
GraphX: Library for graph computation and analytics (e.g., PageRank, shortest
paths).
What happens after spark-
submit is executed?
spark-submit launches a driver program,
which initializes the SparkContext.
The context connects to the cluster manager
(like YARN, Kubernetes), requests resources
(executors), and coordinates task scheduling
across the cluster.
The driver manages the execution lifecycle
and collects the results.

@FEDI_WANNES
Terminology of Spark
applications
Application: Based on a Spark session (Spark context).
Identified by a unique ID such as <application_XXX>.

Jobs: Based on the actions created for an RDD. A job


consists of one or more stages.

Stages: Based on the shuffles created for an RDD. A stage


consists of one or more tasks. The shuffle is Spark's
mechanism for redistributing data so that it's grouped
differently across RDD partitions. Certain transformations,
such as join(), require a shuffle.

Tasks: A task is the minimum unit of processing scheduled


by Spark. Tasks are created for each RDD partition, and the
number of tasks is the maximum number of simultaneous
executions in the stage.
What is the difference
between wide and narrow
transformations?
Narrow transformation (e.g., map, filter) doesn't
require data movement each output partition
depends on a single input partition.

Wide transformation (e.g., groupByKey, join)


requires a shuffle, where data moves across the
cluster. These are more expensive and create
stage boundaries.
What is a DAG in
Spark?
A DAG (Directed Acyclic Graph)
represents the sequence of
transformations on RDDs. It’s built by
the scheduler to understand the
lineage of operations, breaking it into
stages and tasks, and determining
execution flow. A shuffle triggers a
stage split in the DAG.
Spark’s Task
Scheduler
When you trigger an action (like .count()), Spark
breaks down your computation into stages and
tasks, using the DAG Scheduler and Task
Scheduler.
Stage Division (via DAG Scheduler):
Spark builds a DAG (Directed Acyclic Graph) of transformations.
It identifies shuffle boundaries to divide the DAG into stages.
Stage = a set of transformations that can be executed without shuffling.
Each shuffle = a boundary between stages.
Stages are executed sequentially, one after the other.

Tasks and the Task Scheduler:


Within each stage, Spark creates multiple tasks, one per data partition.
Tasks are the actual units of work sent to executors.
The Task Scheduler assigns these tasks to available executor cores.
What is Shuffling in
Apache Spark?
Shuffling in Spark is the process of
redistributing data across partitions often
across different nodes in a cluster so that
operations like joins, aggregations, or
groupings can be performed correctly.

Why Is Shuffling Expensive?


Network I/O: Data is sent across executors or nodes.
Disk Spill: If memory isn't enough, data is written to disk.
Serialization: Data is serialized/deserialized for movement.
CPU Cost: Sorting and hash computations increase.
What is
coalesce vs repartition?
coalesce(n): Reduces number of
partitions without shuffle (efficient,
but uneven).
repartition(n): Increases/decreases
partitions with full shuffle (balanced
but expensive).

Use coalesce after filtering or writing


small files; use repartition before wide
shuffles.

@FEDI_WANNES
What is the Catalyst
Optimizer?
Catalyst is Spark’s query optimizer. It
transforms unresolved logical plans into
optimized physical plans using rule-
based and cost-based optimization. It
handles predicate pushdown, constant
folding, join selection, and more all
done automatically for Spark SQL and
DataFrames.
What is the Tungsten
Execution Engine?
Tungsten is Spark’s low-level
execution engine that improves
performance by managing memory
and CPU more efficiently. It
includes code generation, off-heap
memory management, and whole-
stage codegen to reduce overhead
and speed up query execution.

@FEDI_WANNES
What’s AQE (Adaptive Query
Execution) and why use it?
AQE optimizes query plans at runtime based on
actual data statistics. It can switch join types (e.g.,
from sort-merge to broadcast), coalesce shuffle
partitions, and optimize skewed joins. It’s enabled
via:
What causes a
memory spill and how
do you handle it?
A spill happens when Spark exceeds the
memory available for shuffles or
aggregations. Spark writes data to disk,
degrading performance.

Fixes: Tune memory configs


(spark.memory.fraction), use persist()
smartly, avoid skew, and monitor with
Spark UI.
What is salting and when
do you use it?
Salting solves data skew during joins or
groupBy by adding a random prefix to
skewed keys, spreading them across
partitions.

Use it when:
A key is over-represented (like country = USA).
You notice one task taking far longer than others in
Spark UI.
You get executor memory issues or disk spills on
joins/groupBys.

@FEDI_WANNES
What is Liquid Clustering?
Liquid Clustering automatically optimizes
the physical layout of Delta tables based
on workload patterns. It eliminates the
need for manual OPTIMIZE ZORDER by
dynamically clustering frequently
accessed columns, improving query
performance and reducing maintenance.
What is Delta Lake and
what are its key features?
Delta Lake is a storage layer on top of
Parquet that adds ACID transactions,
schema enforcement, time travel, and
merge support.

Key features:
MERGE INTO support
VERSION AS OF time travel
Scalable metadata
File compaction
Z-order indexing

@FEDI_WANNES
What is the Unity Catalog
in Databricks?
Unity Catalog is a centralized governance layer for
managing data access, metadata, and security
across all Databricks workspaces. It enables
row/column-level access, data lineage, and
integration with external locations (e.g., ADLS, S3)
all natively within Databricks.
What is an asset bundle in
Databricks CI/CD?
An Asset Bundle is a YAML-based packaging
format introduced with Databricks CLI v2. It lets you
define jobs, notebooks, clusters, dependencies,
and environments: all in one version-controlled
structure.

Infrastructure as Code: Define jobs, clusters, and workflows declaratively.


CI/CD Ready: Works with GitHub Actions, Azure DevOps, etc.
Multi-Environment Support: Switch between dev/stage/prod using
separate YAML configs.
Reliable & Repeatable: Ensures consistency across deployments.

Typical workflow:
databricks bundle deploy --target dev
databricks bundle run <job-name>
Ready to Land That
Big Data Role?
Passing Spark & Databricks
interviews

It’s about mastering optimization,


cost-efficiency, and scalable
architecture

@FEDI_WANNES

You might also like