Spark Databricks
Spark Databricks
Apache
Spark Interview
@FEDI_WANNES
Apache Spark Ecosystem
Spark Core API: The heart of Spark: responsible for task scheduling, memory
management, fault recovery, and I/O.
Spark SQL & DataFrames: Enables SQL-like queries on structured data.
Streaming: Processes real-time data streams for applications like fraud
detection or log monitoring.
Mllib: Built-in machine learning library for scalable ML algorithms.
GraphX: Library for graph computation and analytics (e.g., PageRank, shortest
paths).
What happens after spark-
submit is executed?
spark-submit launches a driver program,
which initializes the SparkContext.
The context connects to the cluster manager
(like YARN, Kubernetes), requests resources
(executors), and coordinates task scheduling
across the cluster.
The driver manages the execution lifecycle
and collects the results.
@FEDI_WANNES
Terminology of Spark
applications
Application: Based on a Spark session (Spark context).
Identified by a unique ID such as <application_XXX>.
@FEDI_WANNES
What is the Catalyst
Optimizer?
Catalyst is Spark’s query optimizer. It
transforms unresolved logical plans into
optimized physical plans using rule-
based and cost-based optimization. It
handles predicate pushdown, constant
folding, join selection, and more all
done automatically for Spark SQL and
DataFrames.
What is the Tungsten
Execution Engine?
Tungsten is Spark’s low-level
execution engine that improves
performance by managing memory
and CPU more efficiently. It
includes code generation, off-heap
memory management, and whole-
stage codegen to reduce overhead
and speed up query execution.
@FEDI_WANNES
What’s AQE (Adaptive Query
Execution) and why use it?
AQE optimizes query plans at runtime based on
actual data statistics. It can switch join types (e.g.,
from sort-merge to broadcast), coalesce shuffle
partitions, and optimize skewed joins. It’s enabled
via:
What causes a
memory spill and how
do you handle it?
A spill happens when Spark exceeds the
memory available for shuffles or
aggregations. Spark writes data to disk,
degrading performance.
Use it when:
A key is over-represented (like country = USA).
You notice one task taking far longer than others in
Spark UI.
You get executor memory issues or disk spills on
joins/groupBys.
@FEDI_WANNES
What is Liquid Clustering?
Liquid Clustering automatically optimizes
the physical layout of Delta tables based
on workload patterns. It eliminates the
need for manual OPTIMIZE ZORDER by
dynamically clustering frequently
accessed columns, improving query
performance and reducing maintenance.
What is Delta Lake and
what are its key features?
Delta Lake is a storage layer on top of
Parquet that adds ACID transactions,
schema enforcement, time travel, and
merge support.
Key features:
MERGE INTO support
VERSION AS OF time travel
Scalable metadata
File compaction
Z-order indexing
@FEDI_WANNES
What is the Unity Catalog
in Databricks?
Unity Catalog is a centralized governance layer for
managing data access, metadata, and security
across all Databricks workspaces. It enables
row/column-level access, data lineage, and
integration with external locations (e.g., ADLS, S3)
all natively within Databricks.
What is an asset bundle in
Databricks CI/CD?
An Asset Bundle is a YAML-based packaging
format introduced with Databricks CLI v2. It lets you
define jobs, notebooks, clusters, dependencies,
and environments: all in one version-controlled
structure.
Typical workflow:
databricks bundle deploy --target dev
databricks bundle run <job-name>
Ready to Land That
Big Data Role?
Passing Spark & Databricks
interviews
@FEDI_WANNES