Recap Spark
Recap Spark
Spark Architecture
11.Can you explain the high-level architecture of Apache Spark?
12.How does Spark achieve fault tolerance?
13.What are the main differences between Spark's physical and logical plans?
14.Explain the role of executors in Spark.
15.What are broadcast variables, and how do they optimize Spark jobs?
DAG Rescheduler
30.What is a DAG rescheduler, and when is it invoked?
31.How does the DAG scheduler recover from a failed task?
32.What is the role of speculative execution in Spark?
Optimization Tips:
• Increase the number of cores per executor to process more tasks simultaneously.
• Allocate more RAM to avoid spilling data to disk.
• Ensure your cluster has enough executors to efficiently process the data.
Summary:
• Cluster: A team of machines working together.
• Core: The CPU unit on each machine that processes tasks.
• Executor: The worker responsible for completing tasks on the data.
• RAM: High-speed memory for quick data access.
• Disk: Slower storage used when RAM is full.
Understanding how these components work together in Spark helps you optimize your big
data jobs for faster and more efficient processing.
Master Spark Concepts Zero to Big Data Hero:
Detailed Notes on Worker Node, Executor, Task, Stages, On-Heap
Memory, Off-Heap Memory, and Garbage Collection in Apache Spark
What is JVM?
The Java Virtual Machine (JVM) is an abstraction layer that allows Java (or other JVM
languages like Scala) applications to run on any machine, regardless of hardware or
operating system. It provides:
• Memory Management: JVM manages memory allocation, including heap and stack
memory.
• Execution: JVM converts bytecode (compiled from Java/Scala code) into machine
code.
• Garbage Collection: JVM automatically handles the cleanup of unused memory
through garbage collection.
JVM's Role in Spark
Apache Spark heavily relies on JVM for executing its core components:
• Driver Program: The driver, which coordinates the execution of jobs, runs inside a
JVM instance on the driver node.
• Executors: Executors, which run tasks on worker nodes, are JVM processes that are
responsible for task execution and data storage.
The Spark driver and executors both run as separate JVM instances, each managing its own
memory and resources.
1️.Worker Node
• Role: A worker node is a machine within a Spark cluster that performs the execution
of tasks and handles data storage.
• Purpose: The worker node runs executors that are responsible for running tasks on
data, managing intermediate results, and sending the final output back to the driver.
• Components:
o Executors: Each worker node can have multiple executors, each handling a
portion of the job.
o Data Storage: Worker nodes store data either in memory or on disk during
execution.
Key Points:
• Worker nodes are essentially the physical or virtual machines that process data in a
distributed manner.
• They communicate with the driver program to receive tasks and return results.
2️. Executor
• Role: An executor is a JVM process that runs on worker nodes and is responsible for
executing tasks and storing data.
• Lifecycle: Executors are launched at the beginning of a Spark job and run for the
entire duration of the job unless they fail or the job completes.
• Task Execution: Executors run tasks in parallel and return results to the driver.
• Data Management: Executors store data in-memory (or on disk if necessary) during
task execution and shuffle data between nodes if required.
Key Points:
• Executors perform computations and store intermediate data for tasks.
• Executors handle two main responsibilities:
(1) executing the tasks sent by the driver and
(2) providing in-memory storage for data.
• Executors are removed when the job completes.
3️. Task
• Role: A task is the smallest unit of work in Spark, representing a single operation (like
a map or filter) on a partition of the data.
• Assignment: Tasks are assigned to executors, and each executor can run multiple
tasks concurrently.
• Execution: Tasks are generated from stages in the Directed Acyclic Graph (DAG),
which defines the order of operations in a job.
Key Points:
• Tasks operate on a single partition of the data and are distributed across multiple
executors for parallel processing.
• Tasks are responsible for applying transformations or actions on the data partitions.
4️. Stages
• Role: Stages represent a logical unit of execution in Spark. Each Spark job is divided
into stages by the DAG Scheduler, based on the data shuffle boundaries.
• Types: Stages can be categorized as narrow (tasks in a stage can be executed without
reshuffling data) and wide (requires data shuffling).
• Creation: When an action (like count() or collect()) is triggered, Spark creates stages
that represent the transformation chain.
Key Points:
• Each stage contains a set of tasks that can be executed in parallel.
• Stages are created based on the shuffle dependencies in the job.
5️. On-Heap Memory
• Role: In Spark, on-heap memory refers to the memory space allocated within the
JVM heap for Spark's computations.
• Usage: Spark’s operations on RDDs or DataFrames store intermediate data in the JVM
heap space.
• Garbage Collection: On-heap memory is subject to JVM garbage collection, which
can slow down performance if frequent or large collections occur.
Key Points:
• On-heap memory is prone to the inefficiencies of JVM garbage collection.
• The default memory management in Spark is on-heap.
Key Points:
• Inefficient garbage collection can lead to OutOfMemoryException and Driver Out of
Memory errors.
• Spark provides configurations (spark.executor.memory,
spark.memory.offHeap.enabled) to optimize memory usage and reduce the impact of
GC.
Summary:
• Worker Node: The physical or virtual machine that performs the task execution and
manages storage in the Spark cluster.
• Executor: The process that runs on a worker node, handling task execution and data
storage.
• Task: The smallest unit of work in Spark, operating on data partitions.
• Stages: Logical units of execution that divide a Spark job, categorized as narrow or
wide depending on shuffle dependencies.
• On-Heap Memory: JVM-managed memory for storing Spark data, subject to garbage
collection.
• Off-Heap Memory: Memory managed outside the JVM heap to avoid garbage
collection delays and memory issues.
• Garbage Collection: A JVM process to reclaim memory, but if not optimized, it can
negatively affect performance.
Master Spark Concepts Zero to Big Data Hero:
Detailed Notes on Spark Architecture and Execution Flow
Apache Spark is a powerful open-source distributed computing system that enables fast and
efficient data processing. Here’s a quick overview of its architecture to help you understand
how it works:
Conclusion
Apache Spark’s architecture is designed to handle large-scale data processing efficiently and
effectively. Understanding its components and workflow can help you leverage its full
potential for your big data projects.
1. What is a DAG?
• Definition:
A DAG represents the series of operations (transformations) Spark performs on data.
• Components:
o Nodes: Represent transformation steps (e.g., filter, map).
o Edges: Show the flow of data between transformations.
2. How DAG Works in Spark
a. Job Submission
• When a Spark job is submitted, Spark creates a DAG that maps out the sequence of
steps to execute the job.
• This DAG provides a visual and logical representation of how data is processed.
b. Stages
• Spark divides the DAG into multiple stages based on data shuffling requirements.
• Stage Classification:
o Narrow Transformations: Operations like map and filter that don’t require
shuffling data between partitions.
▪ Grouped within the same stage.
o Wide Transformations: Operations like reduceByKey and join that require data
shuffling across nodes.
▪ Define boundaries between stages.
c. Task Execution
• Each stage is further broken into tasks, which are distributed across nodes (executors)
to execute in parallel.
• Tasks ensure that the workload is balanced for efficient execution.
d. Handling Failures
• If a task or stage fails, the DAG allows Spark to:
o Identify the failed components.
o Re-execute only the affected tasks or stages, saving computation time.
3. Why is DAG Important?
a. Efficiency
• DAG enables Spark to optimize task execution by:
o Minimizing data shuffling.
o Combining transformations to reduce redundant computations.
b. Recovery
• The DAG maintains lineage of operations, allowing Spark to:
o Recompute only the necessary parts in case of a failure.
c. Speed
• By enabling parallel task execution and scheduling, DAG ensures faster data
processing.
4. Key Advantages of Spark DAG
1. Task Optimization: Ensures efficient resource usage by structuring transformations.
2. Parallelism: Breaks jobs into tasks that can run in parallel across nodes.
3. Error Handling: Facilitates partial recomputation for failures, reducing recovery time.
4. Transparency: Provides a clear structure of operations, aiding debugging and analysis.
5. Key Terms to Remember
• Transformations: Operations performed on data (e.g., map, filter, reduceByKey).
• Stages: Logical segments of a DAG determined by transformations and shuffling.
• Tasks: Units of work derived from a stage that are executed by executors.
Master Spark Concepts Zero to Big Data Hero:
How does DAG scheduler work?
The DAG Scheduler in Spark is responsible for job execution planning. It transforms a
logical execution plan into a physical execution plan by dividing the work into stages and
tasks. It ensures efficient task distribution, fault tolerance, and optimized execution.
Summary
• Job: Represents the entire execution triggered by an action.
• Stages: Increase with wide transformations requiring shuffle boundaries.
• Tasks: Represent execution units corresponding to partitions.