0% found this document useful (0 votes)

6 views21 pages

Recap Spark

The document provides a comprehensive overview of Apache Spark's core concepts, including its architecture, components like the Driver Node, Worker Node, and Executors, as well as memory management techniques such as on-heap and off-heap memory. It also covers the execution flow of Spark applications, detailing the roles of tasks, stages, and the DAG scheduler in processing data. Additionally, it highlights optimization strategies for improving Spark job performance and fault tolerance mechanisms.

Uploaded by

rajeshganta.de7799

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views21 pages

Recap Spark

Uploaded by

rajeshganta.de7799

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Recap of Spark Concepts with Interview question on week 2:

Core Components of Spark

1. What are the core components of a Spark cluster?
2. Explain the role of the Driver Node and Worker Node in Spark.
3. What is the function of the Cluster Manager in Spark? Name a few cluster managers
compatible with Spark.
4. How are tasks assigned to worker nodes in Spark?
5. What happens if a worker node fails during a job execution?

JVM, On-Heap Memory, Off-Heap Memory, and Garbage Collector

6. Explain the difference between on-heap and off-heap memory in Spark.
7. Why is off-heap memory preferred for caching in Spark?
8. How does the Garbage Collector affect Spark performance?
9. What techniques can you use to optimize memory usage in Spark?
10.How does JVM tuning impact Spark applications?

Spark Architecture
11.Can you explain the high-level architecture of Apache Spark?
12.How does Spark achieve fault tolerance?
13.What are the main differences between Spark's physical and logical plans?
14.Explain the role of executors in Spark.
15.What are broadcast variables, and how do they optimize Spark jobs?

Execution of Spark Jobs

16.Describe the execution flow of a Spark application.
17.What is the role of the SparkContext in job execution?
18.What happens when you call an action on an RDD or DataFrame in Spark?
19.How are transformations and actions different in Spark?
20.What are the key stages of Spark execution?

Lazy Evaluation and Fault Tolerance

21.What is lazy evaluation, and why is it important in Spark?
22.How does Spark handle fault tolerance for RDDs?
23.Explain the concept of lineage graphs in Spark.
24.Why is checkpointing used in Spark, and how does it help with fault tolerance?

Directed Acyclic Graph (DAG) and Its Components

25.What is a DAG in Spark?
26.Explain the components of a DAG in Spark.
27.How is the DAG scheduler different from the task scheduler?
28.What are stages and tasks in a DAG? How are they related?
29.What triggers the creation of a new stage in a Spark job?

DAG Rescheduler
30.What is a DAG rescheduler, and when is it invoked?
31.How does the DAG scheduler recover from a failed task?
32.What is the role of speculative execution in Spark?

Job, Stages, and Tasks

33.Explain the relationship between jobs, stages, and tasks in Spark.
34.How does Spark decide the number of tasks in a stage?
35.What happens when a stage fails during execution?
36.How does Spark's locality-aware scheduling impact task execution?
37.What is the role of a shuffle in determining stages?
Master Spark Concepts Zero to Big Data Hero:
Spark's Core Concepts: Cluster, Core, Executor, RAM, and Disk
Apache Spark is a powerful tool for handling big data, and understanding how it works can
help optimize your workflows. Let's break down its key components:

1. The Cluster – Your Team of Workers

• Concept: A Spark Cluster is like a team of workers (nodes) in a warehouse, where
each worker is assigned a part of the warehouse (data) to work on.
• In Spark: The cluster is a group of machines (worker nodes) that share the load of
processing data. It distributes the data across these nodes to ensure no single
machine is overwhelmed.

2. Core – The Brainpower of Each Worker

• Concept: Think of each worker having several tools (cores) to perform tasks. More
tools = more work done in parallel.
• In Spark: A Core is the CPU unit responsible for handling tasks. Each worker node has
one or more cores.
o Example: If each node has 4 cores, it can process 4 tasks at once. This increases
the speed of processing by working on multiple tasks in parallel.

3. Executor – The Workers Doing the Job

• Concept: The Executor is like a worker in the warehouse who actually does the work
(e.g., sorting, filtering, analyzing data).
• In Spark: Each executor runs on a worker node and is responsible for executing the
tasks of a job. Executors perform data processing tasks and store the results.
o Example: Executors handle multiple tasks based on the number of cores
assigned to them.

4. RAM – The Worker’s High-Speed Toolbox

• Concept: RAM is like the worker’s toolbox, holding essential tools needed for fast
work. The larger the toolbox, the fewer trips to the storage room (disk) are needed.
• In Spark: RAM stores intermediate data during processing. If the data fits into
memory, Spark can process it quickly without needing to write to disk (which is
slower).
o Example: If an executor has 6 GB of RAM, it can process data faster. If the data
exceeds this memory, Spark will spill data to disk, slowing down the process.

5. Disk – The Worker’s Slow Backup Storage

• Concept: Disk is like a storage room for tools. When the worker’s toolbox (RAM) is
full, they must run to the storage room to get more tools, which is slower.
• In Spark: When there’s not enough RAM, Spark spills data to disk. Although this helps
with large datasets, it slows down processing.
o Example: If there is not enough memory (RAM), Spark will use disk storage to
temporarily hold data, but accessing data from disk is slower.
Real-World Example in Spark:
Imagine you have a 100 GB dataset of customer orders, and you want to analyze the top-
selling products:
• Cluster: Distributes the data across 5 worker nodes.
• Cores: Each node has 4 cores, so each node can process 4 tasks in parallel.
• Executors: Each executor processes tasks such as filtering or aggregating the data.
• RAM: If each executor has 6 GB of RAM, Spark will process the data in-memory,
which is faster. If the data exceeds RAM capacity, Spark will spill excess data to disk,
which slows down the process.

Optimization Tips:
• Increase the number of cores per executor to process more tasks simultaneously.
• Allocate more RAM to avoid spilling data to disk.
• Ensure your cluster has enough executors to efficiently process the data.

Summary:
• Cluster: A team of machines working together.
• Core: The CPU unit on each machine that processes tasks.
• Executor: The worker responsible for completing tasks on the data.
• RAM: High-speed memory for quick data access.
• Disk: Slower storage used when RAM is full.
Understanding how these components work together in Spark helps you optimize your big
data jobs for faster and more efficient processing.
Master Spark Concepts Zero to Big Data Hero:
Detailed Notes on Worker Node, Executor, Task, Stages, On-Heap
Memory, Off-Heap Memory, and Garbage Collection in Apache Spark

What is JVM?
The Java Virtual Machine (JVM) is an abstraction layer that allows Java (or other JVM
languages like Scala) applications to run on any machine, regardless of hardware or
operating system. It provides:
• Memory Management: JVM manages memory allocation, including heap and stack
memory.
• Execution: JVM converts bytecode (compiled from Java/Scala code) into machine
code.
• Garbage Collection: JVM automatically handles the cleanup of unused memory
through garbage collection.
JVM's Role in Spark
Apache Spark heavily relies on JVM for executing its core components:
• Driver Program: The driver, which coordinates the execution of jobs, runs inside a
JVM instance on the driver node.
• Executors: Executors, which run tasks on worker nodes, are JVM processes that are
responsible for task execution and data storage.
The Spark driver and executors both run as separate JVM instances, each managing its own
memory and resources.

1️.Worker Node
• Role: A worker node is a machine within a Spark cluster that performs the execution
of tasks and handles data storage.
• Purpose: The worker node runs executors that are responsible for running tasks on
data, managing intermediate results, and sending the final output back to the driver.
• Components:
o Executors: Each worker node can have multiple executors, each handling a
portion of the job.
o Data Storage: Worker nodes store data either in memory or on disk during
execution.
Key Points:
• Worker nodes are essentially the physical or virtual machines that process data in a
distributed manner.
• They communicate with the driver program to receive tasks and return results.

2️. Executor
• Role: An executor is a JVM process that runs on worker nodes and is responsible for
executing tasks and storing data.
• Lifecycle: Executors are launched at the beginning of a Spark job and run for the
entire duration of the job unless they fail or the job completes.
• Task Execution: Executors run tasks in parallel and return results to the driver.
• Data Management: Executors store data in-memory (or on disk if necessary) during
task execution and shuffle data between nodes if required.
Key Points:
• Executors perform computations and store intermediate data for tasks.
• Executors handle two main responsibilities:
(1) executing the tasks sent by the driver and
(2) providing in-memory storage for data.
• Executors are removed when the job completes.

3️. Task
• Role: A task is the smallest unit of work in Spark, representing a single operation (like
a map or filter) on a partition of the data.
• Assignment: Tasks are assigned to executors, and each executor can run multiple
tasks concurrently.
• Execution: Tasks are generated from stages in the Directed Acyclic Graph (DAG),
which defines the order of operations in a job.
Key Points:
• Tasks operate on a single partition of the data and are distributed across multiple
executors for parallel processing.
• Tasks are responsible for applying transformations or actions on the data partitions.

4️. Stages
• Role: Stages represent a logical unit of execution in Spark. Each Spark job is divided
into stages by the DAG Scheduler, based on the data shuffle boundaries.
• Types: Stages can be categorized as narrow (tasks in a stage can be executed without
reshuffling data) and wide (requires data shuffling).
• Creation: When an action (like count() or collect()) is triggered, Spark creates stages
that represent the transformation chain.
Key Points:
• Each stage contains a set of tasks that can be executed in parallel.
• Stages are created based on the shuffle dependencies in the job.
5️. On-Heap Memory
• Role: In Spark, on-heap memory refers to the memory space allocated within the
JVM heap for Spark's computations.
• Usage: Spark’s operations on RDDs or DataFrames store intermediate data in the JVM
heap space.
• Garbage Collection: On-heap memory is subject to JVM garbage collection, which
can slow down performance if frequent or large collections occur.
Key Points:
• On-heap memory is prone to the inefficiencies of JVM garbage collection.
• The default memory management in Spark is on-heap.

6️. Off-Heap Memory

• Role: Off-heap memory is memory managed outside of the JVM heap. Spark can use
off-heap memory to store data, reducing the burden on the JVM's garbage collector.
• Usage: It helps in better memory utilization and reduces the likelihood of
OutOfMemoryError due to garbage collection issues.
• Configuration: You can enable off-heap memory using the
spark.memory.offHeap.enabled configuration.
Key Points:
• Off-heap memory helps optimize memory usage by bypassing the JVM’s garbage
collector.
• This memory is managed manually, offering more control over memory usage in high-
memory workloads.

7️. Garbage Collection

• Role: Garbage collection (GC) in Spark is the process of cleaning up unused objects in
JVM memory to free up space for new data.
• Types of GC:
o Minor GC: Cleans up the young generation of memory (short-lived objects).
o Major GC: Cleans up the old generation (long-lived objects) and can lead to
stop-the-world events, pausing task execution.
• Impact: Frequent garbage collection, especially major GC, can negatively impact
Spark job performance by increasing the overall execution time.

Key Points:
• Inefficient garbage collection can lead to OutOfMemoryException and Driver Out of
Memory errors.
• Spark provides configurations (spark.executor.memory,
spark.memory.offHeap.enabled) to optimize memory usage and reduce the impact of
GC.

Summary:
• Worker Node: The physical or virtual machine that performs the task execution and
manages storage in the Spark cluster.
• Executor: The process that runs on a worker node, handling task execution and data
storage.
• Task: The smallest unit of work in Spark, operating on data partitions.
• Stages: Logical units of execution that divide a Spark job, categorized as narrow or
wide depending on shuffle dependencies.
• On-Heap Memory: JVM-managed memory for storing Spark data, subject to garbage
collection.
• Off-Heap Memory: Memory managed outside the JVM heap to avoid garbage
collection delays and memory issues.
• Garbage Collection: A JVM process to reclaim memory, but if not optimized, it can
negatively affect performance.
Master Spark Concepts Zero to Big Data Hero:
Detailed Notes on Spark Architecture and Execution Flow
Apache Spark is a powerful open-source distributed computing system that enables fast and
efficient data processing. Here’s a quick overview of its architecture to help you understand
how it works:

Key Components of Spark Architecture

• Driver Program
Description: The central coordinator that converts user code into tasks.
Role: Manages the execution of the Spark application and maintains information
about the status of tasks.
• Cluster Manager
Description: Manages resources across the cluster.
Types: Standalone, YARN, Mesos, Kubernetes.
Role: Allocates resources and schedules tasks on the cluster.
• Executors
Description: Workers that run the tasks assigned by the driver program.
Role: Execute code and report the status of computation and storage.
• Tasks
Description: The smallest unit of work in Spark.
Role: Executes individual operations on the partitioned data.
Data Processing in Spark
• RDDs (Resilient Distributed Datasets)
Description: Immutable collections of objects that can be processed in parallel.
Features: Fault tolerance, distributed processing, and parallel execution.
• Transformations
Description: Operations that create a new RDD from an existing one.
Examples: map, filter, flatMap.
• Actions
Description: Operations that trigger the execution of transformations and return a
result.
Examples: collect, reduce, count.
Example Workflow
1. The Driver Program submits a Spark application.
2. The Cluster Manager allocates resources across the cluster.
3. Executors run on worker nodes, executing tasks on the data.
4. Transformations create new RDDs, and Actions trigger the execution.
5. Results are sent back to the Driver Program.

Execution Flow on Spark Application

The execution flow in Apache Spark outlines how data processing occurs from the initial job
submission to the final result collection. Here’s a step-by-step breakdown:
1. SparkContext Creation
o The driver program starts and initializes the SparkContext.
o This context serves as the main entry point for Spark functionality and manages
the entire Spark application.
2. Cluster Manager Interaction
o The SparkContext interacts with the Cluster Manager (e.g., YARN, Mesos, or
Standalone).
o It requests resources (CPU, memory) needed for the application to run.
3. Job Submission
o The user defines transformations and actions within the Spark application.
o A job is created when an action is called (e.g., collect, count).
o At this point, Spark begins to build a logical execution plan.
4. DAG Scheduler
o The Directed Acyclic Graph (DAG) Scheduler takes the job and breaks it down
into stages.
o Each stage corresponds to a set of transformations that can be executed in
parallel.
5. Task Scheduler
o The Task Scheduler takes the stages defined by the DAG Scheduler and
converts them into tasks.
o It distributes these tasks to the executors based on data locality, optimizing
resource usage by minimizing data transfer.
6. Executor Execution
o The tasks are executed on the worker nodes by the executors.
o Executors process the data according to the tasks assigned and handle
intermediate data storage as required.
7. Data Shuffling
o If operations like reduceByKey or join are performed, data shuffling may occur.
o This involves redistributing data across the cluster, which can be resource-
intensive.
8. Execution and Monitoring
o The driver program continuously monitors the execution of tasks.
o It handles any failures that may occur during processing, re-executing tasks if
necessary.
9. Completion
o Once all tasks are completed, the results are sent back to the driver program.
o The SparkContext is now ready to process more jobs, maintaining a seamless
workflow.

Conclusion
Apache Spark’s architecture is designed to handle large-scale data processing efficiently and
effectively. Understanding its components and workflow can help you leverage its full
potential for your big data projects.

Important Interview Question from previous post

1. Explain Hadoop Architecture?
2. How MapReduce Works?
3. Difference between MapReduce and Spark?
4. Why Spark is better than MapReduce?
5. What are the components of Spark?
6. What you mean by JVM in Spark and what are its component?
7. What is use of Driver node and Work Node?
8. Explain Spark Architecture?
9. Explain the flow of execution in Spark?
Master Spark Concepts Zero to Big Data Hero:
What is Spark Directed Acyclic Graph (DAG)
In Apache Spark, the Directed Acyclic Graph (DAG) is the framework that underpins how
Spark optimizes and executes data processing tasks. Below is a breakdown of key concepts:

1. What is a DAG?
• Definition:
A DAG represents the series of operations (transformations) Spark performs on data.
• Components:
o Nodes: Represent transformation steps (e.g., filter, map).
o Edges: Show the flow of data between transformations.
2. How DAG Works in Spark

a. Job Submission
• When a Spark job is submitted, Spark creates a DAG that maps out the sequence of
steps to execute the job.
• This DAG provides a visual and logical representation of how data is processed.
b. Stages
• Spark divides the DAG into multiple stages based on data shuffling requirements.
• Stage Classification:
o Narrow Transformations: Operations like map and filter that don’t require
shuffling data between partitions.
▪ Grouped within the same stage.
o Wide Transformations: Operations like reduceByKey and join that require data
shuffling across nodes.
▪ Define boundaries between stages.
c. Task Execution
• Each stage is further broken into tasks, which are distributed across nodes (executors)
to execute in parallel.
• Tasks ensure that the workload is balanced for efficient execution.
d. Handling Failures
• If a task or stage fails, the DAG allows Spark to:
o Identify the failed components.
o Re-execute only the affected tasks or stages, saving computation time.
3. Why is DAG Important?
a. Efficiency
• DAG enables Spark to optimize task execution by:
o Minimizing data shuffling.
o Combining transformations to reduce redundant computations.
b. Recovery
• The DAG maintains lineage of operations, allowing Spark to:
o Recompute only the necessary parts in case of a failure.
c. Speed
• By enabling parallel task execution and scheduling, DAG ensures faster data
processing.
4. Key Advantages of Spark DAG
1. Task Optimization: Ensures efficient resource usage by structuring transformations.
2. Parallelism: Breaks jobs into tasks that can run in parallel across nodes.
3. Error Handling: Facilitates partial recomputation for failures, reducing recovery time.
4. Transparency: Provides a clear structure of operations, aiding debugging and analysis.
5. Key Terms to Remember
• Transformations: Operations performed on data (e.g., map, filter, reduceByKey).
• Stages: Logical segments of a DAG determined by transformations and shuffling.
• Tasks: Units of work derived from a stage that are executed by executors.
Master Spark Concepts Zero to Big Data Hero:
How does DAG scheduler work?
The DAG Scheduler in Spark is responsible for job execution planning. It transforms a
logical execution plan into a physical execution plan by dividing the work into stages and
tasks. It ensures efficient task distribution, fault tolerance, and optimized execution.

Key Responsibilities of DAG Scheduler

1. Convert Logical Plan to Physical Plan:
o Spark’s Catalyst optimizer generates a logical plan for transformations.
o The DAG Scheduler converts this logical plan into a physical execution plan
(stages and tasks).
2. Divide Work into Stages:
o Determines boundaries for stages based on wide transformations (e.g., shuffle
operations like groupBy, join).
o Each stage consists of tasks that can execute in parallel.
3. Manage Task Execution:
o Sends tasks to the Task Scheduler, which assigns them to executors.
o Ensures tasks run efficiently by using available resources.
4. Handle Failures:
o If a task fails, the DAG Scheduler recomputes only the failed tasks by tracing
back the lineage.
How the DAG Scheduler Chooses Jobs, Stages, and Tasks
1. Job:
o A job is created when an action (e.g., count, save, collect) is called on a Spark
DataFrame or RDD.
o Example:
▪ df.write.csv() triggers one job.
2. Stage:
o Stages are created based on shuffle boundaries:
▪ Narrow Transformations (e.g., map, filter): Operations that do not
require data shuffling are grouped into the same stage.
▪ Wide Transformations (e.g., reduceByKey, groupBy): Operations
requiring data shuffling (moving data across partitions) split the DAG into
multiple stages.
3. Task:
o Each stage is divided into tasks, one for each data partition.
o Tasks are units of execution sent to executors.
o Example:
▪ If there are 3 partitions in the dataset, Stage 1 will have 3 tasks.

How Stages Increase in the DAG

1. Wide Transformations Add Shuffle Boundaries:
o Wide transformations create dependencies between partitions, requiring Spark
to shuffle data between nodes.
o Each shuffle creates a new stage.
2. Example:
o Reading and filtering: Stage 1 (narrow transformations, no shuffle).
o Grouping data by key: Stage 2 (wide transformation with shuffle).
o Writing results: Often merged into the final stage if it doesn’t involve further
shuffling.
Detailed Example
Problem:
• Read a dataset.
• Filter rows where amount > 500.
• Group by category and calculate total sales.
• Write results to a file.
Execution Breakdown:
1. Job:
o The write action triggers 1 job.
2. Stage Division:
o Stage 1: Includes the read and filter transformations. These are narrow
transformations.
o Stage 2: Includes the groupBy and write. The groupBy is a wide transformation
that introduces a shuffle.
3. Task Division:
o Each stage will have as many tasks as there are partitions in the dataset.
o Example: For a dataset with 3 partitions:
▪ Stage 1: 3 tasks for filtering.
▪ Stage 2: 3 tasks for grouped computation and writing.

Summary
• Job: Represents the entire execution triggered by an action.
• Stages: Increase with wide transformations requiring shuffle boundaries.
• Tasks: Represent execution units corresponding to partitions.

Manual de Serviço SLR5100
100% (1)
Manual de Serviço SLR5100
178 pages
Few Websites To Download Documents For Free
No ratings yet
Few Websites To Download Documents For Free
2 pages
3.2 Es6 OOPs
No ratings yet
3.2 Es6 OOPs
29 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
OMB Form 1 - Application For Ombudsman Clearance
No ratings yet
OMB Form 1 - Application For Ombudsman Clearance
1 page
Spark QA
No ratings yet
Spark QA
34 pages
Antivirus Software in Comparison
100% (1)
Antivirus Software in Comparison
12 pages
Las - Ict 7 - Special Programs - Q4 - Week 3
No ratings yet
Las - Ict 7 - Special Programs - Q4 - Week 3
12 pages
Food Testing Laboratory Food and Foods Administration Idgah Hills, Bhopal, M.P
No ratings yet
Food Testing Laboratory Food and Foods Administration Idgah Hills, Bhopal, M.P
5 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
PRM80 Programmanual
No ratings yet
PRM80 Programmanual
36 pages
Full Stack Development Bootcamp
No ratings yet
Full Stack Development Bootcamp
19 pages
MCD-Level 2
No ratings yet
MCD-Level 2
6 pages
Type Light Help
No ratings yet
Type Light Help
24 pages
M2 - Computational Thinking (CT)
No ratings yet
M2 - Computational Thinking (CT)
39 pages
UP 2210 Hardware Installation and Configuration Manual 1
No ratings yet
UP 2210 Hardware Installation and Configuration Manual 1
127 pages
SQL Solutions
No ratings yet
SQL Solutions
59 pages
Digital Forensics Interview Questions
No ratings yet
Digital Forensics Interview Questions
37 pages
Aws Test Paper 3
No ratings yet
Aws Test Paper 3
21 pages
IOT Based Home Automation System FINAL REPORT - Delete
No ratings yet
IOT Based Home Automation System FINAL REPORT - Delete
53 pages
System Programming: Lecture No. 02 Topic: Input Output Bscs-7 Semester
No ratings yet
System Programming: Lecture No. 02 Topic: Input Output Bscs-7 Semester
15 pages
Multithreading Python
No ratings yet
Multithreading Python
33 pages
Smart Car Parking System Project Report
No ratings yet
Smart Car Parking System Project Report
17 pages
Treaded Binary Tree
No ratings yet
Treaded Binary Tree
4 pages
Assignment 1 JS
No ratings yet
Assignment 1 JS
7 pages
Finacle Support Connect - Vol 44
No ratings yet
Finacle Support Connect - Vol 44
2 pages
Spark Architecture and Deploy Modes
No ratings yet
Spark Architecture and Deploy Modes
22 pages
Yitbarek Zewde
No ratings yet
Yitbarek Zewde
79 pages
3 UNIT3 Spark
No ratings yet
3 UNIT3 Spark
55 pages
Hadoop Recap
No ratings yet
Hadoop Recap
27 pages
Marcon & Purser 2017 - PAPARA (ZZ) I - An Open-Source Software Interface For Annotating Photographs of The Deep-Sea
No ratings yet
Marcon & Purser 2017 - PAPARA (ZZ) I - An Open-Source Software Interface For Annotating Photographs of The Deep-Sea
12 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Top50 Python
No ratings yet
Top50 Python
21 pages
RMK-DEP-14-CLASS TIME TABLE On 02.01.24
No ratings yet
RMK-DEP-14-CLASS TIME TABLE On 02.01.24
7 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
Unity Catalog
No ratings yet
Unity Catalog
8 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Top Questions For Data Engineering Interviews 1742072752
No ratings yet
Top Questions For Data Engineering Interviews 1742072752
72 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
SDL C For Game Development
No ratings yet
SDL C For Game Development
6 pages
Unit 4 (Big Data Analytics)
No ratings yet
Unit 4 (Big Data Analytics)
28 pages
Spark
No ratings yet
Spark
15 pages
Spark Notes
No ratings yet
Spark Notes
19 pages
THYZQh Meot
No ratings yet
THYZQh Meot
13 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Cluster Size
No ratings yet
Cluster Size
4 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Bigdata Interview Q&A
No ratings yet
Bigdata Interview Q&A
71 pages
Unit V
No ratings yet
Unit V
35 pages
SPARK Question Answers
No ratings yet
SPARK Question Answers
19 pages
Notes
No ratings yet
Notes
3 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Aktu Mini 3rd Year Project
No ratings yet
Aktu Mini 3rd Year Project
12 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Data Analatics
No ratings yet
Data Analatics
6 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Spark Architecture
100% (1)
Spark Architecture
12 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
TFWolj ND9 K
No ratings yet
TFWolj ND9 K
25 pages
Cerificate Report Sharique
No ratings yet
Cerificate Report Sharique
12 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Resume Peng Wang
No ratings yet
Resume Peng Wang
2 pages
Spark Architecture
No ratings yet
Spark Architecture
10 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
ECS765P - W5 - Spark Programming
No ratings yet
ECS765P - W5 - Spark Programming
43 pages
Slips Bigdata
No ratings yet
Slips Bigdata
6 pages
BDA GTU Study Material Presentations Unit-6 03102021061221PM
No ratings yet
BDA GTU Study Material Presentations Unit-6 03102021061221PM
23 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Module 4
No ratings yet
Module 4
29 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
Data Bricks Interview
No ratings yet
Data Bricks Interview
18 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Spark Architecture
No ratings yet
Spark Architecture
6 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Spark Tips 1716698498
No ratings yet
Spark Tips 1716698498
7 pages
Spark Intro
No ratings yet
Spark Intro
24 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Spark Runtime Architecture Overview
No ratings yet
Spark Runtime Architecture Overview
5 pages
Spark ETL and Process
No ratings yet
Spark ETL and Process
15 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Using Spark On Cori: Lisa Gerhardt, Evan Racah NERSC New User Training
No ratings yet
Using Spark On Cori: Lisa Gerhardt, Evan Racah NERSC New User Training
14 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Call Center Template
No ratings yet
Call Center Template
8 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Recap Spark

Uploaded by

Recap Spark

Uploaded by

Recap of Spark Concepts with Interview question on week 2:

Core Components of Spark

JVM, On-Heap Memory, Off-Heap Memory, and Garbage Collector

Execution of Spark Jobs

Lazy Evaluation and Fault Tolerance

Directed Acyclic Graph (DAG) and Its Components

Job, Stages, and Tasks

1. The Cluster – Your Team of Workers

2. Core – The Brainpower of Each Worker

3. Executor – The Workers Doing the Job

4. RAM – The Worker’s High-Speed Toolbox

5. Disk – The Worker’s Slow Backup Storage

6️. Off-Heap Memory

7️. Garbage Collection

Key Components of Spark Architecture

Execution Flow on Spark Application

Important Interview Question from previous post

Key Responsibilities of DAG Scheduler

How Stages Increase in the DAG

You might also like