0% found this document useful (0 votes)

26 views57 pages

Spark 1

Spark is a cluster computing framework that provides fast, general computation over large datasets. It uses resilient distributed datasets (RDDs) that can be operated on in parallel. RDDs allow data to be partitioned across nodes and cached in memory for efficient operations. Spark uses a directed acyclic graph to track transformations on RDDs to enable fault tolerance without replication. This provides faster iterative and interactive jobs compared to MapReduce.

Uploaded by

justin maxton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views57 pages

Spark 1

Uploaded by

justin maxton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 57

Apache Spark

1
Why Need Another Infrastructure

• Hadoop is widely used by many applications, each has some

diverse requirements and needs

2
Why Need Another Infrastructure:
Specialized Systems

3
Specialized Systems:
Downside

One application
may need them all

4
Vision: Generic Efficient
Infrastructure

5
Vision: Generic Efficient
Infrastructure

6
Motivation Workloads
• Complex multi-pass algorithms
• Interactive ad-hoc queries
• Real-time steam processing

All need efficient data sharing and transfer

7
Motivation Workloads
From This …

8
Motivation Workloads
To This …

9
Motivation Workloads
From This …

10
Motivation Workloads
To This …

11
Motivation
From Hardware Side
• RAM is getting much cheaper

• Commodity machines with GBs of RAM

• Large Distributed RAM in the cluster

A lot of processing,
storage, and data
transfer should use
RAM

12
Motivation: Summary

• Better support for real-time processing

• Exploit RAM as mush as possible
• Large-Scale Distribution
• Do not reinvent the wheel

13
Spark Architecture
• Master-Slave architecture

14
Spark Communication Model

15
Example [Link]

16
Spark Memory Management

• Memory utilization is essential

in Spark (Caching)

• Spark process is a JVM process

17
Spark Programming Model

• High-Level coding to build a workflow (Scala)

• Code compiles to distributed parallel operations

• Two Abstraction Units

• RDDs: Resilient Distributed Datasets
• Parallel Operations

18
• General purpose programming language

• Combines Object-Oriented and Functional

programming
• Compiles to Java bytecode

• Runs on JVM

19
Spark RDDs

20
RDD: Concept
esi li en t Dis tribu ted Da ta sets
R

• Collection of objects (records) that act as one unit

• Stored in main memory or disk

• Parallel operations built on top of them

• Have fault tolerance without replication (lineage)

21
RDD: Concept
• RDD is read-only

• Distributed either in main memory or disk

(automatically decided)

22
RDD: Fault Tolerance
• Do not have to be replicated. But maintain the lineage
(provenance) on how to re-create them starting from a data in
reliable storage

Stored on disk

Lineage of this RDD

23
RDD: Fault Tolerance

24
RDD: Fault Tolerance

25
RDD: User Control
• Persistence and Partitioning Strategies

• Indicate which RDDs they will reuse and choose a

storage strategy for them (e.g., in-memory storage)
• Ask that an RDD be partitioned across machines this is
useful for placement optimizations.

26
RDD: Advantage

• MapReduce access the computational power of the cluster,

but not distributed memory
• Time consuming and slow

• RDDs allow in-memory storage and transfer of data

27
RDD vs. Traditional Shared
Memory

28
Creating RDDs

• Loading from external dataset (file)

• Creating from another RDD (transformation)

• Parallelizing a centralized collection

29
Creating RDDs

Support for HDFS, HBase, Amazon S3, …

RDD: #partitions = #of HDFS blocks

30
Creating RDDs

New RDD

31
Creating RDDs

• 3. Parallelizing centralized collection

val data = Array(1, 2, 3, 4, 5, 100, 8, 7, ….)

val distData = sc.parallelize(data)

val data = Array(1, 2, 3, 4, 5, 100, 8, 7, ….)

val distData = sc.parallelize(data, 10) Create 10

partitions
32
Operations on RDDs

Create new RDD Return value to caller

No execution is triggered Execution is triggered

for these Ops. for these Ops.

• Transformation Ops. & Action Ops.

Similar to map-side of Hadoop Similar to reduce-side of Hadoop

33
Transformation Ops

• Operate on one RDD and generate a new RDD

• Lazy evaluation

• The input RDD is left intact

• Examples: map, filter, join

34
Transformation Ops: Example I

35
Transformation Ops: Example II

Up to Spark to keep it in memory OR

re-compute when needed

Ask Spark to keep this RDD in

memory

36
Action Ops

37
Action Ops: Example I

38
Action Ops: Example II [Link]

39
Transformations vs. Actions

40
Lazy Evaluation

• Transformation ops on RDDs follow lazy evaluation

• Results are not physically computed right away

• Metadata regarding the transformations is recorded

• Transformations are implemented only when an action

is invoked

41
Example

lines = spark.textFile("hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.count()

Execution is triggered here

42
RDD Fault Tolerance
• In-memory RDDs are not replicated
• RAM is still limited in size (Scarce Resource)

• Lineage Graph
• Directed Acyclic Graph (DAG)

• Maintain dependencies between RDDs

• Go back to the closest disk-based RDD

43
Lineage Graph

• Not storing the data, but instead how it is generated

(the processing steps)

44
Lineage Graph

45
Representation of RDDs

• Each RDD is divided into:

• Multiple partitions
• Dependencies on parent RDD(s)

• Two types of Dependencies

• Narrow
• Wide

46
Representation of RDDs

47
Narrow Dependency

• 1-1 relationship between child-parent

partitions
• Example Ops: Filter & Map

• Relatively cheap process

48
Wide Dependency

• M-1 or M-M relationship

between child-parent partitions
• Example Ops: Join &
Grouping
• More expensive

49
Interfaces on RDDs

50
Scheduling & Memory
Management

51
Scheduling
• Execution is triggered when an “Action” op is invoked

• Scheduler checks the lineage graph to execute

52
Spark Memory Management

• Memory utilization is essential in Spark (Caching)

53
Spark Memory Management
RDDs are cached here
• Spark process is a JVM
process

• Default memory is 512MBs Deserialization

Before data
transfer
• Parameters to control the
usage of memory segments

54
Replacement Policy
• LRU eviction policy at the level of RDD partitions is used

• When a new RDD partition is created

• If there is space in memory  Cache it
• If not  evict one or more partitions from the LRU RDD

• Use “persistence priority” to prevent eviction of important

RDDs

55
RDD Recovery
• In case of failure and losing an RDD partition

This partition is lost

56
RDD Recovery
• Recovery can be time consuming for RDDs with long
lineage chains
• Use of Checkpoint Mechanism to make some RDDs
persistent
• User defined, OR
• System controlled, OR
• More intelligent ways, e.g., workload driven

Java Notes For Selenium
100% (1)
Java Notes For Selenium
134 pages
Spark by Sumit
No ratings yet
Spark by Sumit
33 pages
SPARK
No ratings yet
SPARK
66 pages
SPARK
No ratings yet
SPARK
35 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark
No ratings yet
Spark
51 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
SPARK Internals
No ratings yet
SPARK Internals
13 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Spark
No ratings yet
Spark
96 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Spark and Scala Week 1
No ratings yet
Spark and Scala Week 1
16 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Module 3
No ratings yet
Module 3
51 pages
Enterprise Data Storage and Analysis On Spark
No ratings yet
Enterprise Data Storage and Analysis On Spark
34 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Overview
No ratings yet
Overview
25 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
SPARK
No ratings yet
SPARK
47 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Core Python Syllabus
No ratings yet
Core Python Syllabus
5 pages
A Project Report On "A Digital Step For ASHA - National Rural Health Mission"
No ratings yet
A Project Report On "A Digital Step For ASHA - National Rural Health Mission"
92 pages
DS WhitePaper Troubleshooting 3DEXPERIENCE ABEND Situations
No ratings yet
DS WhitePaper Troubleshooting 3DEXPERIENCE ABEND Situations
26 pages
NoSQL MongoDB HBase Cassandra
100% (1)
NoSQL MongoDB HBase Cassandra
142 pages
(Step by Step (Microsoft) ) John Sharp, Jon Jagger - Microsoft Visual C# .Net Step by Step-Microsoft Press (2002) PDF
No ratings yet
(Step by Step (Microsoft) ) John Sharp, Jon Jagger - Microsoft Visual C# .Net Step by Step-Microsoft Press (2002) PDF
663 pages
E - 20230628 Performance EWM Recom
No ratings yet
E - 20230628 Performance EWM Recom
7 pages
Aindumps 2022-Nov-19 by Barrette 222q Vce
No ratings yet
Aindumps 2022-Nov-19 by Barrette 222q Vce
6 pages
Final
No ratings yet
Final
27 pages
Core Identity JWT
No ratings yet
Core Identity JWT
10 pages
Codeigniter4foundations Sample
No ratings yet
Codeigniter4foundations Sample
25 pages
Glpi Agent Readthedocs Io en Latest
No ratings yet
Glpi Agent Readthedocs Io en Latest
95 pages
Enables in A Program To Be Executed: Multiple Threads - Multithreading
No ratings yet
Enables in A Program To Be Executed: Multiple Threads - Multithreading
48 pages
UI Builder Fundamentals Lab 1
No ratings yet
UI Builder Fundamentals Lab 1
23 pages
Unit3 Assignment1
No ratings yet
Unit3 Assignment1
3 pages
UNIT-1python Introduced
No ratings yet
UNIT-1python Introduced
13 pages
Specimen MS - Paper 2 Edexcel Computer Science IGCSE
No ratings yet
Specimen MS - Paper 2 Edexcel Computer Science IGCSE
32 pages
Gentle Introduction To Apache Nifi For Data Flow. and Some Clojure
No ratings yet
Gentle Introduction To Apache Nifi For Data Flow. and Some Clojure
7 pages
No SQL
No ratings yet
No SQL
35 pages
Test Through Lesson 3
No ratings yet
Test Through Lesson 3
8 pages
Smartvew XML Load Error
No ratings yet
Smartvew XML Load Error
2 pages
Merge - Sort With Graph
No ratings yet
Merge - Sort With Graph
6 pages
13 Great Firefox Extensions For Web Professionals
No ratings yet
13 Great Firefox Extensions For Web Professionals
2 pages
APznzaY78tnl5Y oAf9eS5TdgeXPDlOW4T AmtqiY4PHThk2ZQBAlN TYg2qIhzN8is6Cyb37XgnGHte3fIwNnW5MPM2BaSySYl4QXhx fXWWBjZlqfyJgJ
No ratings yet
APznzaY78tnl5Y oAf9eS5TdgeXPDlOW4T AmtqiY4PHThk2ZQBAlN TYg2qIhzN8is6Cyb37XgnGHte3fIwNnW5MPM2BaSySYl4QXhx fXWWBjZlqfyJgJ
97 pages
Computer Practical File
No ratings yet
Computer Practical File
21 pages
LC: Transport Tax Declaration-2012: Symptom
No ratings yet
LC: Transport Tax Declaration-2012: Symptom
8 pages
Blackjack Game Simulator - CodeProject
No ratings yet
Blackjack Game Simulator - CodeProject
4 pages
CFG and CFL
No ratings yet
CFG and CFL
38 pages
SRITAN KAR - Doc
No ratings yet
SRITAN KAR - Doc
2 pages
Java Programs: Chapter 1: Basic Syntactical Construct in Java
No ratings yet
Java Programs: Chapter 1: Basic Syntactical Construct in Java
14 pages
CN (5,2)
No ratings yet
CN (5,2)
29 pages
UCSOA: User-Centric Service-Oriented Architecture: Mark Chang, Jackson He, W.T. Tsai, Bingnan Xiao, Yinong Chen
No ratings yet
UCSOA: User-Centric Service-Oriented Architecture: Mark Chang, Jackson He, W.T. Tsai, Bingnan Xiao, Yinong Chen
8 pages
Ch-6 Key Distribution
No ratings yet
Ch-6 Key Distribution
16 pages
Resume For Web Developer Position
No ratings yet
Resume For Web Developer Position
2 pages
Crystal Report With DataSet and DataTable Using C# - CodeProject
No ratings yet
Crystal Report With DataSet and DataTable Using C# - CodeProject
8 pages
Train 2
No ratings yet
Train 2
2 pages
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
From Everand
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet

Spark 1

Uploaded by

Spark 1

Uploaded by

Apache Spark

• Hadoop is widely used by many applications, each has some

All need efficient data sharing and transfer

• Commodity machines with GBs of RAM

• Large Distributed RAM in the cluster

• Better support for real-time processing

• Memory utilization is essential

• Spark process is a JVM process

• High-Level coding to build a workflow (Scala)

• Code compiles to distributed parallel operations

• Two Abstraction Units

• Combines Object-Oriented and Functional

• Collection of objects (records) that act as one unit

• Stored in main memory or disk

• Parallel operations built on top of them

• Have fault tolerance without replication (lineage)

• Distributed either in main memory or disk

Lineage of this RDD

• Indicate which RDDs they will reuse and choose a

• MapReduce access the computational power of the cluster,

• RDDs allow in-memory storage and transfer of data

• Loading from external dataset (file)

• Creating from another RDD (transformation)

• Parallelizing a centralized collection

Support for HDFS, HBase, Amazon S3, …

RDD: #partitions = #of HDFS blocks

• 3. Parallelizing centralized collection

val distData = sc.parallelize(data)

val data = Array(1, 2, 3, 4, 5, 100, 8, 7, ….)

val distData = sc.parallelize(data, 10) Create 10

Create new RDD Return value to caller

No execution is triggered Execution is triggered

• Transformation Ops. & Action Ops.

Similar to map-side of Hadoop Similar to reduce-side of Hadoop

• Operate on one RDD and generate a new RDD

• The input RDD is left intact

• Examples: map, filter, join

Up to Spark to keep it in memory OR

Ask Spark to keep this RDD in

• Transformation ops on RDDs follow lazy evaluation

• Results are not physically computed right away

• Metadata regarding the transformations is recorded

• Transformations are implemented only when an action

Execution is triggered here

• Maintain dependencies between RDDs

• Go back to the closest disk-based RDD

• Not storing the data, but instead how it is generated

• Each RDD is divided into:

• Two types of Dependencies

• 1-1 relationship between child-parent

• Relatively cheap process

• M-1 or M-M relationship

• Scheduler checks the lineage graph to execute

• Memory utilization is essential in Spark (Caching)

• Default memory is 512MBs Deserialization

• When a new RDD partition is created

• Use “persistence priority” to prevent eviction of important

This partition is lost

This partition is lost

You might also like