0% found this document useful (0 votes)
26 views57 pages

Spark 1

Spark is a cluster computing framework that provides fast, general computation over large datasets. It uses resilient distributed datasets (RDDs) that can be operated on in parallel. RDDs allow data to be partitioned across nodes and cached in memory for efficient operations. Spark uses a directed acyclic graph to track transformations on RDDs to enable fault tolerance without replication. This provides faster iterative and interactive jobs compared to MapReduce.

Uploaded by

justin maxton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views57 pages

Spark 1

Spark is a cluster computing framework that provides fast, general computation over large datasets. It uses resilient distributed datasets (RDDs) that can be operated on in parallel. RDDs allow data to be partitioned across nodes and cached in memory for efficient operations. Spark uses a directed acyclic graph to track transformations on RDDs to enable fault tolerance without replication. This provides faster iterative and interactive jobs compared to MapReduce.

Uploaded by

justin maxton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Apache Spark

1
Why Need Another Infrastructure

• Hadoop is widely used by many applications, each has some


diverse requirements and needs

2
Why Need Another Infrastructure:
Specialized Systems

3
Specialized Systems:
Downside

One application
may need them all

4
Vision: Generic Efficient
Infrastructure

5
Vision: Generic Efficient
Infrastructure

6
Motivation Workloads
• Complex multi-pass algorithms
• Interactive ad-hoc queries
• Real-time steam processing

All need efficient data sharing and transfer

7
Motivation Workloads
From This …

8
Motivation Workloads
To This …

9
Motivation Workloads
From This …

10
Motivation Workloads
To This …

11
Motivation
From Hardware Side
• RAM is getting much cheaper

• Commodity machines with GBs of RAM

• Large Distributed RAM in the cluster

A lot of processing,
storage, and data
transfer should use
RAM

12
Motivation: Summary

• Better support for real-time processing


• Exploit RAM as mush as possible
• Large-Scale Distribution
• Do not reinvent the wheel

13
Spark Architecture
• Master-Slave architecture

14
Spark Communication Model

15
Example [Link]

16
Spark Memory Management

• Memory utilization is essential


in Spark (Caching)

• Spark process is a JVM process

17
Spark Programming Model

• High-Level coding to build a workflow (Scala)

• Code compiles to distributed parallel operations

• Two Abstraction Units


• RDDs: Resilient Distributed Datasets
• Parallel Operations

18
• General purpose programming language

• Combines Object-Oriented and Functional


programming
• Compiles to Java bytecode

• Runs on JVM

19
Spark RDDs

20
RDD: Concept
esi li en t Dis tribu ted Da ta sets
R

• Collection of objects (records) that act as one unit

• Stored in main memory or disk

• Parallel operations built on top of them

• Have fault tolerance without replication (lineage)

21
RDD: Concept
• RDD is read-only

• Distributed either in main memory or disk


(automatically decided)

22
RDD: Fault Tolerance
• Do not have to be replicated. But maintain the lineage
(provenance) on how to re-create them starting from a data in
reliable storage

Stored on disk

Lineage of this RDD

23
RDD: Fault Tolerance

24
RDD: Fault Tolerance

25
RDD: User Control
• Persistence and Partitioning Strategies

• Indicate which RDDs they will reuse and choose a


storage strategy for them (e.g., in-memory storage)
• Ask that an RDD be partitioned across machines this is
useful for placement optimizations.

26
RDD: Advantage

• MapReduce access the computational power of the cluster,


but not distributed memory
• Time consuming and slow

• RDDs allow in-memory storage and transfer of data

27
RDD vs. Traditional Shared
Memory

28
Creating RDDs

• Loading from external dataset (file)

• Creating from another RDD (transformation)

• Parallelizing a centralized collection

29
Creating RDDs

Support for HDFS, HBase, Amazon S3, …

RDD: #partitions = #of HDFS blocks


30
Creating RDDs

New RDD

31
Creating RDDs

• 3. Parallelizing centralized collection


val data = Array(1, 2, 3, 4, 5, 100, 8, 7, ….)

val distData = sc.parallelize(data)

val data = Array(1, 2, 3, 4, 5, 100, 8, 7, ….)

val distData = sc.parallelize(data, 10) Create 10


partitions
32
Operations on RDDs

Create new RDD Return value to caller

No execution is triggered Execution is triggered


for these Ops. for these Ops.

• Transformation Ops. & Action Ops.

Similar to map-side of Hadoop Similar to reduce-side of Hadoop

33
Transformation Ops

• Operate on one RDD and generate a new RDD

• Lazy evaluation

• The input RDD is left intact

• Examples: map, filter, join

34
Transformation Ops: Example I

35
Transformation Ops: Example II

Up to Spark to keep it in memory OR


re-compute when needed

Ask Spark to keep this RDD in


memory

36
Action Ops

37
Action Ops: Example I

38
Action Ops: Example II [Link]

39
Transformations vs. Actions

40
Lazy Evaluation

• Transformation ops on RDDs follow lazy evaluation

• Results are not physically computed right away

• Metadata regarding the transformations is recorded

• Transformations are implemented only when an action


is invoked

41
Example

lines = spark.textFile("hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.count()

Execution is triggered here

42
RDD Fault Tolerance
• In-memory RDDs are not replicated
• RAM is still limited in size (Scarce Resource)

• Lineage Graph
• Directed Acyclic Graph (DAG)

• Maintain dependencies between RDDs

• Go back to the closest disk-based RDD

43
Lineage Graph

• Not storing the data, but instead how it is generated


(the processing steps)

44
Lineage Graph

45
Representation of RDDs

• Each RDD is divided into:


• Multiple partitions
• Dependencies on parent RDD(s)

• Two types of Dependencies


• Narrow
• Wide

46
Representation of RDDs

47
Narrow Dependency

• 1-1 relationship between child-parent


partitions
• Example Ops: Filter & Map

• Relatively cheap process

48
Wide Dependency

• M-1 or M-M relationship


between child-parent partitions
• Example Ops: Join &
Grouping
• More expensive

49
Interfaces on RDDs

50
Scheduling & Memory
Management

51
Scheduling
• Execution is triggered when an “Action” op is invoked

• Scheduler checks the lineage graph to execute

52
Spark Memory Management

• Memory utilization is essential in Spark (Caching)

53
Spark Memory Management
RDDs are cached here
• Spark process is a JVM
process

• Default memory is 512MBs Deserialization

Before data
transfer
• Parameters to control the
usage of memory segments

54
Replacement Policy
• LRU eviction policy at the level of RDD partitions is used

• When a new RDD partition is created


• If there is space in memory  Cache it
• If not  evict one or more partitions from the LRU RDD

• Use “persistence priority” to prevent eviction of important


RDDs

55
RDD Recovery
• In case of failure and losing an RDD partition

This partition is lost

This partition is lost


56
RDD Recovery
• Recovery can be time consuming for RDDs with long
lineage chains
• Use of Checkpoint Mechanism to make some RDDs
persistent
• User defined, OR
• System controlled, OR
• More intelligent ways, e.g., workload driven

57

You might also like