Comp9313: Big Data Management: Introduction To Mapreduce and Spark
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
t1 t2 t3
worker worker worker
1 2 3
r1 r2 r3
Result
3
There are some problems…
•Data reliability
•Equal split of data
•Delay of worker
•Failure of worker
•Aggregation the result
4
MapReduce
•MapReduce is a programming framework that
• allows us to perform distributed and parallel
processing on large data sets in a distributed
environment
• no need to bother about the issues like reliability,
fault tolerance etc
• offers the flexibility to write code logic without
caring about the design issues of the system
5
Map Reduce
•MapReduce consists of Map and Reduce
•Map
• Reads a block of data
• Produces key-value pairs as intermediate outputs
•Reduce
• Receive key-value pairs from multiple map jobs
• aggregates the intermediate data tuples to the final
output
6
A Simple MapReduce Example
7
Pseudo Code of Word Count
Map(D):
for each w in D:
emit(w, 1)
8
Advantages of MapReduce
•Parallel processing
• Jobs are divided to multiple nodes
• Nodes work simultaneously
• Processing time reduced
•Data locality
• Moving processing to the data
• Opposite from traditional way
9
We will discuss more on
MapReduce, but not now…
10
Motivation of Spark
•MapReduce greatly simplified big data
analysis on large, unreliable clusters. It is
great at one-pass computation.
•But as soon as it got popular, users wanted
more:
• more complex, multi-pass analytics (e.g. ML,
graph)
• more interactive ad-hoc queries
• more real-time stream processing
11
Limitations of MapReduce
•As a general programming model:
• more suitable for one-pass computation on a large
dataset
• hard to compose and nest multiple operations
• no means of expressing iterative operations
•As implemented in Hadoop
• all datasets are read from disk, then stored back
on to disk
• all data is (usually) triple-replicated for reliability
12
Data Sharing in Hadoop MapReduce
14
Spark Features
•Polyglot
•Speed
•Multiple formats
•Lazy evaluation
•Real time computation
•Hadoop integration
•Machine learning
15
Spark Eco-System
16
Spark Architecture
•Master Node
• takes care of the job execution within the cluster
•Cluster Manager
• allocates resources across applications
•Worker Node
• executes the tasks
17
Resilient Distributed Dataset (RDD)
•RDD is where the data stays
•RDD is the fundamental data structure
of Apache Spark
•is a collection of elements
• Dataset
•can be operated on in parallel
• Distributed
•fault tolerant
• Resilient
18
Features of Spark RDD
•In memory computation
•Partitioning
•Fault tolerance
•Immutability
•Persistence
•Coarse-grained operations
•Location-stickiness
19
Create RDDs
•Parallelizing an existing collection in your
driver program
• Normally, Spark tries to set the number of
partitions automatically based on your cluster
•Referencing a dataset in an external storage
system
• HDFS, HBase, or any data source offering a
Hadoop InputFormat
• By default, Spark creates one partition for each
block of the file
20
RDD Operations
•Transformations
• functions that take an RDD as the input and
produce one or many RDDs as the output
• Narrow Transformation
• Wide Transformation
•Actions
• RDD operations that produce non-RDD values.
• returns final result of RDD computations
21
Narrow and Wide Transformations
Narrow transformation Wide transformation
involves no data shuffling involves data shuffling
• map • sortByKey
• flatMap • reduceByKey
• filter • groupByKey
• join
• sample
22
Action
•Actions are the operations which are applied
on an RDD to instruct Apache Spark to apply
computation and pass the result back to the
driver
• collect
• take
• reduce
• forEach
• count
• save
23
Lineage
•RDD lineage is the graph of all the ancestor
RDDs of an RDD
• Also called RDD operator graph or RDD
dependency graph
•Nodes: RDDs
•Edges: dependencies between RDDs
24
Fault tolerance of RDD
•All the RDDs generated from fault tolerant
data are fault tolerant
•If a worker fails, and any partition of an RDD
is lost
• the partition can be re-computed from the original
fault-tolerant dataset using the lineage
• task will be assigned to another worker
25
DAG in Spark
•DAG is a direct graph with no cycle
• Node: RDDs, results
• Edge: Operations to be applied on RDD
•On the calling of Action, the created DAG
submits to DAG Scheduler which further
splits the graph into the stages of the task
•DAG operations can do better global
optimization than other systems like
MapReduce
26
DAG, Stages and Tasks
•DAG Scheduler splits the graph into multiple
stages
•Stages are created based on transformations
• The narrow transformations will be grouped together
into a single stage
• Wide transformation define the boundary of 2 stages
•DAG scheduler will then submit the stages into
the task scheduler
• Number of tasks depends on the number of partitions
• The stages that are not interdependent may be
submitted to the cluster for execution in parallel
27
Lineage vs. DAG
28
29
Data Sharing in Spark Using RDD