0% found this document useful (0 votes)
39 views21 pages

Lecturer 5

The document discusses MapReduce and Spark frameworks for distributed and parallel processing of large datasets. Some key points: 1) MapReduce addresses issues like data reliability, fault tolerance, and aggregation in distributed systems through its Map and Reduce functions. Spark improves on MapReduce by allowing more complex analytics through efficient data sharing and RDDs. 2) Spark uses RDDs to distribute data across clusters and provides primitives for data parallelism and fault tolerance. RDDs can be operated on through transformations and actions to perform distributed computations. 3) Spark features include speed, multiple data sources, and support for machine learning. It uses DAGs and stages to optimize distributed execution and lineage to recover lost data through recomputation.

Uploaded by

Rebaz Mohsen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views21 pages

Lecturer 5

The document discusses MapReduce and Spark frameworks for distributed and parallel processing of large datasets. Some key points: 1) MapReduce addresses issues like data reliability, fault tolerance, and aggregation in distributed systems through its Map and Reduce functions. Spark improves on MapReduce by allowing more complex analytics through efficient data sharing and RDDs. 2) Spark uses RDDs to distribute data across clusters and provides primitives for data parallelism and fault tolerance. RDDs can be operated on through transformations and actions to perform distributed computations. 3) Spark features include speed, multiple data sources, and support for machine learning. It uses DAGs and stages to optimize distributed execution and lineage to recover lost data through recomputation.

Uploaded by

Rebaz Mohsen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

There are some problems…

• Data reliability
• Equal split of data
• Delay of worker
• Failure of worker
• Aggregation the result

• We need to handle them all! In traditional


way of parallel and distributed processing.

1
MapReduce
• MapReduce is a programming framework that
• allows us to perform distributed and parallel
processing on large data sets in a
distributed environment
• no need to bother about the issues like reliability,
fault tolerance etc
• offers the flexibility to write code logic without
caring about the design issues of the system

2
Map Reduce
• MapReduce consists of Map and Reduce
• Map
• Reads a block of data
• Produces key-value pairs as intermediate outputs
• Reduce
• Receive key-value pairs from multiple map jobs
• aggregates the intermediate data tuples to the final
output

3
Advantages of MapReduce
• Parallel processing
• Jobs are divided to multiple nodes
• Nodes work simultaneously
• Processing time reduced

• Data locality
• Moving processing to the data
• Opposite from traditional way

4
Motivation of Spark
• MapReduce greatly simplified big data
analysis on large, unreliable clusters. It is
great at one-pass computation.
• But as soon as it got popular, users wanted
more:
• more complex, multi-pass analytics (e.g. ML,
graph)
• more interactive ad-hoc queries
• more real-time stream processing

5
Limitations of MapReduce
• As a general programming model:
• more suitable for one-pass computation on a large
dataset
• hard to compose and nest multiple operations
• no means of expressing iterative operations
• As implemented in Hadoop
• all datasets are read from disk, then stored back
on to disk
• all data is (usually) triple-replicated for
reliability

6
Data Sharing in Hadoop MapReduce

• Slow due to replication, serialization, and disk IO


• Complex apps, streaming, and interactive queries all
need one thing that MapReduce lacks:
• Efficient primitives for data sharing
7
What is Spark?
• Apache Spark is an open-source cluster
computing framework for real-time
processing.
• Spark provides an interface for programming
entire clusters with
• implicit data parallelism
• fault-tolerance
• Built on top of Hadoop MapReduce
• extends the MapReduce model to efficiently use
more types of computations

8
Spark Features
• Polyglot
• Speed
• Multiple formats
• Lazy evaluation
• Real time computation
• Hadoop integration
• Machine learning

9
Spark Eco-System

10
Spark Architecture
• Master Node
• takes care of the job execution within the cluster
• Cluster Manager
• allocates resources across applications
• Worker Node
• executes the tasks

11
Resilient Distributed Dataset (RDD)
• RDD is where the data stays
• RDD is the fundamental data structure
of Apache Spark
• is a collection of elements
• Dataset
• can be operated on in parallel
• Distributed
• fault tolerant
• Resilient

12
Features of Spark RDD
• In memory computation
• Partitioning
• Fault tolerance
• Immutability
• Persistence
• Coarse-grained operations
• Location-stickiness

13
Create RDDs
• Parallelizing an existing collection in your
driver program
• Normally, Spark tries to set the number of
partitions automatically based on your cluster
• Referencing a dataset in an external storage
system
• HDFS, HBase, or any data source offering a
Hadoop InputFormat
• By default, Spark creates one partition for each
block of the file

14
RDD Operations
• Transformations
• functions that take an RDD as the input and
produce one or many RDDs as the output
• Narrow Transformation
• Wide Transformation
• Actions
• RDD operations that produce non-RDD
values.
• returns final result of RDD computations

15
Narrow and Wide Transformations
Narrow transformation Wide transformation
involves no data shuffling involves data shuffling
• map • sortByKe
• flatMap y
• reduceBy
• filter
Key
• sample • groupBy
Key
• join

16
Action
• Actions are the operations which are applied
on an RDD to instruct Apache Spark to
apply computation and pass the result back
to the driver
• collect
• take
• reduce
• forEach
• count
• save

17
Lineage
• RDD lineage is the graph of all the ancestor
RDDs of an RDD
• Also called RDD operator graph or RDD
dependency graph
• Nodes: RDDs
• Edges: dependencies between RDDs

18
Fault tolerance of RDD
• All the RDDs generated from fault tolerant
data are fault tolerant
• If a worker fails, and any partition of an RDD
is lost
• the partition can be re-computed from the original
fault-tolerant dataset using the lineage
• task will be assigned to another worker

19
DAG in Spark
• DAG is a direct graph with no cycle
• Node: RDDs, results
• Edge: Operations to be applied on RDD
• On the calling of Action, the created
DAG submits to DAG Scheduler which
further splits the graph into the stages of
the task
• DAG operations can do better global
optimization than other systems like
MapReduce
20
DAG, Stages and Tasks
• DAG Scheduler splits the graph into multiple
stages
• Stages are created based on transformations
• The narrow transformations will be grouped together
into a single stage
• Wide transformation define the boundary of 2 stages
• DAG scheduler will then submit the stages into
the task scheduler
• Number of tasks depends on the number of
partitions
• The stages that are not interdependent may be
submitted to the cluster for execution in parallel
21

You might also like