Lecturer 5

The document discusses MapReduce and Spark frameworks for distributed and parallel processing of large datasets. Some key points: 1) MapReduce addresses issues like data reliability, fault tolerance, and aggregation in distributed systems through its Map and Reduce functions. Spark improves on MapReduce by allowing more complex analytics through efficient data sharing and RDDs. 2) Spark uses RDDs to distribute data across clusters and provides primitives for data parallelism and fault tolerance. RDDs can be operated on through transformations and actions to perform distributed computations. 3) Spark features include speed, multiple data sources, and support for machine learning. It uses DAGs and stages to optimize distributed execution and lineage to recover lost data through recomputation.

Uploaded by

Rebaz Mohsen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views21 pages

Lecturer 5

Uploaded by

Rebaz Mohsen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

There are some problems…

• Data reliability
• Equal split of data
• Delay of worker
• Failure of worker
• Aggregation the result

• We need to handle them all! In traditional

way of parallel and distributed processing.

1
MapReduce
• MapReduce is a programming framework that
• allows us to perform distributed and parallel
processing on large data sets in a
distributed environment
• no need to bother about the issues like reliability,
fault tolerance etc
• offers the flexibility to write code logic without
caring about the design issues of the system

2
Map Reduce
• MapReduce consists of Map and Reduce
• Map
• Reads a block of data
• Produces key-value pairs as intermediate outputs
• Reduce
• Receive key-value pairs from multiple map jobs
• aggregates the intermediate data tuples to the final
output

3
Advantages of MapReduce
• Parallel processing
• Jobs are divided to multiple nodes
• Nodes work simultaneously
• Processing time reduced

• Data locality
• Moving processing to the data
• Opposite from traditional way

4
Motivation of Spark
• MapReduce greatly simplified big data
analysis on large, unreliable clusters. It is
great at one-pass computation.
• But as soon as it got popular, users wanted
more:
• more complex, multi-pass analytics (e.g. ML,
graph)
• more interactive ad-hoc queries
• more real-time stream processing

5
Limitations of MapReduce
• As a general programming model:
• more suitable for one-pass computation on a large
dataset
• hard to compose and nest multiple operations
• no means of expressing iterative operations
• As implemented in Hadoop
• all datasets are read from disk, then stored back
on to disk
• all data is (usually) triple-replicated for
reliability

6
Data Sharing in Hadoop MapReduce

• Slow due to replication, serialization, and disk IO

• Complex apps, streaming, and interactive queries all
need one thing that MapReduce lacks:
• Efficient primitives for data sharing
7
What is Spark?
• Apache Spark is an open-source cluster
computing framework for real-time
processing.
• Spark provides an interface for programming
entire clusters with
• implicit data parallelism
• fault-tolerance
• Built on top of Hadoop MapReduce
• extends the MapReduce model to efficiently use
more types of computations

8
Spark Features
• Polyglot
• Speed
• Multiple formats
• Lazy evaluation
• Real time computation
• Hadoop integration
• Machine learning

9
Spark Eco-System

10
Spark Architecture
• Master Node
• takes care of the job execution within the cluster
• Cluster Manager
• allocates resources across applications
• Worker Node
• executes the tasks

11
Resilient Distributed Dataset (RDD)
• RDD is where the data stays
• RDD is the fundamental data structure
of Apache Spark
• is a collection of elements
• Dataset
• can be operated on in parallel
• Distributed
• fault tolerant
• Resilient

12
Features of Spark RDD
• In memory computation
• Partitioning
• Fault tolerance
• Immutability
• Persistence
• Coarse-grained operations
• Location-stickiness

13
Create RDDs
• Parallelizing an existing collection in your
driver program
• Normally, Spark tries to set the number of
partitions automatically based on your cluster
• Referencing a dataset in an external storage
system
• HDFS, HBase, or any data source offering a
Hadoop InputFormat
• By default, Spark creates one partition for each
block of the file

14
RDD Operations
• Transformations
• functions that take an RDD as the input and
produce one or many RDDs as the output
• Narrow Transformation
• Wide Transformation
• Actions
• RDD operations that produce non-RDD
values.
• returns final result of RDD computations

15
Narrow and Wide Transformations
Narrow transformation Wide transformation
involves no data shuffling involves data shuffling
• map • sortByKe
• flatMap y
• reduceBy
• filter
Key
• sample • groupBy
Key
• join

16
Action
• Actions are the operations which are applied
on an RDD to instruct Apache Spark to
apply computation and pass the result back
to the driver
• collect
• take
• reduce
• forEach
• count
• save

17
Lineage
• RDD lineage is the graph of all the ancestor
RDDs of an RDD
• Also called RDD operator graph or RDD
dependency graph
• Nodes: RDDs
• Edges: dependencies between RDDs

18
Fault tolerance of RDD
• All the RDDs generated from fault tolerant
data are fault tolerant
• If a worker fails, and any partition of an RDD
is lost
• the partition can be re-computed from the original
fault-tolerant dataset using the lineage
• task will be assigned to another worker

19
DAG in Spark
• DAG is a direct graph with no cycle
• Node: RDDs, results
• Edge: Operations to be applied on RDD
• On the calling of Action, the created
DAG submits to DAG Scheduler which
further splits the graph into the stages of
the task
• DAG operations can do better global
optimization than other systems like
MapReduce
20
DAG, Stages and Tasks
• DAG Scheduler splits the graph into multiple
stages
• Stages are created based on transformations
• The narrow transformations will be grouped together
into a single stage
• Wide transformation define the boundary of 2 stages
• DAG scheduler will then submit the stages into
the task scheduler
• Number of tasks depends on the number of
partitions
• The stages that are not interdependent may be
submitted to the cluster for execution in parallel
21

Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Emerging - 2021 - Module 2 PDF
No ratings yet
Emerging - 2021 - Module 2 PDF
61 pages
Getz Pharma
No ratings yet
Getz Pharma
11 pages
DS Report
No ratings yet
DS Report
5 pages
13 Chapter 6
No ratings yet
13 Chapter 6
52 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
AIoT Integration Book
No ratings yet
AIoT Integration Book
64 pages
BASIS - Sap Apo Monitoring
No ratings yet
BASIS - Sap Apo Monitoring
16 pages
Library Management System
No ratings yet
Library Management System
14 pages
Access Practical Poly
No ratings yet
Access Practical Poly
2 pages
Course Slideware
No ratings yet
Course Slideware
60 pages
Spark 1
No ratings yet
Spark 1
57 pages
Bigdatamcq mcq1
No ratings yet
Bigdatamcq mcq1
21 pages
Answer
No ratings yet
Answer
36 pages
Stack: Presented By: Maritess Tolentino Maricris Z.Alemania Charmaine Delos Trios
No ratings yet
Stack: Presented By: Maritess Tolentino Maricris Z.Alemania Charmaine Delos Trios
22 pages
How Datetrack Works: (Rel 11I)
No ratings yet
How Datetrack Works: (Rel 11I)
8 pages
18CSC303J DBMS UNIT II - Nested Sub Query
No ratings yet
18CSC303J DBMS UNIT II - Nested Sub Query
18 pages
UCS1412 - Database Lab Assignment - 2 NAME:Prakash R ROLL NO:185001108
No ratings yet
UCS1412 - Database Lab Assignment - 2 NAME:Prakash R ROLL NO:185001108
5 pages
SPARK
No ratings yet
SPARK
125 pages
Database 2 Conflict EXC
No ratings yet
Database 2 Conflict EXC
4 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Bases de Datos.: Escobar Duran Camila Andrea,, Uniremington
No ratings yet
Bases de Datos.: Escobar Duran Camila Andrea,, Uniremington
3 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
The Impact of Extracurricular Activities On The Grade 10 Students Mental Health
100% (1)
The Impact of Extracurricular Activities On The Grade 10 Students Mental Health
7 pages
University of Akron Thesis Guidelines
100% (3)
University of Akron Thesis Guidelines
6 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Chapter 14 SQL Commands
No ratings yet
Chapter 14 SQL Commands
25 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Pilot Interview Dissertation
100% (2)
Pilot Interview Dissertation
5 pages
147-2022 Shashi Bala
No ratings yet
147-2022 Shashi Bala
32 pages
Danti M Marina English Seminar Final Exam
No ratings yet
Danti M Marina English Seminar Final Exam
8 pages
Module 3
No ratings yet
Module 3
51 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Chapter 5 File Managment
No ratings yet
Chapter 5 File Managment
16 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
AI Tools Research
No ratings yet
AI Tools Research
24 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Spark (Introduction, RDD)
No ratings yet
Spark (Introduction, RDD)
28 pages
Spark
No ratings yet
Spark
160 pages
Starfish Description by Farmer
No ratings yet
Starfish Description by Farmer
7 pages
Spark
No ratings yet
Spark
96 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
M5
No ratings yet
M5
18 pages
SPARK
No ratings yet
SPARK
66 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Data Visualization in Decision Making Tools
No ratings yet
Data Visualization in Decision Making Tools
3 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
CURR - Supply Chain MGMT - Level 5.
No ratings yet
CURR - Supply Chain MGMT - Level 5.
98 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
Export - Import DataPump Parameter VERSION - Compatibility of Data Pump Between Different Oracle Ver
No ratings yet
Export - Import DataPump Parameter VERSION - Compatibility of Data Pump Between Different Oracle Ver
32 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Unit V
No ratings yet
Unit V
35 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Class 11th Economics Notes
No ratings yet
Class 11th Economics Notes
8 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
SPARK
No ratings yet
SPARK
47 pages

Lecturer 5

Uploaded by

Lecturer 5

Uploaded by

There are some problems…

• We need to handle them all! In traditional

• Slow due to replication, serialization, and disk IO

You might also like