0% found this document useful (0 votes)

20 views30 pages

Comp9313: Big Data Management: Introduction To Mapreduce and Spark

Uploaded by

dangnhuquynh.ks2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views30 pages

Comp9313: Big Data Management: Introduction To Mapreduce and Spark

Uploaded by

dangnhuquynh.ks2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

COMP9313:

Big Data Management

Introduction to MapReduce
and Spark
Motivation of MapReduce
•Word count
• output the number of occurrence for each word in
the dataset.
•Naïve solution:
word_count(D):
H = new dict
For each w in D:
H[w] += 1
For each w in H:
print (w, H[w])

•How to speed up?

2
Motivation of MapReduce
•Make use of multiple workers
Task

t1 t2 t3
worker worker worker
1 2 3

r1 r2 r3

Result
3
There are some problems…
•Data reliability
•Equal split of data
•Delay of worker
•Failure of worker
•Aggregation the result

•We need to handle them all! In traditional

way of parallel and distributed processing.

4
MapReduce
•MapReduce is a programming framework that
• allows us to perform distributed and parallel
processing on large data sets in a distributed
environment
• no need to bother about the issues like reliability,
fault tolerance etc
• offers the flexibility to write code logic without
caring about the design issues of the system

5
Map Reduce
•MapReduce consists of Map and Reduce
•Map
• Reads a block of data
• Produces key-value pairs as intermediate outputs
•Reduce
• Receive key-value pairs from multiple map jobs
• aggregates the intermediate data tuples to the final
output

6
A Simple MapReduce Example

7
Pseudo Code of Word Count
Map(D):
for each w in D:
emit(w, 1)

Reduce(t, counts): # e.g., bear, [1, 1]

sum = 0
for c in counts:
sum = sum + c
emit (t, sum)

8
Advantages of MapReduce
•Parallel processing
• Jobs are divided to multiple nodes
• Nodes work simultaneously
• Processing time reduced

•Data locality
• Moving processing to the data
• Opposite from traditional way

9
We will discuss more on
MapReduce, but not now…

10
Motivation of Spark
•MapReduce greatly simplified big data
analysis on large, unreliable clusters. It is
great at one-pass computation.
•But as soon as it got popular, users wanted
more:
• more complex, multi-pass analytics (e.g. ML,
graph)
• more interactive ad-hoc queries
• more real-time stream processing

11
Limitations of MapReduce
•As a general programming model:
• more suitable for one-pass computation on a large
dataset
• hard to compose and nest multiple operations
• no means of expressing iterative operations
•As implemented in Hadoop
• all datasets are read from disk, then stored back
on to disk
• all data is (usually) triple-replicated for reliability

12
Data Sharing in Hadoop MapReduce

• Slow due to replication, serialization, and disk IO

• Complex apps, streaming, and interactive queries all
need one thing that MapReduce lacks:
• Efficient primitives for data sharing
13
What is Spark?
•Apache Spark is an open-source cluster
computing framework for real-time
processing.
•Spark provides an interface for programming
entire clusters with
• implicit data parallelism
• fault-tolerance
•Built on top of Hadoop MapReduce
• extends the MapReduce model to efficiently use
more types of computations

14
Spark Features
•Polyglot
•Speed
•Multiple formats
•Lazy evaluation
•Real time computation
•Hadoop integration
•Machine learning

15
Spark Eco-System

16
Spark Architecture
•Master Node
• takes care of the job execution within the cluster
•Cluster Manager
• allocates resources across applications
•Worker Node
• executes the tasks

17
Resilient Distributed Dataset (RDD)
•RDD is where the data stays
•RDD is the fundamental data structure
of Apache Spark
•is a collection of elements
• Dataset
•can be operated on in parallel
• Distributed
•fault tolerant
• Resilient

18
Features of Spark RDD
•In memory computation
•Partitioning
•Fault tolerance
•Immutability
•Persistence
•Coarse-grained operations
•Location-stickiness

19
Create RDDs
•Parallelizing an existing collection in your
driver program
• Normally, Spark tries to set the number of
partitions automatically based on your cluster
•Referencing a dataset in an external storage
system
• HDFS, HBase, or any data source offering a
Hadoop InputFormat
• By default, Spark creates one partition for each
block of the file

20
RDD Operations
•Transformations
• functions that take an RDD as the input and
produce one or many RDDs as the output
• Narrow Transformation
• Wide Transformation
•Actions
• RDD operations that produce non-RDD values.
• returns final result of RDD computations

21
Narrow and Wide Transformations
Narrow transformation Wide transformation
involves no data shuffling involves data shuffling
• map • sortByKey
• flatMap • reduceByKey
• filter • groupByKey
• join
• sample

22
Action
•Actions are the operations which are applied
on an RDD to instruct Apache Spark to apply
computation and pass the result back to the
driver
• collect
• take
• reduce
• forEach
• count
• save

23
Lineage
•RDD lineage is the graph of all the ancestor
RDDs of an RDD
• Also called RDD operator graph or RDD
dependency graph
•Nodes: RDDs
•Edges: dependencies between RDDs

24
Fault tolerance of RDD
•All the RDDs generated from fault tolerant
data are fault tolerant
•If a worker fails, and any partition of an RDD
is lost
• the partition can be re-computed from the original
fault-tolerant dataset using the lineage
• task will be assigned to another worker

25
DAG in Spark
•DAG is a direct graph with no cycle
• Node: RDDs, results
• Edge: Operations to be applied on RDD
•On the calling of Action, the created DAG
submits to DAG Scheduler which further
splits the graph into the stages of the task
•DAG operations can do better global
optimization than other systems like
MapReduce

26
DAG, Stages and Tasks
•DAG Scheduler splits the graph into multiple
stages
•Stages are created based on transformations
• The narrow transformations will be grouped together
into a single stage
• Wide transformation define the boundary of 2 stages
•DAG scheduler will then submit the stages into
the task scheduler
• Number of tasks depends on the number of partitions
• The stages that are not interdependent may be
submitted to the cluster for execution in parallel

27
Lineage vs. DAG

28
29
Data Sharing in Spark Using RDD

100x faster than network and disk

Webfil AccessMUX Training Manual
No ratings yet
Webfil AccessMUX Training Manual
94 pages
SPARK
No ratings yet
SPARK
66 pages
Unit V
No ratings yet
Unit V
35 pages
Spark
No ratings yet
Spark
96 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
3PAR - Hardware SG1 Rev1431
No ratings yet
3PAR - Hardware SG1 Rev1431
254 pages
SPARK
No ratings yet
SPARK
47 pages
Java Syntax Cheat Sheet: Control Flow Key Words
100% (1)
Java Syntax Cheat Sheet: Control Flow Key Words
1 page
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
SPARK
No ratings yet
SPARK
35 pages
HDCP 2.3 On DisplayPort Comppliace Test Specification Mar 19
No ratings yet
HDCP 2.3 On DisplayPort Comppliace Test Specification Mar 19
129 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Wafer Vi Requested by en Azmi Gan Sic Linerfet Afet7 Tdson-8-53-7
No ratings yet
Wafer Vi Requested by en Azmi Gan Sic Linerfet Afet7 Tdson-8-53-7
176 pages
SGC 2 - Configuration
No ratings yet
SGC 2 - Configuration
35 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
CV338H-X42 Schematic Diagram
67% (3)
CV338H-X42 Schematic Diagram
21 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Spark
No ratings yet
Spark
51 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
SPARK
No ratings yet
SPARK
125 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Notes of de Unit-1 Part 1
No ratings yet
Notes of de Unit-1 Part 1
51 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Unite28093v Character Arrays Strings File
No ratings yet
Unite28093v Character Arrays Strings File
18 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
A Rust Sampler
No ratings yet
A Rust Sampler
27 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Cs302 Midterm Short Notes
100% (1)
Cs302 Midterm Short Notes
25 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Module 3
No ratings yet
Module 3
51 pages
DSP
No ratings yet
DSP
104 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
U20EC086 WMC Lab-12
No ratings yet
U20EC086 WMC Lab-12
5 pages
Inverter Commutation Circuits
No ratings yet
Inverter Commutation Circuits
7 pages
State Wise Party Sales With Voucher Class
No ratings yet
State Wise Party Sales With Voucher Class
12 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Network Analysis Module 2
No ratings yet
Network Analysis Module 2
57 pages
Golf 6 and New Beetle 70F3427
No ratings yet
Golf 6 and New Beetle 70F3427
5 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Adding SE78 Logo in Adobe Form - SAPCODES
No ratings yet
Adding SE78 Logo in Adobe Form - SAPCODES
1 page
Automation Anywhere Enterprise Client: Lesson 1 - Introduction To Metabot
No ratings yet
Automation Anywhere Enterprise Client: Lesson 1 - Introduction To Metabot
1 page
Th-xxpx80 v1.13 Firmware Guide
No ratings yet
Th-xxpx80 v1.13 Firmware Guide
13 pages
Wilcoxon Vibration Transmitter
No ratings yet
Wilcoxon Vibration Transmitter
2 pages
Features Description: SBOS014
No ratings yet
Features Description: SBOS014
14 pages
Lab 2.3.2: Motherboard Identification: Estimated Time: 30 Minutes Objective
No ratings yet
Lab 2.3.2: Motherboard Identification: Estimated Time: 30 Minutes Objective
2 pages
BACHELOR OF COMPUTER APPLICATIONS SCHEME OF EXAMINATION W - e - F - 2015-16
No ratings yet
BACHELOR OF COMPUTER APPLICATIONS SCHEME OF EXAMINATION W - e - F - 2015-16
7 pages
Cs 51
No ratings yet
Cs 51
10 pages
Chat Application (Collg Report)
No ratings yet
Chat Application (Collg Report)
31 pages
T24 Support Engineer, Based in Chennai Main Responsibilities
No ratings yet
T24 Support Engineer, Based in Chennai Main Responsibilities
2 pages
A Multi Vibrator Is An Electronic Circuit Used To Implement A Variety of Simple Two
No ratings yet
A Multi Vibrator Is An Electronic Circuit Used To Implement A Variety of Simple Two
5 pages
9300 Servo Inverter TR
No ratings yet
9300 Servo Inverter TR
10 pages
Full Subtractor VHDL Code Using Structural Modeling
No ratings yet
Full Subtractor VHDL Code Using Structural Modeling
2 pages
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet

Comp9313: Big Data Management: Introduction To Mapreduce and Spark

Uploaded by

Comp9313: Big Data Management: Introduction To Mapreduce and Spark

Uploaded by

COMP9313:

Big Data Management

•How to speed up?

•We need to handle them all! In traditional

Reduce(t, counts): # e.g., bear, [1, 1]

• Slow due to replication, serialization, and disk IO

100x faster than network and disk

You might also like