0% found this document useful (0 votes)

20 views62 pages

Lecture 3 MapReduce Spark

The document announces that Professor Jung's office hours have changed from Thursdays 1-2pm to Thursdays 3-4pm in room N303B. It also notes that homework 2 will involve writing and running Spark applications on AWS EC2 instances. The rest of the document provides a high-level overview of MapReduce, including how it splits work across machines, runs map and reduce tasks in parallel, and handles failures.

Uploaded by

chenna kesava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views62 pages

Lecture 3 MapReduce Spark

Uploaded by

chenna kesava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Announcement

● Jung - I have switched my office hours

from Thursdays 1pm - 2pm
to Thursdays 3pm - 4pm
at the same location, N303B.

● In HW2, you will be writing Spark applications and

run them on AWS EC2 instances.
Google MapReduce
CompSci 516
Junghoon Kang
Big Data
it cannot be stored it cannot be processed
in one hard disk drive by one CPU

need to split it into parallelize computation

multiple machines on multiple machines
y !
d a
Google File System To MapReduce
Where does Google use MapReduce?
● crawled documents
Input ● web request logs

MapReduce
● inverted indices
● graph structure of web documents
● summaries of the number of pages
Output crawled per host
● the set of most frequent queries in a
day
What is MapReduce?

It is a programming model

that processes large data by:

apply a function to each logical record in the input (map)

categorize and combine the intermediate results

into summary values (reduce)
Understanding MapReduce
(example by Yongho Ha)

I am a class president

borrowed slide
An English teacher asks you:

“Could you count the number of occurrences of

each word in this book?”

borrowed slide
Um... Ok...
borrowed slide
Let’s divide the workload among classmates.

map

cloud 1 parallel 1 map 1 computer 1 reduce 1

data 1 data 1 cloud 1 map 1 map 1
computer 1 parallel 1 scientist 1 scientist 1

borrowed slide
And let few combine the intermediate results.
reduce

cloud 1 parallel 1 map 1 computer 1 reduce 1

data 1 data 1 cloud 1 map 1 map 1
computer 1 parallel 1 scientist 1 scientist 1

I will collect
from A ~ G
H~Q R~Z

cloud 2 map 3 reduce 1

computer 2 parallel 2 scientist 2
data 2
borrowed slide
Why did MapReduce
become so popular?

borrowed slide
Is it because Google uses it?

borrowed slide
Distributed Computation
Before MapReduce
Things to consider:

● how to divide the workload among multiple machines?

● how to distribute data and program to other machines?
● how to schedule tasks?
● what happens if a task fails while running?
● … and … and ...
borrowed slide
Distributed Computation
After MapReduce
Things to consider:

● how to write Map function?

● how to write Reduce function?

borrowed slide
MapReduce has made distributed computation
an easy thing to do!

Developers needed Developers needed

before MapReduce after MapReduce
borrowed slide
Given the brief intro to
MapReduce,

let’s begin our journey to real

implementation details in
MapReduce !
Key Players in MapReduce
One Master
● coordinates many workers.
● assigns a task* to each worker.
(* task = partition of data + computation)

Multiple Workers
● Follow whatever the Master asks to do.
Execution Overview
1. The MapReduce library in the user program first splits
the input file into M pieces.

gfs://path/input_file

partition_1 partition_2 partition_3 partition_4 partition_M

2. The MapReduce library in the user program then
starts up many copies of the program on a cluster of
machines: one master and multiple workers .

master

worker 1 worker 2 worker 3

There are M map tasks and R reduce tasks to assign.
(The figures below depicts task = data + computation)

Map Task Reduce Task

Data partition_# Data partition_#

Computation map Computation reduce

function function
3. The master picks idle workers and assigns each one
a map task.

worker 1

master worker 2

worker 3
Time
4. Map Phase (each mapper node)
1) Read in a corresponding input partition.
2) Apply the user-defined map function to each key/value pair
in the partition.
3) Partition the result produced by the map function into R
regions using the partitioning function.
4) Write the result into its local disk (not GFS).
5) Notify the master with the locations of each partitioned
intermediate result.
Map Phase

3. here is your input partition

Google File System

partition_k Inside
2. where is
my partition kth
map task
map function

1. assign hash (mod R)

map task

master mapper
temp_k1 temp_k2 temp_kR

4. here are the locations of partitioned intermediate results

5. After all the map tasks are done, the master picks idle
workers and assigns each one a reduce task.

worker 1

master worker 2

worker 3
Time
6. Reduce Phase (each reducer node)
1) Read in all the corresponding intermediate result
partitions from mapper nodes.
2) Sort the intermediate results by the intermediate keys.
3) Apply the user-defined reduce function on each
intermediate key and the corresponding set of
intermediate values.
4) Create one output file.
Reduce Phase
3. here are your intermediate result partitions

mappers
2. send
intermediate
result to this temp_Mk
reducer
temp_1k temp_2k

sort

1. assign
reduce task reduce function

master reducer Inside

kth
reduce task output_k

Google File System

4. store the output file into GFS
(reduce phase will generate the total of R output files)
Fault Tolerance

Although the probability of a machine failure is low,

the probability of a machine failing among thousands of
machines is common.
How does MapReduce
handle machine failures?
Worker Failure
● The master sends heartbeat to each worker node.
● If a worker node fails, the master reschedules the tasks
handled by the worker.
Master Failure
● The whole MapReduce job gets restarted through a
different master.
Locality

● The input data is managed by GFS.

● Choose the cluster of MapReduce machines such that
those machines contain the input data on their local
disk.
● We can conserve network bandwidth.
Task Granularity

● It is preferable to have the number of tasks to be

multiples of worker nodes.
● Smaller the partition size, faster failover and better
granularity in load balance.
But it incurs more overhead. Need a balance.
Backup Tasks

● In order to cope with a straggler, the master

schedules backup executions of the remaining
in-progress tasks.
MapReduce Pros and Cons
● MapReduce is good for off-line batch jobs on large
data sets.
● MapReduce is not good for iterative jobs due to high
I/O overhead as each iteration needs to read/write
data from/to GFS.
● MapReduce is bad for jobs on small datasets and
jobs that require low-latency response.
Apache Hadoop
Apache Hadoop is an open-source version of
GFS and Google MapReduce.

Google Apache Hadoop

File System GFS HDFS
Data Processing Google Hadoop
Engine MapReduce MapReduce
References
● MapReduce: Simplified Data Processing on Large Cluster -
Jeffrey Dean, et al. - 2004
● The Google File System - Sanjay Ghemawat, et al. - 2003
● http://www.slideshare.net/yongho/2011-h3
Apache Spark
CompSci 516
Junghoon Kang
About Spark
● Spark is a distributed large-scale data processing engine that
exploits in-memory computation and other optimizations.

● One of the most popular data processing engine in the industry

these days; many large companies like Netflix, Yahoo, and
eBay use Spark at massive scale.
More about Spark
● It started as a research project at UC
Berkeley.

● Published the
Resilient Distributed Datasets (RDD)
paper in NSDI 2012.

● Best Paper award that year.

Motivation
Hadoop MapReduce indeed made analyzing large
datasets easy.
However, MapReduce was still not good for:
● iterative jobs, such as machine learning and graph
computation
● interactive and ad-hoc queries
Can we do better?
The reason why MapReduce is not good for iterative
jobs is because of the high I/O overhead as each
iteration needs to read/write data from/to HDFS.

So, what if we use RAM between each iteration?

HDFS HDFS HDFS HDFS HDFS
read write read write read
Iter. 1 Iter. 2

Input

Iter. 1 Iter. 2

Input

Instead of storing intermediate outputs into HDFS,

using RAM would be faster
Query 1 Result 1
HDFS read
Query 2 Result 2
HDFS read

HDFS read
Query 3 Result 3
Instead of reading
Input HDFS read input from HDFS
every time you run
query,
Query 1 Result 1
bring the input into
RAM first then run
HDFS Query 2 Result 2 multiple queries.
read

Query 3 Result 3
Input
Challenge
But RAM is a volatile storage…

What happens if a machine faults?

Iter. 1 Iter. 2

Input
Although the probability of a machine failure is low,
the probability of a machine failing among thousands of
machines is common.

In other words, how can we create an efficient,

fault-tolerant, and distributed RAM storage?
Some Approaches
Some data processing frameworks, such as RAMCloud
or Piccolo, also used RAM to improve the performance.

And they supported fine-grained update of data in RAM.

But it is hard to achieve fault tolerance with fine-grained

update while having a good performance.
Spark’s Approach
What if we use RAM as read-only?

This idea is RDD, Resilient Distributed Datasets!

Which is the title of the Spark paper and the core idea
behind Spark!
Resilient Distributed Datasets
What are the properties of RDD?

● read-only, partitioned collections of records ben

dzimitry
hutomo
ivan
kevin
pierre
quan
partition randolf
RDD ...
of records

● you can only create RDD from input files in a storage or RDD

RDD OR RDD RDD

RDD (cont.)
What’s good about RDD again?

● RDD is read-only (immutable).

Thus, it hasn’t been changed since it got created.

● That means
if we just record how the RDD got created
from its parent RDD (lineage),
it becomes fault-tolerant!
RDD (cont.)
But how do you code in Spark using RDD?

● Coding in Spark is
creating a lineage of RDDs
in a directed acyclic graph (DAG) form.

Data
Source Map

Union
Data
Source Map Match
Data
Reduce Group
Sink
Data Data
Source Source
RDD Operators
Transformations & Actions
Lazy Execution

RDD Action Value

Transformation

● Transformation functions simply creates a lineage of RDDs.

● An action function that gets called in the end triggers the
computation of the whole lineage of transformation functions
and outputs the final value.
Two Types of Dependencies
Narrow Dependency
● The task can be done in
one node.

● No need to send data

over network to
complete the task.

● Fast.
Wide Dependency
● The task needs shuffle.

● Need to pull data from

other nodes via network.

● Slow.

● Use wide dependencies

wisely.
Job Scheduling
● One job contains one action
function and possibly many
transformation functions.
● A job is represented by the DAG
of RDDs.
● Compute the job following the
DAG.
● New stage gets created if a RDD
requires shuffle from an input
RDD.
Task Distribution
● Similar to MR

● One master, multiple

workers

● One RDD is divided

into multiple partitions
How fast is Spark?
● Skip the first iteration,
since it’s just text
parsing.

● In later iterations,
Spark is much faster
(black bar).

● HadoopBM writes
intermediate data in
memory not HDFS.
What if the number of nodes
increases?
Apache Spark Ecosystem
References
● Resilient Distributed Datasets: A Fault-Tolerant Abstraction
for In-Memory Cluster Computing - Matei Zaharia, et al. -
2012
● https://databricks.com/spark/about
● http://www.slideshare.net/yongho/rdd-paper-review
● https://www.youtube.com/watch?v=dmL0N3qfSc8
● http://www.tothenew.com/blog/spark-1o3-spark-internals/
● https://trongkhoanguyenblog.wordpress.com/2014/11/27/un
derstand-rdd-operations-transformations-and-actions/

NARAYANAN, SELVAM - Data Visulization With Tableau For Beginners-N Selvam (2021)
100% (2)
NARAYANAN, SELVAM - Data Visulization With Tableau For Beginners-N Selvam (2021)
333 pages
Online Voting System
50% (2)
Online Voting System
33 pages
QA Interview Questions
100% (1)
QA Interview Questions
28 pages
MS Access
No ratings yet
MS Access
60 pages
PHP Pdo Syntax
No ratings yet
PHP Pdo Syntax
20 pages
C4 Containers
No ratings yet
C4 Containers
22 pages
(Business Application Programming Interface) : Some of The Important Points Related To BAPI
No ratings yet
(Business Application Programming Interface) : Some of The Important Points Related To BAPI
4 pages
Implementation of ODOO ERP For Business Applications: Dr. Sujata Rao, Mr. Kaushik Kudtarkar
No ratings yet
Implementation of ODOO ERP For Business Applications: Dr. Sujata Rao, Mr. Kaushik Kudtarkar
8 pages
The Kubernetes Learning Resources List
No ratings yet
The Kubernetes Learning Resources List
20 pages
Easeus Data Recovery Wizard Full Version Lifetime 2020
No ratings yet
Easeus Data Recovery Wizard Full Version Lifetime 2020
3 pages
Viewse Um006 - en e (001 112)
No ratings yet
Viewse Um006 - en e (001 112)
112 pages
Sineca - Salesforce Business Analyst
No ratings yet
Sineca - Salesforce Business Analyst
3 pages
Lecture 1-1
No ratings yet
Lecture 1-1
31 pages
Everything You Need To Know About Moving From Maximo EAM To Maximo Application Suite
No ratings yet
Everything You Need To Know About Moving From Maximo EAM To Maximo Application Suite
22 pages
UNIT4
No ratings yet
UNIT4
157 pages
IT Security Management Checklist: 9 Key Recommendations To Keep Your Network Safe
No ratings yet
IT Security Management Checklist: 9 Key Recommendations To Keep Your Network Safe
11 pages
ERP Program: User Manual
No ratings yet
ERP Program: User Manual
29 pages
Installation Instruction For SIESTA V 3
No ratings yet
Installation Instruction For SIESTA V 3
2 pages
Order Management Inquiry Only Role Setup
No ratings yet
Order Management Inquiry Only Role Setup
11 pages
Migrate Your Data With DMC: Successor of LTMC & LSMW: Scenario
No ratings yet
Migrate Your Data With DMC: Successor of LTMC & LSMW: Scenario
10 pages
01 Create An Azure AI Search Solution
No ratings yet
01 Create An Azure AI Search Solution
34 pages
How To Install SAP Dialog Instance On Windows
No ratings yet
How To Install SAP Dialog Instance On Windows
8 pages
Data Records For Package 1 Selected in PSA
No ratings yet
Data Records For Package 1 Selected in PSA
5 pages
The Specific Module Could Not Be Found
No ratings yet
The Specific Module Could Not Be Found
5 pages
MIT6 858F14 Lab2
No ratings yet
MIT6 858F14 Lab2
12 pages
MarkFink Qualification Profile
No ratings yet
MarkFink Qualification Profile
4 pages
Yuvrajsolanki 1
No ratings yet
Yuvrajsolanki 1
9 pages
Idt MCQ
No ratings yet
Idt MCQ
5 pages
Research Proposal for Masters in China 攻读硕
No ratings yet
Research Proposal for Masters in China 攻读硕
7 pages
Cyber Security Professional CV Training
No ratings yet
Cyber Security Professional CV Training
4 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1175)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)

Lecture 3 MapReduce Spark

Uploaded by

Lecture 3 MapReduce Spark

Uploaded by

Announcement

● Jung - I have switched my office hours

● In HW2, you will be writing Spark applications and

need to split it into parallelize computation

that processes large data by:

apply a function to each logical record in the input (map)

categorize and combine the intermediate results

“Could you count the number of occurrences of

cloud 1 parallel 1 map 1 computer 1 reduce 1

cloud 1 parallel 1 map 1 computer 1 reduce 1

cloud 2 map 3 reduce 1

● how to divide the workload among multiple machines?

● how to write Map function?

Developers needed Developers needed

let’s begin our journey to real

partition_1 partition_2 partition_3 partition_4 partition_M

worker 1 worker 2 worker 3

Map Task Reduce Task

Data partition_# Data partition_#

Computation map Computation reduce

3. here is your input partition

1. assign hash (mod R)

4. here are the locations of partitioned intermediate results

master reducer Inside

Google File System

Although the probability of a machine failure is low,

● The input data is managed by GFS.

● It is preferable to have the number of tasks to be

● In order to cope with a straggler, the master

Google Apache Hadoop

● One of the most popular data processing engine in the industry

● Best Paper award that year.

So, what if we use RAM between each iteration?

Instead of storing intermediate outputs into HDFS,

What happens if a machine faults?

In other words, how can we create an efficient,

And they supported fine-grained update of data in RAM.

But it is hard to achieve fault tolerance with fine-grained

This idea is RDD, Resilient Distributed Datasets!

● read-only, partitioned collections of records ben

RDD OR RDD RDD

● RDD is read-only (immutable).

RDD Action Value

● Transformation functions simply creates a lineage of RDDs.

● No need to send data

● Need to pull data from

● Use wide dependencies

● One master, multiple

● One RDD is divided

You might also like