0% found this document useful (0 votes)

75 views48 pages

Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

The document discusses the MapReduce framework for distributed computing on large datasets, describing how it addresses challenges of distributing computation across clusters by bringing computation to the data through a map step that extracts key-value pairs from input data and a reduce step that aggregates values for each key. It provides examples of how MapReduce can be used to solve problems like counting word frequencies in large documents by mapping words to counts and reducing counts by key.

Uploaded by

varsha1504

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views48 pages

Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

Uploaded by

varsha1504

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

Map-Reduce and
the New Software
Stack
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff
Ullman Stanford University

http://www.mmds.org

MapReduce
Much

of the course will be devoted to

large scale computing for data
mining
Challenges:
How to distribute computation?
Distributed/parallel programming is hard
Map-reduce

addresses all of the above

Googles computational/data manipulation

model
Elegant way to work with big data
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Single Node Architecture

CPU

Machine Learning, Statistics

Memory

Classical Data Mining

Disk

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Motivation: Google
Example
20+

billion web pages x 20KB = 400+ TB

1 computer reads 30-35 MB/sec from disk
~4 months to read the web
~1,000

hard drives to store the web

Takes even more to do something useful
with the data!
Today, a standard architecture for
such problems is emerging:
Cluster of commodity Linux nodes
Commodity network (ethernet) to connect them

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between
any pair of nodes
in a rack

Switch

CPU
Mem
Disk

Switch

CPU

Mem

Disk

CPU

Mem
Disk

Each rack contains 16-64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Large-scale Computing
Large-scale

computing for data

mining
problems on commodity hardware
Challenges:
How do you distribute computation?
How can we make it easy to write
distributed programs?
Machines fail:
One server may stay up 3 years (1,000 days)
If you have 1,000 servers, expect to loose 1/day
People estimated Google had ~1M machines in
2011
1,000 machines fail every day!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Idea and Solution

Issue:

Copying data over a network

takes time
Idea:
Bring computation close to the data
Store files multiple times for reliability
Map-reduce

addresses these problems

Googles computational/data manipulation model

Elegant way to work with big data
Storage Infrastructure File system
Google: GFS. Hadoop: HDFS

Programming model
Map-Reduce
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Storage Infrastructure
Problem:

If nodes fail, how to store data

persistently?
Answer:

Distributed File System:

Provides global file namespace
Google GFS; Hadoop HDFS;

Typical

usage pattern

Huge files (100s of GB to TB)

Data is rarely updated in place
Reads and appends are common
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Distributed File System

Chunk

servers

File is split into contiguous chunks

Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks

Master

node

a.k.a. Name Node in Hadoops HDFS

Stores metadata about where files are stored
Might be replicated

Client

library for file access

Talks to master to find chunk servers

Connects directly to chunk servers to access
data
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Distributed File System

Reliable

distributed file system

Data kept in chunks spread across
machines
Each chunk replicated on different
machines
Seamless
C
C
D
C
C
Cdisk or machine
C
C
recovery
from
failure
D
D C
D
C
C
C
C
0

Chunk server 1

Chunk server 2

Chunk server 3

Chunk server N

Bring computation directly to the

data!
Chunk servers also serve as compute
servers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Programming Model: MapReduce

Warm-up task:
We have a huge text document
Count

the number of times each

distinct word appears in the file

Sample

application:

Analyze web server logs to find popular

URLs

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Task: Word Count

Case 1:
File too large for memory, but all <word,
count> pairs fit in memory

Case 2:
Count occurrences of words:
words(doc.txt) | sort | uniq -c
where words takes a file and outputs the words in
it, one per a line

Case

2 captures the essence of

MapReduce
Great thing is that it is naturally
parallelizable
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

MapReduce: Overview
Sequentially
Map:

read a lot of data

Extract something you care about

Group by
Reduce:

key: Sort and Shuffle

Aggregate, summarize, filter or

transform
Write

the result

Outline stays the same, Map and

Reduce change to fit the
problem
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

MapReduce: The Map

Step
Input
key-value pairs

Intermediate
key-value pairs
k

map
map

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

MapReduce: The Reduce

Step
Intermediate
key-value pairs
k

Key-value groups

k
Group
by key

reduce
reduce

Output
key-value pairs

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

More Specifically
Input: a set
Programmer

of key-value pairs
specifies two methods:

Map(k, v) <k, v>*

Takes a key-value pair and outputs a set of keyvalue pairs
E.g., key is the filename, value is a single line in the file

There is one Map call for every (k,v) pair

Reduce(k, <v>) <k, v>

All values v with same key k are reduced
together
and processed in v order
There is one Reduce function call per unique
key k
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

MapReduce: Word
Counting
Read input and
produces a set
of key-value
pairs
The crew of the space
shuttle Endeavor recently
returned to Earth as
ambassadors, harbingers
of a new era of space
exploration. Scientists at
NASA are saying that the
recent assembly of the
Dextre bot is the first step
in a long-term space-based
man/mache
partnership.
'"The work we're doing now
-- the robotics we're doing
-- is what we're going to
need ..

Big document

(The, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(Endeavor,
1)
(recently, 1)
.
(key, value)

Group by
key:
Collect all pairs
with same key

Provided by the
programmer
Reduce:
Collect all
values
belonging to
the key and
output

(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)

(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)

(key, value)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Only
sequential
reads
Sequentially
read the
data

Provided by the
programmer
MAP:

Word Count Using

MapReduce
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Map-Reduce: Environment
Map-Reduce environment takes
care of:
Partitioning the input data
Scheduling the programs execution
across a
set of machines
Performing the group by key step
Handling machine failures
Managing required inter-machine
communication
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set
of key-value
pairs

Group by
key:
Collect all pairs
with same key
(Hash merge,
Shuffle, Sort,
Partition)

Reduce:
Collect all
values
belonging to the
key and output

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Map-Reduce: In Parallel

All phases are distributed with many tasks

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Map-Reduce
Programmer

specifies:

Map and Reduce and input files

Input 0

Input 1

Input 2

Workflow:

Read inputs as a set of key-value-pairs

Map 0
Map 1
Map 2
Map transforms input kv-pairs into a
new set of k'v'-pairs
Sorts & Shuffles the k'v'-pairs to
Shuffle
output nodes
All kv-pairs with a given k are sent
to the same reduce
Reduce 0
Reduce 1
Reduce processes all k'v'-pairs
grouped by key into new k''v''-pairs
Write the resulting pairs to files
All

phases are distributed with

many tasks doing the work

Out 0

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Out 1

Data Flow
Input

and final output are stored

on a distributed file system (FS):
Scheduler tries to schedule map tasks
close to physical storage location of
input data

Intermediate

results are stored

on local FS
of Map and Reduce workers
Output

is often input to another

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Coordination: Master
Master

node takes care of

coordination:
Task status: (idle, in-progress, completed)
Idle tasks get scheduled as workers become
available
When a map task completes, it sends the
master the location and sizes of its R
intermediate files, one for each reducer
Master pushes this info to reducers

Master

pings workers periodically to

detect failures
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Dealing with Failures

Map

worker failure

Map tasks completed or in-progress at

worker are reset to idle
Reduce workers are notified when task is
rescheduled on another worker
Reduce

worker failure

Only in-progress tasks are reset to idle

Reduce task is restarted
Master

failure

MapReduce task is aborted and client is

notified
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

How many Map and Reduce jobs?

M map tasks, R reduce
Rule of a thumb:

tasks

Make M much larger than the number

of nodes in the cluster
One DFS chunk per map is common
Improves dynamic load balancing and
speeds up recovery from worker
failures
Usually

R is smaller than M

Because output is spread across R files

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Task Granularity & Pipelining

Fine

granularity tasks: map tasks

>> machines
Minimizes time for fault recovery
Can do pipeline shuffling with map
execution
Better dynamic load balancing

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Refinements: Backup
Tasks
Problem

Slow workers significantly lengthen the job

completion time:
Other jobs on the machine
Bad disks
Weird things

Solution

Near end of phase, spawn backup copies of

tasks
Whichever one finishes first wins

Effect

Dramatically shortens job completion time

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Refinement: Combiners
Often

a Map task will produce many

pairs of the form (k,v1), (k,v2), for the
same key k
E.g., popular words in the word count
example

Can

save network time by

pre-aggregating values in
the mapper:
combine(k, list(v1)) v2
Combiner is usually same
as the reduce function

Works

only if reduce
function is commutative and associative
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Refinement: Combiners
Back

to our word counting example:

Combiner combines the values of all keys of a

single mapper (single machine):

Much less data needs to be copied and

shuffled!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Refinement: Partition
Function
Want

to control how keys get

partitioned
Inputs to map tasks are created by contiguous
splits of input file
Reduce needs to ensure that records with the
same intermediate key end up at the same worker

System

uses a default partition

function:
hash(key) mod R

Sometimes

function:

useful to override the hash

E.g., hash(hostname(URL)) mod R ensures URLs from a

host end up in the same output file
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Problems Suited for

Map-Reduce

Example: Host size

Suppose

we have a large web corpus

Look at the metadata file
Lines of the form: (URL, size, date, )
For

each host, find the total number

of bytes
That is, the sum of the page sizes for all URLs
from that particular host

Other

examples:

Link analysis and graph processing

Machine Learning algorithms
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Example: Language
Model
Statistical

machine translation:

Need to count number of times every 5word sequence occurs in a large corpus
of documents
Very

easy with MapReduce:

Map:
Extract (5-word sequence, count) from
document

Reduce:
Combine the counts
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Example: Join By MapReduce

Compute

the natural join R(A,B)

S(B,C)
R and S are each stored in files
Tuples are pairs (a,b) or (b,c)
A

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Map-Reduce Join
Use

a hash function h from Bvalues to 1...k

A Map process turns:
Each input tuple R(a,b) into key-value pair
(b,(a,R))
Each input tuple S(b,c) into (b,(c,S))
Map

processes send each key-value

pair with key b to Reduce process h(b)
Hadoop does this automatically; just tell it
what k is.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Cost Measures for

Algorithms

1.
2.
3.

In MapReduce we quantify the

cost of an algorithm using
Communication cost = total I/O of all
processes
Elapsed communication cost = max of
I/O along any path
(Elapsed) computation cost
analogous, but count only running
time of processes

Note that here the big-O notation is not the most useful
(adding more machines is always an option)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Example: Cost Measures

For

a map-reduce algorithm:

Communication cost = input file size

+ 2 (sum of the sizes of all files passed
from Map processes to Reduce
processes) + the sum of the output sizes
of the Reduce processes.
Elapsed communication cost is the
sum of the largest input + output for any
map process, plus the same for any
reduce process
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

What Cost Measures

Mean
Either

the I/O (communication) or

processing (computation) cost
dominates
Ignore one or the other

Total

cost tells what you pay in rent

from
your friendly neighborhood cloud
Elapsed

cost is wall-clock time using

parallelism
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Cost of Map-Reduce Join

Total

communication cost
= O(|R|+|S|+|R S|)
Elapsed communication cost = O(s)
Were going to pick k and the number of Map
processes so that the I/O limit s is respected
We put a limit s on the amount of input or
output that any one process can have. s could
be:
What fits in main memory
What fits on local disk

With

proper indexes, computation cost is

linear in the input + output size
So computation cost is like comm. cost
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Pointers and Further

Reading

Implementations
Google

Not available outside Google

Hadoop

An open-source implementation in Java

Uses HDFS for stable storage
Download: http://lucene.apache.org/hadoop/
Aster Data
Cluster-optimized SQL Database that
also implements MapReduce

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Cloud Computing
Ability

to rent computing by the hour

Additional services e.g., persistent

storage
Amazons

Elastic Compute Cloud

(EC2)
Aster

Data and Hadoop can both be

run on EC2

For

CS341 (offered next quarter)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Reading
Jeffrey

Dean and Sanjay Ghemawat:

MapReduce: Simplified Data
Processing on Large Clusters
http://labs.google.com/papers/mapreduc
e.html

Sanjay

Ghemawat, Howard Gobioff,

and Shun-Tak Leung: The Google File
System
http://labs.google.com/papers/gfs.html
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Resources
Hadoop

Wiki

Introduction
http://wiki.apache.org/lucene-hadoop/

Getting Started
http://wiki.apache.org/lucene-hadoop/GettingSta
rtedWithHadoop

Map/Reduce Overview
http://wiki.apache.org/lucene-hadoop/HadoopMap
Reduce
http://wiki.apache.org/lucene-hadoop/HadoopMap
RedClasses

Eclipse Environment
http://wiki.apache.org/lucene-hadoop/EclipseEnvi
ronment J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Resources

Releases from Apache download

mirrors
http://www.apache.org/dyn/closer.cgi/lu
cene/hadoop/

Nightly builds of source

http://people.apache.org/dist/lucene/had
oop/nightly/

Source code from subversion

http://lucene.apache.org/hadoop/version
_control.html
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

model inspired by functional language

primitives
Partitioning/shuffling similar to many large-scale sorting
systems
NOW-Sort ['97]
Re-execution

for fault tolerance

BAD-FS ['04] and TACC ['97]

Locality

optimization has parallels with Active

Disks/Diamond work
Active Disks ['01], Diamond ['04]

Backup

tasks similar to Eager Scheduling in Charlotte

system
Charlotte ['96]
Dynamic

load balancing solves similar problem as River's

distributed queues
River ['99]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

A Distributed File System-1
No ratings yet
A Distributed File System-1
65 pages
Full File at Https://Testbankdirect - Eu/ Test Bank For Essentials of Business Analytics 3Rd Edition by Camm
100% (3)
Full File at Https://Testbankdirect - Eu/ Test Bank For Essentials of Business Analytics 3Rd Edition by Camm
24 pages
Yum Yum D Giga
No ratings yet
Yum Yum D Giga
368 pages
CSE545 Sp23 (3) Hadoop MapReduce 2-13
No ratings yet
CSE545 Sp23 (3) Hadoop MapReduce 2-13
96 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
MapReduce-Final
No ratings yet
MapReduce-Final
92 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
49 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
MapReduce - 1
No ratings yet
MapReduce - 1
39 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Big Data - Spring 25 - Week01
No ratings yet
Big Data - Spring 25 - Week01
54 pages
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
No ratings yet
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
35 pages
Ch3 - Mapreduce & Yarn-En
No ratings yet
Ch3 - Mapreduce & Yarn-En
50 pages
Map Reduce
No ratings yet
Map Reduce
26 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
53 pages
ch01 Intro
No ratings yet
ch01 Intro
28 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
12 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
SAP and IoT
100% (2)
SAP and IoT
37 pages
ch02 Mapreduce
No ratings yet
ch02 Mapreduce
7 pages
Lecture 2 - Map Reduce
No ratings yet
Lecture 2 - Map Reduce
20 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Week 02
No ratings yet
Week 02
115 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Big Data - Bi - and - Analytics PDF
0% (1)
Big Data - Bi - and - Analytics PDF
30 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Sns College of Engineering: Big Data Analytics
No ratings yet
Sns College of Engineering: Big Data Analytics
17 pages
Hadoop Trainting in Hyderabad@KellyTechnologies
No ratings yet
Hadoop Trainting in Hyderabad@KellyTechnologies
23 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Bda 2
No ratings yet
Bda 2
35 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
S MapReduce Types Formats Features
No ratings yet
S MapReduce Types Formats Features
15 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
501 Sentence Completion Questions
100% (5)
501 Sentence Completion Questions
194 pages
Cloud Computing Unit-III Cloud Management & Virtualization Technology By. Dr. Samta Gajbhiye
No ratings yet
Cloud Computing Unit-III Cloud Management & Virtualization Technology By. Dr. Samta Gajbhiye
130 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Big Data Processing
No ratings yet
Big Data Processing
38 pages
Unit I
No ratings yet
Unit I
80 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Bigdatacourse
No ratings yet
Bigdatacourse
10 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Texas Instruments Ba2pluspro
No ratings yet
Texas Instruments Ba2pluspro
10 pages
Facebook Distributed System Case Study For Distributed System Inside Facebook Datacenters
No ratings yet
Facebook Distributed System Case Study For Distributed System Inside Facebook Datacenters
12 pages
Spend Analysis Case Data
No ratings yet
Spend Analysis Case Data
493 pages
Unit 1
No ratings yet
Unit 1
19 pages
Hadoop Imp Commands
No ratings yet
Hadoop Imp Commands
21 pages
Hcia Big Data V 3 Merci
No ratings yet
Hcia Big Data V 3 Merci
197 pages
PSO Tutorial
100% (1)
PSO Tutorial
38 pages
CC Bit Bank For Mid-2
No ratings yet
CC Bit Bank For Mid-2
25 pages
Big Data Analyticsfor Io T
No ratings yet
Big Data Analyticsfor Io T
12 pages
Bda L2
No ratings yet
Bda L2
18 pages
Splunk Data Life Cycle Determining When and Where To Roll Data
No ratings yet
Splunk Data Life Cycle Determining When and Where To Roll Data
45 pages
4.7.1 Bda-Mba
No ratings yet
4.7.1 Bda-Mba
2 pages
CS GATE'2017 Paper 02 Key Solution
No ratings yet
CS GATE'2017 Paper 02 Key Solution
32 pages
LekhyaJ SrDE Resume
No ratings yet
LekhyaJ SrDE Resume
5 pages
Shelby Shelving
No ratings yet
Shelby Shelving
2 pages
Recover From Namenode Failure
No ratings yet
Recover From Namenode Failure
14 pages
Tableau Course Content
No ratings yet
Tableau Course Content
6 pages
Advanced Excel 308
No ratings yet
Advanced Excel 308
20 pages
MSC Data Science Oncampus 2020
No ratings yet
MSC Data Science Oncampus 2020
14 pages
No Answer (Big Data)
No ratings yet
No Answer (Big Data)
37 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
41 pages
BDA Lab Manual 200305105108
No ratings yet
BDA Lab Manual 200305105108
44 pages
File Viewer
No ratings yet
File Viewer
7 pages
ABP W9-W10 Big Data Analytics Lab-PIG
No ratings yet
ABP W9-W10 Big Data Analytics Lab-PIG
11 pages
Lecture 3 - Splitting Windows and Freezing Panes, Series and Custom Lists
No ratings yet
Lecture 3 - Splitting Windows and Freezing Panes, Series and Custom Lists
7 pages
There Is A Person Z Such That X Is The Father of Z and Z Is The Mother of y
No ratings yet
There Is A Person Z Such That X Is The Father of Z and Z Is The Mother of y
3 pages
Homework #7: K k+1 K N N N
No ratings yet
Homework #7: K k+1 K N N N
2 pages
Big Data Hadoop and Spark Developer: Certification Project
No ratings yet
Big Data Hadoop and Spark Developer: Certification Project
9 pages
Homework #7: I J I J
No ratings yet
Homework #7: I J I J
1 page
MSBA Program Calendar 2019-2020
No ratings yet
MSBA Program Calendar 2019-2020
1 page
Gaurav Agarwal - Resume 2 - 1739529727594 - Gaurav Agarwal
No ratings yet
Gaurav Agarwal - Resume 2 - 1739529727594 - Gaurav Agarwal
5 pages
Beginning R: The Statistical Programming Language
From Everand
Beginning R: The Statistical Programming Language
Mark Gardener
4.5/5 (4)
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet

Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

Uploaded by

Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

Uploaded by

Note to other teachers and users of these slides: We would be delighted if you found this our

of the course will be devoted to

addresses all of the above

Googles computational/data manipulation

Single Node Architecture

Machine Learning, Statistics

Classical Data Mining

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

billion web pages x 20KB = 400+ TB

hard drives to store the web

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Each rack contains 16-64 nodes

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

computing for data

Idea and Solution

Copying data over a network

addresses these problems

Googles computational/data manipulation model

If nodes fail, how to store data

Distributed File System:

Huge files (100s of GB to TB)

Distributed File System

File is split into contiguous chunks

a.k.a. Name Node in Hadoops HDFS

library for file access

Talks to master to find chunk servers

Distributed File System

distributed file system

Bring computation directly to the

Programming Model: MapReduce

the number of times each

Analyze web server logs to find popular

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Task: Word Count

2 captures the essence of

read a lot of data

Extract something you care about

key: Sort and Shuffle

Aggregate, summarize, filter or

Outline stays the same, Map and

MapReduce: The Map

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

MapReduce: The Reduce

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Map(k, v) <k, v>*

There is one Map call for every (k,v) pair

Reduce(k, <v>*) <k, v>*

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Word Count Using

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

All phases are distributed with many tasks

Map and Reduce and input files

Read inputs as a set of key-value-pairs

phases are distributed with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

and final output are stored

results are stored

is often input to another

node takes care of

pings workers periodically to

Dealing with Failures

Map tasks completed or in-progress at

Only in-progress tasks are reset to idle

MapReduce task is aborted and client is

How many Map and Reduce jobs?

Make M much larger than the number

Because output is spread across R files

Task Granularity & Pipelining

granularity tasks: map tasks

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Slow workers significantly lengthen the job

Near end of phase, spawn backup copies of

Dramatically shortens job completion time

a Map task will produce many

save network time by

Reduce(k, <v>) <k, v>