MapReduce - 1
MapReduce - 1
3
◾ Organizations would like to conserve the amount of floor
space dedicated to their computer infrastructure
5
◾ A facility
used to house computer systems and components,
such as networking and storage systems, cooling, UPS, air
filters
6
7
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
◾ A framework for distributed processing
of large data sets
across clusters of commodity computers using simple
programming models
20
◾ Problem:
If nodes fail, how to store data persistently?
◾ Answer:
Distributed File System:
Provides global file namespace
Google GFS; Hadoop HDFS;
◾ Typical usage pattern
Huge files (100s of GB to TB)
Data is rarely updated in place
Reads and appends are common
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
◾ Chunk servers
File is split into contiguous chunks
Typically each chunk is 16-64MB (the # may have been updated by now)
Each chunk replicated (usually 2x or 3x) (hdfs-site.xml)
Try to keep replicas in different racks
◾ Master node
a.k.a. Name Node in Hadoop’s HDFS
Stores metadata about where files are stored
Might be replicated
◾ Client library for file access
Talks to master to find chunk servers
Connects directly to chunk servers to access data
C0 C1 D0 C1 C2 C5 C0 C5
C5 C2 C5 C3 D0 D1 … D0 C2
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N
k v
map
k v
k v
map
k v
k v
… …
v k v
k
k v
… … …
k v k v k v
data
The crew of the space
(The, 1) (crew, 1)
reads
shuttle Endeavor recently
(crew, 1) (crew, 1)
read the
returned to Earth as
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
(space, 1)
sequential
exploration. Scientists at (the, 1) (the, 1)
NASA are saying that the (the, 3)
(space, 1) (the, 1)
Sequentially
recent assembly of the
Dextre bot is the first step in (shuttle, 1)
(shuttle, 1) (the, 1)
a long-term space-based
(recently, 1)
man/mache partnership. (Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
Only
-- the robotics we're doing - (recently, 1) (recently, 1)
- is what we're going to
need …………………….. …. …
Big document (key, value) (key, value) (key, value)
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, 31
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)