0% found this document useful (0 votes)

7 views246 pages

W7 Lecture Notes

Uploaded by

venugopalaswamyiyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views246 pages

W7 Lecture Notes

Uploaded by

venugopalaswamyiyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 246

MapReduce

EL
PTRajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
[email protected]
Introduction

• MapReduce is a programming model and an associated

implementation for processing and generating large data

EL
sets.
• Users specify a map function that processes a key/value

PT
pair to generate a set of intermediate key/value pairs,
and a reduce function that merges all intermediate
values associated with the same intermediate key.
N
• Many real world tasks are expressible in this model.
Contd…

• Programs written in this functional style are

automatically parallelized and executed on a large
cluster of commodity machines.

EL
• The run-time system takes care of the details of
partitioning the input data, scheduling the program's

PT
execution across a set of machines, handling machine
failures, and managing the required inter-machine
communication.
N
• Thisallows programmers without any experience with
parallel and distributed systems to easily utilize the
resources of a large distributed system.
Contd…

•A typical MapReduce computation processes many

terabytes of data on thousands of machines.

EL
• Hundreds of MapReduce programs have been
implemented and upwards of one thousand MapReduce

PT
jobs are executed on Google's clusters every day.
N
Single-node architecture

EL
CPU
Machine Learning, Statistics
Memory PT
N
“Classical” Data Mining
Disk
Commodity Clusters
• Web data sets can be very large
• Tens to hundreds of terabytes

EL
• Cannot mine on a single server (why?)

• Standard architecture emerging:

•
•
PT
Cluster of commodity Linux nodes
Gigabit ethernet interconnect
N
• How to organize computations on this architecture?
• Mask issues such as hardware failure
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes

EL
in a rack
Switch Switch

CPU
…
PT CPU CPU
…
CPU
N
Mem Mem Mem Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

Stable storage
• First order problem: if nodes can fail, how can we
store data persistently?

EL
• Answer: Distributed File System
• Provides global file namespace

PT
• Google GFS; Hadoop HDFS; Kosmix KFS

• Typical usage pattern

N
• Huge files (100s of GB to TB)
• Data is rarely updated in place
• Reads and appends are common
Distributed File System
•Chunk Servers
•File is split into contiguous chunks
•Typically each chunk is 16-64MB

EL
•Each chunk replicated (usually 2x or 3x)
•Try to keep replicas in different racks
•Master node

PT
•a.k.a. Name Nodes in HDFS
•Stores metadata
N
•Might be replicated
•Client library for file access
•Talks to master to find chunk servers
•Connects directly to chunkservers to access data
Motivation for Map Reduce (Why)
• Large-Scale Data Processing

• Want to use 1000s of CPUs

EL
• But don’t want hassle of managing things

PT
• MapReduce Architecture provides

•Automatic parallelization & distribution

N
•Fault tolerance

•I/O scheduling

•Monitoring & status updates

Programming Model

•Thecomputation takes a set of input key/value pairs, and

produces a set of output key/value pairs.

EL
•The user of the MapReduce library expresses the
computation as two functions:

(i) The Map

PT
N
(ii) The Reduce
(i) Map Abstraction

•Map, written by the user, takes an input pair and produces a

EL
set of intermediate key/value pairs.

•The MapReduce library groups together all intermediate

PT
values associated with the same intermediate key ‘I’ and
passes them to the Reduce function.
N
(ii) Reduce Abstraction
•The Reduce function, also written by the user, accepts an
intermediate key ‘I’ and a set of values for that key.

EL
•Itmerges together these values to form a possibly smaller
set of values.

• Typically PT
just zero or one output value is produced per
Reduce invocation. The intermediate values are supplied to
N
the user's reduce function via an iterator.

•This allows us to handle lists of values that are too large to

fit in memory.
Map-Reduce Functions for Word Count
map(key, value):
// key: document name; value: text of document

EL
for each word w in value:
emit(w, 1)

PT
reduce(key, values):
N
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Map-Reduce Functions

• Input: a set of key/value pairs

• User supplies two functions:

EL
map(k,v)  list(k1,v1)
reduce(k1, list(v1))  v2
PT
• (k1,v1) is an intermediate key/value pair
N
• Output is the set of (k1,v2) pairs
Applications
• Here are a few simple applications of interesting programs that can
be easily expressed as MapReduce computations.
• Distributed Grep: The map function emits a line if it matches a
supplied pattern. The reduce function is an identity function that

EL
just copies the supplied intermediate data to the output.
• Count of URL Access Frequency: The map function processes logs

PT
of web page requests and outputs (URL; 1). The reduce function
adds together all values for the same URL and emits a (URL; total
count) pair.
N
• ReverseWeb-Link Graph: The map function outputs (target;
source) pairs for each link to a target URL found in a page named
source. The reduce function concatenates the list of all source
URLs associated with a given target URL and emits the pair: (target;
list(source))
Contd…
• Term-Vector
per Host: A term vector summarizes the most
important words that occur in a document or a set of
documents as a list of (word; frequency) pairs.

EL
• Themap function emits a (hostname; term vector) pair for
each input document (where the hostname is extracted
from the URL of the document).
• The PT
reduce function is passed all per-document term
vectors for a given host. It adds these term vectors
N
together, throwing away infrequent terms, and then emits
a final (hostname; term vector) pair
Contd…
• InvertedIndex: The map function parses each document,
and emits a sequence of (word; document ID) pairs. The
reduce function accepts all pairs for a given word, sorts the
corresponding document IDs and emits a (word;

EL
list(document ID)) pair. The set of all output pairs forms a
simple inverted index. It is easy to augment this

PT
computation to keep track of word positions.
N
• Distributed
Sort: The map function extracts the key from
each record, and emits a (key; record) pair. The reduce
function emits all pairs unchanged.
Implementation

EL
Overview
PT
N
Implementation Overview
• Many different implementations of the MapReduce
interface are possible. The right choice depends on the
environment.

EL
• For example, one implementation may be suitable for a
small shared-memory machine, another for a large NUMA

PT
multi-processor, and yet another for an even larger
collection of networked machines.
N
• Here we describes an implementation targeted to the
computing environment in wide use at Google: large
clusters of commodity PCs connected together with
switched Ethernet.
Contd…
(1) Machines are typically dual-processor x86 processor running
Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used . Typically either 100
megabits/second or 1 gigabit/second at the machine level, but

EL
averaging considerably less in overall bisection bandwidth.
(3) A cluster consists of hundreds or thousands of machines, and

PT
therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks attached directly
N
to individual machines.
(5) Users submit jobs to a scheduling system. Each job consists of
a set of tasks, and is mapped by the scheduler to a set of
available machines within a cluster.
Distributed Execution Overview
• The Map invocations are distributed across multiple machines
by automatically partitioning the input data into a set of M
splits.
• The input splits can be processed in parallel by different

EL
machines.
• Reduce invocations are distributed by partitioning the

PT
intermediate key space into R pieces using a partitioning
function (e.g., hash(key) mod R).
N
• The number of partitions (R) and the partitioning function are
specified by the user.
• Figure 1 shows the overall flow of a MapReduce operation.
Distributed Execution Overview

User
Program

(1)fork (1) fork (1) fork

EL
(2)assign Master (2) assign
map reduce

Worker PT
(4) local Worker
(6) write Output
File 0
N
Split 0 (3) read
write (5) Remote read, sort
Split 1 Worker
Split 2 Output
Worker File 1
Worker

Intermediate
Input Files Map phase Files on Disk Reduce phase Output Files
Sequence of Actions
When the user program calls the MapReduce function, the following
sequence of actions occurs:
1. The MapReduce library in the user program first splits the input
files into M pieces of typically 16 megabytes to 64 megabytes (MB)

EL
per piece. It then starts up many copies of the program on a cluster
of machines.
2. One of the copies of the program is special- the master. The rest are
PT
workers that are assigned work by the master. There are M map
tasks and R reduce tasks to assign. The master picks idle workers and
N
assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the
corresponding input split. It parses key/value pairs out of the input
data and passes each pair to the user-defined Map function. The
intermediate key/value pairs produced by the Map function are
buffered in memory.
Contd…
4. Periodically, the buffered pairs are written to local disk,
partitioned into R regions by the partitioning function.
• The locations of these buffered pairs on the local disk are passed
back to the master, who is responsible for forwarding these

EL
locations to the reduce workers.
5. When a reduce worker is notified by the master about these

PT
locations, it uses remote procedure calls to read the buffered data
from the local disks of the map workers. When a reduce worker has
read all intermediate data, it sorts it by the intermediate keys so
N
that all occurrences of the same key are grouped together.
• The sorting is needed because typically many different keys map
to the same reduce task. If the amount of intermediate data is too
large to fit in memory, an external sort is used.
Contd…
6. The reduce worker iterates over the sorted intermediate data
and for each unique intermediate key encountered, it passes the
key and the corresponding set of intermediate values to the
user's Reduce function.

EL
• The output of the Reduce function is appended to a final output
file for this reduce partition.

PT
7. When all map tasks and reduce tasks have been completed,
the master wakes up the user program.
N
• At this point, the MapReduce call in the user program returns
back to the user code.
Contd…
• After successful completion, the output of the mapreduce
execution is available in the R output files (one per reduce task,
with file names as specified by the user).
• Typically, users do not need to combine these R output files into

EL
one file- they often pass these files as input to another
MapReduce call, or use them from another distributed

PT
application that is able to deal with input that is partitioned into
multiple files.
N
Master Data Structures
• The master keeps several data structures. For each map task
and reduce task, it stores the state (idle, in-progress, or
completed), and the identity of the worker machine (for non-
idle tasks).

EL
• The master is the conduit through which the location of
intermediate le regions is propagated from map tasks to reduce

PT
tasks. Therefore, for each completed map task, the master
stores the locations and sizes of the R intermediate file regions
produced by the map task.
N
• Updates to this location and size information are received as
map tasks are completed. The information is pushed
incrementally to workers that have in-progress reduce tasks.
Fault Tolerance
• Since the MapReduce library is designed to help process very large
amounts of data using hundreds or thousands of machines, the
library must tolerate machine failures gracefully.

EL
• Map worker failure
• Map tasks completed or in-progress at worker are reset to idle

•
•
another worker
Reduce worker failure
PT
Reduce workers are notified when task is rescheduled on
N
• Only in-progress tasks are reset to idle
• Master failure
• MapReduce task is aborted and client is notified
Locality
• Network bandwidth is a relatively scarce resource in the
computing environment.
• We can conserve network bandwidth by taking advantage of the

EL
fact that the input data (managed by GFS) is stored on the local
disks of the machines that make up our cluster.
• GFS divides each file into 64 MB blocks, and stores several copies
PT
of each block (typically 3 copies) on different machines.
N
Contd…
• The MapReduce master takes the location information of the
input les into account and attempts to schedule a map task on a
machine that contains a replica of the corresponding input data.

EL
• Failing that, it attempts to schedule a map task near a replica of
that task's input data (e.g., on a worker machine that is on the
same network switch as the machine containing the data).
•
PT
When running large MapReduce operations on a signiﬁcant
fraction of the workers in a cluster, most input data is read locally
N
and consumes no network bandwidth.
Task Granularity
• The Map phase is subdivided into M pieces and the reduce
phase into R pieces.
• Ideally, M and R should be much larger than the number of
worker machines.

EL
• Having each worker perform many different tasks improves
dynamic load balancing, and also speeds up recovery when a

PT
worker fails: the many map tasks it has completed can be spread
out across all the other worker machines.
N
• There are practical bounds on how large M and R can be, since
the master must make O(M + R) scheduling decisions and keeps
O(M * R) state in memory.
• Furthermore, R is often constrained by users because the output
of each reduce task ends up in a separate output file.
Partition Function

• Inputs to map tasks are created by contiguous splits of

input file

EL
• For reduce, we need to ensure that records with the
same intermediate key end up at the same worker

PT
• System uses a default partition function e.g., hash(key)
mod R
N
• Sometimes useful to override
• E.g., hash(hostname(URL)) mod R ensures URLs from
a host end up in the same output file
Ordering Guarantees
• We guarantee that within a given partition, the
intermediate key/value pairs are processed in increasing key

EL
order.
• This ordering guarantee makes it easy to generate a sorted
output file per partition, which is useful when the output
PT
file format needs to support efficient random access
lookups by key, or users of the output and it convenient to
N
have the data sorted.
Combiners Function (1)
• In some cases, there is signiﬁcant repetition in the
intermediate keys produced by each map task, and the user
specified Reduce function is commutative and associative.
• A good example of this is the word counting example. Since

EL
word frequencies tend to follow a Zipf distribution, each map
task will produce hundreds or thousands of records of the

•
form <the, 1>.
PT
All of these counts will be sent over the network to a single
N
reduce task and then added together by the Reduce function
to produce one number. We allow the user to specify an
optional Combiner function that does partial merging of this
data before it is sent over the network.
Combiners Function (2)
• The Combiner function is executed on each machine that
performs a map task.
• Typically the same code is used to implement both the
combiner and the reduce functions.

EL
• The only difference between a reduce function and a
combiner function is how the MapReduce library handles the

•
PT
output of the function.
The output of a reduce function is written to the final output
N
file. The output of a combiner function is written to an
intermediate le that will be sent to a reduce task.
• Partial combining significantly speeds up certain classes of
MapReduce operations.
EL
Examples
PT
N
Example: 1 Word Count using MapReduce

map(key, value):
// key: document name; value: text of document

EL
for each word w in value:
emit(w, 1)

PT
reduce(key, values):
N
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Count Illustrated
map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list

EL
Emit result “(word, sum)”

see 1 bob 1
see bob run
see spot throw PT bob 1
run 1
run 1
see 2
N
see 1 spot 1
spot 1 throw 1
throw 1
Example 2: Counting words of different lengths

• The map function takes a value and outputs key:value

pairs.
• For
instance, if we define a map function that takes a string

EL
and outputs the length of the word as the key and the
word itself as the value then

PT
• map(steve) would return 5:steve and

• map(savannah) would return 8:savannah.

N
This allows us to run the map function against values in
parallel and provides a huge advantage.
Example 2: Counting words of different lengths
Before we get to the reduce function, the mapreduce
framework groups all of the values together by key, so if the
map functions output the following key:value pairs:
3 : the

EL
3 : and
They get grouped as:
3 : you
4 : then
4 : what PT 3 : [the, and, you]
4 : [then, what, when]
N
4 : when
5 : [steve, where]
5 : steve
8 : [savannah, research]
5 : where
8 : savannah
8 : research
Example 2: Counting words of different lengths
• Each of these lines would then be passed as an argument to the
reduce function, which accepts a key and a list of values.
• In this instance, we might be trying to figure out how many

EL
words of certain lengths exist, so our reduce function will just
count the number of items in the list and output the key with
the size of the list, like:

3:3
PT
N
4:3
5:2
8:2
Example 2: Counting words of different lengths

• The reductions can also be done in parallel, again providing a

huge advantage. We can then look at these final results and see
that there were only two words of length 5 in the corpus, etc...

EL
• The most common example of mapreduce is for counting the
number of times words occur in a corpus.

PT
N
Example 3: Finding Friends
• Facebook has a list of friends (note that friends are a bi-directional
thing on Facebook. If I'm your friend, you're mine).
• They also have lots of disk space and they serve hundreds of millions
of requests everyday. They've decided to pre-compute calculations

EL
when they can to reduce the processing time of requests. One
common processing request is the "You and Joe have 230 friends in
common" feature.
•
PT
When you visit someone's profile, you see a list of friends that you
have in common. This list doesn't change frequently so it'd be
wasteful to recalculate it every time you visited the profile (sure you
N
could use a decent caching strategy, but then we wouldn't be able to
continue writing about mapreduce for this problem).
• We're going to use mapreduce so that we can calculate everyone's
common friends once a day and store those results. Later on it's just a
quick lookup. We've got lots of disk, it's cheap.
Example 3: Finding Friends
• Assume the friends are stored as Person->[List of Friends], our
friends list is then:

EL
• A -> B C D
• B -> A C D E
•

•
C -> A B D E
D -> A B C E
PT
N
• E -> B C D
Example 3: Finding Friends
For map(A -> B C D) :
(A B) -> B C D
(A C) -> B C D

EL
(A D) -> B C D

PT
For map(B -> A C D E) : (Note that A comes before B in the key)
(A B) -> A C D E
N
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E
Example 3: Finding Friends
For map(C -> A B D E) :

(A C) -> A B D E

EL
(B C) -> A B D E And finally for map(E -> B C D):

(C D) -> A B D E (B E) -> B C D
(C E) -> A B D E
PT
For map(D -> A B C E) :
(C E) -> B C D
(D E) -> B C D
N
(A D) -> A B C E
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E
Example 3: Finding Friends
• Before we send these key-value pairs to the reducers, we group
them by their keys and get:
(A B) -> (A C D E) (B C D)

EL
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)

PT
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
N
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)
Example 3: Finding Friends
• Each line will be passed as an argument to a reducer.

• The reduce function will simply intersect the lists of values and

EL
output the same key with the result of the intersection.

•
PT
For example, reduce((A B) -> (A C D E) (B C D))
will output (A B) : (C D)
N
• and means that friends A and B have C and D as common
friends.
Example 3: Finding Friends
• The result after reduction is:
• (A B) -> (C D)
• (A C) -> (B D)

EL
• (A D) -> (B C)
• (B C) -> (A D E) Now when D visits B's profile, we
•

•
(B D) -> (A C E)
(B E) -> (C D)
PT can quickly look up (B D) and see
that they have three friends in
N
common, (A C E).
• (C D) -> (A B E)
• (C E) -> (B D)
• (D E) -> (B C)
Reading
Jeffrey Dean and Sanjay Ghemawat,
“MapReduce: Simplified Data Processing on Large
Clusters”

EL
http://labs.google.com/papers/mapreduce.html

PT
N
Conclusion

• The MapReduce programming model has been successfully used

at Google for many different purposes.
• The model is easy to use, even for programmers without

EL
experience with parallel and distributed systems, since it hides
the details of parallelization, fault-tolerance, locality optimization,
and load balancing.
•
computations.
PT
A large variety of problems are easily expressible as MapReduce
N
• For example, MapReduce is used for the generation of data for
Google's production web search service, for sorting, for data
mining, for machine learning, and many other systems.
HDFS and Spark

EL
PT Rajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
[email protected]
EL
The Hadoop Distributed
File System (HDFS)
PT
N
Introduction
• Hadoop provides a distributed file system and a framework for
the analysis and transformation of very large data sets using the
MapReduce paradigm.

EL
• An important characteristic of Hadoop is the partitioning of
data and computation across many (thousands) of hosts, and
executing application computations in parallel close to their

•
data.
PT
A Hadoop cluster scales computation capacity, storage capacity
N
and IO bandwidth by simply adding commodity servers. Hadoop
clusters at Yahoo! span 25,000 servers, and store 25 petabytes
of application data, with the largest cluster being 3500 servers.
One hundred other organizations worldwide report using
Hadoop.
Contd…
• Hadoop is an Apache project; all components are available
via the Apache open source license.
• Yahoo! has developed and contributed to 80% of the core of

EL
Hadoop (HDFS and MapReduce).
• HBase was originally developed at Powerset, now a

• PT
department at Microsoft.
Hive was originated and developed at Facebook.
N
• Pig, ZooKeeper, and Chukwa were originated and developed
at Yahoo!
• Avro was originated at Yahoo! and is being co-developed with
Cloudera.
Hadoop Project Components

HDFS Distributed file system

MapReduce Distributed computation framework

EL
HBase Column-oriented table service
Dataflow language and parallel execution
Pig

Hive
ZooKeeper
PT
framework
Data warehouse infrastructure
Distributed coordination service
N
Chukwa System for collecting management data
Avro Data serialization system

Table 1. Hadoop project components

Contd…
• HDFS is the file system component of Hadoop. While the
interface to HDFS is patterned after the UNIX file system,
faithfulness to standards was sacrificed in favour of improved
performance for the applications at hand.

EL
• HDFS stores file system metadata and application data
separately.
• PT
As in other distributed file systems, like PVFS, Lustre and,
HDFS stores metadata on a dedicated server, called the
N
NameNode.
• Application data are stored on other servers called
DataNodes. All servers are fully connected and communicate
with each other using TCP-based protocols.
HDFS Design Assumptions

• Single machines tend to fail

• Hard disk, power supply,

EL
• More machines = increased failure probability

• Data doesn’t fit on a single node

• Desired:
PT
N
• Commodity hardware
• Built-in backup and failover
EL
Architecture
PT
N
Namenode and Datanodes
• Namenode (Master)

• Metadata:

EL
• Where file blocks are stored (namespace image)
• Edit (Operation) log
•
PT
Secondary namenode (Shadow master)
• Datanode (Chunkserver)
N
• Stores and retrieves blocks
• by client or namenode.
• Reports to namenode with list of blocks they are storing
Noticeable Differences from GFS
• Only single-writers per file.
• No record append operation.

EL
• Open source
• Provides many interfaces and libraries for different file
systems.
•

•
S3, KFS, etc. PT
Thrift (C++, Python, …), libhdfs (C), FUSE
N
A) Namenode
• The HDFS namespace is a hierarchy of files and directories. Files
and directories are represented on the NameNode by inodes,
which record attributes like permissions, modification and access
times, namespace and disk space quotas.

EL
• The file content is split into large blocks (typically 128 megabytes,
but user selectable file-by-file) and each block of the file is

PT
independently replicated at multiple DataNodes (typically three,
but user selectable file-by-file). The NameNode maintains the
namespace tree and the mapping of file blocks to DataNodes (the
N
physical location of file data).
• An HDFS client wanting to read a file first contacts the NameNode
for the locations of data blocks comprising the file and then reads
block contents from the DataNode closest to the client.
Contd…

• When writing data, the client requests the NameNode to

nominate a suite of three DataNodes to host the block
replicas.

EL
• The client then writes data to the DataNodes in a pipeline
fashion.
•

•
PT
The current design has a single NameNode for each cluster.
The cluster can have thousands of DataNodes and tens of
N
thousands of HDFS clients per cluster, as each DataNode may
execute multiple application tasks concurrently.
Contd…
• HDFS keeps the entire namespace in RAM. The inode data and
the list of blocks belonging to each file comprise the metadata
of the name system called the image. The persistent record of
the image stored in the local host’s native files system is called a

EL
checkpoint.
• The NameNode also stores the modification log of the image

PT
called the journal in the local host’s native file system. For
improved durability, redundant copies of the checkpoint and
journal can be made at other servers.
N
• During restarts the NameNode restores the namespace by
reading the namespace and replaying the journal. The locations
of block replicas may change over time and are not part of the
persistent checkpoint.
B) Datanode
• Each block replica on a DataNode is represented by two files in
the local host’s native file system. The first file contains the
data itself and the second file is block’s metadata including
checksums for the block data and the block’s generation stamp.

EL
• The size of the data file equals the actual length of the block
and does not require extra space to round it up to the nominal

PT
block size as in traditional file systems. Thus, if a block is half full
it needs only half of the space of the full block on the local drive.
N
• During startup each DataNode connects to the NameNode and
performs a handshake. The purpose of the handshake is to
verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode the
DataNode automatically shuts down.
Contd…
• The namespace ID is assigned to the file system instance when
it is formatted. The namespace ID is persistently stored on all
nodes of the cluster. Nodes with a different namespace ID will
not be able to join the cluster, thus preserving the integrity of

EL
the file system.
• The consistency of software versions is important because

PT
incompatible version may cause data corruption or loss, and on
large clusters of thousands of machines it is easy to overlook
nodes that did not shut down properly prior to the software
N
upgrade or were not available during the upgrade.
• A DataNode that is newly initialized and without any namespace
ID is permitted to join the cluster and receive the cluster’s
namespace ID.
Contd…
• After the handshake the DataNode registers with the
NameNode. DataNodes persistently store their unique storage
IDs. The storage ID is an internal identifier of the DataNode,
which makes it recognizable even if it is restarted with a different

EL
IP address or port. The storage ID is assigned to the DataNode
when it registers with the NameNode for the first time and never
changes after that.
•
PT
A DataNode identifies block replicas in its possession to the
NameNode by sending a block report. A block report contains the
N
block id, the generation stamp and the length for each block
replica the server hosts. The first block report is sent immediately
after the DataNode registration. Subsequent block reports are
sent every hour and provide the NameNode with an up-to-date
view of where block replicas are located on the cluster.
Contd…

• During normal operation DataNodes send heartbeats to the

NameNode to confirm that the DataNode is operating and the
block replicas it hosts are available. The default heartbeat
interval is three seconds.

EL
• If the NameNode does not receive a heartbeat from a DataNode
in ten minutes the NameNode considers the DataNode to be out

PT
of service and the block replicas hosted by that DataNode to be
unavailable.
N
• The NameNode then schedules creation of new replicas of those
blocks on other DataNodes.
Contd…
• Heartbeats from a DataNode also carry information about total
storage capacity, fraction of storage in use, and the number of data
transfers currently in progress. These statistics are used for the
NameNode’s space allocation and load balancing decisions.
• The NameNode does not directly call DataNodes. It uses replies to

EL
heartbeats to send instructions to the DataNodes. The instructions
include commands to:

PT
• Replicate blocks to other nodes;
• Remove local block replicas;
• Re-register or to shut down the node;
N
• Send an immediate block report.

• These commands are important for maintaining the overall system

integrity and therefore it is critical to keep heartbeats frequent even
on big clusters. The NameNode can process thousands of
heartbeats per second without affecting other NameNode
operations.
C) HDFS Client
• User applications access the file system using the HDFS client,
a code library that exports the HDFS file system interface.
• Similar to most conventional file systems, HDFS supports

EL
operations to read, write and delete files, and operations to
create and delete directories.
•
namespace. PT
The user references files and directories by paths in the
N
• The user application generally does not need to know that file
system metadata and storage are on different servers, or that
blocks have multiple replicas.
Contd…
• When an application reads a file, the HDFS client first asks the
NameNode for the list of DataNodes that host replicas of the blocks
of the file. It then contacts a DataNode directly and requests the
transfer of the desired block.

EL
• When a client writes, it first asks the NameNode to choose
DataNodes to host replicas of the first block of the file. The client
organizes a pipeline from node-to-node and sends the data.
• PT
When the first block is filled, the client requests new DataNodes to
be chosen to host replicas of the next block. A new pipeline is
N
organized, and the client sends the further bytes of the file. Each
choice of DataNodes is likely to be different.
• The interactions among the client, the NameNode and the
DataNodes are illustrated in Figure 1.
EL
PT
N
Figure 1. An HDFS client creates a new file by giving its path to the
NameNode. For each block of the file, the NameNode returns a list of
DataNodes to host its replicas. The client then pipelines data to the chosen
DataNodes, which eventually confirm the creation of the block replicas to
the NameNode.
Contd…

• Unlike conventional file systems, HDFS provides an API that

exposes the locations of a file blocks.
• This allows applications like the MapReduce framework to

EL
schedule a task to where the data are located, thus improving
the read performance.
•

•
PT
It also allows an application to set the replication factor of a file.
By default a file’s replication factor is three.
For critical files or files which are accessed very often, having a
N
higher replication factor improves their tolerance against faults
and increase their read bandwidth.
Anatomy of a File Read

EL
PT
N
Anatomy of a File Write

EL
PT
N
Additional Topics
• Replica placements:

• Different node, rack, and center

EL
• Coherency model:

• Describes data visibility

•
readers PT
Current block being written may not be visible to other
N
EL
Spark
PT
N
Motivation

• MapReduce and its variants have been highly successful in

implementing large-scale data-intensive applications on
commodity clusters.

EL
Input
PT
Map
Reduce

Output
N
Map

Reduce
Map
Contd…

• However, most of these systems are built around an acyclic

data flow model that is not suitable for other popular
applications.

EL
• In this part of the lecture, we will focus on one such class of
applications, that reuse a working set of data across multiple

•
PT
parallel operations.
This includes many iterative machine learning algorithms, as
N
well as interactive data analysis tools.
• A new framework called Spark supports such applications
while retaining the scalability and fault tolerance of
MapReduce.
Contd…

• To achieve these goals, Spark introduces an abstraction

called resilient distributed datasets (RDDs).

EL
• An RDD is a read-only collection of objects partitioned
across a set of machines that can be rebuilt if a partition is

•
lost.
PT
Spark can outperform Hadoop by 10x in iterative machine
N
learning jobs, and can be used to interactively query a 39
GB dataset with sub-second response time.
Difference Between Hadoop MapReduce vs.
Apache Spark

EL
Hadoop MapReduce Apache Spark
Fast 100x faster than MapReduce
Batch Processing Real-time Processing
PT
Stores Data on Disk
Written in Java
Stores Data in Memory
Written in Scala
N
N
PT
EL
Introduction
• A new model of cluster computing has become widely popular,
in which data-parallel computations are executed on clusters of
unreliable machines by systems that automatically provide
locality-aware scheduling, fault tolerance, and load balancing.

EL
• MapReduce pioneered this model, while systems like Dryad and
Map-Reduce-Merge generalized the types of data flows

•
supported.
PT
These systems achieve their scalability and fault tolerance by
N
providing a programming model where the user creates acyclic
data flow graphs to pass input data through a set of operators.
• This allows the underlying system to manage scheduling and to
react to faults without user intervention.
Contd…
• While this data flow programming model is useful for a large
class of applications, there are applications that cannot be
expressed efficiently as acyclic data flows.

EL
• Here, we focuses on one such class of applications, that reuse a
working set of data across multiple parallel operations. This
includes two use cases where we have seen Hadoop users report

PT
that MapReduce is deficient:
(i) Iterative jobs: Many common machine learning algorithms
N
apply a function repeatedly to the same dataset to optimize a
parameter (e.g., through gradient descent). While each iteration
can be expressed as a MapReduce/Dryad job, each job must
reload the data from disk, incurring a significant performance
penalty.
Contd…
(ii) Interactive analytics: Hadoop is often used to run ad-hoc
exploratory queries on large datasets, through SQL interfaces
such as Pig and Hive.

EL
• Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.
• However, with Hadoop, each query incurs significant latency
PT
(tens of seconds) because it runs as a separate MapReduce job
and reads data from disk.
N
• A new cluster computing framework called Spark supports
applications with working sets while providing similar scalability
and fault tolerance properties to MapReduce.
Contd…
• The main abstraction in Spark is that of a resilient distributed
dataset (RDD), which represents a read-only collection of objects
partitioned across a set of machines that can be rebuilt if a
partition is lost.

EL
• Users can explicitly cache an RDD in memory across machines
and reuse it in multiple MapReduce-like parallel operations.
RDDs achieve fault tolerance through a notion of lineage: if a

PT
partition of an RDD is lost, the RDD has enough information
about how it was derived from other RDDs to be able to rebuild
just that partition.
N
• Although RDDs are not a general shared memory abstraction,
they represent a sweet-spot between expressivity on the one
hand and scalability and reliability on the other hand, and found
well-suited for a variety of applications.
Contd…
• Spark is implemented in Scala, a statically typed high-level
programming language for the Java VM, and exposes a
functional programming interface similar to DryadLINQ.

EL
• In addition, Spark can be used interactively from a modified
version of the Scala interpreter, which allows the user to define
RDDs, functions, variables and classes and use them in parallel

•
PT
operations on a cluster.
Spark is the first system to allow an efficient, general-purpose
N
programming language to be used interactively to process large
datasets on a cluster.
• Spark can outperform Hadoop by 10x in iterative machine
learning workloads and can be used interactively to scan a 39 GB
dataset with sub-second latency.
Programming Model

• To use Spark, developers write a driver program that

implements the high-level control flow of their application and
launches various operations in parallel.

EL
• Spark provides two main abstractions for parallel
programming:
•
PT
Resilient distributed datasets and parallel operations on
these datasets (invoked by passing a function to apply on a
N
dataset).
• In addition, Spark supports two restricted types of shared
variables that can be used in functions running on the cluster.
EL
PT
N
Figure: Spark runtime. The user’s driver program launches
multiple workers, which read data blocks from a distributed file
system and can persist computed RDD partitions in memory.
Resilient Distributed Datasets (RDDs)
• A resilient distributed dataset (RDD) is a read-only collection of
objects partitioned across a set of machines that can be rebuilt if
a partition is lost.

EL
• The elements of an RDD need not exist in physical storage;
instead, a handle to an RDD contains enough information to
compute the RDD starting from data in reliable storage.
• PT
This means that RDDs can always be reconstructed if nodes fail.
N
Contd…
• In Spark, each RDD is represented by a Scala object. Spark lets
programmers construct RDDs in four ways:
• From a file in a shared file system, such as the Hadoop
Distributed File System (HDFS).

EL
• By “parallelizing” a Scala collection (e.g., an array) in the driver
program, which means dividing it into a number of slices that will
be sent to multiple nodes.
•
PT
By transforming an existing RDD. A dataset with elements of
type A can be transformed into a dataset with elements of type B
N
using an operation called flatMap, which passes each element
through a user-provided function of type A List[B].
• Other transformations can be expressed using flatMap,
including map (pass elements through a function of type A B
and filter (pick elements matching a predicate).
Contd…
• By changing the persistence of an existing RDD. By default, RDDs
are lazy and ephemeral.
• That is, partitions of a dataset are materialized on demand when
they are used in a parallel operation (e.g., by passing a block of a

EL
file through a map function), and are discarded from memory
after use. However, a user can alter the persistence of an RDD

• PT
through two actions:
The cache action leaves the dataset lazy, but hints that it should
be kept in memory after the first time it is computed, because it
N
will be reused.
• The save action evaluates the dataset and writes it to a
distributed filesystem such as HDFS. The saved version is used in
future operations on it.
Contd…
• If there is not enough memory in the cluster to cache all
partitions of a dataset, Spark will recompute them when they are
used.

EL
• Spark programs keep working (at reduced performance) if nodes
fail or if a dataset is too big. This idea is loosely analogous to
virtual memory.
• PT
Spark can also be extended to support other levels of persistence
(e.g., in-memory replication across multiple nodes).
N
• The goal is to let users trade off between the cost of storing an
RDD, the speed of accessing it, the probability of losing part of
it, and the cost of recomputing it.
Parallel Operations
• Several parallel operations can be performed on RDDs:

• Reduce: Combines dataset elements using an associative

EL
function to produce a result at the driver program.

•
PT
Collect: Sends all elements of the dataset to the driver program.
For example, an easy way to update an array in parallel is to
parallelize, map and collect the array.
N
• Foreach: Passes each element through a user provided function.
This is only done for the side effects of the function.
Shared Variables

• Programmers invoke operations like map, filter and reduce by

passing closures (functions) to Spark.

EL
• As is typical in functional programming, these closures can
refer to variables in the scope where they are created.
•

•
PT
Normally, when Spark runs a closure on a worker node, these
variables are copied to the worker.
However, Spark also lets programmers create two restricted
N
types of shared variables to support two simple but common
usage patterns such as Broadcast variables and Accumulators.
Broadcast variables
• Broadcast variables: If a large read-only piece of data (e.g., a
lookup table) is used in multiple parallel operations, it is
preferable to distribute it to the workers only once instead of

EL
packaging it with every closure.
• Spark lets the programmer create a “broadcast variable”

PT
object that wraps the value and ensures that it is only copied
to each worker once.
N
Accumulators
• Accumulators: These are variables that workers can only
“add” to using an associative operation, and that only the
driver can read.

EL
• They can be used to implement counters as in MapReduce and
to provide a more imperative syntax for parallel sums.
•
PT
Accumulators can be defined for any type that has an “add”
operation and a “zero” value.
N
• Due to their “add-only” semantics, they are easy to make
fault-tolerant.
RDD Operations

Transformations Parallel operations (Actions)

(define a new RDD) (return a result to driver)

EL
map reduce
filter collect
sample
union
groupByKey
PT count
save
lookupKey
N
reduceByKey …
join
cache
…
Transformations
Transformation Description
Return a new distributed dataset formed by passing
map (func) each element of the source through a function func
Return a new dataset formed by selecting those

EL
filter (func) element of the source on which func returns true
Similar to map, but each input can be mapped to 0
flatmap (func) or more output items (so func should return a Seq

sample
PT rather than a single item)
Sample a fraction of the data, with or without
replacements, using a given random number
N
(withReplacement,
generator seed
fraction, seed)
Return a new dataset that contains the union of the
union (otherDataset) elements in the source dataset and the argument
Return a new dataset that contains the distinct
distinct ( [numtasks] ) ) elements of the source dataset
Contd…
Transformation Description
groupByKey When called on a dataset of (K, V) pairs, returns a dataset of
([numTasks]) (K, Seq[V]) pairs

When called on a dataset of (K, V) pairs, returns a dataset of

EL
reduceByKey
(K, V) pairs where the values for each key are aggregated using
(func, [numTasks]) the given reduce function
When called on a dataset of (K, V) pairs where K implements
sortByKey( [ascending],
[numtasks] ) PT
Ordered, returns a dataset of (K, V), pairs sorted by keys in
ascending or descending order as specified in the boolean
ascending argument
N
When called on a dataset of type (K, V) and (K, W) returns a
join (otherDataset,
dataset of (K, (V, W)) pairs with all pairs of elements for each
[numTasks] ) key
cogroup (otherDataset, When called on a dataset of type (K, V) and (K, W), returns a
[numTasks] ) dataset of (K, Seq[V], Seq[W]) tuples-also called groupWith

When called on a dataset of types T and U, returns a dataset of

cartesian(otherDataset) (T, U) pairs (all pairs of elements)
Actions
Action Description
Aggregate the elements of the dataset using a function func
(which takes two arguments and returns one), and should
reduce (func)
also be commutative and associative so that it can be

EL
computed correctly in parallel
Return all the elements of the dataset as an array at the
collect ( ) driver program- usually useful after a filter or other operation

count ( )
first( )
PT
that returns a sufficiently small subset of the data
Return the number of elements in the dataset
Return the first element of the dataset-similar to take (I)
N
Return an array with the first n elements of the dataset
take (n) -currently not executed in parallel, instead the driver program
computes all the elements
takeSample Return an array with a random sample of num elements of
(withReplacement, the dataset, with or without replacement, using the given
fraction, seed) random number generator seed.
Contd…
Action Description
Write the elements of the dataset as a text file (or set of text
saveAsTextFile files) in a given directory in the local filesystem, HDFS or any
(path) other Hadoop-supported file system, Spark will call toString

EL
on each element to convert it to a line of text in the file
Write the elements of the dataset as a Hadoop SequenceFile
in a given path in the local filesystem. HDFS or any other
saveAsSequenceFile
(path) PT
Hadoop-supported file system, Only available on RDDs of
key-value pairs that either implement, Hadoop’s writable
interface or are implicitly Hadoop’s Writable interface or are
N
implicitly convertible to Writable (Spark includes conversions
for basic types like Int, Double, String, etc).
Only available of RDD of type (K, V) Returns a ‘Map’ of
countByKey ( )
(K, Int) pairs with the count of each key.
Run a function func on each element of the dataset- usually
foreach (func) done for side effects such as updating an accumulator
variable or interacting with external storage systems
Spark Community
• Most active open source community in big data

• 200+ developers, 50+ companies contributing

EL
PT
N
EL
Built-in Libraries
PT
N
Standard Library for Big Data

• Big data apps lack libraries“

of common algorithms
• Spark’s generality + support“

EL
for multiple languages make
it“ suitable to offer this
•
PT
Much of future activity will
be in these libraries
N
A General Platform

EL
PT
N
Machine Learning Library (MLlib)
MLlib algorithms:
(i) Classification: logistic regression, linear SVM,“ naïve
Bayes, classification tree

EL
(ii) Regression: generalized linear models (GLMs),
regression tree

PT
(iii) Collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
N
(iv) Clustering: k-means
(v) Decomposition: SVD, PCA
(vi) Optimization: stochastic gradient descent, L-BFGS
GraphX

EL
PT
N
GraphX

•General graph processing library

•Build graph using RDDs of nodes and edges

EL
•Large library of graph algorithms with composable steps

PT
N
GraphX Algorithms
(i) Collaborative Filtering (iv) Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss

EL
(ii) Structured Prediction (v) Graph Analytics

PT
Loopy Belief Propagation
Max-Product Linear Programs
Gibbs Sampling
PageRank
Personalized PageRank
Shortest Path
N
Graph Coloring
(iii) Semi-supervised ML
Graph SSL (vi) Classification
CoEM Neural Networks
Spark Streaming

•Large scale streaming

computation

EL
•Ensure exactly one semantics

•Integrated with Spark unifies

PT
batch, interactive, and streaming
computations!
N
Spark SQL

Enables loading & querying structured data in Spark

From Hive:

EL
c = HiveContext(sc)
rows = c.sql(“select text, year from hivetable”)

PT
rows.filter(lambda r: r.year > 2013).collect()
N
From JSON:

c.jsonFile(“tweets.json”).registerAsTable(“tweets”)
c.sql(“select text, user.name from tweets”)
EL
Examples
PT
N
Example 1: PageRank

• Give pages ranks (scores) based on links to them

• Links from many pages high rank

EL
• Links from a high-rank page high rank

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions