0% found this document useful (0 votes)
7 views246 pages

W7 Lecture Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views246 pages

W7 Lecture Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 246

MapReduce

EL
PTRajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
[email protected]
Introduction

• MapReduce is a programming model and an associated


implementation for processing and generating large data

EL
sets.
• Users specify a map function that processes a key/value

PT
pair to generate a set of intermediate key/value pairs,
and a reduce function that merges all intermediate
values associated with the same intermediate key.
N
• Many real world tasks are expressible in this model.
Contd…

• Programs written in this functional style are


automatically parallelized and executed on a large
cluster of commodity machines.

EL
• The run-time system takes care of the details of
partitioning the input data, scheduling the program's

PT
execution across a set of machines, handling machine
failures, and managing the required inter-machine
communication.
N
• Thisallows programmers without any experience with
parallel and distributed systems to easily utilize the
resources of a large distributed system.
Contd…

•A typical MapReduce computation processes many


terabytes of data on thousands of machines.

EL
• Hundreds of MapReduce programs have been
implemented and upwards of one thousand MapReduce

PT
jobs are executed on Google's clusters every day.
N
Single-node architecture

EL
CPU
Machine Learning, Statistics
Memory PT
N
“Classical” Data Mining
Disk
Commodity Clusters
• Web data sets can be very large
• Tens to hundreds of terabytes

EL
• Cannot mine on a single server (why?)

• Standard architecture emerging:




PT
Cluster of commodity Linux nodes
Gigabit ethernet interconnect
N
• How to organize computations on this architecture?
• Mask issues such as hardware failure
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes

EL
in a rack
Switch Switch

CPU

PT CPU CPU

CPU
N
Mem Mem Mem Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes


Stable storage
• First order problem: if nodes can fail, how can we
store data persistently?

EL
• Answer: Distributed File System
• Provides global file namespace

PT
• Google GFS; Hadoop HDFS; Kosmix KFS

• Typical usage pattern


N
• Huge files (100s of GB to TB)
• Data is rarely updated in place
• Reads and appends are common
Distributed File System
•Chunk Servers
•File is split into contiguous chunks
•Typically each chunk is 16-64MB

EL
•Each chunk replicated (usually 2x or 3x)
•Try to keep replicas in different racks
•Master node

PT
•a.k.a. Name Nodes in HDFS
•Stores metadata
N
•Might be replicated
•Client library for file access
•Talks to master to find chunk servers
•Connects directly to chunkservers to access data
Motivation for Map Reduce (Why)
• Large-Scale Data Processing

• Want to use 1000s of CPUs

EL
• But don’t want hassle of managing things

PT
• MapReduce Architecture provides

•Automatic parallelization & distribution


N
•Fault tolerance

•I/O scheduling

•Monitoring & status updates


Programming Model

•Thecomputation takes a set of input key/value pairs, and


produces a set of output key/value pairs.

EL
•The user of the MapReduce library expresses the
computation as two functions:

(i) The Map


PT
N
(ii) The Reduce
(i) Map Abstraction

•Map, written by the user, takes an input pair and produces a

EL
set of intermediate key/value pairs.

•The MapReduce library groups together all intermediate


PT
values associated with the same intermediate key ‘I’ and
passes them to the Reduce function.
N
(ii) Reduce Abstraction
•The Reduce function, also written by the user, accepts an
intermediate key ‘I’ and a set of values for that key.

EL
•Itmerges together these values to form a possibly smaller
set of values.

• Typically PT
just zero or one output value is produced per
Reduce invocation. The intermediate values are supplied to
N
the user's reduce function via an iterator.

•This allows us to handle lists of values that are too large to


fit in memory.
Map-Reduce Functions for Word Count
map(key, value):
// key: document name; value: text of document

EL
for each word w in value:
emit(w, 1)

PT
reduce(key, values):
N
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Map-Reduce Functions

• Input: a set of key/value pairs

• User supplies two functions:

EL
map(k,v)  list(k1,v1)
reduce(k1, list(v1))  v2
PT
• (k1,v1) is an intermediate key/value pair
N
• Output is the set of (k1,v2) pairs
Applications
• Here are a few simple applications of interesting programs that can
be easily expressed as MapReduce computations.
• Distributed Grep: The map function emits a line if it matches a
supplied pattern. The reduce function is an identity function that

EL
just copies the supplied intermediate data to the output.
• Count of URL Access Frequency: The map function processes logs

PT
of web page requests and outputs (URL; 1). The reduce function
adds together all values for the same URL and emits a (URL; total
count) pair.
N
• ReverseWeb-Link Graph: The map function outputs (target;
source) pairs for each link to a target URL found in a page named
source. The reduce function concatenates the list of all source
URLs associated with a given target URL and emits the pair: (target;
list(source))
Contd…
• Term-Vector
per Host: A term vector summarizes the most
important words that occur in a document or a set of
documents as a list of (word; frequency) pairs.

EL
• Themap function emits a (hostname; term vector) pair for
each input document (where the hostname is extracted
from the URL of the document).
• The PT
reduce function is passed all per-document term
vectors for a given host. It adds these term vectors
N
together, throwing away infrequent terms, and then emits
a final (hostname; term vector) pair
Contd…
• InvertedIndex: The map function parses each document,
and emits a sequence of (word; document ID) pairs. The
reduce function accepts all pairs for a given word, sorts the
corresponding document IDs and emits a (word;

EL
list(document ID)) pair. The set of all output pairs forms a
simple inverted index. It is easy to augment this

PT
computation to keep track of word positions.
N
• Distributed
Sort: The map function extracts the key from
each record, and emits a (key; record) pair. The reduce
function emits all pairs unchanged.
Implementation

EL
Overview
PT
N
Implementation Overview
• Many different implementations of the MapReduce
interface are possible. The right choice depends on the
environment.

EL
• For example, one implementation may be suitable for a
small shared-memory machine, another for a large NUMA

PT
multi-processor, and yet another for an even larger
collection of networked machines.
N
• Here we describes an implementation targeted to the
computing environment in wide use at Google: large
clusters of commodity PCs connected together with
switched Ethernet.
Contd…
(1) Machines are typically dual-processor x86 processor running
Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used . Typically either 100
megabits/second or 1 gigabit/second at the machine level, but

EL
averaging considerably less in overall bisection bandwidth.
(3) A cluster consists of hundreds or thousands of machines, and

PT
therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks attached directly
N
to individual machines.
(5) Users submit jobs to a scheduling system. Each job consists of
a set of tasks, and is mapped by the scheduler to a set of
available machines within a cluster.
Distributed Execution Overview
• The Map invocations are distributed across multiple machines
by automatically partitioning the input data into a set of M
splits.
• The input splits can be processed in parallel by different

EL
machines.
• Reduce invocations are distributed by partitioning the

PT
intermediate key space into R pieces using a partitioning
function (e.g., hash(key) mod R).
N
• The number of partitions (R) and the partitioning function are
specified by the user.
• Figure 1 shows the overall flow of a MapReduce operation.
Distributed Execution Overview

User
Program

(1)fork (1) fork (1) fork

EL
(2)assign Master (2) assign
map reduce

Worker PT
(4) local Worker
(6) write Output
File 0
N
Split 0 (3) read
write (5) Remote read, sort
Split 1 Worker
Split 2 Output
Worker File 1
Worker

Intermediate
Input Files Map phase Files on Disk Reduce phase Output Files
Sequence of Actions
When the user program calls the MapReduce function, the following
sequence of actions occurs:
1. The MapReduce library in the user program first splits the input
files into M pieces of typically 16 megabytes to 64 megabytes (MB)

EL
per piece. It then starts up many copies of the program on a cluster
of machines.
2. One of the copies of the program is special- the master. The rest are
PT
workers that are assigned work by the master. There are M map
tasks and R reduce tasks to assign. The master picks idle workers and
N
assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the
corresponding input split. It parses key/value pairs out of the input
data and passes each pair to the user-defined Map function. The
intermediate key/value pairs produced by the Map function are
buffered in memory.
Contd…
4. Periodically, the buffered pairs are written to local disk,
partitioned into R regions by the partitioning function.
• The locations of these buffered pairs on the local disk are passed
back to the master, who is responsible for forwarding these

EL
locations to the reduce workers.
5. When a reduce worker is notified by the master about these

PT
locations, it uses remote procedure calls to read the buffered data
from the local disks of the map workers. When a reduce worker has
read all intermediate data, it sorts it by the intermediate keys so
N
that all occurrences of the same key are grouped together.
• The sorting is needed because typically many different keys map
to the same reduce task. If the amount of intermediate data is too
large to fit in memory, an external sort is used.
Contd…
6. The reduce worker iterates over the sorted intermediate data
and for each unique intermediate key encountered, it passes the
key and the corresponding set of intermediate values to the
user's Reduce function.

EL
• The output of the Reduce function is appended to a final output
file for this reduce partition.

PT
7. When all map tasks and reduce tasks have been completed,
the master wakes up the user program.
N
• At this point, the MapReduce call in the user program returns
back to the user code.
Contd…
• After successful completion, the output of the mapreduce
execution is available in the R output files (one per reduce task,
with file names as specified by the user).
• Typically, users do not need to combine these R output files into

EL
one file- they often pass these files as input to another
MapReduce call, or use them from another distributed

PT
application that is able to deal with input that is partitioned into
multiple files.
N
Master Data Structures
• The master keeps several data structures. For each map task
and reduce task, it stores the state (idle, in-progress, or
completed), and the identity of the worker machine (for non-
idle tasks).

EL
• The master is the conduit through which the location of
intermediate le regions is propagated from map tasks to reduce

PT
tasks. Therefore, for each completed map task, the master
stores the locations and sizes of the R intermediate file regions
produced by the map task.
N
• Updates to this location and size information are received as
map tasks are completed. The information is pushed
incrementally to workers that have in-progress reduce tasks.
Fault Tolerance
• Since the MapReduce library is designed to help process very large
amounts of data using hundreds or thousands of machines, the
library must tolerate machine failures gracefully.

EL
• Map worker failure
• Map tasks completed or in-progress at worker are reset to idle



another worker
Reduce worker failure
PT
Reduce workers are notified when task is rescheduled on
N
• Only in-progress tasks are reset to idle
• Master failure
• MapReduce task is aborted and client is notified
Locality
• Network bandwidth is a relatively scarce resource in the
computing environment.
• We can conserve network bandwidth by taking advantage of the

EL
fact that the input data (managed by GFS) is stored on the local
disks of the machines that make up our cluster.
• GFS divides each file into 64 MB blocks, and stores several copies
PT
of each block (typically 3 copies) on different machines.
N
Contd…
• The MapReduce master takes the location information of the
input les into account and attempts to schedule a map task on a
machine that contains a replica of the corresponding input data.

EL
• Failing that, it attempts to schedule a map task near a replica of
that task's input data (e.g., on a worker machine that is on the
same network switch as the machine containing the data).

PT
When running large MapReduce operations on a significant
fraction of the workers in a cluster, most input data is read locally
N
and consumes no network bandwidth.
Task Granularity
• The Map phase is subdivided into M pieces and the reduce
phase into R pieces.
• Ideally, M and R should be much larger than the number of
worker machines.

EL
• Having each worker perform many different tasks improves
dynamic load balancing, and also speeds up recovery when a

PT
worker fails: the many map tasks it has completed can be spread
out across all the other worker machines.
N
• There are practical bounds on how large M and R can be, since
the master must make O(M + R) scheduling decisions and keeps
O(M * R) state in memory.
• Furthermore, R is often constrained by users because the output
of each reduce task ends up in a separate output file.
Partition Function

• Inputs to map tasks are created by contiguous splits of


input file

EL
• For reduce, we need to ensure that records with the
same intermediate key end up at the same worker

PT
• System uses a default partition function e.g., hash(key)
mod R
N
• Sometimes useful to override
• E.g., hash(hostname(URL)) mod R ensures URLs from
a host end up in the same output file
Ordering Guarantees
• We guarantee that within a given partition, the
intermediate key/value pairs are processed in increasing key

EL
order.
• This ordering guarantee makes it easy to generate a sorted
output file per partition, which is useful when the output
PT
file format needs to support efficient random access
lookups by key, or users of the output and it convenient to
N
have the data sorted.
Combiners Function (1)
• In some cases, there is significant repetition in the
intermediate keys produced by each map task, and the user
specified Reduce function is commutative and associative.
• A good example of this is the word counting example. Since

EL
word frequencies tend to follow a Zipf distribution, each map
task will produce hundreds or thousands of records of the


form <the, 1>.
PT
All of these counts will be sent over the network to a single
N
reduce task and then added together by the Reduce function
to produce one number. We allow the user to specify an
optional Combiner function that does partial merging of this
data before it is sent over the network.
Combiners Function (2)
• The Combiner function is executed on each machine that
performs a map task.
• Typically the same code is used to implement both the
combiner and the reduce functions.

EL
• The only difference between a reduce function and a
combiner function is how the MapReduce library handles the


PT
output of the function.
The output of a reduce function is written to the final output
N
file. The output of a combiner function is written to an
intermediate le that will be sent to a reduce task.
• Partial combining significantly speeds up certain classes of
MapReduce operations.
EL
Examples
PT
N
Example: 1 Word Count using MapReduce

map(key, value):
// key: document name; value: text of document

EL
for each word w in value:
emit(w, 1)

PT
reduce(key, values):
N
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Count Illustrated
map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list

EL
Emit result “(word, sum)”

see 1 bob 1
see bob run
see spot throw PT bob 1
run 1
run 1
see 2
N
see 1 spot 1
spot 1 throw 1
throw 1
Example 2: Counting words of different lengths

• The map function takes a value and outputs key:value


pairs.
• For
instance, if we define a map function that takes a string

EL
and outputs the length of the word as the key and the
word itself as the value then

PT
• map(steve) would return 5:steve and

• map(savannah) would return 8:savannah.


N
This allows us to run the map function against values in
parallel and provides a huge advantage.
Example 2: Counting words of different lengths
Before we get to the reduce function, the mapreduce
framework groups all of the values together by key, so if the
map functions output the following key:value pairs:
3 : the

EL
3 : and
They get grouped as:
3 : you
4 : then
4 : what PT 3 : [the, and, you]
4 : [then, what, when]
N
4 : when
5 : [steve, where]
5 : steve
8 : [savannah, research]
5 : where
8 : savannah
8 : research
Example 2: Counting words of different lengths
• Each of these lines would then be passed as an argument to the
reduce function, which accepts a key and a list of values.
• In this instance, we might be trying to figure out how many

EL
words of certain lengths exist, so our reduce function will just
count the number of items in the list and output the key with
the size of the list, like:

3:3
PT
N
4:3
5:2
8:2
Example 2: Counting words of different lengths

• The reductions can also be done in parallel, again providing a


huge advantage. We can then look at these final results and see
that there were only two words of length 5 in the corpus, etc...

EL
• The most common example of mapreduce is for counting the
number of times words occur in a corpus.

PT
N
Example 3: Finding Friends
• Facebook has a list of friends (note that friends are a bi-directional
thing on Facebook. If I'm your friend, you're mine).
• They also have lots of disk space and they serve hundreds of millions
of requests everyday. They've decided to pre-compute calculations

EL
when they can to reduce the processing time of requests. One
common processing request is the "You and Joe have 230 friends in
common" feature.

PT
When you visit someone's profile, you see a list of friends that you
have in common. This list doesn't change frequently so it'd be
wasteful to recalculate it every time you visited the profile (sure you
N
could use a decent caching strategy, but then we wouldn't be able to
continue writing about mapreduce for this problem).
• We're going to use mapreduce so that we can calculate everyone's
common friends once a day and store those results. Later on it's just a
quick lookup. We've got lots of disk, it's cheap.
Example 3: Finding Friends
• Assume the friends are stored as Person->[List of Friends], our
friends list is then:

EL
• A -> B C D
• B -> A C D E


C -> A B D E
D -> A B C E
PT
N
• E -> B C D
Example 3: Finding Friends
For map(A -> B C D) :
(A B) -> B C D
(A C) -> B C D

EL
(A D) -> B C D

PT
For map(B -> A C D E) : (Note that A comes before B in the key)
(A B) -> A C D E
N
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E
Example 3: Finding Friends
For map(C -> A B D E) :

(A C) -> A B D E

EL
(B C) -> A B D E And finally for map(E -> B C D):

(C D) -> A B D E (B E) -> B C D
(C E) -> A B D E
PT
For map(D -> A B C E) :
(C E) -> B C D
(D E) -> B C D
N
(A D) -> A B C E
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E
Example 3: Finding Friends
• Before we send these key-value pairs to the reducers, we group
them by their keys and get:
(A B) -> (A C D E) (B C D)

EL
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)

PT
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
N
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)
Example 3: Finding Friends
• Each line will be passed as an argument to a reducer.

• The reduce function will simply intersect the lists of values and

EL
output the same key with the result of the intersection.


PT
For example, reduce((A B) -> (A C D E) (B C D))
will output (A B) : (C D)
N
• and means that friends A and B have C and D as common
friends.
Example 3: Finding Friends
• The result after reduction is:
• (A B) -> (C D)
• (A C) -> (B D)

EL
• (A D) -> (B C)
• (B C) -> (A D E) Now when D visits B's profile, we


(B D) -> (A C E)
(B E) -> (C D)
PT can quickly look up (B D) and see
that they have three friends in
N
common, (A C E).
• (C D) -> (A B E)
• (C E) -> (B D)
• (D E) -> (B C)
Reading
Jeffrey Dean and Sanjay Ghemawat,
“MapReduce: Simplified Data Processing on Large
Clusters”

EL
http://labs.google.com/papers/mapreduce.html

PT
N
Conclusion

• The MapReduce programming model has been successfully used


at Google for many different purposes.
• The model is easy to use, even for programmers without

EL
experience with parallel and distributed systems, since it hides
the details of parallelization, fault-tolerance, locality optimization,
and load balancing.

computations.
PT
A large variety of problems are easily expressible as MapReduce
N
• For example, MapReduce is used for the generation of data for
Google's production web search service, for sorting, for data
mining, for machine learning, and many other systems.
HDFS and Spark

EL
PT Rajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
[email protected]
EL
The Hadoop Distributed
File System (HDFS)
PT
N
Introduction
• Hadoop provides a distributed file system and a framework for
the analysis and transformation of very large data sets using the
MapReduce paradigm.

EL
• An important characteristic of Hadoop is the partitioning of
data and computation across many (thousands) of hosts, and
executing application computations in parallel close to their


data.
PT
A Hadoop cluster scales computation capacity, storage capacity
N
and IO bandwidth by simply adding commodity servers. Hadoop
clusters at Yahoo! span 25,000 servers, and store 25 petabytes
of application data, with the largest cluster being 3500 servers.
One hundred other organizations worldwide report using
Hadoop.
Contd…
• Hadoop is an Apache project; all components are available
via the Apache open source license.
• Yahoo! has developed and contributed to 80% of the core of

EL
Hadoop (HDFS and MapReduce).
• HBase was originally developed at Powerset, now a

• PT
department at Microsoft.
Hive was originated and developed at Facebook.
N
• Pig, ZooKeeper, and Chukwa were originated and developed
at Yahoo!
• Avro was originated at Yahoo! and is being co-developed with
Cloudera.
Hadoop Project Components

HDFS Distributed file system


MapReduce Distributed computation framework

EL
HBase Column-oriented table service
Dataflow language and parallel execution
Pig

Hive
ZooKeeper
PT
framework
Data warehouse infrastructure
Distributed coordination service
N
Chukwa System for collecting management data
Avro Data serialization system

Table 1. Hadoop project components


Contd…
• HDFS is the file system component of Hadoop. While the
interface to HDFS is patterned after the UNIX file system,
faithfulness to standards was sacrificed in favour of improved
performance for the applications at hand.

EL
• HDFS stores file system metadata and application data
separately.
• PT
As in other distributed file systems, like PVFS, Lustre and,
HDFS stores metadata on a dedicated server, called the
N
NameNode.
• Application data are stored on other servers called
DataNodes. All servers are fully connected and communicate
with each other using TCP-based protocols.
HDFS Design Assumptions

• Single machines tend to fail

• Hard disk, power supply,

EL
• More machines = increased failure probability

• Data doesn’t fit on a single node

• Desired:
PT
N
• Commodity hardware
• Built-in backup and failover
EL
Architecture
PT
N
Namenode and Datanodes
• Namenode (Master)

• Metadata:

EL
• Where file blocks are stored (namespace image)
• Edit (Operation) log

PT
Secondary namenode (Shadow master)
• Datanode (Chunkserver)
N
• Stores and retrieves blocks
• by client or namenode.
• Reports to namenode with list of blocks they are storing
Noticeable Differences from GFS
• Only single-writers per file.
• No record append operation.

EL
• Open source
• Provides many interfaces and libraries for different file
systems.


S3, KFS, etc. PT
Thrift (C++, Python, …), libhdfs (C), FUSE
N
A) Namenode
• The HDFS namespace is a hierarchy of files and directories. Files
and directories are represented on the NameNode by inodes,
which record attributes like permissions, modification and access
times, namespace and disk space quotas.

EL
• The file content is split into large blocks (typically 128 megabytes,
but user selectable file-by-file) and each block of the file is

PT
independently replicated at multiple DataNodes (typically three,
but user selectable file-by-file). The NameNode maintains the
namespace tree and the mapping of file blocks to DataNodes (the
N
physical location of file data).
• An HDFS client wanting to read a file first contacts the NameNode
for the locations of data blocks comprising the file and then reads
block contents from the DataNode closest to the client.
Contd…

• When writing data, the client requests the NameNode to


nominate a suite of three DataNodes to host the block
replicas.

EL
• The client then writes data to the DataNodes in a pipeline
fashion.


PT
The current design has a single NameNode for each cluster.
The cluster can have thousands of DataNodes and tens of
N
thousands of HDFS clients per cluster, as each DataNode may
execute multiple application tasks concurrently.
Contd…
• HDFS keeps the entire namespace in RAM. The inode data and
the list of blocks belonging to each file comprise the metadata
of the name system called the image. The persistent record of
the image stored in the local host’s native files system is called a

EL
checkpoint.
• The NameNode also stores the modification log of the image

PT
called the journal in the local host’s native file system. For
improved durability, redundant copies of the checkpoint and
journal can be made at other servers.
N
• During restarts the NameNode restores the namespace by
reading the namespace and replaying the journal. The locations
of block replicas may change over time and are not part of the
persistent checkpoint.
B) Datanode
• Each block replica on a DataNode is represented by two files in
the local host’s native file system. The first file contains the
data itself and the second file is block’s metadata including
checksums for the block data and the block’s generation stamp.

EL
• The size of the data file equals the actual length of the block
and does not require extra space to round it up to the nominal

PT
block size as in traditional file systems. Thus, if a block is half full
it needs only half of the space of the full block on the local drive.
N
• During startup each DataNode connects to the NameNode and
performs a handshake. The purpose of the handshake is to
verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode the
DataNode automatically shuts down.
Contd…
• The namespace ID is assigned to the file system instance when
it is formatted. The namespace ID is persistently stored on all
nodes of the cluster. Nodes with a different namespace ID will
not be able to join the cluster, thus preserving the integrity of

EL
the file system.
• The consistency of software versions is important because

PT
incompatible version may cause data corruption or loss, and on
large clusters of thousands of machines it is easy to overlook
nodes that did not shut down properly prior to the software
N
upgrade or were not available during the upgrade.
• A DataNode that is newly initialized and without any namespace
ID is permitted to join the cluster and receive the cluster’s
namespace ID.
Contd…
• After the handshake the DataNode registers with the
NameNode. DataNodes persistently store their unique storage
IDs. The storage ID is an internal identifier of the DataNode,
which makes it recognizable even if it is restarted with a different

EL
IP address or port. The storage ID is assigned to the DataNode
when it registers with the NameNode for the first time and never
changes after that.

PT
A DataNode identifies block replicas in its possession to the
NameNode by sending a block report. A block report contains the
N
block id, the generation stamp and the length for each block
replica the server hosts. The first block report is sent immediately
after the DataNode registration. Subsequent block reports are
sent every hour and provide the NameNode with an up-to-date
view of where block replicas are located on the cluster.
Contd…

• During normal operation DataNodes send heartbeats to the


NameNode to confirm that the DataNode is operating and the
block replicas it hosts are available. The default heartbeat
interval is three seconds.

EL
• If the NameNode does not receive a heartbeat from a DataNode
in ten minutes the NameNode considers the DataNode to be out

PT
of service and the block replicas hosted by that DataNode to be
unavailable.
N
• The NameNode then schedules creation of new replicas of those
blocks on other DataNodes.
Contd…
• Heartbeats from a DataNode also carry information about total
storage capacity, fraction of storage in use, and the number of data
transfers currently in progress. These statistics are used for the
NameNode’s space allocation and load balancing decisions.
• The NameNode does not directly call DataNodes. It uses replies to

EL
heartbeats to send instructions to the DataNodes. The instructions
include commands to:

PT
• Replicate blocks to other nodes;
• Remove local block replicas;
• Re-register or to shut down the node;
N
• Send an immediate block report.

• These commands are important for maintaining the overall system


integrity and therefore it is critical to keep heartbeats frequent even
on big clusters. The NameNode can process thousands of
heartbeats per second without affecting other NameNode
operations.
C) HDFS Client
• User applications access the file system using the HDFS client,
a code library that exports the HDFS file system interface.
• Similar to most conventional file systems, HDFS supports

EL
operations to read, write and delete files, and operations to
create and delete directories.

namespace. PT
The user references files and directories by paths in the
N
• The user application generally does not need to know that file
system metadata and storage are on different servers, or that
blocks have multiple replicas.
Contd…
• When an application reads a file, the HDFS client first asks the
NameNode for the list of DataNodes that host replicas of the blocks
of the file. It then contacts a DataNode directly and requests the
transfer of the desired block.

EL
• When a client writes, it first asks the NameNode to choose
DataNodes to host replicas of the first block of the file. The client
organizes a pipeline from node-to-node and sends the data.
• PT
When the first block is filled, the client requests new DataNodes to
be chosen to host replicas of the next block. A new pipeline is
N
organized, and the client sends the further bytes of the file. Each
choice of DataNodes is likely to be different.
• The interactions among the client, the NameNode and the
DataNodes are illustrated in Figure 1.
EL
PT
N
Figure 1. An HDFS client creates a new file by giving its path to the
NameNode. For each block of the file, the NameNode returns a list of
DataNodes to host its replicas. The client then pipelines data to the chosen
DataNodes, which eventually confirm the creation of the block replicas to
the NameNode.
Contd…

• Unlike conventional file systems, HDFS provides an API that


exposes the locations of a file blocks.
• This allows applications like the MapReduce framework to

EL
schedule a task to where the data are located, thus improving
the read performance.


PT
It also allows an application to set the replication factor of a file.
By default a file’s replication factor is three.
For critical files or files which are accessed very often, having a
N
higher replication factor improves their tolerance against faults
and increase their read bandwidth.
Anatomy of a File Read

EL
PT
N
Anatomy of a File Write

EL
PT
N
Additional Topics
• Replica placements:

• Different node, rack, and center

EL
• Coherency model:

• Describes data visibility



readers PT
Current block being written may not be visible to other
N
EL
Spark
PT
N
Motivation

• MapReduce and its variants have been highly successful in


implementing large-scale data-intensive applications on
commodity clusters.

EL
Input
PT
Map
Reduce

Output
N
Map

Reduce
Map
Contd…

• However, most of these systems are built around an acyclic


data flow model that is not suitable for other popular
applications.

EL
• In this part of the lecture, we will focus on one such class of
applications, that reuse a working set of data across multiple


PT
parallel operations.
This includes many iterative machine learning algorithms, as
N
well as interactive data analysis tools.
• A new framework called Spark supports such applications
while retaining the scalability and fault tolerance of
MapReduce.
Contd…

• To achieve these goals, Spark introduces an abstraction


called resilient distributed datasets (RDDs).

EL
• An RDD is a read-only collection of objects partitioned
across a set of machines that can be rebuilt if a partition is


lost.
PT
Spark can outperform Hadoop by 10x in iterative machine
N
learning jobs, and can be used to interactively query a 39
GB dataset with sub-second response time.
Difference Between Hadoop MapReduce vs.
Apache Spark

EL
Hadoop MapReduce Apache Spark
Fast 100x faster than MapReduce
Batch Processing Real-time Processing
PT
Stores Data on Disk
Written in Java
Stores Data in Memory
Written in Scala
N
N
PT
EL
Introduction
• A new model of cluster computing has become widely popular,
in which data-parallel computations are executed on clusters of
unreliable machines by systems that automatically provide
locality-aware scheduling, fault tolerance, and load balancing.

EL
• MapReduce pioneered this model, while systems like Dryad and
Map-Reduce-Merge generalized the types of data flows


supported.
PT
These systems achieve their scalability and fault tolerance by
N
providing a programming model where the user creates acyclic
data flow graphs to pass input data through a set of operators.
• This allows the underlying system to manage scheduling and to
react to faults without user intervention.
Contd…
• While this data flow programming model is useful for a large
class of applications, there are applications that cannot be
expressed efficiently as acyclic data flows.

EL
• Here, we focuses on one such class of applications, that reuse a
working set of data across multiple parallel operations. This
includes two use cases where we have seen Hadoop users report

PT
that MapReduce is deficient:
(i) Iterative jobs: Many common machine learning algorithms
N
apply a function repeatedly to the same dataset to optimize a
parameter (e.g., through gradient descent). While each iteration
can be expressed as a MapReduce/Dryad job, each job must
reload the data from disk, incurring a significant performance
penalty.
Contd…
(ii) Interactive analytics: Hadoop is often used to run ad-hoc
exploratory queries on large datasets, through SQL interfaces
such as Pig and Hive.

EL
• Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.
• However, with Hadoop, each query incurs significant latency
PT
(tens of seconds) because it runs as a separate MapReduce job
and reads data from disk.
N
• A new cluster computing framework called Spark supports
applications with working sets while providing similar scalability
and fault tolerance properties to MapReduce.
Contd…
• The main abstraction in Spark is that of a resilient distributed
dataset (RDD), which represents a read-only collection of objects
partitioned across a set of machines that can be rebuilt if a
partition is lost.

EL
• Users can explicitly cache an RDD in memory across machines
and reuse it in multiple MapReduce-like parallel operations.
RDDs achieve fault tolerance through a notion of lineage: if a

PT
partition of an RDD is lost, the RDD has enough information
about how it was derived from other RDDs to be able to rebuild
just that partition.
N
• Although RDDs are not a general shared memory abstraction,
they represent a sweet-spot between expressivity on the one
hand and scalability and reliability on the other hand, and found
well-suited for a variety of applications.
Contd…
• Spark is implemented in Scala, a statically typed high-level
programming language for the Java VM, and exposes a
functional programming interface similar to DryadLINQ.

EL
• In addition, Spark can be used interactively from a modified
version of the Scala interpreter, which allows the user to define
RDDs, functions, variables and classes and use them in parallel


PT
operations on a cluster.
Spark is the first system to allow an efficient, general-purpose
N
programming language to be used interactively to process large
datasets on a cluster.
• Spark can outperform Hadoop by 10x in iterative machine
learning workloads and can be used interactively to scan a 39 GB
dataset with sub-second latency.
Programming Model

• To use Spark, developers write a driver program that


implements the high-level control flow of their application and
launches various operations in parallel.

EL
• Spark provides two main abstractions for parallel
programming:

PT
Resilient distributed datasets and parallel operations on
these datasets (invoked by passing a function to apply on a
N
dataset).
• In addition, Spark supports two restricted types of shared
variables that can be used in functions running on the cluster.
EL
PT
N
Figure: Spark runtime. The user’s driver program launches
multiple workers, which read data blocks from a distributed file
system and can persist computed RDD partitions in memory.
Resilient Distributed Datasets (RDDs)
• A resilient distributed dataset (RDD) is a read-only collection of
objects partitioned across a set of machines that can be rebuilt if
a partition is lost.

EL
• The elements of an RDD need not exist in physical storage;
instead, a handle to an RDD contains enough information to
compute the RDD starting from data in reliable storage.
• PT
This means that RDDs can always be reconstructed if nodes fail.
N
Contd…
• In Spark, each RDD is represented by a Scala object. Spark lets
programmers construct RDDs in four ways:
• From a file in a shared file system, such as the Hadoop
Distributed File System (HDFS).

EL
• By “parallelizing” a Scala collection (e.g., an array) in the driver
program, which means dividing it into a number of slices that will
be sent to multiple nodes.

PT
By transforming an existing RDD. A dataset with elements of
type A can be transformed into a dataset with elements of type B
N
using an operation called flatMap, which passes each element
through a user-provided function of type A List[B].
• Other transformations can be expressed using flatMap,
including map (pass elements through a function of type A B
and filter (pick elements matching a predicate).
Contd…
• By changing the persistence of an existing RDD. By default, RDDs
are lazy and ephemeral.
• That is, partitions of a dataset are materialized on demand when
they are used in a parallel operation (e.g., by passing a block of a

EL
file through a map function), and are discarded from memory
after use. However, a user can alter the persistence of an RDD

• PT
through two actions:
The cache action leaves the dataset lazy, but hints that it should
be kept in memory after the first time it is computed, because it
N
will be reused.
• The save action evaluates the dataset and writes it to a
distributed filesystem such as HDFS. The saved version is used in
future operations on it.
Contd…
• If there is not enough memory in the cluster to cache all
partitions of a dataset, Spark will recompute them when they are
used.

EL
• Spark programs keep working (at reduced performance) if nodes
fail or if a dataset is too big. This idea is loosely analogous to
virtual memory.
• PT
Spark can also be extended to support other levels of persistence
(e.g., in-memory replication across multiple nodes).
N
• The goal is to let users trade off between the cost of storing an
RDD, the speed of accessing it, the probability of losing part of
it, and the cost of recomputing it.
Parallel Operations
• Several parallel operations can be performed on RDDs:

• Reduce: Combines dataset elements using an associative

EL
function to produce a result at the driver program.


PT
Collect: Sends all elements of the dataset to the driver program.
For example, an easy way to update an array in parallel is to
parallelize, map and collect the array.
N
• Foreach: Passes each element through a user provided function.
This is only done for the side effects of the function.
Shared Variables

• Programmers invoke operations like map, filter and reduce by


passing closures (functions) to Spark.

EL
• As is typical in functional programming, these closures can
refer to variables in the scope where they are created.


PT
Normally, when Spark runs a closure on a worker node, these
variables are copied to the worker.
However, Spark also lets programmers create two restricted
N
types of shared variables to support two simple but common
usage patterns such as Broadcast variables and Accumulators.
Broadcast variables
• Broadcast variables: If a large read-only piece of data (e.g., a
lookup table) is used in multiple parallel operations, it is
preferable to distribute it to the workers only once instead of

EL
packaging it with every closure.
• Spark lets the programmer create a “broadcast variable”

PT
object that wraps the value and ensures that it is only copied
to each worker once.
N
Accumulators
• Accumulators: These are variables that workers can only
“add” to using an associative operation, and that only the
driver can read.

EL
• They can be used to implement counters as in MapReduce and
to provide a more imperative syntax for parallel sums.

PT
Accumulators can be defined for any type that has an “add”
operation and a “zero” value.
N
• Due to their “add-only” semantics, they are easy to make
fault-tolerant.
RDD Operations

Transformations Parallel operations (Actions)


(define a new RDD) (return a result to driver)

EL
map reduce
filter collect
sample
union
groupByKey
PT count
save
lookupKey
N
reduceByKey …
join
cache

Transformations
Transformation Description
Return a new distributed dataset formed by passing
map (func) each element of the source through a function func
Return a new dataset formed by selecting those

EL
filter (func) element of the source on which func returns true
Similar to map, but each input can be mapped to 0
flatmap (func) or more output items (so func should return a Seq

sample
PT rather than a single item)
Sample a fraction of the data, with or without
replacements, using a given random number
N
(withReplacement,
generator seed
fraction, seed)
Return a new dataset that contains the union of the
union (otherDataset) elements in the source dataset and the argument
Return a new dataset that contains the distinct
distinct ( [numtasks] ) ) elements of the source dataset
Contd…
Transformation Description
groupByKey When called on a dataset of (K, V) pairs, returns a dataset of
([numTasks]) (K, Seq[V]) pairs

When called on a dataset of (K, V) pairs, returns a dataset of

EL
reduceByKey
(K, V) pairs where the values for each key are aggregated using
(func, [numTasks]) the given reduce function
When called on a dataset of (K, V) pairs where K implements
sortByKey( [ascending],
[numtasks] ) PT
Ordered, returns a dataset of (K, V), pairs sorted by keys in
ascending or descending order as specified in the boolean
ascending argument
N
When called on a dataset of type (K, V) and (K, W) returns a
join (otherDataset,
dataset of (K, (V, W)) pairs with all pairs of elements for each
[numTasks] ) key
cogroup (otherDataset, When called on a dataset of type (K, V) and (K, W), returns a
[numTasks] ) dataset of (K, Seq[V], Seq[W]) tuples-also called groupWith

When called on a dataset of types T and U, returns a dataset of


cartesian(otherDataset) (T, U) pairs (all pairs of elements)
Actions
Action Description
Aggregate the elements of the dataset using a function func
(which takes two arguments and returns one), and should
reduce (func)
also be commutative and associative so that it can be

EL
computed correctly in parallel
Return all the elements of the dataset as an array at the
collect ( ) driver program- usually useful after a filter or other operation

count ( )
first( )
PT
that returns a sufficiently small subset of the data
Return the number of elements in the dataset
Return the first element of the dataset-similar to take (I)
N
Return an array with the first n elements of the dataset
take (n) -currently not executed in parallel, instead the driver program
computes all the elements
takeSample Return an array with a random sample of num elements of
(withReplacement, the dataset, with or without replacement, using the given
fraction, seed) random number generator seed.
Contd…
Action Description
Write the elements of the dataset as a text file (or set of text
saveAsTextFile files) in a given directory in the local filesystem, HDFS or any
(path) other Hadoop-supported file system, Spark will call toString

EL
on each element to convert it to a line of text in the file
Write the elements of the dataset as a Hadoop SequenceFile
in a given path in the local filesystem. HDFS or any other
saveAsSequenceFile
(path) PT
Hadoop-supported file system, Only available on RDDs of
key-value pairs that either implement, Hadoop’s writable
interface or are implicitly Hadoop’s Writable interface or are
N
implicitly convertible to Writable (Spark includes conversions
for basic types like Int, Double, String, etc).
Only available of RDD of type (K, V) Returns a ‘Map’ of
countByKey ( )
(K, Int) pairs with the count of each key.
Run a function func on each element of the dataset- usually
foreach (func) done for side effects such as updating an accumulator
variable or interacting with external storage systems
Spark Community
• Most active open source community in big data

• 200+ developers, 50+ companies contributing

EL
PT
N
EL
Built-in Libraries
PT
N
Standard Library for Big Data

• Big data apps lack libraries“


of common algorithms
• Spark’s generality + support“

EL
for multiple languages make
it“ suitable to offer this

PT
Much of future activity will
be in these libraries
N
A General Platform

EL
PT
N
Machine Learning Library (MLlib)
MLlib algorithms:
(i) Classification: logistic regression, linear SVM,“ naïve
Bayes, classification tree

EL
(ii) Regression: generalized linear models (GLMs),
regression tree

PT
(iii) Collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
N
(iv) Clustering: k-means
(v) Decomposition: SVD, PCA
(vi) Optimization: stochastic gradient descent, L-BFGS
GraphX

EL
PT
N
GraphX

•General graph processing library

•Build graph using RDDs of nodes and edges

EL
•Large library of graph algorithms with composable steps

PT
N
GraphX Algorithms
(i) Collaborative Filtering (iv) Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss

EL
(ii) Structured Prediction (v) Graph Analytics

PT
Loopy Belief Propagation
Max-Product Linear Programs
Gibbs Sampling
PageRank
Personalized PageRank
Shortest Path
N
Graph Coloring
(iii) Semi-supervised ML
Graph SSL (vi) Classification
CoEM Neural Networks
Spark Streaming

•Large scale streaming


computation

EL
•Ensure exactly one semantics

•Integrated with Spark unifies


PT
batch, interactive, and streaming
computations!
N
Spark SQL

Enables loading & querying structured data in Spark

From Hive:

EL
c = HiveContext(sc)
rows = c.sql(“select text, year from hivetable”)

PT
rows.filter(lambda r: r.year > 2013).collect()
N
From JSON:

c.jsonFile(“tweets.json”).registerAsTable(“tweets”)
c.sql(“select text, user.name from tweets”)
EL
Examples
PT
N
Example 1: PageRank

• Give pages ranks (scores) based on links to them

• Links from many pages high rank

EL
• Links from a high-rank page high rank

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Spark Program

val links = // RDD of (url, neighbors) pairs


var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {

EL
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>

} PT
links.map(dest => (dest, rank/links.size))

ranks = contribs.reduceByKey (_ + _)
N
.mapValues (0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
PageRank Performance

EL
PT
N
Example 2: Logistic Regression
• Goal: find best line separating two sets of points

EL
random initial line

PT
N
target
Logistic Regression Code

val data = spark.textFile(...).map(readPoint).cache()

EL
var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

PT
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
N
w -= gradient
}

println("Final w: " + w)
Logistic Regression Performance

EL
127 s / iteration

PT
N
first iteration 174 s
further iterations 6 s
Example 3: MapReduce

• MapReduce data flow can be expressed using RDD


transformations

EL
res = data.flatMap(rec => myMapFunc(rec))
.groupByKey()
PT
.map((key, vals) => myReduceFunc(key, vals))
N
Or with combiners:

res = data.flatMap(rec => myMapFunc(rec))


.reduceByKey(myCombiner)
.map((key, val) => myReduceFunc(key, val))
Example 4: WordCount
Definition: Count how often each word appears in a collection
of text documents

This simple program provides a good test case for parallel

EL
processing, since it:

PT
Requires a minimal amount of code
• Demonstrates use of both symbolic and numeric values
• Isn’t many steps away from search indexing
N
• Serves as a “Hello World” for Big Data apps

A distributed computing framework that can run WordCount


efficiently in parallel at scale can likely handle much large and
more interesting computer problems.
WordCount Program
Scala:

val f = sc.textFile ( “README.md” )


val wc = f.flatMap (l => l.split(“ ”)).map (word=> (word, 1)).reduceByKey(_ + _ )

EL
wc.saveAsTextFile ( “wc_out” )

Python:

from operator import add PT


f = sc.textFile ( “README.md” )
N
val wc = f.flatMap (lamda x: x.split(‘ ‘) ).map (lamda x: (x, 1)).reduceByKey(add )
wc.saveAsTextFile ( “wc_out” )
Other Spark Applications

i. Twitter spam classification


ii. EM algorithm for traffic prediction

EL
iii. K-means clustering
iv.
v. PT
Alternating Least Squares matrix factorization
In-memory OLAP aggregation on Hive data
N
vi. SQL on Spark
Reading Material
• Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin,
Scott Shenker, Ion Stoica
“Spark: Cluster Computing with Working Sets”

EL
• Matei Zaharia, Mosharaf Chowdhury et al.

“Resilient Distributed Datasets: A Fault-Tolerant


PT
Abstraction for In-Memory Cluster Computing”
• Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert
N
Chansler
“The Hadoop Distributed File System”
Conclusion
• Spark provides three simple data abstractions for programming
clusters: resilient distributed datasets (RDDs), and two restricted
types of shared variables: broadcast variables and accumulators.
While these abstractions are limited, It is found that they are

EL
powerful enough to express several applications that pose
challenges for existing cluster computing frameworks, including
iterative and interactive computations.
• PT
Furthermore, It is believed that the core idea behind RDDs, of a
dataset handle that has enough information to (re)construct the
N
dataset from data available in reliable storage, may prove useful
in developing other abstractions for programming clusters.
• In this lecture, we have discussed the HDFS components, its
architecture and framework of spark with its applications.
HDFS and Spark

EL
PT Rajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
[email protected]
EL
The Hadoop Distributed
File System (HDFS)
PT
N
Introduction
• Hadoop provides a distributed file system and a framework for
the analysis and transformation of very large data sets using the
MapReduce paradigm.

EL
• An important characteristic of Hadoop is the partitioning of
data and computation across many (thousands) of hosts, and
executing application computations in parallel close to their


data.
PT
A Hadoop cluster scales computation capacity, storage capacity
N
and IO bandwidth by simply adding commodity servers. Hadoop
clusters at Yahoo! span 25,000 servers, and store 25 petabytes
of application data, with the largest cluster being 3500 servers.
One hundred other organizations worldwide report using
Hadoop.
Contd…
• Hadoop is an Apache project; all components are available
via the Apache open source license.
• Yahoo! has developed and contributed to 80% of the core of

EL
Hadoop (HDFS and MapReduce).
• HBase was originally developed at Powerset, now a

• PT
department at Microsoft.
Hive was originated and developed at Facebook.
N
• Pig, ZooKeeper, and Chukwa were originated and developed
at Yahoo!
• Avro was originated at Yahoo! and is being co-developed with
Cloudera.
Hadoop Project Components

HDFS Distributed file system


MapReduce Distributed computation framework

EL
HBase Column-oriented table service
Dataflow language and parallel execution
Pig

Hive
ZooKeeper
PT
framework
Data warehouse infrastructure
Distributed coordination service
N
Chukwa System for collecting management data
Avro Data serialization system

Table 1. Hadoop project components


Contd…
• HDFS is the file system component of Hadoop. While the
interface to HDFS is patterned after the UNIX file system,
faithfulness to standards was sacrificed in favour of improved
performance for the applications at hand.

EL
• HDFS stores file system metadata and application data
separately.
• PT
As in other distributed file systems, like PVFS, Lustre and,
HDFS stores metadata on a dedicated server, called the
N
NameNode.
• Application data are stored on other servers called
DataNodes. All servers are fully connected and communicate
with each other using TCP-based protocols.
HDFS Design Assumptions

• Single machines tend to fail

• Hard disk, power supply,

EL
• More machines = increased failure probability

• Data doesn’t fit on a single node

• Desired:
PT
N
• Commodity hardware
• Built-in backup and failover
EL
Architecture
PT
N
Namenode and Datanodes
• Namenode (Master)

• Metadata:

EL
• Where file blocks are stored (namespace image)
• Edit (Operation) log

PT
Secondary namenode (Shadow master)
• Datanode (Chunkserver)
N
• Stores and retrieves blocks
• by client or namenode.
• Reports to namenode with list of blocks they are storing
Noticeable Differences from GFS
• Only single-writers per file.
• No record append operation.

EL
• Open source
• Provides many interfaces and libraries for different file
systems.


S3, KFS, etc. PT
Thrift (C++, Python, …), libhdfs (C), FUSE
N
A) Namenode
• The HDFS namespace is a hierarchy of files and directories. Files
and directories are represented on the NameNode by inodes,
which record attributes like permissions, modification and access
times, namespace and disk space quotas.

EL
• The file content is split into large blocks (typically 128 megabytes,
but user selectable file-by-file) and each block of the file is

PT
independently replicated at multiple DataNodes (typically three,
but user selectable file-by-file). The NameNode maintains the
namespace tree and the mapping of file blocks to DataNodes (the
N
physical location of file data).
• An HDFS client wanting to read a file first contacts the NameNode
for the locations of data blocks comprising the file and then reads
block contents from the DataNode closest to the client.
Contd…

• When writing data, the client requests the NameNode to


nominate a suite of three DataNodes to host the block
replicas.

EL
• The client then writes data to the DataNodes in a pipeline
fashion.


PT
The current design has a single NameNode for each cluster.
The cluster can have thousands of DataNodes and tens of
N
thousands of HDFS clients per cluster, as each DataNode may
execute multiple application tasks concurrently.
Contd…
• HDFS keeps the entire namespace in RAM. The inode data and
the list of blocks belonging to each file comprise the metadata
of the name system called the image. The persistent record of
the image stored in the local host’s native files system is called a

EL
checkpoint.
• The NameNode also stores the modification log of the image

PT
called the journal in the local host’s native file system. For
improved durability, redundant copies of the checkpoint and
journal can be made at other servers.
N
• During restarts the NameNode restores the namespace by
reading the namespace and replaying the journal. The locations
of block replicas may change over time and are not part of the
persistent checkpoint.
B) Datanode
• Each block replica on a DataNode is represented by two files in
the local host’s native file system. The first file contains the
data itself and the second file is block’s metadata including
checksums for the block data and the block’s generation stamp.

EL
• The size of the data file equals the actual length of the block
and does not require extra space to round it up to the nominal

PT
block size as in traditional file systems. Thus, if a block is half full
it needs only half of the space of the full block on the local drive.
N
• During startup each DataNode connects to the NameNode and
performs a handshake. The purpose of the handshake is to
verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode the
DataNode automatically shuts down.
Contd…
• The namespace ID is assigned to the file system instance when
it is formatted. The namespace ID is persistently stored on all
nodes of the cluster. Nodes with a different namespace ID will
not be able to join the cluster, thus preserving the integrity of

EL
the file system.
• The consistency of software versions is important because

PT
incompatible version may cause data corruption or loss, and on
large clusters of thousands of machines it is easy to overlook
nodes that did not shut down properly prior to the software
N
upgrade or were not available during the upgrade.
• A DataNode that is newly initialized and without any namespace
ID is permitted to join the cluster and receive the cluster’s
namespace ID.
Contd…
• After the handshake the DataNode registers with the
NameNode. DataNodes persistently store their unique storage
IDs. The storage ID is an internal identifier of the DataNode,
which makes it recognizable even if it is restarted with a different

EL
IP address or port. The storage ID is assigned to the DataNode
when it registers with the NameNode for the first time and never
changes after that.

PT
A DataNode identifies block replicas in its possession to the
NameNode by sending a block report. A block report contains the
N
block id, the generation stamp and the length for each block
replica the server hosts. The first block report is sent immediately
after the DataNode registration. Subsequent block reports are
sent every hour and provide the NameNode with an up-to-date
view of where block replicas are located on the cluster.
Contd…

• During normal operation DataNodes send heartbeats to the


NameNode to confirm that the DataNode is operating and the
block replicas it hosts are available. The default heartbeat
interval is three seconds.

EL
• If the NameNode does not receive a heartbeat from a DataNode
in ten minutes the NameNode considers the DataNode to be out

PT
of service and the block replicas hosted by that DataNode to be
unavailable.
N
• The NameNode then schedules creation of new replicas of those
blocks on other DataNodes.
Contd…
• Heartbeats from a DataNode also carry information about total
storage capacity, fraction of storage in use, and the number of data
transfers currently in progress. These statistics are used for the
NameNode’s space allocation and load balancing decisions.
• The NameNode does not directly call DataNodes. It uses replies to

EL
heartbeats to send instructions to the DataNodes. The instructions
include commands to:

PT
• Replicate blocks to other nodes;
• Remove local block replicas;
• Re-register or to shut down the node;
N
• Send an immediate block report.

• These commands are important for maintaining the overall system


integrity and therefore it is critical to keep heartbeats frequent even
on big clusters. The NameNode can process thousands of
heartbeats per second without affecting other NameNode
operations.
C) HDFS Client
• User applications access the file system using the HDFS client,
a code library that exports the HDFS file system interface.
• Similar to most conventional file systems, HDFS supports

EL
operations to read, write and delete files, and operations to
create and delete directories.

namespace. PT
The user references files and directories by paths in the
N
• The user application generally does not need to know that file
system metadata and storage are on different servers, or that
blocks have multiple replicas.
Contd…
• When an application reads a file, the HDFS client first asks the
NameNode for the list of DataNodes that host replicas of the blocks
of the file. It then contacts a DataNode directly and requests the
transfer of the desired block.

EL
• When a client writes, it first asks the NameNode to choose
DataNodes to host replicas of the first block of the file. The client
organizes a pipeline from node-to-node and sends the data.
• PT
When the first block is filled, the client requests new DataNodes to
be chosen to host replicas of the next block. A new pipeline is
N
organized, and the client sends the further bytes of the file. Each
choice of DataNodes is likely to be different.
• The interactions among the client, the NameNode and the
DataNodes are illustrated in Figure 1.
EL
PT
N
Figure 1. An HDFS client creates a new file by giving its path to the
NameNode. For each block of the file, the NameNode returns a list of
DataNodes to host its replicas. The client then pipelines data to the chosen
DataNodes, which eventually confirm the creation of the block replicas to
the NameNode.
Contd…

• Unlike conventional file systems, HDFS provides an API that


exposes the locations of a file blocks.
• This allows applications like the MapReduce framework to

EL
schedule a task to where the data are located, thus improving
the read performance.


PT
It also allows an application to set the replication factor of a file.
By default a file’s replication factor is three.
For critical files or files which are accessed very often, having a
N
higher replication factor improves their tolerance against faults
and increase their read bandwidth.
Anatomy of a File Read

EL
PT
N
Anatomy of a File Write

EL
PT
N
Additional Topics
• Replica placements:

• Different node, rack, and center

EL
• Coherency model:

• Describes data visibility



readers PT
Current block being written may not be visible to other
N
EL
Spark
PT
N
Motivation

• MapReduce and its variants have been highly successful in


implementing large-scale data-intensive applications on
commodity clusters.

EL
Input
PT
Map
Reduce

Output
N
Map

Reduce
Map
Contd…

• However, most of these systems are built around an acyclic


data flow model that is not suitable for other popular
applications.

EL
• In this part of the lecture, we will focus on one such class of
applications, that reuse a working set of data across multiple


PT
parallel operations.
This includes many iterative machine learning algorithms, as
N
well as interactive data analysis tools.
• A new framework called Spark supports such applications
while retaining the scalability and fault tolerance of
MapReduce.
Contd…

• To achieve these goals, Spark introduces an abstraction


called resilient distributed datasets (RDDs).

EL
• An RDD is a read-only collection of objects partitioned
across a set of machines that can be rebuilt if a partition is


lost.
PT
Spark can outperform Hadoop by 10x in iterative machine
N
learning jobs, and can be used to interactively query a 39
GB dataset with sub-second response time.
Difference Between Hadoop MapReduce vs.
Apache Spark

EL
Hadoop MapReduce Apache Spark
Fast 100x faster than MapReduce
Batch Processing Real-time Processing
PT
Stores Data on Disk
Written in Java
Stores Data in Memory
Written in Scala
N
N
PT
EL
Introduction
• A new model of cluster computing has become widely popular,
in which data-parallel computations are executed on clusters of
unreliable machines by systems that automatically provide
locality-aware scheduling, fault tolerance, and load balancing.

EL
• MapReduce pioneered this model, while systems like Dryad and
Map-Reduce-Merge generalized the types of data flows


supported.
PT
These systems achieve their scalability and fault tolerance by
N
providing a programming model where the user creates acyclic
data flow graphs to pass input data through a set of operators.
• This allows the underlying system to manage scheduling and to
react to faults without user intervention.
Contd…
• While this data flow programming model is useful for a large
class of applications, there are applications that cannot be
expressed efficiently as acyclic data flows.

EL
• Here, we focuses on one such class of applications, that reuse a
working set of data across multiple parallel operations. This
includes two use cases where we have seen Hadoop users report

PT
that MapReduce is deficient:
(i) Iterative jobs: Many common machine learning algorithms
N
apply a function repeatedly to the same dataset to optimize a
parameter (e.g., through gradient descent). While each iteration
can be expressed as a MapReduce/Dryad job, each job must
reload the data from disk, incurring a significant performance
penalty.
Contd…
(ii) Interactive analytics: Hadoop is often used to run ad-hoc
exploratory queries on large datasets, through SQL interfaces
such as Pig and Hive.

EL
• Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.
• However, with Hadoop, each query incurs significant latency
PT
(tens of seconds) because it runs as a separate MapReduce job
and reads data from disk.
N
• A new cluster computing framework called Spark supports
applications with working sets while providing similar scalability
and fault tolerance properties to MapReduce.
Contd…
• The main abstraction in Spark is that of a resilient distributed
dataset (RDD), which represents a read-only collection of objects
partitioned across a set of machines that can be rebuilt if a
partition is lost.

EL
• Users can explicitly cache an RDD in memory across machines
and reuse it in multiple MapReduce-like parallel operations.
RDDs achieve fault tolerance through a notion of lineage: if a

PT
partition of an RDD is lost, the RDD has enough information
about how it was derived from other RDDs to be able to rebuild
just that partition.
N
• Although RDDs are not a general shared memory abstraction,
they represent a sweet-spot between expressivity on the one
hand and scalability and reliability on the other hand, and found
well-suited for a variety of applications.
Contd…
• Spark is implemented in Scala, a statically typed high-level
programming language for the Java VM, and exposes a
functional programming interface similar to DryadLINQ.

EL
• In addition, Spark can be used interactively from a modified
version of the Scala interpreter, which allows the user to define
RDDs, functions, variables and classes and use them in parallel


PT
operations on a cluster.
Spark is the first system to allow an efficient, general-purpose
N
programming language to be used interactively to process large
datasets on a cluster.
• Spark can outperform Hadoop by 10x in iterative machine
learning workloads and can be used interactively to scan a 39 GB
dataset with sub-second latency.
Programming Model

• To use Spark, developers write a driver program that


implements the high-level control flow of their application and
launches various operations in parallel.

EL
• Spark provides two main abstractions for parallel
programming:

PT
Resilient distributed datasets and parallel operations on
these datasets (invoked by passing a function to apply on a
N
dataset).
• In addition, Spark supports two restricted types of shared
variables that can be used in functions running on the cluster.
EL
PT
N
Figure: Spark runtime. The user’s driver program launches
multiple workers, which read data blocks from a distributed file
system and can persist computed RDD partitions in memory.
Resilient Distributed Datasets (RDDs)
• A resilient distributed dataset (RDD) is a read-only collection of
objects partitioned across a set of machines that can be rebuilt if
a partition is lost.

EL
• The elements of an RDD need not exist in physical storage;
instead, a handle to an RDD contains enough information to
compute the RDD starting from data in reliable storage.
• PT
This means that RDDs can always be reconstructed if nodes fail.
N
Contd…
• In Spark, each RDD is represented by a Scala object. Spark lets
programmers construct RDDs in four ways:
• From a file in a shared file system, such as the Hadoop
Distributed File System (HDFS).

EL
• By “parallelizing” a Scala collection (e.g., an array) in the driver
program, which means dividing it into a number of slices that will
be sent to multiple nodes.

PT
By transforming an existing RDD. A dataset with elements of
type A can be transformed into a dataset with elements of type B
N
using an operation called flatMap, which passes each element
through a user-provided function of type A List[B].
• Other transformations can be expressed using flatMap,
including map (pass elements through a function of type A B
and filter (pick elements matching a predicate).
Contd…
• By changing the persistence of an existing RDD. By default, RDDs
are lazy and ephemeral.
• That is, partitions of a dataset are materialized on demand when
they are used in a parallel operation (e.g., by passing a block of a

EL
file through a map function), and are discarded from memory
after use. However, a user can alter the persistence of an RDD

• PT
through two actions:
The cache action leaves the dataset lazy, but hints that it should
be kept in memory after the first time it is computed, because it
N
will be reused.
• The save action evaluates the dataset and writes it to a
distributed filesystem such as HDFS. The saved version is used in
future operations on it.
Contd…
• If there is not enough memory in the cluster to cache all
partitions of a dataset, Spark will recompute them when they are
used.

EL
• Spark programs keep working (at reduced performance) if nodes
fail or if a dataset is too big. This idea is loosely analogous to
virtual memory.
• PT
Spark can also be extended to support other levels of persistence
(e.g., in-memory replication across multiple nodes).
N
• The goal is to let users trade off between the cost of storing an
RDD, the speed of accessing it, the probability of losing part of
it, and the cost of recomputing it.
Parallel Operations
• Several parallel operations can be performed on RDDs:

• Reduce: Combines dataset elements using an associative

EL
function to produce a result at the driver program.


PT
Collect: Sends all elements of the dataset to the driver program.
For example, an easy way to update an array in parallel is to
parallelize, map and collect the array.
N
• Foreach: Passes each element through a user provided function.
This is only done for the side effects of the function.
Shared Variables

• Programmers invoke operations like map, filter and reduce by


passing closures (functions) to Spark.

EL
• As is typical in functional programming, these closures can
refer to variables in the scope where they are created.


PT
Normally, when Spark runs a closure on a worker node, these
variables are copied to the worker.
However, Spark also lets programmers create two restricted
N
types of shared variables to support two simple but common
usage patterns such as Broadcast variables and Accumulators.
Broadcast variables
• Broadcast variables: If a large read-only piece of data (e.g., a
lookup table) is used in multiple parallel operations, it is
preferable to distribute it to the workers only once instead of

EL
packaging it with every closure.
• Spark lets the programmer create a “broadcast variable”

PT
object that wraps the value and ensures that it is only copied
to each worker once.
N
Accumulators
• Accumulators: These are variables that workers can only
“add” to using an associative operation, and that only the
driver can read.

EL
• They can be used to implement counters as in MapReduce and
to provide a more imperative syntax for parallel sums.

PT
Accumulators can be defined for any type that has an “add”
operation and a “zero” value.
N
• Due to their “add-only” semantics, they are easy to make
fault-tolerant.
RDD Operations

Transformations Parallel operations (Actions)


(define a new RDD) (return a result to driver)

EL
map reduce
filter collect
sample
union
groupByKey
PT count
save
lookupKey
N
reduceByKey …
join
cache

Transformations
Transformation Description
Return a new distributed dataset formed by passing
map (func) each element of the source through a function func
Return a new dataset formed by selecting those

EL
filter (func) element of the source on which func returns true
Similar to map, but each input can be mapped to 0
flatmap (func) or more output items (so func should return a Seq

sample
PT rather than a single item)
Sample a fraction of the data, with or without
replacements, using a given random number
N
(withReplacement,
generator seed
fraction, seed)
Return a new dataset that contains the union of the
union (otherDataset) elements in the source dataset and the argument
Return a new dataset that contains the distinct
distinct ( [numtasks] ) ) elements of the source dataset
Contd…
Transformation Description
groupByKey When called on a dataset of (K, V) pairs, returns a dataset of
([numTasks]) (K, Seq[V]) pairs

When called on a dataset of (K, V) pairs, returns a dataset of

EL
reduceByKey
(K, V) pairs where the values for each key are aggregated using
(func, [numTasks]) the given reduce function
When called on a dataset of (K, V) pairs where K implements
sortByKey( [ascending],
[numtasks] ) PT
Ordered, returns a dataset of (K, V), pairs sorted by keys in
ascending or descending order as specified in the boolean
ascending argument
N
When called on a dataset of type (K, V) and (K, W) returns a
join (otherDataset,
dataset of (K, (V, W)) pairs with all pairs of elements for each
[numTasks] ) key
cogroup (otherDataset, When called on a dataset of type (K, V) and (K, W), returns a
[numTasks] ) dataset of (K, Seq[V], Seq[W]) tuples-also called groupWith

When called on a dataset of types T and U, returns a dataset of


cartesian(otherDataset) (T, U) pairs (all pairs of elements)
Actions
Action Description
Aggregate the elements of the dataset using a function func
(which takes two arguments and returns one), and should
reduce (func)
also be commutative and associative so that it can be

EL
computed correctly in parallel
Return all the elements of the dataset as an array at the
collect ( ) driver program- usually useful after a filter or other operation

count ( )
first( )
PT
that returns a sufficiently small subset of the data
Return the number of elements in the dataset
Return the first element of the dataset-similar to take (I)
N
Return an array with the first n elements of the dataset
take (n) -currently not executed in parallel, instead the driver program
computes all the elements
takeSample Return an array with a random sample of num elements of
(withReplacement, the dataset, with or without replacement, using the given
fraction, seed) random number generator seed.
Contd…
Action Description
Write the elements of the dataset as a text file (or set of text
saveAsTextFile files) in a given directory in the local filesystem, HDFS or any
(path) other Hadoop-supported file system, Spark will call toString

EL
on each element to convert it to a line of text in the file
Write the elements of the dataset as a Hadoop SequenceFile
in a given path in the local filesystem. HDFS or any other
saveAsSequenceFile
(path) PT
Hadoop-supported file system, Only available on RDDs of
key-value pairs that either implement, Hadoop’s writable
interface or are implicitly Hadoop’s Writable interface or are
N
implicitly convertible to Writable (Spark includes conversions
for basic types like Int, Double, String, etc).
Only available of RDD of type (K, V) Returns a ‘Map’ of
countByKey ( )
(K, Int) pairs with the count of each key.
Run a function func on each element of the dataset- usually
foreach (func) done for side effects such as updating an accumulator
variable or interacting with external storage systems
Spark Community
• Most active open source community in big data

• 200+ developers, 50+ companies contributing

EL
PT
N
EL
Built-in Libraries
PT
N
Standard Library for Big Data

• Big data apps lack libraries“


of common algorithms
• Spark’s generality + support“

EL
for multiple languages make
it“ suitable to offer this

PT
Much of future activity will
be in these libraries
N
A General Platform

EL
PT
N
Machine Learning Library (MLlib)
MLlib algorithms:
(i) Classification: logistic regression, linear SVM,“ naïve
Bayes, classification tree

EL
(ii) Regression: generalized linear models (GLMs),
regression tree

PT
(iii) Collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
N
(iv) Clustering: k-means
(v) Decomposition: SVD, PCA
(vi) Optimization: stochastic gradient descent, L-BFGS
GraphX

EL
PT
N
GraphX

•General graph processing library

•Build graph using RDDs of nodes and edges

EL
•Large library of graph algorithms with composable steps

PT
N
GraphX Algorithms
(i) Collaborative Filtering (iv) Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss

EL
(ii) Structured Prediction (v) Graph Analytics

PT
Loopy Belief Propagation
Max-Product Linear Programs
Gibbs Sampling
PageRank
Personalized PageRank
Shortest Path
N
Graph Coloring
(iii) Semi-supervised ML
Graph SSL (vi) Classification
CoEM Neural Networks
Spark Streaming

•Large scale streaming


computation

EL
•Ensure exactly one semantics

•Integrated with Spark unifies


PT
batch, interactive, and streaming
computations!
N
Spark SQL

Enables loading & querying structured data in Spark

From Hive:

EL
c = HiveContext(sc)
rows = c.sql(“select text, year from hivetable”)

PT
rows.filter(lambda r: r.year > 2013).collect()
N
From JSON:

c.jsonFile(“tweets.json”).registerAsTable(“tweets”)
c.sql(“select text, user.name from tweets”)
EL
Examples
PT
N
Example 1: PageRank

• Give pages ranks (scores) based on links to them

• Links from many pages high rank

EL
• Links from a high-rank page high rank

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N
Spark Program

val links = // RDD of (url, neighbors) pairs


var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {

EL
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>

} PT
links.map(dest => (dest, rank/links.size))

ranks = contribs.reduceByKey (_ + _)
N
.mapValues (0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
PageRank Performance

EL
PT
N
Example 2: Logistic Regression
• Goal: find best line separating two sets of points

EL
random initial line

PT
N
target
Logistic Regression Code

val data = spark.textFile(...).map(readPoint).cache()

EL
var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

PT
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
N
w -= gradient
}

println("Final w: " + w)
Logistic Regression Performance

EL
127 s / iteration

PT
N
first iteration 174 s
further iterations 6 s
Example 3: MapReduce

• MapReduce data flow can be expressed using RDD


transformations

EL
res = data.flatMap(rec => myMapFunc(rec))
.groupByKey()
PT
.map((key, vals) => myReduceFunc(key, vals))
N
Or with combiners:

res = data.flatMap(rec => myMapFunc(rec))


.reduceByKey(myCombiner)
.map((key, val) => myReduceFunc(key, val))
Example 4: WordCount
Definition: Count how often each word appears in a collection
of text documents

This simple program provides a good test case for parallel

EL
processing, since it:

PT
Requires a minimal amount of code
• Demonstrates use of both symbolic and numeric values
• Isn’t many steps away from search indexing
N
• Serves as a “Hello World” for Big Data apps

A distributed computing framework that can run WordCount


efficiently in parallel at scale can likely handle much large and
more interesting computer problems.
WordCount Program
Scala:

val f = sc.textFile ( “README.md” )


val wc = f.flatMap (l => l.split(“ ”)).map (word=> (word, 1)).reduceByKey(_ + _ )

EL
wc.saveAsTextFile ( “wc_out” )

Python:

from operator import add PT


f = sc.textFile ( “README.md” )
N
val wc = f.flatMap (lamda x: x.split(‘ ‘) ).map (lamda x: (x, 1)).reduceByKey(add )
wc.saveAsTextFile ( “wc_out” )
Other Spark Applications

i. Twitter spam classification


ii. EM algorithm for traffic prediction

EL
iii. K-means clustering
iv.
v. PT
Alternating Least Squares matrix factorization
In-memory OLAP aggregation on Hive data
N
vi. SQL on Spark
Reading Material
• Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin,
Scott Shenker, Ion Stoica
“Spark: Cluster Computing with Working Sets”

EL
• Matei Zaharia, Mosharaf Chowdhury et al.

“Resilient Distributed Datasets: A Fault-Tolerant


PT
Abstraction for In-Memory Cluster Computing”
• Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert
N
Chansler
“The Hadoop Distributed File System”
Conclusion
• Spark provides three simple data abstractions for programming
clusters: resilient distributed datasets (RDDs), and two restricted
types of shared variables: broadcast variables and accumulators.
While these abstractions are limited, It is found that they are

EL
powerful enough to express several applications that pose
challenges for existing cluster computing frameworks, including
iterative and interactive computations.
• PT
Furthermore, It is believed that the core idea behind RDDs, of a
dataset handle that has enough information to (re)construct the
N
dataset from data available in reliable storage, may prove useful
in developing other abstractions for programming clusters.
• In this lecture, we have discussed the HDFS components, its
architecture and framework of spark with its applications.
Distributed Algorithms for
Sensor Networks

EL
”Connected Domination in Multihop Ad Hoc Wireless Networks”

PT Rajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
[email protected]
Introduction
• Theidea of virtual backbone routing for ad hoc wireless
networks is to operate routing protocols over a virtual
backbone.
• One purpose of virtual backbone routing is to alleviate the

EL
serious broadcast storm problem suffered by many
exiting on-demand routing protocols for route detection.

• In
PT
Thus constructing a virtual backbone is very important.
this lecture we study, the virtual backbone is
N
approximated by a minimum connected dominating set
(MCDS) in a unit-disk graph. This is a NP-hard problem.
•A distributed approximation algorithm with performance
ratio at most 8 will be covered.
Sensor Network as Adhoc network
• Adhoc wireless and Sensor network has applications in
emergency search-and-rescue operations, decision
making in the battlefield, data acquisition operations in

EL
inhospitable terrain, etc.
• It
is featured by dynamic topology (infrastructureless),

PT
multihop communication, limited resources (bandwidth,
CPU, battery, etc) and limited security.
N
• These characteristics put special challenges in routing
protocol design inspired by the physical backbone in a
wired network, many researchers proposed the concept
of virtual backbone for unicast, multicast/broadcast in
ad hoc wireless networks .
Contd…
• The virtual backbone is mainly used to collect
topology information for route detection. It also
works as a backup when route is unavailable
temporarily.

EL
• An effective approach based on overlaying a virtual
infrastructure (termed core) on an ad hoc network is
popular. PT
• Routing protocols are operated over the core.
N
• Route request packets are unicasted to core nodes and
a (small) subset of non-core nodes.
• No broadcast is involved in core path detection.
Classification of Routing Protocols
• Existing routing protocols can be classified into two categories:
(i) proactive and (ii) reactive.
(i) Proactive routing protocols ask each host (or many hosts) to
maintain global topology information, thus a route can be

EL
provided immediately when requested.
• But large amount of control messages are required to keep

PT
each host updated for the newest topology changes.
(ii) Reactive routing protocols have the feature on-demand.
N
Each host computes route for a specific destination only when
necessary.
• Topology changes which do not influence active routes do not
trigger any route maintenance function, thus communication
overhead is lower compared to proactive routing protocol.
On-demand Routing Protocols
• On-demand routing protocols attract much attention due
to their better scalability and lower protocol overhead.
• But most of them use flooding for route discovery.

EL
Flooding suffers from broadcast storm problem.
• Broadcast storm problem refers to the fact that flooding
PT
may result in excessive redundancy, contention, and
collision. This causes high protocol overhead and
N
interference to other ongoing communication sessions.
• On the other hand, the unreliability of broadcast may
obstruct the detection of the shortest path, or simply
can’t detect any path at all, even though there exists one.
Problem of efficiently constructing virtual
backbone for ad hoc wireless networks
• In
this lecture we will study the “problem of efficiently
constructing virtual backbone” for ad hoc wireless

EL
networks.
• Thenumber of hosts forming the virtual backbone must

• The PT
be as small as possible to decrease protocol overhead.
algorithm must be time/message efficient due to
N
resource scarcity.
• We use a connected dominating set (CDS) to approximate
the virtual backbone.
Assumptions (1)

• We assume a given ad hoc network instance contains


n hosts.
• Each host is in the ground and is mounted by an omni-

EL
directional antenna.
• Thus the transmission range of a host is a disk.
• We
PT
further assume that each transceiver has the same
communication range R.
N
• Thus the footprint of an ad hoc wireless network is a unit-
disk graph.
Assumptions (2)
• Ingraph-theoretic terminology, the network topology we
are interested in is a graph G=(V,E) where V contains all
hosts and E is the set of links.

EL
•A link between u and v exists if their distance is at most R.
In a real world ad hoc wireless network, sometime even
when v is located in u’s transmission range, v is not
PT
reachable from u due to hidden/exposed terminal problems.
• Here, we only consider bidirectional links.
N
• From now on, we use host and node interchangeably to
represent a wireless mobile.
Existing Distributed Algorithms for MCDS
K.M. Mihaela
B. Das et al. B. Das et al. J. Wu et
Alzoubi Cardei et
[1997]-I [1997]-II al. [1999]
[2001] al.

EL
Cardinality ≤(2ln∆+3)opt ≤(2ln∆+2)opt N/A ≤8opt+1 ≤8opt
O(n|C|+m+
Message O(n|C|) O(n∆) O(nlogn) O(n)
nlogn)
Time
Message
O(∆ )
PT
O((n+|C|)∆ ) O((|C|+|C|)∆ )

O(∆ )
O(∆2)

O(∆ )
O(n∆ )

O(∆ )
O(n∆ )

O(∆ )
N
Length
Information 2-hop 2-hop 2-hop 1-hop 1-hop

Table 1: Performance comparison of the algorithms. Here opt is the size of


the given instance; ∆ is the maximum degree; C is the size of the
generated connected dominating set; m is the number of edges; n is the
number of hosts.
Preliminaries (1)
• Given graph G =(V,E), two vertices are independent if they are
not neighbors. For any vertex v, the set of independent
neighbors of v is a subset of v’s neighbors such that any two
vertices in this subset are independent.

EL
• An independent set (IS) ) S of G is a subset of V such that for all
u,v ϵ S, (u,v) ϵ E. S is maximal if any vertex not in S has a

• PT
neighbor in S (denoted by MIS).
A dominating set (DS) D of G is a subset of V such that any
node not in D has at least one neighbor in D. If the induced
N
subgraph of D is connected, then D is a connected dominating
set (CDS).
• Among all CDSs of graph G, the one with minimum cardinality
is called a minimum connected dominating set (MCDS)
Preliminaries (2)
• Computing an MCDS in a unit graph is NP-hard. Note that
the problem of finding an MCDS in a graph is equivalent
to the problem of finding a spanning tree (ST) with

EL
maximum number of leaves. All non-leaf nodes in the
spanning tree form the MCDS. An MIS is also a DS.

• For
PT
a graph G, if e = (u,v) ϵ E iff length (e) ≤ 1, then G is
N
called a unit-disk graph.
Preliminaries (3)
• This lemma relates the size of any MIS of unit-disk graph G
to the size of its optimal CDS
Lemma 2.1 The size of any MIS of G is at most 4 x opt +1

EL
where opt is the size of any optimal CDS of G.
• For
a minimization problem P, the performance ratio of an
PT
approximation algorithm A is defined to be
• where I is the set of instances of P, Ai is the output from A
N
for instance i and opti is the optimal solution for instance
i. In other words, ρ is the supreme of A/opt among all
instances of P.
An 8-approximate algorithm to compute CDS
• This algorithm contains two phases:

EL
• Phase-1: First, a maximal independent set (MIS) is
computed;

• PT
Phase-2: Then a Steiner tree is used to connect all
vertices in the MIS.
N
• This algorithm has performance ratio at most 8 and is
message and time efficient.
Algorithm description
• Initially each host is colored white.
•A dominator is colored black, while a dominatee is
colored gray.

EL
• We assume that each vertex knows its distance-one
neighbors and their effective degrees d*.
• This PT
information can be collected by periodic or
event-driven hello messages.
N
• The effective degree of a vertex is the total number of
white neighbors.
Contd...
• Here host is designated as the leader. This is a
realistic assumption.
• For example, the leader can be the commander’s

EL
mobile for a platoon of soldiers in a mission.
• If it is impossible to designate any leader, a
PT
distributed leader-election algorithm can be applied
to find out a leader. This adds message and time
complexity.
N
• The best leader-election algorithm takes time O(n)
and message O(nlogn) and these are the best-
achievable results. Assume host s is the leader
Phase 1:
•Host
s first colors itself black and broadcasts message
DOMINATOR.

•Anywhite host u receiving DOMINATOR message the first

EL
time from v colors itself gray and broadcasts message
DOMINATEE. u selects v as its dominator.

•A PT
white host receiving at least one DOMINATEE message
becomes active.
N
•An active white host with highest (d*, id) among all of its
active white neighbors will color itself black and broadcast
message DOMINATOR.
Contd...
•A white host decreases its effective degree by 1 and
broadcasts message DEGREE whenever it receives a
DOMINATEE message.

EL
•Message DEGREE contains the sender’s current effective
degree. A white vertex receiving a DEGREE message will

•Each gray
PT
update its neighborhood information accordingly.

vertex will broadcast message


N
NUMOFBLACKNEIGHBORS when it detects that none of its
neighbors is white.

•Phase 1 terminates when no white vertex left.


Phase 2:
•When s receives message NUMOFBLACKNEIGHBORS from
all of its gray neighbors, it starts phase 2 by broadcasting
message M.

EL
•A host is “ready” to be explored if it has no white
neighbors.

•A PT
Steiner tree is used to connect all black hosts generated
in Phase 1.
N
•Theidea is to pick those gray vertices which connect to
many black neighbors.
Contd...
•The classical distributed depth first search spanning tree
algorithm will be modified to compute the Steiner tree.

•A black vertex without any dominator is active.

EL
•Initially
no black vertex has a dominator and all hosts are
unexplored.

•Message
PT
M contains a field next which specifies the next
N
host to be explored.

•A gray vertex with at least 1 active black neighbors are


effective.
Contd...
• If
M is built by a black vertex, its next field contains the id
of the unexplored gray neighbor which connects to
maximum number of active black hosts.

EL
• If
M is built by a gray vertex, its next field contains the id
of any unexplored black neighbor.

• Any
PT
black host u receiving an M message the first time
N
from a gray host v sets its dominator to v by broadcasting
message PARENT.
Contd...
• When a host u receives message M from v that specifies u
to be explored next, if none of u’s neighbors is white, u
then colors itself black, sets its dominator to v and
broadcasts its own M message; otherwise, u defer its
operation until none of its neighbors is white.

EL
• Any gray vertex receiving message PARENT from a black
neighbor PT
will broadcast message
NUMOFBLACKNEIGHBORS, which contains the number of
N
active black neighbors.

• A black vertex becomes inactive after its dominator is set.


Contd...
•A gray vertex becomes ineffective if none of its black
neighbors is active.

EL
•A gray vertex without active black neighbor, or a black
vertex without effective gray neighbor, will send message
DONE to the host which activates its exploration or to its
dominator.
PT
N
• When s gets message DONE and it has no effective gray
neighbors, the algorithm terminates.
Complexity
• Note that phase 1 sets the dominators for all gray
vertices. Phase 2 may modify the dominator of some gray
vertex.
• Themain job for phase 2 is to set a dominator for each

EL
black vertex. All black vertices form a CDS.
• In
Phase 1, each host broadcasts each of the messages
DOMINATOR and DOMINATEE at most once.
• The PT
message complexity is dominated by message
DEGREE, since it may be broadcasted ∆ times by a host,
N
where ∆ is the maximum degree.
• Thusthe message complexity of Phase 1 is O(n ∆). The
time complexity of Phase 1 is O(n).
Contd...
• In phase 2, vertices are explored one by one.

• Thetotal number of vertices explored is the size of the


output CDS. Thus the time complexity is at most O(n).

EL
• The
PT
message complexity is dominated by message
NUMOFBLACKNEIGHBORS, which is broadcasted at most
5 times by each gray vertex because a gray vertex has at
N
most 5 black neighbors in a unit-disk graph.

• Thus the message complexity is also O(n).


Theorem
• Theorem 3.1: The distributed algorithm has time
complexity O(n) and message complexity O(n. ∆)

EL
• Note that in phase 1 if we use (id) instead of (d*, id)
as the parameter to select a white vertex to color it

PT
black, the message complexity will be O(n) because
no DEGREE messages will be broadcasted.
N
• O(n. ∆) is the best result we can achieve if effective
degree is taken into consideration.
Performance Analysis
• Lemma 3.2 Phase 1 computes an MIS which contains all
black nodes.
• Proof. A node is colored black only from white. No two
white neighbors can be colored black at the same time

EL
since they must have different (d*, id) .
• When a node is colored black, all of its neighbors are
colored gray.
PT
• Once a node is colored gray, it remains in color gray during
N
Phase 1.
• From the proof of Lemma 3.2, it is clear that if (id) instead
of (d*, id) is used, we still get an MIS. Intuitively, this result
will have a larger size.
Contd...
• Lemma 3.3 In phase 2, at least one gray vertex which
connects to maximum number of black vertices will
be selected.
• Proof.Let u be a gray vertex with maximum number

EL
of black neighbors.
• At
PT
some step in phase 2, one of u’s black neighbor v
will be explored.
N
• Inthe following step, u will be explored. This
exploration is triggered by v.
Contd...
• Lemma 3.4 If there are c black hosts after phase 1,
then at most c-1 gray hosts will be colored black in
phase 2

EL
• Proof. In phase 2, the first gray vertex selected will
connect to at least 2 black vertices.

PT
In the following steps, any newly selected gray vertex
will connect to at least one new black vertex.
N
Contd...
• Lemma 3.5 If there exists a gray vertex which connects
to at least 3 black vertices, then the number of gray
vertices which are colored black in phase 2 will be at
most c-2, where c is the number of black vertices after

EL
phase 1.
• Proof.From Lemma 3.3, at least one gray vertex with
PT
maximum black neighbors will be colored black in phase
2. Denote this vertex by u. If u is colored black, then all
N
of its black neighbors will choose u as its dominator.
Thus, the selection of u causes more than 1 black hosts
to be connected.
Contd...
• Theorem 3.6 This algorithm has performance ratio at most 8.

• Proof. From Lemma 3.2, phase 1 computes a MIS. We will


consider two cases here.

EL
• If there exists a gray vertex which has at least 3 black
neighbors after phase 1, from Lemma 2.1, the size of the MIS

• From
PT
is at most 4.opt+1.
lemma 3.5, we know the total number of black vertices
N
after phase 2 is at most 4.opt+1+((4.opt+1)-2=8.opt.
Contd...

• If
the maximum number of black neighbors a gray vertex
has is at most 2, then the size of the MIS computed in
phase 1 is at most 2.opt since any vertex in opt connects

EL
to at most 2 vertices in the MIS.
• Thusfrom Lemma 3.4, total number of black hosts will

• Note
PT
be 2.opt + 2.opt – 1 < 4.opt
that from the proof of Theorem 3.6, if (id) instead
N
of (d*, id) is used in phase 1, this algorithm still has
performance ratio at most 8
More References

• Rajiv Misra et al., ”Minimum Connected Dominating Set

EL
Using a Collaborative Cover Heuristic for Ad Hoc Sensor
Networks.” IEEE Trans. Parallel Distrib. Syst. 21(3): 292-
302 (2010) PT
N
Conclusion
• In this lecture, we have discussed a distributed
algorithm which compute a connected dominating set
with smaller size.

EL
• We have discussed how to find a maximal independent
set. Then how to use a Steinter tree to connect all
vertices in the set.
PT
• This algorithm gives performance ratio at most 8.
N
• The future scope of this algorithm is to study the
problem of maintaining the connected dominating set
in a mobility environment.

You might also like