W7 Lecture Notes
W7 Lecture Notes
EL
PTRajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
[email protected]
Introduction
EL
sets.
• Users specify a map function that processes a key/value
PT
pair to generate a set of intermediate key/value pairs,
and a reduce function that merges all intermediate
values associated with the same intermediate key.
N
• Many real world tasks are expressible in this model.
Contd…
EL
• The run-time system takes care of the details of
partitioning the input data, scheduling the program's
PT
execution across a set of machines, handling machine
failures, and managing the required inter-machine
communication.
N
• Thisallows programmers without any experience with
parallel and distributed systems to easily utilize the
resources of a large distributed system.
Contd…
EL
• Hundreds of MapReduce programs have been
implemented and upwards of one thousand MapReduce
PT
jobs are executed on Google's clusters every day.
N
Single-node architecture
EL
CPU
Machine Learning, Statistics
Memory PT
N
“Classical” Data Mining
Disk
Commodity Clusters
• Web data sets can be very large
• Tens to hundreds of terabytes
EL
• Cannot mine on a single server (why?)
EL
in a rack
Switch Switch
CPU
…
PT CPU CPU
…
CPU
N
Mem Mem Mem Mem
EL
• Answer: Distributed File System
• Provides global file namespace
PT
• Google GFS; Hadoop HDFS; Kosmix KFS
EL
•Each chunk replicated (usually 2x or 3x)
•Try to keep replicas in different racks
•Master node
PT
•a.k.a. Name Nodes in HDFS
•Stores metadata
N
•Might be replicated
•Client library for file access
•Talks to master to find chunk servers
•Connects directly to chunkservers to access data
Motivation for Map Reduce (Why)
• Large-Scale Data Processing
EL
• But don’t want hassle of managing things
PT
• MapReduce Architecture provides
•I/O scheduling
EL
•The user of the MapReduce library expresses the
computation as two functions:
EL
set of intermediate key/value pairs.
EL
•Itmerges together these values to form a possibly smaller
set of values.
• Typically PT
just zero or one output value is produced per
Reduce invocation. The intermediate values are supplied to
N
the user's reduce function via an iterator.
EL
for each word w in value:
emit(w, 1)
PT
reduce(key, values):
N
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Map-Reduce Functions
EL
map(k,v) list(k1,v1)
reduce(k1, list(v1)) v2
PT
• (k1,v1) is an intermediate key/value pair
N
• Output is the set of (k1,v2) pairs
Applications
• Here are a few simple applications of interesting programs that can
be easily expressed as MapReduce computations.
• Distributed Grep: The map function emits a line if it matches a
supplied pattern. The reduce function is an identity function that
EL
just copies the supplied intermediate data to the output.
• Count of URL Access Frequency: The map function processes logs
PT
of web page requests and outputs (URL; 1). The reduce function
adds together all values for the same URL and emits a (URL; total
count) pair.
N
• ReverseWeb-Link Graph: The map function outputs (target;
source) pairs for each link to a target URL found in a page named
source. The reduce function concatenates the list of all source
URLs associated with a given target URL and emits the pair: (target;
list(source))
Contd…
• Term-Vector
per Host: A term vector summarizes the most
important words that occur in a document or a set of
documents as a list of (word; frequency) pairs.
EL
• Themap function emits a (hostname; term vector) pair for
each input document (where the hostname is extracted
from the URL of the document).
• The PT
reduce function is passed all per-document term
vectors for a given host. It adds these term vectors
N
together, throwing away infrequent terms, and then emits
a final (hostname; term vector) pair
Contd…
• InvertedIndex: The map function parses each document,
and emits a sequence of (word; document ID) pairs. The
reduce function accepts all pairs for a given word, sorts the
corresponding document IDs and emits a (word;
EL
list(document ID)) pair. The set of all output pairs forms a
simple inverted index. It is easy to augment this
PT
computation to keep track of word positions.
N
• Distributed
Sort: The map function extracts the key from
each record, and emits a (key; record) pair. The reduce
function emits all pairs unchanged.
Implementation
EL
Overview
PT
N
Implementation Overview
• Many different implementations of the MapReduce
interface are possible. The right choice depends on the
environment.
EL
• For example, one implementation may be suitable for a
small shared-memory machine, another for a large NUMA
PT
multi-processor, and yet another for an even larger
collection of networked machines.
N
• Here we describes an implementation targeted to the
computing environment in wide use at Google: large
clusters of commodity PCs connected together with
switched Ethernet.
Contd…
(1) Machines are typically dual-processor x86 processor running
Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used . Typically either 100
megabits/second or 1 gigabit/second at the machine level, but
EL
averaging considerably less in overall bisection bandwidth.
(3) A cluster consists of hundreds or thousands of machines, and
PT
therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks attached directly
N
to individual machines.
(5) Users submit jobs to a scheduling system. Each job consists of
a set of tasks, and is mapped by the scheduler to a set of
available machines within a cluster.
Distributed Execution Overview
• The Map invocations are distributed across multiple machines
by automatically partitioning the input data into a set of M
splits.
• The input splits can be processed in parallel by different
EL
machines.
• Reduce invocations are distributed by partitioning the
PT
intermediate key space into R pieces using a partitioning
function (e.g., hash(key) mod R).
N
• The number of partitions (R) and the partitioning function are
specified by the user.
• Figure 1 shows the overall flow of a MapReduce operation.
Distributed Execution Overview
User
Program
EL
(2)assign Master (2) assign
map reduce
Worker PT
(4) local Worker
(6) write Output
File 0
N
Split 0 (3) read
write (5) Remote read, sort
Split 1 Worker
Split 2 Output
Worker File 1
Worker
Intermediate
Input Files Map phase Files on Disk Reduce phase Output Files
Sequence of Actions
When the user program calls the MapReduce function, the following
sequence of actions occurs:
1. The MapReduce library in the user program first splits the input
files into M pieces of typically 16 megabytes to 64 megabytes (MB)
EL
per piece. It then starts up many copies of the program on a cluster
of machines.
2. One of the copies of the program is special- the master. The rest are
PT
workers that are assigned work by the master. There are M map
tasks and R reduce tasks to assign. The master picks idle workers and
N
assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the
corresponding input split. It parses key/value pairs out of the input
data and passes each pair to the user-defined Map function. The
intermediate key/value pairs produced by the Map function are
buffered in memory.
Contd…
4. Periodically, the buffered pairs are written to local disk,
partitioned into R regions by the partitioning function.
• The locations of these buffered pairs on the local disk are passed
back to the master, who is responsible for forwarding these
EL
locations to the reduce workers.
5. When a reduce worker is notified by the master about these
PT
locations, it uses remote procedure calls to read the buffered data
from the local disks of the map workers. When a reduce worker has
read all intermediate data, it sorts it by the intermediate keys so
N
that all occurrences of the same key are grouped together.
• The sorting is needed because typically many different keys map
to the same reduce task. If the amount of intermediate data is too
large to fit in memory, an external sort is used.
Contd…
6. The reduce worker iterates over the sorted intermediate data
and for each unique intermediate key encountered, it passes the
key and the corresponding set of intermediate values to the
user's Reduce function.
EL
• The output of the Reduce function is appended to a final output
file for this reduce partition.
PT
7. When all map tasks and reduce tasks have been completed,
the master wakes up the user program.
N
• At this point, the MapReduce call in the user program returns
back to the user code.
Contd…
• After successful completion, the output of the mapreduce
execution is available in the R output files (one per reduce task,
with file names as specified by the user).
• Typically, users do not need to combine these R output files into
EL
one file- they often pass these files as input to another
MapReduce call, or use them from another distributed
PT
application that is able to deal with input that is partitioned into
multiple files.
N
Master Data Structures
• The master keeps several data structures. For each map task
and reduce task, it stores the state (idle, in-progress, or
completed), and the identity of the worker machine (for non-
idle tasks).
EL
• The master is the conduit through which the location of
intermediate le regions is propagated from map tasks to reduce
PT
tasks. Therefore, for each completed map task, the master
stores the locations and sizes of the R intermediate file regions
produced by the map task.
N
• Updates to this location and size information are received as
map tasks are completed. The information is pushed
incrementally to workers that have in-progress reduce tasks.
Fault Tolerance
• Since the MapReduce library is designed to help process very large
amounts of data using hundreds or thousands of machines, the
library must tolerate machine failures gracefully.
EL
• Map worker failure
• Map tasks completed or in-progress at worker are reset to idle
•
•
another worker
Reduce worker failure
PT
Reduce workers are notified when task is rescheduled on
N
• Only in-progress tasks are reset to idle
• Master failure
• MapReduce task is aborted and client is notified
Locality
• Network bandwidth is a relatively scarce resource in the
computing environment.
• We can conserve network bandwidth by taking advantage of the
EL
fact that the input data (managed by GFS) is stored on the local
disks of the machines that make up our cluster.
• GFS divides each file into 64 MB blocks, and stores several copies
PT
of each block (typically 3 copies) on different machines.
N
Contd…
• The MapReduce master takes the location information of the
input les into account and attempts to schedule a map task on a
machine that contains a replica of the corresponding input data.
EL
• Failing that, it attempts to schedule a map task near a replica of
that task's input data (e.g., on a worker machine that is on the
same network switch as the machine containing the data).
•
PT
When running large MapReduce operations on a significant
fraction of the workers in a cluster, most input data is read locally
N
and consumes no network bandwidth.
Task Granularity
• The Map phase is subdivided into M pieces and the reduce
phase into R pieces.
• Ideally, M and R should be much larger than the number of
worker machines.
EL
• Having each worker perform many different tasks improves
dynamic load balancing, and also speeds up recovery when a
PT
worker fails: the many map tasks it has completed can be spread
out across all the other worker machines.
N
• There are practical bounds on how large M and R can be, since
the master must make O(M + R) scheduling decisions and keeps
O(M * R) state in memory.
• Furthermore, R is often constrained by users because the output
of each reduce task ends up in a separate output file.
Partition Function
EL
• For reduce, we need to ensure that records with the
same intermediate key end up at the same worker
PT
• System uses a default partition function e.g., hash(key)
mod R
N
• Sometimes useful to override
• E.g., hash(hostname(URL)) mod R ensures URLs from
a host end up in the same output file
Ordering Guarantees
• We guarantee that within a given partition, the
intermediate key/value pairs are processed in increasing key
EL
order.
• This ordering guarantee makes it easy to generate a sorted
output file per partition, which is useful when the output
PT
file format needs to support efficient random access
lookups by key, or users of the output and it convenient to
N
have the data sorted.
Combiners Function (1)
• In some cases, there is significant repetition in the
intermediate keys produced by each map task, and the user
specified Reduce function is commutative and associative.
• A good example of this is the word counting example. Since
EL
word frequencies tend to follow a Zipf distribution, each map
task will produce hundreds or thousands of records of the
•
form <the, 1>.
PT
All of these counts will be sent over the network to a single
N
reduce task and then added together by the Reduce function
to produce one number. We allow the user to specify an
optional Combiner function that does partial merging of this
data before it is sent over the network.
Combiners Function (2)
• The Combiner function is executed on each machine that
performs a map task.
• Typically the same code is used to implement both the
combiner and the reduce functions.
EL
• The only difference between a reduce function and a
combiner function is how the MapReduce library handles the
•
PT
output of the function.
The output of a reduce function is written to the final output
N
file. The output of a combiner function is written to an
intermediate le that will be sent to a reduce task.
• Partial combining significantly speeds up certain classes of
MapReduce operations.
EL
Examples
PT
N
Example: 1 Word Count using MapReduce
map(key, value):
// key: document name; value: text of document
EL
for each word w in value:
emit(w, 1)
PT
reduce(key, values):
N
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Count Illustrated
map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
EL
Emit result “(word, sum)”
see 1 bob 1
see bob run
see spot throw PT bob 1
run 1
run 1
see 2
N
see 1 spot 1
spot 1 throw 1
throw 1
Example 2: Counting words of different lengths
EL
and outputs the length of the word as the key and the
word itself as the value then
PT
• map(steve) would return 5:steve and
EL
3 : and
They get grouped as:
3 : you
4 : then
4 : what PT 3 : [the, and, you]
4 : [then, what, when]
N
4 : when
5 : [steve, where]
5 : steve
8 : [savannah, research]
5 : where
8 : savannah
8 : research
Example 2: Counting words of different lengths
• Each of these lines would then be passed as an argument to the
reduce function, which accepts a key and a list of values.
• In this instance, we might be trying to figure out how many
EL
words of certain lengths exist, so our reduce function will just
count the number of items in the list and output the key with
the size of the list, like:
3:3
PT
N
4:3
5:2
8:2
Example 2: Counting words of different lengths
EL
• The most common example of mapreduce is for counting the
number of times words occur in a corpus.
PT
N
Example 3: Finding Friends
• Facebook has a list of friends (note that friends are a bi-directional
thing on Facebook. If I'm your friend, you're mine).
• They also have lots of disk space and they serve hundreds of millions
of requests everyday. They've decided to pre-compute calculations
EL
when they can to reduce the processing time of requests. One
common processing request is the "You and Joe have 230 friends in
common" feature.
•
PT
When you visit someone's profile, you see a list of friends that you
have in common. This list doesn't change frequently so it'd be
wasteful to recalculate it every time you visited the profile (sure you
N
could use a decent caching strategy, but then we wouldn't be able to
continue writing about mapreduce for this problem).
• We're going to use mapreduce so that we can calculate everyone's
common friends once a day and store those results. Later on it's just a
quick lookup. We've got lots of disk, it's cheap.
Example 3: Finding Friends
• Assume the friends are stored as Person->[List of Friends], our
friends list is then:
EL
• A -> B C D
• B -> A C D E
•
•
C -> A B D E
D -> A B C E
PT
N
• E -> B C D
Example 3: Finding Friends
For map(A -> B C D) :
(A B) -> B C D
(A C) -> B C D
EL
(A D) -> B C D
PT
For map(B -> A C D E) : (Note that A comes before B in the key)
(A B) -> A C D E
N
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E
Example 3: Finding Friends
For map(C -> A B D E) :
(A C) -> A B D E
EL
(B C) -> A B D E And finally for map(E -> B C D):
(C D) -> A B D E (B E) -> B C D
(C E) -> A B D E
PT
For map(D -> A B C E) :
(C E) -> B C D
(D E) -> B C D
N
(A D) -> A B C E
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E
Example 3: Finding Friends
• Before we send these key-value pairs to the reducers, we group
them by their keys and get:
(A B) -> (A C D E) (B C D)
EL
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
PT
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
N
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)
Example 3: Finding Friends
• Each line will be passed as an argument to a reducer.
• The reduce function will simply intersect the lists of values and
EL
output the same key with the result of the intersection.
•
PT
For example, reduce((A B) -> (A C D E) (B C D))
will output (A B) : (C D)
N
• and means that friends A and B have C and D as common
friends.
Example 3: Finding Friends
• The result after reduction is:
• (A B) -> (C D)
• (A C) -> (B D)
EL
• (A D) -> (B C)
• (B C) -> (A D E) Now when D visits B's profile, we
•
•
(B D) -> (A C E)
(B E) -> (C D)
PT can quickly look up (B D) and see
that they have three friends in
N
common, (A C E).
• (C D) -> (A B E)
• (C E) -> (B D)
• (D E) -> (B C)
Reading
Jeffrey Dean and Sanjay Ghemawat,
“MapReduce: Simplified Data Processing on Large
Clusters”
EL
http://labs.google.com/papers/mapreduce.html
PT
N
Conclusion
EL
experience with parallel and distributed systems, since it hides
the details of parallelization, fault-tolerance, locality optimization,
and load balancing.
•
computations.
PT
A large variety of problems are easily expressible as MapReduce
N
• For example, MapReduce is used for the generation of data for
Google's production web search service, for sorting, for data
mining, for machine learning, and many other systems.
HDFS and Spark
EL
PT Rajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
[email protected]
EL
The Hadoop Distributed
File System (HDFS)
PT
N
Introduction
• Hadoop provides a distributed file system and a framework for
the analysis and transformation of very large data sets using the
MapReduce paradigm.
EL
• An important characteristic of Hadoop is the partitioning of
data and computation across many (thousands) of hosts, and
executing application computations in parallel close to their
•
data.
PT
A Hadoop cluster scales computation capacity, storage capacity
N
and IO bandwidth by simply adding commodity servers. Hadoop
clusters at Yahoo! span 25,000 servers, and store 25 petabytes
of application data, with the largest cluster being 3500 servers.
One hundred other organizations worldwide report using
Hadoop.
Contd…
• Hadoop is an Apache project; all components are available
via the Apache open source license.
• Yahoo! has developed and contributed to 80% of the core of
EL
Hadoop (HDFS and MapReduce).
• HBase was originally developed at Powerset, now a
• PT
department at Microsoft.
Hive was originated and developed at Facebook.
N
• Pig, ZooKeeper, and Chukwa were originated and developed
at Yahoo!
• Avro was originated at Yahoo! and is being co-developed with
Cloudera.
Hadoop Project Components
EL
HBase Column-oriented table service
Dataflow language and parallel execution
Pig
Hive
ZooKeeper
PT
framework
Data warehouse infrastructure
Distributed coordination service
N
Chukwa System for collecting management data
Avro Data serialization system
EL
• HDFS stores file system metadata and application data
separately.
• PT
As in other distributed file systems, like PVFS, Lustre and,
HDFS stores metadata on a dedicated server, called the
N
NameNode.
• Application data are stored on other servers called
DataNodes. All servers are fully connected and communicate
with each other using TCP-based protocols.
HDFS Design Assumptions
EL
• More machines = increased failure probability
• Desired:
PT
N
• Commodity hardware
• Built-in backup and failover
EL
Architecture
PT
N
Namenode and Datanodes
• Namenode (Master)
• Metadata:
EL
• Where file blocks are stored (namespace image)
• Edit (Operation) log
•
PT
Secondary namenode (Shadow master)
• Datanode (Chunkserver)
N
• Stores and retrieves blocks
• by client or namenode.
• Reports to namenode with list of blocks they are storing
Noticeable Differences from GFS
• Only single-writers per file.
• No record append operation.
EL
• Open source
• Provides many interfaces and libraries for different file
systems.
•
•
S3, KFS, etc. PT
Thrift (C++, Python, …), libhdfs (C), FUSE
N
A) Namenode
• The HDFS namespace is a hierarchy of files and directories. Files
and directories are represented on the NameNode by inodes,
which record attributes like permissions, modification and access
times, namespace and disk space quotas.
EL
• The file content is split into large blocks (typically 128 megabytes,
but user selectable file-by-file) and each block of the file is
PT
independently replicated at multiple DataNodes (typically three,
but user selectable file-by-file). The NameNode maintains the
namespace tree and the mapping of file blocks to DataNodes (the
N
physical location of file data).
• An HDFS client wanting to read a file first contacts the NameNode
for the locations of data blocks comprising the file and then reads
block contents from the DataNode closest to the client.
Contd…
EL
• The client then writes data to the DataNodes in a pipeline
fashion.
•
•
PT
The current design has a single NameNode for each cluster.
The cluster can have thousands of DataNodes and tens of
N
thousands of HDFS clients per cluster, as each DataNode may
execute multiple application tasks concurrently.
Contd…
• HDFS keeps the entire namespace in RAM. The inode data and
the list of blocks belonging to each file comprise the metadata
of the name system called the image. The persistent record of
the image stored in the local host’s native files system is called a
EL
checkpoint.
• The NameNode also stores the modification log of the image
PT
called the journal in the local host’s native file system. For
improved durability, redundant copies of the checkpoint and
journal can be made at other servers.
N
• During restarts the NameNode restores the namespace by
reading the namespace and replaying the journal. The locations
of block replicas may change over time and are not part of the
persistent checkpoint.
B) Datanode
• Each block replica on a DataNode is represented by two files in
the local host’s native file system. The first file contains the
data itself and the second file is block’s metadata including
checksums for the block data and the block’s generation stamp.
EL
• The size of the data file equals the actual length of the block
and does not require extra space to round it up to the nominal
PT
block size as in traditional file systems. Thus, if a block is half full
it needs only half of the space of the full block on the local drive.
N
• During startup each DataNode connects to the NameNode and
performs a handshake. The purpose of the handshake is to
verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode the
DataNode automatically shuts down.
Contd…
• The namespace ID is assigned to the file system instance when
it is formatted. The namespace ID is persistently stored on all
nodes of the cluster. Nodes with a different namespace ID will
not be able to join the cluster, thus preserving the integrity of
EL
the file system.
• The consistency of software versions is important because
PT
incompatible version may cause data corruption or loss, and on
large clusters of thousands of machines it is easy to overlook
nodes that did not shut down properly prior to the software
N
upgrade or were not available during the upgrade.
• A DataNode that is newly initialized and without any namespace
ID is permitted to join the cluster and receive the cluster’s
namespace ID.
Contd…
• After the handshake the DataNode registers with the
NameNode. DataNodes persistently store their unique storage
IDs. The storage ID is an internal identifier of the DataNode,
which makes it recognizable even if it is restarted with a different
EL
IP address or port. The storage ID is assigned to the DataNode
when it registers with the NameNode for the first time and never
changes after that.
•
PT
A DataNode identifies block replicas in its possession to the
NameNode by sending a block report. A block report contains the
N
block id, the generation stamp and the length for each block
replica the server hosts. The first block report is sent immediately
after the DataNode registration. Subsequent block reports are
sent every hour and provide the NameNode with an up-to-date
view of where block replicas are located on the cluster.
Contd…
EL
• If the NameNode does not receive a heartbeat from a DataNode
in ten minutes the NameNode considers the DataNode to be out
PT
of service and the block replicas hosted by that DataNode to be
unavailable.
N
• The NameNode then schedules creation of new replicas of those
blocks on other DataNodes.
Contd…
• Heartbeats from a DataNode also carry information about total
storage capacity, fraction of storage in use, and the number of data
transfers currently in progress. These statistics are used for the
NameNode’s space allocation and load balancing decisions.
• The NameNode does not directly call DataNodes. It uses replies to
EL
heartbeats to send instructions to the DataNodes. The instructions
include commands to:
PT
• Replicate blocks to other nodes;
• Remove local block replicas;
• Re-register or to shut down the node;
N
• Send an immediate block report.
EL
operations to read, write and delete files, and operations to
create and delete directories.
•
namespace. PT
The user references files and directories by paths in the
N
• The user application generally does not need to know that file
system metadata and storage are on different servers, or that
blocks have multiple replicas.
Contd…
• When an application reads a file, the HDFS client first asks the
NameNode for the list of DataNodes that host replicas of the blocks
of the file. It then contacts a DataNode directly and requests the
transfer of the desired block.
EL
• When a client writes, it first asks the NameNode to choose
DataNodes to host replicas of the first block of the file. The client
organizes a pipeline from node-to-node and sends the data.
• PT
When the first block is filled, the client requests new DataNodes to
be chosen to host replicas of the next block. A new pipeline is
N
organized, and the client sends the further bytes of the file. Each
choice of DataNodes is likely to be different.
• The interactions among the client, the NameNode and the
DataNodes are illustrated in Figure 1.
EL
PT
N
Figure 1. An HDFS client creates a new file by giving its path to the
NameNode. For each block of the file, the NameNode returns a list of
DataNodes to host its replicas. The client then pipelines data to the chosen
DataNodes, which eventually confirm the creation of the block replicas to
the NameNode.
Contd…
EL
schedule a task to where the data are located, thus improving
the read performance.
•
•
PT
It also allows an application to set the replication factor of a file.
By default a file’s replication factor is three.
For critical files or files which are accessed very often, having a
N
higher replication factor improves their tolerance against faults
and increase their read bandwidth.
Anatomy of a File Read
EL
PT
N
Anatomy of a File Write
EL
PT
N
Additional Topics
• Replica placements:
EL
• Coherency model:
EL
Input
PT
Map
Reduce
Output
N
Map
Reduce
Map
Contd…
EL
• In this part of the lecture, we will focus on one such class of
applications, that reuse a working set of data across multiple
•
PT
parallel operations.
This includes many iterative machine learning algorithms, as
N
well as interactive data analysis tools.
• A new framework called Spark supports such applications
while retaining the scalability and fault tolerance of
MapReduce.
Contd…
EL
• An RDD is a read-only collection of objects partitioned
across a set of machines that can be rebuilt if a partition is
•
lost.
PT
Spark can outperform Hadoop by 10x in iterative machine
N
learning jobs, and can be used to interactively query a 39
GB dataset with sub-second response time.
Difference Between Hadoop MapReduce vs.
Apache Spark
EL
Hadoop MapReduce Apache Spark
Fast 100x faster than MapReduce
Batch Processing Real-time Processing
PT
Stores Data on Disk
Written in Java
Stores Data in Memory
Written in Scala
N
N
PT
EL
Introduction
• A new model of cluster computing has become widely popular,
in which data-parallel computations are executed on clusters of
unreliable machines by systems that automatically provide
locality-aware scheduling, fault tolerance, and load balancing.
EL
• MapReduce pioneered this model, while systems like Dryad and
Map-Reduce-Merge generalized the types of data flows
•
supported.
PT
These systems achieve their scalability and fault tolerance by
N
providing a programming model where the user creates acyclic
data flow graphs to pass input data through a set of operators.
• This allows the underlying system to manage scheduling and to
react to faults without user intervention.
Contd…
• While this data flow programming model is useful for a large
class of applications, there are applications that cannot be
expressed efficiently as acyclic data flows.
EL
• Here, we focuses on one such class of applications, that reuse a
working set of data across multiple parallel operations. This
includes two use cases where we have seen Hadoop users report
PT
that MapReduce is deficient:
(i) Iterative jobs: Many common machine learning algorithms
N
apply a function repeatedly to the same dataset to optimize a
parameter (e.g., through gradient descent). While each iteration
can be expressed as a MapReduce/Dryad job, each job must
reload the data from disk, incurring a significant performance
penalty.
Contd…
(ii) Interactive analytics: Hadoop is often used to run ad-hoc
exploratory queries on large datasets, through SQL interfaces
such as Pig and Hive.
EL
• Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.
• However, with Hadoop, each query incurs significant latency
PT
(tens of seconds) because it runs as a separate MapReduce job
and reads data from disk.
N
• A new cluster computing framework called Spark supports
applications with working sets while providing similar scalability
and fault tolerance properties to MapReduce.
Contd…
• The main abstraction in Spark is that of a resilient distributed
dataset (RDD), which represents a read-only collection of objects
partitioned across a set of machines that can be rebuilt if a
partition is lost.
EL
• Users can explicitly cache an RDD in memory across machines
and reuse it in multiple MapReduce-like parallel operations.
RDDs achieve fault tolerance through a notion of lineage: if a
PT
partition of an RDD is lost, the RDD has enough information
about how it was derived from other RDDs to be able to rebuild
just that partition.
N
• Although RDDs are not a general shared memory abstraction,
they represent a sweet-spot between expressivity on the one
hand and scalability and reliability on the other hand, and found
well-suited for a variety of applications.
Contd…
• Spark is implemented in Scala, a statically typed high-level
programming language for the Java VM, and exposes a
functional programming interface similar to DryadLINQ.
EL
• In addition, Spark can be used interactively from a modified
version of the Scala interpreter, which allows the user to define
RDDs, functions, variables and classes and use them in parallel
•
PT
operations on a cluster.
Spark is the first system to allow an efficient, general-purpose
N
programming language to be used interactively to process large
datasets on a cluster.
• Spark can outperform Hadoop by 10x in iterative machine
learning workloads and can be used interactively to scan a 39 GB
dataset with sub-second latency.
Programming Model
EL
• Spark provides two main abstractions for parallel
programming:
•
PT
Resilient distributed datasets and parallel operations on
these datasets (invoked by passing a function to apply on a
N
dataset).
• In addition, Spark supports two restricted types of shared
variables that can be used in functions running on the cluster.
EL
PT
N
Figure: Spark runtime. The user’s driver program launches
multiple workers, which read data blocks from a distributed file
system and can persist computed RDD partitions in memory.
Resilient Distributed Datasets (RDDs)
• A resilient distributed dataset (RDD) is a read-only collection of
objects partitioned across a set of machines that can be rebuilt if
a partition is lost.
EL
• The elements of an RDD need not exist in physical storage;
instead, a handle to an RDD contains enough information to
compute the RDD starting from data in reliable storage.
• PT
This means that RDDs can always be reconstructed if nodes fail.
N
Contd…
• In Spark, each RDD is represented by a Scala object. Spark lets
programmers construct RDDs in four ways:
• From a file in a shared file system, such as the Hadoop
Distributed File System (HDFS).
EL
• By “parallelizing” a Scala collection (e.g., an array) in the driver
program, which means dividing it into a number of slices that will
be sent to multiple nodes.
•
PT
By transforming an existing RDD. A dataset with elements of
type A can be transformed into a dataset with elements of type B
N
using an operation called flatMap, which passes each element
through a user-provided function of type A List[B].
• Other transformations can be expressed using flatMap,
including map (pass elements through a function of type A B
and filter (pick elements matching a predicate).
Contd…
• By changing the persistence of an existing RDD. By default, RDDs
are lazy and ephemeral.
• That is, partitions of a dataset are materialized on demand when
they are used in a parallel operation (e.g., by passing a block of a
EL
file through a map function), and are discarded from memory
after use. However, a user can alter the persistence of an RDD
• PT
through two actions:
The cache action leaves the dataset lazy, but hints that it should
be kept in memory after the first time it is computed, because it
N
will be reused.
• The save action evaluates the dataset and writes it to a
distributed filesystem such as HDFS. The saved version is used in
future operations on it.
Contd…
• If there is not enough memory in the cluster to cache all
partitions of a dataset, Spark will recompute them when they are
used.
EL
• Spark programs keep working (at reduced performance) if nodes
fail or if a dataset is too big. This idea is loosely analogous to
virtual memory.
• PT
Spark can also be extended to support other levels of persistence
(e.g., in-memory replication across multiple nodes).
N
• The goal is to let users trade off between the cost of storing an
RDD, the speed of accessing it, the probability of losing part of
it, and the cost of recomputing it.
Parallel Operations
• Several parallel operations can be performed on RDDs:
EL
function to produce a result at the driver program.
•
PT
Collect: Sends all elements of the dataset to the driver program.
For example, an easy way to update an array in parallel is to
parallelize, map and collect the array.
N
• Foreach: Passes each element through a user provided function.
This is only done for the side effects of the function.
Shared Variables
EL
• As is typical in functional programming, these closures can
refer to variables in the scope where they are created.
•
•
PT
Normally, when Spark runs a closure on a worker node, these
variables are copied to the worker.
However, Spark also lets programmers create two restricted
N
types of shared variables to support two simple but common
usage patterns such as Broadcast variables and Accumulators.
Broadcast variables
• Broadcast variables: If a large read-only piece of data (e.g., a
lookup table) is used in multiple parallel operations, it is
preferable to distribute it to the workers only once instead of
EL
packaging it with every closure.
• Spark lets the programmer create a “broadcast variable”
PT
object that wraps the value and ensures that it is only copied
to each worker once.
N
Accumulators
• Accumulators: These are variables that workers can only
“add” to using an associative operation, and that only the
driver can read.
EL
• They can be used to implement counters as in MapReduce and
to provide a more imperative syntax for parallel sums.
•
PT
Accumulators can be defined for any type that has an “add”
operation and a “zero” value.
N
• Due to their “add-only” semantics, they are easy to make
fault-tolerant.
RDD Operations
EL
map reduce
filter collect
sample
union
groupByKey
PT count
save
lookupKey
N
reduceByKey …
join
cache
…
Transformations
Transformation Description
Return a new distributed dataset formed by passing
map (func) each element of the source through a function func
Return a new dataset formed by selecting those
EL
filter (func) element of the source on which func returns true
Similar to map, but each input can be mapped to 0
flatmap (func) or more output items (so func should return a Seq
sample
PT rather than a single item)
Sample a fraction of the data, with or without
replacements, using a given random number
N
(withReplacement,
generator seed
fraction, seed)
Return a new dataset that contains the union of the
union (otherDataset) elements in the source dataset and the argument
Return a new dataset that contains the distinct
distinct ( [numtasks] ) ) elements of the source dataset
Contd…
Transformation Description
groupByKey When called on a dataset of (K, V) pairs, returns a dataset of
([numTasks]) (K, Seq[V]) pairs
EL
reduceByKey
(K, V) pairs where the values for each key are aggregated using
(func, [numTasks]) the given reduce function
When called on a dataset of (K, V) pairs where K implements
sortByKey( [ascending],
[numtasks] ) PT
Ordered, returns a dataset of (K, V), pairs sorted by keys in
ascending or descending order as specified in the boolean
ascending argument
N
When called on a dataset of type (K, V) and (K, W) returns a
join (otherDataset,
dataset of (K, (V, W)) pairs with all pairs of elements for each
[numTasks] ) key
cogroup (otherDataset, When called on a dataset of type (K, V) and (K, W), returns a
[numTasks] ) dataset of (K, Seq[V], Seq[W]) tuples-also called groupWith
EL
computed correctly in parallel
Return all the elements of the dataset as an array at the
collect ( ) driver program- usually useful after a filter or other operation
count ( )
first( )
PT
that returns a sufficiently small subset of the data
Return the number of elements in the dataset
Return the first element of the dataset-similar to take (I)
N
Return an array with the first n elements of the dataset
take (n) -currently not executed in parallel, instead the driver program
computes all the elements
takeSample Return an array with a random sample of num elements of
(withReplacement, the dataset, with or without replacement, using the given
fraction, seed) random number generator seed.
Contd…
Action Description
Write the elements of the dataset as a text file (or set of text
saveAsTextFile files) in a given directory in the local filesystem, HDFS or any
(path) other Hadoop-supported file system, Spark will call toString
EL
on each element to convert it to a line of text in the file
Write the elements of the dataset as a Hadoop SequenceFile
in a given path in the local filesystem. HDFS or any other
saveAsSequenceFile
(path) PT
Hadoop-supported file system, Only available on RDDs of
key-value pairs that either implement, Hadoop’s writable
interface or are implicitly Hadoop’s Writable interface or are
N
implicitly convertible to Writable (Spark includes conversions
for basic types like Int, Double, String, etc).
Only available of RDD of type (K, V) Returns a ‘Map’ of
countByKey ( )
(K, Int) pairs with the count of each key.
Run a function func on each element of the dataset- usually
foreach (func) done for side effects such as updating an accumulator
variable or interacting with external storage systems
Spark Community
• Most active open source community in big data
EL
PT
N
EL
Built-in Libraries
PT
N
Standard Library for Big Data
EL
for multiple languages make
it“ suitable to offer this
•
PT
Much of future activity will
be in these libraries
N
A General Platform
EL
PT
N
Machine Learning Library (MLlib)
MLlib algorithms:
(i) Classification: logistic regression, linear SVM,“ naïve
Bayes, classification tree
EL
(ii) Regression: generalized linear models (GLMs),
regression tree
PT
(iii) Collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
N
(iv) Clustering: k-means
(v) Decomposition: SVD, PCA
(vi) Optimization: stochastic gradient descent, L-BFGS
GraphX
EL
PT
N
GraphX
EL
•Large library of graph algorithms with composable steps
PT
N
GraphX Algorithms
(i) Collaborative Filtering (iv) Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss
EL
(ii) Structured Prediction (v) Graph Analytics
PT
Loopy Belief Propagation
Max-Product Linear Programs
Gibbs Sampling
PageRank
Personalized PageRank
Shortest Path
N
Graph Coloring
(iii) Semi-supervised ML
Graph SSL (vi) Classification
CoEM Neural Networks
Spark Streaming
EL
•Ensure exactly one semantics
From Hive:
EL
c = HiveContext(sc)
rows = c.sql(“select text, year from hivetable”)
PT
rows.filter(lambda r: r.year > 2013).collect()
N
From JSON:
c.jsonFile(“tweets.json”).registerAsTable(“tweets”)
c.sql(“select text, user.name from tweets”)
EL
Examples
PT
N
Example 1: PageRank
EL
• Links from a high-rank page high rank
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Spark Program
EL
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
} PT
links.map(dest => (dest, rank/links.size))
ranks = contribs.reduceByKey (_ + _)
N
.mapValues (0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
PageRank Performance
EL
PT
N
Example 2: Logistic Regression
• Goal: find best line separating two sets of points
EL
random initial line
PT
N
target
Logistic Regression Code
EL
var w = Vector.random(D)
PT
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
N
w -= gradient
}
println("Final w: " + w)
Logistic Regression Performance
EL
127 s / iteration
PT
N
first iteration 174 s
further iterations 6 s
Example 3: MapReduce
EL
res = data.flatMap(rec => myMapFunc(rec))
.groupByKey()
PT
.map((key, vals) => myReduceFunc(key, vals))
N
Or with combiners:
EL
processing, since it:
PT
Requires a minimal amount of code
• Demonstrates use of both symbolic and numeric values
• Isn’t many steps away from search indexing
N
• Serves as a “Hello World” for Big Data apps
EL
wc.saveAsTextFile ( “wc_out” )
Python:
EL
iii. K-means clustering
iv.
v. PT
Alternating Least Squares matrix factorization
In-memory OLAP aggregation on Hive data
N
vi. SQL on Spark
Reading Material
• Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin,
Scott Shenker, Ion Stoica
“Spark: Cluster Computing with Working Sets”
EL
• Matei Zaharia, Mosharaf Chowdhury et al.
EL
powerful enough to express several applications that pose
challenges for existing cluster computing frameworks, including
iterative and interactive computations.
• PT
Furthermore, It is believed that the core idea behind RDDs, of a
dataset handle that has enough information to (re)construct the
N
dataset from data available in reliable storage, may prove useful
in developing other abstractions for programming clusters.
• In this lecture, we have discussed the HDFS components, its
architecture and framework of spark with its applications.
HDFS and Spark
EL
PT Rajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
[email protected]
EL
The Hadoop Distributed
File System (HDFS)
PT
N
Introduction
• Hadoop provides a distributed file system and a framework for
the analysis and transformation of very large data sets using the
MapReduce paradigm.
EL
• An important characteristic of Hadoop is the partitioning of
data and computation across many (thousands) of hosts, and
executing application computations in parallel close to their
•
data.
PT
A Hadoop cluster scales computation capacity, storage capacity
N
and IO bandwidth by simply adding commodity servers. Hadoop
clusters at Yahoo! span 25,000 servers, and store 25 petabytes
of application data, with the largest cluster being 3500 servers.
One hundred other organizations worldwide report using
Hadoop.
Contd…
• Hadoop is an Apache project; all components are available
via the Apache open source license.
• Yahoo! has developed and contributed to 80% of the core of
EL
Hadoop (HDFS and MapReduce).
• HBase was originally developed at Powerset, now a
• PT
department at Microsoft.
Hive was originated and developed at Facebook.
N
• Pig, ZooKeeper, and Chukwa were originated and developed
at Yahoo!
• Avro was originated at Yahoo! and is being co-developed with
Cloudera.
Hadoop Project Components
EL
HBase Column-oriented table service
Dataflow language and parallel execution
Pig
Hive
ZooKeeper
PT
framework
Data warehouse infrastructure
Distributed coordination service
N
Chukwa System for collecting management data
Avro Data serialization system
EL
• HDFS stores file system metadata and application data
separately.
• PT
As in other distributed file systems, like PVFS, Lustre and,
HDFS stores metadata on a dedicated server, called the
N
NameNode.
• Application data are stored on other servers called
DataNodes. All servers are fully connected and communicate
with each other using TCP-based protocols.
HDFS Design Assumptions
EL
• More machines = increased failure probability
• Desired:
PT
N
• Commodity hardware
• Built-in backup and failover
EL
Architecture
PT
N
Namenode and Datanodes
• Namenode (Master)
• Metadata:
EL
• Where file blocks are stored (namespace image)
• Edit (Operation) log
•
PT
Secondary namenode (Shadow master)
• Datanode (Chunkserver)
N
• Stores and retrieves blocks
• by client or namenode.
• Reports to namenode with list of blocks they are storing
Noticeable Differences from GFS
• Only single-writers per file.
• No record append operation.
EL
• Open source
• Provides many interfaces and libraries for different file
systems.
•
•
S3, KFS, etc. PT
Thrift (C++, Python, …), libhdfs (C), FUSE
N
A) Namenode
• The HDFS namespace is a hierarchy of files and directories. Files
and directories are represented on the NameNode by inodes,
which record attributes like permissions, modification and access
times, namespace and disk space quotas.
EL
• The file content is split into large blocks (typically 128 megabytes,
but user selectable file-by-file) and each block of the file is
PT
independently replicated at multiple DataNodes (typically three,
but user selectable file-by-file). The NameNode maintains the
namespace tree and the mapping of file blocks to DataNodes (the
N
physical location of file data).
• An HDFS client wanting to read a file first contacts the NameNode
for the locations of data blocks comprising the file and then reads
block contents from the DataNode closest to the client.
Contd…
EL
• The client then writes data to the DataNodes in a pipeline
fashion.
•
•
PT
The current design has a single NameNode for each cluster.
The cluster can have thousands of DataNodes and tens of
N
thousands of HDFS clients per cluster, as each DataNode may
execute multiple application tasks concurrently.
Contd…
• HDFS keeps the entire namespace in RAM. The inode data and
the list of blocks belonging to each file comprise the metadata
of the name system called the image. The persistent record of
the image stored in the local host’s native files system is called a
EL
checkpoint.
• The NameNode also stores the modification log of the image
PT
called the journal in the local host’s native file system. For
improved durability, redundant copies of the checkpoint and
journal can be made at other servers.
N
• During restarts the NameNode restores the namespace by
reading the namespace and replaying the journal. The locations
of block replicas may change over time and are not part of the
persistent checkpoint.
B) Datanode
• Each block replica on a DataNode is represented by two files in
the local host’s native file system. The first file contains the
data itself and the second file is block’s metadata including
checksums for the block data and the block’s generation stamp.
EL
• The size of the data file equals the actual length of the block
and does not require extra space to round it up to the nominal
PT
block size as in traditional file systems. Thus, if a block is half full
it needs only half of the space of the full block on the local drive.
N
• During startup each DataNode connects to the NameNode and
performs a handshake. The purpose of the handshake is to
verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode the
DataNode automatically shuts down.
Contd…
• The namespace ID is assigned to the file system instance when
it is formatted. The namespace ID is persistently stored on all
nodes of the cluster. Nodes with a different namespace ID will
not be able to join the cluster, thus preserving the integrity of
EL
the file system.
• The consistency of software versions is important because
PT
incompatible version may cause data corruption or loss, and on
large clusters of thousands of machines it is easy to overlook
nodes that did not shut down properly prior to the software
N
upgrade or were not available during the upgrade.
• A DataNode that is newly initialized and without any namespace
ID is permitted to join the cluster and receive the cluster’s
namespace ID.
Contd…
• After the handshake the DataNode registers with the
NameNode. DataNodes persistently store their unique storage
IDs. The storage ID is an internal identifier of the DataNode,
which makes it recognizable even if it is restarted with a different
EL
IP address or port. The storage ID is assigned to the DataNode
when it registers with the NameNode for the first time and never
changes after that.
•
PT
A DataNode identifies block replicas in its possession to the
NameNode by sending a block report. A block report contains the
N
block id, the generation stamp and the length for each block
replica the server hosts. The first block report is sent immediately
after the DataNode registration. Subsequent block reports are
sent every hour and provide the NameNode with an up-to-date
view of where block replicas are located on the cluster.
Contd…
EL
• If the NameNode does not receive a heartbeat from a DataNode
in ten minutes the NameNode considers the DataNode to be out
PT
of service and the block replicas hosted by that DataNode to be
unavailable.
N
• The NameNode then schedules creation of new replicas of those
blocks on other DataNodes.
Contd…
• Heartbeats from a DataNode also carry information about total
storage capacity, fraction of storage in use, and the number of data
transfers currently in progress. These statistics are used for the
NameNode’s space allocation and load balancing decisions.
• The NameNode does not directly call DataNodes. It uses replies to
EL
heartbeats to send instructions to the DataNodes. The instructions
include commands to:
PT
• Replicate blocks to other nodes;
• Remove local block replicas;
• Re-register or to shut down the node;
N
• Send an immediate block report.
EL
operations to read, write and delete files, and operations to
create and delete directories.
•
namespace. PT
The user references files and directories by paths in the
N
• The user application generally does not need to know that file
system metadata and storage are on different servers, or that
blocks have multiple replicas.
Contd…
• When an application reads a file, the HDFS client first asks the
NameNode for the list of DataNodes that host replicas of the blocks
of the file. It then contacts a DataNode directly and requests the
transfer of the desired block.
EL
• When a client writes, it first asks the NameNode to choose
DataNodes to host replicas of the first block of the file. The client
organizes a pipeline from node-to-node and sends the data.
• PT
When the first block is filled, the client requests new DataNodes to
be chosen to host replicas of the next block. A new pipeline is
N
organized, and the client sends the further bytes of the file. Each
choice of DataNodes is likely to be different.
• The interactions among the client, the NameNode and the
DataNodes are illustrated in Figure 1.
EL
PT
N
Figure 1. An HDFS client creates a new file by giving its path to the
NameNode. For each block of the file, the NameNode returns a list of
DataNodes to host its replicas. The client then pipelines data to the chosen
DataNodes, which eventually confirm the creation of the block replicas to
the NameNode.
Contd…
EL
schedule a task to where the data are located, thus improving
the read performance.
•
•
PT
It also allows an application to set the replication factor of a file.
By default a file’s replication factor is three.
For critical files or files which are accessed very often, having a
N
higher replication factor improves their tolerance against faults
and increase their read bandwidth.
Anatomy of a File Read
EL
PT
N
Anatomy of a File Write
EL
PT
N
Additional Topics
• Replica placements:
EL
• Coherency model:
EL
Input
PT
Map
Reduce
Output
N
Map
Reduce
Map
Contd…
EL
• In this part of the lecture, we will focus on one such class of
applications, that reuse a working set of data across multiple
•
PT
parallel operations.
This includes many iterative machine learning algorithms, as
N
well as interactive data analysis tools.
• A new framework called Spark supports such applications
while retaining the scalability and fault tolerance of
MapReduce.
Contd…
EL
• An RDD is a read-only collection of objects partitioned
across a set of machines that can be rebuilt if a partition is
•
lost.
PT
Spark can outperform Hadoop by 10x in iterative machine
N
learning jobs, and can be used to interactively query a 39
GB dataset with sub-second response time.
Difference Between Hadoop MapReduce vs.
Apache Spark
EL
Hadoop MapReduce Apache Spark
Fast 100x faster than MapReduce
Batch Processing Real-time Processing
PT
Stores Data on Disk
Written in Java
Stores Data in Memory
Written in Scala
N
N
PT
EL
Introduction
• A new model of cluster computing has become widely popular,
in which data-parallel computations are executed on clusters of
unreliable machines by systems that automatically provide
locality-aware scheduling, fault tolerance, and load balancing.
EL
• MapReduce pioneered this model, while systems like Dryad and
Map-Reduce-Merge generalized the types of data flows
•
supported.
PT
These systems achieve their scalability and fault tolerance by
N
providing a programming model where the user creates acyclic
data flow graphs to pass input data through a set of operators.
• This allows the underlying system to manage scheduling and to
react to faults without user intervention.
Contd…
• While this data flow programming model is useful for a large
class of applications, there are applications that cannot be
expressed efficiently as acyclic data flows.
EL
• Here, we focuses on one such class of applications, that reuse a
working set of data across multiple parallel operations. This
includes two use cases where we have seen Hadoop users report
PT
that MapReduce is deficient:
(i) Iterative jobs: Many common machine learning algorithms
N
apply a function repeatedly to the same dataset to optimize a
parameter (e.g., through gradient descent). While each iteration
can be expressed as a MapReduce/Dryad job, each job must
reload the data from disk, incurring a significant performance
penalty.
Contd…
(ii) Interactive analytics: Hadoop is often used to run ad-hoc
exploratory queries on large datasets, through SQL interfaces
such as Pig and Hive.
EL
• Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.
• However, with Hadoop, each query incurs significant latency
PT
(tens of seconds) because it runs as a separate MapReduce job
and reads data from disk.
N
• A new cluster computing framework called Spark supports
applications with working sets while providing similar scalability
and fault tolerance properties to MapReduce.
Contd…
• The main abstraction in Spark is that of a resilient distributed
dataset (RDD), which represents a read-only collection of objects
partitioned across a set of machines that can be rebuilt if a
partition is lost.
EL
• Users can explicitly cache an RDD in memory across machines
and reuse it in multiple MapReduce-like parallel operations.
RDDs achieve fault tolerance through a notion of lineage: if a
PT
partition of an RDD is lost, the RDD has enough information
about how it was derived from other RDDs to be able to rebuild
just that partition.
N
• Although RDDs are not a general shared memory abstraction,
they represent a sweet-spot between expressivity on the one
hand and scalability and reliability on the other hand, and found
well-suited for a variety of applications.
Contd…
• Spark is implemented in Scala, a statically typed high-level
programming language for the Java VM, and exposes a
functional programming interface similar to DryadLINQ.
EL
• In addition, Spark can be used interactively from a modified
version of the Scala interpreter, which allows the user to define
RDDs, functions, variables and classes and use them in parallel
•
PT
operations on a cluster.
Spark is the first system to allow an efficient, general-purpose
N
programming language to be used interactively to process large
datasets on a cluster.
• Spark can outperform Hadoop by 10x in iterative machine
learning workloads and can be used interactively to scan a 39 GB
dataset with sub-second latency.
Programming Model
EL
• Spark provides two main abstractions for parallel
programming:
•
PT
Resilient distributed datasets and parallel operations on
these datasets (invoked by passing a function to apply on a
N
dataset).
• In addition, Spark supports two restricted types of shared
variables that can be used in functions running on the cluster.
EL
PT
N
Figure: Spark runtime. The user’s driver program launches
multiple workers, which read data blocks from a distributed file
system and can persist computed RDD partitions in memory.
Resilient Distributed Datasets (RDDs)
• A resilient distributed dataset (RDD) is a read-only collection of
objects partitioned across a set of machines that can be rebuilt if
a partition is lost.
EL
• The elements of an RDD need not exist in physical storage;
instead, a handle to an RDD contains enough information to
compute the RDD starting from data in reliable storage.
• PT
This means that RDDs can always be reconstructed if nodes fail.
N
Contd…
• In Spark, each RDD is represented by a Scala object. Spark lets
programmers construct RDDs in four ways:
• From a file in a shared file system, such as the Hadoop
Distributed File System (HDFS).
EL
• By “parallelizing” a Scala collection (e.g., an array) in the driver
program, which means dividing it into a number of slices that will
be sent to multiple nodes.
•
PT
By transforming an existing RDD. A dataset with elements of
type A can be transformed into a dataset with elements of type B
N
using an operation called flatMap, which passes each element
through a user-provided function of type A List[B].
• Other transformations can be expressed using flatMap,
including map (pass elements through a function of type A B
and filter (pick elements matching a predicate).
Contd…
• By changing the persistence of an existing RDD. By default, RDDs
are lazy and ephemeral.
• That is, partitions of a dataset are materialized on demand when
they are used in a parallel operation (e.g., by passing a block of a
EL
file through a map function), and are discarded from memory
after use. However, a user can alter the persistence of an RDD
• PT
through two actions:
The cache action leaves the dataset lazy, but hints that it should
be kept in memory after the first time it is computed, because it
N
will be reused.
• The save action evaluates the dataset and writes it to a
distributed filesystem such as HDFS. The saved version is used in
future operations on it.
Contd…
• If there is not enough memory in the cluster to cache all
partitions of a dataset, Spark will recompute them when they are
used.
EL
• Spark programs keep working (at reduced performance) if nodes
fail or if a dataset is too big. This idea is loosely analogous to
virtual memory.
• PT
Spark can also be extended to support other levels of persistence
(e.g., in-memory replication across multiple nodes).
N
• The goal is to let users trade off between the cost of storing an
RDD, the speed of accessing it, the probability of losing part of
it, and the cost of recomputing it.
Parallel Operations
• Several parallel operations can be performed on RDDs:
EL
function to produce a result at the driver program.
•
PT
Collect: Sends all elements of the dataset to the driver program.
For example, an easy way to update an array in parallel is to
parallelize, map and collect the array.
N
• Foreach: Passes each element through a user provided function.
This is only done for the side effects of the function.
Shared Variables
EL
• As is typical in functional programming, these closures can
refer to variables in the scope where they are created.
•
•
PT
Normally, when Spark runs a closure on a worker node, these
variables are copied to the worker.
However, Spark also lets programmers create two restricted
N
types of shared variables to support two simple but common
usage patterns such as Broadcast variables and Accumulators.
Broadcast variables
• Broadcast variables: If a large read-only piece of data (e.g., a
lookup table) is used in multiple parallel operations, it is
preferable to distribute it to the workers only once instead of
EL
packaging it with every closure.
• Spark lets the programmer create a “broadcast variable”
PT
object that wraps the value and ensures that it is only copied
to each worker once.
N
Accumulators
• Accumulators: These are variables that workers can only
“add” to using an associative operation, and that only the
driver can read.
EL
• They can be used to implement counters as in MapReduce and
to provide a more imperative syntax for parallel sums.
•
PT
Accumulators can be defined for any type that has an “add”
operation and a “zero” value.
N
• Due to their “add-only” semantics, they are easy to make
fault-tolerant.
RDD Operations
EL
map reduce
filter collect
sample
union
groupByKey
PT count
save
lookupKey
N
reduceByKey …
join
cache
…
Transformations
Transformation Description
Return a new distributed dataset formed by passing
map (func) each element of the source through a function func
Return a new dataset formed by selecting those
EL
filter (func) element of the source on which func returns true
Similar to map, but each input can be mapped to 0
flatmap (func) or more output items (so func should return a Seq
sample
PT rather than a single item)
Sample a fraction of the data, with or without
replacements, using a given random number
N
(withReplacement,
generator seed
fraction, seed)
Return a new dataset that contains the union of the
union (otherDataset) elements in the source dataset and the argument
Return a new dataset that contains the distinct
distinct ( [numtasks] ) ) elements of the source dataset
Contd…
Transformation Description
groupByKey When called on a dataset of (K, V) pairs, returns a dataset of
([numTasks]) (K, Seq[V]) pairs
EL
reduceByKey
(K, V) pairs where the values for each key are aggregated using
(func, [numTasks]) the given reduce function
When called on a dataset of (K, V) pairs where K implements
sortByKey( [ascending],
[numtasks] ) PT
Ordered, returns a dataset of (K, V), pairs sorted by keys in
ascending or descending order as specified in the boolean
ascending argument
N
When called on a dataset of type (K, V) and (K, W) returns a
join (otherDataset,
dataset of (K, (V, W)) pairs with all pairs of elements for each
[numTasks] ) key
cogroup (otherDataset, When called on a dataset of type (K, V) and (K, W), returns a
[numTasks] ) dataset of (K, Seq[V], Seq[W]) tuples-also called groupWith
EL
computed correctly in parallel
Return all the elements of the dataset as an array at the
collect ( ) driver program- usually useful after a filter or other operation
count ( )
first( )
PT
that returns a sufficiently small subset of the data
Return the number of elements in the dataset
Return the first element of the dataset-similar to take (I)
N
Return an array with the first n elements of the dataset
take (n) -currently not executed in parallel, instead the driver program
computes all the elements
takeSample Return an array with a random sample of num elements of
(withReplacement, the dataset, with or without replacement, using the given
fraction, seed) random number generator seed.
Contd…
Action Description
Write the elements of the dataset as a text file (or set of text
saveAsTextFile files) in a given directory in the local filesystem, HDFS or any
(path) other Hadoop-supported file system, Spark will call toString
EL
on each element to convert it to a line of text in the file
Write the elements of the dataset as a Hadoop SequenceFile
in a given path in the local filesystem. HDFS or any other
saveAsSequenceFile
(path) PT
Hadoop-supported file system, Only available on RDDs of
key-value pairs that either implement, Hadoop’s writable
interface or are implicitly Hadoop’s Writable interface or are
N
implicitly convertible to Writable (Spark includes conversions
for basic types like Int, Double, String, etc).
Only available of RDD of type (K, V) Returns a ‘Map’ of
countByKey ( )
(K, Int) pairs with the count of each key.
Run a function func on each element of the dataset- usually
foreach (func) done for side effects such as updating an accumulator
variable or interacting with external storage systems
Spark Community
• Most active open source community in big data
EL
PT
N
EL
Built-in Libraries
PT
N
Standard Library for Big Data
EL
for multiple languages make
it“ suitable to offer this
•
PT
Much of future activity will
be in these libraries
N
A General Platform
EL
PT
N
Machine Learning Library (MLlib)
MLlib algorithms:
(i) Classification: logistic regression, linear SVM,“ naïve
Bayes, classification tree
EL
(ii) Regression: generalized linear models (GLMs),
regression tree
PT
(iii) Collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
N
(iv) Clustering: k-means
(v) Decomposition: SVD, PCA
(vi) Optimization: stochastic gradient descent, L-BFGS
GraphX
EL
PT
N
GraphX
EL
•Large library of graph algorithms with composable steps
PT
N
GraphX Algorithms
(i) Collaborative Filtering (iv) Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss
EL
(ii) Structured Prediction (v) Graph Analytics
PT
Loopy Belief Propagation
Max-Product Linear Programs
Gibbs Sampling
PageRank
Personalized PageRank
Shortest Path
N
Graph Coloring
(iii) Semi-supervised ML
Graph SSL (vi) Classification
CoEM Neural Networks
Spark Streaming
EL
•Ensure exactly one semantics
From Hive:
EL
c = HiveContext(sc)
rows = c.sql(“select text, year from hivetable”)
PT
rows.filter(lambda r: r.year > 2013).collect()
N
From JSON:
c.jsonFile(“tweets.json”).registerAsTable(“tweets”)
c.sql(“select text, user.name from tweets”)
EL
Examples
PT
N
Example 1: PageRank
EL
• Links from a high-rank page high rank
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Spark Program
EL
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
} PT
links.map(dest => (dest, rank/links.size))
ranks = contribs.reduceByKey (_ + _)
N
.mapValues (0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
PageRank Performance
EL
PT
N
Example 2: Logistic Regression
• Goal: find best line separating two sets of points
EL
random initial line
PT
N
target
Logistic Regression Code
EL
var w = Vector.random(D)
PT
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
N
w -= gradient
}
println("Final w: " + w)
Logistic Regression Performance
EL
127 s / iteration
PT
N
first iteration 174 s
further iterations 6 s
Example 3: MapReduce
EL
res = data.flatMap(rec => myMapFunc(rec))
.groupByKey()
PT
.map((key, vals) => myReduceFunc(key, vals))
N
Or with combiners:
EL
processing, since it:
PT
Requires a minimal amount of code
• Demonstrates use of both symbolic and numeric values
• Isn’t many steps away from search indexing
N
• Serves as a “Hello World” for Big Data apps
EL
wc.saveAsTextFile ( “wc_out” )
Python:
EL
iii. K-means clustering
iv.
v. PT
Alternating Least Squares matrix factorization
In-memory OLAP aggregation on Hive data
N
vi. SQL on Spark
Reading Material
• Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin,
Scott Shenker, Ion Stoica
“Spark: Cluster Computing with Working Sets”
EL
• Matei Zaharia, Mosharaf Chowdhury et al.
EL
powerful enough to express several applications that pose
challenges for existing cluster computing frameworks, including
iterative and interactive computations.
• PT
Furthermore, It is believed that the core idea behind RDDs, of a
dataset handle that has enough information to (re)construct the
N
dataset from data available in reliable storage, may prove useful
in developing other abstractions for programming clusters.
• In this lecture, we have discussed the HDFS components, its
architecture and framework of spark with its applications.
Distributed Algorithms for
Sensor Networks
EL
”Connected Domination in Multihop Ad Hoc Wireless Networks”
PT Rajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
[email protected]
Introduction
• Theidea of virtual backbone routing for ad hoc wireless
networks is to operate routing protocols over a virtual
backbone.
• One purpose of virtual backbone routing is to alleviate the
EL
serious broadcast storm problem suffered by many
exiting on-demand routing protocols for route detection.
• In
PT
Thus constructing a virtual backbone is very important.
this lecture we study, the virtual backbone is
N
approximated by a minimum connected dominating set
(MCDS) in a unit-disk graph. This is a NP-hard problem.
•A distributed approximation algorithm with performance
ratio at most 8 will be covered.
Sensor Network as Adhoc network
• Adhoc wireless and Sensor network has applications in
emergency search-and-rescue operations, decision
making in the battlefield, data acquisition operations in
EL
inhospitable terrain, etc.
• It
is featured by dynamic topology (infrastructureless),
PT
multihop communication, limited resources (bandwidth,
CPU, battery, etc) and limited security.
N
• These characteristics put special challenges in routing
protocol design inspired by the physical backbone in a
wired network, many researchers proposed the concept
of virtual backbone for unicast, multicast/broadcast in
ad hoc wireless networks .
Contd…
• The virtual backbone is mainly used to collect
topology information for route detection. It also
works as a backup when route is unavailable
temporarily.
EL
• An effective approach based on overlaying a virtual
infrastructure (termed core) on an ad hoc network is
popular. PT
• Routing protocols are operated over the core.
N
• Route request packets are unicasted to core nodes and
a (small) subset of non-core nodes.
• No broadcast is involved in core path detection.
Classification of Routing Protocols
• Existing routing protocols can be classified into two categories:
(i) proactive and (ii) reactive.
(i) Proactive routing protocols ask each host (or many hosts) to
maintain global topology information, thus a route can be
EL
provided immediately when requested.
• But large amount of control messages are required to keep
PT
each host updated for the newest topology changes.
(ii) Reactive routing protocols have the feature on-demand.
N
Each host computes route for a specific destination only when
necessary.
• Topology changes which do not influence active routes do not
trigger any route maintenance function, thus communication
overhead is lower compared to proactive routing protocol.
On-demand Routing Protocols
• On-demand routing protocols attract much attention due
to their better scalability and lower protocol overhead.
• But most of them use flooding for route discovery.
EL
Flooding suffers from broadcast storm problem.
• Broadcast storm problem refers to the fact that flooding
PT
may result in excessive redundancy, contention, and
collision. This causes high protocol overhead and
N
interference to other ongoing communication sessions.
• On the other hand, the unreliability of broadcast may
obstruct the detection of the shortest path, or simply
can’t detect any path at all, even though there exists one.
Problem of efficiently constructing virtual
backbone for ad hoc wireless networks
• In
this lecture we will study the “problem of efficiently
constructing virtual backbone” for ad hoc wireless
EL
networks.
• Thenumber of hosts forming the virtual backbone must
• The PT
be as small as possible to decrease protocol overhead.
algorithm must be time/message efficient due to
N
resource scarcity.
• We use a connected dominating set (CDS) to approximate
the virtual backbone.
Assumptions (1)
EL
directional antenna.
• Thus the transmission range of a host is a disk.
• We
PT
further assume that each transceiver has the same
communication range R.
N
• Thus the footprint of an ad hoc wireless network is a unit-
disk graph.
Assumptions (2)
• Ingraph-theoretic terminology, the network topology we
are interested in is a graph G=(V,E) where V contains all
hosts and E is the set of links.
EL
•A link between u and v exists if their distance is at most R.
In a real world ad hoc wireless network, sometime even
when v is located in u’s transmission range, v is not
PT
reachable from u due to hidden/exposed terminal problems.
• Here, we only consider bidirectional links.
N
• From now on, we use host and node interchangeably to
represent a wireless mobile.
Existing Distributed Algorithms for MCDS
K.M. Mihaela
B. Das et al. B. Das et al. J. Wu et
Alzoubi Cardei et
[1997]-I [1997]-II al. [1999]
[2001] al.
EL
Cardinality ≤(2ln∆+3)opt ≤(2ln∆+2)opt N/A ≤8opt+1 ≤8opt
O(n|C|+m+
Message O(n|C|) O(n∆) O(nlogn) O(n)
nlogn)
Time
Message
O(∆ )
PT
O((n+|C|)∆ ) O((|C|+|C|)∆ )
O(∆ )
O(∆2)
O(∆ )
O(n∆ )
O(∆ )
O(n∆ )
O(∆ )
N
Length
Information 2-hop 2-hop 2-hop 1-hop 1-hop
EL
• An independent set (IS) ) S of G is a subset of V such that for all
u,v ϵ S, (u,v) ϵ E. S is maximal if any vertex not in S has a
• PT
neighbor in S (denoted by MIS).
A dominating set (DS) D of G is a subset of V such that any
node not in D has at least one neighbor in D. If the induced
N
subgraph of D is connected, then D is a connected dominating
set (CDS).
• Among all CDSs of graph G, the one with minimum cardinality
is called a minimum connected dominating set (MCDS)
Preliminaries (2)
• Computing an MCDS in a unit graph is NP-hard. Note that
the problem of finding an MCDS in a graph is equivalent
to the problem of finding a spanning tree (ST) with
EL
maximum number of leaves. All non-leaf nodes in the
spanning tree form the MCDS. An MIS is also a DS.
• For
PT
a graph G, if e = (u,v) ϵ E iff length (e) ≤ 1, then G is
N
called a unit-disk graph.
Preliminaries (3)
• This lemma relates the size of any MIS of unit-disk graph G
to the size of its optimal CDS
Lemma 2.1 The size of any MIS of G is at most 4 x opt +1
EL
where opt is the size of any optimal CDS of G.
• For
a minimization problem P, the performance ratio of an
PT
approximation algorithm A is defined to be
• where I is the set of instances of P, Ai is the output from A
N
for instance i and opti is the optimal solution for instance
i. In other words, ρ is the supreme of A/opt among all
instances of P.
An 8-approximate algorithm to compute CDS
• This algorithm contains two phases:
EL
• Phase-1: First, a maximal independent set (MIS) is
computed;
• PT
Phase-2: Then a Steiner tree is used to connect all
vertices in the MIS.
N
• This algorithm has performance ratio at most 8 and is
message and time efficient.
Algorithm description
• Initially each host is colored white.
•A dominator is colored black, while a dominatee is
colored gray.
EL
• We assume that each vertex knows its distance-one
neighbors and their effective degrees d*.
• This PT
information can be collected by periodic or
event-driven hello messages.
N
• The effective degree of a vertex is the total number of
white neighbors.
Contd...
• Here host is designated as the leader. This is a
realistic assumption.
• For example, the leader can be the commander’s
EL
mobile for a platoon of soldiers in a mission.
• If it is impossible to designate any leader, a
PT
distributed leader-election algorithm can be applied
to find out a leader. This adds message and time
complexity.
N
• The best leader-election algorithm takes time O(n)
and message O(nlogn) and these are the best-
achievable results. Assume host s is the leader
Phase 1:
•Host
s first colors itself black and broadcasts message
DOMINATOR.
EL
time from v colors itself gray and broadcasts message
DOMINATEE. u selects v as its dominator.
•A PT
white host receiving at least one DOMINATEE message
becomes active.
N
•An active white host with highest (d*, id) among all of its
active white neighbors will color itself black and broadcast
message DOMINATOR.
Contd...
•A white host decreases its effective degree by 1 and
broadcasts message DEGREE whenever it receives a
DOMINATEE message.
EL
•Message DEGREE contains the sender’s current effective
degree. A white vertex receiving a DEGREE message will
•Each gray
PT
update its neighborhood information accordingly.
EL
•A host is “ready” to be explored if it has no white
neighbors.
•A PT
Steiner tree is used to connect all black hosts generated
in Phase 1.
N
•Theidea is to pick those gray vertices which connect to
many black neighbors.
Contd...
•The classical distributed depth first search spanning tree
algorithm will be modified to compute the Steiner tree.
EL
•Initially
no black vertex has a dominator and all hosts are
unexplored.
•Message
PT
M contains a field next which specifies the next
N
host to be explored.
EL
• If
M is built by a gray vertex, its next field contains the id
of any unexplored black neighbor.
• Any
PT
black host u receiving an M message the first time
N
from a gray host v sets its dominator to v by broadcasting
message PARENT.
Contd...
• When a host u receives message M from v that specifies u
to be explored next, if none of u’s neighbors is white, u
then colors itself black, sets its dominator to v and
broadcasts its own M message; otherwise, u defer its
operation until none of its neighbors is white.
EL
• Any gray vertex receiving message PARENT from a black
neighbor PT
will broadcast message
NUMOFBLACKNEIGHBORS, which contains the number of
N
active black neighbors.
EL
•A gray vertex without active black neighbor, or a black
vertex without effective gray neighbor, will send message
DONE to the host which activates its exploration or to its
dominator.
PT
N
• When s gets message DONE and it has no effective gray
neighbors, the algorithm terminates.
Complexity
• Note that phase 1 sets the dominators for all gray
vertices. Phase 2 may modify the dominator of some gray
vertex.
• Themain job for phase 2 is to set a dominator for each
EL
black vertex. All black vertices form a CDS.
• In
Phase 1, each host broadcasts each of the messages
DOMINATOR and DOMINATEE at most once.
• The PT
message complexity is dominated by message
DEGREE, since it may be broadcasted ∆ times by a host,
N
where ∆ is the maximum degree.
• Thusthe message complexity of Phase 1 is O(n ∆). The
time complexity of Phase 1 is O(n).
Contd...
• In phase 2, vertices are explored one by one.
EL
• The
PT
message complexity is dominated by message
NUMOFBLACKNEIGHBORS, which is broadcasted at most
5 times by each gray vertex because a gray vertex has at
N
most 5 black neighbors in a unit-disk graph.
EL
• Note that in phase 1 if we use (id) instead of (d*, id)
as the parameter to select a white vertex to color it
PT
black, the message complexity will be O(n) because
no DEGREE messages will be broadcasted.
N
• O(n. ∆) is the best result we can achieve if effective
degree is taken into consideration.
Performance Analysis
• Lemma 3.2 Phase 1 computes an MIS which contains all
black nodes.
• Proof. A node is colored black only from white. No two
white neighbors can be colored black at the same time
EL
since they must have different (d*, id) .
• When a node is colored black, all of its neighbors are
colored gray.
PT
• Once a node is colored gray, it remains in color gray during
N
Phase 1.
• From the proof of Lemma 3.2, it is clear that if (id) instead
of (d*, id) is used, we still get an MIS. Intuitively, this result
will have a larger size.
Contd...
• Lemma 3.3 In phase 2, at least one gray vertex which
connects to maximum number of black vertices will
be selected.
• Proof.Let u be a gray vertex with maximum number
EL
of black neighbors.
• At
PT
some step in phase 2, one of u’s black neighbor v
will be explored.
N
• Inthe following step, u will be explored. This
exploration is triggered by v.
Contd...
• Lemma 3.4 If there are c black hosts after phase 1,
then at most c-1 gray hosts will be colored black in
phase 2
EL
• Proof. In phase 2, the first gray vertex selected will
connect to at least 2 black vertices.
•
PT
In the following steps, any newly selected gray vertex
will connect to at least one new black vertex.
N
Contd...
• Lemma 3.5 If there exists a gray vertex which connects
to at least 3 black vertices, then the number of gray
vertices which are colored black in phase 2 will be at
most c-2, where c is the number of black vertices after
EL
phase 1.
• Proof.From Lemma 3.3, at least one gray vertex with
PT
maximum black neighbors will be colored black in phase
2. Denote this vertex by u. If u is colored black, then all
N
of its black neighbors will choose u as its dominator.
Thus, the selection of u causes more than 1 black hosts
to be connected.
Contd...
• Theorem 3.6 This algorithm has performance ratio at most 8.
EL
• If there exists a gray vertex which has at least 3 black
neighbors after phase 1, from Lemma 2.1, the size of the MIS
• From
PT
is at most 4.opt+1.
lemma 3.5, we know the total number of black vertices
N
after phase 2 is at most 4.opt+1+((4.opt+1)-2=8.opt.
Contd...
• If
the maximum number of black neighbors a gray vertex
has is at most 2, then the size of the MIS computed in
phase 1 is at most 2.opt since any vertex in opt connects
EL
to at most 2 vertices in the MIS.
• Thusfrom Lemma 3.4, total number of black hosts will
• Note
PT
be 2.opt + 2.opt – 1 < 4.opt
that from the proof of Theorem 3.6, if (id) instead
N
of (d*, id) is used in phase 1, this algorithm still has
performance ratio at most 8
More References
EL
Using a Collaborative Cover Heuristic for Ad Hoc Sensor
Networks.” IEEE Trans. Parallel Distrib. Syst. 21(3): 292-
302 (2010) PT
N
Conclusion
• In this lecture, we have discussed a distributed
algorithm which compute a connected dominating set
with smaller size.
EL
• We have discussed how to find a maximal independent
set. Then how to use a Steinter tree to connect all
vertices in the set.
PT
• This algorithm gives performance ratio at most 8.
N
• The future scope of this algorithm is to study the
problem of maintaining the connected dominating set
in a mobility environment.