Breadth-First Graph Search Using An Iterative Map-Reduce Algorithm
Breadth-First Graph Search Using An Iterative Map-Reduce Algorithm
import java.util.*;
public Graph() {
this.nodes = new
HashMap<Integer, Node>();
}
Queue<Integer> q = new
LinkedList<Integer>();
q.add(source);
while (!q.isEmpty()) {
Node unode =
nodes.get(q.poll());
vnode.setColor(Node.Color.GRAY);
vnode.setDistance(unode.getDistance(
) + 1);
vnode.setParent(unode.getId());
q.add(v);
}
}
unode.setColor(Node.Color.BLACK);
}
graph.breadthFirstSearch(1);
graph.print();
}
}
parallel breadth-first search
using hadoop
okay, that's nifty and all, but what if
your graph is really really big? this
algorithm marches slowly down the tree
on level at a time. once your past the
first level there will be many many
nodes whose edges need to be
examined, and in the code above this
happens sequentially. how can we
modify this to work for a huge graph and
run the algorithm in parallel? enter
hadoop and map-reduce!
start here for a decent introduction to
graph algorithms on map-reduce. on
again though, this resource gives some
tips, but no actual code.
let's talk through how we go about this,
and actually write a little code! basically,
the idea is this - every Map iteration
"makes a mess" and every Reduce
iteration "cleans up the mess". let's say
we by representing a node in the
following text format
ID EDGES|DISTANCE_FROM_SOURCE|
COLOR|
where EDGES is a comma delimited list
of the ids of the nodes that are
connected to this node. in the beginning,
we do not know the distance and will use
Integer.MAX_VALUE for marking
"unknown". the color tells us whether or
not we've seen the node before, so this
starts off as white.
suppose we start with the following input
graph, in which we've stated that node
#1 is the source (starting point) for the
search, and as such have marked this
one special node with distance 0 and
color GRAY.
1 2,5|0|GRAY|
2 1,3,4,5|Integer.MAX_VALUE|
WHITE|
3 2,4|Integer.MAX_VALUE|WHITE|
4 2,3,5|Integer.MAX_VALUE|
WHITE|
5 1,2,4|Integer.MAX_VALUE|
WHITE|
the mappers are responsible for
"exploding" all gray nodes - e.g. for
exploding all nodes that live at our
current depth in the tree. for each gray
node, the mappers emit a new gray
node, with distance = distance + 1. they
also then emit the input gray node, but
colored black. (once a node has been
exploded, we're done with it.) mappers
also emit all non-gray nodes, with no
change. so, the output of the first map
iteration would be
1 2,5|0|BLACK|
2 NULL|1|GRAY|
5 NULL|1|GRAY|
2 1,3,4,5|Integer.MAX_VALUE|
WHITE|
3 2,4|Integer.MAX_VALUE|WHITE|
4 2,3,5|Integer.MAX_VALUE|
WHITE|
5 1,2,4|Integer.MAX_VALUE|
WHITE|
note that when the mappers "explode"
the gray nodes and create a new node
for each edge, they do not know what to
write for the edges of this new node - so
they leave it blank.
the reducers, of course, receive all data
for a given key - in this case it means
that they receive the data for all "copies"
of each node. for example, the reducer
that receives the data for key = 2 gets
the following list of values :
2 NULL|1|GRAY|
2 1,3,4,5|Integer.MAX_VALUE|
WHITE|
the reducers job is to take all this data
and construct a new node using
• the non-null list of edges
• the minimum distance
• the darkest color
using this logic the output from our first
iteration will be :
1 2,5,|0|BLACK
2 1,3,4,5,|1|GRAY
3 2,4,|Integer.MAX_VALUE|WHITE
4 2,3,5,|Integer.MAX_VALUE|
WHITE
5 1,2,4,|1|GRAY
the second iteration uses this as the
input and outputs :
1 2,5,|0|BLACK
2 1,3,4,5,|1|BLACK
3 2,4,|2|GRAY
4 2,3,5,|2|GRAY
5 1,2,4,|1|BLACK
and the third iteration outputs:
1 2,5,|0|BLACK
2 1,3,4,5,|1|BLACK
3 2,4,|2|BLACK
4 2,3,5,|2|BLACK
5 1,2,4,|1|BLACK
subsequent iterations will continue to
print out the same output.
how do you know when you're done?
you are done when there are no output
nodes that are colored gray. note - if not
all nodes in your input are actually
connected to your source, you may have
final output nodes that are still white.
here's an actual implementation in
hadoop. once again, Node is little more
than a simple bean, plus some ugly
string processing nonsense. you can
download Node.java here and
GraphSearch.java here.
package org.apache.hadoop.examples;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import
org.apache.commons.logging.Log;
import
org.apache.commons.logging.LogFactor
y;
import
org.apache.hadoop.conf.Configuration
;
import
org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWritable;
import
org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.Tool;
import
org.apache.hadoop.util.ToolRunner;
/**
* This is an example Hadoop
Map/Reduce application.
*
* It inputs a map in adjacency list
format, and performs a breadth-first
search.
* The input format is
* ID EDGES|DISTANCE|COLOR
* where
* ID = the unique identifier for a
node (assumed to be an int here)
* EDGES = the list of edges
emanating from the node (e.g.
3,8,9,12)
* DISTANCE = the to be determined
distance of the node from the source
* COLOR = a simple status tracking
field to keep track of when we're
finished with a node
* It assumes that the source node
(the node from which to start the
search) has
* been marked with distance 0 and
color GRAY in the original input.
All other
* nodes will have input distance
Integer.MAX_VALUE and color WHITE.
*/
public class GraphSearch extends
Configured implements Tool {
/**
* Nodes that are Color.WHITE or
Color.BLACK are emitted, as is. For
every
* edge of a Color.GRAY node, we
emit a new Node with distance
incremented by
* one. The Color.GRAY node is
then colored black and is also
emitted.
*/
public static class MapClass
extends MapReduceBase implements
Mapper<LongWritable, Text,
IntWritable, Text> {
public void map(LongWritable
key, Text value,
OutputCollector<IntWritable, Text>
output,
Reporter reporter) throws
IOException {
vnode.setDistance(node.getDistance()
+ 1);
vnode.setColor(Node.Color.GRAY);
output.collect(new
IntWritable(vnode.getId()),
vnode.getLine());
}
// We're done with this node
now, color it BLACK
node.setColor(Node.Color.BLACK);
}
}
}
/**
* A reducer class that just emits
the sum of the input values.
*/
public static class Reduce extends
MapReduceBase implements
Reducer<IntWritable, Text,
IntWritable, Text> {
/**
* Make a new node which
combines all information for this
single node id.
* The new node should have
* - The full list of edges
* - The minimum distance
* - The darkest Color
*/
public void reduce(IntWritable
key, Iterator<Text> values,
OutputCollector<IntWritable,
Text> output, Reporter reporter)
throws IOException {
while (values.hasNext()) {
Text value = values.next();
Node u = new Node(key.get()
+ "\t" + value.toString());
}
}
ToolRunner.printGenericCommandUsage(
System.out);
return -1;
}
private JobConf
getJobConf(String[] args) {
JobConf conf = new
JobConf(getConf(),
GraphSearch.class);
conf.setJobName("graphsearch");
conf.setOutputKeyClass(IntWritable.c
lass);
// the values are the string
representation of a Node
conf.setOutputValueClass(Text.class)
;
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
conf.setNumMapTasks(Integer.parseInt
(args[++i]));
} else if ("-
r".equals(args[i])) {
conf.setNumReduceTasks(Integer.parse
Int(args[++i]));
}
}
return conf;
}
/**
* The main driver for word count
map/reduce program. Invoke this
method to
* submit the map/reduce job.
*
* @throws IOException
* When there is
communication problems with the job
tracker.
*/
public int run(String[] args)
throws Exception {
int iterationCount = 0;
while
(keepGoing(iterationCount)) {
String input;
if (iterationCount == 0)
input = "input-graph";
else
input = "output-graph-" +
iterationCount;
JobConf conf =
getJobConf(args);
FileInputFormat.setInputPaths(conf,
new Path(input));
FileOutputFormat.setOutputPath(conf,
new Path(output));
RunningJob job =
JobClient.runJob(conf);
iterationCount++;
}
return 0;
}
return true;
}
}
the remotely astute reader may notice
that the keepGoing method is not
actually checking to see if there are any
remaining gray node, but rather just
stops after the 4th iteration. this is
because there is no easy way to
communicate this information in hadoop.
what we want to do is the following :
1. at the beginning of each iteration, set
a global flag keepGoing = false
2. as the reducer outputs each node, if
the is outputting a gray node set
keepGoing = true
unfortunately, hadoop provides no
framework for setting such a global
variable. this would need to be managed
using an external semaphore server of
some sort. this is left as an exercise for
the lucky reader. ; )