Chapter 9 - Processing Big Data With Mapreduce
Chapter 9 - Processing Big Data With Mapreduce
MapReduce
What is MapReduce?
MapReduce is a programming model and software
framework used for processing and generating large
datasets. It is designed to handle big data and is
widely used in distributed computing environments.
HDFS Name Node
Client
Local Disk Local Disk Local Disk Local Disk Local Disk
MapReduce design goals
Scalable to large data volumes.
Can run your applications in parallel over 1000’s of
machines, 10,000’s of disks.
Elastic - Can add/remove compute nodes to/from the
framework easily.
Cost-efficiency.
Use commodity machines and network (cheap, but unreliable).
Automatic fault-tolerance (fewer administrators).
Easy to program (fewer programmers).
4
Hadoop
The Apache Hadoop software library is a framework
that allows for the distributed processing of large
data sets across clusters of computers using a simple
programming model.
Hadoop Distributed File System (HDFS)
For storing data of your application over a cluster of
machines.
Map function
7
Example 1 - WordCount
Motivating wordCount application: Given a set of
documents, count the number of occurrences of
each word in the documents.
(how, 1)
(are, 1) The Mapper function
(how, 1)
(are, 1) The Mapper function
Input Output
(key, value) (key, value)
= (fileID, fileContent) = (word, 1)
10
Example 1 - WordCount
Map
phase
(how, 1)
(are, 1) The Mapper function
(how, 1)
(are, 1)
(you, 1) (how, <1,1,1>)
HDFS File 1 File 2
datanode Mapper A (how, 1)
(many, 1)
Reducer A
(how, 1)
(are, 1)
(you, 1) (how, <1,1,1>)
HDFS File 1 File 2
datanode Mapper A (how, 1) (many, <1>)
(many, 1) (happy, <1>)
Reducer A
(are, 1)
(you, 1)
HDFS File 3 (happy, 1) (are, <1,1>)
datanode Mapper B
(you, <1,1>)
(how, 1) (to, <1>)
(to, 1) (do, <1>) Reducer B
14
Example 1 - WordCount
Reduce
phase
in parallel.
The number of reducers can be
different from the number of mappers. (are, <1,1>) (are, 2)
(you, <1,1>) (you, 2)
(to, <1>) (to, 1)
(do, <1>) Reducer B (do, 1)
15
Example 1 - WordCount
Map function
Input - <fileID, fileContent> pairs.
Output – Emit <"how", 1>, <"are",1> …etc as intermediate
<key, value> pairs.
Hadoop users only need to
Reduce function program the Map and Reduce
functions! Hadoop manages
Input - <"how", list(1,1,1)>. the entire cluster to finish the
Output – Emit <"how", 3>. parallel execution! Nice ☺
Accomplish synchronization.
The reduce computation cannot start until all mappers
have finished emitting (key, value) pairs and all
intermediate (key, value) pairs have been shuffled and
sorted.
19
3. Reduce phase
A reducer in MapReduce receives all values associated
with the same key at once.
Programmers can specify the number of reducers.
If a node crashes…
Re-launch its current tasks on other nodes.
Re-run any map tasks the crashed node previously ran because
the outputs were lost along with the crashed node.
21
Fault Tolerance
If a task is going slowly…
Speculative execution - Launch second copy of task on
another node.
Take the output of whichever copy finishes first, and
terminate the other copy.
Very important strategy in large cluster
Stragglers occur frequently due to failing hardware, bugs, etc.
A single straggler may noticeably slow down the overall
performance. E.g., A single straggler map task can delay the
start of the shuffle and sort phase.
22
Example 2 – Inverted Indexing
Inverted indexing: Nearly all search engines today
rely on a data structure called an inverted index,
which given a keyword provides access to the list of
documents that contain the keyword.
Documents
Inverted index
Why inverted index is useful?
1.html Keyword List of document IDs
Construct E.g., To find which document(s)
Computer {1.html, 3.html, …etc }
Inverted contain the keywords “Apple
Apple {2.html, 3.html, …etc}
2.html index Computer” , we simply join the
… …
two inverted lists to find the result
(i.e., 3.html)
3.html
It is called “inverted” because data usually data exist in “one
…
26
Combiner
Local aggregation – Combine some key-value pairs in
each mapper to reduce the number of key-value pairs
that pass into the shuffle and sort phase.
E.g., In the wordCount example, instead of emitting
<"how", 1> twice if a document has two "how"s, We emit
<"how", 2>.
Again, we are handling massive data. This sounds a trivial
If we want to adopt local aggregation, optimization, why is it
buffer space has to be allocated to important and what
store the intermediate word count. are the aspects that I
This may create memory problem if have to concern?
the key space is very large!
27
Combiner
Input (key,value) k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
(docID, docContent)
a 1 a 1 b 1 c 1 c 1 b 1 b 1 a 1 a 1 b 1 c 1 b 1
a 2 b 1 c 2 b 1 b 1 a 2 b 2 c 1
a 4 b 5 c 3
28
Partitioner
The simplest partitioner involves computing the hash
value of the key and then taking the mod of the value
with the number of reducers.
This assigns approximately the same number of keys to each
reducer. (We want to balance the reducers’ workload)
Problem : Partitioner only considers the key and
ignores the value.
Why is it a problem?
Consider the wordCount example. Some words may occur more
frequent than others. Therefore even we assign the same number of
keys to each reducer, the reducers’ workload may still be imbalanced.
29
Partitioner
The partitioner allows programmer to specify which
reducer will be responsible for processing a particular
key.
Shuffle & sort phase Shuffle & sort phase
(Partitioner: assigns keys to reducers (Partitioner : assigns “c” to one reducer,
using k mod 2) “a” & “b” to another reducer)
a 2 1 c 1 1 1 2 b 2 1 a 2 1 b 2 1 c 1 1 1 2
a 3 b 3 c 5
a 3 c 5 b 3
30
We are going to learn…
Section 2. TF/IDF and MapReduce
tf.idf (MR)
Term frequency – inverse document frequency
Acknowledgement – Matei Zaharia, Electrical Engineering and Computer Sciences University of California, Berkeley
tf.idf(t,d)
tf.idf(t,d), short for term frequency–inverse document
frequency, is a numerical statistic that is intended to
reflect how important a term t (keyword) is to a
document d in a collection or corpus.
An apple a Apple Daily is Looking at occurrence
Apple pie
Apple iphone day keeps the a Hong Kong-
ingredient: frequency of a term in a
6 and ipad … doctor based tabloid-
Apple, …
away … style document only:
Doc 1 Doc 2 Doc 3 Doc 4 “apple” occurs many time in Doc
1, so “apple” should be very
Considering how common a term appear in important in Doc 1.
documents: “ingredient” occurs only once in
Wait! “apple” is a common word in the set of Doc 1, so “Ingredient” should be
documents, it is not an informative keyword not important in Doc 1.
to identify a document!
On the other hand, “ingredient” occurs in Doc
1 only, it should be a specify keyword
describing the content in Doc 1!
33
tf.idf(t,d)
tf.idf(t,d), short for term frequency–inverse document
frequency, is a numerical statistic that is intended to
reflect how important a term t (keyword) is to a
document d in a collection or corpus.
An apple a Apple Daily is
Apple pie
Apple iphone day keeps the a Hong Kong-
ingredient:
6 and ipad … doctor based tabloid-
Apple, …
away … style
Challenges
Compound words: hostname, host-name and host name.
Break into two tokens or regroup them as one token? In any
case, lexicon and linguistic analysis needed!
In some languages (Chinese, Japanese), words not separated
by whitespace.
Reference: Web Data Management and Distribution by Serge Abiteboul et al. http://webdam.inria.fr/textbook 35
Preprocessing 2. Stemming
Merge different forms of the same word, or of
closely related words, into a single stem.
Morphological stemming - Remove bound morphemes from
words, such as remove final -s, -’s, -ed, -ing, -er, -est.
Reference: Web Data Management and Distribution by Serge Abiteboul et al. http://webdam.inria.fr/textbook 36
Preprocessing 2. Stemming
Merge different forms of the same word, or of
closely related words, into a single stem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 the jaguar is a new world mammal of the felidae family
d2 jaguar has designed four new engines
d3 for jaguar atari was keen to use a 68k family device
d4 the jacksonville jaguars are a professional us football team
d5 mac os x jaguar is available at a price of us $99 for apple’s new family pack
d6 one such ruling family to incorporate the jaguar into their name is jaguar paw
d7 it is a big cat
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 the jaguar be a new world mammal of the felidae family
d2 jaguar have design four new engine
d3 for jaguar atari be keen to use a 68k family device
d4 the jacksonville jaguar be a professional us football team
d5 mac os x jaguar be available at a price of us $99 for apple new family pack
d6 one such rule family to incorporate the jaguar into their name be jaguar paw
d7 it be a big cat
Reference: Web Data Management and Distribution by Serge Abiteboul et al. http://webdam.inria.fr/textbook 37
Preprocessing 3. Stop word removal
Remove uninformative words from documents, in
particular to lower the cost of storing the index
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 jaguar new world mammal felidae family
d2 jaguar design four new engine
d3 jaguar atari keen 68k family device
d4 jacksonville jaguar professional us football team
d5 mac os x jaguar available price us $99 apple new family pack
d6 one such rule family incorporate jaguar their name jaguar paw
d7 big cat
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 the jaguar be a new world mammal of the felidae family
d2 jaguar have design four new engine
d3 for jaguar atari be keen to use a 68k family device
d4 the jacksonville jaguar be a professional us football team
d5 mac os x jaguar be available at a price of us $99 for apple new family pack
d6 one such rule family to incorporate the jaguar into their name be jaguar paw
d7 it be a big cat
Reference: Web Data Management and Distribution by Serge Abiteboul et al. http://webdam.inria.fr/textbook 38
Inverted index
After all preprocessing, construction of an inverted
index.
Index of all terms, with the list of documents where this term
occurs.
Small scale: disk storage, with memory mapping (cf. mmap)
techniques; secondary index for offset of each term in main
index.
Large scale: distributed on a cluster of machines; hashing
assigns the machine to responsible for some term(s).
Updating the index is costly, so only batch operations
(not one-by-one addition of term occurrences).
Reference: Web Data Management and Distribution by Serge Abiteboul et al. http://webdam.inria.fr/textbook 39
Inverted index
Phrase queries, NEAR operator: need to keep
positional information in the index.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 jaguar new world mammal felidae family
d2 jaguar design four new engine
d3 jaguar atari keen 68k family device
d4 jacksonville jaguar professional us football team
d5 mac os x jaguar available price us $99 apple new family pack
d6 one such rule family incorporate jaguar their name jaguar paw
d7 big cat
family
football
d1/11
d4/8
d3/10 d5/16 d6/4 Find the documents that
jaguar d1/2 d2/1 d3/2 d4/3 d5/4 d6/8,13 contains keywords “jaguar”
new d1/5 d2/5 d5/15
rule d6/3 followed by “family” within
us
world
d4/7
d1/6
d5/11 8 word spaces.
d1 (2,11) , d3 (2,10) , d5 (4,16) ,
Documents that contain both keywords
d6 (8,4) , d6 (13,4)
“jaguar” followed by “family” d1 (2,11) , d3 (2,10) , d5 (4,16)
Within 8 word spaces d3 (2,10) 40
Term frequency tf (t,d)
Insight: Terms occurring frequently in a given
document: more relevant.
nt ,d
tf (t , d ) =
t ' nt ',d
The term frequency tf (t,d) is the number of occurrences of a
term t in a document d (denoted by nt,d ), divided by the total
number of terms in d (denoted by ∑t’ nt’,d).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 jaguar new world mammal felidae family
d2 jaguar design four new engine
d3 jaguar atari keen 68k family device
d4 jacksonville jaguar professional us football team
d5 mac os x jaguar available price us $99 apple new family pack
d6 one such rule family incorporate jaguar their name jaguar paw
d7 big cat 42
tf.idf (t,d) and inverted indices
43
tf.idf (t,d) and MapReduce
Can we use MapReduce to compute
tf.idf(t,d) for all terms t in parallel and
update the inverted lists?
nt ,d |D|
tf .idf (t , d ) = tf (t , d ) idf (t ) = lg
t ' nt ',d | {d ' D | nt ,d ' 0} |
Information we need
nt,d the number of occurrences of term t in document d.
∑t’ nt’,d the total number of terms in document d.
|{d ∈ D | nt,d > 0 } | the number of documents term t appears.
| D | the total number of documents (Global metadata).
44
Job 1. Compute nt,d and ∑t’ nt’,d
The reduce tasks receives the <(t,d), ∑t’ nt’,d > pairs with the same (t,d).
It simply compute nt,d by counting the number of key-value pairs received.
The reduce tasks receives the <t, (d, ∑t’ nt’,d )> pairs with the same t.
It simply compute |{d ∈ D | nt,d > 0 } | by counting the number of key-
value pairs received for key t.
Reducer Reducer Reducer Reducer
Bound 0.18
0.17
The bound is updated to 0.07 +
0.10 = 0.17 (tighter).
[Terminate condition]
Result d1/0.33 d2/0.24 d5/0.17 The bound means the maximum
possible score of other docs will
be 0.17.
The next entry in the inverted
When we look at the entries in
list of "family" is d5 .
Result (i.e., The 3r entry
d5 is already in Result and
"d5/0.17"), we can confirm that
score(q,d5) need not be
they are the top-3 result.
recomputed.
62
Section 3
OLAP on
Search logs
References : OLAP on Search Logs: An Infrastructure Supporting Data-Driven Applications in Search Engines by Bin Zhou et, al.
Query suggestion
Auto complete (After typing “the university of”)
Query suggestion (After issuing the first query as “University of Hong Kong” )
References : OLAP on Search Logs: An Infrastructure Supporting Data-Driven Applications in Search Engines by Bin Zhou et, al.
65
Search logs and sessions
Conceptually, a search log is a sequence of queries
and click events.
Since a search log often contains the information from multiple
users over a long period, we can divide a search log into
sessions. Date time Session ID (or IP) Query …
2023-08-03T00:12:13 132.165.2.2 Honda …
Sessions extraction 2023-08-03T00:12:14
…
217.25.64.32
…
Apple iphone6
…
…
…
2023-08-03T00:13:21 132.165.2.2 Ford …
Step 1. For each user, … … … …
extract the queries by the user from the search log as a stream.
Step 2. Then segment each user's stream into sessions based on
a widely adopted rule : two queries are split into two sessions if
the time interval between them exceeds 30 minutes.
66
Data model
A search log is a sequence of queries and click events.
E.g., q1=“HKU”, q2=“CUHK”, then if a user session s searches for
q1 first and q2 next, then s = < q1, q2 >
Session frequency (s) is the number of query sessions
that is exactly the same as s. Sequence ID Query sequences
s1 <q1q2q3q4>
s=q1q2 , session frequency (s)=0 s2 <q1q2q4q5>
s3 <q6q1q2q5>
s=q1q2q3q4 , session frequency (s) = 2
s4 <q1q2q3q4>
s5 <q6q1q2q5>
Frequency (s) is the number of query s6 <q1q2q3q5>
sessions with s being its substring. s7 <q1q2q3q6>
s8 <q6q1q2q5>
s=q1q2 , frequency (s) = 8 67
1. Forward search
In a set of sessions, given a query sequence s and a
search result size k:
The forward search finds k sequences s1, … , sk such that < s, si >
( 1 ≤ i ≤ k) is among the top-k most frequent sequences that
have s as the prefix.
Input s = <"Frozen">, k= 2
Output Top 2 frequent sequences that contain "Frozen"
<"Disney movie" , "Frozen" >
<"Frozen", "Let it go" >
70
System framework
References : OLAP on Search Logs: An Infrastructure Supporting Data-Driven Applications in Search Engines by Bin Zhou et, al.
71
Sequence ID Query sequences
s1 <q1q2q3q4>
Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1
q2 q3 q4
1 1 1 1
q2 q3 q4
1 1 1
q3 q4
Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1
q2 q3 q4 q5
2 2 1 2 1
q2 q3 q4 q5
q4
2 1 1 1 1
q3 q4 q5
q4
1 1 1 1
q4
q5
Suffixes of s2.
1 1
<q1q2q4q5>
<q2q4q5>
<q4q5>
<q5>
73
Sequence ID Query sequences
s1 <q1q2q3q4>
Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
3 3 1 2 2 1
q2 q3 q5 q4 q5
q4 q1
3 1 1 1 1 1 1
q3 q5 q4 q5
q4 q2
1 1 1 1 1 1
q4
q5 q5
Suffixes of s3.
1 1 1
<q6q1q2q5>
<q1q2q5>
<q2q5>
<q5>
74
Sequence ID Query sequences
s1 <q1q2q3q4>
Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
4 4 2 3 2 1
q2 q3 q5 q4 q5
q4 q1
4 2 1 1 2 1 1
q3 q5 q4 q5
q4 q2
2 1 1 2 1 1
q4
q5 q5
Suffixes of s4.
2 1 1
<q1q2q3q4>
<q2q3q4>
<q3q4>
<q4>
75
Sequence ID Query sequences
s1 <q1q2q3q4>
Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
5 5 2 3 3 2
q2 q3 q5 q4 q5
q4 q1
5 2 1 2 2 1 2
q3 q5 q4 q5
q4 q2
2 1 2 2 1 2
q4
q5 q5
Suffixes of s5.
2 1 2
<q6q1q2q5>
<q1q2q5>
<q2q5>
<q5>
76
Sequence ID Query sequences
s1 <q1q2q3q4>
Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
6 6 3 3 4 2
q2 q3 q5 q4 q5
q4 q5 q1
6 3 1 2 2 1 1 2
q3 q5 q4 q5
q4 q5 q2
3 1 2 2 1 1 2
q4
q5 q5 q5
Suffixes of s6.
2 1 1 2
<q1q2q3q5>
<q2q3q5>
<q3q5>
<q5>
77
Sequence ID Query sequences
s1 <q1q2q3q4>
Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
7 7 4 3 4 3
q2 q3 q5 q4 q6 q5
q4 q5 q1
7 4 1 2 2 1 1 1 2
q3 q5 q4 q6 q5
q4 q5 q2
4 1 2 2 1 1 1 2
q4 q6
q5 q5 q5
Suffixes of s7.
2 1 1 1 2
<q1q2q3q6>
<q2q3q6>
<q3q6>
<q6>
78
Sequence ID Query sequences
s1 <q1q2q3q4>
Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 q1
8 4 1 3 2 1 1 1 3
q3 q5 q4 q6 q5
q4 q5 q2
4 1 3 2 1 1 1 3
q4 q6
q5 q5 q5
Suffixes of s8.
2 1 1 1 3
<q6q1q2q5>
<q1q2q5>
<q2q5>
<q5>
79
Sequence ID Query sequences
s1 <q1q2q3q4>
Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
There are 8 sequences
s7 <q1q2q3q6>
that contain <q1>.
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 q1
8 4 1 3 2 1 1 1 3
q3 q5 q4 q6 q5
q4 q5 q2
4 1 3 2 1 1 1 3
What is the number of
q4 q6
q5 q5 sequences that q5
2 1 1 1 contains the pattern 3
<q1> ?
There are 2 sequences
that contain <q1q2q3q4>. How about the pattern
<q1q2q3q4>?
80
Sequence ID Query sequences
s1 <q1q2q3q4>
Forward search s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
Input s = <q1q2>, k=2 s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 q1
8 4 1 3 2 1 1 1 3
q3 q5 q4 q6 q5
q4 q5 q2
4 1 3 2 1 1 1 3
Answer:
q4 q6
q5 q5 We need to keep a q5
Candidate q3 q5 q4
2 1 1 1 Frequency 4 3 1
priority queue of 3
nodes to explore.
Question: How to use the suffix Let’s illustrate the
tree to answer the forward idea of the algorithm
search with s = <q1q2> and k=2 ? in the next few slides.
81
Sequence ID Query sequences
s1 <q1q2q3q4>
Forward search s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
Input s = <q1q2>, k=2 s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 q1
8 4 1 3 2 1 1 1 3
q3 q5 q4 q6 q5
q4 q5 q2
Observation
4 1 3 2 1 1 1 Since any descendant 3
q4 q6 node cannot have a
q5 q5 q5
Candidate q3 q5 q4 frequency higher than
2 1 1 1 Frequency 4 3 1 that in any of its 3
ancestor nodes, so
We can guarantee there will be no other
q3 is Top-1 answer Top-K answer q3
because any candidate with
Frequency 4
nodes with frequency
prefix q5 will have
frequency of at most 3,
larger than q3.
which is smaller than q3. 82
Sequence ID Query sequences
s1 <q1q2q3q4>
Forward search s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
Input s = <q1q2>, k=2 s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 1 q
Explore child of q3
8 4 1 3 2 1 1 1 Since the frequency of the
q3 q5 q4 q6 q5 child nodes of q3 can be ≤ 4,
q4 q5 q5 is not guarantee to be
4 1 3 2 1 1 1 the Top-2 result. We
q4 q6 therefore explore q3 and
q5 q5
Candidate q5 q3q4 q4 q3q5 q3q6 retrieve the frequencies of
2 1 1 1 Frequency 3 2 1 1 1 q3q4, q3q5 and q3q6.
Top-K answer q3
Frequency 4
83
Sequence ID Query sequences
s1 <q1q2q3q4>
Forward search s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
Input s = <q1q2>, k=2 s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5
8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 q1
q5 is now guarantee to
8 4 1 3 2 1 1 1 3
be top-2
q3 q5 q4 q6 q5
q4 q5 Since q5 is still in the
q2 front
of the priority queue, no
4 1 3 2 1 1 1
other nodes with3q1q2 as
q4 q6 prefix can have frequency
q5 q5 q5
Candidate q5 q3q4 q4 q3q5 q3q6 larger than q5, therefore q5
2 1 1 1 Frequency 3 2 1 1 1 is guaranteed to 3be the top-
2 result.
Top-K answer q3 q5
Frequency 4 3
84
Distributed index structure
A search log may contain billions of query sessions.
The resulting suffix tree would be a gigantic one that
cannot be held into the main memory or even the disk
of one machine.
A distributed suffix tree construction method
under the MapReduce programming model is used.
Step 1. Suffixes counting
Step 2. Suffixes partitioning
Step 3. Suffix subtree construction
85
Step 1. Suffixes counting
Compute all suffixes and the corresponding frequencies.
At the beginning, the whole data set is stored
distributively in the cluster.
Each compute node possesses a subset of data.
Map: For each query session s, the computer emits an
intermediate key-value pair (s’, 1) for every suffix s’ of s,
where the value 1 here is the contribution to frequency of
suffix s’ from s.
Reduce: All intermediate key-value pairs having suffix s’ as
the key are processed on the same computer. The computer
simply outputs a final pair (s’; freq(s’)) where freq(s’) is the
number of intermediate pairs carrying key s’.
86
Step 1. Suffixes counting
sID sequences sID sequences sID sequences sID sequences
Input query s1 <q1q2q3q4> s3 <q6q1q2q5> s5 <q6q1q2q5> s7 <q1q2q3q6>
sequences s2 <q1q2q4q5> s4 <q1q2q3q4> s6 <q1q2q3q5> s8 <q6q1q2q5>
87
Step 1. Suffixes counting
sID sequences sID sequences sID sequences sID sequences
Input query s1 <q1q2q3q4> s3 <q6q1q2q5> s5 <q6q1q2q5> s7 <q1q2q3q6>
sequences s2 <q1q2q4q5> s4 <q1q2q3q4> s6 <q1q2q3q5> s8 <q6q1q2q5>
Mapper Mapper Mapper Mapper
Key value Key value Key value Key value
<q1q2q3q4> 1 <q6q1q2q5> 1 <q6q1q2q5> 1 <q1q2q3q6> 1
Mappers process <q2q3q4> 1 <q1q2q5> 1 <q1q2q5> 1 <q2q3q6> 1
input and emit <q3q4> 1 <q2q5> 1 <q2q5> 1 <q3q6> 1
<q4> 1 <q5> 1 <q5> 1 <q6> 1
<k, v> pairs in <q1q2q4q5> 1 <q1q2q3q4> 1 <q1q2q3q5> 1 <q6q1q2q5> 1
parallel <q2q4q5> 1 <q2q3q4> 1 <q2q3q5> 1 <q1q2q5> 1
<q4q5> 1 <q3q4> 1 <q3q5> 1 <q2q5> 1
(Assume 4 mapper <q5> 1 <q4> 1 <q5> 1 <q5> 1
nodes)
q1 q4 q2 q5 q3 q6
8 3 8 5 4 4
q2 q5 q3 q5 q1
q4 q4 q5 q6
8 1 4 1 3 2 1 1 3
q3 q5 q4 q6 q5
q4 q5 q2
4 1 3 2 1 1 1 3
q4 q6 q5
q5 q5
2 1 1 1 Then, the new set of suffixes are assigned to the 3
index server according to the subtrees that they
should be hosted.
Key value Key value Key value
<q1q3q2q4> 1 <q2q4> 1 <q3> 1
<q1q3> 1 <q2q5> 2 <q3q2q4> 1
<q1q3q2q5> 1 <q5> 2 <q3q2q5> 2
<q4> 1
<q4q3q2q5> 1 99
Incremental update
Index server 1 Index server 2 Index server 3
root root root
q1 q4 q2 q5 q3 q6
11 5 11 7 8 4
q2 q3 q3 q5 q3 q5 q2 q1
q4 q4 q5 q6
8 3 1 1 4 2 5 3 2 1 1 3
q3 q5 q2 q2 q4 q6 q5
q4 q5 q4 q5 q2
4 1 3 2 1 2 1 1 1 1 2 3
q4 q6 q5 q4 q5 q5
q5 q5
2 1 1 1 1 1 1 3
q1 q1 q4 q2 q5 q3 q6
11 11 5 11 7 8 4
q2 q3 q3 q5 q3 q5 q2 q1
q4 q4 q5 q6
8 3 1 1 4 2 5 3 2 1 1 3
q3 q5 q2 q2 q4 q6 q5
q4 q5 q4 q5 q2
4 1 3 2 1 2 1 1 1 1 2 3
q4 q6 q5 q4 q5 q5
q5 q5
2 1 1 1 1 1 1 3
q1 q1 q4 q2 q5 q3 q6
11 11 5 11 7 8 4
q2 q3 q3 q5 q3 q5 q2 q1
q4 q4 q5 q6
8 3 1 1 4 2 5 3 2 1 1 3
q3 q5 q2 q2 q4 q6 q5
q4 q5 q4 q5 q2
4 1 3 2 1 2 1 1 1 1 2 3
q4 q6 q5 q4 q5 q5
q5 q5
2 1 1 1 1 1 1 3
q1 q1 q4 q2 q5 q3 q6
11 11 5 11 7 8 4
q2 q3 q3 q5 q3 q5 q2 q1
q4 q4 q5 q6
8 3 1 1 4 2 5 3 2 1 1 3
q3 q5 q2 q2 q4 q6 q5
q4 q5 q4 q5 q2
4 1 3 2 1 2 1 1 1 1 2 3
q4 q6 q5 q4 q5 q5
q5 q5
2 1 1 1 1 1 1 No match No match 3
Result from server 1a q2 q2q3 q2q5 q2q3q4 q2q4
8 4 3 2 1
Query input: <q1>, k=5
Result from server 1b q3 q3q2 q3q2q4 q3q2q5
3 2 1 1
Combined q2 q2q3 q3 q2q5 q3q2
8 4 3 3 2
103
Follow up topics
How can we support backward search and session
retrieval operations?
The construction of suffix tree for the string S
can take time and space linear in the length of S.
Describe how to update the suffix tree to optimize the
construction time complexity and storage space complexity.
Try to go through the 3 steps of distributed suffix
tree construction with s12 = <q1q2q1q2>
Explain the problem and describe the
technique to resolve the problem.
104
Chapter 9
End
Breadth-first
search (MR)
A standard algorithm on graph
Acknowledgement – Matei Zaharia, Electrical Engineering and Computer Sciences University of California, Berkeley
Shortest path
A common graph search application is finding the
shortest path from a start node to one or more
target nodes
Commonly done on a n2 n4
10 1
single machine with
2
Dijkstra's Algorithm. 9 4
n1 6
3
Can we use BFS to find 5 7
Illustration 10
2
1
9 4
n1 6
3
5 7
2 n5
n3
Q d n1 n2 n3 n4 n5 u
initialization n1 n2 n3 n4 n5 0 ∞ ∞ ∞ ∞
The algorithm
starts by 1 n2 n3 n4 n5 0 10 5 ∞ ∞ n1
exploring the
source nodes.
109
n2 n4
Illustration 10
2
1
9 4
n1 6
In each 3
iteration, the 5 7
algorithm 2 n5
n3
extends the
search by Q d n1 n2 n3 n4 n5 u
exploring one
initialization n1 n2 n3 n4 n5 0 ∞ ∞ ∞ ∞
unvisited node
that is 1 n2 n3 n4 n5 0 10 5 ∞ ∞ n1
currently
closest to the 2 n2 n4 n5 0 8 5 14 7 n3
source node.
110
n2 n4
Illustration 10
2
1
9 4
n1 6
3
5 7
2 n5
n3
Q d n1 n2 n3 n4 n5 u
initialization n1 n2 n3 n4 n5 0 ∞ ∞ ∞ ∞
1 n2 n3 n4 n5 0 10 5 ∞ ∞ n1
2 n2 n4 n5 0 8 5 14 7 n3
3 n2 n4 0 8 5 13 7 n5
111
n2 n4
Illustration 10
2
1
9 4
n1 6
3
5 7
2 n5
n3
Q d n1 n2 n3 n4 n5 u
Termination
The algorithm initialization n1 n2 n3 n4 n5 0 ∞ ∞ ∞ ∞
ends when all
1 n2 n3 n4 n5 0 10 5 ∞ ∞ n1
the nodes are
explored (i.e., Q 2 n2 n4 n5 0 8 5 14 7 n3
is empty)
3 n2 n4 0 8 5 13 7 n5
4 n4 0 8 5 9 7 n2
112
n2 n4
Illustration 10
2
1
9 4
n1 6
3
5 7
2 n5
n3
Q d n1 n2 n3 n4 n5 u
Termination
The algorithm initialization n1 n2 n3 n4 n5 0 ∞ ∞ ∞ ∞
ends when all
1 n2 n3 n4 n5 0 10 5 ∞ ∞ n1
the nodes are
explored (i.e., 2 n2 n4 n5 0 8 5 14 7 n3
Q is empty)
3 n2 n4 0 8 5 13 7 n5
4 n4 0 8 5 9 7 n2
5 0 8 5 9 7 n4
113
The Dijkstra’s algorithm
Sequential processing
Based on maintaining a global priority queue Q of nodes with
priorities equal to their distances from the source node.
If we have a cluster of
machines, can we find the
shortest path in parallel using
MapReduce?
114
Parallel BFS
Idea: While the Dijkstra’s algorithm only explore the
graph by extending one edge at each step, can we
try to explore multiple paths in parallel in
MapReduce?
2 3
1 n5
n2
n1
3
2
1st MapReduce
n3 n6
2 3
n4
n7
2nd MapReduce 4
3
n8 n10
3rd MapReduce 4
n9
115
n2 n4
10 1
Parallel BFS 2
9
6
4
n1
3
5 7
Data representation n3 2
n5
n1 n2 n3 0 f n2 n3 n4 ∞ n3 n2 n4 n5 ∞ n4 n5 ∞ n5 n1 n4 ∞
1st iteration 4
9
2 6
n1
3
We assume that the adjacency list contains also 5 7
the edge weight (not shown in the list here.). n3 2
n5
n1 n2 n3 0 f
In each MapReduce round, we process the
Map task
frontier node(s).
n2 10 The search frontier is n1 in the 1st iteration.
n3 5
1. We found a path from n1 to n2 with cost 10.
Therefore we emit (n2, 10) to tell n2 that
there is such a shortest path.
2. We found a path from n1 to n3 with cost 5.
Therefore we emit (n3, 5) to tell n3 that
there is such a shortest path.
117
n2 n4
10 1
1st iteration 4
9
2 6
n1
3
5 7
n3 2
n5
n1 n2 n3 0 f
Problem: We lost the adjacency lists and
Map task
counters after one MR iteration.
How can we make the outputs of the
n2 10
n3 5
Reduce tasks be the inputs of the
Mappers in the next iteration?
n3 5 n2 10
? ?
118
n2 n4
10 1
1st iteration 4
9
2 6
n1
Solution: We also emit the adjacency 3
lists and counters in each node. 5 7
All nodes are processed in parallel. n3 2
n5
n1 n2 n3 0 f n2 n3 n4 ∞ n3 n2 n4 n5 ∞ n4 n5 ∞ n5 n1 n4 ∞
Map task Map task Map task Map task Map task
n2 10
n2 n3 n4 ∞ n3 n2 n4 n5 ∞ n4 n5 ∞ n5 n1 n4 ∞
n3 5
n1 n2 n3 0 f
n3 5 n2 10
n3 n2 n4 n5 ∞ n2 n3 n4 ∞ n1 n2 n3 0 f n4 n5 ∞ n5 n1 n4 ∞
Reduce task Reduce task Reduce task Reduce task Reduce task
119
n2 n4
10 1
1st iteration 4
9
2 6
n1
Each reduce task receives the shortest paths 3
discovered for a particular node. It simply 5 7
obtains the minimum and update the counter. n3 2
n5
n1 n2 n3 0 f n2 n3 n4 ∞ n3 n2 n4 n5 ∞ n4 n5 ∞ n5 n1 n4 ∞
Map task Map task Map task Map task Map task
n2 10
n2 n3 n4 ∞ n3 n2 n4 n5 ∞ n4 n5 ∞ n5 n1 n4 ∞
n3 5
n1 n2 n3 0 f
n3 5 n2 10
n3 n2 n4 n5 ∞ n2 n3 n4 ∞ n1 n2 n3 0 f n4 n5 ∞ n5 n1 n4 ∞
Reduce task Reduce task Reduce task Reduce task Reduce task
n3 n2 n4 n5 5 n2 n3 n4 10 n1 n2 n3 0 n4 n5 ∞ n5 n1 n4 ∞
120
n2 n4
10 1
1st iteration 4
9
2 6
n1
A node becomes the search frontier in 3
the next iteration if its shortest distance 5 7
changes from ∞ to a value. n3 2
n5
n1 n2 n3 0 f n2 n3 n4 ∞ n3 n2 n4 n5 ∞ n4 n5 ∞ n5 n1 n4 ∞
Map task Map task Map task Map task Map task
n2 10
n2 n3 n4 ∞ n3 n2 n4 n5 ∞ n4 n5 ∞ n5 n1 n4 ∞
n3 5
n1 n2 n3 0 f
n3 5 n2 10
n3 n2 n4 n5 ∞ n2 n3 n4 ∞ n1 n2 n3 0 f n4 n5 ∞ n5 n1 n4 ∞
Reduce task Reduce task Reduce task Reduce task Reduce task
n3 n2 n4 n5 5 f n2 n3 n4 10 f n1 n2 n3 0 - n4 n5 ∞ n5 n1 n4 ∞
n2 n4
10 1
2nd iteration 4
9
2 6
n1
The outputs of the 1st iteration 3
now becomes the input of the 5 7
2nd iteration. n3 2
n5
n3 n2 n4 n5 5 f n2 n3 n4 10 f n1 n2 n3 0 - n4 n5 ∞ n5 n1 n4 ∞
Map task Map task Map task Map task Map task
n2 8 n3 12
n4 14 n4 11 n1 n2 n3 0 - n4 n5 ∞ n5 n1 n4 ∞
n5 7 n2 n3 n4 10 f
n3 n2 n4 n5 5 f
122
n2 n4
10 1
2nd iteration 4
9
2 6
n1
3
We mark a node’s state as “” if the cost of
the shortest path is “changed”; “-” if the 5 7
cost of the shortest path is unchanged. n3 2
n5
n3 n2 n4 n5 5 f n2 n3 n4 10 f n1 n2 n3 0 - n4 n5 ∞ n5 n1 n4 ∞
Map task Map task Map task Map task Map task
n2 8 n3 12
n4 14 n4 11 n1 n2 n3 0 - n4 n5 ∞ n5 n1 n4 ∞
n5 7 n2 n3 n4 10 f
n3 n2 n4 n5 5 f
n2 8 n3 n2 n4 n5 5 f n4 14 n5 7
n4 11
n2 n3 n4 10 f n3 12
n4 n5 ∞
n5 n1 n4 ∞ n1 n2 n3 0 -
Reduce task Reduce task Reduce task Reduce task Reduce task
n2 n3 n4 8 n3 n2 n4 n5 5 - n4 n5 11 f n5 n1 n4 7 f n1 n2 n3 0 -
123
n2 n4
10 1
3rd iteration 4
9
2 6
n1
3
5 7
n3 2
n5
n2 n3 n4 8 n3 n2 n4 n5 5 - n4 n5 11 f n5 n1 n4 7 f n1 n2 n3 0 -
Map task Map task Map task Map task Map task
n3 10 n3 n2 n4 n5 5 - n5 15 n1 14 n1 n2 n3 0 -
n4 9
n4 n5 11 f n4 13
n2 n3 n4 8 n5 n1 n4 7 f
Reduce task Reduce task Reduce task Reduce task Reduce task
n3 n2 n4 n5 5 - n2 n3 n4 8 - n1 n2 n3 0 - n4 n5 9 n5 n1 n4 7 -
124
n2 n4
10 1
4th iteration 4
9
2 6
n1
The algorithm terminates when all 3
nodes are visited and no more 5 7
n3 n5
changes to the shortest distances. 2
n3 n2 n4 n5 5 - n2 n3 n4 8 - n1 n2 n3 0 - n4 n5 9 n5 n1 n4 7 -
Map task Map task Map task Map task Map task
n5 13
n3 n2 n4 n5 5 - n2 n3 n4 8 - n1 n2 n3 0 - n4 n5 9 n5 n1 n4 7 -
n3 n2 n4 n5 5 - n2 n3 n4 8 - n1 n2 n3 0 - n4 n5 9 n5 n1 n4 7 -
Reduce task Reduce task Reduce task Reduce task Reduce task
n3 n2 n4 n5 5 - n2 n3 n4 8 - n1 n2 n3 0 - n4 n5 9 - n5 n1 n4 7 -
125
MapReduce version of BFS
d is the current shortest
distance to node n.
127
Section 5
PageRank
Utilizing structural information of web
graph to rank pages
PageRank
A page is more popular if…
More other pages with a hyperlink pointing to it.
Linked by another popular page.
A link from a popular page (e.g., www.cnn.com) should have a
higher impact than a link from an unpopular page.
Link from popular pages :
Make the linked page more popular
Popular www.cnn
page .com
129
PageRank
If a user starts at a random web page and surfs by
randomly entering new URLs and clicking links, what
is the probability that s/he will arrive at a given page?
The PageRank of a page captures this notion.
More “popular” or “worthwhile” pages get a higher rank.
PageRank of a page n
1 1
prn = + (1 − )
Randomly
entering Clicking links . pr
m
N
new URLs
mL ( n ) C ( m)
130
PageRank
Randomly entering new URLs – user typing URL in
the browser so that the user jump to a page directly
Assume uniform distribution, the probability that a user will
jump to a page among N other pages is 1/N.
1 1
prn = + (1 − ) Clicking links . pr
m
N mL ( n ) C ( m)
131
…
m
PageRank …
n
1 1
prn = + (1 − ) . prm
N mL ( n ) C ( m)
132
PageRank
The random jump factor models the probability
that a user arrives a page n by typing URL in
browsers; alternatively, (1 − ) is referred to as the
“damping” factor.
Common value of the value is 0.15, meaning 15% of
chance a visitor arrive a page by typing URL in a browser,
85% by clicking a hyperlink.
1 1
prn = + (1 − ) . prm
N mL ( n ) C ( m)
133
Illustration ( = 0, 1 st iteration)
n1 (0.2)
n2 (0.2)
To compute pageRank,
n5 (0.2) we first allocate equal
pageRank for each node
n4 (0.2) n3 (0.2) and the sum of the
pageRanks among all
nodes equals 1.
134
Illustration ( = 0, 1 st iteration)
n1 (0.2)
0.1
n2 (0.2)
0.066 As we need prm /C(m) in
0.1 0.1
0.1 0.066 the pageRank formula, for
0.066
0.2
n5 (0.2) each node m, we calculate
prm /C(m) and send to
0.2
n4 (0.2) n3 (0.2) neighboring nodes.
E.g., For n1, the prn1/C(n1)
= 0.2/2 = 0.1.
1 1
prn = + (1 − ) . prm
N mL ( n ) C ( m)
135
Illustration ( = 0, 1 st iteration)
n1 (0.2)
0.1
n2 (0.2) Since the random jump factor is 0, we
0.066 only need to sum up all the received
0.1 0.1 0.1
0.066
prm /C(m) values to get the prn.
0.066 E.g., For n2, we receive prn1/C(n1) =
0.2
n5 (0.2) 0.1 and prn5/C(n5) =0.066.
Therefore, prn2 = 0.1+0.066 = 0.166.
0.2
n4 (0.2) n3 (0.2)
n1 (0.066)
n2(0.166)
1 1
prn = + (1 − ) . prm
N mL ( n ) C ( m)
n5 (0.3)
n5 (0.383)
n3 (0.183)
n4 (0.199) n1 (0.066)
0.033
0.1
0.1 n2(0.166)
0.033 0.083
0.083
0.3
0.1
n5 (0.3)
0.166
n4 (0.3) n3 (0.166) 137
PageRank
PageRank is a global score, independent of the search
query.
PageRank can be used to raise the weight of
important pages:
wt ,d = tf .idf (t , d ) prd
This can be directly incorporated in the
inverted index and the evaluation of top-
k keyword search queries.
138
MapReduce PageRank
n1 (0.2)
0.1
n1 n2 n4 P(n1) 0.2 n2 n3 n5 P(n2) 0.2
0.066 n2 (0.2) Map task Map task
0.1 0.1 0.1
0.066
0.066 n2 0.1 n3 0.1 …
0.2
n5 (0.2) n4 0.1 n5 0.1
n1 n2 n4 n2 n3 n5
0.2
n4 (0.2) n3 (0.2) Shuffle and sort
n1 (0.066) n5 0.1
n1 0.066
n5 0.2
n2(0.166)
n5 n1 n2 n3 n1 n2 n4
…
Reduce task Reduce task
n5 (0.3)
n5 n1 n2 n3 P(n5) 0.3 n1 n2 n4 P(n1) 0.066
140
Follow up topics
What is the dangling node problems in PageRank?
141
Section 6
HITS
Hubs and Authorities
Hubs and Authorities
PageRank: simplistic view of a network.
S
R
Why not just Applying HITs on the whole webgraph will be too
apply HITS on costly.
the whole The set S will have a much refined set of relevant
pages and contains most (or many) of the strongest
webgraph? authorities.
145
Iterative algorithm
End while
Initialization step
All nodes have 1 as their initial hub
and authority values
148
Illustration (1st iteration)
H:1
n2 A:1
H:1
A:2
n1
H:1
A:1
n3 H:1
H:1 A:1
A:1 n5
n4
151
Illustration (1st iteration)
H:1
A:0
Although n1 only points
n2
H:3 to one other node, but
A:2 that node (n3) is a good
n1
H:1 Authority node, making
A:3 n1 a reasonably good hub
n3 H:1 (with relatively high hub
H:1 A:1 score).
A:1 n5
n4
n1 n2 n3 n4 n5 sqrt(Sum of squares)
Hubs 3 6 1 2 3 sqrt( 32+62+12+22+32 ) = sqrt(59)
Authorities 2 0 3 1 1 sqrt( 22+02+32+12+12 ) = sqrt(15)
154
Illustration (1st iteration)
3rd step:
H:0.781 Normalization
n2 A:0 To prepare the hub
H:0.391
A:0.516 and authority scores
n1
H:0.130 for the next iteration,
A:0.775 we normalize the
n3 H:0.391 hub vector and
H:0.260 A:0.258
n5
authority vector to
A:0.258
n4 be unit length.
n1 n2 n3 n4 n5 sqrt(Sum of squares)
Hubs 3 6 1 2 3 sqrt( 32+62+12+22+32 ) = sqrt(59)
Authorities 2 0 3 1 1 sqrt( 22+02+32+12+12 ) = sqrt(15)
Hubs
(unit length)
0.391 0.781 0.130 0.260 0.391
Authorities
Unit vectors
(unit length)
0.516 0 0.775 0.258 0.258
155
Exercise
Exercise: Let’s try to simulate
H:0.811 the computation and see if you
n2 A:0
H:0.374 get the Hub and Authority
A:0.511 scores correctly?
n1
H:0.031
A:0.767
n3 H:0.374
H:0.249 A:0.383
A:0.064 n5
n4 H:0.814
2nd iteration H:0.370
n2 A:0
A:0.517
n1
H:0.007
A:0.760
n3
H:0.252 H:0.370
A:0.015 A:0.395
n4 n5
3rd iteration 156
Follow up topics
How will you compare HITs and PageRank?
A Comparative Study of HITS vs PageRank Algorithms for Twitter Users
Analysis by Ong Kok Chien et, al.
HITS on the Web: How does it Compare? by Marc Najork et, al.
Authority Rankings from HITS, PageRank, and SALSA: Existence,
Uniqueness, and Effect of Initialization by Ayman Farahat et, al.
157