0% found this document useful (0 votes)
5 views

Chapter 9 - Processing Big Data With Mapreduce

asda

Uploaded by

YouTubeATP
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Chapter 9 - Processing Big Data With Mapreduce

asda

Uploaded by

YouTubeATP
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 157

Chapter 9

Processing big data


with MapReduce
COMP3278 Introduction to
Database Management Systems
Department of Computer Science, The University of Hong Kong
Slides prepared by - Dr. Chui Chun Kit, for students in COMP3278
For other uses, please email : [email protected]
Section 1

MapReduce
What is MapReduce?
MapReduce is a programming model and software
framework used for processing and generating large
datasets. It is designed to handle big data and is
widely used in distributed computing environments.
HDFS Name Node
Client

DataNode DataNode DataNode DataNode DataNode

Local Disk Local Disk Local Disk Local Disk Local Disk
MapReduce design goals
Scalable to large data volumes.
Can run your applications in parallel over 1000’s of
machines, 10,000’s of disks.
Elastic - Can add/remove compute nodes to/from the
framework easily.

Cost-efficiency.
Use commodity machines and network (cheap, but unreliable).
Automatic fault-tolerance (fewer administrators).
Easy to program (fewer programmers).
4
Hadoop
The Apache Hadoop software library is a framework
that allows for the distributed processing of large
data sets across clusters of computers using a simple
programming model.
Hadoop Distributed File System (HDFS)
For storing data of your application over a cluster of
machines.

Hadoop MapReduce framework (Hadoop MR)


For executing application programs designed using “map”
and “reduce” functions.
5
HDFS
Data is stored in datanodes.
Namenode
File
A single namenode for managing 1
2
the datanodes. (E.g., namenode 3
4
knows where is a file, and in
which datanode.)

A file can be broken into blocks


and replicated over a number of
1 2 1 3
datanodes to increase reliability 2 1 4 2
4 3 3 4
and efficiency.
Datanodes
6
Hadoop MR
Data type: ( key, value ) pairs.

Map function

(Kin, Vin) ➔ list(Kintermediate, Vintermediate)


Reduce function

(Kintermediate, list(Vintermediate)) ➔ list(Kout, Vout)

7
Example 1 - WordCount
Motivating wordCount application: Given a set of
documents, count the number of occurrences of
each word in the documents.

It is easy☺! I can write a simple program that scans


and counts the words in the input documents. I can
run the program on my computer!

Remember in MapReduce applications we are handling


massive amount of data over a cluster of machines.
How about the data size is in Petabyte scale?
8
Example 1 - WordCount
how are you …

(how, 1)
(are, 1) The Mapper function

HDFS File 1 (you, 1)


datanode Mapper A
Mapper
The input (key,value) pair is (fileID,
Mapper
fileContent).
Input Output Each mapper runs the mapper
(key, value) (key, value) function on its inputs.
= (fileID, fileContent) = (word, 1)
The mapper function simply scans
the file once and emits (word, 1)
pair for each word encountered.
9
Example 1 - WordCount
how many …

(how, 1)
(are, 1) The Mapper function

HDFS File 1 File 2 (you, 1)


datanode Mapper A (how, 1)
(many, 1)
Mapper

Input Output
(key, value) (key, value)
= (fileID, fileContent) = (word, 1)

10
Example 1 - WordCount
Map
phase

(how, 1)
(are, 1) The Mapper function

HDFS File 1 File 2 (you, 1)


datanode Mapper A (how, 1)
(many, 1) All mapper nodes run
(are, 1) the mapper function
(you, 1) on its own input in
HDFS (happy, 1)
datanode
File 3
Mapper B parallel.
(how, 1)
(to, 1)
HDFS File 4 (do, 1)
datanode Mapper C 11
Example 1 - WordCount
Map Shuffle &
phase Sort phase

(how, 1)
(are, 1)
(you, 1) (how, <1,1,1>)
HDFS File 1 File 2
datanode Mapper A (how, 1)
(many, 1)
Reducer A

(are, 1) The (key,value) pairs with the


(you, 1) same keys are grouped and
their values becomes a list.
HDFS File 3 (happy, 1)
datanode Mapper B This can be done automatically
by the MR framework.
(how, 1)
(to, 1)
HDFS File 4 (do, 1)
datanode Mapper C 12
Example 1 - WordCount
Map Shuffle &
phase Sort phase

(how, 1)
(are, 1)
(you, 1) (how, <1,1,1>)
HDFS File 1 File 2
datanode Mapper A (how, 1) (many, <1>)
(many, 1) (happy, <1>)
Reducer A

(are, 1)
(you, 1)
HDFS File 3 (happy, 1) (are, <1,1>)
datanode Mapper B
(you, <1,1>)
(how, 1) (to, <1>)
(to, 1) (do, <1>) Reducer B

HDFS File 4 (do, 1)


datanode Mapper C 13
Example 1 - WordCount
Reduce
phase

(how, <1,1,1>) (how, 3)


The Reducer function (many, <1>) (many, 1)
(happy, <1>) (happy, 1)
Reducer Reducer A

The (key, value) pairs emitted by


the mappers with the same “key” Reducer
will be processed by the same Input Output
reducer. (key, value) (key, value)
= (word, list(1,1,…,1)) = (word, sum)

14
Example 1 - WordCount
Reduce
phase

(how, <1,1,1>) (how, 3)


The Reducer function (many, <1>) (many, 1)
(happy, <1>) (happy, 1)
The reducers run the reducer function Reducer A

in parallel.
The number of reducers can be
different from the number of mappers. (are, <1,1>) (are, 2)
(you, <1,1>) (you, 2)
(to, <1>) (to, 1)
(do, <1>) Reducer B (do, 1)

15
Example 1 - WordCount
Map function
Input - <fileID, fileContent> pairs.
Output – Emit <"how", 1>, <"are",1> …etc as intermediate
<key, value> pairs.
Hadoop users only need to
Reduce function program the Map and Reduce
functions! Hadoop manages
Input - <"how", list(1,1,1)>. the entire cluster to finish the
Output – Emit <"how", 3>. parallel execution! Nice ☺

Everything else is handled by Hadoop!!


16
MapReduce features
By providing a data-parallel programming model,
MapReduce can control job execution in useful ways.
Automatic division of a MapReduce job into a number of map
tasks and reduce tasks.

Automatic placement of computation near data (data locality).


Automatic load balancing (distribute the load evenly to the nodes).

Automatic recovery from failures and stragglers.

User focuses on application, not on complexities of


distributed computing.
17
1. Map phase
A single master node controls job execution on
multiple slaves nodes.

Data locality – process local data to reduce


transmitting data through network.
Hadoop has a scheduler that starts tasks on the node that
holds a particular block of data (i.e., on its local drive) needed
by the task.

Mappers save outputs to local disk.


Allows recovery if a reducer crashes.
18
2. Shuffle and sort phase
Intermediate key-value pairs must be grouped by key.
Accomplished by a large distributed sort involving all the
nodes that executed the map tasks and all the nodes that
will execute the reduce tasks.

Accomplish synchronization.
The reduce computation cannot start until all mappers
have finished emitting (key, value) pairs and all
intermediate (key, value) pairs have been shuffled and
sorted.

19
3. Reduce phase
A reducer in MapReduce receives all values associated
with the same key at once.
Programmers can specify the number of reducers.

Reduce method is called once per intermediate key.


A reducer may execute more than one reduce method.
E.g., Reducer A processes the intermediate keys “How”,
“many” and “happy” (3 reduce methods called) in our
running example.
Reducers also write their outputs to disk.
20
Fault Tolerance
If a Map/Reduce task crashes…
Retry on another node, is it possible?
OK for a map task because it has no dependencies.
OK for a reduce task because map tasks output intermediate
key-value pairs on disk.

If a node crashes…
Re-launch its current tasks on other nodes.
Re-run any map tasks the crashed node previously ran because
the outputs were lost along with the crashed node.
21
Fault Tolerance
If a task is going slowly…
Speculative execution - Launch second copy of task on
another node.
Take the output of whichever copy finishes first, and
terminate the other copy.
Very important strategy in large cluster
Stragglers occur frequently due to failing hardware, bugs, etc.
A single straggler may noticeably slow down the overall
performance. E.g., A single straggler map task can delay the
start of the shuffle and sort phase.
22
Example 2 – Inverted Indexing
Inverted indexing: Nearly all search engines today
rely on a data structure called an inverted index,
which given a keyword provides access to the list of
documents that contain the keyword.
Documents
Inverted index
Why inverted index is useful?
1.html Keyword List of document IDs
Construct E.g., To find which document(s)
Computer {1.html, 3.html, …etc }
Inverted contain the keywords “Apple
Apple {2.html, 3.html, …etc}
2.html index Computer” , we simply join the
… …
two inverted lists to find the result
(i.e., 3.html)
3.html
It is called “inverted” because data usually data exist in “one

document, many keywords” format. Inverted list is “one keyword,


many document IDs” and therefore it is called “inverted”.
23
Example 2 – Inverted Indexing
Problem: Massive input data – E.g., The indexed web
contains at least 3.93 billion web pages in 18 April 2023.
http://www.worldwidewebsize.com/
Solution: Impossible to store the documents and build the
inverted index using one computer. We need a cluster of
machines to store the documents and compute the inverted
index in parallel. MapReduce!

We only need to provide the Mapper function and the Reducer


function if we use the MapReduce framework.
What are the
1. Input/ Output key-value pairs and the
2. logic in the two functions?
24
Example 2 – Inverted Indexing
Mapper
Input - <docID, docContent> pairs.
Output – Emit <keyword, docID> as intermediate key-
value pairs, and only emit once for each keyword appear
in the document docID.
Reducer
Input - <keyword, list(docID1, docID2, docID3, …etc )> pairs.
Output - <keyword, sorted list of docIDs> pairs.
Think about it …
How will you change the above Mapper and Reducer functions if we want to
include in the inverted index the occurrence frequency of each keyword in
each document?
25
Combiner and Partitioner
Besides the Map and Reduce functions, MapReduce
allows programmers to define Combiner and
Partitioner for optimization purpose.

Combiner – Allow mappers to perform local aggregation


(after mapper) before shuffle sort phase.

Partitioner - Allow programmers to divide up intermediate


key space and assigning (key,value) pairs to reducers.

26
Combiner
Local aggregation – Combine some key-value pairs in
each mapper to reduce the number of key-value pairs
that pass into the shuffle and sort phase.
E.g., In the wordCount example, instead of emitting
<"how", 1> twice if a document has two "how"s, We emit
<"how", 2>.
Again, we are handling massive data. This sounds a trivial
If we want to adopt local aggregation, optimization, why is it
buffer space has to be allocated to important and what
store the intermediate word count. are the aspects that I
This may create memory problem if have to concern?
the key space is very large!
27
Combiner
Input (key,value) k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
(docID, docContent)

Map phase map map map map

a 1 a 1 b 1 c 1 c 1 b 1 b 1 a 1 a 1 b 1 c 1 b 1

combine combine combine combine

a 2 b 1 c 2 b 1 b 1 a 2 b 2 c 1

Shuffle & Sort


phase Group values by keys
a 2 2 b 1 1 1 2 c 2 1

Reduce phase reduce reduce

a 4 b 5 c 3
28
Partitioner
The simplest partitioner involves computing the hash
value of the key and then taking the mod of the value
with the number of reducers.
This assigns approximately the same number of keys to each
reducer. (We want to balance the reducers’ workload)
Problem : Partitioner only considers the key and
ignores the value.
Why is it a problem?
Consider the wordCount example. Some words may occur more
frequent than others. Therefore even we assign the same number of
keys to each reducer, the reducers’ workload may still be imbalanced.
29
Partitioner
The partitioner allows programmer to specify which
reducer will be responsible for processing a particular
key.
Shuffle & sort phase Shuffle & sort phase
(Partitioner: assigns keys to reducers (Partitioner : assigns “c” to one reducer,
using k mod 2) “a” & “b” to another reducer)
a 2 1 c 1 1 1 2 b 2 1 a 2 1 b 2 1 c 1 1 1 2

reduce reduce reduce reduce


Slower Faster

a 3 b 3 c 5
a 3 c 5 b 3

30
We are going to learn…
Section 2. TF/IDF and MapReduce

Section 3. Search log analysis with MapReduce

Section 4. Breadth-first search and MapReduce


For
interested
students Section 5. PageRank and MapReduce
only

Section 6. Hubs and Authorities (HITS) and MapReduce


31
Section 2

tf.idf (MR)
Term frequency – inverse document frequency

Acknowledgement – Matei Zaharia, Electrical Engineering and Computer Sciences University of California, Berkeley
tf.idf(t,d)
tf.idf(t,d), short for term frequency–inverse document
frequency, is a numerical statistic that is intended to
reflect how important a term t (keyword) is to a
document d in a collection or corpus.
An apple a Apple Daily is Looking at occurrence
Apple pie
Apple iphone day keeps the a Hong Kong-
ingredient: frequency of a term in a
6 and ipad … doctor based tabloid-
Apple, …
away … style document only:
Doc 1 Doc 2 Doc 3 Doc 4 “apple” occurs many time in Doc
1, so “apple” should be very
Considering how common a term appear in important in Doc 1.
documents: “ingredient” occurs only once in
Wait! “apple” is a common word in the set of Doc 1, so “Ingredient” should be
documents, it is not an informative keyword not important in Doc 1.
to identify a document!
On the other hand, “ingredient” occurs in Doc
1 only, it should be a specify keyword
describing the content in Doc 1!
33
tf.idf(t,d)
tf.idf(t,d), short for term frequency–inverse document
frequency, is a numerical statistic that is intended to
reflect how important a term t (keyword) is to a
document d in a collection or corpus.
An apple a Apple Daily is
Apple pie
Apple iphone day keeps the a Hong Kong-
ingredient:
6 and ipad … doctor based tabloid-
Apple, …
away … style

Doc 1 Doc 2 Doc 3 Doc 4

The tf.idf(t,d) value increases proportionally to the


number of times a term t appears in the document d,
but is offset by the frequency of the term t in the
collection, which helps to adjust for the fact that some
words appear more frequently in general may not be
very informative to describe a document d. 34
Preprocessing 1. Tokenization
Separate text into tokens (words)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 the jaguar is a new world mammal of the felidae family
d2 jaguar has designed four new engines
d3 for jaguar atari was keen to use a 68k family device
d4 the jacksonville jaguars are a professional us football team
d5 mac os x jaguar is available at a price of us $99 for apple’s new family pack
d6 one such ruling family to incorporate the jaguar into their name is jaguar paw
d7 it is a big cat

Challenges
Compound words: hostname, host-name and host name.
Break into two tokens or regroup them as one token? In any
case, lexicon and linguistic analysis needed!
In some languages (Chinese, Japanese), words not separated
by whitespace.
Reference: Web Data Management and Distribution by Serge Abiteboul et al. http://webdam.inria.fr/textbook 35
Preprocessing 2. Stemming
Merge different forms of the same word, or of
closely related words, into a single stem.
Morphological stemming - Remove bound morphemes from
words, such as remove final -s, -’s, -ed, -ing, -er, -est.

Lexical stemming - Merge lexically related terms of various


parts of speech, such as policy, politics, political or politician .

Phonetic stemming - Merge phonetically related words: search


despite spelling errors, such as happiness and happyness
(e.g., The Soundex indexing system.
http://www.archives.gov/research/census/soundex.html)

Reference: Web Data Management and Distribution by Serge Abiteboul et al. http://webdam.inria.fr/textbook 36
Preprocessing 2. Stemming
Merge different forms of the same word, or of
closely related words, into a single stem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 the jaguar is a new world mammal of the felidae family
d2 jaguar has designed four new engines
d3 for jaguar atari was keen to use a 68k family device
d4 the jacksonville jaguars are a professional us football team
d5 mac os x jaguar is available at a price of us $99 for apple’s new family pack
d6 one such ruling family to incorporate the jaguar into their name is jaguar paw
d7 it is a big cat

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 the jaguar be a new world mammal of the felidae family
d2 jaguar have design four new engine
d3 for jaguar atari be keen to use a 68k family device
d4 the jacksonville jaguar be a professional us football team
d5 mac os x jaguar be available at a price of us $99 for apple new family pack
d6 one such rule family to incorporate the jaguar into their name be jaguar paw
d7 it be a big cat

Reference: Web Data Management and Distribution by Serge Abiteboul et al. http://webdam.inria.fr/textbook 37
Preprocessing 3. Stop word removal
Remove uninformative words from documents, in
particular to lower the cost of storing the index
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 jaguar new world mammal felidae family
d2 jaguar design four new engine
d3 jaguar atari keen 68k family device
d4 jacksonville jaguar professional us football team
d5 mac os x jaguar available price us $99 apple new family pack
d6 one such rule family incorporate jaguar their name jaguar paw
d7 big cat

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 the jaguar be a new world mammal of the felidae family
d2 jaguar have design four new engine
d3 for jaguar atari be keen to use a 68k family device
d4 the jacksonville jaguar be a professional us football team
d5 mac os x jaguar be available at a price of us $99 for apple new family pack
d6 one such rule family to incorporate the jaguar into their name be jaguar paw
d7 it be a big cat

Reference: Web Data Management and Distribution by Serge Abiteboul et al. http://webdam.inria.fr/textbook 38
Inverted index
After all preprocessing, construction of an inverted
index.
Index of all terms, with the list of documents where this term
occurs.
Small scale: disk storage, with memory mapping (cf. mmap)
techniques; secondary index for offset of each term in main
index.
Large scale: distributed on a cluster of machines; hashing
assigns the machine to responsible for some term(s).
Updating the index is costly, so only batch operations
(not one-by-one addition of term occurrences).
Reference: Web Data Management and Distribution by Serge Abiteboul et al. http://webdam.inria.fr/textbook 39
Inverted index
Phrase queries, NEAR operator: need to keep
positional information in the index.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 jaguar new world mammal felidae family
d2 jaguar design four new engine
d3 jaguar atari keen 68k family device
d4 jacksonville jaguar professional us football team
d5 mac os x jaguar available price us $99 apple new family pack
d6 one such rule family incorporate jaguar their name jaguar paw
d7 big cat

family
football
d1/11
d4/8
d3/10 d5/16 d6/4 Find the documents that
jaguar d1/2 d2/1 d3/2 d4/3 d5/4 d6/8,13 contains keywords “jaguar”
new d1/5 d2/5 d5/15
rule d6/3 followed by “family” within
us
world
d4/7
d1/6
d5/11 8 word spaces.
d1 (2,11) , d3 (2,10) , d5 (4,16) ,
Documents that contain both keywords
d6 (8,4) , d6 (13,4)
“jaguar” followed by “family” d1 (2,11) , d3 (2,10) , d5 (4,16)
Within 8 word spaces d3 (2,10) 40
Term frequency tf (t,d)
Insight: Terms occurring frequently in a given
document: more relevant.
nt ,d
tf (t , d ) =
t ' nt ',d
The term frequency tf (t,d) is the number of occurrences of a
term t in a document d (denoted by nt,d ), divided by the total
number of terms in d (denoted by ∑t’ nt’,d).

E.g., tf("jaguar",d1) = 1/6 ; tf("jaguar",d6) = 2/10 ; tf("cat",d7)=1/2


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 jaguar new world mammal felidae family
d2 jaguar design four new engine
d3 jaguar atari keen 68k family device
d4 jacksonville jaguar professional us football team
d5 mac os x jaguar available price us $99 apple new family pack
d6 one such rule family incorporate jaguar their name jaguar paw
d7 big cat 41
Inverse document frequency idf(t)
How informative is a term to a document
Insight: Terms occurring rarely in the document
collection as a whole: more informative.
| D|
idf (t ) = lg The inverse document frequency
| {d ' D | nt ,d '  0} | idf(t) of a term t is obtained from
idf("jaguar") = lg (7/6) = 0.22 the division of the total number of
documents |D| by the number of
idf("jacksonville") = lg (7/1) = 2.8
documents where t occurs
The term "jacksonville" is more informative to identify a
matched document (say d3) than "jaguar" because "jaguar" is (denoted by |{d ∈ D | nt,d > 0 }|) .
more common, "jacksonville" is rarely happen (more specific).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 jaguar new world mammal felidae family
d2 jaguar design four new engine
d3 jaguar atari keen 68k family device
d4 jacksonville jaguar professional us football team
d5 mac os x jaguar available price us $99 apple new family pack
d6 one such rule family incorporate jaguar their name jaguar paw
d7 big cat 42
tf.idf (t,d) and inverted indices

The inverted lists are extended by adding tf.idf(t,d) for


each entry, sorted in descending order of tf.idf(t,d).
nt ,d |D|
tf .idf (t , d ) = tf (t , d )  idf (t ) =  lg
t ' nt ',d | {d ' D | nt ,d '  0} |
nt,d is the number of occurrences of t in d ; D is the set of all documents.
tf("family", d1) = 1/6 ; idf("family") = lg (7/4);
tf.idf("family", d1) = 1/6 lg (7/4) = 0.13;
family d1/11/0.13 d3/10/0.13 d6/16/0.08 d5/4/0.07
football d4/8/0.47
jaguar d1/2/0.04 d2/1/0.04 d3/2/0.04 d4/3/0.04 d6/4/0.04 d5/8,13/0.02
new d2/5/0.24 d1/5/0.20 d5/15/0.10
rule d6/3/0.28
us d4/7/0.30 d5/11/0.15
world d1/6/0.47

43
tf.idf (t,d) and MapReduce
Can we use MapReduce to compute
tf.idf(t,d) for all terms t in parallel and
update the inverted lists?
nt ,d |D|
tf .idf (t , d ) = tf (t , d )  idf (t ) =  lg
t ' nt ',d | {d ' D | nt ,d '  0} |
Information we need
nt,d the number of occurrences of term t in document d.
∑t’ nt’,d the total number of terms in document d.
|{d ∈ D | nt,d > 0 } | the number of documents term t appears.
| D | the total number of documents (Global metadata).
44
Job 1. Compute nt,d and ∑t’ nt’,d

Target to compute nt,d for each term t - the number


of occurrences of term t in document d.
Note that ∑t’ nt’,d is the number of words in d.
Map tasks This can be computed by scanning d once
Input: <d, content of d> with a single word counter variable.
If d is too large to be scanned once in one
Output: < (t, d), 1, ∑t’ nt’,d > mapper, we can split d into parts and process
by different mappers in parallel.
In that case we need one more round of MR
Reduce tasks tasks to aggregate the partial word count and
Input: < (t, d), [∑t’ nt’,d ,…] > compute final ∑t’ nt’,d .

nt,d = count number of pairs with key (t,d).


Output: < (t, d), (nt,d , ∑t’ nt’,d )>
45
Job 1. Compute nt,d and ∑t’ nt’,d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
d1 jaguar new world mammal felidae family
d2 jaguar design four new engine
d3 jaguar atari keen 68k family device
d4 jacksonville jaguar professional us football team
d5 mac os x jaguar available price us $99 apple new family pack
d6 one such rule family incorporate jaguar their name jaguar paw
d7 big cat
Mapper Mapper Mapper
key Value key value key value key value
jaguar, d1 1,6 jaguar, d2 1,5 one, d6 1,10 jaguar, d6 1,10
new, d1 1,6 design, d2 1,5 … such, d6 1,10 their, d6 1,10 …
world , d1 1,6 four, d2 1,5 rule, d6 1,10 name, d6 1,10
mammal, d1 1,6 new, d2 1,5 family, d6 1,10 jaguar, d6 1,10
felidae, d1 1,6 engine, d2 1,5 incorporate, d6 1,10 paw, d6 1,10
family,d1 1,6
Shuffle and sort

key Value key value key value key value


family, d1 1,6 family, d3 1,6 family, d5 1,12 family, d6 1,10
rule, d6 1,10 football, d4 1,6 us, d4 1,6 us, d5 1,12 …
world,d1 1,6 jaguar, d1 1,6 jaguar, d2 1,5 jaguar, d3 1,6
jaguar, d4 1,6 jaguar, d5 1,12 jaguar, d6 1,10 new, d1 1,6
new, d2 1,5 new, d5 1,5 jaguar, d6 1,10 … …
… … … … … …
46
Job 1. Compute nt,d and ∑t’ nt’,d
key Value key Value key Value key Value
family, d1 1,6 family, d3 1,6 family, d5 1,12 family, d6 1,10
rule, d6 1,10 football, d4 1,6 us, d4 1,6 us, d5 1,12
world,d1 1,6 jaguar, d1 1,6 jaguar, d2 1,5 jaguar, d3 1,6
jaguar, d4 1,6 jaguar, d5 1,12 jaguar, d6 2,10 new, d1 1,6
new, d2 1,5 new, d5 1,5 … … … …
… … … …

The reduce tasks receives the <(t,d), ∑t’ nt’,d > pairs with the same (t,d).
It simply compute nt,d by counting the number of key-value pairs received.

Reducer Reducer Reducer Reducer

key Value key value key value key value


family, d1 1,6 family, d3 1,6 family, d5 1,12 family, d6 1,10
rule, d6 1,10 football, d4 1,6 us, d4 1,6 us, d5 1,12 …
world,d1 1,6 jaguar, d1 1,6 jaguar, d2 1,5 jaguar, d3 1,6 …
jaguar, d4 1,6 jaguar, d5 1,12 jaguar, d6 1,10 new, d1 1,6
new, d2 1,5 new, d5 1,5 jaguar, d6 1,10 … …
… … … … … …
47
Job 2. Compute |{ d’ ∈ D | nt,d’ > 0 } |

Target to compute |{d ∈ D | nt,d > 0 }| for each term t:


the number of documents term t appears.
Map tasks Map tasks: We make term t as
the key so that term occurrences
Input: < (t, d), (nt,d , ∑t’ nt’,d )> in every document d (i.e., nt,d) will
Output: < t, (d, nt,d , ∑t’ nt’,d ) > be grouped together and
processed by the same reducer.
Reduce tasks
Input : < t, [(d1, nt,d1 , ∑t’ nt’,d1) , (d2, nt,d2 , ∑t’ nt’,d2 ) , … ] >
|{d ∈ D | nt,d > 0 } | = count number of pairs with key t.
Output: < (t, d) , (nt,d , ∑t’ nt’,d , |{d ∈ D | nt,d > 0 } | ) >
48
Job 2. Compute |{ d’ ∈ D | nt,d’ > 0 } |
key Value key Value key Value key Value
family, d1 1,6 family, d3 1,6 family, d5 1,12 family, d6 1,10
rule, d6 1,10 football, d4 1,6 us, d4 1,6 us, d5 1,12
world,d1 1,6 jaguar, d1 1,6 jaguar, d2 1,5 jaguar, d3 1,6
jaguar, d4 1,6 jaguar, d5 1,12 jaguar, d6 2,10 new, d1 1,6
new, d2 1,5 new, d5 1,5 … … … …
… … … …
Mapper Mapper Mapper Mapper
key Value key Value key Value key Value
family d1,1,6 family d3, 1,6 family d5, 1,12 family d6, 1,10
rule d6, 1,10 football d4, 1,6 us d4, 1,6 us d5, 1,12

world d1, 1,6 jaguar d1, 1,6 jaguar d2, 1,5 jaguar d3, 1,6
jaguar d4, 1,6 jaguar d5, 1,12 jaguar d6, 2,10 new d1, 1,6
new d2, 1,5 new d5, 1,5 … … … …
… … … …
Shuffle and sort

key Value key value key value key value


family d1,1,6 football d4, 1,6 jaguar d1, 1,6 new d2, 1,5
family d3, 1,6 us d5, 1,12 jaguar d5, 1,12 new d5, 1,5
family d5, 1,12 us d4, 1,6 jaguar d2, 1,5 new d1, 1,6
family d6, 1,10 … … jaguar d6, 2,10 … …
rule d6, 1,10 jaguar d4, 1,6
… … jaguar d3, 1,6
world d1, 1,6
… … 49
Job 2. Compute |{ d’ ∈ D | nt,d’ > 0 } |
key Value key value key value key value
family, d1 1,6,4 football, d4 1,6,1 jaguar, d1 1,6,6 new, d2 1,5,3
family, d3 1,6,4 us, d5 1,12,2 jaguar, d5 1,12,6 new, d5 1,5,3
family, d5 1,12,4 us, d4 1,6,2 jaguar, d2 1,5,6 new, d1 1,6,3
family, d6 1,10,4 … … jaguar, d6 2,10,6 … …
rule, d6 1,10,1 jaguar, d4 1,6,6
… … jaguar, d3 1,6,6
world, d1 1,6,1
… …

The reduce tasks receives the <t, (d, ∑t’ nt’,d )> pairs with the same t.
It simply compute |{d ∈ D | nt,d > 0 } | by counting the number of key-
value pairs received for key t.
Reducer Reducer Reducer Reducer

key Value key value key value key value


family d1,1,6 football d4, 1,6 jaguar d1, 1,6 new d2, 1,5
family d3, 1,6 us d5, 1,12 jaguar d5, 1,12 new d5, 1,5
family d5, 1,12 us d4, 1,6 jaguar d2, 1,5 new d1, 1,6
family d6, 1,10 … … jaguar d6, 2,10 … …
rule d6, 1,10 jaguar d4, 1,6
… … jaguar d3, 1,6
world d1, 1,6
… … 50
Job 3. Compute tf.idf (t,d)

Target to calculate the tf.idf (t,d) values.


Map tasks
Input: < (t, d) , (nt,d , ∑t’ nt’,d , |{d ∈ D | nt,d > 0 }| ) >
Compute tf.idf (t,d) assume that | D | is a global metadata.
nt ,d |D|
tf .idf (t , d ) = tf (t , d )  idf (t ) =  lg
t ' nt ',d | {d ' D | nt ,d '  0} |
Output: < (t, d), tf.idf (t,d) >

Reduce tasks – No reduce task


51
Job 3. Compute tf.idf (t,d)
key Value key value key value key value
family, d1 1,6,4 football, d4 1,6,1 jaguar, d1 1,6,6 new, d2 1,5,3
family, d3 1,6,4 us, d5 1,12,2 jaguar, d5 1,12,6 new, d5 1,5,3
family, d5 1,12,4 us, d4 1,6,2 jaguar, d2 1,5,6 new, d1 1,6,3
family, d6 1,10,4 … … jaguar, d6 2,10,6 … …
rule, d6 1,10,1 jaguar, d4 1,6,6
… … jaguar, d3 1,6,6
world, d1 1,6,1
… …

Mapper Mapper Mapper Mapper


key Value key Value key Value key Value
family, d1 0.13 football, d4 0.47 jaguar, d1 0.04 new, d2 0.24
family, d3 0.13 us, d5 0.15 jaguar, d5 0.02 new, d5 0.10

family, d5 0.07 us, d4 0.30 jaguar, d2 0.04 new, d1 0.20
family, d6 0.08 … … jaguar, d6 0.04 … …
rule, d6 0.28 jaguar, d4 0.04
… … jaguar, d3 0.04
world, d1 0.47
… …

We may sort the entries within an inverted list by the


tf.idf value. We may also update Job 1,2,3 if we would
like to store the positional information.
52
Answering query

To answer boolean multi-keyword query such as


("jaguar" AND "new" AND NOT "family") OR "cat"
Retrieve inverted lists from all keywords and apply set
operations: AND (intersection) ; OR (union) ; NOT (set
difference)

But there may still be many matched result


documents, so we need to define a score for
a document d given a query q.
score(q, d ) = tq tf .idf (t , d )
53
Top-k query
Top-K query: Given a query q and an integer k, find
the top-k documents d with the highest score(q,d).
1 Let Result be the empty sorted list, and Bound = + ∞.
2 For each 1 ≤ i ≤ n, where n is the number of terms in query q.
In the inverted list of term ti , retrieve the next largest entry’s tf.idf(ti ,dx), let
dcurrent(i) be the doc ID that the entry referring to.
Compute the score of the document dcurrent(i) (i.e., score(q, dcurrent(i)) ) .
This requires searching the inverted lists of the other terms tj where 1 ≤ j ≤ n and j ≠ i
and retrieving the tf.idf values of the doc dcurrent(i) in each list.
If Result contains less than k documents, or if score(q, dcurrent(i)) is greater than the
k-th document in Result, add dcurrent(i) to Result.
3 Update the pruning bound = 1in tf .idf (ti , dcurrent (i) )
4 If Result contains more than k documents, and the score of the k-th
document in Result is greater than or equal to bound, return Result.
5 Otherwise, Redo step 2. 54
Threshold algorithm
Query: "family" OR "new"
e1 e2 e3 e4 e5 e6 e7 e8
family d1/11/0.13 d3/10/0.13 d6/16/0.08 d5/4/0.07 d9/12/0.04 d7/14/0.02 d11/32/0.02 d10/29/0.01
new d2/5/0.24 d1/5/0.20 d5/15/0.10 d9/16/0.09 d7/2/0.05 d11/6/0.04 d8/9/0.03 d13/24/0.03

First, initialize the bound to be positive infinity so there


will be no entries in Result larger than bound. (i.e, the
Bound +∞
algorithm will NOT return yet.)
Result consists of k entries (k =3)
Result d1/0.33 and keeps the entries sorted by
the score(q, d).

The first entry in the inverted list of "family" is d1 with


tf.idf("family",d1) = 0.13.
We then look for the tf.idf("new",d1) in the inverted
list of "new", which is 0.20.
score(q,d1) = 0.13+0.20 = 0.33 55
Threshold algorithm
Query: "family" OR "new"
e1 e2 e3 e4 e5 e6 e7 e8
family d1/11/0.13 d3/10/0.13 d6/16/0.08 d5/4/0.07 d9/12/0.04 d7/14/0.02 d11/32/0.02 d10/29/0.01
new d2/5/0.24 d1/5/0.20 d5/15/0.10 d9/16/0.09 d7/2/0.05 d11/6/0.04 d8/9/0.03 d13/24/0.03

Note that if the query is "family"


Bound +∞ AND "new" then d2 will not be
in the Result as d2 doesn’t
contain both keywords.
Result d1/0.33 d2/0.24

The first entry in the inverted list of "new" is d2 with


tf.idf("new",d2) = 0.24.
We then look for the tf.idf("family",d2) in the inverted
list of "family", which is 0.
Since q is "family" OR "new" , score(q,d2) = 0.24. 56
Threshold algorithm
Query: "family" OR "new"
e1 e2 e3 e4 e5 e6 e7 e8
family d1/11/0.13 d3/10/0.13 d6/16/0.08 d5/4/0.07 d9/12/0.04 d7/14/0.02 d11/32/0.02 d10/29/0.01
new d2/5/0.24 d1/5/0.20 d5/15/0.10 d9/16/0.09 d7/2/0.05 d11/6/0.04 d8/9/0.03 d13/24/0.03

Then we update the


Bound 0.37 Since bound is still larger than bound to the sum of the
top-k score, we need to keep
exploring. maximum tf.idf value
Result d1/0.33 d2/0.24 currently discovered in
the inverted lists “family”
and “new”, respectively.
[Terminate condition]
The meaning of bound is that we will not discover any other
documents d with score(q,d) larger than bound.
If the k-th entry in Result is already larger than bound, we have
already discovered the top-k documents and can return the Result.
57
Threshold algorithm
Query: "family" OR "new"
e1 e2 e3 e4 e5 e6 e7 e8
family d1/11/0.13 d3/10/0.13 d6/16/0.08 d5/4/0.07 d9/12/0.04 d7/14/0.02 d11/32/0.02 d10/29/0.01
new d2/5/0.24 d1/5/0.20 d5/15/0.10 d9/16/0.09 d7/2/0.05 d11/6/0.04 d8/9/0.03 d13/24/0.03

The bound is updated to


Bound 0.37 0.13+0.24 = 0.37 (no change).

Result d1/0.33 d2/0.24 d3/0.13

The next entry in the inverted list of "family" is d3 with


tf.idf("family",d3) = 0.13.
We then look for the tf.idf("new",d3) in the inverted list
of "new", which is 0.
Since q is "family" OR "new" , score(q,d3) = 0.13. 58
Threshold algorithm
Query: "family" OR "new"
e1 e2 e3 e4 e5 e6 e7 e8
family d1/11/0.13 d3/10/0.13 d6/16/0.08 d5/4/0.07 d9/12/0.04 d7/14/0.02 d11/32/0.02 d10/29/0.01
new d2/5/0.24 d1/5/0.20 d5/15/0.10 d9/16/0.09 d7/2/0.05 d11/6/0.04 d8/9/0.03 d13/24/0.03

The bound is updated to


Bound 0.37
0.33 0.13+0.20 = 0.33 (tighter).

Result d1/0.33 d2/0.24 d3/0.13

The next entry in the inverted list of "new" is d1 .


d1 is already in Result and score(q,d1) need not
be recomputed.
59
Threshold algorithm
Query: "family" OR "new"
e1 e2 e3 e4 e5 e6 e7 e8
family d1/11/0.13 d3/10/0.13 d6/16/0.08 d5/4/0.07 d9/12/0.04 d7/14/0.02 d11/32/0.02 d10/29/0.01
new d2/5/0.24 d1/5/0.20 d5/15/0.10 d9/16/0.09 d7/2/0.05 d11/6/0.04 d8/9/0.03 d13/24/0.03

The bound is updated to


Bound 0.33
0.28 0.08+0.20 = 0.28 (tighter).

Result d1/0.33 d2/0.24 d3/0.13 d6/0.08

The next entry in the inverted list of "family" is d6 with


tf.idf("family",d6) = 0.08.
The inverted list of "new", doesn’t contains d6’s entry.
Since q is "Family" OR "new" , score(q,d6) = 0.08.
However since the score of the 3rd entry in Result is 0.13, which is
greater than 0.08, d6 will not be in top-k answer.
60
Threshold algorithm
Query: "family" OR "new"
e1 e2 e3 e4 e5 e6 e7 e8
family d1/11/0.13 d3/10/0.13 d6/16/0.08 d5/4/0.07 d9/12/0.04 d7/14/0.02 d11/32/0.02 d10/29/0.01
new d2/5/0.24 d1/5/0.20 d5/15/0.10 d9/16/0.09 d7/2/0.05 d11/6/0.04 d8/9/0.03 d13/24/0.03

The bound is updated to


Bound 0.28
0.18 0.08+0.10 = 0.18 (tighter).

Result d1/0.33 d2/0.24 d3/0.13 d5/0.17

The next entry in the inverted list of "new" is d5 with


tf.idf("new",d5) = 0.10.
We then look for the tf.idf("family",d5) in the inverted list of
"family", which is 0.07.
Since q is "family" OR "new" , score(q,d5) = 0.10+0.07 = 0.17, which
is greater than score(q,d3) = 0.13, we insert d5 in Result.
61
Threshold algorithm
Query: "family" OR "new"
e1 e2 e3 e4 e5 e6 e7 e8
family d1/11/0.13 d3/10/0.13 d6/16/0.08 d5/4/0.07 d9/12/0.04 d7/14/0.02 d11/32/0.02 d10/29/0.01
new d2/5/0.24 d1/5/0.20 d5/15/0.10 d9/16/0.09 d7/2/0.05 d11/6/0.04 d8/9/0.03 d13/24/0.03

Bound 0.18
0.17
The bound is updated to 0.07 +
0.10 = 0.17 (tighter).
[Terminate condition]
Result d1/0.33 d2/0.24 d5/0.17 The bound means the maximum
possible score of other docs will
be 0.17.
The next entry in the inverted
When we look at the entries in
list of "family" is d5 .
Result (i.e., The 3r entry
d5 is already in Result and
"d5/0.17"), we can confirm that
score(q,d5) need not be
they are the top-3 result.
recomputed.
62
Section 3

OLAP on
Search logs

References : OLAP on Search Logs: An Infrastructure Supporting Data-Driven Applications in Search Engines by Bin Zhou et, al.
Query suggestion
Auto complete (After typing “the university of”)

I know how to construct the


interface of the auto complete
feature (With AJAX technology).
But I don’t know how to generate
the list of options . How can
Google generates meaningful
options effectively and efficiently?

Query suggestion (After issuing the first query as “University of Hong Kong” )

Idea: By examining the queries


frequently asked by users after /
before the query “University of
Hong Kong”, a search engine can
suggest queries such as “Chinese
University of Hong Kong” which may
improve users’ search experience
64
Query suggestion
In query suggestion application, the search engine
provides suggestions each time the user raises a query.
For example, when a user raises a query "Honda", the search
engine can apply a forward search function on s1 = <"Honda">
and get the top-k queries as the candidates for query suggestion.

Suppose the user raises a second query "Ford", the search


engine can drill down using the forward search function to
find out the top-k queries following sequence s2 = < "Honda",
"Ford"> as the candidates for query suggestion.

References : OLAP on Search Logs: An Infrastructure Supporting Data-Driven Applications in Search Engines by Bin Zhou et, al.
65
Search logs and sessions
Conceptually, a search log is a sequence of queries
and click events.
Since a search log often contains the information from multiple
users over a long period, we can divide a search log into
sessions. Date time Session ID (or IP) Query …
2023-08-03T00:12:13 132.165.2.2 Honda …
Sessions extraction 2023-08-03T00:12:14

217.25.64.32

Apple iphone6



2023-08-03T00:13:21 132.165.2.2 Ford …
Step 1. For each user, … … … …

extract the queries by the user from the search log as a stream.
Step 2. Then segment each user's stream into sessions based on
a widely adopted rule : two queries are split into two sessions if
the time interval between them exceeds 30 minutes.
66
Data model
A search log is a sequence of queries and click events.
E.g., q1=“HKU”, q2=“CUHK”, then if a user session s searches for
q1 first and q2 next, then s = < q1, q2 >
Session frequency (s) is the number of query sessions
that is exactly the same as s. Sequence ID Query sequences
s1 <q1q2q3q4>
s=q1q2 , session frequency (s)=0 s2 <q1q2q4q5>
s3 <q6q1q2q5>
s=q1q2q3q4 , session frequency (s) = 2
s4 <q1q2q3q4>
s5 <q6q1q2q5>
Frequency (s) is the number of query s6 <q1q2q3q5>
sessions with s being its substring. s7 <q1q2q3q6>
s8 <q6q1q2q5>
s=q1q2 , frequency (s) = 8 67
1. Forward search
In a set of sessions, given a query sequence s and a
search result size k:
The forward search finds k sequences s1, … , sk such that < s, si >
( 1 ≤ i ≤ k) is among the top-k most frequent sequences that
have s as the prefix.

Input s = <"Apple" , "iPhone">, k = 3


Output Top 3 frequent sequences of the form <"Apple" , "iPhone", <any> >
<"Apple" , "iPhone", "iCloud" >
<"Apple" , "iPhone", "Apple support" >

<"Apple" , "iPhone", "Samsung Galaxy" , "iPad air">


68
2. Backward search
In a set of sessions, given a query sequence s and a
search result size k:
The backward search finds k sequences s1, … , sk such that
<si , s> ( 1 ≤ i ≤ k) is among the top-k most frequent sequences
that have s as the suffix.

Input s = <"The University of Hong Kong">, k= 3


Output Top 3 frequent sequences of the form < <any> , " The University of Hong Kong">
<"QS Ranking", "The University of Hong Kong" >
<"Exchange in Hong Kong" , "HKU", "The University of Hong Kong" >

<"Dr. Sun Yat-sen" , "The University of Hong Kong" >


69
3. Session retrieval
In a set of sessions, given a query sequence s and a
search result size k:
The session retrieval finds the top-k query sessions s1, … , sk
with top session frequency that contains s.

Input s = <"Frozen">, k= 2
Output Top 2 frequent sequences that contain "Frozen"
<"Disney movie" , "Frozen" >
<"Frozen", "Let it go" >

70
System framework

References : OLAP on Search Logs: An Infrastructure Supporting Data-Driven Applications in Search Engines by Bin Zhou et, al.
71
Sequence ID Query sequences
s1 <q1q2q3q4>

Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1
q2 q3 q4
1 1 1 1
q2 q3 q4

1 1 1
q3 q4

1 1 Suffix tree is very useful to


q4 check if a sequence of query is
Suffixes of s1. Substring of any sequences in
1
<q1q2q3q4> the database.
<q2q3q4>
Note: A query sequence is a
substring of a sequence s if it is a
<q3q4>
prefix of a suffix of s.
<q4>
72
Sequence ID Query sequences
s1 <q1q2q3q4>

Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1
q2 q3 q4 q5

2 2 1 2 1
q2 q3 q4 q5
q4
2 1 1 1 1
q3 q4 q5
q4
1 1 1 1
q4
q5
Suffixes of s2.
1 1
<q1q2q4q5>
<q2q4q5>
<q4q5>
<q5>
73
Sequence ID Query sequences
s1 <q1q2q3q4>

Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

3 3 1 2 2 1
q2 q3 q5 q4 q5
q4 q1
3 1 1 1 1 1 1
q3 q5 q4 q5
q4 q2

1 1 1 1 1 1
q4
q5 q5
Suffixes of s3.
1 1 1
<q6q1q2q5>
<q1q2q5>
<q2q5>
<q5>
74
Sequence ID Query sequences
s1 <q1q2q3q4>

Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

4 4 2 3 2 1
q2 q3 q5 q4 q5
q4 q1
4 2 1 1 2 1 1
q3 q5 q4 q5
q4 q2

2 1 1 2 1 1
q4
q5 q5
Suffixes of s4.
2 1 1
<q1q2q3q4>
<q2q3q4>
<q3q4>
<q4>
75
Sequence ID Query sequences
s1 <q1q2q3q4>

Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

5 5 2 3 3 2
q2 q3 q5 q4 q5
q4 q1
5 2 1 2 2 1 2
q3 q5 q4 q5
q4 q2

2 1 2 2 1 2
q4
q5 q5
Suffixes of s5.
2 1 2
<q6q1q2q5>
<q1q2q5>
<q2q5>
<q5>
76
Sequence ID Query sequences
s1 <q1q2q3q4>

Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

6 6 3 3 4 2
q2 q3 q5 q4 q5
q4 q5 q1
6 3 1 2 2 1 1 2
q3 q5 q4 q5
q4 q5 q2

3 1 2 2 1 1 2
q4
q5 q5 q5
Suffixes of s6.
2 1 1 2
<q1q2q3q5>
<q2q3q5>
<q3q5>
<q5>
77
Sequence ID Query sequences
s1 <q1q2q3q4>

Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

7 7 4 3 4 3
q2 q3 q5 q4 q6 q5
q4 q5 q1
7 4 1 2 2 1 1 1 2
q3 q5 q4 q6 q5
q4 q5 q2

4 1 2 2 1 1 1 2
q4 q6
q5 q5 q5
Suffixes of s7.
2 1 1 1 2
<q1q2q3q6>
<q2q3q6>
<q3q6>
<q6>
78
Sequence ID Query sequences
s1 <q1q2q3q4>

Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 q1
8 4 1 3 2 1 1 1 3
q3 q5 q4 q6 q5
q4 q5 q2

4 1 3 2 1 1 1 3
q4 q6
q5 q5 q5
Suffixes of s8.
2 1 1 1 3
<q6q1q2q5>
<q1q2q5>
<q2q5>
<q5>
79
Sequence ID Query sequences
s1 <q1q2q3q4>

Index structure s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
There are 8 sequences
s7 <q1q2q3q6>
that contain <q1>.
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 q1
8 4 1 3 2 1 1 1 3
q3 q5 q4 q6 q5
q4 q5 q2

4 1 3 2 1 1 1 3
What is the number of
q4 q6
q5 q5 sequences that q5
2 1 1 1 contains the pattern 3
<q1> ?
There are 2 sequences
that contain <q1q2q3q4>. How about the pattern
<q1q2q3q4>?
80
Sequence ID Query sequences
s1 <q1q2q3q4>

Forward search s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
Input s = <q1q2>, k=2 s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 q1
8 4 1 3 2 1 1 1 3
q3 q5 q4 q6 q5
q4 q5 q2

4 1 3 2 1 1 1 3
Answer:
q4 q6
q5 q5 We need to keep a q5
Candidate q3 q5 q4
2 1 1 1 Frequency 4 3 1
priority queue of 3
nodes to explore.
Question: How to use the suffix Let’s illustrate the
tree to answer the forward idea of the algorithm
search with s = <q1q2> and k=2 ? in the next few slides.
81
Sequence ID Query sequences
s1 <q1q2q3q4>

Forward search s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
Input s = <q1q2>, k=2 s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 q1
8 4 1 3 2 1 1 1 3
q3 q5 q4 q6 q5
q4 q5 q2
Observation
4 1 3 2 1 1 1 Since any descendant 3
q4 q6 node cannot have a
q5 q5 q5
Candidate q3 q5 q4 frequency higher than
2 1 1 1 Frequency 4 3 1 that in any of its 3
ancestor nodes, so
We can guarantee there will be no other
q3 is Top-1 answer Top-K answer q3
because any candidate with
Frequency 4
nodes with frequency
prefix q5 will have
frequency of at most 3,
larger than q3.
which is smaller than q3. 82
Sequence ID Query sequences
s1 <q1q2q3q4>

Forward search s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
Input s = <q1q2>, k=2 s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 1 q
Explore child of q3
8 4 1 3 2 1 1 1 Since the frequency of the
q3 q5 q4 q6 q5 child nodes of q3 can be ≤ 4,
q4 q5 q5 is not guarantee to be
4 1 3 2 1 1 1 the Top-2 result. We
q4 q6 therefore explore q3 and
q5 q5
Candidate q5 q3q4 q4 q3q5 q3q6 retrieve the frequencies of
2 1 1 1 Frequency 3 2 1 1 1 q3q4, q3q5 and q3q6.

Top-K answer q3
Frequency 4

83
Sequence ID Query sequences
s1 <q1q2q3q4>

Forward search s2
s3
s4
s5
<q1q2q4q5>
<q6q1q2q5>
<q1q2q3q4>
<q6q1q2q5>
root s6 <q1q2q3q5>
Input s = <q1q2>, k=2 s7 <q1q2q3q6>
s8 <q6q1q2q5>
q1 q6
q2 q3 q4 q5

8 8 4 3 5 4
q2 q3 q5 q4 q6 q5
q4 q5 q1
q5 is now guarantee to
8 4 1 3 2 1 1 1 3
be top-2
q3 q5 q4 q6 q5
q4 q5 Since q5 is still in the
q2 front
of the priority queue, no
4 1 3 2 1 1 1
other nodes with3q1q2 as
q4 q6 prefix can have frequency
q5 q5 q5
Candidate q5 q3q4 q4 q3q5 q3q6 larger than q5, therefore q5
2 1 1 1 Frequency 3 2 1 1 1 is guaranteed to 3be the top-
2 result.
Top-K answer q3 q5
Frequency 4 3

84
Distributed index structure
A search log may contain billions of query sessions.
The resulting suffix tree would be a gigantic one that
cannot be held into the main memory or even the disk
of one machine.
A distributed suffix tree construction method
under the MapReduce programming model is used.
Step 1. Suffixes counting
Step 2. Suffixes partitioning
Step 3. Suffix subtree construction
85
Step 1. Suffixes counting
Compute all suffixes and the corresponding frequencies.
At the beginning, the whole data set is stored
distributively in the cluster.
Each compute node possesses a subset of data.
Map: For each query session s, the computer emits an
intermediate key-value pair (s’, 1) for every suffix s’ of s,
where the value 1 here is the contribution to frequency of
suffix s’ from s.
Reduce: All intermediate key-value pairs having suffix s’ as
the key are processed on the same computer. The computer
simply outputs a final pair (s’; freq(s’)) where freq(s’) is the
number of intermediate pairs carrying key s’.
86
Step 1. Suffixes counting
sID sequences sID sequences sID sequences sID sequences
Input query s1 <q1q2q3q4> s3 <q6q1q2q5> s5 <q6q1q2q5> s7 <q1q2q3q6>
sequences s2 <q1q2q4q5> s4 <q1q2q3q4> s6 <q1q2q3q5> s8 <q6q1q2q5>

At the beginning, the whole data set is


stored distributively in the cluster.
Each compute node possesses a subset of
data.

87
Step 1. Suffixes counting
sID sequences sID sequences sID sequences sID sequences
Input query s1 <q1q2q3q4> s3 <q6q1q2q5> s5 <q6q1q2q5> s7 <q1q2q3q6>
sequences s2 <q1q2q4q5> s4 <q1q2q3q4> s6 <q1q2q3q5> s8 <q6q1q2q5>
Mapper Mapper Mapper Mapper
Key value Key value Key value Key value
<q1q2q3q4> 1 <q6q1q2q5> 1 <q6q1q2q5> 1 <q1q2q3q6> 1
Mappers process <q2q3q4> 1 <q1q2q5> 1 <q1q2q5> 1 <q2q3q6> 1
input and emit <q3q4> 1 <q2q5> 1 <q2q5> 1 <q3q6> 1
<q4> 1 <q5> 1 <q5> 1 <q6> 1
<k, v> pairs in <q1q2q4q5> 1 <q1q2q3q4> 1 <q1q2q3q5> 1 <q6q1q2q5> 1
parallel <q2q4q5> 1 <q2q3q4> 1 <q2q3q5> 1 <q1q2q5> 1
<q4q5> 1 <q3q4> 1 <q3q5> 1 <q2q5> 1
(Assume 4 mapper <q5> 1 <q4> 1 <q5> 1 <q5> 1
nodes)

Map: For each query session s, the


computer emits an intermediate key-value
pair (s’, 1) for every suffix s’ of s, where
the value 1 here is the contribution to
frequency of suffix s’ from s.
88
Step 1. Suffixes counting
sID sequences sID sequences sID sequences sID sequences
Input query s1 <q1q2q3q4> s3 <q6q1q2q5> s5 <q6q1q2q5> s7 <q1q2q3q6>
sequences s2 <q1q2q4q5> s4 <q1q2q3q4> s6 <q1q2q3q5> s8 <q6q1q2q5>
Mapper Mapper Mapper Mapper
Key value Key value Key value Key value
<q1q2q3q4> 1 <q6q1q2q5> 1 <q6q1q2q5> 1 <q1q2q3q6> 1
Mappers process <q2q3q4> 1 <q1q2q5> 1 <q1q2q5> 1 <q2q3q6> 1
input and emit <q3q4> 1 <q2q5> 1 <q2q5> 1 <q3q6> 1
<q4> 1 <q5> 1 <q5> 1 <q6> 1
<k, v> pairs in <q1q2q4q5> 1 <q1q2q3q4> 1 <q1q2q3q5> 1 <q6q1q2q5> 1
parallel <q2q4q5> 1 <q2q3q4> 1 <q2q3q5> 1 <q1q2q5> 1
<q4q5> 1 <q3q4> 1 <q3q5> 1 <q2q5> 1
(Assume 4 mapper <q5> 1 <q4> 1 <q5> 1 <q5> 1
nodes) Shuffle and sort (e.g., x mod 4) for the key qx

Key value Key value Key value Key value


<q1q2q3q4> 1 <q2q3q4> 1 <q3q4> 1 <q4> 1
<q1q2q3q4> 1 <q2q3q4> 1 <q3q4> 1 <q4> 1
<q1q2q4q5> 1 <q2q4q5> 1 <q3q5> 1 <q4q5> 1
<q1q2q5> 1 <q2q5> 1 <q3q6> 1
Intermediate <q1q2q5> 1 <q2q5> 1
<q1q2q5> 1 <q2q5> 1
<k,v > pairs <q1q2q3q5> 1 <q2q3q5> 1
grouped by <q1q2q3q6> 1 <q2q3q6> 1
<q5> 1 <q6q1q2q5> 1
key <q5> 1 <q6q1q2q5> 1
<q5> 1 <q6q1q2q5> 1
<q5> 1 <q6> 1
<q5> 1 89
Step 1. Suffixes counting
Key value

<k,v> pairs <q1q2q3q4> 2 Reduce: All intermediate key-value


are written to
<q1q2q4q5> 1 pairs having suffix s’ as the key are
<q1q2q5> 3
HDFS (disk) <q1q2q3q5> 1 processed on the same computer. The
<q1q2q3q6> 1 computer simply outputs a final pair
<q5> 5
(s’; freq(s’)) where freq(s’) is the
number of intermediate pairs carrying
Reducer key s’.

Key value Key value Key value Key value


<q1q2q3q4> 1 <q2q3q4> 1 <q3q4> 1 <q4> 1
<q1q2q3q4> 1 <q2q3q4> 1 <q3q4> 1 <q4> 1
<q1q2q4q5> 1 <q2q4q5> 1 <q3q5> 1 <q4q5> 1
<q1q2q5> 1 <q2q5> 1 <q3q6> 1
<q1q2q5> 1 <q2q5> 1
Input <q1q2q5> 1 <q2q5> 1
<q1q2q3q5> 1 <q2q3q5> 1
query <q1q2q3q6> 1 <q2q3q6> 1
sequences <q5> 1 <q6q1q2q5> 1
<q5> 1 <q6q1q2q5> 1
<q5> 1 <q6q1q2q5> 1
<q5> 1 <q6> 1
<q5> 1 90
Step 1. Suffixes counting
Key value Key value Key value Key value

<k,v> pairs <q1q2q3q4> 2 <q2q3q4> 2 <q3q4> 2 <q4> 2


<q1q2q4q5> 1 <q2q4q5> 1 <q3q5> 1 <q4q5> 1
are written to <q1q2q5> 3 <q2q5> 3 <q3q6> 1
HDFS (disk) <q1q2q3q5> 1 <q2q3q5> 1
<q1q2q3q6> 1 <q2q3q6> 1
<q5> 5 <q6q1q2q5> 3
Reducers <q6> 1
process in
parallel Reducer Reducer Reducer Reducer
(Assume 4
reducer nodes) Key value Key value Key value Key value
<q1q2q3q4> 1 <q2q3q4> 1 <q3q4> 1 <q4> 1
<q1q2q3q4> 1 <q2q3q4> 1 <q3q4> 1 <q4> 1
<q1q2q4q5> 1 <q2q4q5> 1 <q3q5> 1 <q4q5> 1
<q1q2q5> 1 <q2q5> 1 <q3q6> 1
<q1q2q5> 1 <q2q5> 1
Input <q1q2q5> 1 <q2q5> 1
<q1q2q3q5> 1 <q2q3q5> 1
query <q1q2q3q6> 1 <q2q3q6> 1
sequences <q5> 1 <q6q1q2q5> 1
<q5> 1 <q6q1q2q5> 1
<q5> 1 <q6q1q2q5> 1
<q5> 1 <q6> 1
<q5> 1 91
Step 2. Suffixes partitioning
Partition the entire set of suffixes into several parts
such that each part can be held in the main memory
of one index server.
Problem: Given a set of suffix sequences (e.g., q1q2q3 and q1q2q4) , how
to estimate the size of the subtree created by those suffix sequences?
q1
Solution: As we would like to reserve some space in each
index server for incremental update, it is okay to obtain a q2
loose estimation. (say, 1+ sum {|s|-1 | for the set of suffix
q3 q4
sequences} = 1 + 2 + 2 = 5 nodes. )

Requirement: The partitioning method in this step


guarantees that the local suffix trees are subtrees of
the global suffix trees (exclusive to each others).
92
Step 2. Suffixes partitioning
Key value Key value Key value Key value
<q1q2q3q4> 2 <q2q3q4> 2 <q3q4> 2 <q4> 2
Input <q1q2q4q5> 1 <q2q4q5> 1 <q3q5> 1 <q4q5> 1
suffixes <q1q2q5> 3 <q2q5> 3 <q3q6> 1
<q1q2q3q5> 1 <q2q3q5> 1
with count <q1q2q3q6> 1 <q2q3q6> 1
<q5> 5 <q6q1q2q5> 3
<q6> 1
Mappers process
Mapper Mapper Mapper Mapper
input and emit
Key value Key value Key value Key value
<k, v> pairs in q1 <q1q2q3q4>, 2 q2 <q2q3q4>, 2 q3 <q3q4>, 2 q4 <q4>,2
parallel q1 <q1q2q4q5>,1 q2 <q2q4q5>,1 q3 <q3q5>,1 q4 <q4q5>,1
q1 <q1q2q5>,3 q2 <q2q5>,3 q3 <q3q6>,1
(Assume 4 q1 <q1q2q3q5>,1 q2 <q2q3q5>,1
mapper nodes) q1 <q1q2q3q6>,1 q2 <q2q3q6>,1
q5 <q5>, 5 q6 <q6q1q2q5>,3
q6 <q6>,1

Map: For each suffix sequence, the computer


emits an intermediate key-value pair with key as
the prefix query of the suffix, the value retain the
suffix pattern and frequency count.
93
Step 2. Suffixes partitioning
Key value Key value Key value Key value
<q1q2q3q4> 2 <q2q3q4> 2 <q3q4> 2 <q4> 2
Input <q1q2q4q5> 1 <q2q4q5> 1 <q3q5> 1 <q4q5> 1
suffixes <q1q2q5> 3 <q2q5> 3 <q3q6> 1
<q1q2q3q5> 1 <q2q3q5> 1
with count <q1q2q3q6> 1 <q2q3q6> 1
<q5> 5 <q6q1q2q5> 3
<q6> 1
Mappers process
Mapper Mapper Mapper Mapper
input and emit
Key value Key value Key value Key value
<k, v> pairs in q1 <q1q2q3q4>, 2 q2 <q2q3q4>, 2 q3 <q3q4>, 2 q4 <q4>,2
parallel q1 <q1q2q4q5>,1 q2 <q2q4q5>,1 q3 <q3q5>,1 q4 <q4q5>,1
q1 <q1q2q5>,3 q2 <q2q5>,3 q3 <q3q6>,1
(Assume 4 q1 <q1q2q3q5>,1 q2 <q2q3q5>,1
mapper nodes) q1 <q1q2q3q6>,1 q2 <q2q3q6>,1
q5 <q5>, 5 q6 <q6q1q2q5>,3
q6 <q6>,1
Shuffle and sort (e.g., x mod 3) for the key qx
Key value Key value Key value
q1 <q1q2q3q4>, 2 q2 <q2q3q4>, 2 q3 <q3q4>, 2
<key,value> pairs q1 <q1q2q4q5>,1 q2 <q2q4q5>,1 q3 <q3q5>,1
grouped by key q1 <q1q2q5>,3 q2 <q2q5>,3 q3 <q3q6>,1
(Assume 3 q1 <q1q2q3q5>,1 q2 <q2q3q5>,1 q6 <q6q1q2q5>,3
q1 <q1q2q3q6>,1 q2 <q2q3q6>,1 q6 <q6>,1
reducers) q4 <q4>,2 q5 <q5>, 5
q4 <q4q5>,1 94
Step 2. Suffixes partitioning
Let’s first assume Estimate Estimate Estimate
Key value Key value Key value
that each index #nodes #nodes #nodes
server can hold a q1 <q1q2q3q4>, 2 q2 <q2q3q4>, 2 q3 <q3q4>, 2 1+1+1
suffix tree with at q1 <q1q2q4q5>,1 3+3+2+ q2 <q2q4q5>,1 2+2+1+2 q3 <q3q5>,1 +1
most 18 nodes. q1 <q1q2q5>,3 3+3+1 q2 <q2q5>,3 +2+1 q3 <q3q6>,1 =4
q1 <q1q2q3q5>,1 = 13 q2 <q2q3q5>,1 =10 q6 <q6q1q2q5>,3 3+0+1
q1 <q1q2q3q6>,1 q2 <q2q3q6>,1 q6 <q6>,1 =4
q4 <q4>,2 0+1+1 q5 <q5>, 5 0+1 = 1
q4 <q4q5>,1 =2

At most: 13+2+1 nodes At most: 10+1+1 nodes At most: 4+4+1 nodes


What if the estimated size
of the partial suffix tree is
too large? Reducer Reducer Reducer
We can start another round
of MR and further divide the
full node, add more index Key value Key value Key value
server (reducers) to alleviate q1 <q1q2q3q4>, 2 q2 <q2q3q4>, 2 q3 <q3q4>, 2
the workload of a compute q1 <q1q2q4q5>,1 q2 <q2q4q5>,1 q3 <q3q5>,1
node. q1 <q1q2q5>,3 q2 <q2q5>,3 q3 <q3q6>,1
q1 <q1q2q3q5>,1 q2 <q2q3q5>,1 q6 <q6q1q2q5>,3
q1 <q1q2q3q6>,1 q2 <q2q3q6>,1 q6 <q6>,1
q4 <q4>,2 q5 <q5>, 5
q4 <q4q5>,1 95
Step 3. Construct local suffix trees
Key value Key value Key value
q1 <q1q2q3q4>, 2 q2 <q2q3q4>, 2 q3 <q3q4>, 2
q1 <q1q2q4q5>,1 q2 <q2q4q5>,1 q3 <q3q5>,1
Step 3. We can q1 <q1q2q5>,3 q2 <q2q5>,3 q3 <q3q6>,1
construct the local q1 <q1q2q3q5>,1 q2 <q2q3q5>,1 q6 <q6q1q2q5>,3
suffix trees on each q1 <q1q2q3q6>,1 q2 <q2q3q6>,1 q6 <q6>,1
q4 <q4>,2 q5 <q5>, 5
index server. q4 <q4q5>,1
The partitioning Mapper Mapper Mapper
method in the root root
root
second step
guarantees that the q1 q4 q2 q5 q3 q6
local suffix trees are 5 4 4
8 3 8
subtrees of the q3 q4 q1
q2 q4 q5 q6
global suffix tree. q5 q5
8 4 1 3 2 1 1 3
q3 q5 1 q4 q6 q5
q4 q5 q2
4 1 3 2 1 1 1 3
q4 q6 q5
q5 q5
2 1 1 1 3

96 Index server 1 Index server 2 Index server 3


Incremental update
New search log sessions keep arriving incrementally.

To incrementally maintain the suffix tree, when a new batch of


query sessions arrive, we only process the new batch using
the MapReduce process similar to that in Step 1.
Then, the new set of suffixes are assigned to the index
server according to the subtrees that they should be hosted.

When to add a new index server?


If a subtree in an index server exceeds the memory capacity
after the new suffixes are inserted, the subtree is partitioned
recursively as in Step 2 and more index servers are used.
97
Incremental update (Step 1)
sID sequences
s9 <q1q3q2q4> sID sequences sID sequences
To incrementally s12 <q1q3> s10 <q1q3q2q5> s11 <q4q3q2q5>
maintain the suffix Mapper Mapper Mapper
tree, when a new Key value Key value Key value
<q1q3q2q4> 1 <q1q3q2q5> 1 <q4q3q2q5> 1
batch of query
<q3q2q4> 1 <q3q2q5> 1 <q3q2q5> 1
sessions arrive, we <q2q4> 1 <q2q5> 1 <q2q5> 1
only process the <q4> 1 <q5> 1 <q5> 1
<q1q3> 1
new batch using <q3> 1
the MapReduce Shuffle and sort (e.g., x mod 3) for the key qx
process similar to Key value Key value Key value
that in Step 1. <q1q3q2q4> 1 <q2q4> 1 <q3> 1
<q1q3> 1 <q2q5> 1 <q3q2q4> 1
<q1q3q2q5> 1 <q2q5> 1 <q3q2q5> 1
<q4> 1 <q5> 1 <q3q2q5> 1
<q4q3q2q5> 1 <q5> 1
Reducer Reducer Reducer
Key value Key value Key value
<q1q3q2q4> 1 <q2q4> 1 <q3> 1
<q1q3> 1 <q2q5> 2 <q3q2q4> 1
<q1q3q2q5> 1 <q5> 2 <q3q2q5> 2
<q4> 1
<q4q3q2q5> 1 98
Incremental update
Index server 1 Index server 2 Index server 3
root root root

q1 q4 q2 q5 q3 q6

8 3 8 5 4 4
q2 q5 q3 q5 q1
q4 q4 q5 q6
8 1 4 1 3 2 1 1 3
q3 q5 q4 q6 q5
q4 q5 q2
4 1 3 2 1 1 1 3
q4 q6 q5
q5 q5
2 1 1 1 Then, the new set of suffixes are assigned to the 3
index server according to the subtrees that they
should be hosted.
Key value Key value Key value
<q1q3q2q4> 1 <q2q4> 1 <q3> 1
<q1q3> 1 <q2q5> 2 <q3q2q4> 1
<q1q3q2q5> 1 <q5> 2 <q3q2q5> 2
<q4> 1
<q4q3q2q5> 1 99
Incremental update
Index server 1 Index server 2 Index server 3
root root root

q1 q4 q2 q5 q3 q6

11 5 11 7 8 4
q2 q3 q3 q5 q3 q5 q2 q1
q4 q4 q5 q6
8 3 1 1 4 2 5 3 2 1 1 3
q3 q5 q2 q2 q4 q6 q5
q4 q5 q4 q5 q2
4 1 3 2 1 2 1 1 1 1 2 3
q4 q6 q5 q4 q5 q5
q5 q5
2 1 1 1 1 1 1 3

If a subtree in an index server


exceeds the memory capacity after Update Update Update
the new suffixes are inserted, the
subtree is partitioned recursively as
Key value Key value Key value
in Step 2 and more index servers
are used. <q1q3q2q4> 1 <q2q4> 1 <q3> 1
<q1q3> 1 <q2q5> 2 <q3q2q4> 1
<q1q3q2q5> 1 <q5> 2 <q3q2q5> 2
<q4> 1
<q4q3q2q5> 1 100
Incremental update
Index server 1a Index server 1b
For prefix q1q2 For prefix q1q3, q4 Index server 2 Index server 3
root root root root

q1 q1 q4 q2 q5 q3 q6

11 11 5 11 7 8 4
q2 q3 q3 q5 q3 q5 q2 q1
q4 q4 q5 q6
8 3 1 1 4 2 5 3 2 1 1 3
q3 q5 q2 q2 q4 q6 q5
q4 q5 q4 q5 q2
4 1 3 2 1 2 1 1 1 1 2 3
q4 q6 q5 q4 q5 q5
q5 q5
2 1 1 1 1 1 1 3

Update Update Update


Key value Key value Key value
<q1q3q2q4> 1 <q2q4> 1 <q3> 1
<q1q3> 1 <q2q5> 2 <q3q2q4> 1
<q1q3q2q5> 1 <q5> 2 <q3q2q5> 2
<q4> 1
<q4q3q2q5> 1 101
Answering top-k query in parallel
Index server 1a Index server 1b
For prefix q1q2 For prefix q1q3, q4 Index server 2 Index server 3
root root root root

q1 q1 q4 q2 q5 q3 q6

11 11 5 11 7 8 4
q2 q3 q3 q5 q3 q5 q2 q1
q4 q4 q5 q6
8 3 1 1 4 2 5 3 2 1 1 3
q3 q5 q2 q2 q4 q6 q5
q4 q5 q4 q5 q2
4 1 3 2 1 2 1 1 1 1 2 3
q4 q6 q5 q4 q5 q5
q5 q5
2 1 1 1 1 1 1 3

When the search involves multiple index servers, each


index server looks up the local subtree and returns
the local top-k results to the master query server. 102
Answering top-k query in parallel
Index server 1a Index server 1b
For prefix q1q2 For prefix q1q3, q4 Index server 2 Index server 3
root root root root

q1 q1 q4 q2 q5 q3 q6

11 11 5 11 7 8 4
q2 q3 q3 q5 q3 q5 q2 q1
q4 q4 q5 q6
8 3 1 1 4 2 5 3 2 1 1 3
q3 q5 q2 q2 q4 q6 q5
q4 q5 q4 q5 q2
4 1 3 2 1 2 1 1 1 1 2 3
q4 q6 q5 q4 q5 q5
q5 q5
2 1 1 1 1 1 1 No match No match 3
Result from server 1a q2 q2q3 q2q5 q2q3q4 q2q4
8 4 3 2 1
Query input: <q1>, k=5
Result from server 1b q3 q3q2 q3q2q4 q3q2q5
3 2 1 1
Combined q2 q2q3 q3 q2q5 q3q2
8 4 3 3 2
103
Follow up topics
How can we support backward search and session
retrieval operations?
The construction of suffix tree for the string S
can take time and space linear in the length of S.
Describe how to update the suffix tree to optimize the
construction time complexity and storage space complexity.
Try to go through the 3 steps of distributed suffix
tree construction with s12 = <q1q2q1q2>
Explain the problem and describe the
technique to resolve the problem.
104
Chapter 9

End

Slides prepared by - Dr. Chui Chun Kit, for students in COMP3278


For other uses, please email : [email protected]
Section 4

Breadth-first
search (MR)
A standard algorithm on graph

Acknowledgement – Matei Zaharia, Electrical Engineering and Computer Sciences University of California, Berkeley
Shortest path
A common graph search application is finding the
shortest path from a start node to one or more
target nodes
Commonly done on a n2 n4
10 1
single machine with
2
Dijkstra's Algorithm. 9 4
n1 6
3
Can we use BFS to find 5 7

the shortest path with n3


2 n5
MapReduce?
107
The Dijkstra’s algorithm
Graph Edge weight Source node

d – current shortest distance from the


source s.
Therefore it is an array with n slots, where n
is the number of nodes in the graph.

Q – priority queue telling which node


to process next.
The node that is currently closest to
the source node is processed first.

Dijkstra’s algorithm for source node s


108
n2 n4

Illustration 10

2
1

9 4
n1 6
3
5 7
2 n5
n3

Q d n1 n2 n3 n4 n5 u

initialization n1 n2 n3 n4 n5 0 ∞ ∞ ∞ ∞
The algorithm
starts by 1 n2 n3 n4 n5 0 10 5 ∞ ∞ n1
exploring the
source nodes.

109
n2 n4

Illustration 10

2
1

9 4
n1 6
In each 3
iteration, the 5 7
algorithm 2 n5
n3
extends the
search by Q d n1 n2 n3 n4 n5 u
exploring one
initialization n1 n2 n3 n4 n5 0 ∞ ∞ ∞ ∞
unvisited node
that is 1 n2 n3 n4 n5 0 10 5 ∞ ∞ n1
currently
closest to the 2 n2 n4 n5 0 8 5 14 7 n3
source node.

110
n2 n4

Illustration 10

2
1

9 4
n1 6
3
5 7
2 n5
n3

Q d n1 n2 n3 n4 n5 u

initialization n1 n2 n3 n4 n5 0 ∞ ∞ ∞ ∞

1 n2 n3 n4 n5 0 10 5 ∞ ∞ n1

2 n2 n4 n5 0 8 5 14 7 n3

3 n2 n4 0 8 5 13 7 n5

111
n2 n4

Illustration 10

2
1

9 4
n1 6
3
5 7
2 n5
n3

Q d n1 n2 n3 n4 n5 u
Termination
The algorithm initialization n1 n2 n3 n4 n5 0 ∞ ∞ ∞ ∞
ends when all
1 n2 n3 n4 n5 0 10 5 ∞ ∞ n1
the nodes are
explored (i.e., Q 2 n2 n4 n5 0 8 5 14 7 n3
is empty)
3 n2 n4 0 8 5 13 7 n5

4 n4 0 8 5 9 7 n2

112
n2 n4

Illustration 10

2
1

9 4
n1 6
3
5 7
2 n5
n3

Q d n1 n2 n3 n4 n5 u
Termination
The algorithm initialization n1 n2 n3 n4 n5 0 ∞ ∞ ∞ ∞
ends when all
1 n2 n3 n4 n5 0 10 5 ∞ ∞ n1
the nodes are
explored (i.e., 2 n2 n4 n5 0 8 5 14 7 n3
Q is empty)
3 n2 n4 0 8 5 13 7 n5

4 n4 0 8 5 9 7 n2

5 0 8 5 9 7 n4
113
The Dijkstra’s algorithm
Sequential processing
Based on maintaining a global priority queue Q of nodes with
priorities equal to their distances from the source node.

At each iteration, the algorithm explores ONE node with the


shortest distance and updates distances to all reachable
nodes.

If we have a cluster of
machines, can we find the
shortest path in parallel using
MapReduce?
114
Parallel BFS
Idea: While the Dijkstra’s algorithm only explore the
graph by extending one edge at each step, can we
try to explore multiple paths in parallel in
MapReduce?
2 3
1 n5
n2
n1
3
2
1st MapReduce
n3 n6
2 3
n4
n7
2nd MapReduce 4
3
n8 n10
3rd MapReduce 4
n9
115
n2 n4
10 1

Parallel BFS 2
9
6
4
n1
3
5 7
Data representation n3 2
n5

n1 n2 n3 0 f n2 n3 n4 ∞  n3 n2 n4 n5 ∞  n4 n5 ∞  n5 n1 n4 ∞ 

Graph: Adjacency list.


Current distance to the source: Each node is associated with
a counter to represent the currently discovered shortest
distance to the source.
Represent the processing state of a node: Each node
associate with another counter: “f” as frontier node; “”as
not visited yet; “” as updated shortest distance from
previous iteration; “-” as no update of shortest distance from
previous iteration. 116
n2 n4
10 1

1st iteration 4
9
2 6
n1
3
We assume that the adjacency list contains also 5 7
the edge weight (not shown in the list here.). n3 2
n5

n1 n2 n3 0 f
In each MapReduce round, we process the
Map task
frontier node(s).
n2 10 The search frontier is n1 in the 1st iteration.
n3 5
1. We found a path from n1 to n2 with cost 10.
Therefore we emit (n2, 10) to tell n2 that
there is such a shortest path.
2. We found a path from n1 to n3 with cost 5.
Therefore we emit (n3, 5) to tell n3 that
there is such a shortest path.

117
n2 n4
10 1

1st iteration 4
9
2 6
n1
3
5 7
n3 2
n5

n1 n2 n3 0 f
Problem: We lost the adjacency lists and
Map task
counters after one MR iteration.
How can we make the outputs of the
n2 10

n3 5
Reduce tasks be the inputs of the
Mappers in the next iteration?

Shuffle and sort

n3 5 n2 10

Reduce task Reduce task

? ?
118
n2 n4
10 1

1st iteration 4
9
2 6
n1
Solution: We also emit the adjacency 3
lists and counters in each node. 5 7
All nodes are processed in parallel. n3 2
n5

n1 n2 n3 0 f n2 n3 n4 ∞  n3 n2 n4 n5 ∞  n4 n5 ∞  n5 n1 n4 ∞ 

Map task Map task Map task Map task Map task

n2 10
n2 n3 n4 ∞  n3 n2 n4 n5 ∞  n4 n5 ∞  n5 n1 n4 ∞ 
n3 5

n1 n2 n3 0 f

Shuffle and sort

n3 5 n2 10
n3 n2 n4 n5 ∞  n2 n3 n4 ∞  n1 n2 n3 0 f n4 n5 ∞  n5 n1 n4 ∞ 

Reduce task Reduce task Reduce task Reduce task Reduce task

119
n2 n4
10 1

1st iteration 4
9
2 6
n1
Each reduce task receives the shortest paths 3
discovered for a particular node. It simply 5 7
obtains the minimum and update the counter. n3 2
n5

n1 n2 n3 0 f n2 n3 n4 ∞  n3 n2 n4 n5 ∞  n4 n5 ∞  n5 n1 n4 ∞ 

Map task Map task Map task Map task Map task

n2 10
n2 n3 n4 ∞  n3 n2 n4 n5 ∞  n4 n5 ∞  n5 n1 n4 ∞ 
n3 5

n1 n2 n3 0 f

Shuffle and sort

n3 5 n2 10
n3 n2 n4 n5 ∞  n2 n3 n4 ∞  n1 n2 n3 0 f n4 n5 ∞  n5 n1 n4 ∞ 

Reduce task Reduce task Reduce task Reduce task Reduce task

n3 n2 n4 n5 5 n2 n3 n4 10 n1 n2 n3 0 n4 n5 ∞ n5 n1 n4 ∞
120
n2 n4
10 1

1st iteration 4
9
2 6
n1
A node becomes the search frontier in 3
the next iteration if its shortest distance 5 7
changes from ∞ to a value. n3 2
n5

n1 n2 n3 0 f n2 n3 n4 ∞  n3 n2 n4 n5 ∞  n4 n5 ∞  n5 n1 n4 ∞ 

Map task Map task Map task Map task Map task

n2 10
n2 n3 n4 ∞  n3 n2 n4 n5 ∞  n4 n5 ∞  n5 n1 n4 ∞ 
n3 5

n1 n2 n3 0 f

Shuffle and sort

n3 5 n2 10
n3 n2 n4 n5 ∞  n2 n3 n4 ∞  n1 n2 n3 0 f n4 n5 ∞  n5 n1 n4 ∞ 

Reduce task Reduce task Reduce task Reduce task Reduce task

n3 n2 n4 n5 5 f n2 n3 n4 10 f n1 n2 n3 0 - n4 n5 ∞  n5 n1 n4 ∞ 
n2 n4
10 1

2nd iteration 4
9
2 6
n1
The outputs of the 1st iteration 3
now becomes the input of the 5 7
2nd iteration. n3 2
n5

n3 n2 n4 n5 5 f n2 n3 n4 10 f n1 n2 n3 0 - n4 n5 ∞  n5 n1 n4 ∞ 

Map task Map task Map task Map task Map task

n2 8 n3 12

n4 14 n4 11 n1 n2 n3 0 - n4 n5 ∞  n5 n1 n4 ∞ 
n5 7 n2 n3 n4 10 f
n3 n2 n4 n5 5 f

In each iteration, we explore the frontier nodes in parallel.


E.g., for n3 , we discover a path from n3 to n2 with cost 5+3 = 8.
To tell n2 about this path, n3 emits (n2, 8) to tell the node n2 that
there is a path to n2 with cost as 8.

122
n2 n4
10 1

2nd iteration 4
9
2 6
n1
3
We mark a node’s state as “” if the cost of
the shortest path is “changed”; “-” if the 5 7
cost of the shortest path is unchanged. n3 2
n5

n3 n2 n4 n5 5 f n2 n3 n4 10 f n1 n2 n3 0 - n4 n5 ∞  n5 n1 n4 ∞ 

Map task Map task Map task Map task Map task

n2 8 n3 12

n4 14 n4 11 n1 n2 n3 0 - n4 n5 ∞  n5 n1 n4 ∞ 
n5 7 n2 n3 n4 10 f
n3 n2 n4 n5 5 f

Shuffle and sort

n2 8 n3 n2 n4 n5 5 f n4 14 n5 7
n4 11
n2 n3 n4 10 f n3 12
n4 n5 ∞ 
n5 n1 n4 ∞  n1 n2 n3 0 -

Reduce task Reduce task Reduce task Reduce task Reduce task

n2 n3 n4 8  n3 n2 n4 n5 5 - n4 n5 11 f n5 n1 n4 7 f n1 n2 n3 0 -
123
n2 n4
10 1

3rd iteration 4
9
2 6
n1
3
5 7
n3 2
n5

n2 n3 n4 8  n3 n2 n4 n5 5 - n4 n5 11 f n5 n1 n4 7 f n1 n2 n3 0 -

Map task Map task Map task Map task Map task

n3 10 n3 n2 n4 n5 5 - n5 15 n1 14 n1 n2 n3 0 -
n4 9
n4 n5 11 f n4 13

n2 n3 n4 8  n5 n1 n4 7 f

Shuffle and sort


n4 9
n5
n3 10
n2 n3 n4  n1 n2 n3 0 -
15
8 n4 13
n3 n2 n4 n5 5 -
n4 n5 11 f n5 n1 n4 7 f

Reduce task Reduce task Reduce task Reduce task Reduce task

n3 n2 n4 n5 5 - n2 n3 n4 8 - n1 n2 n3 0 - n4 n5 9  n5 n1 n4 7 -
124
n2 n4
10 1

4th iteration 4
9
2 6
n1
The algorithm terminates when all 3
nodes are visited and no more 5 7
n3 n5
changes to the shortest distances. 2

n3 n2 n4 n5 5 - n2 n3 n4 8 - n1 n2 n3 0 - n4 n5 9  n5 n1 n4 7 -

Map task Map task Map task Map task Map task

n5 13

n3 n2 n4 n5 5 - n2 n3 n4 8 - n1 n2 n3 0 - n4 n5 9  n5 n1 n4 7 -

Shuffle and sort


n5 13

n3 n2 n4 n5 5 - n2 n3 n4 8 - n1 n2 n3 0 - n4 n5 9  n5 n1 n4 7 -

Reduce task Reduce task Reduce task Reduce task Reduce task

n3 n2 n4 n5 5 - n2 n3 n4 8 - n1 n2 n3 0 - n4 n5 9 - n5 n1 n4 7 -
125
MapReduce version of BFS
d is the current shortest
distance to node n.

Emits the adjacency list


and counters of node n.
w(n,m) )
Emits the newly discovered
path distance to the node m.

Emits the node’s adjacency


list and the updated counters
for the inputs of the map
tasks in the next iteration.
126
Comparison
In single node scenario: Dijkstra's algorithm is more
efficient because at any step it only pursues edges
from the minimum-cost path inside the frontier.

In multiple nodes scenario: MapReduce version


explores all paths in parallel; not as efficient overall,
but the architecture is more scalable.

127
Section 5

PageRank
Utilizing structural information of web
graph to rank pages
PageRank
A page is more popular if…
More other pages with a hyperlink pointing to it.
Linked by another popular page.
A link from a popular page (e.g., www.cnn.com) should have a
higher impact than a link from an unpopular page.
Link from popular pages :
Make the linked page more popular
Popular www.cnn
page .com

129
PageRank
If a user starts at a random web page and surfs by
randomly entering new URLs and clicking links, what
is the probability that s/he will arrive at a given page?
The PageRank of a page captures this notion.
More “popular” or “worthwhile” pages get a higher rank.

PageRank of a page n

1 1
prn =    + (1 −  ) 
Randomly
entering Clicking links . pr
m
N
new URLs
mL ( n ) C ( m)
130
PageRank
Randomly entering new URLs – user typing URL in
the browser so that the user jump to a page directly
Assume uniform distribution, the probability that a user will
jump to a page among N other pages is 1/N.

1 1
prn =    + (1 −  ) Clicking links . pr
m
N mL ( n ) C ( m)
131

m
PageRank …
n

Clicking links – Let L(n) be the set of pages linking to


the page n. Let m be a page in L(n).
A random surfer at m will arrive at n at probability of 1/C(m),
where C(m) is the out-degree of page m.
The probability of arriving page m is prm.
The probability of arriving page n by clicking a link from page m
is 1/C(m)* prm.
We need to sum contributions from all pages that link to n.

1 1
prn =    + (1 −  )  . prm
N mL ( n ) C ( m)
132
PageRank
The random jump factor  models the probability
that a user arrives a page n by typing URL in
browsers; alternatively, (1 − ) is referred to as the
“damping” factor.
Common value of the  value is 0.15, meaning 15% of
chance a visitor arrive a page by typing URL in a browser,
85% by clicking a hyperlink.

1 1
prn =    + (1 −  )  . prm
N mL ( n ) C ( m)
133
Illustration ( = 0, 1 st iteration)
n1 (0.2)
n2 (0.2)

To compute pageRank,
n5 (0.2) we first allocate equal
pageRank for each node
n4 (0.2) n3 (0.2) and the sum of the
pageRanks among all
nodes equals 1.

134
Illustration ( = 0, 1 st iteration)
n1 (0.2)
0.1
n2 (0.2)
0.066 As we need prm /C(m) in
0.1 0.1
0.1 0.066 the pageRank formula, for
0.066
0.2
n5 (0.2) each node m, we calculate
prm /C(m) and send to
0.2
n4 (0.2) n3 (0.2) neighboring nodes.
E.g., For n1, the prn1/C(n1)
= 0.2/2 = 0.1.
1 1
prn =    + (1 −  )  . prm
N mL ( n ) C ( m)

135
Illustration ( = 0, 1 st iteration)
n1 (0.2)
0.1
n2 (0.2) Since the random jump factor is 0, we
0.066 only need to sum up all the received
0.1 0.1 0.1
0.066
prm /C(m) values to get the prn.
0.066 E.g., For n2, we receive prn1/C(n1) =
0.2
n5 (0.2) 0.1 and prn5/C(n5) =0.066.
Therefore, prn2 = 0.1+0.066 = 0.166.
0.2
n4 (0.2) n3 (0.2)
n1 (0.066)

n2(0.166)
1 1
prn =    + (1 −  )  . prm
N mL ( n ) C ( m)

n5 (0.3)

n4 (0.3) n3 (0.166) 136


Illustration ( = 0, 2 nd iteration)
n2 (0.133)
n1 (0.1)

n5 (0.383)

n3 (0.183)
n4 (0.199) n1 (0.066)
0.033

0.1
0.1 n2(0.166)
0.033 0.083
0.083
0.3
0.1
n5 (0.3)
0.166
n4 (0.3) n3 (0.166) 137
PageRank
PageRank is a global score, independent of the search
query.
PageRank can be used to raise the weight of
important pages:

wt ,d = tf .idf (t , d )  prd
This can be directly incorporated in the
inverted index and the evaluation of top-
k keyword search queries.
138
MapReduce PageRank
n1 (0.2)
0.1
n1 n2 n4 P(n1) 0.2 n2 n3 n5 P(n2) 0.2
0.066 n2 (0.2) Map task Map task
0.1 0.1 0.1
0.066
0.066 n2 0.1 n3 0.1 …
0.2
n5 (0.2) n4 0.1 n5 0.1

n1 n2 n4 n2 n3 n5
0.2
n4 (0.2) n3 (0.2) Shuffle and sort

n1 (0.066) n5 0.1
n1 0.066
n5 0.2
n2(0.166)
n5 n1 n2 n3 n1 n2 n4

Reduce task Reduce task

n5 (0.3)
n5 n1 n2 n3 P(n5) 0.3 n1 n2 n4 P(n1) 0.066

n4 (0.3) n3 (0.166) 139


MapReduce PageRank

140
Follow up topics
What is the dangling node problems in PageRank?

Explain the problem with an example.


Discuss how dangling nodes are handled in PageRank.

141
Section 6

HITS
Hubs and Authorities
Hubs and Authorities
PageRank: simplistic view of a network.

Network topology: different node types Hub

Hub – page that points to lots of others


(e.g., Yahoo, Google)
Authority – page that many others refer to
Authority
A valuable and informative webpage
(e.g., Wikipedia).
A good authority is pointed to by many
good hubs.
A good hub points to many good authorities.
Reference: https://www.youtube.com/watch?v=5lmZreIyA2A by Dr. Victor Lavrenko 143
HITS algorithm
Hypertext Induced Topics Search (HITS) developed
by Jon Kleinberg.
Automatically determine hubs/authorities.
Uses hubs and authorities to define a recursive relationship
between web pages.
In practice
Used on result set (not the whole web graph like PageRank)
Variant used by Ask.com

Reference: https://www.youtube.com/watch?v=5lmZreIyA2A by Dr. Victor Lavrenko 144


Constructing focused subgraph

Start from a set R created by text-based search engine.


For a specific query q, let the set of documents returned by a
standard search engine be called the root set R.
Construct a set S by extending R .
We construct S by extending R with all pages pointed to by
any page in R, and add to S all pages that point to any page in R.

S
R
Why not just Applying HITs on the whole webgraph will be too
apply HITS on costly.
the whole The set S will have a much refined set of relevant
pages and contains most (or many) of the strongest
webgraph? authorities.
145
Iterative algorithm

Hub score of a page x : sum of authority Hub


scores of the pages it points to.
H ( x) =  A( y)
A better hub points to
more pages / pages with
high authority scores.
yx
Authority
Authority score of a page x: sum of
hub scores of pages pointing to it.
A( x) =  H ( y)
A better authority is
pointed by more pages /
pages with high hub scores.
y→x
Normalization: Keeping the hub and authority score
 
2 2
vectors as unit vectors. H ( x) = A( x) = 1
x x 146
Iterative algorithm
Input: The graph S and the number of iterations.
Output: Authority and hub score vectors A and H. Hub

Initialization: All nodes have Hub and Authority scores of 1.


Iterations:

while there are still more iterations do Authority

for each node x do A( x) =  H ( y)


y→x

for each node x do H ( x) =  A( y)


yx

Normalize(A) and Normalize(H)

End while

Reference: The PageRank/HITS algorithms by Joni Pajarinen http://www.cis.hut.fi/Opinnot/T-61.6020/2008/pagerank_hits.pdf


147
Illustration (initialization)
H:1
n2 A:1
H:1
A:1
n1
H:1
A:1
n3 H:1
H:1 A:1
A:1 n5
n4

Initialization step
All nodes have 1 as their initial hub
and authority values

148
Illustration (1st iteration)
H:1
n2 A:1
H:1
A:2
n1
H:1
A:1
n3 H:1
H:1 A:1
A:1 n5
n4

1st step. Update the Authority scores.


Node n1 has two incoming edges, the authority score
of this node equals to the sum of the hub scores of
the two nodes pointing to this node (i.e., 1+1 = 2).
149
Illustration (1st iteration)
H:1
n2 A:1
H:1
A:2
n1
H:1
A:3
n3 H:1
H:1 A:1
A:1 n5
n4

Node n3 has three incoming edges, the authority


score of this node equals to 1+1+1 = 3.
This node is a pretty good authority node
(page that many others refer to, it should be a
valuable and informative webpage.).
150
Illustration (1st iteration)
H:1
n2 A:0
H:1
A:2
n1
H:1
A:3
n3 H:1
H:1 A:1
A:1 n5
n4

151
Illustration (1st iteration)
H:1
A:0
Although n1 only points
n2
H:3 to one other node, but
A:2 that node (n3) is a good
n1
H:1 Authority node, making
A:3 n1 a reasonably good hub
n3 H:1 (with relatively high hub
H:1 A:1 score).
A:1 n5
n4

2nd step. Update the Hub scores.


Node n1 has one outgoing edge, the hub score of this
node equals to the sum of the authority scores of the
other nodes that n1 points to (i.e., n3 , with authority
score equals 3).
152
Illustration (1st iteration)
H:6
n2 A:0
H:3
A:2
n1
H:1
A:3
n3 H:1
H:1 A:1
A:1 n5
n4

Node n2 has three outgoing edges, the hub score of this


node equals to 2+3+1 = 6 .
This node, although is not a good authority node, but is
a good hubs (i.e., it points to lots of other good
authorities).
153
Illustration (1st iteration)
3rd step:
H:6 Normalization
n2 A:0 To prepare the hub
H:3
A:2 and authority scores
n1
H:1 for the next iteration,
A:3 we normalize the
n3 H:3 hub vector and
H:2 A:1
n5
authority vector to
A:1
n4 be unit length.

n1 n2 n3 n4 n5 sqrt(Sum of squares)
Hubs 3 6 1 2 3 sqrt( 32+62+12+22+32 ) = sqrt(59)
Authorities 2 0 3 1 1 sqrt( 22+02+32+12+12 ) = sqrt(15)

154
Illustration (1st iteration)
3rd step:
H:0.781 Normalization
n2 A:0 To prepare the hub
H:0.391
A:0.516 and authority scores
n1
H:0.130 for the next iteration,
A:0.775 we normalize the
n3 H:0.391 hub vector and
H:0.260 A:0.258
n5
authority vector to
A:0.258
n4 be unit length.

n1 n2 n3 n4 n5 sqrt(Sum of squares)
Hubs 3 6 1 2 3 sqrt( 32+62+12+22+32 ) = sqrt(59)
Authorities 2 0 3 1 1 sqrt( 22+02+32+12+12 ) = sqrt(15)

Hubs
(unit length)
0.391 0.781 0.130 0.260 0.391

Authorities
Unit vectors
(unit length)
0.516 0 0.775 0.258 0.258
155
Exercise
Exercise: Let’s try to simulate
H:0.811 the computation and see if you
n2 A:0
H:0.374 get the Hub and Authority
A:0.511 scores correctly?
n1
H:0.031
A:0.767
n3 H:0.374
H:0.249 A:0.383
A:0.064 n5
n4 H:0.814
2nd iteration H:0.370
n2 A:0
A:0.517
n1
H:0.007
A:0.760
n3
H:0.252 H:0.370
A:0.015 A:0.395
n4 n5
3rd iteration 156
Follow up topics
How will you compare HITs and PageRank?
A Comparative Study of HITS vs PageRank Algorithms for Twitter Users
Analysis by Ong Kok Chien et, al.
HITS on the Web: How does it Compare? by Marc Najork et, al.
Authority Rankings from HITS, PageRank, and SALSA: Existence,
Uniqueness, and Effect of Initialization by Ayman Farahat et, al.

Is it possible to make use of MapReduce to compute


the HITs algorithm?

157

You might also like