0% found this document useful (0 votes)

59 views12 pages

Reverse Top-K Search Using Random Walk With Restart: Adams Wei Yu, Nikos Mamoulis, Hao Su

p401-yu

Uploaded by

Aravind Sankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views12 pages

Reverse Top-K Search Using Random Walk With Restart: Adams Wei Yu, Nikos Mamoulis, Hao Su

p401-yu

Uploaded by

Aravind Sankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Reverse Top-k Search using Random Walk with Restart

Adams Wei Yu

, Nikos Mamoulis

, Hao Su

School of Computer Science, Carnegie Mellon University

Department of Computer Science, The University of Hong Kong

Computer Science Department, Stanford University

[email protected], [email protected], [email protected]
ABSTRACT
With the increasing popularity of social networks, large volumes
of graph data are becoming available. Large graphs are also de-
rived by structure extraction from relational, text, or scientic data
(e.g., relational tuple networks, citation graphs, ontology networks,
protein-protein interaction graphs). Node-to-node proximity is the
key building block for many graph-based applications that search
or analyze the data. Among various proximity measures, random
walk with restart (RWR) is widely adopted because of its ability
to consider the global structure of the whole network. Although
RWR-based similarity search has been well studied before, there is
no prior work on reverse top-k proximity search in graphs based on
RWR. We discuss the applicability of this query and show that its
direct evaluation using existing methods on RWR-based similarity
search has very high computational and storage demands. To ad-
dress this issue, we propose an indexing technique, paired with an
on-line reverse top-k search algorithm. Our experiments show that
our technique is efcient and has manageable storage requirements
even when applied on very large graphs.
1. INTRODUCTION
Graph is a fundamental model for capturing the structure of data
in a wide range of applications. Examples of real-life graphs in-
clude, social networks, the Web, transportation networks, citation
graphs, ontology networks, and protein-protein interaction graphs.
In most applications, a key concept is the node-to-node proxim-
ity, which captures the relevance between two nodes in a graph. A
widely adopted proximity measure, due to its ability to capture the
global structure of the graph, is random walk with restart (RWR).
RWR proximity from node u to node v, is the probability for a
random walk starting from u to reach v after innite time; at any
transition there is a chance (0 < < 1) that the random walk
restarts at u. Compared to other measures (like shortest path dis-
tance), a signicant advantage of RWR is that it takes into account
all the possible paths between two nodes. Other merits of RWR is
that it can model the multi-faceted relationship between two nodes

Supported by grant HKU 715413E from Hong Kong RGC. Work

done while the rst author was with HKU.
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-
cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-
mission prior to any use beyond those covered by the license. Contact
copyright holder by emailing [email protected]. Articles from this volume
were invited to present their results at the 40th International Conference on
Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China.
Proceedings of the VLDB Endowment, Vol. 7, No. 5
Copyright 2014 VLDB Endowment 2150-8097/14/01.
[13] and that RWR is stable to small changes in the graph [20].
RWR has been successfully applied in the search engine Google
[21] to rank the importance of web pages. In addition, several other
measures build upon RWR, including Personalized Pagerank [12],
ObjectRank [5], Escape Probability [25], and PathSim [23].
Many search and analysis tasks rely on proximity computations.
These include, citation analysis in bibliographical graphs [14], link
prediction in social networks [19], graph clustering [2], and making
recommendations [16]. The top-k RWR proximity query retrieves
the k nodes with the highest proximity from a given query node
q in a graph. This problem has been investigated previously and
efcient solutions have been proposed for it (e.g., [11, 3, 10]).
In this paper, we study the reverse top-k RWR proximity query:
given a node q, nd all the nodes that have q in their top-k RWR
proximity sets. Reverse top-k queries can be used for detection of
spam nodes in a graph. Search engines, such as Google, aggregate
the RWR proximities from all other nodes to one node in a single
value, known as PageRank. Thus, the proximity from web page u
to v can be interpreted as the PageRank contribution that u makes
to v. When a node q is suspected to be a spam web page, one could
run a reverse top-k search on q, and nd out the pages which give
one of their top-k contributions to q. If the answer set contains a
large proportion of web pages already labeled as spam, then q is
likely to be a spam too. As another application, consider an author
in a co-authorship network who wishes to nd the set of people that
regard himself as the one of their most important direct or indirect
collaborators. The reverse top-k result can be used for identifying
the likelihood of successful collaborations in the future. The size of
an authors reverse top-k list is also an indicator of his popularity in
the community. Finally, in a product co-purchase graph, a reverse
top-k query of a product q can identify which products inuence
the buying of q. One can leverage this information to promote q in
future transactions.
To the best of our knowledge, there is no previous work on re-
verse top-k RWR-based search in large graphs. In addition, ex-
tending solutions for top-k RWR search to compute reverse top-k
queries is not trivial. Specically, while for a top-k search we only
need to nd the top-k proximity set of a single node q, the reverse
top-k search must compute the top-k sets of all nodes in the graph
and check whether q appears in each of them. Therefore, a re-
verse top-k query is substantially more expensive than top-k RWR
search. Figure 1 illustrates a toy graph of 6 nodes and the entire
proximity matrix P computed from it; the i-th column pi of P
contains the proximity values from node i to all nodes in the graph
(e.g., the proximity from node 1 to node 3 is 0.12). In each column
pi, the k = 2 largest entries are shaded; these indicate the results of
a top-2 query from node i (e.g., the top-2 query from node 3 returns
nodes 2 and 3). Observe that for any given node q, to answer a top-
401
!
"
#
$
%
&
'() +,-) !. % /0 1() (230
45&%
45%6
45!%
45!&
454#
4547
45%$
45&7
45!8
45!4
454$
4548
45%$
45%7
45%8
45!4
454$
4548
45!7
45&!
45!&
45%&
45!4
454"
45%4
45&&
45!$
4546
45!6
454#
45!6
45&4
45!&
45!$
454#
45%4
p
1
p
2
p
3
p
4
p
5
p
6 P
Figure 1: Example graph and its proximity matrix
2 query, we only have to compute and access the values of a single
column, whereas a reverse top-2 query for a node i requires nding
all the shaded entries in the i-th row (e.g., the reverse top-2 query
for node 1 returns nodes 1, 2, and 5). To nd whether an entry is
shaded or not, we have to compute the entire matrix P and rank the
values in each column. Computing the whole proximity matrix is
both time and space-consuming, especially for large graphs.
We propose a reverse top-k query evaluation framework which
alleviates this issue. In a nutshell, our approach computes (at a pre-
processing step) from the graph G (having [V [ nodes) a graph in-
dex, which is based on a K[V [ matrix, containing in each column
v the K largest approximate proximity values from v to any other
nodes in G. K is application-dependent and represents the highest
value of k in a practical query. At each column v of the index, the
approximate values are lower bounds of the K largest proximity
values from v to all other nodes, computed after adapting and par-
tially executing Berkhins Bookmark Coloring Algorithm (BCA)
[7]. Given the graph index and a reverse top-k query q (k K),
we prove that the exact proximities from any node v to query q can
be efciently computed by applying the power method. By com-
paring these with the corresponding lower bounds taken from the
k-th row of the graph index, we are able to determine which nodes
(i.e., columns of P) are certainly not in the reverse top-k result of q.
For some of the remaining nodes, we may also be able to determine
that they are certainly in the reverse top-k result, based on derived
upper bounds for the k-th largest proximity value from them. Fi-
nally, for any candidate that remains, we progressively rene its
approximate proximities, until based on its lower or upper bound
we can determine if it should be in the result. The proximities re-
ned during query processing can be updated into the graph index,
making its values progressively more accurate for future queries.
Our contributions can be summarized as follows:
We study for the rst time reverse top-k proximity queries
based on RWR in large graphs.
We propose a dynamically rened, space-efcient index
structure, which supports reverse top-k query evaluation.
The index is paired with an efcient online query algorithm,
which prunes a large number of nodes that are denitely in
or not in the reverse top-k result and minimizes the required
renement for the remaining candidates.
A side contribution of our online algorithm is a proof that we
can apply the power method for computing the exact proxim-
ities from all nodes to a given node q. This result can serve
as a module of any applications that need to compute RWR
proximities to a given node.
We conduct an experimental study demonstrating the ef-
ciency of our framework, as well as the effectiveness of the
reverse top-k RWR query in real graph applications.
The remainder of this paper is organized as follows. Section 2
provides a denition for the RWR-based proximity vector which
captures the proximities from a given node to all other nodes and
General
A Transition probability matrix
P Proximity matrix
pu Proximity vector from node u to other nodes
eu Unit vector having eu(u) = 1, and eu(v) = 0 v = u
p
kmax
u
The k-th largest entry of pu
BCA starting from node u
p
t
u
(r
t
u
) Retained (residue) ink distribution at iteration t
s
t
u
(w
t
u
) Ink accumulated at hubs (non-hubs) at iteration t
pu ( p
t
u
) Descending ranked list of pu(p
t
u
)
lb
t
u
(ub
t
u
) Lower (upper) bound of p
t
u
(k)
Table 1: Main notations
reviews methods for computing it. The reverse top-k RWR prox-
imity search problem is formalized in Section 3 and the baseline
brute force solution is discussed. In Section 4, we present our so-
lution which is experimentally evaluated in Section 5. In Section
6, we briey discuss previous work related to reverse top-k RWR
proximity search. Finally, Section 7 concludes the paper.
2. PRELIMINARIES
In this section, we rst provide denitions for the RWR proxim-
ity matrix of a graph and the proximity vectors of nodes in it. Then,
we review the Bookmark Coloring Algorithm (BCA) [7] for com-
puting the RWR proximity from a given node to all other nodes,
based on which our ofine index is built. For a matrix M, mj or
m,j denotes its j-th column, mi, denotes its i-th row, and mi,j
denotes the element of its i-th row and j-th column. For a vector
v, v(i) denotes its i-th entry and v(1 : i) denotes its rst i entries.
Table 1 summarizes the main symbols used in the paper.
2.1 Denitions
Let G = (V, E) be a directed graph, with a set V = |1, 2, ..., n
of vertices and a set E V V of edges. Let n = [V [, m = [E[,
A = [a1, a2, ..., an] R
nn
be the column-stochastic transition
probability matrix, and OD(i) be the out-degree of node i. We
assume that ai,j =
1
OD(j)
if edge j i exists and ai,j = 0,
otherwise.
1
In other words, the RWR transition probability from
node j to any of its out-neighbors i only depends on the out-degree
of j (i.e., all out-neighbors are equally likely to be visited). For a
given node u, the RWR proximity values from it to other nodes is
the solution of the following linear system w.r.t. pu:
pu = (1 )Apu +eu, (1)
where pu R
n
is the proximity vector of node u, with pu(v)
denoting the proximity fromu to v; eu R
n
is a unit vector having
eu(u) = 1 and all other values set to 0, and [0, 1] denotes the
restart probability in RWR (typically, = 0.15). The analytical
solution is
pu = Peu, (2)
where P = (I (1 )A)
1
= [p1, p2, ..., pn] is called the
proximity matrix. In fact, P can also be used to compute the Pager-
ank (pr) and any Personalized Pagerank (ppr) vector as follows:
pr =
1
n
Pe, pprv = Pv, (3)
1
In the case where dangling nodes with no outgoing edges exist,
we can simply delete them, or add a sink node which links to itself
and is pointed by each dangling node.
402
where e R
n
has 1 in all entries and v R
n
is any given per-
sonalized vector such that 1 i n, vi 0 and |v|1 =

n
i=1
[vi[ = 1.
Computing P at its entirety or partially is a key problem in dif-
ferent applications. Approaches like the iterative Power Method
(PM) [21] and Monte Carlo Simulation (MCS) [9] can be used to
compute an approximate value for a single proximity vector pu
and/or the entire matrix P. PM converges to an accurate pu while
MCS is less accurate but faster. Next, we discuss in detail an ef-
cient technique for deriving a lower-bound of pu.
2.2 Bookmark Coloring Algorithm (BCA)
Basic model. Berkhin [7] models RWR by a bookmark coloring
process, which facilitates the efcient estimation of pu. We begin
by injecting a unit amount of colored ink into u, with an portion
retained in u and the rest (1) portion evenly distributed to each
of us out-neighbors. Each node which receives ink retains an
portion of the ink and distributes the rest to its out-neighbors. At
an intermediate step t(= 0, 1, 2, ...), we can use two vectors p
t
u
,
r
t
u
R
n
to capture the ink distribution in the whole graph, where
p
t
u
(v) is the ink retained at node v, and r
t
u
(v) is the residue ink to
be distributed from v. When r
t
u
(v) reaches 0 for all v V (i.e.,
|ru|1 = 0), p
t
u
is exactly pu; the proximity vector pu can be seen
as a stable distribution of ink. In fact, BCA can stop early, at a time
t, where r
t
u
(v) values are small at all nodes v; p
t
u
is then a sparse
lower-bound approximation of pu [7].
Hub effects. In the process of ink propagation, some of the
nodes may have a high probability to receive new ink and distribute
part of it again and again. Such nodes are called hubs and their set is
denoted by H = |h1, h2, ..., h
|H|
. Without loss of generality, we
assume that the rst [H[ nodes in V are the hubs. If we knew how
hubs distribute their ink across the graph (i.e., if we have precom-
puted the exact proximity vector p
h
for each h H), we would
not need to distribute their residue ink during the process of com-
puting pu for a node u V \ H. Instead, we could accumulate
all the residue ink at hubs, and distribute it in batch at the end by
a simple matrix multiplication. In [7], a greedy scheme is adopted
to select hubs and implement this idea. It starts by applying BCA
on one node and selecting the node with the largest retained ink as
a hub. This process is repeated from another starting node to select
another hub, until a sufcient number of hubs are chosen. Once
the hub nodes are selected, we can use the power method (PM) to
calculate the exact vector p
h
for each h .
BCA using hubs. Assume that we have selected a set of hubs H
and have pre-computed p
h
for each h . To compute pu for a
non-hub node u, BCA [7] (and its revised version [2]) rst injects
a unit amount of ink to u, then u retains an portion of the ink,
and distributes the rest to us out-neighbors. At each propagation
step t, BCA picks a non-hub node vt and distributes the residue
ink r
t1
u
(vt) to its out-neighbors. Two vectors s
t
u
and w
t
u
R
n
are introduced and maintained in this process. s
t
u
is used to store
the ink accumulated at hubs so far and w
t
u
is used to store the ink
retained at non-hub nodes. Thus, for a hub node h, s
t
u
(h) is the ink
accumulated at h by time t; this ink will be distributed to all nodes
in batch after the nal iteration, with the help of the (pre-computed)
p
h
. For a non-hub node v, w
t
u
(v) stores the ink retained so far at
v (which will never be distributed). w
t
u
(v) (s
t
u
(v)) is always zero
for a hub (non-hub) node v. The following equations show how all
vectors are updated at each step:
w
t
u
= r
t1
u
(vt) ev
t
+w
t1
u
(4)
r
t
u
= (1 )r
t1
u
(vt) av
t
+ [r
t1
u
r
t1
u
(vt) ev
t
] (5)
s
t
u
=

iH
r
t1
u
(i) ei +s
t1
u
(6)
According to the rst part of Eq. (4), an ink portion of
r
t1
u
(vt) is retained at vt. Eq. (5) subtracts the residue ink
r
t1
u
(vt) from vt (second part) and evenly distributes the remain-
ing (1 ) portion to vts out-neighbors (rst part). Eq. (6) accu-
mulates the ink that arrives at hub nodes. At any step t, BCA can
compute p
t
u
and use it to approximate pu, as follows:
p
t
u
= w
t
u
+PH s
t
u
(7)
where PH = [p1, p2, ..., p
|H|
, 0
n(n|H|)
] R
nn
, i.e., PH is
the proximity matrix including only the (precomputed) proximity
vectors of hub nodes and having 0s in all proximity entries of non-
hub nodes. p
t
u
is computed only when all residue values are small;
in this case, it is deemed that p
t
u
is a good approximation of pu. In
order to reach this stage, at each step, a vt with large residue ink
should be selected. In [7], vt is selected to be the node with the
largest residue ink, while in [2] vt is any node with more residue
ink than a propagation threshold . BCA terminates when the to-
tal residue ink does not exceed a convergence threshold or when
there is no node with at least residue ink.
3. PROBLEM FORMALIZATION
The reverse top-k RWR query is formally dened as follows:
PROBLEM 1. Given a graph G(V, E), a query node q V
and a positive integer k, nd all nodes u V , for which p
kmax
u

pu(q), where pu is obtained by Eq. (1) and p
kmax
u
is the k-th
largest value in pu.
A brute-force (BF) method for evaluating the query is to (i) com-
pute the proximity vector pu of every node u in the graph, and (ii)
nd all nodes u for which p
kmax
u
pu(q). BF requires the com-
putation of the entire proximity matrix P. No matter which method
is used to compute the exact P (e.g., PM or the state-of-the-art K-
dash algorithm [10]), examining the top-k values at each column
pu results in a O(n
3
) total time complexity for BF (or O(nm)
for sparse graphs with n nodes and m edges), which is too high,
especially for online queries on large-scale graphs.
There are several observations that guide us to the design of an
efcient reverse top-k RWR algorithm. First, the expected num-
ber of nodes in the answer set of a reverse top-k query is k; thus
there is potential of building an index, which can prune the major-
ity of nodes denitely not in the answer set. Second, as noted in [4]
and observed in our experiments, the power law distribution phe-
nomenon applies on each proximity vector: typically, only few en-
tries have signicantly large and meaningful proximities, while the
remaining values are tiny. Third, we observe that verifying whether
the query node q lies in the top-k proximity set of a certain node u
is a far easier problem than computing the exact top-k set of node
u; we can efciently derive upper and lower bounds for the prox-
imities from u to all other nodes and use them for verication. In
the next section, we introduce our approach, which achieves signif-
icantly better performance than BF.
4. OUR APPROACH
Our method focuses on two aspects: (i) avoiding the computa-
tions of unnecessary top-k proximity sets and (ii) terminating the
computation of each top-k proximity set as early as possible. The
overall framework contains two parts: an ofine indexing module
(Section 4.1) and an online querying algorithm (Section 4.2).
403
4.1 Ofine Indexing
For our index design, we assume that the maximum k in any
query does not exceed a predened value K, i.e. k K. For each
node v, we compute proximity lower bounds to all other nodes and
store the K largest bounds to a compact data structure. The index
is relatively efcient to obtain, compared to computing the exact
proximity matrix P. Given a query q and a k K, with the help
of the index, we can prune nodes that are guaranteed not to have q in
their top-k sets, thus avoiding a large number of unnecessary prox-
imity vector computations. The index is stored in a compact format,
so that it can t in main memory even for large graphs. It also sup-
ports dynamic updating after a query has been completed; this way,
its performance potentially improves for any future queries.
The lower bounds used in our index are based on the fact that,
while running BCA from any node u V , each entry of p
t
u
at
iteration t is monotonically increasing w.r.t. t; formally:
PROPOSITION 1. u, v V , p
1
u
(v) p
2
u
(v) ... pu(v).
PROOF. See [29].
Thus, after each iteration t of BCA from u V , we can have a
lower bound p
t
u
(v) of the real proximity value pu(v) fromu to any
node v V . The following proposition shows that the k-th largest
value in p
t
u
serves as a lower bound for the k-th largest proximity
value in pu:
PROPOSITION 2. Let p
t
u
(k) be the k-th largest value in p
t
u
af-
ter t iterations of BCA from u. Let pu(k) be the k-th largest value
in pu. Then, p
t
u
(k) pu(k) = p
kmax
u
.
PROOF. See [29].
Note that this is a nice property of BCA, which is not present
in alternative proximity vector computation techniques (i.e., PM
and MCS). Besides, we observe that by running BCA from a node
u, the high proximity values stand out after only a few iterations.
Thus, to construct the index, we run an adapted version of BCA
from each node u V that stops after a few iterations t to derive
a lower-bound proximity vector p
t
u
. Only the K largest values of
this vector are kept in descending order in a p
t
u
(1 : K) vector.
Our index consists of all these lower bounds; in Section 4.2, we
explain how it can be used for query evaluation. In the remain-
der of this subsection, we provide details about our hub selection
technique (Section 4.1.1), our adaptation of BCA for deriving the
lower-bound proximity vectors and constructing the index (Section
4.1.2), and a compression technique that reduces the storage re-
quirements of the index (Section 4.1.3).
4.1.1 Hub Selection
The hub selection method in [7], runs BCA itself to nd hubs;
its efciency thus heavily relies on the graph size and the number
of selected hubs. We use a simpler approach, which is independent
of these factors and hence can be used for large-scale graphs. We
claim that nodes with high in-degree or out-degree are already good
candidates to be suitable hubs. Therefore we dene H as the union
of the sets of high in-degree nodes Hin and high out-degree nodes
Hout. Hin (Hout) is the set of B nodes in V with the largest in-degree
(out-degree). In Section 5, we investigate choices for parameter B.
4.1.2 BCA Adaptation
We propose an improved ink propagation strategy for BCA com-
pared to those suggested by [7] and [2]. Instead of propagating
a single nodes residue ink at each iteration t, our strategy se-
lects a subset of nodes Lt, which includes those having no less
residue ink than a given propagation threshold ; i.e., Lt = |v
V \ H[r
t1
u
(v) . is selected such that only signicant
residue ink is propagated. The rules for updating s
t
u
and p
t
u
are
the same as shown in Eq. (6) and (7), respectively. However, the
updates w
t
u
and r
t
u
are performed as follows:
w
t
u
=

iL
t
r
t1
u
(i) ei +w
t1
u
(8)
r
t
u
=

iL
t
(1 )r
t1
u
(i) ai + [r
t1
u

iL
t
r
t1
u
(i) ei] (9)
To understand the advantage of our strategy, note that the main
cost of BCA at each iteration consists of two parts. The rst is
the time spent for selecting nodes to propagate ink from and the
second is the time spent on updating vectors r
t
u
, s
t
u
, and w
t
u
.
2
Our
approach reduces both costs. First, selecting a batch of nodes at a
time signicantly reduces the total remaining residue |r
t
u
|1 in a
single iteration and greatly reduces the overall number of iterations
and thus the total number of vector updates. Second, since at each
iteration both nding a single node or a set of nodes to propagate
ink from requires a linear scan of r
t1
u
, the total node selection
time is also reduced.
Our BCA adaptation ends as soon as the total remaining ink
|r
t
u
|1 is no greater than a residue threshold . We observe that
|r
t
u
|1 drops drastically in the rst few iterations of BCA and then
slowly in the latter iterations. Thus, we select such that our BCA
adaptation terminates only after a few iterations, deriving a rough
approximation of pu that is already sufcient to prune the majority
of nodes during search.
The complete lower bound indexing procedure is described by
Algorithm 1. Let tu be the number of iterations until the termina-
tion of BCA from u and t = [t1, t2, ..., tn]. The index resulting
from Algorithm 1 is denoted by 1
t
= (

P
t
, R
t
, W
t
, S
t
, PH),
where

P
t
= [ p
t
1
1
(1 : K), ..., p
tn
n
(1 : K)] is the top-K lower
bound matrix storing the K largest values of each p
tu
u
, R
t
=
[r
t
1
1
, ..., r
tn
n
] is the residue ink matrix, W
t
= [w
t
1
1
, ..., w
tn
n
] is the
non-hub retained ink matrix, S
t
= [s
t
1
1
, ..., s
tn
n
] is the hub accu-
mulated ink matrix and PH is the hub proximity matrix. Whenever
the context is clear, we simply denote p
tu
u
by p
t
u
and the index by
1 = (

P, R, W, S, PH).
Algorithm 1 Lower Bound Indexing (LBI)
Input: Matrix A, number K, Hubs H, Residue threshold , Propaga-
tion threshold .
Output: Index I = (

P, R, W, S, P
H
).
1: for all h H do
2: Compute p
h
by power method or BCA;
3: for all nodes u V do
4: tu = 0; r
tu
u
= eu; s
tu
u
= w
tu
u
= 0;
5: while r
tu
u

1
> do
6: tu = tu + 1;
7: Update r
tu
u
, s
tu
u
, w
tu
u
by Eq. (9), (6), (8);
8: Compute p
tu
u
by Eq. (7);
9: p
tu
u
= top K entries of p
tu
u
in descending order;
Figure 2 illustrates the result of our indexing approach on the
toy graph of Figure 1, for = 0.15. First, by setting B = 1,
we select the two nodes with the highest in- and out-degrees to
become hubs. These are nodes 1 and 2. For these two nodes the
exact proximity vectors p1 and p2 are computed and stored in the
2
Recall that p
t
u
needs not be updated at each iteration and is only
computed at the end of BCA or when an approximation of pu
should be obtained.
404
!"# %&'#( )* + ,-# ."# "/0(
123+
12+4
12)+
12)3
1215
1216
12+7
1236
12)8
12)1
1217
1218
12+7
12+6
12+8
12)1
1217
1218
12)1
12)8
1218
12)6
1214
1213
12+1
1233
12)7
1214
12)4
1215
12)1
12)8
1218
12)1
121+
12)4
123+
12+4
12)3
1236
12+7
12)8
12+6
12+8
12+7
12)6
12)8
12)1
1233
12+1
12)4
12)4
12)8
12)1
)
9
5
7
+
3
p
1
p
2
p
t3
3
p
t4
4
p
t5
5
p
t6
6

P
p
1
p
2
p
t3
3
p
t4
4
p
t5
5
p
t6
6
Figure 2: Example of top-3 lower bound index
hub proximity matrix PH = [p1, p2, 0, 0, 0, 0]. For the remaining
nodes, we run our BCA adaptation with propagation threshold =
10
4
and residue threshold = 0.8, which results in the p
t
3
3

p
t
6
6
vectors shown in the gure. Finally, we select from each of
|p1, p2, p
t
3
3
, . . . , p
t
6
6
the top-K values (for K = 3) and create
the lower bound matrix

P = [ p1, p2, p
t
3
3
, p
t
4
4
, p
t
5
5
, p
t
6
6
], as shown
in the gure. Note that |r
t
3
3
|1 = |r
t
5
5
|1 = 0 and |r
t
4
4
|1 =
|r
t
6
6
|1 = 0.36.
4.1.3 Compact Storage of the Index
The space complexity for the hub proximity matrix PH of 1 is
O([H[n), where [H[ (n) is the number of hub (total) nodes. The
matrix may not t in memory if n and [H[ are large. We apply a
compression technique for PH, based on the observation that the
values of a proximity vector follow a power law distribution; in
each vector p
h
PH, the great majority of values are tiny; only
a small percentage of these values are signicantly large. There-
fore, we perform rounding by zeroing all values lower than a given
rounding threshold . In our implementation, we choose an that
can save much space without losing reverse top-k search precision.
If sufcient hubs are selected, matrices R, W, S are sparse, so
the storage cost for the index 1 will mainly be due to

P and the
rounded PH. The following theorem gives an estimation for the
total index storage requirements after the rounding operation.
THEOREM 1. h H, given rounding threshold , if the val-
ues of p
h
follow a power law distribution, i.e., the sorted value
p
h
(i) i

, where 0 < < 1 is the exponent parameter,

then the space required to store the whole index is O(Kn + (1
)
1

[H[

n
1
1

).
PROOF. Let p
h
(i) = i

. As
1 =
n

i=1
p
h
(i) =
n

i=1
i

n
1
_
1
0
x

dx =
n
1
1
we have (1 )n
1
and p
h
(i) (1 )n
1
i

. Let
p
h
(l

) , then we have
l

(1 )
1

n
1
1

Since only less than l

entries need to be stored for a single hub

node, we need (1)
1

[H[

n
1
1

space for PH. Plus the top-

K lower bound space requirement O(Kn), the total index storage
would be O(Kn + (1 )
1

[H[

n
1
1

).
Let p
tu
u
be the approximated proximities constructed by Eq. (7)
with rounded hub proximities P
H
. We can trivially show that
Propositions 1 and 2 hold for p
tu
u
. Thus, p
tu
u
is still an increas-
ing lower bound of pu and p
tu
u
can replace the p
tu
u
in our index. In
the following, we give a bound for the error caused by rounding.
PROPOSITION 3. Given rounding threshold and p
h
(i)
i

, where = (1 )n
1
, then for u V ,
|p
tu
u
p
tu
u
|1 1 (
1
n
)
1

1
.
PROOF. See [29].
We empirically observed (see Section 5) that our rounding ap-
proach can save huge amounts of space and the real storage re-
quirements are even much smaller than the theoretical bound given
by Theorem 1. Meanwhile, the actual error is much smaller than
the theoretical bound by Proposition 3, and more importantly, it has
minimal effect to the reverse top-k results. To keep the index nota-
tion uncluttered, we use PH to also denote the rounded hub prox-
imities (i.e., P
H
) and p
tu
u
to denote the corresponding rounded
proximity vectors p
tu
u
computed using P
H
.
4.2 Online Query Algorithm
This section introduces our online reverse top-k search tech-
nique. Given a query node q V , we perform search in two steps.
First, we compute the exact proximity from each u V to q us-
ing a novel and efcient method (Section 4.2.1). In the second step
(Section 4.2.2), for each node u we use the index described in Sec-
tion 4.1 to prune u or add u to the search result, by deriving a lower
and an upper bound (denoted as lb
t
u
and ub
t
u
) of us k-th largest
proximity value p
kmax
u
to other nodes and comparing it with its
proximity to q. For nodes u that cannot be pruned or conrmed as
results, we rene lb
t
u
and ub
t
u
using our index incrementally, until u
is pruned or becomes a conrmed result. The renement is used to
update the index for faster future query processing (Section 4.2.3).
In this section, we use pu and p,u interchangeably to denote the
u-th column of the proximity matrix P; also note that lb
t
u
= p
t
u
(k)
and ub
t
u
is the upper bound of pu(k) (= p
kmax
u
) w.r.t. to p
t
u
.
4.2.1 RWR Proximity to the Query Node
The rst step of our method is to compute the exact proximities
from all other nodes to the query q. Although a lot of previous
work has focused on computing the proximities from a given node
u to all other nodes (i.e., a column p,u of the proximity matrix P),
there have only been a few efforts on how to nd the proximities
from all nodes to a node q (i.e., a row pq, of P). The authors of
the SpamRank algorithm [6] suggest computing approximate prox-
imity vectors p,u for all u V , and taking all pq,u to form pq,.
However, to get an exact result, which is our target, such a method
would require the computation of the entire P to a very high preci-
sion, which leads to unacceptably high cost. A heuristic algorithm
is proposed in [8], which rst selects the nodes with high proba-
bility to be the large proximity contributors to the query node, and
then computes their proximity vectors. This method requires the
computation of several proximity vectors p,u to nd only a subset
of entries in pq,. [1] introduces a local search algorithm that ex-
amines only a small fraction of nodes, deriving, however, only an
approximation of pq,.
Although it seems inevitable to compute the whole matrix P
to get the exact proximities from all nodes to q, we show that this
problem can be solved by the power method and has the same com-
plexity as calculating a single column of P. Our result is novel and
constitutes an important contribution not only for the reverse top-k
search problem that we study in this paper, but also for any problem
that includes nding the proximities from all nodes to a given node.
405
For example, our method could be used as a module in SpamRank
[6] to nd PageRank contributions that all nodes make to a given
web page q precisely and efciently.
First of all, we note that pq, is essentially the q-th row of P,
hence, pq, = e
T
q
P = e
T
q
(I (1 )A)
1
, or equivalently
p
T
q,
= (1 )A
T
p
T
q,
+eq
(see Section 2 for the denitions of A, eq, and e). An interesting
observation is that p,u and pq, are actually the solutions of the
following linear systems respectively,
xu = (1 )Axu +eu (10)
xq = (1 )A
T
xq +eq (11)
which share the same structure except, that either Aor A
T
is used
as the coefcient matrix. This similarity motivates us to apply the
power method. Just as Eq. (10) can be solved by the iterative power
method on matrix [(1 )A+eue
T
]:
x
i+1
u
= (1 )Ax
i
u
+eu = [(1 )A+eue
T
]x
i
u
, (12)
we hope that the linear system (11) could be solved by the follow-
ing iterative method:
x
i+1
q
= (1 )A
T
x
i
q
+eq. (13)
However, showing that the sequence generated by Eq. (13) can
successfully converge to the solution of Eq. (11) is not trivial, as
the proof of the convergence of Eq. (12) does not apply for Eq.
(13). The main difference between the two is as follows. In Eq.
(12), if |x
0
u
|1 = 1, then |x
i
u
|1 = e
T
x
i
u
= 1 for i = 1, 2, ....
Hence we can have the r.h.s. of Eq. (12) to prove |x
i
u
i to be
a power methods series and thus converges. Conversely, the se-
quence |x
i
q
i is not non-expansive in the general case and we may
have |x
i+1
q
|1 > |x
i
q
|1. In other words, we cannot transform Eq.
(13) to the form of the r.h.s. of Eq. (12) to prove |x
i
q
i to be a
power methods series, so there is no obvious guarantee that it will
converge. We therefore have to prove that Eq. (13) converges to a
unique vector, which is the solution of Eq. (11). Fortunately, us-
ing techniques very different from the original convergence proof
of Eq. (12), we show that Eq. (13) indeed converges to a unique
solution, from an arbitrary initialization.
Let us lift xq R
n
to space R
n+1
by introducing zq =
_
xq
1
_
.
The afne Equation (11) is now equivalent to
zq = Dqzq (14)
where Dq =
_
(1 )A
T
eq
01n 1
_
R
(n+1)(n+1)
. Then the
rst n columns of (14) is exactly (11). Note that zq is an eigen-
vector of Dq corresponding to eigenvalue 1. We will prove that zq
is in fact the dominant eigenvector, therefore System (14) can be
solved by the power method.
THEOREM 2. Let 1 and 2 be the rst two largest eigenvalues
of Dq. Let z
0
q
= [(x
0
q
)
T
, 1]
T
, z

q
= [pq,, 1]
T
R
(n+1)
, where
x
0
q
is any vector in R
n
, and let
z
i+1
q
= Dqz
i
q
= D
i+1
q
z
0
q
(15)
then the following conclusions hold:
(a) 1 = 1 with multiplicity 1, and limiz
i
q
= z

q
,
limix
i
q
= pq,.
(b) 2 = 1 ; the convergence rate of (15) and (13) is 1 ;
(c) For convergence tolerance , if i > log

/ log(1 ), then
|z
i+1
q
z
i
q
|1 |x
i+1
q
x
i
q
|1 < .
PROOF. (a) Note that the row sum of Dq cannot exceed 1. In
fact, for > 0, it is obvious that the q-th row and the last row
have row sum 1 and all other rows have row sum 1 < 1. So the
spectral radius (Dq) maxi

j
(Dq)ij 1. On the other hand,
z

q
,= 0 satises Eq. (14), which implies that z

q
is the eigenvector
of Dq with eigenvalue 1. Thus, 1 = (Dq) = 1.
Note that any eigenvector of value 1 must be a xed point of Eq.
(14). Therefore, if we can show that the sequence |z
i
q
i converges
to a nonzero point, it must be the unique eigenvector, and then the
multiplicity of 1 is 1. In the following, we will prove that this
statement is true. It is easy to verify that
D
i
q
=
_
(1 )
i
(A
T
)
i

i1
j=0
(1 )
j
(A
T
)
j
eq
01n 1
_
Since |A
T
| = (A
T
) = 1, it follows that
|(1 )
i
(A
T
)
i
| (1 )
i
|A
T
|
i
(1 )
i
, so
lim
i
D
i
q
=
_
0nn [I (1 )A
T
)]
1
eq
01n 1
_
=
_
0 P
T
eq
0 1
_
,
implying that
lim
i
z
i
q
= lim
i
D
i
q
z
0
q
=
_
0 P
T
eq
0 1
_ _
x
0
q
1
_
=
_
p
T
q,
1
_
,
where p
T
q,
= P
T
eq. Hence limiz
i
q
= z

q
and limix
i
q
=
pq,. This also certies that there is a unique convergence point of
(15), so the multiplicity of 1 is 1.
(b) Rewrite Dq = (1)
_
A
T
0n1
01n 1
_
+Fq, where Fq =
_
0nn eq
0n1 1
_
. Let =
_
01n
1
_
. It is easy to verify that D
T
q
= .
As (D
T
q
) = (Dq) = 1, is the eigenvector corresponding to the
largest eigenvalue of D
T
q
and it is unique, since Dq and D
T
q
has
the same eigenvalue multiplicity. Now we leverage the following
lemma to assist the rest of proof.
LEMMA 1. (From page 4 of [27]) If i is an eigenvector of A
corresponding to the eigenvalue i, j is an eigenvector of A
T
corresponding to j and i ,= j, then
T
i
j = 0.
By Lemma 1, the second largest eigenvector of Dq must be
orthogonal to , i.e.,
T
= 0. By the structure of , it must
be true that =
_

0
_
, is some vector in R
n
, which implies
Fq = 0. Hence, Dq = (1)
_
A
T
0
0 1
_
= (1)
_
A
T

0
_
.
As Dq = 2, we have A
T
=

2
1
, indicating that is an
eigenvector of A
T
. Since Ais a transition matrix,

2
1
(A) =
1, so 2 1 . It is easy to verify that for =
_
en1
0
_
,
Dq = (1 ), so 2 = 1 . In addition, the convergence
rate of (15) is dictated by [2[/[1[ = 1 .
(c) Since |z
i
q
i is the power methods series of Dq, we have
|z
i+1
q
z
i
q
|1 |(
|
2
|
|
1
|
)
i
(1
|
2
|
|
1
|
)| = (1 )
i
. Hence, i >
log

/ log(1 ) can lead to |z

i+1
q
z
i
q
|1 < .
Theorem2 shows that sequence |x
i
q
i, computed by Eq. (13) in-
deed converges and also gives the estimated number of iterations.
406
Since it is part of the power method series |z
i
q
i, we can call Eq.
(13) a power method; Algorithm 2 illustrates how to use it in solv-
ing System (11) and deriving pq,. Note that the algorithm ter-
minates as soon as the series converges based on the convergence
threshold (line 6).As it takes O(m) operations in each iteration
(where m = [E[ is the number of edges), the time complexity of
the algorithm is O
_
log

log(1)
m
_
.
Algorithm 2 Power Method for Proximity to Node (PMPN)
Input: Matrix A, Query q, Convergence tolerance .
Output: Proximities pq, from all nodes to q.
1: Initialize x
0
q
as any vector R
n
;
2: i = 0;
3: repeat
4: x
i+1
q
= (1 )A
T
x
i
q
+ eq;
5: i = i + 1;
6: until x
i
q
x
i1
q
< convergence of PMPN
7: pq, = (x
i
q
)
T
;
4.2.2 Upper Bound for the k-largest Proximity
After having computed pq,, we know for each u V , the exact
proximity pu(q)(= pq,u) from u to q. Now, we access the k-th
row of the lower bound matrix

P of the index (see Section 4.1) and
prune all nodes u for which lb
t
u
= p
t
u
(k) > pu(q). Obviously,
if the k-th largest lower bound from u to any other node exceeds
pu(q), then it is not possible for q to be in the set of k closest
nodes to u. For each node u that is not pruned, we compute an
upper bound ub
t
u
for the k-th largest proximity from u to any other
node, using the information that we have about u in the index. If
pu(q) ub
t
u
, then u is denitely in the answer set of the reverse
top-k query. Otherwise, node u needs further processing.
We now show how to compute ub
t
u
for a node u. Note that from
the index, we have the descending top-K lower bound list p
t
u
and
the residue ink vector r
t
u
. For j = 1, 2, ..., k 1, let

t
j
= p
t
u
(j) p
t
u
(j + 1) (16)
z
t
0
= 0, and z
t
j
= z
t
j1
+j
t
kj
, 1 j k 1 (17)
Then,
ub
t
u
=
_

_
p
t
u
(k j)
z
t
j
r
t
u

1
j
, if j [1, k 1],
s.t. z
t
j1
< |r
t
u
|1 z
t
j
p
t
u
(1) +
r
t
u

1
z
t
k1
k
, if |r
t
u
|1 > z
t
k1
(18)
Figures 3 and 4 illustrate the intuition and the derivation of ub
t
u
.
Assume that k = 5 and the rst k values of p
t
u
are as shown on the
left of the gures, while the total remaining ink |r
t
u
|1 is shown on
the right of the gures. The best possible case for the k-th value
of pu is when |r
t
u
|1 is distributed such that (i) only the rst k
values may receive some ink, while all others receive zero ink and
(ii) the ink is distributed in a way that maximizes the updated k-
th value. To achieve (ii), p
t
u
could be viewed as a staircase the k
highest steps of which are t tightly in a container. If we pour the
total residue ink |r
t
u
|1 into the container, the level of the ink will
correspond to the value of ub
t
u
.
t
j
is the difference between j-th
and (j + 1)-th step of the staircase, while z
t
j
is the ink required to
pour in order for its level in the container to reach the (k j)-th
step. The rst line of Eq. (18) corresponds to the case illustrated
by Figure 3, where ub
t
u
is smaller than pu(1), while the example
of Figure 4 corresponds to the case of the second line, where the
whole staircase is covered by residue ink (|r
t
u
|1 > z
t
k1
).

t
1

t
2

t
3

t
4

t
4
^ p
t
u
(1) ^ p
t
u
(1)
ub
t
u
ub
t
u
kr
t
u
k1 kr
t
u
k1
z
t
1
z
t
1
z
t
2
z
t
2
z
t
3
z
t
3
^ p
t
u
(2) ^ p
t
u
(2) ^ p
t
u
(3) ^ p
t
u
(3) ^ p
t
u
(4) ^ p
t
u
(4) ^ p
t
u
(5) ^ p
t
u
(5) ^ p
t
u
(6) ^ p
t
u
(6) ^ p
t
u
(7) ^ p
t
u
(7)
Figure 3: Upper bound, k = 5, z
t
2
< |r
t
u
|1 z
t
3
^ p
t
u
(7) ^ p
t
u
(7) ^ p
t
u
(6) ^ p
t
u
(6) ^ p
t
u
(5) ^ p
t
u
(5) ^ p
t
u
(4) ^ p
t
u
(4) ^ p
t
u
(3) ^ p
t
u
(3) ^ p
t
u
(2) ^ p
t
u
(2) ^ p
t
u
(1) ^ p
t
u
(1)

t
1

t
2

t
3

t
4

t
4
ub
t
u
ub
t
u
kr
t
u
k1 kr
t
u
k1
z
t
4
z
t
4
Figure 4: Upper bound, k = 5, |r
t
u
|1 > z
t
4
The following proposition states that ub
t
u
is indeed an upper
bound of the real k-largest value p
kmax
u
and is monotonically de-
creasing as p
t
u
( p
t
u
) is rened by later iterations.
PROPOSITION 4. u V , ub
1
u
ub
2
u
... p
kmax
u
.
PROOF. See [29].
Algorithm 3 is a procedure for deriving the upper bound ub
t
u
,
given pu, |r
t
u
|1, and k. The algorithm simulates pouring |r
t
u
|1
into the container by gradually computing the z
t
j
values for j =
1, 2, . . . , k 1, until z
t
j1
< |r
t
u
|1 z
t
j
, which indicates that the
residue ink |r
t
u
|1 can level up to p
t
u
(k j). If |r
t
u
|1 > z
t
k1
, the
whole staircase is covered and the algorithm computes ub
t
u
by the
second line of Eq. (18). The complexity of Algorithm 3 is O(k),
which is quite low compared to other modules.
Algorithm 3 Upper Bound Computation (UBC)
Input: Matrix A, Number k, Node u, Lower bound vector p
t
u
, Residue
ink vector r
t
u
.
Output: Upper bound ub
t
u
of the k-th largest proximity from u.
1: z
t
0
= 0;
2: for j = 1 to k 1 do
3: Compute
t
kj
by Eq. (16);
4: Compute z
t
j
by Eq. (17);
5: if z
t
j1
< r
t
u

1
z
t
j
then
6: Compute ub
t
u
by rst line of Eq. (18);
7: return ub
t
u
;
8: Compute and return ub
t
u
by second line of Eq. (18);
4.2.3 Candidate Renement and Index Update
When p
t
u
(k) pq,u < ub
t
u
, we cannot be sure whether u is a
reverse top-k result or not and we need to further rene the bounds
407
p
t
u
(k) and ub
t
u
. First, we apply one step of BCA in continuing
the computation of p
t
u
and update p
t
u
(lines 6-7 of Algorithm 1).
Then, we apply Algorithm 3 to compute a new ub
t
u
. This step-wise
renement process is repeated while p
t
u
(k) pq,u < ub
t
u
; it stops
once (i) pq,u < p
t
u
(k), which means that q is not contained in the
top-k list of u, or (ii) pq,u ub
t
u
, which means that u denitely
has q as one of its top-k nearest nodes. In our empirical study, we
observed that for most of the candidates u, the process terminates
much earlier before the lower and upper bounds approach the exact
value pq,u. Thus, many computations are saved.
If, due to a reverse top-k search, p
t
u
( p
t
u
) has been updated, we
dynamically update the index to include this change. In addition,
we update the corresponding stored values for r
t
u
, s
t
u
, and w
t
u
.
Due to this update, future queries will use tighter lower and upper
bounds for u.
The complete online query (OQ) method is summarized by Al-
gorithm 4. After computing the exact proximities to q (line 1), the
algorithm examines all u V and while a node u is a candidate
based on the lower bound p
tu
u
(k) (line 4), we rst check (line 5)
whether the lower bound is the actual proximity (this happens when
|r
tu
u
|1 = 0); in this case, u is added to the result set C and the
loop breaks. Otherwise, the upper bound ub
tu
u
is computed (line 8)
to verify whether u can be conrmed to be a result; if u is not a
result (line 13), lines 6-7 of Algorithm 1 are run to rene p
tu
u
(k);
after the update, the lower bound condition is re-checked to see
whether u can be pruned or another loop is necessary. Note that
the update besides increasing the values of p
tu
u
(k) (i.e., increasing
the chances for pruning), it also reduces ub
tu
u
, therefore the revised
upper bound ub
tu
u
may render u a query result.
Algorithm 4 Online Query (OQ)
Input: Matrix A, Query q, Number k, Index I.
Output: Reverse top-k Set C of q, Updated Index I.
1: Compute the exact proximities pq, by Algorithm 2;
2: Initialize C = ;
3: for all u V do
4: while pu,q p
tu
u
(k) do
5: if r
tu
u

1
= 0 then
6: C = C u; p
tu
u
(k) = pu(k), so u is a result
7: break;
8: Compute ub
tu
u
by Algorithm 3;
9: if pu,q ub
tu
u
then
10: C = C u; u becomes a result
11: break;
12: else
13: Update p
tu
u
(k) by Algorithm 1;
14: Save the updated

P, R, W, S to I;
We now illustrate OQ with our running example. Consider the
graph and the constructed index shown in Figure 2. Assume that
q = 1 (i.e., the query node is node 1 in the graph) and k = 2.
The rst step is to compute pq, using Algorithm 2; the result
is pq, = [0.32, 0.24, 0.24, 0.19, 0.20, 0.18]. Now OQ loops
through all nodes u and checks whether pu,q p
tu
u
(k). For the
rst node u = 1, we have 0.32 > 0.28 and p
tu
u
(k) is the actual
proximity pu(k) (recall that node 1 is a hub in our example, whose
proximities to other nodes have been computed), thus 1 is a result.
The same holds for u = 2 (0.24 0.24 and node 2 is a hub). For
u = 3, observe that pu,q < p
tu
u
(k) (i.e., 0.24 < 0.27); therefore
node 3 is safely pruned (i.e., OQ does not enter the while loop for
u = 3). Node u = 4 satises pu,q p
tu
u
(k) (0.19 > 0.17) and
|r
tu
u
|1 > 0, therefore the upper bound ub
tu
u
= 0.36 is computed
by Algorithm3, however, pu,q < ub
tu
u
, therefore we are still uncer-
tain whether node 4 is a reverse top-k result. A loop of Algorithm
1 is run to update p
tu
u
(k) to 0.23 (line 13); now node 4 is pruned
because pu,q < p
tu
u
(k) (0.19 < 0.23). Continuing the example,
node 5 is immediately added to the result since p5,q = p
t
5
5
(k) and
|r
t
5
5
|1 = 0, whereas node 6 is pruned after the renement of p
t
6
6
.
The following theorem shows the time complexity of OQ.
THEOREM 3. The time complexity of OQ in worst case is
O
__
log

+[Cand[ log

log(1 )
_
m
_
where the is the convergence threshold of Algorithm 2, is the
residue threshold and is the propagation threshold of Algorithm
1, Cand is the set of candidates that could not be pruned immedi-
ately by the index and m = [E[ is the number of graph edges.
PROOF. The cost of a query includes the cost of Algorithm 2,
which is O
_
log

log(1)
m
_
, as discussed in Section 4.2.1, and the
cost of examining and rening the candidates (lines 2 to 14 of OQ).
The worst case is that all nodes in Cand cannot be pruned or con-
rmed as result until we compute their exact k-th largest proximity
values by repeating line 7 of Algorithm 1, i.e., until the maximum
residue ink maxi|r
tu
u
(i) at any node drops below . Within an
iteration, the update of one node u requires at most O(m) opera-
tions Besides, each iteration is expected to shrink maxi|r
tu
u
(i)
by a factor around (1 ). Recall that since maxi|r
tu
u
(i) ,
the total number of iterations required to terminate BCA by mak-
ing it smaller than satises maxi|r
tu
u
(i) (1 )

, i.e.,

log

max
i
{r
tu
u
(i)}
log (1)

log

log(1)
. Therefore, the total time com-
plexity in the worst case is O
__
log

+|Cand|log

log(1)
_
m
_
.
As we show in Section 5.3, in practice, [Cand[ is extremely
small compared to n and most of the candidates can be pruned or
conrmed within signicantly fewer than iterations. Hence, the
empirical performance of OQ is far better than the worst case.
5. EXPERIMENTAL EVALUATION
This section experimentally evaluates our approach , which is
implemented in Matlab 2012b. Our testbed is a cluster of 500 AMD
Opteron cores (800MHz per core) with a total of 1TB RAM. Since
our indexing algorithm can be fully parallelized (i.e., the approxi-
mate proximity vectors of nodes are computed independently), we
evenly distributed the workload to 100 cores to implement the in-
dexing task. Each online query was executed at a single core and
the memory used at runtime corresponds to the size of our index
(i.e., at most a few GBs as reported in Table 2). Hence, our solu-
tion can also run on a commodity machine.
5.1 Datasets
We conducted our efciency experiments on a set of unlabeled
graphs. The number n = [V [ (m = [E[) of nodes (edges) of each
graph are shown in Table 2. Web-stanford-cs
3
and Web-stanford
4
were crawled from stanford.edu. Each node is a web domain and a
directed link stands for a hyperlink between two nodes. Epinions
4
is a who-trust-whom social network from a consumer review site
epinions.com; each node is a member of this site, and a directed
edge i j means that member i trusts j. Web-google
4
a web
graph collected by Google. Experiments on two additional datasets
are included in [29].
3
law.di.unimi.it/datasets.php
4
snap.stanford.edu/data/
408
5.2 Index Construction
We rst evaluate the cost for constructing our index (Section 4.1)
and its storage requirements for different graphs and sizes [H[ of
hub sets. After tuning, we set the index construction parameters
(see Section 4.1) as follows: propagation threshold = 10
4
,
residue threshold = 0.1, hub vector rounding threshold =
10
6
for the rst three graphs, and = 5 10
6
for the largest
one. In all cases, K=200, the convergence threshold = 10
10
and the restart parameter = 0.15. For a study on the effect of
the different parameters and the rationale of choosing their default
values, see [29].
Table 2 shows the index construction time for different graphs,
for various values of the hub selection parameter B, which result in
different sizes [H[ of hub sets. The last column shows the time to
compute the entire proximity matrix P and its size on disk, which
represents the brute-force (BF) approach of just pre-computing and
using P for reverse top-k queries. The value in parentheses in the
last column is the minimum possible cost for our index, derived by
just storing the top-K lower bound matrix

P and disregarding the
storage of the hub proximities PH and matrices R, W, and S. The
last three rows for each graph show the space that our index would
have if we had not applied the compression technique discussed in
Section 4.1.3, the actual space of our index, and the predicted space
according to our analysis in Section 4.1.3 (i.e., using Theorem 1
with = 0.76, as indicated by [4]). The reported times sum up the
running time at each core, assuming the worst case of having just
one single-core machine. Note that the actual time is roughly the
reported time divided by the number of cores (100).
We observe that the best number of hubs to select in terms of
both index construction cost and index size on disk depends on
the graph sparsity. For Web-stanford-cs, which is sparse graph, it
sufces to select less than 1% of the nodes with the highest in-
and out- degrees as hubs, while for the denser Epinions and Web-
stanford graphs 1% 2% of the nodes should be selected. The
index construction is much faster than the entire P computation,
especially for larger and sparser graphs (e.g., for Web-google it
takes as little as 1.8% of the time to construct P). The time is not
affected too much by the number of selected hubs.
The same observation also holds for the size of our index, which
is much smaller than the entire P and a few times larger than the
baseline storage of the top-K lower bound matrix

P. Although our
index also stores the hub matrix PH and matrices R, W, and S,
its space is reasonable; the index can easily be accommodated in
the main memory of modern commodity hardware. The predicted
space according to our analysis is in most cases an overestimation,
due to an under-estimation of the power law effect on the sparsity
of proximity matrices. Note that our rounding approach generally
achieves signicant space savings especially on large graphs (e.g.,
Web-google). For each dataset, the index that we are using in sub-
sequent experiments is marked in bold.
5.3 Online Query Performance
We now evaluate the performance of our index and our on-line
reverse top-k algorithm. We run a sequence of 500 queries on the
indexes created in Section 5.2 and report average statistics for them.
Query Efciency. Figure 5 shows the average runtime cost of
reverse top-k queries on different graphs, for different values of
k and with different options for using the index. Series update
denotes that after each query is evaluated, the index is updated to
save the changes in the

P, R, W, and S matrices, while no-
update means that the original index is used for each of the 500
queries. We separated these cases in order to evaluate whether our
index update policy brings benets to subsequent queries which
Web-stanford-cs (|V | = 9914, |E| = 36854)
B 50 100 200 300
|H| 82 175 355 530
time (s) 31.5 31.6 34.2 40.4 365.5
no rounding (MB) 55.2 57.4 65.3 77.9
actual space (MB) 39.6 41.8 49.7 62.4 786 (15.8)
pred. space (MB) 44.7 93.5 188 280
Epinions (|V | = 75879, |E| = 508837)
B 1000 1500 2000 3000
|H| 1484 2101 2690 3853
time (s) 15827 12285 11565 10792 139860
no rounding (MB) 2778 2309 2284 2721
actual space (MB) 2310 1696 1538 1716 46071 (121)
pred. space (MB) 4220 5924 7551 10763
Web-stanford (|V | = 281903, |E| = 2312497)
B 1000 1500 2000 3000
|H| 1932 2866 3804 5586
time (s) 85503 89196 97462 111200 3263500
no rounding (MB) 6506 8237 10209 14069
actual space (MB) 1907 1639 1595 1638 635754 (451)
pred. space (MB) 3977 5681 7393 10645
Web-google (|V | = 875713, |E| = 5105039)
B 5000 10000 20000 50000
|H| 9598 18871 37148 86246
time (s) 1024200 1107400 2206300 2865300 60162000
no rounding (MB) 73362 137113 264315 607615
actual space (MB) 5387 4727 4888 6897 6718720 (1466)
pred. space (MB) 2874 4298 7103 14639
Table 2: Index construction time and space cost
apply on a more rened index. The case of update also bears
the cost of updating the corresponding matrices. In either case,
query evaluation is very fast compared to the brute-force approach
of computing the entire P (the time needed for this is already re-
ported in the last column of Table 2) for each graph. The update
policy results in signicant reduction of the average query time in
small and dense graphs; however, for larger and sparser graphs the
index update has marginal effect in the query time improvement
because there is a higher chance that subsequent queries are less
dependent on the index renement done by previous ones. Note
that the workload includes 500 queries, which is a small number
compared to the size of the graphs; we expect that for larger work-
loads the difference will be amplied on large graphs.
Pruning Power of Bounds. Figure 6 shows, for the same
queries and the update case only, the average number (per query)
of the candidates that are not immediately ltered using the lower
bounds of the index and also the number of nodes from these candi-
dates that are immediately identied as hits (i.e., results) after their
upper bound computation. This means that only (candidateshits)
nodes (i.e., columns of

P) need to be rened on average for each
query. We also show the average number of actual results for each
experimental setting. The plots show that the number of candi-
dates are in the order of k and a signicant percentage of them
are immediately identied as results (based on their upper bounds)
without needing renement, a fact that explains the efciency of
our approach. In addition, the cost required for the renement of
these candidates is much lower compared to the cost for computing
their exact proximity vectors. For example, computing the exact
proximity vector pu for a node u in Web-google takes more than
65 seconds, while our method requires just 0.15 seconds to rene a
candidate in a reverse top-100 query on the same graph, on average.
Another observation is that in some graphs, like Web-stanford-cs
and Web-google, the hits number is very close to the results num-
ber. This suggests that when the accuracy demand is not high, an
approximated query algorithm, which only takes the hits as result
and stops further exploration, would save even more time.
409
0
0.5
1
1.5
2
5 10 20 50 100
k
Q
u
e
r
y

t
i
m
e

(
s
)

update
noupdate
(a) Web-stanford-cs
0
5
10
15
5 10 20 50 100
k
Q
u
e
r
y

t
i
m
e

(
s
)

update
noupdate
(b) Epinions
0
50
100
150
5 10 20 50 100
k
Q
u
e
r
y

t
i
m
e

(
s
)

update
noupdate
(c) Web-stanford
0
50
100
150
5 10 20 50 100
k
Q
u
e
r
y

t
i
m
e

(
s
)

update
noupdate
(d) Web-google
Figure 5: Search performance on different graphs, varying k
0
100
200
300
400
5 10 20 50 100
k
N
o
d
e

N
u
m
b
e
r

cand
hits
result
(a) Web-stanford-cs
0
20
40
60
80
100
120
5 10 20 50 100
k
N
o
d
e

N
u
m
b
e
r

cand
hits
result
(b) Epinions
0
500
1000
1500
2000
5 10 20 50 100
k
N
o
d
e

N
u
m
b
e
r

cand
hits
result
(c) Web-stanford
0
200
400
600
800
1000
5 10 20 50 100
k
N
o
d
e

N
u
m
b
e
r

cand
hits
result
(d) Web-google
Figure 6: Number of candidates and immediate hits on different graphs, varying k
Effectiveness of Index Renement. Figure 7 shows the cost of
individual reverse top-100 queries in the 500-query workload on
the Web-stanford graph, with and without the index update option.
Obviously, some queries are harder than others, depending on the
number of candidates that should be rened and the renement cost
for them. We observe an increase in the gap between the query
costs as the query ID increases, which is due to the fact that as
the index gets updated the next queries in the sequence are likely to
take advantage of the update to avoid redundant renements (which
would have to be performed if the index was not updated). For these
queries that take advantage of the updates (i.e., the ones toward the
end of the sequence), the cost is much lower compared to the case
where they are run against a non-updated index. In the following,
all experiments refer to the update case, i.e., the index is updated
after each query evaluation.
Cumulative Cost. Figure 8 compares the cumulative cost of a
workload that includes all nodes from the Web-stanford-cs graph as
queries with the cumulative cost of two versions of the BF method
on the same workload (k=10). The infeasible BF method (IBF) rst
constructs the exact P matrix, keeps the exact top-K proximity val-
ues for each node u, and then evaluates each reverse top-k query
q at the minimal cost of accessing the q-th row of P and the k-th
proximity value for each u V . However, since IBF requires ma-
terializing in memory the whole P (e.g., 6.7TB for Web-google),
it becomes infeasible for large graphs. An alternative, feasible BF
(FBF) method computes the entire P, but keeps in memory only
the exact top-K proximities of each node. Then, at query evalua-
tion, FBF uses our approach in Section 4.2.1 to compute the exact
RWR proximities to the query node from each node in the graph
and then uses the exact pre-computed proximities to verify the re-
verse top-k results. As the gure shows, IBF has a high initial cost
for computing P and afterward the cost for each query is very low.
FBF bears the same overhead as IBF to compute P, but requires
longer query time. Our approach has little initial overhead of con-
structing our index and thereafter a modest cost for evaluating each
query and updating the index. From the gure, we can see that the
cumulative cost of our method is always lower than that of FBF and
lower than IBF at the rst 60% queries. (We emphasize again that
IBF is infeasible for large graphs.) Besides, in practice, reverse top-
k search is only applied on a small percentage of nodes (e.g., less
than 10%); thus, its cumulative cost is low even when compared
to that of IBF. In summary, the overhead of computing P in both
versions of BF is very high, especially for large graphs, given the
fact that not too many reverse top-k queries are issued, in practice.
Rounding Effect. We also tested the effect of using of the
rounded hub proximity matrix P
H
in our index instead of the exact
hub proximity matrix PH on the query results (see Section 4.1.3).
We used the 500 query workload on the Web-stanford-cs graph and
for each query, we recorded the Jaccard similarity
|R
1
R
2
|
|R
1
R
2
|
between
the exact query results R1 when using PH and the results R2 when
using P
H
(i.e., our compressed index). Figure 9 plots the aver-
age similarity between the results of the same query when using
PH or P
H
, for different values of k and the rounding threshold.
Observe that for = 10
5
or smaller (as adopted in our setting),
the results obtained with P
H
for different k are exactly the same
as those obtained with PH. Even a larger threshold = 10
4
achieves an average precision of around 99% for all the tested k
values. Thus, the rounding technique (Section 4.1.3) loses almost
no accuracy, while saving a lot of space, as indicated by the results
of Table 2.
5.4 Search Effectiveness
The experiments of this section demonstrate the effectiveness of
reverse RWR top-k search in some real graph-based applications.
Spam detection. Webspam
5
is a web host graph containing
11402 web hosts, out of which, 8123 are manually labeled as nor-
mal, 2113 are spam, and the remaining ones are undecided.
There are 730774 directed edges in the graph. We verify the use of
reverse RWR top-k search on spam detection by applying reverse
top-5 search on all the spam and normal nodes, and check what
5
barcelona.research.yahoo.net/webspam/datasets/uk2006/
410
0 100 200 300 400 500
0
50
100
150
200
250
300
Query ID
Q
u
e
r
y

T
i
m
e
(
s
)

update
noupdate
Figure 7: Cost of individual queries
0 2000 4000 6000 8000 10000
0
100
200
300
400
500
600
700
Number of queries
A
c
c
u
m
u
l
a
t
e
d

q
u
e
r
y

t
i
m
e
(
s
)

Infeasible Brute Force (IBF)
Feasible Brute Force (FBF)
Our method
Figure 8: Cumulative cost in a workload
0.98
0.985
0.99
0.995
1
1.005
5 10 20 50 100
k
R
e
s
u
l
t

s
i
m
i
l
a
r
i
t
y

= 10
4
= 10
5
= 10
6
Figure 9: Effect of rounding
author reverse top-5 size # coauthors
Philip S. Yu 2020 231
Jiawei Han 2007 253
Christos Faloutsos 1932 221
Zheng Chen 162 137
Qiang Yang 161 166
Daphne Koller 157 98
C. Lee Giles 155 132
Gerhard Weikum 149 130
Michael I. Jordan 147 125
Bernhard Sch olkopf 140 134
Table 3: Longest reverse top-5 lists of DBLP authors
types of web hosts give their top-5 PageRank contributions to each
query node. Our experimental results show that if a query web host
is classied as spam, on average 96.1% web hosts in its reverse
top-5 set are also spam nodes; on the other hand, if the query is a
normal web host, on average 97.4% web hosts in its reverse top-5
result are normal. Therefore, reverse top-k results using RWR are a
very strong indicator toward detection of spam web hosts. In a real
scenario, we can apply a reverse top-k RWR search on any suspi-
cious web host, and make a judgement according to the spam ratio
of the labeled answer set.
Popularity of authors in a coauthorship network. The size of
a reverse top-k query can also be an indicator of the popularity of
the query node in the graph. We extracted from DBLP
6
the publica-
tions in top venues in the elds of databases, data mining, machine
learning, articial intelligence, computer vision, and information
retrieval. We generated a coauthorship network, with 44528 nodes
and 121352 edges where each node corresponds to an author and
an edge indicates coauthorship. To reect the different weights in
coauthorships, we changed the RWR transition matrix as follows:
ai,j =
_
w
i,j
w
j
if edge j i exists,
0 otherwise.
where wj is the number of publications of author j and wi,j is the
number of papers that i and j coauthored. We carried out reverse
top-5 search from all the nodes in the graph, and obtained a de-
scending ranked list of authors w.r.t. the size of their answer set.
The 10 authors with the longest reverse top-5 lists are shown in Ta-
ble 3. The table indicates that there are three popular authors with
6
dblp.uni-trier.de/xml/
very long reverse top-5 lists, which stand out.
7
More importantly,
the reverse top-k lists of these three authors are much longer than
their coauthor lists (third column of Table 3), which indicates that
there are many non-coauthors having them in their reverse top-k
sets. Therefore, the size of a reverse top-k query can be a stronger
indicator for popularity, compared to the nodes degree.
6. RELATED WORK
6.1 Random Walk with Restart
Random work with restart has been a widely used node-to-node
proximity in graph data, especially after its successful application
by the search engine Google [21] to derive the importance (i.e.,
PageRank) of web pages.
Early works focused on how to efciently solve the linear system
(1). Although non-iterative methods such as Gaussian elimination
can be applied, their high complexity of O(n
3
) makes them unaf-
fordable in real scenarios. Iterative approaches such as the Power
Method (PM) [21] and Jacobi algorithm have a lower complex-
ity of O(Dm), where D( n < m) is the number of iterations.
Later on, faster (but less accurate) methods such as Hub-vector de-
composition [15] have been proposed. As this method restricts the
restarting only to a specic node set, it does not compute exactly
the proximity vectors of all nodes in the graph.
To further accelerate the computation of RWR, approximate ap-
proaches have been introduced. [22] leverages the block structure
of the graph and only calculates the RWR similarity within the par-
tition containing the query node. Later, Monte Carlo (MC) methods
are introduced to simulate the random walk process, such as [9, 3,
18]. The simulation can be stored as a ngerprint for fast online
RWR estimation. Recently, a scheduled approximation strategy is
proposed by [30] to compute RWR proximity. From another view-
point of RWR, Bookmark Coloring Algorithm (BCA) [7] has been
proposed to derive a sparse lower bound approximation of the real
proximity vector (see Section 2 for details). Our ofine index is
based on approximations derived by partial execution of BCA and
not on other approaches, such as PM or MC simulation, because the
latter do not guarantee that their approximations are lower bounds
of the exact proximities and therefore do not t into our framework
of using lower and upper proximity bounds to accelerate search.
7
By popular here we mean authors who are likely to be approach-
able by many other authors and intuitively have higher chance to
collaborate with them in the future. Indeed, there are many other
authors who are very popular (e.g., in terms of visibility) and they
do not show up in Table 3, but these authors are likely to work in
smaller groups and do not have so much open collaboration, com-
pared to those having larger reverse top-k sets.
411
6.2 Top-k RWR Proximity Search
Bahmani et al. [4] observed that the majority of entries in a
proximity vector are extremely small. Thus, in many cases, it is
unnecessary to compute the exact RWR proximity from the query
node to all remaining nodes, especially to those with extremely
low proximities. Based on this observation, several top-k prox-
imity search algorithms are introduced. Based on BCA [7], [11]
proposed the Basic Push Algorithm (BPA). At each iteration, BPA
maintains a set of top-k candidates and estimates an upper bound
for the (k+1)-th largest proximity. BPA stops as soon as the upper
bound is not greater than the current k-th largest proximity. Re-
cently, another method, K-dash [10] was proposed. In an indexing
stage, K-dash applies LUdecomposition on the proximity matrix P
and stores the sparse matrices L
1
and U
1
. In the query stage, it
builds a BFS tree rooted at the query node and estimates an upper
bound for each visited node. Such estimation can help determine
whether K-dash should continue or terminate.
When the exact order of the top-k list is not important and a few
misplaced elements are acceptable, Monte Carlo methods can be
used to simulate RWR from the query node u. [3] designs two such
algorithms; MC End Point and MC Complete Path. The former
evaluates RWR proximity pu(v) as the fraction of t random walks
which end at node v, while the latter evaluates pu(v) as the number
of visits to node v multiplied by (1 c)/t.
6.3 Reverse k-NN and Reverse Top-k Search
Reverse k nearest neighbors (RkNN) search aims at nding all
objects in a set T that have a given query object from T in their k-
NN sets. In the Euclidean space, RkNN queries were introduced in
[17]; an efcient geometric solution was proposed in [24]. RkNN
search has also been studied for objects lying on large graphs, al-
beit using shortest path as the proximity measure, which makes the
problem much easier [28]. The reverse top-k query is dened by
Vlachou et al. in [26] as follows. Given a set of multi-dimensional
data points and a query point q, the goal is to nd all linear prefer-
ence functions that dene a total ranking in the data space such that
q lies in the top-k result of the functions. Solutions for RkNN and
reverse top-k queries cannot be applied to solve our problem, due to
the special nature of graph data and/or the use of RWR proximity.
7. CONCLUSIONS
In this paper, we have studied for the rst time the problem of re-
verse top-k proximity search in large graphs, based on the random
walk with restart (RWR) measure. We showed that the naive evalu-
ation of this problem is too expensive and proposed an index which
keeps track of lower bounds for the top proximity values from each
node. Our online query evaluation technique rst computes the ex-
act RWR proximities from the query node q to all graph nodes and
then compares them with the top-k lower bounds derived from the
index. For nodes that cannot be pruned, we compute upper bounds
for their k-th proximities and use them to test whether they are in
the reverse top-k result. For any remaining candidates, their k-th
proximity lower and upper bounds are progressively rened until
they become results or they are pruned. Our experiments conrm
the efciency of our approach; in addition we demonstrate the use
of reverse top-k queries in identifying spam web hosts or popu-
lar authors in co-authorship networks. As future work, we plan to
generalize the problem of reverse top-k search to other proximity
measures such as SimRank [14]. Since the current framework does
not consider the dynamics of the graph, we would also like to ex-
tend our method to do reverse top-k search on evolving graphs. The
key challenge is how to maintain the index incrementally.
8. REFERENCES
[1] R. Andersen, C. Borgs, J. T. Chayes, J. E. Hopcroft, V. S. Mirrokni,
and S.-H. Teng. Local computation of pagerank contributions. In
WAW, 2007.
[2] R. Andersen, F. R. K. Chung, and K. J. Lang. Local graph
partitioning using pagerank vectors. In FOCS, 2006.
[3] K. Avrachenkov, N. Litvak, D. Nemirovsky, E. Smirnova, and
M. Sokol. Quick detection of top-k personalized pagerank lists. In
WAW, 2011.
[4] B. Bahmani, A. Chowdhury, and A. Goel. Fast incremental and
personalized pagerank. PVLDB, 4(3):173184, 2010.
[5] A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank:
Authority-based keyword search in databases. In VLDB, 2004.
[6] A. A. Bencz ur, K. Csalog any, T. Sarl os, and M. Uher. Spamrank
fully automatic link spam detection. In AIRWeb, 2005.
[7] P. Berkhin. Bookmark-coloring approach to personalized pagerank
computing. Internet Mathematics, 3(1):4162, 2006.
[8] Y.-Y. Chen, Q. Gan, and T. Suel. Local methods for estimating
pagerank values. In CIKM, 2004.
[9] D. Fogaras, B. R acz, K. Csalog any, and T. Sarl os. Towards scaling
fully personalized pagerank: Algorithms, lower bounds, and
experiments. Internet Mathematics, 2(3):333358, 2005.
[10] Y. Fujiwara, M. Nakatsuji, M. Onizuka, and M. Kitsuregawa. Fast
and exact top-k search for random walk with restart. PVLDB,
5(5):442453, 2012.
[11] M. S. Gupta, A. Pathak, and S. Chakrabarti. Fast algorithms for topk
personalized pagerank queries. In WWW, 2008.
[12] T. H. Haveliwala. Topic-sensitive pagerank. In WWW, 2002.
[13] J. He, M. Li, H. Zhang, H. Tong, and C. Zhang. Manifold-ranking
based image retrieval. In ACM Multimedia, 2004.
[14] G. Jeh and J. Widom. Simrank: a measure of structural-context
similarity. In KDD, 2002.
[15] G. Jeh and J. Widom. Scaling personalized web search. In WWW,
2003.
[16] I. Konstas, V. Stathopoulos, and J. M. Jose. On social networks and
collaborative recommendation. In SIGIR, 2009.
[17] F. Korn and S. Muthukrishnan. Inuence sets based on reverse
nearest neighbor queries. In SIGMOD Conference, 2000.
[18] N. Li, Z. Guan, L. Ren, J. Wu, J. Han, and X. Yan. giceberg: Towards
iceberg analysis in large graphs. In ICDE, 2013.
[19] D. Liben-Nowell and J. M. Kleinberg. The link prediction problem
for social networks. In CIKM, 2003.
[20] A. Y. Ng, A. X. Zheng, and M. I. Jordan. Link analysis, eigenvectors
and stability. In IJCAI, 2001.
[21] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank
citation ranking: Bringing order to the web. Technical Report
1999-66, Stanford InfoLab, 1999.
[22] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood
formation and anomaly detection in bipartite graphs. In ICDM, 2005.
[23] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta
path-based top-k similarity search in heterogeneous information
networks. PVLDB, 4(11), 2011.
[24] Y. Tao, D. Papadias, and X. Lian. Reverse knn search in arbitrary
dimensionality. In VLDB, 2004.
[25] H. Tong, C. Faloutsos, and Y. Koren. Fast direction-aware proximity
for graph mining. In KDD, 2007.
[26] A. Vlachou, C. Doulkeridis, Y. Kotidis, and K. Nrv ag. Reverse
top-k queries. In ICDE, 2010.
[27] J. H. Wilkinson. The algebraic eigenvalue problem, volume 155.
Oxford Univ Press, 1965.
[28] M. L. Yiu, D. Papadias, N. Mamoulis, and Y. Tao. Reverse nearest
neighbors in large graphs. In ICDE, 2005.
[29] A. W. Yu, N. Mamoulis, and H. Su. Reverse top-k search using
random walk with restart. Technical Report TR-2013-08, CS
Department, HKU, September 2013.
[30] F. Zhu, Y. Fang, K. C.-C. Chang, and J. Ying. Incremental and
accuracy-aware personalized pagerank through scheduled
approximation. PVLDB, 6(6), 2013.
412

Interim Cycle Chemistry Guidelines For Combined Cycle Heat Recovery Steam Generators (HRSGS) - TR-110051
100% (2)
Interim Cycle Chemistry Guidelines For Combined Cycle Heat Recovery Steam Generators (HRSGS) - TR-110051
286 pages
Akiba KShortest 2015
No ratings yet
Akiba KShortest 2015
7 pages
Relational Retrieval Using A Combination of Path-Constrained Random Walks
No ratings yet
Relational Retrieval Using A Combination of Path-Constrained Random Walks
16 pages
Fast Random Walk With Restart and Its Applications
No ratings yet
Fast Random Walk With Restart and Its Applications
10 pages
Ranking of Closeness Centrality For Large-Scale Social Networks
No ratings yet
Ranking of Closeness Centrality For Large-Scale Social Networks
10 pages
Ranking of Closeness Centrality For Large-Scale Social Networks
No ratings yet
Ranking of Closeness Centrality For Large-Scale Social Networks
10 pages
Tong Et Al. - 2006 - Fast Random Walk With Restart and Its Applications
No ratings yet
Tong Et Al. - 2006 - Fast Random Walk With Restart and Its Applications
10 pages
Graph Matching Using Random Walks.: January 2004
No ratings yet
Graph Matching Using Random Walks.: January 2004
5 pages
Complex - Network - Campani-Califonia - Road - Map
No ratings yet
Complex - Network - Campani-Califonia - Road - Map
47 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Vor-Tree: R-Trees With Voronoi Diagrams For Efficient Processing of Spatial Nearest Neighbor Queries
No ratings yet
Vor-Tree: R-Trees With Voronoi Diagrams For Efficient Processing of Spatial Nearest Neighbor Queries
12 pages
Efficient Exact Subgraph Matching Via GNN-based Path Dominance Embedding
No ratings yet
Efficient Exact Subgraph Matching Via GNN-based Path Dominance Embedding
14 pages
Dynamic Bayesian Networks: Fundamentals and Applications
From Everand
Dynamic Bayesian Networks: Fundamentals and Applications
Fouad Sabry
No ratings yet
2021 Ravanth DAPD
No ratings yet
2021 Ravanth DAPD
25 pages
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Markov Random Field: Exploring the Power of Markov Random Fields in Computer Vision
From Everand
Markov Random Field: Exploring the Power of Markov Random Fields in Computer Vision
Fouad Sabry
No ratings yet
Neighborhood Based Fast Graph Search in
No ratings yet
Neighborhood Based Fast Graph Search in
12 pages
Zhou 2021
No ratings yet
Zhou 2021
10 pages
MSC, Project
No ratings yet
MSC, Project
42 pages
Cross Correlation: Unlocking Patterns in Computer Vision
From Everand
Cross Correlation: Unlocking Patterns in Computer Vision
Fouad Sabry
No ratings yet
Subgraph Matching With Set Similarity in A Large Graph Database
No ratings yet
Subgraph Matching With Set Similarity in A Large Graph Database
6 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Scale Space: Exploring Dimensions in Computer Vision
From Everand
Scale Space: Exploring Dimensions in Computer Vision
Fouad Sabry
No ratings yet
Yoga Project Final
No ratings yet
Yoga Project Final
40 pages
Processing In-Route Nearest Neighbor Queries: A Comparison of Alternative Approaches
No ratings yet
Processing In-Route Nearest Neighbor Queries: A Comparison of Alternative Approaches
8 pages
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Direct Linear Transformation: Practical Applications and Techniques in Computer Vision
From Everand
Direct Linear Transformation: Practical Applications and Techniques in Computer Vision
Fouad Sabry
No ratings yet
Reinforcement Learning Based Query VertexOrdering Model For Subgraph Matching
No ratings yet
Reinforcement Learning Based Query VertexOrdering Model For Subgraph Matching
14 pages
Final Version Maier 5681
No ratings yet
Final Version Maier 5681
31 pages
Efficient Pairwise Penetrating-Rank Similarity Retrieva
No ratings yet
Efficient Pairwise Penetrating-Rank Similarity Retrieva
52 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Fast Detour Computation For Ride Sharing: Robert Geisberger, Dennis Luxen, Sabine Neubauer, Peter Sanders, Lars Volker
No ratings yet
Fast Detour Computation For Ride Sharing: Robert Geisberger, Dennis Luxen, Sabine Neubauer, Peter Sanders, Lars Volker
5 pages
Bonsai: Growing Interesting Small Trees: Abstract-Graphs Are Increasingly Used To Model A Variety of Loosely
No ratings yet
Bonsai: Growing Interesting Small Trees: Abstract-Graphs Are Increasingly Used To Model A Variety of Loosely
6 pages
Bayesian Network: Fundamentals and Applications
From Everand
Bayesian Network: Fundamentals and Applications
Fouad Sabry
No ratings yet
Shortest Path Computing in Relational DBMSS: Jun Gao, Jiashuai Zhou, Jeffrey Xu Yu, and Tengjiao Wang
No ratings yet
Shortest Path Computing in Relational DBMSS: Jun Gao, Jiashuai Zhou, Jeffrey Xu Yu, and Tengjiao Wang
15 pages
Fast and Exact Fixed-Radius Neighbor Search Based On Sorting
No ratings yet
Fast and Exact Fixed-Radius Neighbor Search Based On Sorting
17 pages
Computing Classic Closeness Centrality at Scale
No ratings yet
Computing Classic Closeness Centrality at Scale
14 pages
Shervashidze 11 A
No ratings yet
Shervashidze 11 A
23 pages
Bayesian Decision Networks: Fundamentals and Applications
From Everand
Bayesian Decision Networks: Fundamentals and Applications
Fouad Sabry
No ratings yet
A Study On Ad Hoc On Demand Distance Vector AODV Protocol
No ratings yet
A Study On Ad Hoc On Demand Distance Vector AODV Protocol
3 pages
Incremental Closeness
No ratings yet
Incremental Closeness
8 pages
Markov Chains: From Theory to Implementation and Experimentation
From Everand
Markov Chains: From Theory to Implementation and Experimentation
Paul A. Gagniuc
No ratings yet
Memory-Efficient Fast Shortest Path Estimation in Large Social Networks
No ratings yet
Memory-Efficient Fast Shortest Path Estimation in Large Social Networks
10 pages
非一作或通讯SCI论文一
No ratings yet
非一作或通讯SCI论文一
15 pages
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Neues verkehrswissenschaftliches Journal - Ausgabe 16: Capacity Research in Urban Rail-Bound Transportation with Special Consideration of Mixed Traffic
From Everand
Neues verkehrswissenschaftliches Journal - Ausgabe 16: Capacity Research in Urban Rail-Bound Transportation with Special Consideration of Mixed Traffic
Ullrich Martin
No ratings yet
Radon Transform: Unveiling Hidden Patterns in Visual Data
From Everand
Radon Transform: Unveiling Hidden Patterns in Visual Data
Fouad Sabry
No ratings yet
Reinforcement Learning Based Path Exploration For Sequential Explainable Recommendation
No ratings yet
Reinforcement Learning Based Path Exploration For Sequential Explainable Recommendation
14 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Semi-Relaxed Gromov Wasserstein Divergence With Applications On Graphs
No ratings yet
Semi-Relaxed Gromov Wasserstein Divergence With Applications On Graphs
28 pages
Social Network Analysis Unit-2
No ratings yet
Social Network Analysis Unit-2
24 pages
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
Robust Graph Dictionary Learning
No ratings yet
Robust Graph Dictionary Learning
21 pages
Finding Frequent Subpaths in A Graph
No ratings yet
Finding Frequent Subpaths in A Graph
12 pages
Learning To Route With Sparse Trajectory Sets - Extended Version
No ratings yet
Learning To Route With Sparse Trajectory Sets - Extended Version
15 pages
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
From Everand
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
Fouad Sabry
No ratings yet
An Approximate Proximity Graph Incremental Construction For Large Image Collections Indexing
No ratings yet
An Approximate Proximity Graph Incremental Construction For Large Image Collections Indexing
10 pages
Monitoring Reverse Top-K Queries Over Mobile Devices: Akrivi Vlachou Christos Doulkeridis Kjetil Nørvåg
No ratings yet
Monitoring Reverse Top-K Queries Over Mobile Devices: Akrivi Vlachou Christos Doulkeridis Kjetil Nørvåg
8 pages
11893-Article Text-21102-1-10-20220104
No ratings yet
11893-Article Text-21102-1-10-20220104
19 pages
Introduction To Approximation Algorithms
No ratings yet
Introduction To Approximation Algorithms
5 pages
Vemac: A Novel Multichannel Mac Protocol For Vehicular Ad Hoc Networks
No ratings yet
Vemac: A Novel Multichannel Mac Protocol For Vehicular Ad Hoc Networks
6 pages
Footprint Based RetrievalPaper
No ratings yet
Footprint Based RetrievalPaper
15 pages
Case Biglow Toy Company
No ratings yet
Case Biglow Toy Company
3 pages
Chapter 2 - The Equations of Motion
No ratings yet
Chapter 2 - The Equations of Motion
12 pages
Pro Program
100% (2)
Pro Program
27 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Eddy Current NDT Inspection
100% (3)
Eddy Current NDT Inspection
172 pages
Integral Blowdown Valve: Description
No ratings yet
Integral Blowdown Valve: Description
1 page
Fundamentals of Machine Design
No ratings yet
Fundamentals of Machine Design
2 pages
Air Conditioning Processes and Cycles: Ir. Dr. Sam C. M. Hui
No ratings yet
Air Conditioning Processes and Cycles: Ir. Dr. Sam C. M. Hui
36 pages
McGuire - EERI Monograph Re Seismic Hazard and Risk Analysis PDF
No ratings yet
McGuire - EERI Monograph Re Seismic Hazard and Risk Analysis PDF
119 pages
Speaker Placement & Acoustic Environment Effects, November 1994
100% (1)
Speaker Placement & Acoustic Environment Effects, November 1994
2 pages
Development of Tool-Type Devices For A Multifingered Haptic Interface Robot
No ratings yet
Development of Tool-Type Devices For A Multifingered Haptic Interface Robot
14 pages
2.5 Determination of Particle Size of Soils-Astm D 422: Soil Fabric. Loose and Dense Packing of Spheres
No ratings yet
2.5 Determination of Particle Size of Soils-Astm D 422: Soil Fabric. Loose and Dense Packing of Spheres
6 pages
Lab 3 Sheet Shearing Force-21Sep11
No ratings yet
Lab 3 Sheet Shearing Force-21Sep11
4 pages
ES 61 - 7.5 Load, Shear and Bending Moment
No ratings yet
ES 61 - 7.5 Load, Shear and Bending Moment
3 pages
Computer Vision
No ratings yet
Computer Vision
633 pages
Slope Deflection Method For Statically Indeterminate Beams - Notes PDF
No ratings yet
Slope Deflection Method For Statically Indeterminate Beams - Notes PDF
14 pages
Chapter 10 A System of Dosimetric Calculations
No ratings yet
Chapter 10 A System of Dosimetric Calculations
30 pages
Design and Fabrication of Wedge Milling Fixture
No ratings yet
Design and Fabrication of Wedge Milling Fixture
28 pages
L Section PDF
No ratings yet
L Section PDF
30 pages
Jurnal Resin
No ratings yet
Jurnal Resin
8 pages
Mapwork
100% (2)
Mapwork
25 pages
Chemistry XII Prefinal 2067 Eureka
No ratings yet
Chemistry XII Prefinal 2067 Eureka
3 pages
Textile Fibre Identification
No ratings yet
Textile Fibre Identification
8 pages
Factors Affecting Rate of Solutions
100% (8)
Factors Affecting Rate of Solutions
11 pages
On The Texture Profile Analysis Test: September 2012
No ratings yet
On The Texture Profile Analysis Test: September 2012
13 pages
Snow PDF
No ratings yet
Snow PDF
10 pages
Poe-2 1 9trussdesign
No ratings yet
Poe-2 1 9trussdesign
10 pages
Analysis of RCC Chimney
No ratings yet
Analysis of RCC Chimney
26 pages
Turbiquant Detailled Info Engl 06 - 02 - 03
No ratings yet
Turbiquant Detailled Info Engl 06 - 02 - 03
5 pages
Final Report of The Cooperative Research Program On Shell-And-Tube-Heat Exchangers
0% (2)
Final Report of The Cooperative Research Program On Shell-And-Tube-Heat Exchangers
6 pages
Class 12th Chemistry Project On Cleaning Action of Soaps
No ratings yet
Class 12th Chemistry Project On Cleaning Action of Soaps
22 pages
MTH102
100% (1)
MTH102
122 pages
Advanced Recip Compressor Design
No ratings yet
Advanced Recip Compressor Design
5 pages

Reverse Top-K Search Using Random Walk With Restart: Adams Wei Yu, Nikos Mamoulis, Hao Su

Uploaded by

Reverse Top-K Search Using Random Walk With Restart: Adams Wei Yu, Nikos Mamoulis, Hao Su

Uploaded by

Reverse Top-k Search using Random Walk with Restart

School of Computer Science, Carnegie Mellon University

Department of Computer Science, The University of Hong Kong

Computer Science Department, Stanford University

Supported by grant HKU 715413E from Hong Kong RGC. Work

, where 0 < < 1 is the exponent parameter,

Since only less than l

entries need to be stored for a single hub

space for PH. Plus the top-

/ log(1 ) can lead to |z

You might also like