Reverse Top-K Search Using Random Walk With Restart: Adams Wei Yu, Nikos Mamoulis, Hao Su
Reverse Top-K Search Using Random Walk With Restart: Adams Wei Yu, Nikos Mamoulis, Hao Su
Adams Wei Yu
, Nikos Mamoulis
, Hao Su
n
i=1
[vi[ = 1.
Computing P at its entirety or partially is a key problem in dif-
ferent applications. Approaches like the iterative Power Method
(PM) [21] and Monte Carlo Simulation (MCS) [9] can be used to
compute an approximate value for a single proximity vector pu
and/or the entire matrix P. PM converges to an accurate pu while
MCS is less accurate but faster. Next, we discuss in detail an ef-
cient technique for deriving a lower-bound of pu.
2.2 Bookmark Coloring Algorithm (BCA)
Basic model. Berkhin [7] models RWR by a bookmark coloring
process, which facilitates the efcient estimation of pu. We begin
by injecting a unit amount of colored ink into u, with an portion
retained in u and the rest (1) portion evenly distributed to each
of us out-neighbors. Each node which receives ink retains an
portion of the ink and distributes the rest to its out-neighbors. At
an intermediate step t(= 0, 1, 2, ...), we can use two vectors p
t
u
,
r
t
u
R
n
to capture the ink distribution in the whole graph, where
p
t
u
(v) is the ink retained at node v, and r
t
u
(v) is the residue ink to
be distributed from v. When r
t
u
(v) reaches 0 for all v V (i.e.,
|ru|1 = 0), p
t
u
is exactly pu; the proximity vector pu can be seen
as a stable distribution of ink. In fact, BCA can stop early, at a time
t, where r
t
u
(v) values are small at all nodes v; p
t
u
is then a sparse
lower-bound approximation of pu [7].
Hub effects. In the process of ink propagation, some of the
nodes may have a high probability to receive new ink and distribute
part of it again and again. Such nodes are called hubs and their set is
denoted by H = |h1, h2, ..., h
|H|
. Without loss of generality, we
assume that the rst [H[ nodes in V are the hubs. If we knew how
hubs distribute their ink across the graph (i.e., if we have precom-
puted the exact proximity vector p
h
for each h H), we would
not need to distribute their residue ink during the process of com-
puting pu for a node u V \ H. Instead, we could accumulate
all the residue ink at hubs, and distribute it in batch at the end by
a simple matrix multiplication. In [7], a greedy scheme is adopted
to select hubs and implement this idea. It starts by applying BCA
on one node and selecting the node with the largest retained ink as
a hub. This process is repeated from another starting node to select
another hub, until a sufcient number of hubs are chosen. Once
the hub nodes are selected, we can use the power method (PM) to
calculate the exact vector p
h
for each h .
BCA using hubs. Assume that we have selected a set of hubs H
and have pre-computed p
h
for each h . To compute pu for a
non-hub node u, BCA [7] (and its revised version [2]) rst injects
a unit amount of ink to u, then u retains an portion of the ink,
and distributes the rest to us out-neighbors. At each propagation
step t, BCA picks a non-hub node vt and distributes the residue
ink r
t1
u
(vt) to its out-neighbors. Two vectors s
t
u
and w
t
u
R
n
are introduced and maintained in this process. s
t
u
is used to store
the ink accumulated at hubs so far and w
t
u
is used to store the ink
retained at non-hub nodes. Thus, for a hub node h, s
t
u
(h) is the ink
accumulated at h by time t; this ink will be distributed to all nodes
in batch after the nal iteration, with the help of the (pre-computed)
p
h
. For a non-hub node v, w
t
u
(v) stores the ink retained so far at
v (which will never be distributed). w
t
u
(v) (s
t
u
(v)) is always zero
for a hub (non-hub) node v. The following equations show how all
vectors are updated at each step:
w
t
u
= r
t1
u
(vt) ev
t
+w
t1
u
(4)
r
t
u
= (1 )r
t1
u
(vt) av
t
+ [r
t1
u
r
t1
u
(vt) ev
t
] (5)
s
t
u
=
iH
r
t1
u
(i) ei +s
t1
u
(6)
According to the rst part of Eq. (4), an ink portion of
r
t1
u
(vt) is retained at vt. Eq. (5) subtracts the residue ink
r
t1
u
(vt) from vt (second part) and evenly distributes the remain-
ing (1 ) portion to vts out-neighbors (rst part). Eq. (6) accu-
mulates the ink that arrives at hub nodes. At any step t, BCA can
compute p
t
u
and use it to approximate pu, as follows:
p
t
u
= w
t
u
+PH s
t
u
(7)
where PH = [p1, p2, ..., p
|H|
, 0
n(n|H|)
] R
nn
, i.e., PH is
the proximity matrix including only the (precomputed) proximity
vectors of hub nodes and having 0s in all proximity entries of non-
hub nodes. p
t
u
is computed only when all residue values are small;
in this case, it is deemed that p
t
u
is a good approximation of pu. In
order to reach this stage, at each step, a vt with large residue ink
should be selected. In [7], vt is selected to be the node with the
largest residue ink, while in [2] vt is any node with more residue
ink than a propagation threshold . BCA terminates when the to-
tal residue ink does not exceed a convergence threshold or when
there is no node with at least residue ink.
3. PROBLEM FORMALIZATION
The reverse top-k RWR query is formally dened as follows:
PROBLEM 1. Given a graph G(V, E), a query node q V
and a positive integer k, nd all nodes u V , for which p
kmax
u
pu(q), where pu is obtained by Eq. (1) and p
kmax
u
is the k-th
largest value in pu.
A brute-force (BF) method for evaluating the query is to (i) com-
pute the proximity vector pu of every node u in the graph, and (ii)
nd all nodes u for which p
kmax
u
pu(q). BF requires the com-
putation of the entire proximity matrix P. No matter which method
is used to compute the exact P (e.g., PM or the state-of-the-art K-
dash algorithm [10]), examining the top-k values at each column
pu results in a O(n
3
) total time complexity for BF (or O(nm)
for sparse graphs with n nodes and m edges), which is too high,
especially for online queries on large-scale graphs.
There are several observations that guide us to the design of an
efcient reverse top-k RWR algorithm. First, the expected num-
ber of nodes in the answer set of a reverse top-k query is k; thus
there is potential of building an index, which can prune the major-
ity of nodes denitely not in the answer set. Second, as noted in [4]
and observed in our experiments, the power law distribution phe-
nomenon applies on each proximity vector: typically, only few en-
tries have signicantly large and meaningful proximities, while the
remaining values are tiny. Third, we observe that verifying whether
the query node q lies in the top-k proximity set of a certain node u
is a far easier problem than computing the exact top-k set of node
u; we can efciently derive upper and lower bounds for the prox-
imities from u to all other nodes and use them for verication. In
the next section, we introduce our approach, which achieves signif-
icantly better performance than BF.
4. OUR APPROACH
Our method focuses on two aspects: (i) avoiding the computa-
tions of unnecessary top-k proximity sets and (ii) terminating the
computation of each top-k proximity set as early as possible. The
overall framework contains two parts: an ofine indexing module
(Section 4.1) and an online querying algorithm (Section 4.2).
403
4.1 Ofine Indexing
For our index design, we assume that the maximum k in any
query does not exceed a predened value K, i.e. k K. For each
node v, we compute proximity lower bounds to all other nodes and
store the K largest bounds to a compact data structure. The index
is relatively efcient to obtain, compared to computing the exact
proximity matrix P. Given a query q and a k K, with the help
of the index, we can prune nodes that are guaranteed not to have q in
their top-k sets, thus avoiding a large number of unnecessary prox-
imity vector computations. The index is stored in a compact format,
so that it can t in main memory even for large graphs. It also sup-
ports dynamic updating after a query has been completed; this way,
its performance potentially improves for any future queries.
The lower bounds used in our index are based on the fact that,
while running BCA from any node u V , each entry of p
t
u
at
iteration t is monotonically increasing w.r.t. t; formally:
PROPOSITION 1. u, v V , p
1
u
(v) p
2
u
(v) ... pu(v).
PROOF. See [29].
Thus, after each iteration t of BCA from u V , we can have a
lower bound p
t
u
(v) of the real proximity value pu(v) fromu to any
node v V . The following proposition shows that the k-th largest
value in p
t
u
serves as a lower bound for the k-th largest proximity
value in pu:
PROPOSITION 2. Let p
t
u
(k) be the k-th largest value in p
t
u
af-
ter t iterations of BCA from u. Let pu(k) be the k-th largest value
in pu. Then, p
t
u
(k) pu(k) = p
kmax
u
.
PROOF. See [29].
Note that this is a nice property of BCA, which is not present
in alternative proximity vector computation techniques (i.e., PM
and MCS). Besides, we observe that by running BCA from a node
u, the high proximity values stand out after only a few iterations.
Thus, to construct the index, we run an adapted version of BCA
from each node u V that stops after a few iterations t to derive
a lower-bound proximity vector p
t
u
. Only the K largest values of
this vector are kept in descending order in a p
t
u
(1 : K) vector.
Our index consists of all these lower bounds; in Section 4.2, we
explain how it can be used for query evaluation. In the remain-
der of this subsection, we provide details about our hub selection
technique (Section 4.1.1), our adaptation of BCA for deriving the
lower-bound proximity vectors and constructing the index (Section
4.1.2), and a compression technique that reduces the storage re-
quirements of the index (Section 4.1.3).
4.1.1 Hub Selection
The hub selection method in [7], runs BCA itself to nd hubs;
its efciency thus heavily relies on the graph size and the number
of selected hubs. We use a simpler approach, which is independent
of these factors and hence can be used for large-scale graphs. We
claim that nodes with high in-degree or out-degree are already good
candidates to be suitable hubs. Therefore we dene H as the union
of the sets of high in-degree nodes Hin and high out-degree nodes
Hout. Hin (Hout) is the set of B nodes in V with the largest in-degree
(out-degree). In Section 5, we investigate choices for parameter B.
4.1.2 BCA Adaptation
We propose an improved ink propagation strategy for BCA com-
pared to those suggested by [7] and [2]. Instead of propagating
a single nodes residue ink at each iteration t, our strategy se-
lects a subset of nodes Lt, which includes those having no less
residue ink than a given propagation threshold ; i.e., Lt = |v
V \ H[r
t1
u
(v) . is selected such that only signicant
residue ink is propagated. The rules for updating s
t
u
and p
t
u
are
the same as shown in Eq. (6) and (7), respectively. However, the
updates w
t
u
and r
t
u
are performed as follows:
w
t
u
=
iL
t
r
t1
u
(i) ei +w
t1
u
(8)
r
t
u
=
iL
t
(1 )r
t1
u
(i) ai + [r
t1
u
iL
t
r
t1
u
(i) ei] (9)
To understand the advantage of our strategy, note that the main
cost of BCA at each iteration consists of two parts. The rst is
the time spent for selecting nodes to propagate ink from and the
second is the time spent on updating vectors r
t
u
, s
t
u
, and w
t
u
.
2
Our
approach reduces both costs. First, selecting a batch of nodes at a
time signicantly reduces the total remaining residue |r
t
u
|1 in a
single iteration and greatly reduces the overall number of iterations
and thus the total number of vector updates. Second, since at each
iteration both nding a single node or a set of nodes to propagate
ink from requires a linear scan of r
t1
u
, the total node selection
time is also reduced.
Our BCA adaptation ends as soon as the total remaining ink
|r
t
u
|1 is no greater than a residue threshold . We observe that
|r
t
u
|1 drops drastically in the rst few iterations of BCA and then
slowly in the latter iterations. Thus, we select such that our BCA
adaptation terminates only after a few iterations, deriving a rough
approximation of pu that is already sufcient to prune the majority
of nodes during search.
The complete lower bound indexing procedure is described by
Algorithm 1. Let tu be the number of iterations until the termina-
tion of BCA from u and t = [t1, t2, ..., tn]. The index resulting
from Algorithm 1 is denoted by 1
t
= (
P
t
, R
t
, W
t
, S
t
, PH),
where
P
t
= [ p
t
1
1
(1 : K), ..., p
tn
n
(1 : K)] is the top-K lower
bound matrix storing the K largest values of each p
tu
u
, R
t
=
[r
t
1
1
, ..., r
tn
n
] is the residue ink matrix, W
t
= [w
t
1
1
, ..., w
tn
n
] is the
non-hub retained ink matrix, S
t
= [s
t
1
1
, ..., s
tn
n
] is the hub accu-
mulated ink matrix and PH is the hub proximity matrix. Whenever
the context is clear, we simply denote p
tu
u
by p
t
u
and the index by
1 = (
P, R, W, S, PH).
Algorithm 1 Lower Bound Indexing (LBI)
Input: Matrix A, number K, Hubs H, Residue threshold , Propaga-
tion threshold .
Output: Index I = (
P, R, W, S, P
H
).
1: for all h H do
2: Compute p
h
by power method or BCA;
3: for all nodes u V do
4: tu = 0; r
tu
u
= eu; s
tu
u
= w
tu
u
= 0;
5: while r
tu
u
1
> do
6: tu = tu + 1;
7: Update r
tu
u
, s
tu
u
, w
tu
u
by Eq. (9), (6), (8);
8: Compute p
tu
u
by Eq. (7);
9: p
tu
u
= top K entries of p
tu
u
in descending order;
Figure 2 illustrates the result of our indexing approach on the
toy graph of Figure 1, for = 0.15. First, by setting B = 1,
we select the two nodes with the highest in- and out-degrees to
become hubs. These are nodes 1 and 2. For these two nodes the
exact proximity vectors p1 and p2 are computed and stored in the
2
Recall that p
t
u
needs not be updated at each iteration and is only
computed at the end of BCA or when an approximation of pu
should be obtained.
404
!"# %&'#( )* + ,-# ."# "/0(
123+
12+4
12)+
12)3
1215
1216
12+7
1236
12)8
12)1
1217
1218
12+7
12+6
12+8
12)1
1217
1218
12)1
12)8
1218
12)6
1214
1213
12+1
1233
12)7
1214
12)4
1215
12)1
12)8
1218
12)1
121+
12)4
123+
12+4
12)3
1236
12+7
12)8
12+6
12+8
12+7
12)6
12)8
12)1
1233
12+1
12)4
12)4
12)8
12)1
)
9
5
7
+
3
p
1
p
2
p
t3
3
p
t4
4
p
t5
5
p
t6
6
P
p
1
p
2
p
t3
3
p
t4
4
p
t5
5
p
t6
6
Figure 2: Example of top-3 lower bound index
hub proximity matrix PH = [p1, p2, 0, 0, 0, 0]. For the remaining
nodes, we run our BCA adaptation with propagation threshold =
10
4
and residue threshold = 0.8, which results in the p
t
3
3
p
t
6
6
vectors shown in the gure. Finally, we select from each of
|p1, p2, p
t
3
3
, . . . , p
t
6
6
the top-K values (for K = 3) and create
the lower bound matrix
P = [ p1, p2, p
t
3
3
, p
t
4
4
, p
t
5
5
, p
t
6
6
], as shown
in the gure. Note that |r
t
3
3
|1 = |r
t
5
5
|1 = 0 and |r
t
4
4
|1 =
|r
t
6
6
|1 = 0.36.
4.1.3 Compact Storage of the Index
The space complexity for the hub proximity matrix PH of 1 is
O([H[n), where [H[ (n) is the number of hub (total) nodes. The
matrix may not t in memory if n and [H[ are large. We apply a
compression technique for PH, based on the observation that the
values of a proximity vector follow a power law distribution; in
each vector p
h
PH, the great majority of values are tiny; only
a small percentage of these values are signicantly large. There-
fore, we perform rounding by zeroing all values lower than a given
rounding threshold . In our implementation, we choose an that
can save much space without losing reverse top-k search precision.
If sufcient hubs are selected, matrices R, W, S are sparse, so
the storage cost for the index 1 will mainly be due to
P and the
rounded PH. The following theorem gives an estimation for the
total index storage requirements after the rounding operation.
THEOREM 1. h H, given rounding threshold , if the val-
ues of p
h
follow a power law distribution, i.e., the sorted value
p
h
(i) i
[H[
n
1
1
).
PROOF. Let p
h
(i) = i
. As
1 =
n
i=1
p
h
(i) =
n
i=1
i
n
1
_
1
0
x
dx =
n
1
1
we have (1 )n
1
and p
h
(i) (1 )n
1
i
. Let
p
h
(l
) , then we have
l
(1 )
1
n
1
1
[H[
n
1
1
[H[
n
1
1
).
Let p
tu
u
be the approximated proximities constructed by Eq. (7)
with rounded hub proximities P
H
. We can trivially show that
Propositions 1 and 2 hold for p
tu
u
. Thus, p
tu
u
is still an increas-
ing lower bound of pu and p
tu
u
can replace the p
tu
u
in our index. In
the following, we give a bound for the error caused by rounding.
PROPOSITION 3. Given rounding threshold and p
h
(i)
i
, where = (1 )n
1
, then for u V ,
|p
tu
u
p
tu
u
|1 1 (
1
n
)
1
1
.
PROOF. See [29].
We empirically observed (see Section 5) that our rounding ap-
proach can save huge amounts of space and the real storage re-
quirements are even much smaller than the theoretical bound given
by Theorem 1. Meanwhile, the actual error is much smaller than
the theoretical bound by Proposition 3, and more importantly, it has
minimal effect to the reverse top-k results. To keep the index nota-
tion uncluttered, we use PH to also denote the rounded hub prox-
imities (i.e., P
H
) and p
tu
u
to denote the corresponding rounded
proximity vectors p
tu
u
computed using P
H
.
4.2 Online Query Algorithm
This section introduces our online reverse top-k search tech-
nique. Given a query node q V , we perform search in two steps.
First, we compute the exact proximity from each u V to q us-
ing a novel and efcient method (Section 4.2.1). In the second step
(Section 4.2.2), for each node u we use the index described in Sec-
tion 4.1 to prune u or add u to the search result, by deriving a lower
and an upper bound (denoted as lb
t
u
and ub
t
u
) of us k-th largest
proximity value p
kmax
u
to other nodes and comparing it with its
proximity to q. For nodes u that cannot be pruned or conrmed as
results, we rene lb
t
u
and ub
t
u
using our index incrementally, until u
is pruned or becomes a conrmed result. The renement is used to
update the index for faster future query processing (Section 4.2.3).
In this section, we use pu and p,u interchangeably to denote the
u-th column of the proximity matrix P; also note that lb
t
u
= p
t
u
(k)
and ub
t
u
is the upper bound of pu(k) (= p
kmax
u
) w.r.t. to p
t
u
.
4.2.1 RWR Proximity to the Query Node
The rst step of our method is to compute the exact proximities
from all other nodes to the query q. Although a lot of previous
work has focused on computing the proximities from a given node
u to all other nodes (i.e., a column p,u of the proximity matrix P),
there have only been a few efforts on how to nd the proximities
from all nodes to a node q (i.e., a row pq, of P). The authors of
the SpamRank algorithm [6] suggest computing approximate prox-
imity vectors p,u for all u V , and taking all pq,u to form pq,.
However, to get an exact result, which is our target, such a method
would require the computation of the entire P to a very high preci-
sion, which leads to unacceptably high cost. A heuristic algorithm
is proposed in [8], which rst selects the nodes with high proba-
bility to be the large proximity contributors to the query node, and
then computes their proximity vectors. This method requires the
computation of several proximity vectors p,u to nd only a subset
of entries in pq,. [1] introduces a local search algorithm that ex-
amines only a small fraction of nodes, deriving, however, only an
approximation of pq,.
Although it seems inevitable to compute the whole matrix P
to get the exact proximities from all nodes to q, we show that this
problem can be solved by the power method and has the same com-
plexity as calculating a single column of P. Our result is novel and
constitutes an important contribution not only for the reverse top-k
search problem that we study in this paper, but also for any problem
that includes nding the proximities from all nodes to a given node.
405
For example, our method could be used as a module in SpamRank
[6] to nd PageRank contributions that all nodes make to a given
web page q precisely and efciently.
First of all, we note that pq, is essentially the q-th row of P,
hence, pq, = e
T
q
P = e
T
q
(I (1 )A)
1
, or equivalently
p
T
q,
= (1 )A
T
p
T
q,
+eq
(see Section 2 for the denitions of A, eq, and e). An interesting
observation is that p,u and pq, are actually the solutions of the
following linear systems respectively,
xu = (1 )Axu +eu (10)
xq = (1 )A
T
xq +eq (11)
which share the same structure except, that either Aor A
T
is used
as the coefcient matrix. This similarity motivates us to apply the
power method. Just as Eq. (10) can be solved by the iterative power
method on matrix [(1 )A+eue
T
]:
x
i+1
u
= (1 )Ax
i
u
+eu = [(1 )A+eue
T
]x
i
u
, (12)
we hope that the linear system (11) could be solved by the follow-
ing iterative method:
x
i+1
q
= (1 )A
T
x
i
q
+eq. (13)
However, showing that the sequence generated by Eq. (13) can
successfully converge to the solution of Eq. (11) is not trivial, as
the proof of the convergence of Eq. (12) does not apply for Eq.
(13). The main difference between the two is as follows. In Eq.
(12), if |x
0
u
|1 = 1, then |x
i
u
|1 = e
T
x
i
u
= 1 for i = 1, 2, ....
Hence we can have the r.h.s. of Eq. (12) to prove |x
i
u
i to be
a power methods series and thus converges. Conversely, the se-
quence |x
i
q
i is not non-expansive in the general case and we may
have |x
i+1
q
|1 > |x
i
q
|1. In other words, we cannot transform Eq.
(13) to the form of the r.h.s. of Eq. (12) to prove |x
i
q
i to be a
power methods series, so there is no obvious guarantee that it will
converge. We therefore have to prove that Eq. (13) converges to a
unique vector, which is the solution of Eq. (11). Fortunately, us-
ing techniques very different from the original convergence proof
of Eq. (12), we show that Eq. (13) indeed converges to a unique
solution, from an arbitrary initialization.
Let us lift xq R
n
to space R
n+1
by introducing zq =
_
xq
1
_
.
The afne Equation (11) is now equivalent to
zq = Dqzq (14)
where Dq =
_
(1 )A
T
eq
01n 1
_
R
(n+1)(n+1)
. Then the
rst n columns of (14) is exactly (11). Note that zq is an eigen-
vector of Dq corresponding to eigenvalue 1. We will prove that zq
is in fact the dominant eigenvector, therefore System (14) can be
solved by the power method.
THEOREM 2. Let 1 and 2 be the rst two largest eigenvalues
of Dq. Let z
0
q
= [(x
0
q
)
T
, 1]
T
, z
q
= [pq,, 1]
T
R
(n+1)
, where
x
0
q
is any vector in R
n
, and let
z
i+1
q
= Dqz
i
q
= D
i+1
q
z
0
q
(15)
then the following conclusions hold:
(a) 1 = 1 with multiplicity 1, and limiz
i
q
= z
q
,
limix
i
q
= pq,.
(b) 2 = 1 ; the convergence rate of (15) and (13) is 1 ;
(c) For convergence tolerance , if i > log
/ log(1 ), then
|z
i+1
q
z
i
q
|1 |x
i+1
q
x
i
q
|1 < .
PROOF. (a) Note that the row sum of Dq cannot exceed 1. In
fact, for > 0, it is obvious that the q-th row and the last row
have row sum 1 and all other rows have row sum 1 < 1. So the
spectral radius (Dq) maxi
j
(Dq)ij 1. On the other hand,
z
q
,= 0 satises Eq. (14), which implies that z
q
is the eigenvector
of Dq with eigenvalue 1. Thus, 1 = (Dq) = 1.
Note that any eigenvector of value 1 must be a xed point of Eq.
(14). Therefore, if we can show that the sequence |z
i
q
i converges
to a nonzero point, it must be the unique eigenvector, and then the
multiplicity of 1 is 1. In the following, we will prove that this
statement is true. It is easy to verify that
D
i
q
=
_
(1 )
i
(A
T
)
i
i1
j=0
(1 )
j
(A
T
)
j
eq
01n 1
_
Since |A
T
| = (A
T
) = 1, it follows that
|(1 )
i
(A
T
)
i
| (1 )
i
|A
T
|
i
(1 )
i
, so
lim
i
D
i
q
=
_
0nn [I (1 )A
T
)]
1
eq
01n 1
_
=
_
0 P
T
eq
0 1
_
,
implying that
lim
i
z
i
q
= lim
i
D
i
q
z
0
q
=
_
0 P
T
eq
0 1
_ _
x
0
q
1
_
=
_
p
T
q,
1
_
,
where p
T
q,
= P
T
eq. Hence limiz
i
q
= z
q
and limix
i
q
=
pq,. This also certies that there is a unique convergence point of
(15), so the multiplicity of 1 is 1.
(b) Rewrite Dq = (1)
_
A
T
0n1
01n 1
_
+Fq, where Fq =
_
0nn eq
0n1 1
_
. Let =
_
01n
1
_
. It is easy to verify that D
T
q
= .
As (D
T
q
) = (Dq) = 1, is the eigenvector corresponding to the
largest eigenvalue of D
T
q
and it is unique, since Dq and D
T
q
has
the same eigenvalue multiplicity. Now we leverage the following
lemma to assist the rest of proof.
LEMMA 1. (From page 4 of [27]) If i is an eigenvector of A
corresponding to the eigenvalue i, j is an eigenvector of A
T
corresponding to j and i ,= j, then
T
i
j = 0.
By Lemma 1, the second largest eigenvector of Dq must be
orthogonal to , i.e.,
T
= 0. By the structure of , it must
be true that =
_
0
_
, is some vector in R
n
, which implies
Fq = 0. Hence, Dq = (1)
_
A
T
0
0 1
_
= (1)
_
A
T
0
_
.
As Dq = 2, we have A
T
=
2
1
, indicating that is an
eigenvector of A
T
. Since Ais a transition matrix,
2
1
(A) =
1, so 2 1 . It is easy to verify that for =
_
en1
0
_
,
Dq = (1 ), so 2 = 1 . In addition, the convergence
rate of (15) is dictated by [2[/[1[ = 1 .
(c) Since |z
i
q
i is the power methods series of Dq, we have
|z
i+1
q
z
i
q
|1 |(
|
2
|
|
1
|
)
i
(1
|
2
|
|
1
|
)| = (1 )
i
. Hence, i >
log
log(1)
m
_
.
Algorithm 2 Power Method for Proximity to Node (PMPN)
Input: Matrix A, Query q, Convergence tolerance .
Output: Proximities pq, from all nodes to q.
1: Initialize x
0
q
as any vector R
n
;
2: i = 0;
3: repeat
4: x
i+1
q
= (1 )A
T
x
i
q
+ eq;
5: i = i + 1;
6: until x
i
q
x
i1
q
< convergence of PMPN
7: pq, = (x
i
q
)
T
;
4.2.2 Upper Bound for the k-largest Proximity
After having computed pq,, we know for each u V , the exact
proximity pu(q)(= pq,u) from u to q. Now, we access the k-th
row of the lower bound matrix
P of the index (see Section 4.1) and
prune all nodes u for which lb
t
u
= p
t
u
(k) > pu(q). Obviously,
if the k-th largest lower bound from u to any other node exceeds
pu(q), then it is not possible for q to be in the set of k closest
nodes to u. For each node u that is not pruned, we compute an
upper bound ub
t
u
for the k-th largest proximity from u to any other
node, using the information that we have about u in the index. If
pu(q) ub
t
u
, then u is denitely in the answer set of the reverse
top-k query. Otherwise, node u needs further processing.
We now show how to compute ub
t
u
for a node u. Note that from
the index, we have the descending top-K lower bound list p
t
u
and
the residue ink vector r
t
u
. For j = 1, 2, ..., k 1, let
t
j
= p
t
u
(j) p
t
u
(j + 1) (16)
z
t
0
= 0, and z
t
j
= z
t
j1
+j
t
kj
, 1 j k 1 (17)
Then,
ub
t
u
=
_
_
p
t
u
(k j)
z
t
j
r
t
u
1
j
, if j [1, k 1],
s.t. z
t
j1
< |r
t
u
|1 z
t
j
p
t
u
(1) +
r
t
u
1
z
t
k1
k
, if |r
t
u
|1 > z
t
k1
(18)
Figures 3 and 4 illustrate the intuition and the derivation of ub
t
u
.
Assume that k = 5 and the rst k values of p
t
u
are as shown on the
left of the gures, while the total remaining ink |r
t
u
|1 is shown on
the right of the gures. The best possible case for the k-th value
of pu is when |r
t
u
|1 is distributed such that (i) only the rst k
values may receive some ink, while all others receive zero ink and
(ii) the ink is distributed in a way that maximizes the updated k-
th value. To achieve (ii), p
t
u
could be viewed as a staircase the k
highest steps of which are t tightly in a container. If we pour the
total residue ink |r
t
u
|1 into the container, the level of the ink will
correspond to the value of ub
t
u
.
t
j
is the difference between j-th
and (j + 1)-th step of the staircase, while z
t
j
is the ink required to
pour in order for its level in the container to reach the (k j)-th
step. The rst line of Eq. (18) corresponds to the case illustrated
by Figure 3, where ub
t
u
is smaller than pu(1), while the example
of Figure 4 corresponds to the case of the second line, where the
whole staircase is covered by residue ink (|r
t
u
|1 > z
t
k1
).
t
1
t
1
t
2
t
2
t
3
t
3
t
4
t
4
^ p
t
u
(1) ^ p
t
u
(1)
ub
t
u
ub
t
u
kr
t
u
k1 kr
t
u
k1
z
t
1
z
t
1
z
t
2
z
t
2
z
t
3
z
t
3
^ p
t
u
(2) ^ p
t
u
(2) ^ p
t
u
(3) ^ p
t
u
(3) ^ p
t
u
(4) ^ p
t
u
(4) ^ p
t
u
(5) ^ p
t
u
(5) ^ p
t
u
(6) ^ p
t
u
(6) ^ p
t
u
(7) ^ p
t
u
(7)
Figure 3: Upper bound, k = 5, z
t
2
< |r
t
u
|1 z
t
3
^ p
t
u
(7) ^ p
t
u
(7) ^ p
t
u
(6) ^ p
t
u
(6) ^ p
t
u
(5) ^ p
t
u
(5) ^ p
t
u
(4) ^ p
t
u
(4) ^ p
t
u
(3) ^ p
t
u
(3) ^ p
t
u
(2) ^ p
t
u
(2) ^ p
t
u
(1) ^ p
t
u
(1)
t
1
t
1
t
2
t
2
t
3
t
3
t
4
t
4
ub
t
u
ub
t
u
kr
t
u
k1 kr
t
u
k1
z
t
4
z
t
4
Figure 4: Upper bound, k = 5, |r
t
u
|1 > z
t
4
The following proposition states that ub
t
u
is indeed an upper
bound of the real k-largest value p
kmax
u
and is monotonically de-
creasing as p
t
u
( p
t
u
) is rened by later iterations.
PROPOSITION 4. u V , ub
1
u
ub
2
u
... p
kmax
u
.
PROOF. See [29].
Algorithm 3 is a procedure for deriving the upper bound ub
t
u
,
given pu, |r
t
u
|1, and k. The algorithm simulates pouring |r
t
u
|1
into the container by gradually computing the z
t
j
values for j =
1, 2, . . . , k 1, until z
t
j1
< |r
t
u
|1 z
t
j
, which indicates that the
residue ink |r
t
u
|1 can level up to p
t
u
(k j). If |r
t
u
|1 > z
t
k1
, the
whole staircase is covered and the algorithm computes ub
t
u
by the
second line of Eq. (18). The complexity of Algorithm 3 is O(k),
which is quite low compared to other modules.
Algorithm 3 Upper Bound Computation (UBC)
Input: Matrix A, Number k, Node u, Lower bound vector p
t
u
, Residue
ink vector r
t
u
.
Output: Upper bound ub
t
u
of the k-th largest proximity from u.
1: z
t
0
= 0;
2: for j = 1 to k 1 do
3: Compute
t
kj
by Eq. (16);
4: Compute z
t
j
by Eq. (17);
5: if z
t
j1
< r
t
u
1
z
t
j
then
6: Compute ub
t
u
by rst line of Eq. (18);
7: return ub
t
u
;
8: Compute and return ub
t
u
by second line of Eq. (18);
4.2.3 Candidate Renement and Index Update
When p
t
u
(k) pq,u < ub
t
u
, we cannot be sure whether u is a
reverse top-k result or not and we need to further rene the bounds
407
p
t
u
(k) and ub
t
u
. First, we apply one step of BCA in continuing
the computation of p
t
u
and update p
t
u
(lines 6-7 of Algorithm 1).
Then, we apply Algorithm 3 to compute a new ub
t
u
. This step-wise
renement process is repeated while p
t
u
(k) pq,u < ub
t
u
; it stops
once (i) pq,u < p
t
u
(k), which means that q is not contained in the
top-k list of u, or (ii) pq,u ub
t
u
, which means that u denitely
has q as one of its top-k nearest nodes. In our empirical study, we
observed that for most of the candidates u, the process terminates
much earlier before the lower and upper bounds approach the exact
value pq,u. Thus, many computations are saved.
If, due to a reverse top-k search, p
t
u
( p
t
u
) has been updated, we
dynamically update the index to include this change. In addition,
we update the corresponding stored values for r
t
u
, s
t
u
, and w
t
u
.
Due to this update, future queries will use tighter lower and upper
bounds for u.
The complete online query (OQ) method is summarized by Al-
gorithm 4. After computing the exact proximities to q (line 1), the
algorithm examines all u V and while a node u is a candidate
based on the lower bound p
tu
u
(k) (line 4), we rst check (line 5)
whether the lower bound is the actual proximity (this happens when
|r
tu
u
|1 = 0); in this case, u is added to the result set C and the
loop breaks. Otherwise, the upper bound ub
tu
u
is computed (line 8)
to verify whether u can be conrmed to be a result; if u is not a
result (line 13), lines 6-7 of Algorithm 1 are run to rene p
tu
u
(k);
after the update, the lower bound condition is re-checked to see
whether u can be pruned or another loop is necessary. Note that
the update besides increasing the values of p
tu
u
(k) (i.e., increasing
the chances for pruning), it also reduces ub
tu
u
, therefore the revised
upper bound ub
tu
u
may render u a query result.
Algorithm 4 Online Query (OQ)
Input: Matrix A, Query q, Number k, Index I.
Output: Reverse top-k Set C of q, Updated Index I.
1: Compute the exact proximities pq, by Algorithm 2;
2: Initialize C = ;
3: for all u V do
4: while pu,q p
tu
u
(k) do
5: if r
tu
u
1
= 0 then
6: C = C u; p
tu
u
(k) = pu(k), so u is a result
7: break;
8: Compute ub
tu
u
by Algorithm 3;
9: if pu,q ub
tu
u
then
10: C = C u; u becomes a result
11: break;
12: else
13: Update p
tu
u
(k) by Algorithm 1;
14: Save the updated
P, R, W, S to I;
We now illustrate OQ with our running example. Consider the
graph and the constructed index shown in Figure 2. Assume that
q = 1 (i.e., the query node is node 1 in the graph) and k = 2.
The rst step is to compute pq, using Algorithm 2; the result
is pq, = [0.32, 0.24, 0.24, 0.19, 0.20, 0.18]. Now OQ loops
through all nodes u and checks whether pu,q p
tu
u
(k). For the
rst node u = 1, we have 0.32 > 0.28 and p
tu
u
(k) is the actual
proximity pu(k) (recall that node 1 is a hub in our example, whose
proximities to other nodes have been computed), thus 1 is a result.
The same holds for u = 2 (0.24 0.24 and node 2 is a hub). For
u = 3, observe that pu,q < p
tu
u
(k) (i.e., 0.24 < 0.27); therefore
node 3 is safely pruned (i.e., OQ does not enter the while loop for
u = 3). Node u = 4 satises pu,q p
tu
u
(k) (0.19 > 0.17) and
|r
tu
u
|1 > 0, therefore the upper bound ub
tu
u
= 0.36 is computed
by Algorithm3, however, pu,q < ub
tu
u
, therefore we are still uncer-
tain whether node 4 is a reverse top-k result. A loop of Algorithm
1 is run to update p
tu
u
(k) to 0.23 (line 13); now node 4 is pruned
because pu,q < p
tu
u
(k) (0.19 < 0.23). Continuing the example,
node 5 is immediately added to the result since p5,q = p
t
5
5
(k) and
|r
t
5
5
|1 = 0, whereas node 6 is pruned after the renement of p
t
6
6
.
The following theorem shows the time complexity of OQ.
THEOREM 3. The time complexity of OQ in worst case is
O
__
log
+[Cand[ log
log(1 )
_
m
_
where the is the convergence threshold of Algorithm 2, is the
residue threshold and is the propagation threshold of Algorithm
1, Cand is the set of candidates that could not be pruned immedi-
ately by the index and m = [E[ is the number of graph edges.
PROOF. The cost of a query includes the cost of Algorithm 2,
which is O
_
log
log(1)
m
_
, as discussed in Section 4.2.1, and the
cost of examining and rening the candidates (lines 2 to 14 of OQ).
The worst case is that all nodes in Cand cannot be pruned or con-
rmed as result until we compute their exact k-th largest proximity
values by repeating line 7 of Algorithm 1, i.e., until the maximum
residue ink maxi|r
tu
u
(i) at any node drops below . Within an
iteration, the update of one node u requires at most O(m) opera-
tions Besides, each iteration is expected to shrink maxi|r
tu
u
(i)
by a factor around (1 ). Recall that since maxi|r
tu
u
(i) ,
the total number of iterations required to terminate BCA by mak-
ing it smaller than satises maxi|r
tu
u
(i) (1 )
, i.e.,
log
max
i
{r
tu
u
(i)}
log (1)
log
log(1)
. Therefore, the total time com-
plexity in the worst case is O
__
log
+|Cand|log
log(1)
_
m
_
.
As we show in Section 5.3, in practice, [Cand[ is extremely
small compared to n and most of the candidates can be pruned or
conrmed within signicantly fewer than iterations. Hence, the
empirical performance of OQ is far better than the worst case.
5. EXPERIMENTAL EVALUATION
This section experimentally evaluates our approach , which is
implemented in Matlab 2012b. Our testbed is a cluster of 500 AMD
Opteron cores (800MHz per core) with a total of 1TB RAM. Since
our indexing algorithm can be fully parallelized (i.e., the approxi-
mate proximity vectors of nodes are computed independently), we
evenly distributed the workload to 100 cores to implement the in-
dexing task. Each online query was executed at a single core and
the memory used at runtime corresponds to the size of our index
(i.e., at most a few GBs as reported in Table 2). Hence, our solu-
tion can also run on a commodity machine.
5.1 Datasets
We conducted our efciency experiments on a set of unlabeled
graphs. The number n = [V [ (m = [E[) of nodes (edges) of each
graph are shown in Table 2. Web-stanford-cs
3
and Web-stanford
4
were crawled from stanford.edu. Each node is a web domain and a
directed link stands for a hyperlink between two nodes. Epinions
4
is a who-trust-whom social network from a consumer review site
epinions.com; each node is a member of this site, and a directed
edge i j means that member i trusts j. Web-google
4
a web
graph collected by Google. Experiments on two additional datasets
are included in [29].
3
law.di.unimi.it/datasets.php
4
snap.stanford.edu/data/
408
5.2 Index Construction
We rst evaluate the cost for constructing our index (Section 4.1)
and its storage requirements for different graphs and sizes [H[ of
hub sets. After tuning, we set the index construction parameters
(see Section 4.1) as follows: propagation threshold = 10
4
,
residue threshold = 0.1, hub vector rounding threshold =
10
6
for the rst three graphs, and = 5 10
6
for the largest
one. In all cases, K=200, the convergence threshold = 10
10
and the restart parameter = 0.15. For a study on the effect of
the different parameters and the rationale of choosing their default
values, see [29].
Table 2 shows the index construction time for different graphs,
for various values of the hub selection parameter B, which result in
different sizes [H[ of hub sets. The last column shows the time to
compute the entire proximity matrix P and its size on disk, which
represents the brute-force (BF) approach of just pre-computing and
using P for reverse top-k queries. The value in parentheses in the
last column is the minimum possible cost for our index, derived by
just storing the top-K lower bound matrix
P and disregarding the
storage of the hub proximities PH and matrices R, W, and S. The
last three rows for each graph show the space that our index would
have if we had not applied the compression technique discussed in
Section 4.1.3, the actual space of our index, and the predicted space
according to our analysis in Section 4.1.3 (i.e., using Theorem 1
with = 0.76, as indicated by [4]). The reported times sum up the
running time at each core, assuming the worst case of having just
one single-core machine. Note that the actual time is roughly the
reported time divided by the number of cores (100).
We observe that the best number of hubs to select in terms of
both index construction cost and index size on disk depends on
the graph sparsity. For Web-stanford-cs, which is sparse graph, it
sufces to select less than 1% of the nodes with the highest in-
and out- degrees as hubs, while for the denser Epinions and Web-
stanford graphs 1% 2% of the nodes should be selected. The
index construction is much faster than the entire P computation,
especially for larger and sparser graphs (e.g., for Web-google it
takes as little as 1.8% of the time to construct P). The time is not
affected too much by the number of selected hubs.
The same observation also holds for the size of our index, which
is much smaller than the entire P and a few times larger than the
baseline storage of the top-K lower bound matrix
P. Although our
index also stores the hub matrix PH and matrices R, W, and S,
its space is reasonable; the index can easily be accommodated in
the main memory of modern commodity hardware. The predicted
space according to our analysis is in most cases an overestimation,
due to an under-estimation of the power law effect on the sparsity
of proximity matrices. Note that our rounding approach generally
achieves signicant space savings especially on large graphs (e.g.,
Web-google). For each dataset, the index that we are using in sub-
sequent experiments is marked in bold.
5.3 Online Query Performance
We now evaluate the performance of our index and our on-line
reverse top-k algorithm. We run a sequence of 500 queries on the
indexes created in Section 5.2 and report average statistics for them.
Query Efciency. Figure 5 shows the average runtime cost of
reverse top-k queries on different graphs, for different values of
k and with different options for using the index. Series update
denotes that after each query is evaluated, the index is updated to
save the changes in the
P, R, W, and S matrices, while no-
update means that the original index is used for each of the 500
queries. We separated these cases in order to evaluate whether our
index update policy brings benets to subsequent queries which
Web-stanford-cs (|V | = 9914, |E| = 36854)
B 50 100 200 300
|H| 82 175 355 530
time (s) 31.5 31.6 34.2 40.4 365.5
no rounding (MB) 55.2 57.4 65.3 77.9
actual space (MB) 39.6 41.8 49.7 62.4 786 (15.8)
pred. space (MB) 44.7 93.5 188 280
Epinions (|V | = 75879, |E| = 508837)
B 1000 1500 2000 3000
|H| 1484 2101 2690 3853
time (s) 15827 12285 11565 10792 139860
no rounding (MB) 2778 2309 2284 2721
actual space (MB) 2310 1696 1538 1716 46071 (121)
pred. space (MB) 4220 5924 7551 10763
Web-stanford (|V | = 281903, |E| = 2312497)
B 1000 1500 2000 3000
|H| 1932 2866 3804 5586
time (s) 85503 89196 97462 111200 3263500
no rounding (MB) 6506 8237 10209 14069
actual space (MB) 1907 1639 1595 1638 635754 (451)
pred. space (MB) 3977 5681 7393 10645
Web-google (|V | = 875713, |E| = 5105039)
B 5000 10000 20000 50000
|H| 9598 18871 37148 86246
time (s) 1024200 1107400 2206300 2865300 60162000
no rounding (MB) 73362 137113 264315 607615
actual space (MB) 5387 4727 4888 6897 6718720 (1466)
pred. space (MB) 2874 4298 7103 14639
Table 2: Index construction time and space cost
apply on a more rened index. The case of update also bears
the cost of updating the corresponding matrices. In either case,
query evaluation is very fast compared to the brute-force approach
of computing the entire P (the time needed for this is already re-
ported in the last column of Table 2) for each graph. The update
policy results in signicant reduction of the average query time in
small and dense graphs; however, for larger and sparser graphs the
index update has marginal effect in the query time improvement
because there is a higher chance that subsequent queries are less
dependent on the index renement done by previous ones. Note
that the workload includes 500 queries, which is a small number
compared to the size of the graphs; we expect that for larger work-
loads the difference will be amplied on large graphs.
Pruning Power of Bounds. Figure 6 shows, for the same
queries and the update case only, the average number (per query)
of the candidates that are not immediately ltered using the lower
bounds of the index and also the number of nodes from these candi-
dates that are immediately identied as hits (i.e., results) after their
upper bound computation. This means that only (candidateshits)
nodes (i.e., columns of
P) need to be rened on average for each
query. We also show the average number of actual results for each
experimental setting. The plots show that the number of candi-
dates are in the order of k and a signicant percentage of them
are immediately identied as results (based on their upper bounds)
without needing renement, a fact that explains the efciency of
our approach. In addition, the cost required for the renement of
these candidates is much lower compared to the cost for computing
their exact proximity vectors. For example, computing the exact
proximity vector pu for a node u in Web-google takes more than
65 seconds, while our method requires just 0.15 seconds to rene a
candidate in a reverse top-100 query on the same graph, on average.
Another observation is that in some graphs, like Web-stanford-cs
and Web-google, the hits number is very close to the results num-
ber. This suggests that when the accuracy demand is not high, an
approximated query algorithm, which only takes the hits as result
and stops further exploration, would save even more time.
409
0
0.5
1
1.5
2
5 10 20 50 100
k
Q
u
e
r
y
t
i
m
e
(
s
)
update
noupdate
(a) Web-stanford-cs
0
5
10
15
5 10 20 50 100
k
Q
u
e
r
y
t
i
m
e
(
s
)
update
noupdate
(b) Epinions
0
50
100
150
5 10 20 50 100
k
Q
u
e
r
y
t
i
m
e
(
s
)
update
noupdate
(c) Web-stanford
0
50
100
150
5 10 20 50 100
k
Q
u
e
r
y
t
i
m
e
(
s
)
update
noupdate
(d) Web-google
Figure 5: Search performance on different graphs, varying k
0
100
200
300
400
5 10 20 50 100
k
N
o
d
e
N
u
m
b
e
r
cand
hits
result
(a) Web-stanford-cs
0
20
40
60
80
100
120
5 10 20 50 100
k
N
o
d
e
N
u
m
b
e
r
cand
hits
result
(b) Epinions
0
500
1000
1500
2000
5 10 20 50 100
k
N
o
d
e
N
u
m
b
e
r
cand
hits
result
(c) Web-stanford
0
200
400
600
800
1000
5 10 20 50 100
k
N
o
d
e
N
u
m
b
e
r
cand
hits
result
(d) Web-google
Figure 6: Number of candidates and immediate hits on different graphs, varying k
Effectiveness of Index Renement. Figure 7 shows the cost of
individual reverse top-100 queries in the 500-query workload on
the Web-stanford graph, with and without the index update option.
Obviously, some queries are harder than others, depending on the
number of candidates that should be rened and the renement cost
for them. We observe an increase in the gap between the query
costs as the query ID increases, which is due to the fact that as
the index gets updated the next queries in the sequence are likely to
take advantage of the update to avoid redundant renements (which
would have to be performed if the index was not updated). For these
queries that take advantage of the updates (i.e., the ones toward the
end of the sequence), the cost is much lower compared to the case
where they are run against a non-updated index. In the following,
all experiments refer to the update case, i.e., the index is updated
after each query evaluation.
Cumulative Cost. Figure 8 compares the cumulative cost of a
workload that includes all nodes from the Web-stanford-cs graph as
queries with the cumulative cost of two versions of the BF method
on the same workload (k=10). The infeasible BF method (IBF) rst
constructs the exact P matrix, keeps the exact top-K proximity val-
ues for each node u, and then evaluates each reverse top-k query
q at the minimal cost of accessing the q-th row of P and the k-th
proximity value for each u V . However, since IBF requires ma-
terializing in memory the whole P (e.g., 6.7TB for Web-google),
it becomes infeasible for large graphs. An alternative, feasible BF
(FBF) method computes the entire P, but keeps in memory only
the exact top-K proximities of each node. Then, at query evalua-
tion, FBF uses our approach in Section 4.2.1 to compute the exact
RWR proximities to the query node from each node in the graph
and then uses the exact pre-computed proximities to verify the re-
verse top-k results. As the gure shows, IBF has a high initial cost
for computing P and afterward the cost for each query is very low.
FBF bears the same overhead as IBF to compute P, but requires
longer query time. Our approach has little initial overhead of con-
structing our index and thereafter a modest cost for evaluating each
query and updating the index. From the gure, we can see that the
cumulative cost of our method is always lower than that of FBF and
lower than IBF at the rst 60% queries. (We emphasize again that
IBF is infeasible for large graphs.) Besides, in practice, reverse top-
k search is only applied on a small percentage of nodes (e.g., less
than 10%); thus, its cumulative cost is low even when compared
to that of IBF. In summary, the overhead of computing P in both
versions of BF is very high, especially for large graphs, given the
fact that not too many reverse top-k queries are issued, in practice.
Rounding Effect. We also tested the effect of using of the
rounded hub proximity matrix P
H
in our index instead of the exact
hub proximity matrix PH on the query results (see Section 4.1.3).
We used the 500 query workload on the Web-stanford-cs graph and
for each query, we recorded the Jaccard similarity
|R
1
R
2
|
|R
1
R
2
|
between
the exact query results R1 when using PH and the results R2 when
using P
H
(i.e., our compressed index). Figure 9 plots the aver-
age similarity between the results of the same query when using
PH or P
H
, for different values of k and the rounding threshold.
Observe that for = 10
5
or smaller (as adopted in our setting),
the results obtained with P
H
for different k are exactly the same
as those obtained with PH. Even a larger threshold = 10
4
achieves an average precision of around 99% for all the tested k
values. Thus, the rounding technique (Section 4.1.3) loses almost
no accuracy, while saving a lot of space, as indicated by the results
of Table 2.
5.4 Search Effectiveness
The experiments of this section demonstrate the effectiveness of
reverse RWR top-k search in some real graph-based applications.
Spam detection. Webspam
5
is a web host graph containing
11402 web hosts, out of which, 8123 are manually labeled as nor-
mal, 2113 are spam, and the remaining ones are undecided.
There are 730774 directed edges in the graph. We verify the use of
reverse RWR top-k search on spam detection by applying reverse
top-5 search on all the spam and normal nodes, and check what
5
barcelona.research.yahoo.net/webspam/datasets/uk2006/
410
0 100 200 300 400 500
0
50
100
150
200
250
300
Query ID
Q
u
e
r
y
T
i
m
e
(
s
)
update
noupdate
Figure 7: Cost of individual queries
0 2000 4000 6000 8000 10000
0
100
200
300
400
500
600
700
Number of queries
A
c
c
u
m
u
l
a
t
e
d
q
u
e
r
y
t
i
m
e
(
s
)
Infeasible Brute Force (IBF)
Feasible Brute Force (FBF)
Our method
Figure 8: Cumulative cost in a workload
0.98
0.985
0.99
0.995
1
1.005
5 10 20 50 100
k
R
e
s
u
l
t
s
i
m
i
l
a
r
i
t
y
= 10
4
= 10
5
= 10
6
Figure 9: Effect of rounding
author reverse top-5 size # coauthors
Philip S. Yu 2020 231
Jiawei Han 2007 253
Christos Faloutsos 1932 221
Zheng Chen 162 137
Qiang Yang 161 166
Daphne Koller 157 98
C. Lee Giles 155 132
Gerhard Weikum 149 130
Michael I. Jordan 147 125
Bernhard Sch olkopf 140 134
Table 3: Longest reverse top-5 lists of DBLP authors
types of web hosts give their top-5 PageRank contributions to each
query node. Our experimental results show that if a query web host
is classied as spam, on average 96.1% web hosts in its reverse
top-5 set are also spam nodes; on the other hand, if the query is a
normal web host, on average 97.4% web hosts in its reverse top-5
result are normal. Therefore, reverse top-k results using RWR are a
very strong indicator toward detection of spam web hosts. In a real
scenario, we can apply a reverse top-k RWR search on any suspi-
cious web host, and make a judgement according to the spam ratio
of the labeled answer set.
Popularity of authors in a coauthorship network. The size of
a reverse top-k query can also be an indicator of the popularity of
the query node in the graph. We extracted from DBLP
6
the publica-
tions in top venues in the elds of databases, data mining, machine
learning, articial intelligence, computer vision, and information
retrieval. We generated a coauthorship network, with 44528 nodes
and 121352 edges where each node corresponds to an author and
an edge indicates coauthorship. To reect the different weights in
coauthorships, we changed the RWR transition matrix as follows:
ai,j =
_
w
i,j
w
j
if edge j i exists,
0 otherwise.
where wj is the number of publications of author j and wi,j is the
number of papers that i and j coauthored. We carried out reverse
top-5 search from all the nodes in the graph, and obtained a de-
scending ranked list of authors w.r.t. the size of their answer set.
The 10 authors with the longest reverse top-5 lists are shown in Ta-
ble 3. The table indicates that there are three popular authors with
6
dblp.uni-trier.de/xml/
very long reverse top-5 lists, which stand out.
7
More importantly,
the reverse top-k lists of these three authors are much longer than
their coauthor lists (third column of Table 3), which indicates that
there are many non-coauthors having them in their reverse top-k
sets. Therefore, the size of a reverse top-k query can be a stronger
indicator for popularity, compared to the nodes degree.
6. RELATED WORK
6.1 Random Walk with Restart
Random work with restart has been a widely used node-to-node
proximity in graph data, especially after its successful application
by the search engine Google [21] to derive the importance (i.e.,
PageRank) of web pages.
Early works focused on how to efciently solve the linear system
(1). Although non-iterative methods such as Gaussian elimination
can be applied, their high complexity of O(n
3
) makes them unaf-
fordable in real scenarios. Iterative approaches such as the Power
Method (PM) [21] and Jacobi algorithm have a lower complex-
ity of O(Dm), where D( n < m) is the number of iterations.
Later on, faster (but less accurate) methods such as Hub-vector de-
composition [15] have been proposed. As this method restricts the
restarting only to a specic node set, it does not compute exactly
the proximity vectors of all nodes in the graph.
To further accelerate the computation of RWR, approximate ap-
proaches have been introduced. [22] leverages the block structure
of the graph and only calculates the RWR similarity within the par-
tition containing the query node. Later, Monte Carlo (MC) methods
are introduced to simulate the random walk process, such as [9, 3,
18]. The simulation can be stored as a ngerprint for fast online
RWR estimation. Recently, a scheduled approximation strategy is
proposed by [30] to compute RWR proximity. From another view-
point of RWR, Bookmark Coloring Algorithm (BCA) [7] has been
proposed to derive a sparse lower bound approximation of the real
proximity vector (see Section 2 for details). Our ofine index is
based on approximations derived by partial execution of BCA and
not on other approaches, such as PM or MC simulation, because the
latter do not guarantee that their approximations are lower bounds
of the exact proximities and therefore do not t into our framework
of using lower and upper proximity bounds to accelerate search.
7
By popular here we mean authors who are likely to be approach-
able by many other authors and intuitively have higher chance to
collaborate with them in the future. Indeed, there are many other
authors who are very popular (e.g., in terms of visibility) and they
do not show up in Table 3, but these authors are likely to work in
smaller groups and do not have so much open collaboration, com-
pared to those having larger reverse top-k sets.
411
6.2 Top-k RWR Proximity Search
Bahmani et al. [4] observed that the majority of entries in a
proximity vector are extremely small. Thus, in many cases, it is
unnecessary to compute the exact RWR proximity from the query
node to all remaining nodes, especially to those with extremely
low proximities. Based on this observation, several top-k prox-
imity search algorithms are introduced. Based on BCA [7], [11]
proposed the Basic Push Algorithm (BPA). At each iteration, BPA
maintains a set of top-k candidates and estimates an upper bound
for the (k+1)-th largest proximity. BPA stops as soon as the upper
bound is not greater than the current k-th largest proximity. Re-
cently, another method, K-dash [10] was proposed. In an indexing
stage, K-dash applies LUdecomposition on the proximity matrix P
and stores the sparse matrices L
1
and U
1
. In the query stage, it
builds a BFS tree rooted at the query node and estimates an upper
bound for each visited node. Such estimation can help determine
whether K-dash should continue or terminate.
When the exact order of the top-k list is not important and a few
misplaced elements are acceptable, Monte Carlo methods can be
used to simulate RWR from the query node u. [3] designs two such
algorithms; MC End Point and MC Complete Path. The former
evaluates RWR proximity pu(v) as the fraction of t random walks
which end at node v, while the latter evaluates pu(v) as the number
of visits to node v multiplied by (1 c)/t.
6.3 Reverse k-NN and Reverse Top-k Search
Reverse k nearest neighbors (RkNN) search aims at nding all
objects in a set T that have a given query object from T in their k-
NN sets. In the Euclidean space, RkNN queries were introduced in
[17]; an efcient geometric solution was proposed in [24]. RkNN
search has also been studied for objects lying on large graphs, al-
beit using shortest path as the proximity measure, which makes the
problem much easier [28]. The reverse top-k query is dened by
Vlachou et al. in [26] as follows. Given a set of multi-dimensional
data points and a query point q, the goal is to nd all linear prefer-
ence functions that dene a total ranking in the data space such that
q lies in the top-k result of the functions. Solutions for RkNN and
reverse top-k queries cannot be applied to solve our problem, due to
the special nature of graph data and/or the use of RWR proximity.
7. CONCLUSIONS
In this paper, we have studied for the rst time the problem of re-
verse top-k proximity search in large graphs, based on the random
walk with restart (RWR) measure. We showed that the naive evalu-
ation of this problem is too expensive and proposed an index which
keeps track of lower bounds for the top proximity values from each
node. Our online query evaluation technique rst computes the ex-
act RWR proximities from the query node q to all graph nodes and
then compares them with the top-k lower bounds derived from the
index. For nodes that cannot be pruned, we compute upper bounds
for their k-th proximities and use them to test whether they are in
the reverse top-k result. For any remaining candidates, their k-th
proximity lower and upper bounds are progressively rened until
they become results or they are pruned. Our experiments conrm
the efciency of our approach; in addition we demonstrate the use
of reverse top-k queries in identifying spam web hosts or popu-
lar authors in co-authorship networks. As future work, we plan to
generalize the problem of reverse top-k search to other proximity
measures such as SimRank [14]. Since the current framework does
not consider the dynamics of the graph, we would also like to ex-
tend our method to do reverse top-k search on evolving graphs. The
key challenge is how to maintain the index incrementally.
8. REFERENCES
[1] R. Andersen, C. Borgs, J. T. Chayes, J. E. Hopcroft, V. S. Mirrokni,
and S.-H. Teng. Local computation of pagerank contributions. In
WAW, 2007.
[2] R. Andersen, F. R. K. Chung, and K. J. Lang. Local graph
partitioning using pagerank vectors. In FOCS, 2006.
[3] K. Avrachenkov, N. Litvak, D. Nemirovsky, E. Smirnova, and
M. Sokol. Quick detection of top-k personalized pagerank lists. In
WAW, 2011.
[4] B. Bahmani, A. Chowdhury, and A. Goel. Fast incremental and
personalized pagerank. PVLDB, 4(3):173184, 2010.
[5] A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank:
Authority-based keyword search in databases. In VLDB, 2004.
[6] A. A. Bencz ur, K. Csalog any, T. Sarl os, and M. Uher. Spamrank
fully automatic link spam detection. In AIRWeb, 2005.
[7] P. Berkhin. Bookmark-coloring approach to personalized pagerank
computing. Internet Mathematics, 3(1):4162, 2006.
[8] Y.-Y. Chen, Q. Gan, and T. Suel. Local methods for estimating
pagerank values. In CIKM, 2004.
[9] D. Fogaras, B. R acz, K. Csalog any, and T. Sarl os. Towards scaling
fully personalized pagerank: Algorithms, lower bounds, and
experiments. Internet Mathematics, 2(3):333358, 2005.
[10] Y. Fujiwara, M. Nakatsuji, M. Onizuka, and M. Kitsuregawa. Fast
and exact top-k search for random walk with restart. PVLDB,
5(5):442453, 2012.
[11] M. S. Gupta, A. Pathak, and S. Chakrabarti. Fast algorithms for topk
personalized pagerank queries. In WWW, 2008.
[12] T. H. Haveliwala. Topic-sensitive pagerank. In WWW, 2002.
[13] J. He, M. Li, H. Zhang, H. Tong, and C. Zhang. Manifold-ranking
based image retrieval. In ACM Multimedia, 2004.
[14] G. Jeh and J. Widom. Simrank: a measure of structural-context
similarity. In KDD, 2002.
[15] G. Jeh and J. Widom. Scaling personalized web search. In WWW,
2003.
[16] I. Konstas, V. Stathopoulos, and J. M. Jose. On social networks and
collaborative recommendation. In SIGIR, 2009.
[17] F. Korn and S. Muthukrishnan. Inuence sets based on reverse
nearest neighbor queries. In SIGMOD Conference, 2000.
[18] N. Li, Z. Guan, L. Ren, J. Wu, J. Han, and X. Yan. giceberg: Towards
iceberg analysis in large graphs. In ICDE, 2013.
[19] D. Liben-Nowell and J. M. Kleinberg. The link prediction problem
for social networks. In CIKM, 2003.
[20] A. Y. Ng, A. X. Zheng, and M. I. Jordan. Link analysis, eigenvectors
and stability. In IJCAI, 2001.
[21] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank
citation ranking: Bringing order to the web. Technical Report
1999-66, Stanford InfoLab, 1999.
[22] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood
formation and anomaly detection in bipartite graphs. In ICDM, 2005.
[23] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta
path-based top-k similarity search in heterogeneous information
networks. PVLDB, 4(11), 2011.
[24] Y. Tao, D. Papadias, and X. Lian. Reverse knn search in arbitrary
dimensionality. In VLDB, 2004.
[25] H. Tong, C. Faloutsos, and Y. Koren. Fast direction-aware proximity
for graph mining. In KDD, 2007.
[26] A. Vlachou, C. Doulkeridis, Y. Kotidis, and K. Nrv ag. Reverse
top-k queries. In ICDE, 2010.
[27] J. H. Wilkinson. The algebraic eigenvalue problem, volume 155.
Oxford Univ Press, 1965.
[28] M. L. Yiu, D. Papadias, N. Mamoulis, and Y. Tao. Reverse nearest
neighbors in large graphs. In ICDE, 2005.
[29] A. W. Yu, N. Mamoulis, and H. Su. Reverse top-k search using
random walk with restart. Technical Report TR-2013-08, CS
Department, HKU, September 2013.
[30] F. Zhu, Y. Fang, K. C.-C. Chang, and J. Ying. Incremental and
accuracy-aware personalized pagerank through scheduled
approximation. PVLDB, 6(6), 2013.
412