0% found this document useful (0 votes)
45 views21 pages

Link Spam Detection Based On Mass

This document introduces a method for detecting link spam based on estimating spam mass. Spam mass is a measure of how much PageRank a page accumulates through links from spam pages, with higher spam mass indicating pages that benefit more from link spamming. The document proposes estimating spam mass for all pages by computing two PageRank scores - a regular score and a biased score that weights reputable pages higher. Pages identified as having significantly higher spam mass than their regular PageRank are likely instances of heavy link spamming. The method was tested on the Yahoo web graph and successfully identified tens of thousands of heavy link spam cases.

Uploaded by

poimandres
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views21 pages

Link Spam Detection Based On Mass

This document introduces a method for detecting link spam based on estimating spam mass. Spam mass is a measure of how much PageRank a page accumulates through links from spam pages, with higher spam mass indicating pages that benefit more from link spamming. The document proposes estimating spam mass for all pages by computing two PageRank scores - a regular score and a biased score that weights reputable pages higher. Pages identified as having significantly higher spam mass than their regular PageRank are likely instances of heavy link spamming. The method was tested on the Yahoo web graph and successfully identified tens of thousands of heavy link spam cases.

Uploaded by

poimandres
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Link Spam Detection Based on Mass

Estimation
October 31, 2005 Technical Report
Zoltan Gyongyi

[email protected]
Computer Science Department
Stanford University, Stanford, CA 94305
Pavel Berkhin [email protected]
Yahoo! Inc.
701 First Avenue, Sunnyvale, CA 94089
Hector Garcia-Molina [email protected]
Computer Science Department
Stanford University, Stanford, CA 94305
Jan Pedersen [email protected]
Yahoo! Inc.
701 First Avenue, Sunnyvale, CA 94089
Abstract
Link spamming intends to mislead search engines and trigger an articially high link-based
ranking of specic target web pages. This paper introduces the concept of spam mass, a
measure of the impact of link spamming on a pages ranking. We discuss how to estimate
spam mass and how the estimates can help identifying pages that benet signicantly
from link spamming. In our experiments on the host-level Yahoo! web graph we use spam
mass estimates to successfully identify tens of thousands of instances of heavy-weight link
spamming.
1 Introduction
In an era of search-based web access, many attempt to mischievously inuence the page rankings
produced by search engines. This phenomenon, called web spamming, represents a major problem
to search engines [Singhal, 2004, Henzinger et al., 2002] and has negative economic and social im-
pact on the whole web community. Initially, spammers focused on enriching the contents of spam
pages with specic words that would match query terms. With the advent of link-based ranking
techniques, such as PageRank [Page et al., 1998], spammers started to construct spam farms, col-
lections of interlinked spam pages. This latter form of spamming is referred to as link spamming
as opposed to the former term spamming.

Work performed during a summer internship at Yahoo! Inc.


1
The plummeting cost of web publishing has rendered a boom in link spamming. The size of many
spam farms has increased dramatically, and many farms span tens, hundreds, or even thousands
of dierent domain names, rendering nave countermeasures ineective. Skilled spammers, whose
activity remains largely undetected by search engines, often manage to obtain very high rankings
for their spam pages.
This paper proposes a novel method for identifying the largest and most sophisticated spam
farms, by turning the spammers ingenuity against themselves. Our focus is on spamming attempts
that target PageRank. We introduce the concept of spam mass, a measure of how much PageRank
a page accumulates through being linked to by spam pages. The target pages of spam farms, whose
PageRank is boosted by many spam pages, are expected to have a large spam mass. At the same
time, popular reputable pages, which have high PageRank because other reputable pages point to
them, have a small spam mass.
We estimate the spam mass of all web pages by computing and combining two PageRank scores:
the regular PageRank of each page and a biased one, in which a large group of known reputable
pages receives more weight. Mass estimates can then be used to identify pages that are signicant
beneciaries of link spamming with high probability.
The strength of our approach is that we can identify any major case of link spamming, not
only farms with regular interconnection structures or cliques, which represent the main focus of
previous research (see Section 5). The proposed method also complements our previous work on
TrustRank [Gyongyi et al., 2004] in that it detects spam as opposed to detecting reputable pages.
This paper is organized as follows. We start with some background material on PageRank and
link spamming. The rst part of Section 3 introduces the concept of spam mass through a transition
from simple examples to formal denitions. Then, the second part of Section 3 presents an ecient
way of estimating spam mass and a practical spam detection algorithm based on mass estimation.
In Section 4 we discuss our experiments on the Yahoo! search engine index and oer evidence that
spam mass estimation is helpful in identifying heavy-weight link spam. Finally, Section 5 places
our results into the larger picture of link spam detection research and PageRank analysis.
2 Preliminaries
2.1 Web Graph Model
Information on the web can be viewed at dierent levels of granularity. For instance, one could
think of the web of individual HTML pages, the web of hosts, or the web of sites. Our discussion
will abstract from the actual level of granularity, and see the web as an interconnected structure of
nodes, where nodes may be pages, hosts, or sites, respectively.
We adopt the usual graph model of the web, and use ( = (1, c) to denote the web graph that
consists of a set 1 of nodes (vertices) and a set c of directed links (edges) that connect nodes.
We disregard the exact denition of a link, although it will usually represent one or a group of
hyperlinks between corresponding pages, hosts, or sites. We use unweighted links and disallow
self-links.
Each node has some incoming links, or inlinks, and some outgoing links, or outlinks. The
number of outlinks of a node x is its outdegree, whereas the number of inlinks is its indegree. The
nodes pointed to by a node x are the out-neighbors of x. Similarly, the nodes pointing to x are its
in-neighbors.
2
2.2 Linear PageRank
A popular discussion and research topic, PageRank as introduced in [Page et al., 1998] is more of a
concept that allows for dierent mathematical formulations, rather than a single clear-cut algorithm.
From among the available approaches (for an overview, see [Berkhin, 2005], [Bianchini et al., 2005],
and [Eiron et al., 2004]), we adopt the linear system formulation of PageRank, which we introduce
next.
At a high level, the PageRank scores assigned to web nodes correspond to the stationary prob-
ability distribution of a random walk on the web graph: Assume a hypothetical web surfer moving
from node to node by following links, ad innitum. Then, nodes will have PageRank scores pro-
portional to the time the surfer spends at each.
A signicant technical problem with PageRank is that on the web as it is, such a hypothetical
surfer would often get stuck for some time in nodes without outlinks. Consider the transition matrix
T corresponding to the web graph (, dened as
T
xy
=
_
1/out(x), if (x, y) c,
0, otherwise,
where out(x) is the outdegree of node x. Note that T is substochastic: the rows corresponding
to nodes with outlinks sum up to 1, but the rows corresponding to dangling nodes are all 0. To
allow for a true probabilistic interpretation, T has to be transformed into a stochastic transition
matrix T

, commonly done as follows. Consider a vector v of positive elements with the norm
|v| = |v|
1
= 1, specifying a probability distribution. Then,
T

= T+dv
T
,
where d is a dangling node indicator vector:
d
x
=
_
1, if out(x) = 0 ,
0, otherwise.
This transformation corresponds to adding virtual links from dangling nodes to (all) other nodes
on the web, which are then followed according to the probability distribution v.
Even for T

it is not immediately clear whether a stationary probability distribution, and


therefore unique PageRank scores exist. To guarantee a unique stationary distribution, the Markov
chain corresponding to our random walk has to be ergodic, that is, our surfer should be able to
navigate from any node to any other node. This property is satised if we introduce a random
jump (also known as teleportation)): at each step, the surfer follows one of the links from T

with
probability c or jumps to some random node (selected based on the probability distribution v) with
probability (1 c). The corresponding augmented transition matrix T

is dened as
T

= cT

+ (1 c)1
n
v
T
.
PageRank can now be dened rigorously as the stationary distribution p of the random walk on
T

. In fact, p is the dominant eigenvector (corresponding to the eigenvalue = 1) of the system


p = T
T
p = [cT
T
+ cvd
T
+ (1 c)v1
T
n
]p, (1)
which can be solved by using, for instance, the power iterations algorithm.
It turns out, however, that we can reach a solution following a simpler path. We make the
following two observations:
3
1. 1
T
n
p = |p|;
2. d
T
p = |p| |T
T
p|.
Hence, the PageRank equation (1) can be rewritten as the linear system
(I cT
T
) p = kv , (2)
where k = k(p) = |p| c |T
T
p| is a scalar. Notice that any particular value of k will result only
in a rescaling of p and does not change the relative ordering of nodes. In fact, we can pick any
value for k, solve the linear system and then normalize the solution to p/|p| to obtain the same
result as for (1). In this paper, we simply set k = 1 c, so that equation (2) becomes
(I cT
T
)p = (1 c)v . (3)
We adopt the notation p = PR(v) to indicate that p is the (unique) vector of PageRank scores
satisfying (3) for a given v. In general, we will allow for non-uniform random jump distributions. We
even allow v to be unnormalized, that is 0 < |v| 1, and leave the PageRank vector unnormalized
as well. The linear system (3) can be solved, for instance, by using the Jacobi method, shown as
Algorithm 1.
input : transition matrix T, random jump vector v, damping factor c, error bound
output: PageRank score vector p
i 0
p
[0]
v
repeat
i i + 1
p
[i]
cT
T
p
[i1]
+ (1 c)v
until |p
[i]
p
[i1]
| <
p p
[k]
Algorithm 1: Linear PageRank.
A major advantage of the adopted formulation is that the PageRank scores are linear in v:
for p = PR(v) and v = v
1
+ v
2
we have p = p
1
+ p
2
where p
1
= PR(v
1
) and p
2
= PR(v
2
).
Another advantage is that linear systems can be solved using various numerical algorithms, such
as the Jacobi or Gauss-Seidel methods, which are regularly faster than the algorithms available
for solving eigensystems (for instance, power iterations). Further details on linear PageRank are
provided in [Berkhin, 2005].
2.3 Link Spamming
In this paper we focus on link spamming that targets the PageRank algorithm. PageRank is
fairly robust to spamming: a signicant increase in score requires a large number of links from
low-PageRank nodes and/or some hard-to-obtain links from popular nodes, such as The New York
Times site www.nytimes.com. Spammers usually try to blend these two strategies, though the
former is more prevalent.
In order to better understand the modus operandi of link spamming, we introduce the model of
a link spam farm, a group of interconnected nodes involved in link spamming. A spam farm has a
single target node, whose ranking the spammer intends to boost by creating the whole structure.
4
A farm also contains boosting nodes, controlled by the spammer and connected so that they would
inuence the PageRank of the target. Boosting nodes are owned either by the author of target,
or by some other spammer (nancially or otherwise) interested in collaborating with him/her.
Commonly, boosting nodes have little value by themselves; they only exist to improve the ranking
of the target. Their PageRank tends to be small, so serious spammers employ a large number of
boosting nodes (occasionally, thousands of them) to trigger high target ranking.
In addition to the links within the farm, spammers may gather some external links from rep-
utable nodes. While the author of a reputable node y is not voluntarily involved in spamming
(according to our model, if he/she were, the page would be part of the farm), stray links may
exist for a number of reason:
Node y is a blog or message board or guestbook and the spammer manages to post a comment
that includes a spam link, which then slips under the editorial radar.
The spammer creates a honey pot, a spam page that oers valuable information, but behind
the scenes is still part of the farm. Unassuming users might then point to the honey pot,
without realizing that their link is harvested for spamming purposes.
The spammer purchases domain names that recently expired but had previously been rep-
utable and popular. This way he/she can prot of the old links that are still out there.
Actual link spam structures may contain several target pages, and can be thought of as alliances
of simple spam farms [Gyongyi and Garcia-Molina, 2005].
In this paper, we focus on identifying target nodes x that benet mainly from boosting: spam
nodes linking to x increase xs PageRank more than reputable nodes do. Subsequent sections
discuss our proposed approach and the supporting experimental results.
3 Spam Mass
3.1 Nave Approach
In order to start formalizing our problem, let us conceptually partition the web into a set of
reputable nodes 1
+
and a set of spam nodes 1

, with 1
+
1

= 1 and 1
+
1

= .

Given this
partitioning, we wish to detect web nodes x that gain most of their PageRank through spam nodes
in 1

that link to them. We will conclude that such nodes x are spam farm target nodes.
A very simple approach would be that, given a node x, we look only at its immediate in-
neighbors. For the moment, let us assume that it is known whether in-neighbors of x are reputable,
good nodes or spam. (We will remove this unrealistic assumption in Section 3.4.) Now we wish to
infer whether x is good or spam, based on the in-neighbor information.
In a rst approximation, we can simply look at the number of inlinks. If the majority of xs
links comes from spam nodes, x is labeled a spam target node, otherwise it is labeled good. We call
this approach our rst labeling scheme. It is easy to see that this scheme often mislabels spam. To
illustrate, consider the web graph in Figure 1. (Our convention is to show known good nodes lled
white, known spam nodes lled black, and to-be-labeled nodes hashed gray.) As x has two links

In practice, such perfect knowledge is clearly unavailable. Also, what constitutes spam is often a matter of
subjective judgment; hence, the real web includes a voluminous gray area of nodes that some call spam while others
argue against that label. Nevertheless, our simple dichotomy will be helpful in constructing the theory of the proposed
spam detection method.
5
x s
0
s
1
s
2
s
k
g
0
g
1
Figure 1: A scenario in which the rst nave labeling scheme fails, but the second succeeds.
x
g
0
g
3
g
1
g
2
s
5
s
6
s
0
s
1
s
2
s
3
s
4
Figure 2: Another scenario in which both nave labeling schemes fails.
from good nodes g
0
and g
1
and a single link from spam node s
0
, it will be labeled good. However,
the PageRank of x is
p
x
= (1 + 3c + kc
2
)(1 c)/n,
out of which (c + kc
2
)(1 c)/n is due to spamming. (It is straightforward to verify that in the
absence of spam nodes s
0
, . . . , s
k
the PageRank of x would decrease by this much.) For c = 0.85,
as long as k 1/c| = 2 the largest part of xs PageRank comes form spam nodes, so it would be
reasonable to label x as spam. As our rst scheme fails to do so, let us come up with something
better.
A natural alternative is to look not only at the number of links, but also at what amount of
PageRank each link contributes. The contribution of a link amounts to the change in PageRank
induced by the removal of the link. For Figure 1, links from g
0
and g
1
both contribute c(1 c)/n
while the link from s
0
contributes (c + kc
2
)(1 c)/n. As the largest part of xs PageRank comes
from a spam node, we correctly label x as spam.
However, there are cases when even our second scheme is not quite good enough. For example,
consider the graph in Figure 2. The links from g
0
and g
2
contribute (2c + 4c
2
)(1 c)/n to the
PageRank of x, while the link from s
0
contributes (c + 4c
2
)(1 c)/n only. Hence, the second
scheme labels x as good. It is important, however, to realize that spam nodes s
5
and s
6
inuence
the PageRank scores of g
0
and g
2
, respectively, and so they also have an indirect inuence on the
PageRank of x. Overall, the 6 spam nodes of the graph have a stronger inuence on xs PageRank
than the 4 reputable ones do. Our second scheme fails to recognize this because it never looks
beyond the immediate in-neighbors of x.
Therefore, it is appropriate to devise a third scheme that labels node x considering all the
PageRank contributions of other nodes that are directly or indirectly connected to x. The next
section will show how to compute such contributions, both direct and indirect (e.g., that of s
5
to
6
x). Then, in Section 3.3 the contributions of spam nodes will be added to determine what we call
the spam mass of nodes.
3.2 PageRank Contribution
In this section we adapt some of the formalism and results introduced for inverse P-distances
in [Jeh and Widom, 2003].
The connection between the nodes x and y is captured by the concept of a walk. A walk W
from x to y in a directed graph is dened as a nite sequence of nodes x = x
0
, x
1
, . . . , x
k
= y,
where there is a directed edge (x
i
, x
i+1
) c between every pair of adjacent nodes x
i
and x
i+1
,
i = 0, . . . , k 1. The length [W[ of a walk W is the number k 1 of edges. A walk with x = y is
called a circuit.
Acyclic graphs contain a nite number of walks while cyclic graph have an innite number of
walks. The (possibly innite) set of all walks from x to y is denoted by J
xy
.
We dene the PageRank contribution of x to y over the walk W as
q
W
y
= c
k
(W)(1 c)v
x
,
where (W) is the weight of the walk:
(W) =
k1

i=0
1
out(x
i
)
.
This weight can be interpreted as the probability that a Markov chain of length k starting in x
reaches y through the sequence of nodes x
1
, . . . , x
k1
.
In a similar manner, we dene the total PageRank contribution of x to y, x ,= y, over all walks
from x to y (or simply: the PageRank contribution of x to y) as
q
x
y
=

WWxy
q
W
y
.
For a nodes contribution to itself, we also consider an additional virtual circuit Z
x
that has
length zero and weight 1, so that
q
x
x
=

WWxx
q
W
x
= q
Zx
x
+

V Wxx,|V |1
q
V
x
= (1 c)v
x
+

V Wxx,|V |1
q
V
x
.
Note that if a node x does not participate in circuits, xs contribution to itself is q
x
x
= (1 c)v
x
,
which corresponds to the random jump component.
For convenience, we extend our notion of contribution even to those nodes that are unconnected.
If there is no walk from node x to node y then the PageRank contribution q
x
y
is zero.
The following theorem reveals the connection between the PageRank contributions and the
PageRank scores of nodes. (The proofs of the theorems are provided as appendices.)
Theorem 1 The PageRank score of a node y is the sum of the contributions of all other nodes to
y:
p
y
=

xV
q
x
y
.
7
It is possible to compute the PageRank contribution of a node to all nodes in a convenient way,
as stated next.
Theorem 2 Under a given random jump distribution v, the vector q
x
of contributions of a node
x to all nodes is the solution of the linear PageRank system for the core-based random jump vector
v
x
:
v
x
y
=
_
v
x
, if x = y,
0, otherwise,
that is,
q
x
= PR(v
x
) .
Remember that the PageRank equation system is linear in the random jump vector. Hence, we
can easily determine the PageRank contribution q
U
of any subset of nodes | 1 by computing
PageRank using the random jump vector v
U
dened as
v
U
y
=
_
v
y
, if y |,
0, otherwise.
To verify the correctness of this last statement, note that q
x
= PR(v
x
) for all x | and v
U
=

xU
v
x
, therefore q
U
= PR(v
U
) =

xU
q
x
.
3.3 Denition of Spam Mass
Returning to the example in Figure 2, let us check whether PageRank contributions could indeed
help in labeling x. We calculate and add the contributions of known good and spam nodes to the
PageRank of x:
q
{g
0
,...,g
3
}
x
= (2c + 2c
2
)(1 c)/n
and
q
{s
0
,...,s
6
}
x
= (c + 6c
2
)(1 c)/n.
Then, we can decide whether x is spam based on the comparison of q
{s
0
,...,s
6
}
x
to q
{g
0
,...,g
3
}
x
. For
instance, for c = 0.85, q
{s
0
,...,s
6
}
x
= 1.65q
{g
0
,...,g
3
}
x
. Therefore, spam nodes have more impact on the
PageRank of x than good nodes do, and it might be wise to conclude that x is in fact spam. We
formalize our intuition as follows.
For a given partitioning 1
+
, 1

of 1 and for any node x, it is the case that p


x
= q
V
+
x
+ q
V

x
,
that is, xs PageRank is the sum of the contributions of good nodes and that of spam nodes. (The
formula includes xs contribution to itself, as we assume that we are given information about all
nodes.)
Denition 1 The absolute spam mass of x, denoted by M
x
, is the PageRank contribution that x
receives from spam nodes, that is, M
x
= q
V

x
.
Hence, the spam mass is a measure of how much direct or indirect in-neighbor spam nodes
increase the PageRank of a node. Our experimental results indicate that it is suggestive to take a
look at the spam mass of nodes in comparison to their total PageRank:
Denition 2 The relative spam mass of x, denoted by m
x
, is the fraction of xs PageRank due to
contributing spam nodes, that is, m
x
= q
V

x
/p
x
.
8
3.4 Estimating Spam Mass
The assumption that we have accurate a priori knowledge of whether nodes are good (i.e., in 1
+
)
or spam (i.e., in 1

) is of course unrealistic. Not only is such information currently unavailable for


the actual web, but it would be impractical to produce and would quickly get outdated. In practice,
the best we can hope for is some approximation to (subset of) the good nodes (say

1
+
) or spam
nodes (say

1

). Accordingly, we expect that search engines have some reliable white-list and/or
black-list, comprising a subset of the nodes, compiled manually by editors and/or generated by
algorithmic means.
Depending on which of these two sets is available (either or both), the spam mass of nodes can
be approximated by estimating good and spam PageRank contributions.
In this paper we assume that only a subset of the good nodes

1
+
is provided. We call this set

1
+
the good core. A suitable good core is not very hard to construct, as discussed in Section 4.2.
Note that one can expect the good core to be more stable over time than

1

, as spam nodes come


and go on the web. For instance, spammers frequently abandon their pages once there is some
indication that search engines adopted anti-spam measures against them.
Given

1
+
, we compute two sets of PageRank scores:
1. p = PR(v), the PageRank of nodes based on the uniform random jump distribution v = (
1
n
)
n
,
and
2. p

= PR(v

V
+
), a core-based PageRank with a random jump distribution v

V
+
,
v

V
+
x
=
_
1/n, if x

1
+
,
0, otherwise.
Note that p

approximates the PageRank contributions that nodes receive from known good
nodes.
These PageRank scores can be used to estimate spam mass:
Denition 3 Given PageRank scores p
x
and p

x
, the estimated absolute spam mass of node x is

M
x
= p
x
p

x
and the estimated relative spam mass of x is
m
x
= (p
x
p

x
)/p
x
= 1 p

x
/p
x
.
As a simple example of how spam mass estimation works, consider again the graph in Figure 2
and assume that the good core is

1
+
= g
0
, g
1
, g
3
. For c = 0.85 and n = 12, the PageRank score,
actual absolute mass, estimated absolute mass, and its relative counterpart is shown for each of
the nodes in Table 1. Note that here, as well as in the rest of the paper, numeric PageRank scores
and absolute mass values are scaled by n/(1 c) for increased readability. Accordingly, the scaled
PageRank score of a node without inlinks is 1.
For instance, the scaled PageRank score of g
0
is 2.7. Of that, M
g
0
= 0.85 is contributed by
spam pages, in particular by s
5
. Hence, g
1
s relative mass is m
g
0
= 0.85/2.7 = 0.31.
The dierence between actual and estimated mass can be observed in case of nodes x and g
2
.
Although g
2
is a good node, it is not a member of

1
+
. Hence, both its absolute mass and relative
mass are overestimated. The mass estimates for node x are also larger than the actual values.
9
core-based absolute relative estimated estimated
PageRank PageRank mass mass abs. mass rel. mass
p = p

= M =

M = m = m =
x
g
0
g
1
g
2
g
3
s
0
s
1
, . . . , s
6
_
_
_
_
_
_
_
_
_
_
9.33
2.7
1
2.7
1
4.4
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2.295
1.85
1
0.85
1
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
6.185
0.85
0
0.85
0
4.4
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
7.035
0.85
0
1.85
0
4.4
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.66
0.31
0
0.31
0
1
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.75
0.31
0
0.69
0
1
1
_
_
_
_
_
_
_
_
_
_
Table 1: Various features of nodes in Figure 2.
Note that the absolute and relative mass estimates of most good nodes are small compared to
the estimated mass of spam nodes. While the example in Figure 2 is overly simple, the relative
separation of good and spam indicates that mass estimates could be used for spam detection
purposes.
In the alternate situation that

1

is provided, the absolute spam mass can be estimated by

M = PR(v

). Finally, when both



1

and

1
+
are known, the spam mass estimates could be
derived, for instance, by simply computing the average (

M +

M)/2. It is also possible to invent
more sophisticated combination schemes, e.g., a weighted average where the weights depend on the
relative sizes of

1

and

1
+
, with respect to the estimated sizes of 1

and 1
+
.
3.5 Size of the Good Core
In a nal step before devising a practical spam detection algorithm based on mass estimation, we
need to consider a technical problem that arises for real web data.
One can expect for the web that our good core

1
+
will be signicantly smaller than the actual
set of good nodes 1
+
. That is, [

1
+
[ [1
+
[ and thus |v

V
+
| |v|. Note that by the denition
of p = PR(v) from (3), |p| |v|. Similarly, |p

| |v

V
+
|. It follows that |p

| |p|, i.e., the


total estimated good contribution is much smaller than the total PageRank of nodes. In this case,
when estimating spam mass, we will have |pp

| |p| with only a few nodes that have absolute


mass estimates diering from their PageRank scores.
A simple remedy to this problem is as follows. We can construct a (small) uniform random
sample of nodes and manually label each sample node as spam or good. This way it is possible to
roughly approximate the prevalence of spam nodes on the whole web. We introduce to denote
the fraction of nodes that we estimate (based on our sample) that are good, so n [1
+
[. Then,
we scale the core-based random jump vector v

V
+
to w, where
w
x
=
_
/[

1
+
[, if x

1
+
,
0, otherwise.
Note that |w| = |v
V
+
|, so the two random jump vectors are of the same order of magnitude.
Then, we can compute p

based on w and expect that |p

| |p
V
+
|, so we get a reasonable
estimate of the total good contribution.
Using w in computing the core-based PageRank leads to an interesting situation. As

1
+
is
small, the good nodes in it will receive an unusually high random jump (/[

1
+
[ as opposed to 1/n).
10
Therefore, the good PageRank contribution of these known reputable nodes will be overestimated,
to the extent that occasionally for some node y, p

y
will be larger than p
y
. Hence, when computing

M, there will be nodes with negative spam mass. In general, a negative mass indicates that a node
is known to be good in advance (is member of

1
+
) or its PageRank is heavily inuenced by the
contribution of nodes in the good core.
3.6 Spam Detection Algorithm
Section 3.3 introduced the concept of spam mass, Section 3.4 provided an ecient way of estimating
it, while Section 3.5 eliminated some technical obstacles in our way. In this section we put all pieces
together and present our link spam detection algorithm based on mass estimation.
While very similar in nature, our experiments (discussed in Section 4.5) indicate that relative
mass estimates are more useful in spam detection than their absolute counterparts. Therefore,
we build our algorithm around estimating the relative mass of nodes. Details are presented as
Algorithm 2.
The rst input of the algorithm is the good core

1
+
. The second input is a threshold to which
relative mass estimates are compared. If the estimated relative mass of a node is equal to or above
this threshold then the node is labeled as a spam candidate.
The third input is a PageRank threshold : we only verify the relative mass estimates of nodes
with PageRank scores larger than or equal to . Nodes with PageRank less than are never labeled
as spam candidates.
input : good core

1
+
, relative mass threshold , PageRank threshold
output: set of spam candidates o
o
compute PageRank scores p
construct w based on

1
+
and compute p

m (p p

)/p
for each node x so that p
x
do
if m
x
then
o o x
end
end
Algorithm 2: Mass-based spam detection.
There are at least three reasons to apply a threshold on PageRank. First, remember that we
are interested in detecting nodes that prot from signicant link spamming. Obviously, a node
with a small PageRank is not a beneciary of considerable boosting, so it is of no interest to us.
Second, focusing on nodes x with large PageRank also means that we have more evidencea
larger number of nodes contributing to the PageRank of x. Therefore, no single nodes contribution
is critical alone, the decision whether a node is spam or not is based upon data collected from
multiple sources.
Finally, for nodes x with low PageRank scores, even the slightest error in approximating M
x
by

M
x
could yield huge dierences in the corresponding relative mass estimates. The PageRank
threshold helps us to avoid the complications caused by this phenomenon.
As an example of how the algorithm operates, consider once more the graph in Figure 2 with
node features in Table 1. Let us assume that

1
+
= g
0
, g
1
, g
3
, = 1.5 (once again, we use
scaled PageRank scores), = 0.5 and w = v

V
+
. Then, the algorithm disregards nodes g
1
, g
3
and
11
s
1
, . . . , s
6
because their low PageRank of 1 < = 1.5. Again, such nodes cannot possibly benet
from signicant boosting by link spamming.
Node x has PageRank p
x
= 0.7 = 1.5 and a large estimated relative mass m
x
= 0.75 =
0.5, hence it is added to the spam candidate set o. Similarly, node s
0
is labeled spam as well. A
third node, g
2
is a false positive: it has a PageRank of p
g
2
= 2.7 and an estimated relative mass of
m
g
2
= 0.69, so it is labeled spam. This error is due to the fact that our good core

1
+
is incomplete.
Finally, the other good node g
0
is correctly excluded from o, because m
g
0
= 0.31 < .
4 Experimental Results
4.1 Data Set
To evaluate the proposed spam detection method we performed a number of experiments on actual
web data. The data set that we used was based on the web index of the Yahoo! search engine as
of 2004.
From the complete index of several billion web pages we extracted a list consisting of approxi-
mately 73.3 million individual web hosts.

The web graph corresponding to hosts contained slightly more than 979 million edges. These
edges were obtained by collapsing all hyperlinks between any pair of pages on two dierent hosts
into a single directed edge.
Out of the 73.3 million hosts, 25.6 million (35%) had no inlinks and 48.6 million (66.4%) had
no outlinks. Reasonable explanations for the large number of hosts without outlinks are (1) the
presence of URLs that never got visited by the Yahoo! spider due to the crawling policy and (2)
the presence of URLs that could not be crawled because they were misspelled or the corresponding
host was extinct. Some 18.9 million hosts (25.8%) were completely isolated, that is, had neither
inlinks nor outlinks.
4.2 Good Core
The construction of a good core

1
+
represented a rst step in producing spam mass estimates
for the hosts. As we were aiming for a large good core, we felt that the manual selection of its
members is unfeasible. Therefore, we devised a way of assembling a substantially large good core
with minimal human intervention:
1. We included in

1
+
all hosts that appear in a small web directory which we consider being
virtually void of spam. (We prefer not to disclose which directory this is in order to protect it
from inltration attempts of spammers who might read this paper.) After cleaning the URLs
(removing incorrect and broken ones), this group consisted of 16,776 hosts.
2. We included in

1
+
all US governmental (.gov) hosts (55,320 hosts after URL cleaning).
Though it would have been nice to include other countries governmental hosts, as well as
various international organizations, the corresponding lists were not immediately available to
us, and we could not devise a straightforward scheme for their automatic generation.
3. Using web databases (e.g., univ.cc) of educational institutions worldwide, we distilled a
list of 3,976 schools from more than 150 countries. Based on the list, we identied 434,045

Web host names represent the part of the URL between the http:// prex and the rst / character. Host names
map to IP addresses through DNS. We did not perform alias detection, so for instance www-cs.stanford.edu and
cs.stanford.edu counted as two separate hosts, even though the URLs map to the same web server/IP address.
12
individual hosts that belong to these institutions, and included all these hosts in our good
core

1
+
.
With all three sources included, the good core consisted of 504,150 unique hosts.
4.3 Experimental Procedure
First, we computed the regular PageRank vector p for the host graph introduced in Section 4.1.
We used an implementation of Algorithm 1 (Section 2.2).
Corroborating with earlier research reports, the produced PageRank scores follow a power-law
distribution. Accordingly, most hosts have very small PageRank: slightly more than 66.7 out of the
73.3 million (91.1%) have a scaled PageRank less than 2, that is, less than the double of the minimal
PageRank score. At the other end of the spectrum, only about 64,000 hosts have PageRank scores
that are at least 100-times larger than the minimal. This means that the set of hosts that we focus
on, that is, the set of spam targets with large PageRank, is by denition small compared to the
size of the web.
Second, we computed the core-based PageRank vector p

using the same PageRank algorithm,


but a dierent random jump distribution. Initially we experimented with a random jump of 1/n
to each host in

1
+
. However, the resulting absolute mass estimates were virtually identical to the
PageRank scores for most hosts as |p

| |p|.
To circumvent this problems, we decided to adopt the alternative of scaling the random jump
vector to w, as discussed in Section 3.5. In order to construct w, we relied on the conservative
estimate that at least 15% of the hosts are spam.

Correspondingly, we set up w as a uniform


distribution vector over the elements of

1
+
, with |w| = 0.85.
Following the methodology introduced in Section 3.4, the vectors p and p

were used to produce


the absolute and relative mass estimates of hosts (

M and m, respectively). We analyzed these
estimates and tested the proposed spam detection algorithm. Our ndings are presented in the
following two sections.
4.4 Relative Mass
The main results of our experiments concern the performance of Algorithm 2 presented in Sec-
tion 3.6.
With relative mass values already available, only the ltering and labeling steps of the algorithm
were to be performed. First, we proceeded with the PageRank ltering, using the arbitrarily
selected scaled PageRank threshold = 10. This step resulted in a set T of 883,328 hosts with
scaled PageRank scores greater than or equal to 10. The set T is what we focus on in the rest of
this section.
In order to evaluate the eectiveness of Algorithm 2 we constructed and evaluated a sample T

of T . T

consisted of 892 hosts, or approximately 0.1% of T , selected uniformly at random.
We performed a careful manual inspection of the sample hosts, searching for traces of spamming
in their contents, links, and the contents of their in- and out-neighbors. As a result of the inspection,
we were able to categorize the 892 hosts as follows:
564 hosts (63.2% of the sample) were reputable, good ones. The authors of the pages on these
hosts refrained from using spamming techniques.

In [Gyongyi et al., 2004] we found that more than 18% of web sites are spam.
13
Group 1 2 3 4 5 6 7 8 9 10
Smallest m -67.90 -4.21 -2.08 -1.50 -0.98 -0.68 -0.43 -0.27 -0.15 0.00
Largest m -4.47 -2.11 -1.53 -1.00 -0.69 -0.44 -0.28 -0.16 -0.01 0.09
Size 44 45 43 42 43 46 45 45 46 40
Group 11 12 13 14 15 16 17 18 19 20
Smallest m 0.10 0.23 0.34 0.45 0.56 0.66 0.76 0.84 0.91 0.98
Largest m 0.22 0.33 0.43 0.55 0.65 0.75 0.83 0.90 0.97 1.00
Size 45 48 45 42 47 46 45 47 46 42
Table 2: Relative mass thresholds for sample groups.
229 hosts (25.7%) were spam, that is, had some content or links added with the clear in-
tention of manipulating search engine ranking algorithms. The unexpectedly large number
of spam sample hosts indicates that the prevalence of spam is considerable among hosts
with high PageRank scores. Given that earlier research results (e.g., [Fetterly et al., 2004],
[Gyongyi et al., 2004]) reported between 9% and 18% of spam in actual web data, it is possible
that we face a growing trend in spamming.
In case of 54 hosts (6.1%) we could not ascertain whether they were spam or not, and ac-
cordingly labeled them as unknown. This group consisted mainly of East Asian hosts, which
represented a cultural and lingustic challenge to us. We excluded these hosts from subsequent
steps of our experiments.
45 hosts (5%) were inexistent, that is, we could not access their web pages. The lack of
content made it impossible to accurately determine whether these hosts were spam or not, so
we excluded them from the experimental sample as well.
The rst question we addressed is how good and spam hosts are distributed over the range of
relative mass values. Accordingly, we sorted the sample hosts (discarding inexistent and unknown
ones) by their estimated relative mass. Then, we split the list into 20 groups, seeking a compromise
between approximately equal group sizes and relevant cuto values. As shown in Table 2, the
relative mass estimates of sample hosts varied between -67.90 and 1.00, and groups sizes spanned
the interval 40 to 48.
Figure 3 shows the composition of each of the sample groups. The size of each group is shown
on the vertical axis and is also indicated on the top of each bar. Vertically stacked bars represent
the prevalence of good (white) and spam (black) sample hosts.
We decided to show separately (in gray) a specic group of good hosts that have high relative
mass. The relative mass estimates of all these hosts were high because three very specic, isolated
anomalies in our data, particularly in the good core

1
+
:
Five good hosts in groups 18, 19, and 20 belonged to the Chinese e-commerce site Alibaba,
which encompasses a very large number of hosts, all with URLs ending in .alibaba.com. We
believe that the reason why Alibaba hosts received high relative mass is that our good core

1
+
did not provide appropriate coverage of this part of the Chinese web.
Similarly, the remaining good hosts in the last 2 groups were Brazilian blogs with URLs
ending in .blogger.com.br. Again, this is an exceptional case of a large web community
that appears to be relatively isolated from our

1
+
.
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
10
20
30
40
95
% 100
%
93
%
95
%
83
%
90
%
92
%
89
%
88
%
91
%
71
%
84
%
67
%
60
%
45
%
38
% 35
%
26
%
10
%
80
% 74
%
62
% 60
%
58
% 50
%
40
%
33
%
16
%
29
%
9%
5% 5% 7% 8%
17
%
10
%
11
%
12
%
44
42
41 41
39
42
40
38
42
35
38
37
43
35
38
40
37
42
39
40
Sample group number
S
a
m
p
l
e

g
r
o
u
p

s
i
z
e
Figure 3: Sample composition.
Finally, groups 15 through 18 contained a disproportionately large number of good hosts from
the Polish web (URLs ending in .pl). It turns out that this is due to the incomprehensiveness
of our good core:

1
+
only contained 12 Polish educational hosts. In comparison,

1
+
contained
4020 Czech (.cz) educational hosts, even though the Czech Republic (similar to Poland
socially, politically, and in geographical location) has only one quarter of Polands population.
It is important to emphasize that the all hosts in the gray group had high estimated relative
mass due only to these three issues. By making appropriate adjustments to the good core, e.g.,
adding more good Polish web hosts, the anomalies could be eliminated altogether, increasing the
prevalence of spam in groups 1520. In fact, we expect that for relative estimates between 0.98
and 1 the prevalence of spam would increase to close to 100%. Accordingly, an appropriate relative
mass threshold could render Algorithm 2 an extremely powerful spam detection tool, as discussed
next.
We used the sample set T

to estimate the precision of our algorithm for various threshold
values . For a given , the estimated precision prec() is
prec() =
number of spam sample hosts x with m
x

total number of sample hosts y with m
y

.
Clearly, the closer the precision is to 1 the better. We computed the precision both for the case
when we accounted for the anomalous sample hosts as false positives and when we disregarded
them. Figure 4 shows the two corresponding curves for relative mass thresholds between 0.98 and
0.
The horizontal axis contains the (non-uniformly distributed) threshold values that we derived
from the sample group boundaries. The total number of hosts from T above each threshold is also
indicated at the top of the gure. Note that because we used uniform random sampling, there is
a close connection between the size of a sample group and the total number of hosts within the
corresponding relative mass range: each range corresponds to roughly 45,000 hosts in T . For
instance, there are 46,635 hosts in T within the relative mass range 0.98 to 1, which corresponds
to sample group 20.
The vertical axis stands for the (interpolated) precision. Note that precision never drops below
48%, corresponding to the estimated prevalence of spam among hosts with positive relative mass.
If we disregard the anomalous hosts, the precision of the algorithm is virtually 100% for a
threshold = 0.98. Accordingly, we expect that almost all top 46,635 hosts with the highest relative
mass estimates are spam. The precision at = 0.91 is still 94% with more than a 100,000 qualifying
15
0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0
Relative mass threshold
E
s
t
i
m
a
t
e
d

p
r
e
c
i
s
i
o
n
4
6
,
6
3
5
1
0
3
,
0
4
1
1
4
9
,
6
0
5
1
8
9
,
0
6
0
2
3
4
,
3
7
1
2
7
7
,
7
0
7
3
2
0
,
2
2
2
3
6
2
,
1
2
5
4
0
6
,
3
0
2
4
5
4
,
5
2
4
4
8
9
,
1
3
3
Total number of hosts above threshold
0.5
0.6
0.7
0.8
0.9
1
Anomalous hosts excluded
Anomalous hosts included
Figure 4: Precision of the mass-based spam detection algorithm for various thresholds.
hosts. Hence, we argue that our spam detection method can identify with high condence tens of
thousands of hosts that have high PageRank as a result of signicant boosting by link spamming.
This is a remarkebly reassuring result, indicating that mass estimates could become a valuable
practical tool in combating link spamming.
Beyond our basic results, we can also make a number of interesting observations about the
sample composition:
1. Isolated cliques. Around 10% of the sample hosts with positive mass were good ones
belonging to cliques only weakly connected to our good core

1
+
. These good hosts typically
were either members of some online gaming community (e.g., Warcraft fans) or belonged to
a web design/hosting company. In the latter event, usually it was the case that clients linked
to the web design/hosting company, which linked back to them, but very few or no external
links pointed to either.
2. Expired domains. Some spam hosts had large negative absolute/relative mass values be-
cause the adopted technique of buying expired domains, already mentioned in Section 2.3.
To reiterate, it is often the case that when a web domain d expires, old links from external
hosts pointing to hosts in d linger on for some time. Spammers can then buy such expired
domains, populate them with spam, and take advantage of the false importance conveyed by
the pool of outdated links. Note that because most of the PageRank of such spam hosts is
contributed by good hosts, our algorithm is not expected to detect them.
3. Members of the good core. The hosts from our good core received very large negative
mass values because of the inherent bias introduced by the scaling of the random jump
vector. Correspondingly, the rst and second sample groups included 29 educational hosts
and 5 governmental hosts from

1
+
.
4.5 Absolute Mass
As mentioned earlier, our experiments with absolute mass were less successful than those with
relative mass. Nevertheless, it is instructive to discuss some of our ndings.
Mass distribution. As spam mass is a novel feature of web hosts, it is appropriate to check
its value distribution. Figure 5 presents this distribution of estimated absolute mass values on a
log-log scale. The horizontal axes show the range of mass values. We scaled absolute mass values
16
1 10 100 1000 10000 100000 -1000 -100 -10 -1
0.1%
0.01%
0.001%
10%
1%
0.1%
0.01%
0.001%
F
r
a
c
t
i
o
n

o
f

h
o
s
t
s
Scaled absolute mass (negative) Scaled absolute mass (positive)
Figure 5: Distribution of estimated absolute mass values in the host-level web graph.
by n/(1c), just as we did for PageRank scores. Hence, they fell into in the interval from -268,099
to 132,332. We were forced to split the whole range of mass estimates into two separate plots, as a
single log-scale could not properly span both negative and positive values. The vertical axis shows
the percentage of hosts with estimated absolute mass equal to a specic value on the horizontal
axis.
We can draw two important conclusions from the gure. On one hand, positive absolute mass
estimatesalong with many other features of web nodes, such as indegree or PageRankfollow a
power-law distribution.(For our data, the power-law exponent was -2.31.) On the other hand, the
plot for negative estimated mass exhibits a combination of two superimposed curves. The right one
is the natural distribution, corresponding to the majority of hosts. The left curve corresponds
to the biased score distribution of hosts from

1
+
plus of those hosts that receive a large fraction of
their PageRank from the good-core hosts.
Absolute mass in spam detection. A manual inspection of the absolute mass values con-
vinced us that alone they are not appropriate for spam detection purposes. It was not a surprise to
nd that the host with the lowest absolute mass value was www.adobe.com, as its Adobe Acrobat
Reader download page is commonly pointed to by various hosts. It is more intriguing, however,
that www.macromedia.com was the host with the 3
rd
largest spam mass! In general, many hosts
with high estimated mass were not spam, but reputable and popular. Such hosts x had an ex-
tremely large PageRank score p
x
, so even a relatively small dierence between p
x
and p

x
rendered
an absolute mass that was large with respect to the ones computed for other, less signicant hosts.
Hence, in the list of hosts sorted by absolute mass, good and spam hosts were intermixed without
any specic mass value that could be used as an appropriate separation point.
5 Related Work
In a broad sense, our work builds on the theoretical foundations provided by analyses of PageR-
ank (e.g., [Bianchini et al., 2005] and [Langville and Meyer, 2004]). The ways in which link spam-
ming (spam farm construction) inuences PageRank are examined in [Baeza-Yates et al., 2005] and
[Gyongyi and Garcia-Molina, 2005].
A number of recent publications propose link spam detection methods. For instance, Fetterly
et al. [Fetterly et al., 2004] analyze the indegree and outdegree distributions of web pages. Most
web pages have in- and outdegrees that follow a power-law distribution. Occasionally, however,
17
search engines encounter substantially more pages with the exact same in- or outdegrees than what
is predicted by the distribution formula. The authors nd that the vast majority of such outliers
are spam pages.
Similarly, Bencz ur et al. [Bencz ur et al., 2005] verify for each page x whether the distribution of
PageRank scores of pages pointing to x conforms a power law. They claim that a major deviation
in PageRank distribution is an indicator of link spamming that benets x.
These methods are powerful at detecting large, automatically generated link spam structures
with unnatural link patterns. However, they fail to recognize more sophisticated forms of spam,
when spammers mimic reputable web content.
Another group of work focuses on heavily interlinked groups of pages. Collusion is an ecient
way to improve PageRank score, and it is indeed frequently used by spammers. Zhang et al.
[Zhang et al., 2004] and Wu and Davison [Wu and Davison, 2005] present ecient algorithms for
collusion detection. However, certain reputable pages are colluding as well, so it is expected that
the number of false positives returned by the proposed algorithms is large. Therefore, collusion
detection is best used for penalizing all suspicious pages during ranking, as opposed to reliably
pinpointing spam.
A common characteristic of the previously mentioned body of work is that authors focus exclu-
sively on the link patterns between pages, that is, on how pages are interconnected. In contrast, this
paper looks for an answer to the question with whom are pages interconnected? We investigate
the PageRank of web nodes both when computed in the usual way and when determined exclusively
by the links from a large pool of known good nodes. Nodes with a large discrepancy between the
two scores turn out to be successfully boosted by (possibly sophisticated) link spamming.
In that we combat spam using a priori qualitative information about some nodes, the presented
approach supercially resembles TrustRank introduced in [Gyongyi et al., 2004]. However, there
are dierences between the two, which make them complementary rather than overlapping. Most
importantly, TrustRank helps cleansing top ranking results by identifying reputable nodes. While
spam is demoted, it is not detectedthis is a gap that we strive to ll in this paper. Also, the
scope of TrustRank is broader, demoting all forms of web spam, whereas spam mass estimates are
eective in detecting link spamming only.
6 Conclusions
In this paper we introduced a new spam detection method that can identify web nodes with PageR-
ank scores signicantly boosted through link spamming. Our approach is built on the idea of
estimating the spam mass of nodes, which is a measure of the relative PageRank contribution of
connected spam pages. Spam mass estimates are easy to compute using two sets of PageRank
scoresa regular one and another one with the random jump biased to some known good nodes.
Hence, we argue that the spam detection arsenal of search engines could be easily augmented with
our method.
We have shown the eectiveness of mass estimation-based spam detection through a set of
experiments conducted on the Yahoo! web graph. With minimal eort we were able to identify
several tens of thousands of link spam hosts. While the number of detected spam hosts might seem
relatively small with respect to the size of the entire web, it is important to emphasize that these
are the most advanced instances of spam, capable of accumulating large PageRank scores and thus
making to the top of web search result lists.
We believe that another strength of our method is that it is robust even in the event that
spammers learn about it. While knowledgeable spammers could attempt to collect a large number
18
of links from good nodes, eective tampering with the proposed spam detection method would
require non-obvious manipulations of the good graph. Such manipulations are virtually impossible
without knowing exactly the actual set of good nodes used as input by a given implementation of
the spam detection algorithm.
In comparison to other link spam detection methods, our proposed approach excels in handling
irregular link structures. It also diers from our previous work on TrustRank in that we provide
an algorithm for spam detection as opposed to spam demotion.
It would be interesting to see how our spam detection method can be further improved by using
additional pieces of information. For instance, we conjecture that many false positives could be
eliminated by complementary (textual) content analysis. This issues remains to be addressed in
future work. Also, we argue that the increasing number of (link) spam detection algorithms calls
for a comparative study.
A Proof of Theorem 1
The rst thing to realize is that the contribution of the unconnected nodes is zero, therefore

xV
q
x
y
=

z:Wzy=
q
z
y
,
and so the equality from the theorem becomes
p
y
=

z:Wzy=
q
z
y
.
In order to prove that this equality holds we will make use of two lemmas. The rst shows that
the total PageRank contribution to y is a solution in y to the linear PageRank equation, while the
second shows that the solution is unique.
Lemma 1

z:Wzy=
q
z
y
is a solution to the linear PageRank equation of y.
Proof. The linear PageRank equation for node y has the form:
p
y
= c

x:(x,y)E
p
x
/out(x) + (1 c)v
y
. (4)
Assuming that y has k in-neighbors x
1
, x
2
, . . . , x
k
, (4) can be written as
p
y
= c
k

i=1
p
x
i
/out(x
i
) + (1 c)v
y
.
Let us now replace p
y
and p
x
i
on both sides by the corresponding contributions:

z:Wzy=
q
z
y
= c
k

i=1
_
_

z:Wzx
i
=
q
z
x
i
_
_
/out(x
i
) + (1 c)v
y
.
19
and expand the terms q
z
y
and q
z
x
i
:

z:Wzy=
_
_

WWzy
c
|W|
(W)(1 c)v
z
_
_
=
c
k

i=1
_
_

z:Wzx
i
=
_
_

V Wzx
i
c
|V |
(V )(1 c)v
z
_
_
_
_
/out(x
i
) + (1 c)v
y
. (5)
Note that x
1
, . . . , x
k
are all the in-neighbors of y, so J
zy
can be partitioned into

:
J
zy
=
_
k
_
i=1
V.y [ V J
zx
i

_
Z
y
where Z
y
is the zero-length circuit. Also note that for all V J
zx
i
, [W[ = [V [ + 1 and (W) =
(V )/out(x
i
). Hence, we can rewrite the left side of (5) to produce the equation
k

i=1
_
_

z:Wzx
i
=
_
_

V Wzx
i
c
|V |+1
(W)(1 c)v
z
/out(x
i
)
_
_
_
_
+ (1 c)v
y
=
c
k

i=1
_
_

z:Wzx
i
=
_
_

V Wzx
i
c
|V |
(V )(1 c)v
z
_
_
_
_
/out(x
i
) + (1 c)v
y
,
which is an identity. Hence, q
y
is a solution to the PageRank equation of y.
Lemma 2 The linear PageRank equation system has a unique solution.
Proof. As self-loops are not allowed in the matrix T, the matrix U = (I cT
T
) will have 1s on
the diagonal and values 1 in all non-diagonal positions. Therefore U is diagonally dominant, and
hence positive denite. Accordingly, the system Up = k v has a unique solution.
B Proof of Theorem 2
According to Theorem 1, for the core-based random jump vector v
x
the PageRank score p
z
of an
arbitrary node z is the sum of PageRank contributions to z:
p
z
=

yV
q
y
z
=

yV

WWyz
c
|W|
(W)(1 c)v
x
y
.
In our special case we have v
x
y
= 0 for all y ,= x, so p
z
is in fact
p
z
=

WWxz
c
|W|
(W)(1 c)v
x
x
= q
x
z
.
It follows that we can determine the contributions q
x
z
of x to all nodes z by computing the PageRank
scores corresponding to the core-based random jump vector v
x
.

Notation V.y: Given some walk V = t0, t1, . . . , tm of length m and a node y with (tm, y) E we can append y to
V and construct a new walk V.y = t0, t1, . . . , tm, y of length m + 1.
20
References
[Baeza-Yates et al., 2005] Baeza-Yates, R., Castillo, C., and Lopez, V. (2005). PageRank increase
under dierent collusion topologies. In Proceedings of the First International Workshop on Ad-
versarial Information Retrieval on the Web (AIRWeb).
[Bencz ur et al., 2005] Bencz ur, A., Csalogany, K., Sarlos, T., and Uher, M. (2005). SpamRank
fully automatic link spam detection. In Proceedings of the First International Workshop on
Adversarial Information Retrieval on the Web (AIRWeb).
[Berkhin, 2005] Berkhin, P. (2005). A survey on PageRank computing. Internet Mathematics, 2(1).
[Bianchini et al., 2005] Bianchini, M., Gori, M., and Scarselli, F. (2005). Inside PageRank. ACM
Transactions on Internet Technology, 5(1).
[Eiron et al., 2004] Eiron, N., McCurley, K., and Tomlin, J. (2004). Ranking the web frontier. In
Proceedings of the 13th International Conference on World Wide Web.
[Fetterly et al., 2004] Fetterly, D., Manasse, M., and Najork, M. (2004). Spam, damn spam, and
statistics. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB).
[Gyongyi and Garcia-Molina, 2005] Gyongyi, Z. and Garcia-Molina, H. (2005). Link spam al-
liances. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB).
[Gyongyi et al., 2004] Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004). Combating web
spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data
Bases (VLDB).
[Henzinger et al., 2002] Henzinger, M., Motwani, R., and Silverstein, C. (2002). Challenges in web
search engines. ACM SIGIR Forum, 36(2).
[Jeh and Widom, 2003] Jeh, G. and Widom, J. (2003). Scaling personalized web search. In Pro-
ceedings of the 12th International Conference on World Wide Web.
[Langville and Meyer, 2004] Langville, A. and Meyer, C. (2004). Deeper inside PageRank. Internet
Mathematics, 1(3).
[Page et al., 1998] Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The PageRank
citation ranking: Bringing order to the web. Technical report, Stanford University, California.
[Singhal, 2004] Singhal, A. (2004). Challenges in running a commercial web search engine. IBMs
Second Search and Collaboration Seminar.
[Wu and Davison, 2005] Wu, B. and Davison, B. (2005). Identifying link farm spam pages. In
Proceedings of the 14th International Conference on World Wide Web.
[Zhang et al., 2004] Zhang, H., Goel, A., Govindan, R., Mason, K., and Roy, B. V. (2004). Making
eigenvector-based reputation systems robust to collusion. In Proceedings of the 3rd International
Workshop on Algorithms and Models for the Web-Graph (WAW).
21

You might also like