Link Spam Detection Based On Mass
Link Spam Detection Based On Mass
Estimation
October 31, 2005 Technical Report
Zoltan Gyongyi
[email protected]
Computer Science Department
Stanford University, Stanford, CA 94305
Pavel Berkhin [email protected]
Yahoo! Inc.
701 First Avenue, Sunnyvale, CA 94089
Hector Garcia-Molina [email protected]
Computer Science Department
Stanford University, Stanford, CA 94305
Jan Pedersen [email protected]
Yahoo! Inc.
701 First Avenue, Sunnyvale, CA 94089
Abstract
Link spamming intends to mislead search engines and trigger an articially high link-based
ranking of specic target web pages. This paper introduces the concept of spam mass, a
measure of the impact of link spamming on a pages ranking. We discuss how to estimate
spam mass and how the estimates can help identifying pages that benet signicantly
from link spamming. In our experiments on the host-level Yahoo! web graph we use spam
mass estimates to successfully identify tens of thousands of instances of heavy-weight link
spamming.
1 Introduction
In an era of search-based web access, many attempt to mischievously inuence the page rankings
produced by search engines. This phenomenon, called web spamming, represents a major problem
to search engines [Singhal, 2004, Henzinger et al., 2002] and has negative economic and social im-
pact on the whole web community. Initially, spammers focused on enriching the contents of spam
pages with specic words that would match query terms. With the advent of link-based ranking
techniques, such as PageRank [Page et al., 1998], spammers started to construct spam farms, col-
lections of interlinked spam pages. This latter form of spamming is referred to as link spamming
as opposed to the former term spamming.
, commonly done as follows. Consider a vector v of positive elements with the norm
|v| = |v|
1
= 1, specifying a probability distribution. Then,
T
= T+dv
T
,
where d is a dangling node indicator vector:
d
x
=
_
1, if out(x) = 0 ,
0, otherwise.
This transformation corresponds to adding virtual links from dangling nodes to (all) other nodes
on the web, which are then followed according to the probability distribution v.
Even for T
with
probability c or jumps to some random node (selected based on the probability distribution v) with
probability (1 c). The corresponding augmented transition matrix T
is dened as
T
= cT
+ (1 c)1
n
v
T
.
PageRank can now be dened rigorously as the stationary distribution p of the random walk on
T
, with 1
+
1
= 1 and 1
+
1
= .
Given this
partitioning, we wish to detect web nodes x that gain most of their PageRank through spam nodes
in 1
that link to them. We will conclude that such nodes x are spam farm target nodes.
A very simple approach would be that, given a node x, we look only at its immediate in-
neighbors. For the moment, let us assume that it is known whether in-neighbors of x are reputable,
good nodes or spam. (We will remove this unrealistic assumption in Section 3.4.) Now we wish to
infer whether x is good or spam, based on the in-neighbor information.
In a rst approximation, we can simply look at the number of inlinks. If the majority of xs
links comes from spam nodes, x is labeled a spam target node, otherwise it is labeled good. We call
this approach our rst labeling scheme. It is easy to see that this scheme often mislabels spam. To
illustrate, consider the web graph in Figure 1. (Our convention is to show known good nodes lled
white, known spam nodes lled black, and to-be-labeled nodes hashed gray.) As x has two links
In practice, such perfect knowledge is clearly unavailable. Also, what constitutes spam is often a matter of
subjective judgment; hence, the real web includes a voluminous gray area of nodes that some call spam while others
argue against that label. Nevertheless, our simple dichotomy will be helpful in constructing the theory of the proposed
spam detection method.
5
x s
0
s
1
s
2
s
k
g
0
g
1
Figure 1: A scenario in which the rst nave labeling scheme fails, but the second succeeds.
x
g
0
g
3
g
1
g
2
s
5
s
6
s
0
s
1
s
2
s
3
s
4
Figure 2: Another scenario in which both nave labeling schemes fails.
from good nodes g
0
and g
1
and a single link from spam node s
0
, it will be labeled good. However,
the PageRank of x is
p
x
= (1 + 3c + kc
2
)(1 c)/n,
out of which (c + kc
2
)(1 c)/n is due to spamming. (It is straightforward to verify that in the
absence of spam nodes s
0
, . . . , s
k
the PageRank of x would decrease by this much.) For c = 0.85,
as long as k 1/c| = 2 the largest part of xs PageRank comes form spam nodes, so it would be
reasonable to label x as spam. As our rst scheme fails to do so, let us come up with something
better.
A natural alternative is to look not only at the number of links, but also at what amount of
PageRank each link contributes. The contribution of a link amounts to the change in PageRank
induced by the removal of the link. For Figure 1, links from g
0
and g
1
both contribute c(1 c)/n
while the link from s
0
contributes (c + kc
2
)(1 c)/n. As the largest part of xs PageRank comes
from a spam node, we correctly label x as spam.
However, there are cases when even our second scheme is not quite good enough. For example,
consider the graph in Figure 2. The links from g
0
and g
2
contribute (2c + 4c
2
)(1 c)/n to the
PageRank of x, while the link from s
0
contributes (c + 4c
2
)(1 c)/n only. Hence, the second
scheme labels x as good. It is important, however, to realize that spam nodes s
5
and s
6
inuence
the PageRank scores of g
0
and g
2
, respectively, and so they also have an indirect inuence on the
PageRank of x. Overall, the 6 spam nodes of the graph have a stronger inuence on xs PageRank
than the 4 reputable ones do. Our second scheme fails to recognize this because it never looks
beyond the immediate in-neighbors of x.
Therefore, it is appropriate to devise a third scheme that labels node x considering all the
PageRank contributions of other nodes that are directly or indirectly connected to x. The next
section will show how to compute such contributions, both direct and indirect (e.g., that of s
5
to
6
x). Then, in Section 3.3 the contributions of spam nodes will be added to determine what we call
the spam mass of nodes.
3.2 PageRank Contribution
In this section we adapt some of the formalism and results introduced for inverse P-distances
in [Jeh and Widom, 2003].
The connection between the nodes x and y is captured by the concept of a walk. A walk W
from x to y in a directed graph is dened as a nite sequence of nodes x = x
0
, x
1
, . . . , x
k
= y,
where there is a directed edge (x
i
, x
i+1
) c between every pair of adjacent nodes x
i
and x
i+1
,
i = 0, . . . , k 1. The length [W[ of a walk W is the number k 1 of edges. A walk with x = y is
called a circuit.
Acyclic graphs contain a nite number of walks while cyclic graph have an innite number of
walks. The (possibly innite) set of all walks from x to y is denoted by J
xy
.
We dene the PageRank contribution of x to y over the walk W as
q
W
y
= c
k
(W)(1 c)v
x
,
where (W) is the weight of the walk:
(W) =
k1
i=0
1
out(x
i
)
.
This weight can be interpreted as the probability that a Markov chain of length k starting in x
reaches y through the sequence of nodes x
1
, . . . , x
k1
.
In a similar manner, we dene the total PageRank contribution of x to y, x ,= y, over all walks
from x to y (or simply: the PageRank contribution of x to y) as
q
x
y
=
WWxy
q
W
y
.
For a nodes contribution to itself, we also consider an additional virtual circuit Z
x
that has
length zero and weight 1, so that
q
x
x
=
WWxx
q
W
x
= q
Zx
x
+
V Wxx,|V |1
q
V
x
= (1 c)v
x
+
V Wxx,|V |1
q
V
x
.
Note that if a node x does not participate in circuits, xs contribution to itself is q
x
x
= (1 c)v
x
,
which corresponds to the random jump component.
For convenience, we extend our notion of contribution even to those nodes that are unconnected.
If there is no walk from node x to node y then the PageRank contribution q
x
y
is zero.
The following theorem reveals the connection between the PageRank contributions and the
PageRank scores of nodes. (The proofs of the theorems are provided as appendices.)
Theorem 1 The PageRank score of a node y is the sum of the contributions of all other nodes to
y:
p
y
=
xV
q
x
y
.
7
It is possible to compute the PageRank contribution of a node to all nodes in a convenient way,
as stated next.
Theorem 2 Under a given random jump distribution v, the vector q
x
of contributions of a node
x to all nodes is the solution of the linear PageRank system for the core-based random jump vector
v
x
:
v
x
y
=
_
v
x
, if x = y,
0, otherwise,
that is,
q
x
= PR(v
x
) .
Remember that the PageRank equation system is linear in the random jump vector. Hence, we
can easily determine the PageRank contribution q
U
of any subset of nodes | 1 by computing
PageRank using the random jump vector v
U
dened as
v
U
y
=
_
v
y
, if y |,
0, otherwise.
To verify the correctness of this last statement, note that q
x
= PR(v
x
) for all x | and v
U
=
xU
v
x
, therefore q
U
= PR(v
U
) =
xU
q
x
.
3.3 Denition of Spam Mass
Returning to the example in Figure 2, let us check whether PageRank contributions could indeed
help in labeling x. We calculate and add the contributions of known good and spam nodes to the
PageRank of x:
q
{g
0
,...,g
3
}
x
= (2c + 2c
2
)(1 c)/n
and
q
{s
0
,...,s
6
}
x
= (c + 6c
2
)(1 c)/n.
Then, we can decide whether x is spam based on the comparison of q
{s
0
,...,s
6
}
x
to q
{g
0
,...,g
3
}
x
. For
instance, for c = 0.85, q
{s
0
,...,s
6
}
x
= 1.65q
{g
0
,...,g
3
}
x
. Therefore, spam nodes have more impact on the
PageRank of x than good nodes do, and it might be wise to conclude that x is in fact spam. We
formalize our intuition as follows.
For a given partitioning 1
+
, 1
x
,
that is, xs PageRank is the sum of the contributions of good nodes and that of spam nodes. (The
formula includes xs contribution to itself, as we assume that we are given information about all
nodes.)
Denition 1 The absolute spam mass of x, denoted by M
x
, is the PageRank contribution that x
receives from spam nodes, that is, M
x
= q
V
x
.
Hence, the spam mass is a measure of how much direct or indirect in-neighbor spam nodes
increase the PageRank of a node. Our experimental results indicate that it is suggestive to take a
look at the spam mass of nodes in comparison to their total PageRank:
Denition 2 The relative spam mass of x, denoted by m
x
, is the fraction of xs PageRank due to
contributing spam nodes, that is, m
x
= q
V
x
/p
x
.
8
3.4 Estimating Spam Mass
The assumption that we have accurate a priori knowledge of whether nodes are good (i.e., in 1
+
)
or spam (i.e., in 1
). Accordingly, we expect that search engines have some reliable white-list and/or
black-list, comprising a subset of the nodes, compiled manually by editors and/or generated by
algorithmic means.
Depending on which of these two sets is available (either or both), the spam mass of nodes can
be approximated by estimating good and spam PageRank contributions.
In this paper we assume that only a subset of the good nodes
1
+
is provided. We call this set
1
+
the good core. A suitable good core is not very hard to construct, as discussed in Section 4.2.
Note that one can expect the good core to be more stable over time than
1
= PR(v
V
+
), a core-based PageRank with a random jump distribution v
V
+
,
v
V
+
x
=
_
1/n, if x
1
+
,
0, otherwise.
Note that p
approximates the PageRank contributions that nodes receive from known good
nodes.
These PageRank scores can be used to estimate spam mass:
Denition 3 Given PageRank scores p
x
and p
x
, the estimated absolute spam mass of node x is
M
x
= p
x
p
x
and the estimated relative spam mass of x is
m
x
= (p
x
p
x
)/p
x
= 1 p
x
/p
x
.
As a simple example of how spam mass estimation works, consider again the graph in Figure 2
and assume that the good core is
1
+
= g
0
, g
1
, g
3
. For c = 0.85 and n = 12, the PageRank score,
actual absolute mass, estimated absolute mass, and its relative counterpart is shown for each of
the nodes in Table 1. Note that here, as well as in the rest of the paper, numeric PageRank scores
and absolute mass values are scaled by n/(1 c) for increased readability. Accordingly, the scaled
PageRank score of a node without inlinks is 1.
For instance, the scaled PageRank score of g
0
is 2.7. Of that, M
g
0
= 0.85 is contributed by
spam pages, in particular by s
5
. Hence, g
1
s relative mass is m
g
0
= 0.85/2.7 = 0.31.
The dierence between actual and estimated mass can be observed in case of nodes x and g
2
.
Although g
2
is a good node, it is not a member of
1
+
. Hence, both its absolute mass and relative
mass are overestimated. The mass estimates for node x are also larger than the actual values.
9
core-based absolute relative estimated estimated
PageRank PageRank mass mass abs. mass rel. mass
p = p
= M =
M = m = m =
x
g
0
g
1
g
2
g
3
s
0
s
1
, . . . , s
6
_
_
_
_
_
_
_
_
_
_
9.33
2.7
1
2.7
1
4.4
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2.295
1.85
1
0.85
1
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
6.185
0.85
0
0.85
0
4.4
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
7.035
0.85
0
1.85
0
4.4
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.66
0.31
0
0.31
0
1
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.75
0.31
0
0.69
0
1
1
_
_
_
_
_
_
_
_
_
_
Table 1: Various features of nodes in Figure 2.
Note that the absolute and relative mass estimates of most good nodes are small compared to
the estimated mass of spam nodes. While the example in Figure 2 is overly simple, the relative
separation of good and spam indicates that mass estimates could be used for spam detection
purposes.
In the alternate situation that
1
M = PR(v
and
1
+
are known, the spam mass estimates could be
derived, for instance, by simply computing the average (
M +
M)/2. It is also possible to invent
more sophisticated combination schemes, e.g., a weighted average where the weights depend on the
relative sizes of
1
and
1
+
, with respect to the estimated sizes of 1
and 1
+
.
3.5 Size of the Good Core
In a nal step before devising a practical spam detection algorithm based on mass estimation, we
need to consider a technical problem that arises for real web data.
One can expect for the web that our good core
1
+
will be signicantly smaller than the actual
set of good nodes 1
+
. That is, [
1
+
[ [1
+
[ and thus |v
V
+
| |v|. Note that by the denition
of p = PR(v) from (3), |p| |v|. Similarly, |p
| |v
V
+
|. It follows that |p
V
+
to w, where
w
x
=
_
/[
1
+
[, if x
1
+
,
0, otherwise.
Note that |w| = |v
V
+
|, so the two random jump vectors are of the same order of magnitude.
Then, we can compute p
| |p
V
+
|, so we get a reasonable
estimate of the total good contribution.
Using w in computing the core-based PageRank leads to an interesting situation. As
1
+
is
small, the good nodes in it will receive an unusually high random jump (/[
1
+
[ as opposed to 1/n).
10
Therefore, the good PageRank contribution of these known reputable nodes will be overestimated,
to the extent that occasionally for some node y, p
y
will be larger than p
y
. Hence, when computing
M, there will be nodes with negative spam mass. In general, a negative mass indicates that a node
is known to be good in advance (is member of
1
+
) or its PageRank is heavily inuenced by the
contribution of nodes in the good core.
3.6 Spam Detection Algorithm
Section 3.3 introduced the concept of spam mass, Section 3.4 provided an ecient way of estimating
it, while Section 3.5 eliminated some technical obstacles in our way. In this section we put all pieces
together and present our link spam detection algorithm based on mass estimation.
While very similar in nature, our experiments (discussed in Section 4.5) indicate that relative
mass estimates are more useful in spam detection than their absolute counterparts. Therefore,
we build our algorithm around estimating the relative mass of nodes. Details are presented as
Algorithm 2.
The rst input of the algorithm is the good core
1
+
. The second input is a threshold to which
relative mass estimates are compared. If the estimated relative mass of a node is equal to or above
this threshold then the node is labeled as a spam candidate.
The third input is a PageRank threshold : we only verify the relative mass estimates of nodes
with PageRank scores larger than or equal to . Nodes with PageRank less than are never labeled
as spam candidates.
input : good core
1
+
, relative mass threshold , PageRank threshold
output: set of spam candidates o
o
compute PageRank scores p
construct w based on
1
+
and compute p
m (p p
)/p
for each node x so that p
x
do
if m
x
then
o o x
end
end
Algorithm 2: Mass-based spam detection.
There are at least three reasons to apply a threshold on PageRank. First, remember that we
are interested in detecting nodes that prot from signicant link spamming. Obviously, a node
with a small PageRank is not a beneciary of considerable boosting, so it is of no interest to us.
Second, focusing on nodes x with large PageRank also means that we have more evidencea
larger number of nodes contributing to the PageRank of x. Therefore, no single nodes contribution
is critical alone, the decision whether a node is spam or not is based upon data collected from
multiple sources.
Finally, for nodes x with low PageRank scores, even the slightest error in approximating M
x
by
M
x
could yield huge dierences in the corresponding relative mass estimates. The PageRank
threshold helps us to avoid the complications caused by this phenomenon.
As an example of how the algorithm operates, consider once more the graph in Figure 2 with
node features in Table 1. Let us assume that
1
+
= g
0
, g
1
, g
3
, = 1.5 (once again, we use
scaled PageRank scores), = 0.5 and w = v
V
+
. Then, the algorithm disregards nodes g
1
, g
3
and
11
s
1
, . . . , s
6
because their low PageRank of 1 < = 1.5. Again, such nodes cannot possibly benet
from signicant boosting by link spamming.
Node x has PageRank p
x
= 0.7 = 1.5 and a large estimated relative mass m
x
= 0.75 =
0.5, hence it is added to the spam candidate set o. Similarly, node s
0
is labeled spam as well. A
third node, g
2
is a false positive: it has a PageRank of p
g
2
= 2.7 and an estimated relative mass of
m
g
2
= 0.69, so it is labeled spam. This error is due to the fact that our good core
1
+
is incomplete.
Finally, the other good node g
0
is correctly excluded from o, because m
g
0
= 0.31 < .
4 Experimental Results
4.1 Data Set
To evaluate the proposed spam detection method we performed a number of experiments on actual
web data. The data set that we used was based on the web index of the Yahoo! search engine as
of 2004.
From the complete index of several billion web pages we extracted a list consisting of approxi-
mately 73.3 million individual web hosts.
The web graph corresponding to hosts contained slightly more than 979 million edges. These
edges were obtained by collapsing all hyperlinks between any pair of pages on two dierent hosts
into a single directed edge.
Out of the 73.3 million hosts, 25.6 million (35%) had no inlinks and 48.6 million (66.4%) had
no outlinks. Reasonable explanations for the large number of hosts without outlinks are (1) the
presence of URLs that never got visited by the Yahoo! spider due to the crawling policy and (2)
the presence of URLs that could not be crawled because they were misspelled or the corresponding
host was extinct. Some 18.9 million hosts (25.8%) were completely isolated, that is, had neither
inlinks nor outlinks.
4.2 Good Core
The construction of a good core
1
+
represented a rst step in producing spam mass estimates
for the hosts. As we were aiming for a large good core, we felt that the manual selection of its
members is unfeasible. Therefore, we devised a way of assembling a substantially large good core
with minimal human intervention:
1. We included in
1
+
all hosts that appear in a small web directory which we consider being
virtually void of spam. (We prefer not to disclose which directory this is in order to protect it
from inltration attempts of spammers who might read this paper.) After cleaning the URLs
(removing incorrect and broken ones), this group consisted of 16,776 hosts.
2. We included in
1
+
all US governmental (.gov) hosts (55,320 hosts after URL cleaning).
Though it would have been nice to include other countries governmental hosts, as well as
various international organizations, the corresponding lists were not immediately available to
us, and we could not devise a straightforward scheme for their automatic generation.
3. Using web databases (e.g., univ.cc) of educational institutions worldwide, we distilled a
list of 3,976 schools from more than 150 countries. Based on the list, we identied 434,045
Web host names represent the part of the URL between the http:// prex and the rst / character. Host names
map to IP addresses through DNS. We did not perform alias detection, so for instance www-cs.stanford.edu and
cs.stanford.edu counted as two separate hosts, even though the URLs map to the same web server/IP address.
12
individual hosts that belong to these institutions, and included all these hosts in our good
core
1
+
.
With all three sources included, the good core consisted of 504,150 unique hosts.
4.3 Experimental Procedure
First, we computed the regular PageRank vector p for the host graph introduced in Section 4.1.
We used an implementation of Algorithm 1 (Section 2.2).
Corroborating with earlier research reports, the produced PageRank scores follow a power-law
distribution. Accordingly, most hosts have very small PageRank: slightly more than 66.7 out of the
73.3 million (91.1%) have a scaled PageRank less than 2, that is, less than the double of the minimal
PageRank score. At the other end of the spectrum, only about 64,000 hosts have PageRank scores
that are at least 100-times larger than the minimal. This means that the set of hosts that we focus
on, that is, the set of spam targets with large PageRank, is by denition small compared to the
size of the web.
Second, we computed the core-based PageRank vector p
| |p|.
To circumvent this problems, we decided to adopt the alternative of scaling the random jump
vector to w, as discussed in Section 3.5. In order to construct w, we relied on the conservative
estimate that at least 15% of the hosts are spam.
In [Gyongyi et al., 2004] we found that more than 18% of web sites are spam.
13
Group 1 2 3 4 5 6 7 8 9 10
Smallest m -67.90 -4.21 -2.08 -1.50 -0.98 -0.68 -0.43 -0.27 -0.15 0.00
Largest m -4.47 -2.11 -1.53 -1.00 -0.69 -0.44 -0.28 -0.16 -0.01 0.09
Size 44 45 43 42 43 46 45 45 46 40
Group 11 12 13 14 15 16 17 18 19 20
Smallest m 0.10 0.23 0.34 0.45 0.56 0.66 0.76 0.84 0.91 0.98
Largest m 0.22 0.33 0.43 0.55 0.65 0.75 0.83 0.90 0.97 1.00
Size 45 48 45 42 47 46 45 47 46 42
Table 2: Relative mass thresholds for sample groups.
229 hosts (25.7%) were spam, that is, had some content or links added with the clear in-
tention of manipulating search engine ranking algorithms. The unexpectedly large number
of spam sample hosts indicates that the prevalence of spam is considerable among hosts
with high PageRank scores. Given that earlier research results (e.g., [Fetterly et al., 2004],
[Gyongyi et al., 2004]) reported between 9% and 18% of spam in actual web data, it is possible
that we face a growing trend in spamming.
In case of 54 hosts (6.1%) we could not ascertain whether they were spam or not, and ac-
cordingly labeled them as unknown. This group consisted mainly of East Asian hosts, which
represented a cultural and lingustic challenge to us. We excluded these hosts from subsequent
steps of our experiments.
45 hosts (5%) were inexistent, that is, we could not access their web pages. The lack of
content made it impossible to accurately determine whether these hosts were spam or not, so
we excluded them from the experimental sample as well.
The rst question we addressed is how good and spam hosts are distributed over the range of
relative mass values. Accordingly, we sorted the sample hosts (discarding inexistent and unknown
ones) by their estimated relative mass. Then, we split the list into 20 groups, seeking a compromise
between approximately equal group sizes and relevant cuto values. As shown in Table 2, the
relative mass estimates of sample hosts varied between -67.90 and 1.00, and groups sizes spanned
the interval 40 to 48.
Figure 3 shows the composition of each of the sample groups. The size of each group is shown
on the vertical axis and is also indicated on the top of each bar. Vertically stacked bars represent
the prevalence of good (white) and spam (black) sample hosts.
We decided to show separately (in gray) a specic group of good hosts that have high relative
mass. The relative mass estimates of all these hosts were high because three very specic, isolated
anomalies in our data, particularly in the good core
1
+
:
Five good hosts in groups 18, 19, and 20 belonged to the Chinese e-commerce site Alibaba,
which encompasses a very large number of hosts, all with URLs ending in .alibaba.com. We
believe that the reason why Alibaba hosts received high relative mass is that our good core
1
+
did not provide appropriate coverage of this part of the Chinese web.
Similarly, the remaining good hosts in the last 2 groups were Brazilian blogs with URLs
ending in .blogger.com.br. Again, this is an exceptional case of a large web community
that appears to be relatively isolated from our
1
+
.
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
10
20
30
40
95
% 100
%
93
%
95
%
83
%
90
%
92
%
89
%
88
%
91
%
71
%
84
%
67
%
60
%
45
%
38
% 35
%
26
%
10
%
80
% 74
%
62
% 60
%
58
% 50
%
40
%
33
%
16
%
29
%
9%
5% 5% 7% 8%
17
%
10
%
11
%
12
%
44
42
41 41
39
42
40
38
42
35
38
37
43
35
38
40
37
42
39
40
Sample group number
S
a
m
p
l
e
g
r
o
u
p
s
i
z
e
Figure 3: Sample composition.
Finally, groups 15 through 18 contained a disproportionately large number of good hosts from
the Polish web (URLs ending in .pl). It turns out that this is due to the incomprehensiveness
of our good core:
1
+
only contained 12 Polish educational hosts. In comparison,
1
+
contained
4020 Czech (.cz) educational hosts, even though the Czech Republic (similar to Poland
socially, politically, and in geographical location) has only one quarter of Polands population.
It is important to emphasize that the all hosts in the gray group had high estimated relative
mass due only to these three issues. By making appropriate adjustments to the good core, e.g.,
adding more good Polish web hosts, the anomalies could be eliminated altogether, increasing the
prevalence of spam in groups 1520. In fact, we expect that for relative estimates between 0.98
and 1 the prevalence of spam would increase to close to 100%. Accordingly, an appropriate relative
mass threshold could render Algorithm 2 an extremely powerful spam detection tool, as discussed
next.
We used the sample set T
to estimate the precision of our algorithm for various threshold
values . For a given , the estimated precision prec() is
prec() =
number of spam sample hosts x with m
x
total number of sample hosts y with m
y
.
Clearly, the closer the precision is to 1 the better. We computed the precision both for the case
when we accounted for the anomalous sample hosts as false positives and when we disregarded
them. Figure 4 shows the two corresponding curves for relative mass thresholds between 0.98 and
0.
The horizontal axis contains the (non-uniformly distributed) threshold values that we derived
from the sample group boundaries. The total number of hosts from T above each threshold is also
indicated at the top of the gure. Note that because we used uniform random sampling, there is
a close connection between the size of a sample group and the total number of hosts within the
corresponding relative mass range: each range corresponds to roughly 45,000 hosts in T . For
instance, there are 46,635 hosts in T within the relative mass range 0.98 to 1, which corresponds
to sample group 20.
The vertical axis stands for the (interpolated) precision. Note that precision never drops below
48%, corresponding to the estimated prevalence of spam among hosts with positive relative mass.
If we disregard the anomalous hosts, the precision of the algorithm is virtually 100% for a
threshold = 0.98. Accordingly, we expect that almost all top 46,635 hosts with the highest relative
mass estimates are spam. The precision at = 0.91 is still 94% with more than a 100,000 qualifying
15
0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0
Relative mass threshold
E
s
t
i
m
a
t
e
d
p
r
e
c
i
s
i
o
n
4
6
,
6
3
5
1
0
3
,
0
4
1
1
4
9
,
6
0
5
1
8
9
,
0
6
0
2
3
4
,
3
7
1
2
7
7
,
7
0
7
3
2
0
,
2
2
2
3
6
2
,
1
2
5
4
0
6
,
3
0
2
4
5
4
,
5
2
4
4
8
9
,
1
3
3
Total number of hosts above threshold
0.5
0.6
0.7
0.8
0.9
1
Anomalous hosts excluded
Anomalous hosts included
Figure 4: Precision of the mass-based spam detection algorithm for various thresholds.
hosts. Hence, we argue that our spam detection method can identify with high condence tens of
thousands of hosts that have high PageRank as a result of signicant boosting by link spamming.
This is a remarkebly reassuring result, indicating that mass estimates could become a valuable
practical tool in combating link spamming.
Beyond our basic results, we can also make a number of interesting observations about the
sample composition:
1. Isolated cliques. Around 10% of the sample hosts with positive mass were good ones
belonging to cliques only weakly connected to our good core
1
+
. These good hosts typically
were either members of some online gaming community (e.g., Warcraft fans) or belonged to
a web design/hosting company. In the latter event, usually it was the case that clients linked
to the web design/hosting company, which linked back to them, but very few or no external
links pointed to either.
2. Expired domains. Some spam hosts had large negative absolute/relative mass values be-
cause the adopted technique of buying expired domains, already mentioned in Section 2.3.
To reiterate, it is often the case that when a web domain d expires, old links from external
hosts pointing to hosts in d linger on for some time. Spammers can then buy such expired
domains, populate them with spam, and take advantage of the false importance conveyed by
the pool of outdated links. Note that because most of the PageRank of such spam hosts is
contributed by good hosts, our algorithm is not expected to detect them.
3. Members of the good core. The hosts from our good core received very large negative
mass values because of the inherent bias introduced by the scaling of the random jump
vector. Correspondingly, the rst and second sample groups included 29 educational hosts
and 5 governmental hosts from
1
+
.
4.5 Absolute Mass
As mentioned earlier, our experiments with absolute mass were less successful than those with
relative mass. Nevertheless, it is instructive to discuss some of our ndings.
Mass distribution. As spam mass is a novel feature of web hosts, it is appropriate to check
its value distribution. Figure 5 presents this distribution of estimated absolute mass values on a
log-log scale. The horizontal axes show the range of mass values. We scaled absolute mass values
16
1 10 100 1000 10000 100000 -1000 -100 -10 -1
0.1%
0.01%
0.001%
10%
1%
0.1%
0.01%
0.001%
F
r
a
c
t
i
o
n
o
f
h
o
s
t
s
Scaled absolute mass (negative) Scaled absolute mass (positive)
Figure 5: Distribution of estimated absolute mass values in the host-level web graph.
by n/(1c), just as we did for PageRank scores. Hence, they fell into in the interval from -268,099
to 132,332. We were forced to split the whole range of mass estimates into two separate plots, as a
single log-scale could not properly span both negative and positive values. The vertical axis shows
the percentage of hosts with estimated absolute mass equal to a specic value on the horizontal
axis.
We can draw two important conclusions from the gure. On one hand, positive absolute mass
estimatesalong with many other features of web nodes, such as indegree or PageRankfollow a
power-law distribution.(For our data, the power-law exponent was -2.31.) On the other hand, the
plot for negative estimated mass exhibits a combination of two superimposed curves. The right one
is the natural distribution, corresponding to the majority of hosts. The left curve corresponds
to the biased score distribution of hosts from
1
+
plus of those hosts that receive a large fraction of
their PageRank from the good-core hosts.
Absolute mass in spam detection. A manual inspection of the absolute mass values con-
vinced us that alone they are not appropriate for spam detection purposes. It was not a surprise to
nd that the host with the lowest absolute mass value was www.adobe.com, as its Adobe Acrobat
Reader download page is commonly pointed to by various hosts. It is more intriguing, however,
that www.macromedia.com was the host with the 3
rd
largest spam mass! In general, many hosts
with high estimated mass were not spam, but reputable and popular. Such hosts x had an ex-
tremely large PageRank score p
x
, so even a relatively small dierence between p
x
and p
x
rendered
an absolute mass that was large with respect to the ones computed for other, less signicant hosts.
Hence, in the list of hosts sorted by absolute mass, good and spam hosts were intermixed without
any specic mass value that could be used as an appropriate separation point.
5 Related Work
In a broad sense, our work builds on the theoretical foundations provided by analyses of PageR-
ank (e.g., [Bianchini et al., 2005] and [Langville and Meyer, 2004]). The ways in which link spam-
ming (spam farm construction) inuences PageRank are examined in [Baeza-Yates et al., 2005] and
[Gyongyi and Garcia-Molina, 2005].
A number of recent publications propose link spam detection methods. For instance, Fetterly
et al. [Fetterly et al., 2004] analyze the indegree and outdegree distributions of web pages. Most
web pages have in- and outdegrees that follow a power-law distribution. Occasionally, however,
17
search engines encounter substantially more pages with the exact same in- or outdegrees than what
is predicted by the distribution formula. The authors nd that the vast majority of such outliers
are spam pages.
Similarly, Bencz ur et al. [Bencz ur et al., 2005] verify for each page x whether the distribution of
PageRank scores of pages pointing to x conforms a power law. They claim that a major deviation
in PageRank distribution is an indicator of link spamming that benets x.
These methods are powerful at detecting large, automatically generated link spam structures
with unnatural link patterns. However, they fail to recognize more sophisticated forms of spam,
when spammers mimic reputable web content.
Another group of work focuses on heavily interlinked groups of pages. Collusion is an ecient
way to improve PageRank score, and it is indeed frequently used by spammers. Zhang et al.
[Zhang et al., 2004] and Wu and Davison [Wu and Davison, 2005] present ecient algorithms for
collusion detection. However, certain reputable pages are colluding as well, so it is expected that
the number of false positives returned by the proposed algorithms is large. Therefore, collusion
detection is best used for penalizing all suspicious pages during ranking, as opposed to reliably
pinpointing spam.
A common characteristic of the previously mentioned body of work is that authors focus exclu-
sively on the link patterns between pages, that is, on how pages are interconnected. In contrast, this
paper looks for an answer to the question with whom are pages interconnected? We investigate
the PageRank of web nodes both when computed in the usual way and when determined exclusively
by the links from a large pool of known good nodes. Nodes with a large discrepancy between the
two scores turn out to be successfully boosted by (possibly sophisticated) link spamming.
In that we combat spam using a priori qualitative information about some nodes, the presented
approach supercially resembles TrustRank introduced in [Gyongyi et al., 2004]. However, there
are dierences between the two, which make them complementary rather than overlapping. Most
importantly, TrustRank helps cleansing top ranking results by identifying reputable nodes. While
spam is demoted, it is not detectedthis is a gap that we strive to ll in this paper. Also, the
scope of TrustRank is broader, demoting all forms of web spam, whereas spam mass estimates are
eective in detecting link spamming only.
6 Conclusions
In this paper we introduced a new spam detection method that can identify web nodes with PageR-
ank scores signicantly boosted through link spamming. Our approach is built on the idea of
estimating the spam mass of nodes, which is a measure of the relative PageRank contribution of
connected spam pages. Spam mass estimates are easy to compute using two sets of PageRank
scoresa regular one and another one with the random jump biased to some known good nodes.
Hence, we argue that the spam detection arsenal of search engines could be easily augmented with
our method.
We have shown the eectiveness of mass estimation-based spam detection through a set of
experiments conducted on the Yahoo! web graph. With minimal eort we were able to identify
several tens of thousands of link spam hosts. While the number of detected spam hosts might seem
relatively small with respect to the size of the entire web, it is important to emphasize that these
are the most advanced instances of spam, capable of accumulating large PageRank scores and thus
making to the top of web search result lists.
We believe that another strength of our method is that it is robust even in the event that
spammers learn about it. While knowledgeable spammers could attempt to collect a large number
18
of links from good nodes, eective tampering with the proposed spam detection method would
require non-obvious manipulations of the good graph. Such manipulations are virtually impossible
without knowing exactly the actual set of good nodes used as input by a given implementation of
the spam detection algorithm.
In comparison to other link spam detection methods, our proposed approach excels in handling
irregular link structures. It also diers from our previous work on TrustRank in that we provide
an algorithm for spam detection as opposed to spam demotion.
It would be interesting to see how our spam detection method can be further improved by using
additional pieces of information. For instance, we conjecture that many false positives could be
eliminated by complementary (textual) content analysis. This issues remains to be addressed in
future work. Also, we argue that the increasing number of (link) spam detection algorithms calls
for a comparative study.
A Proof of Theorem 1
The rst thing to realize is that the contribution of the unconnected nodes is zero, therefore
xV
q
x
y
=
z:Wzy=
q
z
y
,
and so the equality from the theorem becomes
p
y
=
z:Wzy=
q
z
y
.
In order to prove that this equality holds we will make use of two lemmas. The rst shows that
the total PageRank contribution to y is a solution in y to the linear PageRank equation, while the
second shows that the solution is unique.
Lemma 1
z:Wzy=
q
z
y
is a solution to the linear PageRank equation of y.
Proof. The linear PageRank equation for node y has the form:
p
y
= c
x:(x,y)E
p
x
/out(x) + (1 c)v
y
. (4)
Assuming that y has k in-neighbors x
1
, x
2
, . . . , x
k
, (4) can be written as
p
y
= c
k
i=1
p
x
i
/out(x
i
) + (1 c)v
y
.
Let us now replace p
y
and p
x
i
on both sides by the corresponding contributions:
z:Wzy=
q
z
y
= c
k
i=1
_
_
z:Wzx
i
=
q
z
x
i
_
_
/out(x
i
) + (1 c)v
y
.
19
and expand the terms q
z
y
and q
z
x
i
:
z:Wzy=
_
_
WWzy
c
|W|
(W)(1 c)v
z
_
_
=
c
k
i=1
_
_
z:Wzx
i
=
_
_
V Wzx
i
c
|V |
(V )(1 c)v
z
_
_
_
_
/out(x
i
) + (1 c)v
y
. (5)
Note that x
1
, . . . , x
k
are all the in-neighbors of y, so J
zy
can be partitioned into
:
J
zy
=
_
k
_
i=1
V.y [ V J
zx
i
_
Z
y
where Z
y
is the zero-length circuit. Also note that for all V J
zx
i
, [W[ = [V [ + 1 and (W) =
(V )/out(x
i
). Hence, we can rewrite the left side of (5) to produce the equation
k
i=1
_
_
z:Wzx
i
=
_
_
V Wzx
i
c
|V |+1
(W)(1 c)v
z
/out(x
i
)
_
_
_
_
+ (1 c)v
y
=
c
k
i=1
_
_
z:Wzx
i
=
_
_
V Wzx
i
c
|V |
(V )(1 c)v
z
_
_
_
_
/out(x
i
) + (1 c)v
y
,
which is an identity. Hence, q
y
is a solution to the PageRank equation of y.
Lemma 2 The linear PageRank equation system has a unique solution.
Proof. As self-loops are not allowed in the matrix T, the matrix U = (I cT
T
) will have 1s on
the diagonal and values 1 in all non-diagonal positions. Therefore U is diagonally dominant, and
hence positive denite. Accordingly, the system Up = k v has a unique solution.
B Proof of Theorem 2
According to Theorem 1, for the core-based random jump vector v
x
the PageRank score p
z
of an
arbitrary node z is the sum of PageRank contributions to z:
p
z
=
yV
q
y
z
=
yV
WWyz
c
|W|
(W)(1 c)v
x
y
.
In our special case we have v
x
y
= 0 for all y ,= x, so p
z
is in fact
p
z
=
WWxz
c
|W|
(W)(1 c)v
x
x
= q
x
z
.
It follows that we can determine the contributions q
x
z
of x to all nodes z by computing the PageRank
scores corresponding to the core-based random jump vector v
x
.
Notation V.y: Given some walk V = t0, t1, . . . , tm of length m and a node y with (tm, y) E we can append y to
V and construct a new walk V.y = t0, t1, . . . , tm, y of length m + 1.
20
References
[Baeza-Yates et al., 2005] Baeza-Yates, R., Castillo, C., and Lopez, V. (2005). PageRank increase
under dierent collusion topologies. In Proceedings of the First International Workshop on Ad-
versarial Information Retrieval on the Web (AIRWeb).
[Bencz ur et al., 2005] Bencz ur, A., Csalogany, K., Sarlos, T., and Uher, M. (2005). SpamRank
fully automatic link spam detection. In Proceedings of the First International Workshop on
Adversarial Information Retrieval on the Web (AIRWeb).
[Berkhin, 2005] Berkhin, P. (2005). A survey on PageRank computing. Internet Mathematics, 2(1).
[Bianchini et al., 2005] Bianchini, M., Gori, M., and Scarselli, F. (2005). Inside PageRank. ACM
Transactions on Internet Technology, 5(1).
[Eiron et al., 2004] Eiron, N., McCurley, K., and Tomlin, J. (2004). Ranking the web frontier. In
Proceedings of the 13th International Conference on World Wide Web.
[Fetterly et al., 2004] Fetterly, D., Manasse, M., and Najork, M. (2004). Spam, damn spam, and
statistics. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB).
[Gyongyi and Garcia-Molina, 2005] Gyongyi, Z. and Garcia-Molina, H. (2005). Link spam al-
liances. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB).
[Gyongyi et al., 2004] Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004). Combating web
spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data
Bases (VLDB).
[Henzinger et al., 2002] Henzinger, M., Motwani, R., and Silverstein, C. (2002). Challenges in web
search engines. ACM SIGIR Forum, 36(2).
[Jeh and Widom, 2003] Jeh, G. and Widom, J. (2003). Scaling personalized web search. In Pro-
ceedings of the 12th International Conference on World Wide Web.
[Langville and Meyer, 2004] Langville, A. and Meyer, C. (2004). Deeper inside PageRank. Internet
Mathematics, 1(3).
[Page et al., 1998] Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The PageRank
citation ranking: Bringing order to the web. Technical report, Stanford University, California.
[Singhal, 2004] Singhal, A. (2004). Challenges in running a commercial web search engine. IBMs
Second Search and Collaboration Seminar.
[Wu and Davison, 2005] Wu, B. and Davison, B. (2005). Identifying link farm spam pages. In
Proceedings of the 14th International Conference on World Wide Web.
[Zhang et al., 2004] Zhang, H., Goel, A., Govindan, R., Mason, K., and Roy, B. V. (2004). Making
eigenvector-based reputation systems robust to collusion. In Proceedings of the 3rd International
Workshop on Algorithms and Models for the Web-Graph (WAW).
21