0% found this document useful (0 votes)
11 views44 pages

2010 Lu LinkPredictionComplexNetworks

This document discusses link prediction in complex networks. It provides an overview of link prediction problems, evaluation metrics, and different algorithms for similarity-based, maximum likelihood, and probabilistic modeling approaches. It also covers applications in network reconstruction, evolution evaluation, and classification.

Uploaded by

Rafael Escalfoni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views44 pages

2010 Lu LinkPredictionComplexNetworks

This document discusses link prediction in complex networks. It provides an overview of link prediction problems, evaluation metrics, and different algorithms for similarity-based, maximum likelihood, and probabilistic modeling approaches. It also covers applications in network reconstruction, evolution evaluation, and classification.

Uploaded by

Rafael Escalfoni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Link Prediction in Complex Networks: A

Survey

Linyuan Lü a,b , Tao Zhou a,c


arXiv:1010.0725v1 [physics.soc-ph] 4 Oct 2010

a Web Sciences Center, University of Electronic Science and Technology of China,


Chengdu 610054, People’s Republic of China
b Department of Physics, University of Fribourg, Chemin du Musée 3, CH-1700
Fribourg, Switzerland
c Department of Modern Physics, University of Science and Technology of China,
Hefei 230026, People’s Republic of China

Abstract

Link prediction in complex networks has attracted increasing attention from both
physical and computer science communities. The algorithms can be used to ex-
tract missing information, identify spurious interactions, evaluate network evolving
mechanisms, and so on. This article summaries recent progress about link predic-
tion algorithms, emphasizing on the contributions from physical perspectives and
approaches, such as the random-walk-based methods and the maximum likelihood
methods. We also introduce three typical applications: reconstruction of networks,
evaluation of network evolving mechanism and classification of partially labelled
networks. Finally, we introduce some applications and outline future challenges of
link prediction algorithms.

Key words: Link Prediction, Complex Networks, Node Similarity, Maximum


Likelihood Methods, Probabilistic Models
PACS: 89.20.Ff, 89.75.Hc, 89.65.-s

1 Introduction

Many social, biological, and information systems can be well described by


networks, where nodes represent individuals, and links denote the relations

Email addresses: [email protected] (Linyuan Lü), [email protected] (Tao


Zhou).

Preprint submitted to Elsevier Science 6 October 2010


or interactions between nodes. The study of complex networks has therefore
become a common focus of many branches of science. Great efforts have been
made to understand the evolution of networks [1,2], the relations between
topologies and functions [3,4], and the network characteristics [5]. An impor-
tant scientific issue relevant to network analysis is the so-called information
retrieval [6,7], which aims at finding material of an unstructured nature that
satisfies an information need from large collections [8]. It can also be viewed
as prediction of relations between words and documents and is now further
extended to stand for a number of problems on link mining, wherein link
prediction is the most fundamental problem that attempts to estimate the
likelihood of the existence of a link between two nodes, based on observed
links and the attributes of nodes [9].

In many biological networks, such as food webs, protein-protein interaction


networks and metabolic networks, whether a link between two nodes exists
must be demonstrated by field and/or laboratorial experiments, which are
usually very costly. Our knowledge of these networks is very limited, for ex-
ample, 80% of the molecular interactions in cells of Yeast [10] and 99.7% of
human [11,12] are still unknown. Instead of blindly checking all possible in-
teractions, to predict based on known interactions and focus on those links
most likely to exist can sharply reduce the experimental costs if the predic-
tions are accurate enough. Social network analysis also comes up against the
missing data problem [13,14,15], where link prediction algorithms may play
a role. In addition, the data in constructing biological and social networks
may contain inaccurate information, resulting in spurious links [16,17]. Link
prediction algorithms can be applied in identifying these spurious links [18].

Besides helping in analyzing networks with missing data, the link prediction
algorithms can be used to predict the links that may appear in the future of
evolving networks. For example, in online social networks, very likely but not
yet existent links can be recommended as promising friendships, which can
help users in finding new friends and thus enhance their loyalties to the web
sites. Similar techniques can be applied to evaluate the evolving mechanism for
given networks. For example, many evolving models for the Internet topology
have been proposed: some more accurately reproduce the degree distribution
and the disassortative mixing pattern [19], some better characterize the k-core
structure [20], and so on. Since there are too many topological features and
it is very hard to put weights on them, we are not easy to judge which model
(i.e., which evolving mechanism) is better than the others. Notice that, each
model in principle corresponds to a link prediction algorithm, and thus we can
use the metrics on prediction accuracy to evaluate the performance of different
models.

Link prediction problem is a long-standing challenge in modern information


science, and a lot of algorithms based on Markov chains and statistical mod-

2
els have been proposed by computer science community. However, their works
have not caught up the current progress of the study of complex networks,
especially, they lack serious consideration of the structural characteristics of
networks, like the hierarchical organization [21] and community structure [22],
which may indeed provide useful information and insights for link prediction.
Recently, some physical approaches, such as random walk processes and max-
imum likelihood methods, have found applications in link prediction. This
article will give detailed discussion on these new development.

This article is organized as follows. In the next section, we will present the
link prediction problem and the standard metrics for performance evaluation.
Our tour of link prediction algorithms starts with the mainstreaming class of
algorithms, the so-called similarity-based algorithms 1 , which are further clas-
sified into three categories according to the information used by the similarity
indices: local indices, global indices and quasi-local indices. In Section 4 and
Section 5, we introduce the maximum likelihood algorithms and probabilistic
models for link prediction. The applications of link prediction algorithms are
presented in Section 6, including the reconstruction of networks, the evalua-
tion of network evolving mechanism and the classification of partially labeled
networks. Finally, we outline some future challenges of link prediction algo-
rithms.

2 Problem Description and Evaluation Metrics

Consider an undirected network G(V, E), where V is the set of nodes and E is
the set of links. Multiple links and self-connections are not allowed. Denote by
U the universal set containing all |V |·(|V2 |−1) possible links, where |V | denotes
the number of elements in set V . Then, the set of nonexistent links is U − E.
We assume there are some missing links (or the links that will appear in the
future) in the set U − E, and the task of link prediction is to find out these
links.

Generally, we do not know which links are the missing or future links, otherwise
we do not need to do prediction. Therefore, to test the algorithm’s accuracy,
the observed links, E, is randomly divided into two parts: the training set, E T ,
is treated as known information, while the probe set (i.e., validation subset),
E P , is used for testing and no information in this set is allowed to be used for
prediction. Clearly, E T ∪ E P = E and E T ∩ E P = ø. The advantage of this
random sub-sampling validation is that the proportion of the training split
is not dependent on the number of iterations. But with this method some
1 The similarity indices between nodes are also called kernels on graphs in some
literature of computer science community [23]

3
links may never be selected in the probe set, whereas others may be selected
more than once, resulting in statistical bias. This limitation can be overcame
by using the K-fold cross-validation, in which the observed links is randomly
partitioned into K subsets. Each time one subset is selected as probe set, the
rest K − 1 constitute the training set. The cross-validation process is then
repeated K times, with each of the K subsets used exactly once as the probe
set. With this method, all links are used for both training and validation,
and each link is used for prediction exactly once. Clearly, a larger K will lead
to smaller statistical bias yet require more computation. Some experimental
evidences suggested that the 10-fold cross-validation is a very good tradeoff
between cost and performance [24,25]. An extreme case called leave-one-out
method (i.e., the |V |-flod cross-validation) will be applied in Section 6.2.

Two standard metrics are used to quantify the accuracy of prediction algo-
rithms: area under the receiver operating characteristic curve (AUC) [26] and
Precision [27,28]. In principle, a link prediction algorithm provides an or-
dered list of all non-observed links (i.e., U − E T ) or equivalently gives each
non-observed link, say (x, y) ∈ U − E T , a score sxy to quantify its existence
likelihood. The AUC evaluates the algorithm’s performance according to the
whole list while the Precision only focuses on the L links with the top ranks or
the highest scores. A detailed introduction of these two metrics is as follows.

(i) AUC.— Provided the rank of all non-observed links, the AUC value can
be interpreted as the probability that a randomly chosen missing link (i.e., a
link in E P ) is given a higher score than a randomly chosen nonexistent link
(i.e., a link in U − E). In the algorithmic implementation, we usually calculate
the score of each non-observed link instead of giving the ordered list since
the latter task is more time-consuming. Then, at each time we randomly pick
a missing link and a nonexistent link to compare their scores, if among n
independent comparisons, there are n′ times the missing link having a higher
score and n′′ times they have the same score, the AUC value is

n′ + 0.5n′′
AUC = . (1)
n

If all the scores are generated from an independent and identical distribu-
tion, the AUC value should be about 0.5. Therefore, the degree to which the
value exceeds 0.5 indicates how much better the algorithm performs than pure
chance.

(ii) Precision.— Given the ranking of the non-observed links, the Precision is
defined as the ratio of relevant items selected to the number of items selected.
That is to say, if we take the top-L links as the predicted ones, among which Lr
links are right (i.e., there are Lr links in the probe set E P ), then the Precision
equals Lr /L. Clearly, higher precision means higher prediction accuracy.

4
1 1

5 2 5 2

4 3 4 3

Whole graph Training graph

Fig. 1. An illustration about the calculation of AUC and Precision.


Figure 1 shows an example of how to calculate the AUC and Precision. In this
simple graph, there are five nodes, seven existent links and three nonexistent
links ((1, 2), (1, 4) and (3, 4)). To test the algorithm’s accuracy, we need to
select some existent links as probe links. For instance, we pick (1, 3) and (4, 5)
as probe links, which are presented by dash lines in the right plot. Then, any
algorithm can only make use of the information contained in the training graph
(presented by solid lines in the right plot). If an algorithm assigns scores of
all non-observed links as s12 = 0.4, s13 = 0.5, s14 = 0.6, s34 = 0.5 and
s45 = 0.6. To calculate AUC, we need to compare the scores of a probe link
and a nonexistent link. There are in total six pairs: s13 > s12 , s13 < s14 ,
s13 = s34 , s45 > s12 , s45 = s14 and s45 > s34 . Hence, the AUC value equals
(3 × 1 + 2 × 0.5)/6 ≈ 0.67. For Precision, if L = 2, the predicted links are (1, 4)
and (4, 5). Clearly, the former is wrong while the latter is right, and thus the
Precision equals 0.5.

3 Similarity-Based Algorithms

The simplest framework of link prediction methods is the similarity-based al-


gorithm, where each pair of nodes, x and y, is assigned a score sxy , which
is directly defined as the similarity (or called proximity in the literature) be-
tween x and y. All non-observed links are ranked according to their scores,
and the links connecting more similar nodes are supposed to be of higher ex-
istence likelihoods. In despite of its simplicity, the study on similarity-based
algorithms is the mainstream issue. In fact, the definition of node similarity is
a nontrivial challenge. Similarity index can be very simple or very complicated
and it may work well for some networks while fails for some others. In addi-
tion, the similarities can be used in a more skilled way, such as being locally
integrated under the collaborative filtering 2 framework [30].
2 Collaborative filtering is the process of filtering for information or patterns using
techniques involving collaboration among multiple agents, viewpoints, data sources,
etc. [29]

5
Node similarity can be defined by using the essential attributes of nodes:
two nodes are considered to be similar if they have many common features
[31]. However, the attributes of nodes are generally hidden, and thus we fo-
cus on another group of similarity indices, called structural similarity, which
is based solely on the network structure. The structural similarity indices
can be classified in various ways, such as local vs. global, parameter-free vs.
parameter-dependent, node-dependent vs. path-dependent, and so on. The
similarity indices can also be sophisticatedly classified as structural equiva-
lence and regular equivalence. The former embodies a latent assumption that
the link itself indicated a similarity between two endpoints (see, for example,
the Leicht-Holme-Newman index [32] and transferring similarity [33]), while
the latter assumes that two nodes are similar if their neighbors are similar.
Readers are encouraged to see Ref. [34] for the mathematical definition of
regular equivalence and Ref. [35] for a recent application on the prediction of
protein functions.

Here we adopt the simplest method, where 20 similarity indices are classified
into three categories: the former 10 are local indices, followed by 7 global
indices, and the last 3 are quasi-local indices, which do not require global
topological information but make use of more information than local indices.

3.1 Local Similarity Indices

(1) Common Neighbours (CN). For a node x, let Γ(x) denote the set of neigh-
bors of x. In common sense, two nodes, x and y, are more likely to have
a link if they have many common neighbors. The simplest measure of this
neighborhood overlap is the directed count, namely

sCN
xy = |Γ(x) ∩ Γ(y)|. (2)

where |Q| is the cardinality of the set Q. It is obvious that sxy = (A2 )xy ,
where A is the adjacency matrix: Axy = 1 if x and y are directly connected
and Axy = 0 otherwise. Note that, (A2 )xy is also the number of different paths
with length 2 connecting x and y. Newman [36] used this quantity in the study
of collaboration networks, showing a positive correlation between the number
of common neighbors and the probability that two scientists will collaborate
in the future. Kossinets and Watts [14] analyzed a large-scale social network,
suggesting that two students having many mutual friends are very probable
to be friend in the future.

6
(2) Salton Index [6]. It is defined as

|Γ(x) ∩ Γ(y)|
sSalton
xy = q , (3)
kx × ky

where kx is the degree of node x. The Salton index is also called the cosine
similarity in the literature.

(3) Jaccard Index [37]. This index was proposed by Jaccard over a hundred
years ago, and is defined as:

|Γ(x) ∩ Γ(y)|
sJaccard
xy = . (4)
|Γ(x) ∪ Γ(y)|

(4) Sørensen Index [38]. This index is used mainly for ecological community
data, and is defined as

2|Γ(x) ∩ Γ(y)|
sSørensen
xy = . (5)
kx + ky

(5) Hub Promoted Index (HPI) [39]. This index is proposed for quantifying
the topological overlap of pairs of substrates in metabolic networks, and is
defined as
|Γ(x) ∩ Γ(y)|
sHP
xy
I
= . (6)
min{kx , ky }

Under this measurement, the links adjacent to hubs are likely to be assigned
high scores since the denominator is determined by the lower degree only.

(6) Hub Depressed Index (HDI). Analogously to the above index, we also
consider a measurement with the opposite effect on hubs, defined as

|Γ(x) ∩ Γ(y)|
sHDI
xy = . (7)
max{kx , ky }

(7) Leicht-Holme-Newman Index (LHN1) [32]. This index assigns high simi-
larity to node pairs that have many common neighbors compared not to the
possible maximum, but to the expected number of such neighbors. It is defined
as

1 |Γ(x) ∩ Γ(y)|
sLHN
xy = , (8)
kx × ky

7
where the denominator, kx × ky , is proportional to the expected number of
common neighbors of nodes x and y in the configuration model [40]. We use
the abbreviation LHN1 to distinguish this index to another index (named as
LHN2 index) also proposed by Leicht, Holme and Newman.

(8) Preferential Attachment Index (PA). The mechanism of preferential at-


tachment can be used to generate evolving scale-free networks, where the
probability that a new link is connected to the node x is proportional to kx
[41]. A similar mechanism can also lead to scale-free networks without growth
[42], where at each time step, an old link is removed and a new link is gener-
ated. The probability that this new link will connect x and y is proportional
to kx × ky . Motivated by this mechanism, the corresponding similarity index
can be defined as

sPxyA = kx × ky , (9)

which has been widely used to quantify the functional significance of links
subject to various network-based dynamics, such as percolation [43], synchro-
nization [44] and transportation [45]. Note that this index does not require
the information of the neighborhood of each node, as a consequence, it has
the least computational complexity.

(9) Adamic-Adar Index (AA) [46]. This index refines the simple counting of
common neighbors by assigning the less-connected neighbors more weight, and
is defined as
1
sAA
X
xy = . (10)
z∈Γ(x)∩Γ(y)
log kz

(10) Resource Allocation Index (RA) [47]. This index is motivated by the
resource allocation dynamics on complex networks [48]. Consider a pair of
nodes, x and y, which are not directly connected. The node x can send some
resource to y, with their common neighbors playing the role of transmitters.
In the simplest case, we assume that each transmitter has a unit of resource,
and will equally distribute it to all its neighbors. The similarity between x and
y can be defined as the amount of resource y received from x, which is:
1
sRA
X
xy = . (11)
k
z∈Γ(x)∩Γ(y) z

Clearly, this measure is symmetric, namely sxy = syx . Note that, although
resulting from different motivations, the AA index and RA index have very
similar form. Indeed, they both depress the contribution of the high-degree
common neighbors. AA index takes the form (log kz )−1 while RA index takes

8
Table 1
Accuracies of different local similarity indices subject to link prediction, measured
by the AUC value. Each number is obtained by averaging over 10 implementations
with independently random partitions of testing set (90%) and probe set (10%).
The entries corresponding to the highest accuracies among these 10 indices are em-
phasized in black. The six real networks for testing are a protein-protein interaction
network (PPI) [16], a co-authorship network of scientists who are themselves pub-
lishing on the topic of network science (NS) [50], an electrical power-grid of the
western US (Grid) [51], a network of the US political blogs (PB) [52], a router-level
Internet collected by Rocketfuel Project (INT) [53], and a network of the US air
transportation system (USAir) [54]. Detailed information about these networks can
be found in Ref. [47].
Indices PPI NS Grid PB INT USAir
CN 0.889 0.933 0.590 0.925 0.559 0.937
Salton 0.869 0.911 0.585 0.874 0.552 0.898
Jaccard 0.888 0.933 0.590 0.882 0.559 0.901
Sørensen 0.888 0.933 0.590 0.881 0.559 0.902
HPI 0.868 0.911 0.585 0.852 0.552 0.857
HDI 0.888 0.933 0.590 0.877 0.559 0.895
LHN1 0.866 0.911 0.585 0.772 0.552 0.758
PA 0.828 0.623 0.446 0.907 0.464 0.886
AA 0.888 0.932 0.590 0.922 0.559 0.925
RA 0.890 0.933 0.590 0.931 0.559 0.955

the form kz−1 . The difference is insignificant when the degree, kz , is small,
while it is considerable when kz is large. In a word, RA index punishes the
high-degree common neighbors more heavily than AA.

Liben-Nowell et al. [49] and Zhou et al. [47] systematically compared a number
of local similarity indices on many real networks: the former [49] focuses on
social collaboration networks and the latter [47] considers disparate networks
including the protein-protein interaction network, electronic grid, Internet,
US airport network, etc. According to extensive experimental results on real
networks (see results in Table 1), the RA index performs best, while AA and
CN indices have the second best overall performance among all the above-
mentioned local indices.

The PA index has the worst overall performance, yet we are interested in
it for it requires the least information. Notice that, PA performs even worst
than pure chance for the Internet at router level and the power grid. In these
two networks, the nodes have well-defined positions and the links are physi-

9
cal lines. Actually, geography plays a significant role and links with very long
geographical distances are rare. As local centers, the high-degree nodes have
longer geographical distances to each other than average, and thus have a
lower probability of directly connecting to each other, which leads to the bad
performance of PA. In contrast, although USAir has well-defined geographical
positions of nodes, its links are not physical. Empirical data has demonstrated
that the number of airline flights is not very sensitive to the geographical
distance within a big range [55,56] (another topological evidence for the rela-
tively good performance of PA on USAir is the so-called rich-club phenomenon
[57,58]). The LHN1 index performs the second worst, however, compared with
all other neighborhood-based indices, it is very good at uncovering the missing
links connecting two small-degree nodes [59].

Recently, Pan et al. [60] have compared all the local indices appeared in Ref.
[47] in a similarity-based community detection algorithm, and their experi-
mental results again indicate that the RA index performs best. Wang et al.
[61] have applied the RA index to estimate the weights between stations in
Chinese railway, which shows better performance than the CN index. In ad-
dition, the RA index for bipartite networks can be applied in personalized
recommendation with higher accuracy than the classical collaborative filter-
ing [62].

3.2 Global Similarity Indices

(11) Katz Index [63]. This index is based on the ensemble of all paths, which
directly sums over the collection of paths and is exponentially damped by
length to give the shorter paths more weights. The mathematical expression
reads

2 2 3 3
sKatz β l · |paths<l>
X
xy = xy | = βAxy + β (A )xy + β (A )xy + · · · , (12)
l=1

where paths<l>
xy is the set of all paths with length l connecting x and y, and
β is a free parameter (i.e., the damping factor) controlling the path weights.
Obviously, a very small β yields a measurement close to CN, because the long
paths contribute very little. The similarity matrix can be written as

S Katz = (I − βA)−1 − I. (13)

Note that, β must be lower than the reciprocal of the largest eigenvalue of
matrix A to ensure the convergence of Eq. 12.

10
(12) Leicht-Holme-Newman Index (LHN2) [32]. This index is a variant of the
Katz index. Based on the concept that two nodes are similar if their imme-
diate neighbors are themselves similar, one obtains a self-consistent matrix
formulation

S = φAS + ψI = ψ(I − φA)−1 = ψ(I + φA + φ2 A2 + · · ·), (14)

where φ and ψ are free parameters controlling the balance between the two
components of the similarity. Setting ψ = 1, it is very similar to the Katz
index. Note that (Al )xy is equal to the number of paths of length l from x
to y. The expected value of (Al )xy , namely E[(Al )xy ], equals (kx ky /2M)λl−1
1 ,
where λ1 is the largest eigenvalue of A and M is the total number of edges
in the network. Replace (Al )xy in Eq. 14 with (Al )xy /E[(Al )xy ], we obtain the
expression:

" # " #
2 2M X 2Mλ1 2Mλ1 φ
sLHN
xy = δxy + φl λ1−l (Al )xy = 1 − δxy + (I − A)−1 , (15)
kx ky l=0 kx ky kx ky λ1 xy

where δxy is the Kronecker’s function. Since the first item is a diagonal matrix,
it can be dropped and thus we arrive to a compact expression

φA −1 −1
S = 2mλ1 D −1 (I − ) D , (16)
λ1

where D is the degree matrix with Dxy = δxy kx and φ (0 < φ < 1) is a
free parameter. The choosing of φ depends on the investigated network, and
smaller φ assigns more weights on shorter paths.

(13) Average Commute Time (ACT). Denote by m(x, y) the average number
of steps required by a random walker starting from node x to reach node y,
the average commute time between x and y is

n(x, y) = m(x, y) + m(y, x), (17)

which can be obtained in terms of the pseudoinverse of the Laplacian matrix,


L+ (L = D − A), as [64,65]:
+ + +
n(x, y) = M(lxx + lyy − 2lxy ), (18)

+
where lxy denotes the corresponding entry in L+ . Assuming two nodes are
more similar if they have a smaller average commute time, then the similarity

11
between the nodes x and y can be defined as the reciprocal of n(x, y), namely
(the constant factor M is removed)

1
sACT
xy = + + +
. (19)
lxx + lyy − 2lxy

(14) Cosine based on L+ . This index is an inner-product-based measure. In the


1
Euclidean space spanned by vx = Λ 2 U T e~x , where U is an orthonormal matrix
made of the eigenvectors of L+ ordered in decreasing order of corresponding
eigenvalue λx , Λ = diag(λx ), e~x is an N × 1 vector with the xth element
equal to 1 and others all equal to 0, and T is the matrix transposition, the
pseudoinverse of the Laplacian matrix are the inner products of the node
+
vectors, lxy = vxT vy . Accordingly, the cosine similarity is defined as the cosine
of the node vectors, namely [65]:
+
+ vxT vy
+ lxy
scos
xy = cos(x, y) = = q . (20)
|vx | · |vy | + · l+
lxx yy

(15) Random Walk with Restart (RWR). This index is a direct application of
the PageRank algorithm [66]. Consider a random walker starting from node x,
who will iteratively moves to a random neighbor with probability c and return
to node x with probability 1 − c. Denote by qxy the probability this random
walker locates at node y in the steady state, we have

q~x = cP T q~x + (1 − c)e~x , (21)

where P is the transition matrix with Pxy = 1/kx if x and y are connected,
and Pxy = 0 otherwise. The solution is straightforward, as

q~x = (1 − c)(I − cP T )−1 e~x . (22)

The RWR index is thus defined as

sRW
xy
R
= qxy + qyx . (23)

A fast algorithm to calculate this index was proposed by Tong et al. [67], and
the application of this index to recommender systems can be found in Ref.
[68].

(16) SimRank [69]. Similar to the LHN2, SimRank is defined in a self-consistent


way, according to the assumption that two nodes are similar if they are con-

12
nected to similar nodes:
sSimRank
P P
z∈Γ(x) z ′ ∈Γ(y) zz ′
sSimRank
xy =C· (24)
kx · ky

where sxx = 1 and C ∈ [0, 1] is the decay factor. The SimRank can also be
interpreted by the random-walk process, that is, sSimRank
xy measures how soon
two random walkers, respectively starting from nodes x and y, are expected
to meet at a certain node.

(17) Matrix Forest Index (MFI) [70]. This index is defined as

S = (I + L)−1 , (25)

where the similarity between x and y can be understood as the ratio of the
number of spanning rooted forests such that nodes x and y belong to the same
tree rooted at x to all spanning rooted forests of the network (see details in
Ref. [70]). A parameter-dependent variant of MFI is

S = (I + αL)−1 , α > 0. (26)

This index has been applied to quantify the similarity between nodes on collab-
orative recommendation task [71]. The results indicate that a simple nearest-
neighbors rule based on similarity measured by MFI performs best.

Comparing with the local similarity indices, the global ones ask for the whole
topological information. Although the global indices can provide much more
accurate prediction than the local ones, they suffer two big disadvantages:
(i) the calculation of a global index is very time-consuming, and is usually
infeasible for large-scale networks; (ii) sometimes, the global topological infor-
mation is not available, especially if we would like to implement the algorithm
in a decentralized manner. As we will show in the next subsection, a promising
tradeoff is the quasi-local indices, which considers more information than local
indices while abandons the superfluous information that makes no contribu-
tion or very little contribution to the prediction accuracy.

3.3 Quasi-Local Indices

(18) Local Path Index (LP) [47,72]. To provide a good tradeoff of accuracy and
computational complexity, we here introduce an index that takes consideration
of local paths, with wider horizon than CN. It is defined as

S LP = A2 + ǫA3 , (27)

13
where ǫ is a free parameter. Clearly, this measure degenerates to CN when
ǫ = 0. And if x and y are not directly connected (this is the case we are
interested in), (A3 )xy is equal to the number of different paths with length 3
connecting x and y. This index can be extended to account for higher-order
paths, as

S LP (n) = A2 + ǫA3 + ǫ2 A4 + · · · + ǫn−2 An , (28)

where n > 2 is the maximal order. With the increasing of n, this index asks
for more information and computation. Especially, when n → ∞, S LP (n) will
be equivalent to the Katz index that takes into account all paths in the net-
work. The computational complexity of this index in an uncorrelated network
is O(Nhkin ), which grows fast with the increasing of n and will exceed the
complexity for calculating the Katz index (approximate to O(N 3 )) for large
n. Experimental results show that the optimal n is positively correlated with
the average shortest distance of the network [72].

The LP index performs remarkably better than the neighborhood-based in-


dices, such as RA, AA and CN [47]. It is because the neighborhood information
is less distinguishable and two node pairs are of high probability to be assigned
the same similarity scores. Taking INT as an example, there are more than 107
node pairs, 99.59% of which are assigned zero score by CN. For all the node
pairs having scores higher than 0, 91.11% are assigned score 1, and 4.48%
are assigned score 2. Using a little bit more information involving the next
nearest neighbors may break the “degeneracy of the states and make the simi-
larity scores more distinguishable. This is the reason why the LP index largely
improves the prediction accuracy.

The comparison of LP index with other two path-dependent global indices, the
Katz and LHN2 indices, is shown in Table 2. Overall speaking, the Katz index
performs best subject to the AUC value, while the LP index is the best for the
Precision. For the network with small average shortest distance (e.g., USAir
and PB), LP index gives the most accurate predictions for both AUC and
Precision. In a word, the LP index provides competitively good predictions
while asks for much lighter computation compared with the global indices.

(19) Local Random Walk (LRW) [73]. To measure the similarity between nodes
x and y, a random walker is initially put on node x and thus the initial density
vector ~πx (0) = e~x . This density vector evolves as ~πx (t+1) = P T ~πx (t) for t ≥ 0.
The LRW index at time step t is thus defined as

sLRW
xy (t) = qx πxy (t) + qy πyx (t). (29)

where q is the initial configuration function. In Ref. [73] Liu and Lü applied a
kx
simple form determined by node degree, namely qx = M .

14
Table 2
Accuracies of the three path-dependent similarity indices, measured by AUC and
precision. Here only the main components of example networks are considered (see
Ref. [72] for detailed information). Each number is obtained by averaging over 10
independent realizations. The entries corresponding to the highest accuracies are
emphasized in black. For LP, Katz and LHN2 indices, the AUC values are corre-
sponding to the optimal parameter which will be used to calculate their correspond-
ing precision where we set L = 100. For USAir, the optimal value of ǫ is negative
(see the explanation in Ref. [47]). LP* denotes the LP index with a fixed parameter
ǫ = 0.01 (for USAir ǫ = −0.01). The very small difference between the optimal case
and the case with ǫ = 0.01 suggests that in the real application, one can directly
set ǫ as a very small number, instead of finding out its optimum that may cost
much time. This again supports our motivation that the essential advantage of the
uasge of the second order neighborhood is to improve the distinguishability of the
similarity scores.
AUC PPI NS Grid PB INT USAir
LP 0.970 0.988 0.697 0.941 0.943 0.960
LP* 0.970 0.988 0.697 0.939 0.941 0.959
Katz 0.972 0.988 0.952 0.936 0.975 0.956
LHN2 0.968 0.986 0.947 0.769 0.959 0.778

Precision PPI NS Grid PB INT USAir


LP 0.734 0.292 0.132 0.519 0.557 0.627
LP* 0.734 0.292 0.132 0.469 0.121 0.627
Katz 0.719 0.290 0.063 0.456 0.368 0.623
LHN2 0 0.060 0.005 0 0 0.005

(20) Superposed Random Walk (SRW) [73]. Similar to the RWR index, Liu and
Lü [73] proposed the SRW index, where the random walker is continuously
released at the starting point, resulting in a higher similarity between the
target node and the nodes nearby. The mathematical expression reads
t t
sSRW sLRW
X X
xy (t) = xy (τ ) = [qx πxy (τ ) + qy πyx (τ )], (30)
τ =1 τ =1

where t denotes the time steps.

Liu and Lü [73] systematically compared these two indices, LRW and SRW,
with five other indices, including three local (or quasi-local) indices, CN, RA
and LP, and two other random-walk-based global indices, ACT and RWR, as
well as the hierarchical structure method (HSM) proposed by Clauset, Moore
and Newman [75] (see Section 4.1 for the detailed introduction of HSM).

15
Table 3
Comparison of algorithms’ accuracy quantified by AUC and Precision. For each
network, the training set contains 90% of the known links. Each number is obtained
by averaging over 1000 implementations with independently random divisions of
training set and probe set. The parameters ε = 10−3 for LP (for USAir, ε = −10−3 )
and c = 0.9 for RWR. The numbers inside the brackets denote the optimal step of
LRW and SRW indices. For example, 0.972(2) means the optimal AUC is obtained
at the second step of LRW. The highest accuracy in each line is emphasized in black.
For HSM, 5000 samples of dendrograms for each implementation are generated.
AUC CN RA LP ACT RWR HSM LRW SRW
USAir 0.954 0.972 0.952 0.901 0.977 0.904 0.972(2) 0.978(3)
NetScience 0.978 0.983 0.986 0.934 0.993 0.930 0.989(4) 0.992(3)
Power 0.626 0.626 0.697 0.895 0.760 0.503 0.953(16) 0.963(16)
Yeast 0.915 0.916 0.970 0.900 0.978 0.672 0.974(7) 0.980(8)
C.elegans 0.849 0.871 0.867 0.747 0.889 0.808 0.899(3) 0.906(3)

Precision CN RA LP ACT RWR HSM LRW SRW


USAir 0.59 0.64 0.61 0.49 0.65 0.28 0.64(3) 0.67(3)
NetScience 0.26 0.54 0.30 0.19 0.55 0.25 0.54(2) 0.54(2)
Power 0.11 0.08 0.13 0.08 0.09 0.00 0.08(2) 0.11(3)
Yeast 0.67 0.49 0.68 0.57 0.52 0.84 0.86(3) 0.73(9)
C.elegans 0.12 0.13 0.14 0.07 0.13 0.08 0.14(3) 0.14(3)

According to the experimental results (see Table 3), LRW and SRW methods
perform better than other indices with their respective optimal walking step
positively correlated with the average shortest distance of the network.

Furthermore, the computational complexity of LRW and SRW is lower than


ACT and RWR whose time complexity in calculating inverse and pseudoin-
verse is approximately O(N 3 ), while the time complexity of n-steps LRW and
SRW are approximately O(Nhkin ), ignoring degree heterogeneity of the net-
work. That is to say, when n is small LRW and SRW run much faster than
other random-walk-based global similarity indices. The advantage of LRW and
SRW for their low calculation complexity is prominent especially in the huge
size (i.e. large N) and sparse (i.e. small hki) networks. For example, LRW or
SRW for power grid is thousands time faster than ACT, cos+ and RWR, even
for n ≃ 10 [73].

With the similar motivation of LRW and SRW, Mantrach et al. recently pro-
posed a bounded normalized random walk with restart algorithm (see Eq. 21
for the definition of RWR), and applied it to address the classification problem

16
[74]. With this method both complexities of time and space can be reduced.

4 Maximum Likelihood Methods

This section will introduce two recently proposed algorithms based on the
maximum likelihood estimation. These algorithms presuppose some organiz-
ing principles of the network structure, with the detailed rules and specific
parameters obtained by maximizing the likelihood of the observed structure.
Then, the likelihood of any non-observed link can be calculated according to
those rules and parameters.

From the viewpoint of practical applications, an obvious drawback of the max-


imum likelihood methods is that it is very time consuming. A well designed
algorithm is able to handle networks with up to a few thousand nodes in a
reasonable time, but will definitely fail to deal with the huge online networks
that often consist of millions of nodes. In addition, the maximum likelihood
methods are probably not among the most accurate ones (see, for example, the
comparison between hierarchical structure model and some typical similarity-
based methods in Table 3). However, the maximum likelihood methods provide
very valuable insights into the network organization, which can not be gained
from the similarity-based algorithms or the probabilistic models.

4.1 Hierarchical Structure Model

Empirical evidence indicates that many real networks are hierarchically orga-
nized, where nodes can be divided into groups, further subdivided into groups
of groups, and so forth over multiple scales [21] (e.g., metabolic networks [39]
and brain networks [76]). As Redner said [77], focusing on the hierarchical
structure inherent in social and biological networks might provide a smart
way to find missing links. Clauset, Moore and Newman [75] proposed a gen-
eral technique to infer the hierarchical organization from network data and
further applied it to predict the missing links.

As shown in Fig. 2, the hierarchical structure of a network can be represented


by a dendrogram with N leaves (corresponding to the nodes of the network)
and N − 1 internal nodes. Clauset, Moore and Newman [75] introduced a
simple model where each internal node r is associated with a probability pr
and the connecting probability of a pair of nodes (leaves) is equal to pr′ where
r ′ is the lowest common ancestor of these two nodes. Given a real network G
and a dendrogram D, let Er be the number of edges in G whose endpoints
have r as their lowest common ancestor in D, and let Lr and Rr , respectively,

17
Fig. 2. Illustration of a dendrogram of a network with 5 nodes. Accordingly, the
connecting probability of nodes 1 and 2 is 0.5, of nodes 1 and 3 is 0.3, of nodes 3
and 4 is 0.4.
be the number of leaves in the left and right subtrees rooted at r. Then the
likelihood of the dendrogram D together with a set of pr is
Lr Rr −Er
pE
Y
L(D, {pr }) = r (1 − pr )
r
. (31)
r

For a fixed D, it is obvious that


Er
p∗r = (32)
Lr Rr

maximizes L(D, {pr }). Therefore, according to the maximum likelihood method
[78], with a fixed D, it is easy to determine {pr } (by Eq. 32) that best fits the
network G. Figure 3 shows an example network and two possible dendrograms,
as well as the corresponding likelihoods. It is in accordance with the common
sense that D2 is more likely. A Markov chain Monte Carlo method is used
to sample dendrograms with probability proportional to their likelihood (see
the Supplementary Information of Ref. [75] and a benchmark book [79] for
details).

The algorithm to predict the missing links contains the following procedures:
(i) Sample a large number of dendrograms with probability proportional to
their likelihood; (ii) For each pair of unconnected nodes i and j, calculate the
mean connecting probability hpij i by averaging the corresponding probability
pij over all sampled dendrograms; (iii) Sort these node pairs in descending
order of hpij i and the highest-ranked ones are those to be predicted. According
to the AUC value, this algorithm outperforms the CN index for the terrorist
association network [80] and the grassland species food web [81], while loses
for the metabolic network of the spirochaete Treponema Pallidum [82].

18
Fig. 3. The likelihood of two possible dendrograms for an example network consisting
of 6 nodes. The interval nodes are labeled with the maximum likelihood probability
obtained by Eq. 32. The likelihoods are L(D1 ) ≈ 0.00165 (left dendrogram) and
L(D1 ) ≈ 0.0433 (right dendrogram). Reprinted figure with permission from [75],
copyright is held by Nature Publishing Group.

The hierarchical structure model provides a smart way to predict missing links,
and, maybe more significantly, it uncovers the hidden hierarchical organization
of networks. However, as mentioned above, a big disadvantage is that this
algorithm runs very slow. Actually, the process to sample dendrograms usually
asks for O(N 2 ) steps of the Markov chain [75], and in the worst case, it takes
exponential time [83]. In comparison, according to the CPU of an advanced
desktop computer, the hierarchical structure model cannot manage a network
of tens of thousands nodes, while the algorithms based on local similarity
indices can deal with networks with tens of millions nodes. Another noticeable
remark is that this model may give poor predictions for those networks without
clear hierarchical structures.

4.2 Stochastic Block Model

Stochastic block model [84,85,86,87] is one of the most general network models,
where nodes are partitioned into groups and the probability that two nodes are
connected depends solely on the groups to which they belong. The stochastic
block model can capture the community structure [22], role-to-role connections
[88] and maybe other factors for the establishing of connections, especially
when the group membership plays the considerable roles in determining how
nodes interact with each other, which usually could not be well described by
the simple assortativity coefficient [89,90] or the degree-degree correlations

19
2 5

1 4 6

3 7

Fig. 4. An illustration about the calculation of likelihood for the stochastic block
model.
[91,92].

Given a partition M where each node belongs to one group and the connecting
probability for two nodes respectively in groups α and β is denoted by Qαβ
(Qαα represents the probability that two nodes within group α are connected),
then the likelihood of the observed network structure is [18]:
l
(1 − Qαβ )rαβ −lαβ ,
Y αβ
L(A|M) = Qαβ (33)
α≤β

where lαβ is the number of edges between nodes in groups α and β and rαβ is
the number of pairs of nodes such that one node is in α and the other is in β.
Similar to Eq. 32, the optimal Qαβ that maximizes the likelihood L(A|M) is:

lαβ
Q∗αβ = . (34)
rαβ

A simple illustration is shown in Fig. 4. Given a partition M = {{1, 2, 3}, {4, 5, 6, 7}},
according to Eq. 34, the Q values corresponding to the maximum likelihood
2
are Q∗11 = 33 = 1, Q∗12 = 12 = 16 , Q∗22 = 65 , and thus the likelihood is
 2  10  5  
1 5 5 1
L=1× × ≈ 3.005 × 10−4 . (35)
6 6 6 6

Denote by Ω the set of all possible partitions, using Bayes’ Theorem [93], the
reliability of an individual link is [18]:

L(Axy = 1|M)L(A|M)p(M)dM
R

Rxy = L(Axy = 1|A) = ′ ′ ′
, (36)
Ω L(A|M )p(M )dM
R

where p(M) is a constant assuming no prior knowledge about the model.


Notice that, the number of different partitions of N elements grows faster
than any finite power of N, and thus even for a small network, to sum over

20
all partitions is not possible in practice. The Metropolis algorithm [94] can be
applied to estimate the link reliability [18]. Even though, the whole process is
very time consuming and this method can only manage networks with up to
a few thousands of nodes.

Reliability describes the likelihood of the existence of a link (i.e., the prob-
ability that the link “truly” exists) given the observed structure [18], which
can be used not only to predict missing links (the nonexistent links in the ob-
served network yet with the highest reliabilities) but also to identify possible
spurious links (the existent links with the lowest reliabilities). Empirical com-
parison on five disparate networks (the social interactions in a karate club [95],
the social network of frequent associations between 62 dolphins [96], the air
transportation network of Eastern Europe [97], the neural network of the ne-
matode Caenorhabditis elegans [98], and the metabolic network of Escherichia
coli [99]) indicated that the overall performance of the maximum likelihood
method based on stochastic block model [18] is better than the one based on
the hierarchical structure model [75] and the similarity-based algorithm for
common neighbors [49].

5 Probabilistic Models

Probabilistic models aim at abstracting the underlying structure from the ob-
served network, and then predicting the missing links by using the learned
model. Given a target network G = (V, E), the probabilistic model will opti-
mize a built target function to establish a model composed of a group of param-
eters Θ, which can best fit the observed data of the target network. Then the
probability of the existence of a nonexistent link (i, j) is estimated by the con-
ditional probability P (Aij = 1|Θ). This section will introduce the three main-
stream methods, respectively called Probabilistic Relational Model (PRM)
[100], Probabilistic Entity Relationship Model (PERM) [101] and Stochastic
Relational Model (SRM) [102]. Notice that, in some literature, the term PRM
only refers to a specific model which is usually called the Relational Bayesian
Networks nowadays, while we adopt the more general usage of PRM in this
review.

5.1 Probabilistic Relational Models

PRMs represent a joint probability distribution over the attributes of a re-


lational dataset. They allow the properties of an object to dependent prob-
abilistically both on other properties of that object and on properties of the
related objects. Different from the traditional graphical models using a single

21
graph to model the relationship among the attributes of homogenous entities,
PRMs contain three graphs [103]: the data graph GD , the model graph GM ,
and the inference graph GI . These correspond to the skeleton, model, and
ground graph as outlined by Heckerman et al. [104].

The data graph GD = (VD , ED ) presents the input network, where nodes
are the objects in the data and edges represent the relationships among the
objects. Each node vi ∈ VD and edge ej ∈ ED are associated with a type
T (vi ) = tvi , T (ej ) = tej . Each item (either object or edge) type t ∈ T has a
number of associated attributes Xt . Consequently, each object vi and link
tv te
ej are associated with a set of attribute values, xvii and xejj , determined
by their types, tv and te , respectively. A PRM represents a joint proba-
bility distribution over the values of all the attributes in the data graph,
tv S te
x = {xvii : vi ∈ VD , T (vi ) = tvi } {xejj : ej ∈ ED , T (ej ) = tej }. For example,
in the student-course selection system, the students and courses are nodes
and the edges represent the select relationship between students and courses.
Clearly, there are two types of nodes, namely student and course. And the
type student has four attributes: grade, age, sex and department, while the
type course has five attributes: category, teacher, year, time and discipline.

The model graph GM = (VM , EM ) represents the dependencies among at-


tributes at the level of item types. Attributes of an item can depend proba-
bilistically on other attributes of the same item, as well as on attributes of
other related objects or links in GD . Each node in VM corresponds to an at-
tribute Xit ∈ Xt where t ∈ T . The attributes with the same type in GD are
tied together. Thus GD is decomposed into multiple examples of each type,
based on which a joint model of dependencies among the type attributes can
be built. GM contains two parts: the dependent structure among all the type
attributes and the conditional probability distribution (CPD) associated with
the nodes in GM .

The inference graph GI = (VI , EI ) represents the probabilistic dependencies


among all the variables in a single test set. It can be instantiated by a roll-
out process of GD and GM . Each item-attribute pair in GD gets a separate,
local copy of the corresponding CPD from GM . The relations in GD determine
the way that GM is rolled out to form GI . Therefore the structure of GI is
determined by both GD and GM .

With respect to different representations of the modeled graph GM and the


corresponding learning and inferring procedures, PRMs can be classified into
three groups: Relational Bayesian Networks (RBNs), Relational Markov Net-
works (RMNs) and Relational Dependency Networks (RDNs).

Relational Bayesian Networks (RBNs) [100,105]— The model graph presented

22
by a RBN is a directed acyclic graph 3 with a set of CPDs, P , to represent
a joint distribution over the attributes of the item types. The set P contains
a conditional probability distribution for each variable given its parents 4 ,
p(x|pax ), where pax denotes the parents of node. Thus the joint probabilistic
distribution can be calculated as

p(xtvi |paxtvi ) p(xtei |paxtei ).


Y Y Y Y
p(x) = (37)
t∈T Xit ∈Xt v:T (v)=t e:T (e)=t

The need to avoid cycles in RBN leads to significant representational and


computational difficulties. Inference is done by creating the complete ground
network, which limits their scalability. RBN requires specifying a complete
conditional model for each attribute of each class, which in large complex
domains can be quite burdensome.

Relational Markov Networks (RMNs) [106,107]— A RMN uses an undirected


graph and a set of potential functions Φ to represent the joint distribution over
the attributes of the item types. Denote by C the set of cliques in the graph
and each clique c ∈ C is associated with a set of variables Xc (∈ Xt ) (i.e. the
nodes in this clique 5 ) and a clique potential Φc (xc ) which is a non-negative
function over the possible values for xc ∈ Xc , then the joint probability over
x is calculated with the formula
1 Y
p(x) = Φc (xc ), (38)
Z c∈C

where Z is a normalizing constant, which sums over all possible instantiations.


RMNs are trained discriminatively, and do not specify a complete joint distri-
bution for the variables in the model. The learning algorithm uses maximum
a posteriori (MAP) estimation with belief propagation for inference, which
leads to a high computational complexity for learning.

Relational Dependency Networks (RDNs) [110,111]— The RDN is a bi-directed


graphical model with a set of conditional probability distributions, which can
be used to represent the cyclic dependencies in a relational setting. RDNs use
pseudo-likelihood learning techniques to estimate an efficient approximation
of the full joint distribution of the attribute values in a relational dataset. The
pseudo-likelihood for data graph GD is computed as a product over the item

3 A directed graph is acyclic if there is no directed path that starts and ends at the
same variable. This constrain indicates that a random variable does not depend,
directly or indirectly, on its own value.
4 A direct link from a to b indicates that a is b’s parent node.
5 Actually a node in a clique corresponds to an attribute in the data graph.

23
types t, the attributes of that type Xt , and the nodes v and edges e of that
type:

p(xtvi |paxtvi ; Θ) p(xtei |paxtei ; Θ).


Y Y Y Y
P L(GD ; Θ) = (39)
t∈T Xit ∈Xt v:T (v)=t e:T (e)=t

The CPDs in the RDN pseudo-likelihood are not required to factor the joint
distribution of GD . More specifically, when consider the variable xtvi , we con-
dition on the values of the parents paxtvi regardless of whether the estimation
of CPDs for variables in paxtvi was conditioned on xtvi . RDN adopts Gibbs
sampling 6 to iteratively relabel each unobserved variable by drawing from its
local conditional distribution, given the current state of the rest of the graph.

5.2 Probabilistic Entity Relationship Models

A specific type of probabilistic entity-relationship model is the directed acyclic


PERM (DAPER for short), which uses directed arcs 7 to describe the rela-
tionship between attributes [101]. DAPER makes relationships the first class
objects in the modeling language, and encourages an explicit representation
of conditional probabilistic distribution. The DAPER model consists of six
classes [113]: (i) Entity classes: specify the classes of objects in real world; (ii)
Relationship classes: represent the interaction among entity classes; (iii) At-
tribute classes: describe properties of entities or relationships. (iv) Arc classes:
represent probabilistic dependencies among corresponding attributes; (v) Lo-
cal distribution classes: construct the local distributions for attributes corre-
sponding to the attribute class; (vi) Constraint classes: specify how to derive
inference graph (i.e., ground graph) from the corresponding DAPER model
over the given instantiated domain. DAPER model assigns relationships the
same importance as the entities.

The DAPER model can be used in the situation where the relational struc-
ture itself is uncertain. And it is more expressive than either PRMs or plate
models 8 . Actually, DAPER combines the features of plate models and PRMs,

6 Heckerman et al. [110] proved that the Gibbs sampling can be used to estimate the
joint distribution of a dependency network. For a basic introduction and summary
of the Gibbs sampling, see Ref. [112].
7 In graph theory, the term “arc” stands for directed link.
8 The standard description of plate models can be found in Refs. [108,109]. Hecker-

man et al. [101] provided a new definition of plate model, which is slightly different
from the traditional one [108,109]. According to this new definition, the plate mod-
els and DAPER models are equivalent, and a plate model can be invertible mapped
to a DAPER model [101].

24
and the relations between DAPER models, PRMs and plate models can be
found in Ref. [101].

5.3 Stochastic Relational Models

The key idea of SRM is to model the stochastic structure of entity relationships
(i.e., links) via a tensor interaction of multiple Gaussian Processes (GPs), each
defined on one type of entities [102].

Assuming that the observable links r are derived as local measurements of a


real-value latent relation function t: U × V → R, and each link rij is solely de-
pendent on its latent value tij , modeled by the likelihood p(rij |tij ). The latent
relation function t is generated via a tensor interaction of two independent
entity-specific GPs, one acting on U and the other on V. Note that U and V
can both encompass an infinite number of entities. The relational processes are
characterized by hyper-parameters Θ = {ΘΣ , ΘΩ }, where ΘΣ and ΘΩ are for
the GP kernel function on U and V respectively. Thus SRM defines a Bayesian
prior p(t|Θ) for the latent variables t. Let I be the index set of entity pairs
having observed links, the marginal likelihood under such a prior is

Z Y
p(RI |Θ) = p(rij |tij )p(t|Θ)dt, (40)
(ij)∈I

where RI = rij , (i, j) ∈ I. The hyper-parameters Θ can be estimated by max-


imizing the marginal likelihood. Then the link for a new pair of entities can
be predicted by marginalizing over the a posteriori p(t|RI , Θ).

This model in fact defines a set of nonparametric priors on infinite dimen-


sional tensor matrices, where each element represents a relationship between
a tuple of entities. By maximizing the marginalized likelihood, information is
exchanged between the participating GPs through the entire relational net-
work, so that the dependency structure of links is messaged to the dependency
of entities, reflected by the adapted GP kernels. Because the training is on a
conditional model of links, this model offers a discriminative approach for link
prediction, namely predicting the existences, strengths, or types of relation-
ships based on the partially observed linkage network as well as the attributes
of entities if given. Yu et al. further upgraded SRM with an edge-wise covari-
ance with which the overall computational complexity can be reduced [114].
For more details one can see Refs. [102,115].

25
6 Applications

The problem of link prediction has attracted much attention from disparate
research communities. This is mainly attributed to its broad applicability. For
some networks, especially biological networks such as protein-protein interac-
tion networks, metabolic networks and food webs, the discovery of links or
interactions is costly in the laboratory or the field. A highly accurate predic-
tion can reduce the experimental costs and speed the pace of uncovering the
truth [75,77]. Link prediction has also been applied in the analysis of social
networks, such as the prediction of being actors in acts [116], the prediction
of the collaborations in co-authorship networks [49], the detection of the un-
derground relationships between terrorists [75], and so on. In addition, the
process of recommending items to users can be considered as a link predic-
tion problem in the user-item bipartite networks [117,118]. Actually, almost
the same techniques as the similarity-based link prediction has been applied
in personalized recommendation [62,119,120]. Accurate recommendation can
be used in e-commerce web sites to enhance the sales [121]. Moreover, the
link prediction approaches can be applied to solve the classification problem
in partially labeled networks, such as the prediction of protein functions [35],
the detection of anomalous email [122], distinguishing the research ares of
scientific publications [123], and finding out the fraud and legit users in cell
phone networks [124]. The following three subsections will introduce typical
applications of link prediction.

6.1 Reconstruction of Networks

Guimerà and Sales-Pardo [18] firstly considered the reconstruction of networks


from the observed networks with missing and spurious links. Although one can
rank the observed and non-observed links according to their reliabilities (see
Eq. 36), it is not easy to reconstruct the “true” network since generally no one
knows how many missing and spurious links there are. Applying the similar
techniques presented in Section 4.2, Guimerà and Sales-Pardo [18] defined the
reliability of a network A as

L(Axy = 1|AO ),
Y Y
R(A) = Rxy = (41)
Axy =1,x<y Axy =1,x<y

where Rxy and L are defined in Eqs. 33 and 36, and the term AO is used
to emphasize that the likelihoods are calculated according to the observed
network.

26
Given AO , a straightforward idea is to find out the network A that maximizes
the reliability defined by Eq. 41. However, the computation is too costly to
be implemented. In practice, Guimerà and Sales-Pardo [18] designed a simple
greedy algorithm. Their algorithm starts by evaluating the link reliabilities
for all pairs of nodes. Then, at each time step it removes the link with the
lowest reliability and adds the link (not yet in the current network) with the
highest reliability. This change is accepted if and only if the network reliability
increases. If it is rejected, the link with the next lowest reliability and the not-
yet-existent link with the next highest reliability will be the next candidate
for swapping. The algorithm stops if it rejects five consecutive attempts to
swap links. The observed network is set as the initialization of the algorithm,
and it will consecutively become another network with higher reliability than
the initial network. Guimerà and Sales-Pardo [18] tested their algorithm by
generating hypothetical observed networks AO from the true networks AT
(the five true networks used for testing are introduced in Section 4.2). Each
observation has a fraction of the true links removed and an identical number
of random links added.

Guimerà and Sales-Pardo [18] compared the global network properties of the
observed networks and these of the reconstructed networks. According to six
metrics, clustering coefficient [51], modularity [125], assortativity [89,90], con-
gestability 9 , synchronizability 10 and spreading threshold 11 , the reconstruc-
tion generally improves the estimates, indicating the validity of the approach.
Notice that, the results from the greedy algorithm may be far different from the
real optimum subject to the maximal reliability, thus we may expect even bet-
ter estimates if one has developed a more effective and/or efficient algorithm.
Readers should be warned that in both the algorithm and the preparation of
observed networks, a latent assumption is that the number of missing links
and the number of spurious links are equal. Since in the real systems, these
two numbers may be very different (it is easy to image that in many net-
works, such as metabolic networks and friendship networks, the missing links
are much more than the spurious links), the effectiveness of the algorithm still
needs further validation.

9 The congestability refers to the maximal betweenness centrality which governs


the transportation throughput of a network [126,127].
10 The synchronizability refers to the ratio between the largest and the smallest non-

zero eigenvalues of the Laplacian matrix of a network, which quantifies the ability
of synchronization under the framework of master stability analysis [128,129].
11 Ignoring the degree-degree correlations and applying the mean-field approxima-

tion, the spreading threshold equals the ratio between the first and the second
moments of the degree distribution [130,131].

27
6.2 Evaluation of Network Evolving Mechanisms

Since the groundbreaking work by Barabási and Albert [41], the evolving
models all the time lie in the center of the complex network study. A fun-
damental difficulty is that for a given network or a target network property,
there are generally many possible mechanisms and it is not easy to judge which
one is the best. Taking the power-law degree distribution as an instance, the
well-known mechanisms include rich gets richer [41], good gets richer [132],
optimal design [133], Hamiltonian dynamics [134], merging and regeneration
[135], stability constraints [136], and so on. Hence we can not easily know
which factor(s) leads to the scale-free property of a real network, and in fact
there can be so many models competing for the final explanation of a given
real network. It is very hard to evaluate different models by comparing their
resulted networks with the target network, since there are too many metrics
for topological features [5]. As mentioned in Section 1, there are many mod-
els about the topology of the Internet, some more accurately reproduce the
degree distribution and the disassortative mixing pattern (e.g., see Ref. [19])
and some better characterize the k-core structure (e.g., see Ref. [20]). To judge
which model (i.e., which evolving mechanism) is better than the others is a
tough task.

Essentially speaking, an algorithm for link prediction makes a guess about the
factors resulting in the existence of links, which is actually what an evolv-
ing model wants to show. In other words, an evolving model in principle can
be mapped to a link prediction algorithm. Therefore, we can quantitatively
compare the accuracies of different evolving models with the help of the perfor-
mance metrics for link prediction (see Section 2). We hope this methodology
could provide a fair platform to compare different evolving models, which may
be significant for the studies of network modeling. Next, we will show a real
application about the Chinese city airline network, where each node represents
a city with airport, and two cities are connected if there exists at least one
direct airline between them [56].

It is well-known that the evolution of a city airline network is affected by not


only the topological factor, but also the geographical factor [143] and exter-
nal factors, such as population and economic level of a city [137]. As shown
by Liben-Nowell et al. [49] and Zhou et al. [47], the common neighbor index
is a good candidate to account for the topological effects. In addition, Cui
et al. [138] developed an evolving model driven completely by the common
neighborhood, which well reproduces not only the global network properties,
but also the local structural features like power-law clique-degree distributions
[139] of social and technological networks. Therefore, we simply use the com-
mon neighbor index S CN (see Eq. 2) to represent the topological ingredient.
Geographical distance is considered to be one of the realistic factors that affect

28
the existence of nodes’ interactions in networks [140]. Especially it plays very
important role in analyzing transportation networks [141,142]. It is known to
be relevant to the existence of an airline, and the number of airlines decays
with the increasing of corresponding distance [55,143]. Accordingly, we use the
inverse of geographical distance between two cities as the similarity index, say
1
sDIS
xy = , (42)
Dg (x, y)

where Dg (x, y) denotes the geographical distance between cities x and y. Based
on a null assumption that people in every city have the same frequency of air
travels, the similar index for populations is defined as

sPxyOP U = P (x) × P (y), (43)

where P (x) is the population of city x. The economic level of a city can
be roughly quantified by its gross domestic product (GDP) 12 , and thus the
corresponding similarity is defined as

sGDP
xy = G(x) × G(y), (44)

where G(x) denotes the GDP of city x. Considering that the airline business
is most tightly related to the service industry, besides the simple GDP, we
user the third sector of GDP, named the tertiary industry 13 to characterize a
city’s potential to build airlines:

sTxyI = T (x) × T (y), (45)

where T (x) is the tertiary industry of city x.

Since the size of the Chinese city airline network [56] is small (|V | = 121,
|E| = 1378), we adopt the leave-one-out method, namely at each time, we
pick only one link for test and all other links constitute the training set. This
procedure repeats for 1378 times with each link being once the testing link.
Table 4 displays the prediction accuracy (AUC values) of the five similarity
12 The GDP is a measure of a city’s overall economic output. It is the market value
of all final goods and services made within the borders of a city in a year. Here we
use the data of the year 2005.
13 The tertiary industry (also called tertiary sector of the economy, service sector

or service industry) consists of the “soft” parts of the economy, namely activities
where people offer their knowledge and time to improve productivity, performance,
potential, and sustainability. The basic characteristic of this sector is the production
of services instead of end products.

29
Table 4
The prediction accuracy of the five similarity indices for the Chinese city airline
network. The training and testing sets are divided according to the leave-one-out
method.
Similarity Indices AUC
S CN 0.898
S DIS 0.699
S P OP U 0.745
S GDP 0.855
ST I 0.881

indices. It indicates that every factor under consideration plays a role, while
the topological factor is most significant. The tertiary industry of a city, as an
external factor, also plays a very important role. Actually, a linear combination
of the common neighbor index and the tertiary industry, as S ′ = λS CN + (1 −
λ)S T I can achieve a very high AUC value, 0.928, at λ ≈ 0.2.

Although the method introduced here is straightforward, it gives insights into


the underlying evolving mechanisms which may not be seriously considered
in the early studies. The validity of this method has been demonstrated by
some recent evidences. For example, by comparing the evolving models driven
respectively by the topological factor, the geographical factor, and the above-
mentioned three external factors, Liu et al. [144] showed that only the one
considering the tertiary industry can reproduce the observed double power-
law degree distribution of the Chinese city airline network. In addition, among
many external factors, the Granger causality test 14 shows that the tertiary
industry is the most significant factor in determining the passenger volume
[146].

6.3 Classification of Partially Labeled Networks

Given a network with partial nodes being labeled, the problem is to predict
the labels of these unlabeled nodes based on the known labels and the network
structure. Two main difficulties in achieving highly accurate classification are
the sparsity of the known labeled nodes and the inconsistency of label infor-
mation. To address these two difficulties, a simple but effective method is to
add artificial connections between every pair of labeled and unlabeled nodes
according to their similarity scores [123,147], with almost the same techniques
used in similarity-based link prediction. An underlying assumption is that two
14 The Granger causality test is a technique for determining whether one time series
is useful in forecasting another. See Ref. [145] for details.

30
Node 2 Node 1

b a

? Node 5

a b
Node 3 Node 4

Fig. 5. An illustration of how to predict the fifth node’s label by adding artificial
links.

nodes are more likely to be categorized into the same class if they are more
similar to each other.

Consider an unweighted undirected network of both labeled and unlabeled


nodes: G(V, E, L), where V is the set of nodes, E is the set of links and
L = {l1 , l2 , · · · , lm } is the set of labels. For the nodes without labels, we label
them by 0. For each pair of nodes, x and y, a similarity index will assign a
score as sxy . For an unlabeled node x, the probability it belongs to li is

sxy
P
{y|y6=x,label(y)=li }
p(li |x) = P , (46)
{y|y6=x,label(y)6=0} sxy

where li ∈ L. The predicted label of node x is determined by the largest p(li |x).
If there are more than one maximum values, we randomly select one.

A simple example is shown in Fig. 5, where there are two kinds of labels (i.e. a
and b) and five nodes, four of which are labeled already. Our task is to predict
the label of the node 5. According to the common neighbors index S CN , we
obtain the similarity between node 5 and the other four labeled nodes: s15 = 1,
s25 = 1, s35 = 2 and s45 = 0. Thus, the probabilities that node 5 belongs to
class a and b are p(a|node5) = 0.75 and p(b|node5) = 0.25 respectively. If
we use RA index, the similarity scores are: s15 = 13 , s25 = 21 , s35 = 13 + 21
and s45 = 0. Therefore, the probabilities change to p(a|node5) = 0.7 and
p(b|node5) = 0.3. According to any of the two indices, the predicted label of
node 5 is a.

31
7 Outlook

In this article, we briefly summarized the progress of studies on link prediction,


emphasizing on the recent contributions by statistical physicists. Although link
prediction is not a new problem in information science, traditional methods
have not caught up the new development of network science, especially the
new perspectives and tools resulted from the studies of complex networks. In
our opinion, the studies of link prediction and complex networks will benefit
each other, because the in-depth understanding of network structure can be
used to design advanced link prediction algorithms (e.g., making use of the
information about hierarchical organization [75] and modular structure [18]
of real networks to better predict missing links) and the performance of a
link prediction algorithm could give evidences about structural features [47]
as well as the algorithms themselves can be used to improve the estimates of
real networks’ properties [18] and to evaluate the evolving network models. In
a word, to statistical physicists, the study of link prediction is just unfolding.

Up to now, the studies of link prediction overwhelmingly emphasize on the


unweighted undirected networks. For directed networks, even the ternary rela-
tions are complicated, and thus the simple common-neighbor-based similarity
indices have to be modified to take into account the local motif structure [148].
Otherwise even we can predict the existence of an arc between two nodes, we
can not determine its direction. In addition, the path-dependent similarity in-
dices should also be extended to take into account the link direction [149]. The
fundamental task of link prediction in weighted networks, namely to predict
the existence of links with the help of not only the observed links but also
their weights, has already been considered by Murata et al. [150] and Lü et al.
[151]. The former [150] suggested that the links with higher weights are more
important in predicting missing links, while the latter [151] indicated a com-
pletely opposite conclusion: the weak links play a more significant role. How
to properly exploit the information of weights to improve the prediction accu-
racy is still an unsolved problem. A harder problem is to predict the weights
of links, which is relevant to the traffic prediction for urban transportation
and air transportation systems [152]. We are expecting some variants of link
prediction algorithms can also contribute to this domain.

A big challenge is the link prediction in multi-dimensional networks, where


links could have different meanings. For example, a social network may consist
of positive and negative links, respectively pointing to friends and foes [153], or
trusted and distrusted peers [154]. Leskovec et al. [155] proposed a method to
predict the signs of links (positive or negative), yet the prediction of both the
existence of a link and its sign has not been well studied. Recent development
of social balance theory may provide useful hints [156,157,158].

32
A more complicated kind of multi-dimensional networks is the ones consisted
of several classes of nodes. For example, an online resource-sharing system,
such as Del.icio.us 15 , can be represented by a network that consists of three
kinds of nodes: users, URLs and tags. Different from the tripartite networks,
nodes in the same class can also be connected, like an arc can be added from
a user to her/his follower who has imposed her/his collections. Ignoring the
connections within a class of nodes, the prediction of links between users and
objects has already been investigated [159]. However, there is still nothing
reported about the link prediction algorithms taking into account both the
links within a class and the links between classes.

Inspired by the success in recommender systems, we think the prediction accu-


racy can be considerably improved by hybrid algorithms [160]. Given a specific
target network, we can implement many individual prediction algorithms, and
then try to select and organize them in a property way. This so-called ensem-
ble learning method can obtain better prediction performance than could be
obtained from any of the individual algorithms [161]. Although the scientific
significance of such a method is not clear to us, building ensemble systems for
link prediction could be of huge practical value.

The algorithms’ performance can be effectively enhanced by considering some


external information, like the attributes of nodes [31]. In common sense, two
people share more tastes and interests (and thus may of higher probability to
be connected in a social network) if they have more common features, such
as age, sex, job, and so on. The attribute information can be used to pre-
dict links without considering the network structures. Thus, when the existed
links themselves are unreliable, attribute-based methods are preferable, which
can to some extent solve the so-called cold start problem – a big challenge
of link prediction [162]. Besides, community structures can also help improv-
ing prediction accuracy [163]. In social networks, since one person may play
different roles in different communities, the prediction in one domain can be
inspired by the information in others [164]. For example, when we predict the
collaborations between authors, we can consider their affiliations to improve
the accuracy.

Most of current approaches take into account a single snapshot of a network


to predict the missing or future links. Extensive experiments show that these
methods well uncover whether a link exists. However, this static graph rep-
resentation is difficult in predicting the repeated link occurrences. For exam-
ple, it is impossible to predict whether and when two authors will collaborate
again in co-authorship network. Addressing this problem, Huang and Lin [165]
proposed a time-series link prediction approach considering the temporal evo-

15Del.icio.us is the largest social bookmarking system where a user is allowed to


collect URLs as well as visit and impose other users’ collections.

33
lutions of link occurrences, which is more appropriate for dealing with the link
prediction problem in evolving networks, such as online social networks. An-
other way to involve time information is inspired by the fact that older events
are less likely to be relevant to future links than recent ones. For example,
author’s interests may change over time and thus old publications might be
less relevant to his currents research area. Tylenda et al. [166] developed a
graph-based link prediction method that incorporate the temporal informa-
tion contained in evolving networks. They found that the performance can be
improved by either time-based weighting of edges (i.e., giving the older events
smaller weights or even neglecting them) or weighting of edges according to
the connecting strength. However, to design effective algorithms and eventu-
ally settle this problem, we need in-depth and comprehensive understanding
of temporal effects on human’s interests, attentions and so on, which asks for
extensive empirical analyses.

Acknowledgements

We acknowledge Ci-Hang Jin, Hong-Kun Liu, Wei-Ping Liu, Ming-Sheng Shang,


Qian-Ming Zhang and Yi-Cheng Zhang for their contributions to the collab-
orated works on link prediction, as well as Renaud Lambiotte, Medo Matuš,
Roger Guimerà, Marco Saerens, Marta Sales-Pardo, Chi Ho Yeung and Zi-Ke
Zhang for their valuable discussions, comments and suggestions. This work
is partially supported by the National Natural Science Foundation of China
under Grant Nos. 11075031 and 10635040, and the Swiss National Science
Foundation under Grant No. 200020-121848.

References

[1] R. Albert, A.-L. Barabási, Statistical mechanics of complex networks, Rev. Mod.
Phys. 74 (2002) 47.

[2] S. N. Dorogovtsev, J. F. F. Mendes, Evolution of networks, Adv. Phys. 51 (2002)


1079.

[3] M. E. J. Newman, The Structure and Function of Complex Networks, SIAM


Rev. 45 (2003) 167.

[4] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, D.-U. Huang, Complex


networks: Structure and dynamics, Phys. Rep. 424 (2006) 175.

[5] L. da F. Costa, F. A. Rodrigues, G. Travieso, P. R. U. Boas, Characterization of


complex networks: A survey of measurements, Adv. Phys. 56 (2007) 167.

34
[6] G. Salton, M. J. McGill, Introduction to Modern Information Retrieval,
MuGraw-Hill, Auckland, 1983.

[7] G. Salton, Automatic Text Processing: The Transformation, Analysis, and


Retrieval of Information by Computer, Addison-Wesley, Boston, 1989.

[8] C. D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval,


Cambridge University Press, New York, 2008.

[9] L. Getoor, C. P. Diehl, Link Mining: A Survey, ACM SIGKDD Explorations


Newsletter 7 (2005) 3.

[10] H. Yu et al., High-quality binary protein interaction map of the yeast


interactome network, Science 322 (2008) 104.

[11] M. P. H. Stumpf, T. Thorne, E. de Silva, R. Stewart, H. J. An, M. Lappe,


C. Wiuf, Estimating the size of the human interactome, Proc. Natl. Acad. Sci.
U.S.A. 105 (2008) 6959.

[12] L. A. N. Amaral, A truer measure of our ignorance, Proc. Natl. Acad. Sci.
U.S.A. 105 (2008) 6795.

[13] L. Schafer, J. W. Graham, Missing data: Our view of the state of the art,
Psychol. Methods 7 (2002) 147.

[14] G. Kossinets, Effects of missing data in social networks, Social Networks 28


(2006) 247.

[15] J. W. Neal, “Kracking” the Missing Data Problem: Applying Krackhardt’s


Cognitive Social Structures to School-Based Social Networks, Sociol. Educ. 81
(2008) 140.

[16] C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Field, P. Bork,


Comparative assessment of large-scale data sets of protein-protein interactions,
Nature 417 (2002) 399.

[17] C. T. Butts, Network inference, error, and information (in)accuracy: A Bayesian


approach, Social Networks 25 (2003) 103.

[18] R. Guimerà, M. Sales-Pardo, Missing and spurious interactions and the


reconstruction of complex networks, Proc. Natl. Acad. Sci. U.S.A. 106 (2009)
22073.

[19] S. Zhou, R. J. Mondragón, Accurately modeling the internet topology, Phys.


Rev. E 70 (2004) 066108.

[20] S. Carmi, S. Havlin, S. Kirkpatrick, Y. Shavitt, E. Shir, A model of Internet


topology using k-shell decomposition, Proc. Natl. Acad. Sci. U.S.A. 104 (2007)
11150.

[21] M. Sales-Pardo, R. Guimerà, L. A. N. Amaral, Extracting the hierarchical


organization of complex systems, Proc. Natl. Acad. Sci. U.S.A. 104 (2007) 15224.

35
[22] M. Girvan, M. E. J. Newman, Community structure in social and biological
networks, Proc. Natl. Acad. Sci. U.S.A. 99 (2002) 7821.

[23] J. Shawe-Taylor, N. Cristianini, Kernels Methods for Pattern Analysis,


Cambridge University Press, Cambridge, UK, 2004.

[24] L. Breiman, P. Spector, Submodel selection and evaluation in regression: the


x-random case, Int. Stat. Rev. 60 (1992) 291.

[25] R. Kohavi, A Study of Cross-Valisation and Bootstrap for Accuracy Estimation


and Model Selection, In Proceedings of the International Joint Conference on
Artificial Intelligence, Morgan Kaufmann Publisher, Quebec, Canada, 1995, p.
1137-1143.

[26] J. A. Hanely, B. J. McNeil, The meaning and use of the area under a receiver
operating characteristic (ROC) curve, Radiology 143 (1982) 29.

[27] S. Geisser, Predictive inference: An introduction, Chapman and Hall, New York,
1993.

[28] J. L. Herlocker, J. A. Konstann, K. Terveen, J. T. Riedl, Evaluating


collaborative filtering recommender systems, ACM Trans. Inf. Syst. 22 (2004)
5.

[29] X. Su, T. M. Khoshgoftaar, A Survey of Collaborative Filtering Techniques,


Advances in Artificial Intelligence (2009) 421425.

[30] Z. Huang, X. Li, H. Chen, Link prediction approach to collaborative filtering,


In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries,
ACM Press, New York, 2005.

[31] D. Lin, An information-theoretic definition of similarity, in Proceedings of


the 15th International Conference on Machine Learning, Morgan Kaufman
Publishers, San Francisco, 1998.

[32] E. A. Leicht, P. Holme, M. E. J. Newman, Vertex similarity in networks, Phys.


Rev. E 73 (2006) 026120.

[33] D. Sun, T. Zhou, J.-G. Liu, R.-R. Liu, C.-X. Jia, B.-H. Wang, Information
filtering based on transferring similarity, Phys. Rev. E 80 (2009) 017101.

[34] D. R. White, K. P. Reitz, Graph and semigroup homomorphisms on networks


of relations, Social Networks 5 (1983) 193.

[35] P. Holme, M. Huss, Role-similarity based functional prediction in networked


systems: application to the yeast proteome, J. R. Soc. Interface 2 (2005) 327.

[36] M. E. J. Newman, Clustering and preferential attachment in growing networks,


Phys. Rev. E 64 (2001) 025102.

[37] P. Jaccard, Étude comparative de la distribution florale dans une portion des
Alpes et des Jura, Bulletin de la Societe Vaudoise des Science Naturelles 37
(1901) 547.

36
[38] T. Sørensen, A method of establishing groups of equal amplitude in plant
sociology based on similarity of species content and its application to analyses
of the vegetation on Danish commons, Biol. Skr. 5 (1948) 1.

[39] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, A.-L. Barabási,


Hierarchical organization of modularity in metabolic networks, Science 297
(2002) 1551.

[40] M. Molloy, B. Reed, A critical point for random graphs with a given degree
sequence, Random Structure and Algorithms 6 (1995) 161.

[41] A.-L. Barabási, R. Albert, Emergence of Scaling in Random Networks, Science


286 (1999) 509.

[42] Y.-B. Xie, T. Zhou, B.-H. Wang, Scale-free networks without growth, Physica
A 387 (2008) 1683.

[43] P. Holme, B. J. Kim, C. N. Yoon, S. K. Han, Attack vulnerability of complex


networks, Phys. Rev. E 65 (2002) 056109.

[44] C.-Y. Yin, W.-X. Wang, G.-R. Chen, B.-H. Wang, Decoupling process for better
synchronizability on scale-free networks, Phys. Rev. E 74 (2006) 047102.

[45] G.-Q. Zhang, D. Wang, G.-J. Li, Enhancing the transmission efficiency by edge
deletion in scale-free networks, Phys. Rev. E 76 (2007) 017101.

[46] L. A. Adamic, E. Adar, Friends and neighbors on the Web, Social Networks 25
(2003) 211.

[47] T. Zhou, L. Lü, Y.-C. Zhang, Predicting missing links via local information,
Eur. Phys. J. B 71 (2009) 623.

[48] Q. Ou, Y.-D. Jin, T. Zhou, B.-H. Wang, B.-Q. Yin, Power-law strength-degree
correlation from resource-allocation dynamics on weighted networks, Phys. Rev.
E 75 (2007) 021102.

[49] D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks,


J. Am. Soc. Inform. Sci. Technol. 58 (2007) 1019.

[50] M. E. J. Newman, Finding community structure in networks using the


eigenvectors of matrices, Phys. Rev. E 74 (2006) 036104.

[51] D. J. Watts and S. H. Strogatz, Collective dynamics of ’small-world’ networks,


Nature 393 (1998) 440 .

[52] R. Ackland, Mapping the US political blogosphere: Are conservative bloggers


more prominent, Presentation to BlogTalk Downunder, Sydney, 2005, available
at http://incsub.org/blogtalk/images/robertackland.pdf.

[53] N. Spring, R. Mahajan, D. Wetherall and T. Anderson, IEEE/ACM Trans.


Networking 12 (2004) 2.

[54] V. Batageli, A. Mrvar, Pajek Datasets, available at


http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm.

37
[55] M. T. Gastner, M. E. J. Newman, The spatial structure of networks, Eur. Phys.
J. B 49 (2006) 247.
[56] H.-K. Liu, T. Zhou, Empirical study of Chinese city airline network, Acta
Physica Sinica 56 (2007) 106.
[57] S. Zhou, R. J. Mondragón, The rich-club phenomenon in the Internet topology,
IEEE Commun. Lett. 8 (2004) 180.
[58] V. Colizza, A. Flammini, M. A. Serrano, A. Vespignani, Detecting rich-club
ordering in complex networks, Nat. Phys. 2 (2006) 110.
[59] Y.-X. Zhu, L. Lü, T. Zhou, Uncovering missing links with cold ends
(unpublished).
[60] Y. Pan, D.-H. Li, J.-G. Liu, J.-Z. Liang, Detecting community structure in
complex networks via node similarity, Physica A 389 (2010) 2849.
[61] Y.-L. Wang, T. Zhou, J.-J. Shi, J. Wang, D.-R. He, Empirical analysis of
dependence between stations in Chinese railway network, Physica A 388 (2009)
2949.
[62] T. Zhou, J. Ren, M. Medo, Y.-C. Zhang, Bipartite network projection and
personal recommendation, Phys. Rev.E 76 (2007) 046115.
[63] L. Katz, A new status index derived from sociometric analysis, Psychmetrika
18 (1953) 39.
[64] D. J. Klein, M. Randic, Resistance distance, J. Math. Chem. 12 (1993) 81.
[65] F. Fouss, A. Pirotte, J.-M. Renders, M. Saerens, Random-walk computation
of similarities between nodes of a graph with application to collaborative
recommendation, IEEE Trans. Knowl. Data. Eng. 19 (2007) 355.
[66] S. Brin, L. Page, The anatomy of a large-scale hypertextual Web search engine,
Comput. Netw. ISDN Syst. 30 (1998) 107.
[67] H. Tong, C. Faloutsos, J.-Y. Pan, Fast random walk with restart and its
applications, In Proceedings of the 6th International Conference on Data Mining,
IEEE Press, Washington, DC, USA, 2006, p. 613-622 .
[68] M.-S. Shang, L. Lü, T. Zhou, Y.-C. Zhang, Relevance is more significant than
correlation: Information filtering on sparse data, EPL 88 (2009) 68008.
[69] G. Jeh, J. Widom, SimRank: A measure of structural-context similarity, In
Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, ACM Press, New York, 2002, p. 271-279.
[70] P. Chebotarev, E. V. Shamis, The matrix-forest theorem and measuring
relations in small social groups, Automation and Remote Control 58 (1997) 1505.
[71] F. Fouss, L. Yen, A. Pirotte, M. Saerens, An experimental investigation of
graph kernels on a collaborative recommendation task, In Proceedings of the 6th
International Conference on Data Mining, IEEE Press, Washington, DC, USA,
2006, p. 863-868.

38
[72] L. Lü, C.-H. Jin, T. Zhou, Similarity index based on local paths for link
prediction of complex networks, Phys. Rev. E 80 (2009) 046122.
[73] W. Liu, L. Lü, Link prediction based on local random walk, EPL 89 (2010)
58007.

[74] A. Mantrach, N. van zeebroeck, P. Francq, M. Shimbo, H. Bersini, M. Saerens,


Semi-supervised classification and betweenness computation on large, sparse,
directed, networks (unpublished).

[75] A. Clauset, C. Moore, M. E. J. Newman, Hierarchical structure and the


prediction of missing links in networks, Nature 453 (2008) 98.
[76] C. Zhou, L. Zemanová, G. Zamora, C. C. Hilgetag, J. Kurths, Hierarchical
Organization Unveiled by Functional Connectivity in Complex Brain Networks,
Phys. Rev. Lett. 97 (2006) 238103.
[77] S. Redner, Teasing out the missing links, Nature 453 (2008) 47.

[78] G. Casella, R. L. Berger, Statistical Inference, Duxbury, Belmont, 2001.


[79] M. E. J. Newman, G. T. Barkema, Monte Carlo Methods in Statistical Physics,
Clarendon, Oxford, 1999.
[80] V. Krebs, Mapping networks of terrorist cells, Connections 24 (2002) 43.

[81] H. A. Dawah, B. A. Hawkins, M. F. Claridge, Structure of the parasitoid


communities of grass-feeding chalcid wasps, J. Anim. Ecol. 64 (1995) 708.

[82] M. Huss, P. Holme, Currency and commodity metabolites: Their identification


and relation to the modularity of metabolic networks, IET Syst. Biol. 1 (2007)
280.
[83] E. Mossel, E. Vigoda, Phylogenetic MCMC Are Misleading on Mixtures of
Trees, Science 309 (2005) 2207.
[84] H. C. White, S. A. Boorman, R. L. Breiger, Social structure from multiple
networks I: Blockmodels of roles and positions, Am. J. Sociol. 81 (1976) 730.
[85] P. W. Holland, K. B. Laskey, S. Leinhardt, Stochastic blockmodels: First steps,
Social Networks 5 (1983) 109.

[86] P. Dorelan, V. Batagelj, A. Ferligoj, Generalized Blockmodeling, Cambridge


University Press, Cambridge, UK, 2005.
[87] E. M. Airoldi, D. M. Blei, S. E. Fienberg, X. P. Xing, Mixed-membership
stochastic blockmodels, J. Mach. Learn. Res. 9 (2008) 1981.
[88] R. Guimerà, M. Sales-Pardo, L. A. N. Amaral, Classes of complex networks
defined by role-to-role connectivity profiles, Nat. Phys. 3 (2007) 63.

[89] M. E. J. Newman, Assortative mixing in networks, Phys. Rev. Lett. 89 (2002)


208701.

[90] M. E. J. Newman, Mixing patterns in networks, Phys. Rev. E 67 (2003) 026126.

39
[91] R. Pastor-Satorras, A. Vázquez, A. Vespignani, Dynamical and Correlation
Properties of the Internet, Phys. Rev. Lett. 87 (2001) 258701.

[92] A. Vázquez, R. Pastor-Satorras, A. Vespignani, Large-scale topological and


dynamical properties of the Internet, Phys. Rev. E 65 (2002) 066130.

[93] T. Bayes, An essay towards solving a Problem in the Doctrine of Chances,


Philos. Trans. R. Soc. London 53 (1763) 370.

[94] M. Metropolis, A. W. Rosenbluth, A. H. Teller, E. Teller, Equations of State


Calculation by Fast Computing Machines, J. Chem. Phys. 21 (1953) 1087.

[95] W. Zachary, An information flow model for conflict and fission in small groups,
J. Anthropol. Res. 33 (1977) 452.

[96] D. Lusseau, et al., The bottlenose dolphin community of Doubtful Sound


features a large proportion of long-lasting associations, Behav. Ecol. Sociobiol.
54 (2003) 396.

[97] R. Guimerà, S. Mossa, A. Turtschi, L. A. N. Amaral, The worldwide air


transportation network: Anomalous centrality, community structure, and cities’
global roles, Proc. Natl. Acad. Sci. U.S.A. 102 (2005) 7794.

[98] J. G. White, E. Southgate, J. N. Thomson, S. Brenner, The structure of the


nervous system of the nematode C. elegans, Phil. Trans. R. Soc. B 314 (1986) 1.

[99] J. L. Reed, T. D. Vo, C. H. Schilling, B. Ø. Palsson, An expanded genome-scale


model of Escherichia coli K-12 (iJR904 GSM/GPR), Genome Biol. 4 (2003) R54.

[100] N. Friedman, L. Getoor, D. Koller, A. Pfeffer, Learning Probabilistic relational


models, In Proceedings of the 16th International Joint Conference on Artificial
Intelligence, Stockholm, Sweden, 1999, p. 1300.

[101] D. Heckerman, C. Meek, D. Koller, Probabilistic Entity-Relationship Models,


PRMs, and Plate Models, In Proceedings of the 21st International Conference
on Machine Learning, Banff, Canada, 2004, p. 55.

[102] K. Yu, W. Chu, S. Yu, V. Tresp, Z. Xu, Stochastic Relational Models for
Discriminative Link Prediction, In Proceedings of Neural Information Precessing
Systems, MIT Press, Cambridge MA, 2006, p. 1553.

[103] J. Neville, Statistical models and analysis techniques for learning in relational
data, PhD Thesis, 2006.

[104] D. Heckerman, C. Meek, D. Koller, Probabilistic models for relational data,


Tech. Rep. MSR-TR-2004-30, Microsoft Research, 2004.

[105] D. Heckerman, D. Geiger, D. Chickering, Learning Bayeaian Networks: the


combination of knowledge and statistical data, Machine Learning 20 (1995) 197.

[106] B. Taskar, P. Abbeel, D. Koller, Discriminative probabilistic models in


relational data, In Preceedings of the 18th Conference on Uncertainty in Artificial
Intelligence (UAI02), Edmonton, Canada, 2002, p.485.

40
[107] B. Taskar, M.-F. Wong, P. Abbeel, D. Koller, Link prediction in relational data,
In Proceedings of Neural Information Precessing Systems, MIT Press, Cambridge
MA, 2004, p. 659.

[108] W. Buntine, Operations for learning with graphical models, J. Artif. Intell.
Res. 2 (1994) 159.

[109] D. Spiegelhalter, Bayesian graphical modeling: A case-study in monitoring


health outcomes, Applied Statistics 47 (1998) 115.

[110] D.
Heckerman, D. Chickering, C. Meek, R. Rounthwaite, C. Kadie, Dependency
networks for inference, collaborative filtering, and data visualization, J. Machine
Learning Res. 1 (2000) 49.

[111] J. Neville, D. Jensen, Relational dependency networks, J. Machine Learning


Res. 8 (2007) 653.

[112] G. Casella, E. I. George, Explaining the Gibbs sampler, The American


Statistician 46(3) (1992) 167.

[113] Z. Xu, V. Tresp, K. Yu, S. Yu, H.-P. Kriegel, Dirichlet enhanced relational
learning, In Proceedings of the 22nd internatonal conference on machine learning,
Bonn, Germany, 2005, p. 1004.

[114] K. Yu, W. Chu, Gaussian process models for link analysis and transfer learning,
In Proceedings of Neural Information Precessing Systems, MIT Press, Cambridge
MA, 2007, p. 1657.

[115] W. Chu, V. Sindhwani, Z. Ghahramani, S. S. Keerthi, Relational learning with


gaussian processes, In Proceedings of Neural Information Precessing Systems,
MIT Press, Cambridge MA, 2006, p. 289.

[116] J. O’Madadhain, J. Hutchins, P. Smyth, Prediction and ranking algorithms for


event-based network data, In Proceedings of SIGKDD 2005, ACM Press, New
York, 2005, p. 23.

[117] M.-S. Shang, L. Lü, Y.-C. Zhang, T. Zhou, Empirical analysis of web-based
user-object bipartite networks, EPL 90 (2010) 48006.

[118] J. Kunegis, E. W. De Luca, S. Albayrak, The link predection problem in


bipartite networks, arXiv: 1006.5367.

[119] T. Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J. R. Wakeling, Y.-C. Zhang, Solving
the apparent diversity-accuracy dilemma of recommender systems, Proc. Natl.
Acad. Sci. U.S.A. 107 (2010) 4511.

[120] W. Zeng, M.-S. Shang, Q.-M. Zhang, L. Lü, T. Zhou, Can dissimilar users
contribute to accuracy and diversity of personalized recommendation, Int. J.
Mod. Phys. C (to be published).

[121] J. Schafer, J. Konstan, J. Riedl, E-commerce recommendation applications,


Data Min. Knowl. Discovery 5 (2001) 115.

41
[122] Z. Huang, D. D. Zeng, A link prediction approach to anomalous email
detection, In Proceedings of 2006 IEEE International Conference on Systems,
Man, and Cybernetics, Taipei, Taiwan, 2006, p. 1131.

[123] B. Gallagher, H. Tong, T. Eliassi-Rad, C. Faloutsos, Using ghost edges for


classification in sparsely labeled networks, In Proceedings of the ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, ACM Press,
New York, 2008, p. 256.

[124] K. Dasgupta, R. Singh, B. Viswanathan, D. Chakraborty, S. Mukherjea, A.


A. Nanavati, A. Joshi, Social ties and their relevance to churn in mobile telecom
networks, In Proceedings of the 11th international conference on Extending
database technology: Advances in database technology, ACM Press, New York,
2008, p. 668.

[125] M. E. J. Newman, M. Girvan, Finding and evaluating community structure in


networks, Phys. Rev. E 69 (2004) 026113.

[126] R. Guimerà, A. Dı́az-Guilera, F. Vega-Redondo, A. Cabrales, A. Arenas,


Optimal network topologies for local search with congestion, Phys. Rev. Lett. 89
(2002) 248701.

[127] G. Yan, T. Zhou, B. Hu. Z.-Q. Fu, B.-H. Wang, Efficient routing on complex
networks, Phys. Rev. E 73 (2006) 046108.

[128] M. Barahona, L. M. Pecora, Synchronization in Small-World Systems, Phys.


Rev. Lett. 89 (2002) 054101.

[129] A. Arenas, A. Dı́za-Guilera, J. Kurths, Y. Moreno, C. Zhou, Phys. Rep. 469


(2008) 93.

[130] R. Pastor-Satorras, A. Vespignani, Epidemics and immunization in scale-free


networks, in S. Bornholdt, H. G. Schuster (eds.) Handbook of Graphs and
Networks, Wiley-VCH, Berlin, 2003.

[131] T. Zhou, Z.-Q. Fu, B.-H. Wang, Epidemic dynamics on complex networks,
Prog. Nat. Sci. 16 (2006) 452.

[132] G. Caldarelli, A. Capocci, P. De Los Rios, M. A. Muñoz, Scale-free networks


from varying vertex intrinsic fitness, Phys. Rev. Lett. 89 (2002) 258702.

[133] S. Valverde, R. F. Cancho, R. V. Solé, Scale-free networks from optimal design,


Europhys. Lett. 60 (2002) 512.

[134] M. Baiesi, S. S. Manna, Scale-free networks from a Hamiltonian dynamics,


Phys. Rev. E 68 (2003) 047103.

[135] B. J. Kim, A. Trusina, P. Minnhagen, K. Sneppen, Self organized scale-free


networks from merging and regeneration, Eur. Phys. J. B 43 (2005) 369.

[136] J. I. Perotti, O. V. Billoni, F. A. Tamarit, D. R. Chialvo, S. A. Cannas,


Emergent self-organized complex network topology out of stability constraints,
Phys. Rev. Lett. 103 (2009) 108701.

42
[137] H.-K. Liu, T. Zhou, Review on the studies of airline networks, Prog. Nat. Sci.
18 (2008) 601.
[138] A.-X. Cui, Y. Fu, M.-S. Shang, D.-B. Chen, T. Zhou, Emergence of
local structures in complex network:common neighborhood drives the network
evolution, Acta Physica Sinica (to be published).
[139] W.-K. Xiao, J. Ren, F. Qi, Z.-W. Song, M.-X. Zhu, H.-F. Yang, H.-Y. Jin,
B.-H. Wang, T. Zhou, Emprical study on clique-degree distribution of networks,
Phys. Rev. E 76 (2007) 037102.
[140] R. Lambiotte, V. D. Blondel, C. de Kerchove, E. Huens, C. Prieur, Z. Smoreda,
P. Van Dooren, Geographical dispersal of mobile communication networks,
Physica A 387 (2008) 5317.
[141] W.-S. Jung, F. Wang, H.E. Stanley, Gravity model in the korean highway,
EPL 81 (2008) 48005.
[142] P. Kaluza, A. Koelzsch, M. T. Gastner, B. Blasius, The complex network of
global cargo ship movements, J. R. Soc. Interface 7 (2010) 1093.
[143] G. Bianconi, P. Pin, M. Marsili, Assessing the relevance of node features for
network structure, Proc. Natl. Acad. Sci. U.S.A. 106 (2009) 11433.
[144] H.-K. Liu, X.-L. Zhang, L. Cao, B.-H. Wang, T. Zhou, Analysis on the
connecting mechanism of Chinese city airline network, Sci. China Ser. G 39
(2009) 935.
[145] C. W. J. Granger, Investigating causal relations by econometric models and
cross-spectral methods, Econometrica 37 (1969) 424.
[146] H.-K. Liu, X.-L. Zhang, T. Zhou, Structure and External Factors of Chinese
City Airline Network, Physics Procedia 3 (2010) 1781.
[147] Q.-M Zhang, M.-S. Shang, L. Lü, Similarity-based classification in partially
labeled networks, Int. J. Mod. Phys. C 21 (2010) 813.
[148] U. Alon, Network motifs: theory and experimental approaches, Nat. Rev. Gene.
8 (2007) 450.
[149] A. Mantrach, L. Yen, J. Callut, K. Françoisse, M. Shimbo, M. Saerens, The
Sum-over-Paths Covariance Kernel: A Novel Covariance Measure between Nodes
of a Directed Graph, IEEE Trans. Pattern Analysis and Machine Intelligence 32
(2010) 1112.
[150] T. Murata, S. Moriyasu, Link prediction of social networks based on weighted
proximity measure, In Proceedings of the IEEE/WIC/ACM International
Conference on Web Intelligence, ACM Press, New York, 2007.
[151] L Lü, T. Zhou, Link prediction in weighted networks: The role of weak ties,
EPL 89 (2010) 18001.
[152] H. Yin, S. C. Wong, J. Xu, C. K. Wong, Urban traffic flow prediction using a
fuzzy-neural approach, Transportation Res. C 10 (2002) 85.

43
[153] J. Kunegis, A. Lommatzsch, C. Bauckhage, The Slashdot Zoo: Mining a social
network with negative edges, In Proceedings of WWW’2009, ACM Press, New
York, 2009.

[154] R. V. Guha, R. Kumar, P. Raghavan, A. Tomkins, Propagation of trust and


distrust, In Proceedings of WWW’2004, ACM Press, New York, 2004.

[155] J. Leskovec, D. Huttenlocher, J. Kleinberg, Predicting Positive and Negative


Links in Online Social Networks, In Proceedings of WWW’2010, ACM Press,
New York, 2010.

[156] V. A. Traag, J. Bruggeman, Community detection in networks with positive


and negative links, Phys. Rev. E 80 (2009) 036115.

[157] S. A. Marvel, S. H. Strogatz, J. M. Kleinberg, Energy Landscape of Social


Balance, Phys. Rev. Lett. 103 (2009) 198701.

[158] M. Szell, R. Lambiotte, S. Thurner, Multirelational organization of large-scale


social networks in an online world, Proc. Natl. Acad. Sci. U.S.A. 107 (2010)
13636.

[159] Z.-K. Zhang, T. Zhou, Y.-C. Zhang, Personalized recommendation via


integrated diffusion on user-item-tag tripartite graphs, Physica A 389 (2010)
179.

[160] R. Burke, Hybrid recommender systems: Survey and experiments, User Model.
User-Adap. Interact. 12 (2002) 331.

[161] R. Polikar, Ensemble based systems in decision making, IEEE Circuits and
Systems Magazine 6(3) (2006) 21.

[162] V. Leroy, B. B. Cambazoglu, F. Bonchi, Cold start link prediction, In


Proceedings of the 16th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, ACM Press, New York, 2010, p. 393.

[163] E. Zheleva, L. Getoor, J. Golbeck, Ugur Kuter, Using friendship ties and family
circles for link prediction, In Proceedings of the 2nd Workshop on Social Network
Mining and Analysis, ACM Press, New York, 2008.

[164] B. Cao, N. N. Liu, Q. Yang, Transfer learning for collective link prediction
in multiple heterogenous domains, In Proceedings of the 27th International
Conference on Machine Learning, Haifa, Israel, 2010.

[165] Z. Huang, D. K. J. Lin, The time-series link prediction problem with


applications in communication surveillance, INFORMS J. Comput. 21 (2009)
286.

[166] T. Tylenda, R. Angelova, S. Bedathur, Towards time-aware link prediction in


evolving social networks, In Proceedings of the 3rd Workshop on Social Network
Mining and Analysis, ACM Press, New York, 2009.

44

You might also like