2010 Lu LinkPredictionComplexNetworks
2010 Lu LinkPredictionComplexNetworks
Survey
Abstract
Link prediction in complex networks has attracted increasing attention from both
physical and computer science communities. The algorithms can be used to ex-
tract missing information, identify spurious interactions, evaluate network evolving
mechanisms, and so on. This article summaries recent progress about link predic-
tion algorithms, emphasizing on the contributions from physical perspectives and
approaches, such as the random-walk-based methods and the maximum likelihood
methods. We also introduce three typical applications: reconstruction of networks,
evaluation of network evolving mechanism and classification of partially labelled
networks. Finally, we introduce some applications and outline future challenges of
link prediction algorithms.
1 Introduction
Besides helping in analyzing networks with missing data, the link prediction
algorithms can be used to predict the links that may appear in the future of
evolving networks. For example, in online social networks, very likely but not
yet existent links can be recommended as promising friendships, which can
help users in finding new friends and thus enhance their loyalties to the web
sites. Similar techniques can be applied to evaluate the evolving mechanism for
given networks. For example, many evolving models for the Internet topology
have been proposed: some more accurately reproduce the degree distribution
and the disassortative mixing pattern [19], some better characterize the k-core
structure [20], and so on. Since there are too many topological features and
it is very hard to put weights on them, we are not easy to judge which model
(i.e., which evolving mechanism) is better than the others. Notice that, each
model in principle corresponds to a link prediction algorithm, and thus we can
use the metrics on prediction accuracy to evaluate the performance of different
models.
2
els have been proposed by computer science community. However, their works
have not caught up the current progress of the study of complex networks,
especially, they lack serious consideration of the structural characteristics of
networks, like the hierarchical organization [21] and community structure [22],
which may indeed provide useful information and insights for link prediction.
Recently, some physical approaches, such as random walk processes and max-
imum likelihood methods, have found applications in link prediction. This
article will give detailed discussion on these new development.
This article is organized as follows. In the next section, we will present the
link prediction problem and the standard metrics for performance evaluation.
Our tour of link prediction algorithms starts with the mainstreaming class of
algorithms, the so-called similarity-based algorithms 1 , which are further clas-
sified into three categories according to the information used by the similarity
indices: local indices, global indices and quasi-local indices. In Section 4 and
Section 5, we introduce the maximum likelihood algorithms and probabilistic
models for link prediction. The applications of link prediction algorithms are
presented in Section 6, including the reconstruction of networks, the evalua-
tion of network evolving mechanism and the classification of partially labeled
networks. Finally, we outline some future challenges of link prediction algo-
rithms.
Consider an undirected network G(V, E), where V is the set of nodes and E is
the set of links. Multiple links and self-connections are not allowed. Denote by
U the universal set containing all |V |·(|V2 |−1) possible links, where |V | denotes
the number of elements in set V . Then, the set of nonexistent links is U − E.
We assume there are some missing links (or the links that will appear in the
future) in the set U − E, and the task of link prediction is to find out these
links.
Generally, we do not know which links are the missing or future links, otherwise
we do not need to do prediction. Therefore, to test the algorithm’s accuracy,
the observed links, E, is randomly divided into two parts: the training set, E T ,
is treated as known information, while the probe set (i.e., validation subset),
E P , is used for testing and no information in this set is allowed to be used for
prediction. Clearly, E T ∪ E P = E and E T ∩ E P = ø. The advantage of this
random sub-sampling validation is that the proportion of the training split
is not dependent on the number of iterations. But with this method some
1 The similarity indices between nodes are also called kernels on graphs in some
literature of computer science community [23]
3
links may never be selected in the probe set, whereas others may be selected
more than once, resulting in statistical bias. This limitation can be overcame
by using the K-fold cross-validation, in which the observed links is randomly
partitioned into K subsets. Each time one subset is selected as probe set, the
rest K − 1 constitute the training set. The cross-validation process is then
repeated K times, with each of the K subsets used exactly once as the probe
set. With this method, all links are used for both training and validation,
and each link is used for prediction exactly once. Clearly, a larger K will lead
to smaller statistical bias yet require more computation. Some experimental
evidences suggested that the 10-fold cross-validation is a very good tradeoff
between cost and performance [24,25]. An extreme case called leave-one-out
method (i.e., the |V |-flod cross-validation) will be applied in Section 6.2.
Two standard metrics are used to quantify the accuracy of prediction algo-
rithms: area under the receiver operating characteristic curve (AUC) [26] and
Precision [27,28]. In principle, a link prediction algorithm provides an or-
dered list of all non-observed links (i.e., U − E T ) or equivalently gives each
non-observed link, say (x, y) ∈ U − E T , a score sxy to quantify its existence
likelihood. The AUC evaluates the algorithm’s performance according to the
whole list while the Precision only focuses on the L links with the top ranks or
the highest scores. A detailed introduction of these two metrics is as follows.
(i) AUC.— Provided the rank of all non-observed links, the AUC value can
be interpreted as the probability that a randomly chosen missing link (i.e., a
link in E P ) is given a higher score than a randomly chosen nonexistent link
(i.e., a link in U − E). In the algorithmic implementation, we usually calculate
the score of each non-observed link instead of giving the ordered list since
the latter task is more time-consuming. Then, at each time we randomly pick
a missing link and a nonexistent link to compare their scores, if among n
independent comparisons, there are n′ times the missing link having a higher
score and n′′ times they have the same score, the AUC value is
n′ + 0.5n′′
AUC = . (1)
n
If all the scores are generated from an independent and identical distribu-
tion, the AUC value should be about 0.5. Therefore, the degree to which the
value exceeds 0.5 indicates how much better the algorithm performs than pure
chance.
(ii) Precision.— Given the ranking of the non-observed links, the Precision is
defined as the ratio of relevant items selected to the number of items selected.
That is to say, if we take the top-L links as the predicted ones, among which Lr
links are right (i.e., there are Lr links in the probe set E P ), then the Precision
equals Lr /L. Clearly, higher precision means higher prediction accuracy.
4
1 1
5 2 5 2
4 3 4 3
3 Similarity-Based Algorithms
5
Node similarity can be defined by using the essential attributes of nodes:
two nodes are considered to be similar if they have many common features
[31]. However, the attributes of nodes are generally hidden, and thus we fo-
cus on another group of similarity indices, called structural similarity, which
is based solely on the network structure. The structural similarity indices
can be classified in various ways, such as local vs. global, parameter-free vs.
parameter-dependent, node-dependent vs. path-dependent, and so on. The
similarity indices can also be sophisticatedly classified as structural equiva-
lence and regular equivalence. The former embodies a latent assumption that
the link itself indicated a similarity between two endpoints (see, for example,
the Leicht-Holme-Newman index [32] and transferring similarity [33]), while
the latter assumes that two nodes are similar if their neighbors are similar.
Readers are encouraged to see Ref. [34] for the mathematical definition of
regular equivalence and Ref. [35] for a recent application on the prediction of
protein functions.
Here we adopt the simplest method, where 20 similarity indices are classified
into three categories: the former 10 are local indices, followed by 7 global
indices, and the last 3 are quasi-local indices, which do not require global
topological information but make use of more information than local indices.
(1) Common Neighbours (CN). For a node x, let Γ(x) denote the set of neigh-
bors of x. In common sense, two nodes, x and y, are more likely to have
a link if they have many common neighbors. The simplest measure of this
neighborhood overlap is the directed count, namely
sCN
xy = |Γ(x) ∩ Γ(y)|. (2)
where |Q| is the cardinality of the set Q. It is obvious that sxy = (A2 )xy ,
where A is the adjacency matrix: Axy = 1 if x and y are directly connected
and Axy = 0 otherwise. Note that, (A2 )xy is also the number of different paths
with length 2 connecting x and y. Newman [36] used this quantity in the study
of collaboration networks, showing a positive correlation between the number
of common neighbors and the probability that two scientists will collaborate
in the future. Kossinets and Watts [14] analyzed a large-scale social network,
suggesting that two students having many mutual friends are very probable
to be friend in the future.
6
(2) Salton Index [6]. It is defined as
|Γ(x) ∩ Γ(y)|
sSalton
xy = q , (3)
kx × ky
where kx is the degree of node x. The Salton index is also called the cosine
similarity in the literature.
(3) Jaccard Index [37]. This index was proposed by Jaccard over a hundred
years ago, and is defined as:
|Γ(x) ∩ Γ(y)|
sJaccard
xy = . (4)
|Γ(x) ∪ Γ(y)|
(4) Sørensen Index [38]. This index is used mainly for ecological community
data, and is defined as
2|Γ(x) ∩ Γ(y)|
sSørensen
xy = . (5)
kx + ky
(5) Hub Promoted Index (HPI) [39]. This index is proposed for quantifying
the topological overlap of pairs of substrates in metabolic networks, and is
defined as
|Γ(x) ∩ Γ(y)|
sHP
xy
I
= . (6)
min{kx , ky }
Under this measurement, the links adjacent to hubs are likely to be assigned
high scores since the denominator is determined by the lower degree only.
(6) Hub Depressed Index (HDI). Analogously to the above index, we also
consider a measurement with the opposite effect on hubs, defined as
|Γ(x) ∩ Γ(y)|
sHDI
xy = . (7)
max{kx , ky }
(7) Leicht-Holme-Newman Index (LHN1) [32]. This index assigns high simi-
larity to node pairs that have many common neighbors compared not to the
possible maximum, but to the expected number of such neighbors. It is defined
as
1 |Γ(x) ∩ Γ(y)|
sLHN
xy = , (8)
kx × ky
7
where the denominator, kx × ky , is proportional to the expected number of
common neighbors of nodes x and y in the configuration model [40]. We use
the abbreviation LHN1 to distinguish this index to another index (named as
LHN2 index) also proposed by Leicht, Holme and Newman.
sPxyA = kx × ky , (9)
which has been widely used to quantify the functional significance of links
subject to various network-based dynamics, such as percolation [43], synchro-
nization [44] and transportation [45]. Note that this index does not require
the information of the neighborhood of each node, as a consequence, it has
the least computational complexity.
(9) Adamic-Adar Index (AA) [46]. This index refines the simple counting of
common neighbors by assigning the less-connected neighbors more weight, and
is defined as
1
sAA
X
xy = . (10)
z∈Γ(x)∩Γ(y)
log kz
(10) Resource Allocation Index (RA) [47]. This index is motivated by the
resource allocation dynamics on complex networks [48]. Consider a pair of
nodes, x and y, which are not directly connected. The node x can send some
resource to y, with their common neighbors playing the role of transmitters.
In the simplest case, we assume that each transmitter has a unit of resource,
and will equally distribute it to all its neighbors. The similarity between x and
y can be defined as the amount of resource y received from x, which is:
1
sRA
X
xy = . (11)
k
z∈Γ(x)∩Γ(y) z
Clearly, this measure is symmetric, namely sxy = syx . Note that, although
resulting from different motivations, the AA index and RA index have very
similar form. Indeed, they both depress the contribution of the high-degree
common neighbors. AA index takes the form (log kz )−1 while RA index takes
8
Table 1
Accuracies of different local similarity indices subject to link prediction, measured
by the AUC value. Each number is obtained by averaging over 10 implementations
with independently random partitions of testing set (90%) and probe set (10%).
The entries corresponding to the highest accuracies among these 10 indices are em-
phasized in black. The six real networks for testing are a protein-protein interaction
network (PPI) [16], a co-authorship network of scientists who are themselves pub-
lishing on the topic of network science (NS) [50], an electrical power-grid of the
western US (Grid) [51], a network of the US political blogs (PB) [52], a router-level
Internet collected by Rocketfuel Project (INT) [53], and a network of the US air
transportation system (USAir) [54]. Detailed information about these networks can
be found in Ref. [47].
Indices PPI NS Grid PB INT USAir
CN 0.889 0.933 0.590 0.925 0.559 0.937
Salton 0.869 0.911 0.585 0.874 0.552 0.898
Jaccard 0.888 0.933 0.590 0.882 0.559 0.901
Sørensen 0.888 0.933 0.590 0.881 0.559 0.902
HPI 0.868 0.911 0.585 0.852 0.552 0.857
HDI 0.888 0.933 0.590 0.877 0.559 0.895
LHN1 0.866 0.911 0.585 0.772 0.552 0.758
PA 0.828 0.623 0.446 0.907 0.464 0.886
AA 0.888 0.932 0.590 0.922 0.559 0.925
RA 0.890 0.933 0.590 0.931 0.559 0.955
the form kz−1 . The difference is insignificant when the degree, kz , is small,
while it is considerable when kz is large. In a word, RA index punishes the
high-degree common neighbors more heavily than AA.
Liben-Nowell et al. [49] and Zhou et al. [47] systematically compared a number
of local similarity indices on many real networks: the former [49] focuses on
social collaboration networks and the latter [47] considers disparate networks
including the protein-protein interaction network, electronic grid, Internet,
US airport network, etc. According to extensive experimental results on real
networks (see results in Table 1), the RA index performs best, while AA and
CN indices have the second best overall performance among all the above-
mentioned local indices.
The PA index has the worst overall performance, yet we are interested in
it for it requires the least information. Notice that, PA performs even worst
than pure chance for the Internet at router level and the power grid. In these
two networks, the nodes have well-defined positions and the links are physi-
9
cal lines. Actually, geography plays a significant role and links with very long
geographical distances are rare. As local centers, the high-degree nodes have
longer geographical distances to each other than average, and thus have a
lower probability of directly connecting to each other, which leads to the bad
performance of PA. In contrast, although USAir has well-defined geographical
positions of nodes, its links are not physical. Empirical data has demonstrated
that the number of airline flights is not very sensitive to the geographical
distance within a big range [55,56] (another topological evidence for the rela-
tively good performance of PA on USAir is the so-called rich-club phenomenon
[57,58]). The LHN1 index performs the second worst, however, compared with
all other neighborhood-based indices, it is very good at uncovering the missing
links connecting two small-degree nodes [59].
Recently, Pan et al. [60] have compared all the local indices appeared in Ref.
[47] in a similarity-based community detection algorithm, and their experi-
mental results again indicate that the RA index performs best. Wang et al.
[61] have applied the RA index to estimate the weights between stations in
Chinese railway, which shows better performance than the CN index. In ad-
dition, the RA index for bipartite networks can be applied in personalized
recommendation with higher accuracy than the classical collaborative filter-
ing [62].
(11) Katz Index [63]. This index is based on the ensemble of all paths, which
directly sums over the collection of paths and is exponentially damped by
length to give the shorter paths more weights. The mathematical expression
reads
∞
2 2 3 3
sKatz β l · |paths<l>
X
xy = xy | = βAxy + β (A )xy + β (A )xy + · · · , (12)
l=1
where paths<l>
xy is the set of all paths with length l connecting x and y, and
β is a free parameter (i.e., the damping factor) controlling the path weights.
Obviously, a very small β yields a measurement close to CN, because the long
paths contribute very little. The similarity matrix can be written as
Note that, β must be lower than the reciprocal of the largest eigenvalue of
matrix A to ensure the convergence of Eq. 12.
10
(12) Leicht-Holme-Newman Index (LHN2) [32]. This index is a variant of the
Katz index. Based on the concept that two nodes are similar if their imme-
diate neighbors are themselves similar, one obtains a self-consistent matrix
formulation
where φ and ψ are free parameters controlling the balance between the two
components of the similarity. Setting ψ = 1, it is very similar to the Katz
index. Note that (Al )xy is equal to the number of paths of length l from x
to y. The expected value of (Al )xy , namely E[(Al )xy ], equals (kx ky /2M)λl−1
1 ,
where λ1 is the largest eigenvalue of A and M is the total number of edges
in the network. Replace (Al )xy in Eq. 14 with (Al )xy /E[(Al )xy ], we obtain the
expression:
∞
" # " #
2 2M X 2Mλ1 2Mλ1 φ
sLHN
xy = δxy + φl λ1−l (Al )xy = 1 − δxy + (I − A)−1 , (15)
kx ky l=0 kx ky kx ky λ1 xy
where δxy is the Kronecker’s function. Since the first item is a diagonal matrix,
it can be dropped and thus we arrive to a compact expression
φA −1 −1
S = 2mλ1 D −1 (I − ) D , (16)
λ1
where D is the degree matrix with Dxy = δxy kx and φ (0 < φ < 1) is a
free parameter. The choosing of φ depends on the investigated network, and
smaller φ assigns more weights on shorter paths.
(13) Average Commute Time (ACT). Denote by m(x, y) the average number
of steps required by a random walker starting from node x to reach node y,
the average commute time between x and y is
+
where lxy denotes the corresponding entry in L+ . Assuming two nodes are
more similar if they have a smaller average commute time, then the similarity
11
between the nodes x and y can be defined as the reciprocal of n(x, y), namely
(the constant factor M is removed)
1
sACT
xy = + + +
. (19)
lxx + lyy − 2lxy
(15) Random Walk with Restart (RWR). This index is a direct application of
the PageRank algorithm [66]. Consider a random walker starting from node x,
who will iteratively moves to a random neighbor with probability c and return
to node x with probability 1 − c. Denote by qxy the probability this random
walker locates at node y in the steady state, we have
where P is the transition matrix with Pxy = 1/kx if x and y are connected,
and Pxy = 0 otherwise. The solution is straightforward, as
sRW
xy
R
= qxy + qyx . (23)
A fast algorithm to calculate this index was proposed by Tong et al. [67], and
the application of this index to recommender systems can be found in Ref.
[68].
12
nected to similar nodes:
sSimRank
P P
z∈Γ(x) z ′ ∈Γ(y) zz ′
sSimRank
xy =C· (24)
kx · ky
where sxx = 1 and C ∈ [0, 1] is the decay factor. The SimRank can also be
interpreted by the random-walk process, that is, sSimRank
xy measures how soon
two random walkers, respectively starting from nodes x and y, are expected
to meet at a certain node.
S = (I + L)−1 , (25)
where the similarity between x and y can be understood as the ratio of the
number of spanning rooted forests such that nodes x and y belong to the same
tree rooted at x to all spanning rooted forests of the network (see details in
Ref. [70]). A parameter-dependent variant of MFI is
This index has been applied to quantify the similarity between nodes on collab-
orative recommendation task [71]. The results indicate that a simple nearest-
neighbors rule based on similarity measured by MFI performs best.
Comparing with the local similarity indices, the global ones ask for the whole
topological information. Although the global indices can provide much more
accurate prediction than the local ones, they suffer two big disadvantages:
(i) the calculation of a global index is very time-consuming, and is usually
infeasible for large-scale networks; (ii) sometimes, the global topological infor-
mation is not available, especially if we would like to implement the algorithm
in a decentralized manner. As we will show in the next subsection, a promising
tradeoff is the quasi-local indices, which considers more information than local
indices while abandons the superfluous information that makes no contribu-
tion or very little contribution to the prediction accuracy.
(18) Local Path Index (LP) [47,72]. To provide a good tradeoff of accuracy and
computational complexity, we here introduce an index that takes consideration
of local paths, with wider horizon than CN. It is defined as
S LP = A2 + ǫA3 , (27)
13
where ǫ is a free parameter. Clearly, this measure degenerates to CN when
ǫ = 0. And if x and y are not directly connected (this is the case we are
interested in), (A3 )xy is equal to the number of different paths with length 3
connecting x and y. This index can be extended to account for higher-order
paths, as
where n > 2 is the maximal order. With the increasing of n, this index asks
for more information and computation. Especially, when n → ∞, S LP (n) will
be equivalent to the Katz index that takes into account all paths in the net-
work. The computational complexity of this index in an uncorrelated network
is O(Nhkin ), which grows fast with the increasing of n and will exceed the
complexity for calculating the Katz index (approximate to O(N 3 )) for large
n. Experimental results show that the optimal n is positively correlated with
the average shortest distance of the network [72].
The comparison of LP index with other two path-dependent global indices, the
Katz and LHN2 indices, is shown in Table 2. Overall speaking, the Katz index
performs best subject to the AUC value, while the LP index is the best for the
Precision. For the network with small average shortest distance (e.g., USAir
and PB), LP index gives the most accurate predictions for both AUC and
Precision. In a word, the LP index provides competitively good predictions
while asks for much lighter computation compared with the global indices.
(19) Local Random Walk (LRW) [73]. To measure the similarity between nodes
x and y, a random walker is initially put on node x and thus the initial density
vector ~πx (0) = e~x . This density vector evolves as ~πx (t+1) = P T ~πx (t) for t ≥ 0.
The LRW index at time step t is thus defined as
sLRW
xy (t) = qx πxy (t) + qy πyx (t). (29)
where q is the initial configuration function. In Ref. [73] Liu and Lü applied a
kx
simple form determined by node degree, namely qx = M .
14
Table 2
Accuracies of the three path-dependent similarity indices, measured by AUC and
precision. Here only the main components of example networks are considered (see
Ref. [72] for detailed information). Each number is obtained by averaging over 10
independent realizations. The entries corresponding to the highest accuracies are
emphasized in black. For LP, Katz and LHN2 indices, the AUC values are corre-
sponding to the optimal parameter which will be used to calculate their correspond-
ing precision where we set L = 100. For USAir, the optimal value of ǫ is negative
(see the explanation in Ref. [47]). LP* denotes the LP index with a fixed parameter
ǫ = 0.01 (for USAir ǫ = −0.01). The very small difference between the optimal case
and the case with ǫ = 0.01 suggests that in the real application, one can directly
set ǫ as a very small number, instead of finding out its optimum that may cost
much time. This again supports our motivation that the essential advantage of the
uasge of the second order neighborhood is to improve the distinguishability of the
similarity scores.
AUC PPI NS Grid PB INT USAir
LP 0.970 0.988 0.697 0.941 0.943 0.960
LP* 0.970 0.988 0.697 0.939 0.941 0.959
Katz 0.972 0.988 0.952 0.936 0.975 0.956
LHN2 0.968 0.986 0.947 0.769 0.959 0.778
(20) Superposed Random Walk (SRW) [73]. Similar to the RWR index, Liu and
Lü [73] proposed the SRW index, where the random walker is continuously
released at the starting point, resulting in a higher similarity between the
target node and the nodes nearby. The mathematical expression reads
t t
sSRW sLRW
X X
xy (t) = xy (τ ) = [qx πxy (τ ) + qy πyx (τ )], (30)
τ =1 τ =1
Liu and Lü [73] systematically compared these two indices, LRW and SRW,
with five other indices, including three local (or quasi-local) indices, CN, RA
and LP, and two other random-walk-based global indices, ACT and RWR, as
well as the hierarchical structure method (HSM) proposed by Clauset, Moore
and Newman [75] (see Section 4.1 for the detailed introduction of HSM).
15
Table 3
Comparison of algorithms’ accuracy quantified by AUC and Precision. For each
network, the training set contains 90% of the known links. Each number is obtained
by averaging over 1000 implementations with independently random divisions of
training set and probe set. The parameters ε = 10−3 for LP (for USAir, ε = −10−3 )
and c = 0.9 for RWR. The numbers inside the brackets denote the optimal step of
LRW and SRW indices. For example, 0.972(2) means the optimal AUC is obtained
at the second step of LRW. The highest accuracy in each line is emphasized in black.
For HSM, 5000 samples of dendrograms for each implementation are generated.
AUC CN RA LP ACT RWR HSM LRW SRW
USAir 0.954 0.972 0.952 0.901 0.977 0.904 0.972(2) 0.978(3)
NetScience 0.978 0.983 0.986 0.934 0.993 0.930 0.989(4) 0.992(3)
Power 0.626 0.626 0.697 0.895 0.760 0.503 0.953(16) 0.963(16)
Yeast 0.915 0.916 0.970 0.900 0.978 0.672 0.974(7) 0.980(8)
C.elegans 0.849 0.871 0.867 0.747 0.889 0.808 0.899(3) 0.906(3)
According to the experimental results (see Table 3), LRW and SRW methods
perform better than other indices with their respective optimal walking step
positively correlated with the average shortest distance of the network.
With the similar motivation of LRW and SRW, Mantrach et al. recently pro-
posed a bounded normalized random walk with restart algorithm (see Eq. 21
for the definition of RWR), and applied it to address the classification problem
16
[74]. With this method both complexities of time and space can be reduced.
This section will introduce two recently proposed algorithms based on the
maximum likelihood estimation. These algorithms presuppose some organiz-
ing principles of the network structure, with the detailed rules and specific
parameters obtained by maximizing the likelihood of the observed structure.
Then, the likelihood of any non-observed link can be calculated according to
those rules and parameters.
Empirical evidence indicates that many real networks are hierarchically orga-
nized, where nodes can be divided into groups, further subdivided into groups
of groups, and so forth over multiple scales [21] (e.g., metabolic networks [39]
and brain networks [76]). As Redner said [77], focusing on the hierarchical
structure inherent in social and biological networks might provide a smart
way to find missing links. Clauset, Moore and Newman [75] proposed a gen-
eral technique to infer the hierarchical organization from network data and
further applied it to predict the missing links.
17
Fig. 2. Illustration of a dendrogram of a network with 5 nodes. Accordingly, the
connecting probability of nodes 1 and 2 is 0.5, of nodes 1 and 3 is 0.3, of nodes 3
and 4 is 0.4.
be the number of leaves in the left and right subtrees rooted at r. Then the
likelihood of the dendrogram D together with a set of pr is
Lr Rr −Er
pE
Y
L(D, {pr }) = r (1 − pr )
r
. (31)
r
maximizes L(D, {pr }). Therefore, according to the maximum likelihood method
[78], with a fixed D, it is easy to determine {pr } (by Eq. 32) that best fits the
network G. Figure 3 shows an example network and two possible dendrograms,
as well as the corresponding likelihoods. It is in accordance with the common
sense that D2 is more likely. A Markov chain Monte Carlo method is used
to sample dendrograms with probability proportional to their likelihood (see
the Supplementary Information of Ref. [75] and a benchmark book [79] for
details).
The algorithm to predict the missing links contains the following procedures:
(i) Sample a large number of dendrograms with probability proportional to
their likelihood; (ii) For each pair of unconnected nodes i and j, calculate the
mean connecting probability hpij i by averaging the corresponding probability
pij over all sampled dendrograms; (iii) Sort these node pairs in descending
order of hpij i and the highest-ranked ones are those to be predicted. According
to the AUC value, this algorithm outperforms the CN index for the terrorist
association network [80] and the grassland species food web [81], while loses
for the metabolic network of the spirochaete Treponema Pallidum [82].
18
Fig. 3. The likelihood of two possible dendrograms for an example network consisting
of 6 nodes. The interval nodes are labeled with the maximum likelihood probability
obtained by Eq. 32. The likelihoods are L(D1 ) ≈ 0.00165 (left dendrogram) and
L(D1 ) ≈ 0.0433 (right dendrogram). Reprinted figure with permission from [75],
copyright is held by Nature Publishing Group.
The hierarchical structure model provides a smart way to predict missing links,
and, maybe more significantly, it uncovers the hidden hierarchical organization
of networks. However, as mentioned above, a big disadvantage is that this
algorithm runs very slow. Actually, the process to sample dendrograms usually
asks for O(N 2 ) steps of the Markov chain [75], and in the worst case, it takes
exponential time [83]. In comparison, according to the CPU of an advanced
desktop computer, the hierarchical structure model cannot manage a network
of tens of thousands nodes, while the algorithms based on local similarity
indices can deal with networks with tens of millions nodes. Another noticeable
remark is that this model may give poor predictions for those networks without
clear hierarchical structures.
Stochastic block model [84,85,86,87] is one of the most general network models,
where nodes are partitioned into groups and the probability that two nodes are
connected depends solely on the groups to which they belong. The stochastic
block model can capture the community structure [22], role-to-role connections
[88] and maybe other factors for the establishing of connections, especially
when the group membership plays the considerable roles in determining how
nodes interact with each other, which usually could not be well described by
the simple assortativity coefficient [89,90] or the degree-degree correlations
19
2 5
1 4 6
3 7
Fig. 4. An illustration about the calculation of likelihood for the stochastic block
model.
[91,92].
Given a partition M where each node belongs to one group and the connecting
probability for two nodes respectively in groups α and β is denoted by Qαβ
(Qαα represents the probability that two nodes within group α are connected),
then the likelihood of the observed network structure is [18]:
l
(1 − Qαβ )rαβ −lαβ ,
Y αβ
L(A|M) = Qαβ (33)
α≤β
where lαβ is the number of edges between nodes in groups α and β and rαβ is
the number of pairs of nodes such that one node is in α and the other is in β.
Similar to Eq. 32, the optimal Qαβ that maximizes the likelihood L(A|M) is:
lαβ
Q∗αβ = . (34)
rαβ
A simple illustration is shown in Fig. 4. Given a partition M = {{1, 2, 3}, {4, 5, 6, 7}},
according to Eq. 34, the Q values corresponding to the maximum likelihood
2
are Q∗11 = 33 = 1, Q∗12 = 12 = 16 , Q∗22 = 65 , and thus the likelihood is
2 10 5
1 5 5 1
L=1× × ≈ 3.005 × 10−4 . (35)
6 6 6 6
Denote by Ω the set of all possible partitions, using Bayes’ Theorem [93], the
reliability of an individual link is [18]:
L(Axy = 1|M)L(A|M)p(M)dM
R
Ω
Rxy = L(Axy = 1|A) = ′ ′ ′
, (36)
Ω L(A|M )p(M )dM
R
20
all partitions is not possible in practice. The Metropolis algorithm [94] can be
applied to estimate the link reliability [18]. Even though, the whole process is
very time consuming and this method can only manage networks with up to
a few thousands of nodes.
Reliability describes the likelihood of the existence of a link (i.e., the prob-
ability that the link “truly” exists) given the observed structure [18], which
can be used not only to predict missing links (the nonexistent links in the ob-
served network yet with the highest reliabilities) but also to identify possible
spurious links (the existent links with the lowest reliabilities). Empirical com-
parison on five disparate networks (the social interactions in a karate club [95],
the social network of frequent associations between 62 dolphins [96], the air
transportation network of Eastern Europe [97], the neural network of the ne-
matode Caenorhabditis elegans [98], and the metabolic network of Escherichia
coli [99]) indicated that the overall performance of the maximum likelihood
method based on stochastic block model [18] is better than the one based on
the hierarchical structure model [75] and the similarity-based algorithm for
common neighbors [49].
5 Probabilistic Models
Probabilistic models aim at abstracting the underlying structure from the ob-
served network, and then predicting the missing links by using the learned
model. Given a target network G = (V, E), the probabilistic model will opti-
mize a built target function to establish a model composed of a group of param-
eters Θ, which can best fit the observed data of the target network. Then the
probability of the existence of a nonexistent link (i, j) is estimated by the con-
ditional probability P (Aij = 1|Θ). This section will introduce the three main-
stream methods, respectively called Probabilistic Relational Model (PRM)
[100], Probabilistic Entity Relationship Model (PERM) [101] and Stochastic
Relational Model (SRM) [102]. Notice that, in some literature, the term PRM
only refers to a specific model which is usually called the Relational Bayesian
Networks nowadays, while we adopt the more general usage of PRM in this
review.
21
graph to model the relationship among the attributes of homogenous entities,
PRMs contain three graphs [103]: the data graph GD , the model graph GM ,
and the inference graph GI . These correspond to the skeleton, model, and
ground graph as outlined by Heckerman et al. [104].
The data graph GD = (VD , ED ) presents the input network, where nodes
are the objects in the data and edges represent the relationships among the
objects. Each node vi ∈ VD and edge ej ∈ ED are associated with a type
T (vi ) = tvi , T (ej ) = tej . Each item (either object or edge) type t ∈ T has a
number of associated attributes Xt . Consequently, each object vi and link
tv te
ej are associated with a set of attribute values, xvii and xejj , determined
by their types, tv and te , respectively. A PRM represents a joint proba-
bility distribution over the values of all the attributes in the data graph,
tv S te
x = {xvii : vi ∈ VD , T (vi ) = tvi } {xejj : ej ∈ ED , T (ej ) = tej }. For example,
in the student-course selection system, the students and courses are nodes
and the edges represent the select relationship between students and courses.
Clearly, there are two types of nodes, namely student and course. And the
type student has four attributes: grade, age, sex and department, while the
type course has five attributes: category, teacher, year, time and discipline.
22
by a RBN is a directed acyclic graph 3 with a set of CPDs, P , to represent
a joint distribution over the attributes of the item types. The set P contains
a conditional probability distribution for each variable given its parents 4 ,
p(x|pax ), where pax denotes the parents of node. Thus the joint probabilistic
distribution can be calculated as
3 A directed graph is acyclic if there is no directed path that starts and ends at the
same variable. This constrain indicates that a random variable does not depend,
directly or indirectly, on its own value.
4 A direct link from a to b indicates that a is b’s parent node.
5 Actually a node in a clique corresponds to an attribute in the data graph.
23
types t, the attributes of that type Xt , and the nodes v and edges e of that
type:
The CPDs in the RDN pseudo-likelihood are not required to factor the joint
distribution of GD . More specifically, when consider the variable xtvi , we con-
dition on the values of the parents paxtvi regardless of whether the estimation
of CPDs for variables in paxtvi was conditioned on xtvi . RDN adopts Gibbs
sampling 6 to iteratively relabel each unobserved variable by drawing from its
local conditional distribution, given the current state of the rest of the graph.
The DAPER model can be used in the situation where the relational struc-
ture itself is uncertain. And it is more expressive than either PRMs or plate
models 8 . Actually, DAPER combines the features of plate models and PRMs,
6 Heckerman et al. [110] proved that the Gibbs sampling can be used to estimate the
joint distribution of a dependency network. For a basic introduction and summary
of the Gibbs sampling, see Ref. [112].
7 In graph theory, the term “arc” stands for directed link.
8 The standard description of plate models can be found in Refs. [108,109]. Hecker-
man et al. [101] provided a new definition of plate model, which is slightly different
from the traditional one [108,109]. According to this new definition, the plate mod-
els and DAPER models are equivalent, and a plate model can be invertible mapped
to a DAPER model [101].
24
and the relations between DAPER models, PRMs and plate models can be
found in Ref. [101].
The key idea of SRM is to model the stochastic structure of entity relationships
(i.e., links) via a tensor interaction of multiple Gaussian Processes (GPs), each
defined on one type of entities [102].
Z Y
p(RI |Θ) = p(rij |tij )p(t|Θ)dt, (40)
(ij)∈I
25
6 Applications
The problem of link prediction has attracted much attention from disparate
research communities. This is mainly attributed to its broad applicability. For
some networks, especially biological networks such as protein-protein interac-
tion networks, metabolic networks and food webs, the discovery of links or
interactions is costly in the laboratory or the field. A highly accurate predic-
tion can reduce the experimental costs and speed the pace of uncovering the
truth [75,77]. Link prediction has also been applied in the analysis of social
networks, such as the prediction of being actors in acts [116], the prediction
of the collaborations in co-authorship networks [49], the detection of the un-
derground relationships between terrorists [75], and so on. In addition, the
process of recommending items to users can be considered as a link predic-
tion problem in the user-item bipartite networks [117,118]. Actually, almost
the same techniques as the similarity-based link prediction has been applied
in personalized recommendation [62,119,120]. Accurate recommendation can
be used in e-commerce web sites to enhance the sales [121]. Moreover, the
link prediction approaches can be applied to solve the classification problem
in partially labeled networks, such as the prediction of protein functions [35],
the detection of anomalous email [122], distinguishing the research ares of
scientific publications [123], and finding out the fraud and legit users in cell
phone networks [124]. The following three subsections will introduce typical
applications of link prediction.
L(Axy = 1|AO ),
Y Y
R(A) = Rxy = (41)
Axy =1,x<y Axy =1,x<y
where Rxy and L are defined in Eqs. 33 and 36, and the term AO is used
to emphasize that the likelihoods are calculated according to the observed
network.
26
Given AO , a straightforward idea is to find out the network A that maximizes
the reliability defined by Eq. 41. However, the computation is too costly to
be implemented. In practice, Guimerà and Sales-Pardo [18] designed a simple
greedy algorithm. Their algorithm starts by evaluating the link reliabilities
for all pairs of nodes. Then, at each time step it removes the link with the
lowest reliability and adds the link (not yet in the current network) with the
highest reliability. This change is accepted if and only if the network reliability
increases. If it is rejected, the link with the next lowest reliability and the not-
yet-existent link with the next highest reliability will be the next candidate
for swapping. The algorithm stops if it rejects five consecutive attempts to
swap links. The observed network is set as the initialization of the algorithm,
and it will consecutively become another network with higher reliability than
the initial network. Guimerà and Sales-Pardo [18] tested their algorithm by
generating hypothetical observed networks AO from the true networks AT
(the five true networks used for testing are introduced in Section 4.2). Each
observation has a fraction of the true links removed and an identical number
of random links added.
Guimerà and Sales-Pardo [18] compared the global network properties of the
observed networks and these of the reconstructed networks. According to six
metrics, clustering coefficient [51], modularity [125], assortativity [89,90], con-
gestability 9 , synchronizability 10 and spreading threshold 11 , the reconstruc-
tion generally improves the estimates, indicating the validity of the approach.
Notice that, the results from the greedy algorithm may be far different from the
real optimum subject to the maximal reliability, thus we may expect even bet-
ter estimates if one has developed a more effective and/or efficient algorithm.
Readers should be warned that in both the algorithm and the preparation of
observed networks, a latent assumption is that the number of missing links
and the number of spurious links are equal. Since in the real systems, these
two numbers may be very different (it is easy to image that in many net-
works, such as metabolic networks and friendship networks, the missing links
are much more than the spurious links), the effectiveness of the algorithm still
needs further validation.
zero eigenvalues of the Laplacian matrix of a network, which quantifies the ability
of synchronization under the framework of master stability analysis [128,129].
11 Ignoring the degree-degree correlations and applying the mean-field approxima-
tion, the spreading threshold equals the ratio between the first and the second
moments of the degree distribution [130,131].
27
6.2 Evaluation of Network Evolving Mechanisms
Since the groundbreaking work by Barabási and Albert [41], the evolving
models all the time lie in the center of the complex network study. A fun-
damental difficulty is that for a given network or a target network property,
there are generally many possible mechanisms and it is not easy to judge which
one is the best. Taking the power-law degree distribution as an instance, the
well-known mechanisms include rich gets richer [41], good gets richer [132],
optimal design [133], Hamiltonian dynamics [134], merging and regeneration
[135], stability constraints [136], and so on. Hence we can not easily know
which factor(s) leads to the scale-free property of a real network, and in fact
there can be so many models competing for the final explanation of a given
real network. It is very hard to evaluate different models by comparing their
resulted networks with the target network, since there are too many metrics
for topological features [5]. As mentioned in Section 1, there are many mod-
els about the topology of the Internet, some more accurately reproduce the
degree distribution and the disassortative mixing pattern (e.g., see Ref. [19])
and some better characterize the k-core structure (e.g., see Ref. [20]). To judge
which model (i.e., which evolving mechanism) is better than the others is a
tough task.
Essentially speaking, an algorithm for link prediction makes a guess about the
factors resulting in the existence of links, which is actually what an evolv-
ing model wants to show. In other words, an evolving model in principle can
be mapped to a link prediction algorithm. Therefore, we can quantitatively
compare the accuracies of different evolving models with the help of the perfor-
mance metrics for link prediction (see Section 2). We hope this methodology
could provide a fair platform to compare different evolving models, which may
be significant for the studies of network modeling. Next, we will show a real
application about the Chinese city airline network, where each node represents
a city with airport, and two cities are connected if there exists at least one
direct airline between them [56].
28
the existence of nodes’ interactions in networks [140]. Especially it plays very
important role in analyzing transportation networks [141,142]. It is known to
be relevant to the existence of an airline, and the number of airlines decays
with the increasing of corresponding distance [55,143]. Accordingly, we use the
inverse of geographical distance between two cities as the similarity index, say
1
sDIS
xy = , (42)
Dg (x, y)
where Dg (x, y) denotes the geographical distance between cities x and y. Based
on a null assumption that people in every city have the same frequency of air
travels, the similar index for populations is defined as
where P (x) is the population of city x. The economic level of a city can
be roughly quantified by its gross domestic product (GDP) 12 , and thus the
corresponding similarity is defined as
sGDP
xy = G(x) × G(y), (44)
where G(x) denotes the GDP of city x. Considering that the airline business
is most tightly related to the service industry, besides the simple GDP, we
user the third sector of GDP, named the tertiary industry 13 to characterize a
city’s potential to build airlines:
Since the size of the Chinese city airline network [56] is small (|V | = 121,
|E| = 1378), we adopt the leave-one-out method, namely at each time, we
pick only one link for test and all other links constitute the training set. This
procedure repeats for 1378 times with each link being once the testing link.
Table 4 displays the prediction accuracy (AUC values) of the five similarity
12 The GDP is a measure of a city’s overall economic output. It is the market value
of all final goods and services made within the borders of a city in a year. Here we
use the data of the year 2005.
13 The tertiary industry (also called tertiary sector of the economy, service sector
or service industry) consists of the “soft” parts of the economy, namely activities
where people offer their knowledge and time to improve productivity, performance,
potential, and sustainability. The basic characteristic of this sector is the production
of services instead of end products.
29
Table 4
The prediction accuracy of the five similarity indices for the Chinese city airline
network. The training and testing sets are divided according to the leave-one-out
method.
Similarity Indices AUC
S CN 0.898
S DIS 0.699
S P OP U 0.745
S GDP 0.855
ST I 0.881
indices. It indicates that every factor under consideration plays a role, while
the topological factor is most significant. The tertiary industry of a city, as an
external factor, also plays a very important role. Actually, a linear combination
of the common neighbor index and the tertiary industry, as S ′ = λS CN + (1 −
λ)S T I can achieve a very high AUC value, 0.928, at λ ≈ 0.2.
Given a network with partial nodes being labeled, the problem is to predict
the labels of these unlabeled nodes based on the known labels and the network
structure. Two main difficulties in achieving highly accurate classification are
the sparsity of the known labeled nodes and the inconsistency of label infor-
mation. To address these two difficulties, a simple but effective method is to
add artificial connections between every pair of labeled and unlabeled nodes
according to their similarity scores [123,147], with almost the same techniques
used in similarity-based link prediction. An underlying assumption is that two
14 The Granger causality test is a technique for determining whether one time series
is useful in forecasting another. See Ref. [145] for details.
30
Node 2 Node 1
b a
? Node 5
a b
Node 3 Node 4
Fig. 5. An illustration of how to predict the fifth node’s label by adding artificial
links.
nodes are more likely to be categorized into the same class if they are more
similar to each other.
sxy
P
{y|y6=x,label(y)=li }
p(li |x) = P , (46)
{y|y6=x,label(y)6=0} sxy
where li ∈ L. The predicted label of node x is determined by the largest p(li |x).
If there are more than one maximum values, we randomly select one.
A simple example is shown in Fig. 5, where there are two kinds of labels (i.e. a
and b) and five nodes, four of which are labeled already. Our task is to predict
the label of the node 5. According to the common neighbors index S CN , we
obtain the similarity between node 5 and the other four labeled nodes: s15 = 1,
s25 = 1, s35 = 2 and s45 = 0. Thus, the probabilities that node 5 belongs to
class a and b are p(a|node5) = 0.75 and p(b|node5) = 0.25 respectively. If
we use RA index, the similarity scores are: s15 = 13 , s25 = 21 , s35 = 13 + 21
and s45 = 0. Therefore, the probabilities change to p(a|node5) = 0.7 and
p(b|node5) = 0.3. According to any of the two indices, the predicted label of
node 5 is a.
31
7 Outlook
32
A more complicated kind of multi-dimensional networks is the ones consisted
of several classes of nodes. For example, an online resource-sharing system,
such as Del.icio.us 15 , can be represented by a network that consists of three
kinds of nodes: users, URLs and tags. Different from the tripartite networks,
nodes in the same class can also be connected, like an arc can be added from
a user to her/his follower who has imposed her/his collections. Ignoring the
connections within a class of nodes, the prediction of links between users and
objects has already been investigated [159]. However, there is still nothing
reported about the link prediction algorithms taking into account both the
links within a class and the links between classes.
33
lutions of link occurrences, which is more appropriate for dealing with the link
prediction problem in evolving networks, such as online social networks. An-
other way to involve time information is inspired by the fact that older events
are less likely to be relevant to future links than recent ones. For example,
author’s interests may change over time and thus old publications might be
less relevant to his currents research area. Tylenda et al. [166] developed a
graph-based link prediction method that incorporate the temporal informa-
tion contained in evolving networks. They found that the performance can be
improved by either time-based weighting of edges (i.e., giving the older events
smaller weights or even neglecting them) or weighting of edges according to
the connecting strength. However, to design effective algorithms and eventu-
ally settle this problem, we need in-depth and comprehensive understanding
of temporal effects on human’s interests, attentions and so on, which asks for
extensive empirical analyses.
Acknowledgements
References
[1] R. Albert, A.-L. Barabási, Statistical mechanics of complex networks, Rev. Mod.
Phys. 74 (2002) 47.
34
[6] G. Salton, M. J. McGill, Introduction to Modern Information Retrieval,
MuGraw-Hill, Auckland, 1983.
[12] L. A. N. Amaral, A truer measure of our ignorance, Proc. Natl. Acad. Sci.
U.S.A. 105 (2008) 6795.
[13] L. Schafer, J. W. Graham, Missing data: Our view of the state of the art,
Psychol. Methods 7 (2002) 147.
35
[22] M. Girvan, M. E. J. Newman, Community structure in social and biological
networks, Proc. Natl. Acad. Sci. U.S.A. 99 (2002) 7821.
[26] J. A. Hanely, B. J. McNeil, The meaning and use of the area under a receiver
operating characteristic (ROC) curve, Radiology 143 (1982) 29.
[27] S. Geisser, Predictive inference: An introduction, Chapman and Hall, New York,
1993.
[33] D. Sun, T. Zhou, J.-G. Liu, R.-R. Liu, C.-X. Jia, B.-H. Wang, Information
filtering based on transferring similarity, Phys. Rev. E 80 (2009) 017101.
[37] P. Jaccard, Étude comparative de la distribution florale dans une portion des
Alpes et des Jura, Bulletin de la Societe Vaudoise des Science Naturelles 37
(1901) 547.
36
[38] T. Sørensen, A method of establishing groups of equal amplitude in plant
sociology based on similarity of species content and its application to analyses
of the vegetation on Danish commons, Biol. Skr. 5 (1948) 1.
[40] M. Molloy, B. Reed, A critical point for random graphs with a given degree
sequence, Random Structure and Algorithms 6 (1995) 161.
[42] Y.-B. Xie, T. Zhou, B.-H. Wang, Scale-free networks without growth, Physica
A 387 (2008) 1683.
[44] C.-Y. Yin, W.-X. Wang, G.-R. Chen, B.-H. Wang, Decoupling process for better
synchronizability on scale-free networks, Phys. Rev. E 74 (2006) 047102.
[45] G.-Q. Zhang, D. Wang, G.-J. Li, Enhancing the transmission efficiency by edge
deletion in scale-free networks, Phys. Rev. E 76 (2007) 017101.
[46] L. A. Adamic, E. Adar, Friends and neighbors on the Web, Social Networks 25
(2003) 211.
[47] T. Zhou, L. Lü, Y.-C. Zhang, Predicting missing links via local information,
Eur. Phys. J. B 71 (2009) 623.
[48] Q. Ou, Y.-D. Jin, T. Zhou, B.-H. Wang, B.-Q. Yin, Power-law strength-degree
correlation from resource-allocation dynamics on weighted networks, Phys. Rev.
E 75 (2007) 021102.
37
[55] M. T. Gastner, M. E. J. Newman, The spatial structure of networks, Eur. Phys.
J. B 49 (2006) 247.
[56] H.-K. Liu, T. Zhou, Empirical study of Chinese city airline network, Acta
Physica Sinica 56 (2007) 106.
[57] S. Zhou, R. J. Mondragón, The rich-club phenomenon in the Internet topology,
IEEE Commun. Lett. 8 (2004) 180.
[58] V. Colizza, A. Flammini, M. A. Serrano, A. Vespignani, Detecting rich-club
ordering in complex networks, Nat. Phys. 2 (2006) 110.
[59] Y.-X. Zhu, L. Lü, T. Zhou, Uncovering missing links with cold ends
(unpublished).
[60] Y. Pan, D.-H. Li, J.-G. Liu, J.-Z. Liang, Detecting community structure in
complex networks via node similarity, Physica A 389 (2010) 2849.
[61] Y.-L. Wang, T. Zhou, J.-J. Shi, J. Wang, D.-R. He, Empirical analysis of
dependence between stations in Chinese railway network, Physica A 388 (2009)
2949.
[62] T. Zhou, J. Ren, M. Medo, Y.-C. Zhang, Bipartite network projection and
personal recommendation, Phys. Rev.E 76 (2007) 046115.
[63] L. Katz, A new status index derived from sociometric analysis, Psychmetrika
18 (1953) 39.
[64] D. J. Klein, M. Randic, Resistance distance, J. Math. Chem. 12 (1993) 81.
[65] F. Fouss, A. Pirotte, J.-M. Renders, M. Saerens, Random-walk computation
of similarities between nodes of a graph with application to collaborative
recommendation, IEEE Trans. Knowl. Data. Eng. 19 (2007) 355.
[66] S. Brin, L. Page, The anatomy of a large-scale hypertextual Web search engine,
Comput. Netw. ISDN Syst. 30 (1998) 107.
[67] H. Tong, C. Faloutsos, J.-Y. Pan, Fast random walk with restart and its
applications, In Proceedings of the 6th International Conference on Data Mining,
IEEE Press, Washington, DC, USA, 2006, p. 613-622 .
[68] M.-S. Shang, L. Lü, T. Zhou, Y.-C. Zhang, Relevance is more significant than
correlation: Information filtering on sparse data, EPL 88 (2009) 68008.
[69] G. Jeh, J. Widom, SimRank: A measure of structural-context similarity, In
Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, ACM Press, New York, 2002, p. 271-279.
[70] P. Chebotarev, E. V. Shamis, The matrix-forest theorem and measuring
relations in small social groups, Automation and Remote Control 58 (1997) 1505.
[71] F. Fouss, L. Yen, A. Pirotte, M. Saerens, An experimental investigation of
graph kernels on a collaborative recommendation task, In Proceedings of the 6th
International Conference on Data Mining, IEEE Press, Washington, DC, USA,
2006, p. 863-868.
38
[72] L. Lü, C.-H. Jin, T. Zhou, Similarity index based on local paths for link
prediction of complex networks, Phys. Rev. E 80 (2009) 046122.
[73] W. Liu, L. Lü, Link prediction based on local random walk, EPL 89 (2010)
58007.
39
[91] R. Pastor-Satorras, A. Vázquez, A. Vespignani, Dynamical and Correlation
Properties of the Internet, Phys. Rev. Lett. 87 (2001) 258701.
[95] W. Zachary, An information flow model for conflict and fission in small groups,
J. Anthropol. Res. 33 (1977) 452.
[102] K. Yu, W. Chu, S. Yu, V. Tresp, Z. Xu, Stochastic Relational Models for
Discriminative Link Prediction, In Proceedings of Neural Information Precessing
Systems, MIT Press, Cambridge MA, 2006, p. 1553.
[103] J. Neville, Statistical models and analysis techniques for learning in relational
data, PhD Thesis, 2006.
40
[107] B. Taskar, M.-F. Wong, P. Abbeel, D. Koller, Link prediction in relational data,
In Proceedings of Neural Information Precessing Systems, MIT Press, Cambridge
MA, 2004, p. 659.
[108] W. Buntine, Operations for learning with graphical models, J. Artif. Intell.
Res. 2 (1994) 159.
[110] D.
Heckerman, D. Chickering, C. Meek, R. Rounthwaite, C. Kadie, Dependency
networks for inference, collaborative filtering, and data visualization, J. Machine
Learning Res. 1 (2000) 49.
[113] Z. Xu, V. Tresp, K. Yu, S. Yu, H.-P. Kriegel, Dirichlet enhanced relational
learning, In Proceedings of the 22nd internatonal conference on machine learning,
Bonn, Germany, 2005, p. 1004.
[114] K. Yu, W. Chu, Gaussian process models for link analysis and transfer learning,
In Proceedings of Neural Information Precessing Systems, MIT Press, Cambridge
MA, 2007, p. 1657.
[117] M.-S. Shang, L. Lü, Y.-C. Zhang, T. Zhou, Empirical analysis of web-based
user-object bipartite networks, EPL 90 (2010) 48006.
[119] T. Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J. R. Wakeling, Y.-C. Zhang, Solving
the apparent diversity-accuracy dilemma of recommender systems, Proc. Natl.
Acad. Sci. U.S.A. 107 (2010) 4511.
[120] W. Zeng, M.-S. Shang, Q.-M. Zhang, L. Lü, T. Zhou, Can dissimilar users
contribute to accuracy and diversity of personalized recommendation, Int. J.
Mod. Phys. C (to be published).
41
[122] Z. Huang, D. D. Zeng, A link prediction approach to anomalous email
detection, In Proceedings of 2006 IEEE International Conference on Systems,
Man, and Cybernetics, Taipei, Taiwan, 2006, p. 1131.
[127] G. Yan, T. Zhou, B. Hu. Z.-Q. Fu, B.-H. Wang, Efficient routing on complex
networks, Phys. Rev. E 73 (2006) 046108.
[131] T. Zhou, Z.-Q. Fu, B.-H. Wang, Epidemic dynamics on complex networks,
Prog. Nat. Sci. 16 (2006) 452.
42
[137] H.-K. Liu, T. Zhou, Review on the studies of airline networks, Prog. Nat. Sci.
18 (2008) 601.
[138] A.-X. Cui, Y. Fu, M.-S. Shang, D.-B. Chen, T. Zhou, Emergence of
local structures in complex network:common neighborhood drives the network
evolution, Acta Physica Sinica (to be published).
[139] W.-K. Xiao, J. Ren, F. Qi, Z.-W. Song, M.-X. Zhu, H.-F. Yang, H.-Y. Jin,
B.-H. Wang, T. Zhou, Emprical study on clique-degree distribution of networks,
Phys. Rev. E 76 (2007) 037102.
[140] R. Lambiotte, V. D. Blondel, C. de Kerchove, E. Huens, C. Prieur, Z. Smoreda,
P. Van Dooren, Geographical dispersal of mobile communication networks,
Physica A 387 (2008) 5317.
[141] W.-S. Jung, F. Wang, H.E. Stanley, Gravity model in the korean highway,
EPL 81 (2008) 48005.
[142] P. Kaluza, A. Koelzsch, M. T. Gastner, B. Blasius, The complex network of
global cargo ship movements, J. R. Soc. Interface 7 (2010) 1093.
[143] G. Bianconi, P. Pin, M. Marsili, Assessing the relevance of node features for
network structure, Proc. Natl. Acad. Sci. U.S.A. 106 (2009) 11433.
[144] H.-K. Liu, X.-L. Zhang, L. Cao, B.-H. Wang, T. Zhou, Analysis on the
connecting mechanism of Chinese city airline network, Sci. China Ser. G 39
(2009) 935.
[145] C. W. J. Granger, Investigating causal relations by econometric models and
cross-spectral methods, Econometrica 37 (1969) 424.
[146] H.-K. Liu, X.-L. Zhang, T. Zhou, Structure and External Factors of Chinese
City Airline Network, Physics Procedia 3 (2010) 1781.
[147] Q.-M Zhang, M.-S. Shang, L. Lü, Similarity-based classification in partially
labeled networks, Int. J. Mod. Phys. C 21 (2010) 813.
[148] U. Alon, Network motifs: theory and experimental approaches, Nat. Rev. Gene.
8 (2007) 450.
[149] A. Mantrach, L. Yen, J. Callut, K. Françoisse, M. Shimbo, M. Saerens, The
Sum-over-Paths Covariance Kernel: A Novel Covariance Measure between Nodes
of a Directed Graph, IEEE Trans. Pattern Analysis and Machine Intelligence 32
(2010) 1112.
[150] T. Murata, S. Moriyasu, Link prediction of social networks based on weighted
proximity measure, In Proceedings of the IEEE/WIC/ACM International
Conference on Web Intelligence, ACM Press, New York, 2007.
[151] L Lü, T. Zhou, Link prediction in weighted networks: The role of weak ties,
EPL 89 (2010) 18001.
[152] H. Yin, S. C. Wong, J. Xu, C. K. Wong, Urban traffic flow prediction using a
fuzzy-neural approach, Transportation Res. C 10 (2002) 85.
43
[153] J. Kunegis, A. Lommatzsch, C. Bauckhage, The Slashdot Zoo: Mining a social
network with negative edges, In Proceedings of WWW’2009, ACM Press, New
York, 2009.
[160] R. Burke, Hybrid recommender systems: Survey and experiments, User Model.
User-Adap. Interact. 12 (2002) 331.
[161] R. Polikar, Ensemble based systems in decision making, IEEE Circuits and
Systems Magazine 6(3) (2006) 21.
[163] E. Zheleva, L. Getoor, J. Golbeck, Ugur Kuter, Using friendship ties and family
circles for link prediction, In Proceedings of the 2nd Workshop on Social Network
Mining and Analysis, ACM Press, New York, 2008.
[164] B. Cao, N. N. Liu, Q. Yang, Transfer learning for collective link prediction
in multiple heterogenous domains, In Proceedings of the 27th International
Conference on Machine Learning, Haifa, Israel, 2010.
44