Ranking influential spreaders is an ill-defined problem

Jain Gu; Sungmin Lee; Jari Saramäki; Petter Holme

doi:10.1209/0295-5075/118/68002

Introduction

Rumors, opinions, ideas and infectious disease all spread on networks. To maximize the impact of word-of-mouth marketing or to contain infectious disease outbreaks, it is essential to identify important spreaders —i.e. people that acquire the spreading agent easily and are expected to pass it on to many others. The importance of an individual depends on many factors —the details of disease transmission (we focus on infectious disease spreading from now on), the in-host disease dynamics, and the network structure of the contact patterns, among others. This is the motivation behind the emerging field of network epidemiology [1–3].

Many studies —e.g., refs. [4–12]— have devised methods to rank nodes according to their importance. As there are many ways to vary the underlying assumptions about the structure of contact patterns, the objective function (i.e. how to measure the severity of a disease outbreak), the disease dynamics, and the information available to exploit these structures, these methods are becoming a very rich and diverse theory [13]. Typically, it is implicitly assumed that for vaccination or quarantine, the nodes of a network can be ranked with respect to the objective function: if n nodes are to be vaccinated or quarantined, the optimal choice is to take the top n nodes of the ranking. This assumption can fail. In fig. 1, we show a simple example of how the n most influential nodes —in the sense that deleting them would reduce the largest connected component as much as possible —are not among the optimal $n'$ to delete for any $n'\neq n$ .

Fig. 1: Refer to the following caption and surrounding text. — **Fig. 1:** (Color online) An example of an infinite network where one cannot rank influential spreaders with respect to the reduction of the size of the largest connected component. Let D_n be the set of n nodes that maximizes the number of elements in $\Lambda(X)$ —the set of nodes no longer in the largest connected component after the set X is deleted. In this example, $D_n \cap D_{n'} = \emptyset$ for any $n'\neq n$ .
Download figure:
Standard image

**Fig. 1:** (Color online) An example of an infinite network where one cannot rank influential spreaders with respect to the reduction of the size of the largest connected component. Let D_n be the set of n nodes that maximizes the number of elements in $\Lambda(X)$ —the set of nodes no longer in the largest connected component after the set X is deleted. In this example, $D_n \cap D_{n'} = \emptyset$ for any $n'\neq n$ .
Download figure:
Standard image

Picking the top vertices in a ranking as your n most important nodes is a greedy heuristic algorithm that can be motivated by the problem to optimize vaccination being NP-hard (or at least the closely related influence maximization problem ref. [14]). How much this assumption fails, and why, are questions yet to be addressed in the literature. This problem has been discussed more for the related problem of influence maximization [14] —the problem to identify nodes that, as infection seeds, would maximize the expected outbreak (which is known to be fundamentally different from the vaccination problem [15]).

In this work, we show that situations where ranking influential spreaders is ill-defined arise in many networks, besides this extreme and contrived example case. To this end, we explore how common deviations are from situations where the optimal sets D_n of n nodes to delete are fully nested (i.e. $|D_{n+1}\cap D_n|=n$ ). We first derive a quantity (ill-definedness) that measures the extent of such deviations, then show that ill-definedness is common in simple model networks, and finally make the point that the issue persists even in real-world networks. We also address the issue of degeneracy of the optimal sets.

Preliminaries

We start by introducing some notation. We define a well-defined scenario to be one where one can rank vertices according to influence and where the n most influential nodes are always the first n nodes of that ranking. Given a measure of the severity of a disease outbreak (such as the number of nodes that eventually get the disease), let D_n be an optimal set of n nodes to delete with respect to the severity measure, and let

$\begin{equation} Y_n=\{D_n^i\}_{i=1}^\nu \end{equation} \tag{ 1 }$

be the set of all optimal sets of n nodes, with ${|Y_n|= \nu}$ . An optimal set comprises the n nodes that, if deleted (vaccinated), reduce the severity of the disease as much as possible. The degeneracy ν of the optimal set is introduced for situations where there is more than one optimal set; this degeneracy depends on n. Now let

$\begin{equation} a_n(i)=\min_j|D_{n+1}^j\setminus D_n^i|-1 \end{equation} \tag{ 2 }$

and let $\alpha(n)$ be the average value of a_n(i) over all optimal sets in Y_n. If the influence ranking problem is perfectly well defined for a network, then $a_n(i) = 0\ \forall \ i,n$ and subsequently $\alpha(n)=0 ~\forall ~n$ , that is to say, the optimal sets of n + 1 nodes totally include the optimal sets of n nodes for any value of n. Conversely, the less well defined the ranking, the larger the value of α. We call α the ill-definedness of a network. One can interpret α as the average number of nodes that deviate from the well-defined case.

Results

To get an understanding for how common ill-defined rankings are and how the ill-definedness depends on the network, we first study $\alpha(n)$ on small model networks that allow exhaustive treatment. As the severity measure, we use the size S of the largest connected component. The network models we use are $N=L\times L$ square grids and Erdős-Rényi random graphs [16]. For the square grids, we use open boundary conditions —node (x, y) of the grid is connected to $(x+1,y)$ unless x = L (and, similarly, (x, y) and $(x,y+1)$ are connected for $0\leq y<L$ ). For the random graphs, we start with N isolated nodes, go through all pairs of nodes and add links with probability p. To calculate $\alpha(n)$ , we perform an exhaustive search for optimal sets D_n for the entire range $n\in [1,N]$ . As this is computationally very heavy, we have to restrict ourselves to very small networks that nevertheless clearly illustrate the issue. Using linear programming, larger graphs should be possible to study [17]. In the first analysis, we use N = 9 (i.e. L = 3 square grids).

For the square grid, the ill-definedness α has its maximum at n = 3, dropping down to $\alpha=0$ as n reaches 4 (see fig. 2(b)). Node importance rankings are thus ill-defined for n < 4. The $3\times 3$ square grid is simple and symmetric enough to understand in some detail (fig. 2(a)). In this case, D₃ consists of four sets of nodes —the two diagonals {(1, 1), (2, 2), (3, 3)} and {(3, 1), (2, 2), (1, 3)}, and the middle row {(2, 1), (2, 2), (2, 3)} and column {(1, 2), (2, 2), (3, 2)}. The degeneracy is thus $\nu=4$ . Deleting any of these sets reduces S from nine to three. However, for n = 4 there is only one optimal set consisting of the center nodes of each side —(1, 2), (2, 1), (2, 3) and (3, 2)— and so $\nu=1$ . When these nodes are deleted, all other nodes are isolated. Thus for n > 4, deleting these four nodes and any other node in addition would also disconnect the entire network. This also means that, for n > 4, any $D_n^i$ is one node added to $D_{n-1}^i$ . Therefore, for n > 4, $a_n(i)=0$ for any i and subsequently $\alpha(n)=0$ .

Fig. 2: Refer to the following caption and surrounding text. — **Fig. 2:** (Color online) Panel (a) shows the optimal nodes to vaccinate in a $3\times 3$ square lattice, for $1\leq n\leq 5$ . The other panels show the ill-definedness α and degeneracy ν as functions of n on a square lattice (b) and on random networks with $p=0.1$ (c), 0.5 (d), 0.9 (e). The number of nodes, N, is 9 for the square grid and 10 for the random networks.
Download figure:
Standard image

**Fig. 2:** (Color online) Panel (a) shows the optimal nodes to vaccinate in a $3\times 3$ square lattice, for $1\leq n\leq 5$ . The other panels show the ill-definedness α and degeneracy ν as functions of n on a square lattice (b) and on random networks with $p=0.1$ (c), 0.5 (d), 0.9 (e). The number of nodes, N, is 9 for the square grid and 10 for the random networks.
Download figure:
Standard image

The ill-definedness values for random networks (N = 10, averaged over 1000 networks) are shown in fig. 2(c), (d) and (e). For the lowest network density $(p=0.1)$ , $\alpha(n)$ is seen to follow a similar peaked shape for low n as for the square grid, even though its maximum value is smaller. As the networks get denser, the peak shifts towards larger values of n. This reflects the fact that it takes more node deletions to disconnect a denser network. For the dense networks of fig. 2(e) with $p=0.9$ , there are no sets of one or two nodes whose deletion would fragment the network, and thus any set of one or two vertices is optimal and $\alpha(n)=0$ for n < 3. The degeneracy has a peculiar dependence on the density of the networks. $\nu(n)$ has one peak for the sparsest (fig. 2(c)) and densest networks (fig. 2(e)) and two peaks for the networks of intermediate density (fig. 2(d)). To understand this, note that $\nu(n)=\binom Nn$ if all nodes are equivalent, which is true for the limiting cases of a network without links and a fully-connected network. Let $n'$ be the value of n above which the network is typically fragmented into O(1) subnetworks. For $n>n'$ , the $\nu(n)$ curve would be peaked for the same reason why the $\nu(n)$ of a network of isolated nodes is peaked (indeed in the same way as discussed for the square grid) —it represents the number of sets to fragment the network plus the number of ways to delete the isolates. The intermediate minimum tells us that when n becomes just so large that fragmenting the network completely is possible, then the sets of vertices to delete to achieve this are few. There are two effects that explain the first peak, i.e. why for small n the degeneracy ν grows with n. First, the number of combinations of n elements out of N increases with n. Thus, for homogeneous networks (like the square grid and, to some extent, also the random networks) this leads to an increase of ν. Second, for heterogeneous networks —where the degree distribution is very skewed— the top influencer would be very obvious. One would need to continue to higher n before any degeneracy would be at all likely.

Ill-definedness is not limited to the small networks discussed above —rather, it seems to persist as the network size increases. In fig. 3, we show how $\alpha_{\text{avg}}=(1/n)\sum_n \alpha_n$ and $\alpha_{\max}=\max_n \alpha_n$ depend on the network size N for ER networks with $p=0.5$ . Both of these quantities are increasing. Finally, we have investigated other model networks of varying size, all showing single-peaked α curves and ν curves with one or two peaks.

Fig. 3: Refer to the following caption and surrounding text. — **Fig. 3:** (Color online) The average and maximum α over all n as a function of network size N. The underlying networks are ER model networks with $p=0.5$ . The curves are averaged over > 10³ networks. Error bars would be smaller than the symbol size and are not shown.
Download figure:
Standard image

**Fig. 3:** (Color online) The average and maximum α over all n as a function of network size N. The underlying networks are ER model networks with $p=0.5$ . The curves are averaged over > 10³ networks. Error bars would be smaller than the symbol size and are not shown.
Download figure:
Standard image

The real networks over which diseases spread are believed to have a much a more complex structure —heterogeneous degree distribution, community structure, abundant triangles, etc. [18–21] We have also investigated some empirical contact networks from the network epidemiology literature. Due to computational constraints we have not been able to scan the full range of n, but rather study $\alpha(n)$ for the very lowest values of n only. In fig. 4, we show results for a network of sexual contacts from the article first arguing that HIV is a sexually transmitted infection [22]. It is a small network of only N = 40, still being more heterogeneous than the above-studied random networks. We see a general growing trend of α, with a sudden dip to zero at n = 7. Some specific $D_n^i$ sets are shown in panels fig. 4(b), (c) and (d) (for n = 1, 3 and 5, respectively). In fig. 4(d) we can see a typical reason for large degeneracy ν. The node highlighted by an arrow could be replaced by any other node in the (colored) largest component that it is attached to. The actual values of α that we observe in fig. 4 are larger than for the model networks of fig. 2, even though we have only investigated α for very small n —the largest α is likely larger. Our preliminary results suggest that, in general, the average and maximum $\alpha(n)$ values increases with network size N. However, computational reasons prevent making a comprehensive study of α's N dependence.

Fig. 4: Refer to the following caption and surrounding text. — **Fig. 4:** (Color online) Panel (a) shows the ill-definedness α and degeneracy ν as functions of n on an empirical network of sexual contacts. Panels (b), (c) and (d) show optimal set of nodes (black) to delete for n = 1, 3 and 5, respectively. The colored areas of nodes are members of a largest connected component. The highlighted node in (d) is an example of an optional node in the optimal set. It could be replaced by any other node of the largest connected component it is connected to.
Download figure:
Standard image

In our final numerical study, we investigate a more realistic severity measure than the size of the largest connected component, namely the expected outbreak size Ω in disease simulations on the network. Since the disease simulations make the analysis yet more computationally demanding, we will only show an example network where the ranking according to which to vaccinate nodes is ill-defined. For disease simulations, we use the SIR (susceptible-infected-recovered) model. This is a standard model of diseases that give the infected person immunity upon recovery [23]. It starts from a situation where all nodes are susceptible to the disease except one randomly chosen seed node, who is infected. Nodes have a chance λ to recover at any unit of time (we set $\lambda=1$ ). When an infected node is a neighbor of a susceptible node, the susceptible one can become infected with a probability β. We scanned several Erdős-Rényi random graphs (as above) with N = 10 and $p=0.5$ . We ran 10⁶ outbreaks for every set of n nodes to delete and a range of β values. One challenge to analyze this severity measure is that one cannot identify degenerate optimal sets (i.e. when $\nu>1$ ). In the simulations, Ω can differ for these sets because of stochastic fluctuations, even though they should in theory be equal. Instead of actually measuring α, we will just show that α can be larger than zero (i.e. the ranking problem is ill-defined). This is illustrated in fig. 5, where we show an example of a graph where the optimal sets for n = 1 and n = 2 are not overlapping (we can say with $>99{\%}$ confidence that these sets are not degenerate). Interestingly, these optimal sets depend on β. For $\beta=1$ and n = 2, the optimal set is the one that fragments the network the most (as in the study above with S as the severity measure). For $\beta=0.066$ , on the other hand, the optimal set for n = 2 is the nodes whose removal would decrease the number of links most (even though after deleting them, the network is still connected). We can understand this since a sparser network gives fewer chances for contagion to occur, and thus a higher chance of the outbreak dying out early (which then decreases the average outbreak size).

Fig. 5: Refer to the following caption and surrounding text. — **Fig. 5:** (Color online) Results for SIR disease spreading on a small example network (drawn from the Erdős-Rényi random graph ensemble with N = 10 and $p=0.5$ ). The optimal sets to vaccinate are shown for two values of β —1 and 0.05.
Download figure:
Standard image

**Fig. 5:** (Color online) Results for SIR disease spreading on a small example network (drawn from the Erdős-Rényi random graph ensemble with N = 10 and $p=0.5$ ). The optimal sets to vaccinate are shown for two values of β —1 and 0.05.
Download figure:
Standard image

Conclusions

We have investigated the problem of ranking influential spreaders on networks, using disease spreading as a working example. Instead of finding a quick heuristic to rank vertices in order of how influential they are, we use exhaustive search of every set of n nodes to delete to find the sets that decrease the severity of the spreading the most. We find that the optimal set of n nodes to delete does not in general correspond to the optimal set of n − 1 nodes to delete augmented by just one extra node. Indeed, in practice this ill-definedness of the ranking problem can be rather severe (up to half of the optimal sets would not carry over to the next value of n for the empirical network of fig. 4). Our study does not necessarily disqualify papers proposing rankings of influential nodes. Indeed, for heterogeneous networks —which most real-world networks are— picking the top n nodes of a ranking is probably rather close to the optimal. On the other hand, to properly evaluate ranking methods, one needs to take this issue into consideration.

The obviously most interesting question we leave open is how these results extend to larger networks (the exhaustive search used here limits us to very small networks and values of n). However, nothing suggests that the observed effect would vanish in larger networks. Disease spreading in metapopulations essentially follows the same model, and in such settings the ill-definedness and degeneracy of optimal sets could be relevant with very small networks (networks of farms connected by transport of livestock being one example [24]). For future studies, it would be interesting to investigate the size scaling further. In other words, it would be interesting to find fast heuristic methods for arbitrary n. This should be possible, given the progress of constructing such algorithms for the standard formulation of the vaccination problem.

Acknowledgments

SL was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2016R1A6A3A11932833).

Author affiliations

Dates

Ranking influential spreaders is an ill-defined problem

Article metrics

Permissions

Share this article