Abstract
Finding influential spreaders of information and disease in networks is an important theoretical problem, and one of considerable recent interest. It has been almost exclusively formulated as a node-ranking problem —methods for identifying influential spreaders output a ranking of the nodes. In this work, we show that such a greedy heuristic does not necessarily work: the set of most influential nodes depends on the number of nodes in the set. Therefore, the set of n most important nodes to vaccinate does not need to have any node in common with the set of n + 1 most important nodes. We propose a method for quantifying the extent and impact of this phenomenon. By this method, we show that it is a common phenomenon in both empirical and model networks.
Export citation and abstract BibTeX RIS
Introduction
Rumors, opinions, ideas and infectious disease all spread on networks. To maximize the impact of word-of-mouth marketing or to contain infectious disease outbreaks, it is essential to identify important spreaders —i.e. people that acquire the spreading agent easily and are expected to pass it on to many others. The importance of an individual depends on many factors —the details of disease transmission (we focus on infectious disease spreading from now on), the in-host disease dynamics, and the network structure of the contact patterns, among others. This is the motivation behind the emerging field of network epidemiology [1–3].
Many studies —e.g., refs. [4–12]— have devised methods to rank nodes according to their importance. As there are many ways to vary the underlying assumptions about the structure of contact patterns, the objective function (i.e. how to measure the severity of a disease outbreak), the disease dynamics, and the information available to exploit these structures, these methods are becoming a very rich and diverse theory [13]. Typically, it is implicitly assumed that for vaccination or quarantine, the nodes of a network can be ranked with respect to the objective function: if n nodes are to be vaccinated or quarantined, the optimal choice is to take the top n nodes of the ranking. This assumption can fail. In fig. 1, we show a simple example of how the n most influential nodes —in the sense that deleting them would reduce the largest connected component as much as possible —are not among the optimal to delete for any .
Picking the top vertices in a ranking as your n most important nodes is a greedy heuristic algorithm that can be motivated by the problem to optimize vaccination being NP-hard (or at least the closely related influence maximization problem ref. [14]). How much this assumption fails, and why, are questions yet to be addressed in the literature. This problem has been discussed more for the related problem of influence maximization [14] —the problem to identify nodes that, as infection seeds, would maximize the expected outbreak (which is known to be fundamentally different from the vaccination problem [15]).
In this work, we show that situations where ranking influential spreaders is ill-defined arise in many networks, besides this extreme and contrived example case. To this end, we explore how common deviations are from situations where the optimal sets Dn of n nodes to delete are fully nested (i.e. ). We first derive a quantity (ill-definedness) that measures the extent of such deviations, then show that ill-definedness is common in simple model networks, and finally make the point that the issue persists even in real-world networks. We also address the issue of degeneracy of the optimal sets.
Preliminaries
We start by introducing some notation. We define a well-defined scenario to be one where one can rank vertices according to influence and where the n most influential nodes are always the first n nodes of that ranking. Given a measure of the severity of a disease outbreak (such as the number of nodes that eventually get the disease), let Dn be an optimal set of n nodes to delete with respect to the severity measure, and let
be the set of all optimal sets of n nodes, with . An optimal set comprises the n nodes that, if deleted (vaccinated), reduce the severity of the disease as much as possible. The degeneracy ν of the optimal set is introduced for situations where there is more than one optimal set; this degeneracy depends on n. Now let
and let be the average value of an(i) over all optimal sets in Yn. If the influence ranking problem is perfectly well defined for a network, then and subsequently , that is to say, the optimal sets of n + 1 nodes totally include the optimal sets of n nodes for any value of n. Conversely, the less well defined the ranking, the larger the value of α. We call α the ill-definedness of a network. One can interpret α as the average number of nodes that deviate from the well-defined case.
Results
To get an understanding for how common ill-defined rankings are and how the ill-definedness depends on the network, we first study on small model networks that allow exhaustive treatment. As the severity measure, we use the size S of the largest connected component. The network models we use are square grids and Erdős-Rényi random graphs [16]. For the square grids, we use open boundary conditions —node (x, y) of the grid is connected to unless x = L (and, similarly, (x, y) and are connected for ). For the random graphs, we start with N isolated nodes, go through all pairs of nodes and add links with probability p. To calculate , we perform an exhaustive search for optimal sets Dn for the entire range . As this is computationally very heavy, we have to restrict ourselves to very small networks that nevertheless clearly illustrate the issue. Using linear programming, larger graphs should be possible to study [17]. In the first analysis, we use N = 9 (i.e. L = 3 square grids).
For the square grid, the ill-definedness α has its maximum at n = 3, dropping down to as n reaches 4 (see fig. 2(b)). Node importance rankings are thus ill-defined for n < 4. The square grid is simple and symmetric enough to understand in some detail (fig. 2(a)). In this case, D3 consists of four sets of nodes —the two diagonals {(1, 1), (2, 2), (3, 3)} and {(3, 1), (2, 2), (1, 3)}, and the middle row {(2, 1), (2, 2), (2, 3)} and column {(1, 2), (2, 2), (3, 2)}. The degeneracy is thus . Deleting any of these sets reduces S from nine to three. However, for n = 4 there is only one optimal set consisting of the center nodes of each side —(1, 2), (2, 1), (2, 3) and (3, 2)— and so . When these nodes are deleted, all other nodes are isolated. Thus for n > 4, deleting these four nodes and any other node in addition would also disconnect the entire network. This also means that, for n > 4, any is one node added to . Therefore, for n > 4, for any i and subsequently .
Download figure:
Standard imageThe ill-definedness values for random networks (N = 10, averaged over 1000 networks) are shown in fig. 2(c), (d) and (e). For the lowest network density , is seen to follow a similar peaked shape for low n as for the square grid, even though its maximum value is smaller. As the networks get denser, the peak shifts towards larger values of n. This reflects the fact that it takes more node deletions to disconnect a denser network. For the dense networks of fig. 2(e) with , there are no sets of one or two nodes whose deletion would fragment the network, and thus any set of one or two vertices is optimal and for n < 3. The degeneracy has a peculiar dependence on the density of the networks. has one peak for the sparsest (fig. 2(c)) and densest networks (fig. 2(e)) and two peaks for the networks of intermediate density (fig. 2(d)). To understand this, note that if all nodes are equivalent, which is true for the limiting cases of a network without links and a fully-connected network. Let be the value of n above which the network is typically fragmented into O(1) subnetworks. For , the curve would be peaked for the same reason why the of a network of isolated nodes is peaked (indeed in the same way as discussed for the square grid) —it represents the number of sets to fragment the network plus the number of ways to delete the isolates. The intermediate minimum tells us that when n becomes just so large that fragmenting the network completely is possible, then the sets of vertices to delete to achieve this are few. There are two effects that explain the first peak, i.e. why for small n the degeneracy ν grows with n. First, the number of combinations of n elements out of N increases with n. Thus, for homogeneous networks (like the square grid and, to some extent, also the random networks) this leads to an increase of ν. Second, for heterogeneous networks —where the degree distribution is very skewed— the top influencer would be very obvious. One would need to continue to higher n before any degeneracy would be at all likely.
Ill-definedness is not limited to the small networks discussed above —rather, it seems to persist as the network size increases. In fig. 3, we show how and depend on the network size N for ER networks with . Both of these quantities are increasing. Finally, we have investigated other model networks of varying size, all showing single-peaked α curves and ν curves with one or two peaks.
Download figure:
Standard imageThe real networks over which diseases spread are believed to have a much a more complex structure —heterogeneous degree distribution, community structure, abundant triangles, etc. [18–21] We have also investigated some empirical contact networks from the network epidemiology literature. Due to computational constraints we have not been able to scan the full range of n, but rather study for the very lowest values of n only. In fig. 4, we show results for a network of sexual contacts from the article first arguing that HIV is a sexually transmitted infection [22]. It is a small network of only N = 40, still being more heterogeneous than the above-studied random networks. We see a general growing trend of α, with a sudden dip to zero at n = 7. Some specific sets are shown in panels fig. 4(b), (c) and (d) (for n = 1, 3 and 5, respectively). In fig. 4(d) we can see a typical reason for large degeneracy ν. The node highlighted by an arrow could be replaced by any other node in the (colored) largest component that it is attached to. The actual values of α that we observe in fig. 4 are larger than for the model networks of fig. 2, even though we have only investigated α for very small n —the largest α is likely larger. Our preliminary results suggest that, in general, the average and maximum values increases with network size N. However, computational reasons prevent making a comprehensive study of α's N dependence.
Download figure:
Standard imageIn our final numerical study, we investigate a more realistic severity measure than the size of the largest connected component, namely the expected outbreak size Ω in disease simulations on the network. Since the disease simulations make the analysis yet more computationally demanding, we will only show an example network where the ranking according to which to vaccinate nodes is ill-defined. For disease simulations, we use the SIR (susceptible-infected-recovered) model. This is a standard model of diseases that give the infected person immunity upon recovery [23]. It starts from a situation where all nodes are susceptible to the disease except one randomly chosen seed node, who is infected. Nodes have a chance λ to recover at any unit of time (we set ). When an infected node is a neighbor of a susceptible node, the susceptible one can become infected with a probability β. We scanned several Erdős-Rényi random graphs (as above) with N = 10 and . We ran 106 outbreaks for every set of n nodes to delete and a range of β values. One challenge to analyze this severity measure is that one cannot identify degenerate optimal sets (i.e. when ). In the simulations, Ω can differ for these sets because of stochastic fluctuations, even though they should in theory be equal. Instead of actually measuring α, we will just show that α can be larger than zero (i.e. the ranking problem is ill-defined). This is illustrated in fig. 5, where we show an example of a graph where the optimal sets for n = 1 and n = 2 are not overlapping (we can say with confidence that these sets are not degenerate). Interestingly, these optimal sets depend on β. For and n = 2, the optimal set is the one that fragments the network the most (as in the study above with S as the severity measure). For , on the other hand, the optimal set for n = 2 is the nodes whose removal would decrease the number of links most (even though after deleting them, the network is still connected). We can understand this since a sparser network gives fewer chances for contagion to occur, and thus a higher chance of the outbreak dying out early (which then decreases the average outbreak size).
Download figure:
Standard imageConclusions
We have investigated the problem of ranking influential spreaders on networks, using disease spreading as a working example. Instead of finding a quick heuristic to rank vertices in order of how influential they are, we use exhaustive search of every set of n nodes to delete to find the sets that decrease the severity of the spreading the most. We find that the optimal set of n nodes to delete does not in general correspond to the optimal set of n − 1 nodes to delete augmented by just one extra node. Indeed, in practice this ill-definedness of the ranking problem can be rather severe (up to half of the optimal sets would not carry over to the next value of n for the empirical network of fig. 4). Our study does not necessarily disqualify papers proposing rankings of influential nodes. Indeed, for heterogeneous networks —which most real-world networks are— picking the top n nodes of a ranking is probably rather close to the optimal. On the other hand, to properly evaluate ranking methods, one needs to take this issue into consideration.
The obviously most interesting question we leave open is how these results extend to larger networks (the exhaustive search used here limits us to very small networks and values of n). However, nothing suggests that the observed effect would vanish in larger networks. Disease spreading in metapopulations essentially follows the same model, and in such settings the ill-definedness and degeneracy of optimal sets could be relevant with very small networks (networks of farms connected by transport of livestock being one example [24]). For future studies, it would be interesting to investigate the size scaling further. In other words, it would be interesting to find fast heuristic methods for arbitrary n. This should be possible, given the progress of constructing such algorithms for the standard formulation of the vaccination problem.
Acknowledgments
SL was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2016R1A6A3A11932833).