Development of Social Networking Application Journals
Development of Social Networking Application Journals
Physics Letters A
www.elsevier.com/locate/pla
a r t i c l e i n f o a b s t r a c t
Article history: Detecting influential nodes is still a popular issue in social networks and many excellent detecting
Received 25 June 2020 methods have been put forward. However, most of them aim to improve the accuracy and efficiency of
Received in revised form 28 August 2020 the algorithm, but ignore invulnerability of networks. Based on essential factors of influence propagation
Accepted 6 September 2020
(such as the location and neighborhood of source node, propagation rate) and network invulnerability,
Available online 15 September 2020
Communicated by M. Perc
we propose a novel strategy to search the influential nodes in terms of the local topology and the global
location. Two important indicators are node diffusion degree and node cohesion degree, which are used
Keywords: to increase the probability of influence diffusion and reduce the feasibility of network collapse. More
Social network specially, the loss of global efficiency and the loss of local efficiency are applied to evaluate the impact
Influential nodes of the algorithm from the perspective of network invulnerability. The experimental results in the real
Invulnerability networks show that our method achieves an excellent balance between detecting accuracy and network
Global efficiency invulnerability. The detected influential nodes are the ones that have great influence and can resist certain
Local efficiency
damage and disturbance of the networks.
© 2020 Elsevier B.V. All rights reserved.
1. Introduction PageRank [12] and LeaderRank [13]. Lv et al. [14] proposed a new
measure for ranking nodes of temporal networks based on PageR-
With the rapid development of microblog, mobile communica- ank. In addition, some researchers use node deletion or contraction
tion, epidemiology and biology, social network is still a hot re- to distinguish the importance of nodes. They firmly believed that
search topic in the complex network. The social network is a typ- “destructiveness is the importance”, which is closely related to the
ical complex network formed by different individuals in the real fault-tolerance or invulnerability of networks. Arulselvan et al. [15]
world through mutual connection so that information can spread emphasized the number of connected branches of the network af-
among the crowd [1]. In the hot topics of social network analy- ter node deletion, while Addis et al. [16] concerned the minimum
sis, the detection of influential nodes remains a popular one. The connectivity after node deletion. Li et al. [17] adopted the change
influential nodes in the social network are those nodes that can of path loss caused by deleting nodes as the degree of the net-
spread wide and fast. For example, in the social network such as work damage, while Chen et al. [18] used the number of spanning
microblog, if some key nodes are mastered, you can spread mes- trees to evaluate the importance of nodes when deleting nodes.
sages in the network quickly [2]; in epidemic network, if influential Unlike node deletion, the way to contract is to merge a node and
nodes are controlled, it will reduce the diffusion of epidemic [3]. all its neighbors into a new node, that is, to replace the node
An enormous amount of pioneering research for detecting in- and all its neighbors with a new one. Tan et al. [19] designed a
fluential nodes has been achieved. Typical methods are based on search strategy to fix important nodes based on contraction in the
network topology such as Degree centrality [4,5], Betweenness network. Yang et al. [20] proposed two novel degree discount or
centrality [6], Closeness centrality [7], k-shell centrality [8], and so reward heuristic decomposition strategies to identify and rank the
on. Due to the high efficiency of k-shell algorithm, many improved influential invaders in dynamically evolving cooperative communi-
strategies [9–11] have been proposed subsequently. Another kind ties based on statistical measures, which will be useful in general
of method is based on random walk. The most famous ones are intelligent ecosystems. Latora et al. [21] proposed the network effi-
ciency, which is later used to measure the influence of nodes [22].
All of the methods above have their own advantages and dis-
*
Corresponding author at: College of Mathematics and Informatics, Fujian Nor-
advantages. The past two decades have witnessed the fusion of
mal University, Fuzhou, Fujian 350117, PR China.
E-mail address: [email protected] (S. Zhou). various methods to rank influential nodes [23–27]. Sheng et al.
https://doi.org/10.1016/j.physleta.2020.126879
0375-9601/© 2020 Elsevier B.V. All rights reserved.
2 G. Chen et al. / Physics Letters A 384 (2020) 126879
[28] suggested a method combining the structural characteristics Therefore, the methods related to node diffusivity are introduced
of the network and the influence of the nodes to determine in- first.
fluential nodes. Zhao et al. [29] proposed a quantitative approach
to measure the global influence of all nodes, which is particu- 2.1. Methods based on diffusion
larly effective in identifying actually important nodes that seem
less important. Wang et al. [30] proposed an improved algorithm Degree method is a typical diffusion method. Degree centrality,
based on the k-shell and node information entropy to identify in- also known as local centrality, refers to the number of nodes di-
fluential nodes. Generally, the influence of nodes in the network is rectly associated with a node i, denoted by d(i ). Degree centrality
related to the network size, the location of nodes, the number of
is one-sided because it only considers direct neighbors and ignores
the node’s neighbors and the speed of influence dissemination. It
the influence of indirect neighbors. The semi-local centrality, pro-
is a reasonable idea to measure the network location from the per-
posed by Chen et al. [4], considers neighbors more than one hop.
spective of network invulnerability, which is very consistent with
The semi-local centrality is defined by the equation
the actual natural situation. In most situations, we use the epi-
demic network model [31] as the standard to rank the influence
of nodes. In the epidemic network, the more important a node is, C (i ) = d(k).
the more nodes in infection state are when the network reaches j ∈ N (i ) k ∈ N ( j )
the stable status. To achieve this goal, some rely on the location Although the semi-local centrality extends the scope of diffusion
and the number of neighbors of the source node in the network,
and applies more indirect neighbors, it simply adds the degrees
while others depend on the ability of a node to infect other nodes,
of neighbors in each layer. Here the implicit problem is that
i.e., the probability of epidemic transmission. To reduce the spread
the neighbors with different hops have different influence on the
scale of an epidemic, the possible infected nodes should be con-
source nodes, which is related to the levels they are located at. A
trolled and isolated. Thus, deleting or contracting these nodes from
the network can control the spread of the epidemic effectively. probability-based multi-hop diffusion method, which distinguishes
Therefore, to deal with the problem in the reverse direction, it is the influence of nodes among different levels of neighbors, is pro-
fair and sensible to measure the position of nodes from the per- posed by Nguyen et al. [35]. They suggested that if only one-hop
spective of network invulnerability. The impact of node deletion or neighbors are considered, the diffusion value of node i, denoted by
contraction on the network [32,33] is worthy of application and dp (i ), is calculated by the equation
promotion.
Inspired by the idea above, we propose an influential node de- dp (i ) = d(i ) + p i j d( j ), (1)
tection strategy based on network invulnerability, taking into ac- j ∈ N (i )
count the factors such as the neighborhood of nodes, the location
of nodes and the propagation probability of the node influence. where p i j is the diffusion probability from node i to node j.
The main contributions of the paper are as follows. Furthermore, if two-hop neighbors are considered further, the
diffusion value is calculated by the equation
(1) A novel method is proposed to detect the influential nodes,
which not only considers the local topological characteristics, dp (i ) = d(i ) + p i j d( j ) + p ik d(k), (2)
but also combines the network size and the relative global po- j ∈ N (i ) k∈ N ( j )
sition of nodes, as well as network invulnerability;
where p ik is the diffusion probability from node i to node k subject
(2) As for the impact of hops, the reasonable number of hops is
to that k is the two-hop neighbor of node i. The calculation of the
selected through experiments, and no more external param-
probability is the product of the diffusion probability from node i
eters are essential. Moreover, with regard to the attenuation,
to node j and that from node j to node k, i.e., p ik = p i j p jk .
neighbors farther away from the source have a weaker propa-
Clearly, the diffusion method is a local method with respect to
gation probability than those closer to the source;
degree. If it is used to measure the influence, the usage of the
(3) The loss of global efficiency and the loss of local efficiency are
diffusion probability from one node to its neighbors within a few
used to evaluate the effect of influence ranking. Experiments
hops is more appropriate. In this paper, we will name the strategy
show that our method has a better balance between the accu-
to use Eq. (2) to calculate the diffusion value of each node as dif-
racy of detecting key nodes and network invulnerability.
fusion degree, which represents the direct diffusion speed of node
i to one-hop neighbors and two-hop neighbors.
The rest part of this paper is organized as follows. In Section 2,
we introduce the relevant knowledge and notations. The main idea
and the suggested algorithm are proposed in Section 3. Section 4 2.2. Method based on node contraction
theoretically analyzes the experimental results, and Section 5 con-
cludes the paper. Tan et al. [19] proposed a novel idea to rank the importance
of nodes by the aid of node contraction instead of node deletion.
2. Related work In fact, if there are a lot of nodes with degree one in the net-
work, the traditional calculation of the connected branches of the
In this section, we first introduce some related knowledge. A remaining network can not accurately distinguish their importance
social network is constructed as a graph, which is represented by when deleting these terminal nodes. Node contraction method is a
G = ( V , E ), where each vertex i ∈ V denotes a node and each edge reasonable alternative to node deletion. When a node contracts, it
(i , j ) ∈ E denotes a relation between nodes i and j. The neighbor- will not lead to great damage to the connectivity of the network.
hood N (i ) of node i in the network G is defined as the set of all However, in light of the change of important characteristics of the
nodes which are adjacent to node i, i.e., N (i ) = {i ∈ V | (i , j ) ∈ E }. network, it will reflect the difference of node importance.
For further information on graph and network, we refer the reader The concept of cohesion was suggested by Tan and Deng [19].
to the monograph [34]. Intuitively, the more people are connected with each other in a so-
In this paper, the influence propagation rate of nodes mainly fo- cial network, the shorter the average path length is, and the higher
cuses on the spread process from the source node to its neighbors. the cohesion is; the fewer the number of people in the network,
G. Chen et al. / Physics Letters A 384 (2020) 126879 3
the smaller the proportion of individuals, and the higher the cohe- neighbor of node i. Different from Eq. (2), when calculating the
sion is. Thus, the network cohesion ∂[G ] is defined by the equation diffusion degree, we use the same diffusion probability p, because
the propagation probability in SIR model is a constant. It should
1 1 n−1
∂[G ] = =
di j
= , (3) be noted that the propagation probability is usually very small
nl n i = j ∈ V
i = j ∈ V di j (usually around 0.1), so it is of little significance to distinguish the
n(n−1)
difference between the values of p i j and p jk in Eq. (2). Therefore,
where n is the network size, l is the average path length, and di j all diffusion probabilities are unified as the propagation proba-
is the shortest distance between nodes i and j. Furthermore, the bilities in SIR model. Then, when considering the attenuation of
cohesion of node i, denoted by coh(i ), is reflected by the change of propagation ability, we use the power value to replace the product
network cohesion when node i is contracted, which is represented of different probabilities.
by the equation On the other hand, when the importance of nodes is reflected
by the cohesion value, we introduce a parameter μ related to the
∂[G ]
coh(i ) = 1 − , (4) number of hops to keep pace with intention of the multi-hop
∂[G ◦ i ] neighbors of Eq. (5). When contracting a node and its directed
where G ◦ i is obtained by contracting node i in G. neighbors, μ = 1; when contracting a node and its neighbors
In fact, when a node is damaged or maliciously attacked, it can within two hops, μ = 2, and so on. More specially,
be regarded as deleted or contracted from the network. Different (1) when μ = 1, coh(i )1 = 1 − ∂[∂[GG◦]i ] ;
from other methods, the method based on node contraction can (2) when μ = 2, coh(i )2 = 1 − ∂[∂[G G1 ◦1i]1 ] .
measure network invulnerability efficiently, which is a novel ob-
Therefore, we get the following equation
servation view and of great significance.
∂[G μ−1 ]
3. Our method coh(i )μ = 1 − , (6)
∂[G μ−1 ◦ i μ−1 ]
3.1. The proposed method where G μ represents the new graph obtained by contracting node
i μ−1 and its neighbors from G μ−1 , and i μ represents the new node
The cohesion is not only an index to characterize network in- after node i μ−1 is contracted in graph G μ−1 (where G 0 refers to
vulnerability, but also a global reflection. The cohesion is measured G, i 0 refers to i).
by global shortest distance and the network size. The combination The fusion of these two methods not only employs the network
of global method and local method can integrate their advantages invulnerability, but also takes into account the network size, the lo-
and greatly improve performance. cation of nodes and the propagation probability. At the same time,
In addition, most of the criteria for detecting the influential we combine the advantages of global cohesion and local diffusion.
nodes are based on some models, such as IC model [36], LT model In the specific method, if the parameters are added directly for lin-
[37] and SIR model [31]. In SIR model, the epidemic starts from ear fusion, there will be two more adjustment parameters, and the
a node or a set of nodes, and spreads the infection state to the different values of parameters will bring more uncertainty to the
method. Here, we compare the relationship among the node’s ac-
neighbors. At the next step, the node of infected state transforms
tual influence (derived from SIR model) and the results based on
to a recovery state with a certain probability, and this process is
the diffusion degree and cohesion value (hereinafter referred to co-
iterated until the state of all nodes is stable. In the final state, the
hesion degree) through experiments in the actual network.
total number of nodes in infected and recovery states is the eval-
The comparison result among cohesion degree, diffusion degree
uating indicator to measure the influence of the source node. To
and expected influence value is shown in Fig. 1. It is easy to see
achieve wider dissemination, it is related to several factors: one is
from Fig. 1 that when the function in Eq. (7) is used for fitting, the
the network size, the second is location of the source node and
effect is relatively good because, except for a few nodes with dif-
its neighbors, the third is the ability of the epidemic to spread
ferent colors, most of the node colors have a high approximation.
from one node to the surrounding nodes, which can be simply
Therefore, we finally suggest the node influence by the equation
abstracted into the diffusion speed. In view of the discussion in
μ2
Section 2, the cohesion can measure the relative position of nodes I M P (i ) = dpd(i )μ1 ∗ α coh(i ) , (7)
and reflect the scale of the network. In the probabilistic diffusion
model, although it is not a time-dependent model, it is related where μ1 and μ2 represent the hops of neighbors that need to be
to the number of neighbors’ hops. When the influence of nodes considered in the method, respectively. Due to the complexity of
spreads, it can only spread to the neighbors in a time interval. cohesion degree, if a unified μ is used, the cost may be slightly
higher. In Section 4, the choice of μ1 and μ2 is discussed. In Eq.
Therefore, the diffusion probability can be considered to be posi-
(7), α is the constant in the exponential function, different base
tively related to the diffusion speed. Thus, from the perspective of
numbers will not affect the function greatly. Here we take α = 10
network invulnerability, we propose an influential node detection
for the simplicity of calculation.
method based on the diffusion degree and the cohesion with the
consideration of network size, node location and diffusion proba-
3.2. Algorithm and complexity analysis
bility. Additionally, in order to differentiate the influence of neigh-
bors in different hops, we introduce a parameter μ, the number of The proposed method is called dpc method for short, as shown
hops. Referring to Eq. (2), the diffusion degree, denoted by dpd(i )μ , in Algorithm 1. The complexity of algorithm is mainly composed
is expressed by of Eq. (5) and Eq. (6), while Eq. (7) will not bring extra burden.
μ In the calculation of our algorithm, μ1 = 2, μ2 = 1, the reason for
dpd(i )μ = d(i 0 ) + ( p μ d(i μ )), (5) the choice will be explained in Section 4. In the algorithm, steps
from 7 to 9 only need to cycle all nodes once, and the complexity
μ = 1 i μ ∈ N ( i μ−1 )
is O (n). The complexity of steps from 1 to 3 relies on the process
where i 0 represents the target node i, i μ represents the μth hop of traversing all nodes once, and the complexity is also O (n). The
neighbor of node i, d(i μ ) represents the degree of the μth hop highest complexity is from step 4 to 6. The complexity of each
4 G. Chen et al. / Physics Letters A 384 (2020) 126879
Fig. 1. The relationship among node influence, diffusion degree and cohesion degree in (a) Karate network and (b) Infectious network. The x-axis represents the cohesion
degree computed by Eq. (6) with μ = 1, while the y-axis refers to the diffusion degree derived from Eq. (5) with μ = 2. The node color varies with the influence value from
SIR model in left panels and the proposed method by Eq. (7) in the right panels, respectively. (For interpretation of the colors in the figure(s), the reader is referred to the
web version of this article.)
G. When sorting, all nodes in the network need to be contracted The used experimental data set is listed as follows, and the rel-
once, the time complexity of the contraction method is O (k2 n), evant topological properties are shown in Table 1.
and the complexity of the worst case is O (n3 ). Summing up above,
the total complexity of the dpc method is O (k2 n), and the worst • Karate network [38]: This network is a well-known data set,
case is of O (n3 ). which contains the friendships between the 34 members of a
karate club at a US university.
4. Experiment and analysis • Infectious network [39,40]: This network describes the face-
to-face behavior of people during the exhibition INFECTIOUS:
In this section, we use the actual networks to compare and STAY AWAY. Nodes represent exhibition visitors; edges rep-
analyze the performance of the method. We not only verify the resent face-to-face contacts that were active for at least 20
effectiveness through different views, but also use a quantitative seconds.
G. Chen et al. / Physics Letters A 384 (2020) 126879 5
Fig. 2. The visualization influence results in Karate network with different methods, where the color of nodes represents different influence values. These methods are (a)
Degree centrality, (b) Closeness centrality, (c) Betweenness centrality, and (d) the dpc method in Karate network, respectively.
Fig. 6. The Kendal coefficient τ of the dp, the coh, and the dpc method varies with the change of the propagation probability β in the network, where β ∈ [0.01, 0.20], τ ∈
[0, 1]. The results are averaged over 100 independent runs.
Table 2
The rank of the top-10 influential nodes (in descending order) is obtained by three methods, i.e.,
diffusion degree (dp), cohesion degree (coh), and the dpc method.
1
E loc = E (G i ), (9) the three nodes are deleted, the local efficiency does not decrease
n but increases. The less the reduction, the less the negative bene-
i ∈G
fit brought by the node deletion, that is, the more important the
where di j represents the shortest distance between node i and
node is.
node j, n represents the number of nodes, and E (G i ) is the global
(2) Infectious network: According to the same selection crite-
efficiency of the derived subgraph induced by node i and its neigh-
ria, we choose three nodes 293, 49 and 187 in Infectious network.
bors.
Among them, node 293 only appears in the top-10 of the dp
We select three different nodes from top-10 nodes detected by
method, node 49 only appears in the top-10 of the coh method,
three different methods, and compare the changes of network effi-
and node 187 only appears in the top-10 of the dpc method. In
ciency after they are deleted, respectively (see Fig. 10).
Fig. 10(b), after deleting the corresponding nodes, the global effi-
(1) Karate network: We chose three nodes 23, 19 and 30. Be-
ciency and local efficiency are reduced. The local efficiency of node
cause node 23 in the top-10 of the dp method does not appear
in the top-10 of two rest methods, node 19 in the top-10 of the 49 decreases the most, while the global efficiency of node 187 de-
coh method does not appear in the top-10 of two rest methods, creases the most, which shows that the nodes found by the coh
and node 30 in the top-10 of the dpc method also does not appear method and the dpc method are more invulnerable than the dp
in the top-10 of two rest methods (see Table 2). In Fig. 10(a), after method.
node 23 is deleted, the global efficiency will increase instead of de- (3) Netscience network: The dp and the dpc methods get the
crease, which indicates that this node has a blocking effect on the top-10 nodes, which have higher coincidence with the same nodes
propagation of the original network influence. However, the global but different sequence. So we choose node 14 in the dp method
efficiency of node 19 and node 30 is reduced after deleting, and and node 50 in the dpc method, which are in the 5th place in each
the global efficiency of node 19 is quickly reduced after deleting, method. In the coh method, node 168 is selected, which does not
which shows that the node 19 selected by the coh method has a appear in the top-10 of the dp method and the dpc method. As can
greater impact on the propagation of network influence than that be seen in Fig. 10(c), both the global efficiency loss and the local
of node 30 selected by the dpc method, both of which are better efficiency loss of node 168, 50 and 14 decrease, which indicates
than that of node 23 selected by the dp method. In Fig. 10(a), after that the network invulnerability of coh, dpc and dp decrease.
8 G. Chen et al. / Physics Letters A 384 (2020) 126879
Fig. 7. The total number of infected and recovered nodes F (t ) changes with time t when the top-10 influential nodes in the network are used as the source nodes in the SIR
model, where t ranges from 0 to 20 or 30. The selection of 30 as the maximum value of t in Protein network is due to the larger number of nodes. Therefore, it takes more
time to reach the stable state. The results are averaged over 100 independent runs.
Fig. 8. The comparison between the contraction of one-hop neighbors and two-hop Fig. 9. The comparison between the contraction of one-hop neighbors and two-hop
neighbors by the coh method in Karate network. The x-axis β represents the prop- neighbors by the dpc method in Karate network, where the parameter μ2 in Eq. (7)
agation probability, while the y-axis τ is the Kendal coefficient. represents the number of hops to contract.
Fig. 10. The comparison between the global efficiency loss and local efficiency loss, when three nodes in different top-10 sequences by different methods are deleted. Pink
bar represents the change of global efficiency, while purple bar is the change of local efficiency.
is also better than two other methods. That is to say, the impact [5] G. Lawyer, Understanding the influence of all nodes in a network, Sci. Rep. 5 (1)
of the influential nodes that we detected on the whole network (2015) 1–9.
[6] J. Bae, S. Kim, Identifying and ranking influential spreaders in complex net-
is relatively large once these nodes fail or are isolated. Therefore,
works by neighborhood coreness, Phys. A, Stat. Mech. Appl. 395 (2014)
the proposed method achieves a good balance between the ranking 549–559.
accuracy and the network invulnerability. Subsequently, for many [7] L.C. Freeman, Centrality in social networks conceptual clarification, Soc. Netw.
networks with edge attributes and node attributes, our method can 1 (3) (1978) 215–239.
be extended to the weighted networks. [8] L.-Y. Lü, D.-B. Chen, X.-L. Ren, Q.-M. Zhang, Y.-C. Zhang, T. Zhou, Vital nodes
identification in complex networks, Phys. Rep. 650 (2016) 1–63.
[9] G. Maji, A. Namtirtha, A. Dutta, M.C. Malta, Influential spreaders identification
CRediT authorship contribution statement in complex networks with improved k-shell hybrid method, Expert Syst. Appl.
144 (2020) 113092.
Gaolin Chen: Conceptualization, Methodology, Writing - origi- [10] G. Maji, Influential spreaders identification in complex networks with poten-
tial edge weight based k-shell degree neighborhood method, J. Comput. Sci. 39
nal draft preparation, Software. Shuming Zhou: Supervision, Val-
(2020) 101055.
idation. Jiafei Liu: Writing - review & editing. Min Li: Software, [11] A. Zeng, C.-J. Zhang, Ranking spreaders by decomposing complex networks,
Visualization. Qianru Zhou: Investigation. Phys. Lett. A 377 (14) (2013) 1031–1035.
[12] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine,
Declaration of competing interest Comput. Netw. ISDN Syst. 30 (1–7) (1998) 107–117.
[13] L. Lü, Y.-C. Zhang, C.H. Yeung, T. Zhou, Leaders in social networks, the delicious
case, PLoS ONE 6 (6) (2011) e21202.
The authors declare that there is no conflict of interest regard- [14] L. Lv, K. Zhang, T. Zhang, D. Bardou, J. Zhang, Y. Cai, Pagerank centrality for
ing the publication of this paper. Neither the entire paper nor any temporal networks, Phys. Lett. A 383 (12) (2019) 1215–1222.
part of its content has been published or has been accepted else- [15] A. Arulselvan, C.W. Commander, L. Elefteriadou, P.M. Pardalos, Detecting critical
nodes in sparse graphs, Comput. Oper. Res. 36 (7) (2009) 2193–2200.
where. It is not being submitted to any other journal.
[16] B. Addis, M. Di Summa, A. Grosso, Identifying critical nodes in undirected
graphs: complexity results and polynomial algorithms for the case of bounded
Acknowledgement treewidth, Discrete Appl. Math. 161 (16–17) (2013) 2349–2360.
[17] P.-X. Li, Y.-Q. Ren, Y.-M. Xi, An importance measure of actors (set) within a
network, Syst. Eng. 22 (4) (2004) 13–20.
This work was partly supported by the National Natural Sci-
[18] Y. Chen, A.-Q. Hu, J. Hu, L. Chen, A method for finding the most vital node in
ence Foundation of China (Nos. 61977016 and 61572010), Natural communication networks, High Technol. Lett. 1 (2) (2004) 573–575.
Science Foundation of Fujian Province (No. 2017J01738) and Edu- [19] Y.-J. Tan, J. Wu, H.-Z. Deng, Evaluation method for node importance based on
cation and Scientific Research Project for Young and Middle-aged node contraction in complex networks, Syst. Eng. Theory Pract. 11 (11) (2006)
Teachers of Fujian Province (No. JT180077). 79–83.
[20] G.-L. Yang, T.P. Benko, M. Cavaliere, J.-C. Huang, M. Perc, Identification of influ-
ential invaders in evolutionary populations, Sci. Rep. 9 (1) (2019) 1–12.
References [21] V. Latora, M. Marchiori, Efficient behavior of small-world networks, Phys. Rev.
Lett. 87 (19) (2001) 198701.
[1] X.-D. Wu, Y. Li, L. Li, Influence analysis of online social networks, Chinese J. [22] Y.-C. Wang, S.-S. Wang, Y. Deng, A modified efficiency centrality to identify
Comput. 37 (4) (2014) 735–752. influential nodes in weighted networks, Pramana 92 (4) (2019) 68.
[2] J. Lu, W. Wan, Identification of key nodes in microblog networks, ETRI J. 38 (1) [23] A. Zareie, A. Sheikhahmadi, M. Jalili, Influential node ranking in social net-
(2016) 52–61. works based on neighborhood diversity, Future Gener. Comput. Syst. 94 (2019)
[3] F.D. Malliaros, M.-E.G. Rossi, M. Vazirgiannis, Locating influential nodes in com- 120–129.
plex networks, Sci. Rep. 6 (2016) 19307. [24] L.-F. Zhong, Q.-H. Liu, W. Wang, S.-M. Cai, Comprehensive influence of local and
[4] D.-B. Chen, L.-Y. Lü, M.-S. Shang, Y.-C. Zhang, T. Zhou, Identifying influential global characteristics on identifying the influential nodes, Phys. A, Stat. Mech.
nodes in complex networks, Phys. A, Stat. Mech. Appl. 391 (2012) 1777–1787. Appl. 511 (2018) 78–84.
10 G. Chen et al. / Physics Letters A 384 (2020) 126879
[25] C. Salavati, A. Abdollahpouri, Z. Manbari, Ranking nodes in complex networks [36] J. Goldenberg, B. Libai, E. Muller, Using complex systems analysis to advance
based on local structure and improving closeness centrality, Neurocomputing marketing theory development: modeling heterogeneity effects on new prod-
336 (2019) 36–45. uct growth through stochastic cellular automata, Acad. Mark. Sci. Rev. 9 (3)
[26] Y.-Z. Yang, L. Yu, X. Wang, Z.-L. Zhou, Y. Chen, T. Kou, A novel method to eval- (2001) 1–18.
uate node importance in complex networks, Phys. A, Stat. Mech. Appl. 526 [37] M. Granovetter, Threshold models of collective behavior, Am. J. Sociol. 83 (6)
(2019) 121118. (1978) 1420–1443.
[27] C. Gao, L. Zhong, X.-H. Li, Z.-L. Zhang, N. Shi, Combination methods for identi- [38] Zachary karate club network dataset – KONECT, http://konect.uni-koblenz.de/
fying influential nodes in networks, Int. J. Mod. Phys. C 26 (06) (2015) 1550067. networks/ucidata-zachary, Apr. 2017.
[28] J.-F. Sheng, J.-Y. Dai, B. Wang, G.-H. Duan, J. Long, J.-K. Zhang, K.-R. Guan, S. Hu, [39] Infectious network dataset – KONECT, http://konect.uni-koblenz.de/networks/
L. Chen, W.-H. Guan, Identifying influential nodes in complex networks based sociopatterns-infectious, Apr. 2017.
on global and local structure, Phys. A, Stat. Mech. Appl. 541 (2020) 123262. [40] L. Isella, J. Stehlé, A. Barrat, C. Cattuto, J.-F. Pinton, W. Van den Broeck, What’s
[29] J. Zhao, Y.-C. Wang, Y. Deng, Identifying influential nodes in complex networks in a crowd? Analysis of face-to-face behavioral networks, J. Theor. Biol. 271 (1)
from global perspective, Chaos Solitons Fractals 133 (2020) 109637. (2011) 166–180.
[30] M. Wang, W.-C. Li, Y.-N. Guo, X.-Y. Peng, Y.-X. Li, Identifying influential spread- [41] M.E. Newman, Finding community structure in networks using the eigenvec-
ers in complex networks based on improved k-shell method, Phys. A, Stat. tors of matrices, Phys. Rev. E 74 (3) (2006) 036104.
Mech. Appl. (2020) 124229. [42] Protein network dataset – KONECT, http://konect.uni-koblenz.de/networks/
[31] R. Pastor-Satorras, A. Vespignani, Epidemic dynamics and endemic states in moreno_propro, Apr. 2017.
complex networks, Phys. Rev. E 63 (6) (2001) 066117. [43] S. Coulomb, M. Bauer, D. Bernard, M.-C. Marsolier-Kergoat, Gene essentiality
[32] C.M. Schneider, T. Mihaljev, H.J. Herrmann, Inverse targeting—an effective im- and the topology of protein interaction networks, Proc. R. Soc. B, Biol. Sci.
munization strategy, Europhys. Lett. 98 (4) (2012) 46002. 272 (1573) (2005) 1721–1725.
[33] Q.-H. Hu, A research of identifying vital nodes algorithm based on graph parti- [44] C. Castellano, R. Pastor-Satorras, Thresholds for epidemic spreading in net-
tion, Master’s thesis, University of Electronic Science and Technology of China, works, Phys. Rev. Lett. 105 (21) (2010) 218701.
2019. [45] W. Hoeffding, A non-parametric test of independence, Ann. Math. Stat. (1948)
[34] J.-M. Xu, Combinatorial Theory in Networks, Science Press, Beijing/China, 2013. 546–557.
[35] D.-L. Nguyen, T.-H. Nguyen, T.-H. Do, M. Yoo, Probability-based multi-hop diffu- [46] M.G. Kendall, The treatment of ties in ranking problems, Biometrika 33 (3)
sion method for influence maximization in social networks, Wirel. Pers. Com- (1945) 239–251.
mun. 93 (4) (2017) 903–916.
Social Network Analysis and Mining (2020) 10:2
https://doi.org/10.1007/s13278-019-0616-4
ORIGINAL ARTICLE
Abstract
The information on the web is mixed with rumors and unverified information. Additionally, social networks as a special
and wide subsection of the web have more potential for spreading and creating misinformation or unverified information.
Because of the significance of this issue, and to enhance the information verification performance, in this paper information
verification in social networks is investigated. It seems that several features and conditions are effectual on rumor detection.
Among possible effective features and properties, we consider two main sources for information verification in social net-
works that include user feedback and news agencies. User feedbacks as the first source can be user conversational tree. Some
patterns can be extracted from this tree. News agencies as the second source are also utilized for verification of information
by textual entailment methods. Finally, these two types of features are aggregated to classify the information in one of the
three classes of true, false, or unverified. This method is tested through the experiments with public datasets. The results of
experiments show that the hybrid suggested method for information verification could pass the state-of-the-art methods in
information verification.
Keywords Information verification · Textual entailment recognition · User conversational tree · Subtree pattern extraction ·
Fake news
13
Vol.:(0123456789)
2 Page 2 of 8 Social Network Analysis and Mining (2020) 10:2
Zimmermann and Jucks 2018; Hajli 2018; Alrubaian et al. with the main text, the main text is considered as false. Oth-
2017; Gahirwal et al. 2018; Conroy et al. 2015; Rubin erwise, the main text considered as unverified ones.
et al. 2015; Shu et al. 2017; Rubin et al. 2016; Oshikawa The suggested methods for information verification are
et al. 2018; Vedova et al. 2018; Schifferes et al. 2014; evaluated using public datasets with comparison of the state-
Pérez-Rosas et al. 2017; Long et al. 2017; Ma et al. 2018; of-the-art methods. The results of this comparison show
Zhao et al. 2017; Hanselowski et al. 2018; Paul et al. that this proposed method works better than other available
2018; Basak et al. 2018; Liu et al. 2018; Ren et al. 2017; methods that are presented in our used dataset.
Zanoli and Colombo 2017; Guo et al. 2017). Most of these The structure of the paper is as follows. Section 2 sum-
methods are based on feature selection from the main text marizes the researches related to the information verifica-
(Rubin et al. 2016; Long et al. 2017; Basak et al. 2018). In tion. Afterward, in Sect. 3 three methods based on the user
addition, methods that are studied the structure of graphs feedback, the news sources, and the combination of them are
in social networks are very popular because of the rich described. Experiments with the discussion are explained
effect of history of studying graphs (Androutsopoulos in Sect. 4. Finally, Sect. 5 concludes the paper and presents
and Prodromos 2010). Recently, deep learning methods future works.
come to account in this task, too. For example, recurrent
neural networks (RNNs) are used for detecting rumors
from microblogs (Foroozani and Ebrahimi 2019). How- 2 Related works
ever, among the current approaches, the effect of sources
in information verification is unstudied. Therefore, in this The task of information verification is a wide area that can
paper to enhance the performance of information verifica- be the verification of every sort of information. Among the
tion, the effect of sources is studied. kinds of information that are studied in information verifi-
The proposed method of this paper for information veri- cation topic, news is the most important ones. Therefore,
fication uses the main sources for veracity of information, in this paper the related works for information verification
which are user feedbacks, and news sources. At first, the are separated into two subsections: information verification
effect of the feedback of users and the news sources for and news verification. Then, as studying news sources is
information verification task is studied separately. Then, the typically based on textual entailment methods, the related
results of these methods are combined to get decision about works in the textual entailment methods are explained in the
the veracity of information. last section of related works.
In studying user feedbacks, the structure of tagged nodes
in user conversational tree (UCT) with four kinds of tags, 2.1 Information verification
which are deny, comment, support, and query, is consid-
ered. The UCT is studied in the three approaches: 1—pattern Several research works, such as (Gerhart et al. 2017; Louni
extraction 2—statistical sequential models, and 3—edit dis- and Subbalakshmi 2018; Castillo et al. 2011, 2013; Boidi-
tance. These methods are briefly discussed in the following. dou et al. 2018; Westerman et al. 2012; Yin et al. 2018a;
Zimmermann and Jucks 2018; Hajli 2018; Alrubaian et al.
• For pattern extraction from UCT, different special pat- 2017) investigated information verification. Gerhart et al.
terns from UCT are counted by considering the level of (2017) investigated the effect of user behaviors on infor-
pattern in the tree. In these patterns, the tag of each node mation verification. They state that the social–cultural dif-
is designated, too. ference between people in a social network, the knowledge
• In studying UCT by statistical sequential models, two about past rumor, the trust in network, and the tendency of
different kinds of statistical sequential models are con- user to share information is the most effectual parameters in
sidered including hidden Markov model (HMM) and information verification.
conditional random fields (CRF). Louni and Subbalakshmi (2018) studied the methods for
• Edit distance methods for studying UCT are employed as information verification in large-scale social networks. Their
a similarity measure for the K-nearest neighbor method proposed method is based on the polarity of nodes in net-
in classification of UCT for information verification task. work. They also used a new element that is useful in their
task. This element is entitled by sensor that can be activated
The second source for information verification is news when a sensitive information get into the node or get out.
sources, which is useful for textual entailment methods. In Their method acts well to detect the nodes that spread inva-
this point of view, the text which is questioned about its lid information.
veracity (called main text) is considered whether it could be Alrubaian et al. (2017) investigated the effect of user
entitled from news source or not. If it could be entailed, the popularity on information verification in social network.
main text is considered as true. If the news source contrasts Their results show that if the popularity of user competed
13
Social Network Analysis and Mining (2020) 10:2 Page 3 of 8 2
in a proper way, then that user popularity for information Basak et al. (2018), Liu et al. (2018), Ren et al. (2017),
verification could be comparable with other methods that Zanoli and Colombo (2017), and Guo et al. (2017). The
are utilized for information verification. methods proposed by Yin et al. (2018b), Ma et al. (2018),
Zhao et al. (2017), Hanselowski et al. (2018), Liu et al.
2.2 News verification (2018), and Guo et al. (2017) are based on neural networks.
Other methods are more interpretable and mostly based on
Some research works studied news verification (Gahirwal natural language processing.
et al. 2018; Conroy et al. 2015; Rubin et al. 2015; Shu et al. Paul et al. (2018) proposed a method that contains three
2017; Rubin et al. 2016; Oshikawa et al. 2018; Vedova et al. main parts: information retrieval, intuition gathering, and
2018; Schifferes et al. 2014; Pérez-Rosas et al. 2017; Long classification. They state that this method has potential to
et al. 2017). Zhang and Ghorbani reviewed fake news detec- be used for fraud detection in text. In addition, this method
tion methods. Many of these rely on identifying features of is appropriate to be used for inter-domain applications. The
the users, content, and context that show misinformation. ability of these two models is mostly because of the simplic-
They also list the datasets that have been used for classify- ity of this model.
ing fake news (Zhang and Ghorbani 2019). In the following, Basak et al. (2018) proposed a method using a hybrid
we review some news verification methods. Gahirwal et al. approach that used feature extraction based on some spe-
(2018) proposed a method for news verification using natural cial rules. For this purpose, the structure of language and
language processing and machine learning methods. They machine learning methods are utilized.
investigated news in four levels of verification. They also Zanoli and Colombo (2017) used entailments methods
used regression methods to compute the amount of validity using information transformation. In order to use informa-
of news. tion transformation, they used three actions which include
Zhang et al. (2019) proposed a text analytics-driven delete, add, and swap. They also define new action that
approach for fake news detection for reducing the risks named matching. This method needs special structure of
posed by fake news consumption. dataset, but if this dataset was provided, the method will
Conroy et al. (2015) proposed language- and network- give the good results.
based methods for news verification. They state that if linked
data would be used for news verification, their method will
show better results.
Rubin et al. (2016) studied the effect of misleading con- 3 Proposed method
tent in the spread of fake news. They gathered the effectual
misleading features in fake news classification. The explanation of the proposed method first begins by over-
Long et al. (2017) proposed a method for fake news detec- all description of the method in the first subsection. Then,
tion using a neural network with attention. Their contribu- two main parts of the method, which are studying user feed-
tion is studying the effect of user profile features in operation backs and news sources, are explained. In addition, different
of the neural network. Their experiment results show that parts of user feedback study are explained.
these features have the positive effect.
Textual entailment is a semantic interpretation of the text The overall architecture of the proposed method in informa-
and measures natural language understanding of the text. tion verification task is illustrated in Fig. 1. As shown in this
Textual entailment is the task of deciding, given two text figure, at first a message from a social network, which is
fragments, whether the meaning of one text is entailed (can the input text for information verification, is considered as
be inferred) from another text. So far, several approaches the input. This message is first given to the textual preproc-
have been proposed for text entailment like word embed- essing step. The preprocessing step includes four following
ding, logical models, rule-based model graphical models, operations:
contextual focusing, and machine learning (Androutsopou-
los and Prodromos 2010). Practical or large-scale solutions • transferring abbreviations in the input text into full words
avoid these complex methods and instead use only surface • removing annotations from the input text
syntax or lexical relationships, but are correspondingly less • prepossessing of words (e.g., transfer plural into singular
accurate (Dagan and Glickman 2004). Textual entailment form), and
is investigated in Yin et al. (2018b), Ma et al. (2018), Zhao • prepossessing of verbs (e.g., transfer verbs into their
et al. (2017), Hanselowski et al. (2018), Paul et al. (2018), stems and lemma).
13
2 Page 4 of 8 Social Network Analysis and Mining (2020) 10:2
3.2.1 Pattern extraction
Classification
---------------------
• Extreme Learning Machine Patterns extracted from UCT are considered as features.
• Bayes Classifier
• Support Vector Machine These patterns are every possible subtree with maximum
Final Class
---------------------
• Multilayer Perceptron height of three. The occurrence of each pattern is counted
• True by considering a weight, which is computed by the ratio of
• False
• Unverified
level of the root in the subtree with respect to the height for
each UCT.
Evaluation Extreme learning machine (ELM) (Huang 2014) is used
---------------------
• Score as the classifier for building model based on feature extracted
• Confidence RMSE from UCT. Naive Bayes, support vector machine (SVM),
• Final Score
and multilayer perceptron are other classifiers applied for
classification of messages.
Fig. 1 Overall architecture of the proposed information verification
method
3.2.2 Statistical sequential models
After preprocessing of the input text, it is ready for veri- CRFs are a probabilistic framework for labeling and seg-
fication. Verification task is performed by using two main menting structured data, such as sequences, trees, and lat-
separate steps: considering user feedbacks and news sources. tices. The underlying idea is defining a conditional prob-
These steps are described in the next subsections. ability distribution over label sequences given a particular
observation sequence, rather than a joint distribution over
both label and observation sequences (Lafferty et al. 2001).
3.2 User feedback study HMMs take a generative approach for labeling a sequence.
Contrary to HMM, CRF does not require the observations
User feedback study is typically based on the structure of to be independent (conditional probability). The HMM is a
UCT. UCT is a tree that is created when a message in a special case of CRF where the probabilities used in the state
social network is discussed or usually replied. The root of transition are constant. HMM and CRF have been used in
this tree is that main message that its veracity is questioned. several natural language processing task (Shen et al. 2007;
Then, every reply to this message is a child of the root. For- Borrajo et al. 2015). It seems that these methods are good
merly, every reply of each node in the tree is the next level candidates for information verification from text messages.
of the UCT. HMM and CRF are used in our proposed methods. The
The UCT that considered in this paper is tagged by one label of the path with the majority vote by the priority of
of the labels (Query, Deny, Comment, or Support) within false, true, and unverified is considered as the final label.
the node of the tree. Each of these labels is described in the
following: 3.2.3 Edit distance
• Support: It means that the node with support label agrees Edit distance of each path in the UCT is computed by
with its direct father. the dynamic time warping method. This method is using
• Deny: It means that the node with support label disagrees a dynamic programming method. Afterward, the edit dis-
with its direct father. tance is considered as a similarity measure and by using the
13
Social Network Analysis and Mining (2020) 10:2 Page 5 of 8 2
K-nearest neighbor method, the UCT is classified as false, simulate the experiments like as what has been done in
true, or unverified classes. Semeval-2017 task 8. The evaluation measures are the same
as this task, too. The evaluation measures are Score, Confi-
3.3 Study news sources: recognizing textual dence, Root Mean Square Error (RMSE), and Final Score.
entailment Score is computed as the popular accuracy measure. Con-
fidence RMSE is the RMSE of Classification Confidence.
In this case, the news source must be given. If the main text, Final Score is computed as Eq. (1).
which is a message that its veracity is questioned, entails
from the news source, the main text is a true one. If the
Final Score = Score × (1 − Confidence RMSE) (1)
news source contrasts with the main text, then the main text
is a false one. Otherwise, the main text is an unverified one.
The following textual entailment methods are used in 4.2 User feedback experiments
this paper (Padó et al. 2015; Magnini et al. 2014; Noh et al.
2015). The results of experiments on user feedback study in three
cases are shown in Tables 1, 2, and 3. In Tables 1, 2, and 3,
• Edit distance comp(Fixed Weight Lemma/RES Word the results of pattern extraction, statistical sequential mod-
Net). els, and K-nearest neighbor with dynamic time warping are
• MaxEnt TreeSkeleton RES(Verb Ocean Tree Pattern/ shown, respectively.
Word Net Tree Pattern/Word Net Verb Ocean Tree Pat- In Table 1, ELM is used as the classifier for build-
tern). ing model based on feature extracted from UCT. The
• P1EDA RES: Paraphrase Table. results from Table 1 show that in user feedback study,
13
2 Page 6 of 8 Social Network Analysis and Mining (2020) 10:2
13
Social Network Analysis and Mining (2020) 10:2 Page 7 of 8 2
Table 6 Confusion matrix for the best-achieved result for information methods to study user feedbacks and news agencies in more
verification details.
Actual
Acknowledgements This research was in part supported by a Grant
True False Unverified from IPM (No. CS1397-4-98).
Predicted True 4 1 3
False 2 6 0
Unverified 1 2 9
References
Alrubaian M, Al-Qurishi M, Al-Rakhami M, Hassan MM, Alamri A
Table 7 Evaluation of the proposed method when two classes of (2017) Reputation-based credibility analysis of Twitter social net-
‘rumor’ and ‘other’ are considered work users. Concurr Comput Pract Exp 29(7):e3873
Androutsopoulos I, Prodromos M (2010) A survey of paraphrasing and
Measure Accuracy Sensitivity Specificity Precision F1 Score
textual entailment methods. J Artif Intell Res 38:135–187. https: //
Value 75% 57.14% 80.95% 50% 53.33% doi.org/10.1613/jair.2985
Basak R, Naskar SK, Gelbukh A (2018) A simple hybrid approach to
recognizing textual entailment. J Intell Fuzzy Syst 34(4):1–13
Boididou C, Middleton SE, Jin Z, Papadopoulos S, Dang-Nguyen
D-T, Boato G, Kompatsiaris Y (2018) Verifying informa-
Table 6 shows the confusion matrix of the best modeling, tion with multimedia content on twitter. Multimed Tools Appl
which is ELM (Sigmoidal activation function) plus entail- 77(12):15545–15571
ment. Table 7 displays evaluation of the proposed method Borrajo L, Seara Vieira A, Iglesias EL (2015) TCBR-HMM: an HMM-
when two classes of rumor and other are considered. based text classifier with a CBR system. Appl Soft Comput
26:463–473
Existing news sources are essential for well performing Castillo C, Mendoza M, Poblete B (2011) Information credibility on
of the proposed method. Also in some social networks, the twitter. In: Proceedings of the 20th international conference on
users’ feedback is not recorded and the proposed method World wide web. ACM, pp 675–684
is not a good choice for information verification. In addi- Castillo C, Mendoza M, Poblete B (2013) Predicting information cred-
ibility in time-sensitive social media. Internet Res 23(5):560–588
tion, the proposed method is only applicable for the text Conroy NJ, Rubin VL, Chen Y (2015) Automatic deception detec-
messages. tion: methods for finding fake news. In: Proceedings of the 78th
ASIS&T annual meeting: information science with impact:
research in and for the community. American Society for Infor-
mation Science, p 82
5 Conclusion and future works Dagan I, O Glickman (2004) Probabilistic textual entailment: generic
applied modeling of language variability. In: PASCAL workshop
In this paper, we studied the problem of information verifi- on learning methods for text understanding and mining. Grenoble
cation that comes to account in order to detect false infor- Derczynski L, Bontcheva K, Liakata M, Procter R, Hoi GW, Zubiaga A
(2017) SemEval-2017 Task 8: RumourEval: determining rumour
mation and also unverified ones. The importance of this veracity and support for rumours. arXiv:1704.05972v1 [cs.CL]
problem is because of the fact that the rate of information 20 Apr 2017
generation over the social networks is so high. Therefore, Foroozani A, Ebrahimi M (2019) Anomalous information diffu-
verification of information in a manual way is very hard. sion in social networks: Twitter and Digg. Expert Syst Appl
134:249–266
In addition, everyone knows that these days, web contains Gahirwal M, Moghe S, Kulkarni T, Khakhar D, Bhatia J (2018) Fake
a high volume of false information. By these, information news detection. Int J Adv Res Ideas Innov Technol 4(1):817–819
verification is a hot topic. Gerhart N, Torres R, Negahban A (2017) Combatting fake news: an
The proposed information verification method is the investigation of individuals’ information verification behaviors
on social networking sites. In: Proceedings of the 51st Hawaii
hybrid method, which utilized two main sources for infor- international conference on system sciences
mation verification. These two sources are user feedback and Girard J, Allison M (2008) Information anxiety: fact, fable or fallacy.
news agencies. The first source is studied for information Electron J Knowl Manag 6(2):111–124
verification by using user feedback and other source con- Guo M, Zhang Y, Zhao D, Liu T (2017) Generating textual entailment
using residual LSTMs. In: Chinese computational linguistics and
sidered by textual entailment methods. Then, the results of natural language processing based on naturally annotated big data.
two sources are combined in order to judge for verification Springer, Cham, pp 263–272
of information. Hajli N (2018) Ethical environment in the online communities by
The experiments of the suggested method for information information credibility: a social media perspective. J Bus Ethics
149(4):799–810
verification on a public dataset illustrated that the proposed Hanselowski A, Zhang H, Li Z, Sorokin D, Schiller B, Schulz C,
method passed the state-of-the-art ones in information veri- Gurevych I (2018) Multi-sentence textual entailment for claim
fication. In the future, the proposed method would be devel- verification. In: Proceedings of the first workshop on fact extrac-
oped by more advanced natural language processing-based tion and verification (FEVER), pp 103–108
13
2 Page 8 of 8 Social Network Analysis and Mining (2020) 10:2
Huang G-B (2014) An insight into extreme learning machines: ran- Proceedings of the second workshop on computational approaches
dom neurons, random features and kernels. Cognit Comput to deception detection, pp 7–17
6(3):376–390 Schifferes S, Newman N, Thurman N, Corney D, Göker A, Martin
Kumar S, Shah N (2018) False information on web and social media: C (2014) Identifying and verifying news through social media:
a survey. arXiv:1804.08559v1 [cs.SI] 23 Apr 2018 developing a user-centred tool for professional journalists. Digit
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: J 2(3):406–418
probabilistic models for segmenting and labeling sequence Shen D, Sun J-T, Li H, Yang Q, Chen Z (2007) Document summariza-
data. In: Proceedings of the eighteenth international conference tion using conditional random fields, IJCAI-07, pp 2862–2867
on machine learning (ICML-2001) Shu K, Sliva A, Wang S, Tang J, Liu H (2017) Fake news detection on
Liu L, Huo H, Liu X, Palade V, Peng D, Chen Q (2018) Recognizing social media: a data mining perspective. ACM SIGKDD Explor
textual entailment with attentive reading and writing operations. Newsl 19(1):22–36
In: International conference on database systems for advanced Vedova MD, Tacchini E, Moret S, Ballarin G, DiPierro M, de Alfaro L
applications. Springer, Cham, pp 847–860 (2018) Automatic online fake news detection combining content
Long Y, Lu Q, Xiang R, Li M, Huang C-R (2017) Fake news detection and social signals. In: Proceedings of the 22st conference of open
through multi-perspective speaker profiles. In: Proceedings of the innovations association FRUCT. FRUCT Oy, p 38
eighth international joint conference on natural language process- Westerman D, Spence P, Van Der Heide B (2012) A social network as
ing, volume 2 short papers, pp 252–256 information: the effect of system generated reports of connected-
Louni A, Subbalakshmi KP (2018) Method and apparatus to identify ness on credibility on Twitter. Comput Hum Behav 28(1):199–206
the source of information or misinformation in large-scale social Wurman RS (1989) Information anxiety. Doubleday, New York
media networks. U.S. Patent 9,959,365, issued May 1, 2018 Yin C, Sun Y, Fang Y, Lim K (2018a) Exploring the dual-role of cog-
Ma T, Wu C, Xiao C, Sun J (2018) AWE: asymmetric word embedding nitive heuristics and the moderating effect of gender in micro-
for textual entailment. arXiv preprint arXiv:1809.04047 blog information credibility evaluation. Inf Technol People
Magnini B, Zanoli R, Dagan I, Eichler K, Neumann G, Noh T-G, Pado 31(3):741–769
S, Stern A, Levy O (2014) The excitement open platform for tex- Yin W, Roth D, Schütze H (2018b) End-task oriented textual entail-
tual inferences. In: Proceedings of 52nd annual meeting of the ment via deep exploring inter-sentence interactions. arXiv preprint
association for computational linguistics: system demonstrations, arXiv:1804.08813
pp 43–48 Zanoli R, Colombo S (2017) A transformation-driven approach for
Noh T-G, Padó S, Shwartz V, Dagan I, Nastase V, Eichler K, Kotlerman recognizing textual entailment. Nat Lang Eng 23(4):507–534
L, Adler M (2015) Multi-level alignments as an extensible rep- Zhang X, Ghorbani AA (2019) An overview of online fake news: Char-
resentation basis for textual entailment algorithms. In: Proceed- acterization, detection, and discussion. Inf Process Manag. https
ings of the fourth joint conference on lexical and computational ://doi.org/10.1016/j.ipm.2019.03.004
semantics, pp 193–198 Zhang C, Gupta A, Kauten C, Deokar AV, Qin X (2019) Detecting
Oshikawa R, Qian J, Wang VY (2018) A survey on natural lan- fake news for reducing misinformation risks using analytics
guage processing for fake news detection. arXiv preprint arXiv approaches. Eur J Oper Res 279(3):1036–1052
:1811.00770 Zhao K, Huang L, Ma M (2017) Textual entailment with structured
Padó S, Noh T-G, Stern A, Wang R, Zanoli R (2015) Design and reali- attentions and composition. arXiv preprint arXiv:1701.01126
zation of a modular architecture for textual entailment. Nat Lang Zimmermann M, Jucks R (2018) How experts’ use of medical technical
Eng 21(2):167–200 jargon in different types of online health forums affects perceived
Paul M, Sharp R, Surdeanu M (2018) A mostly unlexicalized model for information credibility: randomized experiment with laypersons.
recognizing textual entailment. In: Proceedings of the first work- J Med Internet Res 20(1):e30
shop on fact extraction and verification (FEVER), pp 166–171 Zubiaga A, Aker A, Bontcheva K, Liakata M, Procter R (2017) Detec-
Pérez-Rosas V, Kleinberg B, Lefevre A, Mihalcea R (2017) Automatic tion and resolution of rumours in social media: a survey. arXiv
detection of fake news. arXiv preprint arXiv:1708.07104 preprint arXiv:1704.00656
Ren H, Li X, Feng W, Wan J (2017) Recognizing textual entailment
using inference phenomenon. In: Workshop on Chinese lexical Publisher’s Note Springer Nature remains neutral with regard to
semantics. Springer, Cham, pp 293–302 jurisdictional claims in published maps and institutional affiliations.
Rubin VL, Chen Y, Conroy NJ (2015) Deception detection for news:
three types of fakes. In: Proceedings of the 78th ASIS&T annual
meeting: information science with impact: research in and for
the community. American Society for Information Science, p 83
Rubin V, Conroy N, Chen Y, Cornwell S (2016) Fake news or truth?
using satirical cues to detect potentially misleading news. In:
13
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23
https://doi.org/10.1186/s13673-020-00230-0
*Correspondence:
[email protected] Abstract
1
Information Engineering Crowdsourcing and crowd computing are a trend that is likely to be increasingly popu‑
College, Henan University
of Science and Technology lar, and there remain a number of research and operational challenges that need to be
and Henan International Joint addressed. The human-centric computational abstraction called situation may be used
Laboratory of Cyberspace to cope with these difficulties. In this paper, we focus on one such challenge, which
Security Applications,
Luoyang 471023, People’s is how to assign crowd assessment tasks about security and privacy in online social
Republic of China networks to the most appropriate users efficiently, effectively and accurately. Specifi‑
Full list of author information cally, here we propose a novel task assignment method to facilitate crowd assessment,
is available at the end of the
article which improves the security and trustworthiness of social networking platforms, as
well as a task assignment algorithm based on SocialSitu, which is a social-domain-
focused situational analytics. Findings from our crowd assessment experiments on
a real world social network Shareteches show that the precision and recall of the
proposed method and algorithm are 0.491 and 0.538 higher than those of a random
algorithm’s, as well as 0.336 and 0.366 higher than users’ theme-aware algorithm’s,
respectively. Moreover, these results further suggest that our experimental evaluation
enhance the security and privacy of online social networks.
Keywords: Online social networks, Security and privacy, Crowd computing, Human-
centric computing, SocialSitu, Task assignment, Artificial intelligence
Introduction
Mobile devices are often used by online social network (OSN) users to access, share,
and exchange information, whenever and wherever possible [1–3]. One relatively recent
trend is crowdsourcing for information, share resources, and/or to get certain tasks
done, as evidenced by the increasing number and variety of crowdsourcing applications,
such as Amazon Mechanical Turks, chats (e.g., Firechat), knowledge sharing (e.g., Fig. 1),
ridesharing (e.g., Uber) and accommodation sharing (e.g., Airbnb).
Crowdsourcing [4] often refers to a solution pattern which outsources the tasks
that were previously performed by full-time employees to non-specific solution pro-
viders via public online platforms. Such crowdsourcing can be voluntary (unpaid) or
paid. Crowd computing [5] is a computing model, which aims to integrate numer-
ous users who may know each other (i.e., the crowd) and computing resources (i.e.,
© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativeco
mmons.org/licenses/by/4.0/.
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 2 of 19
the computer cluster) on the Internet, to handle complex tasks that are difficult to
accomplish with existing computing technologies. In addition, the human-centric
computational abstraction called situation can be used into Human-Embedded
Computing or End-User Embedded Computing [6].
To maximize human individual or collective intelligence in crowd computing,
researchers have studied task assignment that allows one to assign real-time crowd
tasks to appropriate user crowd for execution [7–9]. The majority of existing task
assignment approaches in the crowd computing literature consider user behaviors
as homogenous [10]; thereby, completely ignoring the influence of such behaviors
on the recycling results. Existing approaches also often assume static scenes [11],
although tasks occur dynamically in practical applications and results should be
returned immediately within a specific time. Therefore, such approaches are gener-
ally not practicable in real-world applications.
In this paper, we focus on the task itself. This is because in crowd computing, user
crowds are often passive, and user multi-dimensional information, functional expe-
rience, and integrity are minimally considered. The SocialSitu theory [12], for exam-
ple, analyzes the SocialSitu(t) sequence of users’ access behaviors and draws out
the patterns of such behaviors under different intentions (i.e., frequent functional
experience). Hence, in this paper we propose a task assignment algorithm to deter-
mine user task suitability, based on SocialSitu. After finding user crowds who fre-
quently use a certain function, the task of calculating task suitability and assigning
the related crowd tasks based on user crowds’ multi-dimensional information will
become considerably accurate.
The contributions of this paper are twofold. (1) We design a task assignment
method that can be used to perform crowd assessment for the security and privacy
of online social networks. (2) Moreover, we propose a task assignment algorithm to
determine user task suitability based on SocialSitu, in order to achieve efficient and
accurate assignment of crowd tasks about security and privacy.
The remainder of this paper is structured as follows. “Related work” section briefly
introduces the extant literature. “Task assign method of crowd assessment for online
social networks” and “Task assignment algorithm of user task suitability based on
SocialSitu” sections present our proposed task assignment method and algorithm,
respectively. In “Experimental setup and analysis” section, we describe the crowd
assessment experiments and discuss the findings which show that the proposed
approach facilitates efficient task assignment and improves the task precision and
recall. The last section concludes this paper.
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 3 of 19
Related work
Context-Aware (CA) [13–15] was first proposed by Schilit et al. (1994), in which a scene
is defined as a location, a collection of people and objects in the vicinity, and changes
in these objects. Chang et al. [16, 17] described situation analysis theory and the sig-
nificance and influence of Situ architecture in software engineering, They provided a
detailed description of the Situ architecture, which updates services in real time by iden-
tifying users’ new intentions in software engineering; thereby, providing users with truly
personalized services. To discover users’ intentions in social media in a timely manner
and provide additional personalized services to users, Zhang et al. [12] developed the
SocialSitu theory based on the Situ theory of Chang et al. [16, 17], and designed a dis-
covery method for user behavior patterns in multimedia social networks. This method
analyzes users’ SocialSitu(t) sequences and draws their behavior patterns under different
intentions. Consequently, users’ current intentions can be predicted by comparing their
current behavior sequences with those in the database.
In crowd computing, existing task assignment structures remain relatively simple and
a few cases involve tasks that are randomly assigned (i.e., randomly assigns tasks to user
crowds). Therefore, most of the tasks cannot be handled by suitably qualified individu-
als and the advantages of crowd computing human-computer collaboration cannot be
completely realized. An et al. [18] proposed an algorithm for the discovery and selection
of service nodes based on the analytic hierarchy process. This takes into consideration
users’ mobility, complexity, real-time performance and other characteristics. In [19],
Zhang proposed a simple and effective framework, MacroWiz, to manage the wisdom
of crowds on mobile social networks. MacroWiz encourages online users to contribute
their own knowledge or opinions through an incentive mechanism. This framework
also assists task requester in collecting answers, choosing the reliable answers, and
making final decisions. In the respect of spatial crowdsourcing, in order to cope with
some problems about the multi-skill aware task assignment, Song et al. [20] presented
Online-Exact algorithm and Online-Greedy algorithm. For the uncertain mobile crowd-
sourcing scenario, Guo et al. [21] found out that the results of mobile crowdsourcing
largely depend on the quality of location-related users. Subsequently, Sun et al. [22] for-
mulated an optimization problem of mobile crowdsourcing task allocation by using the
trustworthiness of workers and movement distance costs. Then, they present a Markov
decision process according to mobile crowdsourcing model to solve the problem of
dynamic trust-aware task allocation. In [23], Mao et al. proposed an optimal user crowd
algorithm. This algorithm assigns tasks to the least number of users based on users’
historical task completion status; thereby, minimizing the total cost and ensuring the
completion of a task. Zhang et al. [24] designed a task assignment algorithm with user
theme awareness. The algorithm is designed to solve the blind randomness of random
task assign algorithm and obtain users’ themes (i.e., their fields of specialization) by ana-
lyzing their historical task information. In this manner, the completeness of the overall
tasks is substantially improved. However, users’ themes are constantly changing. Hence,
obtaining these themes solely by historical data is no longer possible. Evidently, such
themes are inaccurate and not real-time. In [25], Kim et al. presented a multi-layered
information analysis approach based on crowdsourcing theory and effectively uses topic
analysis to track scientific issues. In social networks privacy preservation and trust ways,
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 4 of 19
Yuan et al. [26] investigated privacy protection in spatial crowdsourcing and presented
a privacy-preserving framework. Specifically, they also proposed a grid-based location
protection method to protect the locations of workers and tasks. Moreover, Li et al. [27]
exploited a secure grid-based index method to solve the problem of privacy-aware spa-
tial crowdsourcing. This method can not only protect workers location privacy but also
improve the spatial task processing time. In [28], Sharma et al. presented a novel trust
relaying and privacy preservation architecture which included a distributed query sys-
tem for social Internet of Things by using edge-crowdsourcing techniques. Ma et al. [29]
utilized some advanced blockchain technologies to study security, privacy, and trust in
the field of crowdsourcing services. Wang et al. [30] combined incentive mechanism and
the techniques of location privacy-preserving to enhance the validity of mobile crowd-
sourcing systems. In [31], Chi et al. developed an effective location privacy protection
method and solved the problem between location privacy protection and service quality,
to better assess the reliability of workers for task allocation, Jiang et al. [32] developed a
context-aware reliable and efficient crowdsourcing technique in simple social networks
and multiple social networks. In [33], Huang et al. put forward an efficient reputation
evalution technique for crowdsourcing participant, which is based on some important
machine learning methods and multidimensional evaluation index mechanism.
In summary, most current assignment methods are unable to find suitable crowds for
certain tasks; consequently, leading to low completeness, precision, and recall. To solve
these problems, this study defines the decision factors that affect task assignment and
suitability of user tasks in online social networks. The current research completely con-
siders users’ functional experience and historical information, which are used as bases
to design a task assign method of crowd assessment for the security of online social net-
works. This study also proposes a task assign algorithm based on SocialSitu user task
suitability, which can assign crowd tasks efficiently and accurately in real time.
φ, θ, γ). Task publishers publish task t1 ( t1 ∈ T ) to system P. Users log into the system
to query and receive task t1. The system finds user crowd u1 ( u1 ∈ U ) suitable for t1
through the analysis of tasks and users. Moreover, id is the unique identifier of the
two and ω is the description of t1, which is the general basic information. φ refers to
the category of t1 and θ and γ are the number of users required for t1 and deadlines,
respectively. ο is the devices for user crowd u1, including mobile devices m and fixed
devices w. χ is the field of specialization of users, δ is the historical information, such
as degree of completion and degree of correlation, and λ is the users’ situation infor-
mation (i.e., SocialSitu(t) sequence).
The purpose of a crowd assessment system is to assign appropriate assessment
tasks to user crowd. Accordingly, the system architecture must meet the following
three basic requirements: (1) user crowd who is best suitable for some tasks that can
be found; (2) the server can independently design, publish, analyze, and process the
result data; and (3) interface for users to perform tasks can be provided. Therefore, a
hierarchical architecture is designed to meet these requirements. Figure 2 shows the
crowd assessment system architecture. The specific layered architecture is designed
as follows
Nsucc
Ssucc (Ui )= (1)
Ntotal
Definition 3 Average delay S del is the time interval between a system’s task assignment
and task accepted by users, such as Eq. (2), where d(hj) is the duration historical task hj
(hj ∈ H), Ssucc(Ui) is a task’s assigning time, and e(Ui) is the time when user U i accepted a
task.
hj ∈H d(hj )
Sdel (Ui ) = (2)
|S(Ui ) − e(Ui )|
Definition 4 Degree of correlation Srel is the degree of correlation for users’ skilled
fields, such as Eq. (3), where s(hj) is the users’ degree of correlation (hj ∈ H) of historical
tasks hj and ρ (Eq. (4)) is the attenuation factor. It is an amount that dynamically attenu-
ates over time. That is, the longer the interval, the lesser the influence of users’ degree of
correlation on service decisions.
Srel (Ui ) = ρhj s(hj )
(3)
hj ∈H
1, hj = H
ρ(hj )=
ρ(hj − 1) = ρ(hj ) − 1/H, 1 ≤ hj ≤ H (4)
Operation behavior
Definition 5 Operation behaviors are users’ operation process in social networks. This
study uses the user-intentional serialization algorithm based on situational analysis pro-
posed in [12] to analyze the SocialSitu(t) sequence of users’ operation behaviors which
draws out their behavior patterns under different intentions (i.e., frequent functional
experience).
Coincidence degree
Definition 6 Coincidence degree aims is to evaluate the consistency of the results
returned by individual and overall users, as shown in Eq. (5), where Rit is the rating
of task t by user i and R̄t is the average rating of task t by overall users. Therefore, the
higher the coincidence degree is, the more consistent the rating of the user with overall
users is.
C(Uit ) = Rit −R̄t (5)
User-task-suitability
Definition 7 User-task-suitability is how well a user performs a task. Suitability is cal-
culated from the attributes and weights corresponding to the three task assign factors,
namely, degree of completion, service response time, the degree of correlation, opera-
tion behaviors, and coincidence. Equation (6) shows that the greater the suitability, the
more suitable the user is for this task. Accordingly, wij represents the jth attribute of user
i, αit represents the weight of the jth attributes of user i for task t, and p is the number of
attributes. The specific weights of attributes are detailed in the next section.
Sit = αijt ∗ wtij
(6)
1≤j≤p
Constructing judgment matrix at each layer Before using FAHP to determine the weight
of each attribute, the importance of each layer’s attributes is expressed by fuzzy triangular
numbers ãij = (lij , mij , uij ) and a fuzzy reciprocal judgment matrix
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 9 of 19
ãij = 1 Elements i and j are of the same importance to the previous factor
ãij = 3 Element i is slightly more important than element j
ãij = 5 Element i is more important than element j
ãij = 7 Element i is considerably more important than element j
ãij = 9 Element i is extremely more important than element j
ãij = 2n , n = 1, 2, 3, 4 Importance of elements i and j is between ãij = 2n − 1 and ãij = 2n + 1
ãji = 1/ãij Importance of elements i and j is opposite to that of elements ãij , respectively
can be constructed where ãji = 1/ãij . In order to facilitate operation, 1–9 and its recip-
rocal numbers are used as scales to determine the value of lij , mij , uij , where lij , mij and
uij are the lower, the mean and the upper bounds of a triple ( lij , mij , uij ), respectively.
Hence, a relatively important standard seventeen meter is introduced when comparing
the importance of each attribute using FAHP (Table 1).
Determination of weights and consistency check Given that the fuzzy judgment matrix
A–B is reciprocal one, the non-linear programming modification of the Fuzzy Prefer-
ence Programming (FPP) method that only rely on the elements of the upper right
part for matrix A–B is used to estimate weights (Formula (7)). In formula (7), can
be notated as the value of the consistency index and each wi be expressed as weight of
attribute.
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 10 of 19
maximize
subject to
1. The optimal solution ( ∗ , w∗ ) of non-linear program problem for the formula (7)
which includes one equality and six inequality constraints is solved.
w∗
2. The ratios of the acquired weights are calculated. If all solution ratios wi∗ roughly sat-
j
w∗
isfy the double-side inequalities, i.e. lij ≤ ˜ ij , so the initial fuzzy judgements are
˜ wi∗ ≤u
j
1. Input task collection T and user collection U. Traverse unaccomplished task list
T = {t1, t2, … t i} and user list U = {u1, u2, … u
j}, and obtain task ti and information of
user uj.
2. The situation information λ of u j is analyzed and processed and the behavior
sequence patterns of user uj are obtained through the user behavior pattern discov-
ery algorithm SituBehaviorAnalytics(DS, Min_Support, G).
3. If the behavior sequence patterns of user uj match the category φ of uncompleted
task ti, then the id of user u j is stored in user collection L1 of the unallocated tasks.
4. Suitability of user collection L1 for task ti is calculated SLm ti and is sorted from high
to low.
5. The previous θ user is selected and stored in user collection L2.
6. Assign tasks to users in collection L2.
7. Repeat steps 1 to 6 until there are no newer unassigned tasks. And the detailed algo-
rithm process is shown in Algorithm 1 and Fig. 4.
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 12 of 19
1 Security and privacy Whether platforms regularly repair security v-ulnerabilities and prom-pt them or not
2 Security and privacy Whether platforms allow users that publish content to choose who has the authority
to view or not
3 Security and privacy Whether platforms provide and implement hierarchical digital copyright protection
architecture or not
4 Security and privacy Whether the content provided by the platform is confused with fake information or
not
5 Quality assurance Whether platforms have reliable feedback/repo-rting structure for offensive contents
or not
6 Quality assurance Whether platforms allow users to evaluate for the veracity of informat-ion that exists in
the platform or not
7 Quality assurance Whether platforms afford final state of successful or failed submission tips or not
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 14 of 19
is extensively used. Task assign algorithm based on user theme awareness detects users’
themes by analyzing users’ historical data to assign tasks. However, both algorithms have
problems with inaccurate assignment.
Experimental environment
To analyze the correctness and effectiveness of the proposed algorithm, the self-developed
technology social platform Shareteches (formerly CyVOD) [35] (http://www.shareteche
s.com) and its mobile applications are used as experimental platforms to conduct experi-
ments and data analysis. The web server is used as task publisher and the client is used as
mobile.
Experimental design
Social networks have gained widespread attention, and have been extensively applied as
platforms for people to spread information on the Internet and conduct social exchange
activities.
At present, security and privacy issues of social network platforms highlight the urgent
need for social network users to assess platform functionality, security precautions, privacy
protection, and other features [36]. According to the prevailing security and privacy issues
in current social networks [37–39], this study designs seven assessment tasks of social net-
working platforms and evaluates security trust, functionality, and other aspects of social
networking platforms. The details of crowd assessment information are showed in Table 3.
In the course of the experiment,all the participants who come from Shareteches consist of
ordinary users and expert users. Moreover, these users have experience of using this social
networks platform and are able to perform better when they operate these assessment
tasks.
u |Ru∩ Tu |
Recall = (9)
u |Tu |
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 15 of 19
1
task1
0.95
task2
0.9 task3
task4
0.85
task5
0.8 task6
Precision
task7
0.75
0.7
0.65
0.6
0.55
0.5
50 100 150 200 250 300
Number of users
0.95
0.9
0.85
0.8
Recall
0.75
task1
0.7
task2
0.65 task3
task4
0.6 task5
0.55 task6
task7
0.5
50 100 150 200 250 300
Number of users
1
task1
0.95
task2
0.9 task3
task4
0.85
task5
0.8 task6
F-measure
task7
0.75
0.7
0.65
0.6
0.55
0.5
50 100 150 200 250 300
Number of users
2 ∗ precision ∗ recall
F − measure = (10)
precision + recall
0.5
0.5
Recall
0.4
Precision
0.4
0.3
0.3
0.2 Random Algorithm
0.2
0.1 Theme-ware Algorithm
(2020) 10:23
1
Random Algorithm
1
0.9
Theme-ware Algorithm Random Algorithm
0.9
Based on SocialSitu Algorithm Theme-ware Algorithm
0.8
0.8 Based on SocialSitu Algorithm
0.7
0.7
0.6
0.6
0.5
F-measure
0.5
0.4
Completeness
0.4
0.3
0.3
0.2
0.2
0.1
50 100 150 200 250 300 0.1
Number of users 12 24 36 48
Time(h)
c F-measure comparison
d Degree of completion comparison
Fig. 6 Comparative summary of the performance between the proposed algorithm and the two other algorithms
Page 16 of 19
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 17 of 19
Conclusion
To accurately and efficiently assign assessment tasks about the security and trustworthi-
ness of online social networks to social users, this study defined task assign factors and
their attributes and user tasks suitability by analyzing users’ situational information and
historical records. Accordingly, this research designed a task assign method of crowd
assessment for online social networks and proposed a task assign algorithm based on
human-centric computational abstraction SocialSitu theory. Crowd assessment experi-
ments were conducted on a real world social network Shareteches. The experimental
results showed that the proposed method not only achieves both validity and effective-
ness, but also further improves the security and trustworthiness for online social net-
works. In the further, we firstly mine more effective task allocation factors based on
users’ social behavior characteristics and content characteristics. Then, we further com-
bine machine learning method with crowdsourcing theory to complete the security and
trustworthiness assessment of social network platform.
Acknowledgements
We show gratitude to all members who have ever done researching and developing on Shareteches (formerly CyVOD) in
Henan International Joint Laboratory of Cyberspace Security Applications, and would also like to thank the reviewers and
editor for their valuable comments, questions and suggestions.
Authors’ contributions
ZZ put forward the initial ideas and arguments. ZZ, JJ and WX contributed to writing the manuscript. ZZ, WX, KRC and
BBG designed the methods and prepared figures. ZZ and JJ discussed and analyzed the results. All authors read and
approved the final manuscript.
Authors’ information
Zhiyong Zhang received his Master, Ph.D. degrees in Computer Science from Dalian University of Technology and Xid‑
ian University, P. R. China, respectively. He was ever post-doctoral fellowship at School of Management, Xi’an Jiaotong
University, China. Nowadays, he is a full-time Henan Province Distinguished Professor and Dean with Department of
Computer Science, College of Information Engineering, Henan University of Science & Technology, China. He is also a
visiting professor of Computer Science Department of Iowa State University. His research interests include cyber security
and computing, social big data, multimedia content security and Digital Rights Management. Recent years, he has pub‑
lished over 120 scientific papers and edited 6 books in the above research fields, and also holds 12 authorized patents.
He is Chair of IEEE MMTC DRMIG, IEEE Systems, Man, Cybernetics Society Technical Committee on Soft Computing, World
Federation on Soft Computing Young Researchers Committee, Committeeman of China National Audio, Video, Multime‑
dia System and Device Standardization Technologies Committee. And also, he is editorial board member and associate
editor of Multimedia Tools and Applications (Springer), Human-centric Computing and Information Sciences (Springer),
IEEE Access (IEEE), Neural Network World, EURASIP Journal on Information Security (Springer), leading guest editor or co-
guest Editor of Applied Soft Computing (Elsevier), Computer Journal (Oxford) and Future Generation Computer Systems
(Elsevier). And also, he is Chair/Co-Chair and TPC Member for numerous international conferences/ workshops on digital
rights management and cloud computing security.
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 18 of 19
Junchang Jing received the B.E. degree and the M.S. degree from the College of Mathematics and Information Science,
Henan Normal University, Xinxiang, China, in 2015 and 2018, respectively. He is currently pursuing the Ph.D. degree with
College of Information Engineering, Henan University of Science & Technology, Luoyang, China. His research interests
include social network security and computing, machine learning, and social big data.
Xiaoxue Wang received the B.E. degree and the M.S. degree from College of Information Engineering, Henan University
of Science & Technology, China, in 2015 and 2018, respectively. Her research interests include crowd assessment, social
network security and trustworthiness.
Kim-Kwang Raymond Choo received the Ph.D. in Information Security in 2006 from Queensland University of Technol‑
ogy, Australia. He currently holds the Cloud Technology Endowed Professorship at The University of Texas at San Antonio.
In 2016, he was named the Cybersecurity Educator of the Year—APAC (Cybersecurity Excellence Awards are produced
in cooperation with the Information Security Community on LinkedIn), and in 2015 he and his team won the Digital
Forensics Research Challenge organized by Germany’s University of Erlangen-Nuremberg. He is the recipient of the 2018
UTSA College of Business Col. Jean Piccione and Lt. Col. Philip Piccione Endowed Research Award for Tenured Faculty,
IEEE TrustCom 2018 Best Paper Award, ESORICS 2015 Best Paper Award, 2014 Highly Commended Award by the Australia
New Zealand Policing Advisory Agency, Fulbright Scholarship in 2009, 2008 Australia Day Achievement Medallion, and
British Computer Society’s Wilkes Award in 2008. He is a Fellow of the Australian Computer Society.
Brij B. Gupta received PhD degree from Indian Institute of Technology Roorkee, India in the area of Information and Cyber
Security. In 2009, he was selected for Canadian Commonwealth Scholarship awarded by Government of Canada. He
published more than 175 research papers in International Journals and Conferences of high repute including IEEE, Else‑
vier, ACM, Springer, Wiley, Taylor & Francis, Inderscience, etc. He has visited several countries, i.e. Canada, Japan, Malaysia,
Australia, China, Hong-Kong, Italy, Spain etc to present his research work. His biography was selected and published in
the 30th Edition of Marquis Who’s Who in the World, 2012. Dr. Gupta also received Young Faculty research fellowship
award from Ministry of Electronics and Information Technology, Government of India in 2017. He is also working as prin‑
cipal investigator of various R&D projects. He is serving as associate editor of IEEE Access, IEEE TII, and Executive editor of
IJITCA, Inderscience, respectively. He is also serving as reviewer for Journals of IEEE, Springer, Wiley, Taylor & Francis, etc.
He is also serving as guest editor of various reputed Journals. He was also visiting researcher/Professor with University of
Murcia (UMU), Spain, Deakin University, Australia and Yamaguchi University, Japan in 2018, 2017 and 2015, respectively
and many other universities. At present, Dr. Gupta is working as Assistant Professor in the Department of Computer
Engineering, National Institute of Technology Kurukshetra India. His research interest includes Information security, Cyber
Security, Cloud Computing, Web security, Intrusion detection and Phishing.
Funding
The work was sponsored by National Natural Science Foundation of China Grant No. 61972133 and 61772174, Plan For
Scientific Innovation Talent of Henan Province Grant No. 174200510011. Project of Leading Talents in Science and Tech‑
nology Innovation for Thousands of People Plan in Henan Province Grant No. 204200510021.
Competing interests
The authors declare that they have no competing interests.
Author details
1
Information Engineering College, Henan University of Science and Technology and Henan International Joint Labora‑
tory of Cyberspace Security Applications, Luoyang 471023, People’s Republic of China. 2 Department of Information
Systems and Cyber Security, University of Texas at San Antonio, San Antonio, TX 78249, USA. 3 National Institute of Tech‑
nology Kurukshetra, Kurukshetra 136119, India.
References
1. Persia F, DAuria D (2017) A survey of online social networks: challenges and opportunities. In: IEEE international confer‑
ence on information reuse and integration, pp 614–620
2. Yan R, Li D, Wu W, Du D, Wang Y (2019) Minimizing influence of rumors by blockers on social networks: algorithms and
analysis. IEEE Trans Netw Sci Eng. https://doi.org/10.1109/TNSE.2019.2903272
3. Roy D, Lotan G, Zeng W (2015) The attention automaton: sensing collective user interests in social network communi‑
ties. IEEE Trans Netw Sci Eng 2(1):40–52
4. Howe J (2006) The rise of crowdsourcing. Wired Mag 14(14):1–5
5. Parshotam K (2013) Crowd computing: a literature review and definition. In: South African Institute for Computer Scien‑
tists & Information Technologists Conference, pp 121–130
6. Chang CK (2018) Situation analytics-at the dawn of a new software engineering paradigm. Sci China Inform Sci
61(5):050101:1–050101:14
7. Xing Y, Wang L, Li Z (2019) Multi-Attribute crowdsourcing task assignment with stability and satisfactory. IEEE Access
7:133351–133361
Zhang et al. Hum. Cent. Comput. Inf. Sci. (2020) 10:23 Page 19 of 19
8. Tran L, To H, Fan L (2018) A real-time framework for task assignment in hyperlocal spatial crowdsourcing. ACM Trans
Intell Syst Technol 9(3):37:1–37:26
9. Alireza S, Shafigheh H, Masoud RA (2018) Personality classification based on profiles of social networks’ users and the
five-factor model of personality. Hum-Cent Comput Inform Sci 8(1):24–38
10. Karger D R, Oh S, Shah D (2011) Iterative learning for reliable crowdsourcing systems. In: International conference on
neural information processing systems, pp 1953–1961
11. Zhang X, Yang Z, Wu C, Sun W, Liu Y, Liu K (2014) Robust trajectory estimation for crowdsourcing-based mobile applica‑
tions. IEEE Trans Parallel Distrib Syst 25(7):1876–1885
12. Zhang Z, Sun R, Wang X, Zhao C (2019) A situational analytic method for user behavior pattern in multimedia social
networks. IEEE Trans Big Data 5(4):520–528
13. Srinivasan K, Agrawal P, Arya R, Akhtar N, Pengoria D, Gonsalves TA (2012) Context-aware, QoE-driven adaptation of mul‑
timedia services. In: 5th international conference on mobile wireless middleware, operating systems, and applications,
pp 236–249
14. Tekin C, Van Der Schaar M (2015) Contextual online learning for multimedia content aggregation. IEEE Trans Multimed
17(4):549–561
15. Bulterman D, Cesar P, Guimaraes R (2013) Socially-aware multimedia authoring: past, present, and future. ACM Trans
Multimed Comput Commun Appl 9(1):35:1–35:23
16. Chang C, Jiang H, Ming H, Oyama K (2009) Situ: a situation-theoretic approach to context-aware service evolution. IEEE
Trans Serv Comput 2(3):261–275
17. Chang C (2016) Situation analytics—a foundation for a new software engineering paradigm. Computer 49(1):24–33
18. An J, Gui L, He Q, Wu B (2015) Crowdsourcing assignment mechanism based on AHP in mobile crowd sensing. J Beijing
Univ Posts Telecommun 38(5):37–41
19. Zhang X, Shang L, Yuan Y (2017) A crowd wisdom management framework for crowdsourcing systems. IEEE Access
4:9764–9774
20. Song T, Xu K, Li J et al (2019) Multi-skill aware task assignment in real-time spatial crowdsourcing. GeoInformatica
24:153–173
21. Guo B, Liu Y, Wang L (2018) Task allocation in spatial crowdsourcing: current state and future directions. IEEE Internet
Things J 5(3):1749–1764
22. Sun Y, Tan W (2019) A trust-aware task allocation method using deep q-learning for uncertain mobile crowdsourcing.
Hum-Cent Comput Inform Sci 9(1):1–27
23. Mao H (2016) OCC: Opportunistic crowd computing in mobile social networks. Database systems for advanced applica‑
tions. Springer, Berlin
24. Zhang X, Li G, Feng J (2015) Theme-aware task assignment in crowd computing on big Data. J Comput Res Dev
52(2):309–317
25. Kim M, Gupta BB, Rho S (2018) Crowdsourcing based scientific issue tracking with topic analysis. Appl Soft Comput.
https://doi.org/10.1016/j.asoc.2017.09.028
26. Yuan D, Li Q, Li G et al (2020) PriRadar: a privacy-preserving framework for spatial crowdsourcing. IEEE Trans Inform
Forensics Secur 15:299–314
27. Li Y, Yi G, Shin B (2019) Spatial task management method for location privacy aware crowdsourcing. Cluster Comput
22:1797–1803
28. Sharma V, You I, Jayakody DNK et al (2019) Cooperative trust relaying and privacy preservation via edge-crowdsourcing
in social Internet of Things. Fut Gener Comput Syst 92:758–776
29. Ma Y, Sun Y, Lei Y et al (2019) A survey of blockchain technology on security, privacy, and trust in crowdsourcing services.
World Wide Web. https://doi.org/10.1007/s11280-019-00735-4
30. Wang Y, Cai Z, Tong X et al (2018) Truthful incentive mechanism with location privacy-preserving for mobile crowdsourc‑
ing systems. Comput Netw 135:32–43
31. Chi Z, Wang Y, Huang Y, Tong X (2018) The novel location privacy-preserving CKD for mobile crowdsourcing systems.
IEEE Access 6:5678–5687
32. Jiang J, An B, Jiang Y, Lin D (2020) Context-aware reliable crowdsourcing in social networks. IEEE Trans Syst Man Cybern
Syst 50(2):617–632
33. Huang Y, Chen M (2019) Improve reputation evaluation of crowdsourcing participants using multidimensional index
and machine learning techniques. IEEE Access 7:118055–118067
34. Mikhailov L, Tsvetinov P (2004) Evaluation of services using a fuzzy analytic hierarchy process. Appl Soft Comput
5(1):23–33
35. Zhang Z, Sun R, Zhao C, Wang J, Chang C, Gupta B (2017) CyVOD: a novel trinity multimedia social network scheme.
Multimed Tools Appl 76(18):18513–18529
36. Zhang Z, Wen J, Wang X, Zhao C (2018) A novel crowd evaluation method for security and trustworthiness of online
social networks platforms based on signaling theory. J Comput Sci 26:468–477
37. Zhang Z, Gupta B (2018) Social media security and trustworthiness: overview and new direction. Fut Gener Comput Syst
86:914–925
38. Oukemeni S, Rifa-Pous H (2019) Privacy analysis on microblogging online social networks: a survey. ACM Comput Surv
52(3):1–20
39. Souri A, Hosseini R (2018) A state-of-the-art survey of malware detection approaches using data mining techniques.
Hum-Cent Comput Inform Sci 8(1):3–25
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Social Network Analysis and Mining (2020) 10:59
https://doi.org/10.1007/s13278-020-00675-2
ORIGINAL ARTICLE
Abstract
A social networking site (SNS) is a platform for building social networks or social relations with those who share similar
profiles, career, interests, activities, backgrounds, etc. Social networking site serves for communication purposes to specific
interest groups, but they do not have a searching option where we can search for individuals or groups with obvious fea-
tures. A user can be found on social network only if we know the name of the person. With the help of this paper, we aim
at presenting a searching method using natural language processing and query optimization and classification for searching
people by name or any distinct characteristic. These concepts can be together used to constitute and process social network
information so that searching user through their name or some distinguished characteristics is feasible. This paper handles
some of the present limitations of accessing social network data by natural language processing technique. We have used
PHP for programming and algorithm implementation and MySQL for storing sample data.
13
Vol.:(0123456789)
59 Page 2 of 8 Social Network Analysis and Mining (2020) 10:59
• All professors of IIT, Patna, in the Department of Com- Taiwan. Though their work is good, it is limited to searching
puter Science in the building construction domain only.
• All students of St. Louis working as Directors in Malay- In the work by Zheng et al. (2019), they have discussed
sia a symmetric searchable encryption (SSE) scheme for pri-
• Students of DPS, working in USA vacy and security of cloud data, fast search and also effec-
• Employee of Infosys from India tively implemented phrase searching. In their work, they
• Students of Ph.D. from BIT Mesra Ranchi doing have proved that the scheme satisfies IND-CKA2 under the
Research in web semantics selected keywords through security analysis. Their work is
similar to our work as we convert user queries to SQL que-
Enhancing the SNSs data access is an important step ries and then use it in our searching model.
toward solution to this restriction. In this paper, we present a In the work by Srivastava and Patnaik (2019), they have
mechanism to represent user profile and converting the natu- presented a streamlined algorithm, perceptual image resem-
ral language query of a user into SQL query. After getting blance based on neighborhoods termed as Description Prox-
the SQL query, it is improved to give best possible results. imity Cover. They have also done comparative analysis of
The paper has been organized as follows: DPC with other existing metric-based algorithms using large
In the first segment, we discuss about social networking dataset of images. They have also discussed the resemblance
sites and natural language processing. In the second seg- of facial expression of images carried out using DPC. From
ment, we discuss about the related works using PHP as pro- their result it is seen that the retrieval performance may be
gramming language and MySQL as a database. In the third significantly improved using Description Proximity Cover.
segment, we discuss the mechanism used in our work. In Image similarity search is very crucial. They have a lot of
the fourth segment, we discuss the methods used in query use under different search-based systems. In our search in
classification and optimization for quick results. In the fifth social networks, they can be used to determine similari-
segment, we discuss the performance analysis. In the sixth ties among users by images as image is a key point of user
segment, we discuss the implementation part. In the seventh resemblance. However, we need to do some changes in them
segment, we compare our work with the similar works done so that images can be used in searching similar profiles.
in other researches. Finally, in the eighth segment we con- In the work by Nandy et al. (2020), they have proposed
clude our paper. an intense activity recognition framework that combines fea-
tures from smartphone accelerometer and that from wearable
heart rate sensor. In the modern era, human activity recogni-
tion (HAR) can be of great help, especially in the field of
2 Related works health monitoring and rehabilitation. There are many similar
works using one or more specific devices (with embedded
In this section, we have discussed few related works by other sensors) including smartphones for activity recognition.
researchers which are similar to our work. In the end, we In most of these works, the detected activities are coarse
have compared our work with some of these works. grained like sit or walk. Their work is suitable for applica-
In the work by Lee et al. (2019), they have developed tions where a person’s health and more importantly physical
a building construction traceability system where builders exertions for performing those activities are important, like
can directly input the construction inspection records into in heath sector as well as insurance sector. Our work is also
the system during the building construction process from similar in the way that we recognize people in the social
their handheld device on site personal computers. In the network by their activity.
previous works, the builders had to input data from their In the work by Chiu and Lee (2018), named ‘Selfie Mir-
personal computers which required more job load for site ror,’ they have designed an application to provide people
engineers because they need to first keep the records on with their dress styles. They have also taken user input to
paper in the field and then input the data into the system improve their interface in future. In the recent years, ‘selfies’
using the computer in the office. In their work, they have have become popular around the world. This has resulted in
designed a more user-friendly interface so that it is easier for many makeup apps which can detect users’ faces and then
the engineer to directly input the inspection data using their place makeup on part of the face or add some effects. Some
handheld devices on site. To achieve this, they have adopted apparel marketing companies have provided apps for con-
the Responsive Web Design and Bootstrap Framework to sumers to give some advice on the products they can wear.
develop the new interface. Also, they have added feedbacks The work by Chiu et al. is to develop an application that
and suggestions from current users. The data collected will provides help with taking pictures and saving images in a
be stored in some database and can be used to search and library to show consumers’ dressing styles. This work is very
trace building constructions going on in a city, especially in similar to our work. In our work we have built application to
13
Social Network Analysis and Mining (2020) 10:59 Page 3 of 8 59
search social network users based on certain characteristics. In the work by Wanjala and Kahonge (2016), they have
Dressing style too is one of the important characteristics of developed an application using Linux Apache MySQL PHP
a user. Therefore, their work is similar to our work in this and Python. The application uses Python page ranking algo-
respect. rithm to perform web crawling, and the data are stored in a
In the work by Chiu et al. (2019), they have explored MySQL database. The data are collected from the available
the design and production of mobile device-based materi- web forums and stored in the MySQL database after index-
als for cultural heritage tourism and tourist learning out- ing. Their algorithm is not using the records of a particular
comes. Mobile applications recently have adopted to help search engine; instead, they are retrieving data more from
volunteer tour guides-in-training and visitors to understand blogs based on the sentiments of the posts. Their work can
the relations and knowledge associated with Taiwan. They be modified to be used for searching social networking sites
divide the participants into control and experimental groups and may prove to be a good searching algorithm too.
to examine their learning outcomes in terms of their cultural In the work by Radhi and Majeed (2018), they have
and historical knowledge before and after embarking on the implemented a system using PHP programming language
tour. They then correlated that guided tours assisted by digi- and MySQL as a database server to be more compatible with
tal technologies lead to better learning outcomes than nar- web applications. Their system extracts data from Facebook
rative and interpretive tours which are delivered by guides via Graph API 2.7, and then, it is restructured to be compat-
alone. However, the method of dividing participants into ible with SNA system database which should be capable of
experimental groups can be used in searching users of simi- dealing with Arabic data. Finally, these data became a seed
lar profile by making users of simlar groups. for MySQL database of their proposed system. In their work,
In the work by Wei (2019), they have developed Emo- they have tried to tackle the challenge faced by Arabian com-
tional Competencies Scale for Young Children (ECSYC) an munities and Arabic language in social network analysis,
emotional lexicon-based application. The purpose of their specially the people of Iraq. Their work can be modified to
study is to establish the criterion-related validity. In their be used globally as social networking sites are not limited to
work, they have followed certain steps, In the first step, they countries, communities and languages they use. Also, their
developed 40 scenarios based on ECSYC. In the second work can be useful in algorithms which search social net-
step, they developed the five-level criteria. In the third step, working sites for reaching a particular user.
they study implemented observer training and calculated The work by Srivastav and Chauhan (2017) uses the
inter-rater consistency reliability. In the final stage, they data retrieving operation from social networking sites such
categorized children’s replies into different levels. In their as Facebook by using semantic technology. Further to get
work, they ranked the sequence of frequency of each level insight and share some knowledge they have also analyzed
and completed the emotional lexicon. The work is similar it later by using Rtool. But their data are in the CSV format.
to our work in the manner that we have to search users in It is converted as owl that is web ontology language. There
the network based on certain features; emotional lexicon is are certain limitations in CSV data, and it is not suitable for
one of them. data manipulation and concurrent access also.
In the work by Eklaspur and Pashupatimath (2015), they In the work by Kou and Du (2018), they have taken into
have created a web application using HTML and PHP and consideration photographs, video and audio, etc. However,
the database used for storing and retrieving data is MySQL. for searching social network users, basic user information
In their work, they have proposed a framework that recom- like college name, school name, city, state, etc., can be more
mends friend using an efficient algorithm. In their work, useful.
based on the activities of the users of Facebook they give Carminati et al. (2011) have presented a semi-decentral-
some values and compute the score of each individual. This ized discretionary access control model. They have also
score is used to analyze and compute the percentage of simi- discussed-related enforcement mechanism for controlled
larity of life styles between users and recommends friends sharing of information in OSNs. The model allows the speci-
based on similarity. In the work, they have tried to develop fication of access rules for online resources, where author-
a web application which is linked with Facebook login page ized users are denoted in terms of the relationship type,
through which the users can give permissions. The user data depth and trust level existing between nodes in the network.
can be retrieved through the access tokens specified for each There work is better suited to access online resources. It can
user. In this approach, they have presented the design and be modified to be used for searching social networking sites.
implementation of a semantic-based friend recommendation Compared to the existing approaches, we use natural lan-
system for social networks. However, such work is not suita- guage processing to understand user queries and converting
ble to search a particular user as the user being searched may them to SQL queries so that we can search people on the
have a lot of dissimilarities with the user who is searching. social network sites with certain characteristics or features.
13
59 Page 4 of 8 Social Network Analysis and Mining (2020) 10:59
Our query is then optimized and classified for better and fast If they are found, we include one condition
data retrieval. city=‘word’ in the select statement. Since there are
Ultimately, we see that these methods have not used natu- only 4037 cities with population more than 100,000, the
ral language processing to understand user requirements. In sorted list can be searched in O(n) time (http://brilli antm
our work, we have used NLP along with query optimization aps.com/4037-100000-person-cities/01/Jan/2017).
and classification for better, accurate and fast data retrieval.
1. The query is divided into tokens, and prepositions, pro- 5. We then match left out words in the list of designations
nouns, conjunctions and determiners are excluded as If it is found in the list of designation, we include one
they are not needed in the search query. condition designation=‘word’ in the select statement.
2. The words that are left after performing step 1 are As this table is two-dimensional containing identical
matched with coulmn names in the table. If they match it designations in each row, the complexity will be O(n2).
is used in the where clause/condition in the select state- Since the list is small, for small n we can consider it as
ment. O(n)
3. Match each word in the list of cities
13
Social Network Analysis and Mining (2020) 10:59 Page 5 of 8 59
5 Performance analysis
13
59 Page 6 of 8 Social Network Analysis and Mining (2020) 10:59
actual value is the value given in the query by the user. Let us
understand each term:
True positive(TP)—Predicted value of the class is Yes and
the actual value in the query is Yes; for example, predicted
value for ‘City’ is ‘Patna’ and actual value of ‘City’ in the
query is ‘Patna’
True negative(TN)—Predicted value of the class is No and
the actual value is No; for example, predicted value for ‘City’
is ‘Not Patna,’ i.e., other than Patna, and actual value of ‘City’
in the query is also ‘Not Patna’
False positive(FP)—Predicted value of the class is Yes and
the actual value in the query is No; for example, predicted
value for ‘City’ is ‘Patna’ and actual value of ‘City’ in the Fig. 5 User interface used to take user query as input
query is ‘Not Patna’
False negative(FN)—Predicted value of the class is No and
Table 1 Structure of user profile in MySQL database
the actual value in Yes; for example, predicted value for ‘City’
is ‘Not Patna’ and actual value of ‘City’ in the query is ‘Patna.’
Five-hundred queries of test data are used to evaluate
classification accuracy. Correctly qualified queries are 420,
TP = 300 and TN = 120. Incorrectly qualified queries are 80,
FP = 50 and FN = 30.
Accuracy—accuracy is an important performance meas-
ure, and it is simply a ratio of correctly predicted observation
to the total observations.
Accuracy = (TP + TN)∕(TP + FP + FN + TN) = 420∕500 = 0.84
For our system, this value is 0.84 which means our system
is approx. 84% accurate.
Precision—The ratio of correctly predicted positive val-
ues to the total predicted positive values.
Precision = TP∕(TP + FP) = 300∕(300 + 50) = 300∕350 = 0.85
13
Social Network Analysis and Mining (2020) 10:59 Page 7 of 8 59
13
59 Page 8 of 8 Social Network Analysis and Mining (2020) 10:59
these conditions may have multiple words. For this, the List of cities in the world. http://brilli antma ps.com/4037-100000 -perso
query is classified, and by knowledge learning from corpus n-cities/01/Jan/2017
List of colleges and universities in the world. http://www.4icu.org/revie
of query, we get the actual result which is to be found. ws/index0001.html/01/Jan/2017
List of countries in the world. http://www.infoplease.com/ipa/A0932
Acknowledgements The authors would like to thank all those who 875.html/01/Jan/2017
were directly or indirectly involved in this research work. First of all, Maynard D, Bontcheva K, Augenstein I (2016) Natural language pro-
we would like to thank the faculty members and students of Maulana cessing for the semantic web
Azad College of Engineering and Technology, Patna, and BIT Mesra Mehta S, Kaur P, Lodhi P, Mishra O (2018) Empirical evidence of
Ranchi who have shown exemplary interest and supported us in this heuristic and cost based query optimizations in relational data-
research. Special thanks are due to the Staff of S. S. Systems Pvt. Ltd bases. In: 2018 eleventh international conference on contemporary
for their help in implementation of the software and database. computing (IC3). IEEE, pp 1–3
Mika P (2005) Social network and the semantic web. http://www.sprin
ger.com/in/book/9780387710006
Nandy A, Saha J, Chowdhury C (2020) Novel features for inten-
References sive human activity recognition based on wearable and smart-
phone sensors. Microsyst Technol 26:1889–1903. https://doi.
Bhan M, Kumar TVS, Rajanikanth K (2013) Materialized view size org/10.1007/s00542-019-04738-z
estimation using sampling. In: IEEE international conference on Patel D, Patel P (2015) Article: a review paper on different approaches
computational intelligence and computing research for query optimization using schema object base view. Int J Com-
Carminati B, Ferrari E, Heatherly R, Kantarciogl M (2011) Semantic put Appl 114(4):16–18
web based social network access control. Elsevier, Amsterdam Radhi AM, Majeed GA (2018) A novel technique in data mining via
Chiu C, Lee L (2018) Empirical study of the usability and interactiv- investing social analysis tools. Int J Eng Technol IJET 10(5)
ity of an augmented-reality dressing mirror. Microsyst Technol Srivastav A, Chauhan A (2017) Social network data retrieval using
24:4399–4413. https://doi.org/10.1007/s00542-018-3879-1 semantic technology. Asian J Pharm Clin Res 10:31–35
Chiu C, Wei W, Lee L et al (2019) Augmented reality system for tour- Srivastava P, Patnaik KS (2019) Valuation of facial image likeness
ism using image-based recognition. Microsyst Technol. https:// under different posture. Microsyst Technol 25:4625–4635. https
doi.org/10.1007/s00542-019-04600-2 ://doi.org/10.1007/s00542-019-04441-z
Deshpande A, Hellerstein L (2008) Flow algorithms for parallel query Thangam AR, Peter SJ (2016) An extensive survey on various query
optimization. IEEE optimization techniques. IJCSMC 5(8):148–154
Eklaspur NM, Pashupatimath AS (2015) A friend recommender sys- Thannaing M, Hlaing A (2014) Improving information retrieval based
tem for social networks by life style extraction using probabilistic on query classification algorithm. Mach Learn Appl Int J (MLAIJ)
method—“Friendtome”. Int J Comput Sci Trends Technol IJCST 1(1)
3(3) Wanjala GW, Kahonge AM (2016) Social media forensics for hate
Fadoua H, Amel TG (2018) Smart query optimization approach in speech opinion mining. Int J Comput Appl 155(1):39–47
distributed environment. Proc Comput Sci 126:355–362. In: 22nd Wei W (2019) Development and evaluation of an emotional lexi-
international conference on knowledge-based and intelligent infor- con system for young children. Microsyst Technol. https://doi.
mation and engineering systems (Elsevier) org/10.1007/s00542-019-04425-z
Jogekar RN, Mohod A (2013) Design and implementation of algo- Zarate MJA, Pazos RRA, Gelbukh A, Perez OJ (2007) Improving the
rithms for materialized view selection and maintenance in customization of natural language interface to databases using
data warehousing environment. Int J Emerg Technol Adv Eng an ontology
3(9):134–140 Zheng J, Zhang J, Zhang X et al (2019) Symmetric searchable encryp-
Khin NTW, Yee NN (2018) Query classification based information tion scheme that supports phrase search. Microsyst Technol. https
retrieval system. In: 2018 international conference on intelli- ://doi.org/10.1007/s00542-019-04515-y
gent informatics and biomedical sciences (ICIIBMS), vol 3, pp Zhou S, Cheng K, Men L (2017) The survey of large-scale query clas-
151–156 sification. In: AIP conference proceedings 1834(1)
Kou F, Du J-P (2018) Hasshtag recommendation based on multi-fea-
tures of microblogs. J Comput Sci Technol 33:711–726 Publisher’s Note Springer Nature remains neutral with regard to
Lee M, Wang Y, Huang C (2019) Design and development of a friendly jurisdictional claims in published maps and institutional affiliations.
user interface for building construction traceability system.
Microsyst Technol. https://doi.org/10.1007/s00542-019-04547
-4.pdf
Li Y, Wang H, Li Y (2017) Research on query analysis and optimi-
zation based on spark. In: 2017 6th international conference on
computer science and network technology (ICCSNT). IEEE
13
Albladi and Weir Cybersecurity (2020) 3:7
https://doi.org/10.1186/s42400-020-00047-5
Cybersecurity
Abstract
The popularity of social networking sites has attracted billions of users to engage and share their
information on these networks. The vast amount of circulating data and information expose these networks
to several security risks. Social engineering is one of the most common types of threat that may face social
network users. Training and increasing users’ awareness of such threats is essential for maintaining
continuous and safe use of social networking services. Identifying the most vulnerable users in order to
target them for these training programs is desirable for increasing the effectiveness of such programs. Few
studies have investigated the effect of individuals’ characteristics on predicting their vulnerability to social
engineering in the context of social networks. To address this gap, the present study developed a novel
model to predict user vulnerability based on several perspectives of user characteristics. The proposed
model includes interactions between different social network-oriented factors such as level of involvement
in the network, motivation to use the network, and competence in dealing with threats on the network.
The results of this research indicate that most of the considered user characteristics are factors that
influence user vulnerability either directly or indirectly. Furthermore, the present study provides evidence
that individuals’ characteristics can identify vulnerable users so that these risks can be considered when
designing training and awareness programs.
Keywords: Deception, Information security, Phishing, Social engineering, Social network, Vulnerability
awareness-raising and target training sessions for SNs, this study aims to measure the effect of level of in-
those individuals, with the aim of reducing their likely volvement, number of SN connections, percentage of
victimisation. known friends among the network’s connections, and
With such objectives in mind, the present research de- SN experience on predicting user susceptibility to SE in
veloped a conceptual model that reflects the extent to the conceptual model.
which the user-related factors and dimensions are inte-
grated as a means to predict users’ vulnerability to social Level of involvement
engineering-based attacks. This study used a scenario- This construct is intended to measure the extent to
based experiment to examine the relationships between which a user engages in Facebook activities. When
the behavioural constructs in the conceptual model and people are highly involved with a communication
the model’s ability to predict user vulnerability to SE service, they tend to be relaxed and ignore any cues
victimisation. associated with such service that warn of deception
The organisation of this paper is as follows: Theor- risk (Vishwanath et al. 2016). User involvement in a
etical background section briefly analyses the related social network can be measured by the number of
literature that was considered in developing the pro- minutes spent on the network every day and the fre-
posed model. The methods used to evaluate this quency of commenting on other people’s status up-
model are described in Methods section. Following dates or pictures (Vishwanath 2015). Time spent on
this, the results of the analysis are summarised in Re- Facebook is positively associated with disclosing
sults section. Discussion section provides a discussion highly sensitive information (Chang and Heo 2014).
of the findings while Theoretical and practical impli- Furthermore, people who are more involved in the
cations section presents the theory and practical im- network are believed to be more exposed to social
plications. An outline approach to a semi-automated engineering victimisation (Saridakis et al. 2016;
advisory system is proposed in A semi-automated se- Vishwanath 2015).
curity advisory system section. Finally, Conclusion Conversely, highly involved users are supposed to
section draws conclusions from this work. have more experience with the different types of
threat that could occur online. Yet, it has been ob-
Theoretical background served that active Facebook users are less concerned
People’s vulnerability to cyber-attacks, and particularly about sharing their private information as they usu-
to social engineering-based attacks, is not a newly emer- ally have less restrictive privacy settings (Halevi
ging problem. Social engineering issues have been stud- et al. 2013). Users’ tendency to share private infor-
ied in email environments (Alseadoon et al. 2015; Halevi mation could relate to the fact that individuals who
et al. 2013; Vishwanath et al. 2016), organisational envi- spend a lot of time using the network usually exhibit
ronments (Flores et al. 2014, 2015), and recently in so- high trust in the network (Sherchan et al. 2013).
cial network environments (Algarni et al. 2017; Saridakis Therefore, the following hypotheses have been
et al. 2016; Vishwanath 2015). Yet, the present research proposed.
argues that the context of these exploits affects peoples’
ability to detect them, and that the influences create new Ha1. Users with a higher level of involvement will
characteristics and elements which warrant further be more susceptible to social engineering attacks
investigation. (i.e., there will be a positive relationship).
The present study investigated user characteristics ◦ Hb1. The user’s level of involvement positively
in social networks, particularly Facebook, from differ- influences the user’s experience with cybercrime.
ent angles such as peoples’ behaviour, perceptions, ◦ Hb2. The user’s level of involvement positively
and socio-emotions, in an attempt to identify the fac- influences the user’s trust.
tors that could predict individuals’ vulnerability to SE
threats. People’s vulnerability level will be identified Number of connections
based on their response to a variety of social engin- Despite of the fact that having large number of SN
eering scenarios. The following sub-sections will ad- connections could increase people’s life satisfaction if
dress in detail the relationship between each factor of they are motivated to engage in the network to
the three perspectives and user susceptibility to SE maintain friendships (Rae and Lonborg 2015), this
victimisation. high number of contacts in the network is claimed
to increase vulnerability to online risks (Buglass
Habitual perspective et al. 2016; Vishwanath 2015). Risky behaviour such
Due to the importance of understanding the impact of as disclosing personal information in Facebook is
peoples’ habitual factors on their susceptibility to SE in closely associated with users’ desire to maintain and
Albladi and Weir Cybersecurity (2020) 3:7 Page 3 of 19
increase the number of existing friends (Chang and more experienced are the users with SNs, the less
Heo 2014; Cheung et al. 2015). Users with a high vulnerable they are to SE victimisation.
number of social network connections are motivated Additionally, in the context of the social network,
to be more involved in the network by spending Internet experience has been found to predict pre-
more time sharing information and maintaining their cautionary behaviour, and further causes greater
profiles (Madden et al. 2013). sensitivity to associated risks in using Facebook
Furthermore, a high number of connections might (Van Schaik et al. 2018). Thus, years of experience
suggest that users are not only connected with their in using the network could increase the individual’s
friends but also with strangers. Vishwanath (2015) awareness of the risk associated with connecting
has claimed that connecting with strangers on Face- with strangers. Accordingly, the present study pos-
book can be considered as the first level of cyber- tulates that more experienced users would have a
attack victimisation, as those individuals are usually high percentage of connections with known friends
less suspicious of the possible threats that can result in the network.
from connecting with strangers in the network. Fur-
thermore, Alqarni et al. (2016) have adopted this Ha4: Users with a higher level of experience with
view to test the relationship between severity and social network will be less susceptible to social
vulnerability of phishing attacks and connection with engineering attacks (i.e., there will be a negative
strangers (as assumed to present the basis for relationship).
phishing attacks). Their study indicated a negative ◦ Hb4: The user’s social network experience
relationship between the number of strangers that positively influences the user’s connections with
the user is already connected to and the user’s per- known friends.
ception of the severity and their vulnerability to
phishing attacks in Facebook. Therefore, if users are
connected mostly with known friends on Facebook, Perceptual perspective
this could be seen as a mark of less vulnerable indi- People’s risk perception, competence, and cybercrime
viduals. With all of these points in mind, the follow- experience are the three perceptual factors that are be-
ing hypotheses are generated. lieved to influence their susceptibility to social engineer-
ing attacks. The strength and direction of these factors’
Ha2: Users with a higher number of connections impact will be discussed as follows.
will be more susceptible to social engineering attacks
(i.e., there will be a positive relationship).
◦ Hb3: The user’s number of connections Risk perception
positively influences the user’s level of Facebook users have a different level of risk percep-
involvement. tion that might affect their decision in times of risk.
Ha3: Users with higher connections with known Vishwanath et al. (2016) has described risk perception
friends will be less susceptible to social engineering as the bridge between user’s previous knowledge
attacks (i.e., there will be a negative relationship). about the expected risk and their competence to deal
with that risk. Many studies have considered perceiv-
Social network experience ing the risk associated with engaging in online activ-
People’s experience in using information communica- ities as having a direct influence on avoiding using
tion technologies makes them more competent to online services (Riek et al. 2016) and more import-
detect online deception in SNs (Tsikerdekis and antly as decreasing their vulnerability to online threats
Zeadally 2014). For instance, it has been found that (Vishwanath et al. 2016). Facebook users’ perceived
the more time elapsed since joining Facebook makes risk of privacy and security threats significantly pre-
the user more capable of detecting SE attacks dict their strict privacy and security settings (Van
(Algarni et al. 2017). Furthermore, despite the fact Schaik et al. 2018). Thus, if online users are aware of
that some researchers argue that computer experience the potential risks and their consequences that might
has no significant impact on their phishing suscepti- be encountered on Facebook, they will probably avoid
bility (Halevi et al. 2013; Saridakis et al. 2016), other clicking on malicious links and communicating with
research on email phishing found positive impact strangers on the network. This indicates that risk per-
from number of years of using the Internet and num- ception contributes to the user’s competence in deal-
ber of years of using email on people’s detection abil- ing with online threats and should lead to a decrease
ity with email phishing (Alseadoon 2014; Sheng et al. in susceptibility to SE. Therefore, the following rela-
2010). Therefore, the present study suggests that the tionships have been proposed.
Albladi and Weir Cybersecurity (2020) 3:7 Page 4 of 19
Ha5: Users with a higher level of risk perception (Cao and Lin 2015). Furthermore, previous email
will be less susceptible to social engineering attacks phishing victimisation is claimed to raise user aware-
(i.e., there will be a negative relationship). ness and vigilance and thus prevent them from being
◦ Hb5: The user’s perceived risk positively victimised again (Workman 2007). Yet, recent studies
influences the user’s competence. found this claim to be not significant (Iuga et al. 2016;
Wang et al. 2017). As experience with cybercrimes could
Competence also be used as a determinant of people’s weakness in pro-
User competence has been considered an essential de- tecting themselves from such threats.
terminant of end-user capability to accomplish tasks Experience with cybercrime has been found to in-
in many different fields. In the realm of information crease people’s perceived risk of social network ser-
systems, user competence can be defined as the indi- vices (Riek et al. 2016). Those who are knowledgeable
vidual’s knowledge of the intended technology and and have previous experience with online threats could
ability to use it effectively (Munro et al. 1997). To be assumed to have high-risk perception (Vishwanath
gain insight into user competence in detecting secur- et al. 2016). However, unlike the context of email
ity threats in the context of online social networks, phishing, little is known about the role of prior know-
investigating the multidimensional space that deter- ledge and experiences with cybercrime in preventing
mines this user competence level is fundamental people from being vulnerable to social engineering
(Albladi and Weir 2017). The role of user competence attacks in the context of social networks. Thus, this
and its dimensions in facilitating the detection of on- study proposes that past experience could raise the
line threats is still a controversial topic in the infor- user’s risk perception but also could be used as a
mation security field. The dimensions used in the predictor of the user’s risk of being victimised again.
present study to measure the concept are security To this extent, the following hypotheses have been
awareness, privacy awareness, and self-efficacy. The assumed.
scales used to measure these factors can determine
the level of user competence in evaluating risks asso- Ha7: Users with a previous experience with
ciated with social network usage. cybercrime will be more susceptible to social
User competence in dealing with risky situations in engineering attacks (i.e., there will be a positive
a social network setting is a major predictor of the relationship).
user’s response to online threats. When individuals ◦ Hb6: The user’s experience with cybercrime
feel competent to control their information in social positively influences the user’s perceived risk.
networks, they are found to be less vulnerable to
victimisation (Saridakis et al. 2016). Furthermore, Socio-emotional perspective
Self-efficacy, which is one of the user’s competence Little is known regarding the impact that this per-
dimensions, has been found to play a critical role in spective has on SE victimisation in a SN context.
users’ safe and preservative behaviour online (Milne However, previous research has highlighted the
et al. 2009). People who have confidence in their positive effect of people’s general trust or belief in
ability to protect themselves online as well as having their victimisation in email phishing context (Alsea-
high-security awareness can be perceived as highly doon et al. 2015), which encourages the present
competent users when facing cyber-attacks (Wright study to investigate more socio-emotional factors
and Marett 2010). This study hypothesised that such as the dimensions of user trust and motivation,
highly competent users are less susceptible to SE in order to consider their possible impact on user’s
victimisation. risky behaviour.
disclosing personal information among social net- et al. 2009). This involvement could ultimately lead
works users (Beldad and Hegner 2017; Chang and motivated individuals to experience or at least be fa-
Heo 2014). With all of this in mind, the present miliar with different types of cybercrime that could
study hypothesised that trusting the social network happen in the network. Hence, the following hypoth-
provider as well as other members may cause higher eses have been postulated.
susceptibility to cyber-attacks.
Ha9: Users with a higher level of motivation will be
Ha8: Users with a higher level of trust will be more more susceptible to social engineering attacks (i.e.,
susceptible to social engineering attacks (i.e., there there will be a positive relationship).
will be a positive relationship). ◦ Hb7: The user’s motivation positively influences
the user’s trust.
Motivation ◦ Hb8: The user’s motivation positively influences
According to the uses and gratification theory, people the user’s level of involvement.
are using the communication technologies that fulfil ◦ Hb9: The user’s motivation positively influences
their needs (Joinson 2008). Users’ motivation to use the user’s experience with cybercrime.
communication technologies must be taken into con-
sideration in order to understand online user behav- The previous sub-sections explain the nature and
iour. This construct has been acknowledged by the directions of the relationships among the con-
researchers in many fields such as marketing (Chiu structs in the present study. Based on these 18 pro-
et al. 2014), and mobile technology (Kim et al. 2013) posed hypotheses, a novel conceptual model has been
in order to understand their target users. However, developed and presented in Fig. 1. This conceptual
information security research has limitedly adopted model relies on three different perspectives which are
this view toward understanding the online users’ risky believed to predict user behaviour toward SE victim-
behaviour. Users can be motivated by different stimuli isation on Facebook. Developing and validating such a
to engage in social networks such as entertainment or holistic model gives a clear indication of the contribu-
information seeking (Basak and Calisir 2015). Add- tion of the present study.
itionally, people use Facebook for social reasons such
as maintaining existing relationships and making new Methods
friends (Rae and Lonborg 2015). According to SE vic- To evaluate the hypotheses of the conceptual model,
timisation, these motivations can shed light on under- an online-questionnaire was designed using the Qual-
standing the user’s behaviour at times of risk. For trics online survey tool. The questionnaire incorpo-
example, hedonically motivated users who usually rated three main parts starting with questions about
seek enjoyment are assumed to be persuaded to click participants’ demographics, followed by questions that
on links that provide new games or apps. While socially measure the constructs of the proposed model, and fi-
motivated users are generally looking to meet new people nally, a scenario-based experiment. An invitation email
online, this makes them more likely to connect with was sent to a number of faculty staff in two universities,
strangers. This connections with strangers is considered asking them to distribute the online-questionnaire among
risky behaviour nowadays (Alqarni et al. 2016). Therefore, their students and staff.
this study predicts that the users’ vulnerability to social
engineering-based attacks will be different based on their Sample
motives to access the social network. Hair et al. (2017) suggested using a sophisticated
User’s differing motivation to use social networking guideline that relies on Cohen (1988) recommenda-
sites can explain their attitude online, such as ten- tions to calculate the required sample size by using
dency to disclose personal information in social net- power estimates. In this case, for 9 predictors (which
works (Chang and Heo 2014). Additionally, people’s is the number of independent variables in the con-
perceived benefit of network engagement has a posi- ceptual model) with an estimated medium effect size
tive impact on their willingness to share their photos of 0.15, the target sample size should be at least 113
online (Beldad and Hegner 2017). Thus, the present to achieve a power level of 0.80 with a significance
study assumes that motivated users are more vulner- level of 0.05 (Soper 2012). In this study, 316 partici-
able to SE victimisation than others. Additionally, mo- pants have completed the questionnaire (after the
tivated users could be inclined to be more trusting primary data screening). The descriptive analysis of
when using technology (Baabdullah 2018). This mo- participants’ demographics in Table 1 revealed a var-
tivation could lead the individual to spend more time iety of profiles in terms of gender (39% male, 61%
and show higher involvement in the network (Ross female), education level, and education major. The
Albladi and Weir Cybersecurity (2020) 3:7 Page 6 of 19
majority of participants in the study were younger adults Milne et al. (2009), with some modification and changes to
(age 18–24), representing 76% of the total participants. fit the present study context. The scales used to measure
However, this was expected as the survey was undertaken the three dimensions of user competence were adopted
in two universities where students considered vital mem- from Albladi and Weir (2017). Motivation dimension items
bers of the higher education environment. were adopted from previous literature (Al Omoush et al.
2012; Basak and Calisir 2015; Orchard et al. 2014; Yang
Measurement scales and Lin 2014). The scale used to measure users’ trust was
The proposed conceptual model includes five reflective adopted with some modification from Fogel and Nehmad
factors and four second-order formative constructs which (2009) and Chiu et al. (2006) studies. Appendix 1 presents
are risk, competence, trust, and motivation. Repeated indi- a summary of the measurement items.
cator approach was used to measure the formative con- A scenario-based experiment has been chosen as an
structs values. This method recommends using the same empirical approach to examining users’ susceptibility
number of items on all the first order factors in order to to SE victimisation. In such scenario-based experi-
guarantee that all first-order factors have the same weight ments, the human is recruited to take a role in
on the second order factors and to ensure no weight bias reviewing a set of scripted information which can be
are existed (Ringle et al. 2012). in the form of text or images, then asked to react or
The scales used to measure the user habits in SN has respond to this predetermined information (Rungtusa-
been adopted from (Fogel and Nehmad 2009). To measure natham et al. 2011). This method is considered suit-
the risk perception dimensions, scales were adapted from able and realistic for many social engineering studies
Albladi and Weir Cybersecurity (2020) 3:7 Page 7 of 19
Table 1 Participants’ demographics the file” using a 5-point Likert-scale from 1 “strongly
Demographic Frequency Percent Cumulative disagree” to 5 “strongly agree”. Appendix 2 includes a
Percent summary of the scenarios used in this study.
Gender
Male 123 38.9 38.9 Analysis approach
Female 193 61.1 100.0 To evaluate the proposed model, partial least squares
structural equation modelling (PLS-SEM) has been
Total 316 100.0
used due to its suitability in dealing with complex
Age
predictive models that consist in a combination of
18–24 240 75.9 75.9 formative and reflective constructs (Götz et al. 2010),
25–34 57 18.0 94.0 even with some limitations regarding data normality
35–44 14 4.4 98.4 and sample size (Hair et al. 2012). The SmartPLS v3
45–55 5 1.6 100.0 software package (Ringle et al. 2015) was used to ana-
lyse the model and its associated hypotheses.
Total 316 100.0
To evaluate the study model, three different proce-
Education Level
dures have been conducted. First, using the PLS-
High school 187 59.2 59.2 algorithm to provide standard model estimations such
Bachelor’s degree 112 35.4 94.6 as path coefficient, the coefficient of determination (R2
Master’s degree 14 4.4 99.1 values), effect size, and collinearity statistics. Secondly,
Other, please specify 3 .9 100.0 using a bootstrapping approach to test the structural
model relationships significance. In such approach, the
Total 316 100.0
collected data sample is treated as the population sam-
Major
ple where the algorithm used a replacement technique
Computer Science/IT 124 39.2 39.2 to generate a random and large number of bootstrap sam-
Engineering 32 10.1 49.4 ples (recommended to predefine as 5000) all with the
Business/Administrative Sciences 38 12.0 61.4 same amount of cases as the original sample (Henseler
Medical Sciences 5 1.6 63.0 et al. 2009). The present study conducted the bootstrap-
ping procedure with 5000 bootstrap samples, two-tailed
Science 15 4.7 67.7
testing, and an assumption of 5% significant level.
Humanities and Arts 6 1.9 69.6
Finally, a blindfolding procedure was also used to
Other, please specify 96 30.4 100.0 evaluate the predictive relevance (Q2) of the structural
Total 316 100.0 model. In this approach, part of the data points are
omitted and considered missing from the constructs’ in-
(e.g., (Algarni et al. 2017; Iuga et al. 2016)) due to dicators, and the parameters are estimated using the
the ethical concerns associated with conducting real remaining data points (Hair et al. 2017). These estima-
attacks. Our scenario-based experiment includes 6 im- tions are then used to predict the missing data points
ages of Facebook posts (4 high-risk scenarios, and 2 which will be compared later with the real omitted data
low-risk scenarios). Each post contains a type of to measure Q2 value. Blindfolding is considered a sample
cyber-attack which has been chosen from the most reuse approach which only applied to endogenous con-
prominent cyber-attacks that occur in social networks structs (Henseler et al. 2009). Endogenous constructs are
(Gao et al. 2011). the variables that are affected by other variables in the
In the study model, only high-risk scenarios (which study model (Götz et al. 2010), such as user susceptibil-
include phishing, clickjacking with an executable file, ity, involvement, and trust.
malware, and phishing scam) have been considered to
measure user susceptibility to SE attacks. However, Results
comparing individuals’ response to the high-risk at- The part of the conceptual model that includes the rela-
tacks and their response to the low-risk attacks aims tions between the measurement items and their associ-
to examine if users rely on their characteristics when ated factors is called the measurement model, while the
judging the different scenarios and not on other influ- hypothesised relationships among the different factors is
encing factors such as visual message triggers (Wang called the structural model (Tabachnick and Fidel 2013).
et al. 2012). Participants were asked to indicate their The present study’s measurement model, which includes
response to these Facebook posts, as if they had encoun- all the constructs along with their indicators’ outer load-
tered them in their real accounts, by rating a number of ings, can be found in Appendix 3. The result of the
statements such as “I would click on this button to read measurement model analysis in Table 2 reveals that the
Albladi and Weir Cybersecurity (2020) 3:7 Page 8 of 19
Cronbach alpha and the composite reliability were ac- indicate that the variance inflation factor (VIF) values for all
ceptable for all constructs as they were above the thresh- predictors of each endogenous construct (represented by
old of 0.70. Additionally, since the average variance the rows) are below the threshold of 5. Thus, no collinearity
extracted (AVE) for all constructs was above the thresh- issues exist in the structural model.
old of 0.5 (Hair et al. 2017), the convergent validity of
the model’s reflective constructs was confirmed. Assessing path coefficients (hypotheses testing)
However, in order to assess the model’s predictive The path coefficient was calculated using the bootstrap
ability and to examine the significance of relationships re-sampling procedure (Hair et al. 2017). This procedure
between the model’s constructs, the structural model provides estimates of the direct impact that each con-
should be tested. The assessment of the structural model struct has on user susceptibility to cyber-attack. The re-
involves the following testing steps. sult of the direct effect test in Table 4 shows that trust
(t = 5.202, p < 0.01) is the highest variable that predicts the
Assessing collinearity user’s susceptibility to SE victimisation, followed by user’s
This step is vital to determine if there are any collinearity involvement (t = 5.002, p < 0.01), cybercrime experience
issues among the predictors of each endogenous construct. (t = 3.736, p < 0.01), social network experience (t = − 3.015,
Failing to do so could lead to a biased path coefficient esti- p < 0.01), and percentage of known friends among Face-
mation if a critical collinearity issue exists among the con- book connections (t = − 2.735, p < 0.01). The direct effects
struct predictors (Hair et al. 2017). Table 3 presents all the of user competence to deal with threats (t = − 2.474, p <
endogenous constructs (represented by the columns) which 0.05) and the number of connections (t = − 2.428, p < 0.05)
were relatively small, yet still statistically significant in of motivation were aggregated to create one index to
explaining the target variable. However, the impact of the measure the total effect of user’s motivation (both direct
number of connections on users’ susceptibility was nega- and indirect), as illustrated in Table 6, the model re-
tive which opposes hypothesis (Ha2) that claims that this vealed a significant predictor of users’ susceptibility (t =
relationship is positive. 3.854, p < 0.01). Thus, the direct effect of motivation on
Most importantly, the result indicated that perceived user susceptibility is statistically rejected, while the total
risk and motivation have no direct effect on user’s vul- effect of motivation on users’ susceptibility is statistically
nerability (p > 0.05). This could be caused by the fact significant and considered one of the strongest predic-
that both factors are second-order formative variables, tors in the study model.
while their first order factors have different direction ef- Evaluating the total effect of a particular construct
fects on user’s susceptibility. As can be seen from the re- on user susceptibility is considered useful, especially if
sult of the regression analysis in Table 5, perceived risk the goal of the study is to explore the impact of the
is the second order factor of perceived severity of threat relationships between different drivers to predict one
which has a significant negative effect on the user’s sus- latent construct (Hair et al. 2017). The total impact
ceptibility and perceived likelihood of threat which has a includes both the construct’s direct effect and indirect
positive impact on user’s susceptibility. Therefore, their effects through mediating constructs in the model.
joint effect logically will be not significant, because the The total effect analysis in Table 6 revealed that most
opposite effects of the two dimensions of perceived risk of the constructs have a significant overall impact on
have cancelled each other. Thus, Ha5 could be consid- user susceptibility (p < 0.05). Although the number of
ered as partially supported. connections has been proven to have a significant
The situation with Motivation is similar as it is also a negative direct effect on user susceptibility, its total
second-order formative factor and its first order factors effect when considering all the direct and indirect re-
(hedonic and social) have an opposite effect on users’ lationships seems to be very low and not significant
susceptibility. Table 5 presents the result of the regres- (t = − 0.837, p > 0.05). Furthermore, both the direct
sion analysis of first-order factors for the motivation and total effect of perceived risk has been found to
construct. The result provides evidence that hedonic be not substantial (t = − 1.559, p > 0.05).
motivation is negatively related to the user’s susceptibil- The rest of the hypotheses (group b) aim to exam-
ity while social motivation is positively associated with ine the relationships between the independent con-
user’s susceptibility. However, when the two dimensions structs of the study model, which will be tested
according to estimates of the path coefficient be-
Table 5 Regression analysis of perceived risk and motivation tween the related constructs. Table 7 shows that all
dimensions nine hypotheses are statistically significant (p < 0.05).
Factors Dimensions Std. Beta t Sig. This also shows the most substantial relationship
Perceived Risk Severity −.146 −2.446 .015
was between social network experience and the per-
centage of known friends among Facebook connec-
Likelihood .117 1.958 .051
tions (t = 6.091, p < 0.01), followed by the favourable
Motivation Hedonic −.080 −1.423 .156 impact motivation and level of involvement have on
Social .319 5.680 .000 increasing users trust (with t-value = 4.821, and t-
Dependent Variable: Susceptibility value = 3.914, respectively).
Albladi and Weir Cybersecurity (2020) 3:7 Page 10 of 19
Furthermore, motivation (t = 3.640, p < 0.01) and the discipline and the model complexity. Cohen (1988) has
number of connections (t = 3.106, p < 0.01) are two factors suggested a rule of thumb to assess the R2 values for
found to increase users’ level of involvement in the net- models with several independent variables which are: 0.26,
work. Level of involvement also plays a notable role in rais- 0.13, and 0.02 to be considered substantial, moderate, and
ing people’s previous experience with cybercrime (t = 2.532, weak respectively. Table 8 illustrates the coefficient of de-
p < 0.05), while past cybercrime expertise significantly in- termination for the endogenous variables in the study
creases people’s perceived risk associated with using Face- model. The R2 values indicate that the nine prediction var-
book (t = 2.968, p < 0.01). Nevertheless, the contribution of iables together have substantial predictive power and ex-
perceived risk in raising user competence level to deal with plain 33.5% of the variation in users’ susceptibility to SE
online threats was not very strong, although considered sta- attacks. Furthermore, users’ involvement and motivation
tistically significant (t = 2.241, p < 0.05). combined effect on users’ trust is considered moderate as
Finally, there was no significant difference with regard it explains 13.2% of the variation in users’ trust.
to the user characteristics that affect people’s susceptibil-
ity or resistance to the high-risk scenarios and low-risk Predictive relevance Q2
scenarios. This means that participants rely on their per- To measure the model’s predictive capabilities, a blindfold-
ceptions and experience to judge those scenarios. ing procedure has been used to obtain the model’s predict-
ive relevance (Q2 value). Stone-Geisser’s Q2 value, which is
The coefficient of determination - R2 a measure to assess how well a model predicts the data of
The coefficient of determination is a traditional criterion omitted cases, should be higher than zero in order to indi-
that is used to evaluate the structural model’s predictive cate that the path model has a cross-validated predictive
power. In this study, this coefficient measure will repre- relevance (Hair et al. 2017). Table 8 presents results of the
sent the joint effect of all the model variables in explaining predictive relevance test and shows that all of the en-
the variance in people’s susceptibility to SE attacks. Ac- dogenous constructs in the research model have pre-
cording to Hair et al. (2017), the acceptable R2 value is dictive relevance greater than zero, which means that
hard to determine as it might vary depending on the study the model has appropriate predictive ability.
had been assumed to be positive in order to concur with cybercrime risk has also been indicated as influencing
previous claims that large network size makes individuals people to take precautions and avoid using online social
more vulnerable to SNs risks (Buglass et al. 2016; Vishwa- networks (Riek et al. 2016).
nath 2015). Facebook users seem to accept friend requests Measuring user competence levels would contribute to
from strangers to expand their friendship network. our understanding of the reasons behind user weakness in
Around 48% of the participants in this study stated that detecting online security or privacy threats. In the present
they know less than 10% of their Facebook network per- study, the measure of an individual’s competence level in
sonally. Connecting with strangers on the network has dealing with cybercrime was based upon three dimen-
previously been seen as the first step in falling prey to so- sions: security awareness, privacy awareness, and self-
cial engineering attacks (Vishwanath 2015), while also be- efficacy. The empirical results show that this competence
ing regarded as a measure of risky behaviour on social measure can significantly predict the individual’s ability to
networks (Alqarni et al. 2016). A high percentage of detect SE attacks on Facebook. Individuals’ perception of
strangers with whom the user is connected can be seen as their self-ability to control the content shared on social
a determinant of the user’s low level of suspicion. network websites has been previously considered a pre-
Furthermore, social network experience has been found to dictor of their ability to detect social network threats (Sari-
significantly predict people’s susceptibility to social engineer- dakis et al. 2016), as individuals who have this confidence
ing in the present study. People’s ability to detect social net- in their self-ability as well as in their security knowledge
work deception has been said to depend on information seem to be competent in dealing with cyber threats (Flores
communication technology literacy (Tsikerdekis and Zead- et al. 2015; Wright and Marett 2010).
ally 2014). Thus, experienced users are more familiar with Furthermore, our results accord with the finding of
cyber-attacks such as phishing and clickjacking, and easily Riek et al. (2016) that previous cybercrime experience
detect them. This is further supported by Algarni et al. has a positive and substantial impact on users’ perceived
(2017), who pointed out that the less time that has elapsed risk. Yet, this high-risk perception did not decrease
since the user joined Facebook, the more susceptible he or users’ vulnerability in the present study. This could be
she is to social engineering. Yet, their research treated user because experience and knowledge of the existence of
experience with social networks as a demographic variable threats do not need to be reflected in people’s behaviour.
and did not examine whether this factor might affect other For example, individuals who had previously undertaken
aspects of user behaviour. For instance, results from the security awareness training still underestimated the im-
present study reveal that users who are considered more ex- portance of some security practices, such as frequent
perienced in social networks have fewer connections with change of passwords (Kim 2013).
strangers (t = 6.091, p < 0.01), which further explains why The present research found that people’s trust in the so-
they are less susceptible than novice users. cial network’s provider and members were the strongest
Perception of risk has no direct influence on people’s determinants of their vulnerability to social engineering
vulnerability, but the present study found perceived risk attacks (t = 5.202, p < 0.01). Previous email phishing re-
to significantly increase people’s level of competence to search (e.g., Alseadoon et al. 2015; Workman 2008) has
deal with social engineering attacks. This also accords also stressed that people’s disposition to trust has a signifi-
with the Van Schaik et al. (2018) study, which found that cant impact on their weakness in detecting phishing
Facebook users with high risk perception adopt precau- emails. Yet, little was known about the impact of trust in
tionary behaviours such as restrictive privacy and providers and other members of social networks on peo-
security-related settings. Most importantly, perceived ple’s vulnerability to cyber-attacks. These two types of
Albladi and Weir Cybersecurity (2020) 3:7 Page 13 of 19
trust have been found to decrease users’ perception of the Despite the importance of online awareness campaigns as
risks associated with disclosing private information on well as the rich training programs that organisations adopt,
SNs (Cheung et al. 2015). Similarly, trusting social net- problems persist because humans are still the weakest link
work providers to protect members’ private information (Aldawood and Skinner 2018). Changing beliefs and behav-
has caused Facebook users (especially females) to be more iour is a complex procedure that needs more research.
willing to share their photos in the network (Beldad and However, the present study offers clear insight into specific
Hegner 2017). These findings draw attention to the huge individual characteristics that make people more vulnerable
responsibility that social network providers have to pro- to cybercrimes. Using these characteristics to design train-
tect their users. In parallel, users should be encouraged to ing programs is a sensible approach to the tuning of secur-
be cautious about their privacy and security. ity awareness messages. Similarly, our results will be helpful
People’s motivation to use social networks has no dir- in conducting more successful training programs that in-
ect influence on their vulnerability to SE victimisation, corporate the identified essential attributes from the pro-
as evidenced by the results of this study. Yet, this motiv- posed perspectives, as educational elements to increase
ation significantly affects different essential aspects of people’s awareness. While these identified factors might re-
user behaviour and perception such as user involvement, flect a user’s weak points, the factors could also be targeted
trust, and previous experience with cybercrime, which in by enforcing behavioural security strategies in order to miti-
turn substantially predict user vulnerability. This result gate social engineering threats.
accords with the claim that people’s motivation of using The developed conceptual model could be used in the
SNs increase their disclosure of private information (Bel- assessment process for an organisation’s employees, es-
dad and Hegner 2017; Chang and Heo 2014). pecially those working in sensitive positions. Also, the
model and associated scales could be of help in employ-
Theoretical and practical implications ment evaluation tests, particularly in security-critical in-
Most of the proposed measures to mitigate SE threats in stitutions, since the proposed model may predict those
the literature (e.g. (Fu et al. 2018; Gupta et al. 2018)) are weak aspects of an individual that could increase his/her
focused on technical solutions. Despite the importance vulnerability to social engineering.
and effectiveness of these proposed technical solutions,
social engineers try to exploit human vulnerabilities; A semi-automated security advisory system
hence we require solutions that understand and guard One of the practical usefulness of the proposed prediction
against human weaknesses. Given the limited number of model can be demonstrated through integrating this model
studies that investigate the impact of human characteris- in a semi-automated advisory system (Fig. 2). Based on the
tics on predicting vulnerability to social network security idea of user profiling, this research has established a prac-
threats, the present study can be considered useful, hav- tical solution which can semi-automatically predict users’
ing critical practical implications that should be ac- vulnerability to various types of social engineering attacks.
knowledged in this section. The designed semi-automated advisory system could be
The developed conceptual model shows an acceptable used as an approach with which to classify social network
prediction ability of people’s vulnerability to social engin- users according to their vulnerability type and level after
eering in social networks as revealed by the results of this completing an assessment survey. The local administrator
study. The proposed model could be used by information can determine the threshold and the priority for each type of
security researchers (or researchers from different fields) attack based on their knowledge. Then, the network pro-
to predict responses to different security-oriented risks. vider could send awareness posts to each segment that target
For instance, decision-making research could benefit from the particular group’s needs. Assessing social network users
the proposed framework and model as they indicate new and segmenting them based on their behaviour and vulner-
perspectives on user-related characteristics that could abilities is essential in order to design relevant advice that
affect decision-making abilities in times of risk. meets users’ needs. Yet, since social engineering techniques
Protecting users’ personal information is an essential are rapidly changing and improving, the attack scenarios that
element in promoting sustainable use of social networks are used in the assessment step could be updated from time
(Kayes and Iamnitchi 2017). SN providers should pro- to time. The registered users in the semi-automated advisory
vide better privacy rules and policies and develop more system also need to be reassessed regularly in order to ob-
effective security and privacy settings. A live chat threat serve any changes in their vulnerability.
report must be essential in SN channels in order to re- Significant outcomes were noted with practical impli-
duce the number of potential victims of specific threat- cations for how social network users could be assessed
ening posts or accounts. Providing security and privacy- and segmented based on their characteristics, behaviour,
related tools could also help increase users’ satisfaction and vulnerabilities, in turn facilitating their protection
with social networks. from such threats by targeting them with relevant advice
Albladi and Weir Cybersecurity (2020) 3:7 Page 14 of 19
and education that meets users’ needs. This system is SE attacks have been considered in the scale that measures
considered cost and time effective, as integrating individ- previous experience with cybercrime, such as phishing,
uals’ needs with the administrator’s knowledge of exist- identity theft, harassment, and fraud.
ing threats could avoid the overhead and inconvenience Furthermore, this research has focused only on aca-
of sending blanket advice to all users. demic communities as all the participants in this study
were students, academic, and administrative staff of two
Conclusion universities. This could be seen as a limitation as the result
The study develops a conceptual model to test the factors may not reflect the behaviour of the general public. The
that influence social networks users’ judgement of social university context is important however, and cyber-
engineering-based attacks in order to identify the weakest criminals have targeted universities recently due to their
points of users’ detection behaviour, which also helps to importance in providing online resources to their students
predict vulnerable individuals. Proposing such a novel and community (Öğütçü et al. 2016). Additionally, while
conceptual model helped in bridging the gap between the- several steps have been taken to ensure the inclusion of all
ory and practice by providing a better understanding of influential factors in the model, it is not feasible to guaran-
how to predict vulnerable users. The findings of this re- tee that all possibly influencing attributes are included in
search indicate that most of the considered user character- this study. Further efforts are needed in this sphere, as
istics influence users’ vulnerability either directly or predicting human behaviour is a complex task.
indirectly. This research also contributes to the existing The conceptual study model could be used to test user
knowledge of social engineering in social networks, par- vulnerability to different types of privacy or security hazards
ticularly augmenting the research area of predicting user associated with the use of social networks: for instance, by
behaviour toward security threats by proposing a new in- measuring users’ response to the risk related to loose priv-
fluencing perspective, the socio-emotional, which has not acy restrictions, or to sharing private information on the
been satisfactorily reported in the literature before, as a di- network. Furthermore, investigating whether social net-
mension affecting user vulnerability. This new perspective works users have different levels of vulnerability to privacy
could also be incorporated to investigate user behaviour and security associated risks is another area of potential fu-
in several other contexts. ture research. The proposed model’s prediction efficiency
Using a scenario-based experiment instead of conduct- could be compared to different types of security and privacy
ing a real attack study is one of the main limitations of the threats. This comparison would offer a reasonable future
present study but was considered unavoidable due to eth- direction for researchers to consider. Future research could
ical considerations. However, the selected attack scenarios focus more on improving the proposed model by giving
were designed carefully to match recent and real social perceived trust greater attention, as this factor was the
engineering-based attacks on Facebook. Additionally, the highest behaviour predictor in the present model. The
present study was undertaken in full consciousness of the novel conceptualisation of users’ competence in the con-
fact that when measuring people’s previous experience ceptual model has proved to have a profound influence on
with cybercrime, some participants might be unaware of their behaviour toward social engineering victimisation, a
their previous victimisation and so might respond inaccur- finding which can offer additional new insight for future
ately. In order to mitigate this limitation, different types of investigations.
Albladi and Weir Cybersecurity (2020) 3:7 Page 15 of 19
Appendix 1
Table 11 Measurement items
Construct Dimensions Questions Measurement items
Perceived Risk Severity of threat • Please choose the best answer in each • I believe that losing my data privacy while
statement that indicates the extent to using Facebook would be a severe problem
which a statement is true for you: (from for me (ST2)
Strongly agree to Strongly disagree) • I believe that having my messages and chats
being seen or listened to in Facebook would
be a severe
problem for me (ST3)
• I believe that losing my financial information while
using Facebook would be harmful for me (ST4)
Likelihood of threat • Answer the following questions according • How likely is it for your financial information to
to your beliefs, attitudes, and experiences: be stolen in Facebook? (LT1)
(from Extremely Likely to Extremely Unlikely) • How likely is it that your identity can be stolen
in Facebook? (LT2)
• How likely is it for your privacy to be invaded
without your knowledge while using Facebook? (LT3)
Competence Security • Please choose the best answer in each • I use password for my Facebook account
statement that indicates the extent to different from the passwords I use to access
which a statement is true for you: (from other sites (SA2)
Strongly agree to Strongly disagree) • I use a specific new email for my Facebook
account different from my personal or work
email (SA4)
Privacy • Please choose the best answer in each • I don’t share personal information on Facebook
statement that indicates the extent to such as birthdate, phone number, workplace or
which a statement is true for you: (from address (PA3)
Strongly agree to Strongly disagree) • I don’t share my current or future location on
Facebook for example, images for my current
vacation, or plans for future vacation (PA4)
Self-efficacy • Please choose the best answer in each • I have the knowledge and the ability to secure
statement that indicates the extent to my Facebook account by adjusting the account
which a statement is true for you: (from settings (SEF3)
Strongly agree to Strongly disagree) • I have the ability to protect myself from any
online threats while using Facebook (SEF4)
Past experience with • How often have you experienced or been • Identity theft (somebody stealing your personal data
cybercrime a victim of the following incidents? (Rate and impersonating you, e.g. Open SN account with
each statement from always to never) your name, or Shopping under your name) (PE1_It)
• Phishing (Received emails fraudulently asking for
money or personal details, including banking or
payment information) (PE2_Ph)
• Online fraud where goods purchased were not
delivered, counterfeit or not as advertised (PE3_OF)
• Harassment, cyber-bullying (Received Harassing
messages, inappropriate comments, or other
persistent behaviours that endangers your safety)
(PE4_Har)
Trust Trust Provider • Please choose the best answer that • Facebook is a trustworthy social network (TP1)
indicates how much you agree with the • I can count on Facebook to protect my
following statements: (from Strongly privacy (TP2)
agree to Strongly disagree) • I can count on Facebook to protect my personal
information from unauthorized use (TP3)
Trust Members • Please choose the best answer that • Facebook Members will not take advantage
indicates how much you agree with the of others even when the opportunity arises (TM1)
following statements: (from Strongly • Facebook Members are truthful in dealing with
agree to Strongly disagree) one another (TM3)
• Facebook Members will always keep the promises
they make to one another (TM4)
Motivation Hedonic • What are your main reasons of using • To pass the time (HM1)
social networks? (Rate each statement • Using social networks are enjoyable and
from Strongly agree to Strongly disagree) entertaining (HM2)
Social • What are your main reasons of using • To keep in touch with friends and family (SM1)
social networks? (Rate each statement • To maintain my popularity and prestige among
from Strongly agree to Strongly disagree) peers (SM3)
Albladi and Weir Cybersecurity (2020) 3:7 Page 16 of 19
Appendix 2
Table 12 A Summary of the social engineering scenarios
Type of Trick Message Risk-level
1. Phishing – requesting sensitive information such as Winner picked tonight High
the user’s email and real name in order to win an Like = free iphone7
iPhone 7 or £100 voucher. Comment = £100 voucher
To contact you if you win,
Enter your email and name here http://bit.ly/2gno8tj
2. Clickjacking with an executable file- a post about a I don’t want to believe. I just read this document. You High
shocking and a very important document that is shown must read it. it is very important for all public. Please
in the post as a pdf file with the mouse pointer positioned someone tell me that is a lie.
on the link and the actual URL in the status bar indicates
that the document is an executable file.
3. Clickjacking- a post that includes a video that direct the Video: The most shocking viedo you will every watch!! Low
user to an ambiguous link. However, this type of link is a
low-risk since the link could be either a malicious link or a
safe link; it is not clear and not safe to risk and clicks in
such links.
4. Malware- offering an application that allows users to call Download this app. It’s works perfect for calling out or High
and message their friends free of charge if they ignore the messaging. All you need is Wi-Fi.
warning message and give permission to the application to
access their profile and contact information.
5. Phishing scam- a threatening message pretended to be from Your account is at risk! High
Facebook support team asking the user to re-confirm his/her Please re-confirm your account to avoid plocking, if you are
account or blocking the account. The link in the message is the the original owner of this account.
original Facebook site, but the actual URL displayed in the status Please re-confirm you account by following this link here:
bar is http://cut.uk/Facebookconfirm-login, which is apparently a https://www.facebook.com/xsrn
phishing site. if you don’t confirm our system will automatically block your
account and will not be able to use it again.
6. Click on a safe link- YouTube video that shows recent news, the OMG..Tsunami hitting again Low
link appears in the bottom status bar shows a YouTube short link.
Such short URLs could be either malicious links or safe links.
Albladi and Weir Cybersecurity (2020) 3:7 Page 17 of 19
Appendix 3
Acknowledgements Funding
We are sincerely grateful to the many individuals who voluntarily This work is supported by the University of Jeddah, Kingdom of Saudi Arabia
participated in this research. as part of the first author’s research conducted at the University of
Strathclyde in Glasgow, UK.
Authors’ contributions
SMA conducted the study, analysed the collected data, and drafted the Availability of data and materials
manuscript. GRSW participated in drafting the manuscript. Both authors read The data that support the findings of this study are available from the
and approved the final manuscript. corresponding author upon reasonable request.
Albladi and Weir Cybersecurity (2020) 3:7 Page 18 of 19
Mahuteau S, Zhu R (2016) Crime victimisation and subjective well-being: panel Workman M (2007) Gaining access with social engineering: an empirical study of
evidence from Australia. Health Econ 25(11):1448–1463. https://doi.org/10. the threat. Inf Syst Secur 16(6):315–331. https://doi.org/10.1080/
1002/hec.3230 10658980701788165
Milne GR, Labrecque LI, Cromer C (2009) Toward an understanding of the online Workman M (2008) A test of interventions for security threats from social
consumer’s risky behavior and protection practices. J Consum Aff 43(3):449– engineering. Inf Manag Comput Secur 16(5):463–483. https://doi.org/10.1108/
473. https://doi.org/10.1111/j.1745-6606.2009.01148.x 09685220810920549
Mitnick KD, Simon WL (2003) The art of deception: controlling the human Wright RT, Marett K (2010) The influence of experiential and dispositional factors
element in security. Wiley. https://books.google.com.sa/books?hl=ar&lr=&id= in phishing: an empirical investigation of the deceived. J Manag Inf Syst
rmvDDwAAQBAJ&oi=fnd&pg=PR7&dq=Mitnick+KD,+Simon+WL+(2003 27(1):273–303. https://doi.org/10.2753/MIS0742-1222270111
)+The+art+of+deception:+controlling+the+human+1217+element+in+ Yang H-L, Lin C-L (2014) Why do people stick to Facebook web site? A value
security.+Wiley&ots=_eyXWB11Wd&sig=9QEMsNUp8X2oiGmAnh7S800L16 theory-based view. Inf Technol People 27(1):21–37. https://doi.org/10.1108/
0&redir_esc=y#v=onepage&q&f=false. ITP-11-2012-0130
Munro MC, Huff SL, Marcolin BL, Compeau DR (1997) Understanding and
measuring user competence. Inf Manag 33(1):45–57. https://doi.org/10.1016/
S0378-7206(97)00035-9
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in
Öğütçü G, Testik ÖM, Chouseinoglou O (2016) Analysis of personal information
published maps and institutional affiliations.
security behavior and awareness. Comput Secur 56:83–93. https://doi.org/10.
1016/j.cose.2015.10.002
Orchard LJ, Fullwood C, Galbraith N, Morris N (2014) Individual differences as
predictors of social networking. J Comput-Mediat Commun 19(3):388–402.
https://doi.org/10.1111/jcc4.12068
Proofpoint. (2018). The human factor 2018 report. Retrieved from https://www.
proofpoint.com/sites/default/files/pfpt-us-wp-human-factor-report-2018-180425.pdf
Rae JR, Lonborg SD (2015) Do motivations for using Facebook moderate the
association between Facebook use and psychological well-being? Front
Psychol 6:771. https://doi.org/10.3389/fpsyg.2015.00771
Riek M, Bohme R, Moore T (2016) Measuring the influence of perceived
cybercrime risk on online service avoidance. IEEE Trans Dependable Secure
Comput 13(2):261–273. https://doi.org/10.1109/TDSC.2015.2410795
Ringle CM, Sarstedt M, Straub D (2012) A critical look at the use of PLS-SEM in
MIS quarterly. MIS Q 36(1) Retrieved from https://ssrn.com/abstract=2176426
Ringle CM, Wende S, Becker J-M (2015) SmartPLS 3. SmartPLS, Bönningstedt
Retrieved from http://www.smartpls.com
Ross C, Orr ES, Sisic M, Arseneault JM, Simmering MG, Orr RR (2009) Personality
and motivations associated with Facebook use. Comput Hum Behav 25(2):
578–586. https://doi.org/10.1016/j.chb.2008.12.024
Rungtusanatham M, Wallin C, Eckerd S (2011) The vignette in a scenario-based
role-playing experiment. J Supply Chain Manag 47(3):9–16. https://doi.org/10.
1111/j.1745-493X.2011.03232.x
Saridakis G, Benson V, Ezingeard J-N, Tennakoon H (2016) Individual information
security, user behaviour and cyber victimisation: an empirical study of social
networking users. Technol Forecast Soc Chang 102:320–330. https://doi.org/
10.1016/j.techfore.2015.08.012
Sheng S, Holbrook M, Kumaraguru P, Cranor LF, Downs J (2010) Who falls for
phish? In: Proceedings of the 28th international conference on human
factors in computing systems - CHI ‘10. ACM Press, New York, pp 373–382.
https://doi.org/10.1145/1753326.1753383
Sherchan W, Nepal S, Paris C (2013) A survey of trust in social networks. ACM
Comput Surv 45(4):1–33. https://doi.org/10.1145/2501654.2501661
Soper, D. (2012). A-priori sample size calculator. Retrieved from https://www.
danielsoper.com/statcalc/calculator.aspx?id=1
Tabachnick BG, Fidel LS (2013) Using multivariate statistics, 6th edn. Pearson, Boston
Tsikerdekis M, Zeadally S (2014) Online deception in social media. Commun ACM
57(9):72–80. https://doi.org/10.1145/2629612
Van Schaik P, Jansen J, Onibokun J, Camp J, Kusev P (2018) Security and privacy
in online social networking: risk perceptions and precautionary behaviour.
Comput Hum Behav 78:283–297. https://doi.org/10.1016/j.chb.2017.10.007
Vishwanath A (2015) Habitual Facebook use and its impact on getting deceived
on social media. J Comput-Mediat Commun 20(1):83–98. https://doi.org/10.
1111/jcc4.12100
Vishwanath A, Harrison B, Ng YJ (2016) Suspicion, cognition, and automaticity
model of phishing susceptibility. Commun Res. https://doi.org/10.1177/
0093650215627483
Wang J, Herath T, Chen R, Vishwanath A, Rao HR (2012) Research article phishing
susceptibility: an investigation into the processing of a targeted spear
phishing email. IEEE Trans Prof Commun 55(4):345–362. https://doi.org/10.
1109/TPC.2012.2208392
Wang J, Li Y, Rao HR (2017) Coping responses in phishing detection: an
investigation of antecedents and consequences. Inf Syst Res 28(2):378–396.
https://doi.org/10.1287/isre.2016.0680
Education Tech Research Dev
https://doi.org/10.1007/s11423-020-09843-9
RESEARCH ARTICLE
Abstract
Among the literature on self-regulated learning and social networking, the studies, which
explore the impact of social networks on learning regarding connection sizes and relation-
ship-establishing factors, are rarely seen in the context of social networking among stran-
gers. This descriptive study addresses the gap by exploring data from 468 Chinese junior
high school graduates in an online learning resource platform with an integrated social net-
work. The data is digitally generated when the graduates engaged in online self-regulated
learning activities for an average of 36 days without any facilitations. The data analysis
explores the connection sizes and types of follow links, types of self-regulated learners,
and their relationship with lesson completion. The study reveals that social networks trig-
ger different levels of learning engagement. Specifically, the graduates with bidirectional
follow links and the optimal connection size of five complete more lessons than other
graduates. The study also finds that academic factors (similar learning goals and achieve-
ment gaps) are more important than social factors (common identity) in establishing social
connections to support self-regulated learning activities. These findings have direct impli-
cations for the design of social networking that facilitates self-regulated learning, and
enhances students’ self-regulated learning efficacy in online learning environments.
* Xiaohua Yu
[email protected]
Charles Xiaoxue Wang
[email protected]
J. Michael Spector
[email protected]
1
Faculty of Education, Department of Education Information Technology, East China Normal
University, 3663 North Zhongshan Road, Shanghai, China
2
College of Education, Florida Gulf Coast University, 10501 FGCU Boulevard South, Fort Myers,
Florida 33965‑6565, USA
3
Department of Learning Technologies, College of Information, University of North Texas, 3940 N.
Elm St., Suite G 150, Denton, TX 76207, USA
13
Vol.:(0123456789)
X. Yu et al.
Introduction
In this digital age, the explosion of information and advancement of technology require
people to learn continuously and perform effectively as active learners. Self-regulated
learning is regarded as a good method to stay informed and current with an ever-chang-
ing world. Self-regulated learning refers to personal control and self-regulation over one’s
own learning activities, such as planning for a learning task, monitoring actions to achieve
objectives. However, feeling isolated because of a lack of social interactions has always
been a formidable barrier for online learning (Muilenburg and Berge 2005). To address
this barrier and facilitate online learning, social networking has been widely adopted as
a support to educational communications and collaborations (Roblyer et al. 2010). Social
networking refers to the phenomenon where relationships are initiated, established, main-
tained, expanded, or closed, often through specific social media, such as Facebook and
WeChat. Social networking might be an enabler of self-regulated learning by offering
social supports in an online learning environment (Rennie and Morrison 2013). Although
self-regulated learning emphasizes autonomy and personal control over one’s knowledge
construction and skill practice, interactive processes and influences from others can impact
self-regulation processes (Zimmerman and Schunk 1989, 2001).The effectiveness of social
networking in promoting learning interest, motivation, and self-regulation has been veri-
fied by many studies (Fisher and Baird 2005; Tower et al. 2014; Yu et al. 2010; Wang and
Wu 2008).
Most of these studies were done in social networks within the class where students were
familiar with each other. The research on the impact of social networking among unac-
quainted self-regulated learners is limited, especially regarding its connection sizes and
relationship-establishing factors, such as similarity of learning goals and attitudes toward
peer helping. This study explores these influential factors related to social network in the
context of an online learning resource platform with 468 Chinese junior high school stu-
dents engaged in self-regulated learning activities during their summer break. There were
no classes and no teachers to organize these students’ learning activities, and there were
no requirements to use the integrated social networking on this learning resource platform
either. Understanding the factors under this situation can help designing social networking
more effectively to facilitating students’ self-regulated learning in an open online learning
context.
Background
The concept of self-regulated learning has its roots in the theory of self-efficacy and social
cognitive theory (Nilson 2013). Self-regulated learning emphasizes learning autonomy,
mainly in three aspects: self-generated goals, self-adjusted actions and self-evaluated
results (Yu 2012). But self-regulated learning is conducted through reciprocal interactions
among environmental factors, personal processes, and behaviors from a social-cognition
perspective (Bandura 1986; Zimmerman 1989). Consistent with this view, peers’ presence
and online work may change students’ aims and attitudes, and trigger students to change
their online learning activities, such as online course participation and communication with
13
Factors that impact social networking in online self-regulated…
others (Dixson 2015; Lin et al. 2016). The process of self-regulation is “dynamic and con-
textually bound” (Duncan and McKeachie 2005). For example, when the number of new
friends increases, students’ social participation generally increases too. When the number
reaches a certain level, the degree of social participation might decline (Tong et al. 2008).
Students with high self-efficacy may try to establish useful and optimal size of relation-
ships and seek out help from peers to ensure achievement of their desired goals.
Recently, an increased number of researchers have begun to pay attention to social net-
working in self-regulated learning. Many researchers agree that social networking can
enhance students’ sense of community (Hung and Yuen 2010), promote student support
and peer-interaction (Rennie and Morrison 2013; Wang and Wu 2008), and directly and
positively influence students’ learning experience, progress, and outcome (Fisher and Baird
2005; Gewerc et al. 2016). Yu et al. (2010) discovered that social acceptance and accultur-
ation bridged students’ online social networking engagement and positively impacted their
self-esteem development, satisfaction with university life, and performance proficiency.
Lin et al. (2014) applied a self-regulation framework of appraisal, emotional reaction, and
coping response to reveal how users’ experience affected their continuance of social net-
working. These studies investigated how social networking enhances self-regulated learn-
ing in many aspects but not its connection sizes and specific relationship-establishing fac-
tors. One probable reason might be that most previous studies were conducted in a similar
research context where students were acquainted with each other. They came from the
same study group, class, or school in an offline reality before engaging in their online learn-
ing activities for research with requirements from their teachers. For these students, there
was no need to establish relationships because of their pre-existing relationships among
themselves. This study, different from the previous ones, explores the factors related to a
social network established among strangers. It involves 468 students from many different
locations across China, most of whom did not know each other. There were no classes, no
teachers in the learning resource platform. Students chose to use social networking on this
learning resource platform for their online self-regulated learning activities by themselves.
In the age of the Internet, people often build, maintain, or expand their social relation-
ships through specific online services, platforms, or websites, such as Facebook, Twitter,
LinkedIn and WeChat. A distinctive feature of social networking is its follow mechanism
(Boyd and Ellison 2007; Ellison and Boyd 2013). Gelley and John (2015) described the
follow mechanism as “the public articulation of links between users” (p. 1751). Bakshy
et al. (2011) stated that users could influence each other through these followed links.
Users of this mechanism can take the initiative to send requests to develop a connection.
Only users with bidirectional follow links become friends, while those who have unidirec-
tional ties can only be labeled as fans or followers. Altermatt and Pomerantz (2003) exam-
ined whether the beliefs of friends could predict children’s achievement-related beliefs.
They found that influences were more significant among bilateral friends than those uni-
lateral ones. Although online relationships are different from real-life social networks, it
13
X. Yu et al.
is speculated that online relationships have similar impacts in the context of self-regulated
learning in this study.
Connection size
In most cases, the majority of links on social networking websites reflect pre-existing real-
life social networks of the users (Nielsen.com 2011), maintaining and transferring their
offline relationships online. Elements that affect the behavior of relationship connections
generally are common backgrounds, similar values, mutual activities, shared interests, or
friends-of-friends (Boyd and Ellison 2007; Gelley and John 2015). They are mainly based
on the needs of social interaction. When social networking is applied to education, factors
related to academic needs are being considered in relationship establishment. For exam-
ple, modes of online communication and language proficiency were leaners’ connecting
preferences in the context of massive open online courses (MOOC) (Zhang, Peck et al.
2016). In this study within an online learning environment that has no interventions from
teachers, self-regulated learners depend on themselves in achieving learning goals. Under
this circumstance, it is speculated, learners would establish their relationship with their
peers primarily because of academic needs instead of social needs (such as common iden-
tity). Bandura’s social learning theory (1977,1986) has already explained the importance
of peer interactions and support in learning which includes attraction of a role model to the
individual, and positive impact of more experienced peers on individuals’ learning. Three
possible academic relationship-establishing factors identified and investigated in this study
were “similar learning goals”, “peer help”, and “achievement gap”.
A social network is an important source of support that influences online learning. Most
previous studies were conducted with students constrained to prescribed interventions or
controlled groups with pre-setup requirements by researchers. In addition, students in these
studies were acquainted to each other prior to the establishment of social networks. In real-
ity, with many online open learning platforms like MOOCs, learners rarely know each
13
Factors that impact social networking in online self-regulated…
other and they are left on their own to establish their own social networks among stranger
peers to support their self-regulated learning. The studies on connection sizes of social net-
work and factors in relationship-establishing with strangers on student learning in this con-
text are rarely seen, and their impacts on students engaged in online self-regulated learning
activities are generally unclear. With this rationale behind, this descriptive study explores
answers to the following research questions to fill in these gaps:
(1) Does social networking established among strangers affect individuals’ self-regu-
lated learning activities and lesson completion? If it does, how could social networking
affect self-regulated learning?
(2) What is the optimal size of social networks?
(3) Are there any differences in the impact on self-regulated learning produced by dif-
ferent relationship-establishing factors of social network?
By answering the questions above, this study tries to discover insights about self-regu-
lation and the roles that social networking plays and find useful suggestions for the design
of social networking aimed at improving students’ self-regulated learning activities in an
open context.
Methods
This is a descriptive study from a learning analytics perspective based on data automati-
cally recorded in a learning resource platform. Compared to internal personal factors that
are always gleaned by questionnaires or surveys, external behavioral factors are more
observable and recordable in a platform (Bannert et al. 2014; Azar et al.2010). Accord-
ingly, this study focuses on behavioral factors (such as login information, lesson comple-
tion, and performance in learning activities) to examine how students regulate their learn-
ing with social networking through the data recorded on the platform. Information about
the platform and the conceptual research model based on the data analyzed are described
below.
Study setting
13
X. Yu et al.
outcomes, and a learning dashboard for monitoring study activities and progress. While
watching video lessons, students can also engage in various activities with these tools,
including completing exams, discussing, noting, sharing, and reflecting. All their interac-
tive behaviors are tracked and recorded in the platform. This study used this data to explore
and describe students’ learning performance.
ZhiLao also offers user-friendly features for its social networking. One is Learning Tal-
ents List where students are ranked by the number of lessons completed. This feature is
based on role model influence (Bricheno and Thornton 2007). Students use it for build-
ing connections with those who are better than them. The other feature is Learning Peers
List where students who are taking or have completed the current lesson are listed (see
the right part of Fig. 1). This feature is based on identity psychological effect (Howard
2000) and students use it for choosing those who have the same learning goals with them
to be their learning peers. As long as unidirectional follow link is set up, a student can fol-
low or become a fan of the other. When bidirectional follow links are set up, it confirms
13
Factors that impact social networking in online self-regulated…
the establishment of a friendship and means that two students can communicate with each
other and exchange their ideas and learning experience.
Figure 2 shows the conceptual model of this descriptive research, and Table 1 displays
its data indicators of the variables in the model. With no mandatory tests, the learning
outcome of online self-regulated learning activities is measured by Lesson Completion
(LC), the number of lessons completed. Social networking is indicated by two variables:
Students Followed (SF) for the number of students followed, which is also termed as con-
nection size, and Following Status (FS) for different types of connections. The model con-
siders Learning Login (LL), Activity Participation (AP), and Learning Persistence (LP) as
indicators of learning engagement (Dixson 2015). Learning Login and Activity Participa-
tion reflect students’ daily learning engagement on the platform, and Learning Persistence
reflects students’ learning completion within one single login-session. The study assumes
that different sizes and types of social networks may trigger different levels of learning
engagement, thereby leading to differences in lesson completion and learning activities.
Enlightened by Bandura’s social learning theory (1977,1986), four factors are con-
sidered in the model to be related to the establishment of online relationship, including
Common Identity (CI), Similar Learning Goals (SLG), Peer Helping (PH) and Achieve-
ment Gap (AG). Common Identity means students have come from the same school and is
regarded as a social factor. Similar Learning Goal means students have selected the same
lessons on the platform and indicates a similar learning context. This may induce a close-
ness that triggers students to establish a social connection to help each other. Peer Help-
ing means one student has answered the other student’s questions or has responded to the
other student’s thoughts on the platform. Achievement Gap refers to the differences in the
numbers of lessons completed by students. Peer Helping and Achievement Gap might trig-
ger students to follow and learn from, or to compete against the better ones due to their
achievement differences. In this study, Similar Learning Goal, Peer Helping, and Achieve-
ment Gap are treated as academic factors. The study hypothesizes different types of self-
regulated learners might place emphasis on different factors when building social networks,
which may influence the level of lesson completion and learning activities.
Fig. 2 Research model
13
13
Table 1 Model variables and their data indicators
Model variable Data indicator Data type Question
Because the level of lesson completion is positively correlated with the time a student spent
learning on the platform, Total Learning Days (TLD) on the platform from initial registration
to last login date and the difference in total learning days (TLD Diff) are considered as covari-
ates (Tong et al. 2008).
Data collection
A total of 468 junior high school students (male = 51.3%, female = 36.8%, gender not dis-
closed = 12.0%) from 194 schools (max = 40, min = 1, SD = 2, ICC = 0.032) from 22 provinces
were involved in this study from late June to early September with an average of 36 days stud-
ying (max = 86, min = 1, SD = 21.28) on the ZhiLiao learning resources platform. During this
study time, students could log in and out of ZhiLiao freely, with no special requirements. They
regulated their own online learning and engaged in social networking at their own will, while
following the prescribed order of lessons. Almost all of the students ended their participation
on the platform prior to the starting of senior high school in early September with an aver-
age lesson completion rate of 24.61% (max = 97%, min = 0, SD = 27.64%) for all the lessons
offered in ZhiLiao. At that time, all the data were collected from ZhiLiao for analysis (as dis-
played in Table 1). The personal information of students was removed from the data analysis
to protect the privacy of users.
SPSS V22.0 was employed for data analysis. To explore the impact of social networks on
self-regulated learning among strangers and the correlations between social network sizes and
learning engagement, 468 student samples were divided into different groups by type differ-
ence and size difference according to their follow links. Next, univariate ANCOVA and corre-
lation analysis were conducted to examine the divergences of student online learning activities
within different sizes of social network. To satisfy ANCOVA’s requirements for data structure,
a simple log transformation (McDonald 2014) was applied to the variable Lesson Completion.
To investigate how different relationship-establishing factors affect students establishing
social networks, the cross-relations among the 468 student samples were constructed to obtain
a general picture of social networking. This process created 218,556 records, indicating all
possible connections among those students. According to the data of follow links recorded in
the platform, there were 4813 records with the variable following status valued as 1, which
means only 4813 out of 218,556 possible connections were established. This study assumed
that the capability of seeking out all potential learning elements and supports is essential for
students in online self-regulated learning activities. The choices made by self-regulated learn-
ers with a better lesson completion are valuable to notice as they indicate the positive rela-
tionship-establishing factors. Cluster analysis and correlation analysis were used to identify
subgroups of students with differences in their reasons for establishing social networks.
Results
According to connection types, 468 student samples were divided into three groups.
No-follow-link group (Ngp) refers to students with no follow links established,
13
X. Yu et al.
The maximum number of bidirectional follow links was 83, and so we used Dunbar’s serial
numbers (5, 15, and 35) as cutoffs (Hill and Dunbar 2003). The bidirectional-follow-link
group was divided into four subgroups (Bgp1 ≤ 5; Bgp2 > 5 & ≤ 15; Bgp3 > 15 & ≤ 35; and
Bgp4 > 35) to examine the impact divergences of different sizes of bidirectional follow
links. Levene’s test indicates equal variances among the data [F (3,172) = .749, p = .524].
Using univariate ANCOVA as described above, tests of between-subjects effects imply a
significant main effect for Lesson Completion [F (3,461) = 9.48, p < .001, ηp2 = .143] and
no effect for Total Learning Days [F (1,461) = .759, p = .385, ηp2 = .004]. Post hoc pairwise
comparisons suggest that Bgp1 differs significantly from the other groups (see Table 3).
Furthermore, using Total Learning Days as a covariate, partial correlation analysis shows
that there are significantly positive relationships between Lesson Completion and Learning
Persistence in all four groups [Bgp1: r (92) = .547, Bgp2: r (43) = .630, Bgp3: r (21) = .790,
Bgp4: r (20) = .822], positive relationships between Lesson Completion and Learning
Login in Bgp1 (r (92) = .415), Bgp2 (r (43) = .367), and Bgp3 (r (21) = .625), and a rela-
tively lower positive relationship between Lesson Completion and Activity Participation in
Bgp1 (r (92) = .306) (see Table 3).
13
Table 2 Descriptive statistics of various groups
Group Descriptive statistics Group Descriptive statistics
n SF SF2 LC n SF SF2 LC
M SD M SD M SD M SD M SD M SD
Ngp 250 .00 .00 34.6 46.1 Bgp4 20 52.3 12.0 150.8 67.7
Ugp 42 1.2 .66 28.0 31.2 ESRLers 49 43.9 69.9 20.1 23.5 190.9 16.6
Bgp 176 27.1 52.8 81.0 65.0 PSRLers 65 17.7 42.9 8.0 12.8 100.1 21.7
Bgp1 92 2.3 1.4 56.2 50.2 LSRLers 168 7.9 28.8 3.3 8.3 36.2 17.8
Bgp2 43 9.2 2.7 87.2 60.3 NSRLers 186 .95 2.8 .46 1.3 11.5 10.0
Factors that impact social networking in online self-regulated…
Ngp refers to students with no follow links established, Ugp refers to students with only unidirectional follow links established, Bgp refers to students with at least one bidi-
rectional follow link established, Bgp1 refers to students with 5 or fewer bidirectional follow links established, Bgp2 refers to students with between 6 and 15 bidirectional fol-
low links established, Bgp3 refers to students with between 6 and 35 bidirectional follow links established, Bgp4 refers to students with 36 or more bidirectional follow links
established
LC Lesson Completion, SF Students Followed (including unidirectional-follow-links and bidirectional follow links), SF2 Students Followed (only including bidirectional fol-
low links), ESRLers Efficient Self-Regulated Learners, PSRLers Positive Self-Regulated Learners, LSRLers Low-efficient Self-Regulated Learners, NSRLers Negative Self-
Regulated Learners
13
X. Yu et al.
Table 3 Divergence among different groups of different connection types and connection sizes
Group n Multiple Comparisons Correlation analysis
P value
Ngp Ugp Bgp Bgp1 Bgp2 Bgp3 Bgp4 SF LL AP LP
Ngp 250 – LC –
Ugp 42 .635 – .174
Bgp 176 .000 .000 – .290**
Bgp1 92 – – .415** .306** .547**
Bgp2 43 .002 – – .367* .148 .630**
Bgp3 21 .002 .481 – – .625** .383 .790**
Bgp4 20 .000 .024 .163 – – .105 .209 .822**
Ngp refers to students with no follow links established, Ugp refers to students with only unidirectional fol-
low links established, Bgp refers to students with at least one bidirectional follow link established, Bgp1
refers to students with 5 or fewer bidirectional follow links established, Bgp2 refers to students with
between 6 and 15 bidirectional follow links established, Bgp3 refers to students with between 6 and 35
bidirectional follow links established, Bgp4 refers to students with 36 or more bidirectional follow links
established
LC Lesson Completion, SF Students Followed, LL Learning Login, AP Activity Participation, LP Learning
Persistence
**p < .01, *p < .05
ESRLers Efficient Self-Regulated Learners, PSRLers Positive Self-Regulated Learners, LSRLers Low-
efficient Self-Regulated Learners, NSRLers Negative Self-Regulated Learners, FS Following Status, SF2
Students Followed, LC Lesson Completion, TLD Total Learning Days, CI Common Identity, SLG Similar
Learning Goals, PH Peer Helping, AG Achievement Gap
**p < .01, *p < .05
Using the difference of total learning days as a covariate, partial correlation analysis
shows that Following Status and Achievement Gap are slightly correlated in all clusters
except NSRLers (ESRLers: r (22,880) = .226, PSRLers: r (30,352) = .170, LSRLers:
r (78,453) = .108, NSRLers: r (86,859) = .046). The data also shows slight correlation
between Following Status and Similar Learning Goals for ESRLers [r (22,880) = .266] and
PSRLers (r (30,352) = .158), and displays a slight correlation between Following Status
and Common Identity for ESRLers [r (22,880) = .125]. Moreover, statistics do not show
any substantial correlation between Following Status and Peer Helping in all clusters.
13
Factors that impact social networking in online self-regulated…
The distributions of different connection types and connection sizes in each type of self-
regulated learner group are shown in Table 5. The group of Bgp takes more than 50% in
ESRLers and PSRLers, and the number of Bg1 is the most of all bidirectional-follow-link
groups in all clusters.
Discussion
Different sizes of social networks can trigger different levels of learning engagement and
lead to differences in lesson completion. This study found the number of bidirectional links
Table 5 Distributions of groups of different connection types and connection sizes in different self-regu-
lated learner groups
Cluster n Group
Ngp Ugp Bgp Bgp1 Bgp2 Bgp3 Bgp4
Ngp refers to students with no follow links established, Ugp refers to students with only unidirectional fol-
low links established, Bgp refers to students with at least one bidirectional follow link established, Bgp1
refers to students with 5 or fewer bidirectional follow links established, Bgp2 refers to students with
between 6 and 15 bidirectional follow links established, Bgp3 refers to students with between 6 and 35
bidirectional follow links established, Bgp4 refers to students with 36 or more bidirectional follow links
established
ESRLers Efficient Self-Regulated Learners, PSRLers Positive Self-Regulated Learners, LSRLers Low-effi-
cient Self-Regulated Learners, NSRLers Negative Self-Regulated Learners
13
X. Yu et al.
is positively correlated with the number of lessons completed. This finding reveals that the
students with more bidirectional follow links were likely to receive more assistance and
encouragement from peers, which resulted in a higher level of learning persistence to com-
plete more lessons. However, this does not mean the more bidirectional links, the better for
online learning. Further exploration of the contribution of each bidirectional link towards
lesson completion revealed that the subgroup with five or fewer bidirectional follow links
(Bgp1) is an optimal connection size, which outperformed the other three subgroups meas-
ured by the average lesson completion per bidirectional link. In other words, having a max-
imum of five bidirectional follow links in online self-regulated learning activities works
best among a group of strangers. This size of social networks can not only encourage stu-
dents to attempt more lessons, but also to login more often and participate in more learning
activities. Since Bgp1 (92 students) consists of the most of Bgps (176 students) in all four
types of self-regulated learners, according to our data, this finding of optimal connection
size of five on lesson completion remains true for most learners. In addition, the results
also indicate that too big a size of social networks is not necessarily beneficial to staying
focused on meaningful interaction to complete learning tasks.
Conclusions
Implications
This study presents some implications for current education practices and future research.
First, it provides a new understanding of the value of social networking and shows its
impacts on online self-regulated learning activities from aspects of the connection size and
relationship-establishing factors. Scale of social networking has been discussed in previ-
ous studies (Tong et al. 2008; Kim and Lee 2011) but few within the context of academics.
13
Factors that impact social networking in online self-regulated…
Although Dunbar’s number reveals different sizes of circles of intimacy, it is believed there
must be learning intimacy existing in an academic context similar to digital intimacy in
a social context. With the proper level of learning intimacy, social networking can con-
tribute to online learning and learning outcomes. As for relationship-establishing factors,
most research was carried out in the context with pre-existing social networks and rarely
included establishing new relationships. Because online networks mostly originate from
offline real-life relationships, there is generally no need to consider why people choose to
follow others. Nowadays, there are more open online learning platforms supporting people,
who are strangers to each other, when they engage in autonomous learning as they do in
MOOCs. Social networking can be utilized to offer social and academic support to self-
regulated learning among strangers. It is urgent to investigate how to optimize the value
of social networking in this context. This research casts light on a new perspective and
presents several interesting findings. The study found that self-regulated leaners, who are
learning online with strangers, consider more academic factors than social factors when
using social networking to support their learning.
Second, the method for data collection and analysis used in this study to reveal the
relationship between social networking and self-regulated learning is different from most
prior research, which used questionnaires or surveys as the main method to collect their
data (Lin et al. 2014; Matzat and Vrieling 2016). Data from surveys can be biased because
of the survey design or respondents’ attitudes (Gonyea 2005). Compared to data obtained
through questionnaires or surveys, digital data are generally more objective and authentic
(EDUCAUSE 2011). This provides insightful information about underlying problems and
contexts within the learning process (Knight et al. 2014; Tanes et al. 2011). Among the
proliferation of online learning and teaching, learning analytics is gaining increased trac-
tion. The analytic methods used in this study also provide a reference for future research on
related topics.
This study offers several practical implications for effective use of social networking.
This study found that social networking could positively affect self-regulated learning even
among strangers. Specifically, it should involve bidirectional follow links and the optimal
size number is five. In order to build beneficial social networks, academic factors (similar
learning goals and achievement gap) are more valuable than social factors (common iden-
tity). The learning platform should be better designed with use of this finding. In this study,
there are two deliberately-designed places to assist students to initiate connections: Learn-
ing Talents List and Learning Peers List. Both places provided a hint to students that they
could catch up with peers and learn from more advanced peers. These findings are useful
and valuable for designing social networking for facilitating students’ self-regulated learn-
ing activities, for improving the mechanism of peer recommendations in an online learning
platform aimed at enhancing students’ self-regulated learning efficacy.
This study also has theoretical significance. The application of social networking to edu-
cation has triggered heated debates (Aydin 2012; Hamid et al. 2015; Roblyer et al. 2010;
Veletsianos et al. 2013; Zaidieh 2012). Despite mostly optimistic attitudes and visions,
there are still quite a few negative reports (Junco 2012; Karpinski et al. 2013). The contra-
dictory arguments might be caused by different concerns on how to use social networking
pedagogically. The finding of bidirectional follow links and the optimal size may reveal the
reasons of those contradictory arguments related to the use of social network to support
learning.
This study was about junior high school graduates who engaged in self-regulated learn-
ing activities with unfamiliar peers in an open online learning resource platform. Their
social networking behaviors were very different from that in common school contexts with
13
X. Yu et al.
teaching presences and prescribed interventions. The results reveal instinctive demands and
features of Chinese teenagers’ social networking abilities. Most research studies (Fisher
and Baird 2005; Lin et al. 2014; Yu et al. 2010) were done with or on college students who
might have already accumulated their relatively rich online experiences. This study, with
adolescents, may act as precursors for the future cultivation of self-regulation learning abil-
ity with the support of social networking (Alloway et al. 2013).
Limitations
This study is a descriptive study from a learning analytics perspective based on learning
data captured from an online learning resource platform. Without observations, surveys,
and interviews to support our findings, the readers are reminded of the extent to which the
study results can be generalized.
Although the data quality of the learning recorded on the platform were pretty high with
rich information on data types and quantity, they cannot fully reflect students’ intentions
and behaviors. For example, the data can indicate that a lesson was completed, but it does
not tell how many times students logged in the lesson, and why they temporarily aban-
doned the lesson before completing it or why they persisted in learning to get it completed.
In other words, lesson completion and follow links data only provide one snapshot pic-
ture that interprets the complex relationship between social networking and self-regulated
learning. Learning outcome is a very complex subject. Other factors, such as the influence
of students’ appreciation and interest in lesson content, students’ self-regulated learning
skills, or the impact of students’ willingness and ability to socially interact through estab-
lishing online connections are also important to consider when establishing the research
model. Finally, this research is only one case study that focuses on one single online learn-
ing resource platform.
Further studies
Based on findings of this study, future research work may include: (1) obtaining corre-
sponding datasets from other learning platforms and performing similar analyses to
enhance the interpretation power of the hypothesized model; (2) verifying the research
findings of the current study through specifically designed experimental research on the
same platform. For example, in order to verify the optimal number of bidirectional follow
links, we can require each group to have a certain number of friends. (3) Adding qualitative
approaches (e.g. interviews and questionnaires) to enhance the research from the aspect of
emotion and attitude such as learning interest, social preference, and online trust, which
are missing in the current study. Although the learning data recorded on the platform may
reflect students’ intentions and attitudes to a certain extent, the information is difficult to
reveal the complex constructs of learners. For example, Peer helping should take effect in
establishing social networks according to many prior studies (Carrell et al. 2009; Nichols
and White 2001), but this study does not support their findings. It is critical to combine
subjective investigating data with objective recorded learning data to fully understand stu-
dents’ learning behaviors in self-regulated learning online.
Acknowledgements This study was funded by the Peak Discipline Construction Project of Education and
the Advantage Discipline Innovation Platform Principal Investigator Construction Project of Teacher Edu-
cation at East China Normal University. The authors would like to thank to all the experts and editors who
have participated in the review of the paper, especially reviewer #9.
13
Factors that impact social networking in online self-regulated…
References
Alloway, T. P., Horton, J., Alloway, R. G., & Dawson, C. (2013). Social networking sites and cognitive abili-
ties: do they make you smarter? Computers & Education, 63, 10–16. https://doi.org/10.1016/j.compe
du.2012.10.030.
Altermatt, E. R., & Pomerantz, E. M. (2003). The development of competence-related and motivational
beliefs: an investigation of similarity and influence among friends. Journal of Educational Psychology,
95(1), 111–123. https://doi.org/10.1037/0022-0663.95.1.111.
Aydin, S. (2012). A review of research on Facebook as an educational environment. Educational Technol-
ogy Research and Development, 60(6), 1093–1106. https://doi.org/10.1007/s11423-012-9260-7.
Azar, H. K., Lavasani, M. G., Malahmadi, E., & Amani, J. (2010). The role of self-efficacy, task value, and
achievement goals in predicting learning approaches and mathematics achievement. Procedia Social
and Behavioral Sciences, 5, 942–947. https://doi.org/10.1016/j.sbspro.2010.07.214.
Bakshy, E., Hofman, J. M., Mason, W. A., & Watts, D. J. (2011). Everyone’s an influencer: quantify-
ing influence on twitter. In Proceedings of the fourth ACM international conference on Web search
and data mining (pp. 65–74). Washington, DC: Association for Computing Machinery (ACM).
Doi:10.1145/1935826.1935845
Bandura, A. (1977). Self-efficacy: toward a unifying theory of behavioral change. Psychological Review,
84(2), 191–215. https://doi.org/10.1037/0033-295X.84.2.191.
Bandura, A. (1986). Social foundations of thought and action: a social cognitive theory. Englewood Cliffs,
NJ: Prentice-Hall Inc.
Bannert, M., Reimann, P., & Sonnenberg, C. (2014). Process mining techniques for analysing patterns and
strategies in students’ self-regulated learning. Metacognition and Learning, 9(2), 161–185. https://doi.
org/10.1007/s11409-013-9107-6.
Bliss, C. A., Kloumann, I. M., Harris, K. D., Danforth, C. M., & Dodds, P. S. (2012). Twitter reciprocal
reply networks exhibit assortativity with respect to happiness. Journal of Computational Science, 3(5),
388–397. https://doi.org/10.1016/j.jocs.2012.05.001.
Bohn, A., Buchta, C., Hornik, K., & Mair, P. (2014). Making friends and communicating on Facebook:
implications for the access to social capital. Social Networks, 37, 29–41. https://doi.org/10.1016/j.
socnet.2013.11.003.
Boyd, D. M., & Ellison, N. B. (2007). Social network sites: definition, history, and scholarship. Journal of
Computer-Mediated Communication, 13(1), 210–230. https://doi.org/10.1111/j.1083-6101.2007.00393
.x.
Bricheno, P., & Thornton, M. (2007). Role model, hero or champion? children’s views concerning role mod-
els. Educational research, 49(4), 383–396. https://doi.org/10.1080/00131880701717230.
Carrell, S. E., Fullerton, R. L., & West, J. E. (2009). Does your cohort matter? measuring peer effects in
college achievement. Journal of Labor Economics, 27(3), 439–464. https://doi.org/10.1086/600143.
Dixson, M. D. (2015). Measuring student engagement in the online course: the online student engagement
scale (OSE). Online Learning, 19(4), n4. https://doi.org/10.24059/olj.v19i4.561.
Duncan, T. G., & McKeachie, W. J. (2005). The making of the motivated strategies for learning question-
naire. Educational Psychologist, 40(2), 117–128. https://doi.org/10.1207/s15326985ep4002_6.
EDUCAUSE Learning Initiative (2011). 7 things you should know about first-generation learning analyt-
ics. Louisville, CO: EDUCAUSE. https://library.educause.edu/resources/2011/12/7-things-you-shoul
d-know-about-firstgeneration-learning-analytics. Accessed 16 December 2018.
Ellison, N. B., & Boyd, D. (2013). Sociality through social network sites. In W. H. Dutton (Ed.), The Oxford
handbook of Internet studies (pp. 151–172). Oxford: Oxford University Press.
Fisher, M., & Baird, D. E. (2005). Online learning design that fosters student support, self-regulation, and
retention. Campus-Wide Information Systems, 22(2), 88–107. https://doi.org/10.1108/1065074051
0587100.
Gelley, B., & John, A. (2015). Do I need to follow you?: Examining the utility of the pinterest follow
mechanism. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work
& Social Computing(pp. 1751–1762). Washington, DC: Association for Computing Machinery.
Doi:10.1145/2675133.2675209
13
X. Yu et al.
Gewerc, A., Rodríguez-Groba, A., & Martínez-Piñeiro, E. (2016). Academic social networks and learning
analytics to explore self-regulated learning: a case study. IEEE Revista Iberoamericana de Tecnologias
del Aprendizaje, 11(3), 159–166. https://doi.org/10.1109/RITA.2016.2589483.
Gonyea, R. M. (2005). Self-reported data in institutional research: review and recommendations. New
Directions for Institutional Research, 2005(127), 73–89. https://doi.org/10.1002/ir.156.
Grabowicz, P. A., Ramasco, J. J., Moro, E., Pujol, J. M., & Eguiluz, V. M. (2012). Social features of online
networks: the strength of intermediary ties in online social media. PLoS ONE, 7(1), e29358. https://
doi.org/10.1371/journal.pone.0029358.
Hamid, S., Waycott, J., Kurnia, S., & Chang, S. (2015). Understanding students’ perceptions of the benefits
of online social networking use for teaching and learning. The Internet and Higher Education, 26, 1–9.
https://doi.org/10.1016/j.iheduc.2015.02.004.
Hill, R. A., & Dunbar, R. I. (2003). Social network size in humans. Human Nature, 14(1), 53–72. https://
doi.org/10.1007/s12110-003-1016-y.
Howard, J. A. (2000). Social psychology of identities. Annual review of sociology, 26(1), 367–393. https://
doi.org/10.1146/annurev.soc.26.1.367.
Hung, H. T., & Yuen, S. C. Y. (2010). Educational use of social networking technology in higher education.
Teaching in Higher Education, 15(6), 703–714. https://doi.org/10.1080/13562517.2010.507307.
Junco, R. (2012). The relationship between frequency of Facebook use, participation in Facebook activities,
and student engagement. Computers & Education, 58(1), 162–171. https://doi.org/10.1016/j.compe
du.2011.08.004.
Karpinski, A. C., Kirschner, P. A., Ozer, I., Mellott, J. A., & Ochwo, P. (2013). An exploration of social
networking site use, multitasking, and academic performance among United States and European
university students. Computers in Human Behavior, 29(3), 1182–1192. https://doi.org/10.1016/j.
chb.2012.10.011.
Kim, J., & Lee, J. E. R. (2011). The Facebook paths to happiness: effects of the number of Facebook friends
and self-presentation on subjective well-being. CyberPsychology, Behavior, and Social Networking,
14(6), 359–364. https://doi.org/10.1089/cyber.2010.0374.
Knight, S., Shum, S. B., & Littleton, K. (2014). Epistemology, assessment, pedagogy: Where learning meets
analytics in the middle space. Journal of Learning Analytics, 1(2), 23–47. https://doi.org/10.18608/
jla.2014.12.3.
Lin, H., Fan, W., & Chau, P. Y. (2014). Determinants of users’ continuance of social networking sites: a
self-regulation perspective. Information & Management, 51(5), 595–603. https://doi.org/10.1016/j.
im.2014.03.010.
Lin, J. W., Lai, Y. C., Lai, Y. C., & Chang, L. C. (2016). Fostering self-regulated learning in a blended
environment using group awareness and peer assistance as external scaffolds. Journal of Computer
Assisted Learning, 32(1), 77–93. https://doi.org/10.1111/jcal.12120.
Livingstone, S. (2008). Taking risky opportunities in youthful content creation: teenagers’ use of social net-
working sites for intimacy, privacy and self-expression. New Media & Society, 10(3), 393–411. https://
doi.org/10.1177/1461444808089415.
Matzat, U., & Vrieling, E. M. (2016). Self-regulated learning and social media—a ‘natural alliance’? evi-
dence on students’ self-regulation of learning, social media use, and student-teacher relationship.
Learning, Media and Technology, 41(1), 73–99. https://doi.org/10.1080/17439884.2015.1064953.
McDonald, J. H. (2014). Handbook of biological statistics (3rd ed.). Baltimore, MD: Sparky House
Publishing.
Muilenburg, L. Y., & Berge, Z. L. (2005). Student barriers to online learning: a factor analytic study. Dis-
tance Education, 26(1), 29–48. https://doi.org/10.1080/01587910500081269.
Nichols, J. D., & White, J. (2001). Impact of peer networks on achievement of high school algebra students.
The Journal of Educational Research, 94(5), 267–273. https://doi.org/10.1080/00220670109598762.
Nielsen.com (2011). Friends & frenemies: Why we add and remove Facebook friends.https://www.niels
en.com/us/en/insights/news/2011/friends-frenemies-why-we-add-and-remove-facebook-friends.html.
Accessed 16 December 2018.
Nilson, L. B. (2013). Creating self-regulated learners: strategies to strengthen students’ self-awareness and
learning skills. Sterling, VA: Stylus Publishing LLC.
Rennie, F., & Morrison, T. M. (2013). E-learning and social networking handbook: resources for higher
education (2ndd ed.). New York: Routledge.
Roblyer, M. D., McDaniel, M., Webb, M., Herman, J., & Witty, J. V. (2010). Findings on Facebook in higher
education: a comparison of college faculty and student uses and perceptions of social networking sites.
The Internet and Higher Education, 13(3), 134–140. https://doi.org/10.1016/j.iheduc.2010.03.002.
13
Factors that impact social networking in online self-regulated…
Tanes, Z., Arnold, K. E., King, A. S., & Remnet, M. A. (2011). Using Signals for appropriate feedback:
perceptions and practices. Computers & Education, 57(4), 2414–2422. https://doi.org/10.1016/j.compe
du.2011.05.016.
Tong, S. T., Van Der Heide, B., Langwell, L., & Walther, J. B. (2008). Too much of a good thing? the rela-
tionship between number of friends and interpersonal impressions on Facebook. Journal of Computer-
Mediated Communication, 13(3), 531–549. https://doi.org/10.1111/j.1083-6101.2008.00409.x.
Tower, M., Latimer, S., & Hewitt, J. (2014). Social networking as a learning tool: nursing students’ percep-
tion of efficacy. Nurse Education Today, 34(6), 1012–1017. https://doi.org/10.1016/j.nedt.2013.11.006.
Veletsianos, G., Kimmons, R., & French, K. D. (2013). Instructor experiences with a social networking
site in a higher education setting: expectations, frustrations, appropriation, and compartmentalization.
Educational Technology Research and Development, 61(2), 255–278. https://doi.org/10.1007/s1142
3-012-9284-z.
Wang, S. L., & Wu, P. Y. (2008). The role of feedback and self-efficacy on web-based learning: the social
cognitive perspective. Computers & Education, 51(4), 1589–1598. https://doi.org/10.1016/j.compe
du.2008.03.004.
Yu, A. Y., Tian, S. W., Vogel, D., & Kwok, R. C. W. (2010). Can learning be virtually boosted? an inves-
tigation of online social networking impacts. Computers & Education, 55(4), 1494–1503. https://doi.
org/10.1016/j.compedu.2010.06.015.
Yu, X. H. (2012). Modeling and Implementation of Self-Regulated Learning from the Perspective of
Personal Learning Environment Design, Ph.D. Dissertation. Shanghai, PRC: East China Normal
University.
Zaidieh, A. J. Y. (2012). The use of social networking in education: challenges and opportunities. World of
Computer Science and Information Technology Journal (WCSIT), 2(1), 18–21.
Zhang, Q., Peck, K. L., Hristova, A., Jablokow, K. W., Hoffman, V., Park, E., et al. (2016). Exploring the
communication preferences of MOOC learners and the value of preference-based groups: Is grouping
enough? Educational Technology Research and Development, 64(4), 809–837. https://doi.org/10.1007/
s11423-016-9439-4.
Zimmerman, B. J. (1989). A social cognitive view of self-regulated academic learning. Journal of Educa-
tional Psychology, 81(3), 329–339. https://doi.org/10.1037/0022-0663.81.3.329.
Zimmerman, B. J., & Schunk, D. H. (Eds.). (1989). Self-regulated learning and academic achievement:
theory, research, and practice. New York: Springer.
Zimmerman, B. J., & Schunk, D. H. (Eds.). (2001). Self-regulated learning and academic achievement:
theoretical perspectives. New York: Routledge.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Xiaohua Yu is an Associate Professor of education information technology in the Faculty of Education at
East China Normal University, China. Her recent research interests include technology-enhanced learning,
computational thinking, learning analytics, and online learning environment design.
Charles Xiaoxue Wang is a Professor of educational technology in College of Education and a Lucas Faculty
Fellow (2020–2021) at the Florida Gulf Coast University, USA. His recent research interests include design
and development of online learning environments, Multi-User Virtual Environments (MUVEs), synchro-
nous technology for distance learning and collaboration, and technology integration at K-12 schools.
13
Neural Computing and Applications
https://doi.org/10.1007/s00521-018-03967-z (0123456789().,-volV)(0123456789().,-volV)
WSOM 2017
Yves-Alexandre de Montjoye3,4
Abstract
Social networks are known to be assortative with respect to many attributes, such as age, weight, wealth, level of education,
ethnicity and gender: Similar people according to these attributes tend to be more connected. This can be explained by
influences and homophily. Independently of its origin, this assortativity gives us information about each node given its
neighbors. Assortativity can thus be used to improve individual predictions in a broad range of situations, when data are
missing or inaccurate. This paper presents a general framework based on probabilistic graphical models to exploit social
network structures for improving individual predictions of node attributes. Using this framework, we quantify the
assortativity range leading to an accuracy gain in several situations, with various individual prediction profiles. We finally
show how specific characteristics of the network can enhance performances further. For instance, the gender assortativity in
real-world mobile phone data drastically changes according to some communication attributes. In this case, using the
network topology indeed improves local predictions of node labels and moreover enables inferring missing node labels
based on a subset of known vertices. In both cases, the performances of the proposed method are statistically significantly
superior to the ones achieved by state-of-the-art label propagation and feature extraction schemes in most settings.
Keywords Loopy belief propagation Assortativity Homophily Social networks Mobile phone metadata
1 Introduction
123
Neural Computing and Applications
to a graph topology information. These graphs present kind of approach is not intended to directly exploit
specific structures carrying many different characteristics, uncertain label predictions with confidence levels (i.e.,
such as small-worldness or heterogeneous degree distri- class probabilities) [40].
bution [32]. The assortativity of social networks, defined as On the other hand, random walk-based approaches allow
the nodes tendency to be linked to others which are similar to account for the whole network structure by propagating
in some sense [2], with respect to various demographics of the labels through iterative updates [50, 51]. Several vari-
their individuals such as gender, age, weight, income level, ants and adaptations of this principle were proposed to
education, race, religion is well documented in the litera- solve diverse labeling tasks, such as video suggestions [3]
ture [22, 24, 31, 42, 48]. This property has been theorized or demographics prediction in networks [38]. Although
to come from either influences or homophily or a combi- these methods aim to model the network structure as a
nation of both [2]. For instance, Rosenquiest et al. [36] whole, they are based on an implicit model of the joint
showed that social influence can enhance the spreading of probability distribution of all the node labels [5]. As an
alcohol consumption and Madan et al. [22] found that alternative, inference approaches using probabilistic
weight changes in an individual can be influenced by graphical models (PGMs) were developed as the PGM
exposure to overweight peers with unhealthy habits or modeling explicitly fully describes the interactions
inactive lifestyles. On the other hand, the concept of between the nodes [10].
homophily is easily understood as the saying goes: ‘birds Nevertheless, none of the current approaches investi-
of a feather flock together,’ which means that people gates the improvement of uncertain predictions, which can
sharing some characteristics tend to more communicate. be obtained by a classical machine learning algorithm
For instance, we observe more connections between people predicting the labels based on individual profiles, while
of the same age and gender [24]. modeling the network structure as a whole. Instead, the
Independently of its cause, this assortativity can be used current studies focus on the propagation of known labels
for individual prediction purposes when some labels are through a network. In addition, to the best of our knowl-
missing or uncertain, e.g., for demographics prediction in edge, no research quantifies how the performances of label
large networks. The task of predicting missing node labels predictions in a network evolve as a function of the
in networks, known as node classification, makes use of the assortativity strength.
known labels and the graph structure [14], which embeds In this work, we propose a general framework based on
some properties such as its assortativity. Different methods probabilistic graphical models (PGMs) to exploit the social
of node classification, based either on feature extraction or network structure to improve uncertain individual predic-
on random walks [5], were recently developed. On the one tions and infer missing labels. The method can be applied
hand, some feature extraction-based approaches aim at while only knowing the labels of a limited number of pairs
exploiting network assortativity [1, 16]. The general idea of connected users in order to evaluate the assortativity.
is, for each node, to build a feature vector summarizing Then, the inference process is based on class probability
information from its neighborhood. A machine learning estimates for each user. These initial class probabilities
algorithm can then be employed to predict the unknown may be obtained (1) by considering a subset of labeled
labels based on these extracted features. In this setting, the users or (2) from a machine learning algorithm applied on
neighborhood definition is highly important and can be the node-level individual features. A loopy belief propa-
carried out in different ways [15]. To define feature vectors gation algorithm is afterward applied on a Markov random
describing each node’s neighborhood, graph embedding field modeling the network to improve the accuracy of the
techniques can be considered [14]. For instance, Grover class probability estimates. The model is able to benefit
and Leskovec automated the feature extraction to preserve from the strength of the links, quantified for example by the
neighborhoods reflecting the local structures and/or the number of contacts. The estimation of the network assor-
communities [15]. This approach is well suited for classi- tativity allows to optimally tune the model parameters, by
fication tasks as it can account for diverse node neighbor- defining synthetic graphs. The latter simulations permit (1)
hoods which can be related to the node labels. However, to prevent overfitting a given (real) network structure, (2)
the feature extraction-based studies do not take the global to perform the parameter tuning off-line and (3) to avoid
network structure into account and could hence further requiring the labeled users to form a connected graph.
benefit from its properties. Indeed, the feature extraction is These simulations also allow to quantify the assortativity
constrained by the subsequent classification algorithm that range leading to an accuracy gain over an approach
is used: The fixed number of features and their ordering ignoring the network structure. The methodology is vali-
cannot faithfully reflect complex relationships, observed in dated on real-world mobile phone data to predict gender.
social networks for instance, with diverse kinds of network As the assortativity required to significantly improve the
substructures related to the users’ labels [44]. Also, this quality of the prior class probabilities might not always be
123
Neural Computing and Applications
reached in practice, we show that the assortativity signifi- probability distributions over X which, importantly, admit
cantly changes according to some communication attri- a particular factorization according to the graph structure.
butes, which can in turn be exploited to improve the Graphical models aim to represent compactly distributions
predictions by appropriately adapting the model parameters over interacting variables, allowing to decrease the com-
in different parts of the network. Experiments on a real- plexity of inference processes. There exist mainly three
world mobile phone network suggest the statistically sig- kinds of PGMs: undirected graphical models (also called
nificant superiority of our methodology over state-of-the- Markov random fields (MRFs) or Markov networks),
art algorithms, namely the reaction–diffusion label propa- directed acyclic graphical models (DAGs or Bayesian
gation method [38] and three machine learning classifiers networks) and factor graphs [20]. MRFs are employed in
relying on features extracted by the Node2vec graph this work and are defined as follows.
embedding technique [15].
Definition 1 (Markov random field (MRF)) An undirected
The paper is organized as follows. Section 2 introduces
graphical model, or MRF, represents a family of proba-
the framework of probabilistic graphical models (PGMs)
bility distributions over X using an undirected graph GM .
and the notations. The readers who are already familiar
The implied variables satisfy the graph separation prop-
with this field can safely skip this part. Then, the general
erty: For any three sets of nodes H, B and D S in the
methodology to improve attribute predictions in a network
PGM and their associated vectors of random variables XH ,
is detailed in Sect. 3. Its key parameters are highlighted,
XB and XD , XH is independent from XB conditionally to XD
and their tuning based on simulated assortative networks is
(XH ??XB jXD ) when any path in the graph from one node in
detailed. In Sect. 4, we introduce the real-world data sets
H to one node in B contains a node in D.
which are studied, analyze their underlying gender homo-
phily and assess the performances of our method compared The Hammersley–Clifford theorem relates this defini-
to state-of-the-art algorithms, based on feature extraction tion to the factorization of the joint distribution induced by
using Node2vec [15] and on label propagation using the the graph:
reaction–diffusion algorithm [38]. Section 5 then discusses
Theorem 1 (Hammersley–Clifford) A strictly positive
the results and describes the related work more extensively.
distribution pX satisfies the graph separation property if
Conclusions are drawn in Sect. 6.
and only if it can be factored as
1Y
pX ðxÞ ¼ w ðxC Þ; ð1Þ
2 Background and notations Z C2C XC
123
Neural Computing and Applications
2.2 Inference on PGMs This procedure permits to determine the best model
parameters.
Inference aims to compute marginal probabilities or
modes of a joint distribution [47]. Assuming that discrete 3.1 Probabilistic graphical model
random variables are considered (otherwise, sums can be
replaced by integrals), computing a marginal such as In order to improve the initial predictions bp Yi jXi ðyi jxi Þ, the
pX1 ðx1 Þ consists in summing over all the remaining vari- joint probability distribution pðY; XÞ is modeled through an
P P
ables: pX1 ðx1 Þ ¼ x2 xN pX (x). When the joint dis- undirected PGM GM (also called Markov random field,
tribution admits a factorization such as (1), it can be used MRF). The MRF has one node (resp. one edge) for each
to reduce the computational cost of the inference. Indeed, user (resp. link) in the social network. The random vari-
the sum over each variable can be performed on factors ables Yi that we want to infer are assigned to the nodes of
defined on subsets of nodes. The book of Koller and the network; each link represents a conditional dependency
Friedman provides for some concrete examples [20]. This between two of them. As indicated in Fig. 1, the graphical
reasoning leads to the loopy belief propagation (LBP) model GM contains N additional nodes associated with the
algorithm which can be applied on any kind of PGM. If Xi ’s, each one being linked to its corresponding Yi (as
the graph is a tree, LBP converges to the correct mar- in [49] for instance). The relationships between the indi-
ginals in a limited number of iterations [29]. Otherwise, vidual data Xi and the label Yi of each user i are hence
the estimated marginals are optimal in the Bethe–Kikuchi captured, as well as the direct mutual influence of adjacent
sense [47]. users. We choose an undirected graphical model to char-
acterize the statistical dependencies between the consid-
ered random variables, since there is no causal link
3 Method between the labels in the social network which could be
represented with a directed PGM. Also, the joint distribu-
Given an arbitrary social network G, the goal is to exploit tion pðY; XÞ does not admit a natural factorization through
its assortativity to infer, for each user i, an individual scalar conditional probabilities [47]. Instead, our MRF represents
attribute (or class) Yi taking values in a finite alphabet Y. conditional independencies. As a result, the graph separa-
This class can be, for instance, the age or gender of each tion property [20] indicates that the joint probability dis-
individual. The graph G is defined as a pair ðV; EÞ, where V tribution pðY; XÞ modeled by the PGM admits the
and E are, respectively, the sets of nodes (one for each user) factorization
and edges (connecting each pair of individuals who are in Y
pY;X ðy; xÞ ¼ pY ðyÞ pXi jYi ðxi jyi Þ: ð2Þ
contact), with jVj ¼ N. The available individual informa- i
tion about user i is denoted by the random vector Xi . For
example, in the case of Twitter, xi could consist in the The assumption underlying the graphical representation is
tweets generated by user i and possibly in public profile that Xi given Yi is conditionally independent from Yj and
details (e.g., the user’s name). It is assumed that estimates Xj , for all j 6¼ i. Namely, the generative probabilities of the
b
p Yi jXi ðyi jxi Þ of the class membership probabilities features given the class of each node are assumed to be
pYi jXi ðyi jxi Þ are provided. These can be seen as ‘initial conditionally independent.
Let us use the notations
predictions’ for each user i 2 V, which can encode deter-
ministic information (known labels) or which can be out-
putted by a machine learning algorithm applied on the
individual features xi to predict the class yi . If such infor-
mation is missing for some users, uniform class probabil-
1
ities of jYj are used. In what follows, Y (resp. X) denotes the
concatenation of all the Yi ’s (resp. Xi ’s).
The rest of this section is structured as follows. Our
inference model is built in Sect. 3.1 based on the social
network, and the employed message-passing algorithm is
detailed in Sect. 3.2. Next, in Sect. 3.3, by simulating
individual predictions bp Yi jXi ðyi jxi Þ and synthetic networks,
we assess how the performance enhancement is related to
the network assortativity and to the quality of the initial set
Fig. 1 Toy example of the Markov random field. There are two nodes
of predictions, in terms of both accuracy and distribution. per user i in the graph, Yi being her class and Xi her individual data
123
Neural Computing and Applications
pYi jXi ðyi jxi Þ motivates relying on factorization (4), with pairwise
wðyi ; xi Þ :¼ pXi jYi ðxi jyi Þ / : ð3Þ potentials only, and leads to the loopy belief propagation
pYi ðyi Þ
(LBP) algorithm [20]. As further detailed hereunder, LBP
As w is defined over the random variables associated with provides estimates of the posterior probabilities b p Yi jX ðyi jxÞ
each node of the social network G, it is called the node for each node i in the graph and for all yi 2 Y. These
potential. It corresponds to the likelihood of the ith user’s estimates approximate the true posterior probabilities
individual data knowing her class. Besides, factoriza- pYi jX ðyi jxÞ in the Bethe–Kikuchi sense [47]. The predicted
tion (2) entails the joint distribution pY ðyÞ of the labels.
class for user i is then given by arg maxyi 2Y b p Yi jX ðyi jxÞ.
The Hammersley–Clifford theorem indicates that the latter
Computing the conditional probability of a random
can be factored as a product of nonnegative functions
variable Yi , given the observed variables, consists in
defined over cliques in GM . We choose to represent pair-
marginalizing over the remaining unobserved variables. A
wise interactions for pY ðyÞ, i.e., we define a clique potential
normalization step at the end ensures that we have a valid
over each edge binding two yi ’s. According to our PGM, as
conditional distribution. The intuition behind belief prop-
illustrated in Fig. 1, the factorization hence develops as
agation algorithms is to perform these marginalizations
1Y Y efficiently, by avoiding to repeatedly compute the same
pY;X ðy; xÞ ¼ wðyi ; xi Þ Wðyj ; yk Þ; ð4Þ
Z i ðj;kÞ2E
intermediate sums. As a result, LBP is an iterative algo-
rithm in which, at each iteration t, every node j sends a
where Z is the partition function (a normalization constant) message mtjk to each of its neighboring nodes k defined as
and w and W, respectively, denote the node and edge 0 1
potentials. The ith node potential wðyi ; xi Þ can be estimated mtjk ðyk Þ X Y
¼ @wðyj ; xj ÞWðyj ; yk Þ mt1 A
using the first predicted class probability b p Yi jXi ðyi jxi Þ and an uj ðyj Þ ; ð6Þ
k1 y 2Y u2N ðjÞnk
j
estimated class prior b p Yi ðyi Þ, which can be defined as the
proportion of users initially predicted as yi . Besides, in for yk 2 Y and where N ðjÞ is the set of neighbors of user j.
order to reflect either the label assortativity or disassorta- The normalization constant k1 is chosen such that the
tivity of each link, the edge potential Wðyj ; yk Þ for each pair messages on each edge and direction sum to 1:
P t 0
of adjacent users j and k can be defined as yk 2Y mjk ðyk Þ ¼ 1. The initial messages mjk are set to
1=jYj. The summation over the values of the random
sjk ; if yj ¼ yk
Wðyj ; yk Þ ¼ ð5Þ variable Yj consists in marginalizing this variable. The
1 sjk ; if yj 6¼ yk
message mtjk ðyk Þ can be interpreted as all the information
with sjk 2 ½0; 1 and yj , yk 2 Y. It is noteworthy that if sjk is the sender (node j) can provide to the receiver (node k) on
greater than 0.5, Wðyj ; yk Þ will encourage users j and k to the probability for node k to lie in state yk . After the con-
share the same class. At the opposite, an sjk value smaller vergence of the 2N messages after t iterations and a nor-
than 0.5 will favor neighboring users j and k to have dif- malization step, estimates of the posterior probabilities
ferent labels (anti-homophilic contacts). This parameter of pYi jX ðyi jxÞ, termed as beliefs and denoted by bðyi Þ, can be
label compatibility over the edges can hence be interpreted computed for each node i in the graph and for all yi 2 Y as
as the probability for edge (j, k) to be homophilic. follows:
Depending on the application, one may have access to Y
some edge weights, which can be used to model these sjk . pYi jX ðyi jxÞ bðyi Þ :¼ k2 wðyi ; xi Þ mtui ðyi Þ; ð7Þ
u2N ðiÞ
Section 4.4 provides an example of such a refinement in
the context of a real-world application. Another option is to where k2 is a normalization constant such that
employ a constant sjk value for all the edges. P
yi 2Y bðy i Þ ¼ 1. The predicted class for user i is the one
maximizing the estimated posterior probability:
3.2 Inference algorithm
yi ¼ arg max bðyi Þ: ð8Þ
yi 2Y
Along with factorization (4) of the joint probability dis-
tribution, the defined PGM structure enables efficiently This procedure enables handling large graphs as the com-
inferring the posterior probabilities pYi jX , from which plexity of a single message-update iteration is OðjEj jYj2 Þ.
enhanced predictions of the users’ label are derived. Exact In comparison, a brute-force marginalization has a com-
inference on the loopy MRF is intractable, as it would plexity of O(N jYjN ). It can also be noted that (6) high-
require using the junction tree algorithm [20] which, even lights the influence of the edge potential W: An sjk larger
if all the maximal cliques in G were identified, has an than 0.5 on a given edge encourages neighboring users to
exponential complexity in the size of the largest one. This share the same class.
123
Neural Computing and Applications
3.3 Parameter tuning et al. [24], the latter attribute is among the most homophilic
ones.
The sjk values of the edge potential (5) need to be deter- In the special case of a binary attribute, the mixing
mined. As these parameters reflect the confidence in the matrix becomes
(dis-)assortative character of the edges, their tuning should
m11 m12
be related to the network assortativity. The latter quantity M¼ ; ð10Þ
m21 m22
hence has to be quantified, which is detailed in Sect. 3.3.1.
Then, after defining synthetic networks with adjustable as- and the assortativity coefficient is defined as
sortativity in Sect. 3.3.2, Sects. 3.3.3 and 3.3.4 study the
m11 þ m22 m21 m22
influence of the assortativity on the model parameters and r¼ : ð11Þ
1 m21 m22
on the performances. This is done both by simulating
individual predictions and by assuming that a subset of In this particular setting, the assortativity of a perfectly
labels are known. disassortative network reaches 1. Indeed, there can only
be as many nodes from each of the two classes at the ends
3.3.1 Assortativity coefficient of the edges of such a network, since each edge is between
two users from distinct classes. Hence, m1 ¼ m2 ¼ 0:5 and
To quantify the assortativity of a network for a given node r is equal to 1.
attribute, Newman introduced the assortativity coefficient, As it is most of the time unknown, r should be reliably
denoted by r [31]. It assesses the correlation between the estimated in a real setting. An efficient possibility consists
attributes of adjacent nodes, which can be categorical such in edge sampling, as described in Sect. 4.4 in the case of
as the gender or the political affiliation. For scalar, discrete gender prediction in a mobile phone network. We hence
or continuous, attributes such as the user’s age or the node assume in the following that an accurate estimate of r is
degree, a numeric assortativity coefficient is defined. In the provided.
following, we focus on the assortativity coefficient defined For a given network, the model parameters sjk of the
for categorical attributes on an undirected graph. edge potential W can be optimized according to our con-
The assortativity coefficient can be derived thanks to the fidence in the (dis-)assortativity of each link (j, k). If our
L
symmetric mixing matrix M ¼ mij i;j¼1 , where mij is half sole knowledge about assortativity is r, a constant sjk value
the fraction (resp. the fraction) of edges connecting a (denoted by s) can be used for all the edges. This s char-
vertex of class i to a vertex of class j when i 6¼ j (resp. acterizes the confidence in the network information, which
when i ¼ j), and L ¼ jYj is the total number of classes of is proportional to jr j: as indicated by (4), large j0:5 sj
the attribute of interest. Each of the row sums of the mixing values dilute the initial predictions contained in the node
P potential w and give a heavy weight to the network, while
matrix, denoted by mi :¼ j mij , gives the proportion of
at the opposite an s value close to 0.5 will not change the
ends of edges from class i. It corresponds to the sum of
initial predictions by much, since W will remain roughly
degrees of the nodes from class i divided by the number of
constant when its arguments (i.e., the class labels) are
ends of edges (i.e., twice the number of edges). For a
either equal or different. Synthetic networks, defined in the
discrete attribute on an undirected graph, the assortativity
next section, with assortativity coefficients close to a given
coefficient expresses as
P P r, enable us to find an optimal s. To this aim, a grid search
mii i m2i is performed: LBP is applied on the MRF with each s value
r¼ i P 2 2 ½1; 1: ð9Þ
1 i mi from the grid, and the one achieving the highest average
performances on different synthetic networks is kept as
If all the edges lie between pairs of people of the same
optimal. Employing a grid search is convenient as it yields
class, the network is perfectly assortative and it is
robust results. Its usage is affordable thanks to the effi-
straightforward to derive that r ¼ 1. At the opposite, in a
ciency of LBP and since a single parameter needs to be
perfectly disassortative network, r will range in ½1; 0½, as
optimized. Alternative optimization schemes will be con-
detailed in [31]. In the intermediate case, a random mixing
sidered in future works and may only be beneficial for the
occurs when the classes of two connected users are inde-
performances of our approach.
pendent. Hence, mii ¼ m2i which implies that r ¼ 0. Many
To get a clear picture of our approach and to identify
studies show that social networks tend to be more assor-
more easily the meaning of the following sections, the
tative than other ones (e.g., technological or biologi-
different steps are summarized in Fig. 2.
cal) [8], with positive assortativity coefficients ranging up
to 0.6 [32] for attributes like race of partners in a bipartite
graph of sexual partnerships. According to McPherson
123
Neural Computing and Applications
A B C
Estimating assortativity r Build synth. netw. Fit edge potential param. sjk :
• Edge sampling (Sect. 4.4.3) with assortativity r grid search on synth. netw. Ψ
• Social theories: e.g. [44] (Sect. 3.3.2) for constant s (Sect. 3.3.3) F G
Apply LBP with node Predicted class
and edge potentials proba. pYi |X (y i |x)
(Sect. 3.2) = pYi |X (y i |{xi }N
i=1
)
D E
Individual predictions or subset of known labels: Compute the node potential ψ
provide estimate pYi |Xi (y i |xi ) ψ(y i , xi ): Eq. (3)
Fig. 2 Summary of the proposed method. To apply the LBP algorithm predictions (box D) are used to define the node potentials. The LBP
to infer the node labels using the label assortativity, the edge potential algorithm (box F), using the node and edge potential, yields predicted
(box C) and node potential (box E) have to be defined. The edge class membership probabilities bp Yi jX ðyi jxÞ (box G). It is noteworthy
potential can be deduced from the network assortativity r (box A), that X ¼ fXi gNi¼1 concatenates all the Xi ’s. Therefore, the predictions
which can be estimated thanks to social theories or edge sampling, as b
p Yi jX ðyi jxÞ (box G) rely on the whole network structure, whereas the
detailed in Sect. 4.4.3. The estimated r can then be employed to
initial predictions b p Yi jXi ðyi jxi Þ (box D) are only based on the
generate synthetic graphs enabling to tune the edge potential
individual node-level information
parameters (box B). On the other hand, the initial individual
3.3.2 Synthetic networks chosen here to only control the assortativity with respect to
the node label, without additional constraint on the graphs
The construction of the synthetic networks relies on the properties.
same principle as the Watts–Strogatz small-world It remains to endow the synthetic network nodes with
graphs [30]. It first starts with a regular circular lattice prior class probability estimates bp Yi jXi ðyi jxi Þ. In practice,
GR ¼ ðV R ; E R Þ, each of the n nodes being linked to its these probabilities can either be obtained from a machine
k closest neighbors in a ring topology, where k is even. The learning algorithm applied on the individual features xi of
attribute values yi ’s that need to be inferred are randomly each user, or from a subset of labeled users. Both of these
assigned to each node i by sampling a given distribution. situations can be handled in the context of the synthetic
Some edges are then rewired in the graph until the obtained networks, as detailed in the two next sections.
assortativity coefficient is sufficiently close to the targeted
one, denoted by r. This last step is detailed by the fol- 3.3.3 Individual predictions
lowing procedure, illustrated in Fig. 3:
In a given application, a machine learning algorithm pre-
1: rR ← assortativity of GR ; dicting the classes yi from the individual features xi gives
2: while |rR − r| > tolerance do access to a prior information for all the users of the real
3: if rR < r then network. Sampling the distribution of these individual
4: Randomly select an edge (i, j) ∈ E R predictions b p Yi jXi ðyi jxi Þ enables assigning prior class prob-
which is not a bridge and such that y i = y j ability estimates to the nodes of the synthetic graphs, which
5: E R ← E R \ (i, j)
may afterward be employed to determine the optimal
6: Add a random edge (i, l) in GR such that y i = y l
7: else model parameters. Nevertheless, in order to analyze the
8: Randomly select an edge (i, j) ∈ E R behavior of our method when it is confronted to different
which is not a bridge and such that y i = y j uncertainty patterns, we here generate these prior proba-
9: E R ← E R \ (i, j) bilities for a binary label according to three synthetic dis-
10: Add a random edge (i, l) in GR such that y i = y l tributions: linear, exponential and bi-uniform, as depicted
11: end if in Fig. 4. The proportion of correct initial predictions, i.e.,
12: rR ← assortativity of GR ; the initial accuracy, has to be controlled as it will influence
13: end while the performances of the subsequent algorithms employed
to refine these predictions using the network information.
It can be noted that if one makes additional assumptions
The initial classification rule amounts to predict
on the graphs structure, different steps in the generation of
yIi ¼ arg maxyi b p Yi jXi ðyi jxi Þ. Therefore, the initial accuracy,
the synthetic networks could also be considered. For
instance, the LFR model allows to control the community denoted by b, corresponds to the fraction of users i for
structure (the community size distribution and the propor- whom b p Yi jXi ðci jxi Þ 0:5 when the label is binary, where ci
tion of within-community edges) and the degree distribu- is the true class of user i. The distributions of the class
tion to obtain more realistic graphs [33]. Besides, if the probabilities cover three situations with different levels of
mixing matrix was constrained, it could be used to refine difficulty for the subsequent classification task, depending
the network simulations [31]. Our simulated networks are on whether the amplitude of b p Yi jXi ðyi jxi Þ is more or less
123
Neural Computing and Applications
related to the prediction correctness. With the linear and as our PGM is designed to exploit the assortativity, which
exponential distributions, the probability for a prediction to is absent if r ¼ 0, corresponding to a randomly mixed
be actually correct increases with the available confidence network according to the considered attribute. The results
level, whereas with the bi-uniform distribution, when obtained using initial predictions drawn from the linear and
b
p Yi jXi ðyi jxi Þ 0:5, the proportion of correct predictions exponential distributions are very similar. For the bi-uni-
does not increase with the confidence level b p Yi jXi ðyi jxi Þ. All form individual predictions however, much lower accuracy
the three distributions are sampled by inverse transform gains are observed. These results can be explained as, in
sampling [9]. this case, only the sign of b p Yi jXi ðci jxi Þ 0:5 brings infor-
The results of the parameter tuning procedure are mation on the true class probabilities, where ci is the true
depicted in Fig. 5 for an arbitrary binary attribute, such as class of user i. On the other hand, its amplitude also matters
the gender. The best s value and the corresponding mean for the linear and exponential distributions. It would also
accuracy gain, with respect to the accuracy b of the initial most probably be the case in real settings: If an ML
individual predictions, are provided as a function of the algorithm outputs a high confidence level about an indi-
assortativity and the accuracy b. For each pair of b and r, vidual prediction, the probability for this prediction to be
the optimal s value is selected as the one maximizing the indeed correct should be higher than for another prediction
average accuracy over 30 random networks with 200 ver- with a lower confidence level. From this respect, the bi-
tices containing as many nodes from each one of the two uniform distribution may not be very realistic and could
classes. The randomness covers the edge rewiring in the correspond to a worst-case scenario. The extension to the
networks, the attribute assignations and the sampling of the case of nonbinary attributes is straightforward, possibly by
prior probabilities. From the top to the bottom row of fig- employing the numeric assortativity coefficient, e.g., in the
ures, the prior probabilities are simulated using the three case of the age attribute.
distributions illustrated in Fig. 4.
It can be observed that the optimal s values are almost 3.3.4 Labeled data
independent of b, and hence, the parameterization mainly
depends on the assortativity coefficient, using any of the Prior information about the users’ class can also consist in
three distributions of initial predictions. Also, the chosen s a subset of labeled users. In this case, the class of a fraction
evolves in a consistent way as a function of the assorta- b of all the network users is known, whereas no prior clue
tivity r, increasing from smaller values for disassortative is provided about the class of the remaining fraction 1 b
networks to higher values for assortative ones. Using these of users. The symbol b is again used in this section, by
optimal s values when prior probabilities are linearly or analogy with the accuracy of the initial individual predic-
exponentially distributed, Fig. 5b, d shows that the accu- tions of Sect. 3.3.3. Figure 6 shows the optimal s parameter
racy gain is almost always positive, except for some par- computed and the accuracy of the predictions obtained on
ticular pairs of r and b, especially when the assortativity is the unlabeled nodes in synthetic networks, as a function of
within the range ½ 0:1; 0:1. This observation is consistent the fraction b of known labels and the assortativity
123
Neural Computing and Applications
123
Neural Computing and Applications
(a) (b)
Fig. 6 Parameter tuning on synthetic networks when a fraction of grid with a 0.05 step. Each network has 200 nodes and a mean degree
node labels are known, defining a training set. a Optimal s parameter k ¼ 8. The results are given as a function of the fraction of known
of the edge potential (5), with a constant sjk for all the edges of the node labels (in %) and the assortativity coefficient when considering a
networks and b mean accuracy (in %) on the unlabeled nodes over 30 binary attribute
random synthetic networks. The s parameter is tuned by considering a
developing countries. A shortcoming to their use however of any edge e are the number of texts (sms), the number of
is that they often lack even the most basic information calls (calls) and the total duration of the calls (call dur).
about their carrier, such as the gender, age or socioeco- Different functions of these edge attributes can be defined.
nomic status. Indeed, most of the mobile phone connec- For example, the sum of sms and calls is denoted by
tions worldwide are prepaid, as well in developing as in s and c and counts the number of contacts between two
developed countries. Although these connections provide given persons, which is well suited to characterize the
fine-grained information about the mobile phone usage, strength of a social link [34].
they do not give any access to basic demographics. Table 1 provides general features of both networks, as
The data set is first introduced in Sect. 4.1, detailing well as the mean values of the three attributes of the edges
some of its features suggesting significant gender homo- between persons of both the same and different genders. As
phily which can be exploited for the inference process. indicated, the average communication patterns differ
Section 4.2 quantifies the gender assortativity as well as its between hetero- and homogeneous (M–M and F–F) con-
dependence with some mobile communication attributes. tacts. This reflects the stronger relationships occurring
Section 4.3 builds on the conclusions of Sect. 4.2 to refine within the couples. Indeed, for instance in GL , there are on
the model parameter tuning, accounting for the communi- average 6.4 and 9.7 contacts (calls and texts), respectively,
cation patterns for the edge potential definition. Section 4.4 between any homo- and heterogeneous pairs during the
discusses the performances of the proposed methodology, observation period. The same behavior is observed for the
by comparing them with the results of state-of-the-art number of texts or calls distinctly. However, as shown in
classification methods based on label propagation and Fig. 7, there is no obvious dichotomy between the distri-
feature extraction. butions of each attribute on the homo- and heterogeneous
In the following, Xi denotes the individual metadata of edges.
user i and Yi is the random variable for her gender, defined It can be mentioned that the mobile phone use of each
on the alphabet Y ¼ fF; Mg with F and M, respectively, individual according to her gender is not analyzed in this
for a female and male. study, since this kind of information is typically exploited
to provide the individual predictions. Finally, as the gender
4.1 Data description is binary, its assortativity coefficient is defined by (11). In
GS and GL , a moderate gender assortative mixing is
Two undirected and weighted mobile phone networks, observed.
denoted by GS and GL , are used in this section: The data
analysis of this work is only conducted on GL , while the 4.2 Observational analysis
performance assessment is performed on GS . This allows to
avoid overfitting the particular network GL . In both GL and Since the strength of the heterogeneous communications, in
GS , each node refers to one individual and an undirected terms of number of texts and calls exchanged, tends to
edge binds any pair of users who exchanged at least one overcome the one of the homogeneous contacts, the
phone call or text during a fixed time period. The gender is weights of an edge might give clues on its likelihood to be
known for the majority of the users. The communication rather hetero- or homogenous. The subset of the strongest
attributes, extracted from the Call Detail Records (CDRs), edges may hence have a completely different assortativity
123
Neural Computing and Applications
Table 1 Some features of the networks have small s and c, as indicated by the evolution of nstrong
Edge Net. GL Net. GS in logarithmic scale.
A refinement of the previous analysis consists in com-
Covered time period 15 days 3 months bining two thresholds on two different edge attributes in
Number of nodes 160,818 19,779 order to study how rstrong behaves. Figure 9 depicts such an
Number of edges 390,778 78,441 evolution using the sms and calls attributes. The evolution
r (for gender) 0.3 0.26 of rweak as a function of the two thresholds is negligible: It
Homo. edges (%) 66.47 63.5 stays around 0.3, as in Fig. 8. Again, this figure highlights
Male nodes (%) 56.38 53.44 that the strongest edges are more disassortative. However,
Mean sms homo. 3.58 15 the strong part cannot be very large and have a significantly
hetero. 5.74 25.8 negative r in the mean time, as most of the edges have low
Mean calls homo. 2.84 5.3 sms and calls values.
hetero. 3.96 7.9
Mean call dur homo. 13 min 40 s 16 min 4.3 Refining the parameter tuning
hetero. 15 min 20 s 19 min 20 s
If ‘edge type’ is omitted, the characteristic concerns the whole net-
In Sect. 3.3, we show how to select a constant sjk parameter
work. ‘Homogeneous’ (homo.) and ‘heterogeneous’ (hetero.) refer to of the edge potential for all the edges of a network with a
the gender of the persons linked by the edges given r (boxes B and C in Fig. 2). On the other hand, the
analysis of Sect. 4.2 suggests that the assortativity signifi-
cantly varies in distinct parts of a mobile phone network,
decreasing as the strength of the links increases. This can
than the whole network. As the performances of our be interpreted as a social theory (box A of Fig. 2). This
approach increase with the assortativity amplitude, identi- information can be exploited by defining different s values
fying stronger (anti-)homophilic subgroups is of great in the strong and weak parts of the network, respectively,
interest. This section shows that the assortativity r can denoted by sstrong and sweak , defined from the tuning on
indeed significantly change when considering subsets of
synthetic networks (box C of Fig. 2). However, modeling
the edges with specific weights. This kind of information
sjk as a step function is questionable. Indeed sjk is the
can afterward be used to refine the edge potential, as
posterior probability for the edge (j, k) to be homophilic
indicated in box A of Fig. 2.
given its weights. Since this posterior probability is unli-
We analyze the evolution of the assortativity coefficient
kely to abruptly change for some weight value, a smooth
when subgraphs are constructed by only considering the
function should model it, with upper and lower plateaus
edges with a scalar combination of their attributes above a
corresponding to sweak and sstrong , respectively. Determin-
threshold, the latter being progressively increased. For a
ing whether the edge (j, k) is hetero- or homophilic can
given threshold and attribute combination, the strongest
moreover be seen as a binary classification problem, with
edges, according to the considered combination, constitute
the edge weights as features. Thus, inspired by logistic
the strong part of the graph, while the weaker part refers to
regression, we model sjk as a sigmoid function parame-
the rest. Several attribute combinations have been consid-
terized by a fixed linear combination s and c of the edge
ered, including the attributes themselves. The most sig-
weights,
nificant evolution of the assortativity r is obtained using
sweak sstrong
s and c as a measure of link strength and is depicted in sjk ðs and cÞ ¼ þ sstrong ; ð12Þ
Fig. 8. The assortativity coefficient in the strong part (i.e., 1 þ eGðs and cx0 Þ
with edges such that s and c is higher than the threshold where G and x0 are two parameters to determine. Follow-
on the x-axis) is denoted by rstrong , while rweak is the one of ing the observations of Sect. 4.2, the strong part of the
the weak part. The number of edges in the strong part is network is defined as the set of the 1% strongest edges in
denoted by nstrong . The dotted lines indicate the threshold terms of number of contacts. The plateaus sweak and sstrong
and the corresponding rweak , rstrong and nstrong values such are tuned using the synthetic networks with constant sjk
that there are 1% of the edges in the strong part of GL . values, according to rweak and rstrong . Let us further denote
Using this partition, rweak is still equal to about 0.3, but by xU and xL the x-values at which the sigmoid reaches
rstrong reaches 0:3 meaning that the strong part is rather sstrong þ 0:99 sweak sstrong and sstrong þ 0:01ðsweak
anti-homophilic, as suggested by Table 1. From a more sstrong Þ. The parameters G and x0 are fixed such that there
general point of view, as the threshold on s and c are approximately 1% of the edges with a number of
increases, rstrong decreases toward disassortative values. contacts lower (resp. higher) than xU (resp. xL ). Figure 10
Meanwhile, rweak remains quite stable since most edges
123
Neural Computing and Applications
(a) (b)
(c) (d)
123
Neural Computing and Applications
Fig. 10 Sigmoid function defining the sjk values of the edge potential
Node2vec is a graph embedding technique which auto-
used for GS . The threshold on the number of contacts defining the matically defines node features describing each node
edges as strong, indicated by the vertical line, is determined to induce neighborhood. These neighborhoods are defined based on
1% of strong edges. The top histogram gives the distribution of the second-order random walks which are biased to allow
number of contacts (s and c) in GS in logarithmic scale
favoring, to a controlled extent, the preservation of
the node structural properties and/or of the community
4.4.1 Reaction–diffusion algorithm co-memberships (node homophily) [15]. The chosen bias,
controlled by the return and in–out parameters p and q,
The reaction–diffusion (RD) algorithm iteratively updates determines the sampling strategy S defining the neighbor-
the predicted gender probability of each user by computing hood N S ðiÞ V of each node i 2 V. A small value of
a weighted sum of its initial gender probability and the p (\1) increases the probability for a random walker to
current one of her direct neighbors. It is hence based on come back to the source node, while a small value of
initial prediction probabilities for each user, as in our q (\1) encourages the walk to move further away.
approach. The notation pti :¼ b p Xi ðMÞ denotes the estimated Therefore, small p and q, respectively, favor breadth-first
probability for user i to be a male at iteration t. These searches (BFS) and depth-first searches (DFS) through the
probability estimates are updated at each iteration for each network when defining the neighborhoods. Decreasing
user i 2 V as their values hence tends to define graph embeddings,
0 0 11 respectively, preserving the node structures and the com-
1 1 X
ptþ1 ¼ @p0i þ @ ptj AA 8i 2 V; ð13Þ munity co-memberships. Let us denote by d the dimension
i
2 jN ðiÞj j2N ðiÞ of the feature space defined by Node2vec and by f : V !
Rd the function assigning the features to each node. This
until convergence, where N ðiÞ is the set of neighbors of function is determined by Node2vec by maximizing the log
user i and p0i :¼ b p Yi jXi ðyi ¼ Mjxi Þ is the initial male prob- probability of observing the neighborhood of each node i
ability for user i. given its features:
The RD method is a variant of the previously introduced X
and largely studied consistency method [50], with a regu- max logðPðN S ðiÞjf ðiÞÞÞ: ð15Þ
f
i2V
larization parameter fixed to 0.5. Indeed, let us note A 2
f0; 1gNN the adjacency matrix of the graph, with Ai;j ¼ 1 The idea in defining f is to describe nodes with similar
neighborhoods with close features in the embedding space.
if there is an edge between nodes i and j and 0 otherwise.
Further details are provided in the paper of Grover and
We also define the diagonal matrix of degrees D 2 RNN
P Leskovec [15]. Once the node features are computed, a
where Di;i ¼ j2V Ai;j . We can express (13) in a matrix
classification algorithm can be used to predict the labels of
form as test nodes. In this work, we consider three machine
1 learning algorithms for this task: logistic regression with
ptþ1 ¼ p0 þ D1 Apt ; ð14Þ
2 L2 regularization (logReg), Gaussian naive Bayes (GNB)
where the column vector pt :¼ ½pti Ni¼1 . It follows that RD is and k-nearest neighbors (kNN). Logistic regression (with
the first variant of the consistency method introduced L2 regularization) and Gaussian naive Bayes were suc-
in [50]. The only difference between the original consis- cessfully employed in previous works [14, 15]. On the
tency method and this variant is that the random walk other hand, kNN provides further baseline comparison.
normalized Laplacian W :¼ D1 ðD AÞ used in RD is Although using support vector machines (SVMs) is another
123
Neural Computing and Applications
appealing alternative for our two-class problem, it induced than the actual assortativity value. Furthermore, an error of
unaffordable computation times during the experiments 0.05 on the estimation of r induces at worst a small 0.05
with our network. The respective hyper-parameters (HPs) error on the s value, as indicated in Figs. 5 and 6.
of each algorithm are selected through stratified tenfold It is noteworthy that using distinct edge potential
cross-validation (CV) protocols on the labeled subsets of parameters sstrong and sweak in the strong and weak parts of
nodes. It can be noted that we employ Node2vec as a the network requires to estimate r within these two parts.
baseline among the feature extraction-based approaches, as As the strong part tends to be significantly smaller, the
it has been shown to overcome several other embedding estimation of rstrong in a real setting should be carefully
techniques for classification tasks in complex net- performed. Meanwhile, the users linked by the edges
works [14, 15]. The bias weights p and q are each learned selected to estimate r may be, for instance, used as a
in the grid f0:25; 0:5; 1; 2; 4g within the CV, i.e., they are training set to provide initial individual gender predictions.
considered as HPs to tune in addition to the HPs of the
classification algorithms. These p and q hyper-parameters 4.4.4 Performances
are hence chosen among 25 possible combinations, which
is as much as in the paper defining the method [15] and This section presents the experimental results of the pro-
more than in a recent review of graph embedding tech- posed method and of the baseline approaches (reaction–
niques [14]. The best combination of these parameters is diffusion algorithm and classifiers based on Node2vec
individually chosen for each of our 50 simulations asso- features), both on simulated initial individual predictions
ciated with each considered proportion of known labels. and on a growing subset of network users with known
Similar values for the remaining Node2vec hyper-param- labels. For all the comparisons, statistical tests are con-
eters are employed as in previous studies [14, 15]: d ¼ 32, ducted using Welch’s t test with a significance level of
a context size of 10, walk length of 80 and number of walks 0.05. Whenever multiple hypotheses are tested simultane-
of 10. The bias weights are not assigned to constant values ously, Holm–Bonferroni correction is employed to bound
as they control the nature of the considered node neigh- by 0.05 the probability to consider as significant at least
borhoods, which in turn determine the closeness of the one nonsignificant difference [41].
nodes in the embedding space. The class labels to predict
could indeed be related to a structural equivalence between Individual predictions Figure 12 shows the accuracy and
the nodes or to a community membership or to a combi- recall gains over simulated initial predictions on GS , both
nation of both. Finally, we also considered two variants of for our approach based on LBP and for the baseline RD.
the Node2vec feature extraction: using the edge weights or The different distributions of the initial individual predic-
not. Following the results of the data analysis in Sect. 4.2, tions introduced in Sect. 3.3 are used, and the perfor-
the number of contacts between each pair of users mances are given for varying initial accuracies b. As
(s and c attribute) is used as weight. indicated by the stars at the bottom of each plot, the
accuracies obtained with LBP always statistically signifi-
4.4.3 Estimating the assortativity cantly overcome the ones of RD, except when the initial
accuracy is 50%, in which case LBP and RD are not
The best edge potential for a given assortativity r can be
estimated using the synthetic networks, as detailed in
Sects. 3.3.3 and 3.3.4. However, the assortativity of a
given real network still needs to be estimated. To this end,
we propose to collect the gender of an a priori fixed number
of pairs of adjacent users in the considered graph G, for
example by carrying out a mobile phone survey, and then
to use these edges to compute an estimate of r in G. This
procedure has been tested on GL , since it is larger than GS ,
which allows to consider more independent edge sam-
plings. Figure 11 presents the results. The assortativity
estimates are roughly unbiased, while the variance of the Fig. 11 Estimated assortativity r as a function of the number of
estimator decreases toward 0.029, 0.022 and 0.014 when randomly selected pairs of adjacent users with known gender in GL .
the gender of, respectively, 1000, 2000 and 5000 pairs of For each number in abscissa, the edge selection is performed 50
times. The vertical distance between each mean estimated r (red
adjacent users is known. Hence, knowing the gender of
squares) and the green lines gives the standard deviation of the
about 1k pairs of neighbors is sufficient to reliably estimate estimation. The horizontal blue line indicates the true assortativity r
r, as it yields an error with an order of magnitude smaller in GL , equal to 0.3 (color figure online)
123
Neural Computing and Applications
(a) LBP with linear (b) LBP with exponential (c) LBP with bi-uniform
pYi |Xi (y i |xi ) pYi |Xi (y i |xi ) pYi |Xi (y i |xi )
Fig. 12 Accuracy and recall gains on GS of our LBP-based approach standard deviation around the mean gains. A star (resp. a gray square)
and of the RD method over the initial accuracies and recalls. The for an initial accuracy b drawn under one curve indicates that the
performances are provided as a function of the initial accuracy b and accuracy of the corresponding method is higher (resp. not statistically
are averaged over 50 random simulations of the initial individual significantly smaller) than the one of the other method for the same b
predictions generated using the linear (a, d), exponential (b, e) and bi- and distribution of initial predictions
uniform (c, f) distributions. The filled areas delimit intervals of one
significantly different. This last point was expected since gains on synthetic networks are significant when
the case b ¼ 50% corresponds to a random guessing for the b 2 ½0:62; 0:92. This result is intuitive, as near-perfect
first predictions. Furthermore, the well-balanced recalls initial accuracies do not let many opportunities to improve
obtained with LBP indicate that the weighting by the class the predictions, while almost random ones induce too
prior in the node potential w is effective, avoiding to favor rough node potentials. It is interesting to observe that,
the dominant class (M) to the expense of the other one. It depending on the distribution of the initial class probabil-
can, however, be noted that neither the baseline RD nor ities employed, the profiles of the accuracy gains as a
LBP improve the predictions when the bi-uniform distri- function of b are similar for RD and LBP, suggesting that
bution of the individual predictions is used. These poor the considered distribution shape highly determines the
performances confirm the observations of Sect. 3.3.3 on achievable performances.
the synthetic networks. They can be explained as, in this bi- Table 2 gives the average accuracy and recall gains of
uniform case, only the sign and not the amplitude of both RD and LBP in GS over the initial predictions with an
b
p Yi jXi ðMjxi Þ 0:5 brings information on the true gender initial accuracy b ¼ 0:75. LBP increases the accuracy by
that needs to be inferred, which is very unlikely in practice. more than 3 and 2.5% when the linear and exponential
Although optimal s values are quite independent of the distributions are, respectively, chosen, outperforming the
initial accuracy b, the performances are not, with highest RD algorithm. On the other hand, as observed above, both
accuracy gains in the range [70, 85]%. This range covers RD and LBP deteriorate the individual predictions when
the accuracies reached by state-of-the-art techniques aim- the bi-uniform distribution is used.
ing to predict gender using individual-level fea- Labeled data The results of LBP, RD and the classifiers
tures [11, 12, 17, 37]. Likewise, for an assortativity using Node2vec features as a function of the percentage b
coefficient similar to the one of GS ( 0:25), the accuracy of known labels are presented in Fig. 13. Table 3 further
123
Neural Computing and Applications
details the mean accuracy and recalls obtained by all GNB and LogReg), whereas LBP and RD seem to benefit
methods when 50% of the nodes are labeled and the per- more from additional data. This suggests that learning the
formances are computed on the 50% remaining ones. We whole network structure allows to build richer models
observe that LBP statistically significantly outperforms all enabling to enhance the classification performances.
the other schemes when the fraction of known labels is Regarding the algorithms based on Node2vec features,
higher than b ¼ 25%, while RD is superior for smaller Fig. 13 and Table 3 show that all their performances are
percentages of labeled users. For all the explored per- inferior to LBP and RD, even though this kind of feature
centages of labeled nodes, all methods lead to more than extraction has proven to be a powerful graph embedding
50% accuracy on the unlabeled users. We can note, how- technique for node classification. To analyze the sorts of
ever, that LBP tends to provide highly unbalanced male neighborhoods which were preserved in the extracted
and female recalls for small fractions b of known labels. features, Fig. 14 shows the bias weights selected in the CV
The dominant class (male) is always favored, even though for the 50 different samplings of b ¼ 50% labeled nodes.
the data set is hardly unbalanced. Further works will aim at These parameters are the ones that were selected to obtain
overcoming this behavior. This last observation is in con- the results in Table 3. We can observe that from one run to
trast to the results obtained for the initial individual pre- the other (i.e., for different subsets of observed labels), the
dictions in Fig. 12, suggesting that the probabilistic selected parameters are not always the same. Nevertheless,
framework of our approach is especially suited when prior, we can note some trends:
possibly noisy, class probabilities are available for the
• In the weighted case (in Fig. 14b), lower q values tend
network users. Graphical models have indeed already
to be favored, especially with logistic regression. This is
proven to be particularly relevant for the sake of denoising
in accordance with a previous study which has reported
local node information by accounting for the global net-
that low values of the in–out parameter q allow to
work structure. Common applications include the largely
improve subsequent classification based on the
studied hidden Markov models (HMM) in the field of
extracted features [14]. The embeddings therefore
speech recognition, error correcting codes or diverse bio-
mostly preserve the community co-memberships of
logical networks [18].
the nodes (highly interconnected nodes are embedded
Besides, Fig. 13 shows that the performances of the
closely together) [15].
feature extraction-based methods tend to less improve
• In the unweighted case (in Fig. 14a), p and q are mostly
when the training set size increases (especially concerning
selected close to 1 except with kNN where almost all
combinations of values are chosen from one run to the
Table 2 Mean performances on GS of the baseline update (RD) other. Moderate p and q values seem coherent since,
scheme (13) and of LBP, for 50 different assignations of the first without the edge weights, a random walker is only
predictions
guided by the presence of the edges and is as likely to
LBP RD move in any direction starting from the source node. If
Initial distribution DFS was favored (by setting a small q) as in the
Linear weighted case, it would be likely that, without the edge
DAccuracy 3.2 (0.2) 2.01 (0.16)
weights, the genders among the neighbors sampled
DRecallM 3.39 (0.66) 2.63 (0.35)
further away from the source node will not be related to
the source user’s gender and hence that the extracted
DRecallF 2.98 (0.76) 1.3 (0.33)
features will not be helpful for the classification task.
Exponential
DAccuracy 2.6 (0.17) 1.52 (0.17) Surprisingly, it appears that using the edge weights for
DRecallM 2.88 (0.46) 2.07 (0.29) Node2vec deteriorates the reached performances in all
DRecallF 2.25 (0.59) 0.9 (0.29) tested cases. This confirms the observation that there is no
Bi-uniform straightforward link between the users’ communication
DAccuracy - 0.46 (0.25) - 1.43 (0.19) patterns and their labels. In addition, the overall weaker
DRecallM - 0.94 (0.88) - 0.98 (0.29) performances of the Node2vec-based classifiers can at least
DRecallF 0.1 (0.95) - 1.93 (0.34) partly be explained by the moderate gender assortativity as
well as its nonuniformity across the social network.
The three defined distributions of the first predictions are considered
with an initial accuracy b ¼ 75%. The notation D refers to the gains
over the accuracy of the initial predictions. The best result per row is
highlighted in bold. A result is in italic values when it is not statis-
tically significantly worse than the best one of the same row, based on
Welch’s t test. The standard deviations are indicated in brackets
123
Neural Computing and Applications
(c) GNB using Node2vec (d) LogReg using Node2vec (e) kNN using Node2vec
features without the edge features without the edge features without the edge
weights weights weights
(f) GNB using Node2vec (g) LogReg using Node2vec (h) kNN using Node2vec
features with the edge features with the edge features with the edge
weights weights weights
Fig. 13 Accuracy and recalls obtained on the unlabeled nodes of GS drawn under one curve indicates that the accuracy of the correspond-
when varying the training percentage b (i.e., the fraction of the nodes ing method on the remaining unlabeled nodes is the highest among all
with known labels) using our LBP-based approach (a), the RD 8 considered methods. A gray square indicates that the corresponding
method (b) and classifiers using features extracted by Node2vec (c– accuracy is not statistically significantly smaller than the best one for
h). The performances are averaged over 50 random selections of the the same b according to Welch’s t test with 5% significance level,
known labels and the filled areas delimit intervals of one standard with Holm–Bonferroni correction as multiple hypotheses are tested
deviation around the mean scores. A star for a training percentage
123
Neural Computing and Applications
Table 3 Mean performances on GS of our LBP-based method, of the RD update scheme (13) and of three classifiers using features extracted by
Node2vec for 50 different samplings of b ¼ 50% labeled nodes
LBP RD Algorithms using Node2vec features
Without edge weights With edge weights
GNB LogReg kNN GNB LogReg kNN
Accuracy 69.07 (0.45) 68.18 (0.34) 57.16 (1.15) 59.37 (0.36) 61.38 (0.48) 55.01 (0.45) 57.31 (0.33) 58.79 (0.48)
RecallM 75.53 (0.83) 72.46 (0.82) 52.23 (5.39) 69.85 (1.1) 68.92 (5.07) 66.07 (0.82) 76.1 (0.73) 68.02 (4.34)
RecallF 61.67 (1.09) 63.27 (0.96) 62.81 (4.39) 47.34 (1.12) 52.72 (5.66) 42.31 (0.93) 35.75 (0.83) 48.18 (5.33)
The best performances per row are depicted in bold. A result is in italic values when it is not statistically significantly worse than the best one of
the same row according to Welch’s t test with Holm–Bonferroni correction. The standard deviations are indicated in brackets
123
Neural Computing and Applications
of weaker ties, influencing their classes in a different addition, to the best of our knowledge, no research quan-
way [34]. In addition, this kind of approach is not intended tifies the relation between the assortativity strength and the
to directly exploit uncertain label predictions with confi- performances of label prediction in a network.
dence levels (i.e., initial class probabilities), which can In this setting, we introduce a general framework based
only, for instance, be incorporated in the definition of the on PGMs to exploit the global social network topology for
features [40]. the improvement of uncertain predictions and to infer
Besides, random walk-based approaches allow to missing labels. Our study makes use of an objective mea-
account for the whole network structure by propagating the sure of the assortativity to provide guarantees about the
labels through iterative updates [50, 51]. Several variants performances generalization. This quantitative measure of
and adaptations of this principle are proposed to solve the network homophily is typically not provided by
diverse labeling tasks, such as video suggestions [3] or graphical representations [10]. It enables us to describe to
demographics prediction in networks [38]. The latter work which extent the sole network information improves indi-
adopts a two-step approach, first computing uncertain vidual demographics prediction, as a function of the
individual predictions using the individual part of the data assortativity. The proposed methodology easily permits to
and then improving them using the reaction–diffusion (RD) take advantage of some known labels, as well as first
method exploiting the network structure [37] individual predictions obtained using individual data.
Although the aforementioned random walk-based Finally, the model can benefit from assortativity variations
methods aim to model the network structure as a whole, in different subgraphs. By modeling the statistical depen-
they are based on an implicit model of the joint probability dencies between adjacent labels, it can favor heterogeneous
distribution of all the node labels [5]. As a refinement, as well as homogeneous contacts depending on the edge
inference approaches using probabilistic graphical models weights. The experiments of Sect. 4.4.4 first show the
(PGMs) are proposed as this framework allows to make the superiority, in most settings, of our approach over the
models explicit and fully describes the interactions reaction–diffusion algorithm and three classifiers using
between the nodes [10]. Dong et al. [10] introduced a Node2vec features, especially to improve uncertain pre-
double dependent-variable factor graph model in order to dictions. Second, in the studied application, the methods
jointly predict the users’ age and gender by benefiting from exploiting the entire network structure either through label
the links between these two demographic attributes in a propagation (RD) or using PGM are superior to feature
network. Knowing 50% of the labels, the remaining extraction-based approaches. Third, although it can create
unknown genders are predicted with up to 80% accuracy. embeddings of nodes based on diverse types of neighbor-
However, as they do not quantify the assortativity of their hood, the Node2vec feature extraction technique is prob-
network, these performances cannot be easily compared to ably not best suited when the assortativity is moderate and/
our study. Our results may nevertheless qualitatively partly or nonuniform across a network. In such cases, there is no
explain the success of their approach. Combining age and unique relationship between a set of extracted features
gender implicitly delineates in an automated manner some representing the node neighborhoods and the associated
rather (anti-)homophilic subgraphs, as illustrated by their node labels.
data analysis. As highlighted by the present work, this
definition of strong and weak network parts with accentu-
ated (anti-)homophily improves the inference perfor- 6 Conclusion
mances. The latter observation is essential, as several
studies mention that gender assortativity is generally rather This work presents how assortativity can be exploited to
weak [1, 24] and thus not sufficient by itself to infer the infer individual demographics in social networks. To this
gender. For instance, the RD algorithm introduced by aim, a general approach is introduced, using a probabilistic
Sarraute et al. [37] is used to infer the age group of some graphical model. It can both improve noisy initial predic-
users, but not their gender. Their network indeed bears a tions performed at an individual level and propagate a
strong age homophily. When 70% of the known age labels subset of known labels to predict the remaining unknown
are propagated through the network to infer the 30% ones. The achieved performances are studied on simulated
remaining ones, the age group among four categories is networks as a function of the assortativity and the quality
predicted with 43.4% accuracy. of the provided initial information, both in terms of accu-
However, these recent studies focus on the propagation racy and distribution in the initial individual predictions
of known labels through a network and do not consider the case, and in terms of the fraction of users with known
improvement of uncertain predictions, which can be labels otherwise. Indeed, the relevance of the network
obtained by a classical machine learning algorithm pre- information compared to individual features depends on (1)
dicting the labels based on individual information. In the assortativity amplitude and (2) the quality of the prior
123
Neural Computing and Applications
information: In the initial individual predictions context, Proceedings of the 17th international conference on World Wide
poor prior information is misleading, while excellent one Web. ACM, London, pp 895–904
4. Bengtsson L, Lu X, Thorson A, Garfield R, Von Schreeb J (2011)
does not leave much room for improvement. Also, the Improved response to disasters and outbreaks by tracking popu-
distribution of the initial class probabilities highly influ- lation movements with mobile phone network data: a post-
ences the achievable performances, as highlighted by the earthquake geospatial study in Haiti. PLoS Med 8(8):e1001083
results of both our approach and the reaction–diffusion 5. Bhagat S, Cormode G, Muthukrishnan S (2011) Node classifi-
cation in social networks. In: Aggarwal C (ed) Social network
method obtained with different distributions of these first data analytics. Springer, Boston, pp 115–148
probabilities. The graph simulations allow tuning the 6. Blondel VD, Decuyper A, Krings G (2015) A survey of results on
model parameters. Our method is further validated on a mobile phone datasets analysis. EPJ Data Sci 4(1):10
real-world mobile phone network, and the model is refined 7. Blumenstock J, Cadamuro G, On R (2015) Predicting poverty and
wealth from mobile phone metadata. Science 350(6264):
to predict gender, exploiting both weak, homophilic and 1073–1076
strong, anti-homophilic links. In this context, our approach 8. Castellano C, Fortunato S, Loreto V (2009) Statistical physics of
statistically significantly overcomes, in most settings, the social dynamics. Rev Mod Phys 81(2):591
performances of the reaction–diffusion label propagation 9. Devroye L (1996) Random variate generation in one line of code.
In: Simulation conference, 1996. Proceedings, Winter. IEEE,
technique and of machine learning classifiers based on Washington, pp 265–272
features extracted by the Node2vec graph embedding 10. Dong Y, Yang Y, Tang J, Yang Y, Chawla NV (2014) Inferring
method. In particular, the approach allows individual-based user demographics and social strategies in mobile social net-
gender predictions to be improved by up to 3%. On the works. In: Proceedings of the 20th ACM SIGKDD international
conference on knowledge discovery and data mining. ACM,
other hand, when the gender of 60% of the users is known London, pp 15–24
and no information is provided for the remaining users, the 11. Felbo B, Sundsøy P, Lehmann S, de Montjoye YA et al (2017)
proposed approach can infer the missing labels with 70% Modeling the temporal nature of human behavior for demo-
accuracy, solely based on the network assortativity. graphics prediction. In: Joint European conference on machine
learning and knowledge discovery in databases. Springer, Berlin,
The analysis performed on synthetic networks illustrates pp 140–152
that a strong assortativity can be easily exploited through 12. Frias-Martinez V, Frias-Martinez E, Oliver N (2010) A gender-
our methodology. Moreover, an almost randomly mixed centric analysis of calling behavior in a developing economy
network may still be composed of several parts which are, using call detail records. In: AAAI spring symposium: artificial
intelligence for development
if considered in isolation, assortative and disassortative. 13. Ghahramani Z (2002) Graphical models: parameter learning.
Thus even in the latter configuration, the network topology Handb Brain Theory Neural Netw 2:486–490
may still be useful. As a further work, the generalization of 14. Goyal P, Ferrara E (2018) Graph embedding techniques, appli-
the proposed methodology to multivariate predictions cations, and performance: A survey. Knowl Based Syst
151:78–94
would be of great interest. The model could then benefit 15. Grover A, Leskovec J (2016) node2vec: Scalable feature learning
from the relationships between the target variables and for networks. In: Proceedings of the 22nd ACM SIGKDD inter-
automatically make use of sub-networks presenting more national conference on knowledge discovery and data mining.
pronounced homophily. ACM, London, pp 855–864
16. Herrera-Yagüe C, Zufiria PJ (2012) Prediction of telephone user
attributes based on network neighborhood information. In:
Acknowledgements DM and CdB are Research Fellows of the Fonds International workshop on machine learning and data mining in
de la Recherche Scientifique - FNRS. The authors gratefully pattern recognition. Springer, Berlin, pp 645–659
acknowledge Pål Roe Sundsøy for his help with the data. 17. Jahani E, Sundsøy P, Bjelland J, Bengtsson L, de Montjoye YA
et al (2017) Improving official statistics in emerging markets
Compliance with ethical standards using machine learning and mobile phone data. EPJ Data Sci
6(1):3
Conflict of interest The authors declare that they have no conflict of 18. Jordan MI et al (2004) Graphical models. Stat Sci 19(1):140–155
interest. 19. Kokkos A, Tzouramanis T (2014) A robust gender inference
model for online social networks and its application to Linkedin
and Twitter. First Monday 19(9):8
20. Koller D, Friedman N (2009) Probabilistic graphical models:
References principles and techniques. MIT Press, Cambridge
21. Liu W, Ruths D (2013) What’s in a name? using first names as
1. Al Zamal F, Liu W, Ruths D (2012) Homophily and latent features for gender inference in twitter. In: AAAI spring sym-
attribute inference: inferring latent attributes of twitter users from posium: analyzing microtext, vol 13, p 01
neighbors. In: ICWSM, vol. 270 22. Madan A, Moturu ST, Lazer D, Pentland AS (2010) Social
2. Aral S, Muchnik L, Sundararajan A (2009) Distinguishing sensing: obesity, unhealthy eating and exercise in face-to-face
influence-based contagion from homophily-driven diffusion in networks. In: Wireless health 2010. ACM, London, pp 104–110
dynamic networks. Proc Natl Acad Sci 106(51):21544–21549 23. Magno G, Weber I (2014) International gender differences and
3. Baluja S, Seth R, Sivakumar D, Jing Y, Yagnik J, Kumar S, gaps in online social networks. In: International conference on
Ravichandran D, Aly M (2008) Video suggestion and discovery social informatics. Springer, Berlin, pp 121–138
for youtube: taking random walks through the view graph. In:
123
Neural Computing and Applications
24. McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a 39. Šćepanović S, Mishkovski I, Hui P, Nurminen JK, Ylä-Jääski A
feather: homophily in social networks. Annu Rev Sociol (2015) Mobile phone call data as a regional socio-economic
27:415–444 proxy indicator. PLoS ONE 10(4):e0124160
25. de Montjoye YA, Kendall J, Kerry CF (2014) Enabling human- 40. Sen P, Namata G, Bilgic M, Getoor L, Galligher B, Eliassi-Rad T
itarian use of mobile phone data. Brookings Center for Tech- (2008) Collective classification in network data. AI Mag 29(3):93
nology and Innovation, Washington 41. Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol
26. de Montjoye YA, Quoidbach J, Robic F, Pentland AS (2013) 46(1):561–584
Predicting personality using novel mobile phone-based metrics. 42. Smith JA, McPherson M, Smith-Lovin L (2014) Social distance
In: Greenberg AM, Kennedy WG, Bos ND (eds) Social com- in the united states: Sex, race, religion, age, and education
puting, behavioral-cultural modeling and prediction. Springer, homophily among confidants, 1985 to 2004. Am Sociol Rev
Berlin, pp 48–55 79(3):432–456
27. de Montjoye YA, Rocher L, Pentland AS (2016) Bandicoot: a 43. Sundsøy P, Bjelland J, Reme B, Iqbal A, Jahani E (2016) Deep
python toolbox for mobile phone metadata. J Mach Learn Res learning applied to mobile phone data for individual income
17(175):1–5 classification. In: ICAITA 2016 international conference on
28. Montoliu R, Gatica-Perez D (2010) Discovering human places of artificial intelligence and applications
interest from multimodal mobile phone data. In: Proceedings of 44. Tang J, Lou T, Kleinberg J (2012) Inferring social ties across
the 9th international conference on mobile and ubiquitous mul- heterogenous networks. In: Proceedings of the fifth ACM inter-
timedia. ACM, London, p 12 national conference on web search and data mining. ACM,
29. Murphy KP, Weiss Y, Jordan MI (1999) Loopy belief propaga- London, pp 743–752
tion for approximate inference: An empirical study. In: Pro- 45. Tatem AJ, Qiu Y, Smith DL, Sabot O, Ali AS, Moonen B et al
ceedings of the fifteenth conference on uncertainty in artificial (2009) The use of mobile phone data for the estimation of the
intelligence. Morgan Kaufmann Publishers Inc., pp 467–475 travel patterns and imported plasmodium falciparum rates among
30. Newman ME (2000) Models of the small world. J Stat Phys Zanzibar residents. Malar J 8(1):10–1186
101(3–4):819–841 46. Traud AL, Mucha PJ, Porter MA (2012) Social structure of
31. Newman ME (2003) Mixing patterns in networks. Phys Rev E Facebook networks. Phys A Stat Mech Appl 391(16):4165–4180
67(2):026126 47. Wainwright MJ, Jordan MI (2008) Graphical models, exponential
32. Newman ME (2003) The structure and function of complex families, and variational inference. Found Trends Mach Learn
networks. SIAM Rev 45(2):167–256 1(1–2):1–305. https://doi.org/10.1561/2200000001
33. Orman GK, Labatut V (2009) A comparison of community 48. Wang Y, Zang H, Faloutsos M (2013) Inferring cellular user
detection algorithms on artificial networks. In: International demographic information using homophily on call graphs. In:
conference on discovery science. Springer, Berlin, pp 242–256 INFOCOM, 2013 Proceedings IEEE. IEEE, Washington,
34. Palchykov V, Kaski K, Kertész J, Barabási AL, Dunbar RI (2012) pp 3363–3368
Sex differences in intimate relationships. Sci Rep 2:370 49. Weiss Y, Freeman WT (2001) On the optimality of solutions of
35. Peersman C, Daelemans W, Van Vaerenbergh L (2011) Pre- the max-product belief-propagation algorithm in arbitrary graphs.
dicting age and gender in online social networks. In: Proceedings IEEE Trans Inf Theory 47(2):736–744
of the 3rd international workshop on search and mining user- 50. Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2003)
generated contents. ACM, London, pp 37–44 Learning with local and global consistency. NIPS 16:321–328
36. Rosenquist JN, Murabito J, Fowler JH, Christakis NA (2010) The 51. Zhu X, Ghahramani Z, Lafferty JD (2003) Semi-supervised
spread of alcohol consumption behavior in a large social network. learning using Gaussian fields and harmonic functions. In: Pro-
Ann Intern Med 152(7):426–433 ceedings of the 20th international conference on machine learn-
37. Sarraute C, Blanc P, Burroni J (2014) A study of age and gender ing (ICML-03), pp 912–919
seen through mobile phone usage patterns in Mexico. In: 2014
IEEE/ACM international conference on advances in social net- Publisher’s Note Springer Nature remains neutral with regard to
works analysis and mining (ASONAM). IEEE, Washington, jurisdictional claims in published maps and institutional affiliations.
pp 836–843
38. Sarraute C, Brea J, Burroni J, Blanc P (2015) Inference of
demographic attributes based on mobile phone usage patterns and
social network topology. Soc Netw Anal Min 5(1):39
123
TELKOMNIKA Telecommunication, Computing, Electronics and Control
Vol. 18, No. 6, December 2020, pp. 3331~3338
ISSN: 1693-6930, accredited First Grade by Kemenristekdikti, Decree No: 21/E/KPT/2018
DOI: 10.12928/TELKOMNIKA.v18i6.16300 3331
Corresponding Author:
Seifedine Kadry,
Department of Mathematics and Computer Science, Faculty of Science,
Beirut Arab University,
Beirut, Lebanon.
Email: [email protected]
1. INTRODUCTION
Information diffusion in homogenous and heterogeneous networks is a dynamic process of keen
interest to researchers. This concept refers to how information like news of events outbreaks etc. spread from
a set of origin nodes to other nodes across the network [1-3]. Information diffusion has been studied in many
fields ranging from health care [4-6] to social networks [7, 8]. One of the most important tasks of network-
based systems is to understand, model, and predict rapidly developing events within the network. After
discovering the structure of a network, it is possible to predict the patterns of events including their shape, size
and development, which can be described as information diffusion [9]. Over the years, researchers have tried
several methods to model information diffusion in homogeneous and heterogeneous networks [10-13].
Formally, a data network is represented by a graph G = (V, E) where V is the set of vertices and E is
the set of edges. This graph is called homogeneous if the vertices and their edges are of the same type, and is
called heterogeneous otherwise. Homogeneous networks have been the subject of many studies with the focus
being on semantic analysis, communicable disease control [14-17], and link prediction [18-20]. Recently, more
attention has been paid to heterogeneous networks, as they could provide a more realistic representation of
real-world phenomena [21, 22]. In a study by Watt [23], he investigated the role of threshold values and
network structure in information diffusion in these networks. In [10], information diffusion in heterogeneous
networks through hyperpaths was studied. This study proposed a method called MLTM-R for analyzing
information diffusion power in different hyperpaths. In this method, predictions were made with the PathSim
algorithm used to weight the links between each two nodes [24, 25].
Recent years have seen a growing interest in the use of deep learning in heterogeneous
networks [26, 27]. In [28], a core deep learning (CDL) framework was used to solve the problem of
heterogeneous visual versus near-infrared (VIS-NIR) image matching through topic diffusion in networks.
Tang et al. [29] proposed a LINE algorithm for embedding learning that traverses all edge types and samples
one edge at a time for each edge type. Chang et al [30] developed a deep architecture for information diffusion
prediction through information encoding in heterogeneous networks. In [31], a new algorithm called
Metapath2vec was presented for information encoding in heterogeneous networks, where concepts and patterns
are mapped by the use of hyperpaths. The review of previous works reveals some strengths and weaknesses in
the current approach to information diffusion in heterogeneous networks. The use of deep learning in the study
of information diffusion processes such as topic diffusion and information cascades can help avoid the
problems of more traditional methods. The major disadvantage of the previous works is that most topic
diffusion methods use local similarity and encoding based on neighboring nodes. For large heterogeneous
networks, it is time-consuming and difficult to perform local similarity calculations for each two corresponding
nodes. As a result, there is a need for a more comprehensive yet less complex automatic method for measuring
the similarity of nodes and finding diffusion paths in heterogeneous networks.
In this paper, the problem of predicting the path of information diffusion in a network is mapped to
a deep learning problem. Since predicting the new users who will be in the path of information flow is
a recognition process, this problem can be solved by machine learning algorithms. As noted in section X,
recently, deep machine learning algorithms have been widely used in this field. Also, researchers have
developed deep machine learning algorithms that can use graph data in the learning process. This paper presents
a machine learning method based on graph neural networks, which involves selecting the inactive node to be
activated based on its neighboring active nodes in each scientific topic. In other words, in this method,
information diffusion paths are predicted through the activation of inactive nodes by active nodes. To evaluate
the proposed method, it is tested on three heterogeneous scientific databases: The Digital Bibliography and
Library Project (DBLP), Pubmed, and Cora. The method seeks to answer the question that who will be the
publisher of the next article in a particular field of science. The comparison of the proposed method with other
methods shows 10% and 5% improvement in precision in DBLP and Pubmed datasets, respectively.
In summary, the most important innovations of the present work are as follows:
− Presenting a deep learning model where the information of a heterogeneous network is encoded in
the form of a deep learning graph, which can model the information diffusion path.
− Providing a feature extraction mechanism to find the degree of correlation of neighboring vertices in
different graph hyperpaths.
− Testing the method on the heterogeneous datasets DBLP, Pubmed, and Cora, which have real-world
applications, in order to demonstrate the applicability of the proposed method.
The remaining sections of the article are organized as follows. Section 2 describes the different components of
the proposed method. Section 3 describes the testing procedure and analyzes the results. And Section 4 presents
the conclusions.
2. PROPOSED METHOD
Information diffusion is a widely discussed dynamic network process with potential applications in
various fields of science. This term refers to the spreading of information or similar concepts such as news,
innovation, virus or malware a set of vertices to other vertices across the network. There is a rich body of
literature on information diffusion in complex networks, where different models and their interactions with
network topology have been analyzed [1]. The previous studies have been mostly focused on heterogeneous
networks. An information network like G = (V, E) where V is the set of vertices and E is the set of edges is
homogeneous if the edges and vertices are of the same type. Conversely, the networks with more than one type
of node or edge are called heterogeneous [8-10]. For example, in DBLP, which is an important computer
science bibliography database, the vertices could be authors, articles, and venues (journals/conferences) and
edges could be the author-author relationship in the sense that they have worked in the same area, and attended
the same conferences.
TELKOMNIKA Telecommun Comput El Control, Vol. 18, No. 6, December 2020: 3331 - 3338
TELKOMNIKA Telecommun Comput El Control 3333
Here, we model information diffusion and specifically topic diffusion in heterogeneous information
networks. To this end, we use a concept called meta-path. The meta-path p on the grid TG = (A, R), where A
and R represent vertices and relationships, is defined as follows:
𝑅1 𝑅2 𝑅𝑙
𝐴1 → 𝐴2 → … → 𝐴𝑙+1 (1)
Here, l is an index of the meta-path. The summation relationship between different types of vertices (A1 -
Al+1) is given by:
𝑅 = 𝑅1 ∘ 𝑅2 ∘ … 𝑅𝑙 )2 (
where o is the combination operator. In DBLP, for example, each author-author or author-conference-author
relationship is considered a single meta-path.
Figure 1 shows an example of the diffusion of the topic of “data mining” in DBLP, where authors can
be linked through different meta-paths. This paper provides a machine learning method based on graph neural
networks in which an inactive node is activated by its active neighbors in a particular scientific topic. Given
that the prediction of new users who will be in the path of information flow is a recognition process, this
problem can be solved by machine learning algorithms.
The general framework of the method consists of two main phases: 1) designing a machine learning
scheme (learning machine) for the prediction process, and 2) evaluating the accuracy of the designed scheme
(machine) in predicting the flow of information in the dataset of interest. The first step involves training
a learning machine, where the input is the data collected from the information network graph and the output is
the tag “Yes” or “No”, showing whether or not the node specified in the input will be selected as the next path
of information diffusion. The purpose of this machine is to create a regression function for optimal mapping
between input data and output tags. In the second phase, a test dataset, which is taken from the collected data,
is used to test the designed machine. In the testing and accuracy evaluation phase, the classification process is
done once randomly and another time with the designed machine. In the end, the quality of the vertices obtained
from these two methods is compared.
𝑮 = ( 𝑽, 𝑨) )3 (
where 𝑽𝜖ℝ𝑁×𝑓 is the vertex signal matrix describing N vertices each with f features, 𝑨𝜖ℝ𝑁×𝑁 is the adjacency
matrix which encodes the edges information as described in section 2, and each element A is defined as follows;
wij , if there is an edge between i and j
aij = { )4 (
0, otherwise
An example graph and its vertex matrix V and adjacency matrix A are shown in Figure 2.
New prediction method for data spreading in social networks based on… (Maytham N. Meqdad)
3334 ISSN: 1693-6930
P Q R S
P 0 1 1 0
Q 1 1 0 0
R 1 0 0 1
S 0 0 1 0
This filter is defined as the kth degree polynomial of the adjacency matrix. The exponent of this
polynomial encodes the number of steps from the vertex of interest, which is multiplied by the assumed filter
factors. The scalar factor hi determines how much each neighbor of a vertex contributes to the convolution
operation. Therefore, the filter matrix is obtained as 𝐻 ∈ ℝ𝑁 × 𝑁. The convolution of the vertices 𝑉 with
the filter 𝐻 is defined as the following matrix multiplication, where 𝑉𝑜𝑢𝑡, 𝑉𝑖𝑛 ∈ ℝ𝑁.
This model can be adjusted in three ways. The first way is to avoid A becoming exponentiated and simplify
the adjacency polynomial in (2) into the linear form given in (6). The reason behind this approach is that, as
shown by VGGNet, a cascade of filters can effectively estimate the receptive field of a large filter.
The next step is to create the adjacency tensor 𝓐. This tensor consists of multiple adjacency matrices A𝑒 , which are
the slices of this tensor, each encoding a specific edge feature. Therefore, the linear filter matrix in (6) is defined as
a convex combination of adjacency matrices as given in (7). This equation can be simplified into (8).
𝐻 ≈ ∑𝐿e=0 h𝑒 A 𝑒 (9)
Multiple edge features are encoded by multiple adjacency matrices, each of which encodes a single
feature. Also, as shown in Figure 3, the edges are subdivided into multiple matrices. Figure 3 shows the default
linear GCN filter in an image application. A filter factor is isotropically applied to all vertices at a given
distance. In this case, h0 is applied to the vertex of the 0th step and h1 is applied to all adjacent vertices. If this
figure is enclosed in another set of pixels, each pixel in that set will be multiplied by the filter factor h 2.
As shown in Figure 3, to create the adjacency tensor, the adjacency matrix can be subdivided into 9
adjacency matrices. Each of these adjacency matrices shows a different relative link (edge feature) to a given
vertex. The next step is to apply a unique filter to each adjacency matrix to perform convolution followed by
aggregation. This gives a direction to the GCN filter. This is equivalent to a 3×3 FIR filter in traditional GCNs.
All the above-described filters are for a single vertex feature. For the extension to multiple vertex
features, each hl must be in ℝ𝐶 so that H has a dimension of ℝ𝑁×𝑁×𝐶. Therefore, each vertex feature has a filter
matrix H of size N×N. Therefore, in (9) can be rewritten as in (10), where H(c) is an N×N slice of H and h(c)
is a scalar related to an input feature and a slice of Al.
TELKOMNIKA Telecommun Comput El Control, Vol. 18, No. 6, December 2020: 3331 - 3338
TELKOMNIKA Telecommun Comput El Control 3335
𝑐
𝑉𝑜𝑢𝑡 =∑𝐶c=1 𝐻𝑐 V𝑖𝑛 +𝑏 (11)
𝑋 = (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ) (12)
𝐶 = (𝑐1 , 𝑐2 ) (13)
The algorithm presented in Figure 4 shows how the learning machine is built and tested. In
the proposed algorithm, each machine is first trained using the extracted data. As can be seen, each learning
machine is trained separately for each dataset extracted in the algorithm. This is done using the “generateML”
function in line 5. Then, the folding algorithm is used to test each machine. Therefore, each machine is
New prediction method for data spreading in social networks based on… (Maytham N. Meqdad)
3336 ISSN: 1693-6930
repeatedly subdivided and trained and tested by different datasets. More details on the mechanism of the folding
algorithm and measurement of the accuracy of the learning machine
:- are provided in the next section.
1 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛_𝐴𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚={'GCN'}
2 𝑑𝑎𝑡𝑎𝑠𝑒𝑡={'DBLP',Pubmet','Cora'}
3 for each 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 Do
4 for each 𝑓𝑜𝑙𝑑 Do
5 for each 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 Do
6 (Xtrain, Ytrain, Xtest, Ytest) makefold(𝐷𝑎𝑡𝑎𝑠𝑒𝑡𝑖 )
7 ML generateML (𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛_𝐴𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚,Xtrain, Ytrain)
8 accuracy(𝑓𝑜𝑙𝑑) calculateAccuracy(ML,Ytest,)
9 end
10 end
11 end
3. TESTS
3.1. Test preparation
The proposed method was tested on three real datasets, namely DBLP, Pubmed, and Cora, which have
been used in numerous empirical studies. DBLP: this is a computer science bibliography database containing
the name of major authors, conferences, and publications. In the network used for DBLP, objects represent
authors. The meta-paths considered in this network are author-paper-author (APA), author-paper-author-
paper-author (APAPA), author-conference-author (ACA), and author-conference-author- conference-author
(ACACA). This dataset is typically used to extract different topics and examine the diffusion of information
about a specific topic. Information contained in this dataset pertains to the period between 1954 and 2016.
Pubmed: this is a bibliography dataset for the field of medical sciences, which includes authors,
conferences, and publications. In this network used for Pubmed, authors are represented by objects and
the considered meta-paths are APAPA and APA. Information of this dataset is for the period between 1994
and 2003. Cora: this is another computer science bibliography database. The meta-paths used for this dataset
are APAPA and APA. This dataset contains information from 1990 to 2012.
In the evaluation process, the diffusion process was modeled for several topics contained in these
datasets, which include data mining, machine learning, social networks, health care, DNA and infectious disease.
These particular topics were selected because of their high frequency in the dataset and the considerable amount
of data available for comparison and conclusion.
Training and testing operations were performed by the use of the K-Fold method as described earlier in
the paper. In this method, data is partitioned into K subsets. Each time, one of these K subsets is used for testing
and the other K-1 are used for training. This procedure is repeated k times so that each data is used exactly once
for training and once for testing. In the end, the average result of these K tests is reported as a final estimate. In
the K-Fold method, the ratio of data of classes in each subset should match this ratio in the main set.
Finally, the performance of the method in predicting topic diffusion was evaluated in terms of
the criterion known as Precision. This criterion was calculated using the following definitions:
− TP: if an active node is correctly labeled as active
− TN: if an inactive node is correctly labeled as inactive
− FP: if an active node is incorrectly labeled as inactive
− FN: if an inactive node is incorrectly labeled as active
Table 1 presents the parameters of the GCN algorithm illustrated in Figure 4. All tests of this study
were performed with these parameter settings. In this table, hidden1 and hidden2 are the number of nodes in
the two convolution layers. Also, early_stopping refers to the early termination condition of the algorithm,
which is convergence in less than 10 iterations.
TELKOMNIKA Telecommun Comput El Control, Vol. 18, No. 6, December 2020: 3331 - 3338
TELKOMNIKA Telecommun Comput El Control 3337
scientific topics. In this section, the test results for several topics on DBLP and Pubmed datasets are reported
based on precision and recall criteria. Table 2 presents the results of this evaluation in terms of precision.
Table 2. Results of comparison of the proposed method with other methods for different topics on DBLP
using the precision criterion
Accuracy (%)
Subject
MTLM-R HPM-LT HPM-IC GCN
%56 %55 %60 %75 Data Mining
%37 %32 %48 %50 Machine Learning
%39 %38 %40 %50 Social Network
%56 %55 %62 %75 Medical Care
%11 %12 %14 %15 DNA
%22 %20 %21 %25 infectious disease
%10 %20 %25 %30 Software Engineering
%14 %21 %22 %25 Big Data
%16 %19 %21 %25 Network
%33 %40 %50 %75 Genetic
%10 %11 %37 %50 Biology
%20 %21 %22 %25 Neural etwork
As these results indicate, the proposed method achieved a significant improvement ranging between
10% and 20% in all comparisons. In the DBLP dataset, for example, the proposed method has a 10% higher
precision than other methods. The reason for this improvement could be that other methods are purely based on
probability functions and calculation of probability between neighboring vertices. This means that these methods
have no such thing as feature learning or intelligent processes and only repeat a constant set of calculations.
In contrast, as explained in the description of the architecture, the proposed method uses different learning
operations for each segment. For example, the convolution function is designed to determine the relation of each
node to its neighbors through a learning process. These operations are learned intelligently during the evolution
of the GCN algorithm. It should also be noted that in learning algorithms, the entire problem space can be easily
explored, whereas, in probability function-based methods, only a part of the problem space can be searched.
4. CONCLUSION
This paper presented a machine learning method based on the graph neural network algorithm, which
involves the selection of inactive vertices based on their neighboring active vertices in each scientific topic.
Basically, in this method, information diffusion paths are predicted through the activation of inactive vertices
by active vertices. Since predicting the new users who will be in the path of information flow is a recognition
process, this problem can be solved by machine learning algorithms. The proposed method was tested on three
real datasets, DBLP, Pubmed, and Cora, which are extensively used in the empirical studies. The evaluation
process involved modeling the diffusion process for several topics contained in these datasets, including data
mining, machine learning, social networks, health care, DNA and infectious disease.
Test results showed that the proposed method outperforms other methods in this area. As a potential
idea for future studies, the proposed system can be implemented in a parallel platform or with the extraction
and combination of other features to reach a stronger system. The use of more robust machine learning concepts
may also enhance the quality of the method. The methods with possible benefits in this area include feature
reduction and feature learning. The feature reduction method is particularly useful for reducing the overall
complexity of the recognition method. Feature learning is a process involving the transfer of the data processing
from the original feature space to a new space with higher feature resolution.
REFERENCES
[1] E. Bakshy, I. Rosenn, C. Marlow, L. Adamic, “The role of social networks in information diusion,” in: Proceedings
of the 21st international conference on World Wide Web, ACM, pp. 519, 2012.
New prediction method for data spreading in social networks based on… (Maytham N. Meqdad)
3338 ISSN: 1693-6930
[2] M. S. Granovetter, “The strength of weak ties,” In Social networks, Academic Press. pp. 347-367, 1977.
[3] Y. Hu, R. J. Song, M. Chen, “Modeling for Information Di usion in Online Social Networks via Hydrodynamics,”
IEEE Access, vol. 5, 2017.
[4] K. Ikeda, et al., “Multi-agent information diffusion model for twitter,” In Proceedings of the 2014 IEEE/WIC/ACM
International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)- IEEE Computer
Society vol. 1, pp. 21-26, 2014.
[5] T. Kipf, N. Thomas, and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” arXiv
preprint arXiv:1609.02907, 2016.
[6] Y. Moreno, R. Pastor-Satorras, A. Vespignani, “Epidemic outbreaks in complex heterogeneous networks,”
The European Physical Journal B, vol. 26, no. 4, 2002.
[7] R. Yang, B.-H. Wang, J. Ren, W. J. Bai, Z. W. Shi, W. X. Wang, T. Zhou, “Epidemic spreading on heterogeneous
networks with identical infectivity,” Physics Letters A, vol. 364, pp. 3-4, 2007.
[8] M. Salehi, R. Sharma, M. Marzolla, M. Magnani, P. Siyari, D. Montesi, “Spreading processes in multilayer
networks,” IEEE Transactions on Network Science and Engineering, vol. 2, no. 2, 2015.
[9] L. Wang, G. Z. Dai, “Global stability of virus spreading in complex heterogeneous networks,” Siam Journal on
Applied Mathematics, vol. 68, no. 5, 2008.
[10] M. Nadini, K. Sun, E. Ubaldi, M. Starnini, A. Rizzo, N. Perra, “Epidemic spreading in modular time-varying
networks,” arXiv preprint arXiv:1710.01355, 2017.
[11] G. Demirel, E. Barter, T. Gross, “Dynamics of epidemic diseases on a growing adaptive network,” Scientic reports, vol. 7, 2017.
[12] P. Sermpezis, T. Spyropoulos, “Information diffusion in heterogeneous networks: The conguration model approach,”
in: Proceedings-IEEE INFOCOM, pp. 3261, 2013.
[13] Y. Zhou, L. Liu, “Social in uence based clustering of heterogeneous information networks,” in: Proceedings of
the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 338, 2013.
[14] S. Molaei, S. Babaei, M. Salehi, M. Jalili, “Information spread and topic diffusion in heterogeneous information
networks,” Scientic Reports, vol. 8, no. 1, 2018.
[15] S. Molaei, et al., “Information Spread and Topic Diffusion in Heterogeneous Information Networks,” Sci. Rep., vol. 8, 2018.
[16] H. Gui, Y. Sun, J. Han, G. Brova, “Modeling Topic Diffusion in Multi-Relational Bibliographic Information
Networks,” In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge
Management - CIKM ‘14, pp. 649-658, 2014.
[17] T. Mikolov, K. Chen, G. Corrado, J. Dean, “Ecient estimation of word representations in vector space,” arXiv preprint
arXiv:1301.3781, 2013.
[18] W. Cheng, C. Greaves, M. Warren, “From n-gram to skipgram to concgram,” International journal of corpus
linguistics, vol. 11, no. 4, 2006.
[19] B. Perozzi, R. Al-Rfou, S. Skiena, “Deepwalk: Online learning of social representations,” in: Proceedings of the 20th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 70, 2014.
[20] S. Hochreiter, J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, 1997.
[21] Q. Cao, H. Shen, K. Cen, W. Ouyang, X. Cheng, “Deephawkes: Bridging the gap between prediction and
understanding of information cascades,” in: Proceedings of the 2017 ACM on Conference on Information and
Knowledge Management, ACM, 2017.
[22] Y. LeCun, Y. Bengio, et al., “Convolutional networks for images, speech, and time series,” The handbook of brain
theory and neural networks, vol. 3361, no. 10, 1995.
[23] F. J. Ordonez, D. Roggen, “Deep convolutional and lstm recurrent neural networks for multimodal wearable activity
recognition,” Sensors, vol. 16, no. 1, 2016.
[24] Citation Network Dataset, Available [online], URL http://konect.uni-koblenz.de/networks/subelj_cora
[25] P. Kumaran, S. Chitrakala, “Community formation based influence node selection for information diffusion in online
social network,” In 2016 International Conference on Computing Technologies and Intelligent Data Engineering
(ICCTIDE, 2016), pp. 1-6, 2016.
[26] M. Lahiri and M. Cebrin, “The Genetic Algorithm as a General Diffusion Model for Social Networks,” Proc. of the
24th AAAI Conference on Artificial Intelligence, 2010.
[27] L. Li, S. Li, X. Chen, “A new genetics-based diffusion model for social networks,” In 2011 International Conference
on Computational Aspects of Social Networks (CASoN, 2011), pp. 76-81, 2011.
[28] H. Zhu, C. Huang, H. Li, “Information diffusion model based on privacy setting in online social networking services,”
The Computer Journal, vol. 58, no. 4, pp. 536-548, 2014.
[29] L. Liu, et al., “Modelling of information diffusion on social networks with applications to WeChat,” Physica A:
Statistical Mechanics and its Applications, vol. 496, pp. 318-329, 2018.
[30] Ayman Madi, Oussama Kassem Zein, Seifedine Kadry, "On the improvement of cyclomatic complexity metric,”
International Journal of Software Engineering and Its Applications, vol. 7, no. 2, pp. 67-82, 2013.
[31] Seifedine Kadry, Rafic Younès. "Etude Probabiliste d’un Systeme Mecanique a Parametres Incertains par une
Technique Basee sur la Methode de transformation," Proceeding of CanCam. Canada, 2005.
TELKOMNIKA Telecommun Comput El Control, Vol. 18, No. 6, December 2020: 3331 - 3338
Artificial Intelligence Review
https://doi.org/10.1007/s10462-020-09839-0
Mahdi Hashemi1 · Margeret Hall2
Abstract
This study aims at automatic processing and knowledge extraction from large amounts
of oncology-related content from online social networks (OSN). In this context, a large
number of OSN textual posts concerning major cancer types are automatically scraped and
structured using natural language processing techniques. Machines are trained to assign
multiple labels to these posts based on the type of knowledge enclosed, if any. Trained
machines are used to automatically classify large-scale textual posts. Statistical inferences
are made based on these predictions to extract general concepts and abstract knowledge.
Different approaches for constructing document feature vectors showed no tangible effect
on the classification accuracy. Among different classifiers, logistic regression achieved the
highest overall accuracy (96.4%) and F1 (73.4) in a 13-way multi-label classification of
textual posts. The most common topic was seeking or providing moral support for cancer
patients, followed by providing technical information about cancer causes and treatments.
The most common causes and treatments of different types of cancer on OSN are also
automatically detected in this study. Seeking or providing moral support for cancer patients
shared the largest overlap with other topics, i.e. moral support tends to be present even in
OSN posts which focus on other topics. On the other hand, providing technical information
about cancer diagnosis or prevention were the most isolated topics, where OSN posts tend
not to allude to other topics. OSN posts which seek financial support only overlap with the
moral support topic, if any. Our methodology and results provide public health profession-
als with an opportunity to monitor what topics and to which extent are being discussed on
OSN, what specific information and knowledge are being disseminated over OSN, and to
assess their veracity in close to real time. This helps them to develop policies that encour-
age, discourage, or modify the consumption of viral oncology-related information on OSN.
* Mahdi Hashemi
[email protected]
1
Department of Information Sciences and Technology, George Mason University, 4400 University
Dr, Fairfax, VA 22030, USA
2
College of Information Science and Technology, University of Nebraska at Omaha, 1110 S 67th
St, Omaha, NE 68182, USA
13
Vol.:(0123456789)
M. Hashemi, M. Hall
1 Introduction
13
Multi-label classification and knowledge extraction from…
OSN have also been viewed as crowdsourcing platforms where clinicians and patients can
collaborate to manage disease outbreak and predict pandemics (Ritterman et al. 2009).
So far it has only been shown that OSN are massive repositories of health- and cancer-
related information. Nevertheless, automatic oncology-related knowledge extraction from
OSN has yet to be studied and is the focus of this work.
2 Related work
OSN have been viewed as a space where cancer patients can collectively identify treatment
options through finding information about clinical trials (Chretien et al. 2011; Rehman
et al. 2018) as well as a closed group to meet the informational needs of patients and sur-
vivors of cancer. For instance, most participants, in the family/caregiver group in the can-
cer prevention education experiment study by Jiménez et al. (2018) identified OSN, spe-
cifically Facebook and Twitter, as a viable and useful tool for sharing information about
cancer prevention education. As a result, OSN have been reported as potential sources
of oncology-related knowledge dissemination (Attai et al. 2015), cancer prevention and
screening (Lyles et al. 2013; Falzone et al. 2017), interventions for cancer care (Strekalova
and Krieger 2017; Jiang 2017), and anxiety relief for cancer patients (Attai et al. 2015).
Murthy et al. (2011) argued that OSN (specifically Twitter) has changed the relationship
between health institutions (including individual doctors) and the public in that previously
monologic health dictums and warnings can now be interrogated, individually situated, or
affirmed through an interaction with the institution or person tweeting that information.
The authors argue that Twitter has had a recent impact on the ways in which health infor-
mation and resources are shared and it has the potential to foster better health outcomes.
For instance, an individual who decides to schedule colonoscopy or mammogram after
receiving Tweets discussing the successful cases of patients who beat cancer that was dis-
covered at an early stage. Individuals diagnosed with cancer, caregivers, and family mem-
bers have been using Twitter, not only to gather information on particular cancer treatment
options and clinical trials, but also to ask questions about their specific cases to leading
oncologists in the field.
Recent work in machine learning and natural language processing has studied the
health content of Tweets and demonstrated the potential for extracting useful public-
health information from their aggregation (Dredze 2012). Disease self-management can
be hard to study, as it does not involve observation by trained clinicians and patients
might be reluctant to share unapproved practices with health officials. Monitoring medi-
cation usage on Twitter can discover new trends in self-medication otherwise unreported
by patients. Paul and Dredze (2011) studied medication usage from Tweets by creating
medication usage profiles based on ailment groupings. For pain relievers, for example,
they found that Tylenol and Advil have broad profiles (headache, cold relief, and so
on) while Vicodin is targeted at dental problems and injuries. For allergy medication,
Claritin and Zyrtec were almost exclusively used to treat allergies, while off-label uses
of Benadryl included insomnia. Twitter can support other public-health tasks, such as
health risk assessments. For example, Paul and Dredze (2011) uncovered interesting
correlations, such as a positive correlation between states with high smoking rates and
those with high Twitter message rates about cancer (r = 0.648), a negative correlation
between exercise and obesity messages (r = − 0.201), and a negative correlation between
good healthcare coverage and messages about ailments in general (r = − 0.253), where
13
M. Hashemi, M. Hall
r stands for correlation coefficient. Through an online survey of 498 females who fol-
lowed the purple ribbon Twitter campaign (@pprb), a cervical cancer prevention cam-
paign, Yoo et al. (2018) showed that there is a positive binary relationship between the
information people receive from OSN and the actions they take.
Paul and Dredze (2011) showed that there are many Tweets discussing the details
of a single ailment, including general words about the disease/illness, its symptoms,
and treatment. For example, the message: fever + headache = flu, home sick with Tylenol
discusses influenza, where fever and headache are symptoms, Tylenol a treatment, and
flu a general word associated with the ailment. Other examples are: took some Tylenol
for my flu and stuck home with flu and 102 fever. Despite their focus was on general
diseases (not specifically cancer) and they did not offer any automatic approaches for
identifying and dissecting such Tweets, they established the fact that Tweets entail such
information.
As outlined in this section, many researchers have shown that pieces of oncology-
related information (medically correct or incorrect) are unsystematically spread among
OSN content (Chretien et al. 2011; Rehman et al. 2018; Jiménez et al. 2018; Attai et al.
2015). However, existing research do not go further than supporting this claim by manu-
ally processing small amounts of texts (Paul and Dredze 2011). The reason for the man-
ual processing of OSN content is that they are among the most complicated and difficult
data collections to comprehend (Hashemi 2019). This study aims at automatic process-
ing and knowledge extraction from large amounts of oncology-related OSN content. It
studies what types of oncology-related information disseminate over OSN and how they
can be inferred statistically.
This is no ordinary task as OSN’s textual content is extremely unstructured and filled
with acronyms, special characters, and typos (Hashemi and Hall 2019). The OSN con-
tent do not go through the same rigorous editing and review process that books and arti-
cles do and people often improvise methods to cut the letters from words and shorten
their sentences due to character limits on some OSN (Jaidka et al. 2019). This is a major
handicap, considering that even well-written documents are considered unstructured
data (Hashemi 2019). Therefore, collecting OSN posts and structuring them automati-
cally will be lossy, i.e. many terms would not be recognized by the machine and sen-
tences will lose their original cohesion. These losses are inevitable in order to automati-
cally structure large numbers of OSN posts. However, the tremendous number of posts
and sophisticated natural language processing (NLP) techniques would partially make
up for these losses in our study. In other words, despite those losses, the machine will be
able to pick up major patterns and information among posts.
The second major barrier is how useful pieces of knowledge must be extracted from
textual posts. The key question is, how a machine can automatically learn anything
about cancer from thousands of posts. This is not a straightforward regression or classi-
fication problem (Hashemi and Karimi 2018), as the task itself is not well-defined, there
are no training data, or even well-defined features or output classes. Our methodology
attempts to solve this problem by first multi-labeling OSN posts based on the type of
knowledge contained if any and then providing analytics in each class.
The last concern, not addressed in this study, is the veracity of the extracted knowl-
edge. For instance, a specific cause or treatment for cancer might become viral through
OSN posts, while it is medically proven not to be a legitimate cause or effective treat-
ment (Yoo et al. 2018). Despite the veracity of every piece of extracted knowledge
needs to be verified, this task is devolved upon those with profound medical expertise in
the field.
13
Multi-label classification and knowledge extraction from…
3 Dataset
Twitter was chosen for our study because it is a prominent example of OSN (Chou et al.
2009), with currently over 330 million monthly active users (Twitter 2019), it is a pub-
lic forum where everyone’s posts are publicly available, and its exceptional impact on
oncology-related information sharing and seeking (Murthy et al. 2011). Twitter is a micro-
blogging (i.e. short message) service that enables its users to send and read short Tweets.
Rather than employing the taxonomy of friends, Twitter has followers and followees. A
follower is someone who considers the followee interesting for any reason. The relation-
ships are often asymmetric, consisting of unidirectional arcs because a user might not nec-
essarily follow his/her followers. A post on Twitter, called Tweet, may contain multiple
mentions and hash tags. A mention is another user’s username and a hash tag plays the role
of a keyword which determines a Tweet’s topic. A Tweet can be re-tweeted or replied by
followers. A re-Tweet is when a follower posts the same Tweet from his/her followee and a
reply is when a follower comments under a Tweet.
Tweets, written in English, containing at least one of the following 14 word strings (or,
cancer types), were collected from Dec 7th 2018 until Feb 11th 2019: breast cancer, lung
cancer, prostate cancer, colon cancer, melanoma, bladder cancer, non-hodgkin lymphoma,
kidney cancer, endometrial cancer, leukemia, pancreatic cancer, thyroid cancer, rectal can-
cer, and liver cancer. These are cancer types with the estimated annual incidence of at least
40,000 in the United States for 2019 (American Cancer Society 2019). The most com-
mon type of cancer on this list is breast cancer, with 271,270 new cases expected in the
United States in 2019. The next most common cancers are lung cancer and prostate cancer,
with 22,8150 and 174,650 new cases expected in the United States in 2019. Re-Tweets are
excluded from the data collection process, because they are duplicates of original Tweets.
This search resulted in 209,155 Tweets.
4 Methodology
From Dec 7th 2018 until Feb 11th 2019, a total of 209,155 Tweets were collected. Two
undergraduate students were employed from January 2019 to May 2019 to manually clas-
sify Tweets. Due to the subject’s nuanced nature, each Tweet is classified twice by different
people. Tweets with disagreements are classified by a jury of four, including two faculty-
level domain experts. This process resulted in 10,618 classified Tweets (5% of all the col-
lected Tweets). These Tweets will provide the training samples for the machine. Out of all
the manually classified Tweets (i.e. training data), 8% were Tweets with disagreements,
necessitating a discussion among the jury to decide on their label.
Tweets that contain no information about any aspect of cancer that could be useful to
public or in any collective statistical analyses were placed in a separate category, which
we refer to as irrelevant. For instance: “Eventually leukemia was ruled out and we took
him to a different vet after that”. The vast majority of these Tweets contain one of our key-
words but in a metaphoric or sarcastic way, for instance: “and you’re a rectal cancer that
just won’t go away”. In other words, they are not semantically related to cancer. Tweets
that are related to cancer are further classified into 12 classes, defined below, based on
13
M. Hashemi, M. Hall
their content. Tweets that contain aspects of multiple classes are placed in all those classes
(multi-label classification). These classes are determined through discussions among stu-
dents who classified the Tweets, computer scientists, and bioinformatics experts. While
in the beginning, we had to define new classes to cover Tweets that did not fit in any of
the existing classes, half-way through manually classifying the training Tweets, the classes
were stabilized and no new classes were needed.
Through Tweets in this class, awareness campaigns, similar groups, and sometimes indi-
viduals raise awareness about cancer. For instance, urging at-risk people to get screened for
cancer. Representative Tweets from this class follow:
• My Xmas present will be getting my sons agree to have a prostate check-up. Too many
don’t bother or are embarrassed. Don’t be. Cancer is more than an embarrassment.
• US Preventive Services Task Force makes strong recommendation for all adults aged
50–75 to get screened for colon cancer https://t.co/1qBY8Y9TxL, https://t.co/gxC7W
BXuw7.
• With 1 in 8 women getting breast cancer, now is the time to invest in our long term
health. #generationgood https://t.co/8iQlooyB5J.
• Excluding skin cancers, colorectal cancer is the third most common cancer diagnosed
in both men and women in the United States. Talk to your doctor about getting screened
because early signs of colon cancer aren’t always evident. https://t.co/KCmb97iOUu.
Tweets in this class advertise conferences, journals, articles, books, or other published
material on the topic of cancer, cancer research, or cancer support. Representative Tweets
from this class follow:
4.1.3 Fundraising (F)
This class defines Tweets, thereby individuals or organizations attempt to raise money for
cancer patients. Tweets that refer to donations to cancer patients also belong to this class.
Representative Tweets from this class follow:
• Help […] Leukemia Fund every time you shop: https://t.co/VNMneCw6WW #iGive-
DoYou.
13
Multi-label classification and knowledge extraction from…
• If […] helps you win your season, consider donating to the Leukemia and Lym-
phoma Society in his name https://t.co/GkAujShP6B.
• @TheEllenShow Ellen, I am trying to raise money for […]. He has been battling
Leukemia. He’s not able to attend school because of chemo treatments. I’ve started
a Go Fund Me Acct for […] under my name […]. So far I have had no luck. Please
spread the word for […].
This class defines Tweets that contain technical information about what causes (or does
not cause) cancer or what raises (or reduces) the risk of contracting it. Representative
Tweets from this class follow:
This class defines Tweets that contain technical information about the detection and
diagnosis of cancer. This includes specific medical tests that determine whether or not
an individual has cancer, gauge their risk of getting cancer, or screen a cancer patient’s
response to specific treatment plans or different courses of action. If a Tweet contains
information about screening guidelines or change thereof, it will also be classified here.
To qualify for this class, a Tweet must specify the type of the screening test. Represent-
ative Tweets from this class follow:
• Getting a breast implant may make it harder for ur mammogram to pick up cancer.
• Consultant at Capgemini, and a participant from the second batch of PGDM In
#BusinessAnalytics presenting his #capstone project on ‘Breast Cancer Prediction
from Finite Needl Aspiration Data using #MachineLearning Algorithms’ capstone
project. #Analytics #breastcancer.
• Pennell Liquid Biopsies for EGFR T790M NonSmall Cell Lung Cancer NSCLC
[720p] https://t.co/2MCrp3EaVb, https://t.co/ykwear5uUh.
• Optical coherence tomography for diagnosing skin cancer in adults https://t.co/Jkkhs
ASSpM Insufficient data are available on the use of OCT for the detection of mela-
noma or cSCC. @CochraneLibrary #melanoma @FabioGo38238336 https://t.co/
PiNWIqDEiy.
13
M. Hashemi, M. Hall
This class defines Tweets that contain technical information about what prevents (or does
not prevent) cancer or its development. It also includes information about what prevents
cancer recurrence in cancer survivors. Examples are food, drinks, drugs, and behaviors.
Representative Tweets from this class follow:
This class defines Tweets that contain technical information about what are (or are not)
symptoms of cancer. For instance, pain in a certain region of body. Representative Tweets
from this class follow:
This class defines Tweets that contain technical information about therapies and drugs that
cure (or do not cure) cancer. Examples are surgery, immunotherapy, chemotherapy, and
radiotherapy. To qualify for this class, a Tweet must specify the type of therapy or drug.
Representative Tweets from this class follow:
13
Multi-label classification and knowledge extraction from…
This class defines Tweets where an individual references themselves or others suffering
from cancer and seeks or provides moral support. It also includes Tweets containing infor-
mation about events in support groups or merchandise promotion, such as pink or special
edition clothing. Representative Tweets from this class follow:
Through Tweets in this class, people share the news that either themselves or someone else
has been cured of cancer. Representative Tweets from this class follow:
Through Tweets in this class, people share the news that either themselves or someone else
has been diagnosed with cancer. Representative Tweets from this class follow:
• This week has been a week. My partner’s mum has been diagnosed with leukemia
which is heavy. She has been admitted to the Beatson which seems like a good place for
her to be but it’s looking like she will still be there on Christmas.
13
M. Hashemi, M. Hall
1011
638 557
551 541
451 375 302 400
A C F KC KD KP KS KT M NC ND NP
• Please pray for my dear friends, […] and […]! He has just been diagnosed with stage 4
pancreatic cancer and will begin treatment right away! Thank you to all who pray!
• Today I found out my tumor is malignant. This will probably be the second worst day
in my entire life. Sunday will be the worst. That is when we are telling our kids I have
breast cancer.
Through Tweets in this class, people share the news that someone has passed away due to
cancer or cancer complications. Representative Tweets from this class follow:
• Please pray for my close friend, his dad died yesterday after his third battle with leuke-
mia and he is doing rough.
• Former ‘America’s next top model’ contestant Jael Strauss has died of breast cancer
https://t.co/t2SgunQleO.
The contained information in Tweets is not screened for medical accuracy or validity.
Therefore, the contained knowledge in Tweets could be factually incorrect. Figure 1 shows
the number of training samples from each class. It is important to mention that this is a
multi-label classification, where more than one label may be assigned to each sample. In
other words, there is no restriction that a sample must exclusively belong to one class. This
approach was applied because there were Tweets containing information or knowledge
from more than one class. For example, the following first Tweet belongs to three classes,
M, F, and KT, the second Tweet belongs to three classes, KP, KC, and NP, the third Tweet
belongs to two classes, KT and NC, and the fourth Tweet belongs to two classes, KT and
C:
• I just joined because I want to reach out and help […], who is battling Leukemia. He’s
not able to attend school, in Forest, Ms. right now because of taking chemo treatments.
I’ve started a Go Fund Me Acct for him for Christmas. It’s under […]. Please help.
• … That’s amazing! I’m so happy for you. I lost my sister in 2014 to pancreatic cancer.
I’d be curious about your diet & supplements during this time. Sugar is said to feed
cancer & a healthy immune system prior to chemo is supposed to help, as chemo kills
good cells also.
• My sister survived Stage 4 rectal and colon cancer that had spread to her liver. Weed
gave her an appetite post-surgery that she didn’t have on her own and solved her nausea
13
Multi-label classification and knowledge extraction from…
issue. She is alive because of the weedification of America. It is a powerful tool that
Big Pharma hates.
• A novel way to treat breast cancer: blocking metastasis doorways to “prevent” metasta-
sis. If you want to know more, stop by our poster NOW P6-18-22 #SABCS18 #breast-
cancer https://t.co/hiTVkazyDc.
4.2 Automatic classification
The input to a machine or automatic classifier is a feature vector. In other words, machines
classify feature vectors. Therefore, each Tweet needs to be converted to a feature vector.
Natural language processing techniques address that need. The following steps are taken, in
the same order, to construct a feature vector out of each Tweet: tokenization (Jurafsky and
Martin 2014), non-alphanumeric character removal (Uysal and Gunal 2014), stop-word1
removal (Rajaraman and Ullman 2011; Uysal and Gunal 2014), lowercase conversion
(Uysal and Gunal 2014), lemmatizing (Paul and Dredze 2011) (instead of stemming; Porter
1980), lexicon creation, and feature vector construction.
Tokenization is the process of transforming a stream of characters into a stream of
processing units called tokens, e.g. syllables, words, phrases, or sentences (Jurafsky and
Martin 2014). Our tokenizer splits each text into words, i.e. words are tokens. Stop word
removal entails removing non-informative words such as propositions, conjunctions, and
certain high frequency words.
The goal of both stemming and lemmatization is to reduce inflectional forms and deri-
vationally related forms of a word to a common base form. For instance: car, cars, car’s,
and cars’ will become car after either stemming or lemmatizing. Stemming [e.g. Porter’s
stemming algorithm (Porter 1980)] is a crude heuristic process that chops off the ends of
words and removes derivational affixes. Lemmatization, however, uses a vocabulary and
analyzes words morphologically, to remove inflectional endings only and to return the base
or dictionary form of a word, which is known as the lemma. Stemming collapses deri-
vationally related words, whereas lemmatization only collapses the different inflectional
forms of a lemma. Despite stemmers use language-specific rules, they require less knowl-
edge than a lemmatizer. A lemmatizer requires a complete vocabulary and morphological
analysis to accurately identify the lemma for each word, which produces modest benefits
for retrieval. Following is an example comparison of different behaviors of a stemmer and
lemmatizer. Lemmatization is chosen over stemming in this work.
1
I, you’d, below, so, who, is, mightn’t, did, for, the, any, each, hers, more, own, mustn, about, o, wouldn’t,
between, off, s, doesn, ve, it’s, as, just, be, won’t, they, your, yourselves, isn’t, from, where, y, d, ourselves,
she’s, at, our, why, him, you, can, himself, such, haven, to, most, you’ve, above, myself, than, now, here,
only, it, through, aren, while, has, am, aren’t, but, down, too, hadn, he, other, there, having, not, itself,
shouldn’t, up, until, on, didn, how, been, both, her, wouldn, shouldn, nor, being, shan’t, further, themselves,
or, herself, all, theirs, during, no, out, after, needn’t, ain, should’ve, which, under, couldn’t, whom, doesn’t,
their, ma, yours, you’re, if, these, my, again, wasn, weren’t, you’ll, wasn’t, when, don, because, hadn’t,
that’ll, once, over, will, some, isn, does, shan, its, had, what, didn’t, were, an, re, and, are, against, into,
have, mustn’t, this, do, in, before, yourself, t, same, was, doing, mightn, we, weren, haven’t, that, needn,
few, hasn’t, me, she, ours, of, with, don’t, m, a, couldn, by, hasn, won, then, should, them, those, very, his,
ll.
13
M. Hashemi, M. Hall
20265
2915
59 655
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>20
Frequency of a lemma in the lexicon
Fig. 2 Histogram of frequencies vs. number of lemmas in the lexicon with that frequency among Tweets
Sample text:
Cancer is a group of diseases involving abnormal cell growth with the
potential to invade or spread to other parts of the body
Stemmer: Cancer is a group of diseas involv abnorm cell growth with the potenti to
invad or spread to other part of the bodi
Lemmatizer: Cancer is a group of disease involving abnormal cell growth with the
potential to invade or spread to other part of the body
13
Multi-label classification and knowledge extraction from…
We need to assign a feature vector to each sample or Tweet. A popular approach for
this purpose is the bag-of-words representation which discards any information about
the order or structure of terms in the sample document. The bag-of-words representa-
tion creates a feature vector for each sample. The feature vector’s length is equal to
the lexicon’s size and each element represents the frequency of a lexicon term in that
sample. Besides frequency of a lexicon term, it is also common to use zero and one to
indicate its presence or absence. An alternative to term-frequency and term-presence
feature vector is term-TFIDF weight feature vector (Salton and Buckley 1988). TFIDF
weights, calculated from Eq. 1, intend to give higher weights to terms which appear
in fewer documents and lower weights to terms occurring in many documents. This is
achieved by multiplying a term’s frequency by an inverse document frequency (IDF)
factor. In Eq. 1, wij is the weight of the i-th term (ti) in the j-th document (dj), TF(ti, dj)
is the frequency of term ti in document dj, d is the number of documents, and DF(ti) is
the number of documents containing the term ti.
(1)
( ) ( )
wij = TF ti , dj × IDF ti
( )
d
(2)
( )
IDF ti = log ( )
DF ti
Term-TFIDF weight feature vectors are usually used for clustering purposes and their
application to classification problems involves a subtle point. Let’s assume a classifier
is trained using term-TFIDF weight feature vectors of a training documents corpus. If a
new sample needs to be classified, the frequency of its terms need to be first multiplied
by their corresponding IDF(ti), which were calculated based on the training corpus of
documents. In our experiments, we will apply all three approaches to create feature vec-
tors, i.e. term-frequency, term-presence, and term-TFIDF, and compare their results.
Since our samples might have more than one label, a multi-label classification method-
ology is needed. To make it possible to assign more than one label to each sample, we
transformed our multi-class classification into multiple binary classification problems
(Read et al. 2011, 2014). In this approach, one binary classifier is independently trained
for each label. Given an unseen sample, a label would be assigned to this sample by
each binary classifier if the respective classifier predicts a positive result.
In an attempt to exhaustively deploy all major classifiers, 11 different classifiers
were evaluated for this purpose: four linear classifiers, five nonlinear classifiers, and
two combined or hybrid classifiers. Overall accuracy, recall, precision, and F1 score are
measured through ten-fold cross-validation and used to evaluate each classifier’s per-
formance. Each classifier’s hyperparameters are set by a separate ten-fold cross-vali-
dation after excluding the 10% of data that will be used to measure the evaluation met-
rics (overall accuracy, recall, precision, and F1 score). The overall accuracy represents
the percentage of correctly classified samples. The F1 score (also known as F-score or
F-measure) is the harmonic mean of the precision and recall and is calculated as fol-
lows. An F1 score reaches its best value at one (perfect precision and recall) and worst
value at zero.
13
M. Hashemi, M. Hall
2
F1 = 1 1 (3)
Precision
+ Recall
5 Results and discussion
5.1 Automatic classification
Three approaches were discussed in the previous section for constructing feature vectors:
term-frequency, term-presence, and term-TFIDF. This factor’s effect on the classifica-
tion accuracy remained less than 0.03%. In view of this negligible effect, term-frequency
feature vectors are chosen for constructing feature vectors. Next, we measure the ten-fold
cross-validation accuracy of different classifiers using our training data. Table 1 reports the
overall accuracy, precision, recall, and F1 for each classifier.
A much higher precision than recall means that the classifier’s mistakes are mostly due
to missing positive samples, i.e. misclassifying many positive samples as negative. In other
words, the samples classified as positive are much more probable to be correct than those
classified as negative. Precision is more important than recall for our study because we
will only use the positively classified samples for knowledge extraction. Therefore, it is
more important to not have misclassified samples among positives (high precision) than
not missing any positive samples (high recall). Fortunately, all classifiers’ precision is
much higher than their recall according to Table 1, except for the naïve Bayesian. The main
reason for much higher precision than recall in Table 1 is obviously the class imbalance,
where negative samples significantly outnumber positives.
Each classifier performs 12 different binary classification tasks, listed in Table 1. The
histogram in Fig. 3 visualizes the average accuracy of each classifier over these 12 different
tasks. In this histogram, classifiers are ordered based on their F1 score and overall accu-
racy, from the most accurate on the left to the least accurate on the right.
Nonlinear classifiers have higher precisions, except for the naïve Bayesian, than linear
ones. On the other hand, linear classifiers have higher recall. Combining precision and
recall by their harmonic mean, i.e. F1 score, leaves linear classifiers with a slight advan-
tage over nonlinear ones. Linear classifiers also achieve slightly higher overall accuracies
than nonlinear ones. It means that low variance (high bias) classifiers work better on this
dataset than high variance (low bias) ones. This is a difficult classification task. Trying to
capture all nuances of training samples would weaken the classifier in face of unseen sam-
ples. A classifier’s best bet is to suffice to grasping the big picture and ignoring the training
samples crossing the linear border of their class. kNN is the weakest in doing so and thus,
is penalized with one of the lowest accuracies.
Logistic regression has the highest overall accuracy and F1 score. Its precision comes
fourth and its recall comes fifth. Linear SVM achieves the highest recall and kNN the
13
Table 1 Ten-fold cross-validation accuracy of 11 different classifiers, each on 12 different binary classification tasks
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
names follow the conven- Overall Recall % Precision % F1 %
tion in Sect. 4.1) accu-
racy %
Logistic regression Regularization term: L2 norm A: 6.01 Not A: 93.99 97.61 70.22 87.50 77.91
Regularization strength: 0.5 C: 9.52 Not C: 90.48 94.43 58.36 77.63 66.63
*The inverse of the regularization strength, usually represented
with C, is set to 2, resulting in a regularization strength of 0.5 F: 5.19 Not F: 94.81 98.40 76.36 91.30 83.17
KC: 18.36 Not KC: 81.64 94.67 80.13 89.71 84.65
KD: 5.25 Not KD: 94.75 97.64 65.89 85.95 74.59
KP: 4.25 Not KP: 95.75 98.17 65.41 88.59 75.26
KS: 5.10 Not KS: 94.90 97.12 53.52 84.01 65.38
KT: 16.75 Not KT: 83.25 93.52 74.65 84.83 79.41
M: 16.11 Not M: 83.89 90.57 62.14 75.05 67.99
Multi-label classification and knowledge extraction from…
13
Table 1 (continued)
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
Overall Recall % Precision % F1 %
13
names follow the conven-
tion in Sect. 4.1) accu-
racy %
Linear SVM Smoothing parameter (C): 1 A: 6.01 Not A: 93.99 97.14 71.79 78.83 75.14
*The optimum value for C turns out to be small for linear SVM C: 9.52 Not C: 90.48 93.27 61.62 65.65 63.57
as opposed to nonlinear SVM. This is an interesting observation
F: 5.19 Not F: 94.81 98.34 79.45 87.40 83.24
stemmed from the fact that nonlinear SVM is actually a linear
SVM in a higher dimensional space. It is easier to linearly sepa- KC: 18.36 Not KC: 81.64 93.82 81.78 84.11 82.93
rate the two classes in a higher dimensional space and therefore KD: 5.25 Not KD: 94.75 97.28 68.40 77.13 72.50
the classifier can be more restrictive in not allowing samples to KP: 4.25 Not KP: 95.75 97.86 69.18 78.00 73.33
enter the classifier’s margin or go beyond. Thus, a large value
for C is optimal in case of nonlinear SVM. On the other hand, KS: 5.10 Not KS: 94.90 96.48 56.67 68.76 62.13
it is more difficult to linearly separate the two classes in the KT: 16.75 Not KT: 83.25 92.70 76.39 79.21 77.77
original, lower dimensional space. Many samples fall inside the M: 16.11 Not M: 83.89 89.49 65.30 68.13 66.69
classifier’s margin or go beyond. In this space, it is not possible NC: 3.53 Not NC: 96.47 97.35 53.60 65.26 58.86
to restrict this much from happening. Therefore, a small value
for C turns out to be more optimal for linear SVM ND: 2.84 Not ND: 97.16 97.99 59.93 66.30 62.96
NP: 3.77 Not NP: 96.23 98.46 75.50 82.29 78.75
M. Hashemi, M. Hall
Table 1 (continued)
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
names follow the conven- Overall Recall % Precision % F1 %
tion in Sect. 4.1) accu-
racy %
Perceptron Learning rate: 0.001 A: 6.01 Not A: 93.99 96.95 71.63 76.17 73.83
Stopping criterion: minimum decrease in loss is less than 0.00001
Maximum number of iterations: 5000 C: 9.52 Not C: 90.48 93.37 60.53 66.81 63.52
F: 5.19 Not F: 94.81 98.31 80.00 86.44 83.10
KC: 18.36 Not KC: 81.64 93.69 81.21 83.88 82.52
KD: 5.25 Not KD: 94.75 97.08 66.25 75.15 70.42
KP: 4.25 Not KP: 95.75 97.77 70.29 75.48 72.79
KS: 5.10 Not KS: 94.90 96.05 55.37 62.68 58.80
Multi-label classification and knowledge extraction from…
13
Table 1 (continued)
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
Overall Recall % Precision % F1 %
13
names follow the conven-
tion in Sect. 4.1) accu-
racy %
Linear ridge regression Regularization strength: 0.5 A: 6.01 Not A: 93.99 96.88 70.22 76.06 73.02
*Least squares, linear ridge regression, and linear lasso regres-
sion are in the same family of linear regressions with different C: 9.52 Not C: 90.48 92.39 58.06 60.52 59.26
regularizers. Regularization attempts to shrink the regression F: 5.19 Not F: 94.81 97.97 76.73 82.91 79.70
coefficients
Least squares and linear lasso regression were also tried but their KC: 18.36 Not KC: 81.64 93.01 79.77 81.70 80.73
accuracy was less than linear ridge regression and not reported
KD: 5.25 Not KD: 94.75 97.12 68.76 74.37 71.46
here
KP: 4.25 Not KP: 95.75 97.65 68.07 74.51 71.15
KS: 5.10 Not KS: 94.90 96.04 56.11 62.35 59.06
KT: 16.75 Not KT: 83.25 91.61 75.32 74.73 75.03
M: 16.11 Not M: 83.89 88.19 63.14 63.40 63.27
NC: 3.53 Not NC: 96.47 97.13 49.87 61.72 55.16
ND: 2.84 Not ND: 97.16 97.96 59.27 65.81 62.37
NP: 3.77 Not NP: 96.23 98.52 79.00 81.23 80.10
M. Hashemi, M. Hall
Table 1 (continued)
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
names follow the conven- Overall Recall % Precision % F1 %
tion in Sect. 4.1) accu-
racy %
Combining linear classifiers: If Due to the overall higher accuracy of linear classifiers than non- A: 6.01 Not A: 93.99 97.59 70.22 87.16 77.78
at least three of the four afore- linear ones, we developed a combined classifier of only linear
mentioned linear classifiers classifiers: logistic regression, linear SVM, Perceptron, and C: 9.52 Not C: 90.48 94.32 57.76 76.84 65.95
cast a positive vote, the sample linear ridge regression F: 5.19 Not F: 94.81 98.34 75.64 90.83 82.54
is classified as positive.
The threshold, 3, has been optimized as a hyperparameter (see KC: 18.36 Not KC: 81.64 94.55 79.88 89.32 84.34
below) KD: 5.25 Not KD: 94.75 97.65 65.89 86.15 74.67
KP: 4.25 Not KP: 95.75 98.16 65.85 87.87 75.29
KS: 5.10 Not KS: 94.90 96.99 53.33 81.13 64.36
Multi-label classification and knowledge extraction from…
13
Table 1 (continued)
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
Overall Recall % Precision % F1 %
13
names follow the conven-
tion in Sect. 4.1) accu-
racy %
Nonlinear SVM Smoothing parameter (C): 10 A: 6.01 Not A: 93.99 97.70 68.18 91.39 78.10
*C controls how far samples could go inside the classifier’s mar-
gin and to the wrong side of the classifier. The larger the C, the C: 9.52 Not C: 90.48 93.96 40.75 90.75 56.25
less permissive the classifier ( ∥X −X ∥2 )
F: 5.19 Not F: 94.81 98.21 71.27 92.45 80.49
Kernel: Gaussian radial basis function, defined as exp − i 𝜈 j
ν: number of features × variance across(all features ) KC: 18.36 Not KC: 81.64 94.42 75.10 93.18 83.17
*This kernel is also represented as exp −𝛾 ∥ Xi − Xj ∥2 or
( ∥X −X ∥2 ) KD: 5.25 Not KD: 94.75 97.52 56.55 93.75 70.55
exp − i2𝜎 2 j in the literature and different software
KP: 4.25 Not KP: 95.75 97.95 55.43 93.63 69.64
KS: 5.10 Not KS: 94.90 97.12 45.19 96.06 61.46
KT: 16.75 Not KT: 83.25 92.85 67.94 86.39 76.06
M: 16.11 Not M: 83.89 90.56 49.68 85.67 62.89
NC: 3.53 Not NC: 96.47 97.26 25.87 88.18 40.00
ND: 2.84 Not ND: 97.16 98.63 75.17 76.43 75.79
NP: 3.77 Not NP: 96.23 98.80 81.75 85.83 83.74
M. Hashemi, M. Hall
Table 1 (continued)
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
names follow the conven- Overall Recall % Precision % F1 %
tion in Sect. 4.1) accu-
racy %
MLP Number of hidden layers: 3 A: 6.01 Not A: 93.99 97.40 65.36 88.35 75.14
Number of nodes in each hidden layer: 256
Activation function: ReLU C: 9.52 Not C: 90.48 94.27 56.38 77.34 65.22
Activation function in the output layer: softmax F: 5.19 Not F: 94.81 98.05 68.55 91.73 78.46
Cost function: cross entropy
Optimization algorithm: Adam KC: 18.36 Not KC: 81.64 94.25 77.21 90.06 83.14
Batch size: 200
KD: 5.25 Not KD: 94.75 97.60 64.09 86.65 73.68
Number of training epochs: 50
KP: 4.25 Not KP: 95.75 98.14 61.20 92.62 73.70
KS: 5.10 Not KS: 94.90 97.07 50.37 86.35 63.63
Multi-label classification and knowledge extraction from…
13
Table 1 (continued)
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
Overall Recall % Precision % F1 %
13
names follow the conven-
tion in Sect. 4.1) accu-
racy %
Decision tree Criterion to find the best split question at a node: maximum A: 6.01 Not A: 93.99 97.46 67.71 87.10 76.19
entropy decrease
Minimum number of training samples at a leaf: 7 C: 9.52 Not C: 90.48 93.53 48.66 74.66 58.92
F: 5.19 Not F: 94.81 98.09 77.64 84.22 80.79
KC: 18.36 Not KC: 81.64 93.65 72.84 90.73 80.81
KD: 5.25 Not KD: 94.75 97.46 63.02 84.78 72.30
KP: 4.25 Not KP: 95.75 97.84 60.98 83.84 70.60
KS: 5.10 Not KS: 94.90 96.85 45.93 85.52 59.76
KT: 16.75 Not KT: 83.25 90.77 64.90 76.39 70.18
M: 16.11 Not M: 83.89 88.36 49.15 69.65 57.63
NC: 3.53 Not NC: 96.47 88.36 49.15 69.65 57.63
ND: 2.84 Not ND: 97.16 98.42 64.90 75.97 70.00
NP: 3.77 Not NP: 96.23 98.41 74.75 81.47 77.97
M. Hashemi, M. Hall
Table 1 (continued)
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
names follow the conven- Overall Recall % Precision % F1 %
tion in Sect. 4.1) accu-
racy %
kNN Number of neighbors (k): 5 A: 6.01 Not A: 93.99 97.48 62.07 94.06 74.79
Weights: inverse of Euclidean distance
C: 9.52 Not C: 90.48 93.68 41.44 84.14 55.53
F: 5.19 Not F: 94.81 96.87 42.18 94.31 58.29
KC: 18.36 Not KC: 81.64 92.71 64.32 94.14 76.43
KD: 5.25 Not KD: 94.75 97.46 57.09 91.38 70.28
KP: 4.25 Not KP: 95.75 97.89 53.88 93.82 68.45
KS: 5.10 Not KS: 94.90 97.09 46.30 92.94 61.80
Multi-label classification and knowledge extraction from…
13
Table 1 (continued)
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
Overall Recall % Precision % F1 %
13
names follow the conven-
tion in Sect. 4.1) accu-
racy %
Naïve Bayesian Priors: adjusted based on relative class frequencies A: 6.01 Not A: 93.99 92.72 62.38 42.80 50.77
C: 9.52 Not C: 90.48 86.02 56.48 35.38 43.50
F: 5.19 Not F: 94.81 94.49 51.27 47.08 49.09
KC: 18.36 Not KC: 81.64 84.34 70.84 55.78 62.42
KD: 5.25 Not KD: 94.75 93.85 62.30 43.92 51.52
KP: 4.25 Not KP: 95.75 93.00 56.76 31.84 40.80
KS: 5.10 Not KS: 94.90 91.93 49.63 31.46 38.51
KT: 16.75 Not KT: 83.25 81.97 71.77 47.43 57.12
M: 16.11 Not M: 83.89 80.67 60.85 42.94 50.35
NC: 3.53 Not NC: 96.47 94.49 32.27 26.77 29.26
ND: 2.84 Not ND: 97.16 94.18 24.17 15.84 19.13
NP: 3.77 Not NP: 96.23 94.11 31.25 26.32 28.57
M. Hashemi, M. Hall
Table 1 (continued)
Classifier Classifier’s hyperparameters Binary classes and their Ten-fold cross-validation
relative size % (class
names follow the conven- Overall Recall % Precision % F1 %
tion in Sect. 4.1) accu-
racy %
Combining all classifiers: If The threshold, 3, has been optimized as a hyperparameter (see A: 6.01 Not A: 93.99 97.55 73.04 84.12 78.19
at least three of the nine below)
aforementioned classifiers cast C: 9.52 Not C: 90.48 94.13 62.51 72.15 66.98
a positive vote, the sample is F: 5.19 Not F: 94.81 98.57 82.36 89.17 85.63
classified as positive
KC: 18.36 Not KC: 81.64 94.45 83.21 86.09 84.63
KD: 5.25 Not KD: 94.75 97.51 68.22 81.37 74.22
KP: 4.25 Not KP: 95.75 98.24 69.40 86.46 77.00
KS: 5.10 Not KS: 94.90 96.99 55.74 79.00 65.36
Multi-label classification and knowledge extraction from…
13
M. Hashemi, M. Hall
95
85
75
65
55
45
35
Naïve Bayesian
Decision Tree
Logistic Regression
Linear SVM
Perceptron
Nonlinear SVM
MLP
KNN
Combined all classifier
Linear Ridge Regression
75
50
25
KT A NP NC KC KD M KP F ND KS C
highest precision. Logistic regression forms the linear classifier by forcing the logarithm of
the posterior ratios of the two classes to be linear, without making any assumptions about
the probability distribution function (PDF) of each class. This seems to work well, since
making the Gaussian PDF assumption in naïve Bayesian results in the lowest accuracy.
Linear SVM reaching the highest recall is well expected because of its application of sup-
port vectors instead of all training samples. Selecting support vectors from each class and
applying them to design the classifier mitigates the class imbalance problem and makes the
classifier less biased toward the larger class. Since the larger class in our case is the nega-
tive one, SVM boosts the recall. On the other hand, kNN reaching the highest precision
is justified by its highest complexity compared to any other classifier. This high complex-
ity allows it to capture the smallest nuances of the training dataset in the feature space.
This behavior along with the much larger size of the negative class makes it very stingy to
assign any samples to the positive class. This stinginess results in a higher precision, i.e.
not mistakenly classifying a negative sample as positive.
Figure 4 shows the accuracy of each binary classification task by logistic regression.
Figure 5 shows the accuracy of each binary classification task, averaged over different clas-
sifiers. In both figures, binary classification tasks are ordered from left to right based on
13
Multi-label classification and knowledge extraction from…
75
50
25
KC F KT A KD NP KP M C KS ND NC
Fig. 5 Average accuracy for each binary classification task over 11 classifiers listed in Table 1
Number of Tweets
number of Tweets having those
many labels at the same time
14735
1096 84 3
0 1 2 3 4 5
Number of labels
their F1 score. For logistic regression, it is easiest to detect Tweets that contain information
about therapies and drugs that cure cancer (KT) and it is most difficult to detect Tweets
advertising conferences and published material on the topic of cancer (C). The easiness of
the earlier is associated with the common and more formal terms that are used to describe
cancer treatments and the difficulty of the latter originates from the high variance of terms
and their informality.
The most accurate classifier, logistic regression, is used to classify the remaining 198,537
Tweets. That would result in a total of 209,155 labeled Tweets, 10,618 of which are manu-
ally classified and 198,537 of which are automatically classified. These Tweets contain at
least one of the 14 most common cancer types and are collected from Dec 7th 2018 until
Feb 11th 2019. Of all the Tweets, 48% are classified as semantically irrelevant to cancer,
44% belong to one of the 12 cancer-related classes exclusively, 7% belong to two classes
at the same time, and the remaining 1% belong to more than two, up to five classes, at the
same time. Figure 6 indicates how many Tweets belong to how many classes at the same
time.
Table 2 shows the three Tweets having five labels at the same time, along with their
labels. The first Tweet advises the audience to get screened and so it is correctly classified
as A. It introduces age as an influencing factor on cancer risk, thus correctly classified as
KC. Its automatic classification as KP is hard to justify as it does not contain technical
information about cancer prevention. Its automatic classification as KS is justified by notic-
ing that it mentions that colon cancer has no symptoms. Finally, it announces the decease
of two people due to colon cancer and therefore, correctly classified as NP.
13
13
Table 2 Tweets with five labels at the same time
Tweet Labels
@jaketapper 40 if you have family history sir. My mother and uncle both passed from colon cancer. I’m 39 and A KC KP KS NP
the doc is having me screened this month. They said she had it for years. She never showed any signs. 5 weeks
from diagnosis to passing. Get screened. Get screened. Get screened
@jaketapper Have a co-worker age 60 didn’t have colonoscopy. Had mine at age 52, persuaded her to have it KD KS KT M ND
done. Finally age 61 she had blockage-colon cancer. Did chemo for 6 mo. CT scan 1 year later found stage 4
liver cancer. Just finished 2nd round chemo. Today getting CT scan results
@jaketapper I was diagnosed at age 41 with stage 4 colon cancer that spread to my liver also. I had the right KS KT M NC ND
lobe of my liver removed, colon resection & have been undergoing chemo since January of 2015. It has been a
rough road, cancer showed up in my lung & abdomen in 2017, but remission now
M. Hashemi, M. Hall
Multi-label classification and knowledge extraction from…
The second Tweet mentions different types of medical tests for detecting cancer and
thus, classified as KD. Its classification as KS is not justifiable. Its mention of chem-
otherapy is related to its classification as KT. Moral support for the coworker is not
noticeable and its classification as M does not seem just. Announcing that the coworker
has colon cancer explains its classification as ND.
Classification of the third Tweet as KS does not seem right. Its classification as KT
is supported by mentioning different medical methods for treating cancer. It is classified
as M since it seeks moral support by stating that the treatment process has been a rough
road. The Tweet announces being both diagnosed and remitted of colon cancer which
explains its classification as ND and NC, respectively.
Figure 7 shows 12 proximity graphs. Each graph places one class at the center and
visualizes its relative (not absolute) distance to the other 11 classes. Distance between
two classes is proportional to the inverse of the number of Tweets that have labels of
both classes. The more the number of Tweets shared between two classes, the shorter
their distance in these graphs.
The closest two classes are F and M with 2360 Tweets shared between them. This
underscores that Tweets seeking financial support are often elevated to an emotional
level by providing or seeking moral support. The farthest two classes are F and KD with
only two shared Tweets. It is not surprising that Tweets seeking financial support do
not provide much technical information about cancer diagnosis. In fact, Tweets seeking
financial support do not usually overlap with any other class but M. This is clear from
the proximity graph of class F, where its close proximity with class M dominates its
proximity with any other class. Class M is the closest class to other classes on average,
followed by KT, F, KC, ND, NP, NC, KS, A, C, KP, and KD. This shows that, regard-
less of what a Tweet is focused on, providing or seeking moral support is often present.
On the other hand, it shows that Tweets that provide technical information about cancer
diagnosis (KD) or prevention (KP) tend not to allude to other topics and stay focused.
Figure 8 indicates the number of Tweets in each class. Interestingly, 14% of Tweets
seek or provide moral support or contain information about events in support groups or
merchandise promotion (class M). This is the largest class. 11% of Tweets talk about
what causes (or does not cause) cancer or what raises (or reduces) the risk of contract-
ing it (class KC). The same percentage of Tweets talk about therapies and drugs that
cure (or do not cure) cancer (class KT). These two classes are the largest after class M.
It uncovers the fact that people are mostly preoccupied with what causes cancer (prob-
ably those who have not contracted it yet) and how to treat it (probably those who have
been or know someone who has been diagnosed with cancer) on Twitter. Any other
class takes a less than 5% share of all Tweets.
We will focus on the Tweets in the three largest classes. Figures 9, 10, and 11 show
the frequency of some of the most common terms in each of these classes. Figures 12
and 13 show the frequency of some of the most common co-occurring terms in class KC
and KT. Such a histogram is not presented for class M because it does not add much to
what can be learned from its term frequency histogram in Fig. 9.
Immediate family members form the largest vocabulary in class M. Roman Reigns,
the former world wrestling entertainment champion, was a favored character in class
M as well. He announced his diagnosis of leukemia on October 22, 2018, which is
45 days before we started collecting Tweets. The pink ribbon, an international symbol
of breast cancer awareness, is also very popular in this class. Religious vocabulary, such
as prayer, god, blessing, and Christmas are also quite visible.
13
13
KC NP KS
KD M ND KP KC C
ND KP KT KS KT KD
M KT
KD A
C NC NC C KP KS C A KT
F KC
A M KC NP KD A
C KT F M KP KC
KP A M
KS C M
NC F F
NP KD ND
KT ND
A NC ND
NP ND KC KD NC NP
KS F KP NP NC
KS F
NC NP
KC NP
ND C KS NP NC ND KS F NC NC
ND ND KT
KS NC A KT
ND KT C KC F KC
NP M KT
M F NP M
KT M NC M ND M
KS KT
KS KS
NP F KS
KP KP C
F
A KC KP KP
KP A
KD KC A KP
C F C KD KD KD
A KD A C
KD
9695
7531 8212
6242
4127 4864
2886 2919 3348
A C F KC KD KP KS KT M NC ND NP
The vast majority of Tweets in class KC underscore the link between smoking and lung
cancer. Alcohol and air pollution also have been largely associated with lung cancer. The
second most common trend is the link between breast implants and lymphoma. Fat, obe-
sity, childbirth, abortion, genes, alcohol, estrogen (the primary female sex hormone), lead,
menopause, and age have been highlighted as other factors linked to breast cancer in the
Tweets. Alcohol and smoking are considered the most significant players in liver cancer
and diet is considered to play a role in prostate cancer. Other factors affecting cancer risk
can be found in Fig. 10.
The most tweeted treatments for each type of cancer in this class follow:
13
M. Hashemi, M. Hall
***
implant
alcoholism, alcohol, drinking, alcl, alcoholic
childbirth, abortion, pregnancy, pregnant, infertility
fat, fatty
gene, genetic, mutation, hereditary
diet, food, eat
meat, protein, vegan, millionspledgedtobeveg
obesity, weight, overweight
diabetes, sugar
radiation, fukushima, nuclear, radioactive
menopausal, postmenopausal, menopause, period
lead
lifestyle, exercise, sleep, bed
hormone, hormonal
sun, uv, sunscreen
toxic, toxin
environmental, environment, climate
pollution
estrogen
milk, dairy
infection
vaccine
dust, powder
asthma
vitamin
soy
virus
coal
breastfeed
testosterone
alzheimer
hypertension
hepatitis
paper
autism
cholesterol
***smoking, smoke, smoker, hiv
nonsmoker, smoked, deodorant
cigarette, cigs, cigar, tobacco, arthritis
weed, nicotine, marijuana, breastfeeding
cannabis, addiction, vaping, coffee
vape
glyphosate
0 2000 4000 6000 8000 10000
13
Multi-label classification and knowledge extraction from…
chemotherapy, chemo
surgery, surgical
radiation, radiotherapy
***
immunotherapy
adjuvant, neoadjuvant
mastectomy
tea, oolong
exercise
diet, food
ginger
transplant, transplantation
vaccine
trastuzumab, kadcyla
pembrolizumab, keytruda
herb, herbal
tecentriq, atezolizumab
androgen
tamoxifen
castration
nivolumab, opdivo
placebo
gefitinib
antioxidant
ibrutinib
venetoclax
gemcitabine
stereotactic
resection
paclitaxel
thc
durvalumab
ayurvedic
folfirinox
laser
platinum
***oil,
cannabis, brachytherapy
cannabinoi sleep
d, docetaxel
marijuana, proton
lumpectomy
0 1000 2000 3000 4000
It is also conceivable from Fig. 13 that exercise is very common among cancer survivors
and is considered to reduce the cancer risk. The most common pairs of treatments that are
mentioned together in Tweets include: placebo and gefitinib, adjuvant and chemotherapy,
mastectomy and radiotherapy, diet and exercise, adjuvant and folfirinox. Most common
treatments, regardless of the cancer type, can be found in Fig. 11.
Above figures and insights endorse this study’s claim that it provides the opportunity
to monitor oncology-related topics online. With respect to assessing the veracity of this
13
M. Hashemi, M. Hall
Fig. 12 Some of the most common co-occurring terms and their frequency in class KC
The evolution of online social networks (OSN) has altered the way in which we commu-
nicate in all spheres including medically. With the increasing popularity of OSN, patients
and their families and caregivers have bypassed the traditional controls of the healthcare
and life science industries by volunteering private information on OSN. OSN have an
increasing frequency of health-related information, resources, and networks, thus provid-
ing an invaluable knowledge-base which has yet to be fully taken advantage of. As a small
step in this direction, this study applied a hierarchy of statistical inference methods to pro-
vide insight into a large number of oncology-related OSN posts. This insight is of a dif-
ferent nature than the knowledge provided by doctors, textbooks, or interviewing patients.
The reason is that this insight is inferred from OSN posts which are based on the patients’
13
Multi-label classification and knowledge extraction from…
Fig. 13 Some of the most common co-occurring terms and their frequency in class KT
and caregivers’ daily experiences and conversations with each other and occasionally with
doctors. Even if an extracted piece of knowledge does not comply with what medical sci-
ence stands for, it is valuable because it highlights what is being disseminated on OSN to
people who might take it at face value.
We attempted to do automatically what has been done by humans in the past, extract-
ing oncology-related knowledge from OSN. The difference is that humans are capable of
13
M. Hashemi, M. Hall
processing only small amounts of data (though deeply) but machines can sweep and pro-
cess massive datasets swiftly (though superficially compared to humans). In other words,
the comprehensiveness, speed, and thrift of machines come with a tradeoff for precision
and accuracy. Linear classifiers achieved higher accuracies than nonlinear ones. Less com-
plex classifiers generally perform better on small training datasets. A larger training data-
set, than available to this study, could help nonlinear classifiers to overcome the overfitting
problem. A future direction is to develop models to automatically or semi-automatically
verify the veracity of the extracted knowledge. The extracted knowledge might be medi-
cally incorrect due to misinformation spread over OSN. Being aware of the type and verac-
ity of the information shared on OSN, strategies can be developed to identify how this
information may be exploited in shared decision-making models, improving cancer liter-
acy, and countering misinformation.
References
American Cancer Society (2019) Cancer facts and figures. American Cancer Society, Atlanta, GA. https://
www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/annual-cancer-facts-and-
figures/2019/cancer-facts-and-figures-2019.pdf. Accessed 1 Dec 2018
Antheunis ML, Tates K, Nieboer TE (2013) Patients’ and health professionals’ use of social media in health
care: motives, barriers and expectations. Patient Educ Couns 92(3):426–431
Ashcraft KA, Warner AB, Jones LW, Dewhirst MW (2019) Exercise as adjunct therapy in cancer. Semi
Radiat Oncol 29(1):16–24
Attai DJ, Cowher MS, Al-Hamadani M, Schoger JM, Staley AC, Landercasper J (2015) Twitter social
media is an effective tool for breast cancer patient education and support: patient-reported outcomes by
survey. J Med Internet Res 17(7):e188
Bloom R, Amber KT, Hu S, Kirsner R (2015) Google search trends and skin cancer: evaluating the us
population’s interest in skin cancer and its association with melanoma outcomes. JAMA Dermatol
151(8):903–905
Bosslet GT, Torke AM, Hickman SE, Terry CL, Helft PR (2011) The patient–doctor relationship and online
social networks: results of a national survey. J Gen Intern Med 26(10):1168–1174
Byars T, Theisen E, Bolton DL (2019) Using cannabis to treat cancer-related pain. Semin Oncol Nurs
35(3):300–309
Charani E, Castro-Sánchez E, Moore LS, Holmes A (2014) Do smartphone applications in healthcare
require a governance and legal framework? It depends on the application! BMC Med 12(1):29
Chou W-YS, Hunt YM, Beckjord EB, Moser RP, Hesse BW (2009) Social media use in the United States:
implications for health communication. J Med Internet Res 11(4):e48
Chou W-YS, Hunt Y, Folkers A, Augustson E (2011) Cancer survivorship in the age of YouTube and social
media: a narrative analysis. J Med Internet Res 13(1):e7
Chretien K, Azar J, Kind T (2011) Physicians on twitter. J Am Med Assoc 305(6):566–568
Chung JE (2014) Social networking in online support groups for health: how online social networking ben-
efits patients. J Health Commun 19(6):639–659
Crannell WC, Clark E, Jones C, James TA, Moore J (2016) A pattern-matched Twitter analysis of US can-
cer-patient sentiments. J Surg Res 206(2):536–542
Dredze M (2012) How social media will change public health. IEEE Intell Syst 27(4):81–84
Elkin N (2008) How America searches: health and wellness. Opinion Research Corporation: iCrossing 1–17
Eysenbach G (2008) Medicine 2.0: social networking, collaboration, participation, apomediation, and open-
ness. J Med Internet Res 10(3):e22
Falzone AE, Brindis CD, Chren M-M, Junn A, Pagoto S, Wehner M, Linos E (2017) Teens, tweets,
and tanning beds: rethinking the use of social media for skin cancer prevention. Am J Prev Med
53(3):S86–S94
Gold J, Pedrana AE, Sacks-Davis R, Hellard ME, Chang S, Howard S, Keogh L, Hocking JS, Stoove MA
(2011) A systematic examination of the use of online social networking sites for sexual health promo-
tion. BMC Public Health 11(1):583
Gottlieb BH, Wachala ED (2007) Cancer support groups: a critical review of empirical studies. Psychoon-
cology 16(5):379–400
13
Multi-label classification and knowledge extraction from…
Gough A, Hunter RF, Ajao O, Jurek A, McKeown G, Hong J, Barrett E, Ferguson M, McElwee G,
McCarthy M, Kee F (2017) Tweet for behavior change: using social media for the dissemination of
public health messages. JMIR Public Health Surveill 3(1):e14
Griffis HM, Kilaru AS, Werner RM, Asch DA, Hershey JC, Hill S, Ha YP, Sellers A, Mahoney K, Mer-
chant RM (2014) Use of social media across US hospitals: descriptive analysis of adoption and
utilization. J Med Internet Res 16(11):e264
Harris JK, Snider D, Mueller N (2013) Social media adoption in health departments nationwide: the
state of the states. Front Public Health Serv Syst Res 2(1):5
Hashemi M (2019) Web page classification: a survey of perspectives, gaps, and future directions. Multi-
media Tools Appl. https://doi.org/10.1007/s11042-019-08373-8
Hashemi M, Hall M (2019) Detecting and classifying online dark visual propaganda. Image Vis Comput
89:95–105
Hashemi M, Karimi HA (2018) Weighted machine learning. Stat Optim Inf Comput 6(4):497–525
Häuser W, Welsch P, Klose P, Radbruch L, Fitzcharles M-A (2019) Efficacy, tolerability and safety of
cannabis-based medicines for cancer pain: a systematic review with meta-analysis of randomised
controlled trials. Der Schmerz 33(5):424–436
Hawn C (2009) Take two aspirin and tweet me in the morning: how Twitter, Facebook, and other social
media are reshaping health care. Health Aff 28(2):361–368
Heilferty CM (2009) Toward a theory of online communication in illness: concept analysis of illness
blogs. J Adv Nurs 65(7):1539–1547
Huber J, Muck T, Maatz P, Keck B, Enders P, Maatouk I, Ihrig A (2018) Face-to-face vs. online peer
support groups for prostate cancer: a cross-sectional comparison study. J Cancer Surviv 12(1):1–9
Jaidka K, Zhou A, Lelkes Y (2019) Brevity is the soul of Twitter: the constraint affordance and political
discussion. J Commun 69(4):345–372
Jiang S (2017) The role of social media use in improving cancer survivors’ emotional well-being: a mod-
erated mediation study. J Cancer Surviv 11(3):386–392
Jiménez J, Ramos A, Ramos-Rivera FE, Gwede C, Quinn GP, Vadaparampil S, Brandon T, Simmons V,
Castro E (2018) Community engagement for identifying cancer education needs in Puerto Rico. J
Cancer Educ 33(1):12–20
Jung AY, Behrens S, Schmidt M, Thoene K, Obi N, Hüsing A, Chang-Claude J (2019) Pre-to post-
diagnosis leisure-time physical activity and prognosis in postmenopausal breast cancer survivors.
Breast Cancer Res 21(1):117
Jurafsky D, Martin JH (2014) Speech and language processing. Pearson, London
Kaplan W (2012) Social media and survivorship: building a cancer support network for the 21st century.
Oncol Nurse Advisor 3(2):35
Lapointe L, Ramaprasad J, Vedel I (2014) Creating health awareness: a social media enabled collabora-
tion. Health Technol 4(1):43–57
Lyles CR, López A, Pasick R, Sarkar U (2013) “5 mins of uncomfyness is better than dealing with
cancer 4 a lifetime”: an exploratory qualitative analysis of cervical and breast cancer screening dia-
logue on Twitter. J Cancer Educ 28(1):127–133
Marteau TM, Hollands GJ, Fletcher PC (2012) Changing human behavior to prevent disease: the impor-
tance of targeting automatic processes. Science 337(6101):1492–1495
Murthy D, Gross A, Oliveira D (2011) Understanding cancer-based networks in Twitter using social
network analysis. In: 5th IEEE international conference on semantic computing. IEEE, pp 559–566
Norman C (2011) eHealth literacy 2.0: problems and opportunities with an evolving concept. J Med
Internet Res 13(4):e125
Orsini M (2010) Social media: how home health care agencies can join the chorus of empowered voices.
Home Health Care Manag Pract 22(3):213–217
Paul MJ, Dredze M (2011) You are what you tweet: analyzing twitter for public health. In: Fifth interna-
tional AAAI conference on weblogs and social media. AAAI, pp 265–272
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Rajaraman A, Ullman JD (2011) Data mining. In Mining of massive datasets. Cambridge University
Press, Cambridge, pp 1–17
Randeree E (2009) Exploring technology impacts of Healthcare 2.0 initiatives. Telemed and e-Health
15(3):255–260
Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach
Learn 85(3):333–359
Read J, Martino L, Luengo D (2014) Efficient monte carlo methods for multi-dimensional learning with
classifier chains. Pattern Recognit 47(3):1535–1546
13
M. Hashemi, M. Hall
Rehman S, Lyons K, McEwen R, Sellen K (2018) Motives for sharing illness experiences on Twitter: con-
versations of parents with children diagnosed with cancer. Inf Commun Soc 21(4):578–593
Ritterman J, Osborne M, Klein E (2009) Using prediction markets and Twitter to predict a swine flu pan-
demic. In: 1st international workshop on mining social media, vol 9, pp 9–17
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag
24(5):513–523
Strekalova YA, Krieger JL (2017) A picture really is worth a thousand words: public engagement with the
National Cancer Institute on social media. J Cancer Educ 32(1):155–157
Sugawara Y, Narimatsu H, Hozawa A, Shao L, Otani K, Fukao A (2012) Cancer patients on Twitter: a novel
patient community on social media. BMC Res Notes 5(1):699
Tsuya A, Sugawara Y, Tanaka A, Narimatsu H (2014) Do cancer patients tweet? Examining the twitter use
of cancer patients in Japan. J Med Internet Res 16(5):e137
Twitter (n.d.) https://about.twitter.com/company. Retrieved 1 Feb 2019
Uysal AK, Gunal S (2014) The impact of preprocessing on text classification. Inf Process Manag
50(1):104–112
Vraga EK, Stefanidis A, Lamprianidis G, Croitoru A, Crooks AT, Delamater PL, Pfoser D, Radzikowski
JR, Jacobsen KH (2018) Cancer and social media: a comparison of traffic about breast cancer, prostate
cancer, and other reproductive cancers on Twitter and Instagram. J Health Commun 23(2):181–189
Wicks P, Massagli M, Frost J, Brownstein C, Okun S, Vaughan T, Bradley R, Heywood J (2010) Sharing
health data for better outcomes on PatientsLikeMe. J Med Internet Res 12(2):e19
Wiener L, Crum C, Grady C, Merchant M (2011) To friend or not to friend: the use of social media in clini-
cal oncology. J Oncol Pract 8(2):103–106
Yoo S-W, Kim J, Lee Y (2018) The effect of health beliefs, media perceptions, and communicative behav-
iors on health behavioral intention: an integrated health campaign model on social media. Health Com-
mun 33(1):32–40
Zhou J (2018) Factors influencing people’s personal information disclosure behaviors in online health com-
munities: a pilot study. Asia Pac J Public Health 30(3):286–295
Zucco R, Lavano F, Anfosso R, Bianco A, Pileggi C, Pavia M (2018) Internet and social media use for
antibiotic-related information seeking: findings from a survey among adult population in Italy. Int J
Med Inform 111(1):131–139
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13