Predicting Links in Social Networks Using Text Mining and SNA
Predicting Links in Social Networks Using Text Mining and SNA
Abstract- Lately there is great progress in business predictions and to examine whether TDM methods can
organizations perception towards social aspects. Competitive contribute to the overall prediction accuracy.
organizations need to create innovation and segregate in the
market. Business interactions help reaching those goals but II. BACKGROUND AND RESEARCH MODEL
finding the effective interactions is a chalange. According to Wasserman and Faust, [18] "a social
network consists of a finite set or sets of actors and the
We propose a prediction method, based on Social Networks
Analysis (SNA) and text data mining (TDM), for predicting
relation or relations defined on them”. It is a complex
which nodes in a social network will be linked next. The system that characterized by a high number of dynamically
network which is used to demonstrate the proposed prediction interconnected entities [3] and connects entities in any type
method is composed of academic co-authors who collaborated of link that implies a peer-to-peer relationship. Social
on writing articles. Without loss of generality, the academic networks are typically affiliation networks in which
co-authoring network demonstrates the proposed prediction participants collaborate in groups and links are established
procedure due to its similarity to other networks, such as by common group membership [17]. Examples of a
business co-operation networks. The results show that the best relationship between people include friendship, email
prediction is achieved by incorporating TDM with SNA. exchange etc. [13].
Potgieter el al., [3] indicate that SNA is a research area
Keywords- Social network; Prediction Social network aimed at understanding social complexity by representing
analysis; styling and analyzing social networks using mathematical graphs.
A graph G is a structure consisting of a set of vertices V
I. INTRODUCTION and a set of links E. Social networks can represent many
The challenge of predicting changes in a social different types of collaboration links, such as marriages,
network is called the link prediction problem [14]. Liben- classmates, friendships, the Hollywood-actor collaboration
Nowell and Kleinberg, [8] explain it as: Given a snapshot graph [16], [4] and the academic collaboration graph [1],
of a social network at time t, we seek to accurately predict where two people (i.e. vertices) who collaborated together
the edges that will be added to the network during the are connected by an edge (i.e. link). In this research we
interval from time t to a given future time t+1. will use the academic co-author collaboration network.
Social networks are highly dynamic objects which These networks follow a power law distribution of degree
grow and change rapidly over time. The mechanisms by also known as scale free networks where vertices that are
which they evolve is a fundamental question that is still well connected grow more connections than those who are
not well understood, thus it forms the motivation for this less connected [1].
work. For example, two people who are not connected and The co-authorships networks are an important class of
are close in a friendships network will have friends in collaboration networks and have been used extensively to
common. Our approach used to solve the link prediction determine the academic structure of scientific
problem in an academic co-authoring network is being collaborations and the status of individual researchers [20].
addressed by combining TDM methods to evaluate and Collaboration networks can be represented by an
represent authors' interest topics by extracting key undirected graph where edges exist between two vertices if
concepts from common articles titles and SNA methods. they collaborate together. This representation allows
Two TDM methods are compared (NLP-Natural Language researchers to apply SNA methods [18] to evaluate the
Processing and VSM- vector space model) and the properties of the network. In an authors network the
relevance of this knowledge to the prediction accuracy is vertices represents authors and the links are the papers,
examined. The goal of this research is to find which two authors are considered connected if they have
measures of the network can lead to the most accurate link authored at least one paper together.
132
TABLE I- METRIC DEFINITIONS
Preferential attachment [2] | Γ (Vi ) | ⋅ | ( Γ (Vi ) | The product of the number of edges incident to the two Vertices.
2{ej,k }
Cc1 Ci = : v j, vk ∈V, ejk ∈E Clustering Coefficients considering only 1-neighborhood
ki (ki −1)
{e1 }
Cc2 Ci = Clustering Coefficients considering 2-neighborhood
{e 2 }
Degree Γ (Vi ) The number of links from vertex v i to any other vertex
K-Short_Sum, K-Short_Max are the sum and max, Many initiatives within social network research [9]
respectively, of k (integer) -shortest distance paths where indicated clustering index as an important feature of a
k is the number of paths with the same length to the node in a social network.
destination vertex. The importance of the feature gradually CC2 - Clustering Coefficients considering 2-
decreases as k increases. The shortest distance is weighted neighborhood. A vertex connected to a highly connected
and each edge has an actual weight. For any pair of nodes, vertex, has a better chance to cooperate with a distant
the weight on the edge indicates the number of papers the vertex through the later since the number of secondary
authors have co-authored. neighbors in social network usually grows exponentially.
SumInput- This metric is calculated by summing the Norm_input- number of publication a vertex has,
divided by the maximum degree in the network.
number of links (papers) that two vertices (authors)
Preferential_Max - the maximum number of edges
published in the training years. product incident to two vertices.
MinInput_Sum -sum of the minimum edge connected to Adamic_Sum - sum of features shared by two vertices,
vertex v1 and the minimum edge connected to v2. divided by the log frequency of the features.
MaxInput_Sum-sum of the maximum edge connected Jaccard_Sum - the sum of common neighbors of the
to vertex v1 and the maximum edge connected to v2. examined vertices divided by the number of vertices that
CC1 - Clustering Coefficients considering only 1- are neighbors of either examined vertex.
neighborhood. It is the fraction of pairs of vertex’s CommonNabor_Sum - the sum of the vertices linked to
collaborators who have also collaborated with one another both examined vertices (i.e. mutual friends).
[15].
133
CommonNabor_Max - the maximum number of FP, any potential active authors being predicted by a
vertices linked to both examined vertices. model, even if they have a very low probability to be
active, is more interesting and the concern is of missing
C. Classification Algorithm authors which will be active.
Neural Networks (NN) and Decision Trees (DT) were The first NN model achieved accuracy of 71.5% over
considered in order to deal with the classification problem the testing set. The second two layer NN model did not
of trying to classify authors into two classes: C3 and C4 achieve significant improvement over the testing set. The
(see sub section 2.1). third NN model also did not achieve better results over the
The NN models are different from each other because tasting set. The forth DT model achieved accuracy of 88%
they have different topology. SPSS Clementine was over the testing set similarly to the accuracy over the
utilized to implement, build and validating the NN and DT training set. The fifth DT model, which was build with a
models. In the first stage the data was divided into two partition on 80% of the data for training and 20% of the
groups: the training data set which was used to build the data as testing set, achieved 87.6% accuracy over the
model and testing dataset which was used to evaluate the testing data set, similar to the accuracy achieved over the
model results. The standard partitioning is choosing training set. The sixth and best DT model achieved the
randomly 70% of the data into training set and 30% of the highest accuracy level of 90.99% over the testing set. Fig.
data into the testing set [5]. 2 visualizes the network created by the training data set.
The first NN model was build based on the training set The gray vertices represent authors which the model
and contained one hidden layer and cross validation on predicted will be active and the black vertices are authors
30% of the data. The model contained three neurons in the which the model predicted to be non active.
hidden layer and was able to reach 73.6% of accuracy over Fig. 3 visualizes the same network as in Fig. 2 but the
the training set. The second NN model contained two size of the vertices is proportional to their input degree.
hidden layers, in order to improve the prediction accuracy, The experiments discovered that prediction based on graph
the rest of the parameters were not changed. The results of metrics could aid link prediction by finding new active
the second model were 74% of accuracy over the training authors. Dyadic common neighbor-based local metrics can
set which is a slight improvement. The third NN model also be used to predict the active authors, distance-based
had two hidden layers- the cross validation ratio was metrics can be calculated quickly and contribute to the
changed to 50%. This model achieved accuracy of 75.1% prediction. The sixth DT model has achieved improvement
over the training set. Other parameters were used in order in the FN mistakes but has more FP mistakes compared to
to improve the prediction using a NN model with no the fifth DT model. In this research we are more sensitive
success of improving the accuracy of 75.1%. The most to the FN mistakes as discussed earlier. The same models
effective features for the NN model are Jaccard, VSM, cc2 were produced with the same parameters on the same
and suminput. training set and were tested on the same testing data set but
The DT model was built on the same training set as the with one difference- we excluded the VSM feature. The
NN model. Several parameters where used to build the DT results were significantly worse for all the models. The
models using the C 5.0 algorithm. The forth model most critical features for predicting active authors are the
included 10 times Cross validation and generated a DT Cos_Sum, Betweenness and the K-Short distance
with 17 levels and accuracy level of 86% over the training
set. In the fifth DT model the parameters were not changed
but the partition of the training and testing sets were
changed to 20% of the data for the testing set. The model
that was created contained a DT with 23 levels and 87.9%
accuracy over the training set. Other partitions did not
achieve better performance of accuracy. The sixth and best
DT model was created using the Boosting option for 10
times, Boosting causes the DT algorithm to iteratively
build a new DT while concentrating on the missed matches
occurred in the previous step. Pruning severity was set to
30 affecting the decision of how much pruning to make in
order to prevent over fitting.
The DT created using those parameters has 31 levels
and 88.5% prediction accuracy over the training set. The
most influence features on classifying are: Betweenness, Figure 2. Training network
maxinput, kshort and VSM
D. Experiment Results
There exist two types of mistakes: false negative (FN)
and false positive (FP). FN is much more concerning than
134
993
different authors which are only 2 -about 500,000
couples. In this stage we created a feature set more
compatible for co-authors prediction, which are detailed in
table 1. NN and DT models were considered in order to
deal with the classification problem, this time we tried to
classify pairs of authors into classes: C1 and C2. The first
NN model was build based on the training set (70% of the
data) and contained two hidden layer and cross validation
on 50% of the data. The model contained two neurons in
the first and second hidden layers. The second NN model
contained one hidden layer with 3 neurons. It can be
deduced that the most effective features are preferential,
VSM, CC2 and MinInput. The DT model was built on the
same training set as the NN model. The model included 10
Figure 3. Network predicted by the best DT model- Gray vertices were times Cross validation and generated a DT with 28 levels
predicted as “active” and accuracy level of 97.1% over the training set.
When the classification procedure was engaged III. RESULTS AND CONCLUSIONS
without the Cos_Sum feature the prediction results were
This research presented a view into various aspects of
less accurate showing only 80.93% of accuracy using the
the prediction problem in social networks. It described
best model, but when the Cos_Sum feature was engaged,
experiments conducted to distinguish between prediction
the prediction accuracy achieved approximately 91%
using only SNA methods and prediction using TDM and
accuracy. Another experiment was carried out in order to
SNA methods. The aim was to prove the accuracy of
improve the prediction accuracy using NLP (Natural
prediction in large networks using metrics that contained
Language Processing) stemmer, java tool, instead of the
TDM are more accurate compared to metrics that
statistics measures as discussed above. This method
contained only SNA. The first NN model achieved
enables distinguishing words with different grammar
accuracy of 92.23% over the testing data set. The second
inflections, Linguistics-related text mining known as
NN model achieved 93.24% of accuracy. The DT model
Natural Language Processing. NLP is a research field
achieved accuracy of 96.99% over the testing set.
which deals with formal theories about generating and
The DT model has achieved improvement in the FN
understanding spoken language. Linguistics can be
and FP mistakes compared to the other models. The same
considered as an automatic processing of natural language,
models were produced with exactly the same parameters
since the main task of computational linguistics is the
but with one difference, we excluded the statistic VSM
construction of algorithm to process words and texts in
feature and engaged the NLP feature. The results were
natural language.
slightly improved and achieved 97.73% of accuracy over
Thus, NLP process intelligently text document using
the testing set and those are the best achievable results
the general knowledge about natural language and
which enable total accuracy of 89.78% in predicting new
semantic information [6]. The NLP approach applies the
links in the co-author social network (which is the product
principles of computer-assisted analysis of human
of the first prediction and the second prediction).
languages, to the analysis of words, phrases, and the
The experiments discovered that prediction based on
graph metrics could aid link prediction by finding new
syntax, or structure, of text. A system that incorporates active authors. Dyadic common neighbor-based local
NLP can intelligently extract terms as part of the TDM metrics can also be used to predict the "active authors".
process. Using a DT model in the same procedure but The experiment found that there is a difference in
instead of the Statistic VSM feature we engaged an NLP prediction accuracy between the two methods. Although
feature, the results showed slight improvement achieving all metrics are significantly very different no individual
91.86% of accuracy over the testing set. metric alone is useful for classifying the two classes. The
E. Features for Link Prediction most critical features for classifying and thus predicting
new co-operations between authors are VSM, Wight
The second prediction stage is based on the results
achieved in the first prediction procedure (described (link's strength) and Jaccard’s coefficient which indicates
above) which enable us to focus on the "active" authors the similarity between two vertices group of friends.
trying to predict new links between authors. Instead of The contribution of this work emphasizes the
looking at every possible pair of authors, which in this difference between predicting links in a graph using SNA
4446 alone and predicting links using a combination of different
2 methods of TDM and SNA. It is shown through empirical
case is -about 10 million couples, we can look at 993 testing that the two predictions, with TDM and without
135
TDM, have different results in favor of using TDM in the [16] M. E. J. Newman, "Assortative Mixing in Networks",
prediction algorithm. Physical Review Letters, 89(5), APS, Santa Fe, New
Mexico, 2002, 1120
REFERENCES [17] M. E. J. Newman, "The structure and function of complex
[1] A.L. Barabasi, H. Jeong, Z. Neda, E. Ravasz, A. Schubert, networks", SIAM Review 45,Ann Arbor, Department of
and T. Vicsek, “Evolution of the social network of Physics and Center for the Study of Complex Systems,
scientific collaboration”, Physica A 311, 2002, 590-614 University of Michigan, 2003, 167-256
[2] A.L. Barabasi, and R. Albert, “Emergence of scaling in [18] S. Wasserman, and K. Faust, "Social Network Analysis:
random networks”, Science, AAAS, 286, 1999, 509-512 Methods and Applications", Structural Analysis in the
[3] Potgieter, K. A. April, R. J. E. Cooke, and I. O. Social Sciences, 8, Cambridge University Press,
Osunmakinde, "Temporality in Link Prediction: Cambridge, England, 1994
Understanding Social Complexity", Respectively [19] V.V. Raghavan, and S. K. M. A. Wongcritical, "A critical
Department of Computer Science, University of Cape analysis of vector space model for information retrieval",
Town, Republic of South Africa, 2007 Journal of the American Society for Information Science,
[4] Aleman-Meza, M. Nagarajan, C.Ramakrishnan, A. Sheth, 37(5), Computer Science Department, University of
I. Arpinar, L. Ding, P. Kolari, A. Joshi, and T. Finin, Regina, Regina, Saskatchewan, Canada , 1986, 279-287
“Semantic analytics on social networks: Experiences in [20] X. Liu, J. Bollen, M.L. Nelson, and H. Van de Sompel,
addressing the problem of conflict of interest detection”. “Co-Authorship Networks in the Digital Library Research
Proceedings of the 15th international conference on World Community”, Information Processing and Management,
Wide Web, ACM New York, 2006, 407-416 41(6), 2005, 1462-1480
[5] M. Bishop, “Neural Networks for Pattern Recognition”, [21] Z. Huang, X. Li, and H. Chen, “Link Prediction Approach
Clarendon Press, Oxford, 1995 to Collaborative Filtering”, Proceedings of the 5th
[6] Landau, R. Feldman, Y. Aumann, M. Fresko, Y. Lindell, O. ACM/IEEE-CS joint conference on Digital libraries, IEEE,
Liphstat, and O. Zamir, “Textvis: An integrated visual 2005, 7-11
environment for text mining In Principles of Data Mining
and Knowledge Discovery”, The Pennsylvania Stae
University, 1998, 56-64
[7] Liben-Nowell, "An Algorithmic Approach to Social
Networks", PhD thesis at MIT Computer Science and
Artificial Intelligence Laboratory, 2005
[8] [8] D. Liben-Nowell, and J. Kleinberg, “The link
prediction problem for social networks”, Proceedings of the
twelfth international conference on information and
knowledge management, 2003, 556-559
[9] D. Liben-Nowell and J. Kleinberg, “The Link Prediction
Problem for Social Networks”, American society for
information science and technology, John Wiley,
58(7),2007,1019
[10] D.L. Lee, H. Chuang, and K. Seamons, “Document ranking
and the vector space model”, IEEE, 14 (2), Wiley
Blackwell Publishing, 1997, 67-75
[11] I. Witten, and E. Frank, "Data Mining: Practical Machine
Learning Tools and Techniques", 2nd ed. Elsevier, Morgan
Kaufmann, San Francisco, USA,2005, 525
[12] J. Kleinberg, “The Small-World Phenomenon: An
Algorithmic Perspective”, Proceedings of the 32nd ACM
Symposium on Theory of Computing, Cornell University
Press, 2000, 163- 170
[13] L.A. Adamic., and E. Adar, “Friends and Neighbors on the
Web”, Social Networks, Elsevier, 25(3), 2003, 211-230
[14] L. Getoor, and C. Diehl, “Link Mining: A Survey”, ACM
SIGKDD Explorations Newsletter, 7(2),2005, 3-12
[15] M. E. J. Newman, "The structure of scientific collaboration
networks", Proceedings of the National Academy of
Science, USA, 2001, 404-409
136
first divided the authors into two classes where they willing to join or not and
formed links between them
Neural Networks (NN) and Decision Trees (DT) were considered in order to deal with the
classification problem of trying to classify authors into two classes: C3 and C4
C3 and, C4 where C3 is a class which indicates that an author will not be active and C4
indicates that the author will be active, in order to minimize the potential forming links
group and then a second
Another experiment was carried out in order to improve the prediction accuracy using N
(Natural Language Processing) stemmer, java tool, instead of the
NN and DT models were considered in order to deal with the classification problem, this
time we tried to classify pairs of authors into classes: C1 and C2. The first NN model was
build based on the training set (70% of the data) and contained two hidden layer and cross
validation on 50% of the data
It can be deduced that the most effective features are preferential, VSM, CC2 and
MinInput.