Facebook Profiles Clustering
Facebook Profiles Clustering
Abstract— Internet social networks may be an abundant personal data, such as one`s location, birthday, job and
source of opportunities giving space to the “parallel world” education.
which can and, in many ways, does surpass the realty. People In particular, social media are increasingly used in
share data about almost every aspect of their lives, starting political context [5][6]. Potential voters share their
with giving opinions and comments on global problems and impressions daily in the form of statuses about upcoming
events, friends tagging at locations up to the point of events and present state of affairs, their problems, political
multimedia personalized content. Therefore, decentralized stances, agreements or disagreements with political
mini-campaigns about educational, cultural, political and activities, plans, and such like daily subjects. In order to
sports novelties could be conducted. In this paper we have meet the citizens’ needs, politicians and spin-doctors
applied clustering algorithm to social network profiles with
extract and analyze the information of interest from the
the aim of obtaining separate groups of people with different
available statuses. Twitter is favorite amongst politicians
opinions about political views and parties. For network case,
and other known personalities, and thus seems better for
where some centroids are interconnected, we have
collecting and comparing public opinions. Facebook is the
implemented edge constraints into classical 𝒌-means
algorithm. This approach enables fast and effective
most used social network in Serbia, hence we focused our
information analysis about the present state of affairs, but
online political study on Facebook. Moreover, Facebook
also discovers new tendencies in observed political sphere. All offers the way of entering into direct dialog with citizens
profile data, friendships, fanpage likes and statuses with and encouraging political discussions, while Twitter
interactions are collected by already developed software for streams short flurry of information while the fresh ones
neurolinguistics social network analysis - “Symbols”. rush in continuously. Two more important differences
between Facebook and Twitter are: real life friends vs.
I. INTRODUCTION connecting with strangers and undirected vs. directed edges
between profiles. The undirected edges for nodes equality
In recent years, social media are said to have had an were also the milestone for Facebook selection, too. The
impact on the public discourse and social communication. unique possibilities of public opinion research through
Social networks, such as Facebook, Twitter and LinkedIn internet, such as real-time data access, knowledge about
have been becoming very popular during the last few years. people’s changing preferences and access to their status
People experience various life events, happy or unfortunate messages provide prospect for innovation in this field,
life circumstances and all these negative and/or positive contrasting to classical offline ways.
impressions are almost immediately shared online, winning
In this paper, we present a procedure for finding and
inner peace and friends’ support or opinion to the others. A
analyzing valuable information related to the specific
great variety of stances is to be found online, independently
political parties. Our approach is based on Facebook
from the subject of discussion. This permanently enlarges
profiles clustering according to their common friends and
pool of comments on brands, events, educational or health
interests. Clustering techniques can help us to understand
system and could be used as a baseline for research in
relations between profiles and create a global picture of
quality and service improvement [1]. Nonetheless, social
their traits, and eventually conclude how politicians can
network potentials are widely recognized. Many
have impact on them. For this purpose, we adopted well-
companies, schools, public institutions, political parties,
known clustering algorithm “𝑘-means” for dividing social
popular individuals and groups have already created online
network profiling separate groups, thus providing a room
profiles for gathering and analyzing the data [2]. These data
for profiling potential voters. In precise, algorithm 𝑘-means
are, afterwards, useful in numerous areas such as
marketing, public relations, and any type of a thorough is adjusted for graph clustering process in order to form
research of public opinion [3]. several connected components respecting the similarity
between nodes. Collecting and filtering is done by already
It is certain that, apart from web crawlers that are crucial developed software for neurolinguistics social network
for forum research, social networks can yield material for analysis - “Symbols”a, which is described in more details,
sophisticated analyze in the field of marketing and branding in Section 3. Other approaches are also present and they are
[4]. An advantageous approach to grouping people based focused on analyzing the structure of the social networks
on their interests comes from the knowledge of their and profiles centrality (e.g. see [7, 8, 9, 10]).
a
http://symbolsresearch.com
154
6th International Conference on Information Society and Technology ICIST 2016
The remainder of the paper is structured as follows. (a) Like relations: by clicking a “like” button,
Section 2 gives an overview of the literature. Section 3 Facebook users can value another person’s
presents the details of our software “Symbols”. Recent content (posts, photos, videos);
surveys of Facebook popularity in Serbia are highlighted in (b) Comment relations: Facebook users can leave
Section 4. Section 5 describes our research methodology. comments on another person’s content;
Section 6 extends the standard 𝑘-means from vectors to the (c) Post relations: Facebook users can post on the
nodes of graph. The results are presented in Section 7, while “wall” of another person to leave non-private
Section 8 concludes the study. messages.
II. RELATED WORK 3) Affinity network: Attachments to various fanpages and
groups implicating support and agreement within their
Much of real data could be presented as a network
niche.
(graph). Objects can be presented as nodes, and relations
among them as graph’s edges. Based on Facebook users’ This software offers graphical presentation of statistical
relationships and fanpage likes we have created a network data for selected political parties based on social network
out of Facebook profiles. The problem of data clustering statuses and likes, and many more.
with constraints is now surpassed with graph-based
IV. FACEBOOK IN SERBIA
clustering. In this way each element which is be clustered
is represented as a node in a graph and the distance between According to the last researches of Ministry of Trade,
two elements is modeled by a certain weight on the edge Tourism and Telecommunications in Republic of Serbia,
thus linking the nodes [11]. The stronger the relation 93.4% of Internet users aged 16 to 24 have a profile on the
between objects, the higher the weight is (smaller is the social networks (Facebook, Twitter). Our research paper is
distance), and vice-versa. Graph based clustering is a well- based on Facebook audience, because most of the world’s
studied topic in the literature, and various approaches have population are friendly oriented according to this global
been proposed so far. Internet social network. Facebook Advertisement service
In paper [12], the graph edit distance and the weighted presents potential reach of 3,600,000 people from Serbia
mean of a pair of graphs were used for cluster graph-based for the promotion. If we are to believe the self-reported
data under an extension of self-organizing maps (SOMs). information from people in their Facebook profiles, about
In order to determine cluster representatives, the authors in 45% of them are women and 55% are men. Information are
[13] conducted the clustering of attributed graphs by means only available for people aged 18 and older. The largest age
of Function Described Graphs (FDGs). In later approaches group is currently form 18 to 24 with total of 1 440 000
the notion of set median graph [14] was presented. It has users, followed by the users in the age form 25 to 34.
been used to represent the center of each cluster. However, Faculty (College) level educated people participate in
better presentation of each cluster data is obtained by the about 66%, whilst high school students participate in about
generalized median graph concept [14]. Given a set of 32%. At the same time, percentage for single and married
graphs, the generalized median graph is defined as a graph relationship status is 38% to 42%.
that has the minimum sum of distances to all graphs in the
V. METHODOLOGY
set. However, median graph approaches are suffering from
exponential computational complexity or are restricted to Our research focuses on the political parties’ prevalence
special types of graphs [15]. It would seem that spectral in the whole of territory of the Republic of Serbia.
clustering algorithm [16] appears as a much better solution. According to our figures, the total number of grabbed
This method uses the eigenvectors of the adjacency and fanpages is 663925 and it corresponds to a total of 78758
other graph matrices to find clusters in data sets represented profiles. Among these fanpages, 4095 are placed by their
by graphs. 𝑘-means clustering algorithm for graphs was creators in the sphere of politics, while 771 pages have
introduced [17], bearing in mind the simplicity and speed more than three likes. Profiles and fanpages are used for
of algorithms. In this paper we suggested an extension of graph construction. Profiles represent graph nodes, while
classical 𝑘-means algorithm for Euclidean spaces [18][19], fanpages determine a measure for similarity between
but implemented in the case of graph (see Section 5). profiles, i.e. weight of the edges.
Last social research shows that people on the Internet
III. “SYMBOLS” DATA COLLECTION social networks, such as Facebook, mark interactions with
In this section we give a brief overview of Symbols small number of friends compared to the total number of
software and its possibilities. As “glue” between our friends (about 8%), while the remaining ones are “passive”.
software and Facebook API we developed a Facebook Members of the mentioned minority have similar interests,
application SSNA (Software for Social Network Analyses). common friends, and acquaintances from diverse events.
When users start this app, they are asked for the private data This kind of Internet behavior leads us toward taking into
access permission. Upon their agreement, the app calls consideration common pages as well as common friends in
Facebook API on behalf of users after which valid security order to create graph with strong edges. We have taken into
token for the next two months is obtained. The data consideration the limited number of pages for every
encompasses the following network records: political party according to total number of page likes,
1) The friendship network: ego network includes the because a very large number of fanpages can yield
SSNA app users (egos) as nodes and friendship misleading results. Bearing this in mind, we selected ten
relations between them; most numerous fanpages of each political party by
searching keywords in the title related to their name,
2) The communication network: abbreviation and leaders. Lets denote this set of fanpages
with 𝑆. We limited our examination to the four most
popular political parties at this moment.
155
6th International Conference on Information Society and Technology ICIST 2016
Figure 1. Friends (green) with four fanpages and four friends in common.
156
6th International Conference on Information Society and Technology ICIST 2016
TABLE II.
NUMBER OF THE FANPAGES IN COMMON IS GREATER THAN 3. THE
NUMBERS OF NODES AND EDGES ARE 428 AND 4448, RESPECTIVELY
TABLE IV.
TABLE III.
Figure 2. Facebook profiles network, 428 nodes and 4448 edges. NUMBER OF THE FANPAGES IN COMMON IS GREATER THAN 4. THE
NUMBERS OF NODES ANDTABLE V. 213 AND 1141, RESPECTIVELY
EDGES ARE
157
6th International Conference on Information Society and Technology ICIST 2016
“Tvoj stav”b, and may contain valuable information useful [8] M. G. Everett and S. P. Borgatti, “The centrality of groups and
for additional comments, we shall avoid drawing classes,” The Journal of mathematical sociology, vol. 23, pp. 181–
201, 1999.
generalized conclusions and will not deal with such
[9] J. Scott, Social network analysis. Sage, London, 2012.
clusters. Finally, with these clusters we are able to make a
voter’s profile for a political party in a simple way. [10] J. Sun and J. Tang, “A survey of models and algorithms for social
influence analysis,” in Social Network Data Analytics, C. C.
Aggarwal, Eds. Springer US, 2011, pp. 177–214.
VIII. CONCLUSION
[11] A. K. Jain, M. N. Murty and P. J. Flynn, “Data clustering: a review,”
People share contents about almost every aspect of their ACM computing surveys (CSUR), vol. 31, no. 3, pp. 264–323, 1999.
life, from opinions on global problems, comments on [12] S. Günter and H. Bunke, “Self-organizing map for clustering in the
events, to criticism of political parties and their leaders. graph domain,” Pattern Recognition Letters, vol. 23, pp. 405–417,
These daily online activities encourage the opinion 2002.
exchange, thus creating political clusters aimed at inspiring [13] F. Serratosa, R. Alquézar and A. Sanfeliu, “Synthesis of function-
certain political actions and coaxing new voters. The goal described graphs and clustering of attributed graphs,” International
journal of pattern recognition and artificial intelligence, vol. 16,
of this research was to study network ties between profiles pp. 621–655, 2002.
according to their common interests. In this paper, we [14] X. Jiang, A. Münger and H. Bunke, “An median graphs: properties,
presented a novel graph-based clustering approach which algorithms, and applications,” Pattern Analysis and Machine
relies on classical 𝑘-means algorithm. The algorithm was Intelligence, IEEE Transactions on, vol. 23, pp. 1144–1151, 2001.
tested on real Facebook data, and we showed that similar [15] H. Bunke, A. Münger and X. Jiang, “Combinatorial search versus
conclusions could be obtained in a faster way when genetic algorithms: A case study based on the generalized median
compared to the research conducted by marketing agencies graph problem,” Pattern recognition letters, vol. 20, pp. 1271–
engaged for the same purpose and tasks. We determined 1277, 1999.
three clear clusters for chosen political parties, so that we [16] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and
computing, vol. 17, pp. 395–416, 2007.
could distinguish them. The fourth cluster (mixed) consists
[17] A. Schenker, Graph-theoretic techniques for web content mining,
of about 50% of all the profiles, and this problem remains World Scientific, 2005.
unsolved. In the future, our efforts would be oriented to its
[18] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R.
splitting, because undecided group of voters seems to hide Silverman and A. Y. Wu, “An efficient k-means clustering
important information. The algorithm 𝑘-means++ should algorithm: Analysis and implementation,” Pattern Analysis and
be a good start [24]. With small modification the same Machine Intelligence, IEEE Transactions on, vol. 24, pp. 881–
algorithm could be tested on Twitter data. An application 892, 2002.
upgrade for Twitter profiles will also be our tendency for [19] P. Berkhin, “A survey of clustering data mining techniques,” In
the future research. Grouping multidimensional data, J. Kogan, N. Charles and T. Marc,
Eds. Springer Berlin Heidelberg, 2006, pp. 25–71.
ACKNOWLEDGMENT [20] D. Arthur and S. Vassilvitskii, “How Slow is the k-means Method?”
In Proceedings of the twenty-second annual symposium on
This paper was supported by the Ministry of Education, Computational geometry, ACM, pp. 144–153, 2006.
Science and Technological Development of the Republic of [21] S. Har-Peled and B. Sadri, “How fast is the k-means method?,”
Serbia (scientific projects OI174033, III44006, ON174013 Algorithmica, vol. 41, pp. 185–202, 2005.
and TR35026). [22] X. Xu, N. Yuruk, Z. Feng and T. A. Schweiger, “Scan: a structural
clustering algorithm for networks,” In Proceedings of the 13th ACM
REFERENCES SIGKDD international conference on Knowledge discovery and
data mining, ACM, pp. 824–833, 2007.
[1] C. C. Aggarwal, “An introduction to social network data analytics,”
in Social Network Data Analytics, C. C. Aggarwal, Eds. Springer [23] L. C. Freeman, “A set of measures of centrality based upon
US, 2011, pp. 1–15. betweenness,” Sociometry, vol. 40, no. 1, pp. 35–41, 1977.
[2] S. A. Catanese, P. De Meo, E. Ferrara, G. Fiumara and A. Provetti, [24] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of
“Crawling facebook for social network analysis purposes,” In careful seeding,” In Proceedings of the eighteenth annual ACM-
Proceedings of the international conference on web intelligence, SIAM symposium on Discrete algorithms, Society for Industrial and
mining and semantics, ACM, pp. 52, 2011. Applied Mathematics, pp. 1027–1035, 2007.
[3] M. Burke, R. Kraut and C. Marlow, “Social capital on Facebook:
Differentiating uses and users,” In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems, ACM, pp.
571–580, 2011.
[4] B. Arsić, P. Spalević, L. Bojić and A. Crnišanin, “Social Networks
in Logistics System Decision-Making,” In Proceedings ot the 2nd
Logistics International Conference, pp. 166–171, 2015.
[5] D. Zeng, H. Chen, R. Lusch and S. H. Li, “Social media analytics
and intelligence,” Intelligent Systems, IEEE, vol. 25, no. 6, pp. 13–
16, 2010.
[6] S. Wattal, D. Schuff, M. Mandviwalla and C. B. Williams, “Web
2.0 and politics: the 2008 US presidential election and an e-politics
research agenda,” Mis Quarterly, vol. 34, pp. 669–688, 2010.
[7] S. Catanese, P. De Meo, E. Ferrara, G. Fiumara and A. Provetti,
“Extraction and analysis of facebook friendship relations,” In
Computational Social Networks, A. Abraham, Eds. Springer
London, 2012, pp. 291–324.
b
http://www.tvojstav.com/page/analysis#analize_mdr
158