Mining Social Media
Mining Social Media
Abstract–The big data is playing a big role in the field of and visualize that data using Gephi. Gephi is open source
machine learning and data mining. To extract meaningful and software for data visualization and exploration [1]. The
interesting information from big data mining is a challenge. The software runs on Windows, Mac and Linux platforms. It can
size of the data at social media and Wikipedia are increasing be used for analyzing social, biological and network
exponentially. To visualize such huge data is another aspect of
databases in real time. We had used R for extracting twitter
big data. The roles of graphs are becoming important in case of
visualization and modelling of such data. Gephi and R are two data. R is another open source language widely used for
important visualization and exploration tools in this field. Using machine learning. The statisticians and data miners
graph, one may find and calculate modularity, eccentricity, extensively used R and surveys proved that its popularity had
Indegree, Outdegree, betweenness centrality etc. In this paper, increased in recent years [27]. R is also useful in social data
we had used Dbpedia, facebook and twitter datasets. We had mining and big data analytics. R Studio is an integrated
used Gephi and R to look inside the structure of such data and development environment (IDE) for R. R studio has two
comparing different statistics based on the graph by exploring flavours. One is R Studio Desktop and the other is R Studio
the graphs. Server. The graphical programming interface for R Studio
was designed by Qt while the tool was written in C++. For
Keywords: Big Data, DBpedia, Gephi, R, Graph
the next experiment, the Facebook data was collected by
I. INTRODUCTION using Netvizz application. Netvizz is a facebook application
for data extraction and collection which may be extracted in
The size and use of data are increasing at a rapid speed in different file formats [2]. A quantitative and qualitative
today’s world. The data becomes big and needs some analysis may be carried out for pages, groups and friendship
sophisticated tools and techniques to extract some interesting networks of facebook. The analyzed results may be exported
information out of it. Among various challenges of big data as gdf file formats and may be visualized and explored using
are data capturing, analyzing, sharing, searching, querying, Gephi by importing the graph file. We used modularity for
storing, visualizing and keeping it safe [23]. The big data is our graphs as one of the measures of the network. Biological
broken into four dimensions by IBM scientists [20]. They are networks show high modularity [8]. The modularity divides
Volume, Velocity, Variety and Veracity. There are some V’s the whole network into some clusters. The nodes within the
added recently to describe big data. They are Variability, cluster exhibit dense connections and show a sparse
Visualization, and Value. The big data has changed the connection with other clusters. Modularity is used for
computing scenario of the data miners. It has affected from detection of community structure but fails to detect small
social media to business, from education to Internet of Things communities because of resolution limit which is the
(IoT) and the list goes on [12]. In this paper, we try to extract drawback of the modularity. The different communities in a
interesting information from big data repositories. We used network have different properties like centrality,
Dbpedia, facebook and twitter data. The visualisation tools betweenness, clustering coefficient, node degree etc. Many
used for the extracted data are Gephi and R Studio. Dbpedia real-world problems may be represented as the structured
is the structured knowledge base of Wikipedia data [3]. feature by using graphs. There are various networks in the
DBpedia data includes huge information related to places, real world that may be represented as graphs. For example,
persons, organizations, creative works, diseases and species. biological network, World Wide Web, food web, Social
The DBpedia describes 4.58 million things in the English networks, pathological networks etc. The paper is organized
version alone. The DBpedia localized versions are in 125 into five sections as Section II presents the literature review,
languages. The infoboxes used in the Wikipedia inspired to Section III presents the methodology, Section IV presents
create DBpedia ontology. At present, ontology has 685 experimental results and Section V presents conclusions and
classes with 2795 properties [21]. The useful information future works.
from the DBpedia may be retrieved using SPARQL queries. II. LITERATURE REVIEW
The SPARQL query is semantic query language used for
resource description framework based databases. We had In the work of [4], the netscience.gml file was imported
used the SPARQL query to extract a specific university list into the Gephi tool and various procedures of Clustering,
DOI: 10.4316/JACSM.201801002 14
Journal of Applied Computer Science & Mathematics, Issue 1/2018, vol.12, No. 25, Suceava
Ranking and Partitioning were applied to the data. Atlas In the work of [8], 2017 sentimental analysis of tweets on
algorithm had been applied in order to produce a more topic Barcelona terror attack was carried out using R Studio.
readable representation of the text. The algorithm pushes the The basic motivation of the work was to observe, examine
most connected nodes away from each other while aligning and analyze how people criticize a situation either by
the nodes that are connected to the hubs in clusters around expressing their aggression against terrorist or supporting the
them. This provides a much more readable representation of victims, as we all condemn such inhuman activities in our
the graph. The total number nodes in the graph were 5262 own way. Sentiments analysis of was conducted using a
and corresponding edges were 17217. This showed the powerful statistical tool, R programming. The analysis was
distribution of the network and their interaction through the based on classifying views of people in eight different
edges. The clustering MCL (experimental) algorithm was categories of emotion (anger, trust, fear, anticipation, disgust,
also applied so as for identify word group clusters in the sadness, joy, surprise) and two different sentiments (namely
network. In the work, the various operations on the network positive and negative) from the emotion lexicon EmoLex.
data were carried out which include partitioning, ranking, In the study of [5], Jio tweets in twitter as data stream was
clustering, layout using Gephi.
gathered using Twitter API based on keywords #jio. The
In the study of [15] showed how instructors and course work verified the sentiment orientation of the tweets and also
coordinators can use the tool Gephi to generate relevant detected the changes in tweeted words in terms of frequencies
information that would otherwise be difficult to gain. by applying ADWIN sliding window algorithm of R Studio.
Analysis of empirical data from a cross-curricular course Use of online social networks for smoking cessation has
with 656 students proved the usefulness of Gephi for social been associated with abstinence has been investigated in the
learning analytics studies. The in the work further work of [11] using dynamic social network analysis, how
demonstrated how the tool can provide relevant indicators of temporal changes of an individual’s number of social
student activity and engagement. network ties are prospectively related to abstinence in an
In the work [16] investigated how networks of influence online social network for cessation was investigated. In a
are formed among Twitter users and the relative influence of network where quitting is normative and is the focus of
global news media organizations and information providers communications among members, an increasing number of
in the Twitter-sphere during such global news events. Gephi ties would be positively associated with abstinence was
was used to build an analysis around a set of tweets collected predicted. The work analysis was done using R Studio.
during the 2012 London Olympics. To understand how R studio was used in the work of [24] for analysis and
different users influence the conversations across Twitter, visualization of a WhatsApp group. It found that the total
three types of accounts: those belonging to a number of well- number of active users in WhatsApp group chat is 22
known athletes, those belonging to some well-known consisting of an equal number of males and females. Majority
commentators employed by the BBC, and a number of of the female users tend to be more addicted to WhatsApp
corporate accounts belonging to the BBC World Service and group chat as compared to male users, due to various features
the official London Twitter account were compared. The data provided by WhatsApp such as multimedia, Smiley and Text.
was looked from two perspectives. First, to understand the The most addicted respondents were in the age group of 20 to
structure of the social groupings formed among Twitter users, 30 years representing a young sample. The work
we use a network analysis to model social groupings in the recommended that WhatsApp is one of the best
Twitter-sphere across time and space. Second, to assess the communication platforms whose pros and cons are decided
influence of individual tweets, the ageing factor of tweets, by the user itself. If used positively then it's a boon for the
which measures how long users continue to interact with a users and if addicted then a ban and thus the work classified
particular tweet after it is originally posted was also the level of addiction of users to the WhatsApp group chat so
investigated.
as to limit the time spend on it and to explore the group
In the study of [24] multicultural approach to social media whenever necessary.
marketing analytics, applied in two Facebook brand pages:
French (individualistic culture, the country home of the III. METHODOLOGY
brand) versus Saudi Arabian (collectivistic culture, one of its
country hosts), which are published by an international 3.1 Gephi: Gephi is an open source software for graph
beauty & cosmetics firm. The network analysis was carried and network analysis and it uses a 3D render engine to
using Gephi. The most popular posts and the most influential display large networks in real-time and to speed up the
users within these two brand pages and highlight the different exploration. Gephi has flexible and multi-task architecture
communities emerging from brand and users interactions brings new possibilities to work with complex data sets and
were identified. The work revealed that these communities produce valuable visual results [13]. Gephi as an open source
seem to be culture oriented when they are constructed around network exploration and manipulation software developed
socialization branded posts and product-line oriented when modules can import, visualize, spatialize, filter, manipulate
advertising branded posts are concerned. and export all types of networks. The visualization module
15
Computer Science Section
uses a special 3D render engine to render graphs in real-time. also offers a live synchronized version of extracted data -
3D render engine uses the computer graphic card, as video DBpedia Live [22]. The English Wikipedia has hundreds of
games do, and leaves the CPU free for other computing. It updates per minute [10] and it is been processed via the Live
can deal with a large network which has over 20,000 nodes framework. Changes in Wikipedia articles are often
because it is built on a multi-task model and it takes connected to real life events, such as news related events
advantage of multi-core processors [18]. from politics, cultural life, or sports. Due to the large user
3.2 R: R is an open source programming language and base of Wikipedia, these events are often quickly updated in
software environment for statistical computing and graphics many cases quicker than in other Web sources [6].
that is supported by the R Foundation for Statistical 3.7 Indegree of a Graph: Indegree of vertex V is the
Computing [9]. The R language is widely used among data number of edges which are coming into the vertex V. the
miners and statisticians for developing model and data notation is − deg+(V) [10].
analysis [5]. 3.8 Outdegree of a Graph: Outdegree of vertex V is
3.3 R Studio: RStudio is an integrated development the number of edges which are going out from the vertex V.
environment (IDE) for R. It includes a console, syntax- the notation − deg-(V) [10].
highlighting editor that supports direct code execution, as
well as tools for plotting, history, debugging and workspace IV. EXPERIMENTAL RESULTS
management. It is available in open source and commercial
editions and runs Windows, Mac, and Linux [5]. RStudio was We had performed three experiments with Dbpedia,
founded by JJ Allaire, creator of the programming language facebook and twitter data.
ColdFusion. Hadley Wickham is the Chief Scientist at 4.1 Mining Dbpedia data: The first experiment was
RStudio. performed on Dbpedia data. To extract information from
3.4 Eccentricity of a graph: The eccentricity ∈ (ν ) of a Dbpedia data sparql query was used. The query was used to
graph vertex ν in a connected graph G is the maximum graph search the name of all English medium Universities of the
distance between ν and any other vertex u of G . For a world whose creation date was before 01.01.1900. The query
is given as follows:
disconnected graph, all vertices are defined to have infinite
eccentricity. The maximum eccentricity is the graph
diameter. The minimum graph eccentricity is called the graph PREFIX dbo: <http://dbpedia.org/ontology/>
radius [28]. The eccentricity of a vertex v is the greatest PREFIX dbp: <http://dbpedia.org/property/>
geodesic distance between v and any other vertex. The largest PREFIX dbr: <http://dbpedia.org/resource/>
eccentricity of any vertex in the graph is called the diameter PREFIX foaf: <http://xmlns.com/foaf/0.1/>
(d) of the graph. The radius (r) of a graph is the minimum PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
eccentricity of any vertex [3]. PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-
3.5 Modularity of graph: Modularity measures the ns#>
strength of division of a network into clusters or SELECT ?UniversityName ?CountryName
communities. Networks with high modularity have dense WHERE {
connections between the nodes within the modules and sparse ?uni rdf:type dbo:University ;
connections with the nodes in different modules. Formally, foaf:name ?uniName FILTER (lang(?uniName)="en").
modularity is the fraction of the edges that fall within the ?uni dbp:established ?creationDate ;
given groups minus the expected such fraction if edges were dbo:country ?country .
distributed at random. Modularity is expressed as ?country dbo:language dbr:English_language ;
foaf:name ?cName FILTER (lang(?cName)="en") .
Modularity was designed to measure the strength of FILTER ( ?creationDate < "1900-01-01"^^xsd:date ) .
division of a network into modules (also called groups, }
clusters or communities) [1]. ORDER BY (?creationDate)
3.6 DBpedia: DBpedia is a community effort to extract
structured information from Wikipedia and to make this The query returned 4281 universities. We then import the
information available on the Web. Sophisticated queries are CSV file into the Gephi tool for analysis. The graph shows
used against datasets derived from Wikipedia and to link 3659 nodes and 4281 edges. The attributes are University
other datasets on the Web to Wikipedia data [26]. According name and country name. We performed the modularity test
to [19] DBpedia is a community whose goal is to provide a on the data. According to the modularity report, the
web-based open source data set of RDF triples based on modularity is 0.651 with resolution 1 with 41 communities.
Wikipedia data. DBpedia project has been extracting The average path length is close to 1. The average degree is
metadata and structured data from Wikipedia and made it found to be 1.066.
publicly available as RDF triples Since 2007 [17]. DBpedia
16
Journal of Applied Computer Science & Mathematics, Issue 1/2018, vol.12, No. 25, Suceava
Figure 1: Communities size distribution of Dbpedia data Figure 3: Eccentricity distribution of facebook like network data
17
Computer Science Section
REFERENCES
18
Journal of Applied Computer Science & Mathematics, Issue 1/2018, vol.12, No. 25, Suceava
[3] Basics of Graph theory. Retrieved from [16] Kefi, H., Indra, S. and Abdessalem, T. (2016). Social Media
www.cse.iitkgp.ac.in/~animeshm/FirstHalfScribe.pdf Marketing Analytics: A Multicultural Approach Applied To
Accessed date 3rd December 2017 The Beauty & Cosmetics Sector" (2016). PACIS 2016
[4] Bandgar, B. M., Karande, D. N. and Binod K. (2014).An Proceedings, 176.
Analysis of Social Network Data, IFRSA International Journal [17] Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas,
of Data Warehousing & Mining |Vol 4|issue3|August 2014 D., Mendes, P. N., Hellmann, S., Morsey, M. Van Kleef, P.,
[5] Bastian M., Heymann S., Jacomy M. (2009). Gephi: an open Auer, S. and Bizer, C. (2015). DBpedia - a large-scale,
source software for exploring and manipulating multilingual knowledgebase extracted from Wikipedia.
networks. International AAAI Conference on Weblogs and Semantic Web Journal, 6(2):167
Social Media. [18] Mathieu, B. and Sebastien, H. and Mathieu J. (2009). Gephi:
[6] Bernhard Rieder (2013). “Studying Facebook via data An Open Source Software for Exploring and Manipulating
extraction: the Netvizz application”, Proceedings of the 5th Networks, Proceedings of the Third International ICWSM
Annual ACM Web Science Conference, pp 346-355. Conference, 361-362 (2009)
[7] Bizer, Christian; Lehmann, Jens; Kobilarov, Georgi; Auer, [19] Matt, H., (2011). Databases and the Web. Retrieved from
Soren; Becker, Christian; Cyganiak, Richard; Hellmann, www.tinman.cs.gsu.edu/~raj/8711/sp11/presentations/DBPedi
Sebastian (September 2009). "DBpedia - A crystallization a.pdf . Accessed date 3rd December 2017
point for the Web of Data" (PDF). Web Semantics: Science, [20] Muhammad Lawan Jibril, Ibrahim Ali Mohammed and
Services and Agents on the World Wide Web. 7 (3): 154–165. Atomsa Yakubu (2017). “Social Media Analytics Driven
doi:10.1016/j.websem. 2009.07.002. ISSN 1570-8268. Counterterrorism Tool to improve Intelligence Gathering
[8] Dedić, N.; Stanier, C. (2017). "Towards Differentiating towards Combating Terrorism in Nigeria”, International
Business Intelligence, Big Data, Data Analytics and Journal of Advanced Science and Technology Vol.107, pp.33-
Knowledge Discovery". 285. Berlin; Heidelberg: Springer 42.
International Publishing. ISSN 1865-1356. OCLC 909580101. [21] Mohamed Morsey, Jens Lehmann, Sören Auer, Claus Stadler,
[9] Fox, J. & Andersen, R. ( 2005). Using the R Statistical Sebastian Hellmann, (2012) "DBpedia and the live extraction
Computing Environment to Teach Social Statistics Courses, of structured data from Wikipedia", Program: electronic
Department of Sociology, McMaster University. library and information systems, Vol. 46 Iss: 2, pp.157 – 18.
[10] Georgios, A. P., Maria, S., Charalampos, N. M., Theodoros, [22] Morsey, J., Lehmann, M., Auer, S., Stadler, C. and Hellmann,
G. S., Sophia, K., Jan Aerts, R. S. and Pantelis, G. B.. S. (2012). DBpedia and the live extraction of structured data
(2001).Using graph theory to analyze biological networks, from Wikipedia. The program, 46(2):15.
BioData Mining 4(10). [23] Newman, M. E. J. (2006). "Modularity and community
[11] Graham, A. L., Zhao, K., Papandonatos, G. D., Erar, B., structure in networks". Proceedings of the National Academy
Wang, X., Amato, M. S., et al. (2017.) A prospective of Sciences of the United States of America. 103 (23): 8577–
examination of online social network dynamics and smoking 8696. arXiv: physics/0602124. Bibcode:2006 PNAS.
cessation. PLoS ONE 12(8): e0183655. 103.8577N. doi:10.1073/pnas.0601602103.
https://doi.org/10.1371/journal.pone.0183655 PMC 1482622 PMID 16723398.
[12] Hebeler, John; Fisher, Matthew; Blace, Ryan; Perez-Lopez, [24] Sanchita, P. (2016).WhatsApp Group Data Analysis with R,
Andrew (2009). “Semantic Web Programming”. Indianapolis, International Journal of Computer Applications, 154 (4).
Indiana: John Wiley & Sons. p. 406. ISBN 978-0-470-41801- [25] Sonal Singh and Shyam S Choudhary, (2017). Social Media
7. Data Analysis: Twitter Sentimental Analysis using R
[13] Jacomy M, Venturini T, Heymann S, Bastian M (2014) Language, Proceedings of IEEEFORUM International
ForceAtlas2, a Continuous Graph Layout Algorithm for Conference, 01st October 2017, Pune, India
Handy Network Visualization Designed for the Gephi [26] Soren, A., Christian, B., Georgi, K., Jens, L., Richard, C., and
Software. PLoS ONE 9(6): e98679. Zachary, I. (2007). DBpedia: A Nucleus for a Web of Open
doi:10.1371/journal.pone.0098679. Data, Proceedings of the 6th international The semantic web
[14] Jeffrey S. R. (2012). R Studio: A Platform-Independent IDE and 2nd Asian conference on Asian semantic web conference,
for R and Sweave, Journal of Applied Econometrics, 27: 167– ISWC'07/ASWC'07, pg 722-735
172 (2012) [27] Tippmann, Sylvia (29 December 2014). "Programming tools:
[15] Kancharla, S. and, Sudhakar, N. R. (2017). Sentiment Change Adventures with R". Nature. 517: 109–
Detection in Twitter Data Using R Studio, International 110. doi:10.1038/517109a
Journal for Research in Applied Science & Engineering [28] West, D. B. (2000).Introduction to Graph Theory, 2nd ed.
Technology (IJRASET), 5(5). Englewood Cliffs, NJ: Prentice-Hall.
Dr Sadiq Hussain received his PhD from Dibrugarh University in "Interesting Information Retrieval Over High Dimensional Data" and
M.C.A. from Tezpur University, Assam, India. His areas of interest are data mining and Big Data. He is presently working as System
Administrator at Dibrugarh University, Assam, India.
L. J. Muhammad is a PhD candidate at Modibbo Adama University of Technology, Yola, Adamawa State, Nigeria. He had a Diploma in
Law from Aminu Kano School of Islamic and Legal Studies, Kano, Certificate in Computer Appreciation from Bayero University, Kano,
International Diploma and Advance Diploma in Computing from Informatics Institute Singapore, B.Sc. (Hons) in Computing from
19
Computer Science Section
University of Portsmouth U.K, Masters in Business Administration (MBA), University of Wales, UK and M.Sc. Computer Science from
Bayero University, Kano, Nigeria. L. J. Muhammad is Graduate Member of Nigerian Institute of Management (NIM) as a Chartered
Manager, Member of Nigeria Computer Society, member of International Association of Engineers, Associate Member of Institute of
Engineers and Doctors, member of IAENG Society of Artificial Intelligence, IAENG Society of Computer Science IAENG Society of Data
Mining, IAENG Society of Internet Computing and Web Services, IAENG Society of Scientific Computing, IAENG Society of Software
Engineering and IAENG Society of Wireless Networks. L. J Muhammad is presently working with Federal University, Kashere, Gombe
State, Nigeria as academic staff and his area of research interest are data mining, big data, query optimization and soft computing.
Yakubu, Atomsa had M.Sc. Computer Science from University of Hong Kong, Hong Kong and B.Sc. Computer from Adamawa State
University, Mubi, Nigeria. He is working with Federal University Kashere, Gombe State, Nigeria as academic staff.
20