A Framework For Social Media Data Analytics Using Elasticsearch and Kibana
A Framework For Social Media Data Analytics Using Elasticsearch and Kibana
https://doi.org/10.1007/s11276-018-01896-2 (0123456789().,-volV)(0123456789().
,- volV)
Abstract
Real-time online data processing is quickly becoming an essential tool in the analysis of social media for political trends,
advertising, public health awareness programs and policy making. Traditionally, processes associated with offline analysis
are productive and efficient only when the data collection is a one-time process. Currently, cutting edge research requires
real-time data analysis that comes with a set of challenges, particularly the efficiency of continuous data fetching within the
context of present NoSQL and relational databases. In this paper, we demonstrate a solution to effectively adsress the
challenges of real-time analysis using a configurable Elasticsearch search engine. We are using a distributed database
architecture, pre-build indexing and standardizing the Elasticsearch framework for large scale text mining. The results from
the query engine are visulized in almost real-time.
123
1180 Wireless Networks (2022) 28:1179–1187
Tools that support the management of large data sets discuss some of the general functions of Elasticsearch to
and real-time data fetching include relational (MySQL, provide context for the Elasticsearch configurization and
Oracle Database, SQLite), Graph (Neo4j, Oracle Spatial) data standardization and shard management procedure
and NoSQL (MongoDB, IBM Domino, Apache CouchDB). resulting from this research.
Limiting factors related to all types of databases include
lack of support for full-text searches in real-time. While 2.2 Abstract view
NoSQL is functional for full text searching it lacks relia-
bility when compared to relational database models [3]. Figure 1 illustrates the framework for real-time analysis of
Traditional databases require that the data is first uploaded very large scale data based on Elasticsearch and Kibana
and then the administrator must actively decide which data [13]. In the first step, the Twitter API is used for scraping
should be indexed which adds one more layer of processing twitter data (approximately 1400 tweets per minute) that is
making it infeasible for real-time analysis. Elasticsearch stored in a MongoDB database, which is installed on a
provides a solution to these limiting factors [3] by pro- Network Attached Storage (NAS) with a capacity of 16TB.
viding a highly efficient data fetching and real-time anal- The twitter data is transfered to preprocssing units which
ysis system that: handle the data and transfer it to High Performance Com-
puting (HPC) infrastructure in almost real-time. As tradi-
• Performs pre-indexing before storing the data to avoid
tional databases, including MongoDB, are not efficient
the need to fetch and query specific data in real-time;
enough to handle real-time query, we transfer the pro-
• Requires limited resources and computing power in
cessing and analsis of data to Elasticsearch, which is
relation to traditional solutions; and
implemented via HPC lab resources. Before uploading the
• Provides a system that is distributed and easy to scale.
data, we standardize the twitter object for Elasticsearch and
The capacity for Elasticsearch to contribute to high effi- use multithreading to upload the data for better real-time
ciency, real-time data analysis is enhanced through a performance and to shorten the gap between receiving and
standardized configuration process, shard size management processing data. When a user needs any data, a query will
and standardizing the data before upload into Elasticsearch be sent to Elasticsearch using the Kibana front-end. Elas-
and demonstrated through a discussion of both the working ticsearch processes that query and sends the query result
architecture as well as a real-time visualization of social object (JSON format) to Kibana, where Kibana shows the
media data collected during December 2017 and May query object to the user.
2018, a repository of over 1 billion twitter data points. Within the general functioning of the search engine,
Elasticsearch uses a running instance called a node which
1.1 Key contributions can take on one or more roles including a master or a data
node (see Sect. 2.1, Fig. 2). Dataset clusters within Elas-
• Optimizing and standardizing twitter data for ticsearch require at least one master and one data node,
Elasticsearch however it is possible that a cluster can consist of a single
• Creating a configuration file and choosing the optimal node since a node may take on multiple roles. The only
shard size data storage format compatible with Elasticsearch is JSON
• Demonstrating the real-time visualization of a very and therefore requires data mapping for producing func-
large scale social media data set tional analysis and visualizations due to the unstructured
format of the twitter data. We observed that reliance on the
JSON format makes the system more flexible than MySQL
2 Architecture for real-time analysis and other RDBMS, but less than MongoDB. While a tra-
and storage ditional database such as RDBMS use tables to store the
data, MongoDB uses BSON (like JSON) format, and
2.1 Elasticsearch Elasticsearch uses an inverted index via the Apache Lucene
architecture to store the data [11]. A typical index in
Elasticsearch was started in the year 2004 as an open Elasticsearch is a collection of documents with different
source project called compass, which was based on Apache properties that have been organized through user defined
Lucene [11]. Elasticsearch is a distributed and scalable mapping that outlines document types and fields for dif-
full-text search engine written in Java that is stable and ferent data sources; similar to a table in an SQL database.
platform independent. These features combined with The index is then split into shards housed in multiple nodes
requirement specific flexibility and easy expansion options where a shard is part of an index distributed on different
are helpful for real-time big data analysis [12]. We will nodes. Within the Elasticsearch framework, the inverted
index allows a more categorical storage of big data sets
123
Wireless Networks (2022) 28:1179–1187 1181
within nodes and shards so that real-time search queries are 2.2.1 Backbone
more efficient. Elasticsearch uses RESTful API to com-
municate with users, see Table 1 for a basic architecture While Elasticsearch is a powerful tool, a model is required
comparison. Additionally, there are different libraries such to optimize functionality for the purpose of real-time big
as Elasticsearch in Python [14] and Java [15] for better data analysis specific to social media. The purpose of this
integration. research is to provide (1) a specific configuration file to
optimize the organization of the data set, (2) an optimized
shard size for maximum efficiency in storage and pro-
cessing, and (3) a standardized structure for data fields
present within Twitter to eliminate over-processing of
irrelevant information When the data is stored in Elastic-
search, it stores the data in an index first, and then the index
Table 1 Comparison between
Elasticsearch RDBMS data is stored as an inverted-index using an automatic
Elasticsearch and RDBMS basic
architecture Index Database
tokenizer. When we search in Elasticsearch, we get a
Mapping Table
‘snapshot’ of the data, which means that Elasticsearch does
Document Tuple
not require the hosting of actual content but instead links to
documents stored within a node to provide a result through
123
1182 Wireless Networks (2022) 28:1179–1187
the inverted index. These results are not real data but a Logstash [16] make it convenient for functional represen-
representation of the query’s linkages to all associated tations of big data in real-time. It is part of the elastic stack
documents stored in each node. As a component of this and is freely available under open source license. Kibana
project, the following configuration file was developed and has multiple standard visualizations available by default
can be replicated in Elasticsearch on any HPC by editing and simplifies the process of developing visualizations for
the config files as per number of nodes and capacity of end users with a drag and drop feature. As Kibana is
server. Table 2 describes the basic configuration file for backed by the Elasticsearch architecture, it functions
Elasticsearch. quickly and is efficient enough for real-time analysis.
Here, the name of a cluster is dslab and a cluster name is Finally it provides the opportunity for graphical interaction
necessary, even if only a single node is present. As the in the process of building and handling queries with an
Elasticsearch is a scattered database, where one or many accessible visualization of the cluster health and properties
nodes work as heads and others as data, this parameter is within the database.
used to interconnect all the nodes in the cluster. We can
create numerous clusters with the same hardware using
different instances of Elasticsearch and different configu- 3 Social media data analysis
ration files.
Table 3 is an example of a configuration file features for 3.1 Configuration of the Elasticsearch
any Elasticsearch node. In every node for the distributed
Elasticsearch we have to configure the same file in each Live social media streaming data is stored in elastic clus-
and every instance. When the data is stored we use the ters. Each elastic cluster contains 6 nodes, with each node
index to store a specific type of data similar to a dataset in having 2 threads and 12 GB of memory. Within these 6
MySQL. The performance of Elasticsearch is based on the nodes one node works as a master and the remaining 5
mapping of the index and how we size the shards of the work as data nodes. Architecture of the elastic cluster is
data set. The formula to decide the size of the shards is shown in Fig. 2.
given in Eq. 1.
Number of shards ¼ ðSize of index in GBÞ=50 ð1Þ 3.2 Social media dataset
The reason behind the consideration of using 50 GB as a We used Elasticsearch to analyze 250? million out of 1
shard size is due to the architecture in Elasticsearch. The billion tweets scraped between December 2017 and May
architecture supports 32 GB index size and 32 GB cache 2018 using the Twitter API. Since the Twitter API response
memory so ideally the shard’s memory should be less than is in JSON format and contains unstructured and incon-
64 GB and through experimentation we observed that the sistent data the sequential collection of all data fields
best results are achieved at shard size of 50 GB. within the tweet JSON object is not guaranteed. Stan-
dardization of the data and conversion into a structured
2.3 Kibana: visualization format is therefore necessary for Elasticsearch mapping so
that each field of data is present when loaded into the
In addition to Elasticsearch being efficient for real-time index. To optimize the Elasticsearch we changed the
analysis, extended plugins such as kibana [13] and storage format of the tweet so that all the data is required to
123
Wireless Networks (2022) 28:1179–1187 1183
be at depth level one in JSON format. Table 4 depicts the Table 5 Search query result of ‘‘pizza’’ keyword
basic example of restructured data in Elasticsearch. Result of keyword ‘‘pizza’’ from all tweets from database
As we mentioned previously, the data is stored as an
inverted index that is optimized for text searches and {
therefore very efficient. For example, if we search for the ‘‘took’’: 4060,
keyword ‘‘pizza’’ within the context of all tweets (250? ‘‘timed out’’: false,
millions) in Elasticsearch, the time taken is 4060 ms ‘‘shards’’: {
(4.06 s) to find a total of 192,118 tweets where the ‘‘pizza’’ ‘‘total’’: 106,
keyword is present in tweet text. Table 5 shows the ‘‘successful’’: 106,
example of the keyword ‘‘pizza’’ text search query ‘‘skipped’’: 0,
response from Elasticsearch. Figure 3a shows a pie chart of ‘‘failed’’: 0
tweets mapping the geographical distribution by nation of }
‘‘pizza’’ tweets where the United States alone is responsi- ‘‘hits’’: {
ble for 47% of total tweets and other countries excluding ‘‘total’’: 192118,
the top five are 30%, which is 77% of total tweets. Addi- ‘‘max_score’’: 15.110959,
tionally, the visualization shows the time taken to perform ’’hits’’: [???]
the query is 13 ms (0.013 s). Figure 3b shows five most }
used languages in the tweet text related to ‘‘pizza’’ where
the English language is used in more than 77% tweets
while Spanish is used 12%, Portuguese at third spot with
6%, French at 3% and Japanese at 2% tweets. In this
instance Elasticsearch took 17 ms for query processing.
Table 4 Difference between normal and updated structure Figure 3c shows the devices used to tweet with 38% of
Original tweet structure Updated structure tweets coming from the iPhone twitter app, the Android
twitter app was used for 29%, twitter web clients were used
{ {
for only 11% and Twitter lite and Tweetdeck combined
‘‘??Tweet’’:{ ‘‘Id’’:
were used for around 7%. Other sources were indicated for
‘‘User’’??:{ ‘‘Name’’:
the remaining 15% tweets. This query took 11 ms to exe-
‘‘Id’’??: ... cute, which is quite reasonable given the structure and
‘‘Name’’??: } amount of data.
} The above results demonstrate the efficiency of this data
}, analysis system in that all three tasks (fetching the data,
... performing descriptive analysis and creating graphs), were
} accomplished in less than 15 s from a database size of
250? million tweets. Clearly, this framework has proven
123
1184 Wireless Networks (2022) 28:1179–1187
3.4 Limitation
4 Related work
123
Wireless Networks (2022) 28:1179–1187 1185
Fig. 4 Partial view of the Kibana dashboard for the twitter analysis
Currently there are very few research studies on frame- their system sends the large amount of data which is stored
works for big data analysis in real-time although several in distributed NFS. During the preprocessing of the data,
discuss the application of practices in manufacturing [20] which includes analysis of string and basic cleaning, they
and gene coding [21]. Some researchers have used Elas- index the data and make it compatible for Elasticsearch.
ticsearch cluster via a logstash plugin and MySQL data- This model allows users in a different location to query the
bases for heterogenous accounting information system same experimental data which is computed in different part
[22]. The data is monitored using MySQL server before of the world in real-time. All these present environments
inserting it into Elasticsearch. The researchers observed needs to be correctly configured as per the data and the
that there might be an issue of duplication of data and requirements [24].
storage space, but the architecture ensures flexibility and
modularity for the monitoring the system. They choose
Elasticsearch as text search engine in real-time which 5 Conclusion
allows them to search historical data. Mayo Clinic
healthcare system developed a big data hybrid system Elasticsearch provides a functional system to store, pre-
using Hadoop and Elasticsearch technology. In healthcare, index, search and query very large scale data in real-time.
real-time result is essential for effective decision making. In particular, the capability of expanding the cluster size
Before that, they used traditional RDBMS database to store without stopping service as per user’s requirement makes it
and process data. But, it lacks integration between different suitable for this application. This research provides insights
platforms and inability to querying/ingest of healthcare on how to standardize and configure the processes of
data in a real-time or near real-time. In Mayo Clinic system Elasticsearch which result in increased analysis efficiency.
Hadoop is used as a distributed file system and on top of it To demonstrate the functionality and interactivity for users,
Elasticsearch works as a real-time text search engine. the Kibana plugin was used as an interface. In conclusion, a
When there is a need for raw data Hadoop is used, and for proper configuration of Elasticsearch and Kibana makes
real-time analysis Elasticsearch is used. Their experimen- real-time analysis of large scale data efficient and can help
tation showed very promising results, like searching 25.2 policy makers see the results instantaneously and in an
million HL7 records took just 0.21 s [23]. accessible format that allows for decision making.
Designsafe web portal by Natural Hazards Engineering
Research(NHER) analyze and share experimental data in Acknowledgements This research is funded by the NSERC Discov-
ery Grant; computing resources are provided by the High Perfor-
real-time with researchers across the world. The user of mance Computing (HPC) Lab and Department of Computer Science
123
1186 Wireless Networks (2022) 28:1179–1187
at Lakehead University, Canada. Authors are grateful to Gaurav 18. Burkitt, K. J., Dowling, E. G., & Branon, T. R. (2014). System
Sharma for initially setting up the data collection stream, Salimur and method for real-time processing, storage, indexing, and
Choudhury for providing insight on the data analysis and Andrew delivery of segmented video. US Patent 8,769,576.
Heppner for reviewing and editing drafts. 19. Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A.,
& Khan, S. U. (2015). The rise of big data on cloud computing:
Review and open research issues. Information Systems, 47,
98–115.
References 20. Yang, H., Park, M., Cho, M., Song, M., & Kim, S. (2014). A
system architecture for manufacturing process analysis based on
1. Cervellini, P., Menezes, A. G., & Mago, V. K. (2016). Finding big data and process mining techniques. In 2014 IEEE interna-
trendsetters on yelp dataset. In 2016 IEEE symposium series on tional conference on big data (pp. 1024–1029). IEEE.
computational intelligence (SSCI) (pp. 1–7). IEEE. 21. Stelzer, G., Plaschkes, I., Oz-Levi, D., Alkelai, A., Olender, T.,
2. Belyi, E., Giabbanelli, P. J., Patel, I., Balabhadrapathruni, N. H., Zimmerman, S., et al. (2016). Varelect: The phenotype-based
Abdallah, A. B., Hameed, W., et al. (2016). Combining associ- variation prioritizer of the genecards suite. BMC Genomics,
ation rule mining and network analysis for pharmacosurveillance. 17(2), 444.
The Journal of Supercomputing, 72(5), 2014–2034. 22. Bagnasco, S., Berzano, D., Guarise, A., Lusso, S., Masera, M., &
3. Kononenko, O., Baysal, O., Holmes, R., & Godfrey, M. W. Vallero, S. (2015). Monitoring of IAAS and scientific applica-
(2014). Mining modern repositories with Elasticsearch. In Pro- tions on the cloud using the elasticsearch ecosystem. In Journal
ceedings of the 11th working conference on mining software of physics: Conference series (Vol. 608, p. 012016). Bristol: IOP
repositories (pp. 328–331). ACM. Publishing.
4. Liu, Q., Kumar, S., & Mago, V. (2017). Safernet: Safe trans- 23. Chen, D., Chen, Y., Brownlow, B. N., Kanjamala, P. P., Arre-
portation routing in the era of internet of vehicles and mobile dondo, C. A. G., Radspinner, B. L., et al. (2017). Real-time or
crowd sensing. In 2017 14th IEEE annual consumer communi- near real-time persisting daily healthcare data into hdfs and
cations and networking conference (CCNC) (pp. 299–304). IEEE. elasticsearch index inside a big data platform. IEEE Transactions
5. Kim, M. G., & Koh, J. H. (2016). Recent research trends for on Industrial Informatics, 13(2), 595–606.
geospatial information explored by twitter data. Spatial Infor- 24. Coronel, J. B., & Mock, S. (2017). Designsafe: Using elastic-
mation Research, 24(2), 65–73. search to share and search data on a science web portal. In
6. Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Proceedings of the practice and experience in advanced research
Buyya, R. (2015). Big data computing and clouds: Trends and computing 2017 on sustainability, success and impact (p. 25).
future directions. Journal of Parallel and Distributed Computing, ACM.
79, 3–15.
7. Bsch, C., Hartel, P., Jonker, W., & Peter, A. (2014). A survey of
provably secure searchable encryption. ACM Computing Surveys, Neel Shah is a graduate student
47(2), 18:1–18:51. https://doi.org/10.1145/2636328. at Lakehead University, Canada
8. Kumar, P., Kumar, P., Zaidi, N., & Rathore, V. S. (2018). Currently, he is working on
Analysis and comparative exploration of elastic search, Mongodb analyzing social media data to
and Hadoop big data processing. In Soft computing: Theories and gain insight of Canadian healthy
applications, (pp. 605–615). New York: Springer. behaviours. He is an active open
9. Cea, D., Nin, J., Tous, R., Torres, J., & Ayguadé, E (2014). source coder and maintains two
Towards the cloudification of the social networks analytics. In open-source Python libraries.
Modeling decisions for artificial intelligence (pp. 192–203). New His core areas of interest are
York: Springer. deep learning and data science.
10. Bai, J. (2013). Feasibility analysis of big log data real time search
based on hbase and elasticsearch. In 2013 ninth international
conference on natural computation (ICNC) (pp. 1166–1170).
IEEE.
11. Elasticsearch-elastic.co. Retrieved April 30, 2018, from https://
www.elastic.co/guide/en/elasticsearch/reference/6.2/index.html.
12. Gormley, C., & Tong, Z. (2015). Elasticsearch: The definitive
Darryl Willick received the B.Sc.
guide: A distributed real-time search and analytics engine.
(1988) and M.Sc. (1990)
Sebastopol: O’Reilly Media, Inc.
degrees in Computational Sci-
13. Your Window into the Elastic Stack. Retrieved 30, 2018, from
ence from the University of
https://www.elastic.co/products/kibana.
Saskatchewan, Canada.
14. Python Elasticsearch Client. Retrieved April 30, 2018, from
Throughout his career he has
https://elasticsearch-py.readthedocs.io/en/master/.
worked in the areas of High
15. Java Elasticsearch library-Elastic. Retrieved April 30, 2018, from
Performance Computing, Visu-
https://www.elastic.co/guide/en/Elasticsearch/client/java-api/6.2/
alization, System administra-
index.html.
tion, and Cyber Security.
16. Getting Started with Logstash. Retrieved April 30, 2018, from
Currently he is a Technology
https://www.elastic.co/guide/en/logstash/current/getting-started-
Security Specialist/HPCC Ana-
with-logstash.html.
lyst at Lakehead University,
17. Yang, F., Tschetter, E., Léauté, X., Ray, N., Merlino, G., &
Canada.
Ganguli, D. (2014). Druid: A real-time analytical data store. In
Proceedings of the 2014 ACM SIGMOD international conference
on Management of data (pp. 157–168). ACM.
123
Wireless Networks (2022) 28:1179–1187 1187
Vijay Mago is also an Associate University. He has served on the program committees of many
Professor in the Department of international conferences and workshops. Recently in 2017, he joined
Computer Science at Lakehead Technical Investment Strategy Advisory Committee Meeting for
University in Ontario, Canada Compute Ontario. He has published extensively (more than 50 peer
where he teaches and conducts reviewed articles) on new methodologies based on soft computing and
research in areas including big artificial intelligent techniques to tackle complex systemic problems
data analytics, machine learn- such as homelessness, obesity, and crime. He currently serves as an
ing, natural language process- associate editor for IEEE Access and BMC Medical Informatics and
ing, artificial intelligence, Decision Making and as co-editor for the Journal of Intelligent
medical decision making and Systems.
Bayesian intelligence. He
received his Ph.D. in Computer
Science from Panjab University,
India in 2010. In 2011 he joined
the Modelling of Complex
Social Systems program at the IRMACS Centre of Simon Fraser
123