Article About Elasticsearch
Article About Elasticsearch
Abstract: With the ever-increasing demand for data storage, querying and retrieving data from abundant data sources is a tedious and
time-consuming task. Hence, we require a system for querying data that is highly available, has high capacity and can scale out easily
without the need to add more hardware onto a single device. In the paper, we discuss one such heavy full-text search and analytics
engine called Elasticsearch. Elasticsearch is designed to work with various types of data such as structured, unstructured, geospatial,
graphical and numerical data. It was built on top of Lucene and has been improvised with better features. The power of Elasticsearch is
amplified with the help of a number of technologies that provide a visualization platform, data processing pipeline, monitoring,
machine learning, data shipping etc. They, together with Elasticsearch, are called the Elastic Stack (ELK Stack). Comparison of
Elasticsearch with other recent search engine technologies such as Solr, Sphinx and Azure search is provided, which would help
readers better understand which technology to choose. Elasticsearch is being used in a number of organizations today as a powerful
search engine and has been preferred over databases like MongoDB for querying over stored data, both being JSON document
oriented, distributed datastores. But Elasticsearch provides a better searching capability like full-text search unlike MongoDB which is
only preferred for CRUD operations. Elasticsearch is also relatively very fast compared to its counterparts and comes with real-time
search capabilities thereby having negligible latency, hence making it viable to analyze billions of documents within a few seconds.
Besides that, it also has a high throughput, being able to search through and analyze a number of documents concurrently within a
limited response time. Elasticsearch also deals with failure of any node of a cluster and loss of shards on it by replicating primary
shards into a number of replica shards and distributing them across multiple nodes. This distributed nature of Elasticsearch makes it
highly available and robust.
34
Journal of Research in Science and Engineering (JRSE)
ISSN: 1656-1996 Volume-4, Issue-11, November 2022
document. It is analogous to the table schema of a increasing the redundancy of data and preventing loss of
relational database. In Elasticsearch, there are two basic data due to failure of a node. Every index in
approaches to mapping, dynamic and explicit. Using Elasticsearch is composed of at least one primary shard
explicit mapping the users can define fields and their as it consists of the original copy of data. Fig-1 shows
data types on their own. But to make Elasticsearch an index in elasticsearch which is divided into three
easier to use, dynamic mapping was introduced which shards. The shards are then dispersed across three nodes
creates a field mapping automatically when a new field with one replica shard each. Even if one of the nodes in
is encountered whose mapping was not specified the Elasticsearch cluster goes down, there would remain
explicitly by the user, for example if a field with string at least one copy of each of the shards, thereby making
value was created, Elasticsearch would map that field to it a highly available technology.
having a text datatype. Hence explicit and dynamic
mapping can be combined, hence making mapping
flexible in Elasticsearch.
5) Node: A node is an instance of Elasticsearch that is
responsible for storing data and indexing it. A collection
of nodes makes up a cluster. All nodes in a cluster have
information about every other node, and they forward
the requests from client to the appropriate node. Nodes
can take up a number of different roles, which can be
specified by the user or they are set by default. The
responsibilities or the roles taken up by nodes are as:
master, data, client, tribe, ingestion and machine
Figure 1: Primary and Replica Shards in Elasticsearch
learning nodes. The master nodes are responsible for
overseeing the management of the cluster and
configuring them, by creating and removing nodes. 3. Text Search in Elasticsearch
Data nodes store data and carry out operations on that
data. The client nodes act as mediators that balance the Analysis or text analysis is a process that is applicable to the
request load by forwarding the cluster-related request to text fields or values[10]. In elastic search text has to be
the master node and the data-related requests to the data processed before being stored, this processing happens in the
nodes. The tribe nodes perform read and write analysis phase. Text values are analysed when indexing
operations on all the nodes in the cluster and it connects documents by an analyzer and the result is stored in data
one or more clusters making them seem like one big structures that would make the process of searching more
cluster. Ingestion nodes are used for preprocessing the efficient.
documents before indexing them and the machine
learning nodes help in carrying out machine learning An analyser consists of three building blocks : character
tasks. filters, tokenizers and token filters. A character filter
6) Cluster: A cluster in Elasticsearch is a group of one or receives the original text and transforms it by adding,
more nodes that work together. Elasticsearch is removing and changing characters. An analyser may have
distributed in nature, which is a property that is induced one or more character filters that are applied in a particular
by having the capability of adding nodes to it and order as specified by the user . But an analyser can contain
grouping them into clusters, thereby reducing the load only one tokenizer that tokenizes a string by splitting it into
on a single node and dispersing it amongst multiple tokens. Some characters such as punctuations and white
nodes. spaces may be stripped as a result of tokenization for
7) Shard: An index in Elasticsearch[9] is divided into a example splitting a sentence into words. The tokenizer also
number of shards which are then distributed across a records the character offsets for each token[11]. Token filters
number of nodes, hence we can say that an index is a receive the output of the tokenizers as input and they add,
logical integration of one or more shards. The remove or modify tokens. Similar to Character filters an
documents in an index are distributed across multiple analyser may contain one or more Token filters and are
shards and these shards are in turn distributed across applied in the order in which they are specified for example
multiple nodes. When the load on a particular cluster the lower case filter that converts all the characters in each
grows, Elasticsearch migrates some shards from that of the tokens to lowercase. Elasticsearch comes with a
cluster to other clusters, thereby balancing the data load. number of built-in analyzers, character filters, tokenizers and
There are two kinds of shards: primary and replica. token filters. A number of different combinations of
Whenever an index is being created, the user can character filters, tokenizers and token filters can be used by
specify the number of shards it is supposed to have, i.e. the user to build a custom analyzer. The analyzer used by
the number of primary and the number of replica shards Elasticsearch by default is the standard analyzer which does
for each of the primary shards. The data of the index is not consist of a character filter but uses a Standard
divided amongst a number of primary shards, hence the Tokenizer which tokenises by removing white spaces and
primary shards have the original copy of the data, while punctuation and a lowercase Token filter.
the replica shards for each of the primary shards hold
the copy of data of that primary shard, thereby
35
Journal of Research in Science and Engineering (JRSE)
ISSN: 1656-1996 Volume-4, Issue-11, November 2022
The tokens from the analyzer are stored in data structures, a feature primary and replica shards to impart failover, if in
different data structure being used for different fields case a node goes down. When a node containing primary
depending on the field’s data types. Using several data shard shuts down and goes offline due to some problem,
structures for storing the field values instead of one ensures a replica shard is promoted as the primary shard thereby
efficient data access. These data structures for text fields making the whole cluster highly available.
make up inverted indices which help in fastening full-text 4) Full text Search Engine - Traditional SQL database
searches. Inverted indices are mapping between terms i.e. management systems are not designed for full-text
the tokens from the analyzer and documents containing searches against vast quantities of data. Whereas,
them. Terms in inverted indices are sorted alphabetically. Elasticsearch offers one of the most powerful full-text
Inverted indices also contain information about relevance search capabilities and can perform and combine various
scoring, which while performing full text search helps return types of searches, from structured, unstructured, geo, to
documents based on how well they match to the search text. metric data.
Inverted index is created for each text field in the documents 5) Analytics - Other than being used to build a complex
and the fields of data types other than text use a different search engine using its text querying capabilities,
data structure like BKD trees for numerical and date data Elasticsearch can also be used to query structured data
types. such as numbers and aggregate data and is hence used as
an analytics platform. The data can be queried and
After indexing the documents are mapped. Mapping is used analyzed pictorially with the help of line charts, pie
to define the structure of documents in Elasticsearch and charts.
configure how they’re indexed. Mapping is done by 6) Index Management - Elasticsearch provides a suite of
specifying the properties of the fields of the documents and features to monitor and manage indices. Index State
their data types, i.e. equivalent to the schema of a table in Management provides an automated system for defining
relational database. Mapping here can either be done custom policies and for optimizing, monitoring and
explicitly by the user or implicitly by Elasticsearch. managing indices. It eliminates the need to rely on
external systems to periodically execute the tasks. Index
4. Features State Management plugin from Kibana provides users
facility to monitor the indices and apply custom policies
Elasticsearch exhibits a number of features as follows - such as criteria based on index age, size and number of
1) Highly Scalable - Elasticsearch can scale horizontally documents
upto few petabytes of structured as well as unstructured
data. We can increase the capacity of storage by adding 5. The Elastic Stack
more nodes to the cluster. Though there is no upper limit
on the size, the preferred limit per shard is 50 GB. The Elastic stack consists of technologies developed and
2) Highly Secure - All the data stored in Elasticsearch can maintained by Elastic NV, the company behind
be password-protected to prevent any unauthorized users Elasticsearch. Elasticsearch is the heart of the Elastic stack,
from accessing the data. Elasticsearch also provides i.e. most of the technologies that are part of the Elastic stack
various other security mechanisms such as role-based interact with Elasticsearch and have a strong synergy
access control, access control based on attributes, audit between them , hence they are frequently used together. The
logging, IP filtering and communication encryption. products that are a part of Elastic stack are:
3) Highly available - Elasticsearch is based on the concept 1) Kibana - It is an analytics and visualization platform
of using clusters, where clusters are an assemblage of that easily lets us visualize data from Elasticsearch and
one or more nodes or servers which together hold all of analyze it, which helps us understand it better[12]. It is
the data and provides amalgamated indexing and search comparable to a dashboard or an interface where
functionality across all nodes. Clusters in Elasticsearch visualizations of data can be created, for example maps
36
Journal of Research in Science and Engineering (JRSE)
ISSN: 1656-1996 Volume-4, Issue-11, November 2022
(coordinate map , region map) and charts (pie, line, area terms of scalability, performance, optimized query
and bar chart)[13]. execution, cluster management and shard placement[16].
2) Logstash - It is a free, open source and lightweight Shard placement in Solr is static in nature and usually
server side data processing pipeline that consumes data requires manual work for migrating shards whereas in
from various sources and sends them to Elasticsearch, shard placement is dynamic where migration
Elasticsearch[14]. The data that Logstash receives can be of shards is automated based on cluster state. SolrCloud
handled as events like log file entries , e-commerce supports splitting of existing shards but not shrinking of
orders, customer details, chat messages etc. These shards like Elasticsearch. Cluster coordination in
events are then processed by Logstash and shipped off Elasticsearch uses built-in Zen discovery modules whereas
to one or more destinations like Elasticsearch, Kafka SolrCloud requires Apache Zookeeper, an additional
queue , a HTTP endpoint etc. A Logstash pipeline service. In case of a shard or node failure, Elasticsearch does
consists of three stages : i) Inputs ii) Filters iii) Outputs . shard rebalancing itself and rarely requires manual
Each stage makes use of a plugin. While the input intervention[17]. In SolrCloud, rebalancing is complex and
plugins are how Logstash receives the events, the hard to manage. Routing is supported by Solr but not by
output plugins are all about how Logstash processes Elasticsearch.
them. An output plugin is where the processed events
are sent, which are formally called stashes. Multiple 2) Comparison between Elasticsearch and Sphinx -
pipelines can be run under the same Logstash instance, Both Elasticsearch and Sphinx are well known open source
and it is horizontally scalable. search engines but they differ in some features such as
3) X-Pack - It is an extension to Elasticsearch and Kibana memory and scalability. Elasticsearch consumes more
that adds extra features to them like security, memory hence it is scaled over multiple nodes whereas
monitoring, machine learning, alerting and reporting. Sphinx consumes less memory as compared to other search
While providing security it facilitates the users with engines. Sphinx works more tightly with structured data
authentication by integrating with authentication associated with relational databases, like MySQL whereas
providers and helps control permissions with fine- Elasticsearch can handle various types of data from
grained authorization. It helps monitor the performance structured, unstructured to graphical type of data. Sphinx
of Elastic stack like CPU and memory usage, disk space can't index document types such as pdf, ppt, doc directly. To
etc. by providing an insight into how it is running. handle text documents in various formats, the textual
Alerting is specific to the monitoring of Elastic stack contents are imported into a database, or into an XML
that gives an alert to the user if something erroneous format that Sphinx can understand and later, processing is
happens, for example if the web server’s CPU usage performed. Sphinx is written in C++ whereas Elasticsearch
exceeds a certain limit or if the application errors reach is written in Java. Elasticsearch engine allows executing
a threshold. Reporting helps export Kibana aggregation queries in search indices. Elasticsearch engine,
visualizations and data to another file format like PDF along with optimized querying also speeds up the generation
or CSV. X-Pack is also what enables Kibana to use time of layered navigation block and lists of products
machine learning to perform abnormality detection, filtered by some attributes. However, Sphinx search engine
forecasting future values on data which is a does not allow to perform aggregation queries.
functionality provided by X-Pack whereas the interface
is provided by Kibana. One of the most significant 3) Comparison between Elasticsearch and Azure Search
features is the Graph, that helps analyze relationships in Azure search[18] is a cloud based service that provides
data and uses the relevance feature of Elasticsearch to searching as a service for mobile and web application
determine which parts of the data are related and also development. Azure search is powered by Artificial
provides a plugin for Kibana to visualize data as an Intelligence (AI) for easy identification, analysis, and
interactive graph. Graph exposes an API that helps exploration of data. It helps in reducing the vast complexity
integrate this ability into applications. of data ingestion as well as index creation, using its unique
4) Beats - Beats is a collection of data shippers. They are storage solutions and offers index functionality. Azure
lightweight agents that can be installed on servers which search supports a number of languages as compared to
send data to Logstash or Elasticsearch. There are a Elasticsearch which supports a large number of data types.
number of data shippers like filebeat which collects Elasticsearch has in-memory capabilities using Memcached
various log files like access logs and error logs and and Redis Integration, whereas Azure search doesn’t support
sends these log entries to Logstash or Elasticsearch , in-memory capabilities. Elasticsearch supports eventual
metricbeat that collects system and service metrics like consistency whereas Azure search supports immediate
memory and CPU usage, packbeat that collects network consistency.
data (HTTP requests or database transactions), auditbeat
that collects audit data from Linux, Winlogbeat that 6. Conclusion
collects windows event logs etc.
Searching is one of the key features of elastic search. It can
Comparison of Elasticsearch and Other Open Source be used to search documents based on diverse constraints
Search Engines and gives near real time search facility. It uses inverted
indices for searching which makes the process very fast.
1) Comparison between Elasticsearch and Solr - Elasticsearch provides users with the facility of setting
Both Elasticsearch and Solr[15] are open source search scoring schemes for text searching so the documents are
engines that are built on top of Lucene, but they vary in returned in order of their relevance scores. It’s architecture
37
Journal of Research in Science and Engineering (JRSE)
ISSN: 1656-1996 Volume-4, Issue-11, November 2022
is distributed in nature hence elastic search has a failure Conference on Future Internet of Things and Cloud
recovery mechanism that ensures that data is not lost even if Workshops, 2017
some node in the cluster fails, which is what makes it highly [15] Vikash Kumar and P.N. Barwal, “Implementation of
available. Highly Optimized Search Engine Using Solr”,
International Journal of Innovative Research in Science,
References Engineering and Technology, Vol. 5, Issue 3, March
2016
[1] R. Vidhya, G. Vadivu, “Research Document Search [16] Nikola Luburić, Dragan Ivanović, “Comparing Apache
Using Elasticsearch”, Indian Journal of Science and Solr and Elasticsearch search servers”, 6th International
Technology, Vol 9(37), DOI: Conference on Information Society and Technology
10.17485/ijst/2016/v9i37/102108, September 2016 ICIST 2016
[2] Mitra, M. J. (2016), “ The Rise of Elastic Stack” [17] M. A. AKCA, T. Aydoğan, and M. İlkuçar, “An
(November), Analysis on the Comparison of the Performance and
https://doi.org/10.13140/RG.2.2.17596.03203 Configuration Features of Big Data Tools Solr and
[3] Cornelia Gyorodi, Robert Gyorodi, George Pecherle, Elasticsearch”, IJISAE, pp. 8-12, Dec. 2016.
and Andrada Olah, “ A comparative study: Mongodb [18] Pratiksha P. Nikam and Ranjeetsingh S. Suryawanshi,
vs. Mysql”, 13th International Conference on “Microsoft Windows Azure: Developing Applications
Engineering of Modern Electric Systems (EMES), for Highly Available Storage of Cloud Service” ,
Oradea, Romania,11–12 June 2015; pp. 1–6 International Journal of Science and Research (IJSR)
[4] Clinton Gormley & Zachary Tong, Elasticsearch, “The ISSN (Online): 2319-7064 Index Copernicus Value
Definitive Guide : A Distributed real-time search and (2013): 6.14 | Impact Factor (2014): 5.611
analytics engine”, O'Reilly, January 2015
[5] Sematext Blog “Elastic Search:Distributed, Lucene-
based Search Engine. Available from:
https://sematext.com/ blog/2010/05/03/elastic-search-
distributed-lucene/
[6] Pankaj Sareen, P.K.,”NoSQL Database and its
Comparison with SQL Database”,Int. J. Comput.
Sci.Commun. Netw.2015,5, 293–298
[7] Oleksii Kononenko, Olga Baysal, Reid Holmes,
Michael W. Godfrey, “Mining modern repositories with
elasticsearch”, ICSE '14: 36th International Conference
on Software Engineering, Association for Computing
MachineryNew YorkNYUnited States
[8] Li, X.-M., & Wang, Y., “Design and Implementation of
an Indexing Method Based on Fields for
Elasticsearch”,2015 Fifth International Conference on
Instrumentation and Measurement, Computer,
Communication and Control (IMCCC).
doi:10.1109/imccc.2015.137
[9] Kalyani, D., & Mehta, D. (2017). Paper on searching
and indexing using elasticsearch. Int. J. Eng. Comput.
Sci, 6(6), 21824-21829.
[10] Voit A., Stankus A., Magomedov Sh., Ivanova I.,
“Big data processing for full-text search and
visualization with elasticsearch”, International Journal
of Advanced Computer Science and
Applications”,2017Т8 No12. С.76-83. DOI:
10.14569/IJACSA.2017.081211
[11] Subhani shaik, Nallamothu Naga Malleswara Rao,
“Enhancement of Searching and analyzing the
document using Elastic Search”,International Research
Journal of Engineering and Technology
(IRJET),Volume: 04 Issue: 11 | Nov -2017
[12] Neel Shah, Darryl Willick, Vijay Mago, “A framework
for social media data analytics using Elasticsearch and
Kibana”, Wireless Networks. doi:10.1007/s11276-018-
01896-2 , 2018
[13] Shah, N., Willick, D., Mago, V., “A framework for
social media data analytics usingElasticsearch and
Kibana” - Wireless Networks (2018, in press)
[14] Marcin Bajer, “Building an IoT Data Hub with
Elasticsearch, Logstash and Kibana”, 5th International
38