Elasticsearch Search Engine | An introduction
Last Updated :
10 Feb, 2023
Elasticsearch is a full-text search and analytics engine based on Apache Lucene. Elasticsearch makes it easier to perform data aggregation operations on data from multiple sources and to perform unstructured queries such as Fuzzy Searches on the stored data. It stores data in a document-like format, similar to how MongoDB does it. Data is serialized in JSON format. This adds a Non-relational nature to it and thus, it can also be used as a NoSQL/Non-relational database. A typical Elasticsearch document would look like:
{
"first_name": "Divij",
"last_name":"Sehgal",
"email":"[email protected]",
"dob":"04-11-1995",
"city":"Mumbai",
"state":"Maharashtra",
"country":"India",
"occupation":"Software Engineer",
}
- It is distributed, horizontally scalable, as in more Elasticsearch instances can be added to a cluster as and when need arises, as opposed to increasing the capability of one machine running an Elasticsearch instance.
- It is RESTful and API centric, thus making it more usable. Its operations can easily be accessed over HTTP through the RestFul API so it can be integrated seamlessly into any application. Further, numerous wrappers are available in various Programming languages, obviating the need to use the API manually and most operations can be accessed via library function calls that handle communication with the engine themselves.
- Through the use of CRUD operations - Create, Read, Update, Delete - it is possible to effectively operate on the data present in persistent storage. These are similar to the CRUD achieved by relational databases and can be performed through HTTP interface present in the RESTful APIs.
Where do we use Elasticsearch?
Elasticsearch is a good fit for -
- Storing and operating on unstructured or semi-structured data, which may often change in structure. Due to schema-less nature, adding new columns does not require the overhead of adding a new column to the table. By simply adding new columns to incoming data to an index, Elasticsearch is able to accommodate new column and make it available to further operations.
- Full-text searches: By ranking each document for relevance to a search by correlating search terms with document content using TF-IDF count for each document, fuzzy searches are able to rank documents by relevance to the search made.
- It is common to have Elasticsearch to be used as a storage and analysis tool for Logs generated by disparate systems. Aggregation tools such as Kibana can be used to build aggregations and visualizations in real-time from the collected data.
- It works well with Time-series analysis of data as it can extract metrics from the incoming data in real time.
- Infrastructure monitoring in CI/CD pipelines.
Elasticsearch Concepts Elasticsearch works on a concept known as inverse indexing. This concept comes from the Lucene library(Remember Apache Lucene from above). This index is similar to terms present at the back of a book, that show the pages on which each important term in the book may be present or discussed. The inverted index makes it easier to resolve queries to specific documents they could be related to, based on the keywords present in the query, and speeds up a document retrieval process by limiting the search space of documents to be considered for that query. Let's take the following three Game of Thrones dialogues:
- "Winter is coming."
- "A mind needs books as a sword needs a whetstone, if it is to keep its edge."
- "Every flight begins with a fall."
- "Words can accomplish what swords cannot."
Consider each of these dialogues as a single document, i.e, each document has a structure like:
{
"dialogue": "....."
}
After some simple text processing: After lowercasing the text and removing punctuations, we can construct the "inverted index" as follows:
Term | Frequency | Documents |
---|
a | 4 | 2, 3 |
accomplish | 1 | 4 |
as | 1 | 2 |
begins | 1 | 3 |
books | 1 | 2 |
can | 1 | 4 |
cannot | 1 | 4 |
coming | 1 | 1 |
edge | 1 | 2 |
every | 1 | 3 |
fall | 1 | 3 |
flight | 1 | 3 |
if | 1 | 2 |
is | 2 | 1, 2 |
it | 1 | 2 |
its | 1 | 2 |
keep | 1 | 3 |
mind | 1 | 2 |
needs | 1 | 2 |
sword | 1 | 2 |
swords | 1 | 3 |
to | 1 | 2 |
what | 1 | 3 |
whetstone | 1 | 2 |
winter | 1 | 1 |
with | 1 | 3 |
words | 1 | 4 |
- The first two columns form what is called the Dictionary. This is where Elasticsearch searches for the search terms to get to know which documents could be relevant to the current search.
- The third column is also referred to as Postings. This links each individual term with the document it could be present in.
Few common terms associated with Elasticsearch are as follows:
- Cluster: A cluster is a group of systems running Elasticsearch engine, that participate and operate in close correspondence with each other to store data and resolve a query. These are further classified, based on their role in the cluster.
- Node: A node is a JVM Process running an instance of the Elasticsearch runtime, independently accessible over a network by other machines or nodes in a cluster.
- Index: An index in Elasticsearch is analogous to tables in relational databases.
- Mapping: Each index has a mapping associated with it, which is essentially a schema-definition of the data that each individual document in the index can hold. This can be manually created for each index or it can be automatically be added when data is pushed to an index.
- Document: A JSON document. In relational terms, this would represent a single row in a table.
- Shard: Shards are blocks of data that may or may not belong to the same index. Since data belonging to a single index may get very large, say a few hundred GBs or even a few TBs in size, it is infeasible to vertically grow storage. Instead, data is logically divided into shards stored on different nodes, which individually operate on the data contained in them. This allows for horizontal scaling.
- Replicas: Each shard in a cluster may be replicated to one or more nodes in a the cluster. This allows for a failover backup. In case one of the nodes goes down or cannot utilize its resources at the moment, a replica with the data is always available to work on the data. By default, one replica for each shard is created and the number is configurable. In addition to Failover, use of replicas are also increases search performance.
Similar Reads
Elasticsearch Multi Index Search
In Elasticsearch, multi-index search refers to the capability of querying across multiple indices simultaneously. This feature is particularly useful when you have different types of data stored in separate indices and need to search across them in a single query. In this article, we'll explore what
5 min read
Indexing Data in Elasticsearch
In Elasticsearch, indexing data is a fundamental task that involves storing, organizing, and making data searchable. Understanding how indexing works is crucial for efficient data retrieval and analysis. This guide will walk you through the process of indexing data in Elasticsearch step by step, wit
4 min read
Shards and Replicas in Elasticsearch
Elasticsearch, built on top of Apache Lucene, offers a powerful distributed system that enhances scalability and fault tolerance. This distributed nature introduces complexity, with various factors influencing performance and stability. Key among these are shards and replicas, fundamental components
4 min read
Integrating Elasticsearch with External Data Sources
Elasticsearch is a powerful search and analytics engine that can be used to index, search, and analyze large volumes of data quickly and in near real-time. One of its strengths is the ability to integrate seamlessly with various external data sources, allowing users to pull in data from different da
5 min read
Interacting with Elasticsearch via REST API
Elasticsearch is a powerful tool for managing and analyzing data, offering a RESTful API that allows developers to interact with it using simple HTTP requests. This API is built on the principles of Representational State Transfer (REST) making it accessible and intuitive for developers of all level
5 min read
Relevance Scoring and Search Relevance in Elasticsearch
Elasticsearch is a powerful search engine that good at full-text search among other types of queries. One of its key features is the ability to rank search results based on relevance. Relevance scoring determines how well a document matches a given search query and ensures that the most relevant res
6 min read
Deploying an Elasticsearch Cluster in a Production Environment
Elasticsearch is a powerful, open-source search and analytics engine designed for scalability and reliability. Deploying Elasticsearch in a production environment requires careful planning and configuration to ensure optimal performance, stability, and security. This article will guide you through d
4 min read
Searching Documents in Elasticsearch
Searching documents in Elasticsearch is a foundational skill for anyone working with this powerful search engine. Whether you're building a simple search interface or conducting complex data analysis, understanding how to effectively search and retrieve documents is essential. In this article, we'll
4 min read
Highlighting Search Results with Elasticsearch
One powerful open-source and highly scalable search and analytics web application that can effectively carry out efficiently retrieving and displaying relevant information from vast datasets is Elasticsearch. Itâs also convenient that Elasticsearch can highlight the text matches, which allows users
4 min read
Querying Data in Elastic Search
Querying data in Elasticsearch is a fundamental skill for effectively retrieving and analyzing information stored in this powerful search engine. In this guide, we'll explore various querying techniques in Elasticsearch, providing clear examples and outputs to help you understand the process. Introd
4 min read