Elastic Search Presentation
Elastic Search Presentation
By Rohit Pandey
CONTENT
1. Introduction
2. Why Elasticsearch?
3. Basic Concepts
4. Kibana
5. Query DSL(Domain Specific Language)
6. Aggregations
7. Elastic Search REST APIs
8. "elasticsearch": The python library
9. Summary
10. References
INTRODUCTION
➢Elasticsearch is a highly scalable open-source full-text search and analytics
engine.
➢ It allows you to store, search, and analyze big volumes of data quickly and in near
real time.
➢ It is generally used as the underlying engine/technology that powers applications
that have complex search features and requirements.
WHY ELASTICSEARCH?
Easy To Scale RESTful API
01 02
One server can hold one or more parts of one or Elasticsearch is API driven. Almost any action
more indexes, and whenever new nodes are can be performed using a simple RESTful API
introduced to the cluster they are just being added using JSON over HTTP. An API already exists in
to the party. Every such index, or part of it, is called the language of your choice.
a shard, and Elasticsearch shards can be moved
around the cluster very easily.
03 04
You can host multiple indexes on one Store complex real world entities in
Elasticsearch installation - node or cluster. Each Elasticsearch as structured JSON documents.
index can have multiple "types", which are All fields are indexed by default, and all the
essentially completely different indexes. indices can be used in a single query, to return
results at breath taking speed.
01
Clusters
A cluster is a collection of one
or more nodes (servers) that
together holds your entire data
and provides federated
indexing and search
capabilities across all nodes.
02
Node
A node is a single server that is
part of your cluster, stores your
data, and participates in the
cluster’s indexing and search
capabilities.
03
Index
An index is a collection of
documents that have
somewhat similar
characteristics.
04
Type
A type used to be a logical
category/partition of your index
to allow you to store different
types of documents in the same
index.
Documents
05
A document is a basic unit of
information that can be
indexed. For example, you can
have a document for a single
customer, another document
for a single product, and yet
another for a single order.
06
Shards
The ability to subdivide your
index into multiple pieces
called shards.
07
Replicas
Elasticsearch allows you to
make one or more copies of
your index’s shards into what
are called replica shards, or
replicas for short.
RDBMS VS ELASTIC SEARCH
Database Index
Table Type
Rows Documents
Columns Properties
Kibana
Query DSL
QUERY DSL
Elasticsearch provides a full Query DSL (Domain Specific Language) based on JSON to define queries.
Think of the Query DSL as an AST (Abstract Syntax Tree) of queries, consisting of two types of clauses:
{
"query": {
"match_all": {}
}
}
Full Text Queries
The high-level full text queries are usually used for running full text queries on full text fields like the
body of an email. They understand how the field being queried is analyzed and will apply each
field’s analyzer (or search_analyzer) to the query string before executing.
{
"query": {
"match" : {
"field_name" : "text to search"
}
}
}
Match Phrase Query
The match_phrase query analyzes the text and creates a phrase query out of the analyzed text. For example:
{
"query": {
"match_phrase" : {
"field_name" : "text to search"
}
}
}
Multi Match Query
The multi_match query builds on the match query to allow multi-field queries:
{
"query": {
"multi_match" : {
"query": "test to search",
"fields": [ "field1", "field2" ]
}
}
}
Query String Query
A query that uses a query parser in order to parse its content. Here is an example:
{
"query": {
"query_string" : {
"default_field" : "field_name",
"query" : "this AND that OR thus"
}
}
}
Term Level Queries
These queries are usually used for structured data like numbers, dates, and enums, rather than full text
fields. Alternatively, they allow you to craft low-level queries, foregoing the analysis process.
{
"query": {
"term": {
"user": {
"value": "Kimchy",
}
}
}
}
Terms Query
Returns documents that contain one or more exact terms in a provided field.
The terms query is the same as the term query, except you can search for multiple values.
{
"query" : {
"terms" : {
"user" : ["kimchy", "elasticsearch"]
}
}
}
Range Query
Matches documents with fields that have terms within a certain range. The type of the Lucene query depends on the
field type, for string fields, the TermRangeQuery, while for number/date fields, the query is a NumericRangeQuery.
The following example returns all documents where age is between 10 and 20:
{
"query": {
"range" : {
"age" : {
"gte" : 10,
"lte" : 20,
"boost" : 2.0
}
}
}
}
Exists Query
Returns documents that contain a value other than null or [] in a provided field.
{
"query": {
"exists": {
"field": "user"
}
}
}
Regexp Query
The regexp query allows you to use regular expression term queries. The "term queries" in that first sentence means
that Elasticsearch will apply the regexp to the terms produced by the tokenizer for that field, and not to the original
text of the field.
Note: The performance of a regexp query heavily depends on the regular expression chosen. Matching everything
like .* is very slow as well as using lookaround regular expressions. If possible, you should try to use a long prefix
before your regular expression starts. Wildcard matchers like .*?+ will mostly lower performance.
{
"query": {
"regexp":{
"name.first": "s.*y"
}
}
}
Compound Queries
Compound queries wrap other compound or leaf queries, either to combine their results and scores,
to change their behaviour, or to switch from query to filter context.
Occur Description
must The clause (query) must appear in matching documents and will contribute to the score.
The clause (query) must appear in matching documents. However unlike must the score of the
filter query will be ignored. Filter clauses are executed in filter context, meaning that scoring
is ignored, and clauses are considered for caching.
The clause (query) must not appear in the matching documents. Clauses are executed in filter
must_not context meaning that scoring is ignored, and clauses are considered for caching. Because scoring is
ignored, a score of 0 for all documents is returned.
{
"query": {
"bool" : {
"must" : {
"term" : { "user" : "kimchy" }
},
"filter": {
"term" : { "tag" : "tech" }
},
"must_not" : {
"range" : {
"age" : { "gte" : 10, "lte" : 20 }
}
},
"should" : [
{ "term" : { "tag" : "wow" } },
{ "term" : { "tag" : "elasticsearch" } }
],
"minimum_should_match" : 1,
"boost" : 1.0
}
}
}
Script Query
A query allowing to define scripts as queries. They are typically used in a filter context, for example:
{
"query": {
"bool" : {
"filter" : {
"script" : {
"script" : {
"source": "doc['num1'].value > 1",
"lang": "painless"
}
}
}
}
}
}
Aggregations
Aggregations
The aggregations framework helps provide aggregated data based on a search query. It is based on
simple building blocks called aggregations, that can be composed in order to build complex summaries
of the data.
There are many different types of aggregations, each with its own purpose and output. To better
understand these types, it is often easier to break them into four main families:
➢ Bucketing:
A family of aggregations that build buckets, where each bucket is associated with a key and a document
criterion. When the aggregation is executed, all the buckets criteria are evaluated on every document in the
context and when a criterion matches, the document is considered to "fall in" the relevant bucket. By the end
of the aggregation process, we’ll end up with a list of buckets - each one with a set of documents that "belong"
to it.
➢ Metric:
Aggregations that keep track and compute metrics over a set of documents.
➢ Matrix:
A family of aggregations that operate on multiple fields and produce a matrix result based on the values
extracted from the requested document fields. Unlike metric and bucket aggregations, this aggregation family
does not yet support scripting.
➢ Pipeline:
Aggregations that aggregate the output of other aggregations and their associated metrics.
Avg Aggregation
A single-value metrics aggregation that computes the average of numeric values that are extracted from the aggregated
documents. These values can be extracted either from specific numeric fields in the documents or be generated by a
provided script.
Assuming the data consists of documents representing exams grades (between 0 and 100) of students we can average
their scores with:
{
"aggs" : {
"avg_grade" : { "avg" : { "field" : "grade" } }
}
}
Sum Aggregation
A single-value metrics aggregation that sums up numeric values that are extracted from the aggregated documents.
These values can be extracted either from specific numeric fields in the documents or be generated by a provided
script.
Assuming the data consists of documents representing sales records we can sum the sale price of all hats with:
{
"query" : {
"constant_score" : {
"filter" : {
"match" : { "type" : "hat" }
}
}
},
"aggs" : {
"hat_prices" : { "sum" : { "field" : "price" } }
}
}
Max/Min Aggregation
A single-value metrics aggregation that keeps track and returns the maximum/minimum value among the numeric
values extracted from the aggregated documents. These values can be extracted either from specific numeric fields
in the documents or be generated by a provided script.
{
"aggs" : {
"max_price" : { "max" : { "field" : "price" } }
"min_price" : { "min" : { "field" : "price" } }
}
}
Range Aggregation
A multi-bucket value source-based aggregation that enables the user to define a set of ranges - each representing a
bucket. During the aggregation process, the values extracted from each document will be checked against each
bucket range and "bucket" the relevant/matching document. Note that this aggregation includes the 'from' value and
excludes the 'to' value for each range.
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 100.0 },
{ "from" : 100.0, "to" : 200.0 },
{ "from" : 200.0 }
]
}
}
}
}
Date Range Aggregation
A range aggregation that is dedicated for date values. The main difference between this aggregation and the
normal range aggregation is that the from and to values can be expressed in Date Math expressions, and it is also
possible to specify a date format by which the from and to response fields will be returned. Note that this
aggregation includes the from value and excludes the tovalue for each range.
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{ "to": "now-10M/M" },
{ "from": "now-10M/M" }
]
}
}
}
}
Filter Aggregation
Defines a single bucket of all the documents in the current document set context that match a
specified filter. Often this will be used to narrow down the current aggregation context to a specific set
of documents.
{
"aggs" : {
"t_shirts" : {
"filter" : { "term": { "type": "t-shirt" } },
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
}
Terms Aggregation
A multi-bucket value source-based aggregation where buckets are dynamically built - one per unique value.
{
"aggs" : {
"genres" : {
"terms" : { "field" : "genre" }
}
}
}
Avg Bucket Aggregation
A sibling pipeline aggregation which calculates the (mean) average value of a specified metric in a sibling aggregation.
The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
{
"size": 0,
"aggs": {
"sales_per_month": {
"date_histogram": {
"field": "date",
"interval": "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"avg_monthly_sales": {
"avg_bucket": {
"buckets_path": "sales_per_month>sales"
}
}
}
}
Max/Min Bucket Aggregation
A sibling pipeline aggregation which identifies the bucket(s) with the maximum/minimum value of a specified metric in
a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be
numeric and the sibling aggregation must be a multi-bucket aggregation.
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"max_monthly_sales": {
"max_bucket/min_bucket": {
"buckets_path": "sales_per_month>sales"
}
}
}
}
Elastic Search REST APIs
THE REST API
Elasticsearch provides a very comprehensive and powerful REST API that you can use to interact with your cluster. Among the
few things that can be done with the API are as follows:
➢ Check your cluster, node, and index health, status, and statistics
➢ Administer your cluster, node, and index data and metadata
➢ Perform CRUD (Create, Read, Update, and Delete) and search operations against your indexes
➢ Execute advanced search operations such as paging, sorting, filtering, scripting, aggregations, and many others
Request Body Search
The search request can be executed with a search DSL, which includes the Query DSL, within its body.
Parameters Definition
A search timeout, bounding the search request to be executed within the specified time value and bail with the hits accumulated up to
timeout that point when expired. Search requests are canceled after the timeout is reached using the Search Cancellation mechanism. Defaults to
no timeout.
The number of hits to return. Defaults to 10. If you do not care about getting some hits back but only about the number of matches and/or
size aggregations, setting the value to 0 will help performance.
search_type The type of the search operation to perform. Can bedfs_query_then_fetch or query_then_fetch. Defaults to query_then_fetch.
Set to true or false to enable or disable the caching of search results for requests where size is 0, ie aggregations and suggestions (no top
request_cache hits returned).
allow_partial_search_ Set to false to return an overall failure if the request would produce partial results. Defaults to true, which will allow parti al results in the
results case of timeouts or partial failures. This default can be controlled using the cluster-level settingsearch.default_allow_partial_results.
The maximum number of documents to collect for each shard, upon reaching which the query execution will terminate early. If s et, the
terminate_after response will have a boolean field terminated_early to indicate whether the query execution has actually terminated_early. Defaults to no
terminate_after.
‘elasticsearch’: The python library
SUMMARY
Elasticsearch is a distributed, RESTful and analytics search engine capable of solving a wide variety of problems.
Many companies are switching to it and integrating it in their current backend infrastructure since it is really fast and It
combines different type of searches: structured, unstructured, Geo, application search, security analytics, metrics, and
logging.
REFERENCES
• Blusapphire Kibana App
• Query DSL
• Aggregations
• Elastic Search REST APIs
• Elasticsearch: The python library
Thank You