Elastic
Elastic
LOGO
Elasticsearch
-a distributed real-time search
and analytics engine
Dr. Rim Moussa
[email protected]
Outline
● Real-World Users
● Big picture of Elastic Stack
● Information Retrieval Background
● Elasticsearch
● Screenshots
● Elastic Competitors
2
US National Aeronautics and Space Administration
● NASA Curiosity rover explores the red planet Mars and collects
data,
● Curiosity’s onboard sensors capture temperature on the Martian
surface, atmospheric composition, ...
● The challenge is to analyze telemetry data from the Curiosity rover,
150 million miles away? Real-time analytics and visualization with
elastic
● NASA Soil Moisture Active Passive Project
● SMAP is designed to measure soil moisture over a three-year period,
every 2-3 days in the top 5 cm of soil everywhere on Earth’s surface
● SMAP will produce global maps of soil moisture. Scientists will use
these to help improve our understanding of how water and carbon (in
its various forms) circulate
● For more information
● Tom Soderstrom and Dan Isla:
Exploring Space Through Streaming Analytics
● Dan Isla and Ricky Ma:
NASA: Unlocking Interplanetary Datasets with Real-Time Search
3
Other Stories
● Uber and Lyft
● Uber -Engineering Uber Predictions in Real Time with ELK
● Lyft -Lyft's Wild Ride from Amazon ES to Self-Managed Elasticsearch
● User and drivers register on Uber/Lyft platform
● Real-time location data is registred in ELK
● The Key value is to look for available ride in seconds
● Match user location and drivers available in proximity : k nearest
neighbors (kNN)
● Tinder -using elastic stack to make connections around
● Tinder connects in real time than any other mobile application in the
world.
● CISCO: Elasticsearch on Cisco Unified Computing System
● Analyze logs produced by the networking equipments
● All the Data That’s Fit to Find: Search @ The New York Times
● Digitalization of the contents of news' papers, stored and indexed in
ELK stack
4
Other Stories
● Ebay -Elasticsearch Performance Tuning Practice at eBay
● search functionality: search items based on attributes return results fast before
available quantities are over or auctions end (see video)
● HotelTonight scales to millions of users with elasticsearch as a service
● Best price match for hotel accommodation based on customers travel dates
(see video)
● GROUPON -Extended Custom Scripting at Groupon
● propose and score various coupons and deals for customers, let the users
search based on their vicinity to these deals
● Wikipedia - Loading Wikipedia's Search Index For Testing
● Wikipedia uses Elasticsearch to provide full-text search with highlighted
search snippets, and search-as-you-type and did-you-mean suggestions.
● Exploring the Stack Overflow dataset with Elastic Graph
● combines full-text search with geolocation queries and uses more-like-this to
find related questions and answers.
● GitHub uses ES to index over 8 million code repositories
5
Elastic Stack
● A bundle of free and open-source technologies
● Logstash is a data-collection and log enriching and parsing engine
● Event processing pipeline
● It collects data or logs from various sources of an IT infrastructure
● It Enriches and stores the events or log streams as JSON documents
in elasticsearch
● Beats
● Lightweight Data Shippers
● They send data from hundreds or thousands of machines to Logstash
for transformation and parsing or directly to Elasticsearch.
● Elasticsearch is a search engine built on top of Apache Lucene
● Distributed document repository
● It stores and analyzes JSON docs
● It provides distributed and full-text search with a RESTful interface and
schema-free JSON documents.
● Kibana is a rich web-based application which generates real-time
visualizations .
6
The Stack Big Picture
7
Process Flow: big picture
Source: https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html
8
Elastic Stack: Strong points
● Scalability
● supports the 3 V's and splits large indices across a cluster of servers
● Elasticity: Starts small and scales horizontally by adding nodes
● Reliability
● Detects failed nodes
● All changes are recorded in transaction logs on multiple nodes
● Flexibility
● Elasticsearch allows full text searching capabilities with query APIs
which support multilingual search, auto-complete, geolocation, results
snippets, contextual suggestions.
● Logstash connects to multiple data sources like files, databases. It
unifies all data streams: such as logging data from web servers, IoT
devices data or binary files uploaded by users.
● Automated and easy-to-use
● Automatically indexes JSON docs making them searchable
● RESTful API -rich JSON APIs are accessible via RESTful interface -HTTP
● User Friendly
9 • Create charts, plots, histograms, maps for better insights from data
Information Retrieval
● Main goal: efficiently query free text a la google
● c.f. Information Retrieval book
• Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Intr
oduction to Information Retrieval, Cambridge University Press. 2008.
● Open-source technologies
● Apache Lucene -Lucene is a mature, open source, highly performing,
scalable, light and, yet, very powerful library written in Java. It allows to
index documents and search them with its full text search capabilities
● Terminology
• Collection: set of documents
• Document: contains one or more fields.
• Field: is built of two parts: the name and the value.
• Term: is a unit of search representing a word from a field text.
• Token: is an occurrence of a term from the text of the field. It consists
of term text, start offset, end offset, and a type.
10
Full Text Queries
● Boolean search
• simply check whether a term occurs in the document.
• For example, Q = cloud AND computing, where AND is a Boolean
operator, matches every document that mentions both “cloud” and
“computing,”.
• Q = cloud AND NOT computing, will match every document that
mentions “cloud” but does not mention “computing” anywhere.
● Phrase search
• unlike in just Boolean searching, we need to know not only that the
term occurred in the document, but also where it occurred
• the full-text index stores term positions within documents as well.
● Proximity search
• where the terms occur within a given distance to each other
• Q=cloud /k computing with k=5
● Field based search
• Documents may have more than one field, and programmers frequently
want to limit parts of a search to a given field. A field might be boosted.
• e.g. Find all email messages from someone named Peter (boost: 2) that
11 mention MySQL in the subject line (boost:1)
Full Text Queries (ctnd. 1)
● Exact word query (term query)
• The term query finds documents that contain the exact term
• Q=user:”Joseph”
● Wildcard Query
● Matches documents that have fields matching a wildcard expression
● Q1=user:”Jo*h” ; Q2=user:”Jose?h” with * matches any character
sequence and ? matches any single character
● Fuzzy matching
● The fuzzy query uses similarity based on Levenshtein edit distance.
● d(hot,hat)=1, d(cloud,could)=2, d(cat,dog)=3
● Parameters
• Fuzziness - the maximum edit distance. Defaults to AUTO.
• prefix_length -the number of initial characters which will not be
“fuzzified”. Defaults to 0.
• max_expansions -the maximum number of terms that the fuzzy query
will expand to. Defaults to 50.
• Transpositions -whether fuzzy transpositions (ab → ba) are
12 supported. Default is false.
Full Text Queries (ctnd. 2)
● Phonetic matching Query
● searching for words that sound similar, even if their spelling differs.
● Soundex algorithm, Metaphone algorithm ….
● Query boosting
● The normal clause has the default neutral boost of 1.0.
● The urgent query clause has a boost of 2.0, meaning it is twice as
important as the query clause for normal.
13
Stemming & Lemmatization to reduce words to their
root forms
● The goal of both stemming and lemmatization is to reduce
inflectional forms and sometimes derivationally related forms of a
word to a common base form
● Tense (pay, paid), gender (waiter, waitress), number (fox, foxes), case (I, my)
● Stemming
● Stemming usually refers to a crude heuristic process that chops off the
ends of words in the hope of achieving this goal correctly most of the
time, and often includes the removal of derivational affixes
● English Stemmers are Lovins stemmer (1968), Porter stemmer (1980),
Paice stemmer (1990).
● Lemmatization
● Lemmatizer -a tool from Natural Language Processing which does full
morphological analysis and determines the word sense to accurately
identify the lemma for each word
● e.g.
● am, are, is→be ; car, cars, car's, cars' → car ; operate operating
14 operates operation operative operatives operational → oper
Data Structures
● Data structures are very important for efficient free text search
● A text field in a document is a bag of words
● Is term t in doc d?
● Binary Incidence Matrice: t d iif M[t,d] = 1
• good for small collections
● Distributed Inverted indexes a la google
• for each t → list of postings (di (offsets of t), dj (offsets of t), …..)
● Extended full text search capabilities
● Radix trie or compact prefix-tree
● Suffix trie
● B-trees
● Inverted B-trees
● Permuterm index
● For term hello: hello$, ello$h, llo$he, lo$hel, o$hell, $hello all map
to hello term in a permuterm index
● K-gram index
15 ● For k=3, $he, hel, ell, llo, lo$ all map to hello term in 3-gram index
Answer sets Assessment
● Precision
● The ratio of the number of relevant documents retrieved to the total
number of documents retrieved (both relevant as well as irrelevant).
● Recall
● The ratio of the number of relevant records retrieved to the total
number of relevant records in the database.
● Scoring
● With ranking, large result sets are not an issue.
● Just show the Top N results
● Doesn’t overwhelm the user
● Premise: the ranking algorithm works: More relevant results are
ranked higher than less relevant results.
● ranking function to score and rank matching documents based on
their relevance.
● TF-IDF
● BM25 best matching -Apache Lucene 6.0+
16
TF-IDF
● TF-IDF: was built on a combination of the Vector Space Model (VSM) and
the Boolean model of information retrieval.
● the more times a query term appears in a document relative to the
number of times the term appears in the whole collection, the more
relevant that document will be to the query
● TF: term frequency
● TF of a word is the frequency of a word in a document
● e.g. the term “cat” repeats 12 times in document d
● Instead of raw frequency TF : Log frequency weighting
● The log frequency weight of term t in d is defined as follows
17
TF-IDF (ctnd. 1)
● IDF: inverse document frequency
● is the measure of how significant that term is in the whole corpus. The
scoring formula uses this factor to boost documents that contain rare
terms.
19
VSM & TF-IDF weighted Cosine (ctnd.1)
● Fundamental to many operations
● (query,document) pair scoring
● document classification, document clustering, …
● Represent a set of documents as well as queries as vectors of terms
in a common vector space.
● Calculate the Cosine similarity measure
20
Elasticsearch
● Elasticsearch is a real-time distributed search and analytics engine
● Open-source search engine built on top of Apache Lucene,
● It uses Lucene for indexing and searching
● Key characteristics
● A distributed real-time document store where every field is indexed
and searchable: all data in every field is indexed by default.
● A distributed search engine with real-time analytics
● Capable of scaling to hundreds of servers and petabytes of
structured and unstructured data
● Elasticsearch uses JavaScript Object Notation (JSON)
-serialization format for documents
21
Key Components
● Cluster of nodes -collection of connected nodes
● Identified by a unique name
● default cluster.name:elasticsearch in elasticsearch.yml
● A node belongs to unique cluster
● Data node stores data and performs data related ops: CRUD,
search, aggregations and participates in indexing process
● Master node -lightweight node responsible for cluster management
● Coordinating node
• Every node is implicitly a coordinating node,
• Requests like search requests or bulk-indexing requests may
involve data held on different data nodes.
• A search request, for example, is executed in two phases which
are coordinated by the node which receives the client request
• In the scatter phase, the coordinating node forwards the request
to the data nodes which hold the data. Each data node executes
the request locally and returns its results to the coordinating
node.
22 • In the gather phase, the coordinating node reduces each data
node’s results into a single global resultset.
Key Components (ctnd 1)
● Data is stored inside an index
● The Index is a logical namespace grouping physical primary shards of
data
● It can have 0 or more replica shards
● A Shard is a low-level worker unit that holds one slice of all the data in
the index and an instance of Apache Lucene search engine
● Supports time-series indexing such as Log entries, signal processing
data, social net activities...: which are characterized by search recent
document, and documents are never updated. Indexing is performed
per year, per month or per day...
● Applications talk to the index and not shards
● Each document (JSON format) is stored in a single primary shard
● By default each index has 5 shards
● The number of primary shards can not be changed after index creation
● The replication factor can change after index creation
● A Replica shard is never stored on the same node as its primary shard
node
● Document size limit: http.max_content_length is set to 100MB (Lucene
23 still has a limit of 2GB)
Key Components (ctnd 2)
• Type
• Equivalent to a table on a relational database
• Each type has a list of fields
• Mapping defines how each field is analyzed
• ID
• Unique ID to identify a document (by user )
• Unique ID to identify a document (specified by the user or auto
generated by ES)
• The combination of index, type and id must be unique to be able to
identify a document
• Mapping
• Each index has mappings which define each type within an index
• Mapping can be defined explicitly or it will be generated automatically
• Analysis
• Process of converting full text to terms
• Texts are broken depending on type of analyzers
24
Key Components (ctnd 3)
● Analyzers are special algorithms which determine how text is is
transformed into terms in an inverted index
● Analyzers execute a pipeline of transpformations on an input string
to output tokens, as follows,
• One or more Character Filter(s)
• Strip out unwanted characters
• <H1>Elasticsearch is cool</H1> → Elasticsearch is cool
• A Tokenizer
• Breaks down string into stream of terms or tokens
• Elasticsearch is cool → Elasticsearch, is, cool
• Zero or more Token Filter
• Accept a stream of terms
• Can modify (lowercase), delete (remove stopwords) or add
tokens (synonyms)
• Synonyms Extension: english queen = english monarch = british
queen = british monarch
25 • Elasticsearch, is, cool → “elasticsearch”, “cool”
Key Components (ctnd 4)
● Zen discovery module: discovering nodes and responsible of
electing a master node
● Multi-tenancy: Multiple indices on the same cluster
● Availability:
● When a node goes down, the master restores the health of the
cluster through finding out replicas and primary shards
26
RDBMS versus Elasticsearch
● An Elasticsearch cluster can contain multiple indices (RDBMS
databases), which in turn contain multiple types (RDBMS tables).
These types hold multiple documents (RDBMS rows), and each
document has multiple fields (RDBMS columns)
Relational DBMS Elasticsearch
29
Distributed Document Store (ctnd. 1)
● Creating, Indexing, and Deleting a Document
32
Retrieve a document
33
Retrieve a part of a document
34
Retrieve a document without metadata
35
Search for all employees
36
Searching for employees who have “Smith” in their
last name: query-string search
37
Search with Query DSL -Domain Specific Language
38
Find all employees with a last name of Smith, but we
want only employees who are older than 30
39
Full-Text Search: search for all employees who enjoy
rock climbing
40
Phrase search: match exact sequences of words
41
Highlight searches
42
Analytics: #employers for each interest keyword
43
Analytics: interests of Smiths
44
Find the average age of employees who share a
particular interest
45
Update –documents are immutable!
46
Retrieving Multiple Documents
47
Bulk Ops
48
Cluster
● A node is a running instance of Elasticsearch,
● A cluster consists of one or more nodes with the same
cluster.name that are working together to share their data and
workload
● One node in the cluster is elected to be the master node
• creates or deletes an index,
• adds or removes a node from the cluster
● Every node knows where each document lives and can forward the
client request directly to the nodes that hold the searched data
49
Adding /blogs index
50
Cluster Health
Green: All primary and replica shards are active.
Yellow: All primary shards are active, but not all replica shards are active.
Red: Not all primary shards are active.
52
Types' Mapping
53
Inverted Index (c.f. Information Retrieval lecture)
● Doc 1: The quick brown fox jumped over the lazy dog
● Doc 2: Quick brown foxes leap over lazy dogs in summer
● Q = quick brown
• Doc 1 is more relevant than Doc 2
● Text preprocessing
• Lowercase vs uppercase (quick, Quick)
• Singular vs plural (fox,foxes; dog, dogs)
• Same meaning (jumped, leap)
54
Inverted Index (ctnd. 1)
● Input example
• "Set the shape to semi-transparent by calling set_trans(5)"
● Elasticsearch standard analyzer
• splits the text on word boundaries,
• removes most punctuation,
• lowercases all terms
• Result: set, the, shape, to, semi, transparent, by, calling,
set_trans, 5
● English Language analyzers
• stem English words
• Removes stop words
• Result: set, shape, semi, transpar, call, set_tran, 5
55
Inverted Index (ctnd. 2)
● Inverted Indexes are immutable
• every commit point includes a .del file that lists which documents
in which segments have been deleted.
• when a document is updated, the old version of the document is
marked as deleted, and the new version of the document is
indexed in a new segment.
56
Testing the Analyzer
57
Refresh API
● By default, every shard is refreshed automatically once every
second
● Elasticsearch has near real-time search: document changes are not
visible to search immediately, but will become visible within 1sec
● Refresh all indices
• POST /_refresh
● Refresh blogs index
• POST /blogs/_refresh
● Refresh every 30sec
• POST /my_logs/_settings
{ "refresh_interval": “30s” }
● Disable automatic refreshes
• POST /my_logs/_settings
{ "refresh_interval": -1 }
58
Flush API
● translog -transaction log, records every operation in Elasticsearch
as it happens
• When a document is indexed, it is added to the in-memory
buffer and appended to the translog
• when the translog is getting too big—the index is flushed; a
new translog is created, and a full commit is performed
● Flush the blogs index.
• POST /blogs/_flush
● Flush all indices and wait until all flushes have completed before
returning
• POST /_flush?wait_for_ongoing
59
Competitors
● Apache Solr -search engine
● Elasticsearch ecosystem is richer with Logtash, Beats and Kibana
● ES is a rich and a steadly growing ecosystem
https://github.com/elasticsearch
● Commercial products: Splunk>, ATTIVIO, IBM Data Explorer, HP
Autonomy
60
BM25