0% found this document useful (0 votes)
34 views61 pages

Elastic

Uploaded by

rim.moussa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views61 pages

Elastic

Uploaded by

rim.moussa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Company

LOGO
Elasticsearch
-a distributed real-time search
and analytics engine
Dr. Rim Moussa

[email protected]
Outline
● Real-World Users
● Big picture of Elastic Stack
● Information Retrieval Background
● Elasticsearch
● Screenshots
● Elastic Competitors

2
US National Aeronautics and Space Administration
● NASA Curiosity rover explores the red planet Mars and collects
data,
● Curiosity’s onboard sensors capture temperature on the Martian
surface, atmospheric composition, ...
● The challenge is to analyze telemetry data from the Curiosity rover,
150 million miles away? Real-time analytics and visualization with
elastic
● NASA Soil Moisture Active Passive Project
● SMAP is designed to measure soil moisture over a three-year period,
every 2-3 days in the top 5 cm of soil everywhere on Earth’s surface
● SMAP will produce global maps of soil moisture. Scientists will use
these to help improve our understanding of how water and carbon (in
its various forms) circulate
● For more information
● Tom Soderstrom and Dan Isla:
Exploring Space Through Streaming Analytics
● Dan Isla and Ricky Ma:
NASA: Unlocking Interplanetary Datasets with Real-Time Search
3
Other Stories
● Uber and Lyft
● Uber -Engineering Uber Predictions in Real Time with ELK
● Lyft -Lyft's Wild Ride from Amazon ES to Self-Managed Elasticsearch
● User and drivers register on Uber/Lyft platform
● Real-time location data is registred in ELK
● The Key value is to look for available ride in seconds
● Match user location and drivers available in proximity : k nearest
neighbors (kNN)
● Tinder -using elastic stack to make connections around
● Tinder connects in real time than any other mobile application in the
world.
● CISCO: Elasticsearch on Cisco Unified Computing System
● Analyze logs produced by the networking equipments
● All the Data That’s Fit to Find: Search @ The New York Times
● Digitalization of the contents of news' papers, stored and indexed in
ELK stack
4
Other Stories
● Ebay -Elasticsearch Performance Tuning Practice at eBay
● search functionality: search items based on attributes return results fast before
available quantities are over or auctions end (see video)
● HotelTonight scales to millions of users with elasticsearch as a service
● Best price match for hotel accommodation based on customers travel dates
(see video)
● GROUPON -Extended Custom Scripting at Groupon
● propose and score various coupons and deals for customers, let the users
search based on their vicinity to these deals
● Wikipedia - Loading Wikipedia's Search Index For Testing
● Wikipedia uses Elasticsearch to provide full-text search with highlighted
search snippets, and search-as-you-type and did-you-mean suggestions.
● Exploring the Stack Overflow dataset with Elastic Graph
● combines full-text search with geolocation queries and uses more-like-this to
find related questions and answers.
● GitHub uses ES to index over 8 million code repositories
5
Elastic Stack
● A bundle of free and open-source technologies
● Logstash is a data-collection and log enriching and parsing engine
● Event processing pipeline
● It collects data or logs from various sources of an IT infrastructure
● It Enriches and stores the events or log streams as JSON documents
in elasticsearch
● Beats
● Lightweight Data Shippers
● They send data from hundreds or thousands of machines to Logstash
for transformation and parsing or directly to Elasticsearch.
● Elasticsearch is a search engine built on top of Apache Lucene
● Distributed document repository
● It stores and analyzes JSON docs
● It provides distributed and full-text search with a RESTful interface and
schema-free JSON documents.
● Kibana is a rich web-based application which generates real-time
visualizations .
6
The Stack Big Picture

7
Process Flow: big picture

Source: https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html

8
Elastic Stack: Strong points
● Scalability
● supports the 3 V's and splits large indices across a cluster of servers
● Elasticity: Starts small and scales horizontally by adding nodes
● Reliability
● Detects failed nodes
● All changes are recorded in transaction logs on multiple nodes
● Flexibility
● Elasticsearch allows full text searching capabilities with query APIs
which support multilingual search, auto-complete, geolocation, results
snippets, contextual suggestions.
● Logstash connects to multiple data sources like files, databases. It
unifies all data streams: such as logging data from web servers, IoT
devices data or binary files uploaded by users.
● Automated and easy-to-use
● Automatically indexes JSON docs making them searchable
● RESTful API -rich JSON APIs are accessible via RESTful interface -HTTP
● User Friendly
9 • Create charts, plots, histograms, maps for better insights from data
Information Retrieval
● Main goal: efficiently query free text a la google
● c.f. Information Retrieval book
• Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Intr
oduction to Information Retrieval, Cambridge University Press. 2008.
● Open-source technologies
● Apache Lucene -Lucene is a mature, open source, highly performing,
scalable, light and, yet, very powerful library written in Java. It allows to
index documents and search them with its full text search capabilities
● Terminology
• Collection: set of documents
• Document: contains one or more fields.
• Field: is built of two parts: the name and the value.
• Term: is a unit of search representing a word from a field text.
• Token: is an occurrence of a term from the text of the field. It consists
of term text, start offset, end offset, and a type.

10
Full Text Queries
● Boolean search
• simply check whether a term occurs in the document.
• For example, Q = cloud AND computing, where AND is a Boolean
operator, matches every document that mentions both “cloud” and
“computing,”.
• Q = cloud AND NOT computing, will match every document that
mentions “cloud” but does not mention “computing” anywhere.
● Phrase search
• unlike in just Boolean searching, we need to know not only that the
term occurred in the document, but also where it occurred
• the full-text index stores term positions within documents as well.
● Proximity search
• where the terms occur within a given distance to each other
• Q=cloud /k computing with k=5
● Field based search
• Documents may have more than one field, and programmers frequently
want to limit parts of a search to a given field. A field might be boosted.
• e.g. Find all email messages from someone named Peter (boost: 2) that
11 mention MySQL in the subject line (boost:1)
Full Text Queries (ctnd. 1)
● Exact word query (term query)
• The term query finds documents that contain the exact term
• Q=user:”Joseph”
● Wildcard Query
● Matches documents that have fields matching a wildcard expression
● Q1=user:”Jo*h” ; Q2=user:”Jose?h” with * matches any character
sequence and ? matches any single character
● Fuzzy matching
● The fuzzy query uses similarity based on Levenshtein edit distance.
● d(hot,hat)=1, d(cloud,could)=2, d(cat,dog)=3
● Parameters
• Fuzziness - the maximum edit distance. Defaults to AUTO.
• prefix_length -the number of initial characters which will not be
“fuzzified”. Defaults to 0.
• max_expansions -the maximum number of terms that the fuzzy query
will expand to. Defaults to 50.
• Transpositions -whether fuzzy transpositions (ab → ba) are
12 supported. Default is false.
Full Text Queries (ctnd. 2)
● Phonetic matching Query
● searching for words that sound similar, even if their spelling differs.
● Soundex algorithm, Metaphone algorithm ….
● Query boosting
● The normal clause has the default neutral boost of 1.0.
● The urgent query clause has a boost of 2.0, meaning it is twice as
important as the query clause for normal.

13
Stemming & Lemmatization to reduce words to their
root forms
● The goal of both stemming and lemmatization is to reduce
inflectional forms and sometimes derivationally related forms of a
word to a common base form
● Tense (pay, paid), gender (waiter, waitress), number (fox, foxes), case (I, my)
● Stemming
● Stemming usually refers to a crude heuristic process that chops off the
ends of words in the hope of achieving this goal correctly most of the
time, and often includes the removal of derivational affixes
● English Stemmers are Lovins stemmer (1968), Porter stemmer (1980),
Paice stemmer (1990).
● Lemmatization
● Lemmatizer -a tool from Natural Language Processing which does full
morphological analysis and determines the word sense to accurately
identify the lemma for each word
● e.g.
● am, are, is→be ; car, cars, car's, cars' → car ; operate operating
14 operates operation operative operatives operational → oper
Data Structures
● Data structures are very important for efficient free text search
● A text field in a document is a bag of words
● Is term t in doc d?
● Binary Incidence Matrice: t  d iif M[t,d] = 1
• good for small collections
● Distributed Inverted indexes a la google
• for each t → list of postings (di (offsets of t), dj (offsets of t), …..)
● Extended full text search capabilities
● Radix trie or compact prefix-tree
● Suffix trie
● B-trees
● Inverted B-trees
● Permuterm index
● For term hello: hello$, ello$h, llo$he, lo$hel, o$hell, $hello all map
to hello term in a permuterm index
● K-gram index
15 ● For k=3, $he, hel, ell, llo, lo$ all map to hello term in 3-gram index
Answer sets Assessment
● Precision
● The ratio of the number of relevant documents retrieved to the total
number of documents retrieved (both relevant as well as irrelevant).
● Recall
● The ratio of the number of relevant records retrieved to the total
number of relevant records in the database.
● Scoring
● With ranking, large result sets are not an issue.
● Just show the Top N results
● Doesn’t overwhelm the user
● Premise: the ranking algorithm works: More relevant results are
ranked higher than less relevant results.
● ranking function to score and rank matching documents based on
their relevance.
● TF-IDF
● BM25 best matching -Apache Lucene 6.0+

16
TF-IDF
● TF-IDF: was built on a combination of the Vector Space Model (VSM) and
the Boolean model of information retrieval.
● the more times a query term appears in a document relative to the
number of times the term appears in the whole collection, the more
relevant that document will be to the query
● TF: term frequency
● TF of a word is the frequency of a word in a document
● e.g. the term “cat” repeats 12 times in document d
● Instead of raw frequency TF : Log frequency weighting
● The log frequency weight of term t in d is defined as follows

● e.g., tft,d→wt,d: 0→0, 1→1, 2→1.3, 10→2, 1000→4, ….

17
TF-IDF (ctnd. 1)
● IDF: inverse document frequency
● is the measure of how significant that term is in the whole corpus. The
scoring formula uses this factor to boost documents that contain rare
terms.

● N is the number of documents in the collection,


● dft is the document frequency, the number of documents that t
occurs in
● e.g. the term “cat” appears in 300K documents in a 10M document-
sized corpus; IDF (cat) = log (10,000,000/300,000) = 1.52
● The IDF of a term which appears in 10M docs (all the collection) is 0
● tf-idft,d = 1+log10tft,d  idft
● Best known weighting scheme in information retrieval
• increases with the number of occurrences within a document
(term frequency)
• increases with the rarity of the term in the collection (inverse
18 document frequency)
VSM & TF-IDF weighted Cosine

19
VSM & TF-IDF weighted Cosine (ctnd.1)
● Fundamental to many operations
● (query,document) pair scoring
● document classification, document clustering, …
● Represent a set of documents as well as queries as vectors of terms
in a common vector space.
● Calculate the Cosine similarity measure

length of query vector


length of document vector
● qi is the tf-idf weight of term i in the query.
● di is the tf-idf weight of term i in the document.

20
Elasticsearch
● Elasticsearch is a real-time distributed search and analytics engine
● Open-source search engine built on top of Apache Lucene,
● It uses Lucene for indexing and searching
● Key characteristics
● A distributed real-time document store where every field is indexed
and searchable: all data in every field is indexed by default.
● A distributed search engine with real-time analytics
● Capable of scaling to hundreds of servers and petabytes of
structured and unstructured data
● Elasticsearch uses JavaScript Object Notation (JSON)
-serialization format for documents

21
Key Components
● Cluster of nodes -collection of connected nodes
● Identified by a unique name
● default cluster.name:elasticsearch in elasticsearch.yml
● A node belongs to unique cluster
● Data node stores data and performs data related ops: CRUD,
search, aggregations and participates in indexing process
● Master node -lightweight node responsible for cluster management
● Coordinating node
• Every node is implicitly a coordinating node,
• Requests like search requests or bulk-indexing requests may
involve data held on different data nodes.
• A search request, for example, is executed in two phases which
are coordinated by the node which receives the client request
• In the scatter phase, the coordinating node forwards the request
to the data nodes which hold the data. Each data node executes
the request locally and returns its results to the coordinating
node.
22 • In the gather phase, the coordinating node reduces each data
node’s results into a single global resultset.
Key Components (ctnd 1)
● Data is stored inside an index
● The Index is a logical namespace grouping physical primary shards of
data
● It can have 0 or more replica shards
● A Shard is a low-level worker unit that holds one slice of all the data in
the index and an instance of Apache Lucene search engine
● Supports time-series indexing such as Log entries, signal processing
data, social net activities...: which are characterized by search recent
document, and documents are never updated. Indexing is performed
per year, per month or per day...
● Applications talk to the index and not shards
● Each document (JSON format) is stored in a single primary shard
● By default each index has 5 shards
● The number of primary shards can not be changed after index creation
● The replication factor can change after index creation
● A Replica shard is never stored on the same node as its primary shard
node
● Document size limit: http.max_content_length is set to 100MB (Lucene
23 still has a limit of 2GB)
Key Components (ctnd 2)
• Type
• Equivalent to a table on a relational database
• Each type has a list of fields
• Mapping defines how each field is analyzed
• ID
• Unique ID to identify a document (by user )
• Unique ID to identify a document (specified by the user or auto
generated by ES)
• The combination of index, type and id must be unique to be able to
identify a document
• Mapping
• Each index has mappings which define each type within an index
• Mapping can be defined explicitly or it will be generated automatically
• Analysis
• Process of converting full text to terms
• Texts are broken depending on type of analyzers
24
Key Components (ctnd 3)
● Analyzers are special algorithms which determine how text is is
transformed into terms in an inverted index
● Analyzers execute a pipeline of transpformations on an input string
to output tokens, as follows,
• One or more Character Filter(s)
• Strip out unwanted characters
• <H1>Elasticsearch is cool</H1> → Elasticsearch is cool
• A Tokenizer
• Breaks down string into stream of terms or tokens
• Elasticsearch is cool → Elasticsearch, is, cool
• Zero or more Token Filter
• Accept a stream of terms
• Can modify (lowercase), delete (remove stopwords) or add
tokens (synonyms)
• Synonyms Extension: english queen = english monarch = british
queen = british monarch
25 • Elasticsearch, is, cool → “elasticsearch”, “cool”
Key Components (ctnd 4)
● Zen discovery module: discovering nodes and responsible of
electing a master node
● Multi-tenancy: Multiple indices on the same cluster
● Availability:
● When a node goes down, the master restores the health of the
cluster through finding out replicas and primary shards

26
RDBMS versus Elasticsearch
● An Elasticsearch cluster can contain multiple indices (RDBMS
databases), which in turn contain multiple types (RDBMS tables).
These types hold multiple documents (RDBMS rows), and each
document has multiple fields (RDBMS columns)
Relational DBMS Elasticsearch

Database instance Index


Table Type
Row document
column field
schema mapping
SQL query DSL
Select stmt Get API
Update stmt Post API
Insert stmt Put API
Index on demand Everything is indexed
Aggregation Aggregation in SELECT
27
Key Features Elasticsearch
● Indexing and searching happens in near real time
● Support for bulk and incremental indexing
● Timeseries indexing for time based streams of data
● Schema-free
● Inverted indexing
● Customer analyzers
● Keyword and metadata based search
● Enrichment -highlight query terms in search results
● Sorting
● Automatic operations happening under the hood
• Partitioning the documents into multiple different shards
• Balancing shards across the cluster
• Replicating shards to prevent data loss in case of HW failure
• Routing request from any node to the appropriate data node
which holds data
• Integrating new nodes as cluster grows
28 • Redistributing shards after a recovery
Distributed Document Store
● Routing a document to a shard
• shard = hash(routing) % number_of_primary_shards
• The routing value is an arbitrary string, which defaults to the
document’s _id but can also be set to a custom value.
● Serving queries from Primary or Replica Shards
• Every node is fully capable of serving any request.
• Every node knows the location of every document in the cluster
• Load balancing through round-robin

29
Distributed Document Store (ctnd. 1)
● Creating, Indexing, and Deleting a Document

① The client sends a create, index, or delete request to Node 1.


② Node 1 forwards the request to Node 3, where the primary copy
of shard 0 is currently allocated.
③ Node 3 executes the request on the primary shard. Then it
forwards the request to the replica shards on Node 1 and Node 2.
Once all of the replica shards report success, Node 3 reports
30 success to the requesting node, which reports success to the client.
Distributed Document Store (ctnd. 2)
● Replication
• By default sync
• The primary shard waits for successful responses from the
replica shards before returning
• If replication set to async, it will return success to the client as
soon as the request has been executed on the primary shard. It
will still forward the request to the replicas, but you will not know
whether the replicas succeeded.
● Consistency
• By default, the primary shard requires a quorum of shard copies
to run a write operation. This is to prevent writing data to the
“wrong side” of a network partition
• quorum = int( (primary + number_of_replicas) / 2 ) + 1
● Timeout
• By default timeout = 1min
31
Insert New Employees

32
Retrieve a document

33
Retrieve a part of a document

34
Retrieve a document without metadata

35
Search for all employees

36
Searching for employees who have “Smith” in their
last name: query-string search

37
Search with Query DSL -Domain Specific Language

38
Find all employees with a last name of Smith, but we
want only employees who are older than 30

39
Full-Text Search: search for all employees who enjoy
rock climbing

Elasticsearch sorts matching results by their relevance score

40
Phrase search: match exact sequences of words

41
Highlight searches

42
Analytics: #employers for each interest keyword

43
Analytics: interests of Smiths

44
Find the average age of employees who share a
particular interest

45
Update –documents are immutable!

Notice that Elasticsearch has marked the old document as


deleted, it has incremented the _version number and created
flag is set to false because a document with the same index,
type, and ID already existed

46
Retrieving Multiple Documents

47
Bulk Ops

48
Cluster
● A node is a running instance of Elasticsearch,
● A cluster consists of one or more nodes with the same
cluster.name that are working together to share their data and
workload
● One node in the cluster is elected to be the master node
• creates or deletes an index,
• adds or removes a node from the cluster
● Every node knows where each document lives and can forward the
client request directly to the nodes that hold the searched data

49
Adding /blogs index

50
Cluster Health
Green: All primary and replica shards are active.
Yellow: All primary shards are active, but not all replica shards are active.
Red: Not all primary shards are active.

Cluster status is yellow because the replica shards have not


been allocated to a node → the cluster is fully functional but at
51 risk of data loss in case of hardware failure.
Add an Index
● An Index is a logical namespace that points to one or more physical
shards
● A shard is a low-level worker unit that holds just a slice of all the
data in the index
● Documents are stored and indexed in shards
● Shards are allocated to nodes in your cluster
● A shard can be either a primary shard or a replica shard
● A primary shard can technically contain up to
Integer.MAX_VALUE ­ 128 documents,
● A replica shard is just a copy of a primary shard.
● The number of primary shards in an index is fixed at the time that an
index is created,
● but the number of replica shards can be changed at any time
on a live index

52
Types' Mapping

Types are indexed differently

53
Inverted Index (c.f. Information Retrieval lecture)
● Doc 1: The quick brown fox jumped over the lazy dog
● Doc 2: Quick brown foxes leap over lazy dogs in summer
● Q = quick brown
• Doc 1 is more relevant than Doc 2
● Text preprocessing
• Lowercase vs uppercase (quick, Quick)
• Singular vs plural (fox,foxes; dog, dogs)
• Same meaning (jumped, leap)

54
Inverted Index (ctnd. 1)
● Input example
• "Set the shape to semi-transparent by calling set_trans(5)"
● Elasticsearch standard analyzer
• splits the text on word boundaries,
• removes most punctuation,
• lowercases all terms
• Result: set, the, shape, to, semi, transparent, by, calling,
set_trans, 5
● English Language analyzers
• stem English words
• Removes stop words
• Result: set, shape, semi, transpar, call, set_tran, 5

55
Inverted Index (ctnd. 2)
● Inverted Indexes are immutable
• every commit point includes a .del file that lists which documents
in which segments have been deleted.
• when a document is updated, the old version of the document is
marked as deleted, and the new version of the document is
indexed in a new segment.

56
Testing the Analyzer

57
Refresh API
● By default, every shard is refreshed automatically once every
second
● Elasticsearch has near real-time search: document changes are not
visible to search immediately, but will become visible within 1sec
● Refresh all indices
• POST /_refresh
● Refresh blogs index
• POST /blogs/_refresh
● Refresh every 30sec
• POST /my_logs/_settings
{ "refresh_interval": “30s” }
● Disable automatic refreshes
• POST /my_logs/_settings
{ "refresh_interval": -1 }

58
Flush API
● translog -transaction log, records every operation in Elasticsearch
as it happens
• When a document is indexed, it is added to the in-memory
buffer and appended to the translog
• when the translog is getting too big—the index is flushed; a
new translog is created, and a full commit is performed
● Flush the blogs index.
• POST /blogs/_flush
● Flush all indices and wait until all flushes have completed before
returning
• POST /_flush?wait_for_ongoing

59
Competitors
● Apache Solr -search engine
● Elasticsearch ecosystem is richer with Logtash, Beats and Kibana
● ES is a rich and a steadly growing ecosystem
https://github.com/elasticsearch
● Commercial products: Splunk>, ATTIVIO, IBM Data Explorer, HP
Autonomy

60
BM25

● N is the total number of documents available in the dataset.


● dft is the number of documents that contain that term.
● k (k1) is the saturation parameter and controls how quickly an
increase in term frequency results in term-frequency saturation.
The default value is 1.2. Lower values result in quicker saturation,
and higher values in slower saturation.
● b is the length parameter which controls how much effect field-
length normalization should have. A value of 0 disables
normalization completely, and a value of 1 normalizes fully. The
default is 0.75.
● l(d) is the number of tokens in the document.
● avgdl is the average document length in the corpus (complete
dataset).
61
● ft,d is the frequency of a term in the document.

You might also like