How We Improved Our Performance Using ElasticSearch Plugins - Part 2 - by Xiaohu Li - Tinder Tech Blog - Medium
How We Improved Our Performance Using ElasticSearch Plugins - Part 2 - by Xiaohu Li - Tinder Tech Blog - Medium
377 3
Problem
When we query ES to fetch recommendations to serve, we need to send a list
of users to skip. For example, users that you have already seen recently and
users that you are already matched with should not be recommended to you
again. This skip list can be reasonably high for very active users. We use the
terms query on ES for the skip list.
1 {
2 "query": {
3 "filter": {
4 "bool": {
5 "must": [...],
6 "must_not": [
7 {
8 "terms": {
9 "user_number": [1,2,3,...]
10 }
11 }
12 ...
13 ]
14 }
15 }
16 }
17 }
However, we suspected that the “terms” query was inefficient for very large
lists. We conducted a performance test using queries with skip lists of
different sizes using the “terms” filter. From the results below, performance
and skip list size have a clear inverse relationship.
p50 latency comparison with different skip list sizes (terms filter)
Solution
Fundamentally, the solution is to find an alternative to using the terms
query. Our idea was to send a serialized skip list using a compressed data
structure, which could then be deserialized and used on the ES server.
Assuming that the serialization and deserialization overhead is acceptable,
not only would this reduce latency by avoiding a large terms query, but it
could also greatly reduce the size of our query requests.
Now that we are familiarized with the usage of the ES plugin, we thought
about how we could leverage it to optimize the skip list. In addition to adding
a new script that could be used by our LoaderPlugin, another possibility was
to add a new custom API using an ActionPlugin (similar to what we did for
observability in Part 1). We will cover implementation details and tradeoffs
below.
Plugin Types
ActionPlugin
To use the serialized skip list through a custom API, which we will call
“_newsearch”, the following steps must be made.
2. Send a query to ES using the _newsearch API and pass in the serialized
list.
3. In the ES cluster, the query node sends a search query to the data node
without the skip list. The requested document count is equal to the
requested document count sent by the client plus the size of the skip list
because the skip list will be applied on the query node.
4. Receive the ranked documents in the query node. Deserialize the skip
list. Include documents that are not in the skip list up to the requested
size sent by the client and return to the client.
Pros:
Easy to implement
Cons:
Unnecessary processing: the skip logic occurs after the sorting phase, so
relevance factor is calculated for documents in the skip list
Increased load: potentially needs to rank extra documents, which may be heavy
for queries with large skip lists
Increased load on query nodes since it needs to deserialize and apply the
skip list
Updates require cluster restart since it does not take advantage of the
LoaderPlugin
LoaderPlugin
To use the serialized skip list by leveraging the LoaderPlugin from Part 1, we
will need to add a new script to deserialize and apply the skip list. This new
script will use the following workflow.
2. Send a query to ES through the standard _search API. Send the serialized
skip list through “params” in the request. Specify a script that uses a skip
list deserializer in the “source” field. Add a “min_score” (a field from ES)
parameter to the query (used in next step). Here is an example:
1 {
2 "min_score": -100000,
3 "query": {
4 "filter": {
5 ...
6 },
7 "functions": [
8 {
9 "script_score": {
10 "script": {
11 "params": {
12 "key1": value1,
13 ...
14 },
15 "source": "my_bitmap_script",
16 "lang": "tinder_scripts"
17 }
18 }
19 }
20 ]
21 }
22 }
3. On the data node, the skip list will be deserialized. For documents that
should be skipped, the script will return a relevance factor lower than
min_score, so they will be omitted.
Pros:
Reduced load on data nodes, since deserializing a skip list is much faster
than having a large terms filter
Cons:
ActionPlugin vs. LoaderPlugin: p50 latency (50k skip list size, bitmap serialization)
While the latency was similar at lower QPS, the difference is quite obvious at
125 QPS. As we originally expected, the LoaderPlugin yielded much better
performance.
Data structures
Now that we have determined which plugin implementation to use, we still
had to decide which data structure to use to serialize the skip list. To help us
make a decision, we conducted more performance tests for comparison. We
tested the following data structures.
Hash set
Bloom filter
Roaring bitmap
The size of the serialized skip list has a potential impact on ES network
latency since it will be included in the request. Using a skip list of size 10
million, the serialized skip list size of each implementation is shown below.
Since a standard hash set is not meant for compression, it was expected that
it would be larger than the raw values in terms. Inversely, the bloom filter
and roaring bitmap generated serialized skip lists that were much smaller.
Although a smaller size will result in reduced network bandwidth usage, it
may not have any correlation with reduced latency or cluster load.
Therefore, we implemented each data structure using a LoaderPlugin script
and tested latency using various skip list sizes.
Below are the results when comparing the data structures using a skip list of
size 10k.
It was clear that bloom filter and bitmap were much better than the rest of
the pack. We did more performance testing comparing those two with larger
skip lists. Below are the results when using a skip list of size 40k.
Even though bloom filter is slightly faster than bitmap, it has the issue of
false positives. Since the difference in performance is small, we decided to
use bitmap because it does not impact our business logic.
Final decision
In the end, we decided to adopt RoaringBitmap as the data structure and
implement it as a LoaderPlugin.
Impact
By quickly iterating through our ES plugin development cycle, we are able to
validate the functionality of this plugin in production by keeping the hit size
intact and roll this out 100% transparently to our valued users.
Around 35% — 50% CPU utilization drop for query nodes and data nodes
respectively.
Query Nodes
Data Nodes
We are very excited to announce that by releasing this plugin, we are not
only able to provide a better user experience while keeping business logic
intact, but also gain significant headroom of our cluster capacity for future
growth.
Summary
After building out a framework for plugin dynamic loading and iterating, we
pushed our cluster’s performance to the next level by actively identifying our
current bottlenecks, investigating and testing different options, and finally
delivering benefits to our end users. A few key findings we got during this
process:
This concludes our latest innovation on how we operate our ES cluster and
make it hyper-scaled, which is just one of many engineering challenges we
are tackling at Tinder. If you are interested in challenging yourself and want
to work with talented teammates, please take a look at our job website for
openings.
377 3
342 2.1K 11
9 min read · May 15, 2019 7 min read · May 30, 2019
842 1 278 1
See all from Xiaohu Li See all from Tinder Tech Blog
Open source rule engine for fintech Semantic search with Vector
domain embeddings using Elasticsearch
I work in the fintech industry, specifically in So, why vector search ? 🤔
the field of financial crime. Quite often,…
3 min read · Oct 16, 2023 6 min read · Oct 27, 2023
55 8
Lists
10 min read · Jun 13, 2023 2 min read · Dec 27, 2023
269 4
Saeed Zarinfam in ITNEXT Utkarsh Ver… in Sixt Research & Development In…
155 73
Help Status About Careers Blog Privacy Terms Text to speech Teams