0% found this document useful (0 votes)
60 views1 page

How We Improved Our Performance Using ElasticSearch Plugins - Part 1 - by Xiaohu Li - Tinder Tech Blog - Medium

Uploaded by

nd0906
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views1 page

How We Improved Our Performance Using ElasticSearch Plugins - Part 1 - by Xiaohu Li - Tinder Tech Blog - Medium

Uploaded by

nd0906
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Search Write Sign up Sign in

How We Improved Our Performance


Using ElasticSearch Plugins: Part 1
Xiaohu Li · Follow
Published in Tinder Tech Blog · 12 min read · Sep 5, 2019

342

Written By: Pierre Poitevin, Senior Software Engineer|Daniel Geng,


Software Engineer | Xiaohu Li, Engineering Manager

Problems
The Tinder Eng team has recently been working on integrating machine
learning (ML) algorithms into the Tinder recommendation system. The
Tinder recommendation system is what is used to provide users with
recommendations, that the users can then like or not by using the Swipe
Right or Swipe Left features. This recommendation system is discussed in
the blog post: Powering Tinder® — The Method Behind Our Matching.

To start with, we came up with several potential options, but they all relied
on many more features (or user characteristics) than the other algorithms
we were using at the time. When we tested these ML algorithms, they were
not as fast as the non ML ones. It took much longer for Elasticsearch (ES) to
return results on the many features we were querying. Moreover, Painless
scripts, which were the scripts we used for the ML queries, have a hard limit
of 16384 characters (this was changed to a configurable limit at the time of
this writing, but it was not the case when we were working on it), which we
were closely approaching. We also noticed that Painless had other issues,
such as not having static variables or methods, which led to a performance
penalty, because it forced ES to re-instantiate the same objects over and over.

To solve the character limit issue, we tried to split a large script into multiple
smaller scripts, but noticed that the query performance got worse when we
did. We were aware of an alternative to painless script, ES plugins, which Top highlight

allows us to install new functionality on the ES side. That way, we could put
the functions in Java code and install it instead of using Painless scripts.
However, we could not afford to use the plugin functionality as is, because
each update to the plugin would require a complete cluster restart, which is
not only costly, but also reduces the reliability and operability of our
systems.

Goals
Our goal in this project was to improve the current recommendation system,
so that we could support the ML algorithms without a performance penalty.
In addition, we want to be able to iterate on new algorithms often, and have
updates be painless.

Solution

Main idea
To overcome the character limit and performance issues, our main idea was
to leverage the speed of the ES recommendation plugin. Since we couldn’t
afford to restart the cluster too often, the second idea was to design a system
that would be able to add and update new matching algorithms without a
mandatory restart.

Architecture
In this section, we provide some background on Java and ES, and how we
leveraged these technologies to build a script management system that can
load matching algorithms at runtime.

Background: Java, a dynamically compiled language

Like C, Java is a compiled language, but unlike C, the Java compiler doesn’t
transform code into a binary, but into bytecode instead. This bytecode is
then handled at runtime by the Java Virtual Machine (JVM). The JVM allows
to define new classes or to reload new versions of existing classes at
runtime. This is why Java is called dynamically compiled. The main class
responsible for loading and defining classes in the JVM is the ClassLoader.
This class can be extended to control the loading logic. The ClassLoader can
be called anywhere in the code to request for a new class to be loaded.

In the plugin we implemented, we used this characteristic to update class


definition at runtime without needing to do a restart.

ES Background

Elasticsearch is the indexing system, that stores the user documents we use
to search and provide recommendations. ES is open source, and the
different code version can be found on Github. There are many ways to
query Elasticsearch. One of the simplest ways is to store scripts, or search
algorithms, in Elasticsearch, and then send queries that reference the script.
That way, when the query is interpreted by Elasticsearch, it knows what
algorithm to use to search and return results.

Searches in Elasticsearch happen roughly in two steps: filter and sort. In the
filter step, all the documents that don’t match the filter criteria are excluded
from the results. In the sorting step, all the document that fit the filter
criteria are assigned a relevance factor, ordered from highest to lowest, and
put in the response to the caller.

In Elasticsearch, plugins are a way to enhance Elasticsearch basic


functionality. There are many different types of plugins, they allow you to
add custom types, or expose new endpoints to the Elasticsearch API.

The type of plugin that interested us most was the script engine or script
plugin. This type of plugin allows us to customize the way the relevance
assignment is done for the documents.

In the following paragraph, we talk about some details about script plugins.
We used Elasticsearch 6.3; the vocabulary, names, and logic can change from
version to version and might not apply to future versions of Elasticsearch.

ScriptEngine Overview

A script plugin is essentially a “run()” function from the query parameters


(“params”) and a document (“lookup”) to a relevance factor. For each pair of
“(params, lookup)”, ES needs to call “run()” to compute relevance with this
plugin.

To implement a ScriptPlugin, we roughly need to implement or extend each


of these classes called when running the script for plugin.

Notes:

The ScriptEngine.compile() method is only called once per script. It


caches the SearchScript.Factory in memory. Then, for all the subsequent
queries using the same script, the cache will be used to provide the
SearchScript.Factory. Since we want the script to be changed in some
cases (for instance a new version), we know that this method will not be
run again, so we are not able to use code that is version specific in that
method

SearchScriptFactory handles the query level search

SearchScript.LeafFactory handles the relevance computation for a leafor


a segment of documents.

To be able to handle different versions in the scripts,


we use the delegation pattern.

In the newFactory() method of the SearchScriptFactory, we call


factoryCache, which is our in memory SearchScript.Factory storage system,
that will dynamically provide the SearchScript.Factory implementation that
needs to be used.

1 @Override
2 public <T> T compile() {
3 …
4 scriptAndVersion = fetchScriptAndVersion(scriptSource);
5 SearchScript.Factory wrappedFactory = (p, l) -> {
6 SearchScript.Factory delegate = factoryCache.getScript(scriptAndVersion);
7 return delegate.newFactory(p, l);
8 };
9 return…
10 }

SearchScriptFactory.java hosted with ❤ by GitHub view raw

In the schema we designed, the ScriptEngine is actually not a simple script


provider, but an abstracted layer handling routing between queries and
scripts.

Load new scripts

As we saw in the previous section, we have an in memory cache that checks


the params of the query and returns the appropriate matching algorithm.
The cache layers loads new recommendation scripts as it becomes
necessary, driven by queries.

Loading a new script is equivalent to loading a java class from a Jar file that
we get from a storage system. If the class already exists, but we need a new
version of it, we overload the class definition with the new class definition.
We needed to write a custom class loader to overload the classes in the
current JVM with their new definition.

For instance, let’s assume the current JVM has the class MyScript.class (v1)
and A.class (v1) defined, from a previous jar. In a new jar, we have
MyScript.class (v2), that depend on A.class (v2) and B.class (v2).

When we request MyScript.class from the new jar, the ClassLoader will
check in the new jar for the definition of MyScript.class.

Then, the ClassLoader will overwrite the current definition, same for
A.class, and it will add B.class from the new jar. At the end of the operation,
the JVM will have MyScript.class (v2), A.class (v2) and B.class (v2).

Caching scripts as necessary

Once we are done loading the script class, we store it in a cache. We store
scripts by name and version in the cache, using the “source” field of the
query to pass the name and the version that we want to compute with. We
used a simple Guava LoadingCache in the Script manager. A cache is needed
because loading a script from the jar in the disc or in a storage cannot scale
at several thousand QPS. Some scripts might get deprecated, or be unused
for a long period of time, and the LoadingCache supports custom eviction
logic in the CacheBuilder for this purpose.

Here is an overview of the main classes involved in providing the right


SearchScript.Factory to the ScriptEngine:

Query scope vs. document scope

In some cases, running the same code for each document is wasteful in
resources, and we need to run it once per query, and use the result of the
computation when computing the relevance for each of the documents
afterwards.

Since SearchScript.Factory handles the query scope and the


SearchScript.LeafFactory handles the individual document scope.
Everything that should be computed at the query level is done in
SearchScript.Factory.newFactory(params, lookup) (see computeForQuery
call in snippet). Everything related to individual document relevance, on the
other hand, is done in the SearchScript.LeafFactory.newInstance() method.

1 private SearchScript.Factory getFactory() {


2 return (params, lookup) -> new SearchScript.LeafFactory() {
3 // Leaf Factory definition
4 private Object queryLevelComputed = computeForQuery();
5 private Constructor<? extends SearchScript> constructor = getConstructor();
6
7 @Override
8 public SearchScript newInstance(LeafReaderContext context)
9 throws IOException {
10 return constructor.newInstance(params, lookup, context, queryLevelComputed);
11 }
12
13 /* ... */
14 }
15 }

SearchScript.java hosted with ❤ by GitHub view raw

Query changes on the client side

Since params is a map sent in the JSON format by the ES client, we can
customize any change of behavior by changing the content of params. For
instance, if the query contains the param “use_new_algorithm” we can fork
and use a different matching algorithm without coupling ES to a dynamic
flag system/manager.

Review of the development flow


In this project, one of the goals was to make sure the development of new
matching algorithms wasn’t too cumbersome. For the developer, the
workflow to update scripts is the following.

1. Write code, package in jar, and upload to storage system

2. Specify the new version of the script in an ES query

3. Plugin will use the correct script version and sort the results

This workflow is straightforward and fast from both operational and


development perspectives. Therefore, we achieved our goal in preserving
system maintainability and iteribility.

Observability
ES is a vital part of the recommendations framework, so it is essential that
the plugin is highly observable. Although ES itself has its own set of system
metrics, there is not a simple way to add our plugin-specific metrics. We use
Prometheus for monitoring our microservices, so it makes sense for easier
operational integration to use it for the plugin as well. For microservices,
each machine hosts a Prometheus server that exposes a “_metrics” endpoint.
An external client, which can access individual machines behind the load
balancer calls this endpoint and aggregates the results. However, we want to
keep ES decoupled from third-party services such as Prometheus, so we
developed a custom solution.

ES already has a set of _cat APIs included for monitoring its system metrics.
For example, if the _cat/nodes API is accessed from any query node, it will
aggregate metrics from all nodes in the cluster using TCP and return the
results. We leveraged this existing pattern by adding our own
_cat/pluginmetrics API using an ActionPlugin, which is used to create
custom APIs on ES. This way, instead of hosting a Prometheus server on each
node and requiring a client to have access to individual nodes, the
Prometheus client can simply use the new pluginmetrics API using the load
balancer endpoint. This API returns a response equivalent to querying each
individual machine in the cluster while maintaining the same format as the
Prometheus server, so it was simple for the operations team to setup the
monitoring.

Security
We are downloading jar data from the jar storage system. This jar has access
to sensitive data that we store on Elasticsearch. Even if we control the
storage, we must assume the jar might have been tampered with when we
receive it. For security purposes we implemented 3 steps that allow us to
verify that that code in the jar file is from a reliable source:

The jar is signed with a private key that is stored in a key vault

The jar signature is verified within the Elasticsearch plugin

The repository containing the code has a protected branch with an


explicit approver

That way, we control the authenticity of the code that will be loaded at
runtime.

Overall architecture

Rolling out
On the very first release of our plugin, we chose to re-implement the same
matching script as we did with painless script, so we can get an apples-to-
apples comparison. Since the syntax of painless script is pretty similar to
Java, it is straightforward to convert it to native Java code with minor
modifications.

Simply by doing so, we see a solid improvement in latency, from over 500ms
to less than 400ms.

It is pretty obvious that we saved some parsing time by switching from


Painless to native Java plugin; however, the main benefits actually come
from the flexibility of using Java directly. For instance, in our matching
script we have to perform a binary search against an array, which is constant
across all our queries. In the world of painless script, since there is no
support of static variables, the array needs to be reconstructed for each hit to
compute, which means the JVM needs to allocate memory for it and handle
GC afterwards. This is a huge waste considering the scale of our
recommendation system.

Also by leveraging the Jenkins automation pipeline, it is now much more


streamlined whenever we need to push a new relevance function version.

To control the quality of our work, we set up two pipelines for our staging
and production environment respectively, and one more test ES cluster. Here
is how they look like:

Each time we would like to push a new sorting script, here is the process:

1. Manual staging test: We use the Jenkins staging pipeline to build the jar,
upload to file storage system, and deploy our server side code in staging
env to invoke the newest version of script. This step is to check against
any obvious syntax/loading error of our new script and make sure it can
be executed successfully, and the actual calculated relevance is expected.

2. Large scale data validation in canary environment: Due to historical


reasons, there are some field level schema inconsistencies in our
production cluster. For example, some date fields are ISO strings for
some documents, but UNIX timestamps for others. That is why we set up
a separate canary cluster by replicating our production data and have our
server sending queries to it as the next step, to make sure our scripts can
handle all such edge cases.

3. Dark run in production: Once the first 2 steps are done, we now have
high confidence about the correctness of our script. However, the run
time performance, especially latency is still unclear. To avoid running a
script with long latency and hurt our user experience, we set up a
darkrun step in prod to send queries from production server to
production ElasticSearch cluster with the script loaded, in a fire and
forget manner. By doing this we are able to collect performance metrics
and decide if we should fully roll it out. Usually we want to keep dark run
for a few days because some performance issues (e.g. memory leaks) are
more likely to get exposed in a longer run.

4. Fully cutover: If all previous steps look good, we will slowly dial up the
traffic to use the new query script.

Summary
We invented a whole new infrastructure to support continuous development
and integration of Elasticsearch plugins, which is also highly secured and
observable. Thanks to it, we are able to apply much more sophisticated
matching model in run time. However, we are not done yet — in part 2 of this
blog, we will cover some of the most genius ideas our engineers
implemented on top of this pipeline that greatly improved our query
performance. Stay tuned.

Elasticsearch Plugins Performance Tuning

342

Written by Xiaohu Li Follow

86 Followers · Writer for Tinder Tech Blog

More from Xiaohu Li and Tinder Tech Blog

Xiaohu Li in Tinder Tech Blog Tinder in Tinder Tech Blog

How We Improved Our How we built the Tinder API


Performance Using ElasticSearch… Gateway
Written By: Daniel Geng, Software Engineer | Authored by: Vijayvangapandu Vijaya
Pierre Poitevin, Senior Software Engineer|… Vangapandu Distinguished Software…

7 min read · Sep 20, 2019 9 min read · Oct 25, 2022

377 3 2.1K 11

Frank Ren in Tinder Tech Blog Xiaohu Li in Tinder Tech Blog

Geosharded Recommendations Geosharded Recommendations


Part 1: Sharding Approach Part 2: Architecture
Authors: Frank Ren|Director, Backend Authors: Frank Ren|Director, Backend
Engineering, Xiaohu Li|Manager, Backend… Engineering, Xiaohu Li|Manager, Backend…

9 min read · May 15, 2019 7 min read · May 30, 2019

842 1 278 1

See all from Xiaohu Li See all from Tinder Tech Blog

Recommended from Medium

Pinterest Engineeri… in Pinterest Engineering Bl… SDE Story Crafting in Stackademic

Deep Multi-task Learning and Node.js is Single-Threaded But Still


Real-time Personalization for… Concurrent. How?
Haomiao Li | Software Engineer, Closeup If your web server keeps getting client
Ranking & Blending; Travis Ebesu | Software… request and the main thread of your host…

10 min read · Jun 13, 2023 4 min read · Dec 19, 2023

269 4 156 1

Lists

Natural Language Processing


1276 stories · 768 saves

Suresh Podeti Hayk Simonyan in Level Up Coding

System design: Google maps System Design Interview Question:


Requirements Design Spotify
High-level overview of a System Design
Interview Question - Design Spotify.

6 min read · Oct 6, 2023 6 min read · Feb 21, 2024

43 3.5K 28

BigM Ryan Tang in UX Collective

Semantic search with Vector Emerging UX patterns in


embeddings using Elasticsearch Generative AI experiences
So, why vector search ? 🤔 How the evolution of the GUI can tell us what
to expect for Gen AI interactions and…

6 min read · Oct 27, 2023 12 min read · Mar 2, 2024

8 1K 7

See more recommendations

Help Status About Careers Blog Privacy Terms Text to speech Teams

You might also like