Skip to content

Optimize IngestDocument FieldPath allocation #120573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 29, 2025

Conversation

joegallo
Copy link
Contributor

@joegallo joegallo commented Jan 22, 2025

Constructing a FieldPath from a String path requires that we split (via String#split) into an array of substrings (we're splitting on dots, so for example "foo.bar.baz" becomes ["foo", "bar", "baz"]).

So that's allocation of an ArrayList to hold the results as we do the scan, allocation of the Strings to hold each individual substring, and finally allocation of the resulting array at the end when the scan is finished. It's not the slowest thing ever, but it's not free. Of course the scan itself has some small CPU cost, too. (For the record, though, we go down the fast path of String#split, so it's not like we're doing regexes here, thank goodness.)

We call it like it's free, however (😬). Consider the happy path of a rename processor:

Object value = document.getFieldValue(path, Object.class);
document.removeField(path);
try {
document.setFieldValue(target, value);

Let's imagine we loop over 1000 incoming json documents and run that processor. For each json document we'll turn some path into a FieldPath twice (once for the getFieldValue and once for the removeField), then we'll turn some other path into a FieldPath once (for the setFieldValue). And we do that for all 1000 documents.

Anyway... that's a lot of arrays of substrings that we're allocating.

This PR introduces a local static cache that holds onto previously allocated FieldPath objects and allows us to look them up by the path they represent. Returning references to already allocated FieldPath objects is way faster than allocating new ones.

The same pattern of a map that we just whack when it exceeds its size limit is applied in StringLiteralDeduplicator (introduced in #76405) and DateProcessor (see #92880) -- it works pretty well!

Like #120571 this doesn't improve any one ingest processor specifically, it just kinda makes them all a bit faster -- reading and writing values is what this makes faster, and most of what every processor does is read a value, do something with it, and then write the result.

This makes all of ingest on my local test benchmark faster by about 20%, but that's not necessarily indicative of normal workloads. It makes an example convert processor faster by 80%, and a remove processor faster by 70%, but a date processor only gets a smidge faster -- the former processors are mostly just shuffling values around, so the effect is outsized there, while the latter has real work to do (parsing date strings) that this doesn't make any faster.

@joegallo joegallo added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v9.0.0 v8.18.0 labels Jan 22, 2025
@joegallo joegallo requested a review from nielsbauman January 22, 2025 04:22
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @joegallo, I've created a changelog YAML for you.

Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! 🚀

@joegallo joegallo added the auto-backport Automatically create backport pull requests when merged label Jan 27, 2025
@joegallo joegallo merged commit d763805 into elastic:main Jan 29, 2025
16 checks passed
@joegallo joegallo deleted the ingest-document-field-path branch January 29, 2025 18:38
joegallo added a commit to joegallo/elasticsearch that referenced this pull request Jan 29, 2025
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

@joegallo
Copy link
Contributor Author

Screenshot 2025-03-24 at 9 24 42 AM

Here's a screenshot from the nightly benchmarks -- the drop on in ingest time spent in set and remove processors really jumps out here. There were also some later drops from #125051 and #125232.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants