Skip to content

Optimize IngestDocMetadata isAvailable #120753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

joegallo
Copy link
Contributor

@joegallo joegallo commented Jan 23, 2025

A good bit of the diff on the PR is fixing some src and test buglets around the code I'm changing -- it turns out the code and tests only appeared to work, but in reality they were both broken.

The actual work of this PR is that the IngestDocMetadata is formalized to have all its properties start with a leading underscore character (this was already true except for tests, and now it's true in tests, too), and then since we know that all metadata properties start with a leading underscore, we can shortcut the isAvailable check (which is a map containsKey call) in the case of a key with a leading character that is not an underscore and just return false.

On the benchmark I'm running we call this a few hundred times per document, because we guard everything in isAvailable checks inside CtxMap, so even though this is the micro-est of optimizations it does actually matter in practice. My guess is that it gives better CPU cache locality because we already have the key but the properties map itself might be elsewhere in a worse cache line or whatever -- that is, it's faster this way because we don't have to run off to the heap in the vast majority of cases now (or at least that's the story I'm telling myself, it might not actually be true).

Here's a rename processor profile in the 'before' side, the purple bits are the isAvailable invocations (we really do call this a lot!):

Screenshot 2025-01-23 at 3 04 10 PM

You can't actually remove a key from a map while you're in the middle
of iterating through that same map. The only tests of this code
happened to pass because of the order of the keys in the tests (I
swear I am not making this up).
and shortcut the map lookup in the almost-always-expected case of the
key being looked up on a CtxMap *not* having a leading underscore (and
therefore the whole isAvailable thing being an unnecessary expense).
@joegallo joegallo added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v9.0.0 v8.18.0 labels Jan 23, 2025
@joegallo joegallo requested a review from masseyke January 23, 2025 20:06
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @joegallo, I've created a changelog YAML for you.


private static Map<String, FieldProperty<?>> validateLeadingUnderscores(final Map<String, FieldProperty<?>> properties) {
for (String key : properties.keySet()) {
assert key.charAt(0) == UNDERSCORE;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masseyke assert made sense to me before I pulled this out into a method, now I'm thinking it should be an IllegalArgumentException, perhaps. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched it in a683c54.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the constructor that lets you pass in arbitrary properties is only used by unit tests. It seems a little odd that we have this dangerous constructor and validation around it just for the sake of unit tests.

Copy link
Contributor Author

@joegallo joegallo Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fe05c56 drops the bad constructor and fusses with a handful of tests that were using the more flexible version of things. I always hated TestIngestCtxMetadata so thanks for pushing on this, because now it's gone.

@@ -150,7 +151,7 @@ public Object remove(Object key) {
@Override
public void clear() {
// AbstractMap uses entrySet().clear(), it should be quicker to run through the validators, then call the wrapped maps clear
for (String key : metadata.keySet()) {
for (String key : new ArrayList<>(metadata.keySet())) { // copy the key set to get around the ConcurrentModificationException
metadata.remove(key);
}
// TODO: this is just bogus, there isn't any case where metadata won't trip a failure above?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove this TODO now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehhhhhhh, I rewrote the comment in ce216ff, but I didn't remove it. clear() still doesn't work in the general case, for sure, and I'm not sure it works in any particularly practical case.

assert key != null && key.isEmpty() == false;
// we can avoid a map lookup on most keys since we know that the only keys that are 'metadata keys' for an ingest document
// must be keys that start with an underscore
if (key.charAt(0) != UNDERSCORE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm amazed that this makes a difference at all (and I'm really curious why), but I've seen the charts!

Copy link
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not have guessed that this would have made a measurable performance improvement, but it seems to. And it doesn't seem to cause any harm. LGTM

For example, this won't work for IngestCtxMap because the _version
isn't nullable, and so the metadata of an ingest document cannot be
clear()-ed.
@joegallo joegallo requested a review from masseyke January 24, 2025 04:11
@joegallo
Copy link
Contributor Author

joegallo commented Jan 24, 2025

fe05c56 here is big enough that I'm re-requesting review -- I don't want to sneak anything by anybody.

We no longer allow arbitrary properties, the properties map is
hardcoded. TestIngestCtxMetadata is deleted entirely, and several
tests are rewritten to account for newer slightly-less-flexible (but
way easier to reason about!) IngestDocMetadata.
@joegallo joegallo force-pushed the optimize-ingest-doc-metadata-is-available branch from f15f401 to f18ea3a Compare January 24, 2025 04:46
@joegallo joegallo merged commit 5e662c5 into elastic:main Jan 24, 2025
16 checks passed
@joegallo joegallo deleted the optimize-ingest-doc-metadata-is-available branch January 24, 2025 14:31
joegallo added a commit to joegallo/elasticsearch that referenced this pull request Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants