Skip to content

Do not capture ClusterChangedEvent in IndicesStore call to #onClusterStateShardsClosed #120193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Jan 15, 2025

#108145 introduced async close of IndexShard. Before that change IndicesStore called deleteShardIfExistElseWhere directly when needed, but now it rely on IndicesClusterStateService#onClusterStateShardsClosed to be notified when the shard is closed and go through the process of deleting the local contents. The listener was using a lambda that captured the entire ClusterChangedEvent and in some cases if the shard was too slow to be closed we could keep capturing new cluster state instances and finally OOM the node. This commit just avoids capturing the entire ClusterChangedEvent.

…StateShardsClosed

on IndicesClusterStateService#onClusterStateShardsClosed to be
notified when the shard is closed and go through the process of
deleting the local contents. The listener was using a lambda that
captured the entire ClusterChangedEvent and in some cases if the
shard was too slow to be closed we could keep capturing new cluster
state instances and finally OOM the node. This commit just avoids
capturing the entire ClusterChangedEvent.
@fcofdez fcofdez added >bug Team:Distributed Indexing Meta label for Distributed Indexing team :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. labels Jan 15, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine
Copy link
Collaborator

Hi @fcofdez, I've created a changelog YAML for you.

@fcofdez
Copy link
Contributor Author

fcofdez commented Jan 15, 2025

We noticed about this after a production node went OOM and upon the heap dump inspection we could observe how the internal RefCountingListener used in IndicesClusterStateService#onClusterStateShardsClosed were keeping references to old cluster states.
Screenshot 2025-01-14 at 10 52 54
I managed to reproduce this locally too forcing a slow shard close.
Screenshot 2025-01-14 at 16 54 20

There's the question of whether or not we should change how IndicesStore schedules the calls to deleteShardIfExistElseWhere as it appears that we could end up with thousands of ShardActiveRequest if the shards take a while to get closed.
Screenshot 2025-01-13 at 16 28 45

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

There's the question of whether or not we should change how IndicesStore schedules the calls to deleteShardIfExistElseWhere

Yes, similar to #74149 this whole area seems rather overcomplicated and inefficient.

@fcofdez fcofdez merged commit 00bc91c into elastic:main Jan 15, 2025
16 checks passed
@fcofdez
Copy link
Contributor Author

fcofdez commented Jan 15, 2025

Thanks for the review David!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. Team:Distributed Indexing Meta label for Distributed Indexing team v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants