Do not capture ClusterChangedEvent in IndicesStore call to #onClusterStateShardsClosed #120193

fcofdez · 2025-01-15T12:11:15Z

#108145 introduced async close of IndexShard. Before that change IndicesStore called deleteShardIfExistElseWhere directly when needed, but now it rely on IndicesClusterStateService#onClusterStateShardsClosed to be notified when the shard is closed and go through the process of deleting the local contents. The listener was using a lambda that captured the entire ClusterChangedEvent and in some cases if the shard was too slow to be closed we could keep capturing new cluster state instances and finally OOM the node. This commit just avoids capturing the entire ClusterChangedEvent.

…StateShardsClosed on IndicesClusterStateService#onClusterStateShardsClosed to be notified when the shard is closed and go through the process of deleting the local contents. The listener was using a lambda that captured the entire ClusterChangedEvent and in some cases if the shard was too slow to be closed we could keep capturing new cluster state instances and finally OOM the node. This commit just avoids capturing the entire ClusterChangedEvent.

elasticsearchmachine · 2025-01-15T12:12:16Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

elasticsearchmachine · 2025-01-15T12:12:38Z

Hi @fcofdez, I've created a changelog YAML for you.

fcofdez · 2025-01-15T12:15:23Z

We noticed about this after a production node went OOM and upon the heap dump inspection we could observe how the internal RefCountingListener used in IndicesClusterStateService#onClusterStateShardsClosed were keeping references to old cluster states.

I managed to reproduce this locally too forcing a slow shard close.

There's the question of whether or not we should change how IndicesStore schedules the calls to deleteShardIfExistElseWhere as it appears that we could end up with thousands of ShardActiveRequest if the shards take a while to get closed.

DaveCTurner

LGTM

There's the question of whether or not we should change how IndicesStore schedules the calls to deleteShardIfExistElseWhere

Yes, similar to #74149 this whole area seems rather overcomplicated and inefficient.

fcofdez · 2025-01-15T14:16:36Z

Thanks for the review David!

fcofdez added >bug Team:Distributed Indexing Meta label for Distributed Indexing team :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. labels Jan 15, 2025

elasticsearchmachine added the v9.0.0 label Jan 15, 2025

Update docs/changelog/120193.yaml

b6a23be

fcofdez requested review from DaveCTurner and henningandersen January 15, 2025 12:15

DaveCTurner approved these changes Jan 15, 2025

View reviewed changes

fcofdez merged commit 00bc91c into elastic:main Jan 15, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not capture ClusterChangedEvent in IndicesStore call to #onClusterStateShardsClosed #120193

Do not capture ClusterChangedEvent in IndicesStore call to #onClusterStateShardsClosed #120193

Uh oh!

fcofdez commented Jan 15, 2025

Uh oh!

elasticsearchmachine commented Jan 15, 2025

Uh oh!

elasticsearchmachine commented Jan 15, 2025

Uh oh!

fcofdez commented Jan 15, 2025

Uh oh!

DaveCTurner left a comment

Uh oh!

Uh oh!

fcofdez commented Jan 15, 2025

Uh oh!

Uh oh!

Do not capture ClusterChangedEvent in IndicesStore call to #onClusterStateShardsClosed #120193

Do not capture ClusterChangedEvent in IndicesStore call to #onClusterStateShardsClosed #120193

Uh oh!

Conversation

fcofdez commented Jan 15, 2025

Uh oh!

elasticsearchmachine commented Jan 15, 2025

Uh oh!

elasticsearchmachine commented Jan 15, 2025

Uh oh!

fcofdez commented Jan 15, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fcofdez commented Jan 15, 2025

Uh oh!

Uh oh!