feat: add a user-configurable timeout parameter to the `_resolve/cluster` API #120542

pawankartik-elastic · 2025-01-21T16:58:53Z

Previously, should a remote be unresponsive, _resolve/cluster would wait until netty stepped in and terminated the connection. We initially responded to this issue by switching the disconnect strategy. However, this was problematic because it defeated the whole purpose of this API call—re-establish connection if and when possible. We now attempt to respond to it by adding a user-configurable GET parameter. This PR also reverses the problematic disconnect strategy.

Example:

GET _resolve/cluster/*:*?timeout=4s

We previously changed the disconnected strategy to fail fast if a remote is unresponsive. However, this has unintended side effects and defeats the purpose of the `_resolve/cluster` API. Instead, we now use a listener that takes in a timeout value. Upon timeout (if the remote does not respond within the specified time), an appropriate response is sent back to the user, marking the said remote as unreachable.

elasticsearchmachine · 2025-01-21T17:07:04Z

Hi @pawankartik-elastic, I've created a changelog YAML for you.

elasticsearchmachine · 2025-01-22T14:39:13Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

quux00

First round feedback.

In addition to specific code comments, two other notes:

I don't see any tests for this. Can we add automated tests?
Did you do any manual testing? In past experience with doing timeout based returns, I've seen errors/instability in Elasticsearch when the transport action ends earlier while an outstanding network call is still pending. For example - let's add a manual "sleep" to remote clusters that lasts longer than the timeout. What does the output look like for those and do they report errors in the logs or have instability issue when they actually return a response to a coordinator that is no longer listening.

...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java

.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java

...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java

.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java

use `wrapWithTimeout()` if no timeout was specified

...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java

code

quux00

Next round review.

...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java

.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java

quux00 · 2025-01-24T17:36:39Z

The Description also still references a 9 second timeout (in the error message), so I'm confused about 9s vs 30s internal transport/network timeouts.

Didn't mean to approve, just comment

.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java

Comments addressed

error field if request times out

quux00

Good test improvement. Additional suggestions left.

.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java

.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java

.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java

pawankartik-elastic · 2025-01-28T17:18:27Z

Cleaning up the description and dropping additional info here.

Q: What happens on a timeout?
User sees a response similar to this:

{
    "remote1": {
        "connected": false,
        "skip_unavailable": false,
        "error": "Request timed out before receiving a response from the remote cluster"
    }
}

where the "error" field denotes that the request timed out.

Then:
Scenario 1: If the remote continues to stay unresponsive, transport layer code then logs a warning (since the connection gets cut off due to handshake exception):

[WARN ][o.e.t.SniffConnectionStrategy] [node-1] fetching nodes from external cluster [remote1] failedorg.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:9301] handshake_timeout[SOME_TIMEOUT_VALUE_IN_SECONDS]

Scenario 2: If the remote responds, the timeout has already happened by then (by calling into onFailure()) which sets the isDone variable to true in the timeout listener. Should the remote respond back, since isDone is set, no attempt is made to utilise the result since the listener is perceived as consumed/used. In this case, the transport layer logged nothing during the testing since the remote connection is fine from its perspective.

.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java

.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java

.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java

quux00

Approved.

elasticsearchmachine · 2025-01-28T22:00:19Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 120542

pawankartik-elastic · 2025-01-29T09:29:41Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

…PI (elastic#120542) Previously, should a remote cluster be unresponsive, _resolve/cluster would wait until Netty stepped in and terminated the connection. We initially responded to this issue by switching the disconnect strategy. However, this was problematic because it defeated the whole purpose of this API call—re-establish connection if and when possible. We now attempt to respond to it by adding a user-configurable GET parameter. This PR also reverses the problematic disconnect strategy. Example: ``` GET _resolve/cluster/*:*?timeout=5s ``` (cherry picked from commit d27a8e0)

…PI (#120542) (#121142) Previously, should a remote cluster be unresponsive, _resolve/cluster would wait until Netty stepped in and terminated the connection. We initially responded to this issue by switching the disconnect strategy. However, this was problematic because it defeated the whole purpose of this API call—re-establish connection if and when possible. We now attempt to respond to it by adding a user-configurable GET parameter. This PR also reverses the problematic disconnect strategy. Example: ``` GET _resolve/cluster/*:*?timeout=5s ``` (cherry picked from commit d27a8e0)

pawankartik-elastic added 2 commits January 20, 2025 18:17

Make timeout value user-configurable

0e69b81

elasticsearchmachine added the v9.0.0 label Jan 21, 2025

[CI] Auto commit changes from spotless

fa3d7b8

pawankartik-elastic added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations >enhancement labels Jan 21, 2025

pawankartik-elastic added 2 commits January 21, 2025 17:07

Update docs/changelog/120542.yaml

0b60164

Merge branch 'main' into pkar/transport-resolve-timeout-listener

a750662

pawankartik-elastic marked this pull request as ready for review January 22, 2025 14:38

pawankartik-elastic requested a review from quux00 January 22, 2025 14:38

pawankartik-elastic added v8.18.0 auto-backport Automatically create backport pull requests when merged labels Jan 22, 2025

quux00 reviewed Jan 22, 2025

View reviewed changes

.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java Outdated Show resolved Hide resolved

pawankartik-elastic added 5 commits January 22, 2025 17:14

Address PR review comments: let netty handle default timeout and do not

afd133d

use `wrapWithTimeout()` if no timeout was specified

Merge branch 'main' into pkar/transport-resolve-timeout-listener

5ea23ad

Fix conflict resolve issue

433b5a2

Merge branch 'main' into pkar/transport-resolve-timeout-listener

121049b

Return an error instead of capping the timout if > 9s

f690ab6

quux00 reviewed Jan 23, 2025

View reviewed changes

...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java Outdated Show resolved Hide resolved

quux00 reviewed Jan 23, 2025

View reviewed changes

...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java Show resolved Hide resolved

pawankartik-elastic added 3 commits January 23, 2025 16:32

Fail the request if the timeout param exceeds the value used by distrib

36ac593

code

Merge branch 'main' into pkar/transport-resolve-timeout-listener

2a105f2

Mark timeout as transient

5f6f0bc

quux00 previously approved these changes Jan 24, 2025

View reviewed changes

pawankartik-elastic added 2 commits January 27, 2025 15:49

Fix license

318f7c3

Merge branch 'main' into pkar/transport-resolve-timeout-listener

1ae8f7f

quux00 reviewed Jan 27, 2025

View reviewed changes

.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java Outdated Show resolved Hide resolved

DaveCTurner reviewed Jan 28, 2025

View reviewed changes

.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java Outdated Show resolved Hide resolved

.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java Outdated Show resolved Hide resolved

pawankartik-elastic and others added 4 commits January 28, 2025 11:55

Use latch, reduce timeout duration, do not stall remote1, and set

902d17f

error field if request times out

Fix comment

afdd04b

[CI] Auto commit changes from spotless

34b561b

Merge branch 'main' into pkar/transport-resolve-timeout-listener

87de883

quux00 reviewed Jan 28, 2025

View reviewed changes

Address review comments

9a3919d

Merge branch 'main' into pkar/transport-resolve-timeout-listener

60acc5f

quux00 reviewed Jan 28, 2025

View reviewed changes

.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java Outdated Show resolved Hide resolved

Fix variable name mismatch due to randomisation

d2c6997

smalyshev reviewed Jan 28, 2025

View reviewed changes

.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java Show resolved Hide resolved

.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java Show resolved Hide resolved

Fix comment

2a0b017

quux00 approved these changes Jan 28, 2025

View reviewed changes

quux00 merged commit d27a8e0 into elastic:main Jan 28, 2025
16 checks passed

elasticsearchmachine added the backport pending label Jan 28, 2025

pawankartik-elastic mentioned this pull request Jan 29, 2025

[8.x] Add a user-configurable timeout parameter to the _resolve/cluster API (#120542) #121142

Merged

quux00 removed the backport pending label Jan 30, 2025

davismcphee mentioned this pull request Feb 5, 2025

[Data Views] Improve has_es_data check to use ES timeout and surface unresponsive clusters elastic/kibana#209697

Open

sabarasaba mentioned this pull request Feb 10, 2025

[Remote Clusters] Add support for index_expressions and timeout parameter elastic/kibana#210424

Open

pawankartik-elastic mentioned this pull request Mar 6, 2025

Revert fail-fast disconnect strategy for _resolve/cluster #124241

Merged

pawankartik-elastic deleted the pkar/transport-resolve-timeout-listener branch June 26, 2025 14:03

feat: add a user-configurable timeout parameter to the _resolve/cluster API #120542

feat: add a user-configurable timeout parameter to the _resolve/cluster API #120542

Uh oh!

Conversation

pawankartik-elastic commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 21, 2025

Uh oh!

elasticsearchmachine commented Jan 22, 2025

Uh oh!

quux00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quux00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quux00 commented Jan 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quux00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pawankartik-elastic commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quux00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 28, 2025

💔 Backport failed

Uh oh!

pawankartik-elastic commented Jan 29, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Uh oh!

feat: add a user-configurable timeout parameter to the `_resolve/cluster` API #120542

feat: add a user-configurable timeout parameter to the `_resolve/cluster` API #120542

pawankartik-elastic commented Jan 21, 2025 •

edited

Loading

pawankartik-elastic commented Jan 28, 2025 •

edited

Loading