Skip to content

feat: add a user-configurable timeout parameter to the _resolve/cluster API #120542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

pawankartik-elastic
Copy link
Contributor

@pawankartik-elastic pawankartik-elastic commented Jan 21, 2025

Previously, should a remote be unresponsive, _resolve/cluster would wait until netty stepped in and terminated the connection. We initially responded to this issue by switching the disconnect strategy. However, this was problematic because it defeated the whole purpose of this API call—re-establish connection if and when possible. We now attempt to respond to it by adding a user-configurable GET parameter. This PR also reverses the problematic disconnect strategy.

Example:

GET _resolve/cluster/*:*?timeout=4s

We previously changed the disconnected strategy to fail fast if a remote
is unresponsive. However, this has unintended side effects and defeats
the purpose of the `_resolve/cluster` API. Instead, we now use a
listener that takes in a timeout value. Upon timeout (if the remote does
not respond within the specified time), an appropriate response is sent
back to the user, marking the said remote as unreachable.
@pawankartik-elastic pawankartik-elastic added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations >enhancement labels Jan 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @pawankartik-elastic, I've created a changelog YAML for you.

@pawankartik-elastic pawankartik-elastic marked this pull request as ready for review January 22, 2025 14:38
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@pawankartik-elastic pawankartik-elastic added v8.18.0 auto-backport Automatically create backport pull requests when merged labels Jan 22, 2025
Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round feedback.

In addition to specific code comments, two other notes:

  1. I don't see any tests for this. Can we add automated tests?

  2. Did you do any manual testing? In past experience with doing timeout based returns, I've seen errors/instability in Elasticsearch when the transport action ends earlier while an outstanding network call is still pending. For example - let's add a manual "sleep" to remote clusters that lasts longer than the timeout. What does the output look like for those and do they report errors in the logs or have instability issue when they actually return a response to a coordinator that is no longer listening.

quux00
quux00 previously approved these changes Jan 24, 2025
Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next round review.

@quux00
Copy link
Contributor

quux00 commented Jan 24, 2025

The Description also still references a 9 second timeout (in the error message), so I'm confused about 9s vs 30s internal transport/network timeouts.

@quux00 quux00 dismissed their stale review January 24, 2025 17:37

Didn't mean to approve, just comment

@DaveCTurner DaveCTurner dismissed their stale review January 28, 2025 07:57

Comments addressed

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test improvement. Additional suggestions left.

@pawankartik-elastic
Copy link
Contributor Author

pawankartik-elastic commented Jan 28, 2025

Cleaning up the description and dropping additional info here.

Q: What happens on a timeout?
User sees a response similar to this:

{
    "remote1": {
        "connected": false,
        "skip_unavailable": false,
        "error": "Request timed out before receiving a response from the remote cluster"
    }
}

where the "error" field denotes that the request timed out.

Then:
Scenario 1: If the remote continues to stay unresponsive, transport layer code then logs a warning (since the connection gets cut off due to handshake exception):

[WARN ][o.e.t.SniffConnectionStrategy] [node-1] fetching nodes from external cluster [remote1] failedorg.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:9301] handshake_timeout[SOME_TIMEOUT_VALUE_IN_SECONDS]

Scenario 2: If the remote responds, the timeout has already happened by then (by calling into onFailure()) which sets the isDone variable to true in the timeout listener. Should the remote respond back, since isDone is set, no attempt is made to utilise the result since the listener is perceived as consumed/used. In this case, the transport layer logged nothing during the testing since the remote connection is fine from its perspective.

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@quux00 quux00 merged commit d27a8e0 into elastic:main Jan 28, 2025
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 120542

@pawankartik-elastic
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

pawankartik-elastic added a commit to pawankartik-elastic/elasticsearch that referenced this pull request Jan 29, 2025
…PI (elastic#120542)

Previously, should a remote cluster be unresponsive, _resolve/cluster would wait until Netty stepped in and terminated the connection. We initially responded to this issue by switching the disconnect strategy. However, this was problematic because it defeated the whole purpose of this API call—re-establish connection if and when possible. We now attempt to respond to it by adding a user-configurable GET parameter. This PR also reverses the problematic disconnect strategy.

Example:

```
GET _resolve/cluster/*:*?timeout=5s
```

(cherry picked from commit d27a8e0)
pawankartik-elastic added a commit that referenced this pull request Jan 29, 2025
…PI (#120542) (#121142)

Previously, should a remote cluster be unresponsive, _resolve/cluster would wait until Netty stepped in and terminated the connection. We initially responded to this issue by switching the disconnect strategy. However, this was problematic because it defeated the whole purpose of this API call—re-establish connection if and when possible. We now attempt to respond to it by adding a user-configurable GET parameter. This PR also reverses the problematic disconnect strategy.

Example:

```
GET _resolve/cluster/*:*?timeout=5s
```

(cherry picked from commit d27a8e0)
@pawankartik-elastic pawankartik-elastic deleted the pkar/transport-resolve-timeout-listener branch June 26, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants