-
Notifications
You must be signed in to change notification settings - Fork 25.4k
feat: add a user-configurable timeout parameter to the _resolve/cluster
API
#120542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add a user-configurable timeout parameter to the _resolve/cluster
API
#120542
Conversation
We previously changed the disconnected strategy to fail fast if a remote is unresponsive. However, this has unintended side effects and defeats the purpose of the `_resolve/cluster` API. Instead, we now use a listener that takes in a timeout value. Upon timeout (if the remote does not respond within the specified time), an appropriate response is sent back to the user, marking the said remote as unreachable.
Hi @pawankartik-elastic, I've created a changelog YAML for you. |
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round feedback.
In addition to specific code comments, two other notes:
-
I don't see any tests for this. Can we add automated tests?
-
Did you do any manual testing? In past experience with doing timeout based returns, I've seen errors/instability in Elasticsearch when the transport action ends earlier while an outstanding network call is still pending. For example - let's add a manual "sleep" to remote clusters that lasts longer than the timeout. What does the output look like for those and do they report errors in the logs or have instability issue when they actually return a response to a coordinator that is no longer listening.
...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java
Outdated
Show resolved
Hide resolved
use `wrapWithTimeout()` if no timeout was specified
...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next round review.
...rc/main/java/org/elasticsearch/action/admin/indices/resolve/ResolveClusterActionRequest.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java
Outdated
Show resolved
Hide resolved
The Description also still references a 9 second timeout (in the error message), so I'm confused about 9s vs 30s internal transport/network timeouts. |
.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java
Outdated
Show resolved
Hide resolved
.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java
Outdated
Show resolved
Hide resolved
.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good test improvement. Additional suggestions left.
.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java
Outdated
Show resolved
Hide resolved
.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java
Outdated
Show resolved
Hide resolved
.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java
Outdated
Show resolved
Hide resolved
.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java
Outdated
Show resolved
Hide resolved
.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java
Outdated
Show resolved
Hide resolved
Cleaning up the description and dropping additional info here. Q: What happens on a timeout?
where the "error" field denotes that the request timed out. Then:
Scenario 2: If the remote responds, the timeout has already happened by then (by calling into |
.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/action/admin/indices/resolve/TransportResolveClusterAction.java
Show resolved
Hide resolved
.../src/internalClusterTest/java/org/elasticsearch/indices/cluster/ResolveClusterTimeoutIT.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
💔 Backport failed
You can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
…PI (elastic#120542) Previously, should a remote cluster be unresponsive, _resolve/cluster would wait until Netty stepped in and terminated the connection. We initially responded to this issue by switching the disconnect strategy. However, this was problematic because it defeated the whole purpose of this API call—re-establish connection if and when possible. We now attempt to respond to it by adding a user-configurable GET parameter. This PR also reverses the problematic disconnect strategy. Example: ``` GET _resolve/cluster/*:*?timeout=5s ``` (cherry picked from commit d27a8e0)
…PI (#120542) (#121142) Previously, should a remote cluster be unresponsive, _resolve/cluster would wait until Netty stepped in and terminated the connection. We initially responded to this issue by switching the disconnect strategy. However, this was problematic because it defeated the whole purpose of this API call—re-establish connection if and when possible. We now attempt to respond to it by adding a user-configurable GET parameter. This PR also reverses the problematic disconnect strategy. Example: ``` GET _resolve/cluster/*:*?timeout=5s ``` (cherry picked from commit d27a8e0)
Previously, should a remote be unresponsive,
_resolve/cluster
would wait untilnetty
stepped in and terminated the connection. We initially responded to this issue by switching the disconnect strategy. However, this was problematic because it defeated the whole purpose of this API call—re-establish connection if and when possible. We now attempt to respond to it by adding a user-configurableGET
parameter. This PR also reverses the problematic disconnect strategy.Example: