-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Prevent data nodes from sending stack traces to coordinator when error_trace=false
#118266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent data nodes from sending stack traces to coordinator when error_trace=false
#118266
Conversation
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Hi @piergm, I've created a changelog YAML for you. |
…ExecutionException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Matteo, I left some comments
server/src/main/java/org/elasticsearch/rest/action/search/RestMultiSearchAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/rest/action/search/RestMultiSearchAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/SearchService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/plugins/ActionPlugin.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/plugins/ActionPlugin.java
Outdated
Show resolved
Hide resolved
...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/rest/action/search/RestSearchAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/SearchService.java
Outdated
Show resolved
Hide resolved
…ExecutionException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false
server/src/main/java/org/elasticsearch/rest/BaseRestHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/ActionModule.java
Outdated
Show resolved
Hide resolved
…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false
…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false
…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false
…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments and questions, LGTM otherwise
server/src/main/java/org/elasticsearch/rest/BaseRestHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/SearchService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/SearchService.java
Outdated
Show resolved
Hide resolved
...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
Show resolved
Hide resolved
request.addParameter("error_trace", "true"); | ||
while (responseEntity.get("is_running") instanceof Boolean isRunning && isRunning) { | ||
responseEntity = performRequestAndGetResponseEntity(request); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a sense for how many times get async search ends up being called again here because it's running? Wondering because you do it in a tight loop without any backoff or sleep between calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very few times, if any, but I agree we should have a 1s sleep between calls and lowered "wait_for_completion_timeout" to 0ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not mean to imply that sleep is a good solution, I wonder if the sleep ends up slowing down the test and how much. We can maybe finetune this as a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my local tests I saw that with 0ms of wait_for_completion_timeout and 1 second sleep we always get back the response with the first GET /_async_search. So it should not slow down the test too much IMO. But I am open to finetune it 😄
...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
Show resolved
Hide resolved
…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false
💔 Backport failed
You can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
…or_trace=false` (elastic#118266) * first iterations * added tests * Update docs/changelog/118266.yaml * constant for error_trace and typos * centralized putHeader * moved threadContext to parent class * uses NodeClient.threadpool * updated async tests to retrieve final result * moved test to avoid starting up a node * added transport version to avoid sending useless bytes * more async tests (cherry picked from commit 97bc291) # Conflicts: # server/src/main/java/org/elasticsearch/rest/action/search/RestSearchAction.java
…or_trace=false` (#118266) (#118969) * first iterations * added tests * Update docs/changelog/118266.yaml * constant for error_trace and typos * centralized putHeader * moved threadContext to parent class * uses NodeClient.threadpool * updated async tests to retrieve final result * moved test to avoid starting up a node * added transport version to avoid sending useless bytes * more async tests (cherry picked from commit 97bc291) # Conflicts: # server/src/main/java/org/elasticsearch/rest/action/search/RestSearchAction.java
heya @piergm do you need to backport this or just remove the backport_pending label? |
Just to remove the backport_pending label. Thanks for the ping, removed. |
…125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.
…lastic#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d) # Conflicts: # qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java # server/src/main/java/org/elasticsearch/search/SearchService.java # test/framework/src/main/java/org/elasticsearch/search/ErrorTraceHelper.java # x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
…lastic#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d) # Conflicts: # qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java # server/src/main/java/org/elasticsearch/search/SearchService.java # test/framework/src/main/java/org/elasticsearch/search/ErrorTraceHelper.java # x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
…nsport (#125732) (#126246) * Log stack traces on data nodes before they are cleared for transport (#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)
…sport (#125732) (#126245) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)
…sport (#125732) (#126243) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)
…lastic#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.
This PR updates the behavior of data nodes to optimize error handling during search operations. When the query parameter
error_trace=false
is specified in the request (which defaults tofalse
), data nodes will no longer send stack traces to the search coordinator node.With
error_trace=false
, the stack trace is already excluded from the REST response to the client. By extending this to the communication between data nodes and the coordinator node, we further reduce unnecessary data transfer and lower the memory needed to handle search requests in case of errors in the coordinating node.To implement this, the
error_trace
query parameter is passed to data nodes via the transport request headererror_trace
. This ensures consistent handling of theerror_trace
flag throughout the search request lifecycle.After this change the
error_trace
header will be always sent to data nodes. Nodes with an older version will therefore have no way to specify if we want or not stack trace, therefore upon reading the header in the data node, if not specified we default to true, mimicking the current behaviour.closes: #116772