Prevent data nodes from sending stack traces to coordinator when `error_trace=false` #118266

piergm · 2024-12-09T13:51:41Z

This PR updates the behavior of data nodes to optimize error handling during search operations. When the query parameter error_trace=false is specified in the request (which defaults to false), data nodes will no longer send stack traces to the search coordinator node.

With error_trace=false, the stack trace is already excluded from the REST response to the client. By extending this to the communication between data nodes and the coordinator node, we further reduce unnecessary data transfer and lower the memory needed to handle search requests in case of errors in the coordinating node.

To implement this, the error_trace query parameter is passed to data nodes via the transport request header error_trace. This ensures consistent handling of the error_trace flag throughout the search request lifecycle.
After this change the error_trace header will be always sent to data nodes. Nodes with an older version will therefore have no way to specify if we want or not stack trace, therefore upon reading the header in the data node, if not specified we default to true, mimicking the current behaviour.

closes: #116772

elasticsearchmachine · 2024-12-09T13:52:22Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2024-12-09T13:52:23Z

Hi @piergm, I've created a changelog YAML for you.

…ExecutionException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

javanna

Thanks for working on this Matteo, I left some comments

server/src/main/java/org/elasticsearch/rest/action/search/RestMultiSearchAction.java

server/src/main/java/org/elasticsearch/search/SearchService.java

server/src/main/java/org/elasticsearch/plugins/ActionPlugin.java

...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

server/src/main/java/org/elasticsearch/rest/action/search/RestSearchAction.java

server/src/main/java/org/elasticsearch/search/SearchService.java

…ExecutionException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

server/src/main/java/org/elasticsearch/rest/BaseRestHandler.java

server/src/main/java/org/elasticsearch/action/ActionModule.java

…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

javanna

Left a couple of comments and questions, LGTM otherwise

server/src/main/java/org/elasticsearch/rest/BaseRestHandler.java

server/src/main/java/org/elasticsearch/search/SearchService.java

...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

javanna · 2024-12-18T08:59:54Z

...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

+        request.addParameter("error_trace", "true");
+        while (responseEntity.get("is_running") instanceof Boolean isRunning && isRunning) {
+            responseEntity = performRequestAndGetResponseEntity(request);
+        }


Do you have a sense for how many times get async search ends up being called again here because it's running? Wondering because you do it in a tight loop without any backoff or sleep between calls.

Very few times, if any, but I agree we should have a 1s sleep between calls and lowered "wait_for_completion_timeout" to 0ms

I did not mean to imply that sleep is a good solution, I wonder if the sleep ends up slowing down the test and how much. We can maybe finetune this as a followup

From my local tests I saw that with 0ms of wait_for_completion_timeout and 1 second sleep we always get back the response with the first GET /_async_search. So it should not slow down the test too much IMO. But I am open to finetune it 😄

...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

elasticsearchmachine · 2024-12-18T14:30:52Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 118266

piergm · 2024-12-18T14:33:49Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

…or_trace=false` (elastic#118266) * first iterations * added tests * Update docs/changelog/118266.yaml * constant for error_trace and typos * centralized putHeader * moved threadContext to parent class * uses NodeClient.threadpool * updated async tests to retrieve final result * moved test to avoid starting up a node * added transport version to avoid sending useless bytes * more async tests (cherry picked from commit 97bc291) # Conflicts: # server/src/main/java/org/elasticsearch/rest/action/search/RestSearchAction.java

…or_trace=false` (#118266) (#118969) * first iterations * added tests * Update docs/changelog/118266.yaml * constant for error_trace and typos * centralized putHeader * moved threadContext to parent class * uses NodeClient.threadpool * updated async tests to retrieve final result * moved test to avoid starting up a node * added transport version to avoid sending useless bytes * more async tests (cherry picked from commit 97bc291) # Conflicts: # server/src/main/java/org/elasticsearch/rest/action/search/RestSearchAction.java

javanna · 2025-02-13T14:25:39Z

heya @piergm do you need to backport this or just remove the backport_pending label?

piergm · 2025-02-13T14:27:55Z

Just to remove the backport_pending label. Thanks for the ping, removed.

…125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.

…lastic#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d) # Conflicts: # qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java # server/src/main/java/org/elasticsearch/search/SearchService.java # test/framework/src/main/java/org/elasticsearch/search/ErrorTraceHelper.java # x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

…nsport (#125732) (#126246) * Log stack traces on data nodes before they are cleared for transport (#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)

…sport (#125732) (#126245) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)

…sport (#125732) (#126243) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)

…lastic#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.

piergm added 3 commits December 5, 2024 17:21

first iterations

63ab8f8

added tests

cfd3113

iter

ecdd47c

piergm added >enhancement auto-backport Automatically create backport pull requests when merged Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.0.0 v8.18.0 labels Dec 9, 2024

piergm requested review from javanna and original-brownbear December 9, 2024 13:51

piergm self-assigned this Dec 9, 2024

piergm requested a review from a team as a code owner December 9, 2024 13:51

Update docs/changelog/118266.yaml

82e2ce0

piergm added 2 commits December 9, 2024 14:54

merged main, resolved conflicts

262b933

Merge branch 'elastic:main' into datanode-not-to-send-back-EsRejected…

97b07db

…ExecutionException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

javanna reviewed Dec 9, 2024

View reviewed changes

piergm added 5 commits December 10, 2024 15:27

constant for error_trace and typos

df74130

centralized putHeader

875c922

moved threadContext to parent class

f4f8f1c

Merge branch 'elastic:main' into datanode-not-to-send-back-EsRejected…

4ac404c

…ExecutionException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

iter

d105271

javanna reviewed Dec 10, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/rest/BaseRestHandler.java Outdated Show resolved Hide resolved

piergm added 3 commits December 11, 2024 09:51

updated comment

97d956e

merged main, resolved conflicts

3faef3f

iter

94d197d

rjernst reviewed Dec 11, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/action/ActionModule.java Outdated Show resolved Hide resolved

uses NodeClient.threadpool

a16f2f4

piergm added 7 commits December 16, 2024 08:42

Merge branch 'main' into datanode-not-to-send-back-EsRejectedExecutio…

a21d5f8

…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

iter

6869654

more async tests

b027930

iter

a86d38a

Merge branch 'main' into datanode-not-to-send-back-EsRejectedExecutio…

a2cea9a

…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

Merge branch 'main' into datanode-not-to-send-back-EsRejectedExecutio…

a7a0a6b

…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

Merge branch 'main' into datanode-not-to-send-back-EsRejectedExecutio…

8e75864

…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

javanna approved these changes Dec 18, 2024

View reviewed changes

piergm added 2 commits December 18, 2024 14:16

iter

d12dc60

Merge branch 'main' into datanode-not-to-send-back-EsRejectedExecutio…

eba0387

…nException-if-REST_EXCEPTION_SKIP_STACK_TRACE=false

piergm merged commit 97bc291 into elastic:main Dec 18, 2024
16 checks passed

elasticsearchmachine added the backport pending label Dec 18, 2024

piergm mentioned this pull request Dec 18, 2024

[8.x] Prevent data nodes from sending stack traces to coordinator when `error_trace=false` (#118266) #118969

Merged

javanna mentioned this pull request Feb 13, 2025

Limit shard failures accumulated by searches #99220

Closed

piergm removed the backport pending label Feb 13, 2025

javanna mentioned this pull request Mar 17, 2025

Don't generate stacktrace in TaskCancelledException #125002

Merged

benchaplin mentioned this pull request Mar 26, 2025

Log stack traces on data nodes before they are cleared for transport #125732

Merged

Prevent data nodes from sending stack traces to coordinator when error_trace=false #118266

Prevent data nodes from sending stack traces to coordinator when error_trace=false #118266

Uh oh!

Conversation

piergm commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Dec 9, 2024

Uh oh!

elasticsearchmachine commented Dec 9, 2024

Uh oh!

javanna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

javanna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

javanna Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

piergm Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

javanna Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

piergm Dec 19, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Dec 18, 2024

💔 Backport failed

Uh oh!

piergm commented Dec 18, 2024

💚 All backports created successfully

Questions ?

Uh oh!

javanna commented Feb 13, 2025

Uh oh!

piergm commented Feb 13, 2025

Uh oh!

Uh oh!

Prevent data nodes from sending stack traces to coordinator when `error_trace=false` #118266

Prevent data nodes from sending stack traces to coordinator when `error_trace=false` #118266

piergm commented Dec 9, 2024 •

edited

Loading