Skip to content

Prevent data nodes from sending stack traces to coordinator when error_trace=false #118266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

piergm
Copy link
Member

@piergm piergm commented Dec 9, 2024

This PR updates the behavior of data nodes to optimize error handling during search operations. When the query parameter error_trace=false is specified in the request (which defaults to false), data nodes will no longer send stack traces to the search coordinator node.

With error_trace=false, the stack trace is already excluded from the REST response to the client. By extending this to the communication between data nodes and the coordinator node, we further reduce unnecessary data transfer and lower the memory needed to handle search requests in case of errors in the coordinating node.

To implement this, the error_trace query parameter is passed to data nodes via the transport request header error_trace. This ensures consistent handling of the error_trace flag throughout the search request lifecycle.
After this change the error_trace header will be always sent to data nodes. Nodes with an older version will therefore have no way to specify if we want or not stack trace, therefore upon reading the header in the data node, if not specified we default to true, mimicking the current behaviour.

closes: #116772

@piergm piergm added >enhancement auto-backport Automatically create backport pull requests when merged Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.0.0 v8.18.0 labels Dec 9, 2024
@piergm piergm self-assigned this Dec 9, 2024
@piergm piergm requested a review from a team as a code owner December 9, 2024 13:51
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Hi @piergm, I've created a changelog YAML for you.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Matteo, I left some comments

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments and questions, LGTM otherwise

request.addParameter("error_trace", "true");
while (responseEntity.get("is_running") instanceof Boolean isRunning && isRunning) {
responseEntity = performRequestAndGetResponseEntity(request);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a sense for how many times get async search ends up being called again here because it's running? Wondering because you do it in a tight loop without any backoff or sleep between calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very few times, if any, but I agree we should have a 1s sleep between calls and lowered "wait_for_completion_timeout" to 0ms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not mean to imply that sleep is a good solution, I wonder if the sleep ends up slowing down the test and how much. We can maybe finetune this as a followup

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my local tests I saw that with 0ms of wait_for_completion_timeout and 1 second sleep we always get back the response with the first GET /_async_search. So it should not slow down the test too much IMO. But I am open to finetune it 😄

@piergm piergm merged commit 97bc291 into elastic:main Dec 18, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 118266

@piergm
Copy link
Member Author

piergm commented Dec 18, 2024

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

piergm added a commit to piergm/elasticsearch that referenced this pull request Dec 18, 2024
…or_trace=false` (elastic#118266)

* first iterations

* added tests

* Update docs/changelog/118266.yaml

* constant for error_trace and typos

* centralized putHeader

* moved threadContext to parent class

* uses NodeClient.threadpool

* updated async tests to retrieve final result

* moved test to avoid starting up a node

* added transport version to avoid sending useless bytes

* more async tests

(cherry picked from commit 97bc291)

# Conflicts:
#	server/src/main/java/org/elasticsearch/rest/action/search/RestSearchAction.java
elasticsearchmachine pushed a commit that referenced this pull request Dec 18, 2024
…or_trace=false` (#118266) (#118969)

* first iterations

* added tests

* Update docs/changelog/118266.yaml

* constant for error_trace and typos

* centralized putHeader

* moved threadContext to parent class

* uses NodeClient.threadpool

* updated async tests to retrieve final result

* moved test to avoid starting up a node

* added transport version to avoid sending useless bytes

* more async tests

(cherry picked from commit 97bc291)

# Conflicts:
#	server/src/main/java/org/elasticsearch/rest/action/search/RestSearchAction.java
@javanna
Copy link
Member

javanna commented Feb 13, 2025

heya @piergm do you need to backport this or just remove the backport_pending label?

@piergm
Copy link
Member Author

piergm commented Feb 13, 2025

Just to remove the backport_pending label. Thanks for the ping, removed.

benchaplin added a commit that referenced this pull request Apr 3, 2025
…125732)

We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.
benchaplin added a commit to benchaplin/elasticsearch that referenced this pull request Apr 3, 2025
…lastic#125732)

We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.

(cherry picked from commit 9f6eb1d)

# Conflicts:
#	qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java
#	server/src/main/java/org/elasticsearch/search/SearchService.java
#	test/framework/src/main/java/org/elasticsearch/search/ErrorTraceHelper.java
#	x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
benchaplin added a commit to benchaplin/elasticsearch that referenced this pull request Apr 3, 2025
…lastic#125732)

We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.

(cherry picked from commit 9f6eb1d)

# Conflicts:
#	qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java
#	server/src/main/java/org/elasticsearch/search/SearchService.java
#	test/framework/src/main/java/org/elasticsearch/search/ErrorTraceHelper.java
#	x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
benchaplin added a commit that referenced this pull request Apr 4, 2025
…nsport (#125732) (#126246)

* Log stack traces on data nodes before they are cleared for transport (#125732)

We recently cleared stack traces on data nodes before transport back to the coordinating node 
when error_trace=false to reduce unnecessary data transfer and memory on the coordinating 
node (#118266). However, all logging of exceptions happens on the coordinating node, so stack 
traces disappeared from any logs. This change logs stack traces directly on the data node when 
error_trace=false.

(cherry picked from commit 9f6eb1d)
benchaplin added a commit that referenced this pull request Apr 4, 2025
…sport (#125732) (#126245)

We recently cleared stack traces on data nodes before transport back to the coordinating node 
when error_trace=false to reduce unnecessary data transfer and memory on the coordinating 
node (#118266). However, all logging of exceptions happens on the coordinating node, so stack 
traces disappeared from any logs. This change logs stack traces directly on the data node when 
error_trace=false.

(cherry picked from commit 9f6eb1d)
benchaplin added a commit that referenced this pull request Apr 4, 2025
…sport (#125732) (#126243)

We recently cleared stack traces on data nodes before transport back to the coordinating node 
when error_trace=false to reduce unnecessary data transfer and memory on the coordinating 
node (#118266). However, all logging of exceptions happens on the coordinating node, so stack 
traces disappeared from any logs. This change logs stack traces directly on the data node when 
error_trace=false.

(cherry picked from commit 9f6eb1d)
andreidan pushed a commit to andreidan/elasticsearch that referenced this pull request Apr 9, 2025
…lastic#125732)

We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EsRejectedExecutionException instances consume an unreasonable amount of heap on coordinating nodes
5 participants