Skip to content

[ML] Retry on streaming errors #123076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 4, 2025
Merged

Conversation

prwhelan
Copy link
Member

We now always retry based on the provider's configured retry logic rather than the HTTP status code. Some providers (e.g. Cohere, Anthropic) will return 200 status codes with error bodies, others (e.g. OpenAI, Azure) will return non-200 status codes with non-streaming bodies.

Notes:

  • Refactored from HttpResult to StreamingHttpResult, the byte body is now the streaming element while the http response lives outside the stream.
  • Refactored StreamingHttpResultPublisher so that it only pushes byte body into a queue.
  • Tests all now have to wait for the response to be fully consumed before closing the service, otherwise the close method will shut down the mock web server and apache will throw an error.

We now always retry based on the provider's configured retry logic
rather than the HTTP status code. Some providers (e.g. Cohere,
Anthropic) will return 200 status codes with error bodies, others (e.g.
OpenAI, Azure) will return non-200 status codes with non-streaming
bodies.

Notes:
- Refactored from HttpResult to StreamingHttpResult, the byte body is
  now the streaming element while the http response lives outside the
  stream.
- Refactored StreamingHttpResultPublisher so that it only pushes byte
  body into a queue.
- Tests all now have to wait for the response to be fully consumed
  before closing the service, otherwise the close method will shut down
  the mock web server and apache will throw an error.
@prwhelan prwhelan added >bug :ml Machine learning Team:ML Meta label for the ML team auto-backport Automatically create backport pull requests when merged v9.0.0 v8.18.0 v8.18.1 v8.19.0 v9.1.0 labels Feb 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @prwhelan, I've created a changelog YAML for you.

@prwhelan prwhelan marked this pull request as ready for review February 25, 2025 15:46
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

class StreamingHttpResultPublisher implements HttpAsyncResponseConsumer<HttpResponse>, Flow.Publisher<HttpResult> {
private final HttpSettings settings;
private final ActionListener<Flow.Publisher<HttpResult>> listener;
class StreamingHttpResultPublisher implements HttpAsyncResponseConsumer<Void> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is almost completely different, it might be easier to review it as if it's new.

We're now just sending the byte[] as a stream, rather than sending HttpResult(response, byte[]), which simplifies what is being queued.

I separated the class out into the main Apache consumer and two subclasses to publish the consumed data and manage pausing/unpausing Apache. It's hopefully clearer that we're doing three distinct parts:

  1. Apache will continuously read the response bytes and store it in our buffer queue
  2. Meanwhile our response handling code will only pull from the buffer when it is ready to send data to the client (e.g. when the client requests it)
  3. We have the pause/unpause logic to slow Apache down if we've stored too many bytes in memory and are moving too slowly to drain the buffer.


import static org.elasticsearch.core.Strings.format;

class StreamingResponseHandler implements Flow.Processor<HttpResult, HttpResult> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class only existed to read the response headers and determine if there is an error, but we now do that in RetryingHttpSender directly

return RestStatus.isSuccessful(response.getStatusLine().getStatusCode());
}

public Flow.Publisher<HttpResult> toHttpResult() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toHttpResult is a bit of a stopgap to shorten the PR - I didn't want to refactor every provider just yet, but in theory they should all be able to read byte[] directly

Copy link
Contributor

@jonathan-buttner jonathan-buttner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few questions.

Does this PR implementing retrying on midstream and beginning of stream errors? Or does there need to be a follow up for the providers after this?

}

private void addBytesAndMaybePause(long count, IOControl ioControl) {
if (bytesInQueue.accumulateAndGet(count, Long::sum) >= settings.getMaxResponseSize().getBytes()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use addAndGet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah idk why I used two different methods in this one file lol


private void subtractBytesAndMaybeUnpause(long count) {
var currentBytesInQueue = bytesInQueue.updateAndGet(current -> Long.max(0, current - count));
if (savedIoControl != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to wrap this check in a synchronized block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so? Because the resumeProducer() call will lock, so if we ever get into a state where we have two threads competing to unpause, then the worst thing we do is calculate the multiplication twice. We should never be pausing while we are unpausing (Apache shouldn't be calling us with more data when we are paused), but if we are then locking wouldn't help mitigate that either, since we'd be able to unpause and immediately pause.

try {
responseHandler.validateResponse(throttlerManager, logger, request, httpResult);
InferenceServiceResults inferenceResults = responseHandler.parseResult(request, httpResult);
ll.onResponse(inferenceResults);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand this flow correctly, we can get a status code that indicates a failure but if validateResponse doesn't throw an error we'll return an actual result? Or are we calling onResponse here to also handle returning an error object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can get a status code that indicates a failure but if validateResponse doesn't throw an error we'll return an actual result

I was thinking this might be an option, but otherwise most providers would throw an exception and we'd call the listener.onFailure


private void addBytesAndMaybePause(long count, IOControl ioControl) {
if (bytesInQueue.accumulateAndGet(count, Long::sum) >= settings.getMaxResponseSize().getBytes()) {
pauseProducer(ioControl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the addBytesAndMaybePause could be called again after the queue is already full (aka such that the if-block would return true)? Would that matter? I assume the most recent IOControl supersedes any that we've set previously?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it shouldn't happen, but it's okay to pause IOControl twice, and yeah any recent one supersedes the other. We only need one resume call to continue.

@prwhelan prwhelan merged commit dfe2adb into elastic:main Mar 4, 2025
17 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.0 Commit could not be cherrypicked due to conflicts
8.18 Commit could not be cherrypicked due to conflicts
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 123076

prwhelan added a commit to prwhelan/elasticsearch that referenced this pull request Mar 4, 2025
We now always retry based on the provider's configured retry logic
rather than the HTTP status code. Some providers (e.g. Cohere,
Anthropic) will return 200 status codes with error bodies, others (e.g.
OpenAI, Azure) will return non-200 status codes with non-streaming
bodies.

Notes:
- Refactored from HttpResult to StreamingHttpResult, the byte body is
  now the streaming element while the http response lives outside the
  stream.
- Refactored StreamingHttpResultPublisher so that it only pushes byte
  body into a queue.
- Tests all now have to wait for the response to be fully consumed
  before closing the service, otherwise the close method will shut down
  the mock web server and apache will throw an error.
prwhelan added a commit to prwhelan/elasticsearch that referenced this pull request Mar 4, 2025
We now always retry based on the provider's configured retry logic
rather than the HTTP status code. Some providers (e.g. Cohere,
Anthropic) will return 200 status codes with error bodies, others (e.g.
OpenAI, Azure) will return non-200 status codes with non-streaming
bodies.

Notes:
- Refactored from HttpResult to StreamingHttpResult, the byte body is
  now the streaming element while the http response lives outside the
  stream.
- Refactored StreamingHttpResultPublisher so that it only pushes byte
  body into a queue.
- Tests all now have to wait for the response to be fully consumed
  before closing the service, otherwise the close method will shut down
  the mock web server and apache will throw an error.
prwhelan added a commit to prwhelan/elasticsearch that referenced this pull request Mar 4, 2025
We now always retry based on the provider's configured retry logic
rather than the HTTP status code. Some providers (e.g. Cohere,
Anthropic) will return 200 status codes with error bodies, others (e.g.
OpenAI, Azure) will return non-200 status codes with non-streaming
bodies.

Notes:
- Refactored from HttpResult to StreamingHttpResult, the byte body is
  now the streaming element while the http response lives outside the
  stream.
- Refactored StreamingHttpResultPublisher so that it only pushes byte
  body into a queue.
- Tests all now have to wait for the response to be fully consumed
  before closing the service, otherwise the close method will shut down
  the mock web server and apache will throw an error.
elasticsearchmachine pushed a commit that referenced this pull request Mar 4, 2025
* [ML] Retry on streaming errors (#123076)

We now always retry based on the provider's configured retry logic
rather than the HTTP status code. Some providers (e.g. Cohere,
Anthropic) will return 200 status codes with error bodies, others (e.g.
OpenAI, Azure) will return non-200 status codes with non-streaming
bodies.

Notes:
- Refactored from HttpResult to StreamingHttpResult, the byte body is
  now the streaming element while the http response lives outside the
  stream.
- Refactored StreamingHttpResultPublisher so that it only pushes byte
  body into a queue.
- Tests all now have to wait for the response to be fully consumed
  before closing the service, otherwise the close method will shut down
  the mock web server and apache will throw an error.

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Mar 4, 2025
* [ML] Retry on streaming errors (#123076)

We now always retry based on the provider's configured retry logic
rather than the HTTP status code. Some providers (e.g. Cohere,
Anthropic) will return 200 status codes with error bodies, others (e.g.
OpenAI, Azure) will return non-200 status codes with non-streaming
bodies.

Notes:
- Refactored from HttpResult to StreamingHttpResult, the byte body is
  now the streaming element while the http response lives outside the
  stream.
- Refactored StreamingHttpResultPublisher so that it only pushes byte
  body into a queue.
- Tests all now have to wait for the response to be fully consumed
  before closing the service, otherwise the close method will shut down
  the mock web server and apache will throw an error.

* Use old isSuccess API
elasticsearchmachine pushed a commit that referenced this pull request Mar 4, 2025
* [ML] Retry on streaming errors (#123076)

We now always retry based on the provider's configured retry logic
rather than the HTTP status code. Some providers (e.g. Cohere,
Anthropic) will return 200 status codes with error bodies, others (e.g.
OpenAI, Azure) will return non-200 status codes with non-streaming
bodies.

Notes:
- Refactored from HttpResult to StreamingHttpResult, the byte body is
  now the streaming element while the http response lives outside the
  stream.
- Refactored StreamingHttpResultPublisher so that it only pushes byte
  body into a queue.
- Tests all now have to wait for the response to be fully consumed
  before closing the service, otherwise the close method will shut down
  the mock web server and apache will throw an error.

* [CI] Auto commit changes from spotless

* Use old isSuccess API

---------

Co-authored-by: elasticsearchmachine <[email protected]>
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Mar 11, 2025
We now always retry based on the provider's configured retry logic
rather than the HTTP status code. Some providers (e.g. Cohere,
Anthropic) will return 200 status codes with error bodies, others (e.g.
OpenAI, Azure) will return non-200 status codes with non-streaming
bodies.

Notes:
- Refactored from HttpResult to StreamingHttpResult, the byte body is
  now the streaming element while the http response lives outside the
  stream.
- Refactored StreamingHttpResultPublisher so that it only pushes byte
  body into a queue.
- Tests all now have to wait for the response to be fully consumed
  before closing the service, otherwise the close method will shut down
  the mock web server and apache will throw an error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending >bug :ml Machine learning Team:ML Meta label for the ML team v8.18.0 v8.18.1 v8.19.0 v9.0.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants