[ML] Retry on streaming errors #123076

prwhelan · 2025-02-20T21:06:40Z

We now always retry based on the provider's configured retry logic rather than the HTTP status code. Some providers (e.g. Cohere, Anthropic) will return 200 status codes with error bodies, others (e.g. OpenAI, Azure) will return non-200 status codes with non-streaming bodies.

Notes:

Refactored from HttpResult to StreamingHttpResult, the byte body is now the streaming element while the http response lives outside the stream.
Refactored StreamingHttpResultPublisher so that it only pushes byte body into a queue.
Tests all now have to wait for the response to be fully consumed before closing the service, otherwise the close method will shut down the mock web server and apache will throw an error.

We now always retry based on the provider's configured retry logic rather than the HTTP status code. Some providers (e.g. Cohere, Anthropic) will return 200 status codes with error bodies, others (e.g. OpenAI, Azure) will return non-200 status codes with non-streaming bodies. Notes: - Refactored from HttpResult to StreamingHttpResult, the byte body is now the streaming element while the http response lives outside the stream. - Refactored StreamingHttpResultPublisher so that it only pushes byte body into a queue. - Tests all now have to wait for the response to be fully consumed before closing the service, otherwise the close method will shut down the mock web server and apache will throw an error.

elasticsearchmachine · 2025-02-20T21:07:05Z

Hi @prwhelan, I've created a changelog YAML for you.

elasticsearchmachine · 2025-02-25T15:47:11Z

Pinging @elastic/ml-core (Team:ML)

prwhelan · 2025-03-03T16:31:52Z

.../main/java/org/elasticsearch/xpack/inference/external/http/StreamingHttpResultPublisher.java

-class StreamingHttpResultPublisher implements HttpAsyncResponseConsumer<HttpResponse>, Flow.Publisher<HttpResult> {
-    private final HttpSettings settings;
-    private final ActionListener<Flow.Publisher<HttpResult>> listener;
+class StreamingHttpResultPublisher implements HttpAsyncResponseConsumer<Void> {


This file is almost completely different, it might be easier to review it as if it's new.

We're now just sending the byte[] as a stream, rather than sending HttpResult(response, byte[]), which simplifies what is being queued.

I separated the class out into the main Apache consumer and two subclasses to publish the consumed data and manage pausing/unpausing Apache. It's hopefully clearer that we're doing three distinct parts:

Apache will continuously read the response bytes and store it in our buffer queue

Meanwhile our response handling code will only pull from the buffer when it is ready to send data to the client (e.g. when the client requests it)

We have the pause/unpause logic to slow Apache down if we've stored too many bytes in memory and are moving too slowly to drain the buffer.

prwhelan · 2025-03-03T16:32:46Z

...ain/java/org/elasticsearch/xpack/inference/external/http/retry/StreamingResponseHandler.java

-
-import static org.elasticsearch.core.Strings.format;
-
-class StreamingResponseHandler implements Flow.Processor<HttpResult, HttpResult> {


This class only existed to read the response headers and determine if there is an error, but we now do that in RetryingHttpSender directly

prwhelan · 2025-03-03T16:35:02Z

...rence/src/main/java/org/elasticsearch/xpack/inference/external/http/StreamingHttpResult.java

+        return RestStatus.isSuccessful(response.getStatusLine().getStatusCode());
+    }
+
+    public Flow.Publisher<HttpResult> toHttpResult() {


toHttpResult is a bit of a stopgap to shorten the PR - I didn't want to refactor every provider just yet, but in theory they should all be able to read byte[] directly

jonathan-buttner

Just left a few questions.

Does this PR implementing retrying on midstream and beginning of stream errors? Or does there need to be a follow up for the providers after this?

jonathan-buttner · 2025-03-03T19:17:36Z

.../main/java/org/elasticsearch/xpack/inference/external/http/StreamingHttpResultPublisher.java

+        }
+
+        private void addBytesAndMaybePause(long count, IOControl ioControl) {
+            if (bytesInQueue.accumulateAndGet(count, Long::sum) >= settings.getMaxResponseSize().getBytes()) {


Can we use addAndGet?

Yeah idk why I used two different methods in this one file lol

jonathan-buttner · 2025-03-03T19:23:02Z

.../main/java/org/elasticsearch/xpack/inference/external/http/StreamingHttpResultPublisher.java

+
+        private void subtractBytesAndMaybeUnpause(long count) {
+            var currentBytesInQueue = bytesInQueue.updateAndGet(current -> Long.max(0, current - count));
+            if (savedIoControl != null) {


Do we need to wrap this check in a synchronized block?

I don't think so? Because the resumeProducer() call will lock, so if we ever get into a state where we have two threads competing to unpause, then the worst thing we do is calculate the multiplication twice. We should never be pausing while we are unpausing (Apache shouldn't be calling us with more data when we are paused), but if we are then locking wouldn't help mitigate that either, since we'd be able to unpause and immediately pause.

jonathan-buttner · 2025-03-03T19:28:05Z

.../src/main/java/org/elasticsearch/xpack/inference/external/http/retry/RetryingHttpSender.java

+                                try {
+                                    responseHandler.validateResponse(throttlerManager, logger, request, httpResult);
+                                    InferenceServiceResults inferenceResults = responseHandler.parseResult(request, httpResult);
+                                    ll.onResponse(inferenceResults);


Just to make sure I understand this flow correctly, we can get a status code that indicates a failure but if validateResponse doesn't throw an error we'll return an actual result? Or are we calling onResponse here to also handle returning an error object?

we can get a status code that indicates a failure but if validateResponse doesn't throw an error we'll return an actual result

I was thinking this might be an option, but otherwise most providers would throw an exception and we'd call the listener.onFailure

jonathan-buttner · 2025-03-03T19:45:08Z

.../main/java/org/elasticsearch/xpack/inference/external/http/StreamingHttpResultPublisher.java

+
+        private void addBytesAndMaybePause(long count, IOControl ioControl) {
+            if (bytesInQueue.accumulateAndGet(count, Long::sum) >= settings.getMaxResponseSize().getBytes()) {
+                pauseProducer(ioControl);


Is it possible that the addBytesAndMaybePause could be called again after the queue is already full (aka such that the if-block would return true)? Would that matter? I assume the most recent IOControl supersedes any that we've set previously?

Yeah, it shouldn't happen, but it's okay to pause IOControl twice, and yeah any recent one supersedes the other. We only need one resume call to continue.

elasticsearchmachine · 2025-03-04T14:34:34Z

💔 Backport failed

Status	Branch	Result
❌	9.0	Commit could not be cherrypicked due to conflicts
❌	8.18	Commit could not be cherrypicked due to conflicts
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 123076

We now always retry based on the provider's configured retry logic rather than the HTTP status code. Some providers (e.g. Cohere, Anthropic) will return 200 status codes with error bodies, others (e.g. OpenAI, Azure) will return non-200 status codes with non-streaming bodies. Notes: - Refactored from HttpResult to StreamingHttpResult, the byte body is now the streaming element while the http response lives outside the stream. - Refactored StreamingHttpResultPublisher so that it only pushes byte body into a queue. - Tests all now have to wait for the response to be fully consumed before closing the service, otherwise the close method will shut down the mock web server and apache will throw an error.

* [ML] Retry on streaming errors (#123076) We now always retry based on the provider's configured retry logic rather than the HTTP status code. Some providers (e.g. Cohere, Anthropic) will return 200 status codes with error bodies, others (e.g. OpenAI, Azure) will return non-200 status codes with non-streaming bodies. Notes: - Refactored from HttpResult to StreamingHttpResult, the byte body is now the streaming element while the http response lives outside the stream. - Refactored StreamingHttpResultPublisher so that it only pushes byte body into a queue. - Tests all now have to wait for the response to be fully consumed before closing the service, otherwise the close method will shut down the mock web server and apache will throw an error. * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>

* [ML] Retry on streaming errors (#123076) We now always retry based on the provider's configured retry logic rather than the HTTP status code. Some providers (e.g. Cohere, Anthropic) will return 200 status codes with error bodies, others (e.g. OpenAI, Azure) will return non-200 status codes with non-streaming bodies. Notes: - Refactored from HttpResult to StreamingHttpResult, the byte body is now the streaming element while the http response lives outside the stream. - Refactored StreamingHttpResultPublisher so that it only pushes byte body into a queue. - Tests all now have to wait for the response to be fully consumed before closing the service, otherwise the close method will shut down the mock web server and apache will throw an error. * Use old isSuccess API

* [ML] Retry on streaming errors (#123076) We now always retry based on the provider's configured retry logic rather than the HTTP status code. Some providers (e.g. Cohere, Anthropic) will return 200 status codes with error bodies, others (e.g. OpenAI, Azure) will return non-200 status codes with non-streaming bodies. Notes: - Refactored from HttpResult to StreamingHttpResult, the byte body is now the streaming element while the http response lives outside the stream. - Refactored StreamingHttpResultPublisher so that it only pushes byte body into a queue. - Tests all now have to wait for the response to be fully consumed before closing the service, otherwise the close method will shut down the mock web server and apache will throw an error. * [CI] Auto commit changes from spotless * Use old isSuccess API --------- Co-authored-by: elasticsearchmachine <[email protected]>

We now always retry based on the provider's configured retry logic rather than the HTTP status code. Some providers (e.g. Cohere, Anthropic) will return 200 status codes with error bodies, others (e.g. OpenAI, Azure) will return non-200 status codes with non-streaming bodies. Notes: - Refactored from HttpResult to StreamingHttpResult, the byte body is now the streaming element while the http response lives outside the stream. - Refactored StreamingHttpResultPublisher so that it only pushes byte body into a queue. - Tests all now have to wait for the response to be fully consumed before closing the service, otherwise the close method will shut down the mock web server and apache will throw an error.

prwhelan added >bug :ml Machine learning Team:ML Meta label for the ML team auto-backport Automatically create backport pull requests when merged v9.0.0 v8.18.0 v8.18.1 v8.19.0 v9.1.0 labels Feb 20, 2025

prwhelan added 3 commits February 20, 2025 16:07

Update docs/changelog/123076.yaml

682b413

Merge branch 'main' into streaming-apache

a9a4a34

Merge branch 'main' into streaming-apache

e3cde7b

prwhelan marked this pull request as ready for review February 25, 2025 15:46

prwhelan commented Mar 3, 2025

View reviewed changes

jonathan-buttner reviewed Mar 3, 2025

View reviewed changes

prwhelan added 2 commits March 3, 2025 16:03

swap to add

dad844f

Merge branch 'main' into streaming-apache

e3aa7bf

jonathan-buttner approved these changes Mar 3, 2025

View reviewed changes

Merge branch 'main' into streaming-apache

d2fc68a

prwhelan merged commit dfe2adb into elastic:main Mar 4, 2025
17 checks passed

elasticsearchmachine added the backport pending label Mar 4, 2025

This was referenced Mar 4, 2025

[ML] Retry on streaming errors (#123076) #124030

Merged

[ML] Retry on streaming errors (#123076) #124031

Merged

prwhelan mentioned this pull request Mar 4, 2025

[ML] Retry on streaming errors (#123076) #124032

Merged


		import static org.elasticsearch.core.Strings.format;

		class StreamingResponseHandler implements Flow.Processor<HttpResult, HttpResult> {

[ML] Retry on streaming errors #123076

[ML] Retry on streaming errors #123076

Uh oh!

Conversation

prwhelan commented Feb 20, 2025

Uh oh!

elasticsearchmachine commented Feb 20, 2025

Uh oh!

elasticsearchmachine commented Feb 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonathan-buttner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 4, 2025

💔 Backport failed

Uh oh!

Uh oh!