Skip to content

Retry internally when CAS upload is throttled [GCS] #120250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 20, 2025

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Jan 16, 2025

Fixes #116546

I've only changed the case where we are throttled trying to upload the new register contents, because currently that was the only place we returned MISSING when we were throttled. Do we think it'd make more sense to start the whole CAS again in the event that ANY of the requests are throttled?

It looks like by default, GCS is configured with

initial retry delay = 1s
retry delay multiplier = 2
max retry delay = 32s
max attempts = 6

So by adding another layer of retries, this CAS could end up taking some time. By default I allowed two retries, which will take the maximum time out to 96s.

@nicktindall nicktindall changed the title WIP: Retry internally when CAS upload is throttled Retry internally when CAS upload is throttled Jan 16, 2025
@nicktindall nicktindall added the >test Issues or PRs that are addressing/adding tests label Jan 16, 2025
@nicktindall nicktindall added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Jan 16, 2025
@nicktindall nicktindall marked this pull request as ready for review January 16, 2025 06:58
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jan 16, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

if (retries.hasNext()) {
try {
// noinspection BusyWait
Thread.sleep(retries.next().millis());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're good with retrying the whole thing from the start in the event of a throttle, we could do this one level up (where it's async) so we don't have to sleep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeping seems ok here to me, if we're being throttled on a CAS then we probably shouldn't be freeing up the thread to do some other blob-store operation.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Just checking my understanding tho, the Azure implementation already does what we want right?

if (retries.hasNext()) {
try {
// noinspection BusyWait
Thread.sleep(retries.next().millis());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeping seems ok here to me, if we're being throttled on a CAS then we probably shouldn't be freeing up the thread to do some other blob-store operation.

@nicktindall
Copy link
Contributor Author

LGTM

Just checking my understanding tho, the Azure implementation already does what we want right?

This problem was unique due to the way GCP relied on the outer scope to do the retry (on throttling, it simulated a failure to CAS which would trigger a re-attempt). Azure doesn't do that, instead if it gets throttled it'll propagate that error out. We don't have any retries beyond those built into the client, but as far as I know we haven't seen the analysis test fail.

@nicktindall nicktindall removed the request for review from ywangd January 17, 2025 03:16
Comment on lines +65 to +79
static final Setting<TimeValue> RETRY_THROTTLED_CAS_DELAY_INCREMENT = Setting.timeSetting(
"throttled_cas_retry.delay_increment",
TimeValue.timeValueMillis(100),
TimeValue.ZERO
);
static final Setting<Integer> RETRY_THROTTLED_CAS_MAX_NUMBER_OF_RETRIES = Setting.intSetting(
"throttled_cas_retry.maximum_number_of_retries",
2,
0
);
static final Setting<TimeValue> RETRY_THROTTLED_CAS_MAXIMUM_DELAY = Setting.timeSetting(
"throttled_cas_retry.maximum_delay",
TimeValue.timeValueSeconds(5),
TimeValue.ZERO
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these settings registered anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, also I should document them 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that. I forgot repository metadata are not true settings.

@ywangd
Copy link
Member

ywangd commented Jan 17, 2025

Should this be labelled as >enhancement instead of >test?

@nicktindall
Copy link
Contributor Author

Should this be labelled as >enhancement instead of >test?

I guess it should given that it changes actual behaviour. Will update.

@nicktindall nicktindall changed the title Retry internally when CAS upload is throttled Retry internally when CAS upload is throttled [GCS] Jan 17, 2025
@nicktindall nicktindall added >enhancement and removed >test Issues or PRs that are addressing/adding tests labels Jan 17, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

@ywangd
Copy link
Member

ywangd commented Jan 17, 2025

Azure doesn't do that, instead if it gets throttled it'll propagate that error out

IIUC, with this PR, GCP implementation should also do the same after the retry exhausted, right? I think one main difference from the previous behaviour is that we will get a clear exception which helps troubleshooting instead of a return value of OptionalBytesReference[MISSING].

@nicktindall nicktindall merged commit c02292f into elastic:main Jan 20, 2025
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] GCSSnapshotRepoTestKitIT testRepositoryAnalysis failing
4 participants