-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Retry internally when CAS upload is throttled [GCS] #120250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry internally when CAS upload is throttled [GCS] #120250
Conversation
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
if (retries.hasNext()) { | ||
try { | ||
// noinspection BusyWait | ||
Thread.sleep(retries.next().millis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're good with retrying the whole thing from the start in the event of a throttle, we could do this one level up (where it's async) so we don't have to sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleeping seems ok here to me, if we're being throttled on a CAS then we probably shouldn't be freeing up the thread to do some other blob-store operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just checking my understanding tho, the Azure implementation already does what we want right?
if (retries.hasNext()) { | ||
try { | ||
// noinspection BusyWait | ||
Thread.sleep(retries.next().millis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleeping seems ok here to me, if we're being throttled on a CAS then we probably shouldn't be freeing up the thread to do some other blob-store operation.
This problem was unique due to the way GCP relied on the outer scope to do the retry (on throttling, it simulated a failure to CAS which would trigger a re-attempt). Azure doesn't do that, instead if it gets throttled it'll propagate that error out. We don't have any retries beyond those built into the client, but as far as I know we haven't seen the analysis test fail. |
static final Setting<TimeValue> RETRY_THROTTLED_CAS_DELAY_INCREMENT = Setting.timeSetting( | ||
"throttled_cas_retry.delay_increment", | ||
TimeValue.timeValueMillis(100), | ||
TimeValue.ZERO | ||
); | ||
static final Setting<Integer> RETRY_THROTTLED_CAS_MAX_NUMBER_OF_RETRIES = Setting.intSetting( | ||
"throttled_cas_retry.maximum_number_of_retries", | ||
2, | ||
0 | ||
); | ||
static final Setting<TimeValue> RETRY_THROTTLED_CAS_MAXIMUM_DELAY = Setting.timeSetting( | ||
"throttled_cas_retry.maximum_delay", | ||
TimeValue.timeValueSeconds(5), | ||
TimeValue.ZERO | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these settings registered anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, also I should document them 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that. I forgot repository metadata are not true settings.
Should this be labelled as |
I guess it should given that it changes actual behaviour. Will update. |
Hi @nicktindall, I've created a changelog YAML for you. |
IIUC, with this PR, GCP implementation should also do the same after the retry exhausted, right? I think one main difference from the previous behaviour is that we will get a clear exception which helps troubleshooting instead of a return value of |
Fixes #116546
I've only changed the case where we are throttled trying to upload the new register contents, because currently that was the only place we returned
MISSING
when we were throttled. Do we think it'd make more sense to start the whole CAS again in the event that ANY of the requests are throttled?It looks like by default, GCS is configured with
initial retry delay = 1s
retry delay multiplier = 2
max retry delay = 32s
max attempts = 6
So by adding another layer of retries, this CAS could end up taking some time. By default I allowed two retries, which will take the maximum time out to 96s.