-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-4563: EvictionRequest API (fka Evacuation) #4565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
We will introduce a new term called evacuation. This is a contract between the evacuation instigator, | ||
the evacuee, and the evacuator. The contract is enforced by the API and an evacuation controller. | ||
We can think of evacuation as a managed and safer alternative to eviction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Watch out for the risk of confusing end users.
We already have preemption and eviction and people confuse the two. Or three, because there are two kinds of eviction. And there's disruption in the mix.
Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I added a mention which of what kind of eviction I mean here.
Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?
Yes, I think we want to add a new concept there and generally update the docs once we have an alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with Tim here.
Preemption vs Eviction is already quite confusing. And TBH, I couldn't fully understand what the "evacuation" is supposed to solve by reading the summary or motivation.
From Goals:
Changes to the eviction API to support the evacuation process.
If this is already going to be part of the Eviction API, maybe it should be named as a form of eviction. Something like "cooperative eviction" or "eviction with ack" or something along those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for framing it as another type of eviction; we already have two, so the extra cognitive load for users is not so much a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor I have updated the summary and goals, I hope it makes more sense now.
I think the name should make the most sense to the person creating the Evacuation (Evacuation Instigator ATM). So CooperativeEviction
or EvictionWithAck
is a bit misleading IMO. Because from that person's perpective there is no additional step required of them. Only the evacuators and the evacuation controller implement the cooperative evacuation process but this is hidden from the normal user.
My suggestions:
GracefulEviction
(might confuse people if it is associated with graceful pod termination, which it is not)SafeEviction
(*safer than the API-initiated one for some pods)- Or just call it
Eviction
? And tell people to use it instead of the Eviction API endpoint? This might be a bit confusing (at least in the beginning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can bikeshed names for the API kind; I'd throw a few of my own into the hat:
- EvictionRequest
- PodEvictionRequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed the API to EvictionRequest
to make the term recognizable. A minor disadvantage is that we have to clarify what type of eviction we mean if we say evict (API-initiate eviction, or EvictionRequest)
The rest of the renames are as follows:
Evacuation (noun) -> EvictionRequest / Eviction Process
evacuation (verb) -> request an eviction / terminate / evict / process eviction
Evacuator -> Interceptor
Evacuee -> Pod
Evacuator Class -> Interceptor Class
Evacuation Instigator -> Eviction Requester
Evacuation Controller -> Eviction Request Controller
ActiveEvacuatorClass -> ActiveInterceptorClass
ActiveEvacuatorCompleted -> ActiveInterceptorCompleted
EvacuationProgressTimestamp -> ProgressTimestamp
ExpectedEvacuationFinishTime -> ExpectedInterceptorFinishTime
EvacuationCancellationPolicy -> EvictionRequestCancellationPolicy
FailedEvictionCounter -> FailedAPIEvictionCounter
<!-- | ||
What other approaches did you consider, and why did you rule them out? These do | ||
not need to be as detailed as the proposal, but should include enough | ||
information to express the idea and why it was not acceptable. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a number of examples of having a SomethingRequest or SomethingClaim API that then causes a something (certificate signing, node provisioning, etc).
Think of TokenRequest (a subresource for a ServiceAccount), or CertificateSigningRequest.
I would confidently and strongly prefer to have an EvictionRequest or PodEvictionRequest API, rather than an Evacuation API kind.
It's easy to teach that we have evictions and than an EvictionRequest is asking for one to happen; it's hard to teach the difference between an eviction and an evacuation.
As a side effect, this makes the feature gate easier to name (eg PodEvictionRequests
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you have mentioned, we already have different kinds of eviction. So I think it would be good to use a completely new term to distinguish it from the others.
Also, Evacuation does not always result in eviction (and PDB consultation). It depends on the controller/workload. For some workloads like DaemonSets and static pods, API eviction has never worked before. This could also be very confusing if we name it the same way.
I think Evacuation fits this better because
- The name is shorter. If we go with EvacuationRequest then the evacuation will become just an abstract term and less recognizable.
- It seems it will have quite a lot of functionality included (state synchronization between multiple instigators and multiple evacuators, state of the evacuee and evacuation). TokenRequest and CertificateSigningRequest are simpler and not involved in a complex process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested EvictionRequest so that we don't have to have a section with the (too long) title: Scheduling, Preemption, Evacuation and Eviction. Not EvacuationRequest.
Adding another term doesn't scale so well: it means helping n people understand the difference between evacuation and eviction. It's a scaling challenge where n is not only large, it probably includes quite a few Kubernetes maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for CertificateSigningRequest being simple: I don't buy it. There are three controllers, custom signers, an integration with trust bundles, the horrors of ASN.1 and X.509… trust me, it's complicated enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it will be confusing for people, but that will happen regardless of what term we will use.
My main issue is that evacuation does not directly translate to eviction. Therefore, I think it would be preferable to choose a new term (not necessarily evacuation).
I would like to get additional opinions from people about this. And we will definitely have to come back to this in the API review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved now: #4565 (comment)
13611ce
to
2d15b79
Compare
2d15b79
to
c292415
Compare
|
|
||
Example evacuation triggers: | ||
- Node maintenance controller: node maintenance triggered by an admin. | ||
- Descheduler: descheduling triggered by a descheduling rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the descheduler requests an eviction, what thing is being evacuated?
(the node maintenance and cluster autoscaler examples are easier: you're evacuating an entire node)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single pod or multiple pods. The descheduler can use it as a new mechanism instead of eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so people typically think of “evacuate” as a near synonym of “drain” - you drain a node, you evacuate a rack or zone full of servers. Saying that you can evacuate a Pod might make people think its containers all get stopped, or just confuse readers. We do need to smooth out how we can teach this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it can be used in both scenarios https://www.merriam-webster.com/grammar/can-you-evacuate-people.
Evacuation of containers doesn't make sense because they are tied to the pod lifecycle. But, I guess it could be confusing if we do not make it explicitly clear what we are targeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thing is, Kubernetes users typically - before we accept this KEP - use “evacuate” as a synonym for drain.
I'm (still) fine with the API idea, and still concerned about the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to +1 the potential confusion of the term "evacuation".
Is it okay to have a glossary of terms for "evacuation", "eviction", and "drain" (or any other potentially confusing terms) added somewhere in this KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can include it in the KEP. And yes, we are going to change the name to something more suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"To evacuate a person" implies "get them out of trouble, to safety" as opposed to "to empty" (as in vacuum). It's not ENTRIELY wrong in this context, but it's not entirely right either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API is renamed to EvictionRequest now, see #4565 (comment) for more details
1900020
to
085672a
Compare
085672a
to
f81cc8c
Compare
See some practical use cases for this feature: | ||
1. Ability to upscale first before terminating the pods with a Deployment: [Deployment Pod Surge Example](#deployment-pod-surge-example) | ||
based on the [EvictionRequest Process](#evictionrequest-process). | ||
2. Ability to upscale first before terminating the pods with HPA: [HorizontalPodAutoscaler Pod Surge Example](#horizontalpodautoscaler-pod-surge-example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess dragging along the pdb to have minAvailable == HPA's current target is a pretty hacky thus the need for a non pdb blocked signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are a bunch of problems here, user friendliness, atomicity of actions, and responding to any descheduling (e.g. node drain).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the lack of atomicity covered anywhere further in this document? It was something that came to my mind while reading this too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, added to the motivation section
Not sure if it's relevant to the conversation, however I would expect that after this change, the user scenario 2 is solved without the user doing anything, as the maxSurge parameter and the fact that it's a rollout deployment are already specified in the deployment spec. |
Yes, the implementation of the follow-ups is only outlined, here and there should be additional KEPs for each improvement. And the majority of the things proposed should result in an immediate benefit without any user action. |
d6992ce
to
5635ffb
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: atiratree The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
We will revisit this KEP as part of the new Node Lifecycle WG: kubernetes/community#8396 |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
FYI: this is being reviewed by the Node Lifecycle WG (see Agenda/Recording https://docs.google.com/document/d/1LSSfiJatBYX7dhLTowYygDO6MK0K-NZ_L52bEZfcZqU) and we will continue next Monday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned a little about why the pod side of this API is implemented as a set of annotations, rather than something within the pods spec that is structured. There's a lot of complexity in the API, and, a lot of complexity around how the annotations are formatted, where, I believe at least, you'd save a bunch of that complexity by just creating a first class API.
For example, there's this part about the priorities, and not allowing third parties to interleave the priorities of the controller actor. If there was more structure, different groups of actors could be prioritised within each other (group priority and actor priority) and then that issue would surely be resolved, not just for the core controller group, but also other third party implementations that have multiple controllers reconciling.
Was a first class pod API ever considered?
to edit the PDB to account for the additional pod that needs to be spun to move the workload | ||
from one node to another. This has been discussed in issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811) | ||
and in issue [kubernetes/kubernetes#114877](https://github.com/kubernetes/kubernetes/issues/114877). | ||
2. Similar to the first point, it is difficult to use PDBs for applications that can have a variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not make sense to recommend that users of HPA set their minimum to some number greater than 1 when they are using PDBs which would avoid this issue no?
See some practical use cases for this feature: | ||
1. Ability to upscale first before terminating the pods with a Deployment: [Deployment Pod Surge Example](#deployment-pod-surge-example) | ||
based on the [EvictionRequest Process](#evictionrequest-process). | ||
2. Ability to upscale first before terminating the pods with HPA: [HorizontalPodAutoscaler Pod Surge Example](#horizontalpodautoscaler-pod-surge-example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the lack of atomicity covered anywhere further in this document? It was something that came to my mind while reading this too
Any pod can be the subject to an eviction request. There can be multiple interceptors for a single | ||
pod, and they should all advertise which pods they are responsible for. Normally, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the various interceptors are expected to have no prior knowledge of each other, and, should expect there to be no ordering/priority between them? We don't want to create dependencies between them right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the design should not presume any ordering as it is difficult to predict all the use cases. This should mostly be left to the ecosystem.
- We do not want to handle the dependencies and would like to leave this to the ecosystem. If one project has a knowledge of another and would like to preempt it, it should be possible. So the projects can dynamically set the priority.
- We expect the core controllers (Deployment, HPA, etc.) to be aware of each other and to resolve the priorities/ordering accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the projects can dynamically set the priority.
This would be up to the cluster administrator to set and be aware of the ordering requirements no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be if a custom behavior is desired. but I would expect the components/projects to set reasonable defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an unknowing cluster administrator were to install two systems simultaneously, and they had multiple components, and had default priorities, then they could just interleave accidentally? Feels like that is asking for trouble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is. What are our alternatives though?
Should we hard fail when something tries to add an interceptor to a pod and another component occupies the priority?
That would increase the need for conflict resolution, and almost everyone would have to solve it. On the other hand, if we allow conflicts, most of the time nothing bad would happen (taking into account that we allow controller to have their own interval). Interceptors should expect that they can be preempted. E.g. running an A+B interceptor vs a B+A interceptor.
1. It can reject the eviction request and wait for the pod to be intercepted by another interceptor | ||
or evicted by the eviction request controller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would an interceptor decide to defer the decision to someone else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just lists all the options. This could happen if there was a race of the interceptor annotation removal and eviction.
- To prevent misuse, we will maintain a list of allowed `*.k8s.io` interceptor classes. And reject | ||
any classes outside the main Kubernetes project on admission. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So third parties can't create interceptors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was poorly worded. They can, just not with the .k8s.io
suffix.
`.spec.interceptors` is only set by the Eviction Requester and during the EvictionRequest object | ||
create admission. We do not allow subsequent changes to this field to ensure the predictability of | ||
the eviction request process. Also, late registration of the interceptor could go unnoticed and be | ||
preempted by the eviction request controller, resulting in the premature eviction of the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the eviction requestor must have prior knowledge of many other interceptors? I would have expected the interceptors that matter would self register (kind of like a PDB does), rather than requiring some pre-existing knowledge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the interceptors would self register but on a pod beforehand. Eviction requestors should have an easy way of creating EvictionRequests without having to track available interceptors.
12. Actor A updates the EvictionRequest status and ensures that | ||
`.status.evictionRequestCancellationPolicy=Allow` | ||
13. Actor A deletes the p-1 pod. | ||
14. EvictionRequest is garbage collected once the pods terminate even with the descheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't remove an object with a finalizer present, so this statement reads oddly right now, what do you actually expect to happen here? Something is removing that finalizer no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the finalizer should be removed by the Eviction Request Controller GC:
For convenience, we will also remove requester finalizers with
`evictionrequest.coordination.k8s.io/` prefix when the eviction request task is complete (points 2
and 3). Other finalizers will still block deletion.
already exists. It sets the | ||
`requester.evictionrequest.coordination.k8s.io/name_descheduling.avalanche.io` finalizer on the | ||
EvictionRequest. | ||
4. The eviction request controller designates Actor B as the next interceptor by updating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who are Actor A and B and how do they relate to the nodemaintenance and descheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see:
Let's assume there is a single pod p-1 of application P with interceptors A and B:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did read that, but I still don't get the relation. Are these (do these/should these) interceptors tied at all to NodeMaintainence/Descheduling, or are they completely independent of those concepts and just actors who are knowledgeable about a particular pod and how it should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are completely independent and not tied to NodeMaintainence/Descheduling. I have updated the intro to reflect this.
5. The deployment controller creates a set of surge pods C to compensate for the future loss of | ||
availability of pods B. The new pods are created by temporarily surging the `.spec.replicas` | ||
count of the underlying replica sets up to the value of deployments `maxSurge`. | ||
6. Pods C are scheduled on a new schedulable node that is not under the node drain. | ||
7. Pods C become available. | ||
8. The deployment controller scales down the surging replica sets back to their original value. | ||
9. The deployment controller sets `ActiveInterceptorCompleted=true` on the eviction requests of | ||
pods B that are ready to be deleted. | ||
10. The eviction request controller designates the replica set controller as the next interceptor by | ||
updating `.status.activeInterceptorClass`. | ||
11. The replica set controller deletes the pods to which an EvictionRequest object has been | ||
assigned, preserving the availability of the application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like it will interfere with the normal operation of deployments/replicasets? Would scaling down the replicasets not just delete/initiate other pods to be removed, and maybe, the pod in question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would. However, ReplicaSets controlled by Deployments should only be scaled by the Deployments during normal operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we consulted with sig-apps about the implications of this? It sounds like a fairly large change to the way deployments work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have presented the extensions to Deployment, etc. when discussing the EvictionRequest KEP in sig-apps. However, the discussion was mostly focused on EvictionRequest.
There are a number of open sig-apps issues that could benefit from these solutions for which there are no alternatives.
@JoelSpeed thanks a lot for the thorough review. These are valuable observations, and perhaps we should consider making more of the API first class. I will also try to resolve your remarks regarding the current API (e.g. spec/status). |
// ProgressDeadlineSeconds, the eviction request is passed over to the next interceptor with the | ||
// highest priority. If there is none, the pod is evicted using the Eviction API. | ||
// | ||
// The minimum value is 600 (10m) and the maximum value is 21600 (6h). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 6 hours enough for all operations? I can think of operations in systems such as Database Orchestration that take far longer than 6 hours. I can also think of times where an EvictionRequest is intentionally deprioritised for even days in large long running systems.
I think we should not set a maximum value here. I appreciate that opens up the potential cases that users could use, but I also think its a case of "if you break it you buy it" when you set it to a very large number or so high it doesn't have an ffect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the total time of the operation. It is only the maximum amount of time allotted to the controller to provide updates on said operation. The maximum is useful to ensure that controllers do not start the operation and forget about it. There must be an active entity.
Please see #4565 (comment) for more details
Also, this value is set by the eviction requester (e.g., node drain) and not the interceptor. The interceptor has to comply with the minimum update period (expected to be 10 minutes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to that thread.
fb00a38
to
79bf3b6
Compare
f0d9681
to
8fc4f0d
Compare
8fc4f0d
to
1049349
Compare
Uh oh!
There was an error while loading. Please reload this page.