-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP4328: Affinity Based Eviction #4329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AxeZhan
commented
Nov 6, 2023
- One-line PR description: Introducing node affinity RequiredDuringSchedulingRequiredDuringExecution
- Issue link: Implement Affinity Based Eviction #4328
- Other comments:
65175a0
to
6da1d8e
Compare
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments.
I think you want to add both a node affinity type and add a new controller. I really don't know if this should be owned by sig-node. I also wonder if there is a way to reuse an existing controller or use descheduler for this.
I also wonder if sig-scheduling should be the owner of this feature rather than sig-node. Maybe @kerthcet or @Huang-Wei can weigh in here? Tainteviction is owned by sig-scheduling.
198b9ad
to
c16a585
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur with @kannon92: existing alternatives and why they are not sufficient to cover this use case should be discussed in more details. Last time I checked, descheduler seem a very nice fit for this use case.
Came back with updated user story and alternatives. For taint. I think we can safely rule out the idea of using taint for this purpose. Because taint and toleration do not have as many operators as affinity to cover enough scenes. For descheduler. Well, as a user myself, I hope kubernetes can natively support this feature. For me, this sounds like a basic feature like |
Before, my tendency was to put this as a manager under node-lifecycle-controller kubernetes/kubernetes#121798, and thus should be owned by sig-node. However, I have some new thoughts about this now. For owner issues. Since this controller is very similar to taint-eviction which is owned by sig-scheduling, and I think this feature will mostly be used for scheduling purpose. I also think sig-scheduling should be the owner of this feature. For this, maybe @alculquicondor and @Huang-Wei can weigh in here? |
Since there are different opinions about which sig should own this KEP, it is difficult to continue work on this KEP.
I prefer the first route, because I think we can implement requiredDuringSchedulingRequiredDuringExecution not only for nodeAffinity, but also podAffinity and podAntiAffinity in this new controller in the next releases. For this, I need opinions from sig-scheduling-leads. |
Yes. To disable it user needs to remove the affinity from the pod spec.
This sounds reasonable. I think it's worth adding, along with the aforementioned toleration time, and we could probably even add a check on pod's priority. However, I think for alpha we should keep the feature as simple as enough, we can adding these changes in beta, also I think this feature will remain in beta stage for many releases. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Also taints are meant to protect nodes, so they don't respect PDB. | ||
|
||
- descheduler | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that you could enable the API server feature gate (only) and use the descheduler for the descheduling part, assuming we update that out-of-tree code.
Does this sound right? I think we support doing that, although we don't recommend actually trying it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that. This requires some code changes in descheduler though.
keps/sig-scheduling/4328-node-affinity-based-eviction/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/4328-node-affinity-based-eviction/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/4328-node-affinity-based-eviction/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/4328-node-affinity-based-eviction/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/4328-node-affinity-based-eviction/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/4328-node-affinity-based-eviction/README.md
Outdated
Show resolved
Hide resolved
- Listening to the changes of node labels | ||
- Iterating over all pods assigned to the node(excluding mirror pods), checks the NodeAffinity field, if `RequiredDuringSchedulingRequiredDuringExecution` exists, checks if `NodeSelector` still match the new node. | ||
- If `RequiredDuringSchedulingRequiredDuringExecution` is no loger met, trying to evict the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond alpha, do we need to account for a data race, where the node label changes before the Pod is bound to the node, because the kube-scheduler didn't have that particular update?
Kubernetes should aim to be eventually consistent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean:
- node label changed
- pod has finished shceduling, but not bound to node yet
- we evict the pod?
This seems correct to me.
keps/sig-scheduling/4328-node-affinity-based-eviction/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/4328-node-affinity-based-eviction/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/4328-node-affinity-based-eviction/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/4328-node-affinity-based-eviction/README.md
Outdated
Show resolved
Hide resolved
e4ab6d4
to
1dc8b81
Compare
1dc8b81
to
2a4994b
Compare
|
||
### taints | ||
|
||
Taint can achieve this feature in some scenarios. But nodeaffinity has more operators for users to choose from to cover more scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to expand on this and enumerate some of the most important scenarios that this feature enables and cannot be achieved with taints. Adding a new controller is a heavy operation and we need to consider the tradeoffs. This feature also changes the scope of NodeAffinity from a scheduling time capability to a scheduling & execution time capability, which has user cognitive burden.
To make a final decision, we should be able to confidently answer questions like: "Are these scenarios common enough to justify adding a new controller?", "Can't taints be extended to support those scenarios and optionally respect PDB?"
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Without `node-affinity-eviction`, I have to remove "userB=allow" label of node, and delete the pods manually. Also I can't use taints because they don't respect PDBs. | ||
With `node-affinity-eviction`, I can simply delete the "userB=allow" label from the existing nodes | ||
to re-schedule all pods of user B to these new nodes. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I'm late to the party, I just noticed this KEP is coming.
So, I'm not sure if this user story is strong enough to support this feature, especially given implementing/maintaining a new controller is a huge cost.
Can we add some others, or strengthen it?
For this current scenario, why can't we just delete the pods actually (like via the deployment(s) restarts)? Are you trying to argue that that's too troublesome?
And, can't we use the descheduler? Is there a strong need that this has to be supported within the vanilla kubernetes, not just saying "use the descheduler"? What's the justification why this feature has to be supported in kubernetes, compared to other various descheduling features?
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |