KEP4328: Affinity Based Eviction #4329

AxeZhan · 2023-11-06T12:13:01Z

One-line PR description: Introducing node affinity RequiredDuringSchedulingRequiredDuringExecution

Issue link: Implement Affinity Based Eviction #4328

Other comments:

keps/sig-node/4328-node-label-manager/README.md

ffromani · 2023-11-15T14:06:50Z

/cc

keps/sig-node/4328-node-label-manager/README.md

kannon92

I left a few comments.

I think you want to add both a node affinity type and add a new controller. I really don't know if this should be owned by sig-node. I also wonder if there is a way to reuse an existing controller or use descheduler for this.

I also wonder if sig-scheduling should be the owner of this feature rather than sig-node. Maybe @kerthcet or @Huang-Wei can weigh in here? Tainteviction is owned by sig-scheduling.

keps/sig-node/4328-node-label-manager/README.md

ffromani

I concur with @kannon92: existing alternatives and why they are not sufficient to cover this use case should be discussed in more details. Last time I checked, descheduler seem a very nice fit for this use case.

keps/sig-node/4328-node-label-manager/README.md

AxeZhan · 2023-12-06T05:18:54Z

Came back with updated user story and alternatives.

For taint. I think we can safely rule out the idea of using taint for this purpose. Because taint and toleration do not have as many operators as affinity to cover enough scenes.

For descheduler. Well, as a user myself, I hope kubernetes can natively support this feature. For me, this sounds like a basic feature like requiredDuringSchedulingIgnoredDuringExecution and should be part kubernets.
Also, I don't want to introduce descheduler into our existing system just because of this feature. Add additional learning costs for devs and ops.

AxeZhan · 2023-12-06T05:41:10Z

I also wonder if sig-scheduling should be the owner of this feature rather than sig-node. Maybe @kerthcet or @Huang-Wei can weigh in here? Tainteviction is owned by sig-scheduling.

Before, my tendency was to put this as a manager under node-lifecycle-controller kubernetes/kubernetes#121798, and thus should be owned by sig-node. However, I have some new thoughts about this now.

For owner issues. Since this controller is very similar to taint-eviction which is owned by sig-scheduling, and I think this feature will mostly be used for scheduling purpose. I also think sig-scheduling should be the owner of this feature.
Although place this manager under the node-lifecycle-controller can reduce some repetitive code, using a new controller like taint-eviction can make it easy to improve affinity-related scheduling or build custom implementations of the affinity based eviction.(For example, if we are going to implement requiredDuringSchedulingRequiredDuringExecution podAffinity/podAntiAffinity, we can reuse this new affinity-controller).

For this, maybe @alculquicondor and @Huang-Wei can weigh in here?

AxeZhan · 2023-12-07T16:11:50Z

Since there are different opinions about which sig should own this KEP, it is difficult to continue work on this KEP.
I think there are now two possible routes to continue the work of this kep:

Let sig-scheduling be the owner:
The implementation will be changed to add a new controller, let's call it AffinityEvictionController.
- pros: Makes it easier to improve this feature. Easier to implement other affinty related eviction logic in the furture.
- cons: Adding a new controller will increase the complexity of kubernetes.
Let sig-node be the owner:
The implementation will be Add a manager under the node-life-cycle contoller just for nodeAffinity requiredDuringSchedulingRequiredDuringExecution(same as now).
- pros: Placing a manager under node-lifycycle-controller can reduce code duplication.
- cons: If we are implementing other affinity related logics(like same feature for podAffinity), we still need to add a new controller, and this will become hard to maintain by then.

I prefer the first route, because I think we can implement requiredDuringSchedulingRequiredDuringExecution not only for nodeAffinity, but also podAffinity and podAntiAffinity in this new controller in the next releases.

For this, I need opinions from sig-scheduling-leads.
/cc @alculquicondor @Huang-Wei @ahg-g

AxeZhan · 2024-07-12T03:04:05Z

I assume the new controller will work for every pod in every namespace. And there will be no way to disable it?

Yes. To disable it user needs to remove the affinity from the pod spec.

Or, will there be a way to configure "protected namespaces"? E.g. to avoid cases where a malicious user finds a way to remove labels from e.g. control nodes causing some of the control plane components to be evicted without feasible nodes. The descheduler allows to perform a "nodefit" check to see whether a pod that's to be evicted has any free node where it can be rescheduled.

This sounds reasonable. I think it's worth adding, along with the aforementioned toleration time, and we could probably even add a check on pod's priority.

However, I think for alpha we should keep the feature as simple as enough, we can adding these changes in beta, also I think this feature will remain in beta stage for many releases.

k8s-triage-robot · 2024-10-10T03:18:25Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sftim · 2024-10-19T16:40:11Z

keps/sig-scheduling/4328-node-affinity-based-eviction/README.md

+  Also taints are meant to protect nodes, so they don't respect PDB.
+
+- descheduler
+


I'm guessing that you could enable the API server feature gate (only) and use the descheduler for the descheduling part, assuming we update that out-of-tree code.

Does this sound right? I think we support doing that, although we don't recommend actually trying it.

Yes, we can do that. This requires some code changes in descheduler though.

keps/sig-scheduling/4328-node-affinity-based-eviction/README.md

keps/sig-scheduling/4328-node-affinity-based-eviction/kep.yaml

keps/sig-scheduling/4328-node-affinity-based-eviction/README.md

sftim · 2024-10-19T16:52:05Z

keps/sig-scheduling/4328-node-affinity-based-eviction/README.md

+- Listening to the changes of node labels
+- Iterating over all pods assigned to the node(excluding mirror pods), checks the NodeAffinity field, if `RequiredDuringSchedulingRequiredDuringExecution` exists, checks if `NodeSelector` still match the new node. 
+- If `RequiredDuringSchedulingRequiredDuringExecution` is no loger met, trying to evict the pod.


Beyond alpha, do we need to account for a data race, where the node label changes before the Pod is bound to the node, because the kube-scheduler didn't have that particular update?

Kubernetes should aim to be eventually consistent here.

Did you mean:

node label changed

pod has finished shceduling, but not bound to node yet

we evict the pod?

This seems correct to me.

keps/sig-scheduling/4328-node-affinity-based-eviction/README.md

bsalamat · 2024-10-21T17:40:42Z

keps/sig-scheduling/4328-node-affinity-based-eviction/README.md

+
+### taints
+
+  Taint can achieve this feature in some scenarios. But nodeaffinity has more operators for users to choose from to cover more scenarios.


It would be great to expand on this and enumerate some of the most important scenarios that this feature enables and cannot be achieved with taints. Adding a new controller is a heavy operation and we need to consider the tradeoffs. This feature also changes the scope of NodeAffinity from a scheduling time capability to a scheduling & execution time capability, which has user cognitive burden.
To make a final decision, we should be able to confidently answer questions like: "Are these scenarios common enough to justify adding a new controller?", "Can't taints be extended to support those scenarios and optionally respect PDB?"

k8s-triage-robot · 2024-11-20T17:50:03Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

sanposhiho · 2024-11-27T15:07:20Z

keps/sig-scheduling/4328-node-affinity-based-eviction/README.md

+Without `node-affinity-eviction`, I have to remove "userB=allow" label of node, and delete the pods manually. Also I can't use taints because they don't respect PDBs.
+With `node-affinity-eviction`, I can simply delete the "userB=allow" label from the existing nodes 
+to re-schedule all pods of user B to these new nodes.
+


I know I'm late to the party, I just noticed this KEP is coming.

So, I'm not sure if this user story is strong enough to support this feature, especially given implementing/maintaining a new controller is a huge cost.
Can we add some others, or strengthen it?

For this current scenario, why can't we just delete the pods actually (like via the deployment(s) restarts)? Are you trying to argue that that's too troublesome?
And, can't we use the descheduler? Is there a strong need that this has to be supported within the vanilla kubernetes, not just saying "use the descheduler"? What's the justification why this feature has to be supported in kubernetes, compared to other various descheduling features?

k8s-triage-robot · 2024-12-27T15:11:57Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-12-27T15:12:02Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 6, 2023

k8s-ci-robot requested a review from derekwaynecarr November 6, 2023 12:13

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 6, 2023

k8s-ci-robot requested a review from mrunalp November 6, 2023 12:13

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 6, 2023

AxeZhan mentioned this pull request Nov 6, 2023

Implement Affinity Based Eviction #4328

Closed

4 tasks

AxeZhan force-pushed the nodeAffinityController branch 2 times, most recently from 65175a0 to 6da1d8e Compare November 8, 2023 11:49

AxeZhan changed the title ~~KEP4328: NodeAffinityController~~ KEP4328: NodeLabelManager Nov 8, 2023

kannon92 reviewed Nov 15, 2023

View reviewed changes

keps/sig-node/4328-node-label-manager/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Nov 15, 2023

View reviewed changes

keps/sig-node/4328-node-label-manager/README.md Outdated Show resolved Hide resolved

k8s-ci-robot requested a review from ffromani November 15, 2023 14:06

kannon92 reviewed Nov 15, 2023

View reviewed changes

keps/sig-node/4328-node-label-manager/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Nov 15, 2023

View reviewed changes

keps/sig-node/4328-node-label-manager/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Nov 15, 2023

View reviewed changes

keps/sig-node/4328-node-label-manager/README.md Outdated Show resolved Hide resolved

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 15, 2023

AxeZhan force-pushed the nodeAffinityController branch from 198b9ad to c16a585 Compare November 15, 2023 16:01

ffromani reviewed Nov 24, 2023

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 5, 2023

k8s-ci-robot requested review from ahg-g, alculquicondor and Huang-Wei December 7, 2023 16:11

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2024