Skip to content

KEP4328: Affinity Based Eviction #4329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

AxeZhan
Copy link
Member

@AxeZhan AxeZhan commented Nov 6, 2023

  • One-line PR description: Introducing node affinity RequiredDuringSchedulingRequiredDuringExecution
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 6, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 6, 2023
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp November 6, 2023 12:13
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 6, 2023
@AxeZhan AxeZhan mentioned this pull request Nov 6, 2023
4 tasks
@AxeZhan AxeZhan force-pushed the nodeAffinityController branch 2 times, most recently from 65175a0 to 6da1d8e Compare November 8, 2023 11:49
@AxeZhan AxeZhan changed the title KEP4328: NodeAffinityController KEP4328: NodeLabelManager Nov 8, 2023
@ffromani
Copy link
Contributor

/cc

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments.

I think you want to add both a node affinity type and add a new controller. I really don't know if this should be owned by sig-node. I also wonder if there is a way to reuse an existing controller or use descheduler for this.

I also wonder if sig-scheduling should be the owner of this feature rather than sig-node. Maybe @kerthcet or @Huang-Wei can weigh in here? Tainteviction is owned by sig-scheduling.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 15, 2023
@AxeZhan AxeZhan force-pushed the nodeAffinityController branch from 198b9ad to c16a585 Compare November 15, 2023 16:01
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with @kannon92: existing alternatives and why they are not sufficient to cover this use case should be discussed in more details. Last time I checked, descheduler seem a very nice fit for this use case.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 5, 2023
@AxeZhan
Copy link
Member Author

AxeZhan commented Dec 6, 2023

Came back with updated user story and alternatives.

For taint. I think we can safely rule out the idea of using taint for this purpose. Because taint and toleration do not have as many operators as affinity to cover enough scenes.

For descheduler. Well, as a user myself, I hope kubernetes can natively support this feature. For me, this sounds like a basic feature like requiredDuringSchedulingIgnoredDuringExecution and should be part kubernets.
Also, I don't want to introduce descheduler into our existing system just because of this feature. Add additional learning costs for devs and ops.

@AxeZhan
Copy link
Member Author

AxeZhan commented Dec 6, 2023

I also wonder if sig-scheduling should be the owner of this feature rather than sig-node. Maybe @kerthcet or @Huang-Wei can weigh in here? Tainteviction is owned by sig-scheduling.

Before, my tendency was to put this as a manager under node-lifecycle-controller kubernetes/kubernetes#121798, and thus should be owned by sig-node. However, I have some new thoughts about this now.

For owner issues. Since this controller is very similar to taint-eviction which is owned by sig-scheduling, and I think this feature will mostly be used for scheduling purpose. I also think sig-scheduling should be the owner of this feature.
Although place this manager under the node-lifecycle-controller can reduce some repetitive code, using a new controller like taint-eviction can make it easy to improve affinity-related scheduling or build custom implementations of the affinity based eviction.(For example, if we are going to implement requiredDuringSchedulingRequiredDuringExecution podAffinity/podAntiAffinity, we can reuse this new affinity-controller).

For this, maybe @alculquicondor and @Huang-Wei can weigh in here?

@AxeZhan
Copy link
Member Author

AxeZhan commented Dec 7, 2023

Since there are different opinions about which sig should own this KEP, it is difficult to continue work on this KEP.
I think there are now two possible routes to continue the work of this kep:

  1. Let sig-scheduling be the owner:
    The implementation will be changed to add a new controller, let's call it AffinityEvictionController.

    • pros: Makes it easier to improve this feature. Easier to implement other affinty related eviction logic in the furture.
    • cons: Adding a new controller will increase the complexity of kubernetes.
  2. Let sig-node be the owner:
    The implementation will be Add a manager under the node-life-cycle contoller just for nodeAffinity requiredDuringSchedulingRequiredDuringExecution(same as now).

    • pros: Placing a manager under node-lifycycle-controller can reduce code duplication.
    • cons: If we are implementing other affinity related logics(like same feature for podAffinity), we still need to add a new controller, and this will become hard to maintain by then.

I prefer the first route, because I think we can implement requiredDuringSchedulingRequiredDuringExecution not only for nodeAffinity, but also podAffinity and podAntiAffinity in this new controller in the next releases.

For this, I need opinions from sig-scheduling-leads.
/cc @alculquicondor @Huang-Wei @ahg-g

@AxeZhan
Copy link
Member Author

AxeZhan commented Jul 12, 2024

I assume the new controller will work for every pod in every namespace. And there will be no way to disable it?

Yes. To disable it user needs to remove the affinity from the pod spec.

Or, will there be a way to configure "protected namespaces"? E.g. to avoid cases where a malicious user finds a way to remove labels from e.g. control nodes causing some of the control plane components to be evicted without feasible nodes. The descheduler allows to perform a "nodefit" check to see whether a pod that's to be evicted has any free node where it can be rescheduled.

This sounds reasonable. I think it's worth adding, along with the aforementioned toleration time, and we could probably even add a check on pod's priority.

However, I think for alpha we should keep the feature as simple as enough, we can adding these changes in beta, also I think this feature will remain in beta stage for many releases.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2024
Also taints are meant to protect nodes, so they don't respect PDB.

- descheduler

Copy link

@sftim sftim Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that you could enable the API server feature gate (only) and use the descheduler for the descheduling part, assuming we update that out-of-tree code.

Does this sound right? I think we support doing that, although we don't recommend actually trying it.

Copy link
Member Author

@AxeZhan AxeZhan Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do that. This requires some code changes in descheduler though.

Comment on lines +153 to +158
- Listening to the changes of node labels
- Iterating over all pods assigned to the node(excluding mirror pods), checks the NodeAffinity field, if `RequiredDuringSchedulingRequiredDuringExecution` exists, checks if `NodeSelector` still match the new node.
- If `RequiredDuringSchedulingRequiredDuringExecution` is no loger met, trying to evict the pod.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond alpha, do we need to account for a data race, where the node label changes before the Pod is bound to the node, because the kube-scheduler didn't have that particular update?

Kubernetes should aim to be eventually consistent here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean:

  1. node label changed
  2. pod has finished shceduling, but not bound to node yet
  3. we evict the pod?

This seems correct to me.

@AxeZhan AxeZhan force-pushed the nodeAffinityController branch from e4ab6d4 to 1dc8b81 Compare October 20, 2024 07:41
@AxeZhan AxeZhan force-pushed the nodeAffinityController branch from 1dc8b81 to 2a4994b Compare October 20, 2024 07:46

### taints

Taint can achieve this feature in some scenarios. But nodeaffinity has more operators for users to choose from to cover more scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to expand on this and enumerate some of the most important scenarios that this feature enables and cannot be achieved with taints. Adding a new controller is a heavy operation and we need to consider the tradeoffs. This feature also changes the scope of NodeAffinity from a scheduling time capability to a scheduling & execution time capability, which has user cognitive burden.
To make a final decision, we should be able to confidently answer questions like: "Are these scenarios common enough to justify adding a new controller?", "Can't taints be extended to support those scenarios and optionally respect PDB?"

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 20, 2024
Without `node-affinity-eviction`, I have to remove "userB=allow" label of node, and delete the pods manually. Also I can't use taints because they don't respect PDBs.
With `node-affinity-eviction`, I can simply delete the "userB=allow" label from the existing nodes
to re-schedule all pods of user B to these new nodes.

Copy link
Member

@sanposhiho sanposhiho Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I'm late to the party, I just noticed this KEP is coming.

So, I'm not sure if this user story is strong enough to support this feature, especially given implementing/maintaining a new controller is a huge cost.
Can we add some others, or strengthen it?

For this current scenario, why can't we just delete the pods actually (like via the deployment(s) restarts)? Are you trying to argue that that's too troublesome?
And, can't we use the descheduler? Is there a strong need that this has to be supported within the vanilla kubernetes, not just saying "use the descheduler"? What's the justification why this feature has to be supported in kubernetes, compared to other various descheduling features?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.