Skip to content

Handle errors when preparing lease for update #119661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 21, 2023
Merged

Handle errors when preparing lease for update #119661

merged 1 commit into from
Aug 21, 2023

Conversation

cartermckinnon
Copy link
Contributor

@cartermckinnon cartermckinnon commented Jul 28, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes an issue in which kubelet will remove the owner references from its lease (via an update) if its Node does not exist, preventing garbage collection of the Lease.

Which issue(s) this PR fixes:

Fixes #109777

More context in dupe: #119660

Special notes for your reviewer:

Background:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 28, 2023
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.28 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.28.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Jul 28 22:12:27 UTC 2023.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 28, 2023
@k8s-ci-robot k8s-ci-robot requested review from deads2k and jpbetz July 28, 2023 22:14
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 28, 2023
@cartermckinnon cartermckinnon changed the title Do not silence errors from newLease Handle errors when kubelet prepares lease for update Jul 28, 2023
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 28, 2023
@aojea
Copy link
Member

aojea commented Jul 28, 2023

this only modifies the test, it seems WIP

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 28, 2023
@cartermckinnon
Copy link
Contributor Author

/retest

@dims
Copy link
Member

dims commented Jul 29, 2023

/test pull-kubernetes-e2e-gce

@dims
Copy link
Member

dims commented Jul 29, 2023

/retest
/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 29, 2023
@dims
Copy link
Member

dims commented Jul 29, 2023

/priority important-soon
/triage accepted
/assign @mrunalp @SergeyKanzhelev

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 29, 2023
@cartermckinnon
Copy link
Contributor Author

FYI, I've gone through the history and haven't found discussion on why these errors did not block the lease update. Initially added here: https://github.com/kubernetes/kubernetes/pull/70034/files#diff-53e6af5b2caffee205409a79f563db9456102ad124ec18a3de6d98553194c406R194

@cartermckinnon cartermckinnon changed the title Handle errors when kubelet prepares lease for update Handle errors when preparing lease for update Jul 29, 2023
Comment on lines +68 to 70
// before every time the lease is created/refreshed(updated).
// Note that an error will block the lease operation.
newLeasePostProcessFunc ProcessLeaseFunc
Copy link
Contributor Author

@cartermckinnon cartermckinnon Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make sure changing this assumption is okay.

I only see two usages of this controller, using these newLeasePostProcessFunc-s:

  1. kube-apiserver:
    func labelAPIServerHeartbeatFunc(identity string, peeraddress string) lease.ProcessLeaseFunc {
    return func(lease *coordinationapiv1.Lease) error {
    if lease.Labels == nil {
    lease.Labels = map[string]string{}
    }
    if lease.Annotations == nil {
    lease.Annotations = map[string]string{}
    }
    // This label indiciates the identity of the lease object.
    lease.Labels[IdentityLeaseComponentLabelKey] = identity
    hostname, err := os.Hostname()
    if err != nil {
    return err
    }
    // convenience label to easily map a lease object to a specific apiserver
    lease.Labels[apiv1.LabelHostname] = hostname
    // Include apiserver network location <ip_port> used by peers to proxy requests between kube-apiservers
    if utilfeature.DefaultFeatureGate.Enabled(features.UnknownVersionInteroperabilityProxy) {
    if peeraddress != "" {
    lease.Annotations[apiv1.AnnotationPeerAdvertiseAddress] = peeraddress
    }
    }
    return nil
    }
    }
  2. kubelet:
    // SetNodeOwnerFunc helps construct a newLeasePostProcessFunc which sets
    // a node OwnerReference to the given lease object
    func SetNodeOwnerFunc(c clientset.Interface, nodeName string) func(lease *coordinationv1.Lease) error {
    return func(lease *coordinationv1.Lease) error {
    // Setting owner reference needs node's UID. Note that it is different from
    // kubelet.nodeRef.UID. When lease is initially created, it is possible that
    // the connection between master and node is not ready yet. So try to set
    // owner reference every time when renewing the lease, until successful.
    if len(lease.OwnerReferences) == 0 {
    if node, err := c.CoreV1().Nodes().Get(context.TODO(), nodeName, metav1.GetOptions{}); err == nil {
    lease.OwnerReferences = []metav1.OwnerReference{
    {
    APIVersion: corev1.SchemeGroupVersion.WithKind("Node").Version,
    Kind: corev1.SchemeGroupVersion.WithKind("Node").Kind,
    Name: nodeName,
    UID: node.UID,
    },
    }
    } else {
    klog.ErrorS(err, "Failed to get node when trying to set owner ref to the node lease", "node", klog.KRef("", nodeName))
    return err
    }
    }
    return nil
    }

The change doesn't impact kube-apiserver, so we should be good 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current behavior will be always introducing a buggy behavior, no?

Copy link
Contributor Author

@cartermckinnon cartermckinnon Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts I think we need to look for usages of this controller, not just the API type: https://grep.app/search?q=k8s.io/component-helpers/apimachinery/lease

I only see 1 result (karmada-io/karmada) that isn't k8s or vendored-k8s. It's using this newLeasePostProcessFunc, which has the same behavior as kubelet's (and so would probably encounter the same bug 😄 ): https://github.com/karmada-io/karmada/blob/e5277b6317ac1a4717f5fac4057caf51a5d248fc/pkg/util/clusterlease.go#L16-L37

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This library does have the standard disclaimer that it has no compatibility guarantees, it in direct support of Kubernetes, etc: https://github.com/kubernetes/component-helpers#compatibility

If we're concerned about the subtlety, I could make a breaking cosmetic change in the name or something to flag this at compile time? But I think that'd be more annoying than helpful.

@cartermckinnon
Copy link
Contributor Author

cartermckinnon commented Jul 31, 2023

There was a previous attempt to fix this that stalled: #110834

I think the approach taken in that PR has issues.

@cartermckinnon
Copy link
Contributor Author

cartermckinnon commented Jul 31, 2023

/sig node

This is apimachinery for reasons, but it's kubelet-critical functionality

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jul 31, 2023
// before every time the lease is created/refreshed(updated). Note that an error will block
// a lease CREATE, causing the controller to retry next time, but an error won't block a
// lease UPDATE.
// before every time the lease is created/refreshed(updated).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important change in behavior in a public library, another option can be to create a well-known error type and limit the change asserting and that specific error type, and modify the kubelet to return that error type

Copy link
Contributor Author

@cartermckinnon cartermckinnon Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, putting this code here was contentious in the original PR, because it probably shouldn't be a public library. 😄

I think it's kind of strange for the interface to return an error but have conditions around when the error is treated as an error. I can't exhaustively verify that this change doesn't break any consumers, but I have verified that it doesn't break the Kubernetes components that utilize it. I can't really think of a scenario in which this would break consumers, because the lease post-process func has no way to distinguish an update from a create (or, at least, this is not expressed in the interface).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fully agree with you, I don't understand it either, if the postProcessFunc generates and error why we should try to update the Lease nevertheless?

I didn't find an explanation in the original pr too #95428

@aojea
Copy link
Member

aojea commented Aug 1, 2023

/assign @wojtek-t @liggitt

@aojea
Copy link
Member

aojea commented Aug 3, 2023

/lgtm

I agree with the author that is difficult to understand that if we fail to create a newLease we still want to use it to Update the existing one. Assigning people in the original PR, maybe they have more context https://github.com/kubernetes/kubernetes/pull/119661/files#r1283162922

/assign @wojtek-t @deads2k @sttts

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 3, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 03ea88ffdf34a90b80fa022c32aad9abb66198cd

@cartermckinnon cartermckinnon requested a review from sttts August 4, 2023 20:02
@dims
Copy link
Member

dims commented Aug 21, 2023

/approve
/lgtm

This one spans across areas I am happy with the review/engagement it has gotten so far and it looks good to me.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cartermckinnon, dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2023
@k8s-ci-robot k8s-ci-robot merged commit 92b7905 into kubernetes:master Aug 21, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Aug 21, 2023
@cartermckinnon cartermckinnon deleted the lease-leak-node-not-found branch October 16, 2023 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

Kubelet creates leases without ownerReference set if node doesn't exist
10 participants