Skip to content

Latest commit

 

History

History
1634 lines (1252 loc) · 116 KB

meeting-notes-2023.md

File metadata and controls

1634 lines (1252 loc) · 116 KB

SIG Node Meeting Notes

Dec [all dates] - meetings canceled, next meeting Jan 2nd 2024

Dec 5, 2023

Recording: https://youtu.be/k7kzjKnmwSI

  • [tzneal] Feedback on OOM cgroup killing change kubernetes/kubernetes#117793 (comment)

    • Let’s have a kubelet option to return the old behavior. New behavior is a default
    • Need to cherry-pick to 1.27
  • [haircommander] kubernetes/kubernetes#114847 follow ups

    • summary of proposed policy changes.. pull-never policy pods must have the same cred to re-use an image that was pulled with a cred; kubelet needs a switch to disable validation checking of in the cache preloaded images (for disconnected mode at node / kubelet restart), otherwise images pulled with not present are subject to revalidation, and pull never will fail if never authenticated; a pod that successfully pulls an image anonymously from registry A(or default) is to be considered “unique”.. we will not use that anonymous pull as an anonymous success for pods pulling from another registry.. requires alg change in current feature implementation
    • Derek: let’s consider other patterns for time slicing the auth and pull time slots, policies for specifying when and possibly for what reason we need to auth.. (align better with disconnected needs, not just performance/multi-tenant)
    • future items: possible integration with registries for discovering (header/artifacts) what the expiration is for an image
  • [SergeyKanzhelev] Planning of 1.30
    Eliminating perma betas. List of “old” feature gates:

    • AppArmor
      • Mostly need to clean up tests
      • Sergey to follow up
    • CustomCPUCFSQuotaPeriod - Peter will take a look
    • GracefulNodeShutdown
      • Issues with some controllers - Ryan add a comment on KEP indicating what the issue is.
    • GracefulNodeShutdownBasedOnPodPriority
    • LocalStorageCapacityIsolationFSQuotaMonitoring
    • MemoryManager
      • Got as far as PRR review. Lack of observability is concerning from PRR review - need to work on this.
      • Fracesco will follow up on this.
      • Swati: many issues opened for MemoryManager before GA
        • totally true we need to address them - silver lining is this can be done in parallel with observability improvements
    • MemoryQoS
    • PodAndContainerStatsFromCRI
      • Stalled on CRI implementation of those metrics
      • Working on it in CRI-O
      • Need some help from Containerd side
      • Exit criteria: must test performance that is not regressing
    • RotateKubeletServerCertificate
      • No tests and docs
      • Need volunteer to clean it up
      • Harche will take a look
    • SizeMemoryBackedVolumes
      • Need volunteers

    Deprecations:

    • cgroup v1
    • Mrunal, Dawn:
      • let’s announce deprecation in 1.30.
      • Default to cgroupv2 in tests and have cgroupv1 as an “additional”
      • [Alexander Kanevsky] Collect list of distress that people uses, their default cgroups, and the EOL of those disttros. e.g. centos 7 or some ubuntu lts.
  • [SergeyKanzhelev] Should we cancel all the rest of the meetings for the month of Dec?

    • Let;s cancel meeting till the end of the year and meet on Jan 2nd
  • [hakman] node-problem-detector maintainers are needed to keep the project alive. I tried to follow the guidelines to step up as a reviewer and later approver, but it seems there is a lack of approvers. If possible, I would like someone from #sig-node to sponsor me. Thanks in advance! kubernetes/node-problem-detector#830

    • AI: SergeyKanzhelev to follow up

Nov 28, 2023 [Canceled]

  • Canceled due to lack of agenda.
  • Please bring 1.30 planning topics for the next meeting.

Nov 21, 2023 [Cancelled]

  • Cancelled due to Thanksgiving week in US and leads availability

Nov 14, 2023 [Cancelled]

Nov 7, 2023 [Cancelled]

  • Canceled due to kubecon

October 31, 2023

Recording: https://youtu.be/RYBb81l1IGw

October 24, 2023

Canceled due to an empty agenda. Review PRs for freeze next week.

October 17, 2023

Recording: https://youtu.be/740kJACH3i8

Total active pull requests: 311

(weekly changes)

Incoming Completed
Created: 41 Closed: 10
Updated: 113 Merged: 38
New stats needed:
  • PRs needed other SIG approvals
  • Waiting for approvers
  • Waiting for reviewers
  • Separate cherry-picks and regressions

October 10, 2023

Recording: https://www.youtube.com/watch?v=akrWtsCbJZo

October 3, 2023

Recording: https://www.youtube.com/watch?v=HdIURTQSm7Q

September 26, 2023

Recording: https://www.youtube.com/watch?v=yEOUKJCJXa8

  • [Filip Krepinsky] Declarative Node Maintenance: discuss issues related to node drain and the solutions this KEP proposes
  • [Adrian Reber] Created PR to avoid filling up local disk space with too many checkpoint archives.
    • kubernetes/kubernetes#115888
    • bringing to sig-node for awareness
    • requested by multiple users
    • functionality: if more than a certain number of checkpoints are created per container/pod/namespace (default 10), older checkpoint archives are deleted
  • [klueska] Update CDI for device plugins KEP for beta graduation in 1.29
  • [marquiz] introducing KEP-4112 “Pass down resources to CRI”
    • better visibility pod resources in CRI
    • two goals
      • pass down all resources (of all containers) at sandbox creation
      • pass unmodified resource requests and limits to CRI
  • [Kevin Hannon] PodReadyToStartContainers promotion to beta
  • [Kevin Hannon] Split Image Disk KEP
    • kubernetes/enhancements#4198
    • Any interest in separate image filesystems or deal with existing problems, please ping me or comment on KEP.
    • We are interested in hearing what we should consider in scope for this
  • [katarzyna, robscott] Kubelet API to return the local pods. Please review

September 19, 2023

Recording: https://www.youtube.com/watch?v=ngboQ3GvX5o

September 12, 2023

Recording: https://www.youtube.com/watch?v=hVuZg2mqNsw

September 5th, 2023

Recording: https://www.youtube.com/watch?v=5iiD9OIeJv8

August 29th, 2023

Recording: https://www.youtube.com/watch?v=tNangR9QLkg

  • [kannon92]: Follow up for PodReadyToStartContainers Beta
  • [karthik-k-n]: As discussed earlier, Shall we have separate meeting to discuss on scope for dynamic node resize
  • [katarzyna, robscott]
  • [sunnylovestiramisu] Can I use Node -> NodeStatus -> NodePhase(NodePhase is the recently observed lifecycle phase of the node) as evidence that a node is registrationCompleted? If yes, which phase should I use? If not, can I add another status called NodeRegistered to the NodePhase and update it while we set registrationCompleted? - context issue.
  • [weipeng] Need attention on PR Fix: Exclude reserved CPUs from shared pool. Currently the pull-kubernetes-node-kubelet-serial-cpu-manager test lane itself has some issues, how should we proceed? kubernetes/kubernetes#118021

August 22th, 2023

Recording: https://www.youtube.com/watch?v=Y9btZGnyDK0

  • [rata]: In 1.28 we added support for stateful pods with user namespaces.
    • Do we want to blog about it?
    • [rata]: I’ll create a gdocs and share it here. Once that is finalized, I’ll open a PR to the website.
    • If we can’t find someone from sig-docs to approve it, Sergey can help.
  • [fromani]notification: approvers PTAL to these backports - all lanes fixed, tests passing, LGTMd kubernetes/kubernetes#119432 kubernetes/kubernetes#119706 kubernetes/kubernetes#119707
  • [haircommander] Image GC WG time decided–can we add to the calendar?
    • Wednesday 12-12:30 PST (3-3:30 EST)
  • [tzneal] Kubelet detecting a readonly filesystem, what’s the boundary between node-problem-detector responsibilities and kubelet - kubernetes/kubernetes#115746

August 15th, 2023

Recording: https://www.youtube.com/watch?v=wgF8UDgp1sQ

  • [ruiwen] 1.28 KEPs retro (at least 30 minutes. We may not have much time for many other topics)
  • [haircommander] Kubelet image GC conversation
    • [Ruiwen] Pin images
    • [Derek] i am curious if secret pulled images have any unique gc requirements that have surfaced…
      • Tie lifecycle of image to lifecycle of pod?
    • [Sergey] Mirror config into kubelet?
    • Peter to begin a WG in between now and KEP freeze to come up with next steps before bringing to larger group.
  • [SergeyKanzhelev] Sidecar WG: join for the next push in 1.29:

August 8th, 2023

Recording: https://www.youtube.com/watch?v=9BBSMdw8dMA

August 1st, 2023

Recording: https://www.youtube.com/watch?v=V9F8jHgs6R4

July 25th, 2023 [Cancelled due to lack of agenda]

July 18th, 2023

Recording: https://www.youtube.com/watch?v=0Uqq8jNSSDk

  • [ndixita] memory QoS Beta K8s 1.28 might be infeasible https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit#bookmark=id.qaybju6wvb05
    • Requesting kernel experts here for discussion around memory.high memcg controller usage, signals for memory reclaim(pgscan, pgsteal from memory.stat?).
  • [jiaxin] new CPU Manager static policy and in-place VPA improvements (performance, make it work with CPU Manager together), KEP or PR?
    • Problem 1: noisy neighbor issue. We want to spread hyper thread across physical cores to get better performance.
    • Problem 2: In-place VPA currently doesn’t work with CPU Manager
    • Problem 2: In-place VPA sometimes takes up to a minute to finish scaling etc. We will finish a doc with the problems and solutions for further discussion.
    • [fromani] most likely a KEP+1, perhaps share a (preliminary) design doc in the community to outline the proposed scope and changes
    • [Dawn] Please start with a doc on the issue / problem statement and the suggested solution.
    • [Alex] Please separate in-place VPA improvements from CPU static policy.

July 11th, 2023

Recording: https://www.youtube.com/watch?v=0ggcapGYwtc

July 4th, 2023 [Canceled due to US holiday]

June 27th, 2023

Recording: https://www.youtube.com/watch?v=KMD17c5EbFU

  • [Wedson] Discuss setting a default runtime handler for CRI image operations if no runtimeclass is specified. Containerd supports using different snapshotters if pods have the runtime handler annotation specified but this can cause some issues if a pod without an annotation is scheduled after a pod with a runtime handler is specified because kubelet will think the image is already present because it was fetched with a different snapshotter.

    • [mrunal] This intersects with Ensure image pull secrets. Another intersection with signature verification kubernetes/kubernetes#118652
    • Wedson’s PR: kubernetes/kubernetes#118907
    • [Sergey] How rm works on containerd - does it remove both or just default?
    • [Peter] we can’t rely on CRI to do all of the handling because image pull policy isn’t propagated. thus, we do need the annotation approach for now until 1.29 planning when kubelet image gc undergoes redesign
  • [mahamed/upodroid] Overhauling sig-node node e2e tests. I have been working with dims on introducing EC2 node e2e tests and I want to use this opportunity to complete KEP-2464 and adopt kops' prowjob generator to generate jobs at scale as we need to test various permutations of multiple OS, architectures and CRI implementations.

    Implementation: kubernetes/test-infra#29944

    PTAL at the e2e tests guidance in works:

  • [fromani][discussion if time allows, otherwise PTAL and comment on github!] handling devices assignment on node reboot and kubelet restart: issue kubernetes/kubernetes#118559 and its proposed fix kubernetes/kubernetes#118635

  • [haircommander] cgroup driver implementation discussion kubernetes/kubernetes#118770

June 20th, 2023 [Cancelled]

June 13th, 2023

Recording: https://www.youtube.com/watch?v=nF_3dnZJVnA

Enhancements tracking board: https://github.com/orgs/kubernetes/projects/140/views/1?filterQuery=sig%3A%22sig-node%22&sortedBy%5Bdirection%5D=desc&sortedBy%5BcolumnId%5D=Status

May 23rd, 2023 * Need formal approval from SIG Node Tech Leads on the issue

June 6th, 2023

Recording: https://www.youtube.com/watch?v=rR3zOunp6FE

May 30th, 2023

Recording: https://www.youtube.com/watch?v=H9vnLgvTLvo

Agenda

KEPs: https://github.com/kubernetes/enhancements/issues?page=1&q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.28

  • [harche/mrunalp] Cautiously enabling swap only for Burstable Pods - kubernetes/enhancements#3957
  • [marquiz/haircommander]: KEP 4033: discover kubelet cgroup driver from CRI
    • There are other options that the CRI may want to tell the Kubelet what the state of the world is
    • focus this KEP on cgroup driver, but have API extendable so those other use cases (runtime class, QOS class, user namespace support) can be easily covered in the future
    • Separate CRI message from RuntimeStatus so Kubelet can request separately.
  • [mimowo] Changed pod phase when containers exit with 0, related issue: kubernetes/kubernetes#118310. Summary:
    • eviction_manager, preemption: 1.26: Failed, 1.27: Succeeded
    • node shutdown 1.26: Failed, 1.27: Succeeded
    • active deadline exceeded 1.26: Failed, 1.27: Failed
  • [astoycos] bpfd Presentation!
    • Slides
    • [SergeyKanzhelev] SiG node may help in terms of attributing events to pods metadata. When kernel events received - would be nice to know what Pod is running the process that sent this event. Please let us know if anything can be improved from SIG Node side to help with this.
  • [byako] KEP-3542 CRI PullImageWithProgress https://github.com/kubernetes/enhancements/pull/3547/files
  • [adilGhaffarDev] What is the status of this fix: kubernetes/kubernetes#117030 what can we do to escalate it, if possible?
  • [haircommander] KEP 3983: Add support for a drop-in kubelet configuration directory
    • Mostly a review request
  • [SergeyKanzhelev] kubernetes/kubernetes#116429 sidecar PR.

May 23rd, 2023

Recording: https://www.youtube.com/watch?v=shmDtrq55V8

Agenda

  • [intunderflow] Following up from meeting on April 25th talking about lowering frequency of Startup / Readiness probe failure events, my preferred approach after digesting feedback, thoughts from the group about this approach? If happy I can put together a KEP
    • Always emit an event when the result of a probe changes (between Success and Failure, or Failure and Success)
    • When a startup probe fails or a readiness probe fails:
      • We emit the first failure
      • We then emit a failure every 1 hour if still failing
        • Should this event be the same as the first failure, or should it be perhaps something like “Probe still failing since [first failure time]”
    • No changes to liveness probes failing for now: * This will still cause mass event emission to hit the rate limit, but I want to tackle this incrementally and follow up on liveness probes * Lots of users watch for liveness probe failed events, so it's something to be particularly careful about in my opinion (people of course watch readiness/startup probes too, but I’d assume not as many / that liveness probes are the most populous probe type)
    • Thoughts from the group about this approach? If happy I can put together a KEP
  • [intunderflow] kubernetes/kubernetes#115963 needs approver - I’d like to target this for 1.28 if no objections
  • [ffromani] REQUEST: looking for approvers for (all items already part of 1.28 tracking document)
  • [swsehgal] Proposing NodeResourceTopology API under kubernetes-sigs: kubernetes/org#4224. Previously the API was proposed under staging but that proposal was rejected during API review.
    • +1: Alexander +1: Francesco
  • [astoycos] Super Short Introduction of https://github.com/bpfd-dev/bpfd (propose an actual 15-20 minute presentation for next week?) Also reach out in K8s slack #bpfd and #ebpf

May 16th, 2023

Recording: https://www.youtube.com/watch?v=gnbV1nrXVZc

Agenda:

  • [everpeace] I opened a PR for KEP-3169.
    • PR: kubernetes/kubernetes#117842
    • KEP: KEP-3619: Fine-grained SupplementalGroups control
    • I would like the community to triage this and review it.
    • I’m very glad if someone would mentor me because it’s first time for me to make PR including API changes.
    • NOTE: I’m sorry that I can’t show up to the community meeting due to timezone gap (2am in my timezone(Tokyo🇯🇵🗼)). I put this agenda to gain visibility and to help 1.28 planning.
  • [tzneal] Discuss using the cgroup aware OOM killer kubernetes/kubernetes#117793
    • KEP needed for the API change?
    • Potential Options
      • No config, just a new default
      • Add API to Container to allow workload specific configuration
      • Add flag to kubelet
    • [Dawn] Let’s just change the default
    • [mrunal] OK with this.
    • Dawn will comment on the PR.
  • [mimowo]
  • Can this refactoring be led by sig-node?
  • Alternatively, can we go with the simple approach of adding the condition whenever when timeout is exceeded, as suggested in the POC PR: kubernetes/kubernetes#117973. Then, we could document that the behavior when the timeout is exceeded, but the containers aren’t killed (but terminate on their own) is subject to change. Proposed KEP updated for review: kubernetes/enhancements#3999
  • [SergeyKanzhelev] Sidecar KEP: https://github.com/kubernetes/enhancements/pull/3968/files and kubernetes/kubernetes#116429
  • [mo] looking for a way to provide dynamic environment variables at runtime without persisting them in the Kube API (because the contents are sensitive)
    • would like to avoid any approach that uses admission to mutate pods
    • [Anish Ramasekar to Everyone (10:43 AM)] This is the subproject: https://github.com/kubernetes-sigs/secrets-store-csi-driver
    • [Sergey] will this help: kubernetes/enhancements#3721?
    • Init container can download and then regular container will use those.
    • [mo] this ^^^ can work. Is this the right way?
    • [kevin] are you familiar with DRA? CDA is lowest level that makes abstract notion of a device available for a container. CDA can inject environment variables into the container. There may be a “device” that will perform all vault work and then will inject those variables to the container
    • [mo] what is the security model?
    • [kevin] this information will end up being statically stored at CDA file host system
    • [mo] is there way to observer this from kubernetes API?
    • [kevin] DRA is generalization of persistent volumes API. So it will provide some isolation.
    • [Sasha] this will not protect from exec into container. As no env variables would do.
    • [mo] can other containers see it? non-priviledged for example.
    • [mo] what is the interface for DRA? Can it be Daemonset in runtime?
    • [kevin] there is a talk about it at KubeCon. It has all the pieces to build this.
      • [Kevin] Here is my talk on how DRA drivers are structured:
  • [klueska] New Feature for 1.28: Add CDI devices to device plugin API

—- MOVED from 5/2/2023. Move above this line if you plan to show up on the meeting —

  • [kannon92] PRs need approval

May 9th, 2023

Recording: https://www.youtube.com/watch?v=18cRhXTf0Cc

Total active pull requests: 242

Incoming Completed
Created: 19 Closed: 9
Updated: 118 Merged: 17
  • [swsehgal] Community discussion on device Manager recovery bugfix backport
  • [karthik-k-n] Community thoughts on Dynamic Node resize proposal
  • [clayton] Discussion of kubelet state improvements for 1.28 - trying to identify which areas to focus on
  • [zmerlynn] Discuss
    • Dawn: Maybe first restart free, don’t punish
    • Clayton: DaemonSet that runs effectively a for loop to anneal policy
      • There are things we don’t account for, like system resources in a crash looping pod - what does it actually cost to restart a container
    • Derek (on chat): I wonder if we need a way to measure a qps generally for the behavior that crashloopbackoff is trying to protect
      • systemd gives StartLimitBurst and then when that is exhausted you go to StartLimitInterval.... feels like we could give a burst
    • Sergey: Maybe we also need “it’s a bad failure, reschedule me”
    • David: Is it up to the admin to define this?
    • Kevin: KEP in question that Sergei mentioned: kubernetes/enhancements#3816
    • Clayton: Full backoff doesn’t make sense for static pod anyways

—-- End of the meeting. MOVED TO THE NEXT WEEK —--

May 2nd, 2023

Recording: ​​https://www.youtube.com/watch?v=whN6nPOp62g

Total active pull requests: 241

Incoming Completed
Created: 27 Closed: 11
Updated: 88 Merged: 25

Apr 25th, 2023

Recording: https://www.youtube.com/watch?v=oQi3gPsODV0

  • [intunderflow] kubernetes/kubernetes#115963 needs approver
  • [intunderflow] Thoughts on startup probe / readiness probe event emission behavior?
    • Currently the readiness probes and startup probes emit ContainerUnhealthy events each time they probe the container and it is Unhealthy.
    • For liveness probes a container going from a healthy state to suddenly unhealthy is important and notable, but for Readiness and Startup probes it's pretty typical for a container to be unhealthy since the point of these probes is to wait until the container is healthy.
    • Emitting these events eats into the rate limit of 25 events per object sent to the API server.
    • Readiness probes and Startup probes failing multiple times is pretty typical of their operation, since their point is to gate the container until it succeeds.
    • It would be nice if Readiness probes and Startup probes didn’t eat events as fast as they did.
    • My thoughts and opinions:
      • We could consider changing the startup and readiness probe to only emit when they probe the container and it is healthy (since that leads to a change in state and action being taken)
      • My PR above (if approved) would still then report if a startup probe or readiness probe fail conclusively against a container
    • [Action Item] Count incrementation on Events? Why not working for failing probes?
    • [Ryan] The event recorder has a max retries of 12
    • https://github.com/kubernetes/client-go/blob/master/tools/record/event.go#L38
    • [Todd] we need events to be re-emitted periodically. Do not discard them universally. Less frequency, but definitely more events we want to know about like flakes of readiness probe.
  • [SergeyKanzhelev] Probes functionality cleanup: https://docs.google.com/document/d/1G5nGH97s3UTANbA5IyQ7nVIHnrLKfgVZssSYnvp_qX4/edit
  • [haircommander/Peter] Kubelet drop-in config support
    • After conversation about dropping cli flag support, it was illuminated that our users (downstream in Openshift) rely on this feature. Could be a good time to introduce drop-in file support like in /etc/kubernetes/kubelet.conf.d
    • Peter to make proposal to SIG-Arch to see if other components would like to adopt a similar pattern, as well as open an issue to have an asynchronous conversation.
  • [SergeyKanzhelev] ​​kubernetes/kubernetes#116429 , uber issue: kubernetes/kubernetes#115934

Apr 18th, 2023

No call (kubecon)

Apr 11th, 2023

Recording: https://www.youtube.com/watch?v=R9bml9YmP3k

  • [klueska] Need approval on PR to update DRA KEP with changes merged into v1.27
  • [liggitt/derek] proposal to support node/control-plane skew of n-3 (KEP-3935, draft proposal)
    • What in-progress node feature / cleanup rollouts rely on n-2 skew?
      • might delay default-on of in-place-resize for one release (AI: jordan / vinay sync up); notes from jordan/vinay 2023-05-03:
        • a 1.27+ node with the feature disabled will not modify resources as requested, will mark pods requesting resize as "infeasible"
        • a pre-1.27 node will not modify resources as requested, with no user feedback
        • after 1.27 work, we realized that kubelet perpetually reports pod resize as InProgress when running against a containerd that supports UpdateContainerResources CRI API (containerd ~1.4/~1.5 era) but does not support ContainerStatus CRI API (added to CRI API in k8s 1.25, supported in containerd 1.6.9+), so there's already user feedback improvements to make and possibly delay beta for
        • if we were ready to promote in-place-resize to beta in 1.29, n-3 skew would mean 1.26 kubelets would not give any user feedback about lack of support for the feature, but would otherwise fail safe
    • derek:
      • include alternative considered of supporting in-place minor upgrades, rationale why that approach wasn't chosen
        • OS upgrades, immutable nodes can't use in-place for minor upgrades
        • cost of supporting/testing in-place minor upgrades is significantly higher, impacts development of new features and evolution of existing features
      • make sure it is clear what guidance should be given to people working on new features for what to do for features older kubelets don't support yet
  • [mweston & atanas] Still working on the https://github.com/obiTrinobihttps://github.com/obiTrinobiIntel/enhancements/tree/atanas/cci-updated/keps/sig-node/3675-resource-plugin-managerIntel/enhancements/tree/atanas/cci-updated/keps/sig-node/3675-resource-plugin-manager KEP. Need help with scheduling re Dawn or other member in getting feedback.
  • [mrunal] Canceling next week's meeting for kubecon.

Apr 4th, 2023

Recording: https://www.youtube.com/watch?v=Y_TWnklb0vI

  • [pacoxu] undeprecate kubelet --provider-id flag: what are your plans around graduating kubelet config file/actually deprecating these flags in the future?
  • [iancoolidge] Follow-up on issue kubernetes/kubernetes#115994
  • [rata] Userns KEP 127: add support for stateful pods
    • We don’t need code changes in the kubelet for this (just change the validation)
    • Therefore, we want to just change the scope of the KEP to support stateful pods too
    • We want to deprecate the feature gate “UserNamespacesStatelessPodsSupport” and add “UserNamespacesSupport”
    • This new feature gate will activate userns for all pods (stateful/stateless)
    • If this sounds good, we will do a PoC and propose the KEP changes widening the scope and explaining how the stateful case works too.
      • [mrunal] This may be okay but let’s open a KEP change and get opinions of other reviewers involved.
      • [mrunal] We need to start thinking about how user namespaces will work with pod security policies.
      • [rata]: Mrunal and I will join sig-auth to start the PSS conversation
      • [rata] Maybe they need fields to be GA? But happy to start discussing.

Mar 28st, 2023

Recording: https://www.youtube.com/watch?v=yb_LtE0hGDc

  • [SergeyKanzhelev] Annual Report: #7220

    Let’s edit together: https://docs.google.com/document/d/17Z3LO3pSdv9R-v9yLIMO5a46nwXRQTsaEDg0iN74rhs/edit?usp=sharing

  • [jlpedrosa]

    • memory.oom.group setting to oom the whole cgroup in the container.
      slack convo.
      • [Mrunal] container level makes sense
      • [Sergey] for sidecars we will adjust oom score for sidecars so it’s almost the “whole Pod” being killed
      • [Mrunal] we can start with the issue, may not need a KEP for this
      • [Todd Neal] I think there is a potential for API surface as the new behavior may not be desired in all cases. haproxy was the example brought up in Slack where it may handle OOM correctly on a single process. Most everything else probably doesn't, so you might want a default of turning oom.group on and allowing containers to opt-out.

Mar 21st, 2023

Recording: https://www.youtube.com/watch?v=IjxUleYcKgk

Mar 14th, 2023

Recording: https://www.youtube.com/watch?v=e0DA7x4zTs0

Total: 200

Incoming Completed
Created: 86 Closed: 35
Updated: 203 Merged: 103

Needs approval: label:lgtm -label:approved 41

https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+label%3Apriority%2Fcritical-urgent++label%3Asig%2Fnode+

https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+label%3Apriority%2Fimportant-soon++label%3Asig%2Fnode+

Jan 31st, 2023

  • [marosset/sig-windows] kubernetes/kubernetes#116546
    • updating perfCounterUpdatePeriod in kubelet to 10 seconds on Windows to address some perf issues when running logs of pods

Mar 7th, 2023

Recording: https://www.youtube.com/watch?v=KgAR613c1Bs

Total PRs: 241

Incoming Completed
Created: 35 Closed: 14
Updated: 136 Merged: 31
[https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+label%3Apriority%2Fimportant-soon+label%3Asig%2Fnode](https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+label%3Apriority%2Fimportant-soon+label%3Asig%2Fnode)
  • [SergeyKanzhelev] 19 enhancements tracked, and at the moment 0 were opted-in for Feature Blogs.
  • [KevinHannon@kannon92] Starting work on kubernetes/enhancements#3816 (Pending Pods stuck due to configuration errors) \
  • Created a POC PR to see about validation of some of these configuration errors \
  • kubernetes/kubernetes#115736 \
  • Should I consider moving this into a KEP of its own?

Feb 28st, 2023

Recording: https://www.youtube.com/watch?v=IHcI6Jwo5PQ

Total PRs: 248

Incoming Completed
Created: 36 Closed: 9
Updated: 83 Merged: 11
  • 01:00 UTC Wednesday 15th March 2023 / 17:00 PDT Tuesday 14th March 2023: Week 10 — Code Freeze
  • [vinaykul] InPlace Pod Vertical Scaling PR - status update
  • [SergeyKanzhelev] Sidecar: kubernetes/kubernetes#115934
  • [mimowo]
    • need reviewer/approver for: kubernetes/kubernetes#116082. Is it an issue upstream in containerd? [mike brown] let’s chat I don’t think this is a problem with containerd.. more an issue of expectation of the start request being serial and responding before the async start kicking off.. iow add a sleep/yield to the test itself before it “ooms” and you will/should get the started response flowing back through kubelet before the oom happens.. Changing to an ack model for the start request before actually starting would be in conflict with the start being able to return certain errors.
    • Discuss implementation decisions for kubernetes/kubernetes#115331. Specific questions:
      • Should we restrict the handling to pods with finalizers (to save QPS)?
      • When Kubelet restarts there is a short time window that the phase may flip back to Pending, is this something specific to this scenario, or a general behavior / bug in Kubelet?
      • Should we also make sure that all Running pods with deletionTimestamp end up in terminal phase? This is currently not the case for pods with RestartPolicy=OnFailure or Always.
    • Followup:
      • Clayton’s PR that may fix the failure case: kubernetes/kubernetes#113145
      • E2E added to Clayton’s PR would be helpful to see if the issue is fixed or not

Feb 21st, 2023

Recording: https://www.youtube.com/watch?v=Hod1MGk99lc

Total PRs: 230

From Jan 24th:

Incoming Completed
Created: 129 Closed: 48
Updated: 177 Merged: 76

Feb 14th, 2023

Recording: https://www.youtube.com/watch?v=NsV9TVcJw54

Feb 7th, 2023

Recording: https://www.youtube.com/watch?v=cam97qjy8qE

27 KEPs: https://github.com/kubernetes/enhancements/issues?q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.27+label%3Alead-opted-in+

  • [rata]: Userns KEP PR rework with idmap mounts.
    • I think this should be ready to approve for 1.27. Is anything missing?
  • [klueska] KEP with PodResources extensions for DRA
    • Looking for final approval by and / or @mrunal
    • There’s one small change I’d like to see made, but if we get an /approve with a /hold I can make sure the change gets in before giving a final /lgtm
    • kubernetes/enhancements#3738
  • [klueska] Milestone and tracking for updates to DRA enhancement issue
  • [vinaykul] InPlace Pod Vertical Scaling PR - status update
    • Please review and merge k/enhancements housekeeping PR #3845
      • To catch up to the latest KEP template, the PR adds integration test section and responds to node scalability section.
    • I have rebased and updated PR #102884 after LGTM by @thockin
      • Squashed all API commits into single commit + generated files commit
      • Separate commit for scheduler changes
      • I plan to squash various kubelet commits into 2 or 3 commits if that’s ok
      • @thockin awaits Derek’s re-LGTM/approve before approving the PR.
      • I plan to create follow-up PRs to address a few outstanding items:
        • ResizePolicy name restructuring
        • Use PodStatus.QOSClass instead of GetPodQoS across K8s codebase
  • [Atanas] CCI KEP:
    • kubernetes/enhancements#3853
    • Addressing comments as they come in.
    • Brief discussion on anything else outstanding.
    • Reviewers: Swati and Kevin, Approver: Dawn or Derek
  • [qbarrand] Kernel Module Management
  • [mimowo] Ask for final review and approval from sig-node for: Update for second Beta with GA criteria for "KEP-3329: Retriable and non-retriable Pod failures for Jobs"

Jan 31st, 2023

Recording: https://youtu.be/96DTU9ncSLA

[KEPS REVIEW]: 15 minutes

Jan 24th, 2023

Recording: https://youtu.be/NQaTeTfI9UY

Incoming Completed
Created: 29 Closed: 12
Updated: 90 Merged: 14

Jan 17th, 2023

Recording: https://youtu.be/wirWRKSqY10

Total PRs: 217

Incoming Completed
Created: 30 Closed: 16
Updated: 103 Merged: 16

Jan 10th, 2023

Recording: https://youtu.be/5V0uRxH4O4k

  • ~~[pacoxu] KEP-3610: namespace-wide global env injection #3612, not sure if this can be an admission controller.(removed due to mutating CEL admission should be the final solution.) ~~
  • [ruiwen/pacoxu] KEP-3673: Kubelet limit of Parallel Image Pulls #3713 *
  • [klueska] Update CRI to include CDI devices (needed by DRA before moving to beta)
  • [QuentinN42] Add FileEnvSource and FileKeySelector to add environment generated on the fly #114674
    • Sourcing from any file from any source may be too big of a scope. Would limiting this to empty dir files be enough?
    • Security - is there a risk to source some secret as an environment variable that would expose the file that wasn’t available otherwise.
    • Action: Need to move this to kubernetes/enhancements as a KEP and follow the process. => kubernetes/enhancements#3721
    • [Mike Brown] fyi.. not sure if this is the right pattern but NRI plugins support modifying environment variables for the containers. might be useful at least for prototyping
    • [QuentinN42] another question is error conditions depending on the file format
    • [Alexander Kanevsky] my first impression - the env variables are populated in oci spec before container started. sourcing from some file inside container might be not feasible....
    • [Mike Brown] right would require a set for any env change happening in prestart (which could be done by setting a runc hook via NRI or hook schema, or just doing the set on the update response)
  • [vinaykul] InPlace Pod Vertical Scaling PR - status update
    • I won’t be in the Node meeting today due to another 10 am meeting.
    • Please review and merge KEP update PR
      • Updated beta target to v1.29
      • Added details on handling version skew.
    • Tim prefers that we merge PR 102884 in its entirety as opposed to merging API PR 111946 followed by the rest of it a week later.
      • We can re-add periodic CI jobs afterwards and iterate on fixes if there are any issues.
      • I believe this will require both Derek’s & Tim’s lgtm & approve.
      • **AI: **Derek to catch up on Tims’ objections: kubernetes/kubernetes#111946 (review)
  • [derek] sig updates

Jan 3rd, 2023

Recording: https://www.youtube.com/watch?v=AG3U91-5keo

Total active pull requests: 205

Incoming Completed
Created: 45 Closed: 22
Updated: 144 Merged: 18
  • [SergeyKanzhelev] kubernetes/kubernetes#114394 CRI API version skew policies. See slides from contributors summit for extra details
  • ~~[SergeyKanzhelev] Reconcile SIG Node teams and OWNERs files: kubernetes/org#3893 ~~
  • [vinaykul] InPlace Pod Vertical Scaling PR - status update
    • Happy 2023!
    • Please review and merge KEP milestone update PR
    • PR 102884 approved by Derek.
      • @bobbypage fixed containerd/main E2E pull test job, we now have full E2E coverage (verifies values from ContainerStatus CRI response)
      • My recommendation is we merge API changes PR 111946 at the earliest possible point in 1.27 and watch it to see nothing bad happens.
        • Can we do it this week?
        • Can we atleast merge feature gate definition to clean up test failures in unrelated PRs?
      • And then merge PR 102884 shortly after (PR 111946 merge + 1 week)
        • We can then re-add periodic CI test jobs.
  • [Seaiii] kubernetes/kubernetes#113883 The second time the pod is deleted the grace period does not take effect .Please review update PR PR 113883