- Cancelled
- Cancelled
- Cancelled
- 1.32 retro: SIG Node 1.32 retro
- [minna] asking for some PR feedback kubernetes/kubernetes#125918
- [Peter] We should add a feature gate beta + on by default
- [Francesco] + 1 and we should extend
- Maybe wait for critical pods to be ready and not just started before we try to start non critical pods
- [Sergey] Similarly we could extend logic for admission
- [Sergey] It’s possible this PR may switch starting failure to admission failure (if critical pod starts and fails, the pods that rely on them will fail differently)
- [Sergey] Add agenda items ASAP, as we will cancel the meeting aggressively in December
-
[danwinship/surya] Redesigning Kubelet Probes
- antonio had opened an issue for runtime to do the checks
- when kubetlet requests runtime to do probe
- launching new pods and containers would be heavy
- can we re-use the container-monitor process here ? instead of adding new ones?
- tcp/http/grpc types of probes
- would containerd/cri-o be able to do those probes?
- [mrunal] containerd would have to do learn the split of daemon
- tcp/http/grpc types of probes
- [dawn] the pod sounds better than what we have today?
- cost to the user though here at the application level usage is unpredictable - this is not worse than what we have today but there is a complexity for the user (with per container case)
- probing pod is part of system overhead
- when kubetlet requests runtime to do probe
- will this be a new type of probe? replacement of existing probes?
- if its a pod probe then some features like ensuring the port is open might be lost?
- so maybe we should keep both types of probes and users can
- Performance should not regress
- checking a file in the filesystem and letting users put what they want?
- antonio had opened an issue for runtime to do the checks
-
[tallclair] In-Place Pod Resize: status update
Canceled due to lack of the agenda.
- [Kevin Hannon in place of dims] cadvisor for 1.32
- google/cadvisor#3609 ( Reduce the dependencies we drag into cadvisor AND drag into k/k through cadvisor )
- google/cadvisor#3608 ( help the periodic CI job to recover )
- fix for google/cadvisor#3577 as well
- Release may be needed
- https://kubernetes.slack.com/archives/C0BP8PW9G/p1729517493050419
- [Kevin Hannon] Swap Based Eviction
- [Lakshmi] Requesting for review and feedback on PR
- [pehunt] libcontainer + runc + k8s
- two pieces
- runc 1.2.0 just came out, k8s wants to use it (to get PSI stats) but there are concerns about containerd using a different libcontainer version from cadvisor
- https://kubernetes.slack.com/archives/C0BP8PW9G/p1729606639892799
- https://cloud-native.slack.com/archives/CGEQHPYF4/p1729607023643899
- google/cadvisor#3083 (comment)
- Do we need to wait for 1.2.0 in 2.0, or can we backport, or can we run disjoint? we’ve waited a long time for 1.2.0 and I’d like to use it
- libraryfication of libcontainer: currently, we’re vendoring runc libcontainer in k8s, and this means we’re version locked with the runc binary (which doesn’t have k8s as a priority with release cadence)
- Discussions on moving the libcontainer/cgroups library out of runc and into its own repo kubernetes/kubernetes#128157
- Peter Huntto send an email to sig-node mailing list to notify folks of this plan
- part of this plan kubernetes/kubernetes#128245
- runc 1.2.0 just came out, k8s wants to use it (to get PSI stats) but there are concerns about containerd using a different libcontainer version from cadvisor
- two pieces
Recording: https://www.youtube.com/watch?v=MyOhDhHRRKk
- [Sergey] New Feature Gates emulation mode and features GA: kubernetes/kubernetes#126981 (comment)
- Should we keep removing code in kubelet as before? Or just keep it around the same way as we do for API server to minimize possible errors and simply not test it?
- [Chris] A demo for k8s dynamic batch workloads:
https://github.com/chrishenzie/k8s-dynamic-batch-demo - [pehunt] (defer to end, only if there’s time) beginnings of swap aware eviction discussion
Recording: https://www.youtube.com/watch?v=_Zexxr4pxr8
- [Sergey] Containerd 2.0 and KEPs: https://groups.google.com/g/kubernetes-sig-architecture/c/kft-wa929_Q
- Are we promoting to beta with a single runtime implementation?
- What is the production test requirement for the feature? (in case of 2.0 - how do we measure exposure of the feature to prod?)
[fromani] Heads up: KEP 4885 will introduce a new memory manager policythe windows and linux will support different policiesdo we prefer to postpone the memory manager GA graduation?
- [fromani] unblocking kubernetes/kubernetes#70585 with a feature gate?
- [Eddie] Request for KEP review: Mutable CSINode Allocatable Property
- [pehunt] FYI for approvers: two new KEPs have been added to the milestone and don’t have an approver
- kubernetes/enhancements#3619
- kubernetes/enhancements#4753
- [fromani] kubernetes/enhancements#4885 lacks approver also. I’m reviewing and almost LGTM (almost = need a final pass but no outstanding issues after last update)
- [pehunt] according to The KEP board it’s Mrunal Patel
Recording: https://www.youtube.com/watch?v=8YWCql6rLLk
- KEP planning part 2
- [ndixita] Pod Level Resources [Critical Scenarios] Pod Level Resources
- [Lakshmi] IWhen container garbage collection is deprecated? Is there any alternate recommended way for container garbage collection?
- [tjons] run an initContainer only once per rollout of the deployment, not on every scheduled pod.
Recording: https://www.youtube.com/watch?v=GkPrY56_gB4
- [jsturtevant] Windows KEP updates for Cpu/memory manager: kubernetes/enhancements#4738
- [tallclair] InPlacePodVerticalScaling discussion - part 2 (slides)
- [johnbelamaric] Quick PSA: Unless a strong use case comes forward, we plan to remove “classic DRA” in 1.32. See kubernetes/enhancements#3063 (comment)
- Reach out to [email protected] if you have any questions
- [Lakshmi] Requesting for review and feedback on PR
Recording: https://www.youtube.com/watch?v=iH6KVk9B5DE
- [pehunt] KEP planning
Recording: https://www.youtube.com/watch?v=9AfQA0DYR0E
-
[pehunt] KEP planning KEP Board
-
[lauralorenz] CrashLoopBackOff KEP for 1.32 (slides 6-10)
-
[harche] - Looking for reviews kubernetes/kubernetes#125982
- This especially affects users with high number CPUs per nodes
-
[tallclair] InPlacePodVerticalScaling discussion (slides)
-
[SergeyKanzhelev] https://github.com/kubernetes/enhancements/issues/3386#issue comment-2337050862 Do we want to remove this code for now?
-
[T-Lakshmi] - Looking for feedback/answers on queries kubernetes/kubernetes#127157 Is container GC policy replaced with any function in evictionHard and evictionSoft policy, or its completely deprecated? What are the future plans on these container garbage collector policy?
Recording: https://www.youtube.com/watch?v=E8sw-fybnKc
- [ndixita] Pod level resources KEP discussion
[Public] Effective Resources & OOM Kill Behavior
- OOM group -> Pod kill -> in the next iteration of KEP
-
[lauralorenz] CrashLoopBackOff KEP for 1.32 (slides 6-10) [bumped to next week but feel free to take a look at slides or discuss x-post in slack]
-
[sreeram-venkitesh] Zero values for Sleep Action of PreStop Hook
- KEP-4818: Allow zero value for Sleep Action of PreStop Hook
- Draft PR to discuss changes: kubernetes/kubernetes#127094
- Do we need to do anything particular with rollback of the feature?
- Probably not at least the kubelet
-
[pranav] Kubelet idle threads issue
- raised this issue in golang upstream
- how to control kubelet threads and memory by go runtime variables, is there any other way to do it? -
[Kevin Hannon] KEP Board
- Open it up for public viewing?
- [pehunt] inspired by release team, we’ve updated the tracking board to have more column
Recording: https://www.youtube.com/watch?v=wGbkByo_NBI
- [lauralorenz] CrashLoopBackOff KEP (slides)
- updates and changes since 1.31 [5 minutes]
- some discussion on path forward [10-15 mins if I can get it]
- [pehunt] KEP wrangler brainstorm
Recording: https://www.youtube.com/watch?v=KUw2kSFsf2U
- [vinayakankugoyal] https://github.com/kubernetes/enhancements/pull/4760/files#r1699209363
- Break permissions into smaller buckets to allow for users to get access to things like healthz without allowing a user to get a pod to exec
- We currently don’t commit to supporting these endpoints, but they are being used as if we have. Should we group the endpoints by function to be less prescriptive on what a user gets access to, so we have power to change?
- [Peter] Can we break into read-only/read-write?
- some of the “read-only” end points can still be risky to give access to
- [Dawn] There had been talks in the past about deprecating some of the endpoints
- [Tim] We’ve been talking about doing so for so long, maybe we do this now instead of trying to find the perfect APIs
- [Tim] Maybe use healthz as the bucket?
- [Sergey] should documenting the endpoint be part of this KEP?
- Tim Volunteered to review/approve
- KEP to live under SIG-Auth
- [Kevin Hannon] different OCI runtime with NodeConformance
- https://github.com/kubernetes/kubernetes/issues/126639
- Presubmit: kubernetes/test-infra#33298
- Periodic: kubernetes/test-infra#33297
- Maybe we should add these tests in CRI-O upstream instead of k8s–reduce overhead on upstream CI
- [Kevin] If we switch to crun by default in CRI-O, can we switch upstream k8s tests to crun as well?
- [Sergey] as long as the test failures are looked into and addressed quickly
- Run two versions at the same time, and then eventually switch the crun jobs to be the blocking one
- [SergeyKanzhelev] Some org updates:
- New google groups will be used soon:
- New version of GitHub projects:
- [yuanliangzhang] Windows Node graceful shutdown
- KEP enhance draft:
https://github.com/zylxjtu/enhancements/blob/master/keps/sig-node/2000-graceful-node-shutdown/README.md#background-on-windows-shutdown- POC shoutdown poc · zylxjtu/kubernetes@854ea4b (github.com)
- Should we have a new KEP or keep within the other KEP
- Needs a reviewer from kubelet side
- [Dawn] Most of the reviewers in SIG-Node focus on linux
- [Dawn] Are there any windows version requirements?
- [Lin] I don’t think so
- [Lin] How far back do we need to support specific versions windows nodes?
- [Mark] probably windows 2019
- [Mark] We didn’t add support before because termination wasn’t working right in windows, that’s fixed now
- [Peter] kubernetes/enhancements#4738 can be used as a baseline for KEP process
- [Sergey] If we tie this to the linux version, we may be blocked on windows to GA graceful shutdown
- [Sergey] Ideally, we GA ASAP
- [Mrunal] Instead of additional KEP, we could add another feature gate
- [Sergey] Feature gate will still block KEP graduation :-(
- endpoints problem: kubernetes/kubernetes#116965
- KEP enhance draft:
- [sotiris] Triage decision for Minimum CPU request is displayed when only memory request is configured
- [iholder101]
- swap debugability long ongoing discussion - asking to defer to follow-up KEPs: kubernetes/kubernetes#125278 and specifically this comment
- kubernetes/enhancements#4701 - GA plans for swap (KEP-2400)
- Dawn to follow up offline
- [SergeyKanzhelev] https://github.com/orgs/kubernetes/projects/186/views/1
- AI:
- Either move to proposed for consideration
- Or Not for release
- AI:
- [torredil] Ensure volumes are unmounted during graceful node shutdown: kubernetes/kubernetes#125070
- Dawn/Mrunal to look and hopefully approve
- [Mrunal] Maybe add this to Clayton’s document
MEETING IS CANCELED TODAY due to lack of agenda and vacations
Recording: https://www.youtube.com/watch?v=K-eBDYfHiTM
- [SergeyKanzhelev] Files kubelet uses: https://github.com/kubernetes/website/pull/46359/files#r1600516887 Docs request
- List in comment almost? full
- We should document files, but recommending removal of all of them is overkill
- Mrunal: maybe even have a clean up command that will clean up those files.
- Cleaning up on startup of kubelet - maybe we need a KEP
- Dawn: Kubelet should be responsible for its own files, but other files created by the plugins which might not properly cleanup, and there is no way to ensure those by Kubelet. In this case, K8s vendor is responsible for files, not kubelet.
- Also end users are not reporting issues back to upstream if they experience issues.
- Peter: if kubelet creates a file it should be responsible for deleting it. If file is owned by plugin, kubelet should be resilient to those.
- rphillips: Ideally, the plugin’s initialization function should handle cleanup
- [SergeyKanzhelev] SergeyKanzhelev for approver: kubernetes/kubernetes#126551
- [pehunt] SIG Chair proposal
- [SergeyKanzhelev] SIG Node responsiveness improvements
- [pacoxu] issues/116799#issuecomment-2249301937
- In kubernetes/system-validators#37, we refer to kernel long term support: https://wiki.linuxfoundation.org/civilinfrastructureplatform/start and https://endoflife.date/linux
- 4.4 & 4.19 are selected as kernel Super Long Term Support (SLTS), and the Civil Infrastructure Platform(CIP) will provide support until at least 2026.
- For cgroup v2, Kubernetes recommends to use 5.8 and later, and in runc docs, the minimal version is 4.15 and 5.2+ is recommended.
- 4.5 starts support cgroup v2 io,memory & pids.(kernel 4.5 announce that cgroup v2 is not experimental)
- 4.15 starts support cgroup v2 cpu
- 4.20 PSI support & KEP-4205 is not alpha(only KEP was merged)
- 5.2 starts support cgroup v2 freezer
- 5.8: Adding root `cpu.stat` file on cgroupv2 was only added in 5.8.
- In kubernetes/system-validators#37, we refer to kernel long term support: https://wiki.linuxfoundation.org/civilinfrastructureplatform/start and https://endoflife.date/linux
Recording: https://www.youtube.com/watch?v=JGYTQbs6eJk
- [Peter Hunt] Retrospective from 1.31 release
- SIG Node 1.31 retro
- Previous retrospectives:
- There were no retro for 1.29 and 1.30
- SIG Node 1.28 retro
- SIG Node 1.27 retro
Recording: https://www.youtube.com/watch?v=Wc7yrCLILK8
- [fromani][on behalf of sphrasavath] resuming work on KEP 2621: Enhance CPU manager with L3 cache aware
- pivot from new cpumanager policy to new cpumanager policy option
- revised design doc (comment from the enh issue: https://docs.google.com/document/d/1LpnMjGNsQyHOuVHMktIrjZsdRw9aKZ8djt354nAno6M/edit?usp=sharing )
- [Sunnat] On behalf of Marsik. do not set CPU quota for guaranteed pods
- [pehunt]: ProcMount disabled, or UserNamespaces enabled?
Recording: https://www.youtube.com/watch?v=0iPCt_FZxSk
- [dawnchen] FYI: [PUBLIC] Kubernetes: Disrupted pods should be eagerly removed from endpoints
- Primary concern raised so far by Rob Scott is the risk that someone interprets EndpointSlice terminating as one way
- More discussion of alternatives
Recording: https://www.youtube.com/watch?v=RTEtVbZPB-E
- [case] A group of us were working on a PR around adding node labels to the downward API. KEP-4742
- [harche] - Are we calculating the system reserved cpu shares correctly? kubernetes/kubernetes#72881 (comment)
- Analysis with various CPU cores - System reservation cpu
- [Derek] found relevant node allocatable designs https://github.com/kubernetes/design-proposals-archive/blob/main/node/kubelet-systemd.md and https://github.com/kubernetes/design-proposals-archive/blob/main/node/node-allocatable.md
- [adil] I have a question regarding logging, is there a way to disable all logs from k8s components and only get error logs? I tried setting different verbosities but it didn't help much. If there is no way to do it right now, is this something would interested in implementing? The reason why we want this is to optimize the CPU usage.
- pehunt: For all the kubernetes components, you should be able to set the
-v
flag which sets the verbosity of klog. You need to individually set this flag for each kubernetes component, I don’t think there’s a centralized place you can do this today. If you set-v=1
you should only get the most urgent messages
- pehunt: For all the kubernetes components, you should be able to set the
- [mimowo] looking for sig-node reviewers for Fix that PodIP field is temporarily removed for a terminal pod; heads up for the Kubelet issue that it may flip phase from Succeeded to Failed link
- [MaRosset] Request for review of Windows memory pressure eviction PR
Recording: https://www.youtube.com/watch?v=ExmOu9Twp3A
- [Filip Krepinsky] Creation of a new WG
- discussed in https://groups.google.com/g/kubernetes-sig-architecture/c/Tb_3oDMAHrg/m/pJjl6v4mAgAJ
- Clarify scope: Node vs group of Node, SIG Node vs k8s level, list of problems/scope
- [Pranav Pandey] Kubelet not releasing idle threads
- discussed here
- I think this issue is due to golang, could we confirm this?
- Could we also confirm if there is a direct way for the kubelet to set the
maximum thread number by any parameter or something like that? - [lubomir] review my small PR that makes a windows/kubelet related change:
- kubernetes/kubernetes#123137
- warn instead of error for unsupported options on Windows
- we don't need to exit the kubelet with an error on Windows just because the user is using a config that works on Linux.
- old PR where we discussed we should not have different defaults on Windows:
Recording: https://www.youtube.com/watch?v=REmtlcXma_M
- [Sergey] KEPs list for 1.31: https://github.com/orgs/kubernetes/projects/183/views/1?filterQuery=sig%3Asig-node&groupedBy%5BcolumnId%5D=Status&sortedBy%5Bdirection%5D=desc&sortedBy%5BcolumnId%5D=Status&sliceBy%5BcolumnId%5D=Status
- [Dixita] Support for KEP exception until Friday, June 21
- kubernetes/enhancements#2837
- KEP needs to address the following suggestions by Tim Hockin
- Default values when one of requests/limits is not set at pod level
- Change language for QoS definitions
- Stating OOM Kill behavior change
- Reasoning
- Feature discussions since March 2020
- The more we delay this feature, it becomes difficult to support new features being added in every release.
- Low risk: Alpha phase targets only adding the new fields in the PodSpec so that feature development can start.
- Important to unblock AI model use cases
- [mimowo] looking for sig-node reviews for Fix that PodIP field is temporarily removed for a terminal pod
Recording: https://www.youtube.com/watch?v=A1XwOJxBL0c
-
[tallclair] #125393 Should we remove soft admission failure, before AppArmor goes GA?
-
[Filip Krepinsky] Latest NodeMaintenance discussions
-
[Sotiris/esotsal] Static CPU management policy alongside InPlacePodVerticalScaling
-
Status / next steps / open questions one slider
- [vaibhav] Eviction manager should check the disk usage of dead containers
- kubernetes/kubernetes#115201
- Default values of Kubelet’s eviction hard parameters
- kubernetes/kubernetes#119985
-
-
[pehunt] kubernetes/kubernetes#124285 need KEP?
-
[harche] - kubernetes/kubernetes#125341 - changing the secret fetching strategy while creating the pod.
-
[pehunt] sync about kubernetes/enhancements#4693 updates
- kubernetes/enhancements#4693 (comment) How do we feel about the Never handling change?
Recording: https://www.youtube.com/watch?v=3dyVRBR7K7k
-
[SergeyKanzhelev]
KEPs for 1.31: https://github.com/kubernetes/enhancements/issues?q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.31+Missing lead-opted-in: https://github.com/kubernetes/enhancements/issues?q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.31+-label%3Alead-opted-in
-
[chrismuellner] discuss loose linux capability handling in security context: kubernetes/kubernetes#119569 (comment)
- varying, incomplete implementations for validations
- documentation inaccurate: `CAP_` prefix allowed? upper/lower case?
-
[pehunt] feedback on whether to exclude critical pods in swap
-
[ndixita] Pod level resource spec KEP: kubernetes/enhancements#4678
-
[Filip Krepinsky] update on the Declarative NodeMaintenance and Evacuation API KEPs:
-
[lauralorenz] CrashLoopBackoff KEP
-
[SergeyKanzhelev] Many flakes reported by release team
Recording: https://www.youtube.com/watch?v=RDWC4rtQOCo
- KEP freeze is coming (schedule: https://www.kubernetes.dev/resources/release/). https://github.com/kubernetes/enhancements/issues?q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.31+
- [JeffLuoo] Pod full startup latency metrics to record pod from creation to ready: kubernetes/kubernetes#124892
- [dawnchen] Starting DRA driver for GPU in CNCF / K8s repo
- [Ed] https://github.com/kubernetes-sigs/dra-example-driver
- [John] from distributors perspective - driver from community would be preferable comparing to vendor-managed.
- We talking about allowing space for vendors, if they want/prefer.
- Idea is to simplify the life of distributors to have a place to take drivers from.
- [pehunt] kubernetes/kubernetes#125038
- [harche] cgroup v1 maintenance mode KEP - Should we feature gate it or not? kubernetes/enhancements#4572 (comment)
Recording: https://www.youtube.com/watch?v=eSYWzusEZiA
- [iholder101]:
- #123963: Add swap to kubectl describe node's output
- On the one hand we received feedback regarding making it easier to debug and monitor swap. On the other hand there’s a pushback regarding exposing it through API. What’s the right balance here?
- timezone poll results from two weeks ago: https://ibb.co/z8R3nXN.
- SIG-Node leadership: does moving back two hours make sense? What is the process to formalize that change?
- #123963: Add swap to kubectl describe node's output
- [sallyom]:
- [pehunt]: kubernetes/kubernetes#124333
-
compelling case between balancing cluster admin configuration and workloads being punished for them
- [vaibhav] Eviction manager should check the disk usage of dead containers
- kubernetes/kubernetes#115201
- kubernetes/enhancements#4341
-
No agenda, canceling this week.
Recording: https://www.youtube.com/watch?v=_FPa0TVPoY4
- [marquiz/zvonkok] KEP-4112: Pass down resources to CRI follow-up
- [yujuhong] cgroup v2 memory usage – bug or working as intended?
- kubernetes/kubernetes#118916
- and discussion in runc - opencontainers/runc#3933 (comment)
Recording: https://www.youtube.com/watch?v=iuZCxtAeoQ8
-
[SergeyKanzhelev] Annual report last call for comments: https://github.com/kubernetes/community/pull/7831/
-
[lauralorenz] intro on proposed changes to CrashLoopBackoff (slides), this is from Kubernetes#57291
-
[iholder101/Peter Hunt] #124060: Avoid swapping memory-backed volumes with tmpfs’ “noswap” option.
- How to behave if the option is not supported?
- If it is not supported, do we want to fallback to ramfs / BRD / zswap?
- How should it be tested, since the CI runs with an old kernel (5.15 < 6.4)
- Update KEP and issue with the current state
-
[iholder101/Peter Hunt]: In my time-zone this meeting takes place at 20:00 PM. Is it acceptable to reschedule this meeting for an earlier time? This might significantly help people from the EMEA region to join.
- Defer to next week, hope for more consensus
- in the meantime, ask the sig-node mailing list who would be able to make it that previously cannot
-
[ndixita]
- kubelet archived logs permissions kubernetes/kubernetes#124229
Solution: 1) Config options for users maybe kubernetes/kubernetes#124228 (comment)
Have a feature gate that is removed later.
Sergey: same issue with termination logs. kubernetes/kubernetes#108076
- cadvisor enumerates memory and hugepages separately
Issue: kubernetes/kubernetes#84426
https://github.com/kubernetes/kubernetes/pull/119173/files#r1307246832 -
Can we know if this option is planned to be backported, and to which version?
Recommended solution: fix in cadvisor, and assess backward compatibility (probably add a new field)
-
Question: How will the behavior be if huge pages are changed dynamically?
- [Peter Hunt] Finish KEP Planning
Recording: https://www.youtube.com/watch?v=-TEdQvF7kUE
-
[SergeyKanzhelev] Annual report draft: #7831 Please add your comment and review the list of KEPs (#7777 (comment))
-
[anishshah] - v1.30 release report
-
- github.com/AnishShah/sig-node-flaky-tescontainerd/containerdts/tree/main
-
~10% release blocking tests are flaky
-
[jstur] Follow up on UsageNanoCores CRI kubernetes/kubernetes#122092 (comment)
-
What is the best approach?
- implemented cri background implementation in containerd/containerd#10010
-
Additional questions if cri is responsible:
- costs of having 10s heart beat on CRI side?
- what does it mean to have it 10s behind other stats?
- backwards compat?
-
James+Peter+Mike to have a call to sync on this
- [vaibhav] Eviction manager should check the disk usage of dead containers
- kubernetes/kubernetes#115201
- kubernetes/enhancements#4341
-
Recording: https://www.youtube.com/watch?v=vjcRUX_vSbU
- [pehunt] KEPs planning https://docs.google.com/document/d/1U10J0WwgWXkdYrqWGGvO8iH2HKeerQAlygnqgDgWv4E/edit
- [pehunt]: kubernetes/org#4805
- Mostly looking for feedback
- Some questions/replies are here looking for more opinions: kubernetes/org#4805 (comment)
- [iholder101/pehunt]: #123963: Add swap to kubectl describe node's output
- On the one hand we received feedback regarding making it easier to debug and monitor swap. On the other hand there’s a pushback regarding exposing it through API. What’s the right balance here?
- [marquiz/zvonkok] KEP-4112: Pass down resources to CRI follow-up
- [iholder101/pehunt]: timezone poll results from two weeks ago: https://ibb.co/z8R3nXN.
-
SIG-Node leadership: does moving back two hours make sense? What is the process to formalize that change?
- [harche] - cgroup v1 support - Deprecation only or Removal as well?
- kubernetes/enhancements#4569
- [klueska] - KEP update for DRA to match 1.30 implementation
- kubernetes/enhancements#4561
- [email protected]to approve
- [anishshah] - v1.30 release report
- github.com/AnishShah/sig-node-flaky-tests/tree/main
- 22/249 sig-node release blocking tests are flaky.
-
Recording: https://www.youtube.com/watch?v=o3AohYi9aQA
- [oshestopalova]: Soft eviction of pods with long grace periods blocks hard evictions when under resource pressure
- [iholder101/pehunt]: timezone poll results from last week: https://ibb.co/z8R3nXN.
- [jkyros] Trying to use InPlacePodVerticalScaling in Vertical Pod Autoscaler
- does anyone remember why limits are required for in-place scaling?
- naively something like this fixes it, but probably has consequences
- [Sotiris Salloumis] Perhaps we can discuss this in https://kubernetes.slack.com/archives/C06FSK01BGU ?
- [pehunt/eddiezane]: kubectl cp improvements
- [Sonemaly]: Start discussion around Addressing Noisy Neighbor/Split L3 Cache Topology
- [kad]: please share in continuation in the mail thread scenarios that you have and corner cases that you found are not solved today. We need to look how it could be done in a way where all other vendors (especially on ARM side where assumptions on presence of L3 might be not true) will not be affected on proposed changes to static policy. At the moment, the cache layout is partially buggy in cAdvisor library that detects it, and components like CPU manager is not consuming it at all from MachineInfo.
- [Matt Karrmann] Configure group OOM Kills at the container level instead of the kubelet level
- Follow up with an issue to chat about different cases for pod vs kubelet level configuration
Recording: https://www.youtube.com/watch?v=Ho1kn-1p8Cg
- [Sotiris] InPlacePodVerticalScaling moving forward to beta (todo/needs/planing)
- From Jiaxin Shan:
- We worked on issues https://docs.google.com/document/d/1V3DLh3pH3CD-xhhJvAnOq_oWgPyjO-vj6wY6qdew9H0/edit#heading=h.ybybfdfputt and most of the issues have been solved or have pending PRs. But this is definitely a subset of the working items moving to beta.
- for people need more context.
- kubernetes/kubernetes#109547
- [Dixi] a lot of interest. Maybe we need to meet separately to split tasks?
- [mrunal] there is a slack channel already. Is it ok to coordinate there?
- [Dixi] slack may work.
- [Jiaxin] Let’s work together in that channel.
- [SergeyKanzhelev] Please review API again, Many use cases for the feature EXPECT to use this feature differently than the KEP’s API was designed.
- From Jiaxin Shan:
- [Jiaxin] InPlaceVPA performance issue. A few users in the community requested the patch kubernetes/kubernetes#123941. PLEG cycle doesn’t take inplace pod status into consideration and never emit update events.
- [matthyx] (from sidecar WG) postStart hook prevents normal container termination - how to fix that?
- PUBLIC: Trying to diagram pod lifecycle stuff(slide 3)
- https://github.com/kubernetes/kubernetes/blob/ec301a5cc76f48cdadc77bcfbd686cf40b124ecf/pkg/kubelet/kuberuntime/kuberuntime_container.go#L297
- kubernetes/kubernetes#113883 (check for e2e coverage in PR)
- we cover this in our KEP (to be renamed)
- [pranav]: Could we implement a feature in Kubelet to limit the number of threads to the number of CPUs available?
- [SergeyKanzhelev] WG Serving proposal: https://groups.google.com/g/kubernetes-sig-node/c/KGfkpVmNrNc
- [Anish] kubernetes/kubernetes#123782 (ask is for a review).
- Issue: Container status changes to ContainerStatusUnknown when evicted due to exceeding ephemeral storage limit.
- Root Cause: There is a race condition which is removing the container before the container status update.
- Fix: The fix is to check that the pod is finished before cleaning up. added a check to the existing e2e test.
- [iholder101]: In my time-zone this meeting takes place at 20:00 PM. Is it acceptable to reschedule this meeting for an earlier time? This might significantly help people from the EMEA region to join.
Recording: https://www.youtube.com/watch?v=5TLp233Bisg
- [tallclair] Deprecate & remove Kubelet RunOnce mode
-
Mark deprecated in 1.31 and remove in 1.33
-
Add KEP
- [Sergey] cgroupv1 removal/deprecation is moving to 1.31
- Harshal to open a KEP for 1.31
- [kannon92] CAdvisor bug on pid stats
-
https://github.com/google/cadvisor/pull/3497/files
-
K8s: kubernetes/kubernetes#123914
- [Dawn] Kubecon recap. Slide deck: Sig Node Intro and Deep Dive
- Unconference hw resource model discussion:
[PUBLIC] 2024 KubeCon EU - Contrib Summit Unconference
-
Recording: https://www.youtube.com/watch?v=-435mh2GyGU
- [Kevin Hannon] Upper limit on ImagePullBackOff and fail the pod
- [Kevin Hannon] Flakiness in eviction tests
- kubernetes/kubernetes#123591
- Stats eviction if stats api failure
- PIDStats Fix- kubernetes/kubernetes#123369
- kubernetes/kubernetes#123591
- [Krzysztof Wilczyński] Current state and the future of the Graceful Node Shutdown support in kubelet.
- KEP 2000: Graceful Node Shutdown
[Anish] ContainerStatusUnknown after ephemeral storage limit is exceeded- [Hongxiang Jiang] Calculate oom_score_adj in a CPU-agnostic way, taking in consideration Pod Priority too
Recording: https://www.youtube.com/Kubernetes SIG Node 20240305watch?v=yBmVPBO9Y9Y
-
[Sotiris Salloumis] Static CPU management policy along side InPlacePodVerticalScaling
-
Is KEP needed? (this PR is an attempt to fix KEP 1287 Alpha Feature Code Issue #10)
-
PTAL: Inplace VPA + core binding There’s some discussion about VPA + CPU manager static policy
-
[Dixita, Anish] Seeking help for bug prioritization and triage for K8s 1.30 release on Wednesday 10AM PST.
-
[pehunt] proc mount PR separate from e2e tests kubernetes/kubernetes#123520
-
[Kevin Hannon] CRIO tests failing as of today
- kubernetes/kubernetes#123715
- pehunt opened kubernetes/kubernetes#123726
Recording: https://www.youtube.com/watch?v=3IRepUPQ0CU
- [dwestbrook] Discuss Per Pod Container Updates (i.e. similar to this issue)
- Feature Request – Per Pod Container Updates (request access)
- [chrishenzie] Extending containerd 1.X EOL to align with K8s EOL
- 1.6 and 1.7 have parallel LTS windows
- Will run until next LTS release, which release TBD (could be v2.0, v2.1)
- containerd v2.0 contains migration tools/scripts to assist with users of deprecated features
- containerd -c pathToToml config migrate
- https://github.com/containerd/containerd/blob/main/RELEASES.md#daemon-configuration
- containerd has moved packages around in the 2.0 refactoring see move script containerd/containerd#9365 this should aid people involved in containerd plugin development and importing the various packages..
- [SergeyKanzhelev] Sidecar WG - new time for the meeting: Seattle 2PM, Paris 11PM, Seoul 7AM (6AM) (Wednesday)
Recording: https://www.youtube.com/watch?v=vEbpXkhm73M
- [Kevin Hannon] Discuss configuration for pod logs location
- PR: kubernetes/kubernetes#112957
- issue: kubernetes/kubernetes#98473
- Is KEP needed?
- Security implications of logs locations
- Impact on disk usage
- impact on Kata or similar runtimes?
- [Kevin Hannon] KEP-4191 blocked until we have a cadvisor release
- With freeze coming, is it possible to get a cadvisor release before the freeze?
- [AI: dawnchen@] Identify the new owner to help? - Done!
- [Jeffwan/LingyanYin]
- Need reviewers for this PR - Configure MemoryRequest for InPlace pod resize in cgroupv2 systems kubernetes/kubernetes#121218
- Dixita Narangdrop a comment and doc link for why memory.min shouldn't be set as yet
- [AdrianReber] Graduate "Forensic Container Checkpointing" from Alpha to Beta PR
- PR: kubernetes/kubernetes#123215
- All changes in the PR are based on the KEP discussions
- kubernetes/enhancements#4288
- Mainly added tests for existing features as discussed during PRR
- Switch from Alpha to Beta feature gate
- Added separate sub-resource permission to better control access to the kubelet checkpoint API endpoint
- Looking for reviewers
- Will probably not be able to make it to the meeting
[fromani] Looking for approval review: kubernetes/kubernetes#121778 (for memory manager GA graduation, kubelet observability/visibilty) thanks mrunal!- [jsturtevant] KEP 2371 - CRI container and pod stats - Issue with UsageNanoCores calculated in CRI kubernetes/kubernetes#122092 (comment)
- [kevin hannon] PID Stats issues in both containerd and crio
- kubernetes/kubernetes#115215
- kubernetes/kubernetes#123369
- not sure on crio side why its failing to read any process stats
Recording: https://www.youtube.com/watch?v=WLm7m-8T82A
-
kannon92: self nominating to be a reviewer in sig-node-
https://github.com/kubernetes/test-infra/pull/31891 -
Mrunal and Derek approved the above PRs
- [Vaibhav] Discuss on the eviction manager issue
-
-
-
KEP kubernetes/enhancements#4341
- [Ritika] Discuss on this issue
- kubernetes/kubernetes#123176- Pranav : Kubelet Thread Management and Resource Cleanup Post-High Workload
- Discuss a scenario where Kubelet retains idle threads post-high workload,
leading to unnecessary memory consumption.
- Is there a way in kubernetes to set the number of maximum threads?
If no, can k8s community implement the new parameter for it?
kubernetes/kubernetes#123275
-
-
gathering pprofs of the kubelet would be useful to see if there are stuck goroutines
- try to restrict the kubelet process in systemd unit file to cpuset:0, to force go runtime to allocate less threads and kill them more aggressively, and repeat the test. This would rule out either Go library vs. kubelet thread leaks.
-
pehunt: imageRef discussion round 2
- Problem: the public pod API field
container.ImageID
is constructed from the container status ImageRef field. - This ImageID is used to compare against the image.ID of the CRI call for garbage collection.
- The
container.ImageID
is considered to be a stable API, but is not compatible with the image.ID field. - Options to fix:
- return same value as image.ID in container.ImageRef (resolved repoDigest)
- problem: two images tagged with different repos but the same digest would thrash in GC
- add a resolvedImageID or something to ContainerStatus and pod API for doing GC
- both CRI and pod API update
- In GC manager, compare image.RepoDigests in addition to image.ID to find a match
- return same value as image.ID in container.ImageRef (resolved repoDigest)
- TODO:
- check exactly what is returned for each field in cri-o and containerd
- investigate if we can put together the needed info in image gc manager without CRI/pod API extension
- extend them if not
- Problem: the public pod API field
-
kannon92: (if time) kubernetes/kubernetes#123247
- Discovered reason for flake in eviction
- Summary stats is sometimes failing and the first sort of activePods is ignored
-
ndixita: highlight from Sig Node CI triage meeting (every Wednesday 10AM PST) kubernetes/kubernetes#122905
Recording: https://www.youtube.com/watch?v=WiYzo_knwfk
- [Filip Krepinsky] Update on Declarative Node Maintenance
- [Derek] Update requested to clarify security posture that would prevent cross node privileges
- [pehunt] kubernetes/enhancements#3858 RRO conversation redux
- [jonathan-innis] Support for
node.kubernetes.io
resource labels for Gt/Lt requirements on pods - [Jeffwan/LingyanYin] Two things:
- Next steps for KEP 4176 - a new static policy for cpu manager kubernetes/enhancements#4177 (comment)
- Need reviewers for this PR - Configure MemoryRequest for InPlace pod resize in cgroupv2 systems kubernetes/kubernetes#121218
- [Vaibhav] Discuss on the eviction manager issue
Recording: https://www.youtube.com/watch?v=LLS3qQgQJ6g
- [pacoxu] Fix evented pleg mirro pod & use IsEventedPLEGInUse instead of FG status check needs approval and we’d better get inputs from @smarterclayton before merge. This bugfix blocked sig-release-master-blocking#gce-cos-master-alpha-features. And 1.30.0-alpha.1 is planned for Jan 30th.
- Is this an alpha release cut blocker?
- Could we ignore `EventedPLEG` in the job? We already disabled it in some presubmit Jobs,including `pull-kubernetes-e2e-kind-alpha-features` and `pull-kubernetes-e2e-gce-cos-alpha-features`.
- [anish] KEP-3953: Dynamic node resize - draft at KEP-3953: Support for Resizable Nodes
Anish, please contact on Slack Markus Lehtonen/Francesco Romani, Alexander Kanevskiy - we will include into discussion thread about that topic. [tallclair] Expanding Kubelet configuration APIProposal: kubernetes/kubernetes#122916Does this need a KEP? (I think no?)
- [pehunt] imageRef usage in the kubelet
- context: cri-o/cri-o#7579 cri-o/cri-o#7143 kubevirt/kubevirt#10747
- Yuju shared kubernetes/kubernetes#46255
- [fromani] want to improve observability of resource managers: better and more kubelet logs, send kube events on admission failures and in the happy path. Raised as memory manager GA blocker and in general poor observability is a PRR concern. Does this work require a KEP or is an issue sufficient?
- update KEPs where feasible
- For GA KEPs (and in general for this work): update the docs
- Keep issues and file the PR when ready
- [marquiz/zvonkok] KEP-4112: Pass down resources to CRI
Recording:
- [Sergey, Mrunal] 1.30 Planning SIG Node - KEP Planning
- [kannon92, AxeZhan] KEP4328 for 1.30
- kubernetes/enhancements#4329
- sig-scheduling planning to implement nodeAffinity type RequiredDuringSchedulingRequiredDuringExecution by adding a new controller, needs a sig-node approver to review this kep also, as sig-node is involved as a participating-sig.
- Thank you to Dawn for agreeing to review from sig-node.
- [jeffwan, LingyanYin] two KEPs for 1.30
- kubernetes/enhancements#4176
- CPU manager: Adding a static policy option to spread hyperthreads across physical CPUs. Addressed all comments, need approvals
- kubernetes/enhancements#4177 (comment) NRI vs. native cpu manager?
- kubernetes/enhancements#4433
- keep inplace VPA KEP alpha for 1.30
- kubernetes/enhancements#4176
- [klueska] Three KEPs for 1.30
- Add CDI devices to device plugin API (promote to GA)
- Add numeric parameters for dynamic resource allocation (new KEP)
- Simplification / generalization of overall DRA proposal
- Context: Dynamic Resource Allocation (DRA)
- kubernetes/enhancements#4384
- Pass down resources to CRI (new KEP)
- Needed to support GPUs in Kata Containers
- kubernetes/enhancements#4113
Recording: https://youtu.be/NAIQGQHrlN0
- [pacoxu] EventedPLEG bug of static pods start-up. After reverting it to alpha, sig-release-master-blocking#gce-cos-master-alpha-features keeps failing. #122763 is under review.
- Latest PR - kubernetes/kubernetes#122778
- [kannon92 Kevin] Update on Swap.
- Swap Beta2 Findings
- kubernetes/enhancements#4401
- NoSwap seems good
- UnlimitedSwap and Eviction signal may be needed
- We should add eviction signal for swap for UnlimitedSwap
- Or drop support for UnlimitedSwap
- Kevin to reach out to [email protected]about usecases for swap
- Kevin to find examples for UnlimitedSwap.
- [pehunt] proc mount type direction
- https://docs.google.com/document/d/1rYvnhQyi-d8bDgyOGn5FHZKVMgwpygPjksC8ZSBaEPg/edit?usp=sharing
- Make a KEP update to tie ProcMount behavior to userns (if userns, no masked paths). If there’s pushback, push for ProcMount in Beta
- [AkihiroSuda (unlikely to attend due to the timezone)] Can I get any reaction (an explicit rejection will be highly appreciated more than having no action) to the KEP for Recursive Read-only (RRO) mounts? Has been open for almost a year. If this isn’t going to be accepted I’ll just leave Kubernetes unmodified and change containerd to treat RO as RRO.
kubernetes/enhancements#3857 kubernetes/enhancements#3858 - [AdrianReber] Open Checkpoint/Restore questions from last week
-
Checkpoint/Restore demo from container image based on https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/#restore-checkpointed-container-k8s
# curl -s --insecure --cert /var/run/kubernetes/client-admin.crt --key /var/run/kubernetes/client-admin.key -X POST "https://localhost:10250/checkpoint/default/counters/counter"
# kubectl alpha checkpoint counters
# newcontainer=$(buildah from scratch)
# buildah add $newcontainer /var/lib/kubelet/checkpoints/checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar /
# buildah config --annotation=io.kubernetes.cri-o.annotations.checkpoint.name=counter
# buildah commit $newcontainer checkpoint-image:latest
# buildah rm $newcontainer
-
- How would checkpoint/restore work with pods:
- Implemented in March 2022 in combination with kubectl drain
- https://github.com/adrianreber/cri-o/commits/checkpoint-restore-support-cri-api/
- Pause pod (using cgroup)
- Loop over all containers in pod and create a checkpoint
- Collect pod metadata
- Recreate pod based on metadata (no checkpoint)
- Restore all containers
- Unpause pod
- Security review: looking into it
- Garbage collection mechanism: not thought about it
- Image-Spec discussion opencontainers/image-spec#962
Recording: https://youtu.be/b5jaZux0qCo
Agenda:
- [ pehunt ] kubernetes/kubernetes#117793 ownership. 1.30??
- tzneal to take on, no KEP needed
- [kannon92] kubernetes/kubernetes#121834 looking for approver
- Can we consider backporting this?
- Agreement
- [rata]: UserNS KEP: beta migration in 1.30?
- Open a PR to migrate to beta and reach out to gather more feedback
- [tallclair]: Kubelet config clean up
- Now that Dynamic Kubelet config is deprecated & removed, can we move the remaining flags into the Kubelet configuration object?
- Derek: look into whether there are any differences in whether the Kubelet needs to be drained on update
- Mrunal: Sync with folks working on conf.d
- Now that Dynamic Kubelet config is deprecated & removed, can we move the remaining flags into the Kubelet configuration object?
- [rst0git] Forensic Container Checkpointing:
- Provide details about additional checkpoint/restore use cases kubernetes/enhancements#4305
- Graduate "Forensic Container Checkpointing" to Beta kubernetes/enhancements#4288
- Add 'checkpoint' command to kubectl kubernetes/kubernetes#120898
- Proposal: checkpoint image definition
opencontainers/image-spec#962
- [fromani] proposal to allow kubelet to allow the kubelet to trigger the rescheduling of pods. (redo from 20240102 because too small audience; presented on batch WG mtg on 20240104 ) - expected 5 minutes presentation + time for questions/discussion maybe 10 mins top?
- Include a security section about restricting the node to unbind only its own pods.
- [SergeyKanzelev, Harche] kubernetes/kubernetes#122224 are back copat concerns here valid?
Recording: https://youtu.be/BHGZs2HJMyU
Agenda:
- [marquiz] QoS resources KEP, call for reviews, blockers from sig-node perspective(?)
- [fromani] proposal to allow kubelet to allow the kubelet to trigger the rescheduling of pods. Looking for early feedback/possible concerns.
- spinoff from DRA conversations; beneficial to improve UX with kubelet admission failures
- will be presented to batch WG/sig-scheduling mtgs