- What work did the SIG do this year that should be highlighted?
SIG-Node remains a structural piece of the Kubernetes community and the span of the work done in 2024 highlights that. As the community continues rallying behind AI use cases and identifying gaps with Kubernetes as a platform for LLM training and serving, SIG-Node made strides in multiple AI related areas. DRA structured parameters made it to beta, meaning more flexible scheduling and allocation of device resources is now possible. In 2025 there will be a lot of continued work on DRA, including enhancing drivers to be able to report device health and Kubernetes components be able to react to that, extending DRA to support advanced networking use cases, device taints and tolerations, and lots more! Outside of DRA, OCI image volume mounts have been added as alpha in 2024, allowing users to mount AI models into containers via a separate image (and one day artifact) instead of a model car or embedding it in the container image. Also, work like in-place pod resize and pod level resource limits will unlock use cases for power AI users: allowing more flexibility in pod resource limit calculation at both initialization and during runtime.
Plenty of work has been being done outside of AI as well! SIG-Node remains the top SIG in KEPs progressing, moving forward on 13, 16, and 17 KEPs between 1.30, 1.31, and 1.32 respectively. Lots of progress has been made in the CPU manager: like adding support for split uncore cache, adding a policy option for restricting resrevedSystemCPUS and a new static policy for optimizing CPU alignment. We have also worked on some long awaited linux technologies like user namespaces, swap, AppArmor, ephemeral storage quotas, recursive read only mounts, and better support for supplemental groups, as well as announced feature freeze on cgroupv1.
All of these features don't even begin to cover the amount of CI stabilization, bug fixes, and other work the SIG is doing. We remain a productive (albeit, occasionally overbooked) SIG. To help keep up with all of the work, we've inducted one new approver Sergey Kanzhelev, reinducted a formerly emertius approver Tim Allclair, welcomed a new SIG chair Peter Hunt, as well as began crafting a role to help KEP authors follow along the KEP process, currently called the KEP wranglers.
- Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?
SIG-Node, in being so busy, always has a bottleneck of top level approvers. Any path in the kubelet could use more people who have expertise and confidence in reviewing. Please refer to our contributor ladder to see ways to grow in the SIG!
- Did you have community-wide updates in 2024 (e.g. KubeCon talks)?
- Kubecon EU 2024 maintainers track
- Kubecon NA 2024 maintainers track
- KEP work in 2024 (v1.30, v1.31, v1.32):
-
Alpha
- 2837 - KEP Template - v1.32
- 2862 - Fine grained Kubelet API authorization - v1.32
- 3288 - Split Stdout and Stderr Log Stream of Container - v1.32
- 3619 - Fine grained SupplementalGroups control - v1.31
- 4438 - Restarting sidecar containers during Pod termination - v1.32
- 4540 - Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing - v1.32
- 4580 - Deprecate & remove Kubelet RunOnce mode - v1.31
- 4603 - Tune Crashloop Backoff - v1.32
- 4680 - Add Resource Health Status to the Pod Status for Device Plugin and DRA - v1.31
- 4800 - Split UnCoreCache Toplogy Awareness in CPU Manager - v1.32
- 4815 - DRA Partitionable Devices - v1.32
- 4817 - Resource Claim Status With Possible Standardized Network Interface Data - v1.32
- 4818 - Allow zero value for Sleep Action of PreStop Hook - v1.32
-
Beta
- 1029 - Quotas for Ephemeral Storage - 1.31
- 127 - Support User Namespaces - v1.30
- 1287 - In-place Update of Pod Resources - v1.32
- 2008 - Forensic Container Checkpointing - v1.30
- 2400 - Node system swap support - v1.30
- 3857 - Recursive read-only mounts - v1.31
- 3983 - Add support for a kubelet drop-in configuration directory - v1.31
- 4033 - Discover cgroup driver from CRI - v1.31
- 4176 - New CPUManager Static Policy which spread hyperthreads across physical CPUs to better utilize CPU Cache - v1.31
- 4191 - Split Image Filesystem - v1.31
- 4210 - ImageMaximumGCAge in Kubelet - v1.30
- 4216 - Image pull per runtime class - v1.31
- 4265 - Add ProcMount option - v1.31
- 4369 - Allow special characters environment variable - v1.32
- 4381 - DRA Structured Parameters - v1.32
- 4639 - OCI objects as VolumeSource - v1.32
-
Stable
- 1769 - Memory Manager - v1.32
- 1967 - Size memory backed volumes - v1.32
- 24 - Add AppArmor Support - v1.31
- 3545 - Improved multi-numa alignment in Topology Manager - v1.32
- 3673 - Kubelet limit of Parallel Image Pulls - v1.32
- 3960 - Pod lifecycle sleep action - v1.32
- 4009 - Add CDI devices to device plugin API - v1.31
- 4188 - New kubelet gRPC API with endpoint returning local pods information - v1.31
- 4569 - Move cgroup v1 in maintenance mode - v1.31
- 4622 - New TopologyManager Policy which configure the value of maxAllowableNUMANodes - v1.32 -->
New in 2024:
- cri-client Continuing:
- ci-testing
- cri-api
- cri-tools
- kernel-module-management
- kubelet
- node-api
- node-feature-discovery
- node-problem-detector
- resource-management
- security-profiles-operator
New in 2024:
- Device Management
- Serving Continuing:
- Batch
- Policy
- Structured Logging
Operational tasks in sig-governance.md:
- README.md reviewed for accuracy and updated if needed
- CONTRIBUTING.md reviewed for accuracy and updated if needed
- Other contributing docs (e.g. in devel dir or contributor guide) reviewed for accuracy and updated if needed
- Subprojects list and linked OWNERS files in sigs.yaml reviewed for accuracy and updated if needed
- SIG leaders (chairs, tech leads, and subproject leads) in sigs.yaml are accurate and active, and updated if needed
- Meeting notes and recordings for 2024 are linked from README.md and updated/uploaded if needed