Skip to content

Latest commit

 

History

History
2622 lines (2179 loc) · 151 KB

meeting-notes-2022.md

File metadata and controls

2622 lines (2179 loc) · 151 KB

SIG Node Meeting Notes

Dec 27th [Canceled for holidays]

Dec 20th [Canceled for holidays]

Dec 13th, 2022

Recording: https://www.youtube.com/watch?v=iw_xZZPuXDI

Total: 196 (-22, yay!)

Incoming Completed
Created: 32 Closed: 37
Updated: 161 Merged: 22
  • [mrunal/sergey/ruiwen] 1.26 retro/1.27 planning
    • 1.26 retro, with tracked KEPs and finished KEPs
    • 1.27 planning with initial KEP candidates:
  • [everpeace] KEP-3169: Fine-grained SupplementalGroups control
    • NOTE: I’m sorry that I can’t attend to the regular community meeting due to timezone gap (3am in my timezone(Tokyo)). I put this agenda to help 1.27 planning.
    • This KEP can resolve very unfamiliar behavior of SupplementalGroups field described in k/k#112879, which keeps group membership defined in the container image. I believe many (probably most) cluster admins don’t know the behavior. Moreover, when a cluster uses hostPath volumes, the unfamiliar behavior could cause security concerns even when cluster admins enforce some policy engines in the cluster.
  • [swsehgal] Topology Manager GA graduation: happy to volunteer to drive this work in 1.27 (if we want to move ahead with it?)
    • Lack of Multi NUMA systems in CI is the key blocker
      • [fromani] this is also relevant for memory manager
    • Currently e2e tests that require multi NUMA are skipped
    • kubernetes/test-infra#28211
      • Inputs/suggestions on how this can be potentially handled welcome on the issue
      • Slack discussion with test-infra group: here (potential use of equinix nodes is being discussed here)
  • [SergeyKanzhelev] kubernetes/kubernetes#114394 CRI API version skew policies. See slides from contributors summit for extra details
  • [vinaykul] InPlace Pod Vertical Scaling PR - status update
    • vinaykul not joining this meeting (on vacation in India)
    • Please review and merge KEP milestone update PR
    • PR 102884 approved by Derek.
      • @bobbypage fixed containerd/main E2E pull test job, we now have full E2E coverage
      • I recommend that we merge API changes PR 111946 at the earliest possible point in 1.27 and watch to see nothing bad happens.
      • And then merge PR 102884 shortly after (< 1 week) and re-add periodic CI test jobs.
      • Does the 1st week of Jan 2023 look realistic for the above proposed PRs merge plan, assuming the above plan sounds good?
  • [SergeyKanzhelev] Reconcile SIG Node teams and OWNERs files: kubernetes/org#3893
  • [mweston & atanas] Quick update on issue kubernetes/enhancements#3675

Dec 6, 2022

Recording: https://www.youtube.com/watch?v=t3PcHj62f0c

Total: 218

Incoming Completed
Created: 11 Closed: 7
Updated: 63 Merged: 2
  • [mweston & atanas] Ready Dec 6th: filed issue on kubelet plugin model here: kubernetes/enhancements#3675
    • starting KEP and looking for interested parties, discussion partially based around dynamic resource allocation and thoughts on how to incorporate it.
  • [pacoxu] small improvement on memoryThrottlingFactor proposal(I listed 3 problems here in the link), but a behavior change. memory.high = memory.request + (memory.limit - memory.request) * memoryThrottlingFactor. Also, defaulting to 0.8 may result to performance issue for pods that may always use 80%+ memory of the limit like Java applications. Probably we need a pod level setting for it like softrequest/throttlingLimit besides limits and requests.
  • [msau] (joining at 10:30) Looking for maintainers from multiple sigs to participate in a discussion/roundtable with Data on Kubernetes (stateful end users). If you’re interested, add your name to the list. First roundtable is going to be sometime in January.
  • [claudiubelu] Proposed changes to how kubelet detects updates for registered plugins b/c current implementation doesn’t work on Windows due to timestamp granularity issues kubernetes/kubernetes#114136
  • [SergeyKanzhelev] Sidecar WG: we are getting to the conclusion. Will send summary soon. Find information here: https://docs.google.com/document/d/1E1guvFJ5KBQIGcjCrQqFywU9_cBQHRtHvjuqcVbCXvU/edit#
  • [SergeyKanzhelev] No perma betas:
    • AppArmor beta since 1.4 (owner: @tallclair)
    • QOSReserved alpha since 1.11 (owner: @sjenning)
      • Mrunal or Ryan will take a look
    • RotateKubeletServerCertificate beta since 1.12 (owner: @mikedanese)
      • Sergey to ping Mike
    • CustomCPUCFSQuotaPeriod alpha since 1.12 (owner: @szuecs)
      • Mrunal to take a look
    • KubeletPodResources beta since 1.15 (owner: @dashpole)
      • [@fromanirh] I volunteer to help graduating this in GA in 1.27 - I'll add this to the 1.27 planning document when we start it
    • TopologyManager beta since 1.18 (owner: @lmdaly)
      • Graduate before out of process plugins was the past decision
      • [Dawn] Let’s put together a one pager explaining the roadmap short and longer term.
      • [Swati] Device and CPU manager are graduated. Maybe let’s be consistent
    • DownwardAPIHugePages beta since 1.21 (owner: @derekwaynecarr, )
    • ProbeTerminationGracePeriod beta since 1.22 (owner: )

Nov 29, 2022

  • [aditi] Expose pod cgroup path: kubernetes/kubernetes#113342

    [sig-node] Concerns about exposing cgroup information at pod api status level. There could be races depending on how the path will be used. Better to focus down the issue to the interaction between the runtime and CNI plugin at pod bring up time. Peter(CRI-O), David Porter(bobbypage)/mikebrow(containerd) and Mike Zappa(CNI @MikeZappa87 ) to figure out details of approach across runtimes.

  • [everperace] KEP-3169: Fine-grained SupplementalGroups control

    • NOTE: I’m sorry that I can’t attend to the regular community meeting due to timezone gap (3am in my timezone(Tokyo)). I put this agenda to gain more visibility of my KEP in the sig-node community.
    • This KEP can resolve very unfamiliar behavior of SupplementalGroups field described in k/k#112879. I believe many (probably most) cluster admins don’t know the behavior. Moreover, when a cluster uses hostPath volumes, the unfamiliar behavior could cause security concerns even when cluster admins enforce PSPs(or some policy engines) in the cluster. So, I would like to implement the KEP hopefully in v1.27.
    • I would very appreciate it if somebody help reviewing my KEP.
    • This KEP includes modification of CRI. So, we probably need to update CRI implementation first, at least most popular ones(containerd and cri-o are enough??). I’m not familiar how to do this. I recognize we can’t apply feature gate on CRI and its implementations. I also appreciate it if some contributors advise it to me.
  • [klueska] Update to KEP to reflect actual implementation that was merged

  • [swsehgal] Need Derek’s architecture approval on kubernetes/kubernetes#110252. API updates are proposed in a separate PR (ready for review as well).

  • [bobbypage] Update/thoughts on CRI healthz: kubernetes/kubernetes#109653

  • [vinaykul] InPlace Pod Vertical Scaling PR - status update

    • vinaykul may not join due to conflicting appointment
    • PR 102884 approved, missed 1.26, targeting for 1.27
    • Please review and merge KEP milestone update PR
    • Please review test-infra inplace resize test pull job PR

Nov 22, 2022

No meeting due to the Thanksgiving holiday in USA.

Nov 15, 2022

  • [rata, giuseppe] Userns support
    • For stateful pods, shall we create a new KEP or change the scope of the existing?
    • Will join sig-storage to start the conversation with them about stateful pods too
    • [Derek] - Separate KEP and Feature gate recommended
    • [Sergey] - Should we GA the existing support?
    • [Derek/Mrunal] Yes, we should move it to beta.
    • [Rodrigo] Concerns around validation if we introduce another Feature Flag.
    • [Rodrigo] Id mapped mounts could solve issues around right permissions for files such as ssh keys.
    • [Derek] Can we key off the kernel version to figure out if we have id mapped mounts? Any way to implement a fallback?
  • [klueska] Dynamic Resource Allocation (DRA) update
    • Merged on Friday (after an extension request) as an alpha feature for 1.26
    • New staging repo created for k8s.io/dynamic-resource-allocation with helper libraries to build resource drivers against the DRA API
    • Outstanding request to create dra-example-driver repo
    • Request to “associate” DRA with an official sig-node subproject
      • Should we reuse an existing subproject or create a new one?
      • My vote is for a new one (but what to call it?)
  • [Sergey] sidecar WG: https://docs.google.com/document/d/1E1guvFJ5KBQIGcjCrQqFywU9_cBQHRtHvjuqcVbCXvU/edit#

Nov 8, 2022

Recording: https://www.youtube.com/watch?v=mnZWYAuOJ90

  • [bobbypage/eric lin] kubernetes/kubernetes#109653
  • [pacoxu] kubelet: make registry qps/Burst to limit parallerel pulling count #112242
    • after rethinking, the current qps/burst of image pull makes no sense to users. And the current PR tries to make it limit the image in pulling process at the same time. The flag and the meaning will not match then. So I suggest to just deprecate and then remove the current registryPullQPS and registryBurst flags. Meanwhile if this is a concern, we should provide a new flag like parallel-image-pull-limit as a new feature. (#112044 the issue) At least we should add more explanation for the flag.(registryPullQPS: limit registry pull QPS to this value. qps is request per second.)
    • [ruiwen-zhao] +1 on adding a node-level limit of parallel pulls. I can help with this effort.
    • [paco] containerd/containerd#7313 I am working on a PullRequest in containerd to add some image pull related metrics. One of them is the processing count of image pulling
    • [mikebrow] needs more declarative hints in the pod/container spec, and more resource information image manager will not know about other activities… declarative info: qos/cache policy/confidential meta/lazy snapshots vs pull all/does the container runtime optimize for common layers/… as mrunalp says, it’s not just about the image it’s the connection cost/manifests/layers and soon artifacts
  • [vinaykul] InPlace Pod Vertical Scaling PR - status update
    • Fixed nits and updated code to catch up after rebase.
    • Updated E2E test to run full-spectrum for containerd>=1.6.9. Tested in a local cluster.
    • Investigating failures with newly added cgroupv1 , cgroupv2 for in-place resize CI job with containerd-main.
    • Requested 4 day exception to investigate/fix issues from rebase and CI job failure.
    • IMHO, it may be safer to merge this early 1.27 rather than late 1.26
  • [iancoolidge] cpuset to kubernetes/utils
    • [time permitting]
    • kubernetes/kubernetes#113744
    • minor controversies: NoSort/Sort, Int64 vs int
    • plan: merge all changes here, then copy into k/utils, then revendor in k/k
  • [klueska] Need approval from sig-node-leads for feature gate addition in following PR
    • kubernetes/kubernetes#112914
    • I’ve already LGTM’d and APPROVED the kubelet changes, it just needs the feature gate approval now (@liggitt already confirmed to do the API approval)
    • Assigning ~~~~since he did the KEP approval
  • [klueska] Need sig-node-leads approval for creation of dynamic-resource-allocation staging repo
  • [MaRosset] - Windows hostnetwork alpha #112961

Nov 1, 2022

Oct 25, 2022 [Cancelled for KubeCon]

Oct 18, 2022

Total active pull requests: 213 (+8 from last week)

Incoming Completed
Created: 24 Closed: 7
Updated: 66 Merged: 9

Oct 11, 2022

Total active pull requests: 205 (-3 from last week)

Incoming Completed
Created: 17 Closed: 11
Updated: 69 Merged: 8
  • [matthyx] request a WG creation to work on sidecar containers
    • A summary from Sergey:
    • [Dawn] define exit criteria and a way to report back status to this SIG. Also define the term for WG.
    • [Sergey] wait for the Doodle to decide scheduling.
  • [vinaykul] InPlace Pod Vertical Scaling PR - status update
    • Please review KubeCon slides 11-16, if possible
    • Cgroupv2 support changes are in review, issues fixed. Mrunal PTAL.
    • Awaiting containerd release in order to enable full-E2E tests.
    • Mothership PR 102884 can merge once we have the next containerd release (1.7 per Ruiwen - sorry I accidentally deleted Ruiwen’s comment ​), the CI picks it up, E2E tests are fully enabled (validates PodStatus for resize), and cgroupv2 review issues have been addressed.
    • API changes PR 111946 also on hold for containerd.
  • [mimowo]: Heads up for "Standardization of the OOM kill communication between container runtime and kubelet" (kubernetes/kubernetes#112910)
    • first - standardization of what we have
    • second - add more information - whether it is because exceeding the limits or memory pressure on the node. This is more involving.
    • [Dawn] user space oom killer in cgroupv2 will also introduce more standardization in this space
    • [Dawn] thought it is already aligned, how much has it diverged?
    • [Sergey] is it required for the KEP?
      • [Michael] no, but may break in future
    • [Sergey] How easy to troubleshoot that it was indeed the OOM kill when people start relying on job retries based on OOM kills.
      • [Michael] Feature: customer can define policies for jobs depending on pod end state. Today pod conditions are used to understand the pod end state. Pod condition will be “resource exhausted”.
  • [Dawn] there will be cases when kubelet just cannot tell that something was oom killed. But it is still good to have everything unified.
  • [David Porter] how practically will it be standardized? It is just a string. Are there any conformance tests or something?
  • [Lantao] logging format is the same way
  • [David] Container running multiple processes when subprocesses were OOM killed, container runtime may not detect this.
  • [Lantao] cgroupv1 behavior is different, is it? IIRC, when there is a cgroup OOM, a random process in the cgroup will be killed. In that case, OOMKilled is still set, even if pid 1 is running happily. ([David] Let’s confirm)
  • [Dawn] this was one of the first issues that was fixed.
  • [SergeyKanzhelev] containerd 1.6 is going LTS containerd/containerd#7454

Oct 4, 2022

Total active pull requests: 208 (-22 since last week)

Incoming Completed
Created: 11 Closed: 21
Updated: 89 Merged: 12

Sep 27, 2022

Total active pull requests: 230 (+3 since last week)

Incoming Completed
Created: 22 Closed: 12
Updated: 76 Merged: 9
  • [ruiwen-zhao] Just a quick reminder on 1.26 planning: Please update the status on 1.26 planning, or add to it if you are planning to work on something not tracked there. KEP freeze is 18:00 PDT Thursday 6th October 2022

  • [alexey fomenko]: CRI Image Pulling Progress and Notifications: https://hackmd.io/nyLLTtAkTgOuYwxmnu0sIQ

    • [paco]: recently, I see some image pulling related issues. 1. image pull time is including waiting time due to default serialize image pulling behavior. 2. no image pulling related metrics in kubelet(pr) or containerd(pr). 3. kubelet registry qps/burst is not working as expected. The current registry qps is like start n image pulling in a second. The user want parallerel image pulling number is qps. BTW, enabling parallerel image pulling will solve some problems that are caused by image pulling stucking. We may discuss this as a whole.
    • [lantaol]:
      • We hit this issue as well with serial image pull. A bad container image can block all other pods from coming up forever.
      • However, with parallel image pull, there is no good way to control the concurrency. The QPS is not the best way to solve this problem, because each image pull request can take a long time, just controlling the query per second is not sufficient.
      • This is worse with containerd, which doesn’t have an overall image pull timeout, or a progress based timeout like dockershim.
    • [Derek] Do we want image pull status on Pod as well?
      • [Alexey] yes, we want but not looked into details yet
      • [Derek] qps of progress reports may be a concern for a lot of updates
      • [Alexey] maybe just key points like 25%, 50%
      • [Derek] still a lot of information on happy path. Must be careful with it, only needed for debugging
    • [Derek] is serial pull policy still being used? It only exists for very old runtimes
      • [lantaol] parallel pull may have qps issues
      • [Derek] we have up to n images in parallel. Exactly for this reason. There is no reason to not switch to parallel
    • [Wenjun] many customers wants a image pulling status.
      • [Derek] maybe metrics instead or something that will help minimize the traffic
    • [Lantaol] What about image pull timeouts? Overall timeout is impossible to set. So we had pull timeout. Is it exist in Containerd?
      • [Ruiwen] Containerd pull timeout will be in 1.7: containerd/containerd#6150
      • [Sergey] Do we need it configured from kubelet or runtime?
    • [Alexey] another option is to issue an ETA as an update instead of the progress.
    • [Derek] How it will work with “lazy image pulling” like GKE image streaming?
      • [Mrunal] yes, it likely needs to be accounted for.
      • [Dawn] should work well with it.
    • [lantaol] Checked with mrunal offline that, for cri-o the “concurrent pulled image layers” configuration is in cri-o.
      • Containerd only has MaxConcurrentDownloads which limits the concurrent downloads for each image.
      • To do this at the containerd level, we will need the CRI plugin to implement a cap at the daemon level.
      • Or we can consider implementing this at kubelet level to extend serial-image-pull to max-concurrent-image-pull.
    • [mikebrow]nod to lantao’s comments.. we have a need to parallelize pulls, a need to identify resource contention when to many parallel pulls (layers/manifest checks/… soon artifacts), a need to handle slow/no progress due to registry contention/access issues… ** Because there is a large cost to getting the resource contention proximity and responding from the back seat if you will, we will probably be better off passing prioritization information/policies from the kubelet side down to the code (in the container runtimes) that is performing the resolve and pulling of layers..
    • Summary: let’s proceed with CRI part of it
  • [fangyuchen86]: Kubelet Support Custom Prober Protocol

    • [Derek] Where is the probe run?
    • [fangyuchen86] On the node
    • [Derek] who is charged with the execution of these probes? They are not running inside the cgroup, who is reserving compute and memory for these probes.
    • [fangyuchen86] Controller will allocate this - it will be a custom pod in VPC network on the same VPC as a user workload. kubelet cannot access the Pod’s network. It can start it, attach storage, but not the network.
    • [Dawn] I understand the requirement. But maybe that third party controller can take even more responsibility and actually do the pod management on cluster level. It may be easier. Introducing a custom prober to the node has some security concerns too.
    • [Wenjun] What prevents to create a proxy?
      • [fangyuchen86] security does not allow this
      • [Derek] this is the most interesting question here. Naively we believe that kubelet is an admin for all workload. And why the networking is taken away from the containers. Maybe we can make a session about these requirements?
      • [Dawn] In some environments, the workload cannot access the kubelet network. This is where gVisor helps for example.
      • [Derek] there were nobody before pushing back on probes.
    • [fangyuchen86] we also have a problem of other protocols that needs to be covered.
      • [Derek] this is separate requirement that might be solved differently.
    • Summary: Let’s create a doc that explains the requirements and scenarios
  • [klueska] Small, self-contained TopologyManager update planned for this release:

    • kubernetes/enhancements#3545
    • Already added to
    • Please add the **/label lead-opted-in to the issue so it can be tracked
    • [Derek] this is done
  • sig-node meeting recordings: uploading recent ones.

    • [Derek] will do
    • [Dawn] might help with it
  • [Sergey] per container restartPolicy override: https://docs.google.com/document/d/1gX_SOZXCNMcIe9d8CkekiT0GU9htH_my8OB7HSm3fM8/edit

    • [Derek] Alternative is “BindsTo” semantics that can be used for termination of sidecar containers. Maybe kubelet can delegate it to OS as well like systemd.

    • [Mrunal] BindsTo needs to be experimented. But delegating to OS is also appealing

    • [MikeB] Question is how much we can delegate.

      summary: read the doc and compare with BindsTo.

  • [Sergey] more sidecars: kubernetes/kubernetes#111356 (comment)

    • [Dawn] restartPolicy and QoS (including OOM score) were per container initially. But then after debate it was changed per pod and convinced community. There were even conversations of scheduling container into the Pod. This is why sidecar feel so unnatural in k8s. Once opening this can of worm, we are starting to do Pod v2.
    • [Mrunal] this OOM adj calculation may be challenging.
    • [Derek] I’d rather move to OOMd than change the present state.

    summary: likely not (closed the issue).

  • [marquiz] update on QoS-class resources KEP

  • [pacoxu] If ResourceQuota for cpu/memory is set, no best effort pod can be created. For other resources like ephemeral-storage, best effort pod can be created.

Sep 20, 2022

  • [ruiwen/mrunal] 1.26 planning
  • [marquiz] QoS-class resources KEP (renamed)
  • [vinaykul] InPlace Pod Vertical Scaling PR - status update
    • Fabian has a PR adding Windows support for in-place resize. Thanks Fabian!
    • JaffWan fixed my missing unit tests and typo for cgroupv2 :) Thanks Jaixin!
    • Jaixin also found the root-cause for issue 112264 and will work on a fix. This should significantly speed up E2E tests.
    • I tried out in-place resize with CRI-O (in local cluster) and it works!
    • Tested resize E2E tests using Ruiwen’s containerd support
      • It works but PodStatus.Resources update takes ~60s. Issue 112264
      • UpdateContainerResources (containerd) applies the resize in < 50 ms
    • API changes PR 111946 ready for review & preferably early-merge.
    • Cgroupv2 support changes are in review.
    • Mothership PR 102884 can merge once we have the next containerd release (1.6.9?), the CI picks it up, E2E tests are fully enabled (validates PodStatus for resize), and cgroupv2 review issues have been addressed.
  • [mimowo] Promote KEP-3329 "Retriable and non-retriable Pod failures for Jobs" for Beta
  • [Sergey] SIG ongoing things update:

Total active pull requests: 227 (+35 since June)

Incoming Completed
Created: 32 Closed: 10
Updated: 87 Merged: 15

Bugs untriaged: 13 https://github.com/orgs/kubernetes/projects/59

PRs untriaged: 99 https://github.com/orgs/kubernetes/projects/49

CI group meetings are back and we will be triaging issues and getting the tests back on track.

Sep 13, 2022

Sep 6, 2022

  • [ruiwen] 1.25 retro
  • [danielye] CRI Stats Performance Update
  • [qiutongs] Issue awareness: unexpected initial delay of probes
    • kubernetes/kubernetes#96614 (comment)
    • ``initialDelaySeconds doesn’t work as the API spec says.
      • first probe time = container start time + initialDelaySeconds
      • kubelet restart: wait reasonable amount of time; still respect initialDelaySeconds differences for probes in the same container?
    • Jitter is needed in the case of kubelet restart. Avoid thundering herd problems.
      • The jitters given to the probes in the same container are different.
      • Since 1.21, the jitter is only added when kubelet recently started/restarted.
        • If the periodSeconds are the same for all probes in a container, the probes will be invoked at the same time. initialDelaySecondsmakes no difference.
  • [adrianreber] Checkpoint/Restore next steps
    • main focus how to secure checkpoint
    • a checkpoint contains all memory pages (maybe secrets, random numbers)
    • possible suggestions how to secure checkpoint archives
  • [vinaykul] InPlace Pod Vertical Scaling PR status update
    • Attempted “full scope” E2E tests with Ruiwen’s containerd support
      • Tested this by switching containerd binaries on GKE worker with those I built from master latest.
      • All tests pass and verify that Ruiwen’s code works correctly.
      • The tests took considerably longer (869s vs 2988s to run all 34 E2E tests with GKE 1master-1worker cluster)
        • The issue is NOT in containerd.
        • ContainerStatus() CRI called within 50 millisec of UpdateContainerResources() shows updated cgroup values.
        • The long delay is in updating apiPodStatus. This needs further investigation.
    • cgroup v2 support for in-place resize - review is in progress.
    • Is there any interest in merging API changes ( PR 111946 ) early?
      • If yes, please add an ok-to-test label.

Aug 30, 2022

  • [klueska, pohly] Update on Dynamic Resource Allocation
    • KEP accepted for 1.25
    • Delayed implementation to 1.26
      (Mostly functional Draft PR)
    • Demo with NVIDIA GPUs
  • [dgl] Status of ProcMountType feature gate
    • This has been alpha since 1.12, I’m interested in potentially progressing it
  • [pehunt] inheritable capabilities regression follow up
  • [mgroot]
  • [marquiz] QoS-class resources KEP (renamed)
    • request for comments
    • also for post in k8s developers’ blog (PR)
  • [vinaykul] InPlace Pod Vertical Scaling PR status update
    • CRI (containerd) support has been merged. Thanks Ruiwen!
      • Next step: containerd release. K8s pickup new containerd version.
    • PR 111946 (API changes for in-place resize from 102884) needs ok-to-test
    • cgroup v2 support for in-place resize awaiting review.
    • Marion Lobur from the GKE team (Warsaw) is joining this effort.
      • His use case can help get significant test coverage in alpha.
  • [qbarrand]
    • Code push complete - development ongoing
    • KMM admin PR - needs attention from one of the chairs
    • Sponsorship for an additional KMM contributor to become member of the kubernetes org

Aug 23, 2022

Aug 16, 2022

  • [bobbypage/pehunt/Daniel] CRI stats Prometheus endpoint and adding cAdvisor metrics to CRI
    • KEP as it stands proposes to have CRI impl emit the prometheus metrics. New proposal is to enhance CRI API to have CRI impl give Kubelet the metrics, then have Kubelet emit them.
      • Same concerns with performance, but possibly it won’t be as bad as we worry about
    • Plan is to have Daniel/David/Peter setup proof of concept with Kubelet/containerd fork to see if it regresses in performance.
    • Aim to have a POC for 1.26 KEP time to be able to decide how to go forward in 1.26 cycle.
  • [ndixita] Looking for a reviewer for External Credential Provider GA PR: kubernetes/kubernetes#111495
  • [vinaykul] InPlace Pod Vertical Scaling PR status update
    • vinaykul not in Node meeting today due to conflict
    • Please review KEP PR that updates milestones for this feature
    • I will spawn a separate PR for API changes later this week.
  • [pehunt] How best to request reviewers to look at a PR not attached to a release cycle

Aug 09, 2022

    • any reason “it was a mistake that user must make an explicit request/limit when ResourceQuota has CPU/Memory setting."?
  • [vinaykul] InPlace Pod Vertical Scaling PR status update
    • Extracted and merged CRI changes in PR 111645
      • Many thanks to Mrunal, Peter, Mike, Ruiwen & Mark for quick reviews!!
      • Huge thanks to wangchen615 for pushing hard on the scheduler code!
      • This now unblocks runtime implementing support for in-place update.
    • If there are no objections, can I squash all the discrete kubelet commits till now into a single commit (easier to rebase), and then add cgroupv2 support?
    • Can we target an early 1.26 merge of API code? (saves me rebase headache)
  • [qbarrand] kubernetes-sigs membership for the KMMO contributors PR
    • feel free to ping @endocrimes
  • [pehunt] repercussions of dropping inheritable capabilities - moby/moby#43420

Aug 02, 2022

Reminder of the coming code freeze on 08/02.

  • [vinaykul] InPlace Pod Vertical Scaling PR status update

    • Pod resize E2E test failed after rebase because GKE switched to cos-97 last week which defaults to cgroup2
      • we don’t support it yet (it was planned for beta)
      • disabling cgroup values check for resize verification unblocks but manual test on pre cos-97 (cgroup1) needed
    • wangchen615 found that the scheduler takes 5 minutes to reevaluate pending pods after resizing down a bound pod.
      • Late stage fix in commits 563b254 and c6581a8
      • SIG-scheduling feels it is low risk and has signed off unofficially on slack
    • thockin has been LGTM on API changes.
    • Open issues tracked here.
    • My sentiment has changed - I am nervous about late stage changes and missing cgroup2 support when CI is on cgv2
  • [harche] evented PLEG - kubernetes/kubernetes#111384

              [   https://github.com/kubernetes/kubernetes/pull/111642](https://github.com/kubernetes/kubernetes/pull/111642)
    
  • [rata]: Asked for exception for userns PR, as talked with Mrunal on slack

    • Mrunal mentioned there are some concerns with phase II to discuss
    • rata to update on PR, KEP and exception:
      • Capture the discussion, we agreed to reduce the scope to stateless pods.
      • One one hand, this buys us more time to figure out the details that some reviewers want about persistent volume support. On the other, it is very valuable to have support for stateless pods and it is an end in itself.
      • This should also eliminate all concerns on how this can graduate to beta and GA.
      • Should we change the feature gate name?
    • any reason “it was a mistake that user must make an explicit request/limit when ResourceQuota has CPU/Memory setting."?
  • [bobbypage] cgroup v2 GA update

    • CI has been running cgroupv2 images (COS/Ubuntu) on node e2e and cluster e2e
    • cgroupv1 specific tests added
    • More feedback has been obtained on cgroupv2 from customers
      • Tencent has been running on cgroups v2 for a while
    • Planned doc updates and blog post

July 26, 2022

July 19, 2022

  • [rata]: Can we have a review for userns k/k PR? Code freeze is coming soon :)
    • Mrunal is reviewing it
    • Ruiwen would review it for Containerd related changes
  • [jstur]: Windows cri-only posandbox stats: kubernetes/kubernetes#110754. Do we keep the structure generic or make it specific for windows?
  • [Brett]: SRO to kmmo repo rename
    • Brett to open an issue to get the rename going
  • [vinaykul] InPlace Pod Vertical Scaling PR status update
    • Rebased code to resolve latest conflict
    • Added concise guidance for UpdateContainerResources CRI
    • SIG-Scheduling (huang-wei) signaled LGTM for current code.
      • E2E test and optimization can come in follow-up PRs.
        • wangchen615 is working on addressing Danielle’s feedback & E2E test targeting scheduler changes
    • thockin has been LGTM on API changes.
    • Open issues tracked here.
    • What do we need for Node & CRI LGTM for alpha?
  • Jing Local Storage Capacity Isolation Feature
    • Problem:
    • Proposal: Add kubelet option enableLocalStorageCapacityIsolation(default=true) into kubelet configuration
      • The default value is true.
      • For systems that cannot support detecting root disk usage, set enableLocalStorageCapacityIsolation=false in kubelet configuration. In this case, kubelet can continue to start without rootfs disk usage information. So ephemeral storage allocatable is not set either. If pod has ephemeral storage request/limit set in this case, pod will fail to create because allocatable storage is not available.
      • [feedback] whether we can automatically detect this. avoid complicating kubelet
    • Feedback from sig-storage: seems fine
  • [Peter] CRI stats check-in
  • [Dawn] FYI: PR for SIG Node Contributor ladder was merged: #6725 Thanks Derek!
    • Please send your PR against that ladder if you think you are ready. Thanks!

July 12, 2022

  • [danielfoehrn] KEP proposal: Dynamic Resource Reservations
  • [vinaykul] InPlace Pod Vertical Scaling PR status update
    • Rebased code to resolve latest conflict
    • Guidance for runtime is taking shape - thanks mrunalp & kolyshkin
    • SIG-Scheduling (huang-wei) signaled LGTM for current code.
      • E2E test and optimization can come in follow-up PRs.
        • wangchen615 is working on addressing Danielle’s feedback & E2E test targeting scheduler changes
    • thockin has been LGTM on API changes.
    • Open issues tracked here.
    • What do we need for Node & CRI LGTM for alpha?
  • [adrianreber] Forensic Container Checkpointing
    • code PR kubernetes/kubernetes#104907
    • LGTM by Ryan, Danielle, Mike(per existing kep) and Mrunal
    • I think only Derek's /approve is now missing

July 5, 2022

June 28th, 2022

  • [vinaykul] InPlace Pod Vertical Scaling PR status update (vinaykul OOO next week)
    • KEP template changes merged. KEP is now tracked for 1.25
    • Fixed CRI & test issues found by Mike and Derek respectively.
    • [derek on 6/14] Reviewed with feedback, need to clarify core behavior expectation
      • Kubelet -> CRI interaction pattern for observing state
        • This assumes the runtime reports values as read from host cgroup
      • [vinaykul] Please review my response on ResizeStatus generation.

      • [vinaykul] Please review my comment on suggested runtime behavior.

  • [bthurber & mrunal] - special-resource-operator repo rename
  • [adrianreber] Forensic Container Checkpointing
    • code PR kubernetes/kubernetes#104907
    • LGTM by Ryan, Danielle, Mike(per existing kep) .
    • Ready to be merged? Derek expressed possible discussion/evaluation needed for new exposed checkpoint service, mrunalp requested to look it over in that context.
  • [swsehgal] Populating Node Resource Topology-api repository

June 21st, 2022

June 14th, 2022

Total active pull requests: 192

Incoming Completed
Created: 17 Closed: 6
Updated: 57 Merged: 13

Fish out: kubernetes/kubernetes#104140

  • [dawnchen] https://bit.ly/k8s125-enhancements is updated for SIG Node.

    • Total: 19 enhancements
  • [paco] I worked on this kubernetes/enhancements#1029 for sometime since 1.22 and fixed a bug and tried to add metrics/log for this feature, and the promotion pr was updated kubernetes/enhancements#2697 Can we add this to v1.25 if it met the beta-promotion bar?

  • [vinaykul] InPlace Pod Vertical Scaling PR status (vinaykul unavailable next two weeks)

    • [derek] Reviewed with feedback, need to clarify core behavior expectation
      • Kubelet -> CRI interaction pattern for observing state
        • This assumes the runtime reports values as read from host cgroup
    • KEP needs a template catch-up update. Please review this PR.
      • Partial/placeholder current unit-test cov info added. A more detailed breakdown will have to wait until after the OSS NA conference.
    • API (Tim Hockin) LGTM. Scheduling (need to add E2E test, then LGTM likely)
    • Issues to fix are being tracked here. Volunteers welcome :)
  • [adrianreber] Forensic Container Checkpointing (not able to join (again))

    • code PR kubernetes/kubernetes#104907

      • reviews done by Mrunal and Danielle (thanks)
      • probably almost finished, waiting for additional reviews/approval
      • Waiting for feedback from Mike about CRI changes
        • Initially we targeted checkpoints archive on the local file system
        • Right now we have successfully implemented checkpoint OCI images (not standardized (yet)) in the local registry in containerd and CRI-O
        • Storing checkpoints as OCI images was a request during early discussions (1.5 years ago)
        • At this point we could completely drop checkpoints written to the local file system and only store checkpoint images in the local containerd/CRI-O registry
        • Please let me know in the PR if it would be preferred to drop the local file system checkpoint archives (I am in favor of it)
    • Concerning the PR discussion this would mean we need to keep the parameter about the checkpoint destination (something like localhost/checkpoint-image:tag) and not remove the destination parameter as suggested by Mike

      [Derek] fundamental question - PR allows to change CPU and Memory request and limit. But when the change will manifest? How kubelet will know if this errored? Should we read the value back on whether change was applied?

      [MikeB] error code from API call will notify if failed to lower the limit. Swallowed on increase.

      [Mrunal] confirm this^^^. As long as we increment properly should be fine.

      [Derek] Need to make sure kubelet always knows the latest applied values. Also in case of emptyDir <missed this>

      Also PR makes an assumption on a PLEG event being handled properly.

      [Derek] Need to check the behavior on cgroupv2 as well.

  • [ddebroy] SandboxReady pod condition KEP

  • [ruiwen-zhao] Adding GA criteria for KEP-2133 kubelet credential provider

    • kubernetes/enhancements#3379
    • Reviewed/Approved by SergeyKanzhelev and deads2k
    • Looking for a review from Derek (or other sig node approvers)
  • [mikebrow] exec with uid/gid (maybe user) option vs current root only.. any interest? original discussion1224—

    • discussion centered on use cases, comparing with ephemeral container support, login with ssh plugin extension…
    • Sergey had a good idea about a flag for disabling root defaulting
      • perhaps we could use container default here..?
  • [ed] Dynamic resource allocation KEP: request for review: kubernetes/enhancements#3064

    • Being reviewed by Tim Hockin
    • Looking for a 2nd review round from Derek
  • [marquiz] Class resources KEP, re-triage, reviewers/approver were missing last week

  • [mckdev] Always set alpha.kubernetes.io/provided-node-ip kubernetes/kubernetes#109794

  •       * 
    

June 7th, 2022

Total active pull requests:192

(for the past two weeks):

Incoming Completed
Created: 29 Closed: 17
Updated: 94 Merged: 17

May 31, 2022

  • [klueska]: Please add following enhancement to tracking sheet:
  • [matthyx] (cannot be present): Please remove Keystone containers KEP from tracking:
    • adisky is on maternity leave
    • code will likely impact PLEG, will prefer to have more tests added by sig-node: reliability project (which I want to participate to) before refactoring
  • [rata]: userns KEP PR open since early April. Anything missing?
    • The sig-node freeze is in a few days
    • AFAIK there is nothing missing. Got LGTM a few hours ago, missing /approve
  • [vinaykul] InPlace Pod Vertical Scaling - status update
    • Merged KEPs 2273 with 1287. Awaiting review.
    • API (Tim Hockin) LGTM. Scheduling (need to add E2E test, then LGTM likely)
    • Awaiting Derek review completion**.**
    • Issues to fix are being tracked here. Volunteers welcome :)
  • [adrianreber] Forensic Container Checkpointing (cannot make it to today's meeting)
    • KEP kubernetes/enhancements#3264
      • Reviewed, now waiting for approval
      • There is a review comment from my Mike which is not totally clear to me and I am looking for clarification what Mike was exactly asking for
    • code PR kubernetes/kubernetes#104907
      • multiple review rounds (thanks Mrunal!)
      • probably almost finished, waiting for additional reviews/approval
  • [bobbypage/ruiwen] issue with terminating pods reporting ready=true
  • [ddebroy] SandboxReady pod condition KEP

May 24, 2022

May 17, 2022

  • Canceled due to Kubecon

May 10, 2022

  • [Derek/Dawn] Update on sig-node reliability kickoff last week

              will upload the recording
    
    
              spent time reaching consensus on what reliability means for node - clarify    kubelet vs. runtime vs. operating system.
    
    
         Increase test coverage and then next steps.
    
         Calling on the community to help. 
    
         matthyx - discuss at kubecon 
    
  • [matthyx] discuss about Keystone containers KEP

    • matthyx will update the document and come back in a few weeks to present the milestone and design
  • [knight42] review KEP: Split stdout and stderr log stream

  •        Mrunal will make a pass
    
  • [rphillips] Evented PLEG initial work [doc] (Points of Contacts: Ryan Phillips, Mrunal Patel, Harshal Patil)

    • Derek - Data on perf?. Ryan - We don’t have that yet.
    • Derek - Clarification that we won’t get rid of the list entirely but be able to make the lists less frequent.
    • Mike will review
  • [mikebrow/Paul] KEP: Sub-Second Probes

  • [marquiz] follow-up discussion on Class resources KEP.

    • Action to Markus:
      • blog post for k8s.io blog, description on what is possible now with runtimes and existing annotations
      • Come back with demo/description of how Block I/O will be utilized by user
  • [vinaykul] InPlace Pod Vertical Scaling PR status

    • Merged KEPs 2273 with 1287 per last week's discussion. Please review.
    • API (Tim Hockin) LGTM.
    • Awaiting Derek to complete the review. Can we please prioritize to avoid another release slip? My time is going to be limited as June rolls around - multiple CPFs accepted for LF OSS conference, I have additional work of content & demo prep.
    • Issues to fix are tracked here. Volunteers welcome :)
  • [mrunal] kubecon next week - do we keep the meeting?

  • Cancelling next week.

May 3, 2022

April 26, 2022

Done (6):

Issue Number Name Stage Status Stage Assignee

281

DynamicKubeletConfig Removal SergeyKanzhelev

688

PodOverhead Graduating Stable SergeyKanzhelev

2133

Kubelet Credential Provider Graduating Beta adisky

2221

Dockershim removal Major Change Stable SergeyKanzhelev

2712

PriorityClassValueBasedGracefulShutdown Graduating Beta mrunalp

2727

gRPC probes Graduating Beta SergeyKanzhelev

Removed from Milestone (17):

Issue Number Name Stage Status Stage Assignee

127

User Namespaces Graduating+ Alpha rata

1287

In-place Pod Vertical Scaling Graduating+ Alpha vinaykul

1972

ExecProbeTimeout Graduating+ Stable jackfrancis

2008

Container Checkpointing (CRIU) Graduating+ Alpha adrianreber

2043

List/watch for concrete resource assignments via PodResource API Graduating+ Stable swatisehgal

2254

Cgroupsv2 Graduating+ Stable giuseppe

2371

cAdvisor-less, CRI-full stats Graduating+ Beta haircommander

2400

Swap Graduating+ Beta ehashman

2413

SeccompByDefault Graduating Beta saschagrunert

2535

Ensure Secret Pulled Images Graduating+ Alpha mikebrow

2823

Node-level pod admission handlers Graduating+ Alpha SaranBalaji90

2837

Pod level resource limits Graduating+ Alpha n4j

2872

Keystone Containers Graduating+ Alpha adisky

2902

New CPU Manager Policy: distribute-across-numa Graduating+ Beta klueska

3063

Dynamic resource allocation Graduating+ Alpha pohly

3085

Pod conditions around starting and completion of pod sandbox creation Graduating+ Alpha ddebroy

3162

Add Deallocate and PostStopContainer to device plugin API Graduating Alpha zvonkok
    1.22 release with 24 KEPs tracked and **13 merged**


    1.23 was tracking 14 with **8 merged**


    1.24 with 23 tracked and **6 merged**


    1.23 release retro summary:


    Good:
  • Planning and tracking is useful

  • Soft freeze helps

  • Early merges are great

      Can be better:
    
  • Lack of reviewers and early reviews

  • Lack of approver’s bandwidth

  • Things that went well

    Notes:

  • Reviewer found missing tests during the review process

  • We are making progress even though things are moving slow sometimes.

  • For in-place pod vertical scaling, containerd side changes are done in parallel and ready to go.

  • Collaboration with the runc community is good.

  • Things that didn’t went well

    Notes:

  • In-place vertical scaling takes long, but we are practicing caution in the review process.

  • Original author moved forward last minute

  • Keystone Containers design came late

  • In-place scaling scope increased over the review process

  • unit tests in different location than the code

  • Syncing changes between Kubernetes and container runtime. (One side needs to cut a release first.)

  • (compared to runc) Containerd community could be more proactive when cutting releases

  • AIs

    • Investment on testability and reliability next cycle. Leadership to scope and direct these works needed.
    • Don’t accept changes without test coverage or those lack testability. Reviewers need to hold the bar.
    • Build component tests - volunteers needed
    • Clearer instructions on which folder to add tests, based on the code. Automated tool?
  • [SergeyKanzhelev] KEP 1.24 retro and KEPs 1.25 planning kick-off

  • [rata]: userns KEP: CRI changes PR open for 19 days now

    • Can we ask for a review, please? :)
    • Do we need review from Windows/VM runtimes maintainers for Alpha phase or for beta?
      • Mark R (what is the github handle?) will take a look. But it doesn’t seem like a blocker for alpha
      • rata: Also, this lives inside the Linux section of the CRI
    • Can you help us to reach out to the relevant runtime maintainers to also take a look?
    • Can we aim for alpha in 1.25
      • It is currently not listed in that section in the doc Sergey shared
      • rata: Added, thanks!
  • [adrianreber] Forensic Container Checkpointing

  • [vinaykul] InPlace Pod Vertical Scaling PR status

    • API (Tim Hockin) LGTM. Derek’s review is in progress.
    • Issues to fix are being tracked here.
    • Can we make an early commit for v1.25?

April 19, 2022

Cancelled - due to availability of leads.

April 12, 2022

Total active pull requests: 172 (+10 from last week)

Incoming Completed
Created: 16 Closed: 5
Updated: 44 Merged: 2

April 5, 2022

Total active pull requests: 162 (-17 from two weeks)

Incoming Completed
Created: 55 Closed: 15
Updated: 122 Merged: 62

March 29, 2022

No agenda, canceling to focus on code freeze.

For any urgent items for code freeze, please ping sig-node slack channel.

March 22, 2022

Total active pull requests: 179 (+5 from last week)

Incoming Completed
Created: 25 Closed: 5
Updated: 81 Merged: 15

March 15, 2022

Total active pull requests: 174 (+3 from last week)

Incoming Completed
Created: 18 Closed: 5
Updated: 66 Merged: 15
  • Reminder: Mar. 29 is code freeze
  • [vinaykul] InPlace Pod Vertical Scaling status
    • WIP. Responding to Derek’s comments and addressing identified issues
  • [danielle] as part of the sig annual report (https://docs.google.com/document/d/1JAvi8ptbovvjSqh88378YB9irJUqWcT-PQ5YsYmdD3c/edit#) we have various pieces of documentation to update across the sig that need discussion.
    • We need to document the progression ladder from new contributor -> reviewer -> approver, asking for dawn/derek to finalize the doc they were working on to move into the community repo.
    • CONTRIBUTING.md updates
      • We need to refine our on-ramp for new contributors a little here. What do folks think is important?
        • Today that page is a pile of links, with some helpful intro to building k8s docs from dims at the bottom
  • [ddebroy] Updates to kubernetes/enhancements#3087
    • Addressed comments/concerns from Elana and Derek so far
    • Single SandboxReady condition
  • [dawnchen] Status update on OutOfCpu issue?
    • regression since 1.22.
    • Clayton is working on a fix
    • It’s close, David is testing it. Should merge soon

March 8, 2022

Total active pull requests: 171 (+11 from last week)

Incoming Completed
Created: 23 Closed: 5
Updated: 59 Merged: 10
  • [SergeyKanzhelev] Review KEPs which are in soft freeze

    The following KEPs are expected to be fully merged (Beta/Deprecations):

  • [Done] 281: DynamicKubeletConfig kubernetes/enhancements#281

  • [soft cut] 2133: Kubelet Credential Provider kubernetes/enhancements#2133

  • [Done] 2221: Dockershim removal kubernetes/enhancements#2221

  • [keep in alpha] 2371: cAdvisor-less, CRI-full stats kubernetes/enhancements#2371 * david to update the KEP to reflect keeping in alpha

  • [cut from release] 2400: Swap kubernetes/enhancements#2400

  • [in-progress] 2712: PriorityClassValueBasedGracefulShutdown kubernetes/enhancements#2712

  • [in-progress, let’s keep in release] 2727: gRPC probes kubernetes/enhancements#2727

    The following KEPs should have WIPs up that are ready for review (Alpha/GA):

  • [keep it] 688: PodOverhead kubernetes/enhancements#688

  • [keep] 1287: In-place Pod Vertical Scaling kubernetes/enhancements#1287

  • [review in progress] 2008: Container Checkpointing (CRIU) kubernetes/enhancements#2008

  • [pr in progress, design changes may be needed] 2535: Ensure Secret Pulled Images kubernetes/enhancements#2353 * qq: is phase 1 valuable by itself? * [Derek] there may not be enough benefits in phase 1 alone. Especially if long term the logic is moving to the runtime, some benefit if users wipe imagefs on reboot of host. * mrunal and mike to decide whether to keep it in milestone by confirming the value pf phase 1.

    29th March 2022: Week 12 — Code Freeze

  • [vinaykul] InPlace Pod Vertical Scaling status

    • My company work has taken priority, but I’ll start addressing Derek’s comments after this coming Friday.
    • Won’t be in today’s meeting due to conflict.
  • [bobbypage] Update on OutOfCPU issue

    • Fix is in progress (kubernetes/kubernetes#108366), but it is a tricky fix and need to be careful to avoid introducing new regressions
    • Uncovered another related issue with pod lifecycle refactor relating to eviction / graceful node shutdown [Will create a GH issue to track]

Mar 1, 2022

Total active pull requests: 160

Incoming Completed
Created: 37 Closed: 10
Updated: 76 Merged: 24
  • Announcements
    • Reminder: Soft freeze: Mar. 4
    • Anything that won’t make milestone, feel free to remove it now
  • [derekwaynecarr] sig annual report
    • Danielle offers to help out
  • [derekwaynecarr] got through most of in-place resizing pr, its big!
    • kubernetes/kubernetes#102884
    • some updates are requested by vinay when he has time in interim
    • deferring all things cgroups v2 until this merges, but need to reconcile with memory qos by beta.
  • [marquiz] Class resources KEP
    • Has been in hibernation but want to take it out of draft and proceed
  • [wenwu449] kubernetes/kubernetes#106884
  • Were autogenerated live captions helpful? We tried that out today for the first time.
    • Sentiment in chat is that they were quite helpful, and could be turned off if they were not

February 22, 2022 [cancelled]

[dawnchen] The meeting is cancelled due to no agenda proposed. Thanks!

February 15, 2022

Total active pull requests: 154

Incoming Completed
Created: 16 Closed: 21
Updated: 101 Merged: 17

February 8, 2022

Total active pull requests: 175 (+3 from last week)

Incoming Completed
Created: 21 Closed: 9
Updated: 52 Merged: 10

February 1, 2022

Total active pull requests: 172 (+6 from last week)

Incoming Completed
Created: 22 Closed: 3
Updated: 49 Merged: 11

Announcements:

January 25, 2022

Total active pull requests: 166 (+10 since last week)

Incoming Completed
Created: 21 Closed: 3
Updated: 52 Merged: 11

January 18, 2022

Total active pull requests: 156 (-34 since the last meeting)

Incoming Completed
Created: 14 Closed: 35
Updated: 116 Merged: 17

January 11, 2022

Total active pull requests: 190 (-21 since the last meeting)

Incoming Completed
Created: 13 Closed: 13
Updated: 176 Merged: 26
  • Announcements

    • 1.24 schedule finalized
    • [ehashman] Proposed date for soft node freeze: Fri. Mar. 4, 2022
      • Applies to beta/deprecations
      • Alpha/GA features must have WIPs up
      • Action: ehashman to send email with announcement
    • kubernetes/kubernetes#104143 Welcome wzshiming@ (Shiming Zhang) as SIG Node reviewer!
  • [derek] conclusion on special resource operator proposal

  • [ehashman] 1.24 KEP prioritization

    • Action: Elana to send email requesting review/feedback and explaining the prioritization goals
  • [swsehgal/fromani][heads up] PodResource API watch support to be postponed to 1.25 release due to capacity constraints. We aim to narrow down the design in the 1.24 timeframe and target the implementation in the 1.25 timeframe.

  • [pacoxu] Quotas for Ephemeral Storage #1029 Fixed a bug and try to adding a metric for this feature. Not sure if it can be promoted beta in 1.24 or 1.25.(I will update the KEP if it most likely will be promoted to beta in 1.24. If not, I may update it for 1.25 or later.)

    • [Derek] There were some feature gaps that may need to be addressed before moving to beta.
  • [vaibhav2107] Rotated container log files size not counted towards ephemeral-storage’s limit

      ( [https://github.com/kubernetes/kubernetes/issues/107447](https://github.com/kubernetes/kubernetes/issues/107447) )
    
    
      Discussion on the issue
    
    • [ehashman] Hasn’t been triaged yet. Will be looked at tomorrow during Node CI/Triage subproject
  • [jackfrancis] What to do with ExecProbeTimeout during 1.24 release cycle?

      ( https://github.com/kubernetes/kubernetes/issues/99854 )
    
  • [vinaykul] In-Place Pod Vertical Scaling - plan for early 1.24 merge

    • PR kubernetes/kubernetes#102884
    • Pod resize E2E tests have been “weakened” for alpha - now passing CI
    • Alpha-blocker issues:
      • Review claims code in convertToAPIContainerStatus breaks non-mutating guarantees. - My Nov 10 response needs Elana’s followup
        • It is unclear to me what part of the code updates or mutates any state. Need a response/clarification.
      • Container hash excludes Resources with in-place-resize feature-gate enabled, toggling fg can restart containers - Fixed & Reviewed
        • This fix seems acceptable to both Elana & Lantao but Hash annotation naming needs to be more specific. Working on it.
    • NodeSwap issue: Not an alpha blocker (No CI test failures seen). Asked @ichbinblau or @cathyhongzhang to file tracking issues.
    • Other non-alpha-blocker issues:
      • I’m fixing various issues found in reviews of API, scheduler and kubelet.
      • I’ll file tracking github issues for the remaining (7-10) issues/TODOs and assign them to people that have offered to help. They can be fixed after this PR is merged most likely within the 1.24 timeframe.
  • [mweston & swsehgal] reminder to review this-continued conversation inline for cpu management cases: https://docs.google.com/document/d/1U4jjRR7kw18Rllh-xpAaNTBcPsK5jl48ZAVo7KRqkJk/edit

January 4, 2022

Total active pull requests: 211 (+4 since the last meeting)

Incoming Completed
Created: 33 Closed: 26
Updated: 108 Merged: 7
  • Announcements
    • Release dates for 1.24
      • Cycle start next week (Jan 10)
      • Tentative release date (Apr 19)
    • Saying hi to James Laverack, Release Lead
  • [mrunal] 1.24 planning *
  • [ddebroy] KEP for pod sandbox creation conditions in pod status [kubernetes/enhancements#3087]
  • [vinaykul] In-Place Pod Vertical Scaling - plan for early 1.24 merge
    • PR kubernetes/kubernetes#102884
    • Pod resize E2E tests have been “weakened” for alpha.
      • Resize success verified at cgroup instead of pod status.
      • All 31 tests are passing now.
    • Alpha-blocker issues:
      • Container hash excludes Resources with in-place-resize feature-gate enabled, toggling fg can restart containers - Fixed
        • Please review this incremental change which addresses it.
        • [Lantao] Some customers may rely on the existing label implementation, even though it wasn’t intended for that use. Want to get feedback on this.
        • Alternative: use the same, current hash field but use it to store both hashes.
        • It might be clearer to write down both hashes separately.
        • [Elana] Some concerns about version skew of labels; if one kubelet is on one version and another is on a different one, they need to know how to use the labels correctly and not accidentally break each other.
        • [Derek] Will focus in and review. People should not assume guarantees for kubelet labels.
        • [Dawn] Adding an additional label reduces complexity because we don’t have to worry about internal versioning of the label scheme.
      • Reviewer claims code in convertToAPIContainerStatus breaks non-mutating guarantees. - My Nov 10 response needs Elana’s followup
        • It is unclear what part of the code updates or mutates any state. Need a response/clarification.
      • Multiple reviewers have felt that the NodeSwap issue is a blocking issue. But in the Dec 07 meeting, we felt this may not be an alpha blocker (No CI test failures seen. After I weakened resize E2E tests and all-alpha tests passed). However, we want to be sure. - Need Elana’s input.
        • Can we identify exact reasons why this would (or would not) be alpha blocker?
    • I plan to create issues to track other non-alpha-blocking review items and assign them to folks to fix after PR is merged. A few people have offered to contribute. With help, we should be able to nail most, if not all, of them in the upcoming release.
  • [mweston & swsehgal] Request for reviewers of CPU doc here:
    • [swsehgal] How do we make this more pluggable in the long run? Support more bespoke use cases