Skip to content

Latest commit

 

History

History
859 lines (684 loc) · 60.2 KB

ci-subgroup-notes-2021.md

File metadata and controls

859 lines (684 loc) · 60.2 KB

Kubernetes SIG-Node CI subgroup notes

12/29/2021

Cancelled - year-end holiday break

12/22/2021

  • Device plugin: kubernetes/test-infra#24557
    • Put on hold to be able to repro. But can repro locally now. So can skip tests for now
  • Memory (kubelet) down again:
    ![][image12]

12/15/2021

12/09/2021

  • [ehashman] Dockershim removal - cleanup kubernetes/test-infra#24592
    • PR: kubernetes/test-infra#24595
    • There are some tests I don’t want to migrate as part of this PR with special configs: e.g. CPU manager, hugepages, etc. but are failing so no point in running them
      • Ideally, since these are serial, would like to see all tests using the same config moved under a single job
    • Presubmits need to be done separately
    • Will file issues for split work (presumbits)
  • [ruiwen-zhao] kubernetes/kubernetes#106895
    • Summary API test flaky on kubelet-gce-e2e-swap-ubuntu since the beginning of test job history
    • Same test passes on -fedora testgrid
  • Agreed: cancelling Thursday alternate time meeting due to lack of attendance.
    • We will only meet Wednesdays from now on.
    • #6285

12/01/2021

11/24/2021

Perf (memory chart) - let’s see if trend continues next week
![][image13]

11/17/2021

11/11/2021

  • kubernetes/kubernetes#106204
    • Release blocker?
  • [danielle] kubernetes/kubernetes#106348
    • Lets drop the dependency on GPUs and reduce some maintenance for ourselves. Also apparently the device tests were broken in the same way GPU tests originally were (but we missed it bc of [Flaky])
  • [aditi] kubernetes/kubernetes#106252
    • Thoughts on adding credential provider to node e2e?
  • Status of node serial
    • Let’s make it green by removing flakes out

10/26/2021

10/20/2021

  • [fromani][could be postponed if not enough time] some tests have implicit dependencies on the node state
    • Memory manager tests have implicit dep on memory fragmentation (or lack thereof)
    • ContainerRuntimeRestart - fails for timeout, but we WANT to saturate the node with pods (that’s the whole point of the test!)
    • How do we keep these tests while keeping it reliable (no false negatives)?
    • Separate lanes seem a bit excessive, any other idea?
    • Wait for the PR to reduce amount of allocated hugepages under the test
      • If the PR will fix flakes, remove the separated lane, otherwise remove the test from serial lane with notes why it needs a separate lane
  • [imran] Lock Contention Tests : Updated the job with `NodeSpecialFeature:LockContention` , all tests are passing, CI is green.
    Updated the test-infra PR to include a skip value with `--skip="\[Flaky\]|\[Serial\]"`
    Just need to merge and proceed with the next set of steps.
  • [mmiranda96] Adding new alpha features to jobs e2e-gce-alpha-features (kubernetes/test-infra#23642)
  • [jlebon] Allow running e2e tests on non-GCE nodes kubernetes/kubernetes#105764
    • Danielle will take a look
      • Fromani will have a look as well
    • What about cloud-init?
    • Cloud ignition is an alternative - needs to be adapted
    • Have a way to prepare all the binaries upfront?
  • [mmiranda96] Updating job ci-kubernetes-node-kubelet-eviction to use swap (for kubernetes/kubernetes#105023 (comment))
  • [ehashman] attendance for alt time
    • Let’s hold another one in november and then if poor attendance again, cancel

10/14/2021

10/06/2021

  • [bobbypage] Kubetest2 migration plan
    • Amit @amwat will provide a bit of context on kubetest2 migration plans and how it relates to node e2e testing

    • ref: kubernetes/enhancements#2464

      CI jobs very first layer  Image for prow job \- has many tools on it already  
      

- all these tools are deprecated and in maintenance mode. Tools evolved from bash script and became unmaintainable
- kubetest2 is designed to be extensible and will replace it.
- PLuggable on where to test (GCE, AWS, Kind, etc.)
- Pluggable on what to test

Thousands of jobs using old tools. The process of switching all the jobs will be slow. Some of the jobs will be moved to kubetest2. Presubmits and release blocking jobs are the first target.

Mainly: awareness of the project. Feature requests must go to kubetest2 now.

Most significant impacting change - node tester. kubetest2 will use a makefile as a source of truth. make test_e2e lets you run tests, but kubetest is not using it. kubetest2 will change this and will only use makefile. Dealing with test infra will be mostly when bugs are encountered, no need to deal with it any longer.

Danielle: some tests needs more features
Amwat: yes, can be added in node tester

	Timeline?  

Amwat: scoped to presubmits and release blocking - 1.24 is a target version. At least jobs will start be running. No timeline for other jobs.

09/29/2021

09/22/2021

  • [mmiranda96] Created kubernetes/test-infra#23642 for keeping track of alpha feature jobs with non-alpha features.
    • Update list of features manually to make progress on the alpha job cleanup.
    • Work on updating tags for jobs
  • [ehashman] status of kubernetes/k8s.io#956 ?
    • GCP accounts for node contributors - need a list of use cases
    • Dani to drive when she returns from PTO?
  • [Sergey] memory spike kubernetes/kubernetes#105053

09/15/2021

09/09/2021

09/01/2021

08/25/2021 Cancelled due to hosts unavailability

08/18/2021

08/12/2021

08/04/2021

07/28/2021

07/21/2021

  • [haircommander] Adding presubmit/release-blocking CRI-O jobs

  • [fromani] using device plugins in the e2e suites running on CI

    • The k8s e2e test suite has a fair amount of tests which need device plugin, because they exercise device manager -directly or indirectly.
      We mostly use SRIOV devices, because SRIOV devices are just the cheapest and easiest supported device to get, so this is why we wrote the tests in k8s to consume them.
      But we don't have device plugin support on CI. We do have gpus-enabled machine, but it's a subset and should be used sparingly (e.g. not every PR should use them. Or can we just use gpus every time? I expect no for cost reasons, but worth mentioning).
      So today a large amount of tests just skip on CI. This is especially evident in the serial lane and in the resource management area
      In RH we actually have machines with SRIOV devices which run the e2e testsuite, but this is of course suboptimal for a number of reasons; a much better state for everyone would be to actually have some device plugins in u/s CI.
      There are some options we can discuss as community:
      • 1. use sample plugin
      • 2. fake sriov devices (I can elaborate on this if there is interest
      • 3. just use GPUs?
      • 4. Just bump the spec of the CI machines to have SRIOV devices?
  • [mkmir] E2E sysctl test is marked as conformance. However, it does not respect some of the requirements:

    • it tests only GA, non-optional features or APIs (uses feature sysctl)
    • it works for all providers (doesn’t work for Windows and other non-sysctl OS)
    • kubernetes/kubernetes#101190
  • [ehashman] 1.22 burndown (includes some of the topics above)

  • [SergeyKanzhelev] 1.21 vs 1.20 perf degradation: kubernetes/kubernetes#101989

07/14/2021

  • [Sergey] NodeConformance writeup https://docs.google.com/document/d/1ezJPfItuhZvwyP_RtiWTNcjCM3gi94vu1nw6uVNHKgM/edit?usp=sharing
    • NodeConformance historically tried to be two things:
      • A set of e2e tests that you just needed a single node to run
      • Conformance-like test for nodes
    • [ehashman] Suggestion: we get rid of “NodeConformance” because the name is confusing
      • For CRI conformance-like tests, label them CRIValidation -- NodeAgnostic?
      • For the set of e2e tests you just need a single node to run, let’s come up with a new name -- needs a name (anything in test/e2e_node) (SingleNodeTest or KubeletLocal)
      • Note: some tests may overlap between both sets
    • Action: add a plan to cover splitting the use cases for current tests
    • Action: send out NodeConformance plan, soliciting feedback
  • [ehashman] 1.22 burndown
  • [fromani][status update][serial lane] looks like kubernetes/kubernetes#100145 eventually broke, will look on it ASAP

07/08/2021

06/30/2021

  • [fromani] update only, no agenda item:
  • Bug scrub follow-up
    • Added some new bugs to the board after having scrubbed bugs, including some issues for adding test coverage
    • Issues are now in a more manageable state, but we have so many
    • Suggestion: bug board + everything else board will be optimal, need to figure out the columns (triage/waiting/accepted/in progress/done?)
    • Hopefully moving forward we can do regular incoming issue triage as part of these meetings
  • Bot to help with automation for boards?
    • GitHub still doesn’t have support
    • Contribex is working on it
  • NodeConformance status
    • Assigned to Sergey, worked on bug scrub so hasn’t had a chance to look since last meeting
  • NodeFeature status?
    • mmiranda has submitted PR: kubernetes/test-infra#22677
    • Starting with duplicating selectors in test-infra, then we can start making test changes
  • [Sergey] Soak tests
    • kubernetes/kubernetes#64523
    • Is there logic we can reuse to automatically detect this?
    • Not afaik - usually determined by querying debugging endpoint and looking at the memory dumps
  • Resource utilization regressions now being tracked in perf-tests: kubernetes/perf-tests#1789
  • How to find reviewers for various PRs?
    • Action: Swati to add item to next week’s SIG Node meeting to discuss with wider SIG

06/23/2021

Attendees:
![][image14]

06/16/2021

Attendees:
![][image15]

06/10/2021

Attendees:

  • [Sergey] I cannot join this time, but this is one of conflict

Agenda:

06/02/2021

Agenda:

05/25/2021

Agenda:

05/19/2021

Attendees:
![][image16]

Agenda:

05/12/2021

Agenda:

05/04/2021 Cancel for KubeCon

04/28/2021 Short sync up

Attendees:
![][image17]

Agenda:

  • Follow ups - need to move to the next week:
  • [fromani] Managing kubelet state in e2e tests: overriding the system default (/var/lib/kubelet)
    • Shared global state between tests
    • Any objections to move away from it?
    • Storing state of the kubelet
    • Ideally each e2e test has it’s own state
  • [ehashman] looking for volunteers for kubernetes/perf-tests#1789
    • Currently don’t have any perf/scalability tests for upstream kubelet resource utilization (CPU/memory)

04/21/2021

Attendees:

  • Sergey
  • Elana
  • Artyom

Agenda:

  • triage

04/14/2021

Attendees:

  • fromani
  • ehashman
  • Sergey Kanzhelev
  • David Porter
  • Amim Knabben
  • Qiutong Song
  • Madhav Jivrajani
  • Jiaming Xu

Agenda:

04/07/2021

Attendees:

  • Elana Hashman
  • Sergey Kanzhelev
  • Alukiano
  • Tao
  • Qiutong

Agrenda:

  • Artem will start looking at fake NUMA flag.

03/30/2021 Cancelled

03/24/2021

Agenda

https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.21+label%3Asig%2Fnode

Discussing kubernetes/kubernetes#99336:

  • Overall feeling is that it’s too late for unknown unknowns introduced by this PR.
  • 1.18 cherry-picking is not Node team problem, more release team problem. Maybe release team will need an exception.

03/17/2021

Attendees

Agenda

Direction long term is not to test the whole matrix on presubmits, but have a good signal with failures easy to investigate by contributors. Maybe PR needs to be replaced with a single job with both runtimes.

Action: determine a long-term plan to merge all node presubmits into one job, using a name that doesn’t reveal the underlying runtimes. (e.g. cleanup ubuntu-containerd* tests)
Timeframe: dependent on a presubmit policy, maybe will happen next release cycle (1.22)

03/10/2021

Attendees

  • ehashman
  • fromanirh
  • swsehgal

Agenda

03/03/2021

Attendees

Agenda

02/24/2021.

Attendees (7 on call):
![][image18]

  1. Containerd plan kubernetes/test-infra#18570
  2. Questions about kubernetes/test-infra#20937 and node-kubelet-serial tests (https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial)
  3. Announcement: node n-2 version skew tests to be discussed at SIG Arch tomorrow: https://groups.google.com/g/kubernetes-sig-architecture/c/QX-4qq2krm0/m/998T3cJUBQAJ

Product triage: https://github.com/orgs/kubernetes/projects/49

  • Feature PRs missing from board that happen to have sig/testing label
  • Action: needs-rebase isn’t auto-applied, bot needs to be pestered. File issue to proactively apply without resetting stale counter

02/17/2021

Product triage: https://github.com/orgs/kubernetes/projects/49

02/08/2021

[Sergey] New time for the meeting? It looks like 10 AM Mon is very inconvenient. Is Monday 9AM better?

https://doodle.com/poll/ii5vyde6wpp3migm?utm_campaign=poll_update_participant_admin&utm_medium=email&utm_source=poll_transactional&utm_content=gotopoll-cta#table

![][image19]

Triage: https://github.com/orgs/kubernetes/projects/43

  • Suggest MHBauer to approver. Elana to reach out

02/01/2021 [skipping]

01/25/2021

Agenda:
[Aditi]

[Sergey] Do we need this:
kubernetes/k8s.io#956 (comment)?

  • Need for deflaking tests. “Pause” job and SSH access
  • Sometime hard to understand what failed and why PR validation failed

[knabben]

[ehashman]

[Meeting time]

  • Action: Sergey to send doodle for maybe moving this meeting? Mondays frequently conflict.

01/11/2021

Attendees (7 on the call):

Agenda:

  • triage

01/04/2021

Attendees:

Agenda:

[knabben]

[Sergey] Triage

Volunteers to help with this effort

Victor Pickard (Red Hat), nick=vpickard, [email protected]
Jay Pipes (AWS), nick=jaypipes, gh=jaypipes, [email protected]
Balaji (AWS), nick=srisaranbalaji, gh=SaranBalaji90, [email protected]
Morgan Bauer (IBM), slack=mhb, gh=mhbauer, mail=[email protected]
Ning Liao (Google), nick=nliao, mail=[email protected]
David Porter (Google), nick=davidporter; mail=[email protected]
Hanfei Lin (Google), nick=hanfeil; mail=[email protected]
Hugo Huang (Google), nick=tangent; mail=[email protected]
Roy Yang(Google), nick=roy; mail=[email protected]
Aaron Crickenberger (Google), nick=spiffxp, [email protected]
nick=Archer
Ed Bartosh (Intel), slack=Ed, github=bart0sh [email protected]
Daniel Mangum (upbound.io), nick=hasheddan
Chirag Tayal (PayPal) nick=ctayal, [email protected]
Zhi Feng(Airbnb), nick=Zhi, [email protected]
Dims, nick=dims, [email protected]
Jacob Blain Christen (Rancher Labs), nick=dweomer; mail=[email protected]
Artyom Lukianov(Red Hat), nick(github)=cynepco3hahue,nick(slack)=alukiano,mail=[email protected]
Swati Sehgal (Red Hat), slack=swsehgal; mail [email protected]
Jorge Alarcon, nick=alejandrox1, [email protected]
Sascha Grunert (SUSE), nick=sascha, [email protected]
Srini Brahmaroutu(IBM), slack=srbrahma, gh=brahmaroutu, mail=[email protected]
Tim Pepper (VMware), slack=tpepper, gh=tpepper, mail=[email protected]
John Taylor (IBM), mail=[email protected]
Francesco Romani (Red Hat), nick=fromani; mail=[email protected]
Karan Goel (Google), nick=karan; mail=[email protected]
Sergey Kanzhelev (Google), nick=SergeyKanzhelev; mail=[email protected]
Mike Carlise (Salesforce), nick=micarlise, mail=[email protected]
Matt Merkes (AWS), nick=merkes, mail=[email protected]
Amim Knabben (Loadsmart), nick=knabben, mail=[email protected]
Swati Sehgal (Red Hat), nick(slack)=swsehgal, nick(github)= swatisehgal, mail = [email protected]
Harshal Patil (Red Hat), slack=Harshal, gh=harche, mail=[email protected]
Elana Hashman (Red Hat), nick=ehashman, mail=[email protected]
Paco Xu(DaoCloud), nick=paco,mail=[email protected]