-
Notifications
You must be signed in to change notification settings - Fork 8.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Do not fail fast. #13120
CI: Do not fail fast. #13120
Conversation
✅ Deploy Preview for kubernetes-ingress-nginx canceled.
|
/triage accepted |
/cherry-pick release-1.12 |
@Gacko: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cherry-pick release-1.11 |
@Gacko: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@Gacko the reason for it is because the tests may take a very long time to run, even if they fail, and they consume resources that may at some point be exhausted (IIRC there's a limit on Github actions). Probably what you can do is instead make fail-fast a flag, and on Github Actions CI allow the usage of a label or some test that sets this flags for another runs EDIT: I was thinking this is also related with fail-fast from e2e ginkgo tests, but still related to why we usually fail fast |
E2E Tests mostly failed due to flakes in the ExternalName Service recently. So chances 4 out of 5 runs (5 Kubernetes versions) complete in the end are high. If we make them fail fast, we always need to re-run all of them, even if 4 out of 5 would have completed. |
This is what I was also talking about. |
Sure, I'm aware of that. This is making a single E2E run fail fast and this absolutely makes sense. But we are spinning up 5 E2E runs per variation at the moment. And sometimes it seems the nip.io backend we are using for ExternalName services is not replying in time, at least DNS requests are timing out. Normally one E2E run takes around 45 minutes. If one of them fails at 40 minutes, we kill all 5 (one per Kubernetes version we support), even though the other 4 could have completed successfully. With the current behavior you always need to re-run all 5. With my change you can wait til the other 4 complete successfully and only re-trigger one. So without my change: 200 minutes of GitHub Actions wasted. With my change it's only 40 minutes. |
Ok makes sense. I am leaving the lgtm and hold and you can unhold as you wish /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Gacko, rikatz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Gacko: new pull request created: #13130 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@Gacko: new pull request created: #13131 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
Currently E2E tests for all Kubernetes versions get canceled as soon as E2E tests for one of them fails. Therefore one always needs to re-run 5 jobs instead of only one.
I know, we should rather fix flakes, but this change is particularly useful for such also.
Types of changes
Checklist: