-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conformance-tests: fix flake in ExpectMirroredRequest #3708
conformance-tests: fix flake in ExpectMirroredRequest #3708
Conversation
Currently the Conformance Tests for request mirroring is flaky because the mirrored request can sometimes not be found in the log of the Pod. ``` Test: TestConformance/HTTPRouteRequestMultipleMirrors/1_request_to_'/multi-mirror-and-modify-request-headers'_with_headers_should_go_to_infra-backend-v1 Messages: Couldn't find mirrored request in "gateway-conformance-infra/infra-backend-v3" logs ``` The issue is that the logic that fetches the Pod log (`ExpectMirroredRequest`) uses the current time (`time.Now()`) on every assertion-attempt as the `sinceTime` parameter. As a consequence, the assertion can miss messages from the Pods that have been logged between the assertion attempts. This commit fixes the issue by using the start time of the "check" when fetching the logs (the same way it's already handled in `testMirroredRequestsDistribution`).
Welcome @mhofstetter! |
Hi @mhofstetter. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/hold |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mhofstetter! this looks like it is going to indeed fix it.
Did you test it with an implementation? and it passed conformance?
👋 Hello @LiorLieberman Yes conformance tests with this fix pass on Cilium with PR cilium/cilium#38501 -> GitHub action with the GW API conformance tests: https://github.com/cilium/cilium/actions/runs/14069099267 (Note: in addition to pull in the fixes from this PR that Cilium PR also temporarily disables the GW API feature |
/ok-to-test Thanks @mhofstetter. LGTM for this PR.
Regarding this - note that it hasn't been lowered, it was merged with this in first place. Why do we need 15% tolerance? |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
This fixes failures in Istio as well
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: howardjohn, LiorLieberman, mhofstetter The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@LiorLieberman I think the hold can be cancelled? |
yep |
/retest |
FYI we are seeing flake even with this on 1.3. Not sure if its our fault though, ahven't had a chance to investigate. |
@howardjohn For which feature? This PR should fix the flake that existed in But feature We assume that Envoys implementation of the mirror percentage is currently not able to meet the tolerance of 5% that is hardcoded in the Gateway API conformance test. It uses the stream id as random value when calculating whether a "feature" (mirroring for a request) is enabled or not. Analyzed over here in Cilium.
|
is what we see. Full log example here https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/55770/integ-pilot-istiodremote-mc_istio/1907140847116226560/build-log.txt. (We also use envoy) |
Above discussion shows that the 5% tolerance probably can't be met with Envoy's (current) implementation (Cilium, Istio - and probably also other Gateway API implementations).
How should we proceed here? Would it be an option to (temporarily) increase the tolerance to 20%? (even though it's not sure that even this can be guaranteed (haven't checked for a biggest outlier) 🥲 ) cc @youngnick |
Here is some more sample info from a single test run brtw:
|
@howardjohn does it fail for a specific percetnage-mirroring test case? istio had a problem with in the previous release which should have been fixed. |
Not sure exactly what you are asking but it fails on TestGatewayConformance/HTTPRouteRequestPercentageMirror/0_request_to_'/percent-mirror'_should_go_to_infra-backend-v1 |
Reverts part of kubernetes-sigs#3508 See kubernetes-sigs#3708; Envoy impl cannot meet this fine grain threshold.
What type of PR is this?
/kind test
/kind flake
/area conformance
What this PR does / why we need it:
Currently the Conformance Tests for request mirroring are flaky because the mirrored request can sometimes not be found in the log of the Pods.
The issue is that the logic that fetches the Pod log (
ExpectMirroredRequest
) uses the current time (time.Now()
) on every assertion-attempt as thesinceTime
parameter.As a consequence, the assertion can miss messages from the Pods that have been logged between the assertion attempts.
This commit fixes the issue by using the start time of the "check" when fetching the logs (the same way it's already handled in
testMirroredRequestsDistribution
).Does this PR introduce a user-facing change?: