Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conformance-tests: fix flake in ExpectMirroredRequest #3708

Conversation

mhofstetter
Copy link
Contributor

What type of PR is this?
/kind test
/kind flake
/area conformance

What this PR does / why we need it:

Currently the Conformance Tests for request mirroring are flaky because the mirrored request can sometimes not be found in the log of the Pods.

Test:       	TestConformance/HTTPRouteRequestMultipleMirrors/1_request_to_'/multi-mirror-and-modify-request-headers'_with_headers_should_go_to_infra-backend-v1
Messages:   	Couldn't find mirrored request in "gateway-conformance-infra/infra-backend-v3" logs

The issue is that the logic that fetches the Pod log (ExpectMirroredRequest) uses the current time (time.Now()) on every assertion-attempt as the sinceTime parameter.

As a consequence, the assertion can miss messages from the Pods that have been logged between the assertion attempts.

This commit fixes the issue by using the start time of the "check" when fetching the logs (the same way it's already handled in testMirroredRequestsDistribution).

Does this PR introduce a user-facing change?:

NONE

Currently the Conformance Tests for request mirroring is flaky
because the mirrored request can sometimes not be found in the
log of the Pod.

```
Test:       	TestConformance/HTTPRouteRequestMultipleMirrors/1_request_to_'/multi-mirror-and-modify-request-headers'_with_headers_should_go_to_infra-backend-v1
Messages:   	Couldn't find mirrored request in "gateway-conformance-infra/infra-backend-v3" logs
```

The issue is that the logic that fetches the Pod log (`ExpectMirroredRequest`)
uses the current time (`time.Now()`) on every assertion-attempt as the
`sinceTime` parameter.

As a consequence, the assertion can miss messages from the Pods that have been
logged between the assertion attempts.

This commit fixes the issue by using the start time of the "check"
when fetching the logs (the same way it's already handled in
`testMirroredRequestsDistribution`).
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/test kind/flake Categorizes issue or PR as related to a flaky test. area/conformance-test Issues or PRs related to Conformance tests. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 25, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @mhofstetter!

It looks like this is your first PR to kubernetes-sigs/gateway-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gateway-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 25, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @mhofstetter. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 25, 2025
@LiorLieberman
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 25, 2025
@LiorLieberman
Copy link
Member

/cc

Copy link
Member

@LiorLieberman LiorLieberman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mhofstetter! this looks like it is going to indeed fix it.

Did you test it with an implementation? and it passed conformance?

@mhofstetter
Copy link
Contributor Author

mhofstetter commented Mar 25, 2025

Thanks @mhofstetter! this looks like it is going to indeed fix it.

Did you test it with an implementation? and it passed conformance?

👋 Hello @LiorLieberman

Yes conformance tests with this fix pass on Cilium with PR cilium/cilium#38501

-> GitHub action with the GW API conformance tests: https://github.com/cilium/cilium/actions/runs/14069099267

(Note: in addition to pull in the fixes from this PR that Cilium PR also temporarily disables the GW API feature HTTPRouteRequestPercentageMirror. This is related to another GW API conformance test flake that seems to be introduced after the mirroring percentage tolerance has been lowered from 15% to 5% with this commit in the GW API conformance tests. For more information - please see this issue in the Cilium repo. I will probably open another PR that will revert that change if is ok from your side. )

@LiorLieberman
Copy link
Member

/ok-to-test

Thanks @mhofstetter.

LGTM for this PR.

the mirroring percentage tolerance has been lowered from 15% to 5% with this commit in the GW API conformance tests

Regarding this - note that it hasn't been lowered, it was merged with this in first place. Why do we need 15% tolerance?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 27, 2025
@LiorLieberman
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 28, 2025
Copy link
Contributor

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

This fixes failures in Istio as well

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: howardjohn, LiorLieberman, mhofstetter

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@howardjohn
Copy link
Contributor

@LiorLieberman I think the hold can be cancelled?

@LiorLieberman
Copy link
Member

yep
/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 28, 2025
@LiorLieberman
Copy link
Member

/retest

@k8s-ci-robot k8s-ci-robot merged commit d923573 into kubernetes-sigs:main Mar 28, 2025
13 checks passed
@mhofstetter mhofstetter deleted the pr/mhofstetter/fix-expectmirroredrequest branch March 29, 2025 07:19
@howardjohn
Copy link
Contributor

FYI we are seeing flake even with this on 1.3. Not sure if its our fault though, ahven't had a chance to investigate.

@mhofstetter
Copy link
Contributor Author

FYI we are seeing flake even with this on 1.3. Not sure if its our fault though, ahven't had a chance to investigate.

@howardjohn For which feature? This PR should fix the flake that existed in HTTPRouteRequestMirror & HTTPRouteRequestMultipleMirrors (with failure message ~ Couldn't find mirrored request in "gateway-conformance-infra/infra-backend-v3" logs). At least we no longer see it in Cilium.

But feature HTTPRouteRequestPercentageMirror seems to have another flake (with failure message ~ Traffic distribution test failed (5/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 95.000000 and 105.000000, but got 94.000000). That's the issue that i briefly mentioned in my comment here. Is this the flake that you still see in Istio?

We assume that Envoys implementation of the mirror percentage is currently not able to meet the tolerance of 5% that is hardcoded in the Gateway API conformance test. It uses the stream id as random value when calculating whether a "feature" (mirroring for a request) is enabled or not. Analyzed over here in Cilium.

@howardjohn
Copy link
Contributor

    httproute-request-percentage-mirror.go:195: 2025-04-01T18:47:17.088627044Z: Searching for the mirrored request log
    httproute-request-percentage-mirror.go:196: 2025-04-01T18:47:17.088669449Z: Reading "gateway-conformance-infra/infra-backend-v2" logs
    httproute-request-percentage-mirror.go:225: 2025-04-01T18:47:17.104374279Z: Pod: infra-backend-v2, Expected: 100.000000 (min: 95.000000, max: 105.000000), Actual: 117.000000
    httproute-request-percentage-mirror.go:168: Traffic distribution test failed (5/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 95.000000 and 105.000000, but got 117.000000

is what we see. Full log example here https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/55770/integ-pilot-istiodremote-mc_istio/1907140847116226560/build-log.txt.

(We also use envoy)

@mhofstetter
Copy link
Contributor Author

@LiorLieberman ^^

the mirroring percentage tolerance has been lowered from 15% to 5% with this commit in the GW API conformance tests

Regarding this - note that it hasn't been lowered, it was merged with this in first place. Why do we need 15% tolerance?

Above discussion shows that the 5% tolerance probably can't be met with Envoy's (current) implementation (Cilium, Istio - and probably also other Gateway API implementations).

Expected between 95.000000 and 105.000000, but got 117.000000

How should we proceed here? Would it be an option to (temporarily) increase the tolerance to 20%? (even though it's not sure that even this can be guaranteed (haven't checked for a biggest outlier) 🥲 )

cc @youngnick

@howardjohn
Copy link
Contributor

Here is some more sample info from a single test run brtw:

7728:    httproute-request-percentage-mirror.go:168: Traffic distribution test failed (1/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 95.000000 and 105.000000, but got 85.000000
8732:    httproute-request-percentage-mirror.go:168: Traffic distribution test failed (2/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 95.000000 and 105.000000, but got 87.000000
9736:    httproute-request-percentage-mirror.go:168: Traffic distribution test failed (3/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 95.000000 and 105.000000, but got 110.000000
10740:    httproute-request-percentage-mirror.go:168: Traffic distribution test failed (4/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 95.000000 and 105.000000, but got 92.000000
11744:    httproute-request-percentage-mirror.go:168: Traffic distribution test failed (5/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 95.000000 and 105.000000, but got 117.000000
12752:    httproute-request-percentage-mirror.go:168: Traffic distribution test failed (1/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 237.500000 and 262.500000, but got 266.000000
13756:    httproute-request-percentage-mirror.go:168: Traffic distribution test failed (2/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 237.500000 and 262.500000, but got 271.000000
15767:    httproute-request-percentage-mirror.go:168: Traffic distribution test failed (1/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 166.250000 and 183.750000, but got 207.000000
16771:    httproute-request-percentage-mirror.go:168: Traffic distribution test failed (2/5): Pod infra-backend-v2 did not meet the mirroring percentage within tolerance. Expected between 166.250000 and 183.750000, but got 161.000000

@LiorLieberman
Copy link
Member

@howardjohn does it fail for a specific percetnage-mirroring test case? istio had a problem with in the previous release which should have been fixed.

@howardjohn
Copy link
Contributor

Not sure exactly what you are asking but it fails on TestGatewayConformance/HTTPRouteRequestPercentageMirror/0_request_to_'/percent-mirror'_should_go_to_infra-backend-v1

howardjohn added a commit to howardjohn/gateway-api that referenced this pull request Apr 3, 2025
Reverts part of kubernetes-sigs#3508

See kubernetes-sigs#3708; Envoy
impl cannot meet this fine grain threshold.
k8s-ci-robot pushed a commit that referenced this pull request Apr 3, 2025
Reverts part of #3508

See #3708; Envoy
impl cannot meet this fine grain threshold.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/conformance-test Issues or PRs related to Conformance tests. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. kind/test lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants