-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
critest "runtime should support exec with tty=true and stdin=true" seems flaky on Windows Server 2022 #6652
Comments
I think it's possible to tell critest to skip individual tests, so it might be worth doing that if this issue is annoying developers. (Maybe I'm the only one who is fussy about getting my ✔ from CI?) I feel like this might also be related to the underlying cause of moby/moby#41479 (comment), seen in the Docker test suite while running up Docker on Windows with containerd. At the time, I thought I'd seen some issue in either containerd or hcsshim (or the connection between the two) around stdin/tty but couldn't find it then. So it's possible this isn't actually a containerd issue, but it's definitely affecting containerd. |
Definitely sounds like same case and afaiu it is related to stdin. With quickly search on hcsshim I found this one and wonder if the case is that those 3 retries just are not enough? microsoft/hcsshim@573c137 |
haven't looked at this just remembering cr lf back on windows/dos... I'd be looking for any possible cr conversions to lf and if there were two line feeds because of that.. edited to add or maybe we are just not inserting a needed cr at the end if it's not already there for example someone writes "hello" but not with a line.. so you get hello.. then we add lf but not cr lf |
@dcantah - Have you seen any issues like this in testing? |
We run critest internally as well and haven't seen this test (or any for that matter) fail. Our tests internally are on AZ machines so it's possible it may be a setup issue like @TBBle alludes to, but that's just pure guesswork on my part really. I don't think this patch would matter much unless Containerd restarted during the test, but even then I don't think we handled stdin reconnect microsoft/hcsshim@573c137. There WAS some flaky issues with hyper-v containers and tty iirc, but I'd need to track down what I'm remembering here 😆 |
I added a little bit of debugging and I think I have confirmed that the problem is happening somewhere down in hcsshim or below. Logs can be fetched from https://github.com/TBBle/containerd/actions/runs/1964241067 Windows Server 2019 reported
Windows Server 2020 reported:
The only thing I'm suspicious about is that in the success case, The order of relevent spans in the success case is:
In the failure case, it's
I'm not sure if the ordering variation is a problem, or merely the natural result of whatever's actually causing the issue. Maybe somehow Anyway, from at least that ordering, perhaps it's a race between But yeah, I'm not super-familiar with this code, so I must be misreading it, or the Interestingly, when I used The logs for the section that's differentThese are the (almost) full log lines for the dot-points above. I've trimmed timestamps so I could read them more-easily in Windows Server 2019
Windows Server 2022
|
Okay, with more debugging logs, the 'race condition' idea looks like a red herring. Success case:
Failure case:
So the success case looks like we actually get data and the stdout socket is closed by the process before we're start waiting on it, while in the failure case we didn't get any data, and when we are waiting on the socket, nothing came through before it was closed. As noted above, on examination of the log timestamps, the orderings are identical, so the process at the other end of the stdout socket simply sent 0 bytes and closed its FD. So the problem must be lower still. |
@jterry75 any possibility to find someone who have access to Windows source codes to investigating this one? It completely prevents Docker to start using containerd with Windows which would eventually allow it stop using deprecated HCS v1 moby/moby#41455 |
@helsaawy This makes me think some of your work may have fixed/will fix this (also confused how we don't seem to be hitting this internally or on the periodic tests). Let me setup some vms to try this on different builds in the meantime |
LOL. I completely missed your response Danny. I'm not great at reading... |
Okay, possible totally out of left field, but is there any way we might see issues with this if there were two builds of the shim around, and we sometimes ran the wrong one sometimes? I noticed we build the shim twice in the GitHub Actions workflow (once through The Windows Periodic flow also double-builds hcsshim, but the second build goes into a different directory compared to the GitHub Actions run. I put this fix up as #6661, and if that turns out to magically fix this, then... surprise! Update: And one run in three of that branch reproduced this failure, so that wasn't it. |
Okay, peering back in after trying this quite extensively locally (over 200 iterations) with no failures.. It seems the Github actions machines are the minority here if things seem to go over well on a local vm and az machines, but what could possibly be different I'm not sure.. Thinking back a bit though, I'd added bindings for the pseudo console API in Windows for use with Windows' form of privileged containers, and a test for this functionality was also exceedingly flaky on our github CI (I'd say around 30-40% of the time) but ran fine locally microsoft/hcsshim#1282. When you ask for a tty for a Windows Server Container this same api is used. I'll probably need to shift the focus to trying to debug on the actions machines as I can't seem to get a repro locally, unless someone else is able to |
I think we'd just re-run the CI if it failed so I don't have a lot of examples to show but here's one from after check-in that shows the same symptom as Paul had found, we don't seem to be getting any output: https://github.com/microsoft/hcsshim/runs/4598067706?check_suite_focus=true#step:4:527 This also only manifested on ws2022 afaict |
It might be worth trying to run up the GitHub Actions VM image for Server 2022, and see if the problem can reproduce there. Given that it seems to be tty-related, my guess is either something is odd in the VMs being created, or whatever is actually running the actions is somehow interfering with TTY operation. Thinking about it, my first suspicion is towards Although we aren't seeing this on Windows Server 2019, which has conpty, I just discovered. Having a repro that doesn't involve containers at all is quite interesting, although it also probably doesn't involve bash, so that's another defenestrated idea of mine, if we assume that test is the same underlying issue. |
I don't want to throw off the trail just based off that non-container pty test as it might've been a fault of my own in the test, but it is a bit odd that they're both having issues on ws2022.. On RS5 it looks like Windows Server containers didn't make use of the pseudo console API either, so there's another tidbit of info for us to play with. Does anyone think it'd be fruitful to skip this test on the ws2022 runs for now so folks don't have to think they broke something while we investigate (Paul it sounds like you were in favor of this and I'd have to agree also 😆)? We'd taken a similar approach for flaky tests in the past until they were finally resolved. |
Yeah, without more clarity on the problem than we have now, skipping the test is better overall. I don't suppose there's an env-var that can detect the GitHub Actions VM/runner specifically? The test is passing on the Windows Periodic run AFAIK, so if we can avoid skipping it there, then really we have lost nothing compared to where we were in February. ^_^ Looking at the docs, |
Given the periodic tests are in their own yaml, couldn't we get away with just -ginkgo.skip {regex} on the run in ci.yml? |
Btw. moby/moby#41479 (comment) which was mentioned here earlier was possible to repro locally. I tried now with latest moby code which uses containerd 1.6.1 and it was not as clear anymore if I hit the same or other issue with it. Anyway, I tried to re-enable this one again https://github.com/moby/moby/blob/085c6a98d54720e70b28354ccec6da9b1b9e7fcf/integration/container/exec_test.go#L18-L85 and run moby CI script https://github.com/moby/moby/blob/master/hack/ci/windows.ps1 locally on my Azure VM and that still gets stuck and it happens on every run. |
@olljanat I have not tried to repro the moby issue yet as I was trying to see if I could get the critest test case to barf (which to my demise didn't happen). Is there a simple repro/setup to follow there? They could very well be related. What host build were you getting the moby issue to appear on? |
I assume this was referring to microsoft/hcsshim#1296 which I just came across. |
Apologies for the belated response: Ideally, we would close the upstream stdin for writing (or reading) when the process finishes, but before |
Looks to be same case with Docker+ containerd combination. Issue happens in here https://github.com/moby/moby/blob/98d8343aa28de0d499464b5529e6b8ccc92e9313/daemon/exec.go#L205-L211 and disappear on that specific test case if I add
At least I would like to understand that what is right way to handle that situation in Go when pipe where client is connected closes? |
Description
After #6626 enabled
critest
on GitHub Actions, I've seen the specific test below fail onmain
branch a lot, on Windows Server 2022. Looking at the main branch CI builds it feels like it's a one-in-two failure rate, and I felt the same way during development of #6626.The log for the failing test
This doesn't seem to happen on the Windows Periodic Tests on main branch, so I fear there's something in the system setup I've overlooked, but I couldn't see anything in the scripts run by the Windows Periodic Tests that leapt out as a difference.
Anyway, apologies to other developers who get undeservedly-❌ CI builds for code that should be ✔.
Steps to reproduce the issue
Describe the results you received and expected
Expected: GitHub Actions CI build passes consistently.
Actual: GitHub Actions CI build fails one-in-two on the Windows Server 2022 in the "cri-tools critest" stage.
What version of containerd are you using?
main
Any other relevant information
No response
Show configuration if it is related to CRI plugin.
No response
The text was updated successfully, but these errors were encountered: