Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

critest "runtime should support exec with tty=true and stdin=true" seems flaky on Windows Server 2022 #6652

Open
TBBle opened this issue Mar 9, 2022 · 24 comments

Comments

@TBBle
Copy link
Contributor

TBBle commented Mar 9, 2022

Description

After #6626 enabled critest on GitHub Actions, I've seen the specific test below fail on main branch a lot, on Windows Server 2022. Looking at the main branch CI builds it feels like it's a one-in-two failure rate, and I felt the same way during development of #6626.

The log for the failing test
[k8s.io] Streaming runtime should support streaming interfaces 
  runtime should support exec with tty=true and stdin=true [Conformance]
  github.com/kubernetes-sigs/cri-tools/pkg/validate/streaming.go:87
[BeforeEach] [k8s.io] Streaming
  github.com/kubernetes-sigs/cri-tools/pkg/framework/framework.go:50
[BeforeEach] [k8s.io] Streaming
  github.com/kubernetes-sigs/cri-tools/pkg/validate/streaming.go:50
[It] runtime should support exec with tty=true and stdin=true [Conformance]
  github.com/kubernetes-sigs/cri-tools/pkg/validate/streaming.go:87
STEP: create a default container
STEP: Get image status for image: k8s.gcr.io/e2e-test-images/busybox:1.29-2
STEP: Create container.
Mar  8 19:03:47.734: INFO: Created container "1eff23091f6da906f5dc6a8ad8582e4233566cf8e1ed4837daf0eb4df1a34452"

STEP: start container
STEP: Start container for containerID: 1eff23091f6da906f5dc6a8ad8582e4233566cf8e1ed4837daf0eb4df1a34452
Mar  8 19:03:48.944: INFO: Started container "1eff23091f6da906f5dc6a8ad8582e4233566cf8e1ed4837daf0eb4df1a34452"

STEP: exec given command in container: 1eff23091f6da906f5dc6a8ad8582e4233566cf8e1ed4837daf0eb4df1a34452
Mar  8 19:03:48.946: INFO: Get exec url: http://127.0.0.1:61319/exec/e8YUFeAD
STEP: check the output of exec
Mar  8 19:03:48.947: INFO: Parse url "http://127.0.0.1:61319/exec/e8YUFeAD" succeed
[AfterEach] runtime should support streaming interfaces
  github.com/kubernetes-sigs/cri-tools/pkg/validate/streaming.go:59
STEP: stop PodSandbox
STEP: delete PodSandbox
[AfterEach] [k8s.io] Streaming
  github.com/kubernetes-sigs/cri-tools/pkg/framework/framework.go:51

+ Failure [7.094 seconds]
[k8s.io] Streaming
github.com/kubernetes-sigs/cri-tools/pkg/framework/framework.go:72
  runtime should support streaming interfaces
  github.com/kubernetes-sigs/cri-tools/pkg/validate/streaming.go:55
    runtime should support exec with tty=true and stdin=true [Conformance] [It]
    github.com/kubernetes-sigs/cri-tools/pkg/validate/streaming.go:87

    The stdout of exec should contain hello
    Expected
        <string>: 
    to contain substring
        <string>: hello

    github.com/kubernetes-sigs/cri-tools/pkg/validate/streaming.go:203
------------------------------

This doesn't seem to happen on the Windows Periodic Tests on main branch, so I fear there's something in the system setup I've overlooked, but I couldn't see anything in the scripts run by the Windows Periodic Tests that leapt out as a difference.

Anyway, apologies to other developers who get undeservedly-❌ CI builds for code that should be ✔.

Steps to reproduce the issue

  1. Submit a PR

Describe the results you received and expected

Expected: GitHub Actions CI build passes consistently.
Actual: GitHub Actions CI build fails one-in-two on the Windows Server 2022 in the "cri-tools critest" stage.

What version of containerd are you using?

main

Any other relevant information

No response

Show configuration if it is related to CRI plugin.

No response

@TBBle TBBle added the kind/bug label Mar 9, 2022
@TBBle
Copy link
Contributor Author

TBBle commented Mar 9, 2022

I think it's possible to tell critest to skip individual tests, so it might be worth doing that if this issue is annoying developers. (Maybe I'm the only one who is fussy about getting my ✔ from CI?)

I feel like this might also be related to the underlying cause of moby/moby#41479 (comment), seen in the Docker test suite while running up Docker on Windows with containerd. At the time, I thought I'd seen some issue in either containerd or hcsshim (or the connection between the two) around stdin/tty but couldn't find it then.

So it's possible this isn't actually a containerd issue, but it's definitely affecting containerd.

@olljanat
Copy link
Contributor

olljanat commented Mar 9, 2022

I feel like this might also be related to the underlying cause of moby/moby#41479 (comment), seen in the Docker test suite while running up Docker on Windows with containerd. At the time, I thought I'd seen some issue in either containerd or hcsshim (or the connection between the two) around stdin/tty but couldn't find it then.

Definitely sounds like same case and afaiu it is related to stdin. With quickly search on hcsshim I found this one and wonder if the case is that those 3 retries just are not enough? microsoft/hcsshim@573c137

@mikebrow
Copy link
Member

mikebrow commented Mar 9, 2022

haven't looked at this just remembering cr lf back on windows/dos... I'd be looking for any possible cr conversions to lf and if there were two line feeds because of that..

edited to add or maybe we are just not inserting a needed cr at the end if it's not already there

for example someone writes "hello" but not with a line.. so you get hello.. then we add lf but not cr lf

@mikebrow mikebrow closed this as completed Mar 9, 2022
@mikebrow mikebrow reopened this Mar 9, 2022
@jterry75
Copy link
Contributor

jterry75 commented Mar 9, 2022

@dcantah - Have you seen any issues like this in testing?

@dcantah
Copy link
Member

dcantah commented Mar 9, 2022

We run critest internally as well and haven't seen this test (or any for that matter) fail. Our tests internally are on AZ machines so it's possible it may be a setup issue like @TBBle alludes to, but that's just pure guesswork on my part really. I don't think this patch would matter much unless Containerd restarted during the test, but even then I don't think we handled stdin reconnect microsoft/hcsshim@573c137. There WAS some flaky issues with hyper-v containers and tty iirc, but I'd need to track down what I'm remembering here 😆

@TBBle
Copy link
Contributor Author

TBBle commented Mar 10, 2022

I added a little bit of debugging and I think I have confirmed that the problem is happening somewhere down in hcsshim or below.

Logs can be fetched from https://github.com/TBBle/containerd/actions/runs/1964241067

Windows Server 2019 reported

time="2022-03-10T16:29:38.114836400Z" level=warning msg="TBBle: relayIO Cmd stdout, 231 bytes" eid=aa4d95b78867d222773b7b1ce4805364f01b9025106961116ba973eebc971d9f pid=3732 spanID=e52f533fba795f29 tid=db0d54838909efbc89b5821cdd61b29b9866281196dcacfbb985afeafb6def54 traceID=69f954d7cb85b53b2926861fcdee703a

Windows Server 2020 reported:

time="2022-03-10T16:32:00.475954300Z" level=warning msg="TBBle: relayIO Cmd stdout, 0 bytes" eid=5373243d56eb3f401b8f3592b928890386aa91b2433a3a3bdf87c56e348a2b1d pid=6280 spanID=de1db34657828b56 tid=1fc7344a2f92eb221426d6e8c8e4e18697708ce3f9e28af36c91ffc6563499b8 traceID=5caf5e29d9e752a0e8b7f2b88140c3ad

The only thing I'm suspicious about is that in the success case, hcs::Process::CloseStdout span is reported immediately after the TBBle: relayIO Cmd stdout, while in the failure case, hcs::Process::CloseStdout span is reported immediately before that line.

The order of relevent spans in the success case is:

  • msg="TBBle: relayIO Cmd stdout, 231 bytes"
  • name="hcs::Process::CloseStdout"
  • name=HcsGetProcessProperties
  • name="hcs::Process::waitBackground"
  • name=HcsUnregisterProcessCallback
  • name="hcsExec::waitForContainerExit"
  • name=HcsCloseProcess

In the failure case, it's

  • name=HcsGetProcessProperties
  • name="hcs::Process::waitBackground"
  • name="hcsExec::waitForContainerExit"
  • msg="TBBle: relayIO Cmd stdout, 0 bytes"
  • name="hcs::Process::CloseStdout"
  • name=HcsUnregisterProcessCallback
  • name=HcsCloseProcess

I'm not sure if the ordering variation is a problem, or merely the natural result of whatever's actually causing the issue. Maybe somehow waitBackground is losing what's in the stdout pipe before it can be copied out to containerd?

Anyway, from at least that ordering, perhaps it's a race between func (he *hcsExec) waitForExit() and func (c *Cmd) Wait() which are both waiting on Process.Wait()? Although waitForExit calls (c *Cmd) Wait() after Process.Wait() is complete,and that should be blocking on the code with the relayIO call, before then calling io.Close() which I assume would close all these channels anyway.

But yeah, I'm not super-familiar with this code, so I must be misreading it, or the c.iogrp.Wait() call in (c *Cmd) Wait() is not doing what it looks like it's supposed to do.

Interestingly, when I used -ginkgo.focus to run only this test, it passed. Although I only tried that once, so it might have been luck. But with my current setup it's 100% reproduced.

The logs for the section that's different

These are the (almost) full log lines for the dot-points above. I've trimmed timestamps so I could read them more-easily in vimdiff.

Windows Server 2019

level=info msg=Span duration=0s name=containerd.task.v2.Task.State parentSpanID=0000000000000000 spanID=f39fdf1badffa5db traceID=3259e10f067ad8c45472f1b100a239b9
level=warning msg="TBBle: relayIO Cmd stdout, 231 bytes" eid=aa4d95b78867d222773b7b1ce4805364f01b9025106961116ba973eebc971d9f pid=3732 spanID=e52f533fba795f29 tid=db0d54838909efbc89b5821cdd61b29b9866281196dcacfbb985afeafb6def54 traceID=69f954d7cb85b53b2926861fcdee703a
level=info msg=Span cid=db0d54838909efbc89b5821cdd61b29b9866281196dcacfbb985afeafb6def54 duration=0s name="hcs::Process::CloseStdout" parentSpanID=0000000000000000 pid=3732 spanID=4dc56365532c686c traceID=7248a327c43a7812b4bcbafa629ec3c2
level=info msg=Span duration=0s name=HcsGetProcessProperties parentSpanID=997a5bd206d3e34a spanID=fad7258aa642c9b4 traceID=18e4f581f83c4a152537b378b4751e55
level=info msg=Span cid=db0d54838909efbc89b5821cdd61b29b9866281196dcacfbb985afeafb6def54 duration=922.1346ms name="hcs::Process::waitBackground" parentSpanID=0000000000000000 pid=3732 spanID=997a5bd206d3e34a traceID=18e4f581f83c4a152537b378b4751e55
level=info msg=Span duration=0s name=HcsUnregisterProcessCallback parentSpanID=a7eae7aef9582afd spanID=54fda9d34c6f8b45 traceID=6e7aee78a4dcef81c2d622e5cd36e080
level=info msg=Span duration=957.9937ms eid=aa4d95b78867d222773b7b1ce4805364f01b9025106961116ba973eebc971d9f name="hcsExec::waitForContainerExit" parentSpanID=0000000000000000 spanID=7d9a421921c756e6 tid=db0d54838909efbc89b5821cdd61b29b9866281196dcacfbb985afeafb6def54 traceID=a4c18312879b5a2611242de406e44eee
level=info msg=Span duration=0s name=HcsCloseProcess parentSpanID=a7eae7aef9582afd spanID=01106cf89f85ec8d traceID=6e7aee78a4dcef81c2d622e5cd36e080

Windows Server 2022

level=info msg=Span duration=0s name=containerd.task.v2.Task.State parentSpanID=0000000000000000 spanID=48d925b2cd242d89 traceID=796aaad04a4017aaf63cd7c42a49d7b3
level=info msg=Span duration=0s name=HcsGetProcessProperties parentSpanID=7a45aae350994c78 spanID=166da1804ab00d9a traceID=e6bb92089360863f2007915ff0119ca4
level=info msg=Span cid=1fc7344a2f92eb221426d6e8c8e4e18697708ce3f9e28af36c91ffc6563499b8 duration=1.0712412s name="hcs::Process::waitBackground" parentSpanID=0000000000000000 pid=6280 spanID=7a45aae350994c78 traceID=e6bb92089360863f2007915ff0119ca4
level=info msg=Span duration=1.0953695s eid=5373243d56eb3f401b8f3592b928890386aa91b2433a3a3bdf87c56e348a2b1d name="hcsExec::waitForContainerExit" parentSpanID=0000000000000000 spanID=a6cec40c64540913 tid=1fc7344a2f92eb221426d6e8c8e4e18697708ce3f9e28af36c91ffc6563499b8 traceID=198866e0ec7a770de8045975be65ee62
level=warning msg="TBBle: relayIO Cmd stdout, 0 bytes" eid=5373243d56eb3f401b8f3592b928890386aa91b2433a3a3bdf87c56e348a2b1d pid=6280 spanID=de1db34657828b56 tid=1fc7344a2f92eb221426d6e8c8e4e18697708ce3f9e28af36c91ffc6563499b8 traceID=5caf5e29d9e752a0e8b7f2b88140c3ad
level=info msg=Span cid=1fc7344a2f92eb221426d6e8c8e4e18697708ce3f9e28af36c91ffc6563499b8 duration=0s name="hcs::Process::CloseStdout" parentSpanID=0000000000000000 pid=6280 spanID=fd36dfe708f67da2 traceID=cdb9af6b2c1a1035560d107a6c1ab6a3
level=info msg=Span duration=0s name=HcsUnregisterProcessCallback parentSpanID=e4001d4fc73beeaa spanID=cbca5ab685815eb3 traceID=98376f612a54bd49eacc53b8e2cda5f1
level=info msg=Span duration=0s name=HcsCloseProcess parentSpanID=e4001d4fc73beeaa spanID=b294981d44c7cebb traceID=98376f612a54bd49eacc53b8e2cda5f1

@TBBle
Copy link
Contributor Author

TBBle commented Mar 10, 2022

Okay, with more debugging logs, the 'race condition' idea looks like a red herring.

Success case:

  • TBBle: (he *hcsExec) waitForExit: before he.p.Process.Wait()
  • TBBle: relayIO Cmd stdout, 231 bytes
  • TBBle: (he *hcsExec) waitForExit: after he.p.Process.Wait()
  • TBBle: (he *hcsExec) waitForExit: before he.p.Wait()
  • TBBle: (c *Cmd) Wait(): before c.Process.Wait()
  • TBBle: (c *Cmd) Wait(): after c.Process.Wait()
  • TBBle: (c *Cmd) Wait(): before c.iogrp.Wait()
  • TBBle: (c *Cmd) Wait(): after c.iogrp.Wait()
  • TBBle: (he *hcsExec) waitForExit: after he.p.Wait()
  • TBBle: (he *hcsExec) waitForExit: before he.io.Close(ctx)
  • TBBle: (he *hcsExec) waitForExit: after he.io.Close(ctx)

Failure case:

  • TBBle: (he *hcsExec) waitForExit: before he.p.Process.Wait()
  • TBBle: (he *hcsExec) waitForExit: after he.p.Process.Wait()
  • TBBle: (he *hcsExec) waitForExit: before he.p.Wait()
  • TBBle: (c *Cmd) Wait(): before c.Process.Wait()
  • TBBle: (c *Cmd) Wait(): after c.Process.Wait()
  • TBBle: (c *Cmd) Wait(): before c.iogrp.Wait()
  • TBBle: relayIO Cmd stdout, 0 bytes <== log timestamp puts it at the same time as after he.p.Process.Wait() above.
  • TBBle: (c *Cmd) Wait(): after c.iogrp.Wait()
  • TBBle: (he *hcsExec) waitForExit: after he.p.Wait()
  • TBBle: (he *hcsExec) waitForExit: before he.io.Close(ctx)
  • TBBle: (he *hcsExec) waitForExit: after he.io.Close(ctx)

So the success case looks like we actually get data and the stdout socket is closed by the process before we're start waiting on it, while in the failure case we didn't get any data, and when we are waiting on the socket, nothing came through before it was closed.

As noted above, on examination of the log timestamps, the orderings are identical, so the process at the other end of the stdout socket simply sent 0 bytes and closed its FD.

So the problem must be lower still.

@olljanat
Copy link
Contributor

@jterry75 any possibility to find someone who have access to Windows source codes to investigating this one?

It completely prevents Docker to start using containerd with Windows which would eventually allow it stop using deprecated HCS v1 moby/moby#41455

@dcantah
Copy link
Member

dcantah commented Mar 10, 2022

@helsaawy This makes me think some of your work may have fixed/will fix this (also confused how we don't seem to be hitting this internally or on the periodic tests). Let me setup some vms to try this on different builds in the meantime

@dcantah
Copy link
Member

dcantah commented Mar 10, 2022

@olljanat @helsaawy and I would likely be the ones on the hook here :)

@jterry75
Copy link
Contributor

@olljanat - @dcantah / @helsaawy are those people :)

@jterry75
Copy link
Contributor

LOL. I completely missed your response Danny. I'm not great at reading...

@TBBle
Copy link
Contributor Author

TBBle commented Mar 11, 2022

Okay, possible totally out of left field, but is there any way we might see issues with this if there were two builds of the shim around, and we sometimes ran the wrong one sometimes? I noticed we build the shim twice in the GitHub Actions workflow (once through make binaries and once output into integration/client for some reason), and while fixing that up, the problem stopped reproducing: https://github.com/TBBle/containerd/runs/5505767991 (Two runs, both passed.)

The Windows Periodic flow also double-builds hcsshim, but the second build goes into a different directory compared to the GitHub Actions run.

I put this fix up as #6661, and if that turns out to magically fix this, then... surprise!

Update: And one run in three of that branch reproduced this failure, so that wasn't it.

@dcantah
Copy link
Member

dcantah commented Mar 16, 2022

Okay, peering back in after trying this quite extensively locally (over 200 iterations) with no failures.. It seems the Github actions machines are the minority here if things seem to go over well on a local vm and az machines, but what could possibly be different I'm not sure..

Thinking back a bit though, I'd added bindings for the pseudo console API in Windows for use with Windows' form of privileged containers, and a test for this functionality was also exceedingly flaky on our github CI (I'd say around 30-40% of the time) but ran fine locally microsoft/hcsshim#1282. When you ask for a tty for a Windows Server Container this same api is used. I'll probably need to shift the focus to trying to debug on the actions machines as I can't seem to get a repro locally, unless someone else is able to

@dcantah
Copy link
Member

dcantah commented Mar 16, 2022

I think we'd just re-run the CI if it failed so I don't have a lot of examples to show but here's one from after check-in that shows the same symptom as Paul had found, we don't seem to be getting any output: https://github.com/microsoft/hcsshim/runs/4598067706?check_suite_focus=true#step:4:527

This also only manifested on ws2022 afaict

@TBBle
Copy link
Contributor Author

TBBle commented Mar 17, 2022

It might be worth trying to run up the GitHub Actions VM image for Server 2022, and see if the problem can reproduce there. Given that it seems to be tty-related, my guess is either something is odd in the VMs being created, or whatever is actually running the actions is somehow interfering with TTY operation.

Thinking about it, my first suspicion is towards bash, because if that's Cygwin-or-derived, e.g. msys2 or Git for Windows, then it's historically had weirdness, bugs, and just plain interference with conpty, particularly if it's not the latest version.

Although we aren't seeing this on Windows Server 2019, which has conpty, I just discovered.

Having a repro that doesn't involve containers at all is quite interesting, although it also probably doesn't involve bash, so that's another defenestrated idea of mine, if we assume that test is the same underlying issue.

@dcantah
Copy link
Member

dcantah commented Mar 17, 2022

It might be worth trying to run up the GitHub Actions VM image for Server 2022, and see if the problem can reproduce there. Given that it seems to be tty-related, my guess is either something is odd in the VMs being created, or whatever is actually running the actions is somehow interfering with TTY operation.

Thinking about it, my first suspicion is towards bash, because if that's Cygwin-or-derived, e.g. msys2 or Git for Windows, then it's historically had weirdness, bugs, and just plain interference with conpty, particularly if it's not the latest version.

Although we aren't seeing this on Windows Server 2019, which has conpty, I just discovered.

Having a repro that doesn't involve containers at all is quite interesting, although it also probably doesn't involve bash, so that's another defenestrated idea of mine, if we assume that test is the same underlying issue.

I don't want to throw off the trail just based off that non-container pty test as it might've been a fault of my own in the test, but it is a bit odd that they're both having issues on ws2022.. On RS5 it looks like Windows Server containers didn't make use of the pseudo console API either, so there's another tidbit of info for us to play with.

Does anyone think it'd be fruitful to skip this test on the ws2022 runs for now so folks don't have to think they broke something while we investigate (Paul it sounds like you were in favor of this and I'd have to agree also 😆)? We'd taken a similar approach for flaky tests in the past until they were finally resolved.

@TBBle
Copy link
Contributor Author

TBBle commented Mar 17, 2022

Yeah, without more clarity on the problem than we have now, skipping the test is better overall.

I don't suppose there's an env-var that can detect the GitHub Actions VM/runner specifically? The test is passing on the Windows Periodic run AFAIK, so if we can avoid skipping it there, then really we have lost nothing compared to where we were in February. ^_^

Looking at the docs, GITHUB_ACTIONS==true is probably the right thing to use.

@dcantah
Copy link
Member

dcantah commented Mar 17, 2022

Given the periodic tests are in their own yaml, couldn't we get away with just -ginkgo.skip {regex} on the run in ci.yml?

@olljanat
Copy link
Contributor

I'll probably need to shift the focus to trying to debug on the actions machines as I can't seem to get a repro locally, unless someone else is able to

Btw. moby/moby#41479 (comment) which was mentioned here earlier was possible to repro locally. I tried now with latest moby code which uses containerd 1.6.1 and it was not as clear anymore if I hit the same or other issue with it.

Anyway, I tried to re-enable this one again https://github.com/moby/moby/blob/085c6a98d54720e70b28354ccec6da9b1b9e7fcf/integration/container/exec_test.go#L18-L85 and run moby CI script https://github.com/moby/moby/blob/master/hack/ci/windows.ps1 locally on my Azure VM and that still gets stuck and it happens on every run.

@dcantah
Copy link
Member

dcantah commented Mar 18, 2022

I'll probably need to shift the focus to trying to debug on the actions machines as I can't seem to get a repro locally, unless someone else is able to

Btw. moby/moby#41479 (comment) which was mentioned here earlier was possible to repro locally. I tried now with latest moby code which uses containerd 1.6.1 and it was not as clear anymore if I hit the same or other issue with it.

Anyway, I tried to re-enable this one again https://github.com/moby/moby/blob/085c6a98d54720e70b28354ccec6da9b1b9e7fcf/integration/container/exec_test.go#L18-L85 and run moby CI script https://github.com/moby/moby/blob/master/hack/ci/windows.ps1 locally on my Azure VM and that still gets stuck and it happens on every run.

@olljanat I have not tried to repro the moby issue yet as I was trying to see if I could get the critest test case to barf (which to my demise didn't happen). Is there a simple repro/setup to follow there? They could very well be related. What host build were you getting the moby issue to appear on?

@TBBle
Copy link
Contributor Author

TBBle commented Apr 28, 2022

@helsaawy This makes me think some of your work may have fixed/will fix this (also confused how we don't seem to be hitting this internally or on the periodic tests). Let me setup some vms to try this on different builds in the meantime

I assume this was referring to microsoft/hcsshim#1296 which I just came across.

@helsaawy
Copy link
Contributor

helsaawy commented May 2, 2022

@helsaawy This makes me think some of your work may have fixed/will fix this (also confused how we don't seem to be hitting this internally or on the periodic tests). Let me setup some vms to try this on different builds in the meantime

I assume this was referring to microsoft/hcsshim#1296 which I just came across.

Apologies for the belated response:
There is some weirdness/possible race conditions with how Cmd handles stdin, though microsoft/hcsshim#1296 really just attempts to smooth that out by suppressing Close() errors, rather than synchronize things.
The issue there is that we leak a goroutine thats copying from the upstream stdin and to the process's stdin.
But that blocks on reading from upstream, which isnt closed until after the Cmd finishes, in which case the goroutine tries to write an EOF to the process's stdin, which has already closed, so it errors out.

Ideally, we would close the upstream stdin for writing (or reading) when the process finishes, but before Cmd exits, but I dont know what containerd or other callers expect, since they are the ones that create the pipes that hcsshim listens to.

@olljanat
Copy link
Contributor

The issue there is that we leak a goroutine thats copying from the upstream stdin and to the process's stdin. But that blocks on reading from upstream, which isnt closed until after the Cmd finishes, in which case the goroutine tries to write an EOF to the process's stdin, which has already closed, so it errors out.

Looks to be same case with Docker+ containerd combination. Issue happens in here https://github.com/moby/moby/blob/98d8343aa28de0d499464b5529e6b8ccc92e9313/daemon/exec.go#L205-L211 and disappear on that specific test case if I add defer time.Sleep(30 * time.Second) between those other defer rows. Then container stops as expected when without this it gets stuck forever and cannot be killed without restarting docker.

Ideally, we would close the upstream stdin for writing (or reading) when the process finishes, but before Cmd exits, but I dont know what containerd or other callers expect, since they are the ones that create the pipes that hcsshim listens to.

At least I would like to understand that what is right way to handle that situation in Go when pipe where client is connected closes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants