-
-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: enable parallel testing in CI builds #11510
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this!
I used my test analysis tool to look at the failures resulting from this PR. Here's a summary. Test runs that timed out:
I don't think we typically get a lot of cases where tests just hang. However, with parallel tests enabled I've seen it (in CI systems, anyway) a number of times. This is a bit worrysome because it means there may be a race condition in the code somewhere that causes this. In the Azure cases I'm planning on disabling parallel testing altogether (see the next section, below) but the Circle one has been the best in terms of unloaded servers and parallel speedup potential, so if enabling parallel tests causes this, even somewhat frequently, I'm hesitant about enabling it (there or really anywhere). Unfortunately, my test analysis program looks at the level of individual test failures and does not currently surface test runs that hang and are aborted by timeout such as the above, so I had to gather these manually. I should be able to detect this for the next time. New test failures in this PR run:
The above sections are really the only test case failures I'm worried about, with possibly a few in the next section. These are failures on tests that have not failed recently and have not shown themselves to be flaky. What's interesting is that they are all on Azure infrastructure, which I've found to be very oversubscribed. For that reason I'm only using 2 parallel tasks there, but I think I'll have to completely disable parallel testing on Azure Linux infrastructure to avoid this flakiness. The following failures don't concern me much at all. New test failures, but tests are flaky in other builds:
The above are flaky in other builds on these same CI systems, so it's not too surprising to see flakes here. Except that flakes are pretty rare (e.g. 1.4%) so having 3 of the pop up in this one run is pretty suspicious. However, they're all in the Azure cloud so stopping parallelism there as I propose above will fix this (ahem). Flaky tests:
The above are already flaky, so it's not too odd to see these failure. Except, as in the previous section, the fact that they're only 1.4% flaky and I see two of them isn't good. That makes 5 flaky tests in a single run that normally only flake out 1.4% of the time—pretty suspicious. Existing failing tests:
All these are permafailing, so can be ignored. Mostly, they're marked as "run but ignore the results" so you don't see the failures day to day. |
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
7cd7046
to
8567b3b
Compare
Here are the results of just the test failures that bother me from the second run, which reduced or eliminated parallel testing from a number of builds:
This is an improvement over the first run, but there's also less parallel testing going on. The test run hangs bother me the most, since those are so hard to debug when I can't reproduce them locally. |
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
8567b3b
to
5191a80
Compare
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
5191a80
to
e14e61b
Compare
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
e14e61b
to
e7d5ed7
Compare
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
e7d5ed7
to
6747a09
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@vszakats, thanks for the patches, but my commits still didn't apply cleanly. I'll have to look at your changes and adjust accordingly. |
Maybe the simplest is if you overwrite your version with this one: |
b9ff54f
to
15cfc95
Compare
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
I have it rebaesd now—thanks for the help. |
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
15cfc95
to
ec70db3
Compare
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: TODO: completely remove the 2 here: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
d9fdaeb
to
a727c57
Compare
@dfandrich: Ready to merge? |
a727c57
to
ce041ce
Compare
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
8d01464
to
dc5fc1d
Compare
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
fd69739
to
76729d6
Compare
Credit-to: Dan Fandrich Cherry-picked from curl#11510
- bump parallel test for Linux jobs. Credit-to: Dan Fandrich Cherry-picked from #11510 - bump parallel test for macOS jobs. - drop no longer necessary `-Wno-vla` option. - fold long lines. - drop `--enable-maintainer-mode` `./configure` option. - replace a hard-coded prefix with `brew --prefix`. - update documentation link. - move `--enable-debug` in front. - tidy up quotes. Closes #14171
- bump parallel test for Linux jobs. Credit-to: Dan Fandrich Cherry-picked from curl#11510 - bump parallel test for macOS jobs. - drop no longer necessary `-Wno-vla` option. - fold long lines. - drop `--enable-maintainer-mode` `./configure` option. - replace a hard-coded prefix with `brew --prefix`. - update documentation link. - move `--enable-debug` in front. - tidy up quotes. Closes curl#14171
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510
76729d6
to
995a6b5
Compare
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Already merged via separate commits: - 2a7c8b2 curl#14171 - 7234106 Ref: curl#10818 Closes curl#11510
The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Already merged via separate commits: - 2a7c8b2 #14171 - 7234106 - efce544 #14244 - c6cf411 Ref: #10818 Closes #11510
995a6b5
to
c7b989e
Compare
Thanks Dan, this is merged now! Most of this PR was merged in the recent months piece by piece. What remained was changing the CI default to |
Thanks for the work in getting this in. I'm a bit troubled by the results,
though. While monitoring Test Clutch's PR comments on failed tests feature (not
quite enabled yet), I've noticed that the majority of test failures now are of
the sort that I've previously found are triggered by parallel testing.
Especially those of the form:
There was no content at all in the file log/5/server.input.
Server glitch? Total curl failure? Returned: 28
Most of the recent red at
https://testclutch.curl.se/static/reports/summary.html leads to a result like
this. This is the reason I didn't enable parallel tests before, and it seems
like those underlying issue has not been solved.
While it's nice to get results quicker now, I'm concerned that the increase in
spurious failed tests negates that. There was a brief period after tests on
Appveyor were disabled that we were starting to see failure-free runs, but
it's happening less often now. The overall test failure rate is still pretty
low (0.0041% over the last 3 weeks, to be precise), so maybe it's more Test
Clutch's propensity to show me every single failure that's clouding my vision.
|
Yeah, it's definitely not perfect. I plan to reduce parallelism for the FreeBSD Intel job just migrated to GHA, and/or put FTP results on ignore. The other notorious issue is the native Windows ones. Old-mingw, and especially MSVC, where we added a bunch of new jobs, sadly bumped up the failure rate due hangs. Very annoying. It's not the Windows env itself, because Cygwin and MSYS2 are rock solid (knock on wood) on the same runners and high parallelism. The Azure jobs are also failing often (I haven't looked at it closely), and also the old Cirrus FreeBSD jobs were prone to not start. These don't use parallel tests. The most annoying is the Windows old-mingw and MSVC hangs. I think it'd be fine to drop running tests with 7.3.0 and leave only 9.5.0.
I was wondering about these. Could it be |
They started show the similar flakiness as the GHA ones after enabling parallel tests (`-j2`) by default. Example flaky run: https://dev.azure.com/daniel0244/curl/_build/results?buildId=24763&view=results Ubuntu: ``` FAIL 137: 'FTP download without size in RETR string' FTP, RETR, --data-binary FAIL 336: 'FTP range download when SIZE doesn't work' FTP, PASV, TYPE A, RETR FAIL 975: 'HTTP with auth redirected to FTP allowing auth to continue' HTTP, FTP, --location-trusted FAIL 1378: 'FTP DL, file without Content-Disposition inside, using -o fname' FTP, RETR ``` MSYS2 mingw32: ``` FAIL 1501: 'FTP with multi interface and slow LIST response' FTP, RETR, multi, LIST, DELAY ``` MSYS2 mingw64: ``` FAIL 1501: 'FTP with multi interface and slow LIST response' FTP, RETR, multi, LIST, DELAY ``` Follow-up to 0324d55 #11510 Closes #14593
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs. Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.
Here's a summary of the number of test processes by CI service:
Appveyor - 2 (Windows MSVC), 1 (others)
Azure - 2
Circle CI - 14
Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
GitHub Actions - 3 (macOS), 2 (Linux)
Some of these are a bit conservative to keep timing-induced flakiness down.
The net result is that the first test results should arrive only
3 minutes after a commit submission.
Changes merged via separate commits:
Ref: #10818
Closes #11510