CI: enable parallel testing in CI builds #11510

dfandrich · 2023-07-24T22:03:04Z

The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs. Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

Appveyor - 2 (Windows MSVC), 1 (others)
Azure - 2
Circle CI - 14
Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Changes merged via separate commits:

Ref: #10818
Closes #11510

bagder

I love this!

dfandrich · 2023-07-27T00:17:02Z

I used my test analysis tool to look at the failures resulting from this PR. Here's a summary.

Test runs that timed out:

Azure: msys v2_mingw64_libssh
Azure: msys v2_mingw64_schannel
Circle: cares

I don't think we typically get a lot of cases where tests just hang. However, with parallel tests enabled I've seen it (in CI systems, anyway) a number of times. This is a bit worrysome because it means there may be a race condition in the code somewhere that causes this. In the Azure cases I'm planning on disabling parallel testing altogether (see the next section, below) but the Circle one has been the best in terms of unloaded servers and parallel speedup potential, so if enabling parallel tests causes this, even somewhat frequently, I'm hesitant about enabling it (there or really anywhere).

Unfortunately, my test analysis program looks at the level of individual test failures and does not currently surface test runs that hang and are aborted by timeout such as the above, so I had to gather these manually. I should be able to detect this for the next time.

New test failures in this PR run:

GHA: normal (curl)
1130 (failed in protocol)
1129 (failed in protocol)

Azure: msys v1_mingw64 (curl)
1056 (failed in protocol) THIS ONE IS PREEXISTING
3027 (failed in protocol)

Azure: msys v1_mingw64_schannel (curl)
1056 (failed in protocol) THIS ONE IS PREEXISTING
3027 (failed in protocol)

The above sections are really the only test case failures I'm worried about, with possibly a few in the next section. These are failures on tests that have not failed recently and have not shown themselves to be flaky. What's interesting is that they are all on Azure infrastructure, which I've found to be very oversubscribed. For that reason I'm only using 2 parallel tasks there, but I think I'll have to completely disable parallel testing on Azure Linux infrastructure to avoid this flakiness.

The following failures don't concern me much at all.

New test failures, but tests are flaky in other builds:

GHA: LibreSSL http2 (curl)
2600 (failed in exit)

GHA: SecureTransport http2 (curl)
2600 (failed in exit)

Azure: msys v2_mingw32_openssl (curl)
612 (failed in exit)
1056 (failed in protocol)

The above are flaky in other builds on these same CI systems, so it's not too surprising to see flakes here. Except that flakes are pretty rare (e.g. 1.4%) so having 3 of the pop up in this one run is pretty suspicious. However, they're all in the Azure cloud so stopping parallelism there as I propose above will fix this (ahem).

Flaky tests:

GHA: debug (curl)
2600 (failed in exit)
2600 fails 1.4% (latest failure: https://github.com/curl/curl/actions/runs/5221838456)

GHA: gcc SecureTransport (curl)
2600 (failed in exit)
2600 fails 1.4% (latest failure: https://github.com/curl/curl/actions/runs/5345853952)

The above are already flaky, so it's not too odd to see these failure. Except, as in the previous section, the fact that they're only 1.4% flaky and I see two of them isn't good. That makes 5 flaky tests in a single run that normally only flake out 1.4% of the time—pretty suspicious.

Existing failing tests:

Appveyor: CMake, mingw-w64, gcc 8, Debug x64, Schannel, Static, Unicode (curl)
1056 (failed in protocol)

Appveyor: CMake, mingw-w64, gcc 7, Debug x64, Schannel, Static, Unicode (curl)
1056 (failed in protocol)

Appveyor: CMake, mingw-w64, Debug x86, Schannel, Static (curl)
1056 (failed in protocol)

Appveyor: CMake, mingw, Debug x86, no SSL, Static (curl)
1056 (failed in protocol)

Appveyor: autotools, msys2, Debug, no Proxy, no SSL (curl)
1056 (failed in protocol)

Appveyor: autotools, msys2, Debug, no SSL (curl)
1056 (failed in protocol)

Appveyor: autotools, msys2, Release, no SSL (curl)
1056 (failed in protocol)

Appveyor: autotools, cygwin, Debug, no SSL (curl)
1056 (failed in protocol)

Azure: msys v2_mingw64_openssl (curl)
612 (failed in exit)
1056 (failed in protocol)

Azure: msys v1_mingw32_schannel (curl)
1056 (failed in protocol)

Azure: msys v1_mingw (curl)
1056 (failed in protocol)

Azure: msys v1_mingw_schannel (curl)
1056 (failed in protocol)

Azure: msys v1_mingw32 (curl)
1056 (failed in protocol)

Azure: msys v2_mingw32_schannel (curl)
612 (failed in exit)
1056 (failed in protocol)

Azure: msys v2_mingw64_schannel (curl)
612 (failed in exit)
1056 (failed in protocol)

Azure: msys v2_mingw64_libssh (curl)
612 (failed in exit)
614 (failed in postcheck)
1056 (failed in protocol)

All these are permafailing, so can be ignored. Mostly, they're marked as "run but ignore the results" so you don't see the failures day to day.

The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510

dfandrich · 2023-07-31T15:27:15Z

Here are the results of just the test failures that bother me from the second run, which reduced or eliminated parallel testing from a number of builds:

GHA: Test macOS / LibreSSL http2 (curl)
1554 (failed in data)

GHA: Test macOS / libssh2 (curl)
1554 (failed in data)

Test CircleCI / openssl-no-proxy (curl)
575 (failed in protocol)

Test CircleCI / openssl-c-ares (curl)
test run timed out

This is an improvement over the first run, but there's also less parallel testing going on. The test run hangs bother me the most, since those are so hard to debug when I can't reproduce them locally.

The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510

dfandrich · 2023-10-04T19:26:24Z

@vszakats, thanks for the patches, but my commits still didn't apply cleanly. I'll have to look at your changes and adjust accordingly.

vszakats · 2023-10-04T21:13:19Z

Maybe the simplest is if you overwrite your version with this one:
appveyor-PARALLEL.yml.txt
(also included some TESTING: line moves that I missed above.)

The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510

dfandrich · 2023-10-05T08:52:36Z

I have it rebaesd now—thanks for the help.

dfandrich · 2023-10-05T17:44:57Z

That last run was the best ever! Only two failures: one due to #12033 and a new one I just opened #12040.

The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510

The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: TODO: completely remove the 2 here: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510

vszakats · 2024-06-02T21:30:49Z

@dfandrich: Ready to merge?

The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510

Credit-to: Dan Fandrich Cherry-picked from #11510 #14097

The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510

Credit-to: Dan Fandrich Cherry-picked from curl#11510

- bump parallel test for Linux jobs. Credit-to: Dan Fandrich Cherry-picked from #11510 - bump parallel test for macOS jobs. - drop no longer necessary `-Wno-vla` option. - fold long lines. - drop `--enable-maintainer-mode` `./configure` option. - replace a hard-coded prefix with `brew --prefix`. - update documentation link. - move `--enable-debug` in front. - tidy up quotes. Closes #14171

- bump parallel test for Linux jobs. Credit-to: Dan Fandrich Cherry-picked from curl#11510 - bump parallel test for macOS jobs. - drop no longer necessary `-Wno-vla` option. - fold long lines. - drop `--enable-maintainer-mode` `./configure` option. - replace a hard-coded prefix with `brew --prefix`. - update documentation link. - move `--enable-debug` in front. - tidy up quotes. Closes curl#14171

The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Ref: #10818 Closes #11510

The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Already merged via separate commits: - 2a7c8b2 curl#14171 - 7234106 Ref: curl#10818 Closes curl#11510

The test-ci target now uses 2 processes by default, but the amount of parallelism is tuned for each CI service and build environment based on results of a number of test runs. Some CI services use super- oversubscribed build machines that can barely run the curl tests already with no parallelism without frequently failing with timing-induced failures. These continue to be run without parallelism. Other services provide two fast, unloaded cores and these run with 14 processes, which is a good default for this kind of environment. Here's a summary of the number of test processes by CI service: Appveyor - 2 (Windows MSVC), 1 (others) Azure - 2 Circle CI - 14 Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows) GitHub Actions - 3 (macOS), 2 (Linux) Some of these are a bit conservative to keep timing-induced flakiness down. The net result is that the first test results should arrive only 3 minutes after a commit submission. Already merged via separate commits: - 2a7c8b2 #14171 - 7234106 - efce544 #14244 - c6cf411 Ref: #10818 Closes #11510

vszakats · 2024-08-03T14:16:30Z

Thanks Dan, this is merged now!

Most of this PR was merged in the recent months piece by piece. What remained was changing the CI default to -j2. (and keeping AppVeyor at -j0, but there is no CI test run over there for now.)

dfandrich · 2024-08-03T17:29:26Z

Thanks for the work in getting this in. I'm a bit troubled by the results, though. While monitoring Test Clutch's PR comments on failed tests feature (not quite enabled yet), I've noticed that the majority of test failures now are of the sort that I've previously found are triggered by parallel testing. Especially those of the form: There was no content at all in the file log/5/server.input. Server glitch? Total curl failure? Returned: 28 Most of the recent red at https://testclutch.curl.se/static/reports/summary.html leads to a result like this. This is the reason I didn't enable parallel tests before, and it seems like those underlying issue has not been solved. While it's nice to get results quicker now, I'm concerned that the increase in spurious failed tests negates that. There was a brief period after tests on Appveyor were disabled that we were starting to see failure-free runs, but it's happening less often now. The overall test failure rate is still pretty low (0.0041% over the last 3 weeks, to be precise), so maybe it's more Test Clutch's propensity to show me every single failure that's clouding my vision.

vszakats · 2024-08-03T17:57:53Z

Yeah, it's definitely not perfect.

I plan to reduce parallelism for the FreeBSD Intel job just migrated to GHA, and/or put FTP results on ignore.

The other notorious issue is the native Windows ones. Old-mingw, and especially MSVC, where we added a bunch of new jobs, sadly bumped up the failure rate due hangs. Very annoying. It's not the Windows env itself, because Cygwin and MSYS2 are rock solid (knock on wood) on the same runners and high parallelism.

The Azure jobs are also failing often (I haven't looked at it closely), and also the old Cirrus FreeBSD jobs were prone to not start. These don't use parallel tests.

The most annoying is the Windows old-mingw and MSVC hangs.
Also FTP seems to be a common culprit.

I think it'd be fine to drop running tests with 7.3.0 and leave only 9.5.0.
We might also stop running tests in the wolfSSL and/or LibreSSL MSVC job?

There was no content at all in the file log/5/server.input.

I was wondering about these. Could it be curl exe crashing? It's very curious and consistently happening in macOS jobs when using the gcc compiler, while the identical clang couterparts run consistently without issues. Mostly but not only with RTSP tests:
https://github.com/curl/curl/actions/runs/10228630550/job/28301388347#step:15:3236

They started show the similar flakiness as the GHA ones after enabling parallel tests (`-j2`) by default. Example flaky run: https://dev.azure.com/daniel0244/curl/_build/results?buildId=24763&view=results Ubuntu: ``` FAIL 137: 'FTP download without size in RETR string' FTP, RETR, --data-binary FAIL 336: 'FTP range download when SIZE doesn't work' FTP, PASV, TYPE A, RETR FAIL 975: 'HTTP with auth redirected to FTP allowing auth to continue' HTTP, FTP, --location-trusted FAIL 1378: 'FTP DL, file without Content-Disposition inside, using -o fname' FTP, RETR ``` MSYS2 mingw32: ``` FAIL 1501: 'FTP with multi interface and slow LIST response' FTP, RETR, multi, LIST, DELAY ``` MSYS2 mingw64: ``` FAIL 1501: 'FTP with multi interface and slow LIST response' FTP, RETR, multi, LIST, DELAY ``` Follow-up to 0324d55 #11510 Closes #14593

github-actions bot added tests CI Continuous Integration labels Jul 24, 2023

bagder approved these changes Jul 24, 2023

View reviewed changes

dfandrich force-pushed the dfandrich/parallelci branch from 7cd7046 to 8567b3b Compare July 27, 2023 05:55

dfandrich force-pushed the dfandrich/parallelci branch from 8567b3b to 5191a80 Compare August 7, 2023 04:52

dfandrich force-pushed the dfandrich/parallelci branch from 5191a80 to e14e61b Compare September 28, 2023 07:28

dfandrich force-pushed the dfandrich/parallelci branch from e14e61b to e7d5ed7 Compare September 29, 2023 20:12

dfandrich force-pushed the dfandrich/parallelci branch from e7d5ed7 to 6747a09 Compare September 30, 2023 04:23

This comment was marked as outdated.

Sign in to view

dfandrich force-pushed the dfandrich/parallelci branch from b9ff54f to 15cfc95 Compare October 5, 2023 08:49

dfandrich force-pushed the dfandrich/parallelci branch from 15cfc95 to ec70db3 Compare October 8, 2023 18:50

dfandrich mentioned this pull request Feb 12, 2024

GHA: adjust parallel job counts #12927

Closed

vszakats force-pushed the dfandrich/parallelci branch from d9fdaeb to a727c57 Compare May 28, 2024 06:46

vszakats force-pushed the dfandrich/parallelci branch from a727c57 to ce041ce Compare June 2, 2024 22:22

vszakats approved these changes Jun 2, 2024

View reviewed changes

vszakats force-pushed the dfandrich/parallelci branch from 8d01464 to dc5fc1d Compare June 4, 2024 21:15

vszakats added a commit that referenced this pull request Jul 8, 2024

GHA/macos: bump parallel tests to -j5

7234106

Credit-to: Dan Fandrich Cherry-picked from #11510 #14097

vszakats force-pushed the dfandrich/parallelci branch from fd69739 to 76729d6 Compare July 8, 2024 13:47

vszakats added a commit to vszakats/curl that referenced this pull request Jul 12, 2024

parallel tests linux

1e008cd

Credit-to: Dan Fandrich Cherry-picked from curl#11510

vszakats mentioned this pull request Jul 12, 2024

CI/circleci: config tidy-ups, bump up test parallelism #14171

Closed

vszakats force-pushed the dfandrich/parallelci branch from 76729d6 to 995a6b5 Compare July 19, 2024 22:19

vszakats mentioned this pull request Jul 20, 2024

Linux parallel test tests #14238

Closed

dfandrich and others added 2 commits August 3, 2024 15:56

appveyor.sh: move/rebase patch from appveyor.yml

c7b989e

vszakats force-pushed the dfandrich/parallelci branch from 995a6b5 to c7b989e Compare August 3, 2024 13:57

vszakats closed this in 0324d55 Aug 3, 2024

vszakats mentioned this pull request Aug 4, 2024

GHA/windows: add mbedTLS MSVC job #14203

Closed

vszakats mentioned this pull request Aug 19, 2024

CI/azure: disable parallel tests, allow IDN tests #14593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: enable parallel testing in CI builds #11510

CI: enable parallel testing in CI builds #11510

dfandrich commented Jul 24, 2023 •

edited by vszakats

Loading

bagder left a comment

dfandrich commented Jul 27, 2023

dfandrich commented Jul 31, 2023

This comment was marked as outdated.

This comment was marked as outdated.

dfandrich commented Oct 4, 2023

vszakats commented Oct 4, 2023

dfandrich commented Oct 5, 2023

dfandrich commented Oct 5, 2023

vszakats commented Jun 2, 2024

vszakats commented Aug 3, 2024

dfandrich commented Aug 3, 2024 via email

vszakats commented Aug 3, 2024

CI: enable parallel testing in CI builds #11510

CI: enable parallel testing in CI builds #11510

Conversation

dfandrich commented Jul 24, 2023 • edited by vszakats Loading

bagder left a comment

Choose a reason for hiding this comment

dfandrich commented Jul 27, 2023

dfandrich commented Jul 31, 2023

This comment was marked as outdated.

This comment was marked as outdated.

dfandrich commented Oct 4, 2023

vszakats commented Oct 4, 2023

dfandrich commented Oct 5, 2023

dfandrich commented Oct 5, 2023

vszakats commented Jun 2, 2024

vszakats commented Aug 3, 2024

dfandrich commented Aug 3, 2024 via email

vszakats commented Aug 3, 2024

dfandrich commented Jul 24, 2023 •

edited by vszakats

Loading