Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: enable parallel testing in CI builds #11510

Closed
wants to merge 2 commits into from
Closed

Conversation

dfandrich
Copy link
Contributor

@dfandrich dfandrich commented Jul 24, 2023

The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs. Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

Appveyor - 2 (Windows MSVC), 1 (others)
Azure - 2
Circle CI - 14
Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Changes merged via separate commits:

Ref: #10818
Closes #11510

@github-actions github-actions bot added tests CI Continuous Integration labels Jul 24, 2023
Copy link
Member

@bagder bagder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this!

@dfandrich
Copy link
Contributor Author

I used my test analysis tool to look at the failures resulting from this PR. Here's a summary.

Test runs that timed out:

Azure: msys v2_mingw64_libssh
Azure: msys v2_mingw64_schannel
Circle: cares

I don't think we typically get a lot of cases where tests just hang. However, with parallel tests enabled I've seen it (in CI systems, anyway) a number of times. This is a bit worrysome because it means there may be a race condition in the code somewhere that causes this. In the Azure cases I'm planning on disabling parallel testing altogether (see the next section, below) but the Circle one has been the best in terms of unloaded servers and parallel speedup potential, so if enabling parallel tests causes this, even somewhat frequently, I'm hesitant about enabling it (there or really anywhere).

Unfortunately, my test analysis program looks at the level of individual test failures and does not currently surface test runs that hang and are aborted by timeout such as the above, so I had to gather these manually. I should be able to detect this for the next time.

New test failures in this PR run:

GHA: normal (curl)
1130 (failed in protocol)
1129 (failed in protocol)

Azure: msys v1_mingw64 (curl)
1056 (failed in protocol) THIS ONE IS PREEXISTING
3027 (failed in protocol)

Azure: msys v1_mingw64_schannel (curl)
1056 (failed in protocol) THIS ONE IS PREEXISTING
3027 (failed in protocol)

The above sections are really the only test case failures I'm worried about, with possibly a few in the next section. These are failures on tests that have not failed recently and have not shown themselves to be flaky. What's interesting is that they are all on Azure infrastructure, which I've found to be very oversubscribed. For that reason I'm only using 2 parallel tasks there, but I think I'll have to completely disable parallel testing on Azure Linux infrastructure to avoid this flakiness.

The following failures don't concern me much at all.

New test failures, but tests are flaky in other builds:

GHA: LibreSSL http2 (curl)
2600 (failed in exit)

GHA: SecureTransport http2 (curl)
2600 (failed in exit)

Azure: msys v2_mingw32_openssl (curl)
612 (failed in exit)
1056 (failed in protocol)

The above are flaky in other builds on these same CI systems, so it's not too surprising to see flakes here. Except that flakes are pretty rare (e.g. 1.4%) so having 3 of the pop up in this one run is pretty suspicious. However, they're all in the Azure cloud so stopping parallelism there as I propose above will fix this (ahem).

Flaky tests:

GHA: debug (curl)
2600 (failed in exit)
2600 fails 1.4% (latest failure: https://github.com/curl/curl/actions/runs/5221838456)

GHA: gcc SecureTransport (curl)
2600 (failed in exit)
2600 fails 1.4% (latest failure: https://github.com/curl/curl/actions/runs/5345853952)

The above are already flaky, so it's not too odd to see these failure. Except, as in the previous section, the fact that they're only 1.4% flaky and I see two of them isn't good. That makes 5 flaky tests in a single run that normally only flake out 1.4% of the time—pretty suspicious.

Existing failing tests:

Appveyor: CMake, mingw-w64, gcc 8, Debug x64, Schannel, Static, Unicode (curl)
1056 (failed in protocol)

Appveyor: CMake, mingw-w64, gcc 7, Debug x64, Schannel, Static, Unicode (curl)
1056 (failed in protocol)

Appveyor: CMake, mingw-w64, Debug x86, Schannel, Static (curl)
1056 (failed in protocol)

Appveyor: CMake, mingw, Debug x86, no SSL, Static (curl)
1056 (failed in protocol)

Appveyor: autotools, msys2, Debug, no Proxy, no SSL (curl)
1056 (failed in protocol)

Appveyor: autotools, msys2, Debug, no SSL (curl)
1056 (failed in protocol)

Appveyor: autotools, msys2, Release, no SSL (curl)
1056 (failed in protocol)

Appveyor: autotools, cygwin, Debug, no SSL (curl)
1056 (failed in protocol)

Azure: msys v2_mingw64_openssl (curl)
612 (failed in exit)
1056 (failed in protocol)

Azure: msys v1_mingw32_schannel (curl)
1056 (failed in protocol)

Azure: msys v1_mingw (curl)
1056 (failed in protocol)

Azure: msys v1_mingw_schannel (curl)
1056 (failed in protocol)

Azure: msys v1_mingw32 (curl)
1056 (failed in protocol)

Azure: msys v2_mingw32_schannel (curl)
612 (failed in exit)
1056 (failed in protocol)

Azure: msys v2_mingw64_schannel (curl)
612 (failed in exit)
1056 (failed in protocol)

Azure: msys v2_mingw64_libssh (curl)
612 (failed in exit)
614 (failed in postcheck)
1056 (failed in protocol)

All these are permafailing, so can be ignored. Mostly, they're marked as "run but ignore the results" so you don't see the failures day to day.

dfandrich added a commit that referenced this pull request Jul 27, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
@dfandrich dfandrich force-pushed the dfandrich/parallelci branch from 7cd7046 to 8567b3b Compare July 27, 2023 05:55
@dfandrich
Copy link
Contributor Author

Here are the results of just the test failures that bother me from the second run, which reduced or eliminated parallel testing from a number of builds:

GHA: Test macOS / LibreSSL http2 (curl)
1554 (failed in data)

GHA: Test macOS / libssh2 (curl)
1554 (failed in data)

Test CircleCI / openssl-no-proxy (curl)
575 (failed in protocol)

Test CircleCI / openssl-c-ares (curl)
test run timed out

This is an improvement over the first run, but there's also less parallel testing going on. The test run hangs bother me the most, since those are so hard to debug when I can't reproduce them locally.

dfandrich added a commit that referenced this pull request Aug 7, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
@dfandrich dfandrich force-pushed the dfandrich/parallelci branch from 8567b3b to 5191a80 Compare August 7, 2023 04:52
dfandrich added a commit that referenced this pull request Sep 28, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
dfandrich added a commit that referenced this pull request Sep 29, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
dfandrich added a commit that referenced this pull request Sep 30, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
@vszakats

This comment was marked as outdated.

@vszakats

This comment was marked as outdated.

@dfandrich
Copy link
Contributor Author

@vszakats, thanks for the patches, but my commits still didn't apply cleanly. I'll have to look at your changes and adjust accordingly.

@vszakats
Copy link
Member

vszakats commented Oct 4, 2023

Maybe the simplest is if you overwrite your version with this one:
appveyor-PARALLEL.yml.txt
(also included some TESTING: line moves that I missed above.)

@dfandrich dfandrich force-pushed the dfandrich/parallelci branch from b9ff54f to 15cfc95 Compare October 5, 2023 08:49
dfandrich added a commit that referenced this pull request Oct 5, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
@dfandrich
Copy link
Contributor Author

I have it rebaesd now—thanks for the help.

@dfandrich
Copy link
Contributor Author

That last run was the best ever! Only two failures: one due to #12033 and a new one I just opened #12040.

dfandrich added a commit that referenced this pull request Oct 8, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
@dfandrich dfandrich force-pushed the dfandrich/parallelci branch from 15cfc95 to ec70db3 Compare October 8, 2023 18:50
dfandrich added a commit that referenced this pull request Oct 11, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
dfandrich added a commit that referenced this pull request Oct 11, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
dfandrich added a commit that referenced this pull request Oct 11, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
dfandrich added a commit that referenced this pull request Nov 13, 2023
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

TODO: completely remove the 2 here:  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
@vszakats vszakats force-pushed the dfandrich/parallelci branch from d9fdaeb to a727c57 Compare May 28, 2024 06:46
@vszakats
Copy link
Member

vszakats commented Jun 2, 2024

@dfandrich: Ready to merge?

@vszakats vszakats force-pushed the dfandrich/parallelci branch from a727c57 to ce041ce Compare June 2, 2024 22:22
vszakats pushed a commit that referenced this pull request Jun 4, 2024
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
@vszakats vszakats force-pushed the dfandrich/parallelci branch from 8d01464 to dc5fc1d Compare June 4, 2024 21:15
vszakats added a commit that referenced this pull request Jul 8, 2024
Credit-to: Dan Fandrich
Cherry-picked from #11510 #14097
vszakats pushed a commit that referenced this pull request Jul 8, 2024
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
@vszakats vszakats force-pushed the dfandrich/parallelci branch from fd69739 to 76729d6 Compare July 8, 2024 13:47
vszakats added a commit to vszakats/curl that referenced this pull request Jul 12, 2024
Credit-to: Dan Fandrich
Cherry-picked from curl#11510
vszakats added a commit that referenced this pull request Jul 13, 2024
- bump parallel test for Linux jobs.
  Credit-to: Dan Fandrich
  Cherry-picked from #11510
- bump parallel test for macOS jobs.
- drop no longer necessary `-Wno-vla` option.
- fold long lines.
- drop `--enable-maintainer-mode` `./configure` option.
- replace a hard-coded prefix with `brew --prefix`.
- update documentation link.
- move `--enable-debug` in front.
- tidy up quotes.

Closes #14171
meslubi2021 pushed a commit to Unity-Curl/curl that referenced this pull request Jul 19, 2024
- bump parallel test for Linux jobs.
  Credit-to: Dan Fandrich
  Cherry-picked from curl#11510
- bump parallel test for macOS jobs.
- drop no longer necessary `-Wno-vla` option.
- fold long lines.
- drop `--enable-maintainer-mode` `./configure` option.
- replace a hard-coded prefix with `brew --prefix`.
- update documentation link.
- move `--enable-debug` in front.
- tidy up quotes.

Closes curl#14171
vszakats pushed a commit that referenced this pull request Jul 19, 2024
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Ref: #10818
Closes #11510
@vszakats vszakats force-pushed the dfandrich/parallelci branch from 76729d6 to 995a6b5 Compare July 19, 2024 22:19
vszakats pushed a commit to vszakats/curl that referenced this pull request Jul 20, 2024
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Already merged via separate commits:
- 2a7c8b2 curl#14171
- 7234106

Ref: curl#10818
Closes curl#11510
dfandrich and others added 2 commits August 3, 2024 15:56
The test-ci target now uses 2 processes by default, but the amount of
parallelism is tuned for each CI service and build environment based on
results of a number of test runs.  Some CI services use super-
oversubscribed build machines that can barely run the curl tests
already with no parallelism without frequently failing with
timing-induced failures. These continue to be run without parallelism.
Other services provide two fast, unloaded cores and these run with 14
processes, which is a good default for this kind of environment.

Here's a summary of the number of test processes by CI service:

  Appveyor - 2 (Windows MSVC), 1 (others)
  Azure - 2
  Circle CI - 14
  Cirrus - 28 (macOS), 14 (Linux), 7 (FreeBSD), 5 (macOS torture), 2 (Windows)
  GitHub Actions - 3 (macOS), 2 (Linux)

Some of these are a bit conservative to keep timing-induced flakiness down.

The net result is that the first test results should arrive only
3 minutes after a commit submission.

Already merged via separate commits:
- 2a7c8b2 #14171
- 7234106
- efce544 #14244
- c6cf411

Ref: #10818
Closes #11510
@vszakats vszakats force-pushed the dfandrich/parallelci branch from 995a6b5 to c7b989e Compare August 3, 2024 13:57
@vszakats vszakats closed this in 0324d55 Aug 3, 2024
@vszakats
Copy link
Member

vszakats commented Aug 3, 2024

Thanks Dan, this is merged now!

Most of this PR was merged in the recent months piece by piece. What remained was changing the CI default to -j2. (and keeping AppVeyor at -j0, but there is no CI test run over there for now.)

@dfandrich
Copy link
Contributor Author

dfandrich commented Aug 3, 2024 via email

@vszakats
Copy link
Member

vszakats commented Aug 3, 2024

Yeah, it's definitely not perfect.

I plan to reduce parallelism for the FreeBSD Intel job just migrated to GHA, and/or put FTP results on ignore.

The other notorious issue is the native Windows ones. Old-mingw, and especially MSVC, where we added a bunch of new jobs, sadly bumped up the failure rate due hangs. Very annoying. It's not the Windows env itself, because Cygwin and MSYS2 are rock solid (knock on wood) on the same runners and high parallelism.

The Azure jobs are also failing often (I haven't looked at it closely), and also the old Cirrus FreeBSD jobs were prone to not start. These don't use parallel tests.

The most annoying is the Windows old-mingw and MSVC hangs.
Also FTP seems to be a common culprit.

I think it'd be fine to drop running tests with 7.3.0 and leave only 9.5.0.
We might also stop running tests in the wolfSSL and/or LibreSSL MSVC job?

There was no content at all in the file log/5/server.input.

I was wondering about these. Could it be curl exe crashing? It's very curious and consistently happening in macOS jobs when using the gcc compiler, while the identical clang couterparts run consistently without issues. Mostly but not only with RTSP tests:
https://github.com/curl/curl/actions/runs/10228630550/job/28301388347#step:15:3236

vszakats added a commit that referenced this pull request Aug 19, 2024
They started show the similar flakiness as the GHA ones after enabling
parallel tests (`-j2`) by default.

Example flaky run:
https://dev.azure.com/daniel0244/curl/_build/results?buildId=24763&view=results

Ubuntu:
```
FAIL 137: 'FTP download without size in RETR string' FTP, RETR, --data-binary
FAIL 336: 'FTP range download when SIZE doesn't work' FTP, PASV, TYPE A, RETR
FAIL 975: 'HTTP with auth redirected to FTP allowing auth to continue' HTTP, FTP, --location-trusted
FAIL 1378: 'FTP DL, file without Content-Disposition inside, using -o fname' FTP, RETR
```

MSYS2 mingw32:
```
FAIL 1501: 'FTP with multi interface and slow LIST response' FTP, RETR, multi, LIST, DELAY
```

MSYS2 mingw64:
```
FAIL 1501: 'FTP with multi interface and slow LIST response' FTP, RETR, multi, LIST, DELAY
```

Follow-up to 0324d55 #11510

Closes #14593
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration tests
Development

Successfully merging this pull request may close these issues.

3 participants