Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: performance regression for small Swiss Table map access for non-specialized keys #70849

Closed
thepudds opened this issue Dec 14, 2024 · 7 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@thepudds
Copy link
Contributor

thepudds commented Dec 14, 2024

Edit: Initially, I thought this was an "optimization opportunity" and wrote it up as such, but later realized it is seemingly a regression from Go 1.23 (comment). Changed title accordingly.


I was reading through the (excellent!) new Swiss Table implementation from #54766 and noticed a potential optimization opportunity for lookups with small maps (<= 8 elems) for the normal case of non-specialized keys.

I suspect the current linear scan for these maps might be better off if we instead build a match bitmap and jump to the candidate match. I mailed https://go.dev/cl/634396 with this change.

For predictable keys, it might be a small win or close to a wash, but for unpredictable keys, it might be a larger win.

To better illustrate this (as well as to help with analyzing a couple other experiments I'm trying), I also updated the benchmarks (#70700) to shuffle the keys for the newer benchmarks, and also added a new "Hot" benchmark that repeatedly looks up a single key (Hot=1 below) or a small number of random keys (Hot=3 below).

The geomean here is -28.69%. These results are for amd64 and use the SIMD optimizations. I did not test on arm or without the SIMD optimizations.

                                                │  no-fix-new-bmarks  │ fix-with-new-bmarks   │
                                                │      sec/op         │  sec/op       vs base │
MapAccessHit/Key=smallType/Elem=int32/len=6-4            24.55n ±  0%   13.63n ±  0%  -44.48% (p=0.000 n=25)
MapAccessMiss/Key=smallType/Elem=int32/len=6-4          12.500n ±  0%   9.517n ± 32%  -23.86% (p=0.007 n=25)
MapAccessHitHot/Key=smallType/Elem=int32/len=6/Hot=1-4   13.73n ±  4%   13.54n ±  0%        ~ (p=0.096 n=25)
MapAccessHitHot/Key=smallType/Elem=int32/len=6/Hot=3-4   21.93n ±  1%   13.57n ±  1%  -38.12% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4       13.07n ±  0%   13.54n ±  0%   +3.60% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4       19.05n ±  0%   13.53n ±  0%  -28.98% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4       21.46n ±  0%   13.52n ±  0%  -37.00% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4       23.06n ±  0%   13.53n ±  0%  -41.33% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4       23.94n ±  0%   13.53n ±  0%  -43.48% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4       24.46n ±  0%   13.54n ±  0%  -44.64% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4       24.91n ±  0%   13.54n ±  0%  -45.64% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4       25.12n ±  0%   13.56n ±  0%  -46.02% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4     12.480n ±  0%   9.523n ±  0%  -23.69% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4     12.480n ±  0%   9.516n ±  0%  -23.75% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4     12.490n ±  0%   9.516n ±  0%  -23.81% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4     12.480n ±  0%   9.520n ±  0%  -23.72% (p=0.001 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4     12.480n ±  0%   9.520n ±  0%  -23.72% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4     12.490n ± 20%   9.527n ± 32%  -23.72% (p=0.003 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4      12.48n ±  0%   12.09n ± 21%   -3.12% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4      12.49n ± 16%   11.77n ± 19%   -5.76% (p=0.000 n=25)
geomean                                                  16.58n         11.82n        -28.69%

For the miss benchmarks, I suspect the fact that a miss is happening is predictable in all cases, but in some cases the predictable miss is with a predictable key, vs. a predictable miss on an unpredictable key in other cases. For a table with a single group like these all have, that probably doesn't matter too much.

For the hits and misses, these are the ones that I expect to have predictable keys:

MapAccessHitHot/Key=smallType/Elem=int32/len=6/Hot=1-4   13.73n ±  4%   13.54n ±  0%        ~ (p=0.096 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4       13.07n ±  0%   13.54n ±  0%   +3.60% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4     12.480n ±  0%   9.523n ±  0%  -23.69% (p=0.000 n=25)

(Those three are also listed above in the larger list, but pulling them out here for commentary).

The first has predictable keys because a single key is looked up repeatedly per run (with 6 elements in the small map). The second two have predictable keys because there is only a single element. In other words, the keys are shuffled in all ~20 of these benchmarks, but in the three here, the shuffling is effectively a no-op.

CC @prattmic

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 14, 2024
@thepudds
Copy link
Contributor Author

If we instead compare master from a week or so ago (8c3e391) vs. just making the map lookup behavior change (but without changing benchmark behavior), we get:

                                                    │ master-8c3e391573 │       just-fix-with-old-bmarks   │
                                                    │      sec/op       │    sec/op     vs base            │
MapAccessHit/Key=smallType/Elem=int32/len=6-4              22.75n ±  0%   21.81n ±  0%   -4.11% (p=0.001 n=20)
MapAccessMiss/Key=smallType/Elem=int32/len=6-4             20.02n ± 24%   17.34n ±  2%  -13.39% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4         20.87n ±  0%   21.76n ±  0%   +4.26% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4         21.22n ±  0%   21.72n ±  0%   +2.33% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4         21.72n ±  0%   21.71n ±  0%        ~ (p=0.522 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4         22.06n ±  0%   21.72n ±  0%   -1.54% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4         22.42n ±  0%   21.73n ±  0%   -3.06% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4         22.75n ±  0%   21.74n ±  0%   -4.44% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4         23.00n ±  0%   21.73n ±  0%   -5.52% (p=0.001 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4         23.22n ±  0%   21.73n ±  0%   -6.40% (p=0.006 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4        19.95n ±  0%   17.31n ±  0%  -13.21% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4        19.96n ±  0%   17.32n ±  0%  -13.25% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4        19.95n ±  0%   17.31n ±  0%  -13.23% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4        19.99n ±  0%   17.30n ±  0%  -13.43% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4        19.96n ±  0%   17.32n ±  0%  -13.23% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4        19.97n ±  0%   17.33n ±  2%  -13.22% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4        19.98n ± 20%   17.83n ± 13%  -10.76% (p=0.012 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4        22.65n ± 12%   17.34n ± 14%  -23.47% (p=0.000 n=20)
geomean                                                    21.21n         19.44n         -8.36%

Note that both columns from this run are generally slower overall than the first benchmarks presented above. This is likely for a couple of reasons, including these benchmarks here still have a somewhat expensive mod operation in the benchmark code itself.

The most interesting change here from the top comment might be MapSmallAccessHit/Key=smallType/Elem=int32/len=2, which above is a -28.98% win vs. a +2.33% loss here. (In the older flavor of the benchmarks here, two keys are looked up repeatedly in the exact same order which is presumably predictable, vs. in the run in the top comment above, the two keys are randomly shuffled and not predictable. It seems at least plausible that could explain the difference, with a predictable shorter loop in the old version doing better).


Also, I initially poked at the implementation in master a bit with perf on Linux, but have not yet had a chance to do so after I started making changes... And in general, please read any theorizing in either of these two comments as best guesses based on the results so far. 😅

@thepudds
Copy link
Contributor Author

Finally, still using the old benchmarks (with predictable keys), if we compare master as of a week or so ago with and without the Swiss Table enabled, we can see that on these benchmarks the default Swiss Table in master does seem slower compared to the old runtime map (geomean of +9.31% worse).

In other words, decent chance that master has a performance regression in these lookup benchmarks compared to Go 1.23, though I did not check Go 1.23 directly.

                                                    │ disable-swissmap │          master-8c3e391573           │
                                                    │      sec/op      │    sec/op     vs base                │
MapAccessHit/Key=smallType/Elem=int32/len=6-4             20.94n ±  0%   22.75n ±  0%   +8.65% (p=0.000 n=20)
MapAccessMiss/Key=smallType/Elem=int32/len=6-4            18.79n ±  2%   20.02n ± 24%   +6.52% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4        19.68n ±  0%   20.87n ±  0%   +6.05% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4        19.92n ±  0%   21.22n ±  0%   +6.55% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4        20.09n ±  0%   21.72n ±  0%   +8.11% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4        20.38n ±  0%   22.06n ±  0%   +8.24% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4        20.57n ±  0%   22.42n ±  0%   +8.99% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4        21.04n ±  0%   22.75n ±  0%   +8.13% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4        21.37n ±  2%   23.00n ±  0%   +7.65% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4        21.78n ±  0%   23.22n ±  0%   +6.61% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4       16.77n ±  0%   19.95n ±  0%  +18.93% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4       16.99n ±  0%   19.96n ±  0%  +17.51% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4       17.22n ±  0%   19.95n ±  0%  +15.85% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4       17.79n ±  0%   19.99n ±  0%  +12.37% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4       18.25n ±  0%   19.96n ±  0%   +9.40% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4       18.77n ±  1%   19.97n ±  0%   +6.42% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4       19.36n ± 13%   19.98n ± 20%   +3.20% (p=0.021 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4       20.63n ±  1%   22.65n ± 12%        ~ (p=0.180 n=20)
geomean                                                   19.40n         21.21n         +9.31%

(Those are still the slower version of the benchmarks, which includes the somewhat expensive mod operation in the benchmark code, so the denominator on the percent change here is larger than in our first set of benchmarks above).

I'll add a caveat that I was juggling different machines and different versions, and apologies in advance if I crossed some wires here.

@thepudds thepudds changed the title runtime: speed up small Swiss Table map access for non-specialized keys runtime: performance regression for small Swiss Table map access for non-specialized keys Dec 15, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/634396 mentions this issue: internal/runtime/maps: speed up small map lookups ~1.7x for unpredictable keys

@dr2chase dr2chase added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 17, 2024
@mknyszek mknyszek added this to the Go1.25 milestone Dec 18, 2024
@prattmic
Copy link
Member

Thanks for the thorough investigation! I can indeed reproduce this using the existing benchmarks as you did in #70849 (comment).

goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
                                                     │     tip     │              matchH2               │             matchH2-v3             │
                                                     │   sec/op    │    sec/op     vs base              │    sec/op     vs base              │
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-12    17.48n ± 1%   18.09n ±  1%  +3.43% (p=0.002 n=6)   17.99n ±  1%  +2.86% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-12    17.88n ± 1%   18.07n ±  9%       ~ (p=0.065 n=6)   17.79n ±  1%       ~ (p=0.455 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-12    18.21n ± 2%   18.22n ±  3%       ~ (p=0.900 n=6)   17.94n ±  2%  -1.46% (p=0.048 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-12    18.60n ± 2%   18.11n ±  3%  -2.63% (p=0.009 n=6)   18.00n ±  5%  -3.20% (p=0.041 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-12    19.14n ± 3%   18.10n ±  2%  -5.46% (p=0.002 n=6)   17.91n ±  1%  -6.43% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-12    19.53n ± 3%   18.15n ± 28%       ~ (p=0.058 n=6)   18.07n ±  4%  -7.50% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-12    19.34n ± 2%   17.96n ± 19%       ~ (p=0.394 n=6)   18.08n ± 21%       ~ (p=0.065 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-12    19.64n ± 3%   18.22n ± 16%       ~ (p=0.065 n=6)   18.29n ± 15%       ~ (p=0.065 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-12   17.61n ± 2%   17.95n ±  3%  +1.96% (p=0.015 n=6)   17.93n ±  4%  +1.79% (p=0.017 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-12   17.82n ± 6%   18.11n ±  3%       ~ (p=0.240 n=6)   18.12n ±  2%       ~ (p=0.310 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-12   18.38n ± 2%   18.34n ±  2%       ~ (p=0.669 n=6)   18.11n ±  2%  -1.47% (p=0.015 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-12   18.73n ± 2%   18.21n ±  2%  -2.80% (p=0.002 n=6)   17.93n ±  1%  -4.30% (p=0.002 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-12   18.86n ± 3%   18.25n ±  2%  -3.24% (p=0.002 n=6)   18.07n ±  2%  -4.19% (p=0.002 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-12   19.05n ± 2%   18.46n ± 27%       ~ (p=0.394 n=6)   18.06n ± 27%       ~ (p=0.065 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-12   19.29n ± 1%   18.07n ±  1%  -6.35% (p=0.002 n=6)   19.54n ± 10%       ~ (p=1.000 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-12   19.59n ± 2%   18.15n ± 15%       ~ (p=0.065 n=6)   18.21n ±  6%  -7.02% (p=0.004 n=6)
geomean                                                18.68n        18.15n        -2.83%                 18.12n        -3.00%

matchH2 is a change equivalent to your CL. matchH2-v3 is the same with GOAMD64=v3 (slightly different SIMD instruction selection).

For reference, the same comparison against the old map implementation:

goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
                                                     │   noswiss   │                tip                │              matchH2               │             matchH2-v3             │
                                                     │   sec/op    │   sec/op     vs base              │    sec/op     vs base              │    sec/op     vs base              │
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-12    16.68n ± 2%   17.48n ± 1%  +4.79% (p=0.002 n=6)   18.09n ±  1%  +8.39% (p=0.002 n=6)   17.99n ±  1%  +7.79% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-12    17.04n ± 3%   17.88n ± 1%  +4.87% (p=0.002 n=6)   18.07n ±  9%  +5.98% (p=0.002 n=6)   17.79n ±  1%  +4.34% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-12    17.16n ± 1%   18.21n ± 2%  +6.09% (p=0.002 n=6)   18.22n ±  3%  +6.18% (p=0.002 n=6)   17.94n ±  2%  +4.55% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-12    17.58n ± 3%   18.60n ± 2%  +5.80% (p=0.002 n=6)   18.11n ±  3%  +3.01% (p=0.026 n=6)   18.00n ±  5%  +2.42% (p=0.026 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-12    18.27n ± 2%   19.14n ± 3%  +4.76% (p=0.002 n=6)   18.10n ±  2%       ~ (p=0.589 n=6)   17.91n ±  1%       ~ (p=0.093 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-12    18.64n ± 3%   19.53n ± 3%  +4.83% (p=0.004 n=6)   18.15n ± 28%       ~ (p=0.093 n=6)   18.07n ±  4%       ~ (p=0.063 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-12    18.72n ± 6%   19.34n ± 2%       ~ (p=0.065 n=6)   17.96n ± 19%       ~ (p=0.394 n=6)   18.08n ± 21%       ~ (p=0.065 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-12    19.32n ± 3%   19.64n ± 3%       ~ (p=0.240 n=6)   18.22n ± 16%       ~ (p=0.065 n=6)   18.29n ± 15%       ~ (p=0.065 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-12   16.80n ± 5%   17.61n ± 2%  +4.82% (p=0.013 n=6)   17.95n ±  3%  +6.88% (p=0.002 n=6)   17.93n ±  4%  +6.70% (p=0.002 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-12   17.16n ± 2%   17.82n ± 6%  +3.88% (p=0.002 n=6)   18.11n ±  3%  +5.60% (p=0.002 n=6)   18.12n ±  2%  +5.65% (p=0.002 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-12   17.56n ± 4%   18.38n ± 2%  +4.64% (p=0.002 n=6)   18.34n ±  2%  +4.44% (p=0.004 n=6)   18.11n ±  2%  +3.10% (p=0.026 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-12   17.80n ± 1%   18.73n ± 2%  +5.22% (p=0.002 n=6)   18.21n ±  2%  +2.27% (p=0.015 n=6)   17.93n ±  1%       ~ (p=0.180 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-12   18.21n ± 1%   18.86n ± 3%  +3.54% (p=0.002 n=6)   18.25n ±  2%       ~ (p=0.416 n=6)   18.07n ±  2%       ~ (p=0.699 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-12   18.61n ± 8%   19.05n ± 2%       ~ (p=0.331 n=6)   18.46n ± 27%       ~ (p=0.974 n=6)   18.06n ± 27%       ~ (p=0.132 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-12   18.61n ± 2%   19.29n ± 1%  +3.63% (p=0.004 n=6)   18.07n ±  1%  -2.95% (p=0.002 n=6)   19.54n ± 10%       ~ (p=1.000 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-12   19.20n ± 7%   19.59n ± 2%       ~ (p=0.087 n=6)   18.15n ± 15%       ~ (p=0.058 n=6)   18.21n ±  6%  -5.11% (p=0.024 n=6)
geomean                                                17.94n        18.68n       +4.14%                 18.15n        +1.19%                 18.12n        +1.02%

It's interesting since this linear scan was an improvement when I wrote https://go.dev/cl/611189, but a lot has changed since then so I'm not too surprised.

What CPU are you testing on? It's also interesting that you seem to see more extreme results than me. I'm on Intel(R) Xeon(R) W-2135 (Skylake).

I'll also try out your new benchmarks, I just haven't gotten around to it yet.

@thepudds
Copy link
Contributor Author

thepudds commented Mar 6, 2025

What CPU are you testing on? It's also interesting that you seem to see more extreme results than me. I'm on Intel(R) Xeon(R) W-2135 (Skylake).

Hi @prattmic, sorry I missed that question, which came in while I was traveling for the holidays.

For the benchmark results in the comments above, this is what my stored copy of those benchmark results say for cpu:

goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz

GCE console and gcc currently both say that benchmark VM is running on Cascade Lake.

Also, I probably should have mentioned the results in the comments above are for GOAMD64=v4, though I'm not sure how much of a difference that might have made.

(The results in the comments above are from a cloud VM, where I used GOAMD64=v4. Locally on my laptop I was using GOAMD64=v3 and saw fairly similar but not identical improvements. I could dig up the laptop results, but my laptop was probably not as quiescent as it could have been, so if there is interest in seeing GOAMD64=v3 results, I'd probably just re-run on a cloud VM).

I did not benchmark on arm or with SIMD disabled. I'm happy to do that if you think useful, or re-run benchmarks with GOAMD64=v3, and/or if you have any other suggestions for next steps.

(Also, my take is this change here is mostly independent of whether or not we tweak the benchmarks as suggested in #70700, though I'd also be OK to treat them more together if you think that's better).

Thanks!

edit: originally in this comment, I referenced the results in the CL, but meant to reference the results in the comments above. edited to fix.

@thepudds
Copy link
Contributor Author

thepudds commented Mar 9, 2025

Hi @prattmic, I ran the benchmarks on arm64 (GCE Axion), which means there is no SIMD in these results.

I ran both the new benchmarks (with mostly unpredictable keys) and the old benchmarks (with mostly predictable keys). I did not set GOARM64, which I think means it defaults to GOARM64=v8.0.

The short version is the arm64 results seem broadly similar to the amd64 GOAMD64=v4 results in the comments above, with a caveat for the arm64 results for small map access hit with predictable keys (discussed with second table below).

The observed improvement with the fix is better for the benchmarks with mostly unpredictable keys (-30.45% geomean for these arm64 results) than with mostly predictable keys (-16.23% geomean for these arm64 results), which I think is expected (and as discussed in the comments above and in the CL).


The first table here of arm64 results is comparing the new benchmarks (mostly unpredictable keys) without the fix vs. with the fix. (This is similar to the first table of results in the first comment above in #70849 (comment), which was for amd64 with a -28.69% geomean, vs. a -30.45% geomean here).

goos: linux
goarch: arm64
pkg: runtime
                                                       │ no-fix-new-bmarks-arm │          fix-new-bmarks-arm          │
                                                       │        sec/op         │    sec/op     vs base                │
MapAccessHit/Key=smallType/Elem=int32/len=6-4                    17.610n ±  0%   9.331n ±  0%  -47.01% (p=0.000 n=25)
MapAccessHitHot/Key=smallType/Elem=int32/len=6/Hot=1-4            8.840n ±  2%   9.282n ±  0%   +5.00% (p=0.001 n=25)
MapAccessHitHot/Key=smallType/Elem=int32/len=6/Hot=3-4           15.620n ±  1%   9.296n ±  1%  -40.49% (p=0.000 n=25)
MapAccessMiss/Key=smallType/Elem=int32/len=6-4                    8.555n ± 21%   5.817n ±  0%  -32.00% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4                8.821n ±  0%   9.296n ±  0%   +5.38% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4               13.040n ±  0%   9.297n ±  0%  -28.70% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4               15.360n ±  0%   9.287n ±  0%  -39.54% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4               15.950n ±  0%   9.291n ±  0%  -41.75% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4               16.710n ±  0%   9.286n ±  0%  -44.43% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4               17.010n ±  0%   9.327n ±  0%  -45.17% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4               17.640n ±  0%   9.336n ±  0%  -47.07% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4               18.140n ±  0%   9.379n ±  0%  -48.30% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4               8.488n ±  9%   6.795n ±  0%  -19.95% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4               8.473n ±  3%   6.795n ±  0%  -19.80% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4               8.444n ±  4%   6.796n ±  0%  -19.52% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4               9.307n ±  9%   6.796n ±  0%  -26.98% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4               8.487n ±  4%   6.797n ±  0%  -19.91% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4               8.750n ± 18%   6.796n ±  0%  -22.33% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4               9.510n ± 11%   6.795n ±  0%  -28.55% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4               9.336n ±  7%   7.853n ± 13%  -15.88% (p=0.000 n=25)
geomean                                                           11.61n         8.076n        -30.45%

The next table here for arm64 results is comparing the old benchmarks (mostly predictable keys) without the fix vs. with the fix. (This is similar to the second table of results in the comments above in #70849 (comment), which was for amd64 with a -8.36% geomean, vs. a -16.23% geomean here).

For these results on arm64, the small map access hit with predictable keys (MapSmallAccessHit) is roughly +8% worse on average with the fix for the 8 different sizes, but the small map access miss (MapSmallAccessMiss) has a bigger improvement here compared to amd64.

goos: linux
goarch: arm64
pkg: runtime
                                                    │ master-8391579ece-arm │    just-fix-with-old-bmarks-arm     │
                                                    │        sec/op         │   sec/op     vs base                │
MapAccessHit/Key=smallType/Elem=int32/len=6-4                   8.826n ± 4%   9.633n ± 0%   +9.14% (p=0.000 n=25)
MapAccessMiss/Key=smallType/Elem=int32/len=6-4                  8.703n ± 3%   5.615n ± 9%  -35.48% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4              8.433n ± 0%   9.092n ± 1%   +7.81% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4              8.427n ± 0%   9.095n ± 0%   +7.93% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4              8.522n ± 0%   9.652n ± 0%  +13.26% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4              8.466n ± 2%   9.099n ± 0%   +7.48% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4              8.681n ± 3%   9.573n ± 0%  +10.28% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4              8.759n ± 2%   9.568n ± 0%   +9.24% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4              8.913n ± 2%   9.563n ± 0%   +7.29% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4              8.917n ± 1%   9.118n ± 0%   +2.25% (p=0.001 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4             8.449n ± 3%   5.419n ± 0%  -35.86% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4             8.386n ± 0%   5.417n ± 0%  -35.40% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4             8.394n ± 0%   5.644n ± 0%  -32.76% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4             8.621n ± 5%   5.415n ± 0%  -37.19% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4             8.484n ± 4%   5.628n ± 0%  -33.66% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4             8.704n ± 3%   5.620n ± 0%  -35.43% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4             8.437n ± 3%   5.624n ± 8%  -33.34% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4             8.657n ± 3%   5.424n ± 7%  -37.35% (p=0.000 n=25)
geomean                                                         8.597n        7.202n       -16.23%

Finally, a few days ago in my immediately prior comment #70849 (comment), I said "For the benchmark results in the CL..." and then described the benchmark hardware, but I meant to refer to the benchmarks results in the issue comments above (rather than the results in the CL). I'll edit that comment to be clearer, and I'll also separately update the description in the CL.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 24, 2025
@dmitshur dmitshur moved this from Todo to All-But-Submitted in Go Compiler / Runtime Mar 24, 2025
@github-project-automation github-project-automation bot moved this from All-But-Submitted to Done in Go Compiler / Runtime Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
Development

No branches or pull requests

7 participants