-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: performance regression for small Swiss Table map access for non-specialized keys #70849
Comments
If we instead compare master from a week or so ago (8c3e391) vs. just making the map lookup behavior change (but without changing benchmark behavior), we get:
Note that both columns from this run are generally slower overall than the first benchmarks presented above. This is likely for a couple of reasons, including these benchmarks here still have a somewhat expensive mod operation in the benchmark code itself. The most interesting change here from the top comment might be Also, I initially poked at the implementation in master a bit with |
Finally, still using the old benchmarks (with predictable keys), if we compare master as of a week or so ago with and without the Swiss Table enabled, we can see that on these benchmarks the default Swiss Table in master does seem slower compared to the old runtime map (geomean of +9.31% worse). In other words, decent chance that master has a performance regression in these lookup benchmarks compared to Go 1.23, though I did not check Go 1.23 directly.
(Those are still the slower version of the benchmarks, which includes the somewhat expensive mod operation in the benchmark code, so the denominator on the percent change here is larger than in our first set of benchmarks above). I'll add a caveat that I was juggling different machines and different versions, and apologies in advance if I crossed some wires here. |
Change https://go.dev/cl/634396 mentions this issue: |
Thanks for the thorough investigation! I can indeed reproduce this using the existing benchmarks as you did in #70849 (comment).
For reference, the same comparison against the old map implementation:
It's interesting since this linear scan was an improvement when I wrote https://go.dev/cl/611189, but a lot has changed since then so I'm not too surprised. What CPU are you testing on? It's also interesting that you seem to see more extreme results than me. I'm on Intel(R) Xeon(R) W-2135 (Skylake). I'll also try out your new benchmarks, I just haven't gotten around to it yet. |
Hi @prattmic, sorry I missed that question, which came in while I was traveling for the holidays. For the benchmark results in the comments above, this is what my stored copy of those benchmark results say for cpu:
GCE console and gcc currently both say that benchmark VM is running on Cascade Lake. Also, I probably should have mentioned the results in the comments above are for (The results in the comments above are from a cloud VM, where I used I did not benchmark on arm or with SIMD disabled. I'm happy to do that if you think useful, or re-run benchmarks with (Also, my take is this change here is mostly independent of whether or not we tweak the benchmarks as suggested in #70700, though I'd also be OK to treat them more together if you think that's better). Thanks! edit: originally in this comment, I referenced the results in the CL, but meant to reference the results in the comments above. edited to fix. |
Hi @prattmic, I ran the benchmarks on arm64 (GCE Axion), which means there is no SIMD in these results. I ran both the new benchmarks (with mostly unpredictable keys) and the old benchmarks (with mostly predictable keys). I did not set GOARM64, which I think means it defaults to GOARM64=v8.0. The short version is the arm64 results seem broadly similar to the amd64 GOAMD64=v4 results in the comments above, with a caveat for the arm64 results for small map access hit with predictable keys (discussed with second table below). The observed improvement with the fix is better for the benchmarks with mostly unpredictable keys (-30.45% geomean for these arm64 results) than with mostly predictable keys (-16.23% geomean for these arm64 results), which I think is expected (and as discussed in the comments above and in the CL). The first table here of arm64 results is comparing the new benchmarks (mostly unpredictable keys) without the fix vs. with the fix. (This is similar to the first table of results in the first comment above in #70849 (comment), which was for amd64 with a -28.69% geomean, vs. a -30.45% geomean here).
The next table here for arm64 results is comparing the old benchmarks (mostly predictable keys) without the fix vs. with the fix. (This is similar to the second table of results in the comments above in #70849 (comment), which was for amd64 with a -8.36% geomean, vs. a -16.23% geomean here). For these results on arm64, the small map access hit with predictable keys (MapSmallAccessHit) is roughly +8% worse on average with the fix for the 8 different sizes, but the small map access miss (MapSmallAccessMiss) has a bigger improvement here compared to amd64.
Finally, a few days ago in my immediately prior comment #70849 (comment), I said "For the benchmark results in the CL..." and then described the benchmark hardware, but I meant to refer to the benchmarks results in the issue comments above (rather than the results in the CL). I'll edit that comment to be clearer, and I'll also separately update the description in the CL. |
Edit: Initially, I thought this was an "optimization opportunity" and wrote it up as such, but later realized it is seemingly a regression from Go 1.23 (comment). Changed title accordingly.
I was reading through the (excellent!) new Swiss Table implementation from #54766 and noticed a potential optimization opportunity for lookups with small maps (<= 8 elems) for the normal case of non-specialized keys.
I suspect the current linear scan for these maps might be better off if we instead build a match bitmap and jump to the candidate match. I mailed https://go.dev/cl/634396 with this change.
For predictable keys, it might be a small win or close to a wash, but for unpredictable keys, it might be a larger win.
To better illustrate this (as well as to help with analyzing a couple other experiments I'm trying), I also updated the benchmarks (#70700) to shuffle the keys for the newer benchmarks, and also added a new "Hot" benchmark that repeatedly looks up a single key (Hot=1 below) or a small number of random keys (Hot=3 below).
The geomean here is -28.69%. These results are for amd64 and use the SIMD optimizations. I did not test on arm or without the SIMD optimizations.
For the miss benchmarks, I suspect the fact that a miss is happening is predictable in all cases, but in some cases the predictable miss is with a predictable key, vs. a predictable miss on an unpredictable key in other cases. For a table with a single group like these all have, that probably doesn't matter too much.
For the hits and misses, these are the ones that I expect to have predictable keys:
(Those three are also listed above in the larger list, but pulling them out here for commentary).
The first has predictable keys because a single key is looked up repeatedly per run (with 6 elements in the small map). The second two have predictable keys because there is only a single element. In other words, the keys are shuffled in all ~20 of these benchmarks, but in the three here, the shuffling is effectively a no-op.
CC @prattmic
The text was updated successfully, but these errors were encountered: