-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: runtime: optionally allow callers to intern strings #5160
Comments
And I'd like to be able to lookup the interned strings from a []byte or string. Related: issue #3512 |
You could do this in user space if the runtime provided weak pointers: pointers that are set to nil if the object to which they point is garbage collected. (I don't see why the runtime package has any advantage in making access to the map less contentious, but perhaps I am missing something.) |
Even if the runtime provided weak pointers, I'm still not exactly sure how I would implement it. What would my map key be? I might also need the runtime to give me the string's hash value, since doing my own hash function would probably be too slow. The runtime could do a better job since it could use a less contentious hashmap (non-iterable, pushing down locks further, aware of resizes, etc) without having to offer that to the language. |
Agree with Brad, even if it will use weak pointers inside it's very difficult in implement and optimize. It may not necessary live in runtime, it can be in strings/bytes but use some runtime helpers. Btw wrt []byte, do you mean that there will be bold caution comment saying that you must not mutate them? |
Answered []byte question here: https://golang.org/issue/3512?c=5 |
@bradfitz I made you a thing: https://github.com/josharian/intern/. It's not exactly what you asked for, and it's a kinda horrible abuse of I fiddled with a bunch of non-short, non-simple alternatives and didn't like any of them. |
@josharian thanks for https://github.com/josharian/intern! in one of our internal apps that has to parse a lot of json (with a lot of repeated string values) and process it, we were able to slash memory usage by a significant amount (30% on the whole app, >75% just for the in-memory data coming from the parsed json) by using it. I know I'm jumping the gun on this one, but +1 for having interning - ideally even as an option when unmarshaling serialized payloads, i.e. something along the lines of:
(the rationale for doing interning during unmarshaling is this: consider the case of the |
@CAFxX I'm pleasantly surprised that it ended up being useful. I think prior to exposing any API, it’d be worth experimenting with doing this automatically in the runtime. Do it:
I can't predict offhand how that would turn out, but it seems a simple enough experiment to run. |
Full disclosure: those JSON structures we parse, and that contain significant amounts of repeated strings, are kept in memory as a form of local LRU cache to avoid lookups on a remote API. So this is probably a very favorable scenario for interning. At the same time, it's a scenario that is common enough to probably warrant consideration.
Absolutely agree, as long as this won't significantly impact regular string ops. I'm not an expert in GC/runtime, but it would be neat (not for the PoC) if this form of automatic interning was done only for strings that are either likely to survive GC or likely to be already in the interning table. And ideally the interning table should not keep strings alive during GC (like you do in Random idea for future consideration if this ever gets implemented: maybe interning could be done in multiple steps:
Just as a data point, in our case most of the strings were GUIDs... so it would have helped if the upper limit for "very small" was at least 36 bytes. |
Change https://golang.org/cl/141641 mentions this issue: |
CL 141641 is a quick-and-dirty experiment with doing this in the runtime. |
@josharian I actually also tried (and got a little bit farther: I also hooked up the GC in |
Cool! I recommend trying a handful of std packages benchmarks as well as go1; the go1 benchmarks can be narrow in scope. And of course definitely run the runtime string benchmarks. :) |
@CAFxX see also discussion with @randall77 in |
I have pushed my current experiment on my fork (see master...CAFxX:intern). It is not pretty as it's meant to be an experiment; specifically, there's a lot of duplication that would need to be sorted out. I'm also not 100% sure the GC part is done correctly (specifically, is it correct to assume that we don't need locking in What is implemented:
The magic numbers 1024 and 64 are arbitrary, haven't really attempted any tuning. Performance is a mixed bag; some benchmarks improve, other slow down. Allocation benchmarks strictly improve. I suspect some of the regressions could be avoided/mitigated either by refactoring some logic in the standard library (e.g. in I read @randall77's comments and I can say that yes, we can try to start smaller (e.g. only do interning in encoding/json and then move from there). I still stand by what I suggested above, i.e. that if the runtime does the right thing with no knobs, it's better. |
I managed to improve things a bit (removed code duplication and lowered the overhead for when interning is not supposed to happen) and now there are almost no regressions apart from 3 benchmarks that I still need to analyze in detail (they could very well be artifacts). These are the current results for the I plan to run some additional benchmarks simulating our production workloads to validate the results. If you have suggestions about additional benchmarks to run, please let me know. It would be advisable if also somebody else ran benchmarks on different hardware. I pushed the latest version here: master...CAFxX:intern Assuming we manage to contain all regressions, would such an approach (that has the benefit of exposing new APIs or modifying existing ones) be considered for inclusion? |
Given the recent flurry of activity, and given that this is a significant change, and given that there are a variety of options (automatic interning, package runtime API, etc.), I've changed this to a proposal so that we can get some input from @golang/proposal-review. |
@josharian What is the exact API proposal? I see a very long issue thread here and lots of links, but could you please post a note here with the API in the text of the note? |
OK, it looks like https://github.com/josharian/intern/blob/master/intern.go is the API but I don't see what the proposed implementation is. (Probably not what's in that file.) Please help me. |
@rsc there are several different paths we could take here. I'm not confident that I have clarity about optimal path(s) forward; I'm hoping to restart discussion. One option is to automatically intern some strings. This has the benefit of no new API, but the downside of making program performance harder to predict. @CAFxX has experimented with this a bunch (above). If we are going to do manual interning, this looks a lot like free lists and sync.Pool. If you have a way to handle string interning with the correct lifetime and control over concurrency, you should just do it there (much like with free lists). The harder case is where you don't have a good way to manage the string pool's lifetime or concurrency. For the easy case, we might consider adding to package strings: type Interner struct {
// ...
}
func (i *Interner) Intern(s string) string
func (i *Interner) InternBytes(b []byte) string The argument for adding API for the easy case is that, unlike with free lists, the implementation is non-obvious and non-portable (it requires specific compiler optimization #3512). (Maybe that means it belongs in package runtime as The harder case needs to handle lifetime as well as concurrency. The analogy with sync.Pool suggests that package Does this help, Russ? |
DO NOT SUBMIT Updates golang#5160 This is the simplest thing I could think of to experiment with. It's incomplete: It doesn't cover all the ways to create strings. It's not optimized at all. I picked the magic numbers haphazardly. Nevertheless, it doesn't do too badly, when applied to the compiler: name old time/op new time/op delta Template 187ms ± 3% 184ms ± 2% -1.47% (p=0.000 n=97+95) Unicode 86.9ms ± 5% 87.3ms ± 4% ~ (p=0.065 n=99+94) GoTypes 658ms ± 2% 659ms ± 2% ~ (p=0.614 n=99+97) Compiler 2.94s ± 2% 2.94s ± 2% ~ (p=0.945 n=95+95) SSA 8.53s ± 1% 8.54s ± 1% ~ (p=0.276 n=97+98) Flate 125ms ± 3% 124ms ± 4% -0.78% (p=0.000 n=99+97) GoParser 149ms ± 4% 149ms ± 3% ~ (p=0.595 n=100+95) Reflect 410ms ± 3% 412ms ± 4% +0.48% (p=0.047 n=97+96) Tar 167ms ± 3% 166ms ± 5% ~ (p=0.078 n=99+98) XML 227ms ± 3% 227ms ± 4% ~ (p=0.723 n=96+95) [Geo mean] 388ms 387ms -0.17% name old user-time/op new user-time/op delta Template 223ms ± 3% 221ms ± 3% -0.78% (p=0.000 n=99+95) Unicode 109ms ± 8% 111ms ± 6% +1.36% (p=0.001 n=99+99) GoTypes 846ms ± 2% 848ms ± 2% ~ (p=0.092 n=97+97) Compiler 3.91s ± 2% 3.91s ± 2% ~ (p=0.666 n=97+96) SSA 12.1s ± 2% 12.1s ± 1% ~ (p=0.128 n=92+96) Flate 145ms ± 3% 144ms ± 4% ~ (p=0.157 n=93+99) GoParser 180ms ± 5% 181ms ± 5% +0.63% (p=0.004 n=90+94) Reflect 522ms ± 3% 524ms ± 4% ~ (p=0.055 n=95+96) Tar 203ms ± 5% 203ms ± 5% ~ (p=0.880 n=100+99) XML 280ms ± 4% 281ms ± 4% ~ (p=0.170 n=99+98) [Geo mean] 487ms 488ms +0.19% name old alloc/op new alloc/op delta Template 36.3MB ± 0% 36.2MB ± 0% -0.23% (p=0.008 n=5+5) Unicode 29.7MB ± 0% 29.7MB ± 0% -0.07% (p=0.008 n=5+5) GoTypes 126MB ± 0% 126MB ± 0% -0.27% (p=0.008 n=5+5) Compiler 537MB ± 0% 536MB ± 0% -0.21% (p=0.008 n=5+5) SSA 2.00GB ± 0% 2.00GB ± 0% -0.12% (p=0.008 n=5+5) Flate 24.6MB ± 0% 24.6MB ± 0% -0.28% (p=0.008 n=5+5) GoParser 29.4MB ± 0% 29.4MB ± 0% -0.20% (p=0.008 n=5+5) Reflect 87.3MB ± 0% 86.9MB ± 0% -0.52% (p=0.008 n=5+5) Tar 35.6MB ± 0% 35.5MB ± 0% -0.28% (p=0.008 n=5+5) XML 48.4MB ± 0% 48.4MB ± 0% -0.16% (p=0.008 n=5+5) [Geo mean] 83.3MB 83.1MB -0.23% name old allocs/op new allocs/op delta Template 352k ± 0% 347k ± 0% -1.16% (p=0.008 n=5+5) Unicode 341k ± 0% 339k ± 0% -0.76% (p=0.008 n=5+5) GoTypes 1.28M ± 0% 1.26M ± 0% -1.48% (p=0.008 n=5+5) Compiler 4.97M ± 0% 4.90M ± 0% -1.38% (p=0.008 n=5+5) SSA 15.6M ± 0% 15.3M ± 0% -2.11% (p=0.016 n=4+5) Flate 233k ± 0% 228k ± 0% -1.92% (p=0.008 n=5+5) GoParser 294k ± 0% 290k ± 0% -1.32% (p=0.008 n=5+5) Reflect 1.04M ± 0% 1.03M ± 0% -1.73% (p=0.008 n=5+5) Tar 343k ± 0% 337k ± 0% -1.62% (p=0.008 n=5+5) XML 432k ± 0% 426k ± 0% -1.28% (p=0.008 n=5+5) [Geo mean] 813k 801k -1.48% Change-Id: I4cd95bf4a74479b0e8a8d339d77e248d1467a6e0
Now that the compiler is better about not copying for string(b) in various cases, is there any reason this needs to be in the standard library? Is there any runtime access it really needs? @bradfitz says he'd be OK with closing this. |
|
It still seems too specialized. You can do interning with a plain map[string]string today, but then you have to know when to drop the map. It sounds like this API contributes "you don't have to think about when to drop the map". But then the runtime does. That's a lot of work and it's unclear that we know how to do it well. We're not even managing sync.Pool that well. This is much more complicated, since it essentially depends on weak reference logic to decide whether a string is needed or not. I'm skeptical we should spend the time to get this right versus other work we could be doing.. Is it really worth the cost of a general solution here, versus people who have this need just using a map[string]string? Brad suggests maybe there's some way to "just mark these intern-string pools last" in order to automatically evict strings from the pool once all the references are gone, but it's not obvious to me how to implement that with the distributed mark termination detection and lazy sweeping. It seems like possibly adding a lot of complexity that would otherwise go unused. /cc @RLH @aclements |
Re-reading this, in an ideal world, we'd be able to build something sophisticated and sync.Pool-like as a third party package. I tried this, using the finalizer hack from @CAFxX to detect GC cycles. The performance was an order of magnitude worse than my sync.Pool-based implementation, due to mutex overhead; we need #18802. I'm going to close this proposal. For the simple case, it is easy enough to build yourself or have as a third party package. For the harder case, I think we should provide the tools for people to build this themselves, well: #18802 and #29696. As for doing it automatically in the runtime, I guess we'll just decide on a CL by CL basis. |
The text was updated successfully, but these errors were encountered: