[RFC] Breaking basic_format_string's ABI for performance improvements

This proposal describes an performance improvement for C++20’s std::format and similar functions. This improvement involves an ABI break. I’d like to get some feedback how we want to tackle this ABI break in libc++.

Abstract

Something that has bothered me for a long time is that std::format validates its format-string both at compile-time and at runtime. The thing that bothers me is that the compile-time parsing information is not available at runtime so we re-parse and re-validate the format-string again at runtime. The validation is useless; we know the format-string is valid, the parsing also feels wasteful, we had that information during compilation and now calculate it again. I found a solution to these problem but this requires an ABI break for the basic_format_string class. The change means the compile-time parser stores data in a cache in basic_format_string and at run-time this cache is used to format the arguments. There are 2 approaches to this ABI break:

  1. Use the libc++ unstable ABI
  2. Use the libc++ stable ABI, but offer an opt-out
  3. Implement the change unconditionally

Option 1 is non-controversial, but means the feature is only available to people who opt-in the to ABI break. In practice this means very few people will benefit from this change.
Option 2 is controversial, but allows users to opt-out of the ABI break.
Option 3 is controversial, but I think it would be worth the effort.

Below I’ll explain why I think 3 is worth the effort. The last part of the post goes deeper into details of the changes I plan to make. I’ve created a draft PR with a PoC.

Motivation

The PoC only optimizes the parsing part. The formatting is not optimized so that is slower than the original. This means the performance for std::format(""); goes from 6.18 ns to 21.6 ns with this PoC. Similarly std::format("{}", 42); goes from 47.6 ns up to 60.5 ns. For the real patch I do not consider this performance degradation acceptable.

Looking at the case of std::format("{}", 42);, with a different number of characters before the “{}” in the format-string the performance is:

Prefix  Before [ns] After [ns]
0        47.6        60.5
5        51.1        64.1
10       61.6        59.8
20       86.9        57.6
40      158          81.1

The after values seem a bit noisy. Still it’s clear the performance of the cached version: starts slower, gets similar performance around 10 prefix characters, and is a clear win starting at 20 characters. This pattern is seen is the other parts of the benchmark too.

For formatting a plain string the numbers are:

Characters  Before [ns] After [ns]
0             6.18      21.6 
5            16.5       26.0 
10           24.9       26.0
20           52.7       29.3
40          116         37.5

Here the caching seems more effective. In the first example, the new version takes about 1/2 of the time and in this example it takes about 1/3 of the time. (Note formatting a plain string is silly, however std::print uses the same code and printing a plain string is commonly used.)

So this shows caching the parsed result can improve the runtime performance.

The ABI break

The performance benefit comes from adding a cache to the std::basic_format_string class. Since this changes the size of the object it is considered an ABI break.
The class std::basic_format_string used to be an implementation detail, but has been changed to a standard provided type. Objects of this type are implicitly created when calling std::format. These are temporaries that are destroyed when the call to std::format finishes. Users can create objects of std::basic_format_string and use that object to call std::format. In that case the object can cross an ABI boundary. Searching on Github found:

  • 153 use cases of std::basic_format_string, most are Standard libraries. Two notable projects were found:
    • spdlog a fast logger
    • BDE a “boost-like project” provided by Bloomberg
      both projects use it like std::format so there are no possible ODR violations inside these projects,
  • 165 use cased of std::wformat_string similar as std::basic_format_string.
  • 1.5K usages similar as std::basic_format_string similar as std::basic_format_string.

In these use-cases most of them used it as wrapper around std::format and friends so using it as a temporary. I only found 2 cases where an object was store, which may cause an ODR violation.

So at the moment there are less than 2K usages of this object and only a small fraction of these use cases may issues. In general the adoption of std::format is not too high. These are the number of times formatting strings were found in code on GitHub

  • std::format 90.6 K
  • fmt::format 1.9 M (the library std::format is based upon)
  • sprintf 18 M (this includes C projects)
  • snprintf 4.4 M
  • printf 61.3 M

Which ABI approach to take?

First of all, at the moment there is only a PoC not a complete implementation. I like to know the end goal of the change. This will influence how I will write the patches to implement this change. This change will take quite a bit of time to complete so it may not be complete before LLLVM 21. The approach for the various options are:

  1. Make the feature conditionally available. Opt-in in ABI V1, opt-out in the Unstable ABI. This involves:
    • Add a new ABI flag _LIBCPP_ABI_BASIC_FORMAT_STRING_CACHE,
    • add this flag to _LIBCPP_ABI_VERSION,
    • the new features is added with #ifdef in the existing class.
      If we get new ABI breaking ideas in 5 years we can apply them.
  2. Make the feature conditionally available. Opt-out in ABI V1 and the Unstable ABI. This involves:
    • Add a new ABI flag _LIBCPP_ABI_BASIC_FORMAT_STRING_CACHE,
    • add the new features with #ifdef in the existing class,
    • once the cache is fully complete we make the new ABI the default and offer an opt out.
      If we get new ABI breaking ideas in 5 years we need to evaluate whether these are worth another ABI break at that time.
  3. Replace the existing code with the new implementation. This involves:
    • Add define _LIBCPP_HAS_EXPERIMENTAL_BASIC_FORMAT_STRING _LIBCPP_HAS_EXPERIMENTAL_LIBRARY,
    • copy the existing class and use the _LIBCPP_HAS_EXPERIMENTAL_BASIC_FORMAT_STRING to add the new features in the copy,
    • once the cache is fully complete we delete the original version and remove the _LIBCPP_HAS_EXPERIMENTAL_BASIC_FORMAT_STRING flag.
      If we get new ABI breaking ideas in 5 years we need to evaluate whether these are worth another ABI break at that time.

All options add a set of #ifdefs in the code. Options 1 and 2 keep these forever, option 3 has them temporary.

My preference and motivation

  • Option 3:
    • The ABI impact is minimal or non-existing.
    • The feels change in line with C++'s zero overhead principle.
    • The class basic_format_string is rather new and the C++ committee had no issue to change how format worked for several years after it shipped:
      • In original C++20, finished in February 2020, did not have this class.
      • P2216R3 “std::format improvements” added basic-format-string as DR against C++20 in June 2021.
      • P2508R1 “Expose std::basic-format-string<charT, Args…>” added basic_format_string as DR against C++20 in July 2022 and has been available in libc++15 (released in September 2022).
  • Option 2:
    • This adds a bit of maintenance overhead.
    • Still has the benefits of option 3.
  • Option 1:
    • This adds a bit of maintenance overhead.
    • Means the benefits are available, but it’s unknown how many users will be aware and opt-in to these changes.

Details of the changes

Caching strategy

The C++ formatting library has two types of formatting functions the normal-version and the v-version. This is similar to the approach of printf and vprintf. The signatures are

  template<class... Args>
    std::string std::format(std::format_string<Args...> fmt, Args&&... args);

  std::string std::vformat(std::string_view fmt, format_args args);

The function vformat parses the fmt argument writes the formatted output at runtime. When the argument fmt is incorrect it throws an exception.

The function format is implemented as:

  return std::vformat(fmt.str, std::make_format_args(args...));

So this does the same as vformat. However the format_string does something more. An format_string object looks (slightly simplified) like:

 template<class... Args>
 struct format_string {
   string_view str;
   consteval basic_format_string(string_view s);
 };

Note the constructor is consteval so it always is evaluated at compile-time. This constructor parses the argument s. When the argument s is incorrect the code is ill-formed and thus won’t compile.

So format has the following effects:

  • parses the fmt argument at compile-time
  • parses the fmt argument at runtime
  • writes the formatted output

My gripe with this approach is that the compile-time parsing knows exactly how to format fmt. However all information is lost, just to be rediscovered at runtime. This feature adds a cache to format_string to record information.
For example, std::format("hello world"); will:

  • compile-time
    • parse “hello world”
    • create one cache entry “write string” + std::string_view to the just parsed data
  • runtime
    • process cache entry 1 and writes the string “hello world”

For example, std::format("The answer is {}.", 42); will

  • compile-time
    • parse "The answer is "
    • create one cache entry “write string” + std::string_view to the just parsed data
    • parse “{}”
    • create one cache entry “format argument” + index 0
    • parse “.”
    • create one cache entry “write string” + std::string_view to the just parsed data
  • runtime
    • process cache entry 1 and writes the string "The answer is "
    • process cache entry 2 and formats the value 42 and writes “42”
    • process cache entry 3 and writes the string “.”

The compile-time parser only knows that the type of the first argument is an integer so it can’t store the formatted value. This argument can be a function call that is evaluated at runtime. The cache stores all information about the format argument, so if fmt had been {:x} instead the formatter would have written “2a” (hex) instead of “42”.

There is a caveat, there are two types of format arguments “builtin types” and “handle types”. All “builtin types” have formatters of the same size and behave “the same”. Formatters of “handle types” may work completely differently. For example, the chrono header has a formatter for seconds that can be invokes as std::format("{:%H:%M:%S}", 10'000s);. The format string parses “%H:%M:%S” using a strftime like parser. The standard does not specify how to implement this formatter. Some valid options are: using string_view member “storing” “%H:%M:%S”, or using string member storing “%H:%M:%S”. (The latter is not efficient, but it is an valid implementation.)

There are several approaches how to handle format-string that use “handle formatters”:

  1. have more space in a cache entry and always store the formatter. Since we need to hard-code a size, this only works is some cases.
  2. only store formatter that have the same size as the formatters for the “builtin types”. This is useful when users inherit their formatter from a formatter of the “builtin types” and change the format function to forward a member of their type to the base formatter.
  3. parse the replacement-field at compile-time to validate, parse and format the replacement-field at runtime. This approach seems not efficient, but it is the status quo.
  4. drop the cache and fall back to the current behaviour. Again not efficient, but it matches the status quo.

The PoC does 4.
I expect option 3 to be feasible, the real implementation should at least do this.
Option 2 means we need to store an type-erased formatter at compile-time. I need to investigate whether this is possible. (The handle class does something similar, but it stores a type-erased pointer to the value. Here we need to store the entire object).Another issue with option 2 is that a “handle formatter” can have an allocating member. For example. the std::string member of the seconds formatter. When memory is allocated at compile-time it needs to be freed at compile-time. This makes it impossible to store this implementation of a seconds formatter.
If option 2 is feasible I would like to look at option 1. The Standard library has chrono and range-based formatters. These are larger than the “builtin types”. However I have no benchmarks that show the performance improvement. This approach has disadvantage that the size of every element in the cache will be larger.

Cache size

Another item to consider is the cache size. Depending on the type of the entry the required size differs. The largest size used is for a formatter which stores:

  • a type which is untyped enum, based on the alignment requirements this is a 32-bit value,
  • the index of the format argument to process, this is a 32-bit value,
  • the formatter whose size is 128-bit.

This is 192 bits or 24 bytes per entry. It is possible to use 16-bit types for the type and index field, saving 4 bytes. (With an 16-bit index we can support 65K cache entries.) However we also need to store a type and string_view, where the alignment of the string_view may mean we still need 192 bits. In this case we look at parallel arrays to reduce the padding overhead:

  • 128-bit for formatter/string_view
  • 16-bit for arg-id
  • 8-bit for the types

In this case we need 152 bits or 19 bytes and maybe some final padding. We could even consider to use some bits of the arg-id for the type information (this would need performance benchmarking). For now we consider the cost per cache entry will be 20 bytes in the binaries’ data segment.

The question is how many entries do we need to store? Take these calls:

  std::format("The answer is {}.", 42);
  std::format("The answer is {0}, "
              "or (0:o} for those who favour octal, "
              "the hexadecimal lovers get {0:x}, and "
              "last but not least {0:b} for those who prefer binary.", 42);

Both these calls instantiate std::format_string<int>. The first function needs 3 cache entries, the second function needs 9 entries. However since these functions instantiate the same class they have the same cache size. If we pick 3 it means:

  • It’s optimal for the first call,
  • the second call processes the first 3 elements from the cache and the other 6 need to be re-parsed at runtime.

If we pick 9 items it means:

  • We waste 6 * 20 = 120 bytes in the binaries’ data segment,
  • it’s optimal for the second call.

The PoC uses a fixed size buffer, regardless of the number of formatter arguments. For the real implementation it will, very likely, depend on the number of arguments. For example 4 + 2 * sizeof…(Args)`. These values “feel good”, but I have not yet done any analyses on real world use cases. I want the look at real world use cases before settling on an algorithm.

If we only want to make this feature available in the unstable ABI we can change the cache size algorithm every release. If we want to make this change available in ABI V1 (always or the opt-out solution) we need to do more upfront investigation what the proper cache size is.

1 Like

Can you clarify whether option 3 would be a “loud” ABI break, i.e. everyone affected would get a linker error?

Thanks for the RFC, this is super interesting and a bit mind-bending that we could potentially do that.

FWIW, I don’t think that the ABI of std::format_string is something that we need to keep stable at all costs. I don’t mean that we should break it every release, but I do think that introducing a single ABI break early in its life cycle is probably acceptable. That being said, it does mean that we’d want to settle on something we’ll be happy with for the foreseeable future to avoid changing it often. So I think that if we manage to find something we’re happy with, strategy (3) to implement the change unconditionally is probably workable, perhaps with some transition mechanism.

In terms of the “deployment mechanism”, I would tend to introduce the new class under an ABI macro. That’s what we have precedent for (e.g. the “new optimized std::function”). But then, we can decide to make that ABI macro enabled by default, and to deprecate/remove the other ABI after e.g. 1 release, making it basically an unconditional change. But I’d piggy back on our ABI macro “framework” instead of _LIBCPP_HAS_EXPERIMENTAL_FOO to do this.

In terms of the proposal itself, I am a bit worried about the possible impact on binary size – I think that’s the obvious concern here. I am also a bit concerned about baking in many assumptions into std::format_string and then realizing that we’d like to make changes to it. I think that one of the main reasons why this new design is tricky ABI-wise is that we’re hardcoding the size of the cache into std::format_string, for lack of a better option. But if we could instead have a generic cache representation and just store a pointer to it, we’d know our std::format_string would never have to change again, and we’d be free to generate however large/small caches in the data segment as we want.

The reason why the above isn’t doable today is that we don’t support non-transient allocations in C++. I mean that we don’t support making an allocation at compile-time and then “baking it” into runtime by copying it into the data segment. There’s been some work around that, like propconst and recent work by Barry (p1974r0, p2670r1 ). I know this problem is being worked on pretty actively because it would be a huge improvement. If we had this ability, I think this proposal would operate under vastly different constraints.

IMO, we should wait for the non-transient allocation problem to be solved before we actually switch libc++'s default ABI to use that mechanism, but I wouldn’t oppose work on the feature as a whole on that basis. We should just develop it with the knowledge that at some point the underlying caching mechanism is going to be swapped.

2 Likes

The ABI break changes the sizeof(format_string) so this is an ODR violation. AFAIK linkers will not give an error in this case.

I hope that in 2025 we can do better than that, leveraging ABI tags or inline namespaces, or something along these lines.

ABI tags and inline namespaces are both insufficient, sadly. Being able to do this well was indeed the initial goal of the abi_tag feature, but the fact that incomplete aggregate types exist rendered that goal infeasible.

Given namespace std { struct Bar {}; }: if I add a field to the standard-library type “std::Bar”, it causes an ABI incompatibility for anything using it. So, we’d like to automatically change the name-mangling for any function that mentions a std::Bar in its signature, for any global variable of type Bar, and for any aggregate type which has a member variable Bar (and then transitively for any function/global referring to that type).

Moving Bar to a different inline-namespace already handles use in function arguments: that’ll result in a different name-mangling of the function. But, return types are not a part of non-template function-name mangling, and global variables also don’t get name-mangled with their type. The nonstandard abi_tag attribute does get added to the name mangling in those two cases, so that’s some help.

But sadly, it cannot handle Bar being embedded in another aggregate. Ideally, we’d like a user-defined struct Foo { std::Bar a; } to automatically inherit any abi_tags from Bar (and thus get a new name-mangling when Bar is incompatibly changed). But that cannot possibly work. Another TU may declare only the incomplete struct Foo;. It needs to mangle the same as the full struct definition, but we have no way of knowing that the latter depends on Bar. So, sadly, it’s just not possible to fully solve this problem.

2 Likes