This proposal describes an performance improvement for C++20’s std::format
and similar functions. This improvement involves an ABI break. I’d like to get some feedback how we want to tackle this ABI break in libc++.
Abstract
Something that has bothered me for a long time is that std::format
validates its format-string both at compile-time and at runtime. The thing that bothers me is that the compile-time parsing information is not available at runtime so we re-parse and re-validate the format-string again at runtime. The validation is useless; we know the format-string is valid, the parsing also feels wasteful, we had that information during compilation and now calculate it again. I found a solution to these problem but this requires an ABI break for the basic_format_string
class. The change means the compile-time parser stores data in a cache in basic_format_string
and at run-time this cache is used to format the arguments. There are 2 approaches to this ABI break:
- Use the libc++ unstable ABI
- Use the libc++ stable ABI, but offer an opt-out
- Implement the change unconditionally
Option 1 is non-controversial, but means the feature is only available to people who opt-in the to ABI break. In practice this means very few people will benefit from this change.
Option 2 is controversial, but allows users to opt-out of the ABI break.
Option 3 is controversial, but I think it would be worth the effort.
Below I’ll explain why I think 3 is worth the effort. The last part of the post goes deeper into details of the changes I plan to make. I’ve created a draft PR with a PoC.
Motivation
The PoC only optimizes the parsing part. The formatting is not optimized so that is slower than the original. This means the performance for std::format("");
goes from 6.18 ns
to 21.6 ns
with this PoC. Similarly std::format("{}", 42);
goes from 47.6 ns
up to 60.5 ns
. For the real patch I do not consider this performance degradation acceptable.
Looking at the case of std::format("{}", 42);
, with a different number of characters before the “{}” in the format-string the performance is:
Prefix Before [ns] After [ns]
0 47.6 60.5
5 51.1 64.1
10 61.6 59.8
20 86.9 57.6
40 158 81.1
The after values seem a bit noisy. Still it’s clear the performance of the cached version: starts slower, gets similar performance around 10 prefix characters, and is a clear win starting at 20 characters. This pattern is seen is the other parts of the benchmark too.
For formatting a plain string the numbers are:
Characters Before [ns] After [ns]
0 6.18 21.6
5 16.5 26.0
10 24.9 26.0
20 52.7 29.3
40 116 37.5
Here the caching seems more effective. In the first example, the new version takes about 1/2 of the time and in this example it takes about 1/3 of the time. (Note formatting a plain string is silly, however std::print uses the same code and printing a plain string is commonly used.)
So this shows caching the parsed result can improve the runtime performance.
The ABI break
The performance benefit comes from adding a cache to the std::basic_format_string
class. Since this changes the size of the object it is considered an ABI break.
The class std::basic_format_string
used to be an implementation detail, but has been changed to a standard provided type. Objects of this type are implicitly created when calling std::format
. These are temporaries that are destroyed when the call to std::format
finishes. Users can create objects of std::basic_format_string
and use that object to call std::format
. In that case the object can cross an ABI boundary. Searching on Github found:
- 153 use cases of
std::basic_format_string
, most are Standard libraries. Two notable projects were found:- spdlog a fast logger
- BDE a “boost-like project” provided by Bloomberg
both projects use it likestd::format
so there are no possible ODR violations inside these projects,
- 165 use cased of
std::wformat_string
similar asstd::basic_format_string
. - 1.5K usages similar as
std::basic_format_string
similar asstd::basic_format_string
.
In these use-cases most of them used it as wrapper
around std::format
and friends so using it as a temporary. I only found 2 cases where an object was store, which may cause an ODR violation.
So at the moment there are less than 2K usages of this object and only a small fraction of these use cases may issues. In general the adoption of std::format
is not too high. These are the number of times formatting strings were found in code on GitHub
- std::format 90.6 K
- fmt::format 1.9 M (the library std::format is based upon)
- sprintf 18 M (this includes C projects)
- snprintf 4.4 M
- printf 61.3 M
Which ABI approach to take?
First of all, at the moment there is only a PoC not a complete implementation. I like to know the end goal of the change. This will influence how I will write the patches to implement this change. This change will take quite a bit of time to complete so it may not be complete before LLLVM 21. The approach for the various options are:
- Make the feature conditionally available. Opt-in in ABI V1, opt-out in the Unstable ABI. This involves:
- Add a new ABI flag
_LIBCPP_ABI_BASIC_FORMAT_STRING_CACHE
, - add this flag to
_LIBCPP_ABI_VERSION
, - the new features is added with
#ifdef
in the existing class.
If we get new ABI breaking ideas in 5 years we can apply them.
- Add a new ABI flag
- Make the feature conditionally available. Opt-out in ABI V1 and the Unstable ABI. This involves:
- Add a new ABI flag
_LIBCPP_ABI_BASIC_FORMAT_STRING_CACHE
, - add the new features with
#ifdef
in the existing class, - once the cache is fully complete we make the new ABI the default and offer an opt out.
If we get new ABI breaking ideas in 5 years we need to evaluate whether these are worth another ABI break at that time.
- Add a new ABI flag
- Replace the existing code with the new implementation. This involves:
- Add
define _LIBCPP_HAS_EXPERIMENTAL_BASIC_FORMAT_STRING _LIBCPP_HAS_EXPERIMENTAL_LIBRARY
, - copy the existing class and use the
_LIBCPP_HAS_EXPERIMENTAL_BASIC_FORMAT_STRING
to add the new features in the copy, - once the cache is fully complete we delete the original version and remove the
_LIBCPP_HAS_EXPERIMENTAL_BASIC_FORMAT_STRING
flag.
If we get new ABI breaking ideas in 5 years we need to evaluate whether these are worth another ABI break at that time.
- Add
All options add a set of #ifdefs
in the code. Options 1 and 2 keep these forever, option 3 has them temporary.
My preference and motivation
- Option 3:
- The ABI impact is minimal or non-existing.
- The feels change in line with C++'s zero overhead principle.
- The class
basic_format_string
is rather new and the C++ committee had no issue to change how format worked for several years after it shipped:- In original C++20, finished in February 2020, did not have this class.
- P2216R3 “std::format improvements” added
basic-format-string
as DR against C++20 in June 2021. - P2508R1 “Expose std::basic-format-string<charT, Args…>” added
basic_format_string
as DR against C++20 in July 2022 and has been available in libc++15 (released in September 2022).
- Option 2:
- This adds a bit of maintenance overhead.
- Still has the benefits of option 3.
- Option 1:
- This adds a bit of maintenance overhead.
- Means the benefits are available, but it’s unknown how many users will be aware and opt-in to these changes.
Details of the changes
Caching strategy
The C++ formatting library has two types of formatting functions the normal-version and the v-version. This is similar to the approach of printf
and vprintf
. The signatures are
template<class... Args>
std::string std::format(std::format_string<Args...> fmt, Args&&... args);
std::string std::vformat(std::string_view fmt, format_args args);
The function vformat
parses the fmt
argument writes the formatted output at runtime. When the argument fmt
is incorrect it throws an exception.
The function format
is implemented as:
return std::vformat(fmt.str, std::make_format_args(args...));
So this does the same as vformat
. However the format_string
does something more. An format_string
object looks (slightly simplified) like:
template<class... Args>
struct format_string {
string_view str;
consteval basic_format_string(string_view s);
};
Note the constructor is consteval
so it always is evaluated at compile-time. This constructor parses the argument s
. When the argument s
is incorrect the code is ill-formed and thus won’t compile.
So format
has the following effects:
- parses the
fmt
argument at compile-time - parses the
fmt
argument at runtime - writes the formatted output
My gripe with this approach is that the compile-time parsing knows exactly how to format fmt
. However all information is lost, just to be rediscovered at runtime. This feature adds a cache to format_string
to record information.
For example, std::format("hello world");
will:
- compile-time
- parse “hello world”
- create one cache entry “write string” +
std::string_view
to the just parsed data
- runtime
- process cache entry 1 and writes the string “hello world”
For example, std::format("The answer is {}.", 42);
will
- compile-time
- parse "The answer is "
- create one cache entry “write string” +
std::string_view
to the just parsed data - parse “{}”
- create one cache entry “format argument” + index 0
- parse “.”
- create one cache entry “write string” +
std::string_view
to the just parsed data
- runtime
- process cache entry 1 and writes the string "The answer is "
- process cache entry 2 and formats the value
42
and writes “42” - process cache entry 3 and writes the string “.”
The compile-time parser only knows that the type of the first argument is an integer so it can’t store the formatted value. This argument can be a function call that is evaluated at runtime. The cache stores all information about the format argument, so if fmt
had been {:x}
instead the formatter would have written “2a” (hex) instead of “42”.
There is a caveat, there are two types of format arguments “builtin types” and “handle types”. All “builtin types” have formatters of the same size and behave “the same”. Formatters of “handle types” may work completely differently. For example, the chrono header has a formatter for seconds that can be invokes as std::format("{:%H:%M:%S}", 10'000s);
. The format string parses “%H:%M:%S” using a strftime
like parser. The standard does not specify how to implement this formatter. Some valid options are: using string_view
member “storing” “%H:%M:%S”, or using string
member storing “%H:%M:%S”. (The latter is not efficient, but it is an valid implementation.)
There are several approaches how to handle format-string that use “handle formatters”:
- have more space in a cache entry and always store the formatter. Since we need to hard-code a size, this only works is some cases.
- only store formatter that have the same size as the formatters for the “builtin types”. This is useful when users inherit their formatter from a formatter of the “builtin types” and change the format function to forward a member of their type to the base formatter.
- parse the replacement-field at compile-time to validate, parse and format the replacement-field at runtime. This approach seems not efficient, but it is the status quo.
- drop the cache and fall back to the current behaviour. Again not efficient, but it matches the status quo.
The PoC does 4.
I expect option 3 to be feasible, the real implementation should at least do this.
Option 2 means we need to store an type-erased formatter at compile-time. I need to investigate whether this is possible. (The handle class does something similar, but it stores a type-erased pointer to the value. Here we need to store the entire object).Another issue with option 2 is that a “handle formatter” can have an allocating member. For example. the std::string
member of the seconds formatter. When memory is allocated at compile-time it needs to be freed at compile-time. This makes it impossible to store this implementation of a seconds formatter.
If option 2 is feasible I would like to look at option 1. The Standard library has chrono and range-based formatters. These are larger than the “builtin types”. However I have no benchmarks that show the performance improvement. This approach has disadvantage that the size of every element in the cache will be larger.
Cache size
Another item to consider is the cache size. Depending on the type of the entry the required size differs. The largest size used is for a formatter which stores:
- a type which is untyped enum, based on the alignment requirements this is a 32-bit value,
- the index of the format argument to process, this is a 32-bit value,
- the formatter whose size is 128-bit.
This is 192 bits or 24 bytes per entry. It is possible to use 16-bit types for the type and index field, saving 4 bytes. (With an 16-bit index we can support 65K cache entries.) However we also need to store a type
and string_view
, where the alignment of the string_view
may mean we still need 192 bits. In this case we look at parallel arrays to reduce the padding overhead:
- 128-bit for formatter/string_view
- 16-bit for arg-id
- 8-bit for the types
In this case we need 152 bits or 19 bytes and maybe some final padding. We could even consider to use some bits of the arg-id for the type information (this would need performance benchmarking). For now we consider the cost per cache entry will be 20 bytes in the binaries’ data segment.
The question is how many entries do we need to store? Take these calls:
std::format("The answer is {}.", 42);
std::format("The answer is {0}, "
"or (0:o} for those who favour octal, "
"the hexadecimal lovers get {0:x}, and "
"last but not least {0:b} for those who prefer binary.", 42);
Both these calls instantiate std::format_string<int>
. The first function needs 3 cache entries, the second function needs 9 entries. However since these functions instantiate the same class they have the same cache size. If we pick 3 it means:
- It’s optimal for the first call,
- the second call processes the first 3 elements from the cache and the other 6 need to be re-parsed at runtime.
If we pick 9 items it means:
- We waste 6 * 20 = 120 bytes in the binaries’ data segment,
- it’s optimal for the second call.
The PoC uses a fixed size buffer, regardless of the number of formatter arguments. For the real implementation it will, very likely, depend on the number of arguments. For example 4 + 2 * sizeof…(Args)`. These values “feel good”, but I have not yet done any analyses on real world use cases. I want the look at real world use cases before settling on an algorithm.
If we only want to make this feature available in the unstable ABI we can change the cache size algorithm every release. If we want to make this change available in ABI V1 (always or the opt-out solution) we need to do more upfront investigation what the proper cache size is.