0% found this document useful (0 votes)
9 views

Slides

The document discusses techniques for developing low latency C++ applications for electronic trading, including avoiding error handling in the hot path, using templates to remove branches, preallocating memory to avoid allocations, leveraging exceptions without cost, reducing branches, avoiding multithreading, and optimizing data lookups.

Uploaded by

enkinil02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Slides

The document discusses techniques for developing low latency C++ applications for electronic trading, including avoiding error handling in the hot path, using templates to remove branches, preallocating memory to avoid allocations, leveraging exceptions without cost, reducing branches, avoiding multithreading, and optimizing data lookups.

Uploaded by

enkinil02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Low Latency C++ for Fun and

Profit
Carl Cook, Ph.D.

@ProgrammerCarl
[email protected]

1
Introduction

About me:
● Freelance software developer
● Experience is with trading companies (mainly)
● A member of ISO SG14 (gaming, low latency, trading)

Contents:
● A 30 second introduction to trading
● Performance techniques for low latency, and then some surprises
● Measurement of performance

Disclaimer: This is not a general discussion on every C++ optimization technique -


it’s a quick sampler into the life of developing high performance trading systems

2
What is electronic trading/HFT/market making/algo trading?

3
while (true) {
try_buy_low();
try_sell_high();
}

4
Why the need for speed?

Electronic market makers aim for the lowest latency possible:

● Fast reaction to market events


○ Allowing now out-of-date orders to be adjusted (before losing money)
○ To be the first to spot a favorable order and try to trade with this

5
Solving this challenge has some nice spin-offs to other industries:

● More efficient code: longer battery life/drone flight time/power savings


● Faster/more responsive autonomous vehicles
● Better general application performance
● Continually improving hardware
● …

6
C++ in finance
Source:
Jetbrains

7
Technical challenges of low latency
trading

“If you’re not at all interested in performance, shouldn’t you be in the Python
room down the hall?”
– Scott Meyers

8
The ‘Hotpath’

● The “hotpath” is only exercised 0.01% of the time - the rest of the time, the
system is idle, or doing administrative work
● Operating systems, networks and hardware are focused on throughput and
fairness
● Jitter is unacceptable - it means bad trades
● A lot can go wrong in a few microseconds

9
Execution time is a limited resource

If the target is 3.5us wire to wire (for example), then:


● 1us for RX of market data message from exchange
● 1us for TX of order message to exchange
● Maybe 0.5us of misc IPC, and jitter that’s hard to get rid of
● Leaves approximately 1us for the actual trading code
○ Arguably around 3K CPU cycles/12K instructions
○ But think about memory latency, pipeline stalls, cache misses, etc

RX IPC { 1 us TX

{Time it takes for light to travel 300 metres}


10
The role of C++

From Bjarne Stroustrup:


“C++ enables zero-overhead abstraction to get us away from the hardware
without adding cost”

But: even though C++ is good at saying what will be done, there are other factors:
● Compiler (and version)
● Machine architecture
● 3rd party libraries
● Build and link flags

We need to check what C++ is doing in terms of machine instructions...

11
… luckily there’s an app for that:

12
The importance of system tuning (results on the next page)

std::vector<int> items;
items.reserve(1024);

void SortVector(benchmark::State& state) {


for (auto _ : state) {
const auto N = state.range(0);
items.resize(N);
for (int i = 0; i < N; ++i)
items[i] = rand() % N;
std::sort(items.begin(), items.end());
}
}

BENCHMARK(Sort)->Range(8, 1024);

13
Same:
● Hardware
● Operating system
● Binary
● Background load

One server is tuned for


production (no hyper
threading, etc), the
other not

14
Low latency programming techniques

"When in doubt, use brute force."


– Ken Thompson

15
Slowpath removal

Avoid this: Aim for this:

if (checkForErrorA()) int64_t errorFlags;


handleErrorA(); ...
else if (checkForErrorB()) if (!errorFlags)
handleErrorB(); sendOrderToExchange();
else if (checkForErrorC()) else
handleErrorC(); HandleError(errorFlags);
else
sendOrderToExchange();

Tip: ensure that error handling code will not be inlined

16
Template-based configuration

● It’s convenient to have some things controlled via configuration files


○ However virtual functions (and even simple branches) can be expensive
● One possible solution:
○ Use templates (often overlooked, even though everyone uses the STL)
○ This removes branches, eliminates code that won’t be executed, etc

17
// 1st implementation // 2nd implementation
struct OrderSenderA { struct OrderSenderB {
void SendOrder() { void SendOrder() {
... ...
} }
}; };

template <typename T>


struct OrderManager : public IOrderManager {
void MainLoop() final {
// ... and at some stage in the future...
mOrderSender.SendOrder();
}
T mOrderSender;
};

18
std::unique_ptr<IOrderManager> Factory(const Config& config) {
if (config.UseOrderSenderA())
return std::make_unique<OrderManager<OrderSenderA>>();
else if (config.UseOrderSenderB())
return std::make_unique<OrderManager<OrderSenderB>>();
else
throw;
}

int main(int argc, char *argv[]) {


auto manager = Factory(config);
manager->MainLoop();
}

19
Memory allocation

● Allocations are of course costly:


○ Use a pool of preallocated objects
○ Reuse objects instead of deallocating:
■ delete involves no system calls (memory is not given back to the OS)
● But: glibc free has 400 lines of book-keeping code
■ Reusing objects helps avoid memory fragmentation as well
● If you must delete large objects, consider doing this from another thread
● Be aware that destructors may be inlined
○ This can start trampling your instruction cache

20
Exceptions in C++

● Don’t be afraid to use exceptions (if using gcc, clang, msvc):


○ I’ve measured this in quite some detail:
■ They are basically zero cost if they don’t throw
■ Maybe some slight code reordering, but the cost is negligible

● Don’t use exceptions for control flow:


○ That will get expensive:
■ My benchmarking suggests an overhead of at least 1.5us
○ Your code will look terrible

21
Branch reduction

Branching approach:

enum class Side { Buy, Sell };

void RunLogic(Side side) {


const float orderPrice = CalcPrice(side, fairValue, credit);
CheckRiskLimits(side, orderPrice);
SendOrder(side, orderPrice);
}

float CalcPrice(Side side, float value, float credit) {


return side == Side::Buy ? value - credit : value + credit;
}

22
Templated approach:

template<>
void RunLogic<Side::Buy>() {
float orderPrice = CalcPrice<Side::Buy>(fairValue, credit);
CheckRiskLimits<Side::Buy>(orderPrice);
SendOrder<Side::Buy>(orderPrice);
}
template<>
float CalcPrice<Side::Buy>(float value, float credit) {
return value - credit;
}
template<>
float CalcPrice<Side::Sell>(float value, float credit) {
return value + credit;
}

23
Multi-threading

Multithreading is best avoided for


latency-sensitive code:
● Synchronization of data via locking
is going to be expensive
● Lock free code may still require
locks at the hardware level
● Mind-bendingly complex to
correctly implement parallelism
● Easy for the producer to
accidentally saturate the consumer

24
If you must use multiple threads...

● Keep shared data to an absolute minimum


○ Multiple threads writing to the same cacheline will get expensive
● Consider passing copies of data rather than sharing data
○ E.g. a single writer, single reader lock free queue
● If you have to share data, consider not using synchronization, i.e.:
○ Maybe you can live with out-of-sequence updates
○ Maybe the machine architecture prevents torn reads/writes, preserves
ordering of stores and loads (etc)

25
Data lookups

The software engineering textbooks would typically suggest:

struct Market { struct Instrument {


int32_t id; float price;
char shortName[4]; int32_t marketId;
int16_t quantityMultiplier; ...
... }
}

Message orderMessage;
orderMessage.price = instrument.price;
Market& market = Markets.FindMarket(instrument.marketId);
orderMessage.qty = market.quantityMultiplier * qty;
...

26
Actually, denormalized data is not a sin:
● Chances are there is space in the cacheline that you read to have pulled in the
extra field, avoiding an additional lookup

struct Market { struct Instrument {


int32_t id; float price;
char shortName[4]; int16_t quantityMultiplier;
int16_t quantityMultiplier; ...
...
} }

This is better than trampling your cache to “save memory”

27
Fast associative containers (std::unordered_map)

Bucket 1 Bucket ... Bucket N

Key Value Key Value Key Value

{
Key Value std::pair<K, V> Key Value

Key Value
Default max_load_factor: 1
Average case insert: O(1) See: N1456
Average case find: O(1)
28
10K elements, keyed in the range std::uniform_int_distribution(0, 1e+12)

Complexity of find:

Average case: O(1)


Worst case: O(N)

29
Run on (32 X 2892.9 MHz CPU s), 2017-09-08 11:39:44
Benchmark Time
----------------------------------------------
FindBenchmark<unordered_map>/10 14 ns
FindBenchmark<unordered_map>/64 16 ns
FindBenchmark<unordered_map>/512 16 ns
FindBenchmark<unordered_map>/4k 20 ns
FindBenchmark<unordered_map>/10k 24 ns
----------------------------------------------

# 56.54% frontend cycles idle


# 21.61% backend cycles idle
# 0.67 insns per cycle
# 0.84 stalled cycles per insn
branch-misses # 0.63% of all branches
cache-misses # 0.153% of all cache refs
30
Alternatively, consider open addressing, e.g. google’s dense_hash_map

Key Value Key Value Key Value

✓ Key/Value pairs are in contiguous memory - no pointer following between nodes

✘ Complexity around collision management

31
A lesser-known approach: a hybrid of both chaining and open addressing

Goals:
● Minimal memory footprint
● Predictable cache access patterns (no jumping all over the place)

32
Key ➔ Hash ➔ Index Key ➔ Hash ➔ Index

✓ ✘ ✓
Hash Ptr Hash Ptr Hash Ptr


Key Value

Key Value


Key Value

It’s possible to implement this as a drop-in substitute for std::unordered_map

33
Run on (32 X 2892.9 MHz CPU s), 2017-09-08 11:40:08
Benchmark Time
----------------------------------------------
FindBenchmark<array_map>/10 7 ns
FindBenchmark<array_map>/64 7 ns
FindBenchmark<array_map>/512 7 ns
FindBenchmark<array_map>/4k 9 ns
FindBenchmark<array_map>/10k 9 ns
----------------------------------------------

# 38.26% frontend cycles idle


# 6.77% backend cycles idle
# 1.6 insns per cycle
# 0.24 stalled cycles per insn
branch-misses # 0.22% of all branches
cache-misses # 0.067% of all cache refs

34
Branch prediction hints

#define likely(x) __builtin_expect((x),1)


#define unlikely(x) __builtin_expect((x),0)

● You may recognise these from the linux kernel source


● The compiler often picks the right case in the first place, but there’s no
guarantee

35
gcc with no hints main:
cmp edi, 1 // argc
int GetErrorCode() {
jle .L7
return rand() % 255 + 1;
sub rsp, 8
}
call rand
mov ecx, 255
int main(int argc, char**) {
cdq
if (argc > 1)
idiv ecx
return GetErrorCode();
lea eax, [rdx+1]
else
pop rdx
return 0;
ret
}
.L7:
xor eax, eax // zeros ebx
ret
36
Now with branch prediction hints main:
cmp edi, 1
jg .L12
int GetErrorCode() { xor eax, eax
return rand() % 255 + 1; ret
} .L12:
sub rsp, 8
int main(int argc, char**) { call rand
if (unlikely(argc > 1)) mov ecx, 255
return GetErrorCode(); cdq
else idiv ecx
return 0; lea eax, [rdx+1]
} pop rdx
ret
37
● These “likely” attributes are useful if something called very rarely needs to be
fast when called (i.e. expect more efficient assembly code to be generated)
● In all other cases:
○ Write your code to avoid branches, and
○ Train the hardware branch predictor (more about this later)
■ This is the dominant factor

See https://wg21.link/P0479 for a proposal to standardize these attributes

See https://groups.google.com/a/isocpp.org/forum/#!forum/sg14 for a lively debate


on this proposal

38
((always_inline)) and ((noinline))

● ((always_inline)) and ((noinline)) can be useful


○ Means: inlining is preferred/inlining should be avoided
○ But be careful: measure
● Please note that the inline keyword is not really what you are looking for
○ Mainly means: multiple definitions are permitted

A quick example: forcing a method to be not inlined (for good reason)

CheckMarket(); __attribute__((noinline))
if (notGoingToSendAnOrder) void ComplexLoggingFunction()
ComplexLoggingFunction(); {
else ...
SendOrder(); }

39
Default gcc generated code

void get_error_code() { ... } get_error_code:


...
int main(int argc, char**) { ret
if (argc > 1) main:
return get_error_code(); cmp edi, 1 // argc register
else jle .L6
return 0; jmp get_error_code
} .L6:
xor eax, eax // zeros eax
ret // eax is the ret val

40
Forcing get_error_code to be inlined

__attribute__((always_inline)) main:
void get_error_code() { ... } cmp edi, 1
jle .L6
int main(int argc, char**) { get_error_code instruction 1
if (argc > 1) get_error_code instruction ..
return get_error_code(); get_error_code instruction N
else mov eax, [error code]
return 0; ret
} .L6:
xor ebx, ebx // zeros ebx
ret

41
Combining inlining hints and branch prediction hints

Combining noinline with “unlikely” branch prediction

__attribute__((noinline)) get_error_code:
void get_error_code() { ... } ...
ret
int main(int argc, char**) { main:
if (unlikely(argc > 1)) cmp edi, 1
return get_error_code(); jg .L7
else xor eax, eax
return 0; ret
} .L7:
jmp get_error_code

42
Other gcc compiler hints for cache locality

__attribute__((hot)):

Puts all functions into a single section in the binary, including ancestor functions

__attribute__((cold)):

Puts functions into a different section (and will avoid inlining)

This is somewhat useful - basically does the same as inlining of hot functions and
no-inlining of cold functions

43
Prefetching

__builtin_prefetch can also be useful (if you know that the hardware branch
predictor won’t be able to work out the right pattern)

Example (of a binary search loop):

// next mid val after this iteration if we take the low path
__builtin_prefetch(&array[(low + mid - 1)/2]);
// next mid val after this iteration if we take the high path
__builtin_prefetch(&array[(mid + 1 + high)/2]);

int mid = (low + high) / 2;


if (array[mid] == key) return mid;
if (array[mid] < key) low = mid + 1; // search high path
else high = mid - 1; // search low path

Bonus: you can also prefetch the instruction cache


44
Compiler attributes <TL/DR>

Pick one:

● Code with no (or minimal) branches


● __attribute__((always_inline)) and __attribute__((noinline))
● __builtin_expect()
● __attribute__((hot)) and __attribute__((cold))
● __builtin_prefetch()

Usually you will see no further gain if you apply several of the above

45
Keeping the caches hot - a better way!

Remember, the full hotpath is only exercised very infrequently - your cache has
most likely been trampled by non-hotpath data and instructions

Market data Strategy Execution


decoder engine

Market data
decoder

Market data Strategy


decoder

Market data
decoder

Market data Strategy Execution


decoder engine

Market data
decoder
46
A simple solution: run a very frequent pre-warm path through your entire system,
keeping both your data cache and instruction cache primed

Market data Strategy Execution


decoder engine

Market data Strategy Execution


decoder engine

Market data Strategy Execution


decoder engine

Market data Strategy Execution


decoder engine

Market data Strategy Execution


decoder engine

Market data Strategy Execution


decoder engine

Bonus: this also correctly trains the hardware branch predictor

47
Are
System running Ye Fix
pre-warming
5us slower than s pre-warming
messages
normal messages
broken?

No
You poor
bastard

Problem
No solved?

Ye
s
Done

48
Hardware/architecture considerations

Quick recap:
● A server can have N physical CPUs (one CPU attaches to one socket)
○ Each CPU can have N cores (ignoring hyperthreading per core)
■ Each core has a:
● L1 data cache (~32KB)
● L1 instruction cache (~32KB)
● Unified L2 cache (~512KB)
○ All cores share a unified L3 cache (~50Mb)

Source:
Intel Corporation

49
Intel Xeon E5 processor

Source:
Intel Corporation
50
● Don’t share L3 - disable all other cores (or lock the cache)
○ This might mean paying for 22 cores but only using 1
● Choose your neighbours carefully:
○ Noisy neighbours should probably be moved to a different physical CPU

51
Surprises and war stories

"I have always wished for my computer to be as easy to use as my telephone;


my wish has come true because I can no longer figure out how to use my
telephone."
– Bjarne Stroustrup

52
Small string optimization support
std::unordered_map<std::string, Instrument> instruments;
return instruments.find({“IBM”}) != instruments.end();

● This will only work:


○ With gcc 5.1 or greater, and if the string is 15 characters or less
○ In clang if the string is 22 characters or less

● In gcc, std::string has C.O.W. semantics (prior to gcc 5.1)


○ This gets expensive (during copying/destruction) due to atomics
○ First mentioned by Herb Sutter in 1999
● If you use a ABI compatible linux distribution such as
Redhat/Centos/Ubuntu/Fedora, then you are probably still using the old
std::string implementation (even with the latest versions of gcc):
○ C.O.W and no SSO support

53
std::string_view (to the rescue)

Provides allocation-free substrings and string literals

std::map<std::string, Instrument, std::less<>> instruments;


instruments.find(std::string_view{"FACEBOOK"})->second;

std::string name{"FACEBOOK"};
instruments.find(name.substr(1,3)); // "ACE"

Available in most C++17 compilers, and in C++14 as


std::experimental::string_view

54
Avoiding std::string (and allocations)

● Consider something like inplace_string:


○ No allocation, compile time bounds checking, and full std::string interface
○ https://github.com/david-grs/inplace_string

using InstrumentName = inplace_string<16>;


InstrumentName instrumentName {"IBM"};
assert(InstrumentName::npos == instrumentName.find("GOOGLE"));

● Implicitly convertible to std::string if required


std::string str{instrumentName};

● In production, with a sample size of 1024, inserting 6 elements into a vector


std::string min=918ns mean=3,003ns max=29,518ns
inplace_string<16> min= 28ns mean= 61ns max= 1,829ns

55
Userspace networking vs cache

● Userspace means we can receive data (prices, etc) without any system calls
● But there can be too much of a good thing:
○ All secondary data goes through the cache, even if we don’t use the data
○ When items go into the cache, other items are evicted

Order insert requests


Cache Core 1
Latency critical data Orders to exchange

Secondary data

Key
Userspace communication
Shared memory communication

56
Alternative setup:

Order insert requests Core 1

Cache
Orders to exchange
Latency critical data

Dequeued in batches, infrequently

Secondary data Core 2


Cache

Key
Userspace communication
Shared memory communication
Single writer/single reader lock free queue
57
Watch your enums and switches

enum Enum { Good, Bad, Ugly };

int main(int argc, char**) { main:


switch ((Enum)argc) { sub rsp, 8
case Good: Handle("GOOD"); test edi, edi
break; je .L8
case Bad: Handle("BAD"); cmp edi, 1
break; je .L3
case Ugly: Handle("UGLY"); cmp edi, 2
break; je .L4
}
}

58
Overhead of C++11 static local variable initialization

struct Random { Random::get():


int get() { movzx eax, BYTE PTR guard var
// threadsafe! test al, al
static int i = rand(); je .L13
return i; mov eax, DWORD PTR get()::i
} ret
}; .L13
// acquire and set the guard var
int main() { 5-10% overhead compared to
Random r; non-static access, even if binary is
single threaded
return r.get();
}

59
std::pow can be slow, really slow

std::pow is a transcendental function, meaning it goes into a second, slower


phase if the accuracy of the result isn’t acceptable after the first phase.

auto base = 1.00000000000001, exp1 = 1.4, exp2 = 1.5;


std::pow(base, exp1) = 1.0000000000000140
std::pow(base, exp2) = 1.0000000000000151

Benchmark Time Iterations


-------------------------------------------------------
pow(base, exp1) [glibc 2.17] 53 ns 13142054
pow(base, exp1) [glibc 2.21] 53 ns 13142821
pow(base, exp2) [glibc 2.17] 478195 ns 1457
pow(base, exp2) [glibc 2.21] 63348 ns 11113

60
Measurement of low latency systems

“Bottlenecks occur in surprising places, so don't try to second guess and put in
a speed hack until you've proven that's where the bottleneck is.”
– Rob Pike

61
Measurement of low latency systems

● Two common approaches:


○ Profiling: seeing what your code is doing (bottlenecks in particular)
○ Benchmarking: timing the speed of your system
● Caution: profiling is not necessarily benchmarking
○ Profiling is useful for catching unexpected things
○ Improvements in profiling results isn’t a 100% guarantee that your system
is now faster

62
✘ Sampling profilers (e.g. gprof) are not what you are looking for
○ They miss the key events
✘ Instrumentation profilers (e.g. valgrind) are not what you are looking for
○ They are too intrusive
○ They don’t catch I/O slowness/jitter (they don’t even model I/O)
✘ Microbenchmarks (e.g. google benchmark) are not what you are looking for
○ They are not representative of a realistic environment
○ Takes some effort to force the compiler to not optimize out the test
○ Heap fragmentation can have an impact on subsequent tests

They are all in some ways useful, but not for micro-optimization of code

63
?Performance counters can be useful (e.g. linux perf)
○ E.g. # of cache misses, # of pipeline stalls

?Consider just comparing certain types of instruction counts


○ objdump -S my_binary | cut -c 33-34 | grep j | wc -l

?High-resolution timestamping can be useful (e.g. the hardware TSC)


○ Doesn’t need to be in sync with clock time
■ Just needs to be constant across samples
○ If you want actual nanoseconds:
■ Calibrate with wallclock time every few milliseconds

64
✓ Most useful: measure end-to-end time in a production-like setup
(Many trading companies do this)

Switch with high precision hardware-based


timestamping (appended to each packet)

Server which replays Server under test - listens


exchange market data to market data and sends
and accepts orders orders

Server which captures and parses each


network packet it sees, and calculates
response time (accurate to a few
nanoseconds)

65
Summary

“A language that doesn't affect the way you think about programming is not
worth knowing.”
– Alan Perlis

66
● Know C++ well, including your compiler
● Know the basics of machine architecture, and how it will impact your code
● Do as much work as possible at compile time
● Aim for very simple runtime logic
● Accurate measurement is essential
● Assume nothing: a lot can be surprising, and compilers, hardware and
operating systems are always changing

67
Thanks for listening!

68

You might also like