Slides
Slides
Profit
Carl Cook, Ph.D.
@ProgrammerCarl
[email protected]
1
Introduction
About me:
● Freelance software developer
● Experience is with trading companies (mainly)
● A member of ISO SG14 (gaming, low latency, trading)
Contents:
● A 30 second introduction to trading
● Performance techniques for low latency, and then some surprises
● Measurement of performance
2
What is electronic trading/HFT/market making/algo trading?
3
while (true) {
try_buy_low();
try_sell_high();
}
4
Why the need for speed?
5
Solving this challenge has some nice spin-offs to other industries:
6
C++ in finance
Source:
Jetbrains
7
Technical challenges of low latency
trading
“If you’re not at all interested in performance, shouldn’t you be in the Python
room down the hall?”
– Scott Meyers
8
The ‘Hotpath’
● The “hotpath” is only exercised 0.01% of the time - the rest of the time, the
system is idle, or doing administrative work
● Operating systems, networks and hardware are focused on throughput and
fairness
● Jitter is unacceptable - it means bad trades
● A lot can go wrong in a few microseconds
9
Execution time is a limited resource
RX IPC { 1 us TX
But: even though C++ is good at saying what will be done, there are other factors:
● Compiler (and version)
● Machine architecture
● 3rd party libraries
● Build and link flags
11
… luckily there’s an app for that:
12
The importance of system tuning (results on the next page)
std::vector<int> items;
items.reserve(1024);
BENCHMARK(Sort)->Range(8, 1024);
13
Same:
● Hardware
● Operating system
● Binary
● Background load
14
Low latency programming techniques
15
Slowpath removal
16
Template-based configuration
17
// 1st implementation // 2nd implementation
struct OrderSenderA { struct OrderSenderB {
void SendOrder() { void SendOrder() {
... ...
} }
}; };
18
std::unique_ptr<IOrderManager> Factory(const Config& config) {
if (config.UseOrderSenderA())
return std::make_unique<OrderManager<OrderSenderA>>();
else if (config.UseOrderSenderB())
return std::make_unique<OrderManager<OrderSenderB>>();
else
throw;
}
19
Memory allocation
20
Exceptions in C++
21
Branch reduction
Branching approach:
22
Templated approach:
template<>
void RunLogic<Side::Buy>() {
float orderPrice = CalcPrice<Side::Buy>(fairValue, credit);
CheckRiskLimits<Side::Buy>(orderPrice);
SendOrder<Side::Buy>(orderPrice);
}
template<>
float CalcPrice<Side::Buy>(float value, float credit) {
return value - credit;
}
template<>
float CalcPrice<Side::Sell>(float value, float credit) {
return value + credit;
}
23
Multi-threading
24
If you must use multiple threads...
25
Data lookups
Message orderMessage;
orderMessage.price = instrument.price;
Market& market = Markets.FindMarket(instrument.marketId);
orderMessage.qty = market.quantityMultiplier * qty;
...
26
Actually, denormalized data is not a sin:
● Chances are there is space in the cacheline that you read to have pulled in the
extra field, avoiding an additional lookup
27
Fast associative containers (std::unordered_map)
{
Key Value std::pair<K, V> Key Value
Key Value
Default max_load_factor: 1
Average case insert: O(1) See: N1456
Average case find: O(1)
28
10K elements, keyed in the range std::uniform_int_distribution(0, 1e+12)
Complexity of find:
29
Run on (32 X 2892.9 MHz CPU s), 2017-09-08 11:39:44
Benchmark Time
----------------------------------------------
FindBenchmark<unordered_map>/10 14 ns
FindBenchmark<unordered_map>/64 16 ns
FindBenchmark<unordered_map>/512 16 ns
FindBenchmark<unordered_map>/4k 20 ns
FindBenchmark<unordered_map>/10k 24 ns
----------------------------------------------
31
A lesser-known approach: a hybrid of both chaining and open addressing
Goals:
● Minimal memory footprint
● Predictable cache access patterns (no jumping all over the place)
32
Key ➔ Hash ➔ Index Key ➔ Hash ➔ Index
✓ ✘ ✓
Hash Ptr Hash Ptr Hash Ptr
✓
Key Value
Key Value
✓
Key Value
33
Run on (32 X 2892.9 MHz CPU s), 2017-09-08 11:40:08
Benchmark Time
----------------------------------------------
FindBenchmark<array_map>/10 7 ns
FindBenchmark<array_map>/64 7 ns
FindBenchmark<array_map>/512 7 ns
FindBenchmark<array_map>/4k 9 ns
FindBenchmark<array_map>/10k 9 ns
----------------------------------------------
34
Branch prediction hints
35
gcc with no hints main:
cmp edi, 1 // argc
int GetErrorCode() {
jle .L7
return rand() % 255 + 1;
sub rsp, 8
}
call rand
mov ecx, 255
int main(int argc, char**) {
cdq
if (argc > 1)
idiv ecx
return GetErrorCode();
lea eax, [rdx+1]
else
pop rdx
return 0;
ret
}
.L7:
xor eax, eax // zeros ebx
ret
36
Now with branch prediction hints main:
cmp edi, 1
jg .L12
int GetErrorCode() { xor eax, eax
return rand() % 255 + 1; ret
} .L12:
sub rsp, 8
int main(int argc, char**) { call rand
if (unlikely(argc > 1)) mov ecx, 255
return GetErrorCode(); cdq
else idiv ecx
return 0; lea eax, [rdx+1]
} pop rdx
ret
37
● These “likely” attributes are useful if something called very rarely needs to be
fast when called (i.e. expect more efficient assembly code to be generated)
● In all other cases:
○ Write your code to avoid branches, and
○ Train the hardware branch predictor (more about this later)
■ This is the dominant factor
38
((always_inline)) and ((noinline))
CheckMarket(); __attribute__((noinline))
if (notGoingToSendAnOrder) void ComplexLoggingFunction()
ComplexLoggingFunction(); {
else ...
SendOrder(); }
39
Default gcc generated code
40
Forcing get_error_code to be inlined
__attribute__((always_inline)) main:
void get_error_code() { ... } cmp edi, 1
jle .L6
int main(int argc, char**) { get_error_code instruction 1
if (argc > 1) get_error_code instruction ..
return get_error_code(); get_error_code instruction N
else mov eax, [error code]
return 0; ret
} .L6:
xor ebx, ebx // zeros ebx
ret
41
Combining inlining hints and branch prediction hints
__attribute__((noinline)) get_error_code:
void get_error_code() { ... } ...
ret
int main(int argc, char**) { main:
if (unlikely(argc > 1)) cmp edi, 1
return get_error_code(); jg .L7
else xor eax, eax
return 0; ret
} .L7:
jmp get_error_code
42
Other gcc compiler hints for cache locality
__attribute__((hot)):
Puts all functions into a single section in the binary, including ancestor functions
__attribute__((cold)):
This is somewhat useful - basically does the same as inlining of hot functions and
no-inlining of cold functions
43
Prefetching
__builtin_prefetch can also be useful (if you know that the hardware branch
predictor won’t be able to work out the right pattern)
// next mid val after this iteration if we take the low path
__builtin_prefetch(&array[(low + mid - 1)/2]);
// next mid val after this iteration if we take the high path
__builtin_prefetch(&array[(mid + 1 + high)/2]);
Pick one:
Usually you will see no further gain if you apply several of the above
45
Keeping the caches hot - a better way!
Remember, the full hotpath is only exercised very infrequently - your cache has
most likely been trampled by non-hotpath data and instructions
Market data
decoder
Market data
decoder
Market data
decoder
46
A simple solution: run a very frequent pre-warm path through your entire system,
keeping both your data cache and instruction cache primed
47
Are
System running Ye Fix
pre-warming
5us slower than s pre-warming
messages
normal messages
broken?
No
You poor
bastard
Problem
No solved?
Ye
s
Done
48
Hardware/architecture considerations
Quick recap:
● A server can have N physical CPUs (one CPU attaches to one socket)
○ Each CPU can have N cores (ignoring hyperthreading per core)
■ Each core has a:
● L1 data cache (~32KB)
● L1 instruction cache (~32KB)
● Unified L2 cache (~512KB)
○ All cores share a unified L3 cache (~50Mb)
Source:
Intel Corporation
49
Intel Xeon E5 processor
Source:
Intel Corporation
50
● Don’t share L3 - disable all other cores (or lock the cache)
○ This might mean paying for 22 cores but only using 1
● Choose your neighbours carefully:
○ Noisy neighbours should probably be moved to a different physical CPU
51
Surprises and war stories
52
Small string optimization support
std::unordered_map<std::string, Instrument> instruments;
return instruments.find({“IBM”}) != instruments.end();
53
std::string_view (to the rescue)
std::string name{"FACEBOOK"};
instruments.find(name.substr(1,3)); // "ACE"
54
Avoiding std::string (and allocations)
55
Userspace networking vs cache
● Userspace means we can receive data (prices, etc) without any system calls
● But there can be too much of a good thing:
○ All secondary data goes through the cache, even if we don’t use the data
○ When items go into the cache, other items are evicted
Secondary data
Key
Userspace communication
Shared memory communication
56
Alternative setup:
Cache
Orders to exchange
Latency critical data
Key
Userspace communication
Shared memory communication
Single writer/single reader lock free queue
57
Watch your enums and switches
58
Overhead of C++11 static local variable initialization
59
std::pow can be slow, really slow
60
Measurement of low latency systems
“Bottlenecks occur in surprising places, so don't try to second guess and put in
a speed hack until you've proven that's where the bottleneck is.”
– Rob Pike
61
Measurement of low latency systems
62
✘ Sampling profilers (e.g. gprof) are not what you are looking for
○ They miss the key events
✘ Instrumentation profilers (e.g. valgrind) are not what you are looking for
○ They are too intrusive
○ They don’t catch I/O slowness/jitter (they don’t even model I/O)
✘ Microbenchmarks (e.g. google benchmark) are not what you are looking for
○ They are not representative of a realistic environment
○ Takes some effort to force the compiler to not optimize out the test
○ Heap fragmentation can have an impact on subsequent tests
They are all in some ways useful, but not for micro-optimization of code
63
?Performance counters can be useful (e.g. linux perf)
○ E.g. # of cache misses, # of pipeline stalls
64
✓ Most useful: measure end-to-end time in a production-like setup
(Many trading companies do this)
65
Summary
“A language that doesn't affect the way you think about programming is not
worth knowing.”
– Alan Perlis
66
● Know C++ well, including your compiler
● Know the basics of machine architecture, and how it will impact your code
● Do as much work as possible at compile time
● Aim for very simple runtime logic
● Accurate measurement is essential
● Assume nothing: a lot can be surprising, and compilers, hardware and
operating systems are always changing
67
Thanks for listening!
68