Caravaggio in Binary
Caravaggio in Binary
Part 0 - Introduction
Part 1 - Hardware
Part 4 - Measurements
Part 5 - Overclocking
Changelog
——————————————————————————————————————————————————————————
Introduction
This text is a primer on how to develop a high-frequency-trading system/simulation
lab, with focus on the Nasdaq
exchange and the ITCH protocol.
The reason for picking C instead of C++, when the latter is the de-facto language
in the industry, is because C is a very
simple language to understand (and optimize), which makes this primer more encompassing
than if I was to write the same
in C++ (which would require you to be very familiar with advanced concepts of the language).
By the end of this text, you will know how to set up your own simulation lab, how to build a very fast LOB, how to
measure the performance of your code and will be in a great position
to do research on market microstructure and
experiment with HFT strategies.
I am still working on this text so feel free to check for updates every few weeks.
——————————————————————————————————————————————————————————
Hardware
Note that even if you can't build a lab similar to the one described below, you will still benefit from reading
since you will still learn how how to write a very fast ITCH parser and order book (which is faster than the fas
implementation available in the open source domain), as well as pick a few tips on how to write high performance
code.
To be able to get a simulation lab as close as possible to the live environment, you will need two computers (running
Linux), two network cards (NIC’s) with both
kernel bypass and hardware timestamping functionality, as well as two cables
to connect them (1 in and 1 out on each NIC).
One of the computers (computer A) will act as a fake exchange by sending the
historical market data to the second
computer (computer B - the trading server), which itself
handles the parsing of the market data, builds the LOB and does
the desired
computations to decide when it should trade.
Regarding which Linux distro to use on “computer B”, I recommend that you pick Arch Linux if
you have experience with
Linux, because, since it is a rolling-distro, you will very easily be able to test updates made to compilers and see if they
improve your code, and additionally you can run a version without the X-server (the GUI windows manager)
which
occupies only 300mb of RAM.
The objective of this setup is to accurately simulate a real world live trading situation, which means that you will have zero
control over computer A (the fake exchange) but you have full control over computer B (the trading server).
Frameworks, such as those that you find on GitHub, won’t cut it because when you are designing a low latency system you
have to consider, measure and optimize every single detail of your system, since your success/profitability will very likely
depend on it.
Your system must be tuned, both hardware and software, and it must only
do one thing (the trading logic), and not have
rogue processes running in the background interfering with the cpu cache.
To have such high precision measurements you need to physically simulate what
happens once your server is live, and that
is the reason why you need two physically
independent computers.
As mentioned above (computer A) simulates the exchange, by sending historical market data over the NIC (in other words,
it replays
the market) and computer B acts as the trading server, by receiving the data
sent by computer A, parsing it,
building the order book, doing any desired analysis and eventually sending a message (i.e. an order) back to computer A.
It’s now easy to understand that if you were reading data from disk
and processing it on the same computer, you would be
unable to get proper measurements
since the computer would be performing other tasks (reading data, piping it to another
process, etc), which would give you incorrect measurements.
You would also be unable to test and get an answer for questions such as:
How do you deal with the network side when you are trading in the real world and no longer reading data from disk?
How can you test if your server is able to handle spikes of millions of packets per second?
How much time does it take to copy the data from the NIC buffer to user space? And does that extra time have a negative
impact on your trading algorithms?
on and on…
Before going a bit more in depth on how to accurately measure the performance
of computer B (the trading server), I would
first like to clearly define “performance”.
Performance, in this context, is the sum of the time it takes to retrieve the data from the
NIC
+ doing all the computations
you want to do
+ sending data out of the NIC.
Now, there are two ways you can truly measure performance of a system. (I alluded to one before -
the one I took - but I will clarify both now).
The first approach is to use a switch with hardware time stamping functionality, and connect both computers to it.
This switch simply timestamps every packet that passes by it, both coming from “computer A” and “computer B”,
and from
there you can calculate how much time “computer B” takes to process, and reply, to
incoming data.
These switches can cost from around $4k for the “old” metamako’s to $40k for the new Arista/Cisco.
The second approach is the one I took, and which I recommend you do.
I bought two second hand NIC’s for around $200 each on Ebay, and these NIC’s offer the same hardware timestamping
functionality that you will find on the much more expensive switches, with a precision of around 10ns.
The specific models I bought were the following: Exablaze X2 and Exablaze X4.
You can buy two X2, or two X4, it doesn’t matter, I picked one of each because
those were the ones that were for sale at the
moment.
If you end up purchasing the Exablaze cards, check their documentation online
since they go over installation and some
tips on how to get the most out of the
cards and most importantly, remember to update the firmware.
Note: You can also check some of the solarflare cards, they can be had pretty
cheap too - just make sure that the model you
buy supports OpenOnload / ef_vi.
After plugging both cards to the respective computer motherboards (use the closest pcie slot to the CPU!) , download and
install
the libraries that allow you to access them, plug the two cables between the cards and run:
$ sudo exanic-config
Device exanic0:
Port 0:
Interface: enp1s0
Bypass-only mode: on
TX packets: 0
Port 1:
Interface: enp1s0d1
Bypass-only mode: on
TX packets: 0
If you see something similar to the above, you are ready to follow along and
will be able to replicate the C code that I will
share below and will end up with a professional, high accuracy simulation
lab, where you can test multiple
ideas/algorithms and do research on market-microstructure.
I will now discuss which pc components are ideal for a low latency server.
Remember that for computer A (the fake exchange) you really can use any computer (i.e. an old one).
If you have the possibility to make some components update, my recommendation is that you purchase the fastest nvme
you can afford, since the fastest you can read from the disk the faster you can push data out of the NIC.
In the next next section, I will show you how to load the historical market data into memory and just pushing it out of the
card as fast as possible - this is a great way to test how computer B (the trading server) behaves when you saturate it with
data - however it’s not always possible to have the entire dataset in memory and that is
when having a very fast disk really
helps.
For computer B, however, choices of components will very much impact the performance you can get. Let’s go one by one.
CPU: You will want an intel “K” cpu, that is a cpu that allows you to overclock (OC)
its frequency and it’s cache frequency
via the BIOS.
Personally, I have a 11900k which, while it didn’t receive great feedback from the public
due to its poor multi-core
performance, it is the fastest single core processor
you can buy at the moment, and hence perfect for an HFT application.
In a later section I will go in depth into overclocking and show you how much
other techniques such as disabling extra
(unnecessary) cores, disabling sleep states and
raising the aforementioned frequencies, improve the overall performance of
the system.
Another important functionality of the last gen intel cpus is the fact that they
support AVX-512 (while previous versions
only supported up to AVX2), and I will also
show you how using such intrinsics can lead to extremely high performance
code
when doing real time analysis of data.
Alternatives: The previous gen 10850k is also a very decent CPU and with a good
cooling solution (and some luck on the
specific bin) can be pushed up to 5.4GhZ on all cores or 5.5GhZ on one/two cores configuration.
Motherboard: In here you should definitely consider a 2-DIMM slot Motherboard, and the
reason why is because 2-DIMM
slots motherboards have shorter traces to the CPU and each of the two slots is connected directly to 1 channel of the CPU,
and also because since there are less dimms
there will be less signal losses.
These motherboards are commonly used for overclocking (do you see a pattern?) not
only because of the reasons
mentioned above but also because of the fact that
their BIOS’s allow super fine tuning over ram/cpu parameters so you can
really get the best out of your system.
On my current system I have an Asus Apex XIII installed and in my opinion the Asus BIOS
is second to none in terms of
organization, accessibility and fine tuning abilities.
Alternatives: Asus Apex XII, Gigabyte Aorus Tachyon, EVGA (z590) Dark
and the AsRock OC Formula.
RAM: For memory you will want a 2x8 B-DIE (Samsung chip) kit.
B-DIE’s are the most stable memory kits to overclock (in comparison to something like
Micron Rev-E), which means you
can push more voltage through them/increase the frequency and they won’t crash.
If you know you will need 32GB of RAM (2x16), note that they will
be around 5/6ns slower than the 2x8 variants.
I currently own a pair of each of the xtreem’s and they are excellent to OC.
Disk: Any nvme/ssd will do. If you are looking for a nvme, either the
Samsung 980 pro or the Sabrent rocket4+ are great
picks.
If you want to store market data, which comes with a few caveats by itself (see Appendix - Costs of opening a one-man
HFT firm),
you could look into the enterprise Micron 9300 Pro as perhaps the best ssd format disk with up to 15tb of
capacity.
Cooling: This is an extremely important part of the system because when you
OC both RAM and CPU these will get hotter
and you will need to cool them down
otherwise the motherboard will either lower their frequency or just shut
down the
system to avoid burning the components.
There are a few parts that you have to cool down to achieve maximum system
performance and stability and those are:
CPU cooler: I recommend that you go with an AIO (All in One) cooling solution with 360mm (3 x 120) fans. I currently
own the Arctic Liquid Freezer II 360 and it’s an
excellent cooler which is also very well priced.
Alternatively you could either build your own liquid cooler by purchasing
and assembling the necessary components from
a place like AquaTuning but
the problem arises if you have to ship the server to a colocation, because
you would need to
have someone filling up the cooling liquid at the colocation
site, as shipping the server with the liquid already on the loop
could cause
possible leaks.
To cool the RAM chips and the NIC you should place a fan facing each of the components.
This will greatly improve memory stability while OC’ing and while the cooling for the NIC is not as important, it does
make a difference and provides extra airflow inside the case.
You can also add one or two small fans to exhaust the air at the back of
the case.
Regarding which fans to purchase, I always had great results with Noctua fans (and the Arctic fans included with the AIO),
they are very capable and silent so you can easily work with the system by your side, however if noise is not a concern (i.e.
you have the server in a different room or it is ready to be shipped
to the trading venue) you can look into more powerful
fans such as the
those by Delta Electronics.
PSU: Any 650W+ model from a known brand such as EVGA, Corsair, XPG, etc will be a good pick. I do recommend
however that you
pick one with a full modular design (i.e. you only attach the cables that you need to the PSU) as this will
lead to much less clutter inside the case
and consequently improve airflow.
Case: Assuming that you wanted to colocate the server at a trading venue
you will need a 4U case, since otherwise the CPU
cooler I recommended won’t fit.
Now, finding decent 4u cases is not the easiest task and even if you find one
it will probably be cluttered with unnecessary
things.
I recommend that you remove the front panel and purchase some metallic mesh
from a hardware store and use that instead,
and while this procedure takes a bit of work,
it will greatly improve airflow. (Have a look at a few pictures of my system
below,
where I did just that).
If your system is meant to be at home, any ATX desktop case with decent airflow will do.
I will end this section with a few pictures of my own lab. Hope you are enjoying the text so far.
——————————————————————————————————————————————————————————
For the sake of reachability and applicability of this primer, I made some decisions on how to actually proceed and I will
explain them now.
When market data is captured, it is captured in a format called pcap (packet capture)
and what this means is that you store
the ethernet packets exactly as they arrive, with
all the different OSI layers information, instead of only storing the payload
(the messages from the exchange).
This is done mainly because having the raw data allows you to troubleshoot and do a deep
analysis of your network.
If you have the Exablaze/Cisco NIC’s, you can use this optimized library to perform this capture.
First, these files are huge (a single day of market data for a single exchange can be 80+ gb) and second, acquiring
market
data in this format is quite expensive and very few data vendors sell it.
Instead of showing you how to send .pcap market data from computer A to computer B,
I will show you how to send the
actual payload that those files contain, and because
Nasdaq makes some of these samples available for download at
ftp://emi.nasdaq.com/ITCH/
, you will be able to grab
them for free and follow along implementing the code on this series.
Now, to clarify, the aforementioned payload follows a specific protocol called ITCH, which is a protocol developed by
Nasdaq for disseminating market data.
Because the ITCH protocol is in binary, it is extremely fast and easy to parse when compared to a protocol such
(which is ASCII based) and this particularility makes it ideal to parse in a fpga, since all fields/types have a
length and appear exactly in the same order on every identical (same type) message.
These two aspects allow you to write fpga logic that just parses all message fields at the same time, which resu
massive performance increase.
I made one more decision that I must clarify prior to continuing:
That decision is that computer A will be sending only one message on each ethernet frame instead of (possible)
multiple
messages as you would see in a real-world environment.
I decided to do this because it slightly simplifies the code in both computer A and computer B.
Basically what will happen in computer B is that you receive one message and you parse it and then
you receive another
frame and you parse the one message within and on and on, while
in a live environment you might receive multiple
messages on the same ethernet frame and you will have to iterate over the receiving buffer and parse all the messages
within.
The change is minimal but I argue that it aids readability in both computer A and computer B code and at this point that is
more important in my opinion.
When I finish this primer, I will write an appendix where I will show you how to both send .pcap
data from computer A to
computer B, with multiple messages in the same frame, as well as how to handle multiple messages in the same frame in
computer B.
With that out of the way, here’s the code you should run on computer A:
// replay.c
int main(void) {
if (!nic_handle) {
exanic_get_last_error());
return -1;
if (!nic_buffer) {
exanic_get_last_error());
return -1;
stat(file, &st);
if (ptr == NULL) {
return EXIT_FAILURE;
// Open the file and copy it's content to the previously allocated buffer.
buf[file_sz] = 0;
size_t size = 0;
char message[100];
// -------------- Iterate over the file and send each individual ITCH message
// -------------- Iterate over the file and send each individual ITCH message
for (;;) {
buf += 2;
if (*buf) {
switch (*buf) {
size = 36;
} break;
size = 19;
} break;
size = 35;
} break;
size = 31;
} break;
size = 23;
} break;
size = 50;
} break;
size = 40;
} break;
size = 44;
} break;
size = 26;
} break;
size = 36;
} break;
size = 40;
} break;
size = 20;
} break;
size = 25;
} break;
size = 39;
} break;
size = 35;
} break;
size = 35;
} break;
size = 12;
} break;
size = 28;
} break;
size = 19;
} break;
size = 20;
} break;
size = 12;
} break;
size = 21;
} break;
default: {
return EXIT_FAILURE;
// Copy the data to the message buffer that will be sent over the wire.
buf += size;
// Send it.
exit(0);
} else {
break;
// ----------------------------------------------------------------- Clean-up
fclose(f);
free(ptr);
exanic_release_tx_buffer(nic_buffer);
exanic_release_handle(nic_handle);
return 0;
Note that this is the simplest “fake exchange” you can have, because as of now it doesn’t receive orders from computer B,
nor builds an
order book representation (which you can then use to then trade against).
Later on in this primer, or in a follow-up post, I will show you how to build a “fake exchange” that addresses those issues.
Before moving on to the next section, I would like to clarify what exactly is “kernel bypass”
and why is it necessary technique (and why you need a specific NIC to achieve it) when you
are designing a low latency system.
When you receive data in a regular network card (such as an ethernet card on your motherboard or a wifi adapter), what
happens is that
the Linux Kernel reads/copies those bytes from the hardware to a Kernel buffer (which resides in the Kernel
space)
and then copies them again from that buffer to a buffer in user space (so whatever application that is waiting on that
data can use it).
These copies take a considerable amount of time (in HFT terms) and also the Kernel has a limited buffer size, which means
that if you are being flooded by packets, these will have to wait in a queue until they eventually get copied to user space,
resulting in extra delay.
Similarly, when you want to send data, a system call is issued behind the scenes and what happens is that the Kernel takes
control of the execution, and on behalf of the user space process copies the data to kernel space and eventually sends it.
What Kernel Bypass libraries/NIC’s allow you is to do is bypass the copies made by the Kernel by giving you direct access
to the hardware (NIC) buffer, which then allows you to copy the bytes directly from the hardware to userspace and from
userspace to
hardware without having to go through the Kernel.
In other words, you bypass the Kernel, and hence the name.
——————————————————————————————————————————————————————————
I will first start by explaining some of the design choices I made in terms of data structures
and overall architecture, and
then go in detail over each function used.
By the end of this section you will have a very minimal (i.e. easily expandable) foundation of a simulation lab, as well as a
deep understanding of the mechanics of an order book.
A limit order book represents the will of all market participant at any given moment, and because
of that it is the only
source of truth at (any) time T.
An exchange, such as Nasdaq, hosts a LOB for each tradable asset and disseminates each update to each LOB (which
can
be a buy order, delete order, etc) to all market participants at the same time, via multicast, through
a data feed called ITCH
(same name of the protocol).
The ITCH feed is a firehose of data, as all the updates to each LOB will be pushed
through it, which means that if you are
receiving data from the ITCH feed you will receive
updates on tradable assets that you might not even be interested in
trading yourself.
To handle that, you will have to filter out the incoming messages the messages from tradable assets
that you are not
interested in trading, and you can either do it in the NIC (by programming the logic
in verilog/vhdl directly in the card) or
in software. (In this post I will cover the latter option).
Another reason why you might want to filter messages is because you might decide to have specific servers only t
specific assets, which would allow you to load balance between different servers the whole trading operation and
the latency on each individual server (since they will only be processing a subset of the messages received).
It’s also important to note that not all types of messages received via the ITCH feed are useful to build the LOB,
since not
all of them directly represent a change in a LOB (some messages just signal
the beginning/end of trading day, if there’s a
halt in trading etc), and while the parser should still parse them they are not mandatory.
To an attempt at brevity, I will only show you the functions that parse the messages that actually matter to building the
LOB, however writing the remaining functions should be trivial
for you after this section (and with the help of the ITCH
protocol specification document).
Now, when you receive a message that affects the LOB you must store it somewhere, an example
would be if you receive a
“add order” message, you will need to store it so in case you later
receive a “delete” message (with the same ID), you can
make the appropriate changes to the order book.
Since all such messages have an ID, the most common data structure to use is a hash-map, using
the ID as the key.
This is what you will see in many LOB implementations, and it is a valid option in particular
if you are developing in C++
and pick a good hash-map implementation such as google-dense-hashmap or similar.
I will, however, show you a different approach to building the LOB, which has
several performance benefits (and beauty)
in comparison to a hash-map, at the expense of some complexity.
Remember that regardless of how fast the software is, it will never come even close to a full hardware systems,
you are OK with that and are not trying to run a HFT market-making strategy, you can be on the second level in t
speed, and are granted the ability to run more complex computations.
Without further ado, here’s how you build a super optimized LOB.
I will start to show you the header file - don’t try to grasp everything right now
as it will only make sense when you see the
implementation files.
// parser.h
// --------------------------------------------------------------------- Macros
// ---------------------------------------------------------------------- Stock
char symbol[8];
uint32_t previous_day_closing_price;
} stocks; // 12 bytes
// ---------------------------------------------------------------------- Order
uint32_t* order_book_volume;
} order; // 16 bytes
uint32_t* highest_bid_price;
uint32_t* highest_bid_volume;
uint32_t* minimum_bid_price;
uint32_t* minimum_bid_volume;
uint32_t* lowest_ask_price;
uint32_t* lowest_ask_volume;
uint32_t* maximum_ask_price;
uint32_t* maximum_ask_volume;
// ------------------------------------------------------------------ Functions
bool* interested_in_trading,
order* order_ids);
order* order_ids);
bool* interested_in_trading,
bool* interested_in_trading,
order_book* ob);
There are a few points that I want to explain from the code above:
First one is “DEPTH”, which stands for the number of price levels an order book can have.
Each $1 has 100 price levels (0, 1, …, 99), so if you want to hold information for
$10 of depth in each side of the book, you
will need 1000 levels for the Bid side, 1000 levels for the Ask side, and 1 level for the initial spread.
Inn the order_book struct, you see that there are multiple ptrs, these point exclusively to addresses
that belong to either
the prices or volume arrays.
On the order struct, the pointer within, points exclusively to an address in a order_book volume array.
(Note that I said “a” not “the”, as, in my implementation, an order has no idea that belongs to its symbol’s order book, it
only knows in which address its quantity is stored - you will see why this is, soon).
Also note that there are only 5 functions, and these are all the functions you need to build and maintain a LOB.
I will go over each of their implementations in detail soon, but before that, here’s the “main” file of the program, which
runs the loop that gets data from the NIC, checks the message type and delegates it to the appropriate function to be parsed.
// main.c
order* order_ids,
order_book* order_books,
bool* interested_in_trading) {
char message[1536];
for (;;) {
// Negative values mean that an error occured while retrieving the frame.
continue;
switch (*message) {
} break;
delete_order(message, order_ids);
} break;
} break;
reduce_order(message, order_ids);
} break;
reduce_order(message, order_ids);
} break;
} break;
} break;
} break;
} break;
reduce_order(message, order_ids);
} break;
} break;
} break;
} break;
// trading.
parse_stock_id_and_initialize_orderbook(message, interested_in_trading,
order_books);
} break;
} break;
} break;
} break;
} break;
} break;
} break;
} break;
} break;
default: {
exit(1);
int main(void) {
if (!exanic) {
exanic_get_last_error());
return EXIT_FAILURE;
if (!nic_buffer) {
exanic_get_last_error());
return EXIT_FAILURE;
free(order_ids);
free(order_books);
exanic_release_rx_buffer(nic_buffer);
exanic_release_handle(exanic);
return EXIT_FAILURE;
printf("\nRunning\n");
printf("-------\n");
return EXIT_SUCCESS;
The most important things to understand from the code above are the following:
The array “interested_in_trading” holds 10000 Bools, and the reason why is because
Nasdaq has 8/9 thousand tradable
assets. (less than 10000)
What will happen is that in the initial messages of every trading day, information
about the ID of each particular stock for
that day is provided (i.e. NVDA = 5823) and if we are interested in trading NVDA stock on that day we will set
“interested_in_trading[5823] = 1;” and what that allows is to then in other functions, such as “add_order()”, check if
the
particular message we are receiving belongs (via its symbol ID) to one of the stocks we
are interested in trading. If so, we
parse the message, otherwise we don’t proceed further.
For the same reasons as above, I create 10000 “order_books”, and again
only some of them will actually be used, but it is
much preferable to pre-allocate everything.
In the array “order_ids”, I allocate 1 billion orders, a common day will have around 500/600 million
orders, but it’s
important to have some headroom in case there’s a trading day that has more activity than the usual.
Note that every day order ID’s start from 0, and if there was an order left in the order book from the previous day, that order
will be re-entered the following day and given a new
order ID.
And here’s the final piece of the puzzle, the code which parses the ITCH protocol and builds the
LOB.
// parser.c
};
if (interested_in_trading[stock]) {
return;
// highest_bid_price.
*volume_at += number_of_shares;
order_ids[id].order_book_volume = volume_at;
order_ids[id].quantity = number_of_shares;
order_ids[id].side = 1;
// If the new bid is higher than the current bid, let the new bid be
// the highest/best.
if (diff < 0) {
ob->highest_bid_price -= diff;
ob->highest_bid_volume -= diff;
ob->lowest_ask_price = ob->highest_bid_price + 1;
ob->lowest_ask_volume = ob->highest_bid_volume + 1;
ob->lowest_ask_price++, ob->lowest_ask_volume++) {
if (*ob->lowest_ask_volume) {
break;
} else { // ask
// lowest_ask_price.
*volume_at += number_of_shares;
order_ids[id].order_book_volume = volume_at;
order_ids[id].quantity = number_of_shares;
order_ids[id].side = 2;
// If the new ask price is lower than the current best ask, let the new
// highest bid, because in that case there will be a cross and the order
// will be executed and because of that the highest ask will be the
if (diff > 0) {
ob->lowest_ask_price -= diff;
ob->lowest_ask_volume -= diff;
ob->highest_bid_price = ob->lowest_ask_price - 1;
ob->highest_bid_volume = ob->lowest_ask_volume - 1;
ob->highest_bid_price--, ob->highest_bid_volume--) {
if (*ob->highest_bid_volume) {
break;
ob->highest_bid_volume--, ob->highest_bid_price--) {
if (*ob->highest_bid_volume) {
break;
ob->lowest_ask_volume++, ob->lowest_ask_price++) {
if (*ob->lowest_ask_volume) {
break;
if (order_ids[id].quantity) {
*order_ids[id].order_book_volume -= order_ids[id].quantity;
if (order_ids[id].quantity) {
order_ids[id].quantity -= number_of_shares;
*order_ids[id].order_book_volume -= number_of_shares;
if (interested_in_trading[stock]) {
// If we didn't keep track of the original order then we can not replace
// it because we don't know it's side. (The reason for such to happen
if (order_ids[original_id].side) {
*order_ids[original_id].order_book_volume -=
order_ids[original_id].quantity;
order_ids[new_id].side = side;
return;
if (side == 1) { // bid
// highest_bid_price.
*volume_at += number_of_shares;
order_ids[new_id].order_book_volume = volume_at;
order_ids[new_id].quantity = number_of_shares;
order_ids[new_id].side = side;
// If the new bid is higher than the current bid, let the new bid be
// the highest/best.
if (diff < 0) {
ob->highest_bid_price -= diff;
ob->highest_bid_volume -= diff;
ob->lowest_ask_price = ob->highest_bid_price + 1;
ob->lowest_ask_volume = ob->highest_bid_volume + 1;
ob->lowest_ask_price++, ob->lowest_ask_volume++) {
if (*ob->lowest_ask_volume) {
break;
} else { // ask
// lowest_ask_price.
*volume_at += number_of_shares;
order_ids[new_id].order_book_volume = volume_at;
order_ids[new_id].quantity = number_of_shares;
order_ids[new_id].side = side;
// If the new ask price is lower than the current best ask, let the
if (diff > 0) {
ob->lowest_ask_price -= diff;
ob->lowest_ask_volume -= diff;
ob->highest_bid_price = ob->lowest_ask_price - 1;
ob->highest_bid_volume = ob->lowest_ask_volume - 1;
ob->highest_bid_price--, ob->highest_bid_volume--) {
if (*ob->highest_bid_volume) {
break;
ob->highest_bid_volume--, ob->highest_bid_price--) {
if (*ob->highest_bid_volume) {
break;
ob->lowest_ask_volume++, ob->lowest_ask_price++) {
if (*ob->lowest_ask_volume) {
break;
order_book *order_books) {
stocks *endptr =
// Compare the current stock name with the stocks we are interested in trading
// and set the stocks[locate] to 1 as well as prepare the order_book for that
// stock.
if (!strcmp(stock, ptr->symbol)) {
interested_in_trading[locate] = 1;
ob->highest_bid_price = &ob->prices[DEPTH];
ob->highest_bid_volume = &ob->volume[DEPTH];
// Point the minimum bid volume to the first element of the array.
ob->minimum_bid_volume = ob->volume;
// Set the value of the highest_bid_price to the previous day closing price.
// This is necessary to do because when new bids arrive to the book, they will be
*ob->highest_bid_price = ptr->previous_day_closing_price;
// Fill the order book with prices increasing on the 'Ask' side
ob->maximum_ask_price = value;
value = ob->highest_bid_price;
ob->minimum_bid_price = ob->highest_bid_price - i;
*ob->minimum_bid_price = *value;
ptr++;
Compile with: clang -O3 parser.c main.c -Weverything -Werror -o program -lexanic
Start by reading the parse_stock_id_and_initialize_orderbook() function and then move on to add_order() and the
remaining functions.
Still in the parse_stock_id_and_initialize_orderbook() function is important to understand that the value ptr-
>previous_day_closing_price would normally be requested from a database.
The function replace_order() is identical to the function add_order() except for one extra check. However, the reason why
you can’t just call add_order() from replace_order() after performing that check is because the message fields to be parsed
are in a different position within the payload.
——————————————————————————————————————————————————————————
Measurements
Work in Progress
——————————————————————————————————————————————————————————
Overclocking
Work in Progress
——————————————————————————————————————————————————————————
Linux Tuning
Work in Progress
——————————————————————————————————————————————————————————
In the simplest of the setups, you will need to host the minimum of one server with a colocation provider such as
TSN/Pico/Options-IT that provides hosting at Carteret (the data center where the Nasdaq Matching Engine (NME)) is
located, and for a single server hosting expect to pay
around $3.5K MRC with a $2K NRC.
This will include layer 1 access to UDP full book market data (1 hop), which means that your server will be connected to
one very fast
switch (either a Cisco or Arista L1 switch) which itself is directly connected
to the NME.
The average latency from the market data being sent by the NME
and arrive to your network-interface-card (NIC) will be
around 80ns.
To be able to place orders you will have to go through a broker and if you
are capital constrained, probably your only
option here will be Lime Execution
(which will cost you around $7K MRC per server).
Your server will be connected to a (not as fast) switch owned by the colocation
provider you chose, which itself is
connected to an equally not so fast switch
owned by Lime which will receive your order, perform some checks (i.e. can you
afford what you want to buy, do you own what you want to sell, etc) and
finally be sent to the NME.
From the data leaving your server to reaching Lime’s server it will take around 2 * 380ns on average, + at least a few micro
seconds for the checks.
Assuming that it takes 3us for the checks to be made and the order to be sent to the NME, your order will take 760ns in
transit + 3us in processing for a total of almost 4us from leaving your server to arrive at the NME.
That is ~4000 nanoseconds and it is an important figure to understand because there are currently systems (hardware/fpga
based) that have direct-market-access (DMA)
to the exchange and can send an order to the exchange, in response
to market
data, in ~20ns (in those 20ns they parse the data, make some simple computations and send the order), which means that
while your order is in traffic to arrive to the NME, these systems would have been able to place
200 orders (serially) in the
same period of time, which could alter the market
to the point that when your order arrives to the NME, the price/quantity
you were considering to trade at is no longer available.
Note that unless you have around $5M to invest you won’t be able to get “sponsored access”/DMA to the exchange, which
is provided by one of the big
investment banks, and so you are constrained to go through a broker and incur
on the latency
mentioned above.
There is also a somehow hidden cost that you will have to pay in one form or the other and that is related to market data
collection.
If you are recording the market data yourself you have a couple of options:
1. On your main process (where your parsing/trading logic is) you will have to add extra functionality to save the data (and
even if you are writing to RAM
you will eventually have to save to disk) which likely implies the use of threads.
2. You have a second process reading the data from the NIC and saving it. (You will have to synchronize it with the main
process so that both get to
read the data before its evicted from the NIC)
Whatever option you choose, you will eventually have to make system calls to save the data (which will create added
latency), you will also be putting more stress on the kernel task scheduler (which is non-deterministic),
and additionaly you
will be adding more complexity to the code, which by itself
will put more stress on the instruction and data cache.
If you can’t accept the added latency/jitter, you also have a couple of options:
1. Colocate another server which basically only records market data (it doesn’t trade).
This will cost you another 3.5k MRC + 2k NRC + the cost of the server components (and any necessary maintenance).
2. Purchase the data from a data provider such as Maystreet (also check with your colocation provider as they might offer
such an add-on service) and get it by the end of the trading day.
This will cost around $500-600 MRC and it’s probably the way to go if you are not
in a position where you have rented a
whole rack at the data center and have space for an extra server tasked only with recording market data.
As you can see, it is a substantial investment just to be able to “play the game”,
however if you have the capital and the
expertise to keep pushing the boundaries, it is definitely worth considering.
——————————————————————————————————————————————————————————
Changelog
August 5th, 2021: Initial commit, covering the first 4 sections of the primer + the appendix “costs of opening a one-man hft
firm”.
August 10th, 2021: Minor change in the valid_stocks struct and in the parse_stock_id_and_initialize_orderbook() function.
Thank you for the feedback, Anton.
August 14th, 2021: Improved tracking of orders that won’t be added to the order_book (because they fall outside of the
predefined range).
——————————————————————————————————————————————————————————
Diogo Flores
2022