High Frequency Trading Final Pres Slides
High Frequency Trading Final Pres Slides
1
Project Goals
Design an FPGA-based infrastructure for
HFT that:
▪ Abstracts the details of networking &
financial encoding/decoding.
▪ Handles order book keeping
(pre-processing) to ease development of
HFT algorithms on FPGAs.
▪ Is extensible and easy to interface with,
but still has very low round-trip latency
(<1 s including the network stack).
1
System Overview
Our system has five modules:
▪ UDP Network Stack - Xilinx's UDP network stack.
▪ Network Switch/Timestamper - Timestamps
incoming and outgoing packets to track latency.
▪ FAST Encoder/Decoder - Performs FAST
protocol conversion on incoming and outgoing
packets
▪ Order Book - Sorts all current valid bid and ask
orders by price, and passes these top values to the
application layer.
▪ Application Layer - Simple demo "client"
hardware that uses the order book data to execute
trades.
1
System Overview
Our system has five modules:
▪ UDP Network Stack - Xilinx's UDP network stack.
▪ Network Switch/Timestamper - Timestamps
incoming and outgoing packets to track latency.
▪ FAST Encoder/Decoder - Performs FAST
protocol conversion on incoming and outgoing
packets
▪ Order Book - Sorts all current valid bid and ask
orders by price, and passes these top values to the
application layer.
▪ Application Layer - Simple demo "client"
hardware that uses the order book data to execute
trades.
1
Network Switch & Timestamper
● Automatically tags incoming packets with a
timestamp
● This timestamp is passed throughout the
the downstream system unchanged.
● If an incoming message triggers an order,
that message's timestamp gets transmitted
back to the timestamper alongside the
outgoing order.
● The timestamp then computes the latency
of the packets and appends it to the packet.
● Also provides multiplexing on the transmit
side for monitoring (this was not used in the
final project iteration).
1
FAST Protocol - Recap
● Financial Information Exchange protocol
(FIX) adapted for streaming
○ Variable length message encoded in
bytes
○ Decoding depends on a template
○ Stop bits determine end of field 11011000 10000001 01100000 10011000 10011011
1
FAST Decoder
● An input UDP packet comes in 64 bit chunks
● Packets are buffered into bytes before processing
● Dataflow:
1) Stop bit detector inspects the MSB of each byte
2) Multiple decoders are run in parallel
3) Decoded message is sent to the Order Book
● Latency of 9 cycles @ 5 ns estimated by Vivado HLS
Timestamp Time Stamp
UDP Metadata UDP Metadata
Decode Price
UDP Packet Data Decode Size Combine Market Order
Stop Bit
Market
Detector Decode ID Order
Decode Type
FAST Decoder
1
FAST Encoder
● An input market order received from the Custom App module
● Dataflow:
1) Multiple encoders are run in parallel
2) Encoded message is sent to the Network layer in 64-bit chunks
● Latency of 0 cycles @ 5 ns estimated by Vivado HLS
Encode Price
Market Order Encode Size Combine
UDP Packet Data
to UDP
Encode ID Packets
Encode Type
FAST Encoder
1
Order Book Keeping - Recap
▪ It is an essential pre-processing stage for
almost all financial algorithms. Bid Order Book
▪ Keeps track of Bids (offers to buy) and Asks Order ID Time Size Bid Price
1
Order Book Keeping - Implementation
1
26
2 3
23 22
4 5 6 7
21
1 2 3 4 5 6 7
heap 26 23 22 21
holes
1
Order Book Keeping - Implementation
1
▪ For any node at index j:
26
– Left child has index 2j.
– Right child has index 2j+1. 2 3
23 22
4 5 6 7
21
1 2 3 4 5 6 7
heap 26 23 22 21
holes
1
Order Book Keeping - Implementation
1
▪ For any node at index j:
26
– Left child has index 2j
– Right child has index 2j+1. 2 3
23 22
1 2 3 4 5 6 7
heap 26 23 22 21
holes
1
Order Book Keeping - Implementation
1
▪ For any node at index j:
26
– Left child has index 2j
– Right child has index 2j+1. 2 3
23 22
holes
1
Order Book Keeping - Implementation
1
▪ For any node at index j:
26
– Left child has index 2j
– Right child has index 2j+1. 2 3
23 22
left right
holes
1
Order Book Keeping - Insertion
1
Using those two simple yet very helpful
24 26
observations we can build very efficient PQ:
1. Get index of next empty node → 5 2 3
23 22
4 5 6 7
21
1 2 3 4 5 6 7
heap 26 23 25 21
holes
1
Order Book Keeping - Insertion
1
Using those two simple yet very helpful
24 26
observations we can build very efficient PQ:
1. Get index of next empty node → 5 2 3
2. Get the path → 5-4= 01 (left - right) 23 22
4 5 6 7
21
1 2 3 4 5 6 7
heap 26 23 22 21
holes
1
Order Book Keeping - Insertion
1
Using those two simple yet very helpful
24 26
observations we can build very efficient PQ:
1. Get index of next empty node → 5 2 3
2. Get the path → 5-4= 01 (left - right) 23 22
3. Compare to index 1 → 26 > 24 (No swap)
4 5 6 7
21
1 2 3 4 5 6 7
heap 26 23 22 21
holes
1
Order Book Keeping - Insertion
1
Using those two simple yet very helpful
24 26
observations we can build very efficient PQ:
1. Get index of next empty node → 5 2 3
2. Get the path → 5-4= 01 (left - right) 23 22
3. Compare to index 1 → 26 > 24 (No swap)
4. Move to left node → index 2x1 = 2 4 5 6 7
21
1 2 3 4 5 6 7
heap 26 23 22 21
holes
1
Order Book Keeping - Insertion
1
Using those two simple yet very helpful
23 26
observations we can build very efficient PQ:
1. Get index of next empty node → 5 2 3
2. Get the path → 5-4= 01 (left - right) 24 22
3. Compare to index 1 → 26 > 24 (No swap)
4. Move to left node → index 2x1 = 2 4 5 6 7
5. Compare to index 2 → 23 < 24 (Swap) 21
1 2 3 4 5 6 7
heap 26 24 22 21
holes
1
Order Book Keeping - Insertion
1
Using those two simple yet very helpful
23 26
observations we can build very efficient PQ:
1. Get index of next empty node → 5 2 3
2. Get the path → 5-4= 01 (left - right) 24 22
3. Compare to index 1 → 26 > 24 (No swap)
4. Move to left node → index 2x1 = 2 4 5 6 7
5. Compare to index 2 → 23 < 24 (Swap) 21
6. Move to right node → index (2x2)+1 = 5
1 2 3 4 5 6 7
heap 26 24 22 21
holes
1
Order Book Keeping - Insertion
1
Using those two simple yet very helpful
26
observations we can build very efficient PQ:
1. Get index of next empty node → 5 2 3
2. Get the path → 5-4= 01 (left - right) 24 22
3. Compare to index 1 → 26 > 24 (No swap)
4. Move to left node → index 2x1 = 2 4 5 6 7
5. Compare to index 2 → 23 < 24 (Swap) 21 23
6. Move to right node → index (2x2)+1 = 5
7. Destination reached
1 2 3 4 5 6 7
heap 26 24 22 21 23
holes
1
Order Book Keeping - Specs
1
24
▪ Capacity of 4096 bids and 4096 asks.
2 3
▪ Streaming output after first comparison.
23 22
▪ II=1 for all insertion and deletion loops.
holes 5
1
HLS Shortcoming
▪ When optimizing the order book for II=1, 24
we needed to partition the heap array such
that each level of the tree is in a separate
partition. 23 22
▪ Array partitioning with varying partition
sizes is not doable in HLS.
21 19
heap 24 23 22 21 19
1
HLS Shortcoming
▪ When optimizing the order book for II=1, 24
we needed to partition the heap array such
that each level of the tree is in a separate
partition. 23 22
▪ Array partitioning with varying partition
sizes is not doable in HLS.
▪ We had to implement the heap array as a
2D array with dimensions equal to the 21 19
number of levels times the size of the tree
base.
▪ Wastes a lot of memory resources that can
heap 24
be avoided by more complex coding.
23 22
21 19
1
Testing and Verification
▪ We used an incremental approach to
integrate the system.
Iter 1. Network stack & timestamping
Iter 2. FAST Encoder/Decoder w/ Network
Iter 3. Entire system inc. Order Book
▪ This allowed more robust verification in
hardware of components such as the FAST
encoder.
▪ The incremental approach also allowed
creating a latency breakdown of the individual
blocks once synthesized.
1
Testing and Verification
Two different monitoring mechanisms were
used in the hardware:
1 | MicroBlaze Monitoring
▪ Order Book exposes an AXI-Lite interface
with top bid/ask, which is identical to the
last streaming output.
▪ Top Bid/Ask were reported periodically
(~1s) through JTAG debug interface.
▪ Allows observation of the Order Book
state, which is normally "hidden" behind
the trading algorithm.
1
Testing and Verification
2 | Network-Based Testing
▪ System is tested as a black-box; can only see the
outgoing packets it transmits.
▪ Latency data is appended unto the outgoing
order packets.
▪ Server-side software:
– Generates test data, encodes it as FAST orders
and sends them over Ethernet interface
– Receives and decodes orders from the
hardware
– Computes and displays an equivalent Order
Book state given the test data.
1
FPGA Platform
▪ Xilinx Kintex Ultrascale FPGA on the Alpha Data 8K5
– 10 Gigabit Ethernet
– Can reuse the same UDP/IP subsystem used for the Shell project
▪ Reported Network switch latency shows us that the streaming interface adds a
significant amount of latency.
▪ Adding the Order book only added 12 cycles of Latency: a minimal amount.
Area Results
Resource Utilization Available Utilization %
▪ Leber et. al created an in parallel multiple FAST stream solution that sends
decoded data to a software processor at a total latency of 2.6 us
▪ Lockwood el. al used the FIX financial exchange protocol for encoding and
decoding and report 200 ns
– We measured our system with only FAST encoder / decoder and achieved
average runtimes of 192.3 ns
HLS ROCKS
▪ Streaming interfaces used HLS library calls
– Easy for integration, but gave little control over the latency of
communication between HLS IP cores.
▪ Regarding the FAST Protocol each exchange has their own message template
– In C, encoding/ decoding functions are simple to order in a way that
matches the message template
▪ Experts in financial trading algorithms can easily tinker/build onto our design
with basic understanding of hardware and no need for RTL expertise.
HLS ROCKS
▪ Design Space Exploration and Optimization
– HLS directives allow easier and faster tuning for performance.
– In HDL, all tuning is essentially manual code changes.
▪ Testing
– Getting RTL to work correctly requires lots of low level debugging.
– HLS design can be tested in C to verify the basic functionality before
worrying about hardware concerns.
▪ Empirical Observation
– Before this course was HLS, many projects finished after the summer
– Our cohort finished all projects by early June; in our case Mid-May.
– We believe this indicates that HLS is ~2x more productive
Database Organization
▪ Git/ bitbucket: cloud source code control hft
▪ Directory structure base project folder
– Each member worked on a project in
the hls folder
src
src code folder
• Network switch
• FAST Protocol hls
• Order book vivado hls projects
– Separate folder for IP Integration
ip
• Used for individual, partial and full
built IP cores
integration
– Scripts folder contains python files build
used to send information through the vivado IP integrator projects
network to the FPGA
scripts
network scripts
References
[1] C. Leber, B. Geib and H. Litz, "High Frequency Trading Acceleration Using
FPGAs," 2011 21st International Conference on Field Programmable Logic and
Applications, Chania, 2011, pp. 317-322.
[2] J. W. Lockwood, A. Gupte, N. Mehta, M. Blott, T. English and K. Vissers, "A
Low-Latency Library in FPGA Hardware for High-Frequency Trading (HFT),"
2012 IEEE 20th Annual Symposium on High-Performance Interconnects, Santa
Clara, CA, 2012, pp. 9-16.
Thank You!
Order Book Keeping - Deletion
1
Using those two simple yet very helpful 26
26
observations we can build very efficient PQ:
1. Return the top node. 2 3
24 22
4 5 6 7
21 23 19
1 2 3 4 5 6 7
heap 26 24 22 21 23 19
holes
1
Order Book Keeping - Deletion
1
Using those two simple yet very helpful 24
26
observations we can build very efficient PQ:
1. Return the top node. 2 3
2. Pick its larger child to replace it. 24 22
4 5 6 7
21 23 19
1 2 3 4 5 6 7
heap 24 24 22 21 23 19
holes
1
Order Book Keeping - Deletion
1
Using those two simple yet very helpful 24
26
observations we can build very efficient PQ:
1. Return the top node. 2 3
2. Pick its larger child to replace it. 23 22
3. Go to the picked child & repeat.
4 5 6 7
21 23 19
1 2 3 4 5 6 7
heap 24 23 22 21 23 19
holes
1
Order Book Keeping - Deletion
1
Using those two simple yet very helpful 24
26
observations we can build very efficient PQ:
1. Return the top node. 2 3
2. Pick its larger child to replace it. 23 22
3. Go to the picked child & repeat.
4. Reaching a leaf node → Add to holes. 4 5 6 7
21 19
1 2 3 4 5 6 7
heap 24 23 22 21 19
holes 5
1
Order Book Keeping - Deletion
1
Using those two simple yet very helpful 24
observations we can build very efficient PQ:
1. Return the top node. 2 3
2. Pick its larger child to replace it. 23 22
3. Go to the picked child & repeat.
4. Reaching a leaf node → Add to holes. 4 5 6 7
21 19
If an insertion occurs, fill the holes first before
the next empty node to maintain the heap
1 2 3 4 5 6 7
structure.
heap 24 23 22 21 19
holes 5