FPGA-based TCP IP Checksum Offloading Engine For 100 Gbps Networks
FPGA-based TCP IP Checksum Offloading Engine For 100 Gbps Networks
[email protected]
‡ Systems Group, Department of Computer Science, ETH Zürich, Switzerland
Abstract—End-to-end packet integrity in TCP/IP is ensured packet, doing it efficiently is key to achieve higher line rates
through checksums based on one’s complement addition. Once and lower latency.
a negligible part of the overall cost of processing a packet, Nowadays, at 10 to 100 Gbps speeds, packet processing
increasing network speeds have turned checksum computation
into a bottleneck. Actually, supporting 100 Gbps bandwidth is a is offloaded to hardware — to the NIC (Network Interface
challenge partially due to the difficulties of performing checksums Card) — since otherwise processing is too slow. We carried
at line rate. As part of a larger effort to implement a 100 Gbps out a simple test to confirm this. We set up a testbed with
TCP/IP stack on an FPGA based NIC, in this paper we analyse a 128 GByte, 2.20 GHz Xeon E5-2630 v4 server (A) and a
the problem of checksum computation for 100+ Gbps TCP/IP 192 GByte, 2.60 GHz Xeon 6126 server (B), both running
links and describe an open-source solution for the 512-bit wide,
322 MHz buses being used in the 100 Gbps Ethernet interfaces of Gentoo Linux (kernel 4.14.7). We used a Mellanox 100 GbE
Xilinx UltraScale devices. The proposed architecture computes ConnectX-5 NICs on both servers, which were connected
thirty-three 16-bit one’s complement additions in only 3.1 ns, to each other via a QSFP28+ direct attach cable, and we
more than enough to support 100 Gbps Ethernet links. run iperf2 over TCP to measure the overall performance.
Results show that, when server B acts as a client, performance
I. I NTRODUCTION
decreases from 28.3 Gbps to 10.0 Gbps when checksum
The TCP/IP network protocol provides mechanisms to ver- offloading is disabled (rx off and tx off options of ethtool).
ify data integrity through the detection of corrupted packets. As part of a larger effort to implement a 100 Gbps TCP/IP
These mechanisms are based on computing a checksum of the stack on programmable logic, intended for Software Defined
packet at the source, before sending it, and then controlling Networks (SDN) and in-network data processing, we have
the checksum upon arrival at the destination. Checksums are confronted the need of implementing an ultra-low latency
used to protect both the packet headers as well as the payload checksum mechanism capable of sustaining a 100 Gbps line
of the packet, and there are both layer 3 (IP headers) and rate. While, the checksum arithmetic is well-known and com-
layer 4 (TCP segments or UDP datagrams) checksums. In both paratively simple, quoting the RFC793 [2] ” The checksum
cases, the TCP/IP checksum is calculated as one’s complement field is the 16-bit one’s complement of the one’s complement
addition over 16-bit words. sum of all 16-bit words in the header and text”, the low
At low line rates, the overhead of computing the checksum latency requirement imposed by higher bandwidth is a major
is expensive but not necessarily overly large when compared challenge. In this paper we describe our solution to compute
to that of other parts of the protocol. Yet, already early on the checksum at 100 Gbps on an FPGA. We aim for a one
Clark et al. [1] suggested that the checksum could become a clock cycle implementation in order to support maximum
problem in packet processing. As network bandwidth increases throughput even for short packets. Our design covers com-
and, thus, as the amount of data being sent and received puting the checksum for i) the IP header: from 10 to 30 ×
per unit of time grows, in-flight computing of the checksum 16-bit words, to be processed in one clock cycle; and for ii)
becomes a major bottleneck and can affect the overall latency. the UDP/TCP/ICMP header plus payload: iterating over 33
In particular, the checksum must be computed not only at × 16-bit words checksum per clock cycle over several clock
the source and destination but also during routing if there cycles. The target frequency is 322 MHz (3.1 ns), which is
are any changes to the header. In the case of TCP and the one used by the Xilinx’s Integrated CMAC — CMAC =
(optionally) UDP packets, the payload is also included in the Ethernet Media Access Controller (MAC) and Physical Coding
checksum. Hence, computing the checksum of a packet po- Sublayer (PCS) core. The results are available as two open
tentially involves processing many 16-bit words. As checksum source implementations, one for the IP header and another for
calculation is performed several times over the life time of a ICMP, UPD, and TCP packets.
Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.
II. R ELATED W ORK IV. H ANDLING 100 G BPS IN THE FPGA
In the market we can find different FPGAs that can handle
The checksum computation is a straightforward algorithm,
100 Gbps links at the physical level. Xilinx has several devel-
described in RFC793 [2] and RFC1071 [3]. Nevertheless,
opment boards supporting that rate. In this paper, we focus on
the overhead of checksum processing when implemented in
the VCU108 (XCVU095-2FFVA2104E UltraScale) [14] and
software is well-known [4], [5]. Kay’s et al. [6] show that
VCU118 (XCVU9P-L2FLGA2104E UltraScale+) [15] boards.
checksum calculation was the major processing overhead in
Both devices provide an integrated 100G Ethernet Subsys-
a software TCP/IP implementation. Today’s commercial NICs
tem [16], containing an Ethernet MAC and PCS core, com-
offer TCP/IP offloading, reducing the overhead on the CPU
plying with the IEEE 802.3-2012 specification. This integrated
for any task related to packet processing, including checksum.
core presents a 4 x 128-bit segmented LBUS interface than can
Indeed, the Xilinx AXI 1G/2.5G Ethernet subsystem also
be easily adapted to a 512-bit AXI4-Stream. The operating
provides full checksum offloading capabilities. Unfortunately,
frequency is 322 MHz (3.103 ns). Accordingly, to maintain
these capabilities are not available in the 10 Gigabit Ethernet
line rate processing, any checksum design has to work at the
Subsystem nor in the UltraScale Devices Integrated 100G
same frequency, otherwise latency will increase. From now on,
Ethernet core.
we will assume that the checksum implementation is interfaced
One’s complement addition has the interesting property of
with a 512-bit AXI4-Stream and its output is a 16-bit AXI4-
being associative. Thus, the 16-bit words can be added in
Stream. In case of checksums that span more than 64 bytes
any order. Although a hardware implementation was already
(e.g., the checksum of a TCP segment), the message is split
suggested in RFC 1936 in 1996 [7], there have been only a
in as many 512-bit AXI4-Stream transactions as necessary.
few studies of low latency checksum in hardware, mostly for
In TCP/IP over Ethernet, the shortest and largest packet
10+ Gbps networking. Henriksson et al. [8] provide a 0.18
are as follows. For small packets, size is 60 bytes and time
µm ASIC implementation for 10 Gbps Ethernet. On FPGAs,
between packets can be as small as 6.72 ns. In a packet with a
a Stratix III has been used to achieve a throughput of 14.2
MTU (Maximum Transmit Unit) of 1500 bytes, a packet has
Gbps [9]. Atomic Rules [10], along with other companies,
to be processed every 121.92 ns.
offers a 10 to 400 GbE UDP Offload Engine with an integrated
checksum computation for the UltraScale(+) families, but no V. IP V 4 H EADER AND TCP/UDP CHECKSUM
details about the implementation are available and only the C ALCULATION
10G and 40G versions seem to be accessible at the moment.
Following the RFC793 and RFC1071 [2], [3] specifications,
Recent implementations of a 10 Gbps TCP/IP stack for FPGAs
the transmitter side computes the checksum as follows:
include checksum computation [11], [12]. In this paper, we
show how to perform the TCP/IP checksum computation at a i. The value of the checksum word (16-bit) is set to zero —
minimum line rate of 100 Gbps. since the checksum field is part of the packet for which
the checksum has to be computed.
ii. The message is split into 16-bit words.
III. U SE CASES iii. All 16-bit word fragments are added using one’s comple-
ment arithmetic.
The results of this paper are part of a larger effort to
iv. The sum is complemented (flip the bits) and becomes the
implement an open source TCP/IP stack capable of reach
overall checksum.
100 Gbps line rate. Other project that can benefit from the
v. The checksum is inserted into the header and sent with
contributions of this work are, for instance, a 10 Gbps TCP/IP
the data.
stack [12], which we use as starting point, or projects using a
NetFPGA [13]. We have focused this work on IPv4 (Internet The receiver uses the following complementary steps for
Protocol Version 4). error detection:
We make no assumptions about the use of the stack, i. The message (including the checksum field) is split into
regarding whether it is on a NIC, as an end point of the 16-bit words.
network, or on a router or on a middle box, because of that ii. Every word is added using one’s complement arithmetic.
we aim for one cycle latency. In such cases, checksum com- iii. The result is the computed checksum. If the value is 0,
putation is used quite often. For instance, routers recompute the message is considered to be correct.
the checksum if headers change as a result of a NAT (Network The procedure is similar for the IP header, TCP header and
Address Translation) or port re-assignment. In NICs with optionally for the UDP header. In what follows, the differences
offloading support, the host sends data through PCIe and the between the three cases are explained and one’s complement
NIC has to segment and packetise such data, performing the arithmetic is discussed in detail.
checksum computation and including the result in the packet
header. Finally, in accelerators using TCP/IP as a mean of A. IP header checksum Computation
communication, e.g. [11], outgoing packets must include both Figure 1a shows the IPv4 header with its 14 fields. Thirteen
IP and TCP checksums and the checksum in the incoming are mandatory, whereas, the 14th field is optional and appro-
packet must be verified before being processed. priately named: options. Since an IPv4 header may contain a
Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.
Bit 0 3 7 11 15 19 23 27 31 Bit 0 3 7 11 15 19 23 27 31
0 4
Header
Type of Service Total Length 0 Source Port Destination Port
Check
Length
Word
Options Options
(a) (b)
packets are hardly ever that long in a wide area network,
Fig. 1: (a) IP Version 4 header. (b) TCP header. because the Ethernet MTU is usually 1500-byte.
In the case of TCP, the main difference with respect to
variable number of options, the Internet Header Length (IHL) UDP is that the TCP header might have optional fields, so its
field (4-bit) defines the size of the header in steps of 32-bit size varies between 20 to 60-byte in 4-byte steps. In practical
words, which also coincides with the offset to the data. The terms, the payload varies from 0 (acknowledge packets without
minimum value for this field is five which indicates a length payload) to 1460 (maximum segment size that fits the Ethernet
of 5 × 32-bit = 160-bit ≡ 20-byte ≡ 10 × 16-bit words. As a MTU). Hence, the shortest packets have 12-byte (pseudo
4-bit field, the maximum value is fifteen words (15 × 32-bit, header) + 20-byte (header), i.e. can be processed in one 512-
or 480-bit ≡ 60-byte ≡ 30 × 16-bit words). bit transaction. For the longest packets, 12 + 20 + 1460-byte
Our interface is 512-bit wide. As a result, the whole IP = 1492-byte require 24 × 512-bit AXI4-Stream transactions.
header is contained in a single transaction. Processing com-
plexity is mainly caused by the variable header length, which C. One’s Complement Addition
requires a variable sum ranging from 10 to 30 × 16-bit words.
The proposed architecture considers the maximum possible The one’s complement of a binary number K in an N-
number of words and a multiplexer selects if it is valid or not. bit representation system is determined by inverting every
bit (flipping 0s for 1s and vice versa). It is arithmetically
equivalent to perform 2N −1 − 1 − K. Therefore, zero has two
B. TCP/UDP header checksum Computation
representations (00...00) and (11...11). As a consequence, an
TCP checksum is a 16-bit field in the header, Fig. 1b. In N-bit one’s complement numeral system can only represent
this case the checksum covers a pseudo-header, the TCP/UDP integers in the range −2N −1 − 1 to 2N −1 − 1 while two’s
header, and the payload. The pseudo-header has to be created complement can represent −2N −1 to 2N −1 − 1.
prior to checksum calculation with information from the IP One’s complement addition is calculated by summing as
header that includes: the source and destination IP addresses, natural numbers and adding the carry to the result — aka
the protocol and the TCP segment length, Fig 2. The message swing the bit(s) around. An interesting property in multi-
covered by this checksum is shown in Fig. 3. In case of odd operand one’s complement addition is the possibility to add
number of bytes, a zero padding is inevitable to make the as natural numbers (equivalent to two’s complement) and ”
total number even. Note that the checksum field is at the very recirculate” (swing around) all the carries. As an example,
beginning of the header, meaning that the packet has to be consider the following 20 bytes IP header. The underlined 16-
stored and cannot be delivered until the last byte has been bit word represent the checksum.
taken into account in the checksum. This is why the latency
4500 0030
of the checksum computation impacts directly in the latency 0000 0000
of the network protocol. 4006 F96A
For UDP packets, the checksum is computed including C0A8 0005
the same pseudo-header as TCP (Figure 2). The size of C0A8 0008
the header is fixed to 64-bit, which includes four fields,
In order to calculate the checksum, we can sum each of 16-
each of which is 2-byte (16-bit). The maximum length of
bit values within the header in a two’s complement fashion,
an UDP packet is 65,507-byte (65,535 maximum IP packet
avoiding only the checksum field itself (considering as 0).
size - 8-byte UDP header - 20-byte IP header). For the
Then we have the following addition (values in hexadecimal)
longest packet, 12 + 65,535-byte = 65,547-byte require 1025
× 512-bit AXI4-Stream transactions, we consider that case 4500 + 0030 + 0000 + 0000 + 4006 + 0 +
for a specialised network supporting jumbo frames. However, C0A8 + 0005 + C0A8 + 0008 = 20693h (132755 dec).
Destination Address In order to verify the checksum, all 16-bit numbers are added
2 Reserved Protocol TCP Segment Length
including the checksum:
4500 + 0030 + 0000 + 0000 + 4006 + F96A +
Fig. 2: TCP and UDP pseudo header. C0A8 + 0005 + C0A8 + 0008 = 2FFFDh (196605 dec).
Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.
Cn-1 Cn-2 Cn-3 C2 C1 C0
0
. . .
Ren-3 Re0
A Ren-2 . . .Re1
Ren-1 Re2
R0
B . . . R1
R2
Ren-1Ren-2 Ren-3 Re2 Re1 Re0
Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.
15 0
0
L4(2) = L3(2) + L3(1) + L3(0) + 2;
with (L4(0)(17:16)) select
sumFinal = L4(0) when "00",
0
L4(1) when "01",
6
7
L4(2) when others;
13
computed in parallel and a multiplexer selects the correct one.
14
0
The most significant bit of the sum is used as selector.
(L4(1), L4(0)) = CSA(L3(2), L3(1), L3(0));
20 L3
21
L5(0) = L4(1) + L4(0);
6 L5(1) = L4(1) + L4(0) + 1;
1
L4 sumFinal = L5(0) when (L5(0)(16) = '0')
2
26
else L5(1);
0
27 L5 1
Using this solution only one 16-bit carry propagation is
(b) Following stages
present in the critical path. In a Xilinx device, this means
31
two CARRY8 components, therefor, yielding the best delay
32
result.
(a) First level
Fig. 6: Reduction tree data arrangement. (a) First level of VII. E XPERIMENTAL EVALUATION
reductions. (b) Following levels of reductions. All the previous architectures have been synthesized and
columns, R1 represents the first carry and finally, R2 contains implemented using Vivado 2017.4 for the Xilinx UltraScale+
the second carry. architecture, targeting the VCU118 development board [15].
The first level of reduction is depicted in Fig. 6a, where Table I shows the delay and logic levels break-even for
each 7-bit column is reduced to a 3-bit result (the last two the studied circuits where lgc and rt stand for time spent
groups only clusters 6 elements). As a consequence, 33 × 16- in logic and routing respectively. Logic levels column details
bit numbers are reduced to 15 × 16-bit numbers, as shown in the components in the critical path. Additionally, the area
Level 2 (L2) of Fig. 6b. Observe that the white dots correspond and delay of the different circuits described are presented.
to swinging the overflow bits from the dotted circles around. LUTs and CLBs usage are included, the amount of carry-
The second level clusters two 7-bit elements per column, logic component (carry8) is reported as well. The inputs and
leaving one row for the next level, thus reducing from 15 to 7 outputs were registered in order to obtain the post place and
numbers. For the third level (L3), seven numbers are clustered route timing report. The first element of Max Delay column
and reduced to three numbers. As explained, with only three summarizes the worst path expressed in ns.
levels of logic, we are able to reduce 33 numbers to only 3 Reducer trees is the only alternative that meets timing at
numbers without increasing the width of the numbers — 16- 322 MHz, the target frequency for the Xilinx’s 100 Gbps
bit each. In what follow we discuss different alternatives to Ethernet interfaces. The design ArchRed3 is the one with the
sum three numbers in one’s complement arithmetic. minimum delay. The area penalty of this solution is almost
negligible compared to the naı̈ve implementation.
Reduction tree version 1 The results shown in Tab I regarding to delay are valid for
The first version of the reducer tree, called ArchRed1, Virtex UltraScale+ (16 nm), whereas for the Virtex UltraScale
finishes with a ternary adder, plus two binary adders, as (20 nm) the same circuits have a delay penalty ranging from
suggested in the following pseudo code. 27 % to 35 % due to the use of a previous generation
L4 = L3(2) + L3(1) + L3(0); process. Actually, in Virtex UltraScale, the best available
sumPrev = L4(15:0) + sum_L4(17:16); design ArchRed3 reaches only 3.7 ns. The best solution has
sumFinal = sumPrev(15:0) + sumPrev(16); been included in a bigger design, a VCU118 project with a
usage of more than 20 % of the LUTs, with real producer and
Reduction tree version 2
consumer logic, reaching similar delay results.
The second version, ArchRed2, tries to reduce the three Figure 7 shows the performance of the checksum architec-
successive additions, processing in parallel the possible results ture depending on the message size. The 100 Gbps Ethernet
and selecting the correct one with a multiplexer. The two most link theoretical throughput is also shown. It demonstrates that
significant bits of the sum are used as selector. the maximum possible performance is achieved comfortably
L4(0) = L3(2) + L3(1) + L3(0); even for the smallest packets. The figure also shows that the
L4(1) = L3(2) + L3(1) + L3(0) + 1; implementation reaches 164.86 Gbps when the message size
Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Delay and logic depth break-even for the studied circuits in an UltraScale+ architecture
Area
Circuit Max Delay Logic Levels
LUTs Carry8 CLBs
Bin16 4.684ns (lgc 2.4ns (52.1%) rt 2.2ns (47.9%)) 20 (CY8=13 LUT2=6 LUT6=1) 541 98 117
Bin32 5.278ns (lgc 2.9ns (54.6%) rt 2.4ns (45.4%)) 25 (CY8=17 LUT2=7 LUT3=1) 530 85 103
Tern16 4.073ns (lgc 2.0ns (49.8%) rt 2.0ns (50.2%)) 17 (CY8=11 LUT2=2 LUT3=3 LUT6=1) 368 53 96
Red1 3.691ns (lgc 1.6ns (42.9%) rt 2.1ns (57.1%)) 13 (CY8=6 LUT2=1 LUT6=4 MUXF7=2) 707 8 130
Red2 3.070ns (lgc 0.9ns (29.0%) rt 2.2ns (71.0%)) 11 (CY8=3 LUT5=1 LUT6=4 MUXF7=3) 748 5 138
Red3 2.979ns (lgc 0.9ns (31.5%) rt 2.0ns (68.5%)) 10 (CY8=2 LUT3=1 LUT5=1 LUT6=3 MUXF7=3) 734 5 140
Throughput depending on message size
160
ACKNOWLEDGMENT
140
This work was partially supported by the Spanish Min-
Bandwidth (Gbps)
Checksum Throughput
Ethernet Theoretical Throughput
120
istry of Economy and Competitiveness and the European
100 Regional Development Fund under the project TRÁFICA
80 (MINECO/FEDER TEC2015-69417-C2-1-R), and by the Eu-
60 ropean Commission, under the METRO-HAUL (grant agree-
40
ment No. 761727) project, both of the H2020 programme.
128 256 384 512 640 768 896 1024 1152 1280 1408
Message Size (Bytes) R EFERENCES
Fig. 7: Checksum computation performance. [1] D. D. Clark, V. Jacobson, J. Romkey, and H. Salwen, “An Analysis of
TCP Processing Overhead,” IEEE Communications magazine, vol. 27,
is a multiple of 64 bytes — and, thus, there are no wasted no. 6, pp. 23–29, 1989.
[2] J. Postel et al., “Transmission Control Protocol RFC 793,” 1981.
bytes in the last AXI4-Stream transaction. [3] N. W. Group et al., “RFC 1071Computing the Internet Checksum, 1988.”
To support a hypothetical 200 Gbps link, we assume that [4] J. H. Huang and C.-W. Chen, “On Performance Measurements of
the bus width will double to reach a 1024-bit AXI4-Stream. TCP/IP and its Device Driver,” in Local Computer Networks, 1992.
Proceedings., 17th Conference on. IEEE, 1992, pp. 568–575.
In such a case, the checksum for TCP/UDP in the worst-case [5] G. Regnier, S. Makineni, R. Illikkal, R. Iyer, D. Minturn, R. Huggahalli,
scenario can be reduced to 65 × 16-bit word one’s complement D. Newell, L. Cline, and A. Foong, “TCP Onloading for Data Center
addition at 322 MHz. Following the same idea as described Servers,” Computer, no. 11, pp. 48–58, 2004.
[6] J. Kay and J. Pasquale, “Profiling and Reducing Processing Overheads in
in the design ArchRed3, with an extra level of reduction, it TCP/IP,” IEEE/ACM Transactions on Networking (TON), vol. 4, no. 6,
is feasible to reach the required processing rate. However, the pp. 817–828, 1996.
delay of this new level has to be cut out from the routing. In [7] J. Touch and B. Parham, “Implementing the Internet Checksum in
Hardware,” Network Working Group, Tech. Rep., 1996.
the current Virtex UltraScale+ architecture this is only viable [8] T. Henriksson, N. Persson, and D. Liu, “VLSI Implementation of
with a careful relative placement of the logic. In conclusion, Internet Checksum Calculation for 10 Gigabit Ethernet,” Proceedings
the throughput of the proposed design can be doubled, but of Design and Diganostics of Electronics, Cricuits and Systems, pp.
114–121, 2002.
further work is needed to do so. [9] E. B. Eyo and T. A. Nwodoh, “Designing TCP/IP Checksum Function
for Acceleration in FPGA,” Nigerian Journal of Technology, vol. 29,
VIII. C ONCLUSION no. 3, pp. 31–41, 2010.
[10] Atomic Rules., “10/25/40/50/100/400 GbE UDP Offload Engine,”
In this paper we show how to efficiently calculate the one’s Atomic Rules, Tech. Rep. [Online]. Available: http://www.atomicrules.
complement checksum in a 100 Gbps Ethernet links taking com/wp-content/uploads/2016/04/AtomicRules UOE 170822.pdf
advantage of the Xilinx Virtex UltraScale+ architecture. [11] D. Sidler, G. Alonso, M. Blott, K. Karras, K. Vissers, and R. Carley,
“Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hard-
Using the Xilinx’s integrated Ethernet Subsystem, and its ware,” in FCCM’15. IEEE, 2015, pp. 36–43.
512-bit width interface, the problem can be summarized as [12] D. Sidler, Z. István, and G. Alonso, “Low-latency TCP/IP stack for data
the addition of 33 × 16-bit numbers in one’s complement center applications,” in FPL’2016, 2016.
[13] N. Zilberman, Y. Audzevich, G. A. Covington, and A. W. Moore,
arithmetic at 322 MHz. After evaluating several alternatives, “NetFPGA SUME: Toward 100 Gbps as Research Commodity,” IEEE
we conclude that the best solution is implemented using tree micro, vol. 34, no. 5, pp. 32–41, 2014.
levels of 7-3 CSA, which fits well in the slice Xilinx’s devices [14] Xilinx Inc., “VCU108 Evaluation Board, User Guide
UG1066,” Xilinx Inc., Tech. Rep., 07 2016. [Online].
architecture. This means that, after three levels of logic, the Available: http://www.xilinx.com/support/documentation/boards and
problem is transformed in the one’s complement addition of 3 kits/vcu108/ug1066-vcu108-eval-bd.pdf
× 16-bit numbers. For the final addition a 3-2 CSA adder, [15] Xilinx Inc., “VCU118 Evaluation Board, User Guide
UG1224,” Xilinx Inc., Tech. Rep., 05 2018. [Online].
two binary adders in parallel and a multiplexer are used. Available: https://www.xilinx.com/support/documentation/boards and
This architecture reaches the desired frequency with a neg- kits/vcu118/ug1224-vcu118-eval-bd.pdf
ligible area penalty in Virtex UltraScale+ devices, achieving [16] Xilinx Inc., “UltraScale+ Devices Integrated 100G Ethernet Subsystem
v2.4,” Xilinx Inc., Tech. Rep., 04 2018. [Online]. Avail-
an initiation interval of one and one clock cycle of latency, able: https://www.xilinx.com/support/documentation/ip documentation/
extremely useful to reach maximum throughput with short cmac usplus/v2 4/pg203-cmac-usplus.pdf
packets. Extrapolating that idea, a checksum offloading engine [17] J.-P. Deschamps, G. J. Bioul, and G. D. Sutter, Synthesis of Arithmetic
Circuits: FPGA, ASIC and Embedded Systems. John Wiley & Sons,
at 200 Gbps is feasible with a meticulous relative placement 2006.
of the logic. All the designs discussed are available as open [18] “Efficient Checksum Offload Engine,” https://github.com/hpcn-uam/
source [18]. efficient checksum-offload-engine.
Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.