0% found this document useful (0 votes)

38 views6 pages

FPGA-based TCP IP Checksum Offloading Engine For 100 Gbps Networks

This is a FPGA based TCP/IP checksum offloadding Enine for 100 gbps netwok

Uploaded by

R. RAJ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views6 pages

FPGA-based TCP IP Checksum Offloading Engine For 100 Gbps Networks

This is a FPGA based TCP/IP checksum offloadding Enine for 100 gbps netwok

Uploaded by

R. RAJ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

FPGA-based TCP/IP Checksum Offloading Engine

for 100 Gbps Networks

Gustavo Sutter∗ , Mario Ruiz∗ , Sergio López-Buedo∗† and Gustavo Alonso‡
∗High Performance Computing and Networking Research Group,
Escuela Politécnica Superior, Universidad Autónoma de Madrid, Spain
{mario.ruiz, gustavo.sutter, sergio.lopez-buedo}@uam.es
† NAUDIT HPCN, Spain

[email protected]
‡ Systems Group, Department of Computer Science, ETH Zürich, Switzerland

[email protected]

Abstract—End-to-end packet integrity in TCP/IP is ensured packet, doing it efficiently is key to achieve higher line rates
through checksums based on one’s complement addition. Once and lower latency.
a negligible part of the overall cost of processing a packet, Nowadays, at 10 to 100 Gbps speeds, packet processing
increasing network speeds have turned checksum computation
into a bottleneck. Actually, supporting 100 Gbps bandwidth is a is offloaded to hardware — to the NIC (Network Interface
challenge partially due to the difficulties of performing checksums Card) — since otherwise processing is too slow. We carried
at line rate. As part of a larger effort to implement a 100 Gbps out a simple test to confirm this. We set up a testbed with
TCP/IP stack on an FPGA based NIC, in this paper we analyse a 128 GByte, 2.20 GHz Xeon E5-2630 v4 server (A) and a
the problem of checksum computation for 100+ Gbps TCP/IP 192 GByte, 2.60 GHz Xeon 6126 server (B), both running
links and describe an open-source solution for the 512-bit wide,
322 MHz buses being used in the 100 Gbps Ethernet interfaces of Gentoo Linux (kernel 4.14.7). We used a Mellanox 100 GbE
Xilinx UltraScale devices. The proposed architecture computes ConnectX-5 NICs on both servers, which were connected
thirty-three 16-bit one’s complement additions in only 3.1 ns, to each other via a QSFP28+ direct attach cable, and we
more than enough to support 100 Gbps Ethernet links. run iperf2 over TCP to measure the overall performance.
Results show that, when server B acts as a client, performance
I. I NTRODUCTION
decreases from 28.3 Gbps to 10.0 Gbps when checksum
The TCP/IP network protocol provides mechanisms to ver- offloading is disabled (rx off and tx off options of ethtool).
ify data integrity through the detection of corrupted packets. As part of a larger effort to implement a 100 Gbps TCP/IP
These mechanisms are based on computing a checksum of the stack on programmable logic, intended for Software Defined
packet at the source, before sending it, and then controlling Networks (SDN) and in-network data processing, we have
the checksum upon arrival at the destination. Checksums are confronted the need of implementing an ultra-low latency
used to protect both the packet headers as well as the payload checksum mechanism capable of sustaining a 100 Gbps line
of the packet, and there are both layer 3 (IP headers) and rate. While, the checksum arithmetic is well-known and com-
layer 4 (TCP segments or UDP datagrams) checksums. In both paratively simple, quoting the RFC793 [2] ” The checksum
cases, the TCP/IP checksum is calculated as one’s complement field is the 16-bit one’s complement of the one’s complement
addition over 16-bit words. sum of all 16-bit words in the header and text”, the low
At low line rates, the overhead of computing the checksum latency requirement imposed by higher bandwidth is a major
is expensive but not necessarily overly large when compared challenge. In this paper we describe our solution to compute
to that of other parts of the protocol. Yet, already early on the checksum at 100 Gbps on an FPGA. We aim for a one
Clark et al. [1] suggested that the checksum could become a clock cycle implementation in order to support maximum
problem in packet processing. As network bandwidth increases throughput even for short packets. Our design covers com-
and, thus, as the amount of data being sent and received puting the checksum for i) the IP header: from 10 to 30 ×
per unit of time grows, in-flight computing of the checksum 16-bit words, to be processed in one clock cycle; and for ii)
becomes a major bottleneck and can affect the overall latency. the UDP/TCP/ICMP header plus payload: iterating over 33
In particular, the checksum must be computed not only at × 16-bit words checksum per clock cycle over several clock
the source and destination but also during routing if there cycles. The target frequency is 322 MHz (3.1 ns), which is
are any changes to the header. In the case of TCP and the one used by the Xilinx’s Integrated CMAC — CMAC =
(optionally) UDP packets, the payload is also included in the Ethernet Media Access Controller (MAC) and Physical Coding
checksum. Hence, computing the checksum of a packet po- Sublayer (PCS) core. The results are available as two open
tentially involves processing many 16-bit words. As checksum source implementations, one for the IP header and another for
calculation is performed several times over the life time of a ICMP, UPD, and TCP packets.

978-1-7281-1968-7/18/$31.00 ©2018 IEEE

Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.
II. R ELATED W ORK IV. H ANDLING 100 G BPS IN THE FPGA
In the market we can find different FPGAs that can handle
The checksum computation is a straightforward algorithm,
100 Gbps links at the physical level. Xilinx has several devel-
described in RFC793 [2] and RFC1071 [3]. Nevertheless,
opment boards supporting that rate. In this paper, we focus on
the overhead of checksum processing when implemented in
the VCU108 (XCVU095-2FFVA2104E UltraScale) [14] and
software is well-known [4], [5]. Kay’s et al. [6] show that
VCU118 (XCVU9P-L2FLGA2104E UltraScale+) [15] boards.
checksum calculation was the major processing overhead in
Both devices provide an integrated 100G Ethernet Subsys-
a software TCP/IP implementation. Today’s commercial NICs
tem [16], containing an Ethernet MAC and PCS core, com-
offer TCP/IP offloading, reducing the overhead on the CPU
plying with the IEEE 802.3-2012 specification. This integrated
for any task related to packet processing, including checksum.
core presents a 4 x 128-bit segmented LBUS interface than can
Indeed, the Xilinx AXI 1G/2.5G Ethernet subsystem also
be easily adapted to a 512-bit AXI4-Stream. The operating
provides full checksum offloading capabilities. Unfortunately,
frequency is 322 MHz (3.103 ns). Accordingly, to maintain
these capabilities are not available in the 10 Gigabit Ethernet
line rate processing, any checksum design has to work at the
Subsystem nor in the UltraScale Devices Integrated 100G
same frequency, otherwise latency will increase. From now on,
Ethernet core.
we will assume that the checksum implementation is interfaced
One’s complement addition has the interesting property of
with a 512-bit AXI4-Stream and its output is a 16-bit AXI4-
being associative. Thus, the 16-bit words can be added in
Stream. In case of checksums that span more than 64 bytes
any order. Although a hardware implementation was already
(e.g., the checksum of a TCP segment), the message is split
suggested in RFC 1936 in 1996 [7], there have been only a
in as many 512-bit AXI4-Stream transactions as necessary.
few studies of low latency checksum in hardware, mostly for
In TCP/IP over Ethernet, the shortest and largest packet
10+ Gbps networking. Henriksson et al. [8] provide a 0.18
are as follows. For small packets, size is 60 bytes and time
µm ASIC implementation for 10 Gbps Ethernet. On FPGAs,
between packets can be as small as 6.72 ns. In a packet with a
a Stratix III has been used to achieve a throughput of 14.2
MTU (Maximum Transmit Unit) of 1500 bytes, a packet has
Gbps [9]. Atomic Rules [10], along with other companies,
to be processed every 121.92 ns.
offers a 10 to 400 GbE UDP Offload Engine with an integrated
checksum computation for the UltraScale(+) families, but no V. IP V 4 H EADER AND TCP/UDP CHECKSUM
details about the implementation are available and only the C ALCULATION
10G and 40G versions seem to be accessible at the moment.
Following the RFC793 and RFC1071 [2], [3] specifications,
Recent implementations of a 10 Gbps TCP/IP stack for FPGAs
the transmitter side computes the checksum as follows:
include checksum computation [11], [12]. In this paper, we
show how to perform the TCP/IP checksum computation at a i. The value of the checksum word (16-bit) is set to zero —
minimum line rate of 100 Gbps. since the checksum field is part of the packet for which
the checksum has to be computed.
ii. The message is split into 16-bit words.
III. U SE CASES iii. All 16-bit word fragments are added using one’s comple-
ment arithmetic.
The results of this paper are part of a larger effort to
iv. The sum is complemented (flip the bits) and becomes the
implement an open source TCP/IP stack capable of reach
overall checksum.
100 Gbps line rate. Other project that can benefit from the
v. The checksum is inserted into the header and sent with
contributions of this work are, for instance, a 10 Gbps TCP/IP
the data.
stack [12], which we use as starting point, or projects using a
NetFPGA [13]. We have focused this work on IPv4 (Internet The receiver uses the following complementary steps for
Protocol Version 4). error detection:
We make no assumptions about the use of the stack, i. The message (including the checksum field) is split into
regarding whether it is on a NIC, as an end point of the 16-bit words.
network, or on a router or on a middle box, because of that ii. Every word is added using one’s complement arithmetic.
we aim for one cycle latency. In such cases, checksum com- iii. The result is the computed checksum. If the value is 0,
putation is used quite often. For instance, routers recompute the message is considered to be correct.
the checksum if headers change as a result of a NAT (Network The procedure is similar for the IP header, TCP header and
Address Translation) or port re-assignment. In NICs with optionally for the UDP header. In what follows, the differences
offloading support, the host sends data through PCIe and the between the three cases are explained and one’s complement
NIC has to segment and packetise such data, performing the arithmetic is discussed in detail.
checksum computation and including the result in the packet
header. Finally, in accelerators using TCP/IP as a mean of A. IP header checksum Computation
communication, e.g. [11], outgoing packets must include both Figure 1a shows the IPv4 header with its 14 fields. Thirteen
IP and TCP checksums and the checksum in the incoming are mandatory, whereas, the 14th field is optional and appro-
packet must be verified before being processed. priately named: options. Since an IPv4 header may contain a

Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.
Bit 0 3 7 11 15 19 23 27 31 Bit 0 3 7 11 15 19 23 27 31

0 4
Header
Type of Service Total Length 0 Source Port Destination Port
Check
Length

Pseudo TCP sum TCP

1 Identifier Flags Fragment Offset 1 Sequence Number Payload
Header Header Options
if Any
2 Time to live Protocol Header Checksum 2 Acknowledgement Number if Any
C E U A P R S F
Data W CR C S S Y I
3 Source Address 3
Offset Reserved Window Size
R E G K H T N N

4 Destination Address 4 TCP Checksum Urgent Pointer

Fig. 3: Data used to compute TCP checksum.

Word

Word
Options Options

(a) (b)
packets are hardly ever that long in a wide area network,
Fig. 1: (a) IP Version 4 header. (b) TCP header. because the Ethernet MTU is usually 1500-byte.
In the case of TCP, the main difference with respect to
variable number of options, the Internet Header Length (IHL) UDP is that the TCP header might have optional fields, so its
field (4-bit) defines the size of the header in steps of 32-bit size varies between 20 to 60-byte in 4-byte steps. In practical
words, which also coincides with the offset to the data. The terms, the payload varies from 0 (acknowledge packets without
minimum value for this field is five which indicates a length payload) to 1460 (maximum segment size that fits the Ethernet
of 5 × 32-bit = 160-bit ≡ 20-byte ≡ 10 × 16-bit words. As a MTU). Hence, the shortest packets have 12-byte (pseudo
4-bit field, the maximum value is fifteen words (15 × 32-bit, header) + 20-byte (header), i.e. can be processed in one 512-
or 480-bit ≡ 60-byte ≡ 30 × 16-bit words). bit transaction. For the longest packets, 12 + 20 + 1460-byte
Our interface is 512-bit wide. As a result, the whole IP = 1492-byte require 24 × 512-bit AXI4-Stream transactions.
header is contained in a single transaction. Processing com-
plexity is mainly caused by the variable header length, which C. One’s Complement Addition
requires a variable sum ranging from 10 to 30 × 16-bit words.
The proposed architecture considers the maximum possible The one’s complement of a binary number K in an N-
number of words and a multiplexer selects if it is valid or not. bit representation system is determined by inverting every
bit (flipping 0s for 1s and vice versa). It is arithmetically
equivalent to perform 2N −1 − 1 − K. Therefore, zero has two
B. TCP/UDP header checksum Computation
representations (00...00) and (11...11). As a consequence, an
TCP checksum is a 16-bit field in the header, Fig. 1b. In N-bit one’s complement numeral system can only represent
this case the checksum covers a pseudo-header, the TCP/UDP integers in the range −2N −1 − 1 to 2N −1 − 1 while two’s
header, and the payload. The pseudo-header has to be created complement can represent −2N −1 to 2N −1 − 1.
prior to checksum calculation with information from the IP One’s complement addition is calculated by summing as
header that includes: the source and destination IP addresses, natural numbers and adding the carry to the result — aka
the protocol and the TCP segment length, Fig 2. The message swing the bit(s) around. An interesting property in multi-
covered by this checksum is shown in Fig. 3. In case of odd operand one’s complement addition is the possibility to add
number of bytes, a zero padding is inevitable to make the as natural numbers (equivalent to two’s complement) and ”
total number even. Note that the checksum field is at the very recirculate” (swing around) all the carries. As an example,
beginning of the header, meaning that the packet has to be consider the following 20 bytes IP header. The underlined 16-
stored and cannot be delivered until the last byte has been bit word represent the checksum.
taken into account in the checksum. This is why the latency
4500 0030
of the checksum computation impacts directly in the latency 0000 0000
of the network protocol. 4006 F96A
For UDP packets, the checksum is computed including C0A8 0005
the same pseudo-header as TCP (Figure 2). The size of C0A8 0008
the header is fixed to 64-bit, which includes four fields,
In order to calculate the checksum, we can sum each of 16-
each of which is 2-byte (16-bit). The maximum length of
bit values within the header in a two’s complement fashion,
an UDP packet is 65,507-byte (65,535 maximum IP packet
avoiding only the checksum field itself (considering as 0).
size - 8-byte UDP header - 20-byte IP header). For the
Then we have the following addition (values in hexadecimal)
longest packet, 12 + 65,535-byte = 65,547-byte require 1025
× 512-bit AXI4-Stream transactions, we consider that case 4500 + 0030 + 0000 + 0000 + 4006 + 0 +
for a specialised network supporting jumbo frames. However, C0A8 + 0005 + C0A8 + 0008 = 20693h (132755 dec).

Then swing the bits outside the 16-bit boundaries around

Bit 0 3 7 11 15 19 23 27 31
getting 693 hex + 2 = 695 hex. The one’s complement inverse
0 Source Address
(negation bitwise) is the checksum, i.e. not(695) = F96A hex.
Word

Destination Address In order to verify the checksum, all 16-bit numbers are added
2 Reserved Protocol TCP Segment Length
including the checksum:
4500 + 0030 + 0000 + 0000 + 4006 + F96A +
Fig. 2: TCP and UDP pseudo header. C0A8 + 0005 + C0A8 + 0008 = 2FFFDh (196605 dec).

Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.
Cn-1 Cn-2 Cn-3 C2 C1 C0
0

. . .

Ren-3 Re0
A Ren-2 . . .Re1
Ren-1 Re2

B . . . R1
R2
Ren-1Ren-2 Ren-3 Re2 Re1 Re0

Fig. 5: 7 to 3 Carry Save Adder example.

complement number. The RTL pseudo code for these additions
is as follows:
sumPrev <= L6(15:0) + L6(21:16);
sumFinal <= sumPrev(15:0) + sumPrev(16);
B. Wide binary computation
Fig. 4: Binary Tree Adder. With the purpose to reduce the critical path latency and
leverage the low latency built-in carry-logic, wider adders can
”Recirculating” the bits outside the 16-bit boundaries. FFFD be used in the first stages. This architecture is called ArchBin32
+ 2 = FFFF. Taking the ones’ complement (flipping every bit) and it is based on 32-bit words, reducing after 4 levels to a
yields 0000, which indicates that no error is detected. 35-bit number. Then two extra levels are required. The total
Observe that is equivalent to add bigger numbers and reduce amount of additions is the same as before but the routing seems
later to 16-bit. For instance, we can add as 32-bit values: to be more complex.
45000030 + 00000000 + 40060000 +
C0A80005 + C0A80008 = 20656003Dh C. Using Ternary Trees
Using the built-in carry-logic and the adjacent 6-LUT, it is
Then the number is reduced to 16-bit:
feasible to build ternary adders (add 3 N-bit-2’s complement
20656 + 003D = 20693h (132755 dec). elements) using the same resources as a binary adder. Taking
A final reduction is necessary 693 hex + 2 = 695 hex. The advantage of this idea, it is possible to reduce the tree depth
one’s complement (not(695) = F96A) gives the checksum. and the resource usage. The first Level (L1) reduces 33 × 16-
bit numbers to 11 × 18-bit numbers. Thus, within four logic
D. Solving the worst case at 100 Gbps levels we get to a 22-bit number. Then two supplementary
In case of TCP or UDP, several 512-bit wide transactions additions are needed to achieve the one’s complement number
could be needed to process a packet. The processing involves as stated previously. This architecture is called ArchTern16.
adding 512/16 = 32 words, plus a word from the previous D. Using Reduction Trees
transaction. Based on this, we propose a basic building block
for the design consisting of an adder for 33 × 16-bit one’s None of the previous alternatives reaches the desired per-
complement numbers in 3.1 ns (322 MHz). For longer packets, formance. The solution that we have finally adopted leverages
we combine this basic building block as needed. In what the CSA (Carry Save Adders) to reduce the numbers with
follows we study different alternatives to add these 33 numbers minimum latency. In Xilinx FPGAs, it is possible to implement
in one’s complement arithmetic. efficiently 7-3 counters (a CSA that reduces 7-bit to 3-bit
number without carry propagation) [17]. Each bit of the
VI. A RCHITECTURES FOR C HECKSUM C OMPUTATION result is implemented using two adjacent 6-LUTs and a slice
In a first attempt, we modelled the problem using C/C++ in multiplexer (muxF7). Figure 5 shows an example of a 7 to
Vivado-HLS. Both a naı̈ve and an advance version of the code 3 CSA adapted for one’s complement arithmetic. Seven n-
needed at least several clock cycles to complete. Consequently, bit words are arranged, the procedure groups the seven bits
this approach was discarded. The next version of the solution for each column (Ci ), as result the sum is a 3-bit word per
was developed at RTL level, namely in VHDL. column (Rei ), which represents the amount of ones in the
vertical col (popcount). The weight of the least significant bit
A. Naı̈ve binary computation of the result is the same as the weight of the column Ci . Note
Xilinx FPGAs implement efficiently two’s complement that the two most significant columns could produce overflow
addition using built-in carry-logic resources. We name this bits, however, due to one’s complement arithmetic properties
approach ArchBin16. It uses a tree reduction, Fig 4, where the bits are swung around. The second part (B) represents the
six levels of adders are necessary to get a 22-bit number. same information, but it is arranged in a different way (Rei
Then two extra additions are needed to achieve the one’s is skewed). Then Row R0 can be seen as the actual sum of a

Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.
15 0
0
L4(2) = L3(2) + L3(1) + L3(0) + 2;
with (L4(0)(17:16)) select
sumFinal = L4(0) when "00",
0
L4(1) when "01",
6
7
L4(2) when others;

L2 6 Reduction tree version 3

The third version, ArchRed3, uses a 3 to 2 CSA generating

13 a fifth level (L5 in Fig. 6b). Then, the two possible results are
14

13
computed in parallel and a multiplexer selects the correct one.
14
0
The most significant bit of the sum is used as selector.
(L4(1), L4(0)) = CSA(L3(2), L3(1), L3(0));
20 L3
21
L5(0) = L4(1) + L4(0);
6 L5(1) = L4(1) + L4(0) + 1;
1
L4 sumFinal = L5(0) when (L5(0)(16) = '0')
2
26
else L5(1);
0
27 L5 1
Using this solution only one 16-bit carry propagation is
(b) Following stages
present in the critical path. In a Xilinx device, this means
31
two CARRY8 components, therefor, yielding the best delay
32
result.
(a) First level

Fig. 6: Reduction tree data arrangement. (a) First level of VII. E XPERIMENTAL EVALUATION
reductions. (b) Following levels of reductions. All the previous architectures have been synthesized and
columns, R1 represents the first carry and finally, R2 contains implemented using Vivado 2017.4 for the Xilinx UltraScale+
the second carry. architecture, targeting the VCU118 development board [15].
The first level of reduction is depicted in Fig. 6a, where Table I shows the delay and logic levels break-even for
each 7-bit column is reduced to a 3-bit result (the last two the studied circuits where lgc and rt stand for time spent
groups only clusters 6 elements). As a consequence, 33 × 16- in logic and routing respectively. Logic levels column details
bit numbers are reduced to 15 × 16-bit numbers, as shown in the components in the critical path. Additionally, the area
Level 2 (L2) of Fig. 6b. Observe that the white dots correspond and delay of the different circuits described are presented.
to swinging the overflow bits from the dotted circles around. LUTs and CLBs usage are included, the amount of carry-
The second level clusters two 7-bit elements per column, logic component (carry8) is reported as well. The inputs and
leaving one row for the next level, thus reducing from 15 to 7 outputs were registered in order to obtain the post place and
numbers. For the third level (L3), seven numbers are clustered route timing report. The first element of Max Delay column
and reduced to three numbers. As explained, with only three summarizes the worst path expressed in ns.
levels of logic, we are able to reduce 33 numbers to only 3 Reducer trees is the only alternative that meets timing at
numbers without increasing the width of the numbers — 16- 322 MHz, the target frequency for the Xilinx’s 100 Gbps
bit each. In what follow we discuss different alternatives to Ethernet interfaces. The design ArchRed3 is the one with the
sum three numbers in one’s complement arithmetic. minimum delay. The area penalty of this solution is almost
negligible compared to the naı̈ve implementation.
Reduction tree version 1 The results shown in Tab I regarding to delay are valid for
The first version of the reducer tree, called ArchRed1, Virtex UltraScale+ (16 nm), whereas for the Virtex UltraScale
finishes with a ternary adder, plus two binary adders, as (20 nm) the same circuits have a delay penalty ranging from
suggested in the following pseudo code. 27 % to 35 % due to the use of a previous generation
L4 = L3(2) + L3(1) + L3(0); process. Actually, in Virtex UltraScale, the best available
sumPrev = L4(15:0) + sum_L4(17:16); design ArchRed3 reaches only 3.7 ns. The best solution has
sumFinal = sumPrev(15:0) + sumPrev(16); been included in a bigger design, a VCU118 project with a
usage of more than 20 % of the LUTs, with real producer and
Reduction tree version 2
consumer logic, reaching similar delay results.
The second version, ArchRed2, tries to reduce the three Figure 7 shows the performance of the checksum architec-
successive additions, processing in parallel the possible results ture depending on the message size. The 100 Gbps Ethernet
and selecting the correct one with a multiplexer. The two most link theoretical throughput is also shown. It demonstrates that
significant bits of the sum are used as selector. the maximum possible performance is achieved comfortably
L4(0) = L3(2) + L3(1) + L3(0); even for the smallest packets. The figure also shows that the
L4(1) = L3(2) + L3(1) + L3(0) + 1; implementation reaches 164.86 Gbps when the message size

Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Delay and logic depth break-even for the studied circuits in an UltraScale+ architecture
Area
Circuit Max Delay Logic Levels
LUTs Carry8 CLBs
Bin16 4.684ns (lgc 2.4ns (52.1%) rt 2.2ns (47.9%)) 20 (CY8=13 LUT2=6 LUT6=1) 541 98 117
Bin32 5.278ns (lgc 2.9ns (54.6%) rt 2.4ns (45.4%)) 25 (CY8=17 LUT2=7 LUT3=1) 530 85 103
Tern16 4.073ns (lgc 2.0ns (49.8%) rt 2.0ns (50.2%)) 17 (CY8=11 LUT2=2 LUT3=3 LUT6=1) 368 53 96
Red1 3.691ns (lgc 1.6ns (42.9%) rt 2.1ns (57.1%)) 13 (CY8=6 LUT2=1 LUT6=4 MUXF7=2) 707 8 130
Red2 3.070ns (lgc 0.9ns (29.0%) rt 2.2ns (71.0%)) 11 (CY8=3 LUT5=1 LUT6=4 MUXF7=3) 748 5 138
Red3 2.979ns (lgc 0.9ns (31.5%) rt 2.0ns (68.5%)) 10 (CY8=2 LUT3=1 LUT5=1 LUT6=3 MUXF7=3) 734 5 140
Throughput depending on message size
160
ACKNOWLEDGMENT
140
This work was partially supported by the Spanish Min-
Bandwidth (Gbps)

Checksum Throughput
Ethernet Theoretical Throughput
120
istry of Economy and Competitiveness and the European
100 Regional Development Fund under the project TRÁFICA
80 (MINECO/FEDER TEC2015-69417-C2-1-R), and by the Eu-
60 ropean Commission, under the METRO-HAUL (grant agree-
40
ment No. 761727) project, both of the H2020 programme.
128 256 384 512 640 768 896 1024 1152 1280 1408
Message Size (Bytes) R EFERENCES
Fig. 7: Checksum computation performance. [1] D. D. Clark, V. Jacobson, J. Romkey, and H. Salwen, “An Analysis of
TCP Processing Overhead,” IEEE Communications magazine, vol. 27,
is a multiple of 64 bytes — and, thus, there are no wasted no. 6, pp. 23–29, 1989.
[2] J. Postel et al., “Transmission Control Protocol RFC 793,” 1981.
bytes in the last AXI4-Stream transaction. [3] N. W. Group et al., “RFC 1071Computing the Internet Checksum, 1988.”
To support a hypothetical 200 Gbps link, we assume that [4] J. H. Huang and C.-W. Chen, “On Performance Measurements of
the bus width will double to reach a 1024-bit AXI4-Stream. TCP/IP and its Device Driver,” in Local Computer Networks, 1992.
Proceedings., 17th Conference on. IEEE, 1992, pp. 568–575.
In such a case, the checksum for TCP/UDP in the worst-case [5] G. Regnier, S. Makineni, R. Illikkal, R. Iyer, D. Minturn, R. Huggahalli,
scenario can be reduced to 65 × 16-bit word one’s complement D. Newell, L. Cline, and A. Foong, “TCP Onloading for Data Center
addition at 322 MHz. Following the same idea as described Servers,” Computer, no. 11, pp. 48–58, 2004.
[6] J. Kay and J. Pasquale, “Profiling and Reducing Processing Overheads in
in the design ArchRed3, with an extra level of reduction, it TCP/IP,” IEEE/ACM Transactions on Networking (TON), vol. 4, no. 6,
is feasible to reach the required processing rate. However, the pp. 817–828, 1996.
delay of this new level has to be cut out from the routing. In [7] J. Touch and B. Parham, “Implementing the Internet Checksum in
Hardware,” Network Working Group, Tech. Rep., 1996.
the current Virtex UltraScale+ architecture this is only viable [8] T. Henriksson, N. Persson, and D. Liu, “VLSI Implementation of
with a careful relative placement of the logic. In conclusion, Internet Checksum Calculation for 10 Gigabit Ethernet,” Proceedings
the throughput of the proposed design can be doubled, but of Design and Diganostics of Electronics, Cricuits and Systems, pp.
114–121, 2002.
further work is needed to do so. [9] E. B. Eyo and T. A. Nwodoh, “Designing TCP/IP Checksum Function
for Acceleration in FPGA,” Nigerian Journal of Technology, vol. 29,
VIII. C ONCLUSION no. 3, pp. 31–41, 2010.
[10] Atomic Rules., “10/25/40/50/100/400 GbE UDP Offload Engine,”
In this paper we show how to efficiently calculate the one’s Atomic Rules, Tech. Rep. [Online]. Available: http://www.atomicrules.
complement checksum in a 100 Gbps Ethernet links taking com/wp-content/uploads/2016/04/AtomicRules UOE 170822.pdf
advantage of the Xilinx Virtex UltraScale+ architecture. [11] D. Sidler, G. Alonso, M. Blott, K. Karras, K. Vissers, and R. Carley,
“Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hard-
Using the Xilinx’s integrated Ethernet Subsystem, and its ware,” in FCCM’15. IEEE, 2015, pp. 36–43.
512-bit width interface, the problem can be summarized as [12] D. Sidler, Z. István, and G. Alonso, “Low-latency TCP/IP stack for data
the addition of 33 × 16-bit numbers in one’s complement center applications,” in FPL’2016, 2016.
[13] N. Zilberman, Y. Audzevich, G. A. Covington, and A. W. Moore,
arithmetic at 322 MHz. After evaluating several alternatives, “NetFPGA SUME: Toward 100 Gbps as Research Commodity,” IEEE
we conclude that the best solution is implemented using tree micro, vol. 34, no. 5, pp. 32–41, 2014.
levels of 7-3 CSA, which fits well in the slice Xilinx’s devices [14] Xilinx Inc., “VCU108 Evaluation Board, User Guide
UG1066,” Xilinx Inc., Tech. Rep., 07 2016. [Online].
architecture. This means that, after three levels of logic, the Available: http://www.xilinx.com/support/documentation/boards and
problem is transformed in the one’s complement addition of 3 kits/vcu108/ug1066-vcu108-eval-bd.pdf
× 16-bit numbers. For the final addition a 3-2 CSA adder, [15] Xilinx Inc., “VCU118 Evaluation Board, User Guide
UG1224,” Xilinx Inc., Tech. Rep., 05 2018. [Online].
two binary adders in parallel and a multiplexer are used. Available: https://www.xilinx.com/support/documentation/boards and
This architecture reaches the desired frequency with a neg- kits/vcu118/ug1224-vcu118-eval-bd.pdf
ligible area penalty in Virtex UltraScale+ devices, achieving [16] Xilinx Inc., “UltraScale+ Devices Integrated 100G Ethernet Subsystem
v2.4,” Xilinx Inc., Tech. Rep., 04 2018. [Online]. Avail-
an initiation interval of one and one clock cycle of latency, able: https://www.xilinx.com/support/documentation/ip documentation/
extremely useful to reach maximum throughput with short cmac usplus/v2 4/pg203-cmac-usplus.pdf
packets. Extrapolating that idea, a checksum offloading engine [17] J.-P. Deschamps, G. J. Bioul, and G. D. Sutter, Synthesis of Arithmetic
Circuits: FPGA, ASIC and Embedded Systems. John Wiley & Sons,
at 200 Gbps is feasible with a meticulous relative placement 2006.
of the logic. All the designs discussed are available as open [18] “Efficient Checksum Offload Engine,” https://github.com/hpcn-uam/
source [18]. efficient checksum-offload-engine.

Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF TELEMATICS. Downloaded on August 05,2024 at 06:12:51 UTC from IEEE Xplore. Restrictions apply.

Raw TCP
100% (2)
Raw TCP
15 pages
2.1-2.5 TCP Service Model
No ratings yet
2.1-2.5 TCP Service Model
45 pages
TCP Udp Checksum
100% (1)
TCP Udp Checksum
7 pages
Check 2
100% (1)
Check 2
8 pages
Screenshot 2024-10-30 at 1.26.55 PM
No ratings yet
Screenshot 2024-10-30 at 1.26.55 PM
42 pages
Solutions OSI Exercises
No ratings yet
Solutions OSI Exercises
50 pages
Com 5403 Soft
No ratings yet
Com 5403 Soft
9 pages
Check 4
100% (1)
Check 4
7 pages
SP23-BCS-091 Lab Task CN
No ratings yet
SP23-BCS-091 Lab Task CN
4 pages
Short Description of The Internet Checksum IP Checksum Definition
No ratings yet
Short Description of The Internet Checksum IP Checksum Definition
9 pages
Transport Udp Error Detection
No ratings yet
Transport Udp Error Detection
21 pages
IP Address Hacking Prevention System
No ratings yet
IP Address Hacking Prevention System
3 pages
Appendix C-Wireshark Code
No ratings yet
Appendix C-Wireshark Code
14 pages
CRC, IP and TCP Checksum Algo
No ratings yet
CRC, IP and TCP Checksum Algo
20 pages
UDP Operation
No ratings yet
UDP Operation
3 pages
Axelspace: Coding Assignment For: Embedded Systems Software Engineer
No ratings yet
Axelspace: Coding Assignment For: Embedded Systems Software Engineer
4 pages
Getting The Most TCP/IP From Your Embedded Processor
No ratings yet
Getting The Most TCP/IP From Your Embedded Processor
35 pages
Unit 6 - Pipeline, Vector Processing and Multiprocessors
No ratings yet
Unit 6 - Pipeline, Vector Processing and Multiprocessors
23 pages
User Datagram Protocol (UDP) - Packet Checksums - Reliability: Sliding Window - TCP Connection Setup - TCP Windows, Retransmissions, and
No ratings yet
User Datagram Protocol (UDP) - Packet Checksums - Reliability: Sliding Window - TCP Connection Setup - TCP Windows, Retransmissions, and
40 pages
TCP Research
No ratings yet
TCP Research
9 pages
Cheatsheet GCP A4
100% (1)
Cheatsheet GCP A4
3 pages
Networks Lab Manual
No ratings yet
Networks Lab Manual
12 pages
Implementation of Checksum
No ratings yet
Implementation of Checksum
26 pages
Project2 Spec CS118 2021Sv3
No ratings yet
Project2 Spec CS118 2021Sv3
9 pages
Instructions On AccuMark Plot Pieces by Lectra Alys Plotter
No ratings yet
Instructions On AccuMark Plot Pieces by Lectra Alys Plotter
5 pages
02 - R20 I Yr M.Tech (Embedded Systems) I Sem - Syllabus
No ratings yet
02 - R20 I Yr M.Tech (Embedded Systems) I Sem - Syllabus
36 pages
LogMeIn Hamachi
No ratings yet
LogMeIn Hamachi
114 pages
DIT OS Notes
100% (1)
DIT OS Notes
11 pages
Schneider Communication Drivers 2016
No ratings yet
Schneider Communication Drivers 2016
126 pages
HW 3 Solution
No ratings yet
HW 3 Solution
3 pages
Ip Datagram Structure
No ratings yet
Ip Datagram Structure
13 pages
Computer Networks: Spring 2012 Instructor: Yuan Xue
No ratings yet
Computer Networks: Spring 2012 Instructor: Yuan Xue
8 pages
Manual ZTE ZXCTN-6200-V11 20090825 EN V1
No ratings yet
Manual ZTE ZXCTN-6200-V11 20090825 EN V1
73 pages
Error Detection: Transmission Errors Occur
No ratings yet
Error Detection: Transmission Errors Occur
29 pages
EPLAN Macro2D Helpfile-XML Import
No ratings yet
EPLAN Macro2D Helpfile-XML Import
37 pages
Checksum
No ratings yet
Checksum
6 pages
Cymatics - Ableton Shortcuts PDF
No ratings yet
Cymatics - Ableton Shortcuts PDF
9 pages
Short Description of The Internet Checksum IP Checksum Definition
No ratings yet
Short Description of The Internet Checksum IP Checksum Definition
4 pages
OMU Board
No ratings yet
OMU Board
57 pages
2017 02 25T19 14 39 - R3dlog
No ratings yet
2017 02 25T19 14 39 - R3dlog
127 pages
Design Scenario Chapter 5
No ratings yet
Design Scenario Chapter 5
6 pages
Interfacing LPC2148 With GLCD.
No ratings yet
Interfacing LPC2148 With GLCD.
3 pages
TDTS06 Computer Networks
No ratings yet
TDTS06 Computer Networks
7 pages
Exam AZ-304 - Microsoft Azure Architect Design - Skills Measured
No ratings yet
Exam AZ-304 - Microsoft Azure Architect Design - Skills Measured
5 pages
University of Engineering & Management, Kolkata: Project Report On
No ratings yet
University of Engineering & Management, Kolkata: Project Report On
11 pages
Configuration Guide: Smartconnector For Raw Syslog Daemon
No ratings yet
Configuration Guide: Smartconnector For Raw Syslog Daemon
8 pages
Sharp MFP How To Print Out and Clear The Copy Counts in The User Account Control System
No ratings yet
Sharp MFP How To Print Out and Clear The Copy Counts in The User Account Control System
9 pages
Cycle-I: Computer Communication Networks Lab Manual (18TE63), 2020-2021
No ratings yet
Cycle-I: Computer Communication Networks Lab Manual (18TE63), 2020-2021
77 pages
Network Security Essentials Study Guide Local EN-US-strony-100-200-ANG
No ratings yet
Network Security Essentials Study Guide Local EN-US-strony-100-200-ANG
101 pages
Exam Questions
No ratings yet
Exam Questions
12 pages
NetBackup104 DeployGuide Kubernetes Clusters
No ratings yet
NetBackup104 DeployGuide Kubernetes Clusters
318 pages
@vtucode - In-2022-Scheme-Module-4-3rd semester-CSE
No ratings yet
@vtucode - In-2022-Scheme-Module-4-3rd semester-CSE
35 pages
Muskan Dhingra 19BCE1148 Networks and Communication Lab Fat Exam Ques 1
No ratings yet
Muskan Dhingra 19BCE1148 Networks and Communication Lab Fat Exam Ques 1
7 pages
Programming Manual TM NPU Micropython en-US en-US
No ratings yet
Programming Manual TM NPU Micropython en-US en-US
44 pages
KLCP Codec Log
No ratings yet
KLCP Codec Log
2 pages
Mariappan
No ratings yet
Mariappan
4 pages
Cloud Computing Unit-I, II
No ratings yet
Cloud Computing Unit-I, II
22 pages
CSS or Hands-On Activity Monitoring Form
No ratings yet
CSS or Hands-On Activity Monitoring Form
2 pages
2024 12 02 - 01 28 33.6806 - 0500 8e872040
No ratings yet
2024 12 02 - 01 28 33.6806 - 0500 8e872040
14 pages
SF Dump
No ratings yet
SF Dump
19 pages
HADR
No ratings yet
HADR
46 pages
Serial Port Complete: COM Ports, USB Virtual COM Ports, and Ports for Embedded Systems
From Everand
Serial Port Complete: COM Ports, USB Virtual COM Ports, and Ports for Embedded Systems
Jan Axelson
3.5/5 (9)
LEARN MPLS FROM SCRATCH PART-A: A Beginner's Guide to Next Level of Networking
From Everand
LEARN MPLS FROM SCRATCH PART-A: A Beginner's Guide to Next Level of Networking
POONAM DEVI
No ratings yet
Efficient Memory Optimization for IoT Intrusion Detection
From Everand
Efficient Memory Optimization for IoT Intrusion Detection
Ethan Evelyn
No ratings yet
Computer Networking: An introductory guide for complete beginners: Computer Networking, #1
From Everand
Computer Networking: An introductory guide for complete beginners: Computer Networking, #1
Ramon Nastase
4.5/5 (2)
Routing in Wireless Mesh Networks
From Everand
Routing in Wireless Mesh Networks
Raghav Kumar
No ratings yet
Cisco Network Administration Interview Questions: CISCO CCNA Certification Review
From Everand
Cisco Network Administration Interview Questions: CISCO CCNA Certification Review
equitypress
4.5/5 (6)
Security+ Boot Camp Study Guide
From Everand
Security+ Boot Camp Study Guide
Chad Russell
5/5 (1)
Echo on a Chip - Secure Embedded Systems in Cryptography: A New Perception for the Next Generation of Micro-Controllers handling Encryption for Mobile Messaging
From Everand
Echo on a Chip - Secure Embedded Systems in Cryptography: A New Perception for the Next Generation of Micro-Controllers handling Encryption for Mobile Messaging
Mancy A. Wake
No ratings yet
Versatile Routing and Services with BGP: Understanding and Implementing BGP in SR-OS
From Everand
Versatile Routing and Services with BGP: Understanding and Implementing BGP in SR-OS
Alcatel-Lucent
No ratings yet
CompTIA Security+: Network Attacks
From Everand
CompTIA Security+: Network Attacks
AS Snipes
5/5 (1)
Study Guide 300-615 Dcit Troubleshooting Cisco Data Centre Infrastructure
From Everand
Study Guide 300-615 Dcit Troubleshooting Cisco Data Centre Infrastructure
Anand Vemula
No ratings yet
Study Guide - 100-150 CCST-Networking Cisco Certified Support Technician – Networking
From Everand
Study Guide - 100-150 CCST-Networking Cisco Certified Support Technician – Networking
Anand Vemula
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
From Everand
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
Mamta Devi
No ratings yet
Key technologies for NG-PON2 system
From Everand
Key technologies for NG-PON2 system
Rawa Muayad
No ratings yet
A Practical Guide Wireshark Forensics
From Everand
A Practical Guide Wireshark Forensics
alasdair gilchrist
5/5 (4)
Tcpdump in Depth: Definitive Reference for Developers and Engineers
From Everand
Tcpdump in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Network Security All-in-one: ASA Firepower WSA Umbrella VPN ISE Layer 2 Security
From Everand
Network Security All-in-one: ASA Firepower WSA Umbrella VPN ISE Layer 2 Security
Redouane MEDDANE
No ratings yet
Study Guide Cisco 300-915 DEVIOT Developing Solutions using Cisco IoT and Edge Platforms Exam
From Everand
Study Guide Cisco 300-915 DEVIOT Developing Solutions using Cisco IoT and Edge Platforms Exam
Anand Vemula
No ratings yet
Introduction to Internet & Web Technology: Internet & Web Technology
From Everand
Introduction to Internet & Web Technology: Internet & Web Technology
Dr. Yashpal singh
No ratings yet
First Hop Redundancy Protocol: Network Redundancy Protocol
From Everand
First Hop Redundancy Protocol: Network Redundancy Protocol
Mulayam Singh
No ratings yet
Cisco Certified Network Associate CCNA 200-301
From Everand
Cisco Certified Network Associate CCNA 200-301
Manish Soni
No ratings yet
CCNA Interview Questions You'll Most Likely Be Asked
From Everand
CCNA Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
TCP/IP: Network+ Protocols And Campus LAN Switching Fundamentals
From Everand
TCP/IP: Network+ Protocols And Campus LAN Switching Fundamentals
Rob Botwright
No ratings yet

FPGA-based TCP IP Checksum Offloading Engine For 100 Gbps Networks

Uploaded by

FPGA-based TCP IP Checksum Offloading Engine For 100 Gbps Networks

Uploaded by

FPGA-based TCP/IP Checksum Offloading Engine

for 100 Gbps Networks

978-1-7281-1968-7/18/$31.00 ©2018 IEEE

Pseudo TCP sum TCP

4 Destination Address 4 TCP Checksum Urgent Pointer

Fig. 3: Data used to compute TCP checksum.

Then swing the bits outside the 16-bit boundaries around

Fig. 5: 7 to 3 Carry Save Adder example.

L2 6 Reduction tree version 3

The third version, ArchRed3, uses a 3 to 2 CSA generating

You might also like