0% found this document useful (0 votes)

42 views10 pages

Group 7 Report

The document describes a design for implementing key network protocols like ARP, IP, ICMP, and UDP/TCP on an FPGA to achieve high-speed networking capabilities. It discusses the high-level system organization with different protocol processing modules. Test plans are outlined to validate each part of the implementation from the virtual Ethernet interface to individual protocol handlers.

Uploaded by

Rio Carthiis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views10 pages

Group 7 Report

Uploaded by

Rio Carthiis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A Minimal Network Stack on FPGA

Tianhao Huang
[email protected]

December 12, 2019

1 Introduction

The TCP/IP protocol stack is the de-facto language spoken by networked systems. In a
widely-adopted model, the protocol stack consists of four layers from bottom up: link layer,
Internet layer, transport layer and application layer. The former three are also referred as
L2, L3 and L4 respectively. For hosts (non-router nodes), NICs such as Ethernet controllers
provide key link layer functionality, while the middle two reside in OS kernels and the highest
layer is left for user-level applications. Core protocols in lower layers are limited. ARP, ICMP,
IP, UDP and TCP are sufficient to form the basis of Internet. In this project, we explore
opportunities of offloading some of these critical protocols to an FPGA. The resulting design
successfully implemented L2, L3 and limited L4 functionalities. The performance objective
of this project is a duplex throughput of 10Gbps.

2 High-level Design and Test Plan

2.1 System Organization

The layered protocol design of the network stack easily leads to a layered system implemen-
tation. We choose the commonplace Ethernet as our link layer implementation. Like most
software stacks, our system focuses on these critical protocols for the bottom three layers:

• ARP, which builds directly on the link layer and provides mappings between physical
and IP addresses of remote hosts;

• IPv4, which bases on the link layer and implements essential functionalities of the
Internet layer

1
NIC

Ethernet header Checksum

Payload
(14 octets) (4 octets)

ARP packet IPv4 header

Payload
(28 octets) (20 octets)

ICMP header UDP header TCP header Options

Payload Payload Payload
(8 octets) (8 octets) (20 octets) (0~40 octets)

Figure 1: Stack organization of targeted protocols. Arrows points to those protocol packets
which payload might contain. 1 octet = 8 bits.

• ICMP, which builds directly on the Internet layer and provides the functionality of the
well-known command ping

• UDP/TCP, which are part of the transport layer and support application layer

Headers of a higher-layer protocol are capsulated in the payload of a lower-layer protocol.

Figure 1 illustrates the header-payload relationship of protocols above.

As the lowest level of data abstraction in our system, Ethernet frames are the input and
output data between our hardware stack and external network device. Xilinx’s Ethernet
module transceives the frames via AXI-Stream interface. In our case, connectal and host-
provided facilities are used together to implement a "virtual" Ethernet transceiver to avoid
spending too much time on the complicated hardware IP. As incoming packets are streamed
into our network stack cycle by cycle, the receiver side of the system will parse the packets
to extract headers and redirect the payload data into different packet processing pipelines,
until the data is forwarded into transmit side or passed to application. Figure 2 shows our
stack design in an eagle view, while Figure 3 demonstrates how the virtual ethernet device
is constructed.

2
Input Output

Ethernet
Frame
Assembler

ARP L2 Packet
packet Hdr Arbiter

L2
Hdr
ARP reply L3
L2 Hdr
Hdr L4
Hdr

ARP packets Cache requests Address

L2
Dispatcher ARP Engine Resolution
Cache response Module

L3
Hdr
L4
Hdr
L3
Hdr

L2layer)
L3 (IP
Dispatcher
Handler
L2layer)
L3 (IP
Dispatcher
Handler ICMP Engine
L4
Hdr
Data

L3
L4
Hdr Dispatcher UDP/TCP Packet
Arbiter
L3
Hdr
L2
Hdr

Packet Application
Structure

Figure 2: High-level organization of the network stack on FPGA

Host Computer FPGA

RX worker TX pipeline
Software Virtual
packet tap Connectal Ethernet
processing Device
TX worker RX pipeline

Figure 3: Virtual ethernet device for our network stack

3
2.2 Test Plan

The system is developed and tested in a bottom-up fashion, starting from the virtual ethernet
device, lower layer protocol to higher layer ones. After the virtual ethernet is up and running,
L2 Dispatcher, ARP Engine and Ethernet Frame Assembler are developed to set up the
first RX-TX loop. We then use the command facility arping to generate and examine
the implementation of ARP protocol. Then, IP Handlers and ICMP Engine and Address
Resolution Module brings up the second RX-TX loop, which is tested by the ping command.
We added packet schedulers to the TX pipeline to merge two loops. While the third loop
via UDP/TCP engines haven’t been finished, the test plan would follow a similar method
only using different host-side packet generators.

3 Microarchitecture Design

Microarchitecture of major modules in Figure 2 is described in this section. Noted that a

RX or TX packet in our system is treated as a combination of metadata and a finite data
stream. There is no backward dependency in the RX and TX pipeline. By providing enough
depth of FIFOs, our system could accept one data beat (8 bytes) every cycle. Due to time
limitations, UDP/TCP engines have not been fully implemented.

3.1 Packet Stream Splitters and Compactors

Although not shown in the high-level diagram, packet stream splitters and compactors are
fundamental parts of the modules to parse and compose packets. We use a datapath of 8
bytes throughout our RX and TX pipeline, which allows our design to run at 156.25MHz
to achive the throughput goal. However, not all packet headers and payloads are aligned
to 8 Bytes. Therefore, a general splitter is designed to extract headers and payloads from
incoming 8-byte data beats in a streaming fashion. Oppositely, compactors are responsible
for the dirty work of combining unaligned headers and payloads into an 8-byte data stream.

The microarchitecture of a splitter is shown in Figure 4. Compactors have a similar design

with reversed data input and output. They both divide one data beat into two parts, store
them into FIFO and reorganize them at the FIFO output. Special consideration needs to
be done for the last output data beat due to the extra padding. Splitters do not know
the total length of incoming packets in advance. Rather, our data beat is marked with a

4
header octet payload octet padding octet

Splitter
feed

header
data register

field
field 1 2 extractHdr
field 2
(cont.) type
Payload FIFOs
Parsed Header

extract

Figure 4: Design of a packet splitter. In this example, a header has 7 octets and one data
beat has 4 octets.

flag indicating the end of a packet. The module is implemented in a polymorphic fashion.
Given the protocol that a packet belongs to, a splitter provides interfaces to extract a parsed
header and its payload stream. Compactors plays the opposite role in the TX pipeline, which
hardens the provided header and payload stream into data stream. Since the lengths of a
specific protocol header could be determined statically, the widths of FIFOs are known in
compile time, enabling cheap circuit implementations.

Main packet processing modules (L2 Dispatcher, Ethernet Frame Assembler, IP In/Out
Handlers, ARP/ICMP/UDP Engine) build upon one or both of splitters and compactors.
The main difference of these modules is how they process the parsed packet headers.

3.2 Packet Arbiter

Packet Arbiter is used at the merge point of two same-level modules in the TX pipeline,
such as ARP Engine and IPv4 Out Handler. It basically arbitrates which module could
have their packet data output to the lower layer. We use round-robin arbitration provided
by the Arbiter libraray in Bluespec.

5
3.3 General Packet Processing Module

For the RX side, the design of L2 Dispatcher, IP input Handlers, part of ARP, ICMP and
UDP engines could be described as a general packet processing module. It accepts a parsed
header and payload stream from a splitter. The parsed header is passed into a protocol-
specific processor or FSM for understanding and checking header fields. Depending on the
results, the payload stream is either dropped or forwarded into next layer. The structure of
such a module is shown in Figure 5. Protocol-specific processing is not described in details
here because they could be found in protocol easily found specifications. Noted that we
ignored some rare and complicate features, such as IP fragmentation and options.

Splitter

field
field 1 2
field 2
(cont.) type Parsed Payload
Header

Drop?
ARP?
Protocol-specific
header processor IPv4?
Other?

sender address,
payload length
Next layer

Figure 5: Design of a general packet processing module in L2

Similarly for the TX side, IP output Handler, Ethernet Frame Assembler and part of ARP,
ICMP engines use a common structure based on compactors. It accept some key fields to
generate a complete protocol header. The header is later merged with input higher-layer
data stream to form a current-layer packet. For the following subsections, we omit modules
that are nothing more than a general packet processing module and only clarify the design
of those with substantial additions.

6
next layer next layer

ARP
Engine L2
Hdr
ARP Gen L3
processing Compactor Hdr
FSM ARP L4
reply Hdr
L2
Dispatcher Splitter
update Cache
<IP, MAC> Hit
Cache Gen ARP
Miss request

Address
ARP Maybe#(MAC)
Resolution
payload Cache Module
dropped read <IP, ?>

Figure 6: Design of ARP Engine

3.4 ARP Engine

ARP Engine is a moderately complex module with an ARP cache of hIP, MACi address
pairs. It provides physical address to Internet address mappings for higher-layer protocols.
The ARP cache resides in the ARP-specific header processor in Figure 5. It is implemented as
a nonblocking direct-mapped cache using dual-port BRAMs. Because of the limited storage
space, we only use the last 12-bit of IP address as the index. A better solution would be
using a hash function but only minor changes are required. On conflicts, the existing entry
is invalidated immediately.

There are two types of ARP messaegs. The first one is ARP replies. If targeted for the stack,
the ARP cache is updated with sender’s hIP, MACi pair silently. If not, they are dropped.
The other one is ARP requests. While the ARP cache is updated the same way, an ARP
reply will be generated and sent to the requester. An ARP request could be generated when,
say, an IPv4 packet should be transmitted but there is no valid hMAC, IPi entry in the
cache. The cache requests are delegated by the Address Resolution Module. On cache read
miss, the IPv4 packet is dropped. Then ARP engine broadcast an ARP request to local
network asking for the desired MAC address. After receiving the reply, later IPv4 address
could be resolved successfully.

ARP Engine covers both RX and TX side of the stack. As a result, it contains a splitter
near the input and a compactor near the output. Figure 6 illustrates the microarchitecture

7
next layer

ICMP
Engine
ICMP Gen
processing
FSM ICMP
IPv4 In
reply Compactor
Splitter
Handler
if payload not dropped

Port Unavailable
Protocol Unavailable
etc.

High layer

Figure 7: Design of ICMP Engine

of the module.

3.5 ICMP Engine

ICMP Engine looks very similar to ARP Engine, but simpler because it does not have a
cache. We focus on replying to incoming ICMP echo requests (usually used by the ping
command). For all ICMP replies and other types of requests, the engine just drops the
packet. The ICMP processing FSM would copy the group id and sequence id, and then send
an ICMP echo reply header to the compactor. The original payload of the echo request is
forwarded to the compactor. There is another path to generate ICMP replies. When a L4
packet is malformed, not supported or not allowed by our stack, a “Network Unreachable”
message with some of the packet contents should be sent to the packet sender.

3.6 UDP Engine

UDP Engine usually directly interfaces with applications. Normally, DRAM is used as
UDP/TCP payload storage because when the application will consume the data is unknown
to the stack. Thus, a memory allocator/deallocator is necessary for managing the packet

8
memory. Due to time limitation, they have not been implemented yet. The following part
will discuss the design plan waiting to be materialized into hardware.

Consider the position of UDP modules, it is important to provide a convenient communi-

cation mechanism between the engines and applications. A ring buffer, its head and tail
pointers are used for such purposes in both RX and TX engines. When the application
would like to listen on a certain UDP port, it shall ask for the memory allocator to reserve
a continuous memory space to buffer incoming UDP payloads, and then update the rx-side
table in the UDP RX engine. Next time when UDP engine receives a packet for the listening
port, it could consult the table for DRAM write address (the tail pointer), and then write the
payload. Similary for UDP TX engine, when the application sets up a UDP "socket" with
a certain outgoing port, the tx-side table entry is updated. When the application is ready
to send a UDP packet (payload stored on DRAM), it notifies UDP Tx FSM to generate a
header and then sends the data into compactor before output.

UDP Rx Engine UDP Tx Engine

UDP read port:id
hdr
Rx UDP hdr
FSM status, Tx
status, buf info FSM
buf info update
splitter buf info Compactor
rx-side table tx-side table
update port:0, status, buf info port:0, status, buf info
payload buf info port:1, status, buf info port:1, status, buf info

payload

update update notify packets

notify packets tail
head

Application

DRAM

Figure 8: UDP Engine - receiver part

3.7 TCP Engine

Due to time limitations, TCP Engine is not implemented as part of the project. TCP
Engine requires complicated connection state transition and out-of-order packet processing.
Its implementation would be an interesting module to study later on.

9
4 Implementation Evaluation

The network stack in Figure 2 excluding the UDP/TCP Engine is fully implemented in
hardware. Simulation works great, and it could correctly parse, accept and reply to incoming
ARP, IPv4 and ICMP packets. The implementation is also synthesized, place-and-routed
using Vivado. The hardware could handle ARP and IPv4 packets, but get into some trouble
when replying to ICMP packets. Therefore, for the delay and throughput test I primarily
use ARP packets. The following is a summary of results.

Item Results
Critical path delay 5.332ns
Clocked @ 166.7MHz
Theoretical throughput 10.67Gbps duplex
30.1k LUTs (9.91%)
Resource Utilization 35k FlipFlops
277kB BRAM
2231 BSV
Loc
384 C++

Table 1: Results of hardware implementation

The speed of generating packets on the host side is increased on the host side to stress our
system. On simulation, the average reply delay shown by the host side command is 0.666 ms,
while on FPGA the number is 0.130 ms. On contrast, the average delay of pinging another
host computer is roughly 1.05 ms. These results could only be treated as a reference since our
implementation nevers use a real ethernet device. Moreover, it seems our system is seriously
bottlenecked by the virtual ethernet device. The througput of our system is measured at
only 477.1 Mbps. Given a 60-byte packet, even if our system processes one packet at a time,
it would only take less than 200 ns, which translates into a throughput of 2.86 Gbps.

Looking back, it is hard to say that creating a virtual ethernet device rather than using
Xilinx-provided ethernet IP is 100 percent a good decision. At least a quarter of the time
is spent on developing facilities and debugging bugs related to endianness and concurrency.
Also, the performance of our system seems seriously bottlenecked by the throughput of the
virtual ethernet device.

LO1 - Networking Principles and Their Protocols
No ratings yet
LO1 - Networking Principles and Their Protocols
69 pages
Udp Ip Stack
No ratings yet
Udp Ip Stack
22 pages
Module05 Datalinkv3
No ratings yet
Module05 Datalinkv3
11 pages
Design Principles For Packet Parsers
No ratings yet
Design Principles For Packet Parsers
12 pages
Packet Parser Merged
No ratings yet
Packet Parser Merged
19 pages
Ethernet Protocol
100% (1)
Ethernet Protocol
18 pages
01 Overview
No ratings yet
01 Overview
12 pages
Layer 2: Data Framing For Fun and Profit
No ratings yet
Layer 2: Data Framing For Fun and Profit
29 pages
Data Link Issues: Relates To Lab 2
No ratings yet
Data Link Issues: Relates To Lab 2
17 pages
X.25 and Frame Relay
No ratings yet
X.25 and Frame Relay
40 pages
Data Link Issues: Relates To Lab 2
No ratings yet
Data Link Issues: Relates To Lab 2
17 pages
Data Link Protocols: Relates To Lab 2
No ratings yet
Data Link Protocols: Relates To Lab 2
21 pages
Computer Organization
No ratings yet
Computer Organization
96 pages
Ccna Icnd1 Study Notes
No ratings yet
Ccna Icnd1 Study Notes
187 pages
Understanding The Open Systems Interconnection Reference Model Understanding Media Types Understanding Adapters, Hubs, and Switches Understanding Routing
No ratings yet
Understanding The Open Systems Interconnection Reference Model Understanding Media Types Understanding Adapters, Hubs, and Switches Understanding Routing
20 pages
2.2 Standards and Networking
No ratings yet
2.2 Standards and Networking
43 pages
Internet Protocol
100% (1)
Internet Protocol
120 pages
Unit Iv Hardware Accelerates & Networks
No ratings yet
Unit Iv Hardware Accelerates & Networks
59 pages
Basic IT Network-System
No ratings yet
Basic IT Network-System
264 pages
Unit 5 13 Wired Ethernet
No ratings yet
Unit 5 13 Wired Ethernet
83 pages
Adding Ethernet Connectivity To DSP-Based Systems: M. Nashaat Soliman Novra Technologies Inc
No ratings yet
Adding Ethernet Connectivity To DSP-Based Systems: M. Nashaat Soliman Novra Technologies Inc
15 pages
Unit-3 IEEE Standards and LAN Technologies
No ratings yet
Unit-3 IEEE Standards and LAN Technologies
38 pages
MN RAN Networking
No ratings yet
MN RAN Networking
137 pages
Check 4
100% (1)
Check 4
7 pages
Substation Ethernet
No ratings yet
Substation Ethernet
44 pages
Ethernet / TCP-IP - Training Suite: 01 - LWIP Introduction
No ratings yet
Ethernet / TCP-IP - Training Suite: 01 - LWIP Introduction
56 pages
Layer 2: Data Framing For Fun and Profit
No ratings yet
Layer 2: Data Framing For Fun and Profit
29 pages
C2.1. Objectives Physical Layer: Data Encoding Signaling The Physical Components
No ratings yet
C2.1. Objectives Physical Layer: Data Encoding Signaling The Physical Components
6 pages
Lecture 03
No ratings yet
Lecture 03
21 pages
Ethernet Protocol Data Units (Packets)
No ratings yet
Ethernet Protocol Data Units (Packets)
4 pages
Data and Computer Communications: Tenth Edition by William Stallings
No ratings yet
Data and Computer Communications: Tenth Edition by William Stallings
44 pages
02 - Fundamentals of LANs
No ratings yet
02 - Fundamentals of LANs
36 pages
VLSI Implementations of Communication PR PDF
No ratings yet
VLSI Implementations of Communication PR PDF
9 pages
Data
No ratings yet
Data
291 pages
Ee 549 Undergraduate Project Ii: "Voip To PSTN Converter"
No ratings yet
Ee 549 Undergraduate Project Ii: "Voip To PSTN Converter"
30 pages
Computer Network Layers CIS748 Class Notes
No ratings yet
Computer Network Layers CIS748 Class Notes
10 pages
Chapter 4: Network Access
No ratings yet
Chapter 4: Network Access
55 pages
Physical Layer Standards: Medium Dependent
No ratings yet
Physical Layer Standards: Medium Dependent
13 pages
CN Topic 2 Network Protocols and Standards
No ratings yet
CN Topic 2 Network Protocols and Standards
118 pages
Lecture 1 - 2
No ratings yet
Lecture 1 - 2
41 pages
LU 7 - Wide Area Network - MCS
No ratings yet
LU 7 - Wide Area Network - MCS
86 pages
Developers Guide
No ratings yet
Developers Guide
46 pages
X.25 Overview
No ratings yet
X.25 Overview
5 pages
W2 L2 Interfaces and Cables
No ratings yet
W2 L2 Interfaces and Cables
37 pages
Unit Ii Data-Link Layer & Media Access
No ratings yet
Unit Ii Data-Link Layer & Media Access
42 pages
Network System Design
No ratings yet
Network System Design
828 pages
Layer One and Two LAN Networking: Wires and Connections Station To Station Packet Transmission
No ratings yet
Layer One and Two LAN Networking: Wires and Connections Station To Station Packet Transmission
54 pages
Protocol Layering and Data
No ratings yet
Protocol Layering and Data
20 pages
Data Link Layer
No ratings yet
Data Link Layer
59 pages
Networking Principles
No ratings yet
Networking Principles
19 pages
98-366 MVA Slides Lesson 2
No ratings yet
98-366 MVA Slides Lesson 2
29 pages
Data Link Protocols: Relates To Lab 2
No ratings yet
Data Link Protocols: Relates To Lab 2
21 pages
Protocol Layering and Data
No ratings yet
Protocol Layering and Data
19 pages
Chapter - 2 - Standards Protocals
No ratings yet
Chapter - 2 - Standards Protocals
53 pages
VTU Exam Question Paper With Solution of 18EC71 Computer Networks March-2022-Richa Tengshe, Eisha Akanksha, Sharmila KP
No ratings yet
VTU Exam Question Paper With Solution of 18EC71 Computer Networks March-2022-Richa Tengshe, Eisha Akanksha, Sharmila KP
26 pages
Ethernet Overview: Standards and Opera - On
No ratings yet
Ethernet Overview: Standards and Opera - On
7 pages
Day 2 3 Dec 8
No ratings yet
Day 2 3 Dec 8
2 pages
Day 0 1 Dec 7
No ratings yet
Day 0 1 Dec 7
1 page
Challenges of Artificial Intelligence in Design Education
No ratings yet
Challenges of Artificial Intelligence in Design Education
4 pages
v2 FirstPrinciple
No ratings yet
v2 FirstPrinciple
4 pages
Group 2 Report
No ratings yet
Group 2 Report
10 pages
Arxiv 2404.14135
No ratings yet
Arxiv 2404.14135
49 pages
QXQ - YLC-Week 7 Summary of Key Concepts
No ratings yet
QXQ - YLC-Week 7 Summary of Key Concepts
3 pages
Loops and Conditionals Cheat Sheet
No ratings yet
Loops and Conditionals Cheat Sheet
2 pages
QXQ - YLC-Week 8 Summary of Key Concepts
No ratings yet
QXQ - YLC-Week 8 Summary of Key Concepts
2 pages

Group 7 Report

Uploaded by

Group 7 Report

Uploaded by

A Minimal Network Stack on FPGA

December 12, 2019

2 High-level Design and Test Plan

2.1 System Organization

Ethernet header Checksum

ARP packet IPv4 header

ICMP header UDP header TCP header Options

Headers of a higher-layer protocol are capsulated in the payload of a lower-layer protocol.

ARP packets Cache requests Address

Figure 2: High-level organization of the network stack on FPGA

Host Computer FPGA

Figure 3: Virtual ethernet device for our network stack

Microarchitecture of major modules in Figure 2 is described in this section. Noted that a

3.1 Packet Stream Splitters and Compactors

The microarchitecture of a splitter is shown in Figure 4. Compactors have a similar design

3.2 Packet Arbiter

Figure 5: Design of a general packet processing module in L2

Figure 6: Design of ARP Engine

3.4 ARP Engine

Figure 7: Design of ICMP Engine

3.5 ICMP Engine

3.6 UDP Engine

Consider the position of UDP modules, it is important to provide a convenient communi-

UDP Rx Engine UDP Tx Engine

update update notify packets

Figure 8: UDP Engine - receiver part

3.7 TCP Engine

Table 1: Results of hardware implementation

You might also like