0% found this document useful (0 votes)
49 views6 pages

Chiu 2020

About packet classification

Uploaded by

Vijay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views6 pages

Chiu 2020

About packet classification

Uploaded by

Vijay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

The Design and Implementation of a Latency-Aware

Packet Classification for OpenFlow Protocol based on


FPGA
Yu-Kai Chiu, Shanq-Jang Ruan, Chung-An Shen, Chun-Chi Hung
National Taiwan University of Science and Technology, Department of Electrical & Computer Engineering
No.43, Keelung Rd., Sec.4, Da'an Dist.,
Taipei City 10607, Taiwan (R.O.C.)
+886-2-2737-6411
{m10402105, sjruan, cashen, m10602125}@mail.ntust.edu.tw

ABSTRACT firewall, and load balancing. However, the traditional packet


Packet classification has been recognized as one of the most classification approach only considers 5-tuple fields, which is not
significant functions in contemporary network infrastructures. sufficient for the complicated network requirements nowadays.
Furthermore, a number of modern applications such as IoTs For example, the commonly used OpenFlow [4] protocol for SDN
contain very strict constraints on the latency of network paradigm employs the 12-tuple header fields for supporting more
transmissions. This paper presents the design and implementation advanced network services. In addition, there are more and more
of a novel packet classification based on FPGA architecture. The applications required ultra-low end-to-end latency of network
proposed design contains a Latency Compression Scheme (LCS) communications such as the control of critical resources or the
to achieve the low-latency packet processing. Furthermore, this remote manipulation of a robotic arm [5, 6]. Therefore, it is
structure supports 12-tuple fields for the modern Internet traffics. essential to design a low-latency packet processing engine.
The experimental results show that the proposed packet On the other hand, traditionally, multi-field packet classifications
classification scheme reduces the delay of packet processing by are usually design based on ternary content addressable memory
2.18 compared to the state-of-the-art works. (TCAM) [7, 8]. TCAM based packet classification approaches
lead to high-performance processing owing to its parallel
CCS Concepts
architecture. However, the TCAM circuit results in very high cost
• Networks➝Bridges and switches.
and power consumption and heavily degrades the system
Keywords performance. Besides, the TCAM based structure also suffers
Packet classification; Latency-Aware; OpenFlow; FPGA from the range expansion when converting ranges into prefixes
[8]. As an alternative, Field Programmable Gate Array (FPGA)
1. INTRODUCTION platform contains high level of flexibility and configurability, and
Explosive Internet traffic has led to a high bandwidth demand in is an attractive scheme for designing real-time network processing
backbone networks. To provide better user experiences and to engines [9, 10]. For example, the implementations of packet
support a variety of network functionalities, such as firewall, classification based on FPGA have been reported in [9] and [10],
quality of service (QoS), and other value-added services, it is where the work shown in [9] improved the performance of packet
important to realize critical functions of network management classification on range matching and the approach in [10]
tasks. On the other hand, Software Defined Networking (SDN) [1] enhanced the throughput with the configurable architecture.
has been proposed as a flexible solution for the next generation However, these designs come with the costs of increased
Internet provision. With the provided flexibility, more advanced
processing latency due to the lengthy pipeline stages. As a result,
applications have taken advantage of SDN, such as IoT [2],
the efficiency of a classification engine on processing each packet
Industry 4.0 [3] and other innovation services.
cannot satisfy the applications that require low-latency processing.
From the viewpoint of network management, packet classification
plays a significant role as it classifies the network traffics into This paper presents a latency-aware packet classification engine
flows based on a predefined set of rules. Such classifications of based on FPGA structure. The major contributions of this work
network traffic are the foundations of applications like QoS, can be summarized as follows:
 Scalability – The proposed design supports 12-tuple header
Permission to make digital or hard copies of all or part of this work for fields so that it can be applied with novel networking
personal or classroom use is granted without fee provided that copies are
schemes such as OpenFlow 1.0, and can support advanced
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for Internet services.
components of this work owned by others than ACM must be honored.  Flexibility – The design is implemented based on the Xilinx
Abstracting with credit is permitted. To copy otherwise, or republish, to
post on servers or to redistribute to lists, requires prior specific permission
Virtex-6 XC6VLX760 FPGA device [11] which is also a
and/or a fee. Request permissions from [email protected]. promising platform for the realizations of network
ICNCC 2018, December 14–16, 2018, Taipei City, Taiwan processing engines.
© 2018 Association for Computing Machinery.  Latency-Aware – In order to achieve the requirement of
ACM ISBN 978-1-4503-6553-6/18/12…$15.00
low-latency process, the proposed packet classification
https://doi.org/10.1145/3301326.3301368

64
circuit contains a Latency Compression Scheme (LCS) It does not need to execute pre-computations against input headers.
which arranges a specific manner for the architecture. This The advantage of this scheme lies in the case that it can sustain
scheme reduces at least 56.6% of the latency and 54.1% of the throughput of more than 380MHz with all match types. For
the propagation time for the packet processing. the prior approaches, the feature of searching a header field
usually aims for the improvement on the specific type, such as
The rest of the paper is organized as follows: Section 2 and exact/prefix match or range match. In this paper, we efficiently
Section 3 introduce the background and related works of packet employ the heterogeneous PE and propose an integrated solution
classification, respectively. We present the algorithm and the to achieve the low-latency process of packet classification.
hardware architecture in Section 4. Evaluated performance is
discussed in Section 5. Section 6 concludes the paper. 3. ARCHITECTURE
The proposed latency-aware packet classification engine contains
2. BACKGROUND AND RELATED WORK three major components including Exact/Prefix Search Unit,
2.1 Traditional vs. OpenFlow Classification Range Search Unit, and Modular Architecture. The detailed
Packet classification is the process that categorizes packets into structures with operation manners of these units will be
“flows” in an Internet router. Denote a pre-defined entry is a rule introduced in this section.
and the data set which includes all the rules is rule set. Packet
classification can be defined as given a packet header with a rule 3.1 Exact/Prefix Search Unit
set of size R. If a rule matches the packet header, RID of the rule As mentioned earlier, the FSBV approach uses memory
is exported. efficiently by managing individual field search. However, this
method relies on ruleset features. For the case of a traditional 5-
The OpenFlow protocol, proposed by the Open Networking field packet (104 bits), since the FSBV processes 104 stages in a
Foundation (ONF), was defined as a first standard single pipeline architecture, it may suffer from the latency issue
communications interface of SDN architecture. The OpenFlow seriously. With these problems, we utilize the improved method
packet classification requires a larger number of packet header named StrideBV [15] as a basic scheme of exact/prefix search unit.
fields compared to the traditional packet. We show the
In contrast with each K bits field being split into 1 bit-field, the
comparison of packet information between traditional and
solution of StrideBV considered a sub-field length of bits by
OpenFlow packets in Table 1 and Table 2.
using stride size of s. This solution is more practical than the
Table 1. Comparison of packet information FSBV work in the real situation. The details of algorithm and
architecture are explained below.
Type Traditional OpenFlow1.0
# of fields 5 12 Algorithm 1 Creation of Rule Table
# of bits 104 247
Require: R rules which is indicated as a K-bit ternary string:
,
Table 2. Packet header field of traditional and openflow
Length/Match Type Require: A -bit string with rules to indicate the type of each
Header Field Symbol bit:
Traditional OpenFlow
,
Ingress Port Ingr_port (N/A) 10/Exact
Ethernet Source Address Eth_src (N/A) 48/Exact Ensure: ^ , -bit-vectors:
Ethernet Dest. Address Eth_dst (N/A) 48/Exact
Ethernet Type Eth_type (N/A) 16/Exact
VLAN ID V_id (N/A) 12/Exact
VLAN Priority V_priority (N/A) 3/Exact
IP Source Address SA 32/Prefix 32/Prefix 1: Initialization:
IP Destination Address DA 32/Prefix 32/Prefix 2: Index Null
IP Protocol Prtl 8/Exact 8/Exact 3: For to do
IP Type of Service ToS (N/A) 6/Exact 4: For to do
Source Port SP 16/Range 16/Range 5: if then
Destination Port DP 16/Range 16/Range 6: Index
7: else
2.2 Related Work of Packet Classification 8: Index RuleExtension( )
The algorithms of packet classification have been reported in 9: end if
literature. To be specific, the Field-Split Bit Vector (FSBV) 10: For to do
approach was proposed in [12]. In this approach, an N-bit ternary 11: RuleFilling(Index)
string indicates each rule in an N-bit field and each ternary string 12: end for
consists of {0, 1, ∗}. This design further proposes logical AND 13: end for
operations in a pipelined architecture on FPGA to generate the 14: end for
final match result from all the extracted bit vectors. However, Algorithm 1 and 2 show the creation of rule table and the process
FSBV suffers from range expansion when transforming ranges of packet lookup. Creating the lookup table with R K-bits is the
into prefix [13] because it demands rules to be represented in essential step of Algorithm 1. The accepts K bits of the
ternary string. To solve this problem, the design [9] proposed the rule and generates a 2-dimensional array of size (height)
packet classification approach supporting range match effectively. (width), in which s indicates the chosen stride.
In this approach, it saves subranges in the memory and executes
subrange match operations sequentially. It is like iterative First, the given Index indicates a partial value of whose
structure of 1-bit range check circuit of the extended TCAM [14]. is 0. Conversely, the which is set to 1 represents the type is

65
wildcard match. It implies both 0 and 1 are accepted. Then, the 3.2 Range Search Unit
RuleExtension function will execute and give all the results to As can be seen in Table 2, both of the traditional and the
Index. For example, consider a stride size of and the rule OpenFlow packet classification rule tables have the SP and DP
value for a particular stride is {1*}. In this case, the values of the fields. Each of the fields consists of 16 bits and demands the
Index are supposed to be {10, 11} by expanding {1*}. Finally, all criterion of range match. In this section, we are taking advantage
values of are set to 1 against relative Index by of the approach [9] to decide an appropriate subrange and
performing the function RuleFilling. In Algorithm 2, the function reorganize the larger size of PE to fit our requirement. The detail
PktLookup accepts packet header and checks the value from of algorithm and architecture are discussed below.
the rule table. After that, the final result will be delivered to .
Figure 1 shows the example of applying the Algorithm 1 and Algorithm 3 The Process of Range Searching.
Algorithm 2 for matching a packet header with 4-bit and 3 rules. Require: A -bit packet header: ,
For clarity, we consider the whole process with stride size of s = 2. A -bit upper/lower bound address: ,

Require: A chosen bit to indicate the data width of each stage


Ensure: bit to indicate the match results of each stage,
bit to indicate results of each stage,

Ensure: A bit to indicate the final match result which will be


generated after cycles
1: Initialization: Assume the rule mismatch initially
{First stage process}
2: if
3: else if
Figure 1. Demonstration of packet processing. 4: else if
5: end if
Algorithm 2 Process of Packet Lookup {Middle stage process}
Require: A -bit packet header : 6: For to do
7: if ( )
Require: , -bit-vectors:

8: else if
Require: A bit-vector indicates the match results 9: else if
Ensure: bit-vector to represent the match results 10: end if
1: Initialization: 00…0 Assume all rules mismatch initially 11: end for
2: For to do {bit-wise AND} {Final stage process}
3: PktLookup( ) 12: if
4: end for

For the architecture of exact/prefix search unit, we mainly refer to


[15] to select stride size of s = 4 and 5 as the basic scheme. Each
processing element supports R = 4 rule sets. The final match result
is generated from the outputs of and previous match
by using bit-wised AND operation. Since the pipeline method will
be used in the complete architecture, the register is also 13: end if
needed in each unit. The PE of exact/prefix search unit is 14:
illustrated in Fig. 2.
The signals that we use in the algorithm and architecture are listed
below: Packet Header , Upper Bound Address , Lower
Bound Address , Match , Upper Bound Match ,
Lower Bound Match and Result . For simplicity and
clarity, we also defined four signals , , and
to represent the following logical operations respectively:
, , , and .
Algorithm 3 shows the process of range searching. Both processes
of the initial and the middle stage generate three signals, ,
and which then deliver to the next stage. To avoid the clock
degradation by logical operations, the pipeline method is needed.
Notice that cycles are demanded for a whole process
with bits range search which is divided into several
Figure 2. The PE of exact/prefix search unit. processing units. For example, we apply the Algorithm 3 for
processing a packet header of bits with bits. In the

66
initial stage 0, the ] will be generated after handling the in the last stage to carry out the final result which is set to “1”
. For the , more complex signals are combined if and only if the input packet is within the and the .
with to acquire the outcomes. are executed

Figure 3. The architecture of LCS.


packet by the LCS arrangement. Each layer has different PEs
against the header fields. The design prototype of each PE is listed
in Table 3. In brief, the match criterion cataloged to the
exact/prefix match uses the exact/prefix (Ex/Pr) PE with stride
size of , . For the criterion classified to the range
match applies the range PE with subrange size of . To merge
the outcomes from the parallel productions, we employ the
aggregator to generate the result in the last stage. For this
arrangement, the packet latency can be reduced significantly. The
reductions are 79% and 74% compared to the classic pattern with
and , respectively. This consequence makes our
method fit in the applications with low latency operations.
Table 3. The design prototype of each header fields
Figure 4. The architecture of classic pattern. Header Field Symbol Length Design Prototype
Ingress Port Ingr_port 10 bits Ex/Pr PE (s 5)
3.3 Modular Architecture Ethernet Source Address Eth_src 48 bits Ex/Pr PE (s 5)
In this section, we propose a pipeline architecture called Latency Ethernet Destination Address Eth_dst 48 bits Ex/Pr PE (s 5)
Compression Scheme (LCS) as the integrated solution. The detail Ethernet Type Eth_type 16 bits Ex/Pr PE (s 5)
of our method and implementation are explored below. VLAN ID V_id 12 bits Ex/Pr PE (s 4)
According to the basic principle and algorithm of packet VLAN Priority V_priority 3 bits Ex/Pr PE (s 5)
classification [16], the final result is not affected by the order in IP Source Address SA 32 bits Ex/Pr PE (s 5)
which the header fields are inspected. In other words, it is allowed IP Destination Address DA 32 bits Ex/Pr PE (s 5)
IP Protocol Prtl 8 bits Ex/Pr PE (s 4)
to arrange all kind of processing elements in any order that we
IP Type of Service ToS 6 bits Ex/Pr PE (s 5)
demand. Figure 4 shows the pipeline architecture of classic
Source Port SP 16 bits Range PE (r 4)
pattern which is also used by most of the existing works. As can
Destination Port DP 16 bits Range PE (r 4)
be seen, taking a 12-tuple header including 247 bits in length as
an example, the packet latency will be ⌈ ⌉ clocks by choosing 4. PERFORMANCE EVALUATION
stride size of . For the stride size of , the clock cycles 4.1 Experimental Setup
will be consumed even ⌈ ⌉ by using the classic pattern. In such We conducted experiments targeting the Virtex-6 XC6VLX760
cases, the classic pattern may cause the processing delay to FFG1760-2 FPGA [11], using Xilinx ISE Design Suite 14.7.
increase linearly proportional to the header length. As header Virtex-6 XC6VLX760 has 1200 I/O pins, 864 DSP slices, 25,920
lengths increasing, it will be a worthless feature especially for Kb BRAM, and 118,560 logic slices including 33,120 logic slices
some services that require low latency task. of SLICEM and 85,440 logic slices of SLICEL. In this work, each
memory module is configured as dual-port for searching packet to
To promote such requirements, we propose Latency Compression improve the throughput. Clock rate and resource utilization are
Scheme (LCS) in this paper. As mentioned earlier, since the final reported in post-place & route level. We show the resource
result is not affected by the order in which the header fields are consumption of the proposed scheme in Table 4.
inspected, we permutate the searching manner of each processing
element by special construction. The overview of the proposed Table 4. Resource consumption of proposed scheme
architecture is shown in Fig. 3. In order to not waste any bits of Configurable Logic Blocks
PE and not add extra bubble-stage, the following case is one Slices (CLBs) User
possible arrangement that we perform in LCS: Layer 1 (L1) {SP: Architecture
Registers Distributed I/O
DP: V_ID: Prtl}, Layer 2 (L2) {Eth_src: Eth_type: V_priority (XC6VLX760) SliceM SliceL
(948,480) RAM(Kb) (1,200)
[0]}, Layer 3 (L3) {Eth_dst: Ingr_port: ToS: V_priority [1]}, (33,120) (85,440)
(8,280)
Layer 4 (L4) {SA: DA, V_priority [2]}. At the beginning of the Proposed 234,394 18,048 21,992 4,512 273
process, the Packet Reorder is responsible for organizing the Approach (24.7%) (54.5%) (25.7%) (54.5%) (23%)

67
4.2 Performance Metrics and Comparisons [9]. However, the additional spaces of block RAM storage are
The comparisons of performance are tabulated in 5 and visualized needed to save the rule sets in work [9]-I. For the other solution
in Fig. 5 and Fig. 6. We consider several metrics to indicate the [9]-II, it took the longest time of processing the whole packet (D
performance as follows: = 150) among all five approaches.

 Match types (M). The support level of match types, To the best of our knowledge, the proposed design is the first
including exact match, prefix match and range match. latency-aware method that can support all types of match.
 Number of fields (F). The number of fields supported for Furthermore, the better performances of D and E also indicate
each packet header. that it is possible to obtain the result earlier rather than taking
 Number of rules (R). The number of rules supported in the more memory resources to keep previous data by applying our
rule table (in K rules). proposed LCS architecture.
 Throughput (T). The number of ports on lookup table
clock rate of each processing element (in MHz). M F R
 Period (P). The clock period of each processing element (in
ns). 20
15
 Latency (L). The total number of clock cycles per packet 15 12 12 12
inspected from input to result.
10
 Delay (D). The total number of periods per entire packet 5 5
5 2 3 3 3 2 3 3
consumed, denoted by P L (in ns). 0.5 1
0
In addition, we assume Efficiency (E) as an essential metrics for
representing the capability of packet processing. The metrics can
be calculated by using the following equation:
Efficiency (E) (F R) D. (1)
Figure 5 shows three metrics M, F and R compared to other
approaches. As can be seen, the proposed scheme accommodates
up to R 3 rules. Each rule includes F 12 (247 bits). Moreover,
Figure 5. The match types, M, # of fields, F and # of rules, R.
the metrics M means that the proposed work supports all three
types of match criterion, including exact, prefix and range match.
The performance metrics of latency (L) and delay (D) are shown L D
in Fig. 6. As we can see, the proposed approach consumes not
only the lowest latency (L = 13) but also the fewest consumption 180 150.0 137.35
127.66
(D = 42.9). Furthermore, notice that the performance of D 93.64
120 89
achieves at most 3.49 lower than the prior works. 53 57
30 42.9
The comparisons between our proposed scheme and the existing 60 13
works are shown in Table 5. For one of the most original works 0
[15], only 30 clock cycles are needed to handle input packet.
However, this work supported only 5-tuple header fields. It may
not be able to cope with today’s complex requirements. The
approach in [10], the performance of throughput and number of
fields supported are the highest among all the existing works, but
the performance of latency and contained rules are degrading as
well. The range field is supported and enhanced by the approach
Figure 6. The latency, L and delay, D.
Table 5. Performance comparisons
Item Rule Sets L T P D
M E
Approach R (K) F (Clocks) (MHz) (ns) (ns)
1. StrideBV [15] prefix/exact 0.5 5 30 235 4.3 127.66 19.58
2. Range-Enhanced I [9] any 3 12 53 566 1.8 93.64 384.45
3. Range-Enhanced II [9] any 5 12 57 380 2.6 150.00 400.0
4. High-perf. and Updatable [10] prefix/exact 1 15 89 648 1.5 137.35 109.21
5. Proposed Approach any 3 12 13 303 3.3 42.90 839.08

5. CONCLUSION 6. REFERENCES
In this paper, we presented a modular architecture for low-latency [1] Software-Defined Networking (SDN) Definition. [Online].
and high-efficiency packet classification on Xilinx Virtex-6 Available: https://www.opennetworking.org/den-
XC6VLX760 FPGA. By reorganizing the different models of PE, resources/sdn-definition/.
the proposed LCS architecture can acquire the better performance [2] M. Ojo et al., “A SDN-IoT Architecture with NFV
of packet processing. This feature makes our approach suitable for Implementation,” in 2016 IEEE Globecom Workshops, 2016
the network applications that require low-latency operation. © IEEE. doi: 10.1109/GLOCOMW.2016.7848825

68
[3] Y.-W. Ma et al., “SDN-enabled network virtualization for [10] Y. R. Qu and V. K. Prasanna, “High-Performance and
industry 4.0 based on IoTs and cloud computing,” in 19th Int. Dynamically Updatable Packet Classification Engine on
Conf. Advanced Communication Technology, Bongpyeong, FPGA”, IEEE Trans. Parallel Distrib. Syst., vol. 27, no. 1,
South Korea, 2017, pp. 199-202. pp.197-209, Jan. 2016.
[4] OpenFlow Switch Specification V1.0.0. [Online]. Available: [11] Xilinx, Inc., San Jose, CA, “Virtex-6 Family Overview,”
https://www.opennetworking.org/wp- 2015. [Online]. Available:
content/uploads/2013/04/openflow-spec-v1.0.0.pdf, 2009. https://www.xilinx.com/support/documentation/data_sheets/d
[5] J. M. Llopis et al., “Minimizing Latency of Critical Traffic s150.pdf
through SDN,” in 2016 IEEE Int. Conf. Networking, [12] W. Jiang and V. K. Prasanna, “Field-split Parallel
Architecture and Storage, 2016 © IEEE. doi: Architecture for High Performance Multi-match Packet
10.1109/NAS.2016.7549408 Classification using FPGAs,” in Proc. 21st Annu. Symp.
[6] P. Schulz et al., “Latency Critical IoT Applications in 5G: Parallelism in Algorithms and Architectures, 2009, pp. 188-
Perspective on the Design of Radio Interface and Network 196.
Architecture,” IEEE Commun. Magazine, Vol. 55, no. 2, pp. [13] V. Srinivasan et al., “Fast and Scalable Layer Four
70-78, Feb., 2017. Switching,” in Proc. ACM SIGCOMM, 1998, pp. 191-202.
[7] F. Yu et al., “Efficient Multimatch Packet Classification and [14] E. Spitznagel et al., “Packet classification using extended
Lookup with TCAM,” IEEE Micro, vol. 25, no. 1, pp. 50-59, TCAMs,” in Proc. IEEE Int. Conf. Network Protocols, 2003,
2005. pp. 120-131.
[8] K. Lakshminarayanan et al., “Algorithms for Advanced [15] T. Ganegedara and V. K. Prasanna, “StrideBV: Single chip
Packet Classification with Ternary CAMs,” in Proc. ACM 400G+ packet classification,” in Proc. IEEE 13th Int. Conf.
SIGCOMM, 2005, pp. 193-204. High Performance Switching and Routing, 2012, pp. 1-6.
[9] Y.-K. Chang and C.-S. Hsueh, “Range-Enhanced Packet [16] P. Gupta and N. McKeown, “Algorithms for packet
Classification,” IEEE Trans. Emerging Topics in Computing, classification,” IEEE Netw., vol. 15, no. 2, pp. 24-32, Mar.
vol. 4, no. 2, pp. 214-224, Jun. 2016. 2001.

69

You might also like