Chiu 2020
Chiu 2020
64
circuit contains a Latency Compression Scheme (LCS) It does not need to execute pre-computations against input headers.
which arranges a specific manner for the architecture. This The advantage of this scheme lies in the case that it can sustain
scheme reduces at least 56.6% of the latency and 54.1% of the throughput of more than 380MHz with all match types. For
the propagation time for the packet processing. the prior approaches, the feature of searching a header field
usually aims for the improvement on the specific type, such as
The rest of the paper is organized as follows: Section 2 and exact/prefix match or range match. In this paper, we efficiently
Section 3 introduce the background and related works of packet employ the heterogeneous PE and propose an integrated solution
classification, respectively. We present the algorithm and the to achieve the low-latency process of packet classification.
hardware architecture in Section 4. Evaluated performance is
discussed in Section 5. Section 6 concludes the paper. 3. ARCHITECTURE
The proposed latency-aware packet classification engine contains
2. BACKGROUND AND RELATED WORK three major components including Exact/Prefix Search Unit,
2.1 Traditional vs. OpenFlow Classification Range Search Unit, and Modular Architecture. The detailed
Packet classification is the process that categorizes packets into structures with operation manners of these units will be
“flows” in an Internet router. Denote a pre-defined entry is a rule introduced in this section.
and the data set which includes all the rules is rule set. Packet
classification can be defined as given a packet header with a rule 3.1 Exact/Prefix Search Unit
set of size R. If a rule matches the packet header, RID of the rule As mentioned earlier, the FSBV approach uses memory
is exported. efficiently by managing individual field search. However, this
method relies on ruleset features. For the case of a traditional 5-
The OpenFlow protocol, proposed by the Open Networking field packet (104 bits), since the FSBV processes 104 stages in a
Foundation (ONF), was defined as a first standard single pipeline architecture, it may suffer from the latency issue
communications interface of SDN architecture. The OpenFlow seriously. With these problems, we utilize the improved method
packet classification requires a larger number of packet header named StrideBV [15] as a basic scheme of exact/prefix search unit.
fields compared to the traditional packet. We show the
In contrast with each K bits field being split into 1 bit-field, the
comparison of packet information between traditional and
solution of StrideBV considered a sub-field length of bits by
OpenFlow packets in Table 1 and Table 2.
using stride size of s. This solution is more practical than the
Table 1. Comparison of packet information FSBV work in the real situation. The details of algorithm and
architecture are explained below.
Type Traditional OpenFlow1.0
# of fields 5 12 Algorithm 1 Creation of Rule Table
# of bits 104 247
Require: R rules which is indicated as a K-bit ternary string:
,
Table 2. Packet header field of traditional and openflow
Length/Match Type Require: A -bit string with rules to indicate the type of each
Header Field Symbol bit:
Traditional OpenFlow
,
Ingress Port Ingr_port (N/A) 10/Exact
Ethernet Source Address Eth_src (N/A) 48/Exact Ensure: ^ , -bit-vectors:
Ethernet Dest. Address Eth_dst (N/A) 48/Exact
Ethernet Type Eth_type (N/A) 16/Exact
VLAN ID V_id (N/A) 12/Exact
VLAN Priority V_priority (N/A) 3/Exact
IP Source Address SA 32/Prefix 32/Prefix 1: Initialization:
IP Destination Address DA 32/Prefix 32/Prefix 2: Index Null
IP Protocol Prtl 8/Exact 8/Exact 3: For to do
IP Type of Service ToS (N/A) 6/Exact 4: For to do
Source Port SP 16/Range 16/Range 5: if then
Destination Port DP 16/Range 16/Range 6: Index
7: else
2.2 Related Work of Packet Classification 8: Index RuleExtension( )
The algorithms of packet classification have been reported in 9: end if
literature. To be specific, the Field-Split Bit Vector (FSBV) 10: For to do
approach was proposed in [12]. In this approach, an N-bit ternary 11: RuleFilling(Index)
string indicates each rule in an N-bit field and each ternary string 12: end for
consists of {0, 1, ∗}. This design further proposes logical AND 13: end for
operations in a pipelined architecture on FPGA to generate the 14: end for
final match result from all the extracted bit vectors. However, Algorithm 1 and 2 show the creation of rule table and the process
FSBV suffers from range expansion when transforming ranges of packet lookup. Creating the lookup table with R K-bits is the
into prefix [13] because it demands rules to be represented in essential step of Algorithm 1. The accepts K bits of the
ternary string. To solve this problem, the design [9] proposed the rule and generates a 2-dimensional array of size (height)
packet classification approach supporting range match effectively. (width), in which s indicates the chosen stride.
In this approach, it saves subranges in the memory and executes
subrange match operations sequentially. It is like iterative First, the given Index indicates a partial value of whose
structure of 1-bit range check circuit of the extended TCAM [14]. is 0. Conversely, the which is set to 1 represents the type is
65
wildcard match. It implies both 0 and 1 are accepted. Then, the 3.2 Range Search Unit
RuleExtension function will execute and give all the results to As can be seen in Table 2, both of the traditional and the
Index. For example, consider a stride size of and the rule OpenFlow packet classification rule tables have the SP and DP
value for a particular stride is {1*}. In this case, the values of the fields. Each of the fields consists of 16 bits and demands the
Index are supposed to be {10, 11} by expanding {1*}. Finally, all criterion of range match. In this section, we are taking advantage
values of are set to 1 against relative Index by of the approach [9] to decide an appropriate subrange and
performing the function RuleFilling. In Algorithm 2, the function reorganize the larger size of PE to fit our requirement. The detail
PktLookup accepts packet header and checks the value from of algorithm and architecture are discussed below.
the rule table. After that, the final result will be delivered to .
Figure 1 shows the example of applying the Algorithm 1 and Algorithm 3 The Process of Range Searching.
Algorithm 2 for matching a packet header with 4-bit and 3 rules. Require: A -bit packet header: ,
For clarity, we consider the whole process with stride size of s = 2. A -bit upper/lower bound address: ,
8: else if
Require: A bit-vector indicates the match results 9: else if
Ensure: bit-vector to represent the match results 10: end if
1: Initialization: 00…0 Assume all rules mismatch initially 11: end for
2: For to do {bit-wise AND} {Final stage process}
3: PktLookup( ) 12: if
4: end for
66
initial stage 0, the ] will be generated after handling the in the last stage to carry out the final result which is set to “1”
. For the , more complex signals are combined if and only if the input packet is within the and the .
with to acquire the outcomes. are executed
67
4.2 Performance Metrics and Comparisons [9]. However, the additional spaces of block RAM storage are
The comparisons of performance are tabulated in 5 and visualized needed to save the rule sets in work [9]-I. For the other solution
in Fig. 5 and Fig. 6. We consider several metrics to indicate the [9]-II, it took the longest time of processing the whole packet (D
performance as follows: = 150) among all five approaches.
Match types (M). The support level of match types, To the best of our knowledge, the proposed design is the first
including exact match, prefix match and range match. latency-aware method that can support all types of match.
Number of fields (F). The number of fields supported for Furthermore, the better performances of D and E also indicate
each packet header. that it is possible to obtain the result earlier rather than taking
Number of rules (R). The number of rules supported in the more memory resources to keep previous data by applying our
rule table (in K rules). proposed LCS architecture.
Throughput (T). The number of ports on lookup table
clock rate of each processing element (in MHz). M F R
Period (P). The clock period of each processing element (in
ns). 20
15
Latency (L). The total number of clock cycles per packet 15 12 12 12
inspected from input to result.
10
Delay (D). The total number of periods per entire packet 5 5
5 2 3 3 3 2 3 3
consumed, denoted by P L (in ns). 0.5 1
0
In addition, we assume Efficiency (E) as an essential metrics for
representing the capability of packet processing. The metrics can
be calculated by using the following equation:
Efficiency (E) (F R) D. (1)
Figure 5 shows three metrics M, F and R compared to other
approaches. As can be seen, the proposed scheme accommodates
up to R 3 rules. Each rule includes F 12 (247 bits). Moreover,
Figure 5. The match types, M, # of fields, F and # of rules, R.
the metrics M means that the proposed work supports all three
types of match criterion, including exact, prefix and range match.
The performance metrics of latency (L) and delay (D) are shown L D
in Fig. 6. As we can see, the proposed approach consumes not
only the lowest latency (L = 13) but also the fewest consumption 180 150.0 137.35
127.66
(D = 42.9). Furthermore, notice that the performance of D 93.64
120 89
achieves at most 3.49 lower than the prior works. 53 57
30 42.9
The comparisons between our proposed scheme and the existing 60 13
works are shown in Table 5. For one of the most original works 0
[15], only 30 clock cycles are needed to handle input packet.
However, this work supported only 5-tuple header fields. It may
not be able to cope with today’s complex requirements. The
approach in [10], the performance of throughput and number of
fields supported are the highest among all the existing works, but
the performance of latency and contained rules are degrading as
well. The range field is supported and enhanced by the approach
Figure 6. The latency, L and delay, D.
Table 5. Performance comparisons
Item Rule Sets L T P D
M E
Approach R (K) F (Clocks) (MHz) (ns) (ns)
1. StrideBV [15] prefix/exact 0.5 5 30 235 4.3 127.66 19.58
2. Range-Enhanced I [9] any 3 12 53 566 1.8 93.64 384.45
3. Range-Enhanced II [9] any 5 12 57 380 2.6 150.00 400.0
4. High-perf. and Updatable [10] prefix/exact 1 15 89 648 1.5 137.35 109.21
5. Proposed Approach any 3 12 13 303 3.3 42.90 839.08
5. CONCLUSION 6. REFERENCES
In this paper, we presented a modular architecture for low-latency [1] Software-Defined Networking (SDN) Definition. [Online].
and high-efficiency packet classification on Xilinx Virtex-6 Available: https://www.opennetworking.org/den-
XC6VLX760 FPGA. By reorganizing the different models of PE, resources/sdn-definition/.
the proposed LCS architecture can acquire the better performance [2] M. Ojo et al., “A SDN-IoT Architecture with NFV
of packet processing. This feature makes our approach suitable for Implementation,” in 2016 IEEE Globecom Workshops, 2016
the network applications that require low-latency operation. © IEEE. doi: 10.1109/GLOCOMW.2016.7848825
68
[3] Y.-W. Ma et al., “SDN-enabled network virtualization for [10] Y. R. Qu and V. K. Prasanna, “High-Performance and
industry 4.0 based on IoTs and cloud computing,” in 19th Int. Dynamically Updatable Packet Classification Engine on
Conf. Advanced Communication Technology, Bongpyeong, FPGA”, IEEE Trans. Parallel Distrib. Syst., vol. 27, no. 1,
South Korea, 2017, pp. 199-202. pp.197-209, Jan. 2016.
[4] OpenFlow Switch Specification V1.0.0. [Online]. Available: [11] Xilinx, Inc., San Jose, CA, “Virtex-6 Family Overview,”
https://www.opennetworking.org/wp- 2015. [Online]. Available:
content/uploads/2013/04/openflow-spec-v1.0.0.pdf, 2009. https://www.xilinx.com/support/documentation/data_sheets/d
[5] J. M. Llopis et al., “Minimizing Latency of Critical Traffic s150.pdf
through SDN,” in 2016 IEEE Int. Conf. Networking, [12] W. Jiang and V. K. Prasanna, “Field-split Parallel
Architecture and Storage, 2016 © IEEE. doi: Architecture for High Performance Multi-match Packet
10.1109/NAS.2016.7549408 Classification using FPGAs,” in Proc. 21st Annu. Symp.
[6] P. Schulz et al., “Latency Critical IoT Applications in 5G: Parallelism in Algorithms and Architectures, 2009, pp. 188-
Perspective on the Design of Radio Interface and Network 196.
Architecture,” IEEE Commun. Magazine, Vol. 55, no. 2, pp. [13] V. Srinivasan et al., “Fast and Scalable Layer Four
70-78, Feb., 2017. Switching,” in Proc. ACM SIGCOMM, 1998, pp. 191-202.
[7] F. Yu et al., “Efficient Multimatch Packet Classification and [14] E. Spitznagel et al., “Packet classification using extended
Lookup with TCAM,” IEEE Micro, vol. 25, no. 1, pp. 50-59, TCAMs,” in Proc. IEEE Int. Conf. Network Protocols, 2003,
2005. pp. 120-131.
[8] K. Lakshminarayanan et al., “Algorithms for Advanced [15] T. Ganegedara and V. K. Prasanna, “StrideBV: Single chip
Packet Classification with Ternary CAMs,” in Proc. ACM 400G+ packet classification,” in Proc. IEEE 13th Int. Conf.
SIGCOMM, 2005, pp. 193-204. High Performance Switching and Routing, 2012, pp. 1-6.
[9] Y.-K. Chang and C.-S. Hsueh, “Range-Enhanced Packet [16] P. Gupta and N. McKeown, “Algorithms for packet
Classification,” IEEE Trans. Emerging Topics in Computing, classification,” IEEE Netw., vol. 15, no. 2, pp. 24-32, Mar.
vol. 4, no. 2, pp. 214-224, Jun. 2016. 2001.
69