A Bypass-Based Low Latency Network-On-Chip Router
A Bypass-Based Low Latency Network-On-Chip Router
net/publication/330890506
CITATIONS READS
4 330
5 authors, including:
Ruizhi Chen
Chinese Academy of Sciences
6 PUBLICATIONS 124 CITATIONS
SEE PROFILE
All content following this page was uploaded by Ruizhi Chen on 22 December 2020.
References
1
IEICE Electronics Express, Vol.*, No.*, 1–12
[8] L. S. Peh and W. J. Dally: “Delay model and speculative architecture for
pipelined routers,” HPCA (2000) 255 (DOI: 10.1109/HPCA.2001.903268)
[9] Moscibroda et al.: “A case for bufferless routing in on-chip networks.”
ACM SIGARCH (2009) 196 (DOI: 10.1145/1555815.1555781).
[10] Hsu et al.: “A low power detection routing method for bufferless noc.”
ISQED (2013) 364 (DOI: 10.1109/ISQED.2013.6523636)
[11] Y. Lu et al.: “Exploring virtual-channel architecture in FPGA based
networks-on-chip,” SOCC (2011) 302 (DOI: 10.1109/SOCC.2011.6085089)
[12] Monemi et al.: “Low latency network-on-chip router microarchi-
tecture using request masking technique.” IJRC (2015): 2. (DOI:
10.1155/2015/570836)
[13] A. Kumar et al.: “Express virtual channels: towards the ideal intercon-
nection fabric,” ISCA (2007) 150 (DOI: 10.1145/1250662.1250681)
[14] Wang D et al.: “MaPU: A novel mathematical computing architecture,”
HPCA (2016) 457 (10.1109/HPCA.2016.7446086)
1 Introduction
With the development of multi-core processors, inter-core communication has
become an important factor affecting the overall performance [1]. Consider-
ing that conventional shared bus and crossbar based interconnections cannot
provide enough flexibility and scalability, NoC is provided as a promising ap-
proach to connect a large number of IP cores on the chip. However, the NoC
transmission latency between the nodes and the power consumption of the
network are still two major issues that need to be further optimized [2][3].
The latency of a NoC is determined by the bandwidth of the network,
the length of the routing path, the pipeline number of each router and the
waiting time due to congestion [4]. For a given network, the bandwidth and
the routing length are hard to be optimized. In addition, the congestion is
mainly related to the topology and routing algorithm. In this paper, we focus
on the optimization of the pipeline number, which is decided by the micro-
architecture of the routers. As the most common router type, virtual channel
(VC) based routers need to complete four functions before the flit departs
from the desired output port: Route computation, VC allocation, Switch
allocation and Switch traversal. Limited to the clock frequency, the vanilla
implementations normally take four or three cycles to transfer one flit, which
will bring significant latency [5][6]. Therefore, reducing the pipeline number
has been taken as a promising method to optimize the NoC latency and
many techniques like look-ahead routing [7] and speculative switch allocator
[8] have been proposed.
As for the power consumption, NoCs can consume about 30% or even
40% of the system power [3][9]. Especially, the VC based router normally
buffer the flits within the network, and these buffers will consume significant
energy. Some researches have proposed bufferless NoC for reducing the power
consumption [9][10]. Bufferless NoC removes the router buffers and transmits
the data packets by deflecting the packets to a free output port. But the
2
IEICE Electronics Express, Vol.*, No.*, 1–12
latency of bufferless NoC is typically worse than the VC based NoC and they
are not suitable for applications with high injection ratio.
On the other hand, a key characteristic of real CMP systems running real
applications is that they are self-throttling, preventing the injection rates
from becoming very high over an extended period of time. Therefore in most
cases the flits don’t compete for the same output port with other input VCs
and can be transferred to the output directly, eliminating the necessity of a
complex arbitration pipeline and buffering.
In this work, we introduce a brand new bypass mechanism and propose
a low latency network-on-chip router (BNR) to take advantage of the above
features. Combined with look-ahead rooting and parallel switch traversal,
BNR takes only one or two stages to transfer one flit. In details, the non-
conflict flits are forwarded to the output port directly with only one flit,
while the flits which encounter congestion are stored in the VC buffer at the
first cycle and then sent out at the second stage. In addition, considering
the BNR will bring significant area overhead, we divide the packets into long
packets and short packets, and then propose BNR-S, which can be seen as
a simplified version of BNR that only bypasses the short packets. In the
experiment part, we first evaluate these two routers with synthesis traffic
of different injection rates, BNR achieves 1.48x and 1.31x speedup than the
two baselines while BNR-S achieves 1.3x and 1.15x. Then for several typical
applications, they also show obvious performance improvement, especially
for the memory-intensive ones. In addition, we also show the critical path,
area and power consumption of the proposed design and we can find that the
dynamic power is reduced with this bypass mechanism.
The rest of the paper is organized as follows. In Section 2 we describe
the related work. Section 3 introduces the motivation and overview of BNR.
Section 4 details the microarchitecture of BNR and BNR-S. Finally, the ex-
periments and conclusion are presented in Section 5 and Section 6.
2 Related work
Many designs have been presented to reduce the latency of NoC routers. In
[7], Galles et. al. propose the look-ahead routing computation (RC) which
computes the output port one router in advance and attaches the result to
the header flit. Hence, when the header flit arrives, the NoC router can send
the allocation request directly to the attached output port and calculate
the output port corresponding to the next router concurrently. In addition,
speculative allocation removes the dependency between VC allocation and
switch allocation by speculating that a waiting packet will be allocated an
output VC successfully [8]. In details, packets that are awaiting VC allocation
are permitted to make speculative requests for switch allocation. To reduce
the negative impact on performance when the speculative fails, the switch
allocator prioritizes non-speculative requests over speculative requests. But
this method can also lead to too many speculation failures during heavy
traffic, which means significant performance dropping.
3
IEICE Electronics Express, Vol.*, No.*, 1–12
4
IEICE Electronics Express, Vol.*, No.*, 1–12
Look-ahead routing
computation
Credit in
Flow control
Combination
allocator
Flit in
Credit out
Input
port
Flit out
...
Flit in
Credit out
Input
port Switch Traversal
this stage. Then the authorized flits are transferred to the destination port
at the second stage.
0 1 2 3
5 1 1
4 2 2
4 4
5 3
6 2
7
5 1
3 3 3
5
8 7 9 10 4
11
6 6
8 2
12 13 14 15
9 1
5
IEICE Electronics Express, Vol.*, No.*, 1–12
head flit and the subsequent flits just follow the determined path. Thus, the
congestion situation of LP and SP are different. The result is shown in Fig.
3. We can find that the congestion ratio is actually quite low. In details,
the congestion ratio is about 2% when the injection ratio is 5% and is 8%
when the injection ration is 15%. Even when the injection ratio is as high as
30%, the congestion ratio is less than 23%. Especially, the SP is much less
inclined to conflict with other packets. Therefore, most flits essentially don’t
encounter congestion and can be processed in a simpler way.
15.00%
10.00%
5.00%
0.00%
5% 10% 15% 20% 25% 30%
Injection Rate
6
IEICE Electronics Express, Vol.*, No.*, 1–12
BP Failure Combination
Credit in allocator
BP credit in Flow control LP
and design complexity. On the other hand, in section 3.2 we observe that
the LP is inclined to occur congestion. So we also propose a simplified BNR
(BNR-S) which only bypasses non-conflict SP while transfers all LP through
the conventional path. Their comparison is detailed in the experiment part.
4 Microarchitecture
4.1 Bypass module of BNR
In the previous section, we propose two routers to meet different area and
performance requirements. The main difference between these two routers is
the bypass module. BNR can bypass both the LP and the SP while BNR-S
only bypasses the SP. In this subsection, we mainly focus on the bypass flow
in BNR. Each LP consists of multiple flits. When one flit comes into the
router, it’s checked with the below conditions:
Non-Conflict: The flit doesn’t compete for the same output port with
the other flits.
If and only if all conditions are met, it can be successfully traversed through
the bypass path. Otherwise, it’s traversed to the conventional path. As for
the SP, each packet only contains one flit. Thus an SP can be bypassed if it
meets the Non-Conflict and Availability conditions.
The detailed structure of BNR’s bypass module is shown in Fig. 5. The
structure grants the destination port and OVC in parallel. First, the possible
destination ports are selected by the combination of port select module, LP
Dest module, LP Failure module, and conventional path routing result. In
fact, the body and tail flits of LP must follow their previous flits while the
destination of other flits is calculated directly. Then a set of P-1:1 arbiter
are used to decide the final port. As for the OVC, the available signals
of LP OVC are forwarded to the arbiter and the multiplexers are used to
choose the proper channels. Like output port, the output channels for the
flits of one LP are also consistent. For each LP, the bypass module needs
7
IEICE Electronics Express, Vol.*, No.*, 1–12
to register the destination port, assigned status, assigned OVC number, and
failure status. The first three messages are the routing information while the
last one is related to the bypass history. In case of bypass failure, the routing
information will be pushed into the conventional path. The conventional path
can router the flit based on this information. If the head flit fails to bypass,
the conventional path is responsible for performing the routing calculation.
Conventional P-1:1
Path routing Arbiter
LP Dest.
...
Dest port LP Failure
Flit in
select
...
Sel. P-1:1
Flit in ... Arbiter
LP Head
Status
LP OVC LP OVC
available arbiter Mux LP assigned LP OVC
...
SP OVC SP OVC
available
Mux BP Flit
arbiter SP?
BP Dest.
BP Credit
8
IEICE Electronics Express, Vol.*, No.*, 1–12
Conventional P-1:1
Path routing
Dest port Arbiter
Flit in select
...
Sel.
LP
...
P-1:1
Flit in ...
Arbiter
...
...
SP OVC OVC ...
available
MUX
Arbiter
support the flit having same source port and destination port. Therefore, the
switch crossbar in bypass path is 4x4 while 5x5 in conventional path.
This structure introduces many benefits. First, two independent switch
traversal structures enable that the conventional path and bypass path can
work in parallel without bringing too much timing latency. Second, the par-
allel look-ahead RC and switch traversal is very suitable for the adaptive
routing algorithms. In fact, for adaptive algorithms, there are multiple se-
lectable output ports. Many conventional routers perform look-ahead RC in
the first stage, and the number of required address calculations and rout-
ing computing units increases linearly with the number of selectable ports.
Take the mini path routing algorithm for example, when the number of in-
put ports is 5 since there are two possible output ports for the packets, 10
routing computing units are needed. However, in our design, the output port
is determined before look-ahead RC and switch traversal, so we only need
5 routing computing units. On the other hand, once the output port is de-
termined, the switch traversal can also be performed immediately, thus they
can be paralleled naturally.
9
IEICE Electronics Express, Vol.*, No.*, 1–12
5 Experiment Results
5.1 Configuration
In order to evaluate BNR and BNR-S, we implement BNR and BNR-S in
Verilog HDL. Each port of the routers is configured with 4 VC for LP and 2
VC for SP. And the depth of the channels is set to 4 while the data width of
these routers is 64-bit. For comparison, we take two designs as our baselines:
Spec-R and Mask-R[12]1 . Both of them are two-stage routers and take look-
ahead routing. Especially, Spec-R takes the speculative allocator while Mask-
R takes the request masking technique.
For the later performance evaluation, we construct four NoC systems with
the above routers. All of them take 6x6 mesh topology. Among the 36 nodes,
there are 2 ARM A53 cores, 4 DMA and 30 MaPU cores. The ARM cores are
taken as the controller while DMAs are used to communicate with the exter-
nal DRAM. MaPU is a VLIW DSP that is dedicated to the computation[14].
These cores all run at 800MHz. It’s noticeable that the original interface
of these cores takes AXI 3.0 protocol. So we design a adaptor module to
convert the AXI based communication into the NoC packets. To avoid the
deadlock in the network, the wrapper limits the number of the out-of-order
transmissions to no more than the LP virtual channels. Besides, the adaptor
is also a clock-domain crossing module.
Mask-R
480ps
1.2GHz
BNR-S
590ps
1.1GHz
Spec-R
690ps
1GHz
BNR
710ps
950MHz
10
IEICE Electronics Express, Vol.*, No.*, 1–12
the input buffer write, through the speculative allocator, until the flit flow
control. It takes 690ps totally. We omit some other time budgets like setup
time, clock uncertainty, etc. in the figure, and the final frequency is 1GHz. As
for the Mask-R, the speculative allocator is replaced with the request masking
and combined allocator. Thus the critical path is reduced to 480ps and the
router runs at 1.2GHz. The critical paths of both BNR-S and BNR start
from the bypass module. For BNR-S, the critical path in the bypass module
is 590ps and the frequency is 1.1GHz. In contrast, the bypass process for
LP is much more complex. Thus the path takes 710ps and the router runs
under 950MHz. Although the bypass module impairs the implementation
frequency, we will show that it still brings significant overall performance
improvement in the next subsection.
5.3 Performance
We first evaluate the NoCs with synthetic traffic. In details, the traffic in the
network is generated randomly under a given injection rate. The injection
rate refers to the ratio that the flits are injected into the network by each
node. As shown in Fig. 9, BNR and BNR-S achieve much better performance
than Spec-R and Mask-R at all injection rates. On average, BNR achieves
1.48x and 1.31x speedup than Spec-R and Mask-R, while BNR-S achieves
1.3x and 1.15x speedup than them.
35
30
25
20
15
5% 10% 15% 20% 25% 30%
Injection Rate
11
IEICE Electronics Express, Vol.*, No.*, 1–12
1.25
1.2
Performance
1.15
1.1
1.05
0.95
0.9
DMA V_MUL V_MAX FFT 16QAM OFDM FIR TURBO CRC
6 Conclusion
In this work, we propose two NoC routers: BNR and BNR-S. With the bypass
path in the router, BNR can transfer the non-conflict flits directly, which can
not only reduce the NoC latency but also remove a lot of unnecessary buffer
activity. BNR-S can be seen as a simplified version of BNR. It balances the
performance benefits and area overhead by only supporting the bypass of SP.
In the experiment with synthetic traffics, BNR shows 1.48x and 1.31x speedup
than the two baselines while BNR-S shows 1.3x and 1.15x. In addition, they
also achieve obvious improvements for the real applications, especially for the
memory-intensive ones. At last, the synthesis results illustrate that they are
really more energy-efficiency than the baselines.
12