0% found this document useful (0 votes)
36 views13 pages

A Bypass-Based Low Latency Network-On-Chip Router

Uploaded by

vivekva879
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views13 pages

A Bypass-Based Low Latency Network-On-Chip Router

Uploaded by

vivekva879
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/330890506

A Bypass-Based Low Latency Network-on-Chip Router

Article in IEICE Electronics Express · February 2019


DOI: 10.1587/elex.16.20181147

CITATIONS READS

4 330

5 authors, including:

Ruizhi Chen
Chinese Academy of Sciences
6 PUBLICATIONS 124 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ruizhi Chen on 22 December 2020.

The user has requested enhancement of the downloaded file.


This article has been accepted and published on J-STAGE in advance of
copyediting. Content is final as presented.
IEICE Electronics Express, Vol.*, No.*, 1–12

A Bypass-Based Low Latency


Network-on-Chip Router
Peng Guo12a) , Qingbin Liu12) , Ruizhi Chen12 , Lei Yang1 ,
Donglin Wang1 ,
1
Institute of Automation, Chinese Academy of Sciences, Beijing, China
2
University of Chinese Academy of Sciences, China
a) [email protected]

Abstract: As the most critical components of Network on chip


(NoC), the routers need to select suitable output ports and guaran-
tee every flit accesses the hardware resource exclusively. Thus they
are normally designed with several pipelines. However, most flits don’t
compete for the same output port with other flits in real applications.
In this work, we introduce a bypass path to the traditional router thus
the non-conflict flits can be forwarded directly. Combined with several
other optimizations, we propose a bypass-based low latency NoC router
(BNR). When no congestion occurs, BNR can transfer the flit through
the bypass path with only one cycle. Otherwise, the flits are transferred
through the conventional path with two hops. Besides, we also present
a simplified version, BNR-S. Compared with BNR, it only bypasses the
short packets and will reduce the area overhead significantly. For the
synthetic traffic with different injection rate, BNR achieves 1.48x and
1.31x speedup than the two baselines while BNR-S achieves 1.3x and
1.15x. They also bring obvious benefits for several real applications.
In addition, the experiments also illustrate that the proposed bypass
mechanism can reduce dynamic power.
Keywords: Network-on-chip, router, bypass, low latency
Classification: Integrated circuits

References

[1] Paul Gratz et al.: “Implementation and Evaluation of On-Chip Network


Architectures,” ICCD (2006) 477 (DOI: 10.1109/ICCD.2006.4380859)
[2] Y. Hoskote et al.:“A 5-ghz mesh interconnect for a teraflops processor,
IEEE Micro (2007) 51 (DOI: 10.1109/MM.2007.4378783)
[3] Michelogiannakis et al.: “Elastic-Buffer Flow Control for On-Chip Net-
works,” HPCA (2009) 151 (DOI: 10.1109/HPCA.2009.4798250)
[4] Daniel U Becker et al.: “Efficient Microarchitecture For Network-on-Chip
Routers,” Ph.D Dissertation, Stanford University (2011)
[5] Robert Mullins et al.: “Low-Latency Virtual-Channel Routers for On-
Chip Networks,” ISCA (2004) 188 (DOI: 10.1109/ISCA.2004.1310774)
[6] W. J. Dally et al.: “Virtual-Channel Flow Control,” ISCA (1992) 194
(DOI: 10.1109/71.127260)
[7] M. Galles et al.: “spider: A high-speed network interconnect, IEEE Micro
©IEICE 2019
DOI: 10.1587/elex.16.20181147 (1997) 34 (DOI: 10.1109/40.566196)
Received December 29, 2018
Accepted January 21, 2019
Publicized February 6, 2019

1
IEICE Electronics Express, Vol.*, No.*, 1–12

[8] L. S. Peh and W. J. Dally: “Delay model and speculative architecture for
pipelined routers,” HPCA (2000) 255 (DOI: 10.1109/HPCA.2001.903268)
[9] Moscibroda et al.: “A case for bufferless routing in on-chip networks.”
ACM SIGARCH (2009) 196 (DOI: 10.1145/1555815.1555781).
[10] Hsu et al.: “A low power detection routing method for bufferless noc.”
ISQED (2013) 364 (DOI: 10.1109/ISQED.2013.6523636)
[11] Y. Lu et al.: “Exploring virtual-channel architecture in FPGA based
networks-on-chip,” SOCC (2011) 302 (DOI: 10.1109/SOCC.2011.6085089)
[12] Monemi et al.: “Low latency network-on-chip router microarchi-
tecture using request masking technique.” IJRC (2015): 2. (DOI:
10.1155/2015/570836)
[13] A. Kumar et al.: “Express virtual channels: towards the ideal intercon-
nection fabric,” ISCA (2007) 150 (DOI: 10.1145/1250662.1250681)
[14] Wang D et al.: “MaPU: A novel mathematical computing architecture,”
HPCA (2016) 457 (10.1109/HPCA.2016.7446086)

1 Introduction
With the development of multi-core processors, inter-core communication has
become an important factor affecting the overall performance [1]. Consider-
ing that conventional shared bus and crossbar based interconnections cannot
provide enough flexibility and scalability, NoC is provided as a promising ap-
proach to connect a large number of IP cores on the chip. However, the NoC
transmission latency between the nodes and the power consumption of the
network are still two major issues that need to be further optimized [2][3].
The latency of a NoC is determined by the bandwidth of the network,
the length of the routing path, the pipeline number of each router and the
waiting time due to congestion [4]. For a given network, the bandwidth and
the routing length are hard to be optimized. In addition, the congestion is
mainly related to the topology and routing algorithm. In this paper, we focus
on the optimization of the pipeline number, which is decided by the micro-
architecture of the routers. As the most common router type, virtual channel
(VC) based routers need to complete four functions before the flit departs
from the desired output port: Route computation, VC allocation, Switch
allocation and Switch traversal. Limited to the clock frequency, the vanilla
implementations normally take four or three cycles to transfer one flit, which
will bring significant latency [5][6]. Therefore, reducing the pipeline number
has been taken as a promising method to optimize the NoC latency and
many techniques like look-ahead routing [7] and speculative switch allocator
[8] have been proposed.
As for the power consumption, NoCs can consume about 30% or even
40% of the system power [3][9]. Especially, the VC based router normally
buffer the flits within the network, and these buffers will consume significant
energy. Some researches have proposed bufferless NoC for reducing the power
consumption [9][10]. Bufferless NoC removes the router buffers and transmits
the data packets by deflecting the packets to a free output port. But the

2
IEICE Electronics Express, Vol.*, No.*, 1–12

latency of bufferless NoC is typically worse than the VC based NoC and they
are not suitable for applications with high injection ratio.
On the other hand, a key characteristic of real CMP systems running real
applications is that they are self-throttling, preventing the injection rates
from becoming very high over an extended period of time. Therefore in most
cases the flits don’t compete for the same output port with other input VCs
and can be transferred to the output directly, eliminating the necessity of a
complex arbitration pipeline and buffering.
In this work, we introduce a brand new bypass mechanism and propose
a low latency network-on-chip router (BNR) to take advantage of the above
features. Combined with look-ahead rooting and parallel switch traversal,
BNR takes only one or two stages to transfer one flit. In details, the non-
conflict flits are forwarded to the output port directly with only one flit,
while the flits which encounter congestion are stored in the VC buffer at the
first cycle and then sent out at the second stage. In addition, considering
the BNR will bring significant area overhead, we divide the packets into long
packets and short packets, and then propose BNR-S, which can be seen as
a simplified version of BNR that only bypasses the short packets. In the
experiment part, we first evaluate these two routers with synthesis traffic
of different injection rates, BNR achieves 1.48x and 1.31x speedup than the
two baselines while BNR-S achieves 1.3x and 1.15x. Then for several typical
applications, they also show obvious performance improvement, especially
for the memory-intensive ones. In addition, we also show the critical path,
area and power consumption of the proposed design and we can find that the
dynamic power is reduced with this bypass mechanism.
The rest of the paper is organized as follows. In Section 2 we describe
the related work. Section 3 introduces the motivation and overview of BNR.
Section 4 details the microarchitecture of BNR and BNR-S. Finally, the ex-
periments and conclusion are presented in Section 5 and Section 6.

2 Related work
Many designs have been presented to reduce the latency of NoC routers. In
[7], Galles et. al. propose the look-ahead routing computation (RC) which
computes the output port one router in advance and attaches the result to
the header flit. Hence, when the header flit arrives, the NoC router can send
the allocation request directly to the attached output port and calculate
the output port corresponding to the next router concurrently. In addition,
speculative allocation removes the dependency between VC allocation and
switch allocation by speculating that a waiting packet will be allocated an
output VC successfully [8]. In details, packets that are awaiting VC allocation
are permitted to make speculative requests for switch allocation. To reduce
the negative impact on performance when the speculative fails, the switch
allocator prioritizes non-speculative requests over speculative requests. But
this method can also lead to too many speculation failures during heavy
traffic, which means significant performance dropping.

3
IEICE Electronics Express, Vol.*, No.*, 1–12

Another way to relax the dependency of VC and switch allocator is using


a non-speculative combination of VC/switch allocation unit [11]. It replaces
the VC allocator with the queue of available VCs. For the head flit of a
packet, a new VC is allocated from the queue of available VC upon the
successful switch allocation, thus the VC allocation and switch allocation
is combined together. Although this method can provide efficient use of
VC buffers, it also doesn’t work well during heavy traffic. The design in
[12] further proposes a request masking technique which can filter all switch
allocation requests that are not able to pass flits to the output port. This
method can eliminate the need for setting a higher priority to any input VC
requests and provide efficient use of VC memory buffers. Combined with
non-speculative VC/switch allocation, it realizes a two-cycle router.
There are also some approaches which don’t optimize the VC based router
directly. For example, bufferless NoC router [9][10] reduces the power con-
sumption and circuit area by removing the router buffer completely. It trans-
mits the data packets by deflecting the packets to a free output port. But
the permutation tree which is used to perform the routing algorithm also
consumes large power and area. Besides, it’s also not suitable for heavy
traffic. In addition, Express Virtual Channels [13] can bypass many routers
by adding a special path to the frequently communicating nodes, thus the
latency between these nodes is reduced significantly. However, this method
will impair the adaptability and scalability of the original NoC.
Our proposed NoC router adapts some of the above optimization methods
such as look-ahead RC and the non-speculative combination of VC/switch
allocation. At the same time, we introduce a brand new bypass in each router
to forward the non-conflicted flits, which can not only reduce the latency but
also avoid unnecessary VC buffer activity.

3 Basic structure of BNR and BNR-S


3.1 A Basic Two-Stage Pipeline Router
A vanilla VC based NoC router takes four steps to process one flit. First,
the router computation is performed to calculate the output port. Then
the VC allocation stage assigns an empty VC in the next router to this flit.
Considering that many flits may apply to the same VC, an arbitration logic
is required. The third stage is switch allocation, which arbitrates multiple
requests to the same output port. At last, the flit is sent out through the
crossbar during the switch traversal stage.
As mentioned in the related work, by taking look-ahead routing, combi-
nation of VC/SW and request masking, [12] proposes a two-stage pipeline
router. We take this design as the baseline and our optimization is based
on it. The diagram of this design is shown in Fig. 1. At the first stage,
the competition between multiple input VCs is arbitrated with the combi-
nation of VC/SW. The request masking can guarantee that all requests to
the switch allocation have valid unassigned VC and free space in the corre-
sponding output port. Besides, the look-ahead routing is also performed in

4
IEICE Electronics Express, Vol.*, No.*, 1–12

Look-ahead routing
computation
Credit in
Flow control
Combination
allocator

Flit in
Credit out
Input
port
Flit out

...
Flit in
Credit out
Input
port Switch Traversal

Fig. 1. The conventional router architecture.

this stage. Then the authorized flits are transferred to the destination port
at the second stage.

3.2 Analysis of the NoC Traffic


Fig. 2 shows a NoC with 4x4 mesh topology. In addition to the adjacent
router, each router is also connected to a local node. There are totally 4
packets in the figure. Although the yellow and crimson flow both transfer
through the router 5, they don’t use the same output port and thus have no
congestion. On the other hand, the purple and blue flows will compete for
the left port of router 10 if they arrive that node at the same time, which
leads to congestion.

0 1 2 3
5 1 1
4 2 2

4 4
5 3
6 2
7
5 1
3 3 3

5
8 7 9 10 4
11
6 6
8 2

12 13 14 15
9 1

Fig. 2. A 4x4 mesh NoC with four packets flows.

In fact, many structures in traditional routers, including VC, arbitration,


etc., are designed for the context of congestion. But, one question is how
often the congestion occurs? We evaluate this by constructing a 6x6 mesh
NoC with the introduced basic router. All nodes in the NoC initially take
the AXI protocol. So we employ an adapter to convert the communications
into the NoC traffic. Besides, we divide the traffic into two categories: long
packet (LP) and short packet (SP). LP refers to the packet which has more
than one flit while SP refers to the packet which has only one flit. For
the LP, the routing algorithm and VC allocation are only performed in the

5
IEICE Electronics Express, Vol.*, No.*, 1–12

head flit and the subsequent flits just follow the determined path. Thus, the
congestion situation of LP and SP are different. The result is shown in Fig.
3. We can find that the congestion ratio is actually quite low. In details,
the congestion ratio is about 2% when the injection ratio is 5% and is 8%
when the injection ration is 15%. Even when the injection ratio is as high as
30%, the congestion ratio is less than 23%. Especially, the SP is much less
inclined to conflict with other packets. Therefore, most flits essentially don’t
encounter congestion and can be processed in a simpler way.

Congestion Rate 25.00% LP SP Total


20.00%

15.00%

10.00%

5.00%

0.00%
5% 10% 15% 20% 25% 30%
Injection Rate

Fig. 3. The congestion rate under different injection rate.

3.3 Overview of BNR and BNR-S


To facilitate the transmission of non-conflict flits, we introduce a bypass
technique to the basic two-stage NoC router. The architecture of BNR is
illustrated in Fig. 4. The LP and SP are processed separately. For one flit
belong to the LP. The BP Process LP module first decide whether this flit
meet the bypass condition. The concrete condition will be explained later.
If the condition is met, it will transmit through the purple and green data
flow. The green data flow corresponds to the switch traversal and look-ahead
routing. If not, it will be sent to the red data flow, which is similar to the
conventional data path of the basic two-stage. As for the flit of SP, it can
be bypassed to the yellow and green data flow or transferred through the
conventional red data flow. This bypass mechanism can introduce two major
advantages. First, the bypassed flit only take one cycle to transfer through
the router, which will reduce the latency significantly. Second, compared
to the conventional path, the bypass path doesn’t need to buffer the flit and
thus reduce the dynamic power consumption of the buffer, leading to a better
low power NoC design.
Compared with SP, there is an extra BP Failure module in the path of
LP. In fact, for SP, each packet only contain one flit, thus the order of the
flits has no dependencies. But this is not true for LP. Actually, each LP
contains multiple flits and flits between the same packet must arrive in the
destination in the initial order. Therefore, they must be transferred through
the same VC port. And if one flit in an LP fails to bypass, all subsequent
flits cannot be bypassed and they should be arbitrated to the same output
VC in the conventional path. This will introduce significant area overhead

6
IEICE Electronics Express, Vol.*, No.*, 1–12

BP Failure Combination
Credit in allocator
BP credit in Flow control LP

Credit out In_port


BP credit out LP Look-ahead RC
BP Process
LP
Flit in BP granted flits
BP Process
SP Flit out
In_port
BP credit out SP Look-ahead RC
Credit out

Credit in Flow control SP


BP credit in
Parallel switch traversal

Fig. 4. The architecture of BNR.

and design complexity. On the other hand, in section 3.2 we observe that
the LP is inclined to occur congestion. So we also propose a simplified BNR
(BNR-S) which only bypasses non-conflict SP while transfers all LP through
the conventional path. Their comparison is detailed in the experiment part.

4 Microarchitecture
4.1 Bypass module of BNR
In the previous section, we propose two routers to meet different area and
performance requirements. The main difference between these two routers is
the bypass module. BNR can bypass both the LP and the SP while BNR-S
only bypasses the SP. In this subsection, we mainly focus on the bypass flow
in BNR. Each LP consists of multiple flits. When one flit comes into the
router, it’s checked with the below conditions:

ˆ Non-Conflict: The flit doesn’t compete for the same output port with
the other flits.

ˆ Availability: There are available output VCs in the next router.

ˆ Sequentiality: All previous flits in the same packet succeed to be


bypassed.

If and only if all conditions are met, it can be successfully traversed through
the bypass path. Otherwise, it’s traversed to the conventional path. As for
the SP, each packet only contains one flit. Thus an SP can be bypassed if it
meets the Non-Conflict and Availability conditions.
The detailed structure of BNR’s bypass module is shown in Fig. 5. The
structure grants the destination port and OVC in parallel. First, the possible
destination ports are selected by the combination of port select module, LP
Dest module, LP Failure module, and conventional path routing result. In
fact, the body and tail flits of LP must follow their previous flits while the
destination of other flits is calculated directly. Then a set of P-1:1 arbiter
are used to decide the final port. As for the OVC, the available signals
of LP OVC are forwarded to the arbiter and the multiplexers are used to
choose the proper channels. Like output port, the output channels for the
flits of one LP are also consistent. For each LP, the bypass module needs

7
IEICE Electronics Express, Vol.*, No.*, 1–12

to register the destination port, assigned status, assigned OVC number, and
failure status. The first three messages are the routing information while the
last one is related to the bypass history. In case of bypass failure, the routing
information will be pushed into the conventional path. The conventional path
can router the flit based on this information. If the head flit fails to bypass,
the conventional path is responsible for performing the routing calculation.

Conventional P-1:1
Path routing Arbiter
LP Dest.

...
Dest port LP Failure
Flit in
select

...
Sel. P-1:1
Flit in ... Arbiter
LP Head

Status
LP OVC LP OVC
available arbiter Mux LP assigned LP OVC

OVC allocate OVC release


...

...
SP OVC SP OVC
available
Mux BP Flit
arbiter SP?
BP Dest.
BP Credit

Fig. 5. The bypass module of BNR.

4.2 Bypass module of BNR-S


For BNR-S, it only supports the bypass of the SP. So it only needs to check the
Non-Conflict and Availability conditions. Besides, each packet contains their
own routing information. Therefore, the bypass process module doesn’t need
to provide routing information for the flits. Meanwhile, in case of bypass
failures, the flit also doesn’t need to consider the routing information of
previous flits. Thus the bypass process module can be significantly simplified.
The concrete structure is shown in Fig. 6. Similar to BNR, the destination
address is first arbitrated to ensure that the requests don’t compete for same
output port. And a set of mux units are employed to select the final output
channel from the input available output channels simultaneously. The credit
signals are used to communicate with the next router.

4.3 Parallel Switch Traversal Structure


In a conventional router, the look-ahead RC and switch traversal are normally
performed sequentially. For BNR and BNR-S, we propose a parallel switch
traversal structure. As shown in Fig. 7, it has two sets of parallel switch
traversal structure. In each one, the look-ahead RC and switch traversal is
performed in parallel. Besides, the router is designed with adaptive routing
algorithms to better balance the workload in the network. The required
adaptive RC messages are 2 bits and this corresponds to the top part of the
figure. The bottom corresponds to the data part and we suppose each flit has
N bits. For the sake of simplicity, the switch traversal in bypass path doesn’t

8
IEICE Electronics Express, Vol.*, No.*, 1–12

Conventional P-1:1
Path routing
Dest port Arbiter
Flit in select

...
Sel.
LP

...
P-1:1
Flit in ...
Arbiter

SP OVC OVC BP Flit


available MUX BP Dest
Arbiter
BP Credit

...

...
SP OVC OVC ...
available
MUX
Arbiter

Fig. 6. The bypass module of BNR-S.

support the flit having same source port and destination port. Therefore, the
switch crossbar in bypass path is 4x4 while 5x5 in conventional path.
This structure introduces many benefits. First, two independent switch
traversal structures enable that the conventional path and bypass path can
work in parallel without bringing too much timing latency. Second, the par-
allel look-ahead RC and switch traversal is very suitable for the adaptive
routing algorithms. In fact, for adaptive algorithms, there are multiple se-
lectable output ports. Many conventional routers perform look-ahead RC in
the first stage, and the number of required address calculations and rout-
ing computing units increases linearly with the number of selectable ports.
Take the mini path routing algorithm for example, when the number of in-
put ports is 5 since there are two possible output ports for the packets, 10
routing computing units are needed. However, in our design, the output port
is determined before look-ahead RC and switch traversal, so we only need
5 routing computing units. On the other hand, once the output port is de-
termined, the switch traversal can also be performed immediately, thus they
can be paralleled naturally.

Fig. 7. The parallel switch traversal structure.

9
IEICE Electronics Express, Vol.*, No.*, 1–12

5 Experiment Results
5.1 Configuration
In order to evaluate BNR and BNR-S, we implement BNR and BNR-S in
Verilog HDL. Each port of the routers is configured with 4 VC for LP and 2
VC for SP. And the depth of the channels is set to 4 while the data width of
these routers is 64-bit. For comparison, we take two designs as our baselines:
Spec-R and Mask-R[12]1 . Both of them are two-stage routers and take look-
ahead routing. Especially, Spec-R takes the speculative allocator while Mask-
R takes the request masking technique.
For the later performance evaluation, we construct four NoC systems with
the above routers. All of them take 6x6 mesh topology. Among the 36 nodes,
there are 2 ARM A53 cores, 4 DMA and 30 MaPU cores. The ARM cores are
taken as the controller while DMAs are used to communicate with the exter-
nal DRAM. MaPU is a VLIW DSP that is dedicated to the computation[14].
These cores all run at 800MHz. It’s noticeable that the original interface
of these cores takes AXI 3.0 protocol. So we design a adaptor module to
convert the AXI based communication into the NoC packets. To avoid the
deadlock in the network, the wrapper limits the number of the out-of-order
transmissions to no more than the LP virtual channels. Besides, the adaptor
is also a clock-domain crossing module.

Mask-R
480ps
1.2GHz

BNR-S
590ps
1.1GHz

Spec-R
690ps
1GHz

BNR
710ps
950MHz

Fig. 8. The critical paths and frequency of the four


routers.

5.2 Critical Path


We synthesize all routers with Synopsys Design Compiler using TSMC 28nm
library. We adopt a frequency-first strategy and the final critical path and
frequency are shown in Fig. 8. For Spec-R, the critical path starts from
1
These two routers are open source and publicly released at
http://opencores.org/project,an-fpga-implementation-oflow-latency-noc-based-mpsoc

10
IEICE Electronics Express, Vol.*, No.*, 1–12

the input buffer write, through the speculative allocator, until the flit flow
control. It takes 690ps totally. We omit some other time budgets like setup
time, clock uncertainty, etc. in the figure, and the final frequency is 1GHz. As
for the Mask-R, the speculative allocator is replaced with the request masking
and combined allocator. Thus the critical path is reduced to 480ps and the
router runs at 1.2GHz. The critical paths of both BNR-S and BNR start
from the bypass module. For BNR-S, the critical path in the bypass module
is 590ps and the frequency is 1.1GHz. In contrast, the bypass process for
LP is much more complex. Thus the path takes 710ps and the router runs
under 950MHz. Although the bypass module impairs the implementation
frequency, we will show that it still brings significant overall performance
improvement in the next subsection.

5.3 Performance
We first evaluate the NoCs with synthetic traffic. In details, the traffic in the
network is generated randomly under a given injection rate. The injection
rate refers to the ratio that the flits are injected into the network by each
node. As shown in Fig. 9, BNR and BNR-S achieve much better performance
than Spec-R and Mask-R at all injection rates. On average, BNR achieves
1.48x and 1.31x speedup than Spec-R and Mask-R, while BNR-S achieves
1.3x and 1.15x speedup than them.

BNR BNR-S Mask-R Spec-R


40
Average Latency (ns)

35

30

25

20

15
5% 10% 15% 20% 25% 30%
Injection Rate

Fig. 9. The performance under traffics of different injec-


tion rates.

Furthermore, we deploy several typical applications on these NoC sys-


tems. As shown in Fig. 10, we normalize the result by setting the per-
formance of Spec-R to 1. Although the system performance depends on
many aspects, including the computing units, communication mode, schedul-
ing method, but in general, the proposed bypass mechanism brings perfor-
mance improvement to most applications. Especially, BNR achieves 20%-
30% speedup than Spec-R for memory-intensive algorithms like DMA, vector-
multiplication (V-MUL) and vector-maximum (V-MAX). These results fully
illustrate that the proposed bypass mechanism is effective to reduce NoC
latency.

11
IEICE Electronics Express, Vol.*, No.*, 1–12

BNR BNR-S Mask-R Spec-R


1.3

1.25

1.2

Performance
1.15

1.1

1.05

0.95

0.9
DMA V_MUL V_MAX FFT 16QAM OFDM FIR TURBO CRC

Fig. 10. Performance of several real applications.

5.4 Area and Power


We estimate the area and power consumption by the report area and re-
port power commands of Design Compiler. The results are normalized and
shown in Table I. For the area, we can find that BNR-S greatly reduces the
area overhead and it’s only a little bigger than Mask-R. This fully illustrates
that BNR-S can effectively reduce the area overhead introduced by the by-
pass mechanism. As for the power consumption, the bypass path can avoid
a lot of buffer activities, thus reducing the dynamic power and total power.
In details, we can find that although BNR is larger than Spec-R and BNR-S
is larger than Mask-R, the total power consumption of BNR and BNR-S is
smaller than Spec-R and Mask-R respectively.

Table I. Comparison of area and power

Dynamic Power Leakage Power Total Power Area


Optimization
(mw) (mw) (mw) (µm2 )
Spec-R[12] 5.86 0.86 6.72 67,739
Mask-R[12] 5.57 0.67 6.24 55,914
BNR 5.61 0.91 6.52 71,762
BNR-S 5.45 0.66 6.11 56,381

6 Conclusion
In this work, we propose two NoC routers: BNR and BNR-S. With the bypass
path in the router, BNR can transfer the non-conflict flits directly, which can
not only reduce the NoC latency but also remove a lot of unnecessary buffer
activity. BNR-S can be seen as a simplified version of BNR. It balances the
performance benefits and area overhead by only supporting the bypass of SP.
In the experiment with synthetic traffics, BNR shows 1.48x and 1.31x speedup
than the two baselines while BNR-S shows 1.3x and 1.15x. In addition, they
also achieve obvious improvements for the real applications, especially for the
memory-intensive ones. At last, the synthesis results illustrate that they are
really more energy-efficiency than the baselines.

12

View publication stats

You might also like