0% found this document useful (0 votes)
6 views20 pages

An Efficient Interfacing Approach For Heavily-Communicating NoC-Based Systems

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 20

Received August 30, 2020, accepted September 29, 2020, date of publication October 12, 2020, date of current

version October 22, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.3030606

NoC 2: An Efficient Interfacing Approach for


Heavily-Communicating NoC-Based Systems
AHMED A. MORGAN 1,2 , AHMED S. HASSAN 3, M. WATHEQ EL-KHARASHI 3,4 ,

AND AYMAN TAWFIK 5, (Member, IEEE)


1 Department of Computer Engineering, Faculty of Engineering, Cairo University, Giza 12613, Egypt
2 College of Computer and Information Systems, Umm Al-Qura University, Makkah 21955, Saudi Arabia
3 Computer and Systems Engineering Department, Faculty of Engineering, Ain Shams University, Cairo 11517, Egypt
4 Department of Electrical and Computer Engineering, University of Victoria, Victoria, BC V8W 3P6, Canada
5 Electrical Engineering Department, Ajman University, Ajman, United Arab Emirates

Corresponding author: M. Watheq El-Kharashi ([email protected])

ABSTRACT Current research in interfacing clusters within Hierarchical Networks-on-Chip (HNoC) as well
as interfacing NoC-based systems adopts a centralized approach. In this approach, a specific Processing
Element (PE) acts as a gateway between interfacing peripherals and the rest of NoC elements. This article
evaluates this approach and show that it is not optimal for handling the inter-NoC communication. Routing
inter-NoC traffic through a system to its gateway PE deteriorates the network performance. Results show that
both the throughput and latency of the centralized approach degrade with the increase in the inter-NoC traffic
bandwidth. To alleviate this, we propose a novel distributed approach, which separates the inter-NoC traffic
from the intra-NoC one. Our approach relies on distributed buffers to allow PEs to efficiently communicate
with the interfacing peripheral. We evaluate our approach against other interfacing ones using synthetic traffic
as well as real benchmark applications. Our evaluation covers the whole system performance as well as its
inter- and intra-NoC parts. Results prove that the proposed approach outperforms previous interfacing ones in
terms of throughput and latency. The proposed approach significantly enhances the inter-NoC performance
without any deterioration in the intra-NoC one. Considering the inter-NoC performance, we achieve a
throughput that is close to the maximum possibly attainable one. Other approaches show major performance
degradation, reaching as low as 10% of this maximum attainable throughput.

INDEX TERMS Hierarchical networks-on-chip (HNoC), inter-NoC communication, intra-NoC communi-


cation, NoC benchmarks, NoC Ethernet, NoC high-speed interfacing, NoC time division multiple access
(TDMA), NoC traffic.

I. INTRODUCTION techniques are proposed to customize the NoC hardware,


Networks-on-Chip (NoCs) became widely used as a according to the target application. For example, task-to-
communication infrastructure for multicore and High Per- core mapping, routing, dynamic resource management, and
formance Computing (HPC) systems. Modern and future other techniques are extensively researched. In this article,
applications, which run on these systems, require a contin- we are not contradicting with previous techniques in this
uously increasing performance. From an NoC perspective, direction or replacing them. Rather, we acknowledge that
research efforts to fulfill this performance could be catego- they successfully accomplish the performance requirements
rized into two main directions. In the first direction, the under- of many NoC applications. Nevertheless, as the number of
lying NoC architecture is considered as a one consolidated cores and the communication among these cores increase
system, such that the hardware resources are not grouped rapidly in modern and future NoC-based systems, these tech-
into clusters [1]. Numerous research work are presented to niques would struggle to fulfill the required performance of
realize the required performance of applications running on such complicated systems [2]. Managing and controlling the
these non-clustered architectures. Many static and dynamic underlying hardware for a complex application, or multiple
simultaneously running applications, would definitely be a
The associate editor coordinating the review of this manuscript and challenge. Execution time, chip area, and power consumption
approving it for publication was Nitin Nitin . budgets of these techniques would prevent many of them

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
185992 VOLUME 8, 2020
A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

from being widely used with future heavily-communicating our approach employs the Ethernet standard to carry out the
NoC-based systems. communication among these systems.
The second research direction appears more promising and The rest of the paper is organized as follows. Section II
manageable for future NoC-based systems. In this direction, defines terms, which are frequently utilized throughout
the underlying NoC architecture is partitioned into multiple this article and clarifies the used terminology. Section III
clusters. In the literature, these Clustered NoCs (CNoCs) are verifies the necessity for an efficient NoC interfac-
also named Hierarchical NoCs (HNoCs) [3], or region-based ing approach. It highlights the prediction for the future
NoCs [4]. Each cluster has a limited number of cores and inter-cluster traffic and summarizes the drawbacks with
could be easily managed and controlled. For a complex appli- current interfacing approaches. Our contributions are then
cation, its tasks are also split into different segments. Each enumerated in Section IV. Section V reviews the related
segment is mapped to one of the hardware clusters. Due to this work. It first covers some research efforts to customize NoC
task splitting, clusters are interfaced and connected to each architectures with minimal inter-cluster traffic. Thereafter,
others in order to carry out the whole application execution. it surveys previous NoC interfacing techniques. Section VI
Many techniques are presented to accomplish the clustering presents our proposal, NoC 2 , for interfacing NoC-based clus-
process and to customize the underlying NoC architecture to ters, or systems.1 Section VII explains the experimental setup
the target application. A common and key objective of these employed in evaluating different NoC interfacing approaches.
techniques is to minimize the inter-cluster traffic. This traffic The section describes the used benchmark applications,
minimization is intended to avoid congestion over the few the employed clustering variants, and the environment for
inter-cluster links, to minimize the latency, and to maximize simulation as well as hardware implementation. Section VIII
the throughput. In this article, we are not replacing these tech- shows and discusses results of evaluating different NoC
niques and we again acknowledge the good performance they interfacing approaches. Finally, the paper is concluded in
achieve. Nevertheless, most of these techniques reduce the Section IX.
inter-cluster traffic, but they don not eliminate it. Therefore,
an open research challenge is how to properly handle this II. DEFINITIONS AND USED TERMINOLOGY
inter-cluster traffic. In other words, an efficient interfacing Terms within the NoC research community might be used
approach would indeed help previous techniques reaching the differently from one article to another. To avoid confusing the
maximum attainable performance from the underlying hard- reader and make our discussion clear, we herein define terms
ware. On another hand, the future prediction of NoC-based that we are using throughout this article.
systems shows a significant increase in their inter-cluster • NoC architecture: The actual hardware of the
traffic [5]. This apparently emphasizes the need for the afore- NoC-based system. In this article, we assume that
mentioned efficient interfacing approach. However, in the the architecture consists of tiles, where each tile has
literature, very few techniques are proposed to carry out the the three following modules2
inter-cluster communication. In this article, we will show that
1) Processing Element (PE), or core: The actual exe-
the performance of these techniques is not optimal. Moreover,
cution module, which runs a task, or tasks, of the
supported by our initial work in [6], we present a more effi-
application. In this article, PEs are drawn as black
cient interfacing approach. Our approach is meant to work in
square Quad Flat Package (FQP) chips.
conjunction with previous CNoC architecture customization
2) Network Interface (NI): The module that interfaces
techniques to properly handle their inter-cluster traffic and
the PE to the network. It packetizes the traffic at
further enhance their performance.
the source PE and de-packetizes the traffic at the
Another method to realize the required increasing perfor-
destination PE. In our proposed approach, it is also
mance of future NoC-based systems would be through adding
responsible for communicating with the interfac-
more computational cores. In other words, a new off-chip,
ing peripheral. In this article, NIs are drawn as
or maybe off-board, NoC-based system would be connected
while rectangles, with the abbreviation NI written
to an already running one. Connecting an NoC-based system
inside these rectangles.
to another one necessitates an efficient interfacing approach
3) Router: The module that executes the routing pro-
to handle the communication among these systems. This
tocol. It transfers packets through the network from
interfacing approach should be taken into account during the
one tile to another. In this article, routers are drawn
design phase and it should further employ a standard commu-
as circles with small arrows inside.
nication protocol. Despite the importance of this challenge,
it is barely studied by the NoC research community. In this • Non-clustered architecture: The architecture, which is
article, we target this open research area. Our proposed inter- not partitioned into clusters. The system could simply be
facing approach could not only be used in connecting clusters envisioned as a collection of tiles.
within a single NoC-based system, but also in connecting 1 For shortness, we will only use one of the two terms, either clusters or
different systems. In the later case, we are building a net- systems. Unless explicitly stated, the discussion is however applied for both.
work of NoC-based systems. Therefore, we name our inter- 2 In our proposed approach, the communication module is also considered
facing approach NoC 2 . Finally, without loss of generality, part of the architecture

VOLUME 8, 2020 185993


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

• Clustered or hierarchical architecture (CNoC or processors, and Application-Specific Integrated Circuits


HNoC): The architecture, which is partitioned into clus- (ASICs) [8]. As mentioned in Section I, these cores are
ters. Each cluster has multiple tiles. The system could be partitioned into clusters that are interfaced to each others.
envisioned as a collection of clusters, which are inter- Consequently, the amount of inter-NoC traffic is significantly
faced to each others. increasing and it should be properly handled.
• NoC interfacing: Connecting clusters of a single system Manevich et al. analyzed and modeled the inter-NoC traffic
or connecting separate NoC-based systems through a resulted from interfacing clusters within modern NoC-based
standard communication protocol. Without loss of gen- systems [5],. Their work is based on a Rentian traffic
erality, Ethernet is used in this article. model [9]. They further considered two modeling scenar-
• Intra-NoC traffic: The traffic exchanged among PEs of ios. The first is an optimistic one of lightly communicating
the same cluster, or the same non-clustered system. clusters, whereas the second is a worst-case pessimistic sce-
• Inter-NoC traffic: The traffic exchanged among PEs of nario of clusters that exchange high traffic volumes. As the
different clusters, or systems. number of cores doubles, they predicted that the inter-NoC
• Benchmark: A standard-alike NoC application, which traffic would increase by 63.3% and 77.2% for lightly and
is widely used by the NoC research community in eval- heavily communicating clusters, respectively. Authors fur-
uating the performance of novel proposed techniques. ther signified that current interfacing techniques would soon
• Core graph: For each benchmark, it is a graph, which suffer from network congestion and saturation, which directly
represents the required PEs to execute the benchmark affect the throughput, latency, and power consumption. Going
and the traffic exchanged among these PEs. with the aforementioned Moore’s rate observation, the inter-
• Useful throughput, or goodput: The amount of traffic NoC traffic would double approximately every 20.3 months,
that is successfully delivered to its destination PE in a in the worst-case. Indeed, this is a significant increase in the
unit of time. Throughout this article, we drop the word inter-NoC traffic, which necessitates an efficient interfacing
‘‘useful’’ and the term ‘‘throughput’’ is solely used to approach to handle it.
represent this goodput.
• Total throughput, or router load: The total amount of B. DRAWBACKS OF CURRENT INTERFACING TECHNIQUES
traffic passing through all ports of a router in a unit of
In the literature, few techniques are presented to carry out
time. This includes useful traffic as well as overheads.
the interfacing process. Most of these techniques deploy the
To avoid any confusion with the goodput, we use the
same approach. In this article, we call it the centralized
term ‘‘router load’’ throughout this article to represent
approach. In this approach, only one centralized module,
this total throughput.
i.e., a PE or an NI, is used as a gateway between all cores
• Peak throughput: The maximum throughput that could
in its system and the Communication Peripheral Controller
theoretically be achieved. It is calculated using a
(CPC) [10], [11]. An analogous approach is presented by
zero-delay model and assuming a congestion-free NoC
Dorai et al. to alleviate the interfacing burden from PEs
with unlimited buffers. It represents a ceiling of the
and NIs [12]. As such, no centralized gateway PE, or NI,
throughput in order to compare the performance of dif-
is used and cores within the NoC-based system are allowed to
ferent interfacing approach against it.
access the CPC in a Time-Division Multiple Access (TDMA)
fashion. While this approach is not centralized around the
III. NECESSITY OF EFFICIENT NoC INTERFACING
gateway PE, or NI, it is still centralized around the TDMA
APPROACHES
controller. In summary, the centralized approach suffers from
In this section, we aim to clearly explain the motivations
four main drawbacks.
behind our work. First, we discuss the expected increase in the
inter-NoC traffic, which necessitates an efficient interfacing 1) As being centralized, the gateway module becomes a
approach to handle it. Thereafter, we explore the drawbacks bottleneck. As the traffic increases, this module con-
of current interfacing techniques, which encourage us to stitutes a hotspot in the system, degrading the overall
present a more efficient one. system performance.
2) The centralized approach requires that the inter-NoC
A. PREDICTION OF FUTURE INTER-NoC TRAFFIC traffic, from all cores within the system, be first routed
NoC becomes the main communication backbone within to its gateway module. Thereafter, the traffic goes
modern multicore systems. The number of cores within these from this module to the interfaced system. Routing
systems increases rapidly. Graphical Processing Unit (GPU) the inter-NoC traffic through the system rises the
industry first led this trend by introducing cards with thou- underlying network load, overwhelms its resources,
sands of cores. Nickolls and Dally showed that the num- increases its overall latency, and probably pushes it into
ber of cores within modern GPUs increases according to congestion.
Moore’s rate, such that it doubles every 18 months [7]. 3) Once a new NoC-based system is to be interfaced
In recent years, a significant increase in the number of using the centralized approach, the gateway module as
cores also occurs in the embedded domains, general-purpose well as its running software should be modified. This

185994 VOLUME 8, 2020


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

makes the interfacing process more difficult and time enhances inter-NoC communication performance with-
consuming. out affecting the intra-NoC one. Once an NoC-based
4) The centralized module might serialize the accesses system is designed according to our approach, interfac-
to the CPC. For example, when using TDMA, some ing it to a similar system would not require modifying
inter-NoC traffic would stall awaiting their designated the cores nor the software running on them.
time slots. This stall consequently increases the overall 2) Evaluating different NoC interfacing approaches using
latency and affects the system performance badly. synthetic traffic and real NoC benchmark applications
through simulation as well as hardware implementa-
IV. CONTRIBUTIONS tion.
Motivated by the future prediction of inter-NoC traffic and
the challenges of current interfacing techniques, we aim V. RELATED WORK
to present an efficient approach to handle the inter-NoC In this section, we review some of the related research efforts.
traffic and overcome the drawbacks of the centralized First, in Subsection V-A, techniques that customize the NoC
approach. Accordingly, our proposed NoC 2 approach elim- architecture with minimal inter-NoC traffic are surveyed.
inates the centralized gateway module in order to avoid cre- As mentioned in Section I, our proposed approach is to
ating hotspots in the system. It allows direct and concurrent work in conjunction with these techniques to better enhance
access to the CPC in order to reduce the overall latency as the performance of NoC-based systems. Thereafter, previous
well as congestion probability. It requires no modifications to interfacing techniques are discussed in details in Subsec-
the running software in order to make interfacing NoC-based tion V-B. Those techniques are the ones which, we consider
systems much easier. in our comparison in Section VIII to verify the efficiency of
As our approach is proposed to efficiently handle the our proposed interfacing approach in handling the inter-NoC
inter-NoC traffic, it mainly targets two important NoC met- communication.
rics, the throughout and the latency. Consequently, the per-
formance of our approach is evaluated against that of the A. NoC CUSTOMIZATION TECHNIQUES WITH MINIMAL
non-clustered, the centralized, and the TDMA techniques, INTER-NoC TRAFFIC
in terms of these two metrics. On another hand, associated Numerous techniques were presented to realize the best per-
overheads with the approach should be tolerable. In other formance for NoC-based systems. Most of these techniques
words, our approach should be customized to minimize its could be used with clustered and non-clustered architec-
impact on the NoC design budget. Buffers are shown to be tures. A main objective of these techniques is to achieve
responsible for most of the design area and power consump- a better traffic localization, such that the inter-NoC traffic
tion [13]. Therefore, our approach employs as much buffers is minimized [14]. As discussed in Section III, this traf-
as those used by the centralized approach. Nevertheless, fic localization would not be easily achieved with the con-
these buffers are re-distributed throughout the network more tinuous increase in the number of cores and the traffic of
wisely. future NoC-based systems. In summary, the most widely used
Our approach is evaluated using simulation as well as hard- NoC customization techniques are optimal task-to-core map-
ware implementation on FPGA. The evaluation is conducted ping, dynamic architecture reconfiguration, adaptive routing,
using synthetic traffic and real benchmark applications. For traffic shaping and regulation, virtual channel partitioning,
each employed benchmark, its cores are partitioned into two dynamic resource management, and network coding. Multi-
clusters. These clusters are then interfaced to each other. ple of these techniques could be combined to achieve better
Thereafter, both inter- and intra-NoC communication per- performance. In the following paragraphs, we shed some light
formances are assessed, in terms of throughput and latency. on these techniques.
As the performance of the overall system would be dependent Optimizing the task-to-core mapping is probably the most
on the used clustering technique, we consider two cluster- widely used technique for enhancing the performance of
ing variants. The first variant maximizes the inter-NoC traf- NoC-based systems. In this technique, mapping the appli-
fic in order to judge the efficiency of different interfacing cation tasks onto the architecture cores is optimized for a
approaches, in the case of heavily-communicating systems. certain metric, such as latency, throughput, power consump-
This variant would therefore evaluate whether a hotspot is tion, or reliability. For example, Xiao et al. proposed a new
created or a congestion is occurred. The second clustering methodology, based on a Dynamic Application Dependency
variant yields the minimum inter-NoC traffic, and hence, Graph (DADG), to optimize the clustering process [15].
the maximum intra-NoC one. This variant would conse- The objectives of this clustering process are to minimize
quently assess the effect of re-distributing NoC buffers on the the inter-NoC traffic and to balance the workload between
intra-NoC performance. To this end our contributions are two different clusters. First, the DADG is automatically generated
fold: from the application using a Low Level Virtual Machine
1) Presenting a novel approach, NoC 2 , for interfacing Intermediate Representation (LLVM IR) compiler. There-
NoC-based clusters or systems. Our approach does not after, a topological sort algorithm is used to map a maximum
overload the network in terms of buffers. It further of one thread cluster to every core within the NoC, such that

VOLUME 8, 2020 185995


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

the two aforementioned objectives are realized. The work performance with the minimum buffering requirements [26].
is extended in [16] by considering heterogeneous systems Li and Louri used supervised learning techniques to analyze,
and incorporating machine learning techniques. The target predict, and consequently adjust the injected traffic into the
systems consists of CPUs, GPUs, and Hardware Accelera- network [27]. Traffic shaping and regulation techniques prove
tors (HWAs). At compile time, Neural Networks (NNs) are useful in enhancing the system performance as well as avoid-
used to transform the application into a DADG and extract ing network congestion. However, some packets typically
patterns within the generated graph. Thereafter, a Community suffer from extra delay awaiting the proper network condi-
Detection (CD) algorithm is used to partition the DADG tions to be injected.
into communities. At runtime, a dynamic resource manager The technique of virtual channel partitioning enhances
intelligently maps the resultant communities onto the proper the performance of NoC-based systems by controlling the
hardware category, i.e., CPUs, GPUs, or HWA. The manager assignment of available virtual channels. For example,
further maps the tasks within each community onto the cores Jindal et al. presented two virtual channel assignment strate-
of the target hardware category. The proposed framework gies: Augmented Virtual Channel (AugVC) and Output port
proves useful in achieving a good tradeoff between pro- Directed Virtual Channel (ODVC). Authors also reused trace
grammability, efficiency, and energy consumption. In gen- buffers, which are intended for debugging, to further enhance
eral, interested reader in enhancing the performance of the performance [28]. Experimental results prove that the
NoC-based systems using optimal task-to-core mapping is presented strategies successfully reduce the latency. Bose and
referred to the surveys presented in [17]–[19]. Ghosal combined virtual channel assignment with a novel
Dynamic architecture reconfiguration is another widely flow control strategy. A new router design is presented to
adopted technique for enhancing the performance of carry out the assignment of virtual channels and realize the
NoC-based systems. In this technique, the underlying archi- new flow control strategy [29]. Results show a reduction in
tecture is reconfigured according to the executed applica- the area, the energy consumption, the latency, and the packet
tion at the runtime. For example, Hollis et al. changed drop rate. Partitioning the virtual channels between CPUs and
the topology and the characteristics of a regular mesh to GPUs traffic for heterogeneous systems is considered in [30].
dynamically fit with the traffic requirements of the launched Lee et al. proposed a feedback-directed virtual channel par-
application [20]. The topology adaptation is built on the titioning mechanism to efficiently share the NoC bandwidth
small world phenomenon [21]. As such, skip-links, or long- between CPU and GPU cores. Experimental results reflect an
range links, are inserted to connect heavily-communicating increase in the system throughput.
cores. Experimental results show that the proposed technique The technique of dynamic resource management enhances
successfully reduces the latency and the energy consumption. the performance of NoC-based systems by optimally allocat-
Xue and Bogdan proposed a general mathematical frame- ing the network resources, such as switch crossbar, buffers,
work to formulate the dynamic reconfiguration of NoC-based and links. For example, Fang et al. enhanced the performance
systems [22]. A greedy algorithm is further presented, based of heterogeneous NoC by intelligently allocating buffering
on this formulation, to adaptively reconfigure the NoC for resources [31]. Results show a 55.47% reduction in the
lower latency and power consumption. In general, interested latency and a 21% reduction in the energy consumption.
reader in enhancing the performance of NoC-based systems Qian et al. adaptively changed links’ direction according
using dynamic architecture reconfiguration is referred to the to the runtime traffic [32]. A novel router architecture is
survey presented in [23]. Finally, it is worth mentioning that presented to support this dynamic link customization. Li and
one problem with the dynamic reconfiguration technique is Louri used supervised learning techniques to dynamically
that it might only work with reconfigurable hardware, such customize the switch crossbar, the buffer organization, and
as FPGAs, rather than all types of NoC-based systems. the routing protocol [27]. The proposed architecture manages
Using adaptive routing is another widely used technique to increase the throughput by 28%, decrease the latency and
for enhancing the performance of NoC-based systems. In this the power consumption by 24% and 19%, respectively.
technique, packets are adaptively routed through the network Network coding is finally used to enhance the performance
in order to avoid creating hotspots within it. For example, of NoC-based systems. For example, Xue and Bogdan used
Qian et al. proposed a hub router and a deadlock-free adap- network coding techniques to combine multiple packets into a
tive routing protocol to enhance the performance of Express larger one [33]. The work is intended for multicast traffic. The
Virtual Channel (EVC) NoCs [24]. Presented results show an proposed scheme encapsulates cooperation unites, a corridor
up to 80% reduction in the latency with a power consumption routing algorithm, and an adaptive flit dropping algorithm
overhead of only 15%. In general, interested reader in enhanc- to avoid network congestion. Results prove the efficiency of
ing the performance of NoC-based systems using adaptive the proposed scheme in reducing the links’ utilization and
routing is referred to the survey presented in [25]. increasing the overall system throughput.
The techniques of traffic shaping and regulation enhance
the performance of NoC-based systems by controlling the B. NoC INTERFACING TECHNIQUES
traffic injection. For example, Lu and Yao presented a Few techniques are presented to properly handle the traffic
dynamic traffic regulation approach to realize the target between interconnected NoC-based systems and efficiently

185996 VOLUME 8, 2020


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

FIGURE 3. Interfacing NoC-based systems using GTX transceivers [12].

FIGURE 1. Interfacing NoC-based systems using centralized PEs and


Ethernet bridges [10].
new system is connected. Such a modification problem limits
the portability of Nejad’s proposal and prevents it from being
widely used.
Dorai et al. proposed interfacing multi-FPGA NoC-
based systems using GTX transceivers [12]. As shown
in Figure 3, their proposal embeds a GTX transceiver within
each NoC-based system. PEs of each system access this
GTX transceiver through TDMA control logic. Paths for
the inter-NoC traffic, which are drawn in red, are sepa-
rated from those of the intra-NoC one, which are drawn
in black. Although Dorai’s proposal is not as centralized
FIGURE 2. Interfacing NoC-based systems using centralized NIs and as that of Wasicek and Nejad, it still has four drawbacks.
Ethernet bridges [11]. First, the TDMA controller constitutes a centralized module
in the system and consequently suffers from all associated
interface them. In [10], Wasicek et al. proposed an NoC-based disadvantages. Second, identifying whether the transmitted
Ethernet gateway to interface Multi-Processor Systems-on- packets represent inter-NoC traffic or intra-NoC one is still
Chip (MPSoCs). In their proposal, as shown in Figure 1, the responsibility of PEs. This accordingly takes from the
inter-NoC communication is handled by implementing the computation power of these PEs and reduces their perfor-
gateway logic on a dedicated PE. This PE has an off-system mance. Third, the required TDMA time slices would exhibit
Ethernet bridge attached to it. The gateway PE treats the a quadratic growth with the increase in NoC dimensions.
intra- and inter-NoC traffic differently. It consequently has to For large-scale NoC-based systems, this indeed would end
convert message formats between the NoC of its own system up with lots of collisions between these TDMA time slices.
and that of the interfaced one. Furthermore, due to the use Fourth, Dorai et al. do not specify how to configure their
of a centralized gateway, Wasicek’s proposal suffers from the TDMA time slices. On one hand, these slices could be sim-
drawbacks of the centralized interfacing approach, as men- ply configured in a Round Robin (RR) manner. This could
tioned in Subsection III-B. In short, these drawbacks are not guarantee the optimal performance under unbalanced
again creating hotspots in the system, exhausting intra-NoC inter-NoC traffic. In this article, we evaluate this TDMA-RR
bandwidth in routing inter-NoC traffic, and necessitating a approach. Obtained results show that it sometimes gives
continuous modification to the gateway PEs or the software worse performance than the centralized approach. On the
running on them. other hand, time slices could be dynamically configured
Nejad et al. also proposed interfacing NoC-based systems and tuned according to the shape of the inter-NoC traffic
using off-chip bridges [11]. However, as shown in Figure 2, using a Traffic Analysis and Queue Prioritization (TAQP)
their proposal relies on an Ethernet CPC connected to a module [34]. This of course would add more complexity
dedicated NI, rather than PE. Accordingly, PEs of a certain to the system and consequently increase its overall latency
system, which wish to communicate with those of other sys- and power consumption. Alternatively, a static Weighted
tems, have to address this gateway NI. The intended recipient Scheduling (WS) TDMA could be used. In this TDMA-WS
address should be encoded in the message payload. From approach, time slices are statically configured according the
the inter-NoC communication perspective, this approach is the bandwidth requirements for each PE. For any application
similar to that of Wasicek. The gateway NI and the bridge whose inter-NoC communication bandwidth is known, this
are centralized points of communication that could constitute TDMA-WS emulates the results of using the dynamic TAQP
hotspots and prevent the interfacing process from being done module. Therefore, the TDMA-WS approach is also consid-
optimally. Moreover, PEs of a certain system should be aware ered for our evaluation in this article.
of the structure of interfaced systems in order to be able to In conclusion, previous NoC interfacing approaches could
encode their addresses within messages. This indeed requires not ensure the optimal handling of inter-NoC communication.
modifying these PEs or the software running on them, when a In this article, we aim to fill this open research gap by

VOLUME 8, 2020 185997


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

presenting a more efficient interfacing approach, NoC 2 .


In our approach, NIs are responsible for handling the
inter-NoC traffic, not the PEs. To avoid creating a centralized
hotspot in the system, the presented approach employs a
distributed buffering method, where a dedicated buffer con-
nects between each individual NI and the CPC. Furthermore,
the intra-NoC bandwidth is not also consumed in routing the
inter-NoC traffic to the gateway PE, or NI. Finally, connecting
new NoC-based systems to a one built using our approach FIGURE 4. NoC 2 interconnection model.
does not require modifying any PEs or the software running
on them.
responsible for routing the intra-NoC traffic to the routers
within their own tiles and the inter-NoC traffic directly to
VI. NoC 2 : AN EFFICIENT NoC INTERFACING APPROACH ICFIFO. Links between NIs and ICFIFO are bidirectional.
NoC 2 is a novel approach for interfacing NoC-based systems Moreover, each NI has dedicated buffers inside ICFIFO to
and efficiently handling their inter-NoC traffic. The main allow for simultaneous transmission and reception. As such,
characteristics of NoC 2 could be summarized as follow. the inter-NoC traffic always has a deterministic single hop
• It connects NoC-based systems via an interconnec- from the NI to ICFIFO.
tion standard. Any standard could be used with our The system-level representation in Figure 4 shows a
approach. Without loss of generality, an Ethernet non-uniformity in the delay of NI/ICFIFO links. A concern
standard is employed in this article. Using a stan- that might consequently come to the mind of the reader is
dard communication protocol accomplishes three main whether farther NIs could accomplish the transfer within the
advantages. First, re-using a widely known and well single cycle duration. Despite the soundness of this concern
tested standard facilitates and speeds up the generation and the fact that the actual delays could only be extracted after
of new NoC-based systems, which support productivity the floorplanning, the following points verify that this single
and time-to-market. Second, it ensures the portability cycle operation would often be met.
and compatibility with currently deployed Systems-on- • Previous studies analyzed the delays of global
Chip (SoC)-based systems. Third, the evolution of these long-range links, based on the small world phe-
NoC-based systems would be automatically realized by nomenon [20], [21]. For most NoC-based systems with
the evolution of the standard itself. typical frequency ranges, these studies confirmed that
• NoC 2 requires no modifications to PEs or the soft- the delay of these global links would fit within the single
ware running on them. Once an NoC-based system is clock cycle. They further employed these long-range
designed using our NoC 2 approach, PEs within the links to enhance the average latency of the whole system.
system, as well as the software running on them, are • In the actual implementation of the system, the floor-
completely detached from the inter-NoC traffic manage- planning and the clock distribution could be adjusted to
ment. ensure the single cycle operation and to realize compa-
• NoC 2 is a general-purpose approach. It puts no restric- rable delays over NI/ICFIFO links [35].
tions on the interfaced systems, other than using the • Modern techniques are presented to ensure a single cycle
same communication standard. As will be explained in transfer between distant tiles. For example, wireless
Subsection VI-C, any new system could dynamically be links are used in [36], optical links are used in [37],
interfaced to an already running one after completing a and 3D technologies are used in [38]. These techniques
handshaking process. could be integrated with our approach to accomplish the
• In order to employ our approach within any NoC-based required single cycle operation.
system, a communication module should be integrated In case all of the previous points are not applicable,
into that system during the design time. This module the pipelining or the Global Asynchronous Local Syn-
uses distributed FIFO buffers to efficiently handle the chronous (GALS) technique could be used for NI/ICFIFO
inter-NoC traffic. Therefore, we call it Inter-NoC FIFO links. Consequently, the delay of each link would be upscaled
(ICFIFO). ICFIFO will be discussed in details in Sub- according to its length. Even with these extra delays, NoC 2
section VI-A. would still outperform previous interfacing techniques, due to
Figure 4 shows the interconnection model of our NoC 2 the removal of arbitration, scheduling, and contention delays
approach. As shown in the figure, ICFIFO is considered part through routers. To verify the efficiency of our proposed
of the NoC architecture, not just a secondary device attached approach for different implementation scenarios, we employ
to a gateway PE. The design of PEs and routers within each pipelined NI/ICFIFO links in all our simulation results. The
tile of the architecture is not affected. Nevertheless, NIs are number of cycles to traverse each of these links is calculated
designed to directly communicate with ICFIFO, instead of according to the ratio of its length to that of the inter-router
routing flits around the network to the gateway PE. NIs are one.

185998 VOLUME 8, 2020


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

• The CPC, which constitutes the interfacing peripheral.


An Ethernet CPC is used in this article.
• The controller, which invokes the main functionality of
the ICFIFO and controls the CPC.
• The Register File (RF), which holds the status of the
ICFIFO and is used as temporary storage for any internal
operation.
• The bus, which constitutes the communication medium
between different components of ICFIFO.
• The FIFO buffers, which hold packets till being trans-
mitted by the CPC or consumed by NIs. Each NI in the
NoC-based system is associated with two FIFO buffers.
One is for packets transmission, TxL , and the other is for
packets reception, RxL . This is done to allow our ICFIFO
to communicate with multiple NIs simultaneously and
prevent it from becoming a bottleneck in the system.
Compared to the centralized approach, ICFIFO could
FIGURE 5. Internal structure of ICFIFO.
add its own extra buffers. However, this would increase
area and power consumption budgets of the underlying
A. ICFIFO STRUCTURE network. Therefore, in this article, we rather relocate
ICFIFO is responsible for connecting PEs within its local some buffers from each NI/router interface to its cor-
system to those of another interconnected one. It queues the responding NI/ICFIFO interface. As such, the overall
packets from its local system and initiates the CPC to transmit number of buffers remains unchanged. For example,
them. It also receives packets from the other interconnected if a certain NI/router interface has a buffer size of four
system through the CPC and writes them into their appropri- packets, we shorten this buffer into two packets only.
ate NI buffers. Figure 5 shows the inputs, the outputs, and the The saved buffering is then used within ICFIFO for the
structure of ICFIFO. same NI.

1) INPUTS B. NI IMPLEMENTATION
The inputs of ICFIFO could be summarized as follow. In NoC 2 , NI is the module that controls the path for intra- and
• Intra-NoC inputs: ICFIFO has as many intra-NoC local inter-NoC traffic. This alleviates the burden of controlling
inputs as the number of tiles within its own system. Each these paths from PEs. Our approach employs a typical NI
NI has a dedicated input for packet transmission, TxL . with two slight modifications. First, the controller within
The traffic comes onto each of these inputs is written the NI is changed to properly adjust the path of the output
into its associated buffer, BTx . traffic. Moreover, a special-purpose register is added to every
• Inter-NoC input: ICFIFO has an inter-NoC global input, NI in the system to continuously hold the total number of
RG , which comes from the interfaced system. The traffic PEs within all interconnected NoC-based systems. Therefore,
comes onto that input is written via the CPC onto the we call this register the PE Total Number (PETN ) register.
appropriate buffer, BRx . Once the total number of PEs is modified by connecting new
systems or disconnecting old ones, the ICFIFO would send
2) OUTPUTS a control signal to each NI in its local system with the new
The outputs of ICFIFO could be summarized as follow. number of PEs. In accordance, each NI would update its
internal PETN register.
• Intra-NoC outputs: ICFIFO has as many intra-NoC local
outputs as the number of tiles within its own system.
Each NI has a dedicated output for packet reception, RxL . C. NoC 2 OPERATION
The traffic transmitted through each of these outputs 1) PROCEDURE FOR INTERFACING NEW SYSTEMS
comes from its associated buffer, BRx . At startup, the PETN register of each NI is initialized with
• Inter-NoC output: ICFIFO has an inter-NoC global out- the total number of PEs within its local system. When another
put, TG , which goes to the interfaced system. The traffic NoC-based system is connected, a handshaking process takes
transmitted through this output comes from an NI buffer, place between the two systems. The ICFIFO of each system
BTx , via the CPC. transmits a packet to the other one with the total number
of its associated PEs. Upon receiving this packet by the
3) STRUCTURE other ICFIFO, it sends a control message to its local NIs,
The main components of ICFIFO could be summarized as containing the new value of the PETN register. NIs update
follow. their register in accordance. Finally, each ICFIFO sends an

VOLUME 8, 2020 185999


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

FIGURE 6. An example of egress traffic management.


FIGURE 7. An example of ingress traffic management.

acknowledgment signal to the other one in order to terminate


the handshaking process and conclude the startup phase. VII. PERFORMANCE EVALUATION: EXPERIMENTAL
Inter-NoC communication could then take place safely. SETUP
As mentioned in Section IV, our approach is evaluated
2) EGRESS COMMUNICATION using synthetic traffic as well as real benchmark applica-
When a PE transmits a flit, the NI examines its header. If the tions. The evaluation is done using simulation and hardware
destination ID matches a local PE within its own system, implementation. For each considered benchmark, PEs are
the NI carries out a normal intra-NoC transmission. Other- clustered into two sub-systems. These sub-systems are inter-
wise, it queues this inter-NoC traffic into its corresponding faced to each other using the centralized approach, the two
ICFIFO buffer, BTx . Consequently, the ICFIFO dequeues the TDMA approaches, and the NoC 2 approach. The perfor-
buffer and triggers the CPC for packet transmission. Figure 6 mance of these approaches are then compared to one another.
shows an example of the egress communication. In this exam- As such, in the following subsection, we start our discussion
ple, PE #3 sends a flit, which is colored red, to PE #8 and by introducing the considered benchmarks. Thereafter, Sub-
another flit, which is colored blue, to PE #10. The sender PE sections VII-B and VII-C describe the employed evaluation
is not concerned whether these destination PEs are intra- or environment for simulation and hardware implementation,
inter-NoC ones. Instead, the NI examines each flit header and respectively. For our hardware implementation, we employ
properly routes it to its corresponding destination. In more an FPGA that runs at 16 MHz, whereas these benchmarks
details, the destination ID for the red flit matches the ID of run at 1000 MHz. Therefore, we have to scale down the
a local PE. Therefore, the NI transfers the flit to the corre- traffic exchanged between PEs of these benchmarks. Subsec-
sponding router to issue a normal intra-NoC transmission. tion VII-D explains this scaling process in details. Finally,
On the other hand, the blue flit is destined to a PE within the in Subsection VII-E, we discuss the two clustering variants
interfaced system. Thus, the NI writes this flit into its ICFIFO that are considered for evaluating different NoC interfacing
buffer, BT3 . Consequently, the ICFIFO controller triggers a approaches.
CPC transmission. The CPC encapsulates multiple flits into
a single packet, according to the communication protocol A. BENCHMARK APPLICATIONS
standard. This packet is finally sent through the global output, Many workload modeling methodologies were presented
TG , to the interfaced system. in the literature to generate realistic and scalable bench-
marks [39]. Considering the limited size of the employed
3) INGRESS COMMUNICATION FPGA, three widely-used NoC benchmark applications are
When a packet is received by the CPC from the interfaced adopted from the literature to evaluate different NoC inter-
system, it is split into flits, according to the communication facing approaches in this article. In our experimental work,
protocol standard. The ICFIFO controller then examines the we will further use an increasing number of PEs and synthetic
header to figure out the destination PE. The traffic is conse- traffic to evaluate the performance of our approach against
quently written into the appropriate NI buffer, BRx . When an the increase in the inter-NoC traffic. The three considered
NI finds that its Rx buffer becomes non-empty, it forwards benchmarks are the Video Object Plane Decoder (VOPD)
the flits from the ICFIFO to its attached PE. Figure 7 shows benchmark [40], [41], the Moving Picture Experts Group 4
an example of the ingress communication. In this example, (MPEG4) decoder benchmark [42]–[44], and the Multi Win-
the ICFIFO received a packet from the interfaced system dow Display (MWD) benchmark [43]–[45]. Figure 8 shows
through the global input, RG . The packet is split into flits the PEs and the bandwidth requirements of each considered
and the header is checked by the ICFIFO controller. As the benchmark. The VOPD benchmark, as shown in Subfig-
destination is found to be PE #2, the flits are enqueued into its ure 8a, has 16 PEs and an aggregate bandwidth of 3, 574
associated ICFIFO buffer, BR2 . NI #2 finally dequeues these Mega bits per second (Mbps). The MPEG4 benchmark,
flits from the buffer to its corresponding PE. as shown in Subfigure 8b, has 12 PEs and an aggregate

186000 VOLUME 8, 2020


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

FIGURE 8. Core graph of the employed benchmark applications (The bandwidth, in Mbps, are written on arrows. The ID of each PE is written
inside it beneath its name).

bandwidth of 3, 466 Mbps. The MWD benchmark, as shown run on two hardware modules. These modules are an Artix-7
in Subfigure 8c, has 9 PEs and an aggregate bandwidth of 35T FPGA [48] and a Linux-based PC. The two modules are
1, 184 Mbps. interconnected using an Ethernet link. Due to the size lim-
itation of the employed FPGA, it could maximally emulate
B. SIMULATION ENVIRONMENT six PEs. Therefore, in our partitioning process, we restrict
The extended NoCTweak simulator [46], [47] is used for the size of one cluster into six. This resultant 6-PE cluster
all our simulation experiments in this article. To the best of is executed on the FPGA. The other cluster, which contains
our knowledge, this simulator is the only one that allows the remaining PEs, is executed on the PC. The PC also runs
NoC-based systems to be interfaced to each others with Wireshark [49] for capturing the inter-NoC Ethernet traffic.
different degrees of inter-connectivity. Accordingly, in our In order to be consistent with our simulation environment,
experiments, we first partition every benchmark into two PEs of each cluster are mapped using the NMAP heuristic
clusters. These clusters are instantiated as standalone sys- onto a mesh topology. Figure 9 shows an example of how
tems in the simulator. These systems are then interconnected the MWD benchmark is executed on the employed hardware
using the centralized as well as NoC 2 approaches. Unfortu- environment. The benchmark is partitioned into 6-PE and
nately, the extended NoCTweak does not support the TDMA 3-PE clusters. The 6-PE cluster is executed on the FPGA,
approach. Accordingly, in this article, this approach is only whereas the 3-PE one is executed on the PC.
evaluated through hardware implementation. PEs within each
of the two interfaced systems are mapped onto a mesh topol- D. BANDWIDTH SCALING
ogy, because it is the only topology that is supported by Benchmark applications have communication bandwidth
the simulator. Furthermore, mapping PEs onto this mesh is requirements in Mbps, with PEs run at 1000 MHz. In our
done using the high-performance bandwidth-oriented map- experiments, PEs are emulated by traffic generators. Based on
ping heuristic, NMAP [41]. This mapping heuristic is again the employed FPGA, these traffic generators run at 16 MHz.
the one employed by the simulator. Therefore, the original bandwidth of each benchmark should
be scaled down by a factor of 64. This value of 64 comes
C. HARDWARE IMPLEMENTATION ENVIRONMENT from ceiling the result of dividing the benchmark execution
Figure 9 shows the hardware environment used in our evalua- frequency by that of our FPGA implementation, to the near-
tion. For each benchmark, the clusters, which result from the est 2n , i.e., 1000/16 = 62.5, and the nearest 2n is 64. Accord-
partitioning process, are emulated by traffic generators that ingly, the down-scaled bandwidth that is actually generated

VOLUME 8, 2020 186001


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

further divided into 2048 packets of 64 bytes each. Consid-


ering 512 loop iterations per second, 4 packets are actually
transmitted every iteration. Based on this scaling, two points
could be easily deduced. First, due to the ceiling process,
the practically injected traffic might be slightly higher than
the scaled bandwidth, as expressed by (1). However, it should
be emphasized that this is very minor in a way that would
not affect the correctness nor the fairness of our evaluation.
Second, the smallest bandwidth that could be produced by
FIGURE 9. Hardware implementation environment and an example of any traffic generator is that corresponding to a single packet.
where the two clusters of the MWD benchmark are executed. This could be calculated as 1×512×64×8
1024×1024 = 0.25 Mbps of
scaled traffic. In turn, this corresponds to an original band-
width of 0.25 × 64 = 16 Mbps. Any bandwidth lower than
by our traffic generators is represented as
this 16 Mbps will consequently be treated as a 16 Mbps one.
Boriginal Nevertheless, as shown in Figure 8, we only face this case
Bscaled = (1)
64 twice in the MPGE4 benchmark. Again, the correctness of
where Bscaled is the bandwidth that is practically produced our evaluation is not affected by this insignificant change.
by our traffic generators and Boriginal is the original band- Before concluding this subsection, we need to clarify two
width of the benchmarks, as shown in Figure 8. Accordingly, more points. First, our simulation is conducted using the orig-
for each benchmark, the traffic generator corresponding to inal bandwidth, rather than the scaled one. Second, the inter-
every source PE will inject this scaled bandwidth into the NoC traffic would specifically be higher than the scaled one,
network. For an idealistic scenario, in which this injected as expressed by (1). This will occur irrespective of the ceiling
traffic experiences no delays in its path, the same bandwidth process and even for our idealistic scenario. It actually hap-
will be received by the destination PE. This represents the pens due to the fact that Ethernet adds additional 16 bytes as
peak throughput, as defined in Section II. This idealistic sce- control overhead. These additional bytes result in an Ethernet
nario would indeed never happen practically, due to different packet length of 80 bytes. Therefore, inter-NoC traffic is
types of delays in the network. However, the peak throughput specifically scaled up by a factor of 80/64. For example,
is included in our evaluation results to represent the maxi- considering our idealistic scenario of no delays and for a
mum attainable performance. The closer the throughput of scaled bandwidth of 1 Mbps, the peak observed bandwidth
an approach to this peak value, the better its performance is. over the Ethernet link would be 1.25 Mbps.
In order to be consistent throughout the article, we present
all our throughput results in Mbps. However, to accurately E. CLUSTERING VARIANTS
describe our implementation, it should be noticed that our As mentioned at the beginning of this section, PEs of each
traffic generators actually transmit packets, not bit streams. benchmark are clustered into two sub-systems, which are
Each packet is composed of 64 bytes. Accordingly, the afore- then interfaced to each other. In our experiments, we con-
mentioned scaled bandwidth, in (1), is segmented into pack- sidered two variants for this clustering process. The first
ets. In other words, the scaled bandwidth is divided by 8 to variant minimizes the inter-NoC traffic, and hence, max-
transform it into bytes and the results is further divided by 64 imizes the intra-NoC one. In contrary, the second variant
to convert it into packets. Moreover, in order to control and maximizes the inter-NoC traffic, and hence, minimizes the
harmonize packets transmission over the execution period, intra-NoC one. Throughout this section, we call these two
we run each benchmark in a loop that executes 512 iterations variants the optimal and the worst clustering, respectively.
per second. Transmitted packets per second should be further The rationale behind using these two clustering variants is
divided by 512 to get the number of packets transmitted in to ensure that NoC 2 would perform well for any application
every loop iteration. As the repeated division might produce with low or high inter-NoC communication requirements.
fractions, the resulted number is finally ceiled to the nearest On one extreme of a high inter-NoC traffic, we aim to ensure
integer. Consequently, the number of packet transmissions that NoC 2 achieves the best performance with respect to
that a traffic generator will issue per one loop iteration is other interfacing approaches. On the other extreme of a high
calculated as intra-NoC traffic, we aim to ensure that re-allocating some
  buffers from each NoC to ICFIFO would not degrade the
Bscaled
NTXiter = (2) performance of any of the two interfaced systems.
8 × 64 × 512 Both clustering variants are realized using the ParMETIS
where NTXiter is the number of packets transmitted in a graph partitioning package [50]. ParMETIS automatically
single loop iteration and Bscaled is the scaled bandwidth, generates the optimal clustering by minimizing the total
as expressed by (1). The following example numerically clar- bandwidth, i.e., weight, of the cut edges. For the worst
ifies the scaling process. An original benchmark bandwidth clustering, we negate all the bandwidths and again run the
of 64 Mbps is first scaled down to 1 Mbps. This 1 Mbps is ParMETIS for the minimal total weight. This consequently

186002 VOLUME 8, 2020


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

FIGURE 10. Optimal and worst clustering of the three considered benchmarks (bandwidth is the scaled one in Mbps).

generates the worst clustering. For each benchmark, the resul- many results to extensively evaluate different interfacing
tant optimal and worst clustering variants are shown in Fig- approaches with respect to these two metrics. However, due
ure 10. It is worth mentioning that the figure represents the to the importance of power and area to any NoC-based design,
scaled bandwidths, rather than the original ones. Table 1 sum- we conclude this section by quickly verifying the efficiency
marizes the original and the scaled Ethernet bandwidths of the of our approach with respect to these two later metrics.
three considered benchmarks. The figure and the table clearly Throughout this section, the optimal as well as worst
show that the optimal clustering of the three benchmarks clustering variants are considered for evaluation. In all our
has low inter-NoC traffic and high intra-NoC one. This is experiments, the simulation and the hardware implemen-
apparently reversed for the worst clustering variant. tation results are found to be consistent with each others.
Therefore, for the sake of being short and concise, we suf-
VIII. EXPERIMENTAL RESULTS fice by presenting one of them. First, in Subsection VIII-A,
In this section, we assess the performance of different we show and discuss our simulation results for evaluat-
NoC interfacing approaches using synthetic traffic as well ing the overall performance of the whole interconnected
as the three considered benchmarks. As mentioned in systems. This overall performance is then split into its
Section IV, our NoC 2 approach is mainly proposed to inter-NoC and intra-NoC parts. The two parts are discussed in
enhance the throughput and the latency. Therefore, we present Subsections VIII-B and VIII-C, respectively. Unlike the

VOLUME 8, 2020 186003


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

TABLE 1. Original and scaled Ethernet inter-NoC bandwidths, in Mbps, for optimal and worst clustering of the three considered benchmarks.

FIGURE 11. Average router throughput and packet latency resulted from different approaches for the two clustering variants of the three considered
benchmark applications.

overall performance subsection, the hardware implementa- clustering variant, the peak, non-clustered, centralized, and
tion results are shown and discussed in these two subsections. NoC 2 approaches are represented by orange, green, gray, and
After evaluating the throughput and latency performance black bars, respectively. From these results, we have three
completely, we finally verify that our approach does not con- observations. First, the results clarify that our NoC 2 approach
sume more power than the centralized counterpart and its area outperforms the centralized one in terms of average through-
overhead could be tolerated. Therefore, in Subsection VIII-D, put. Second, the figure shows that the throughput of NoC 2 is
we present sample results with synthetic traffic and increasing very close to the maximum practical one of the non-clustered
number of PEs to ensure these two points. architecture. Except for the MPEG4 benchmark, NoC 2 is also
close to the theoretical peak of the hypothetical zero-delay
A. OVERALL PERFORMANCE EVALUATION architecture. Numerically, relative to the non-clustered archi-
In this subsection, we evaluate the overall performance of the tecture, our NoC 2 approach achieves an average throughput
whole system. In accordance, the two sub-systems resulted of 96.50%, 92.60%, 94.60%, 88.30%, 98.32%, and 93.82%
from our clustering step is interconnected to each other using for VOPD-optimal, VOPD-worst, MPEG4-optimal, MPEG4-
the centralized as well as our NoC 2 approaches. Unfortu- worst, MWD-optimal, and MWD-worst, respectively. In con-
nately, the simulator does not support any of the TDMA trary, the centralized approach achieves 93.20%, 82.60%,
approaches. Therefore, their results are not included in this 81.30%, 70.50%, 94.20%, and 81.10% for the six respec-
subsection. We further implement a non-clustered architec- tive cases. These results clearly emphasize the superior-
ture in which all PEs within a benchmark constitute only one ity of our NoC 2 approach over the centralized one. Third,
system, as defined in Section II. This architecture abstracts the figure shows that the most degraded performance of
the NoC customization techniques with minimal inter-NoC the centralized approach occurs for the MPEG4 benchmark
traffic, as discussed in Subsection V-A. The non-clustered as well as the worst clustering of the other two bench-
architecture represents the maximum practical performance marks. According to Table 1, all of these variants have high
that could be attained if the inter-NoC traffic overheads are inter-NoC traffic. Apparently, the gateway PE in the cen-
not existent. Finally, the theoretical peak performance is also tralized approach creates a hotspot that causes the perfor-
included in the comparison. mance to deteriorate. Moreover, increasing the network load,
Subfigure 11a shows the throughput resulted from the by routing the inter-NoC traffic to this communicating PE,
four approaches. The presented throughput is an average is another reason that badly affects the throughput of the
useful one over all routers within an architecture, i. e., centralized approach. In contrary, NoC 2 does not create this
a goodput as defined in Section II. For any benchmark or hotspot and allows direct one-hop connection between NIs

186004 VOLUME 8, 2020


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

FIGURE 12. Average router throughput and packet latency resulted from applying the stress test into a 64-PE system.

and ICFIFO. Consequently, it manages to attain a good per- injects its traffic using a uniform distribution. All results of
formance for all considered simulation scenarios. This clearly these different architectures yield the same behavior, with
emphasizes the efficiency of our approach in specifically different values. Figures 12 and 13 show two examples of
handling heavily-communicating NoC-based systems. these results for the 64-PE and 100-PE architectures. The 64
Subfigure 11b shows the average packet latency of PEs are partitioned into 16 2 × 2 clusters. Similarly, the 100
different approaches and simulation scenarios. Trivially, PEs are partitioned into 25 2 × 2 clusters. These clusters
the hypothetical zero-delay architecture is not included in are interconnected together using our NoC 2 approach as well
the figure. Prior to congestion, it is well known that the as the centralized one. A non-clustered architecture is also
throughput and latency are tightly related to one another. included in the comparison to represent the ceiling of the
The higher the throughput, the lower the latency and vice practical achievable performance. In terms of average router
versa. Therefore, the latency figure shows similar behav- throughput and average packet latency, obtained results show
ior to the throughput one. The figure again emphasizes the superiority of our NoC 2 approach over the centralized
the superiority of NoC 2 over the centralized approach, one. After going into congestion, NoC 2 could achieve an
especially for heavily-communicating NoC-based systems. average throughput that is 47 Mbps higher than the central-
It also affirms that NoC 2 manages to achieve an average ized approach. Moreover, for any specific injection rate, our
packet latency close to that of the non-clustered archi- approach could significantly achieve higher throughput and
tecture. Numerically, relative to the non-clustered archi- lower latency than the centralized one. Numerically, for post
tecture, NoC 2 has an increased average packet latency of congestion of both the 64-PE and the 100-PE architectures,
14.00%, 29.60%, 21.60%, 17.20%, 6.72%, and 24.72% NoC 2 realizes 78.00% of the throughput of the non-clustered
for VOPD-optimal, VOPD-worst, MPEG4-optimal, MPEG4- architecture, whereas the centralized approach only realizes
worst, MWD-optimal, and MWD-worst, respectively. In con- 56.00% of it. These results clearly show how powerful is our
trary, the centralized approach has an increased average approach in handling very high traffic load. It reflects the effi-
packet latency by 47.50%, 99.00%, 93.50%, 118.00%, ciency of our approach in interfacing heavily-communicating
29.00%, and 94.48% for the six respective cases. NoC-based systems.
As discussed in Subsection III-A, the number of PEs and
the traffic volume for NoC-based systems would rapidly B. INTER-NoC PERFORMANCE EVALUATION
increase in the future. Therefore, we use the stress test tech- In this subsection, we evaluate the inter-NoC performance
nique [51] to evaluate the efficiency of our proposed approach of different NoC interfacing approaches. Before presenting
under very high traffic load. In this technique, the NoC and discussing our results, three points should be emphasized.
is pushed into congestion by continuously increasing the First, the non-clustered architecture is not existent throughout
injected traffic. Congestion is identified that the throughput this subsection, because its PEs constitute only one system
saturates irrespective of the increase in the input traffic. The without any inter-NoC traffic. Second, Subsection VIII-A
performance of this stressed network, in terms of throughput shows that the throughput and latency results are tightly
and latency, is then measured. Stress test proves useful in related. We therefore suffice by presenting the throughput
validating the NoC performance under extreme traffic con- results of our hardware implementation. As the simulation
ditions. Moreover, in order to verify the performance of our and hardware implementation results are consistent, we only
NoC 2 approach for various NoC-based systems, we exper- include the hardware ones to avoid lengthy repeated dis-
iment multiple of these stress tests with increasing number cussion. The hardware results allow us to include the two
of PEs. We consequently implement different architectures, TDMA approaches into our discussion. Third, the throughput
whose number of PEs increases from 16 and 100. Each PE results presented in this subsection are the inter-NoC traffic

VOLUME 8, 2020 186005


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

FIGURE 13. Average router throughput and packet latency resulted from applying the stress test into a 100-PE system.

passing through the Ethernet link, as measured by Wireshark.


This Ethernet throughput of the centralized, TDMA-RR,
TDMA-WS, and NoC 2 is presented. Moreover, the theoret-
ical peak inter-NoC throughput of the idealistic zero-delay
architecture is also included in the comparison.
Figure 14 shows the Ethernet throughput resulted from the
five aforementioned approaches. From this figure, we have
five observations. First, the figure clarifies that NoC 2 signif-
icantly outperforms the other approaches, specially for the
worst clustering variant of the benchmarks. Indeed, the worst
clustering has higher inter-NoC traffic and requires an effi-
cient interfacing approach to properly handle that traffic. This
agrees with our conclusion at the end of the previous overall FIGURE 14. Ethernet throughput resulted from different interfacing
performance evaluation subsection that the efficiency of our approaches for the three considered benchmark applications.
proposed approach better appears as the inter-NoC traffic
increases. Second, for the centralized approach, the gateway 90.60%, 26.00%, 88.40%, 26.50%, 98.30%, and 81.10% for
PE apparently constitutes a bottleneck in the system that the six respective cases. Furthermore, TDMA-RR achieves
causes the throughput to deteriorate. Third, as the TDMA-RR 92.20%, 20.40%, 71.60%, 10.00%, 98.30%, and 64.60%,
is not tailored to the traffic, the round robin scheduling dedi- whereas TDMA-WS achieves 92.20%, 58.40%, 76.70%,
cates many of the time slots to cores that have no traffic. This 39.50%, 98.30%, and 88.70% for the same six cases, respec-
wasted time significantly reduces the achieved inter-NoC tively. In summary, presented results clearly show the effi-
throughput. Fourth, the TDMA-WS surprisingly achieves an ciency of our approach over other ones in handling the com-
average throughput between 25.00% and 85.00% of the the- munication between interfaced NoC-based systems.
oretical peak. Although the TDMA-WS is statically tailored We again use the stress test to evaluate the performance
to the traffic, random delays within the system shift the of various interfacing approaches for future high-traffic
transmission of packets for a certain PE from its assigned NoC-based systems. Therefore, similar to Subsection VIII-A,
time slots. This nonalignment not only lowers the throughput, different architectures are implemented. The injected traffic
but it also prevents packets from being received in their is uniformly increased and the resultant Ethernet throughput
statically designated time. Destination PEs will therefore stall is measured. Sample of the obtained results for the afore-
awaiting for the rest of the packets to arrive, once the source mentioned five approaches is shown in Figure 15. From
PE is again granted time slots to transmit. Unfortunately, these results, we notice that the peak throughput saturates at
these delays are random and could not be statically pre- 80 Mbps. Thereafter, starting from the worst, the centralized,
dicted. The accumulation of these delays eventually lowers TDMA-RR, TDMA-WS, and NoC 2 approaches saturate at
the throughput of the TDMA-WS. Fifth, the figure shows 20 Mbps, 27.5 Mbps, 44.5 Mbps, and 63 Mbps, respec-
that our approach is the closest one to the maximum attain- tively. In sequence, these values are corresponding to 25.00%,
able throughput of the theoretical peak. Numerically, rela- 34.00%, 56.00%, and 79.00% of the peak throughput. Once
tive to the theoretical peak throughput, our NoC 2 approach again, the degraded performance of the centralized approach
achieves an Ethernet throughput of 92.50%, 92.70%, 94.40%, is due to the competition on the gateway PE and the heavy
85.80%, 98.30%, and 97.80% for VOPD-optimal, VOPD- network load resulted from routing the inter-NoC traffic
worst, MPEG4-optimal, MPEG4-worst, MWD-optimal, and through each cluster to this gateway PE. The performance
MWD-worst, respectively. The centralized approach achieves deterioration of the TDMA-RR approach returns to the cycles

186006 VOLUME 8, 2020


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

theoretically-calculated one to indicate the load of each router


if we use unlimited buffers and assume a congestion-free
network. This ultimate load is included in the comparison
to show how worse could the router load reach, as a result of
routing the inter-NoC traffic through the network.
Figure 16 shows the load of every FPGA router, for
each benchmark and clustering variant. From this figure,
we have three observations. First, the centralized approach
has a significantly higher intra-NoC load, compared to others
interfacing approach. This increased load is apparently due
to routing the inter-NoC traffic through the network to the
FIGURE 15. Ethernet throughput resulted from applying the stress test
into different NoC interfacing approaches.
gateway PE. In turn, this gateway PE, which is represented
by the last green bar, has the worst load impact. It receives
additional traffic that is equal to the total inter-NoC traffic
wasted by a PE pending for a TDMA time slot to be granted. of all PEs, except itself. This heavy traffic would indeed
TDMA-WS is affected by the nonalignment between real affect the execution on this PE. Much of its computation
packets transmission and their statically assigned time slots. power would certainly be dedicated to manage this heavy
In conclusion, from the inter-NoC traffic perspective, results inter-NoC traffic. Second, the figure clarifies that the actual
of the stress test clearly proves that the proposed approach is routers load of the centralized approach is much lower than
the most efficient one in interfacing heavily-communicating the ultimate one, specially for the worst clustering vari-
NoC-based systems. ant of each benchmark. On average over the six routers,
the actual load is 54.00%, 52.00%, and 59.00% of the ulti-
C. INTRA-NoC PERFORMANCE EVALUATION mate one for VOPD-worst, MPEG4-worst, and MWD-worst,
In this subsection, we assess the intra-NoC performance respectively. This observation reflects a significant amount
of different interfacing approaches. We herein concerned of congestion, which occurs in the centralized approach
in ensuring two points. First, we have to verify that the and prevents its actual load from reaching the ultimate one.
re-distribution of buffers between NIs and ICFIFO in our pro- The congestion would consequently increase the packet loss
posed approach does not affect the intra-NoC performance. rate, the latency, and the power consumption. This obser-
Second, for each approach, we evaluate how routing the vation again emphasizes the importance of our proposed
inter-NoC traffic through a system overloads its intra-NoC approach for heavily-communicating NoC-based systems.
routers. Therefore, the router load, as defined in Section II, Third, the figure shows that TDMA-RR, TDMA-WS, and
is used to quantify whether an interfacing approach burdens our NoC 2 approach do not affect the intra-NoC perfor-
the routers with extra traffic or not. It is obviously enough to mance. They have as much load as the non-clustered counter-
show and discuss the routers load of one of the two interfaced part. As these three approaches completely separate between
clusters. Throughout this subsection, we present the load of the intra- and inter-NoC traffic, they do not increase the
each router within the cluster that is implemented onto the intra-NoC load. Most importantly for us, this observation
FPFA. In order to measure the routers load on the hardware clearly validates that re-distributing buffers in our proposed
level, we dedicate a General-Purpose Input-Output (GPIO) approach does not degrade the intra-NoC performance of
pin to every router. This pin toggles for every packet passing interfaced systems. In a nutshell, the whole experimental
through its corresponding router. As we have six routers results clearly prove the efficiency of NoC 2 in interconnect-
inside the FPGA, six pins are used. The activity on these ing NoC-based systems over other interfacing approaches.
six pins is simultaneously monitored by a signal analyzer. Without any additional buffer requirements, it significantly
Knowing the size of every packet, we then calculate each enhances the inter-NoC performance, without degrading the
router load in Mbps. intra-NoC one.
For each benchmark, five interfacing approaches are
implemented and considered for evaluation. These approaches D. POWER CONSUMPTION AND AREA EVALUATION
are the non-clustered, the centralized, TDMA-RR, TDMA- Our approach is mainly proposed to enhance the throughput
WS, and NoC 2 . On one hand, the non-clustered approach and latency of future interconnected NoC-based systems.
would have the minimum traffic going through its routers, In the NoC domain, any novel approach is not adopted unless
as it has no inter-NoC communication. This provides us with its power consumption and area overheads could be tolerated.
a baseline reference of what the network load should be and Our approach requires more area than the currently deployed
how much extra overhead traffic is added by each interfacing centralized one due to the extra hardware of ICFIFO, the extra
approach. On the other hand, for the centralized approach, links between NIs and ICFIFO, and the extra port of NIs.
we present two sets of results. The first, which is named Therefore, in this subsection, we quickly evaluate the power
actual, represents the router load that is actually measured on and area of our NoC 2 approach versus the current centralized
the hardware level. The second, which is named ultimate, is a one. Similar to our stress test experiments, we implement

VOLUME 8, 2020 186007


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

FIGURE 16. Intra-NoC load for FPGA routers resulted from different interfacing approaches for the three considered benchmark applications.

FIGURE 17. Power and area comparison between NoC 2 and the centralized approach for different number of cores.

multiple architectures, whose number of PEs increases from traffic uniformly with a Flit Injection Rate (FIR) of 0.5.
16 to 100. This architectures are partitioned into 2 × 2 NoCTweak is used to evaluate the total power consumption
clusters that are interfaced to each others. PEs inject the of the two approaches, whereas ORION system-level area

186008 VOLUME 8, 2020


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

modeling [52] is employed to calculate the total NoC area [2] H. C. Freitas and P. O. A. Navaux, ‘‘A high-throughput multi-cluster NoC
of them. architecture,’’ in Proc. 11th IEEE Int. Conf. Comput. Sci. Eng., Sao Paulo,
Brazil, Jul. 2008, pp. 56–63.
Subfigure 17a shows the average power consumption of [3] C. Puttmann, J. C. Niemann, M. Porrmann, and U. Ruckert, ‘‘GigaNoC–a
the two approaches. Throughout the experimented range of hierarchical network-on-chip for scalable chip-multiprocessors,’’ in Proc.
PEs, the figure shows that our NoC 2 approach does not 10th Euromicro Conf. Digital Syst. Design Archit., Methods Tools (DSD),
Lubeck, Germany, Aug. 2007, pp. 495–502.
burden the power budget of the design. Rather, it signif- [4] C.-H. Huang, C.-Y. Chen, and H.-Y. Huang, ‘‘Hierarchical and
icantly reduces the power consumption, compared to the dependency-aware task mapping for NoC-based systems,’’ in Proc.
centralized approach. The amount of saved power is fur- 11th Int. Workshop Netw. Chip Architectures (NoCArc), Fukuoka, Japan,
Oct. 2018, pp. 1–6.
ther increased as the number of PEs is increased. For [5] R. Manevich, I. Cidon, and A. Kolodny, ‘‘Design and dynamic man-
example, for 100 PEs, NoC 2 consumes only 58.1% of the agement of hierarchical NoCs,’’ Microprocessors Microsyst., vol. 40,
power of the centralized approach. This mainly returns to pp. 154–166, Feb. 2016.
the capability of our approach to avoid congestion, which [6] A. S. Hassan, A. A. Morgan, and M. W. El-Kharashi, ‘‘Introducing
NoC2 : Interconnecting NoC-based systems through ethernet,’’ in Proc. 4th
consumes a noticeable amount of power, as discussed in Int. Workshop Design Perform. Netw. Chip (DPNoC), Leuven, Belgium,
Subsection VIII-C. These results certainly add another advan- Jul. 2017, pp. 473–478.
tage to our approach. Beside enhancing the throughput and [7] J. Nickolls and W. J. Dally, ‘‘The GPU computing era,’’ IEEE Micro,
vol. 30, no. 2, pp. 56–69, Mar. 2010.
the latency, it moreover saves energy. [8] B. Bohnenstiehl, A. Stillmaker, J. J. Pimentel, T. Andreas, B. Liu,
Subfigure 17b shows the total NoC area of NoC 2 as well A. T. Tran, E. Adeagbo, and B. M. Baas, ‘‘KiloCore: A 32-nm 1000-
as the centralized approach. This total area includes routers, processor computational array,’’ IEEE J. Solid-State Circuits, vol. 52, no. 4,
pp. 891–902, Apr. 2017.
NIs, and links’ area. The figure shows that the area overheads [9] D. Greenfield, A. Banerjee, J.-G. Lee, and S. Moore, ‘‘Implications of
of our approach are reasonable. Furthermore, these over- Rent’s rule for NoC design and its fault-tolerance,’’ in Proc. 1st Int. Symp.
heads are not significantly increased as the number of PEs Netw.–Chip (NOCS), Princeton, NJ, USA, May 2007, pp. 283–294.
[10] A. Wasicek, ‘‘Embedding complex embedded systems in large ethernet–
is increased. Considering the significant enhancement real- based networks,’’ Network, vol. 1, no. C2, p. C3, 2011.
ized by our approach for the throughput, latency, and power [11] A. Beyranvand Nejad, A. Molnos, M. Escudero Martinez, and
consumption, these overheads could indeed be tolerated. K. Goossens, ‘‘A hardware/software platform for QoS bridging over
multi-chip NoC-based systems,’’ Parallel Comput., vol. 39, no. 9,
pp. 424–441, Sep. 2013.
IX. CONCLUSION [12] A. Dorai, O. Sentieys, and H. Dubois, ‘‘Evaluation of NoC on multi-
In this article, we presented a novel approach, NoC 2 , to effi- FPGA interconnection using GTX transceiver,’’ in Proc. 24th IEEE Int.
Conf. Electron., Circuits Syst. (ICECS), Batumi, Georgia, Dec. 2017,
ciently interface NoC-based systems. Our approach dis- pp. 170–173.
tributes buffers wisely through the interfaced systems. It does [13] J. Hu, U. Y. Ogras, and R. Marculescu, ‘‘System-level buffer allo-
not overload the resources within these systems by rout- cation for application-specific Networks-on-Chip router design,’’ IEEE
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 25, no. 12,
ing the inter-NoC traffic through them. Once a system is pp. 2919–2933, Dec. 2006.
designed according to our approach, its PEs and the software [14] M. Kandemir, Y. Zhang, J. Liu, and T. Yemliha, ‘‘Neighborhood-aware
running on them would not be affected by the interfacing data locality optimization for NoC-based multicores,’’ in Proc. Int. Symp.
Code Gener. Optim. (CGO), Chamonix, France, Apr. 2011, pp. 191–200.
process. We evaluated our approach, as well as previous
[15] Y. Xiao, Y. Xue, S. Nazarian, and P. Bogdan, ‘‘A load balancing inspired
interfacing ones, using synthetic traffic and real benchmark optimization framework for exascale multicore systems: A complex net-
applications. Results show a superior performance of NoC 2 works approach,’’ in Proc. IEEE/ACM Int. Conf. Computer-Aided Design
over other approaches, specially for heavily-communicating (ICCAD), Irvine, CA, USA, Nov. 2017, pp. 217–224.
[16] Y. Xiao, S. Nazarian, and P. Bogdan, ‘‘Self-optimizing and self-
systems. Compared to the currently deployed centralized programming computing systems: A combined compiler, complex net-
approach, NoC 2 significantly increases the throughput and works, and machine learning approach,’’ IEEE Trans. Very Large Scale
reduces the latency and the power consumption. It does Integr. (VLSI) Syst., vol. 27, no. 6, pp. 1416–1427, Jun. 2019.
[17] P. K. Sahu and S. Chattopadhyay, ‘‘A survey on application mapping
not also suffer from the congestion problem as its central- strategies for Network-on-Chip design,’’ J. Syst. Archit., vol. 59, no. 1,
ized counterpart. These enhancements are achieved with a pp. 60–76, Jan. 2013.
small increase in the NoC area, which could be tolerated. [18] A. K. Singh, M. Shafique, A. Kumar, and J. Henkel, ‘‘Mapping on
multi/many-core systems: Survey of current and emerging trends,’’ in Proc.
Numerically, compared to the theoretical peak inter-NoC 50th ACM/EDAC/IEEE Design Automat. Conf. (DAC), Austin, TX, USA,
throughput, our proposed approach achieves 79.00% of it. May/Jun. 2013, pp. 1–10.
In contrary, the centralized, TDMA-RR, and TDMA-WS [19] W. Amin, F. Hussain, S. Anjum, S. Khan, N. K. Baloch, Z. Nain, and
S. W. Kim, ‘‘Performance evaluation of application mapping approaches
approaches only achieve 25.00%, 34.00%, and 56.00% of this
for Network-on-Chip designs,’’ IEEE Access, vol. 8, pp. 63607–63631,
peak throughput, respectively. Finally, in the future, we plan 2020.
to extend NoC 2 to support multiple CPC standards. Also, [20] S. J. Hollis, C. Jackson, P. Bogdan, and R. Marculescu, ‘‘Exploiting
we would evaluate our approach against dynamic arbitration emergence in on-chip interconnects,’’ IEEE Trans. Comput., vol. 63, no. 3,
pp. 570–582, Mar. 2014.
techniques, such as TDMA-WS with TAQP. [21] U. Ogras and R. Marculescu, ‘‘‘It’s a small world after all’: NoC perfor-
mance optimization via long-range link insertion,’’ IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 14, no. 7, pp. 693–706, Jul. 2006.
REFERENCES
[22] Y. Xue and P. Bogdan, ‘‘Improving NoC performance under spatio-
[1] A. A. Morgan, M. W. El-Kharashi, H. Elmiligi, and F. Gebali, ‘‘Unified temporal variability by runtime reconfiguration: A general mathematical
multi-objective mapping and architecture customisation of networks-on- framework,’’ in Proc. 10th IEEE/ACM Int. Symp. Netw.–Chip (NOCS),
chip,’’ IET Comput. Digit. Techn., vol. 7, no. 6, pp. 282–293, Nov. 2013. Nara, Japan, Sep. 2016, pp. 1–8.

VOLUME 8, 2020 186009


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

[23] H. Leake Kidane and E.-B. Bourennane, ‘‘Run-time reconfigurable [43] S. Murali and G. De Micheli, ‘‘SUNMAP: A tool for automatic topology
Network-On-chip: A survey,’’ in Proc. 15th Int. Multi-Conf. Syst., Signals selection and generation for NoCs,’’ in Proc. 41st Annu. Design Automat.
Devices (SSD), Hammamet, Tunisia, Mar. 2018, pp. 846–851. Conf. (DAC), San Diego, CA, USA, Jul. 2004, pp. 914–919.
[24] Z. Qian, P. Bogdan, G. Wei, C.-Y. Tsui, and R. Marculescu, ‘‘A traffic- [44] V. Dumitriu and G. N. Khan, ‘‘Throughput-oriented NoC topology
aware adaptive routing algorithm on a highly reconfigurable network-on- generation and analysis for high performance SoCs,’’ IEEE Trans.
chip architecture,’’ in Proc. 8th IEEE/ACM/IFIP Int. Conf. Hardw./Softw. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 10, pp. 1433–1446,
Codesign Syst. Synth. (CODES+ISSS), Tampere, Finland, Oct. 2012, Oct. 2009.
pp. 161–170. [45] E. G. T. Jaspers and P. H. N. de With, ‘‘Chip-set for video display of
[25] Y. Wu, C. Lu, and Y. Chen, ‘‘A survey of routing algorithm for mesh multimedia information,’’ IEEE Trans. Consum. Electron., vol. 45, no. 3,
Network-on-Chip,’’ Frontiers Comput. Sci., vol. 10, no. 4, pp. 591–601, pp. 706–715, Aug. 1999.
Aug. 2016. [46] A. T. Tran and B. M. Baas, ‘‘NoCTweak: A highly parameterizable simula-
[26] Z. Lu and Y. Yao, ‘‘Dynamic traffic regulation in NoC-based systems,’’ tor for early exploration of performance and energy efficiency of networks
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 2, on-chip,’’ VLSI Comput. Lab, ECE Dept., Univ. California, Davis, Davis,
pp. 556–569, Feb. 2017. CA, USA, Tech. Rep. ECE-VCL-012-2, Jul. 2012. [Online]. Available:
http://vcl.ece.ucdavis.edu/pubs/2012.07.techreport.noctweak/
[27] Y. Li and A. Louri, ‘‘ALPHA: A learning-enabled high-performance
[47] A. S. Hassan, A. A. Morgan, and M. W. El-Kharashi, ‘‘An enhanced
Network-on-Chip router design for heterogeneous manycore architec-
network-on-chip simulation for cluster-based routing,’’ in Proc. 3rd Int.
tures,’’ IEEE Trans. Sustain. Comput., early access, Mar. 17, 2020,
Workshop Design Perform. Netw. Chip (DPNoC), Montreal, QC, Canada,
doi: 10.1109/TSUSC.2020.2981340.
Aug. 2016, pp. 410–417.
[28] N. Jindal, S. Gupta, D. P. Ravipati, P. R. Panda, and S. R. Sarangi, ‘‘Enhanc- [48] Artix-7. Accessed: Oct. 6, 2020. [Online]. Available: https://www.xilinx.
ing Network-on-Chip performance by reusing trace buffers,’’ IEEE Trans. com/products/silicon-devices/fpga/artix-7.html
Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 4, pp. 922–935, [49] Wireshark. Accessed: Oct. 6, 2020. [Online]. Available: https://www.
Apr. 2020. wireshark.org
[29] A. Bose and P. Ghosal, ‘‘Switching at flit level: A congestion efficient [50] G. Karypis, K. Schloegel, and V. Kumar, ‘‘ParMETIS: Parallel gragh parti-
flow control strategy for Network-on-Chip,’’ in Proc. 28th Euromicro Int. tioning and sparse matrix ordering library, version 3.1,’’ Univ. Minnesota,
Conf. Parallel, Distrib. Network-Based Process. (PDP), Västerås, Sweden, Minneapolis, MN, USA, Tech. Rep. TR 97-060, 1997. [Online]. Available:
Mar. 2020, pp. 319–322. http://glaros.dtc.umn.edu/gkhome/fetch/sw/parmetis/OLD/ParMetis-
[30] J. Lee, S. Li, H. Kim, and S. Yalamanchili, ‘‘Adaptive virtual channel 3.1.tar.gz
partitioning for network-on-chip in heterogeneous architectures,’’ ACM [51] K. Lahiri, A. Raghunathan, and S. Dey, ‘‘Evaluation of the traffic-
Trans. Design Autom. Electron. Syst., vol. 18, no. 4, p. 48, Oct. 2013. performance characteristics of system-on-chip communication architec-
[31] J. Fang, Z. Chang, and D. Li, ‘‘Exploration on routing configuration tures,’’ in Proc. VLSI Design. 14th Int. Conf. VLSI Design, Bangalore,
of HNoC with intelligent on-chip resource management,’’ IEEE Access, India, 2001, pp. 29–35.
vol. 8, pp. 12117–12129, 2020. [52] A. B. Kahng, B. Lin, and K. Samadi, ‘‘Improved on-chip router analytical
[32] Z. Qian, S. M. Abbas, and C.-Y. Tsui, ‘‘FSNoC: A flit-level speedup power and area modeling,’’ in Proc. 15th Asia South Pacific Design Autom.
scheme for network on-chips using self-reconfigurable bidirectional chan- Conf. (ASP-DAC), Taipei, Taiwan, Jan. 2010, pp. 241–246.
nels,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 9,
pp. 1854–1867, Sep. 2015.
[33] Y. Xue and P. Bogdan, ‘‘User cooperation network coding approach
for NoC performance improvement,’’ in Proc. 9th Int. Symp. Netw.–
Chip (NOCS), Vancouver, BC, Canada, 2015, pp. 1–8.
[34] F. Mashhadi, A. Asaduzzaman, and M. F. Mridha, ‘‘A novel resource AHMED A. MORGAN received the B.Sc.
scheduling approach to improve the reliability of shuffle-exchange net- degree (Hons.) from the Faculty of Engineering
works,’’ in Proc. IEEE Int. Conf. Imag., Vis. Pattern Recognit. (icIVPR), at Shoubra, Benha University, Egypt, in 2000,
Dhaka, Bangladesh, Feb. 2017, pp. 1–6. the Diploma degree in electronic design automa-
[35] L. Xue, W. Ji, Q. Zuo, and Y. Zhang, ‘‘Floorplanning exploration and tion (EDA) and VLSI design from the Information
performance evaluation of a new Network-on-Chip,’’ in Proc. Design, Technology Institute (ITI), Cairo, Egypt, in 2002,
Autom. Test Eur., Grenoble, France, Mar. 2011, pp. 625–630. the M.Sc. degree from the Faculty of Engineer-
[36] K. Duraisamy and P. P. Pande, ‘‘Enabling high-performance SMART NoC ing at Shoubra, Benha University, in 2005, and
architectures using on-chip wireless links,’’ IEEE Trans. Very Large Scale the Ph.D. degree from the University of Victoria,
Integr. (VLSI) Syst., vol. 25, no. 12, pp. 3495–3508, Dec. 2017. Victoria, BC, Canada, in 2011. He is currently an
[37] Y. Ye, J. Xu, B. Huang, X. Wu, W. Zhang, X. Wang, M. Nikdast, Assistant Professor with the Department of Computer Engineering, Cairo
Z. Wang, W. Liu, and Z. Wang, ‘‘3-D mesh-based optical Network-on-Chip University, Egypt. He is currently on leave at the College of Computers
for multiprocessor System-on-Chip,’’ IEEE Trans. Comput.-Aided Design and Information Systems, Umm Al-Qura University, Mecca, Saudi Arabia.
Integr. Circuits Syst., vol. 32, no. 4, pp. 584–596, Apr. 2013. He has about 25 publications that span journals, conferences, book chap-
[38] B. K. Joardar, K. Duraisamy, and P. P. Pande, ‘‘High performance ters, and technical reports. His research interests include parallel architec-
collective communication-aware 3D Network-on-Chip architectures,’’ in tures, multicore systems, digital VLSI design, wireless sensor networks,
Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Dresden, Germany, and networks-on-chip (NoC) modeling, optimization, and performance
Mar. 2018, pp. 1351–1356. evaluation.
[39] Y. Xue and P. Bogdan, ‘‘Scalable and realistic benchmark synthesis
for efficient NoC performance evaluation: A complex network anal-
ysis approach,’’ in Proc. 2016 IEEE Int. Conf. Hardw./Softw. Code-
sign Syst. Synth. (CODES+ISSS), Pittsburgh, PA, USA, Oct. 2–7, 2016,
pp. 1–10.
[40] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, AHMED S. HASSAN received the B.Sc. degree
and G. De Micheli, ‘‘NoC synthesis flow for customized domain spe- in systems and biomedical engineering from Cairo
cific multiprocessor systems-on-chip,’’ IEEE Trans. Parallel Distrib. Syst., University, Egypt, in 2011, and the M.Sc. degree
vol. 16, no. 2, pp. 113–129, Feb. 2005. from Ain Shams University, Cairo, in 2018. He is
[41] S. Murali and G. De Micheli, ‘‘Bandwidth-constrained mapping of cores currently working on many-core systems-on-chip
onto NoC architectures,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib., (SoC) analysis and design. He is also an Embed-
vol. 2, Paris, France, 2004, pp. 896–901. ded Software Developer, specialized in multicore
[42] E. B. Van Der Tol and E. G. Jaspers, ‘‘Mapping of MPEG-4 decoding architecture, wireless connectivity, and automotive
on a flexible architecture platform,’’ Proc. SPIE, vol. 4674, pp. 362–375, Ethernet.
Dec. 2001.

186010 VOLUME 8, 2020


A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems

M. WATHEQ EL-KHARASHI received the B.Sc. AYMAN TAWFIK (Member, IEEE) received the
(Hons.) and M.Sc. degrees in computer engineer- B.Sc. (Hons.) and M.Sc. degrees in electrical engi-
ing from Ain Shams University, Cairo, Egypt, neering from Ain Shams University, Cairo, Egypt,
in 1992 and 1996, respectively, and the Ph.D. in 1983 and 1989, respectively, and the Ph.D.
degree in computer engineering from the Univer- degree in electrical engineering from the Univer-
sity of Victoria, Victoria, BC, Canada, in 2002. sity of Victoria, Victoria, Canada, in 1995. He has
He is currently a Professor of computer organiza- worked as a Consultant for DND, Canada, and
tion with the Department of Computer and Sys- Egetronic, Egypt. He is currently the Head of the
tems Engineering, Ain Shams University, and an Electrical and Computer Engineering Department,
Adjunct Professor with the Department of Elec- College of Engineering and Information Technol-
trical and Computer Engineering, University of Victoria. He has published ogy, Ajman University, Ajman, United Arab Emirates. He has over 30 years
115 papers in refereed international journals and conferences. He has of experience in teaching different academic courses and vast experience in
authored two books and seven book chapters. His general research interests ABET, accreditation, and reaccreditation of electrical engineering programs.
include advanced system architectures, especially networks-on-chip (NoC), He has published more than 60 research papers in renowned journals and
systems-on-chip (SoC), and secure hardware. His specific research interests conferences. His research interests include digital signal processing, digital
include hardware architectures for networking (network processing units) image processing, VLSI signal processing, digital communication, the Inter-
and security; advanced microprocessor design, simulation, performance net of Things, computer organization, and education technology.
evaluation, and testability; and computer architecture and computer networks
education.

VOLUME 8, 2020 186011

You might also like