An Efficient Interfacing Approach For Heavily-Communicating NoC-Based Systems
An Efficient Interfacing Approach For Heavily-Communicating NoC-Based Systems
An Efficient Interfacing Approach For Heavily-Communicating NoC-Based Systems
ABSTRACT Current research in interfacing clusters within Hierarchical Networks-on-Chip (HNoC) as well
as interfacing NoC-based systems adopts a centralized approach. In this approach, a specific Processing
Element (PE) acts as a gateway between interfacing peripherals and the rest of NoC elements. This article
evaluates this approach and show that it is not optimal for handling the inter-NoC communication. Routing
inter-NoC traffic through a system to its gateway PE deteriorates the network performance. Results show that
both the throughput and latency of the centralized approach degrade with the increase in the inter-NoC traffic
bandwidth. To alleviate this, we propose a novel distributed approach, which separates the inter-NoC traffic
from the intra-NoC one. Our approach relies on distributed buffers to allow PEs to efficiently communicate
with the interfacing peripheral. We evaluate our approach against other interfacing ones using synthetic traffic
as well as real benchmark applications. Our evaluation covers the whole system performance as well as its
inter- and intra-NoC parts. Results prove that the proposed approach outperforms previous interfacing ones in
terms of throughput and latency. The proposed approach significantly enhances the inter-NoC performance
without any deterioration in the intra-NoC one. Considering the inter-NoC performance, we achieve a
throughput that is close to the maximum possibly attainable one. Other approaches show major performance
degradation, reaching as low as 10% of this maximum attainable throughput.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
185992 VOLUME 8, 2020
A. A. Morgan et al.: NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems
from being widely used with future heavily-communicating our approach employs the Ethernet standard to carry out the
NoC-based systems. communication among these systems.
The second research direction appears more promising and The rest of the paper is organized as follows. Section II
manageable for future NoC-based systems. In this direction, defines terms, which are frequently utilized throughout
the underlying NoC architecture is partitioned into multiple this article and clarifies the used terminology. Section III
clusters. In the literature, these Clustered NoCs (CNoCs) are verifies the necessity for an efficient NoC interfac-
also named Hierarchical NoCs (HNoCs) [3], or region-based ing approach. It highlights the prediction for the future
NoCs [4]. Each cluster has a limited number of cores and inter-cluster traffic and summarizes the drawbacks with
could be easily managed and controlled. For a complex appli- current interfacing approaches. Our contributions are then
cation, its tasks are also split into different segments. Each enumerated in Section IV. Section V reviews the related
segment is mapped to one of the hardware clusters. Due to this work. It first covers some research efforts to customize NoC
task splitting, clusters are interfaced and connected to each architectures with minimal inter-cluster traffic. Thereafter,
others in order to carry out the whole application execution. it surveys previous NoC interfacing techniques. Section VI
Many techniques are presented to accomplish the clustering presents our proposal, NoC 2 , for interfacing NoC-based clus-
process and to customize the underlying NoC architecture to ters, or systems.1 Section VII explains the experimental setup
the target application. A common and key objective of these employed in evaluating different NoC interfacing approaches.
techniques is to minimize the inter-cluster traffic. This traffic The section describes the used benchmark applications,
minimization is intended to avoid congestion over the few the employed clustering variants, and the environment for
inter-cluster links, to minimize the latency, and to maximize simulation as well as hardware implementation. Section VIII
the throughput. In this article, we are not replacing these tech- shows and discusses results of evaluating different NoC
niques and we again acknowledge the good performance they interfacing approaches. Finally, the paper is concluded in
achieve. Nevertheless, most of these techniques reduce the Section IX.
inter-cluster traffic, but they don not eliminate it. Therefore,
an open research challenge is how to properly handle this II. DEFINITIONS AND USED TERMINOLOGY
inter-cluster traffic. In other words, an efficient interfacing Terms within the NoC research community might be used
approach would indeed help previous techniques reaching the differently from one article to another. To avoid confusing the
maximum attainable performance from the underlying hard- reader and make our discussion clear, we herein define terms
ware. On another hand, the future prediction of NoC-based that we are using throughout this article.
systems shows a significant increase in their inter-cluster • NoC architecture: The actual hardware of the
traffic [5]. This apparently emphasizes the need for the afore- NoC-based system. In this article, we assume that
mentioned efficient interfacing approach. However, in the the architecture consists of tiles, where each tile has
literature, very few techniques are proposed to carry out the the three following modules2
inter-cluster communication. In this article, we will show that
1) Processing Element (PE), or core: The actual exe-
the performance of these techniques is not optimal. Moreover,
cution module, which runs a task, or tasks, of the
supported by our initial work in [6], we present a more effi-
application. In this article, PEs are drawn as black
cient interfacing approach. Our approach is meant to work in
square Quad Flat Package (FQP) chips.
conjunction with previous CNoC architecture customization
2) Network Interface (NI): The module that interfaces
techniques to properly handle their inter-cluster traffic and
the PE to the network. It packetizes the traffic at
further enhance their performance.
the source PE and de-packetizes the traffic at the
Another method to realize the required increasing perfor-
destination PE. In our proposed approach, it is also
mance of future NoC-based systems would be through adding
responsible for communicating with the interfac-
more computational cores. In other words, a new off-chip,
ing peripheral. In this article, NIs are drawn as
or maybe off-board, NoC-based system would be connected
while rectangles, with the abbreviation NI written
to an already running one. Connecting an NoC-based system
inside these rectangles.
to another one necessitates an efficient interfacing approach
3) Router: The module that executes the routing pro-
to handle the communication among these systems. This
tocol. It transfers packets through the network from
interfacing approach should be taken into account during the
one tile to another. In this article, routers are drawn
design phase and it should further employ a standard commu-
as circles with small arrows inside.
nication protocol. Despite the importance of this challenge,
it is barely studied by the NoC research community. In this • Non-clustered architecture: The architecture, which is
article, we target this open research area. Our proposed inter- not partitioned into clusters. The system could simply be
facing approach could not only be used in connecting clusters envisioned as a collection of tiles.
within a single NoC-based system, but also in connecting 1 For shortness, we will only use one of the two terms, either clusters or
different systems. In the later case, we are building a net- systems. Unless explicitly stated, the discussion is however applied for both.
work of NoC-based systems. Therefore, we name our inter- 2 In our proposed approach, the communication module is also considered
facing approach NoC 2 . Finally, without loss of generality, part of the architecture
makes the interfacing process more difficult and time enhances inter-NoC communication performance with-
consuming. out affecting the intra-NoC one. Once an NoC-based
4) The centralized module might serialize the accesses system is designed according to our approach, interfac-
to the CPC. For example, when using TDMA, some ing it to a similar system would not require modifying
inter-NoC traffic would stall awaiting their designated the cores nor the software running on them.
time slots. This stall consequently increases the overall 2) Evaluating different NoC interfacing approaches using
latency and affects the system performance badly. synthetic traffic and real NoC benchmark applications
through simulation as well as hardware implementa-
IV. CONTRIBUTIONS tion.
Motivated by the future prediction of inter-NoC traffic and
the challenges of current interfacing techniques, we aim V. RELATED WORK
to present an efficient approach to handle the inter-NoC In this section, we review some of the related research efforts.
traffic and overcome the drawbacks of the centralized First, in Subsection V-A, techniques that customize the NoC
approach. Accordingly, our proposed NoC 2 approach elim- architecture with minimal inter-NoC traffic are surveyed.
inates the centralized gateway module in order to avoid cre- As mentioned in Section I, our proposed approach is to
ating hotspots in the system. It allows direct and concurrent work in conjunction with these techniques to better enhance
access to the CPC in order to reduce the overall latency as the performance of NoC-based systems. Thereafter, previous
well as congestion probability. It requires no modifications to interfacing techniques are discussed in details in Subsec-
the running software in order to make interfacing NoC-based tion V-B. Those techniques are the ones which, we consider
systems much easier. in our comparison in Section VIII to verify the efficiency of
As our approach is proposed to efficiently handle the our proposed interfacing approach in handling the inter-NoC
inter-NoC traffic, it mainly targets two important NoC met- communication.
rics, the throughout and the latency. Consequently, the per-
formance of our approach is evaluated against that of the A. NoC CUSTOMIZATION TECHNIQUES WITH MINIMAL
non-clustered, the centralized, and the TDMA techniques, INTER-NoC TRAFFIC
in terms of these two metrics. On another hand, associated Numerous techniques were presented to realize the best per-
overheads with the approach should be tolerable. In other formance for NoC-based systems. Most of these techniques
words, our approach should be customized to minimize its could be used with clustered and non-clustered architec-
impact on the NoC design budget. Buffers are shown to be tures. A main objective of these techniques is to achieve
responsible for most of the design area and power consump- a better traffic localization, such that the inter-NoC traffic
tion [13]. Therefore, our approach employs as much buffers is minimized [14]. As discussed in Section III, this traf-
as those used by the centralized approach. Nevertheless, fic localization would not be easily achieved with the con-
these buffers are re-distributed throughout the network more tinuous increase in the number of cores and the traffic of
wisely. future NoC-based systems. In summary, the most widely used
Our approach is evaluated using simulation as well as hard- NoC customization techniques are optimal task-to-core map-
ware implementation on FPGA. The evaluation is conducted ping, dynamic architecture reconfiguration, adaptive routing,
using synthetic traffic and real benchmark applications. For traffic shaping and regulation, virtual channel partitioning,
each employed benchmark, its cores are partitioned into two dynamic resource management, and network coding. Multi-
clusters. These clusters are then interfaced to each other. ple of these techniques could be combined to achieve better
Thereafter, both inter- and intra-NoC communication per- performance. In the following paragraphs, we shed some light
formances are assessed, in terms of throughput and latency. on these techniques.
As the performance of the overall system would be dependent Optimizing the task-to-core mapping is probably the most
on the used clustering technique, we consider two cluster- widely used technique for enhancing the performance of
ing variants. The first variant maximizes the inter-NoC traf- NoC-based systems. In this technique, mapping the appli-
fic in order to judge the efficiency of different interfacing cation tasks onto the architecture cores is optimized for a
approaches, in the case of heavily-communicating systems. certain metric, such as latency, throughput, power consump-
This variant would therefore evaluate whether a hotspot is tion, or reliability. For example, Xiao et al. proposed a new
created or a congestion is occurred. The second clustering methodology, based on a Dynamic Application Dependency
variant yields the minimum inter-NoC traffic, and hence, Graph (DADG), to optimize the clustering process [15].
the maximum intra-NoC one. This variant would conse- The objectives of this clustering process are to minimize
quently assess the effect of re-distributing NoC buffers on the the inter-NoC traffic and to balance the workload between
intra-NoC performance. To this end our contributions are two different clusters. First, the DADG is automatically generated
fold: from the application using a Low Level Virtual Machine
1) Presenting a novel approach, NoC 2 , for interfacing Intermediate Representation (LLVM IR) compiler. There-
NoC-based clusters or systems. Our approach does not after, a topological sort algorithm is used to map a maximum
overload the network in terms of buffers. It further of one thread cluster to every core within the NoC, such that
the two aforementioned objectives are realized. The work performance with the minimum buffering requirements [26].
is extended in [16] by considering heterogeneous systems Li and Louri used supervised learning techniques to analyze,
and incorporating machine learning techniques. The target predict, and consequently adjust the injected traffic into the
systems consists of CPUs, GPUs, and Hardware Accelera- network [27]. Traffic shaping and regulation techniques prove
tors (HWAs). At compile time, Neural Networks (NNs) are useful in enhancing the system performance as well as avoid-
used to transform the application into a DADG and extract ing network congestion. However, some packets typically
patterns within the generated graph. Thereafter, a Community suffer from extra delay awaiting the proper network condi-
Detection (CD) algorithm is used to partition the DADG tions to be injected.
into communities. At runtime, a dynamic resource manager The technique of virtual channel partitioning enhances
intelligently maps the resultant communities onto the proper the performance of NoC-based systems by controlling the
hardware category, i.e., CPUs, GPUs, or HWA. The manager assignment of available virtual channels. For example,
further maps the tasks within each community onto the cores Jindal et al. presented two virtual channel assignment strate-
of the target hardware category. The proposed framework gies: Augmented Virtual Channel (AugVC) and Output port
proves useful in achieving a good tradeoff between pro- Directed Virtual Channel (ODVC). Authors also reused trace
grammability, efficiency, and energy consumption. In gen- buffers, which are intended for debugging, to further enhance
eral, interested reader in enhancing the performance of the performance [28]. Experimental results prove that the
NoC-based systems using optimal task-to-core mapping is presented strategies successfully reduce the latency. Bose and
referred to the surveys presented in [17]–[19]. Ghosal combined virtual channel assignment with a novel
Dynamic architecture reconfiguration is another widely flow control strategy. A new router design is presented to
adopted technique for enhancing the performance of carry out the assignment of virtual channels and realize the
NoC-based systems. In this technique, the underlying archi- new flow control strategy [29]. Results show a reduction in
tecture is reconfigured according to the executed applica- the area, the energy consumption, the latency, and the packet
tion at the runtime. For example, Hollis et al. changed drop rate. Partitioning the virtual channels between CPUs and
the topology and the characteristics of a regular mesh to GPUs traffic for heterogeneous systems is considered in [30].
dynamically fit with the traffic requirements of the launched Lee et al. proposed a feedback-directed virtual channel par-
application [20]. The topology adaptation is built on the titioning mechanism to efficiently share the NoC bandwidth
small world phenomenon [21]. As such, skip-links, or long- between CPU and GPU cores. Experimental results reflect an
range links, are inserted to connect heavily-communicating increase in the system throughput.
cores. Experimental results show that the proposed technique The technique of dynamic resource management enhances
successfully reduces the latency and the energy consumption. the performance of NoC-based systems by optimally allocat-
Xue and Bogdan proposed a general mathematical frame- ing the network resources, such as switch crossbar, buffers,
work to formulate the dynamic reconfiguration of NoC-based and links. For example, Fang et al. enhanced the performance
systems [22]. A greedy algorithm is further presented, based of heterogeneous NoC by intelligently allocating buffering
on this formulation, to adaptively reconfigure the NoC for resources [31]. Results show a 55.47% reduction in the
lower latency and power consumption. In general, interested latency and a 21% reduction in the energy consumption.
reader in enhancing the performance of NoC-based systems Qian et al. adaptively changed links’ direction according
using dynamic architecture reconfiguration is referred to the to the runtime traffic [32]. A novel router architecture is
survey presented in [23]. Finally, it is worth mentioning that presented to support this dynamic link customization. Li and
one problem with the dynamic reconfiguration technique is Louri used supervised learning techniques to dynamically
that it might only work with reconfigurable hardware, such customize the switch crossbar, the buffer organization, and
as FPGAs, rather than all types of NoC-based systems. the routing protocol [27]. The proposed architecture manages
Using adaptive routing is another widely used technique to increase the throughput by 28%, decrease the latency and
for enhancing the performance of NoC-based systems. In this the power consumption by 24% and 19%, respectively.
technique, packets are adaptively routed through the network Network coding is finally used to enhance the performance
in order to avoid creating hotspots within it. For example, of NoC-based systems. For example, Xue and Bogdan used
Qian et al. proposed a hub router and a deadlock-free adap- network coding techniques to combine multiple packets into a
tive routing protocol to enhance the performance of Express larger one [33]. The work is intended for multicast traffic. The
Virtual Channel (EVC) NoCs [24]. Presented results show an proposed scheme encapsulates cooperation unites, a corridor
up to 80% reduction in the latency with a power consumption routing algorithm, and an adaptive flit dropping algorithm
overhead of only 15%. In general, interested reader in enhanc- to avoid network congestion. Results prove the efficiency of
ing the performance of NoC-based systems using adaptive the proposed scheme in reducing the links’ utilization and
routing is referred to the survey presented in [25]. increasing the overall system throughput.
The techniques of traffic shaping and regulation enhance
the performance of NoC-based systems by controlling the B. NoC INTERFACING TECHNIQUES
traffic injection. For example, Lu and Yao presented a Few techniques are presented to properly handle the traffic
dynamic traffic regulation approach to realize the target between interconnected NoC-based systems and efficiently
1) INPUTS B. NI IMPLEMENTATION
The inputs of ICFIFO could be summarized as follow. In NoC 2 , NI is the module that controls the path for intra- and
• Intra-NoC inputs: ICFIFO has as many intra-NoC local inter-NoC traffic. This alleviates the burden of controlling
inputs as the number of tiles within its own system. Each these paths from PEs. Our approach employs a typical NI
NI has a dedicated input for packet transmission, TxL . with two slight modifications. First, the controller within
The traffic comes onto each of these inputs is written the NI is changed to properly adjust the path of the output
into its associated buffer, BTx . traffic. Moreover, a special-purpose register is added to every
• Inter-NoC input: ICFIFO has an inter-NoC global input, NI in the system to continuously hold the total number of
RG , which comes from the interfaced system. The traffic PEs within all interconnected NoC-based systems. Therefore,
comes onto that input is written via the CPC onto the we call this register the PE Total Number (PETN ) register.
appropriate buffer, BRx . Once the total number of PEs is modified by connecting new
systems or disconnecting old ones, the ICFIFO would send
2) OUTPUTS a control signal to each NI in its local system with the new
The outputs of ICFIFO could be summarized as follow. number of PEs. In accordance, each NI would update its
internal PETN register.
• Intra-NoC outputs: ICFIFO has as many intra-NoC local
outputs as the number of tiles within its own system.
Each NI has a dedicated output for packet reception, RxL . C. NoC 2 OPERATION
The traffic transmitted through each of these outputs 1) PROCEDURE FOR INTERFACING NEW SYSTEMS
comes from its associated buffer, BRx . At startup, the PETN register of each NI is initialized with
• Inter-NoC output: ICFIFO has an inter-NoC global out- the total number of PEs within its local system. When another
put, TG , which goes to the interfaced system. The traffic NoC-based system is connected, a handshaking process takes
transmitted through this output comes from an NI buffer, place between the two systems. The ICFIFO of each system
BTx , via the CPC. transmits a packet to the other one with the total number
of its associated PEs. Upon receiving this packet by the
3) STRUCTURE other ICFIFO, it sends a control message to its local NIs,
The main components of ICFIFO could be summarized as containing the new value of the PETN register. NIs update
follow. their register in accordance. Finally, each ICFIFO sends an
FIGURE 8. Core graph of the employed benchmark applications (The bandwidth, in Mbps, are written on arrows. The ID of each PE is written
inside it beneath its name).
bandwidth of 3, 466 Mbps. The MWD benchmark, as shown run on two hardware modules. These modules are an Artix-7
in Subfigure 8c, has 9 PEs and an aggregate bandwidth of 35T FPGA [48] and a Linux-based PC. The two modules are
1, 184 Mbps. interconnected using an Ethernet link. Due to the size lim-
itation of the employed FPGA, it could maximally emulate
B. SIMULATION ENVIRONMENT six PEs. Therefore, in our partitioning process, we restrict
The extended NoCTweak simulator [46], [47] is used for the size of one cluster into six. This resultant 6-PE cluster
all our simulation experiments in this article. To the best of is executed on the FPGA. The other cluster, which contains
our knowledge, this simulator is the only one that allows the remaining PEs, is executed on the PC. The PC also runs
NoC-based systems to be interfaced to each others with Wireshark [49] for capturing the inter-NoC Ethernet traffic.
different degrees of inter-connectivity. Accordingly, in our In order to be consistent with our simulation environment,
experiments, we first partition every benchmark into two PEs of each cluster are mapped using the NMAP heuristic
clusters. These clusters are instantiated as standalone sys- onto a mesh topology. Figure 9 shows an example of how
tems in the simulator. These systems are then interconnected the MWD benchmark is executed on the employed hardware
using the centralized as well as NoC 2 approaches. Unfortu- environment. The benchmark is partitioned into 6-PE and
nately, the extended NoCTweak does not support the TDMA 3-PE clusters. The 6-PE cluster is executed on the FPGA,
approach. Accordingly, in this article, this approach is only whereas the 3-PE one is executed on the PC.
evaluated through hardware implementation. PEs within each
of the two interfaced systems are mapped onto a mesh topol- D. BANDWIDTH SCALING
ogy, because it is the only topology that is supported by Benchmark applications have communication bandwidth
the simulator. Furthermore, mapping PEs onto this mesh is requirements in Mbps, with PEs run at 1000 MHz. In our
done using the high-performance bandwidth-oriented map- experiments, PEs are emulated by traffic generators. Based on
ping heuristic, NMAP [41]. This mapping heuristic is again the employed FPGA, these traffic generators run at 16 MHz.
the one employed by the simulator. Therefore, the original bandwidth of each benchmark should
be scaled down by a factor of 64. This value of 64 comes
C. HARDWARE IMPLEMENTATION ENVIRONMENT from ceiling the result of dividing the benchmark execution
Figure 9 shows the hardware environment used in our evalua- frequency by that of our FPGA implementation, to the near-
tion. For each benchmark, the clusters, which result from the est 2n , i.e., 1000/16 = 62.5, and the nearest 2n is 64. Accord-
partitioning process, are emulated by traffic generators that ingly, the down-scaled bandwidth that is actually generated
FIGURE 10. Optimal and worst clustering of the three considered benchmarks (bandwidth is the scaled one in Mbps).
generates the worst clustering. For each benchmark, the resul- many results to extensively evaluate different interfacing
tant optimal and worst clustering variants are shown in Fig- approaches with respect to these two metrics. However, due
ure 10. It is worth mentioning that the figure represents the to the importance of power and area to any NoC-based design,
scaled bandwidths, rather than the original ones. Table 1 sum- we conclude this section by quickly verifying the efficiency
marizes the original and the scaled Ethernet bandwidths of the of our approach with respect to these two later metrics.
three considered benchmarks. The figure and the table clearly Throughout this section, the optimal as well as worst
show that the optimal clustering of the three benchmarks clustering variants are considered for evaluation. In all our
has low inter-NoC traffic and high intra-NoC one. This is experiments, the simulation and the hardware implemen-
apparently reversed for the worst clustering variant. tation results are found to be consistent with each others.
Therefore, for the sake of being short and concise, we suf-
VIII. EXPERIMENTAL RESULTS fice by presenting one of them. First, in Subsection VIII-A,
In this section, we assess the performance of different we show and discuss our simulation results for evaluat-
NoC interfacing approaches using synthetic traffic as well ing the overall performance of the whole interconnected
as the three considered benchmarks. As mentioned in systems. This overall performance is then split into its
Section IV, our NoC 2 approach is mainly proposed to inter-NoC and intra-NoC parts. The two parts are discussed in
enhance the throughput and the latency. Therefore, we present Subsections VIII-B and VIII-C, respectively. Unlike the
TABLE 1. Original and scaled Ethernet inter-NoC bandwidths, in Mbps, for optimal and worst clustering of the three considered benchmarks.
FIGURE 11. Average router throughput and packet latency resulted from different approaches for the two clustering variants of the three considered
benchmark applications.
overall performance subsection, the hardware implementa- clustering variant, the peak, non-clustered, centralized, and
tion results are shown and discussed in these two subsections. NoC 2 approaches are represented by orange, green, gray, and
After evaluating the throughput and latency performance black bars, respectively. From these results, we have three
completely, we finally verify that our approach does not con- observations. First, the results clarify that our NoC 2 approach
sume more power than the centralized counterpart and its area outperforms the centralized one in terms of average through-
overhead could be tolerated. Therefore, in Subsection VIII-D, put. Second, the figure shows that the throughput of NoC 2 is
we present sample results with synthetic traffic and increasing very close to the maximum practical one of the non-clustered
number of PEs to ensure these two points. architecture. Except for the MPEG4 benchmark, NoC 2 is also
close to the theoretical peak of the hypothetical zero-delay
A. OVERALL PERFORMANCE EVALUATION architecture. Numerically, relative to the non-clustered archi-
In this subsection, we evaluate the overall performance of the tecture, our NoC 2 approach achieves an average throughput
whole system. In accordance, the two sub-systems resulted of 96.50%, 92.60%, 94.60%, 88.30%, 98.32%, and 93.82%
from our clustering step is interconnected to each other using for VOPD-optimal, VOPD-worst, MPEG4-optimal, MPEG4-
the centralized as well as our NoC 2 approaches. Unfortu- worst, MWD-optimal, and MWD-worst, respectively. In con-
nately, the simulator does not support any of the TDMA trary, the centralized approach achieves 93.20%, 82.60%,
approaches. Therefore, their results are not included in this 81.30%, 70.50%, 94.20%, and 81.10% for the six respec-
subsection. We further implement a non-clustered architec- tive cases. These results clearly emphasize the superior-
ture in which all PEs within a benchmark constitute only one ity of our NoC 2 approach over the centralized one. Third,
system, as defined in Section II. This architecture abstracts the figure shows that the most degraded performance of
the NoC customization techniques with minimal inter-NoC the centralized approach occurs for the MPEG4 benchmark
traffic, as discussed in Subsection V-A. The non-clustered as well as the worst clustering of the other two bench-
architecture represents the maximum practical performance marks. According to Table 1, all of these variants have high
that could be attained if the inter-NoC traffic overheads are inter-NoC traffic. Apparently, the gateway PE in the cen-
not existent. Finally, the theoretical peak performance is also tralized approach creates a hotspot that causes the perfor-
included in the comparison. mance to deteriorate. Moreover, increasing the network load,
Subfigure 11a shows the throughput resulted from the by routing the inter-NoC traffic to this communicating PE,
four approaches. The presented throughput is an average is another reason that badly affects the throughput of the
useful one over all routers within an architecture, i. e., centralized approach. In contrary, NoC 2 does not create this
a goodput as defined in Section II. For any benchmark or hotspot and allows direct one-hop connection between NIs
FIGURE 12. Average router throughput and packet latency resulted from applying the stress test into a 64-PE system.
and ICFIFO. Consequently, it manages to attain a good per- injects its traffic using a uniform distribution. All results of
formance for all considered simulation scenarios. This clearly these different architectures yield the same behavior, with
emphasizes the efficiency of our approach in specifically different values. Figures 12 and 13 show two examples of
handling heavily-communicating NoC-based systems. these results for the 64-PE and 100-PE architectures. The 64
Subfigure 11b shows the average packet latency of PEs are partitioned into 16 2 × 2 clusters. Similarly, the 100
different approaches and simulation scenarios. Trivially, PEs are partitioned into 25 2 × 2 clusters. These clusters
the hypothetical zero-delay architecture is not included in are interconnected together using our NoC 2 approach as well
the figure. Prior to congestion, it is well known that the as the centralized one. A non-clustered architecture is also
throughput and latency are tightly related to one another. included in the comparison to represent the ceiling of the
The higher the throughput, the lower the latency and vice practical achievable performance. In terms of average router
versa. Therefore, the latency figure shows similar behav- throughput and average packet latency, obtained results show
ior to the throughput one. The figure again emphasizes the superiority of our NoC 2 approach over the centralized
the superiority of NoC 2 over the centralized approach, one. After going into congestion, NoC 2 could achieve an
especially for heavily-communicating NoC-based systems. average throughput that is 47 Mbps higher than the central-
It also affirms that NoC 2 manages to achieve an average ized approach. Moreover, for any specific injection rate, our
packet latency close to that of the non-clustered archi- approach could significantly achieve higher throughput and
tecture. Numerically, relative to the non-clustered archi- lower latency than the centralized one. Numerically, for post
tecture, NoC 2 has an increased average packet latency of congestion of both the 64-PE and the 100-PE architectures,
14.00%, 29.60%, 21.60%, 17.20%, 6.72%, and 24.72% NoC 2 realizes 78.00% of the throughput of the non-clustered
for VOPD-optimal, VOPD-worst, MPEG4-optimal, MPEG4- architecture, whereas the centralized approach only realizes
worst, MWD-optimal, and MWD-worst, respectively. In con- 56.00% of it. These results clearly show how powerful is our
trary, the centralized approach has an increased average approach in handling very high traffic load. It reflects the effi-
packet latency by 47.50%, 99.00%, 93.50%, 118.00%, ciency of our approach in interfacing heavily-communicating
29.00%, and 94.48% for the six respective cases. NoC-based systems.
As discussed in Subsection III-A, the number of PEs and
the traffic volume for NoC-based systems would rapidly B. INTER-NoC PERFORMANCE EVALUATION
increase in the future. Therefore, we use the stress test tech- In this subsection, we evaluate the inter-NoC performance
nique [51] to evaluate the efficiency of our proposed approach of different NoC interfacing approaches. Before presenting
under very high traffic load. In this technique, the NoC and discussing our results, three points should be emphasized.
is pushed into congestion by continuously increasing the First, the non-clustered architecture is not existent throughout
injected traffic. Congestion is identified that the throughput this subsection, because its PEs constitute only one system
saturates irrespective of the increase in the input traffic. The without any inter-NoC traffic. Second, Subsection VIII-A
performance of this stressed network, in terms of throughput shows that the throughput and latency results are tightly
and latency, is then measured. Stress test proves useful in related. We therefore suffice by presenting the throughput
validating the NoC performance under extreme traffic con- results of our hardware implementation. As the simulation
ditions. Moreover, in order to verify the performance of our and hardware implementation results are consistent, we only
NoC 2 approach for various NoC-based systems, we exper- include the hardware ones to avoid lengthy repeated dis-
iment multiple of these stress tests with increasing number cussion. The hardware results allow us to include the two
of PEs. We consequently implement different architectures, TDMA approaches into our discussion. Third, the throughput
whose number of PEs increases from 16 and 100. Each PE results presented in this subsection are the inter-NoC traffic
FIGURE 13. Average router throughput and packet latency resulted from applying the stress test into a 100-PE system.
FIGURE 16. Intra-NoC load for FPGA routers resulted from different interfacing approaches for the three considered benchmark applications.
FIGURE 17. Power and area comparison between NoC 2 and the centralized approach for different number of cores.
multiple architectures, whose number of PEs increases from traffic uniformly with a Flit Injection Rate (FIR) of 0.5.
16 to 100. This architectures are partitioned into 2 × 2 NoCTweak is used to evaluate the total power consumption
clusters that are interfaced to each others. PEs inject the of the two approaches, whereas ORION system-level area
modeling [52] is employed to calculate the total NoC area [2] H. C. Freitas and P. O. A. Navaux, ‘‘A high-throughput multi-cluster NoC
of them. architecture,’’ in Proc. 11th IEEE Int. Conf. Comput. Sci. Eng., Sao Paulo,
Brazil, Jul. 2008, pp. 56–63.
Subfigure 17a shows the average power consumption of [3] C. Puttmann, J. C. Niemann, M. Porrmann, and U. Ruckert, ‘‘GigaNoC–a
the two approaches. Throughout the experimented range of hierarchical network-on-chip for scalable chip-multiprocessors,’’ in Proc.
PEs, the figure shows that our NoC 2 approach does not 10th Euromicro Conf. Digital Syst. Design Archit., Methods Tools (DSD),
Lubeck, Germany, Aug. 2007, pp. 495–502.
burden the power budget of the design. Rather, it signif- [4] C.-H. Huang, C.-Y. Chen, and H.-Y. Huang, ‘‘Hierarchical and
icantly reduces the power consumption, compared to the dependency-aware task mapping for NoC-based systems,’’ in Proc.
centralized approach. The amount of saved power is fur- 11th Int. Workshop Netw. Chip Architectures (NoCArc), Fukuoka, Japan,
Oct. 2018, pp. 1–6.
ther increased as the number of PEs is increased. For [5] R. Manevich, I. Cidon, and A. Kolodny, ‘‘Design and dynamic man-
example, for 100 PEs, NoC 2 consumes only 58.1% of the agement of hierarchical NoCs,’’ Microprocessors Microsyst., vol. 40,
power of the centralized approach. This mainly returns to pp. 154–166, Feb. 2016.
the capability of our approach to avoid congestion, which [6] A. S. Hassan, A. A. Morgan, and M. W. El-Kharashi, ‘‘Introducing
NoC2 : Interconnecting NoC-based systems through ethernet,’’ in Proc. 4th
consumes a noticeable amount of power, as discussed in Int. Workshop Design Perform. Netw. Chip (DPNoC), Leuven, Belgium,
Subsection VIII-C. These results certainly add another advan- Jul. 2017, pp. 473–478.
tage to our approach. Beside enhancing the throughput and [7] J. Nickolls and W. J. Dally, ‘‘The GPU computing era,’’ IEEE Micro,
vol. 30, no. 2, pp. 56–69, Mar. 2010.
the latency, it moreover saves energy. [8] B. Bohnenstiehl, A. Stillmaker, J. J. Pimentel, T. Andreas, B. Liu,
Subfigure 17b shows the total NoC area of NoC 2 as well A. T. Tran, E. Adeagbo, and B. M. Baas, ‘‘KiloCore: A 32-nm 1000-
as the centralized approach. This total area includes routers, processor computational array,’’ IEEE J. Solid-State Circuits, vol. 52, no. 4,
pp. 891–902, Apr. 2017.
NIs, and links’ area. The figure shows that the area overheads [9] D. Greenfield, A. Banerjee, J.-G. Lee, and S. Moore, ‘‘Implications of
of our approach are reasonable. Furthermore, these over- Rent’s rule for NoC design and its fault-tolerance,’’ in Proc. 1st Int. Symp.
heads are not significantly increased as the number of PEs Netw.–Chip (NOCS), Princeton, NJ, USA, May 2007, pp. 283–294.
[10] A. Wasicek, ‘‘Embedding complex embedded systems in large ethernet–
is increased. Considering the significant enhancement real- based networks,’’ Network, vol. 1, no. C2, p. C3, 2011.
ized by our approach for the throughput, latency, and power [11] A. Beyranvand Nejad, A. Molnos, M. Escudero Martinez, and
consumption, these overheads could indeed be tolerated. K. Goossens, ‘‘A hardware/software platform for QoS bridging over
multi-chip NoC-based systems,’’ Parallel Comput., vol. 39, no. 9,
pp. 424–441, Sep. 2013.
IX. CONCLUSION [12] A. Dorai, O. Sentieys, and H. Dubois, ‘‘Evaluation of NoC on multi-
In this article, we presented a novel approach, NoC 2 , to effi- FPGA interconnection using GTX transceiver,’’ in Proc. 24th IEEE Int.
Conf. Electron., Circuits Syst. (ICECS), Batumi, Georgia, Dec. 2017,
ciently interface NoC-based systems. Our approach dis- pp. 170–173.
tributes buffers wisely through the interfaced systems. It does [13] J. Hu, U. Y. Ogras, and R. Marculescu, ‘‘System-level buffer allo-
not overload the resources within these systems by rout- cation for application-specific Networks-on-Chip router design,’’ IEEE
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 25, no. 12,
ing the inter-NoC traffic through them. Once a system is pp. 2919–2933, Dec. 2006.
designed according to our approach, its PEs and the software [14] M. Kandemir, Y. Zhang, J. Liu, and T. Yemliha, ‘‘Neighborhood-aware
running on them would not be affected by the interfacing data locality optimization for NoC-based multicores,’’ in Proc. Int. Symp.
Code Gener. Optim. (CGO), Chamonix, France, Apr. 2011, pp. 191–200.
process. We evaluated our approach, as well as previous
[15] Y. Xiao, Y. Xue, S. Nazarian, and P. Bogdan, ‘‘A load balancing inspired
interfacing ones, using synthetic traffic and real benchmark optimization framework for exascale multicore systems: A complex net-
applications. Results show a superior performance of NoC 2 works approach,’’ in Proc. IEEE/ACM Int. Conf. Computer-Aided Design
over other approaches, specially for heavily-communicating (ICCAD), Irvine, CA, USA, Nov. 2017, pp. 217–224.
[16] Y. Xiao, S. Nazarian, and P. Bogdan, ‘‘Self-optimizing and self-
systems. Compared to the currently deployed centralized programming computing systems: A combined compiler, complex net-
approach, NoC 2 significantly increases the throughput and works, and machine learning approach,’’ IEEE Trans. Very Large Scale
reduces the latency and the power consumption. It does Integr. (VLSI) Syst., vol. 27, no. 6, pp. 1416–1427, Jun. 2019.
[17] P. K. Sahu and S. Chattopadhyay, ‘‘A survey on application mapping
not also suffer from the congestion problem as its central- strategies for Network-on-Chip design,’’ J. Syst. Archit., vol. 59, no. 1,
ized counterpart. These enhancements are achieved with a pp. 60–76, Jan. 2013.
small increase in the NoC area, which could be tolerated. [18] A. K. Singh, M. Shafique, A. Kumar, and J. Henkel, ‘‘Mapping on
multi/many-core systems: Survey of current and emerging trends,’’ in Proc.
Numerically, compared to the theoretical peak inter-NoC 50th ACM/EDAC/IEEE Design Automat. Conf. (DAC), Austin, TX, USA,
throughput, our proposed approach achieves 79.00% of it. May/Jun. 2013, pp. 1–10.
In contrary, the centralized, TDMA-RR, and TDMA-WS [19] W. Amin, F. Hussain, S. Anjum, S. Khan, N. K. Baloch, Z. Nain, and
S. W. Kim, ‘‘Performance evaluation of application mapping approaches
approaches only achieve 25.00%, 34.00%, and 56.00% of this
for Network-on-Chip designs,’’ IEEE Access, vol. 8, pp. 63607–63631,
peak throughput, respectively. Finally, in the future, we plan 2020.
to extend NoC 2 to support multiple CPC standards. Also, [20] S. J. Hollis, C. Jackson, P. Bogdan, and R. Marculescu, ‘‘Exploiting
we would evaluate our approach against dynamic arbitration emergence in on-chip interconnects,’’ IEEE Trans. Comput., vol. 63, no. 3,
pp. 570–582, Mar. 2014.
techniques, such as TDMA-WS with TAQP. [21] U. Ogras and R. Marculescu, ‘‘‘It’s a small world after all’: NoC perfor-
mance optimization via long-range link insertion,’’ IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 14, no. 7, pp. 693–706, Jul. 2006.
REFERENCES
[22] Y. Xue and P. Bogdan, ‘‘Improving NoC performance under spatio-
[1] A. A. Morgan, M. W. El-Kharashi, H. Elmiligi, and F. Gebali, ‘‘Unified temporal variability by runtime reconfiguration: A general mathematical
multi-objective mapping and architecture customisation of networks-on- framework,’’ in Proc. 10th IEEE/ACM Int. Symp. Netw.–Chip (NOCS),
chip,’’ IET Comput. Digit. Techn., vol. 7, no. 6, pp. 282–293, Nov. 2013. Nara, Japan, Sep. 2016, pp. 1–8.
[23] H. Leake Kidane and E.-B. Bourennane, ‘‘Run-time reconfigurable [43] S. Murali and G. De Micheli, ‘‘SUNMAP: A tool for automatic topology
Network-On-chip: A survey,’’ in Proc. 15th Int. Multi-Conf. Syst., Signals selection and generation for NoCs,’’ in Proc. 41st Annu. Design Automat.
Devices (SSD), Hammamet, Tunisia, Mar. 2018, pp. 846–851. Conf. (DAC), San Diego, CA, USA, Jul. 2004, pp. 914–919.
[24] Z. Qian, P. Bogdan, G. Wei, C.-Y. Tsui, and R. Marculescu, ‘‘A traffic- [44] V. Dumitriu and G. N. Khan, ‘‘Throughput-oriented NoC topology
aware adaptive routing algorithm on a highly reconfigurable network-on- generation and analysis for high performance SoCs,’’ IEEE Trans.
chip architecture,’’ in Proc. 8th IEEE/ACM/IFIP Int. Conf. Hardw./Softw. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 10, pp. 1433–1446,
Codesign Syst. Synth. (CODES+ISSS), Tampere, Finland, Oct. 2012, Oct. 2009.
pp. 161–170. [45] E. G. T. Jaspers and P. H. N. de With, ‘‘Chip-set for video display of
[25] Y. Wu, C. Lu, and Y. Chen, ‘‘A survey of routing algorithm for mesh multimedia information,’’ IEEE Trans. Consum. Electron., vol. 45, no. 3,
Network-on-Chip,’’ Frontiers Comput. Sci., vol. 10, no. 4, pp. 591–601, pp. 706–715, Aug. 1999.
Aug. 2016. [46] A. T. Tran and B. M. Baas, ‘‘NoCTweak: A highly parameterizable simula-
[26] Z. Lu and Y. Yao, ‘‘Dynamic traffic regulation in NoC-based systems,’’ tor for early exploration of performance and energy efficiency of networks
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 2, on-chip,’’ VLSI Comput. Lab, ECE Dept., Univ. California, Davis, Davis,
pp. 556–569, Feb. 2017. CA, USA, Tech. Rep. ECE-VCL-012-2, Jul. 2012. [Online]. Available:
http://vcl.ece.ucdavis.edu/pubs/2012.07.techreport.noctweak/
[27] Y. Li and A. Louri, ‘‘ALPHA: A learning-enabled high-performance
[47] A. S. Hassan, A. A. Morgan, and M. W. El-Kharashi, ‘‘An enhanced
Network-on-Chip router design for heterogeneous manycore architec-
network-on-chip simulation for cluster-based routing,’’ in Proc. 3rd Int.
tures,’’ IEEE Trans. Sustain. Comput., early access, Mar. 17, 2020,
Workshop Design Perform. Netw. Chip (DPNoC), Montreal, QC, Canada,
doi: 10.1109/TSUSC.2020.2981340.
Aug. 2016, pp. 410–417.
[28] N. Jindal, S. Gupta, D. P. Ravipati, P. R. Panda, and S. R. Sarangi, ‘‘Enhanc- [48] Artix-7. Accessed: Oct. 6, 2020. [Online]. Available: https://www.xilinx.
ing Network-on-Chip performance by reusing trace buffers,’’ IEEE Trans. com/products/silicon-devices/fpga/artix-7.html
Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 4, pp. 922–935, [49] Wireshark. Accessed: Oct. 6, 2020. [Online]. Available: https://www.
Apr. 2020. wireshark.org
[29] A. Bose and P. Ghosal, ‘‘Switching at flit level: A congestion efficient [50] G. Karypis, K. Schloegel, and V. Kumar, ‘‘ParMETIS: Parallel gragh parti-
flow control strategy for Network-on-Chip,’’ in Proc. 28th Euromicro Int. tioning and sparse matrix ordering library, version 3.1,’’ Univ. Minnesota,
Conf. Parallel, Distrib. Network-Based Process. (PDP), Västerås, Sweden, Minneapolis, MN, USA, Tech. Rep. TR 97-060, 1997. [Online]. Available:
Mar. 2020, pp. 319–322. http://glaros.dtc.umn.edu/gkhome/fetch/sw/parmetis/OLD/ParMetis-
[30] J. Lee, S. Li, H. Kim, and S. Yalamanchili, ‘‘Adaptive virtual channel 3.1.tar.gz
partitioning for network-on-chip in heterogeneous architectures,’’ ACM [51] K. Lahiri, A. Raghunathan, and S. Dey, ‘‘Evaluation of the traffic-
Trans. Design Autom. Electron. Syst., vol. 18, no. 4, p. 48, Oct. 2013. performance characteristics of system-on-chip communication architec-
[31] J. Fang, Z. Chang, and D. Li, ‘‘Exploration on routing configuration tures,’’ in Proc. VLSI Design. 14th Int. Conf. VLSI Design, Bangalore,
of HNoC with intelligent on-chip resource management,’’ IEEE Access, India, 2001, pp. 29–35.
vol. 8, pp. 12117–12129, 2020. [52] A. B. Kahng, B. Lin, and K. Samadi, ‘‘Improved on-chip router analytical
[32] Z. Qian, S. M. Abbas, and C.-Y. Tsui, ‘‘FSNoC: A flit-level speedup power and area modeling,’’ in Proc. 15th Asia South Pacific Design Autom.
scheme for network on-chips using self-reconfigurable bidirectional chan- Conf. (ASP-DAC), Taipei, Taiwan, Jan. 2010, pp. 241–246.
nels,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 9,
pp. 1854–1867, Sep. 2015.
[33] Y. Xue and P. Bogdan, ‘‘User cooperation network coding approach
for NoC performance improvement,’’ in Proc. 9th Int. Symp. Netw.–
Chip (NOCS), Vancouver, BC, Canada, 2015, pp. 1–8.
[34] F. Mashhadi, A. Asaduzzaman, and M. F. Mridha, ‘‘A novel resource AHMED A. MORGAN received the B.Sc.
scheduling approach to improve the reliability of shuffle-exchange net- degree (Hons.) from the Faculty of Engineering
works,’’ in Proc. IEEE Int. Conf. Imag., Vis. Pattern Recognit. (icIVPR), at Shoubra, Benha University, Egypt, in 2000,
Dhaka, Bangladesh, Feb. 2017, pp. 1–6. the Diploma degree in electronic design automa-
[35] L. Xue, W. Ji, Q. Zuo, and Y. Zhang, ‘‘Floorplanning exploration and tion (EDA) and VLSI design from the Information
performance evaluation of a new Network-on-Chip,’’ in Proc. Design, Technology Institute (ITI), Cairo, Egypt, in 2002,
Autom. Test Eur., Grenoble, France, Mar. 2011, pp. 625–630. the M.Sc. degree from the Faculty of Engineer-
[36] K. Duraisamy and P. P. Pande, ‘‘Enabling high-performance SMART NoC ing at Shoubra, Benha University, in 2005, and
architectures using on-chip wireless links,’’ IEEE Trans. Very Large Scale the Ph.D. degree from the University of Victoria,
Integr. (VLSI) Syst., vol. 25, no. 12, pp. 3495–3508, Dec. 2017. Victoria, BC, Canada, in 2011. He is currently an
[37] Y. Ye, J. Xu, B. Huang, X. Wu, W. Zhang, X. Wang, M. Nikdast, Assistant Professor with the Department of Computer Engineering, Cairo
Z. Wang, W. Liu, and Z. Wang, ‘‘3-D mesh-based optical Network-on-Chip University, Egypt. He is currently on leave at the College of Computers
for multiprocessor System-on-Chip,’’ IEEE Trans. Comput.-Aided Design and Information Systems, Umm Al-Qura University, Mecca, Saudi Arabia.
Integr. Circuits Syst., vol. 32, no. 4, pp. 584–596, Apr. 2013. He has about 25 publications that span journals, conferences, book chap-
[38] B. K. Joardar, K. Duraisamy, and P. P. Pande, ‘‘High performance ters, and technical reports. His research interests include parallel architec-
collective communication-aware 3D Network-on-Chip architectures,’’ in tures, multicore systems, digital VLSI design, wireless sensor networks,
Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Dresden, Germany, and networks-on-chip (NoC) modeling, optimization, and performance
Mar. 2018, pp. 1351–1356. evaluation.
[39] Y. Xue and P. Bogdan, ‘‘Scalable and realistic benchmark synthesis
for efficient NoC performance evaluation: A complex network anal-
ysis approach,’’ in Proc. 2016 IEEE Int. Conf. Hardw./Softw. Code-
sign Syst. Synth. (CODES+ISSS), Pittsburgh, PA, USA, Oct. 2–7, 2016,
pp. 1–10.
[40] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, AHMED S. HASSAN received the B.Sc. degree
and G. De Micheli, ‘‘NoC synthesis flow for customized domain spe- in systems and biomedical engineering from Cairo
cific multiprocessor systems-on-chip,’’ IEEE Trans. Parallel Distrib. Syst., University, Egypt, in 2011, and the M.Sc. degree
vol. 16, no. 2, pp. 113–129, Feb. 2005. from Ain Shams University, Cairo, in 2018. He is
[41] S. Murali and G. De Micheli, ‘‘Bandwidth-constrained mapping of cores currently working on many-core systems-on-chip
onto NoC architectures,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib., (SoC) analysis and design. He is also an Embed-
vol. 2, Paris, France, 2004, pp. 896–901. ded Software Developer, specialized in multicore
[42] E. B. Van Der Tol and E. G. Jaspers, ‘‘Mapping of MPEG-4 decoding architecture, wireless connectivity, and automotive
on a flexible architecture platform,’’ Proc. SPIE, vol. 4674, pp. 362–375, Ethernet.
Dec. 2001.
M. WATHEQ EL-KHARASHI received the B.Sc. AYMAN TAWFIK (Member, IEEE) received the
(Hons.) and M.Sc. degrees in computer engineer- B.Sc. (Hons.) and M.Sc. degrees in electrical engi-
ing from Ain Shams University, Cairo, Egypt, neering from Ain Shams University, Cairo, Egypt,
in 1992 and 1996, respectively, and the Ph.D. in 1983 and 1989, respectively, and the Ph.D.
degree in computer engineering from the Univer- degree in electrical engineering from the Univer-
sity of Victoria, Victoria, BC, Canada, in 2002. sity of Victoria, Victoria, Canada, in 1995. He has
He is currently a Professor of computer organiza- worked as a Consultant for DND, Canada, and
tion with the Department of Computer and Sys- Egetronic, Egypt. He is currently the Head of the
tems Engineering, Ain Shams University, and an Electrical and Computer Engineering Department,
Adjunct Professor with the Department of Elec- College of Engineering and Information Technol-
trical and Computer Engineering, University of Victoria. He has published ogy, Ajman University, Ajman, United Arab Emirates. He has over 30 years
115 papers in refereed international journals and conferences. He has of experience in teaching different academic courses and vast experience in
authored two books and seven book chapters. His general research interests ABET, accreditation, and reaccreditation of electrical engineering programs.
include advanced system architectures, especially networks-on-chip (NoC), He has published more than 60 research papers in renowned journals and
systems-on-chip (SoC), and secure hardware. His specific research interests conferences. His research interests include digital signal processing, digital
include hardware architectures for networking (network processing units) image processing, VLSI signal processing, digital communication, the Inter-
and security; advanced microprocessor design, simulation, performance net of Things, computer organization, and education technology.
evaluation, and testability; and computer architecture and computer networks
education.