Noc Book 2
Noc Book 2
DN-
CHIPS
Theory
and Practice
Edited by
FAYEZGEBALI
HAYTHAM ELMILIGI
HQHAHED WATHEQ EL-KHARASHI
CRC Press
Taylor & Francis Croup
Boca Raton London New York
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher can-
not assume responsibility for the validity of all materials or the consequences of their use. The
authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
TK5105.546.N48 2009
621.3815’31--dc22 2009000684
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
3 Networks-on-Chip Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Michihiro Koibuchi and Hiroki Matsutani
v
© 2009 by Taylor & Francis Group, LLC
vi Contents
vii
© 2009 by Taylor & Francis Group, LLC
viii Preface
Fayez Gebali
Haytham El-Miligi
M. Watheq El-Kharashi
Victoria, BC, Canada
Fayez Gebali received a B.Sc. degree in electrical engineering (first class hon-
ors) from Cairo University, Cairo, Egypt, a B.Sc. degree in applied mathemat-
ics from Ain Shams University, Cairo, Egypt, and a Ph.D. degree in electrical
engineering from the University of British Columbia, Vancouver, BC, Canada,
in 1972, 1974, and 1979, respectively. For the Ph.D. degree he was a holder of an
NSERC postgraduate scholarship. He is currently a professor in the Depart-
ment of Electrical and Computer Engineering, University of Victoria, Victoria,
BC, Canada. He joined the department at its inception in 1984, where he was
an assistant professor from 1984 to 1986, associate professor from 1986 to 1991,
and professor from 1991 to the present. Gebali is a registered professional en-
gineer in the Province of British Columbia, Canada, since 1985 and a senior
member of the IEEE since 1983. His research interests include networks-on-
chips, computer communications, computer arithmetic, computer security,
parallel algorithms, processor array design for DSP, and optical holographic
systems.
xi
© 2009 by Taylor & Francis Group, LLC
Contributors
xiii
© 2009 by Taylor & Francis Group, LLC
xiv Contributors
Frederic Robert
Hiroki Matsutani
Université Libre de Bruxelles—ULB
Department of Information
Brussells, Belgium
and Computer Science
[email protected]
Keio University
Minato, Tokyo, Japan Resve Saleh
[email protected] University of British Columbia
British Columbia, Canada
Dragomir Milojevic [email protected]
Université Libre de Bruxelles—ULB
Brussells, Belgium Mariagiovanna Sami
[email protected] Dipartimento di Electronica
e Inforazione
Gianluca Palermo Politecnico di Milano
Dipartimento di Electronica Milano, Italy
e Inforazione [email protected]
Politecnico di Milano
Milano, Italy Antoine Scherrer
[email protected] Laboratoire de Physique
Université de Lyon
Partha Pratim Pande ENS-Lyon, France
Washington State University [email protected]
Pullman, Washington
[email protected] Julien Schmaltz
Radboud University Nijmegen
Ioannis Papaeystathiou Institute for Computing and
Technical University of Crete Information Sciences
Kounoupidiana, Chania, Greece Heijendaalseweg, The Netherlands
[email protected] [email protected]
CONTENTS
1.1 Introduction.................................................................................................... 1
1.2 Related Work .................................................................................................. 3
1.3 Alternative Vertical Interconnection Topologies....................................... 5
1.4 Overview of the Exploration Methodology............................................... 7
1.5 Evaluation—Experimental Results ............................................................. 9
1.5.1 Experimental Setup........................................................................... 9
1.5.2 Routing Procedure .......................................................................... 12
1.5.3 Impact of Traffic Load..................................................................... 13
1.5.4 3D NoC Performance under Uniform Traffic.............................. 14
1.5.5 3D NoC Performance under Hotspot Traffic .............................. 16
1.5.6 3D NoC Performance under Transpose Traffic ........................... 19
1.5.7 Energy Dissipation Breakdown .................................................... 19
1.5.8 Summary .......................................................................................... 22
1.6 Conclusions .................................................................................................. 23
Acknowledgments ................................................................................................ 23
References............................................................................................................... 24
1.1 Introduction
Future integrated systems will contain billions of transistors [1], composing
tens to hundreds of IP cores. These IP cores, implementing emerging complex
multimedia and network applications, should be able to deliver rich multi-
media and networking services. An efficient cooperation among these IP cores
(e.g., efficient data transfers) can be achieved through innovations of on-chip
communication strategies.
The design of such complex systems includes several challenges. One chal-
lenge is designing on-chip interconnection networks that efficiently connect
the IP cores. Another challenge is application mapping that makes efficient
1
© 2009 by Taylor & Francis Group, LLC
2 Networks-on-Chips: Theory and Practice
3D
Router r=3
2D 3D
Y Router Router
Y
r=3
Z Z
X X
(a) Full vertical interconnection (100%) (b) Uniform distribution of vertical links.
for a 3D NoC.
3D 2D 3D 2D
Router Router Router Router
Y Y
Z Z
X X
(c) Positioning of vertical links at the (d) Positioning of vertical links at the
center of the NoC. periphery of the NoC.
Interconnection Link
FIGURE 1.1
Positioning of the vertical interconnection links, for each plane of the 3D NoC (each plane is a
6 × 6) grid.
2D and 3D routers for a 3D mesh NoC is illustrated in Figure 1.1. The figure
shows a grid that belongs to a 3D NoC where several 2D and 3D routers exist.
• Full: Where all the routers of the NoC are 3D ones [number of 3D
routers: 64 (100%)].
• Uniform based: Pattern-based topologies with r value equal to three
[by_three pattern, as shown in Figure 1.1(b)], four (by_four), and five
(by_five). Correspondingly, the number of 3D routers is 44 (68.75%),
48 (75%), and 52 (81.25%).
• Odd: In this pattern, all the routers belonging to the same row are
of the same type. Two adjacent rows never have the same type of
router [number of 3D routers: 32 (50%)].
• Edges: Where the center (dimensions x × y) of the 3D NoC has only
2D routers [number of 3D routers: 48 (75%)].
• Center: Where only the center (dimensions x × y) of the 3D NoC
has 3D routers [number of 3D routers: 16 (25%)].
• Side based: Where a side (e.g., outer row) of each plane has 2D
routers. Patterns evaluated have one (one_side), two (two_side), or
three (three_side) sides as “2D routers only.” The number of 3D
routers for each pattern is 48 (75%), 36 (56.25%), and 24 (37.5%),
respectively.
3D routing
Routing Schemes (adaptation of existing alg.)
xy NoC
odd-even Simulator
...
Metrics:
- Latency
Stimuli - Energy:
Synthetic traffic - link energy
- uniform - crossbar energy
- transpose - router energy
- hotspot -…
Real application traffic - Total energy consumption
FIGURE 1.2
An overview of the exploration methodology of alternative topologies for 3D Networks-on-Chip.
FIGURE 1.3
3D NoC architectures.
The output of the simulation is a log file that contains the relevant evalu-
ated cost factors, such as overall latency, average latency per packet, and the
energy breakdown of the NoC, providing values for link energy consump-
tion, crossbar and router energy consumption, etc. From these energy figures,
we calculate the total energy consumption of the 3D NoCs.
The 3D architectures to be explored may have a mix of 2D and 3D routers,
ranging from very few 3D routers to only 3D routers. To steer the exploration,
we use different patterns (as presented in Section 1.3). The proposed 3D NoCs
can be constructed by placing a number of identical 2D NoCs on individual
planes, providing communication by interplane vias among vertically adja-
cent routers. This means that the position of silicon vias is exactly the same for
each plane. Hence, the router configuration is extended to the third dimen-
sion, whereas the structure of the individual logic blocks (IP cores) remains
unchanged.
1: function ROUTINGXYZ
2: src : type Node; //this is the source node
3: dst : type Node; //this is the destination node (final)
4:
5: findCoordinates(); //returns src.x, src.y, src.z, dst.x, dst.y and dst.z
6:
7: for all plane ∈ NoC do
8: if packet passes from pla ne then
9: findTmpDestination(); //find a temporary destination of the packet for each plane
of the NoC that the packet passes from
10: end if
11: end for
12: while tmp Destina tion NOT dst do //if we have not reached the final destination...
13: packet.header = tmpDestination;
14: end while
15: end function
16: function FINDTMPDESTINATION //for each plane that the packet is going to traverse
17: tmpDestination.x = dst.x
18: tmpDestination.y = dst.y
19: tmpDestination.z = sr c.z //for xyz routing
20:
21: for all valid Nodes ∈ pla ne do
22: if link NOT valid //if vertical link does not exist. This information is obtained through
the vertical interconnections patterns input file.
23: newLink = computeManhattanDistance(); //returns the position of a verical link
with the smallest Manhattan distance
24: tmpDestination = newLink;
25: else
26: tmpDestination = link;
27: end if
28: end for
29: end function
FIGURE 1.4
Routing algorithm modifications. (// denotes a comment in the algorithm)
equals the consumption of a link between two neighboring routers at the same
plane (if they have the same length).
More specifically because the 3D integration technology, which provides
communication among layers using through-silicon vias (TSVs), has not been
explored sufficiently yet, 3D-based systems design still needs to be addressed.
Due to the large variation of the 3D TSV parameters, such as diameter, length,
dielectric thickness, and fill material among alternative process technolo-
gies, a wide range of measured resistances, capacitances, and inductances
have been reported in the literature. Typical values for the size (diameter) of
TSVs is about 4 × 4 μm, with a minimum pitch around 8–10 μm, whereas
their total length starting from plane T1 and terminating on plane T3 is
17.94 μm, implying wafer thinning of planes T2 and T3 to approximately
10–15 μm [54–56].
The different TSV fabrication processes lead to a high variation in the cor-
responding electrical characteristics. The resistance of a single 3D via varies
from 20 m to as high as 600 m [55,56], with a feasible value (in terms of
fabrication) around 30 m. Regarding the capacitances of these vias, their
values vary from 40 fF to over 1 pF [57], with feasible value for fabrication
to be around 180 fF. In the context of this work, we assume a resistance of
350 m and a capacitance of 2.5 fF.
Using our extended version of the NoC simulator, we have performed
simulations involving a 64-node and a 144-node architecture with 3D mesh
and torus topologies with synthetic traffic patterns. The configuration files
describing the corresponding link patterns are supplied to the simulator as
an input. The sizes of the 3D NoCs we simulated were 4 × 4 × 4 and 6 × 6 × 4,
whereas the equivalent 2D ones were 8 × 8 and 12 × 12. We have used three
types of input (synthetic traffic) and three traffic loads (heavy, normal, and
low). The traffic schemes used are as follows:
We have used the three routing schemes presented in Worm_Sim [14], and
extended them in order to function in a 3D NoC as follows:
1. For each packet, we know the source and destination nodes and can
find the positions of these nodes in the topology. The on-chip “coor-
dinates” of the nodes for the destination one are dst.x, dst.y,
dst.z and for the source one are src.x, src.y, src.z.
2. By doing so, we can formulate the temporary destinations, one for
each plane. For the number of planes a packet has to traverse to
arrive at its final destination, the algorithm initially sets the route
to a temporary destination located at position dst.x, dst.y,
src.z. The algorithm takes into consideration the “direction” that
the packet is going to follow across the planes (i.e., if it is going to
an upper or lower plane according to its “source” plane) and finds
the nearest valid link at each plane. This outputs, as an outcome
to update properly, the z coefficient of the temporary destination’s
position. Valid link is every vertical interconnection link available in
the plane in which the packet traverses. This information is obtained
from the vertical interconnection patterns file. A link is uniquely
identified by the node that is connected and its direction. So, for
all the specified valid links that are located at the same plane, the
header flit of the packet checks if the desired route is matched to the
destination up or down link.
3. If there is no match between them, compute the Manhattan distance
(in case of 3D torus topology, we have modified it to produce the
correct Manhattan distance between the two nodes).
4. Finally, the valid link with the smallest Manhattan distance is cho-
sen, and its corresponding node is chosen to be the temporary des-
tination at each plane the packet is going to traverse.
5. After finding a set of temporary destinations (each one located at a
different plane), they are stored into the header flit of the packet. The
aforementioned temporary destinations may or may not be used,
as the packet is being routed during the simulation, so they are
“candidate” temporary destinations. The decision of being just a
candidate or the actual destination per plane is taken based on one
of two scenarios: (1) if a set of vertical links, which exhibited rela-
tively high utilization during a previous simulation with the same
network parameters, achieved the desired minimum link commu-
nication volume or (2) according to a given vertical link pattern such
as the one presented in Section 1.1.
150%
Normalized Latency
140%
130%
120%
110%
100%
90%
s
ve
ee
er
ity
ou
ru
ge
sid
id
sid
od
nt
hr
_fi
tiv
_s
to
ed
_f
e_
o_
ce
_t
by
ec
ee
8/
by
on
by
tw
nn
8×
r
th
co
ll_
fu
FIGURE 1.5
Impact of traffic load on 2D and 3D NoCs (for all different types of traffic used).
350%
Normalized Latency
300%
250%
200%
150%
100%
50%
h
ve
ee
er
ity
ou
ge
sid
sid
sid
od
es
nt
hr
_fi
tiv
ed
m
_f
e_
e_
o_
ce
_t
by
ec
8/
by
on
re
by
tw
nn
8×
th
co
ll_
fu
Mesh (heavy uniform) Mesh (medium uniform) Mesh (low uniform)
Torus (heavy uniform) Torus (medium uniform) Torus (low uniform)
FIGURE 1.6
Impact of traffic load on 2D and 3D mesh and torus NoCs (for uniform traffic).
100%
80%
60%
40%
20%
0%
h
ve
ee
er
ity
ou
ge
sid
sid
sid
od
es
nt
hr
_fi
tiv
ed
m
_f
e_
e_
o_
ce
_t
by
ec
8/
by
on
re
by
tw
nn
8×
th
co
ll_
fu
(a) Experimental results for a 4 × 4 × 4 3D mesh.
120%
100%
80%
60%
40%
20%
0%
h
ve
ee
ity
ou
te
ge
sid
id
id
od
es
hr
_fi
tiv
n
_s
_s
ed
m
_f
e_
ce
_t
by
ec
ee
o
2/
by
on
by
tw
nn
r
×1
th
co
12
ll_
fu
FIGURE 1.7
Uniform traffic (medium load) on a 3D NoC for alternative interconnection topologies.
better results. It is worth noticing that the overall performance of the 2D NoC
significantly decreases, exhibiting around 50% increase in energy and latency.
When we increase the traffic load by increasing the packet generation rate
by 50%, we see that all patterns have worse behavior than the full_connectivity
3D NoC. The reason is that by using a pattern-based 3D NoC, we decrease
the number of 3D routers by decreasing the number of vertical links, thereby
reducing the connectivity within the NoC. As expected, this reduced connec-
tivity has a negative impact in cases where there is an increased traffic.
For low traffic load NoC, the patterns can become beneficial because there
is not that high need for communication resources. This effect is illustrated in
Figure 1.8. The figure shows the experimental results for 64- and 144-node 2D
and 3D NoCs under low uniform traffic and xyz routing. The exception is the
edges pattern in the 64-node 3D NoC [Figure 1.8(a)], where all the 3D routers
reside on the edges of each plane of the 3D NoC. This results in a 7% increase
in the packet latency. Again it is worth noticing that as the NoC dimensions
increase, the performance of the 2D NoC decreases. This can be clearly seen
in Figure 1.8(b), where the 2D NoC has 38% increased energy dissipation.
We have also compared the performance of the proposed approach against
that achievable with a torus network, which provides wraparound links
added in a systematic manner. Note that the vertical links connecting the
bottom with the upper planes are not removed, as this is the additional fea-
ture of the torus topology when compared to the mesh. Our simulations
show that using the transpose traffic scheme, the vertical link patterns exhibit
notable results; this pattern continues as the dimensions of the NoC get bigger.
The explanation is that the flow of packets between a source and a destina-
tion follows a diagonal course among the nodes at each plane. At the same
time, the wraparound links of the torus topology play a significant role in
preserving the performance even when some vertical links are removed. The
results show that increasing the dimensions of the NoC increases the energy
savings, when the link patterns are applied. But, this is not true for the case of
mesh topology. In particular, in the 6 × 6 × 4 3D torus architecture, using the
by_five, by_four, by_three, one_side, and two_side patterns show better results as
far as the energy consumption is concerned. For instance, the two_side pattern
exhibits 7.5% energy savings and 32.84 cycles increased latency relative to the
30 cycles of the fully vertical connected 3D torus topology.
100%
80%
60%
40%
20%
0%
h
ve
ee
er
ity
ou
ge
sid
sid
sid
od
es
nt
hr
_fi
tiv
ed
m
_f
e_
e_
o_
ce
_t
by
ec
8/
by
on
re
by
tw
nn
8×
th
co
ll_
fu
(a) Experimental results for a 4 × 4 × 4 3D mesh.
100%
80%
60%
40%
20%
0%
h
ve
ee
ity
ou
ge
sid
id
sid
od
es
nt
hr
_fi
tiv
_s
ed
m
_f
e_
o_
ce
_t
by
c
ee
2/
by
ne
on
by
tw
r
×1
th
n
co
12
ll_
fu
FIGURE 1.8
Uniform traffic (low load) on a 3D NoC for alternative interconnection topologies.
100%
80%
60%
40%
20%
0%
h
ve
ee
er
ity
ou
ge
sid
sid
sid
od
es
nt
hr
_fi
tiv
ed
m
_f
e_
e_
o_
ce
_t
by
ec
8/
by
on
re
by
tw
nn
8×
th
co
ll_
fu
(a) Experimental results for a 4 × 4 × 4 3D mesh.
120%
100%
80%
60%
40%
20%
0%
h
ur
ee
ity
e
ge
v
sid
id
id
od
es
nt
hr
_fi
iv
_s
_s
ed
m
_f
e_
ce
ct
_t
by
o
2/
by
ne
on
re
by
tw
×1
th
n
co
12
ll_
fu
FIGURE 1.9
Hotspot traffic (low load) on a 3D NoC for alternative interconnection topologies.
vertical connected architecture (that was expected due to the location where
the hotspot nodes were positioned).
In Figure 1.10, the simulation results for the two 3D NoC architectures when
triggered by a hotspot-type traffic are presented. Figures 1.10(a) and 1.10(b)
present the results for the mesh and torus architectures, respectively, showing
gains in energy consumption and area, with a negligible penalty in latency.
Again, the architectures where congestion is experienced are highlighted.
These results are also compared to their equivalent 2D architectures. For
the 8×8 2D NoC (same number of cores as the 4×4 × 4 architecture), it shows
25% increased latency and 40% increased energy consumption compared to
the one_side link pattern, whereas the 12 × 12 mesh (same number of cores as
the 6 × 6 × 4 architecture) shows 46% increase in latency and 49% increase
in energy consumption compared to the same pattern using uniform traffic.
In addition, comparing the by_four pattern on the 64-node architecture under
transpose traffic shows 31% and 18% reduced latency and total network con-
sumption, respectively. However, in the case of hotspot traffic and employing
the two_side link pattern, these numbers change to 24% reduced latency and
56% reduced energy consumption.
100%
80%
60%
40%
20%
0%
e
h
ve
ee
er
ity
e
ou
ge
sid
id
sid
od
es
nt
hr
_fi
tiv
_s
ed
m
_f
e_
o_
ce
_t
by
ec
e
8/
by
on
re
tw
by
nn
8×
th
co
ll_
fu
100%
80%
60%
40%
20%
0%
s
ve
ee
ity
ou
te
ru
ge
sid
id
sid
od
hr
_fi
tiv
n
_s
to
ed
_f
e_
o_
ce
_t
by
c
e
8/
by
ne
on
re
by
tw
8×
th
n
co
ll_
fu
FIGURE 1.10
Hotspot traffic (medium load) on a 3D NoC for alternative interconnection topologies.
Congestion
200%
150%
100%
50%
0%
h
ve
ee
er
ity
ou
ge
sid
sid
sid
od
es
nt
hr
_fi
tiv
ed
m
_f
e_
e_
o_
ce
_t
by
ec
8/
by
on
re
by
tw
nn
8×
th
co
ll_
fu
(a) Experimental results for a 4 × 4 × 4 3D mesh.
140%
120%
100%
80%
60%
40%
20%
0%
e
ity
s
ve
ee
e
ou
te
ge
ru
sid
sid
sid
od
hr
_fi
iv
n
to
ed
_f
e_
e_
o_
ce
ct
_t
by
2/
by
ne
on
re
by
tw
×1
th
n
12
co
ll_
fu
FIGURE 1.11
Transpose traffic on a 3D NoC for alternative interconnection topologies.
160%
140%
120%
100%
80%
60%
40%
20%
0%
Link Crossbar Router Arbiter Buffer Read Buffer Write
8x8/mesh by_five by_four by_three center edges
odd one_side three_side two_side full_connectivity
FIGURE 1.12
An overview of the energy breakdown in a 3D NoC (4 × 4 × 4 3D mesh, uniform traffic, xyz-old
routing).
energy 62%. The normalized results of the energy consumption for a uniform
traffic on a 4 × 4 × 4 NoC are presented in Figure 1.12.
1.5.8 Summary
A summary of the experimental results is presented in Table 1.1. The energy
and latency values that were obtained are compared to the ones of the 3D
mesh full vertically interconnected NoC. The three types of traffic are shown
in the first column. The next two columns present the gains [min to max values
(in%)] for the energy dissipation. The fourth and fifth columns show the min
to max values for the average packet latency, respectively. It can been seen that
energy reduction up to 29% can be achieved. But gains in energy dissipation
TABLE 1.1
Experimental Results: Min-Max Impact on Costs
(Energy and Latency) with Medium Traffic Load
Normalized
Energy Latency
Traffic Patterns min max min max
1.6 Conclusions
Networks-on-Chips are becoming more and more popular as a solution able
to accommodate large numbers of IP cores, offering an efficient and scalable
interconnection network. Three-dimensional NoCs are taking advantage of
the progress of integration and packaging technologies offering advantages
when compared to 2D ones. Existing 3D NoCs assume that every router of a
grid can communicate directly with the neighboring routers of the same grid
and with the ones of the adjacent planes. This communication can be achieved
by employing wire bonding, microbumb, or through-silicon vias [35].
All of these technologies have their advantages and disadvantages. Reduc-
ing the number of vertical connections makes the design and final fabrication
of 3D systems easier. The goal of the proposed methodology is to find het-
erogeneous 3D NoC topologies with a mix of 2D and 3D routers and vertical
link interconnection patterns that performs best to the incoming traffic. In
this way, the exploration process evaluates the incoming traffic and the in-
terconnection network, proposing an incoming traffic-specific alternative 3D
NoC. Aiming in this direction, we have presented a methodology that shows
by employing an alternative 3D NoC vertical link interconnection network,
in essence proposing an NoC with less vertical links, we can achieve gains in
energy consumption (up to 29%), in the average packet latency (up to 2%),
and in the area occupied by the routers of the NoC (up to 18%).
Extensions of this work could include not only more heterogeneous 3D
architectures but also different router architectures, providing better adap-
tive routing algorithms and performing further customizations targeting het-
erogeneous NoC architectures. In this way it would be able to create even
more heterogeneous 3D NoCs. For providing stimuli to the NoCs, a move
toward using real applications would be useful apart from using even more
types of synthetic traffic. By doing so, it would become feasible to propose
application-domain-specific 3D NoC architectures.
Acknowledgments
The authors would like to thank Dr. Antonis Papanikolaou (IMEC vzw.,
Belgium) for his helpful comments and suggestions. This research is sup-
ported by the 03ED593 research project, implemented within the framework
References
1. Semiconductor Industry Association, “International technology roadmap
for semiconductors,” 2006. [Online]. Available: http://www.itrs.net/Links/
2006Update/2006UpdateFinal.htm.
2. S. Murali and G. D. Micheli, “Bandwidth-constrained mapping of cores onto
NoC architectures,” In Proc. of DATE. Washington, DC: IEEE Computer Society,
2004, 896–901.
3. J. Hu and R. Marculescu, “Energy- and performance-aware mapping for regular
NoC architectures,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 24 (2005) (4): 551–562.
4. L. Benini and G. de Micheli, “Networks on chips: a new SoC paradigm,”
Computer 35 (2002) (1): 70–78.
5. A. Jantsch and H. Tenhunen, eds., Networks on Chip. New York: Kluwer Academic
Publishers, 2003.
6. K. Goossens, J. Dielissen, and A. Radulescu, “The Æthereal network on chip:
Concepts, architectures, and implementations,” IEEE Des. Test, 22 (2005) (5):
414–421.
7. STMicroelectronics, “STNoC: Building a new system-on-chip paradigm,” White
Paper, 2005.
8. S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, et al.,
“An 80-tile 1.28 TFLOPS network-on-chip in 65nm CMOS,” In Proc. of Interna-
tional Solid-State Circuits Conference (ISSCC). IEEE, 2007, 98–589.
9. U. Ogras and R. Marculescu, “Application-specific network-on-chip architecture
customization via long-range link insertion,” In Proc. of ICCAD (6–10 Nov.) 2005,
246–253.
10. E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, “Cost considerations in network
on chip,” Integr. VLSI J. 38 (2004) (1): 19–42.
11. E. Beyne, “3D system integration technologies,” In International Symposium on
VLSI Technology, Systems, and Applications, Hsinchu, Taiwan, April 2006, 1–9.
12. ——, “The rise of the 3rd dimension for system integration,” In Proc. of Interna-
tional Interconnect Technology Conference, Burlingame, CA 5–7 June, 2006, 1–5.
13. J. Joyner, R. Venkatesan, P. Zarkesh-Ha, J. Davis, and J. Meindl, “Impact of three-
dimensional architectures on interconnects in gigascale integration,” IEEE Trans-
actions on Very Large Scale Integration (VLSI) Systems, 9 (Dec. 2001) (6): 922–928.
14. R. Marculescu, U. Y. Ogras, and N. H. Zamora, “Computation and communica-
tion refinement for multiprocessor SoC design: A system-level perspective,” In
Proc. of DAC. New York: ACM Press, 2004, 564–592.
15. J. Duato, S. Yalamanchili, and N. Lionel, Interconnection Networks: An Engineering
Approach. San Francisco, CA: Morgan Kaufmann Publishers Inc., 2002.
16. W. Dally and B. Towles, Principles and Practices of Interconnection Networks.
San Francisco, CA: Morgan Kaufmann Publishers Inc., 2003.
CONTENTS
2.1 Introduction.................................................................................................. 29
2.2 Circuit Switching ......................................................................................... 33
2.3 Time Division Multiplexing Virtual Circuits ........................................... 37
2.3.1 Operation and Properties of TDM VCs........................................ 38
2.3.2 On-Chip TDM VCs ......................................................................... 39
2.3.3 TDM VC Configuration.................................................................. 40
2.3.4 Theory of Logical Network for TDM VCs................................... 41
2.3.5 Application of the Logical Network Theory for TDM VCs ...... 44
2.4 Aggregate Resource Allocation ................................................................. 46
2.4.1 Aggregate Allocation of a Channel .............................................. 46
2.4.2 Aggregate Allocation of a Network.............................................. 51
2.5 Dynamic Connection Setup ....................................................................... 53
2.6 Priority and Fairness ................................................................................... 56
2.7 QoS in a Telecom Application.................................................................... 58
2.7.1 Industrial Application .................................................................... 58
2.7.2 VC Specification .............................................................................. 59
2.7.3 Looped VC Implementation .......................................................... 60
2.8 Summary....................................................................................................... 62
References............................................................................................................... 62
2.1 Introduction
The provision of communication services with well-defined performance
characteristics has received significant attention in the NoC community
because for many applications it is not sufficient or adequate to simply max-
imize average performance. It is envisioned that complex NoC-based archi-
tectures will host complex, heterogeneous sets of applications. In a scenario
29
© 2009 by Taylor & Francis Group, LLC
30 Networks-on-Chips: Theory and Practice
where many applications compete for shared resources, a fair allocation pol-
icy that gives each application sufficient resources to meet its delay, jitter, and
throughput requirements is critical. Each application, or each part of an appli-
cation, should obtain exactly those resources needed to accomplish its task,
not more nor less. If an application gets too small a share of the resources,
it will either fail completely, because of a critical deadline miss, or its utility
will be degraded, for example, due to bad video or audio quality. If an appli-
cation gets more of a resource than needed, the system is over-dimensioned
and not cost effective. Moreover, well-defined performance characteristics
are a prerequisite for efficient composition of components and subsystems
into systems [1]. If all subsystems come with QoS properties the system per-
formance can be statically analyzed and, most importantly, the impact of the
composition on the performance of individual subsystems can be understood
and limited. In the absence of QoS characteristics, all subsystems have to be
reverified because the interference with other subsystems may severely affect
a subsystem’s performance and even render it faulty. Thus, QoS is an enabling
feature for compositionality.
This chapter discusses resource allocation schemes that provide the shared
NoC communication resources with well-defined Quality of Service (QoS)
characteristics. We exclusively deal with the performance characteristics
delay, throughput, and, to a lesser extent, delay variations (jitter).
We group the resource allocation techniques into three main categories.
Circuit switching∗ allocates all necessary resources during the entire life-
time of a connection. Figure 2.1(b) illustrates this scheme. In every switch
there is a table that defines the connections between input ports and output
ports. The output port is exclusively reserved for packets from that particular
input port. In this way all the necessary buffers and links are allocated for a
connection between a specific source and destination. Before a data packet
can be sent, the complete connection has to be set up; and once it is done, the
communication is very fast because all contention and stalling is avoided.
The table can be implemented as an optimized hardware structure lead-
ing to a very compact and fast switch. However, setting up a new connec-
tion has a relatively high delay. Moreover, the setup delay is unpredictable
because it is not guaranteed that a new connection can be set up at all.
Circuit switching is justified only if a connection is stable over a long time
and utilizes the resources to a very high degree. With few exceptions such
as SoCBUS [2] and Crossroad [3], circuit switching has not been widely used
in NoCs because only few applications justify the exclusive assignment of
resources to individual connections. Also, the problem of predictable com-
munication is not avoided but only moved from data communication time
to circuit setup time. Furthermore, the achievable load of the network as a
whole is limited in practice because a given set of circuits blocks the setup
∗ Note that some authors categorize time division multiplexing (TDM) techniques as a circuit
switching scheme. In this chapter we reserve the term circuit switching for the case when resources
are allocated exclusively during the entire lifetime of a connection.
Slot Flow
0 A
1 B NI
NI 2 A
C C
A 3 C B
Sw Sw
Slot Flow
Slot Flow 0
0 B 1 A
1 2
2 3 A
3
Sw Sw
A
B
NI
NI
In Out
Local South
West
North
NI
NI East Local
A B
South East C
C
In Out
Local West
West Local
Sw Sw
North
East
In Out South
Local North
West
North East
East
South
Sw Sw
A
B In Out
Local
NI
West Local
NI
North
East
South
(b) Circuit switching based resource allocation
FIGURE 2.1
Resource allocation schemes based on TDM and circuit switching. (Sw = switch; NI = network
interface; A, B, C = traffic flows).
C
C NI
NI
B
A Budget
Budget
Sw Sw
Sw Sw
Budget
Budget A
B
NI
NI
FIGURE 2.2
Aggregate resource allocation.
slot to appear. This is a problem for low delay, low throughput traffic because
it either gets much more bandwidth than needed or its delay is very high.
Aggregate resource allocation is a course grained and flexible allocation
scheme. Figure 2.2 shows that each resource is assigned a traffic budget for
both sent and received traffic. The reason for this is that if all resources comply
with their budget bounds, the network is not overloaded and can guarantee
minimum bandwidth and maximum delay properties for all the flows. Traffic
budgets can be defined per resource or per flow, and they have to take into
account the communication distance to correctly reflect the load in the net-
work. Aggregate allocation schemes are flexible but provide looser delay
bounds and require larger buffers than more fine grained control mecha-
nisms such as TDM and circuit switching. This approach has been elaborated
by Jantsch [1] and suitable analysis techniques can be adapted from flow
regulation theories in communication networks such as network calculus [4].
In the following sections we will discuss these three main groups of re-
source allocation in more detail. In Section 2.5 we take up dynamic setup of
connections, and in Section 2.6 we elaborate some aspects of priority schemes
and fairness of resource allocation. Finally we give an example of how to use
a TDM resource allocation scheme in a complex telecom system.
Port
NI NI
Buffer
Buffer Link
Link
Crossbar
FIGURE 2.3
All the resources used for communication between a source and a destination can be allocated
in different ways.
communication from the entire left half of the network to the right half is pos-
sible. If these four connections live for a long time, they will completely block
a large set of new connections, independent of the routing policy employed,
even if they utilize only a tiny fraction of the network or link bandwidth. If
restrictive routing algorithms such as deterministic dimension order routing
in use
Sw Sw Sw Sw
in use
Sw Sw Sw Sw
in use
Sw Sw Sw Sw
in use
A
Sw 1 Sw 2 Sw Sw
FIGURE 2.4
A few active connections may inhibit the setup of new connections although communication
bandwidth is available.
are used, a few allocated links can already stall completely the communica-
tion between large parts of the system. For instance, if only one link, that is,
link A in Figure 2.4, is used in both nodes connected to Sw 1, then Sw 2 will
not be able to communicate to any of the nodes in the right half of the system
under X–Y dimension order routing.
Consequently, neither the setup delay nor the unpredictability of the setup
time is the most severe disadvantage of circuit switching when compared
to other resource allocation schemes, because the connection setup problem
is very similar in TDM-based techniques (see Section 2.5 for a discussion
on circuit setup). The major drawback of circuit switching is its inflexibility
and, from a QoS point of view, the limited options for selecting a particular
QoS level. For a given source–destination pair the only choice is to set up a
circuit switched connection which, once established, gives the minimal delay
(1 cycle per hop × the number of hops in SoCBUS) and the full bandwidth. If
an application requires many overlapping connections with moderate band-
width demands and varying delay requirements, a circuit switched network
has little to offer. Thus, a pure circuit switching allocation scheme can be used
with benefit in the following two scenarios:
1. If the application exhibits a well-understood, fairly static communi-
cation pattern with a relatively small number of traffic streams with
very high bandwidth requirements and long lifetime, these streams
can be mapped on circuit switched connections in a cost- and power-
efficient and low-delay implementation, as demonstrated in a study
by Chung et al. [3].
2. For networks with a small number of hops (up to two), connections
can be quickly built up and torn down. The setup overhead may be
compensated by efficient data traversal even if the packet length
is only a few words. Several proposals argue for circuit switch-
ing implementations based implicitly on this assumption [2,5,6].
But even for small-sized networks we have the apparent trade-off
between packet size and blocking time of resources. Longer pack-
ets decrease the relative overhead of connection setup but block the
establishment of other connections for a longer time.
For large networks and applications with communications having different
QoS requirements that demand more flexibility in allocating resources, circuit
switched techniques are only part of the solution at best.
This inflexibility of circuit switching can be addressed by duplicating some
bottleneck resources. For instance, if the resources in the NI are duplicated, as
shown in Figure 2.5(a), each node can entertain two concurrent connections
in each direction, which increases the overall utilization of the network.
A study by Millberg et al. [7] has demonstrated that by duplicating the
outgoing link capacity of the network, called dual packet exit [Figure 2.5(b)],
the average delay is reduced by 30% and the worst case delay by 50%. Even
though that study was not concerned with circuit switching, similar or higher
gains are expected in circuit switched approaches. Leroy et al. [8] essentially
NI NI
Sw Sw
(a) Duplication of selected resources can (b) Dual packet exit doubles the
increase the overall utilization of the network. upstream buffers in the NI [7].
FIGURE 2.5
Duplication of NI resources.
NI
8 A
4
4 C
C
8 A
B 4 4 B
Sw
(a) Spatial division multiplexing (SDM) assigns different
wires to different connections on the links [8].
Switch Switch
VC Buffers VC Buffers
A A
Link
B B
C C
D D
(b) A mixed circut switched and time shared allocation
scheme of the Mango NoC [9].
FIGURE 2.6
Duplication of switch resources.
in the switch, the end-to-end delay (not considering the network interfaces)
is bounded by ( Q · F + ) · h, where is the constant delay in the crossbar
and the input buffer of the switch, and h is the number of hops.
The number of VCs determine the granularity of bandwidth allocation,
and the bandwidth allocated to a connection can be increased by assigning
more VCs.
One drawback of this method is that a connection exclusively uses a re-
source, a VC buffer, and to support many concurrent connections, many VCs
are required. This drawback is inherited from the exclusive resource allocation
of circuit switching, but it is a limited problem here because it is confined to the
VC buffers. Also, there is a trade-off between high granularity of bandwidth
allocation and the number of VCs. But this example demonstrates clearly that
the combination of different allocation schemes can offer significant benefits
in terms of increased flexibility and QoS control at limited costs.
time slot
admit packet
v: v: v:
t t t
0 12 34 5 6789 0 12 34 5 6789 0 12 34 5 6789
w=6 w=6 w=6 w=6 w=6 w=6
FIGURE 2.7
An example of packet delivery on a VC.
difference [14]. (2) Buffer and link allocations are coupled, as stated previously.
Because packets are transmitted over these shared resources without stall
and in a time-division fashion, we need only one buffer per link. This buffer
may be situated at the input or output of a switch. As can be observed in
Figure 2.7, we assumed that the buffer is located at the output. In terms of
QoS, TDM VC provides strict guarantees in delay and bandwidth with low
cost. Compared with circuit switching, it utilizes resources in a shared fashion
(but with exclusive time slots), thus is more efficient. As with circuit switching,
it must be established before communication can start. The establishment can
be accomplished through configuring a routing table in switches. Routing for
VC packets is performed by looking up these tables to find the output port
along the VC path.
Before discussing VC configuration, we introduce two representative TDM
VCs proposed for on-chip networks, the Æthereal VC [10] and the Nostrum
VC [11].
sw1 sw2
v1
b1 b2
t in out t in out
2k W E 2k S E
b3 2k+1 W E
N v2
W E t in out
S sw3 2k+1 W N
FIGURE 2.8
Open-ended virtual circuits.
∗ We allow that a VC may comprise more than one source and one destination node.
sw1 sw2
t in out v1 t in out
4k S E b2 4k+1 W S
4k+2 S E 4k+3 W S
b3
b1
t in out v2 t in out
4k+1 E N 2k+1 W W
b4
4k+3 E N 4k N W
2k E E 4k+2 N W
b0
sw4 sw3
FIGURE 2.9
Closed-loop virtual circuits.
: s 20 (b0)
ln 20 (v1, b0) ln 21 (v2, b0)
: s 21 (b0)
(t,b4): 0 1 2 3 4 5 6 7 8 9
(t,b0): 0 1 2 3 4 5 6 7 8 9
(t,b1): 0 1 2 3 4 5 6 7 8 9
(t,b2): 0 1 2 3 4 5 6 7 8 9
(t,b3): 0 1 2 3 4 5 6 7 8 9
t
0 1 2 3 4 5 6 7 8 9
FIGURE 2.10
LN construction by partitioning and mapping slots in the time and space domain.
packet or container advances one hop along its path each and every
slot. For example, v1 packets holding slot t at buffer b 0 , that is, pair
(t, b 0 ), will consecutively take slot t +1 at b 1 (pair (t +1, b 1 )), slot t +2
at b 2 (pair (t + 2, b 2 )), and slot t + 3 at b 3 (pair (t + 3, b 3 )). In this way,
the slot partitionings are propagated to other buffers on the VC. In
Figure 2.10, after mapping the slot set s02 (b 0 ) on v1 and s12 (b 0 ) on
v2 , we obtain two sets of slot sets {s02 (b 0 ), s12 (b 1 ), s02 (b 2 ), s12 (b 3 )} and
{s12 (b 0 ), s02 (b 4 )}, as marked by the dashed diagonal lines. We refer to
the logically networked slot sets in a set of buffers of a VC as an LN.
Thus an LN is a composition of associated (time slot, buffer) pairs on
a VC with respect to a buffer. We denote the two LNs as ln20 (v1 , b 0 )
and ln21 (v2 , b 0 ), respectively. The notation lnτT (v, b) represents the τ th
LN of the total T LNs on v with respect to b. Figure 2.10 illustrates
the mapped slot sets for s02 (b 0 ) and s12 (b 0 ) and the resulting LNs. We
may also observe that slot mapping is a process of assigning VCs
to LNs. LNs can be viewed as the result of VC assignment to slot
sets, and an LN is a function of a VC. In our case, v1 subscribes to
ln20 (v1 , b 0 ) and v2 to ln21 (v2 , b 0 ).
Start
Reference consistency?
N
Y
Comsume LNs
Return 0 Return 1
FIGURE 2.11
LN-oriented slot allocation.
b1
s s s s
b2
v1 v2
b3
s s s
b4
b5
s s s s
b6
s s s s v2
v1
FIGURE 2.12
An example of LN-oriented slot allocation.
A A
B B
C = 32 Mb/s
L = 1 word = 32 bit
Delay = 2μs for each word
(a) One channel is allocated to two flows.
A A
16 Mb/s
B B
16 Mb/s C = 32 Mb/s
Round L = 1 word = 32 bit
Robin Delay = 2μs for each word
(b) The channel access is arbitrated with a round-robin policy.
FIGURE 2.13
Shared channel.
for all time intervals [t1 , t2 ] with 0 ≤ t1 ≤ t2 . Hence, in any period the number
of bits moving in the flow cannot exceed the average bit rate by more than
σ . This concept is illustrated in Figure 2.14 where the solid line shows a flow
that is constrained by the function σ + ρt.
ρ
F(t)
Traffic flow
σ + ρt
σ + ρ(t3−t2)
t1 t2 t3 t4 t5 t
FIGURE 2.14
A (σ, ρ)-regulated flow.
We use this notation in our shared channel example and model a round-
robin arbiter as a rate-latency server [4] that serves each input flow with a
minimum rate of C/2 after a maximum initial delay of L/C, assuming a
constant word length of L in both flows. Then, based on the network calculus
theory, we can compute the maximum delay and backlog on flow A ( D̄A, B̄ A)
and flow B ( D̄B , B̄ B ), and the characteristics of the output flows A∗ and B ∗ , as
shown in Figure 2.15.
We cannot derive these formulas here (see Le Boudec [4] for detailed deriva-
tion and motivation) due to the limited space, but we can make several obser-
vations. The delay in each flow consists of three components. The first two are
due to arbitration and the last one, 2 μs, is the channel delay. The term L/C
is the worst case, as the time it takes for a word in one flow to get access to
the channel if there are no other words in the same flow queued up before the
arbiter. The second term, 2σ/C, is the delay of a worst case burst. The formula
for the maximum backlog also consists of two terms: one due to the worst
A ~ (σA, ρA)
BA
B ~ (σB, ρB)
DA –
BA = σA + ρAL /C
–
A A* BB = σB + ρBL /C
–
DA = L/C+2σA /C + 2μsec
–
DB = L/C+2σB /C + 2μsec
B B*
Round
C = 32 Mb/s A* ~ (σA + ρAL /C, ρA)
L = 1 word = 32 bit
Robin Delay = 2us for each word B* ~ (σB + ρB L /C, ρB)
DB ρA ≤ 0.5C
BB
ρB ≤ 0.5C
FIGURE 2.15
The shared channel serves two regulated flows A and B with round-robin arbitration.
TABLE 2.1
Maximum Delay, Backlog, and Output Flow Characteristics for a Round-
Robin Arbitration. Delays Are in μs, Rates Are in Mb/s, and Backlog and
Delay Values Are Rounded Up to Full 32-Bit Words.
A B
A B
(σA , ρA ) (σB , ρB ) B̄A D̄A (σ A∗ ,ρ
A∗ ) B̄B D̄B (σB∗ , ρB∗ )
case arbitration time (ρ L/C) and the other due to bursts (σ ). The rates of the
output flows are unchanged, as is expected, but the burstiness increases due
to the variable channel access delay in the arbiter. It can be seen in the for-
mulas that delay and backlog bounds and the output flow characteristics of
each flow do not depend on the characteristics of the other flow. This demon-
strates the strong isolation of the round-robin arbiter that in the worst case
always offers half the channel bandwidth to each flow. However, the average
delay and backlog values of one flow do depend on the actual behavior of
the other flow, because if one flow does not use its maximum share of the
channel bandwidth (0.5C), the arbiter allows the other flow to use it. This
dynamic reallocation of bandwidth will increase average performance and
channel utilization. However, note that these formulas are only valid under
the given assumptions, that is, the average flows of both rates must be lower
than 50% of the channel bandwidth. If one flow has a higher average rate, its
worst case backlog and delay are unbounded.
Table 2.1 shows how the delay and backlog bounds depend on input rates
and burstiness. In the upper half of the table, both flows have no burstiness
but the rate of flow A is varying. It can be seen that flow B is not influenced at
all and for flow A only the output rate changes but delay and backlog bounds
are not affected. This is because as long as the flow does not request more
than 50% of the channel bandwidth (16 Mb/s), both backlog and delay in the
arbiter are only caused by the arbitration granularity of one word. In the lower
part of the table, the burstiness of flow A is steadily increased. This affects
the backlog bound, the delay bound, and the output flow characteristics of
A. However, flow B is not affected at all, which underscores the isolation
property of round-robin arbitration under the given constraints.
To illustrate the importance of the arbitration policy on QoS parameters
and the isolation of flows, we present priority-based arbitration as another
example. Figure 2.16 shows the same situation, but the arbiter gives higher
A ~ (σA, ρA)
BA B ~ (σB, ρB)
DA –
BA = σA + ρAL /C
–
A A* BB = σB + ρBσA/(C – ρA)
–
DA = (L + σA)/C + 2μs
–
B B* DB = (σA + σB )/(C – ρA)
C = 32 Mb/s + 2μs
Priority L = 1 word = 32 bit
Delay = 2μs for each word A* ~ (σA + ρAL /C, ρA)
DB B* ~ (σB + ρBσA/(C – ρA),
BB ρB)
ρA + ρB ≤ C
FIGURE 2.16
The shared channel serves two regulated flows A and B with a priority arbitration.
TABLE 2.2
Maximum Delay, Backlog, and Output Flow Characteristics for an Arbitration
Giving a Higher Priority. Delays Are in μs, Rates Are in Mb/s, and Backlog and
Delay Values Are Rounded Up to Full 32-Bit Words.
A B
A B
(σA , ρA ) (σB , ρB ) B̄A D̄A (σA∗ , ρA∗ ) B̄B D̄B (σB∗ , ρB∗ )
E h = nh dh δ (2.1)
where nh is the number of packets A injects into the network during a given
window, W, dh is the shortest distance between A and B, and δ is the aver-
age deflection factor. It expresses the average amount of deflections a packet
experiences and is defined as
δ is load dependent and, as we will see in the following equations, the network
load has to be limited in order to bound δ. Call Hro and Hri the sets of all
outgoing and incoming connections of node r , respectively. We assign traffic
Bro and Bri constitute the traffic budgets for each node r and CNet is the total
communication capacity of the network during the time window W.κ, with
0 ≤ κ ≤ 1, which is called the traffic ceiling. It is an empirical constant that has
to be set properly to bound the deflection factor δ. A node is allowed to set
up a new connection as long as the constraints shown in Equations (2.2) and
(2.3) are met. In return, every connection is characterized by the following
bandwidth, average delay, and maximum delay bounds [1],
nh
BWh = (2.5)
W
maxLath = 5DN (2.6)
avgLath = dh δ (2.7)
TABLE 2.3
(κ, D1 ) Pairs for Various Network Sizes N and Emission Budgets per
Cycle Bro /W
16 30 50 70 100
Bor /W (κ, D1 ) (κ, D1 ) (κ, D1 ) (κ, D1 ) (κ, D1 )
0.05 (0.04, 1.12) (0.06, 1.12 ) (0.07, 1.15) (0.08, 1.16) (0.09, 1.11)
0.10 (0.09, 1.12) (0.11, 1.15) (0.14, 1.23) (0.16, 1.23) (0.19, 1.23)
0.15 (0.13, 1.12) (0.17, 1.30) (0.21, 1.41) (0.24, 1.35) (0.28, 1.35)
0.20 (0.18, 1.36) (0.22, 1.40) (0.27, 1.46) (0.32, 1.46) (0.37, 1.55)
0.25 (0.22, 1.44) (0.28, 1.45) (0.34, 1.64) (0.40, 1.80) (0.46, sat.)
0.30 (0.27, 1.44) (0.34, 1.61) (0.41, 4.65) (0.48, sat.) (0.56, sat.)
0.35 (0.31, 1.60) (0.39, 1.72) (0.48, sat.) (0.55, sat.) (0.65, sat.)
0.40 (0.36, 1.60) (0.45, 6.10) (0.55, sat.) (0.63, sat.) (0.74, sat.)
0.45 (0.40, 1.80) (0.50, sat.) (0.62, sat.) (0.71, sat.) (0.83, sat.)
0.50 (0.44, 6.17) (0.56, sat.) (0.69, sat.) (0.79, sat.) (0.93, sat.)
This approach, while giving loose worst case bounds, optimizes the aver-
age performance, because all network resources are adaptively allocated to
the traffic that needs them. It is also cost effective, because in the network
there is overhead for reserving resources and no sophisticated scheduling
algorithm for setting up connections is required. The budget regulation at
the network entry can be implemented cost efficiently and the decision for
setting up a new connection can be taken quickly, based on locally available
information. However, to check if the receiving node has sufficient incoming
traffic capacity is more time consuming because it requires communication
and acknowledgment across the network.
Request Request
Request Request
Request nAck
Ack nAck
Ack Request
Ack Request
Data
Request
Data Ack
Ack
Cancel
Ack
Cancel
Data
Cancel
(b) No retry Data
Cancel
Cancel
Cancel
(b) Setup with one retry
FIGURE 2.17
The three phases of circuit switched communication in SoCBUS [2].
∗ Forthe sake of simplicity we ignore priority inversion. Priority inversion is a time period when
a high priority packet waits for a low priority packet. This period is typically limited and known
as L/C in Figure 2.15.
X
A A B A B A A C B CA D C D B D C D A
A B C D
A B C D
B C D
A
FIGURE 2.18
Local fairness may be very unfair globally.
In summary, we note that hard bounds on delay and bandwidth can only be
given if the rate and burstiness of all higher priority traffic is constrained and
known. Priority schemes work best with a relatively small number of priority
levels (2–8) and, if well characterized, low throughput traffic is assigned to
high priority levels.
All arbitration policies should feature a certain fairness of access to a shared
resource. But which notion of fairness to apply is, however, less obvious. Local
versus global fairness is a case in point, illustrated in Figure 2.18. Packets of
connection A are subject to three arbitration points. At each point a round-
robin arbiter is fair to both connections. However, at channel X connection
D occupies four times the bandwidth and experiences 1/4 of the delay as
connection A. This example shows that if only local fairness is considered,
the number of arbitration points that a connection meets has a big impact
on its performance because its assigned bandwidth drops by a factor two
at each arbitration point. Consequently, in multistage networks often age-
based fairness or priority schemes are used. This can be implemented with
a counter in the packet header that is set to zero when the packet enters the
network and is incremented in every cycle. For instance, Nostrum uses an
age-based arbitration scheme to guarantee the maximum delay bound given
in Section 2.4, Equation (2.6).
Another potential negative effect of ill-conceived fairness is shown in
Figure 2.19. Assume we have two messages A and B, each consisting of 10
packets each. Assume further that the delay of the message is determined
by the delay of the last packet. Packets of messages A and B compete for
channel X. If they are arbitrated fairly in a round-robin fashion, they occupy
the channel alternatively. Assume it takes one cycle to cross channel X. If a
packet A gets access first, the last A packet will have crossed the channel after
19 cycles, and the last B packet after 20 cycles. If we opt for an alternative
strategy and assign channel X exclusively to message A, all A packets will
have crossed the channel after 10 cycles although all B packets will still need
20 cycles. Thus, a winner-takes-it-all arbitration policy would decrease the
delay of message A by half without adversely affecting the delay of mes-
sage B. Moreover, if the buffers are exclusively reserved for a message, both
A
B
A
B
Y
A A X B Z
B B A
B B B A B A A A A
FIGURE 2.19
Local fairness may lead to lower performance.
messages will block their buffers for a shorter time period compared to the
fair round-robin arbitration.
These examples illustrate that fairness issues require attention, and the
effects of arbitration policies on global fairness and performance are not
always obvious. For a complete trade-off analysis the cost of implementa-
tion has to be also taken into account. For a discussion on implementation
of circuits, their size, and delay that realize different arbitration policies see
Dally and Towles [20, Chapter 18].
cc ax3: 4096
n13 n14 n15 n16 bx6: 512
b j k cc d
cx4: 512
b j k dx2: 2048
n9 n10 n11 n12 d
a a a ex1: 512
e fx4: 128
n6 i n7 gx1: 64
n5 n8
h i g hx3: 4096
h f ix2: 512
n1 n2 f n3 n4 jx2: 512
to/from all h f f kx2: 512
unit: Mbits/s
FIGURE 2.20
Node-to-node traffic flows for a radio system.
and others are unicast traffic. As the application requires strict bandwidth
guarantees for processing traffic streams, we use TDM VCs to serve the traffic
flows. In this case study, we use closed-loop VCs.
The case study comprises two phases: VC specification and VC configura-
tion. The VC specification phase defines a set of source and destination (sink)
nodes, and normalized bandwidth demand for each VC. The VC configura-
tion phase constructs VC implementations satisfying the VCs’ specification
requirement—one VC implementation for one VC specification. In this case
study, a VC implementation is a looped TDM VC. Note that a VC specification
only consists of source and destination nodes, although its corresponding VC
implementation consists of the source and destination nodes plus intermedi-
ate visiting nodes.
2.7.2 VC Specification
The VC specification phase consists of three steps: determining link capacity,
merging traffic flows, and normalizing VC bandwidth demand.
We first determine the minimum required link capacity by identifying a
critical (heaviest loaded) link. The most heavily loaded link may be the link
directing from n5 to n9 . The a-type traffic passes it and BWa = 4096 Mbits/s.
To support bwa , link bandwidth bwlink must be not less than 4096 Mbits/s.
We choose the minimum 4096 Mbits/s for BWlink . This is an initial estimation
and subject to adjustment and optimization, if necessary.
Because the VC path search space increases exponentially with the num-
ber of VCs, reducing the number of VCs when building a VC specification
set is crucial. In our case, we intend to define 11 VCs for the 11 types of
traffic. To this end, we merge traffic flows by taking advantage of the fact
that the VC loop allows multiple source and destination nodes (multinode
VCs) on it, functioning as a virtual bus supporting arbitrary communication
patterns [13]. Specifically, this merging can be done for multicast, multiple-
flow low-bandwidth, and round-trip (bidirectional) traffic. In the example,
for the two multicast traffic a and h, we specify two multinode VCs for them
as v̄a (n5 , n9 , n10 , n11 ) and v̄h (n5 , n6 , n2 , n3 ). For multiple-flow low-bandwidth
type of traffic, we can specify a VC to include as many nodes as a type of
traffic spreads. For instance, traffic c and f include 4 node-to-node flows each,
and their node-to-node flows require lower bandwidth, 512 Mbits for traffic
type c and 128 Mbits for traffic type f. For c, we specify a a five-node VC
v̄c (n13 , n14 , n15 , n16 , n7 ); for f, a three-node VC v̄ f (n2 , n3 , n4 ). Furthermore, as
we use a closed-loop VC, two simplex traffic flows can be merged into one
duplex flow. For instance, for two i flows, we specify only one VC v̄i (n6 , n7 ).
This also applies to traffic b, d, j, and k.
Based on results from the last two steps, we compute normalized band-
width demand for each VC specification. Suppose link capacity bwlink =
4096 Mbits/s, 512 Mbits/s is equivalent to 1/8 bwlink . While calculating this,
we need to be careful of duplex traffic. Because the VC implementation is
a loop, a container on it offers equal bandwidth in a round trip. Therefore,
duplex traffic can exploit this by utilizing bandwidth in either direction. For
example, traffic d has two flows, one from n16 to n12 , the other from n12 to n16 ,
requiring 1/2 bandwidth in each direction. By using a looped VC, the actual
bandwidth demand on the VC is still 1/2 (not 2 × 1/2). Because of this, the
bandwidth requirements on VCs for traffic b, d, f, i, j, and k are 1/8, 1/2,
1/16, 1/8, 1/8, and 1/8, respectively.
With the steps mentioned above, we obtain a set of VC specifications as
listed in Table 2.4.
TABLE 2.4
VC Specification for Traffic Flows
VC BW Number of Node- BW
Spec. Traffic (Mbits/s) to-Node Flows Source and Sink Nodes Demand
b j k d
n5 n6 i n7 n8
h f g
n1 n2 n3 n4
FIGURE 2.21
One solution of looped VC implementations with a snapshot of containers on VCs.
TABLE 2.5
Looped TDM VC Implementations for Traffic Flows
VC Loop BW
Impl. Traffic Visiting Nodes Length Containers Supply
2.8 Summary
We have addressed the provision of QoS for communication performance
from the perspective of resource allocation. We have seen that we can reserve
communication resources exclusively throughout the lifetime of a connection
(circuit switching) or during individual time slots (TDM). We have discussed
nonexclusive usage of resources in Section 2.4 and noticed that QoS guaran-
tees can be provided by analyzing the worst case interaction of all involved
connections. We have observed a general trade-off between the utilization of
resources and the tightness of bounds. If we exclusively allocate resources
to a single connection, their utilization may be very low because no other
connection can use them. But the delay of packets is accurately known and
the worst case is the same as the average and the best cases. In the other
extreme we have aggregate allocation of the entire network to a set of con-
nections. The utilization of resources is potentially very high because they are
adaptively assigned to packets in need. However, the worst case delay can be
several times the average case delay because many connections may compete
for the same resource simultaneously. Which solution to select depends on
the application’s traffic patterns, on the real-time requirements, and on what
constitutes an acceptable cost.
In practice all the presented techniques of resource allocation and arbitra-
tion can be mixed. By using different techniques for managing the various
resources such as links, buffers, crossbars, and NIs, a network can be opti-
mized for a given set of objectives while exploiting knowledge of application
features and requirements.
References
[1] A. Jantsch, “Models of computation for networks on chip.” In Proc. of Sixth
International Conference on Application of Concurrency to System Design, June
2006, invited paper.
[2] D. Wiklund and D. Liu, “SoCBUS: Switched network on chip for real time
embedded systems.” In Proc. of Parallel and Distributed Processing Symposium,
Apr. 2003.
[3] K.-C. Chang, J.-S. Shen, and T.-F. Chen, “Evaluation and design trade-offs
between circuit-switched and packet-switched NOCs for application-specific
SOCs.” In Proc. of 43rd Annual Conference on Design Automation, 2006, 143–
148.
[4] J.-Y. LeBoudec, Network Calculus. Lecture Notes in Computer Science, no. 2050.
Berlin: Springer Verlag, 2001.
[5] C. Hilton and B. Nelson, “A flexible circuit switched NOC for FPGA based
systems.” In Proc. of Conference on Field Programmable Logic (FPL), Aug. 2005,
24–26.
[6] A. Lines, “Asynchronous interconnect for synchronous SoC design,” IEEE Micro
24(1) (Jan-Feb 2004): 32–41.
[7] M. Millberg and A. Jantsch, “Increasing NoC performance and utilisation
using a dualpacket exit strategy.” In 10th Euromicro Conference on Digital System
Design, Lubeck, Germany, Aug. 2007.
[8] A. Leroy, P. Marchal, A. Shickova, F. Catthoor, F. Robert, and D. Verkest, “Spatial
division multiplexing: A novel approach for guaranteed throughput on NoCs.”
In Proc. of International Conference on Hardware/Software Codesign and System Syn-
thesis, Sept. 2005, 81–86.
[9] T. Bjerregaard and J. Sparso, “A router architecture for connection-oriented ser-
vice guarantees in the MANGO clockless network-on-chip.” In Proc. of Conference
on Design, Automation and Test in Europe—Volume 2, Mar. 2005, 1226–1231.
[10] K. Goossens, J. Dielissen, and A. Rădulescu, “The Æthereal network on
chip: Concepts, architectures, and implementations,” IEEE Design and Test of
Computers 22(5), (Sept-Oct 2005): 21–31.
[11] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch, “Guaranteed bandwidth using
looped containers in temporally disjoint networks within the Nostrum network
on chip.” In Proc. of Design Automation and Test in Europe Conference, Paris, France,
Feb. 2004.
[12] Z. Lu and A. Jantsch, “Slot allocation using logical networks for TDM
virtual-circuit configuration for network-on-chip.” In International Conference
on Computer Aided Design (ICCAD), Nov. 2007.
[13] Z. Lu and A. Jantsch, “TDM virtual-circuit configuration for network-on-chip,”
IEEE Transactions on Very Large Scale Integration Systems 16(8), (August 2008).
[14] E. Nilsson and J. Öberg, “Reducing peak power and latency in 2-D mesh
NoCs using globally pseudochronous locally synchronous clocking.” In Proc.
of International Conference on Hardware/Software Codesign and System Synthesis,
Sep. 2004.
[15] A. Borodin, Y. Rabani, and B. Schieber, “Deterministic many-to-many hot potato
routing,” IEEE Transactions on Parallel and Distributed Systems 8(6) (1997): 587–
596.
[16] R. L. Cruz, “A calculus for network delay, part I: Network elements in isolation,”
IEEE Transactions on Information Theory 37(1) (January 1991): 114–131.
[17] H. Zhang, “Service disciplines for guaranteed performance service in packet-
switching networks,” Proc. IEEE, 83 (1995): 1374–1396.
[18] D. Wiklund, “Development and performance evaluation of networks on chip,”
Ph.D. dissertation, Department of Electrical Engineering, Linköping University,
SE-581 83 Linköping, Sweden, 2005, Linköping Studies in Science and Technol-
ogy, Dissertation No. 932.
[19] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, “QNoC: QoS architecture and
design process for network on chip,” Journal of Systems Architecture, 50(2–3)
(Feb. 2004): 105–128.
[20] W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks.
Morgan Kaufman Publishers, 2004.
CONTENTS
3.1 Introduction.................................................................................................. 66
3.2 Switch-to-Switch Flow Control.................................................................. 67
3.2.1 Switching Techniques ..................................................................... 67
3.2.1.1 Store-and-Forward (SAF) Switching............................. 67
3.2.1.2 Wormhole (WH) Switching............................................ 67
3.2.1.3 Virtual Cut-Through (VCT) Switching ......................... 68
3.2.2 Channel Buffer Management......................................................... 70
3.2.2.1 Go & Stop Control ........................................................... 70
3.2.2.2 Credit-Based Control....................................................... 70
3.2.3 Evaluation ........................................................................................ 71
3.2.3.1 Throughput and Latency................................................ 71
3.2.3.2 Amount of Hardware...................................................... 71
3.3 Packet Routing Protocols............................................................................ 73
3.3.1 Deadlocks and Livelocks of Packet Transfer ............................... 73
3.3.2 Performance Factors of Routing Protocols .................................. 74
3.3.3 Routing Algorithm.......................................................................... 77
3.3.3.1 k-ary n-cube Topologies .................................................. 78
3.3.3.2 Irregular Topologies ........................................................ 80
3.3.4 Subfunction of Routing Algorithms ............................................. 82
3.3.4.1 Output Selection Function (OSF) .................................. 83
3.3.4.2 Path Selection Algorithm................................................ 83
3.3.5 Evaluation ........................................................................................ 83
3.4 End-to-End Flow Control ........................................................................... 84
3.4.1 Injection Limitation......................................................................... 85
3.4.2 ACK/NACK Flow Control............................................................ 85
3.5 Practical Issues ............................................................................................. 86
3.5.1 Commercial and Prototype NoC Systems ................................... 86
3.5.2 Research Trend................................................................................. 88
3.6 Summary....................................................................................................... 90
References............................................................................................................... 91
65
© 2009 by Taylor & Francis Group, LLC
66 Networks-on-Chips: Theory and Practice
3.1 Introduction
In this chapter, we explain NoC protocol family, that is, switching tech-
niques, routing protocols, and flow controls. These techniques are responsible
for low-latency packet transfer. They strongly affect the performance, hard-
ware amount, and power consumption of on-chip interconnection networks.
Figure 3.1 shows an example NoC that consists of 16 tiles, each of which has
a processing core and a router. In these networks, source nodes (i.e., cores)
generate packets that consist of a header and payload data. On-chip routers
transfer these packets through connected links, whereas destination nodes
decompose them. High-quality communication that never loses data within
the network is required for on-chip communication, because delayed packets
of inter-process communication may degrade the overall performance of the
target (parallel) application.
Switching techniques, routing algorithms, and flow control have been stud-
ied for several decades for off-chip interconnection networks. General dis-
cussion about these techniques is provided by existing textbooks [1–3], and
some textbooks also describe them [4,5]. We introduce them from a view
point of on-chip communications, and discuss their pros and cons in terms of
throughput, latency, hardware amount, and power consumption. We also
survey these techniques used in various commercial and prototype NoC
systems.
The rest of this chapter is organized as follows. Section 3.2 describes switch-
ing techniques and channel buffer managements, and Section 3.3 explains
the routing protocols. End-to-end flow control is described in Section 3.4.
Section 3.5 discusses the trends of NoC protocols, and Section 3.6 summa-
rizes the chapter.
Core
Links
Router
FIGURE 3.1
Network-on-Chip: routers, cores, and links.
Forward packet-by-packet
Buffer
Router packet
(a) Store-and-Forward
buffer
router
(b) Wormhole
FIGURE 3.2
Store-and-forward (SAF) and wormhole (WH) switching techniques.
Router Router
Routing Routing
information information
Length information
Data
Data
Clock cycle Clock cycle
FIGURE 3.3
Packet structure of the various switching techniques discussed in this section.
router, the router stores flits (same-sized as channel buffers). Flits of the same
packet could be stored at different routers. Thus, AWH switching theoreti-
cally accepts an infinite packet length, whereas VCT switching can cope with
only packets whose length is smaller than its channel buffer size.
Another variation of the VCT switching customized to NoC purposes is
based on a cell structure using a fixed single flit packet [6]. This is similar to the
asynchronous transfer mode (ATM) (traditional wide area network protocol).
As mentioned above, the main drawback of WH routing is that the buffer is
smaller in size than the maximum used packet size, which frequently causes
the HOL blocking. To mitigate this problem, the cell-based (CB) switching
limits the maximum packet size to a single flit with each flit having its own
routing information.
To simplify the packet management procedure, the cell-based switching re-
moves the support of variable-length packets in routers and network
interfaces. Routing information is transferred on dedicated wires besides data
lines in a channel with a single-flit packet structure (Figure 3.3).
The single-flit packet structure introduces a new problem; namely, the con-
trol information may decrease the ratio of raw data (payload) in each transfer
unit, because it attaches control information to every transfer unit.
Table 3.1 and Figure 3.3 compare SAF, WH, VCT, AWH, and CB switching
techniques.
TABLE 3.1
Comparison of the Switching Techniques Discussed in This Section
Control Channel Buffer Size Throughput (Unloaded) Latency∗
Stop Threshold
Stop
Buffer is occupied
Buffer is released
Credit is
Router incremented
FIGURE 3.4
Channel buffer management techniques.
router. The credit-based control makes the best use of channel buffers, and
can be implemented regardless of the link length or the sender and receiver
overheads.
In the case of the credit-based control, the receiver router sends a credit that
allows the sender router to forward one more flit, as soon as a used buffer is
released (becomes free). The sender router can send a number of flits up to the
number of credits, and uses up a single credit when it sends a flit, as shown
in Figure 3.4. If the credit becomes zero, the sender router cannot forward a
flit, and must wait for a new credit from the receiver router.
The main drawback of the credit-based control is that it needs more con-
trol signals between sender and receiver routers compared to the Go & Stop
control.
3.2.3 Evaluation
In this subsection, we compare the switching techniques in terms of through-
put, latency, and hardware amount. The switching technique and routing
protocol used are both important for determining the throughput and la-
tency. The impact of the routing protocol used on throughput and latency is
analyzed in the next section.
2000
CB (1)
WH (1)
AWH(2)
1500 VCT(8)
Latency [cycle]
1000
CB and WH
500
0
0.1 0.2 0.3 0.4 0.5
Accepted traffic [flit/cycle/core]
FIGURE 3.5
Throughput and latency of the switching techniques discussed in Section 3.2.3.1.
30 Xbar 31.4
Channel
FIFO buf 24.9
20
15.5 15.9
15
10
0
CB(1VC) WH(1VC) WH(2VC) VCT(1VC)
FIGURE 3.6
Hardware amount of the switching techniques discussed in Section 3.2.3.2.
Note that every virtual channel requires a buffer, and the virtual-channel
mechanism makes the structure of the arbiter and crossbar more complicated,
increasing the router hardware by 90%.
∗ We use the term “nodes” for IP cores that are connected on a chip.
packet 3
packet 1
Router
FIGURE 3.7
Deadlocks in routing protocols.
The deadlock- and livelock-free properties are not strictly required in rout-
ing algorithms in the case of traditional LANs and WANs. This is because the
Ethernet usually employs a spanning tree protocol that limits the topology to
that of a tree whose structure does not cause deadlocks of paths; moreover,
the Internet Protocol allows packets to have time-to-live field that limits the
maximum number of transfers. However, NoC routing protocols cannot sim-
ply borrow the techniques used by commodity LANs and WANs. Therefore,
new research fields dedicated to NoCs have developed, similar to those in
parallel computers.
Routing Algorithm
FIGURE 3.8
Taxonomy of routing algorithms.
Router
S
FIGURE 3.9
The adaptivity property.
S2
FIGURE 3.10
The different-paths property.
Adaptive
Adaptive Regular Determin
Non/Min Non/Min Non/Min Determin Regular
Routing Routing Routing Non/Min Routing
FIGURE 3.11
Taxonomy of routing implementations.
TABLE 3.2
Deadlock-Free Routing Algorithms
Routing Algorithm Type Topology Minimum Number of VCs
FIGURE 3.12
Prohibited turn sets of three routing algorithms in the turn model.
of the original turn model proposed by Glass and Ni. Thus, the odd-even turn
model has an advantage over the original ones, especially in networks with
faulty links that require a higher path diversity to avoid them.
Turn models can guarantee deadlock freedom in 2-D mesh, but they cannot
remove deadlocks in rings and tori that have wraparound channels in which
cyclic dependencies can be formed. A virtual channel mechanism is typically
used to cut such cyclic dependencies. That is, packets are first transferred
using virtual-channel number zero in tori, and the virtual-channel number is
then increased when the packet crosses the wraparound channels.
Moreover, turn models achieve some degree of fault tolerance. Figure 3.14
shows an example of the shortest paths avoiding the faulty link on a North-
Last turn model.
3.3.3.1.3 Duato’s Protocol
Duato gave a general theorem defining a criterion for deadlock freedom and
used the theorem to develop a fully adaptive, profitable, and progressive
FIGURE 3.13
Prohibited turn set in the odd-even turn model.
Faulty Link
Feasible Output Port
FIGURE 3.14
Paths avoiding a faulty link on a North-Last turn model.
protocol [11], called Duato’s protocol or *-channel. Because the theorem states
that by separating virtual channels on a link into escape and adaptive parti-
tions, a fully adaptive routing can be performed and yet be deadlock-free. This
is not restricted to a particular topology or routing algorithm. Cyclic depen-
dencies between channels are allowed, provided that there exists a connected
channel subset free of cyclic dependencies.
A simple description of Duato’s protocol is as follows:
a. Provide that every packet can always find a path toward its destina-
tion whose channels are not involved in cyclic dependencies (escape
path).
b. Guarantee that every packet can be sent to any destination node
using an escape path and the other path on which cyclic dependency
is broken by the escape path (fully adaptive path).
By selecting these two routes (escape path and fully adaptive path) adaptively,
deadlocks can be prevented by minimal paths.
Three virtual channels are required on tori. Two virtual channels (we call
them CA and CH) are used for DOR, and a packet that needs to use a
wraparound path is allowed to use CA channel and a packet that does not need
to use a wraparound path is allowed to use both CH and CA channels. Based
on the above restrictions, these channels provide an escape path, whereas an-
other virtual channel (called CF) is used for fully minimal adaptive routing.
Duato’s protocol can be extended for irregular topologies by allowing more
routing restrictions and nonminimal paths [12].
0 1 2 3 0
4 1
4 5 6 7 Router 8 5 2
12 9 6 3
8 9 10 11 13 10 7
Link 14 11
12 13 14 15 15
4×4 Mesh with two faults The corresponding graph based on spanning tree for routing
FIGURE 3.15
Topology with faults.
a Up*/down* routing
VC-trans routing using two VCs
VC-trans routing using three VCs
b c d Router
Link
e f g
Up direction
h i j k
m n o
FIGURE 3.16
Example of up*/down* and VC transition routings.
3.3.5 Evaluation
We use the same C++ simulator used in the previous section to compare the
different routing protocols.
Figure 3.17 shows the relation between the average latency and the accepted
traffic of up*/down* routing, DOR, the West-First turn model, and Duato’s
2000
Up down
WF TM
DOR
1500 Duato
Latency (cycle)
1000
500
0
0.05 0.1 0.15 0.2 0.25 0.3
Accepted Traffic (flit/cycle/core)
FIGURE 3.17
Throughput and latency of routing protocols discussed in Section 3.3.5.
TABLE 3.3
Switching, Flow Control, and Routing Protocols in Representative NoCs
Topology; Switching; Flow Routing
Ref. Data Width VCs Control Algorithm
TABLE 3.4
Network Partitioning in Representative NoCs
Ref. Network Partitioning (Physical and Logical)
MIT Raw [22] Four physical networks: two for static communication
microprocessor [23] and two for dynamic communications
Sony, Toshiba, IBM [30] Four data rings: two for clockwise and two for
Cell BE EIB [31] counterclockwise
UT Austin TRIPS [32] Two physical networks: an on-chip network (OCN) and
microprocessor [33] an operand network (OPN); OCN has four VCs.
[36] Two lanes:a one for data transfers and one for
Intel Teraflops NoC
[37] instruction transfers
Five physical networks: a user dynamic network,
Tilera TILE64 iMesh [38] an I/O dynamic network, a memory dynamic network,
a tile dynamic network, and a static network
a The lane is similar to a virtual channel.
can make routing algorithms that provide the deadlock-free and connectivity
properties only for the set of paths used by the target application [44]. The
routing algorithms explained in Section 3.3 are general techniques to establish
deadlock-free paths between all pairs of nodes, and their design requirement
is tighter than that of application-specific routings. Another feature of locality
is the ability to determine the number of entries of routing tables and routing
(address) information embedded in every packet. Because the routing address
is required to identify output ports of packets generated on the application
where a few pair of nodes are communicated, the routing address is assigned
and optimized to the routing paths used by the target application; namely,
the size of the routing (address) information can be drastically reduced [6].
Here, we focus on how the power consumption is influenced by routing
algorithms, and we introduce a simple energy model. This model is useful
for estimating the average energy consumption needed to transmit a single
flit from a source to a destination. It can be estimated as
where w is the flit-width, Have is the average hop count, E sw is the average
energy to switch the 1-bit data inside a router, and E link is the 1-bit energy
consumed in a link.
E link can be calculated as
where d is the 1-hop distance (in millimeters), V is the supply voltage, and
Cwire is the wire capacitance per millimeter. These parameters can be extracted
from the post place-and-route simulations of a given NoC.
Sophisticated mechanisms (e.g., virtual-channel mechanisms) and the
increased number of ports make the router complex. As the switch complex-
ity increases, E sw is increased in Equation (3.2). The complex switch-to-switch
flow control that uses the increased number of control signals also increases
power, because of its increased channel bit width. Regarding routing pro-
tocols, Equation (3.2) shows that path hops are proportional to the energy
consumption of a packet, and a nonminimal routing has the disadvantage of
the energy consumption.
The energy-aware routing strategy tried to minimize the energy by im-
proving routing algorithms [45], although other approaches make the best
use of the low power network architecture. It assumes that dynamic voltage
and frequency scaling (DVFS) and on/off link activation will be used in the
case of NoCs [46]. The voltage and frequency scaling is a power saving tech-
nique that reduces the operating frequency and supply voltage according to
the applied load. Dynamic power consumption is proportional to the square
of the supply voltage; because a peak performance is not always required
during the whole execution time, adjusting the frequency and supply voltage
to at least achieve the required performance can reduce the dynamic power.
In the paper presented by Shang et al. [47], the frequency and the voltage of
3.6 Summary
This chapter presented the Networks-on-Chip (NoC) protocol family: switch-
ing techniques, routing protocols, and flow control. These techniques and
protocols affect the network throughput, hardware amount, energy consump-
tion, and reliability for on-chip communications. Discussed protocols have
originally been developed for parallel computers, but now they are evolv-
ing for on-chip purposes in different ways, because the requirements for
on-chip networks are different from those for off-chip systems. One of the
distinctive concepts of NoCs is the loss-less, low-latency, and lightweight net-
work architecture. Channel buffer management between neighboring routers
References
[1] J. Duato, S. Yalamanchili, and L. M. Ni, Interconnection Networks: An Engineering
Approach. Morgan Kaufmann, 2002.
[2] W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks.
Morgan Kaufmann, 2004.
[3] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Ap-
proach, Fourth Edition. Morgan Kaufmann, 2007.
[4] L. Benini and G. D. Micheli, Networks on Chips: Technology and Tools. Morgan
Kaufmann, 2006.
[5] A. Jantsch and H. Tenhunen, Networks on Chip. Kluwer Academic Publishers,
2003.
[6] M. Koibuchi, K. Anjo, Y. Yamada, A. Jouraku, and H. Amano, “A simple data
transfer technique using local address for networks-on-chips,” IEEE Transactions
on Parallel and Distributed Systems 17 (Dec. 2006) (12): 1425–1437.
[7] H. Matsutani, M. Koibuchi, and H. Amano, “Performance, cost, and energy
evaluation of Fat H-Tree: A cost-efficient tree-based on-chip network.” In Proc.
of International Parallel and Distributed Processing Symposium (IPDPS’07), March
2007.
[8] H. Matsutani, M. Koibuchi, D. Wang, and H. Amano, “Adding slow-silent
virtual channels for low-power on-chip networks.” In Proc. of International Sym-
posium on Networks-on-Chip (NOCS’08), Apr. 2008, 23–32.
[9] C. J. Glass and L. M. Ni, “The turn model for adaptive routing.” In Proc.
of International Symposium on Computer Architecture (ISCA’92), May 1992, 278–
287.
[10] G.-M. Chiu, “The odd-even turn model for adaptive routing,” IEEE Transactions
on Parallel and Distributed Systems 11 (Nov. 2000) (7): 729–738.
[11] J. Duato, “A necessary and sufficient condition for deadlock-free adaptive rout-
ing in wormhole networks,” IEEE Transactions on Parallel and Distributed Systems
6 (Jun. 1995) (10): 1055–1067.
[12] F. Silla and J. Duato, “High-performance routing in networks of workstations
with irregular topology,” IEEE Transactions on Parallel and Distributed Systems 11
(Jul. 2000) (7): 699–719.
[13] W. H. Ho and T. M. Pinkston, “A design methodology for efficient application-
specific on-chip interconnects,” IEEE Transactions on Parallel and Distributed Sys-
tems 17 (Feb. 2006) (2): 174–190.
[14] M. D. Schroeder, A. D. Birrell, M. Burrows, H. Murray, R. M. Needham, and T. L.
Rodeheffer, “Autonet: A high-speed, self-configuring local area network using
point-to-point links,” IEEE Journal on Selected Areas in Communications 9 (October
1991): 1318–1335.
[47] L. Shang, L.-S. Peh, and N. K. Jha, “Dynamic voltage scaling with links for power
optimization of Interconnection Networks.” In Proc. of International Symposium
on High-Performance Computer Architecture (HPCA’03), Jan. 2003, 79–90.
[48] J. M. Stine and N. P. Carter, “Comparing adaptive routing and dynamic voltage
scaling for link power reduction,” IEEE Computer Architecture Letters 3 (Jan. 2004)
(1): 14–17.
[49] J. Hu and R. Marculescu, “DyAD. Smart routing for networks-on-chip.” In Proc.
of Design Automation Conference (DAC’04), Jun. 2004, 260–263.
[50] M. Li, Q.-A. Zeng, and W.-B. Jone, “DyXY: A proximity congestion-aware
deadlock-free dynamic routing method for network on chip.” In Proc. of
Design Automation Conference (DAC), Jul. 2006, 849–852.
[51] U. Y. Ogras and R. Marculescu, “Prediction-based flow control for network-on-
chip traffic.” In Proc. of Design Automation Conference (DAC), Jul. 2006.
[52] J. W. van den Brand, C. Ciordas, K. Goossens, and T. Basten, “Congestion-
controlled best-effort communication for networks-on-chip.” In Proc. of Design
Automation and Test in Europe (DATE), Apr. 2007.
[53] J. Hu and R. Marculescu, “Energy- and performance-aware mapping for regular
NoC architectures,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 24 (Apr. 2005) (4): 551–562.
[54] S. Manolache, P. Eles, and Z. Peng, “Buffer space optimization with communi-
cation synthesis and traffic shaping for NoCs.” In Proc. of Design Automation and
Test in Europe (DATE), 1, 2006.
CONTENTS
4.1 Introduction.................................................................................................. 96
4.2 Statistical Traffic Modeling......................................................................... 97
4.2.1 On-Chip Processor Traffic .............................................................. 97
4.2.2 On-Chip Traffic Formalism ............................................................ 98
4.2.3 Statistical Traffic Modeling ............................................................ 99
4.2.4 Statistical Stationarity and Traffic Phases .................................. 100
4.2.4.1 Phase Decomposition.................................................... 101
4.2.5 Long-Range Dependence ............................................................. 102
4.2.5.1 Estimation of the Hurst Parameter ............................. 103
4.2.5.2 Synthesis of Long-Range Dependent Processes........ 103
4.3 Traffic Modeling in Practice ..................................................................... 104
4.3.1 Guidelines for Designing a Traffic Modeling
Environment .................................................................................. 105
4.3.1.1 Simulation Precision...................................................... 105
4.3.1.2 Trace Analysis ................................................................ 105
4.3.1.3 Platform Generation...................................................... 106
4.3.1.4 Traffic Analysis and Synthesis Flow ........................... 106
4.3.2 Multiphase Traffic Generation Environment ............................ 106
4.3.2.1 Key Features of the MPTG Environment ................... 112
4.3.3 Experimental Analysis of NoC Traffic........................................ 112
4.3.3.1 Speedup........................................................................... 112
4.3.3.2 Simulation Setup............................................................ 113
4.3.3.3 Multiphase ...................................................................... 114
4.3.3.4 Long-Range Dependence ............................................. 115
4.3.4 Traffic Modeling Accuracy........................................................... 117
4.4 Related Work and Conclusion ................................................................. 118
References............................................................................................................. 119
95
© 2009 by Taylor & Francis Group, LLC
96 Networks-on-Chips: Theory and Practice
4.1 Introduction
Next generation System-on-Chip (SoC) architectures will include many pro-
cessors on a single chip, performing the entire computation that used to be
done by hardware accelerators. They are referred to as MPSoC for multipro-
cessor SoC. When multiprocessors were not on chip as in parallel machines 20
years ago, communication latency, synchronization, and network contention
were the most important fences for performance. This was mainly due to the
cost of communication compared to computation. For simple SoC architec-
tures, the communication latency is kept low and the communication scheme
is simple: most of the transactions occur between the processor and the main
memory. For SoC, a Network-on-Chip (NoC), or at least a hierarchy of buses,
is needed and communication has a major influence on performance and on
power consumption of the global system. Predicting communication perfor-
mance at design time is essential because it might influence physical design
parameters, such as the location of various IPs on the chip.
MPSoC are highly programmable, and can target possibly any application.
However, currently, they are mostly designed for signal processing and mul-
timedia applications with real-time constraints, which are not as harsh as for
avionics. To meet these real-time constraints, MPSoC are composed of many
master IPs (processors) and few slave IPs (memories and peripherals).
In this chapter, we investigate on-chip processor traffic for performance
evaluation of NoC. Traffic modeling and generation of dedicated IPs (e.g.,
MPEG-2, FFT) use predictable communication schemes, such that it is possible
to generate a traffic that looks like the one these IPs would produce. Such a
traffic generator is usually designed together with (or even before) the IP itself
and is very different for processors. Processor traffic is much more difficult
to model for two main reasons: (1) cache behavior is difficult to predict (it
is program and data dependent) and (2) operating system interrupts lead
to nondeterministic behavior in terms of communication and contention. In
order to build an efficient tool for predicting communication performance for
a given application, it is therefore essential to have a precise modeling of the
communications induced by applications running on processors.
Predicting communication performance can be done by a precise (cycle
accurate) simulation of the complete application or by using a traffic genera-
tor instead of real IPs. Simulation is usually impossible at early stages of the
design because IPs and programs are not yet available. Note also that SoC
cycle accurate simulations are very time consuming, unless they are per-
formed on expensive hardware emulators (based on hundreds of FPGA).
Traffic generators are preferred because they are parameterizable, faster to
simulate, and simpler to use. However, they are less precise because they do
not execute the real program.
Traffic generators can produce communications in many ways, ranging
from the replay of a previously recorded trace to the generation of sample
paths of stochastic processes, or by writing a very simple code emulating the
communications of a dedicated IP. Note that random sources can have param-
eters fitted to the statistical properties of the observed traffic, or parameters
fixed by hand. The decisions of which communication parameter (latency,
throughput, etc.) and statistical property are to be emulated are important
issues that must be addressed when designing an NoC traffic modeling
environment. This is the main topic of this chapter.
One of the main difficulties in modeling processor traffic is that processor
activity is not stationary (its behavior is not stable). It rather corresponds to a
sequence of traffic phases (corresponding to program phase [1]). In each sta-
tionary phases, data can be fitted to well-known stochastic processes with
prescribed first (marginal distribution) and second (covariance) statistical
orders.
The chapter is divided in two main parts: Section 4.2 gives the background
of stochastic processes as well as on-chip processor traffic. In Section 4.3, we
discuss in detail the various steps involved in the design of a traffic generation
environment, and illustrate it with the MPTG environment [2]. Conclusion
and related works are reported in Section 4.4.
back (one line at a time). If a write buffer is present then the size is
variable, as the buffer is periodically emptied.
3. Other requests. Requests to noncached memory parts have a size
of one word, as for atomic reads/writes. If a cache coherency algo-
rithm is implemented then additional messages are also sent among
processors.
RESPONSE
L(k) L(k+1)
CLOCK
FIGURE 4.1
Traffic modeling formalism: A(k) is the target address, C(k) the command (read or write), S(k) is
the size of the transaction, D(k) is the delay between the completion of one transaction and the
beginning of the following one, and I (k) is the interrequest time.
expecting a response (even for write requests), which is the case for most IP
communication interfaces such as VCI (virtual component interface [23]).
One can distinguish two main communication schemes used by IPs: the
nonsplit transactions scheme, where the IP is not able to send a request until
the response to the previous one has been received, and the split transactions
scheme in which new requests can be sent without waiting for the responses.
The nonsplit transaction scheme is widely used by processors and caches
(although, for cache, it might depend on the cache parameters), whereas the
split transaction scheme is used by dedicated IPs performing computation
on streams of data that are transmitted via direct memory access (DMA)
modules.
We can also distinguish between two ways of modeling, the time at which
each transaction occurs, leading to different accuracy levels.
• Delay. Use the delay sequence D(k) representing the time (in cycles)
between the reception of the k th response and the start of the (k +1) th
request.
• Aggregated throughput. Use the sequence of aggregated through-
put of the processor Wδ (k), and transactions can be placed in various
ways within the aggregation window δ.
TABLE 4.1
Some Classical Probability Distribution Functions (PDF)
PDF Description
Gaussian The most widely used PDF, used for aggregated throughput for instance
Exponential Fast decay PDF, used for delay sequence for instance
Gamma Gives intermediate PDF between exponential and Gaussian
Lognormal Gives asymmetric PDF
Pareto Provides heavy-tailed PDF (slow decay)
TABLE 4.2
Some Classical Covariance Functions
Covariance Description
So, when modeling a time series, one should carefully check that station-
arity is a reasonable assumption. For on-chip processor traffic, algorithms
that are executed on the processor have different phases resulting in different
communication patterns, where most of the time the traffic will not be glob-
ally stationary. If signs of nonstationarity are present, one should consider
building a piecewise stationary model. This implies the estimation of model
parameters on several stationary phases of the data. At simulation time the
generator will change the model parameters when it switches between phases.
A traffic phase is a part of the transaction sequence T(k), i ≤ k ≤ j.
Because most multimedia algorithms are repetitive, it is likely that simi-
lar phases appear several times in the trace. For instance, in the MP3 de-
coding algorithm, each MP3 frame is decoded in a loop leading to similar
treatments.
Therefore, LRD reflects the ability of the process to be highly correlated with
its past, because even at large lags, the covariance function is not negligible.
This property is also linked to self-similarity, which is more general, and it
can be shown that asymptotic second order self-similarity implies LRD [17].
A long-range dependent process is usually modeled with a power-law
decay of the covariance function as follows:
γ X (k) ∼ ck −α , 0<α≤1
k→ + ∞
The exponent α (also called scaling index) provides a parameter to tell how
much a process is long-range dependent (0 < α ≤ 1). The Hurst exponent,
noted H, is the classical parameter for describing self-similarity [15]. Because
of the analogy between LRD and self-similarity, it can be shown that a simple
relation exists between H and α: H = (2 − α)/2. As a consequence, H (1/2 <
H < 1) is the commonly used parameter for LRD. Note that when H = 0.5,
there is no LRD (this is also referred to as short-range dependence).
Moreover, it can also be shown that the time averages S j for each scale j
(n j is the number of wavelet coefficients available at scale j):
nj
S j = (1/n j ) |d X ( j, k)|2 (4.2)
k=1
can be used as relevant, efficient, and robust estimators for E(d X ( j, k) 2 ) [17].
From Equations (4.1) and (4.2), the estimation of H is as follows: (1) plot log2 S j
versus log2 2 j = j and (2) perform a weighted linear regression of log2 S j in
the coarsest scales (see for instance Figure 4.2). These plots are commonly
referred to as log-scale diagrams (LD). In such diagrams, LRD is evidenced
by a straight line behavior in the limit of large scales. In particular, if the line
is horizontal, then H = 0.5 and there is no LRD.
To illustrate how we use this tool to evaluate the Hurst parameter, we
provide a typical LD extracted from an Internet trace in Figure 4.2. Along the
x axis are the different values of the scale j at which the process is observed.
For each scale, log2 S j is plotted together with its confidence interval (vertical
bars). The Hurst parameter can be estimated if the different points plotted are
aligned on a straight line for large scales.
12
10
log2Sj 8
2
1 5 10 15
j
FIGURE 4.2
Example of log-scale diagram (LD), the Hurst parameter is estimated with the slope of the dashed
line (here H = 0.83).
Noise (FGN) is commonly used for this. However, if one wants to generate a
long-range dependent process whose marginal law is non-Gaussian, the prob-
lem is more complex. The inverse method [14] only guarantees an asymptotic
behavior of the covariance function. We have developed, for several common
laws (exponential, gamma, χ 2 , etc.), an exact method of synthesis described
by Scherrer et al. [11]. We can thus produce synthetic long-range dependent
sample paths that can be used in traffic generation. It is important to note that
most elements of the transaction sequence of on-chip processor communica-
tions have non-Gaussian distributions. For instance, delay sequences rather
exhibit an exponential distribution as we expect many small delays and few
big ones. With our synthesis method, we can produce a synthetic exponential
process with long-range dependence. Such non-Gaussian and LRD models
have been used for Internet traffic modeling as well [11].
We have introduced the major theoretical notions useful for a precise mod-
eling of the on-chip traffic. We will now adopt a more practical vision and
explain how these statistical modeling notions can be used in an SoC simula-
tion environment.
1
CMD_VAL
1
CMD_ACK
32
CMD_ADDRESS
32 CMD_COMMAND
32 CMD_WDATA
4 CMD_TID
1 CMD_EOP
VCI Master
VCI slave
1
RSP_ACK
1
RSP_VAL
32
RSP_RDATA
4
RSP_TID
1
RSP_EOP
3
RSP_ERROR
FIGURE 4.3
Example of NoC interconnect interface: Advanced VCI, defined by the OCP consortium.
Within the SocLib framework all components are connected via VCI ports
to an NoC interconnection. We used the DSPIN network on chip (an evolu-
tion of SPIN [25]) which uses wormhole and credits-based contention control
mechanisms. DSPIN uses a set of 4-port routers that can be interconnected
in a mesh topology to provide the desired packet switched network archi-
tecture. The software running on the processors used in SocLib is compiled
with the GNU GCC tool suite. A tiny open source operating system called
mutek [26] is used when several processors run in parallel. This OS can handle
multithreading on each processor.
The global MPTG flow is depicted in Figure 4.4. It is composed of three
main parts, described hereafter.
• Reference trace collection. This is the entry point of our MPTG flow.
Because we follow a trace-based approach, we perform a simula-
tion with a fixed-latency interconnection and get a reference trace.
It is important to understand that this reference trace can then be
used for many platform simulations (various interconnections, IP
placement, memory mapping, etc.) because with such an ideal inter-
connect we gather the intrinsic communication patterns of IPs. We
simply make the assumption that the behavior of IPs (order of trans-
actions, etc.) is not influenced by the latency of the network. Because
our traffic generator is aware of the network latency, the reference
Trafic
Models
Trace Parser
MPTG configuration
Reference trace
SystemC
Platform IP Generic
Description MPTG IP
SocGen
Performance
evaluation Simulation
FIGURE 4.4
Multiphase traffic generation flow: An initial trace is collected by simulation with ideal inter-
connection. The trace is then analyzed and segmented to generate a configuration of the MPTG,
then the real simulation can take place.
phase0{
phase1{
FIGURE 4.5
MPTG configuration file example.
A generic traffic generator has been written, once for all, for the SocLib
environment. This traffic generator is used as a standard IP during simu-
lations, and provides a master VCI interface. Transactions are generated by
MPTG according to a phase description file, and a sequencer is in charge of
switching between phases. Each phase consists either of a replay of a recorded
trace or of a stochastic model, with parameters adjusted by the fitting pro-
cedure. These traffic patterns can be described in sequence. These sequences
will be used during the next runs of the simulation. Figure 4.5 illustrates such
a configuration. The entry point of a configuration is the sequencer part that
will schedule the different phases of the traffic. Each phase is then described
in the file using its traffic shape and the associated packet size and address
(destination among the IPs on the NoC).
Designer’s choices made at this stage for the MPTG configuration can be
categorized using the following points, also illustrated in the configuration
file presented in Figure 4.5.
1. Timing modeling. We distinguish two types of placement of trans-
actions in time, as already mentioned in Section 4.2.3. On one hand
the designer can choose to model the delay D(k), on the other hand
they can model the aggregated throughput time series Wδ (i). This
choice depends on the context and purpose of the traffic generation.
Using the aggregated throughput, one loses specific information
concerning the time lag between transactions but the traffic load
(on a scale of time exceeding the size of the window δ) is respected.
If we choose accuracy over aggregated throughput then two sub-
groups will be considered to be independent: addresses, orders,
and size [A(k), C(k), S(k)] on one hand and aggregated throughput
[Wδ (i)] on the other.
2. Content modeling. Once the time modeling for transactions has
been decided, the designer must model the content of transactions
(address, command, and size). We have defined different types of
modeling to handle different situations.
• Random. In this mode, each element (address, control, time,
and size) is random, hence independent of the others. This
can be used for generating customizable random load on the
network.
• Cache. In this mode, the size of the read requests is constant
(equal to the size of a line cache). There is a mode for instruction
on cache mixed with data cache and for data cache only.
• Instruction cache. This method is specific to an instruction
cache. It contains instruction specificities, meaning that ac-
cesses are only read requests of the size of a cache line.
3. Phase duration modeling. A phase may appear several times in a
trace, therefore it is necessary to characterize the size and number
of transactions for each phase.
4. Order of phases. This stage involves the configuration of the se-
quencer to choose the sequence of phases. It can basically play a
given sequence of phases, or can randomly shuffle them.
On top of the traffic content, the MPTG must also define modes for memory
access. Let us recall that one of the objectives of our traffic generation envi-
ronment is to be able, from a reference trace collected with a simple intercon-
nection, to generate traffic for a platform exhibiting an arbitrary interconnect.
To do so, we have to prove that the communication scheme is not affected
by the communication latency. From the point of view of the component, it
means that communications will be the same regardless of the latency of the
interconnect. In the general case of a CPU with a cache, we cannot guarantee
that, because the content of transactions [A(k), C(k), and S(k) series] may be
affected by the latency of the network. This is especially due to the presence
of the write buffer. The behavior of such buffer may, in some cases, cause
modifications in the size of transactions sent in the network, depending on
the latency, especially in the case of large sequences of consecutive writes
(zero initialization of a portion of the memory for instance).
This is why we use the time D(k), which relates to the receipt of the
request (so that it is independent of the network latency), instead of the time
between two successive transactions. However, the problem of the recovery
of calculations and communications remains. We must be able to determine
if the delay D(k) is a time during which the component is awaiting a reply
(the component is blocked waiting for the response), or if it is a time during
which the component keeps running, and thus may produce new communica-
tions. This led us to define different operating modes for the traffic generator,
described hereafter.
• Blocking requests. In this mode, regardless of the order, the traffic
generator emits a burst of type C(k) to address A(k), and of size
S(k) bus-words. Once the response is received, the traffic generator
waits D(k) cycles before issuing the next transaction [T(k +1)] on the
network. This characterizes a component that is blocked (pending)
when making a request.
• Nonblocking requests. In this mode, regardless of the order, the
traffic generator emits a burst of type C(k) to address A(k) with a
size of S(k) words. Once the S(k) words have been sent, the traffic
generator restarts after D(k). Upon receipt of the answer, if D(k)
is in the past, then the next request is sent immediately, otherwise
we wait until D(k) is reached. This allows modeling of a data-flow
component (e.g., hardware accelerators) that is not blocked by com-
munications. It is likely to be used for processor traffic. We included
it for sake of generality.
• Blocking/nonblocking read and write. We can also specify, more
precisely, if read transactions and/or write transactions are block-
ing or not. For example, a write-through cache is not blocked by
writes (the processor keeps on running). However, a processor read-
ing block must wait before continuing its execution. The mode
“nonblocking writes, blocking reads” acts as a good approximation
of the behavior of a cache.
• Full data-flow mode. To emulate the traffic of data flow components,
we have finally established a communication mode in which only
the requests are considered (the arrival of the answer is not taken
into account). In this mode, the traffic generator issues a request,
waits D(k) cycles, and makes the following request, without concern
for the answer.
The definition of these operating modes allows us, without loss of gener-
ality, to be able to deal with different types of SoC platforms.
4.3.3.1 Speedup
To evaluate the speedup of using a traffic generator instead of real IPs, we built
several platforms with different number of processors and different network
sizes. The results are reported in Table 4.3. We also compare our speedup
factor with a traffic generation environment [27] that performs smart replay
of a recorded trace.
The reference simulation time used for speedup (“S” columns) computa-
tion is the “MIPS without VCD” (processor simulation without recording the
TABLE 4.3
Speedup of the Simulation: Simulation Time in Seconds and Speedup Factor
(S Columns) for Various Platforms and Traffic Generation Schemes
Number of Processors
1 2 3 4
Mesh Size
0×0 2×2 3×3 4×4
Time S Time S Time S Time S
VCD trace file) configuration. The speedup factor for “MIPS with VCD” is
less than one because recording the trace takes a fair amount of time. The
simulation speedup is never greater than 2.27, which is obtained with no
interconnection (“0 × 0” mesh). However, the speedup increases with the
number of processors; it means, as expected, that large platforms will ben-
efit more from traffic generators speedup than small ones. One can further
note that the VCD recording impact decreases for large platforms and even
becomes negligible for a “4 × 4” mesh. Generation of stochastic processes
(“sto.” and “lrd” lines) does not have a big impact on simulation speedup,
which means that reading values from a file and generating random numbers
(even LRD processes) is almost equally costly in terms of computation time.
The impact of the number of phases is also very small.
Speedup factors obtained are of the same order of magnitude as the ones
obtained by Mahadevan et al. [27], and are quite small. The fact is that most
of the simulation time is spent in the core simulation engine and in the sim-
ulation of the interconnection system, which cannot be reduced. Note that
our conclusion is opposite to Mahadevan et al. which claims a noteworthy
speedup factor. On the contrary, we found that the speedup is too small to
be useful for designers and we believe that the real interest of a traffic gen-
eration environment lies in its flexibility (various generation modes, easy to
configure, etc.). This will be illustrated in the following paragraphs.
MIPS
R3000
Cache
Measurement point
RAM
FIGURE 4.6
Simulation platforms for initial trace collection.
TABLE 4.4
Inputs Used in the Simulations
App. Input
4.3.3.3 Multiphase
We have processed each traffic trace with the segmentation algorithm de-
scribed in Section 4.2.4, using delay as the representative element and for
different number of phases (k). The size of intervals is set as L = 5000 trans-
actions. The choice of k is a trade-off between statistical accuracy (we need a
large interval for statistical estimators to converge) and phase grain (we need
many intervals to properly identify traffic phases). Figure 4.8(b), (c), and (d)
Back RAM
MIPS
TG TG 1
RAM Input
TG 2 Data
Output
TTY RAM
Data
FIGURE 4.7
Simulation platforms for MPTG validation including five memories, a terminal type (TTY), a
MIPS processor, and a background traffic generator (BACK TG) used to introduce contention in
the network.
Normalized delay 1
0
0 50000 100000 150000 200000 250000 300000 350000
Transaction index
(a) Original trace
5
4
Phase ID
3
2
1
0
−1
0 50000 100000 150000 200000 250000 300000 350000
Transaction index
(b) 3-phases clustering
5
4
Phase ID
3
2
1
0
−1
0 50000 100000 150000 200000 250000 300000 350000
Transaction index
(c) 4-phases clustering
5
4
Phase ID
3
2
1
0
−1
0 50000 100000 150000 200000 250000 300000 350000
Transaction index
(d) 5-phases clustering
FIGURE 4.8
Phases discovered by our algorithm on the MP3 traffic trace using the delay, for different phase
numbers.
show the results for various number of phases. One can see that the algorithm
finds the analogy between the two frame processing, and identifies phases
inside each of them. The segmentation appears to be valid and pertinent. The
segmentation is done with mean and variance as representative vectors. So
we expect that each identified phase is stationary, likely to be processed by a
stochastic analysis.
11
10
8
log2Sj
4
0 2 4 6 8 10 12
j
FIGURE 4.9
LD of the traffic trace corresponding to the MPEG-2 and MP3 implementations. Ĥ = 0.56.
scale allows for a fine grain analysis of the traffic. For each application, we
comment on the LD presented in Figures 4.9, 4.10, and 4.11.
• MPEG-2 (Figure 4.9). The shape of the LD does not exhibit evidence
for LRD. Indeed the estimated value for the Hurst parameter, H =
0.56, indicates that LRD is not present in the trace (H = 0.5 means
no LRD). In this case, an IID (independent identically distributed)
process would be a good approximation of the traffic. One can note
a peak around scale 25 , meaning that a recurrent operation with
this periodicity is present in the algorithm, which might have an
7
4
log2Sj
0
0 2 4 6 8 10
j
FIGURE 4.10
LD of the traffic trace corresponding to the MP3 implementation. Ĥ = 0.58.
20
log2Sj 15
10
5
0 2 4 6 8 10 12
j
FIGURE 4.11
LD of the traffic trace corresponding to the JPEG2000 implementation. Ĥ = 0.89.
TABLE 4.5
Accuracy of MPTG: Error (in Percent) on Various Metrics with
Respect to the Reference MIPS Simulation (NoC Platform)
Config. Delay Size Cmd Throughput Latency
the processors’ simulation and the traffic traces obtained from traffic genera-
tors. It is clear that one should not look at global metrics such as the average
delay or the average throughput. This would not highlight the interest of the
multiphase approach. As such, we define an accuracy measure by computing
the mean evolution of each transaction’s element (delay, size, command, and
throughput). The mean evolution is defined as the average value of the series,
computed in consecutive time windows of size L.
To summarize the results we define the error as the mean of absolute values
of relative differences between two mean evolutions. Let Mref (i) be the mean
evolution of some element for the reference simulation. Further, let M(i) be
the evolution of the same element for another simulation, and finally let n
bethe number of points of both functions. The error (in percent) is: Err =
i |Mref (i)− M(i)|/Mref (i)∗100. Note that this is a classical signal processing
1
n
technique to evaluate the distance between two signals. Furthermore, we
define the cycle error as the relative difference between numbers of simulated
cycles.
To illustrate those metrics, Table 4.5 shows accuracy results on the NoC
platform and the execution of the MP3 application.
As expected, the higher the phase number is, the more accurate the sim-
ulations are. In particular, the error on latency becomes very low when the
number of phases is greater than one. This is of major importance because
the latency of communications reflects the network state. It means that the
traffic generation from a network performance point of view is satisfactory
with multiphase traffic generation. Multiphase traffic generation provides
therefore an interesting trade-off between deterministic replay and random
traffic.
and design that use deterministic traffic generation (trace replay) [27–29]. For
instance, the TG proposed by Mahadevan et al. [27] uses a trace compiler
that can generate a program for a reduced instruction set processor that will
replay the recorded transactions in a cycle accurate simulation without having
to simulate the complete processor. This TG is sensitive to the network latency:
changing network latency will produce a similar effect on the TG as on the
original IP. This is an important point that is also taken into account in our
environment.
An alternative solution for NoC performance analysis is to use stochastic
traffic generators, as used in many environments [30–33]. However, none of
these works proposes a fitting procedure to determine the adequate statis-
tical parameters that should be used to simulate traffic. Recently, the work
presented by Soteriou et al. [34] studies an LRD on-chip traffic model in
detail with fitting procedures. To our knowledge, no NoC traffic study has
introduced multiphase modeling. A complete traffic generation environment
should integrate both deterministic and stochastic traffic generation tech-
niques. From the seminal work of Varatkar and Marculescu [14], long-range
dependence is used in on-chip traffic generators [35]. Marculescu et al. have
isolated a long-range-dependent behavior in the communications between
different parts of a hardware MPEG-2 decoder at the macro-block level.
Rapid NoC design is a major concern for next generation MPSoC design.
In this field, processor traffic emulation is a real bottleneck. This chapter has
investigated many issues related to the sizing of NoC. In particular it insists
on the fact that a serious statistical toolbox must be used to generate realistic
traffic patterns on the network.
References
[1] B. Calder, G. Hamerly, and T. Sherwood. Simpoint. Online: http://www.cse.
ucsd.edu/∼calder/simpoint/, April 2001.
[2] A. Scherrer. Analyses statistiques des communications sur puces. PhD thesis, ENS
Lyon, LIP, France, Dec. 2006.
[3] J. Archibald and J. L. Baer. Cache coherence protocols: Evaluation using a multi-
processor simulation model. ACM Transactions on Computer Systems 4. (Novem-
ber 1996): 273–298.
[4] R. H. Katz, S. J. Eggers, D. A. Wood, C. L. Perkins, and R. G. Sheldon. Implement-
ing a cache consistency protocol. In Proc. of 12th Annual International Symposium
on Computer Architecture, 276–283. Boston, MA: IEEE Computer Society Press,
1985.
[5] R. Jain. The Art of Computer Systems Performance Analysis. New York: John Wiley
& Sons, 1991.
[6] P. J. Brockwell and R. A. Davis. Time Series: Theory and Methods, 2ed. Springer
Series in Statistics. New York: Springer, 1991.
CONTENTS
5.1 Introduction ............................................................................................... 124
5.2 Attack Taxonomy....................................................................................... 125
5.2.1 Attacks Addressing SoCs ............................................................. 125
5.2.1.1 Software Attacks ............................................................ 126
5.2.1.2 Physical Attacks ............................................................. 126
5.2.2 Attacks Exploiting NoC Implementations ................................ 130
5.2.2.1 Denial of Service ............................................................ 131
5.2.2.2 Illegal Access to Sensitive Information....................... 133
5.2.2.3 Illegal Configuration of System Resources ................ 133
5.2.3 Overview of Security Enhanced Embedded Architectures .... 133
5.3 Data Protection for NoC-Based Systems................................................ 135
5.3.1 The Data Protection Unit.............................................................. 135
5.3.2 DPU Microarchitectural Issues.................................................... 137
5.3.3 DPU Overhead Evaluation .......................................................... 139
5.4 Security in NoC-Based Reconfigurable Architectures ......................... 140
5.4.1 System Components ..................................................................... 140
5.4.1.1 Security and Configuration Manager ......................... 140
5.4.1.2 Secure Network Interface ............................................. 140
5.4.1.3 Secure Configuration of NIs......................................... 142
5.4.2 Evaluation of Cost ......................................................................... 142
5.5 Protection from Side-Channel Attacks ................................................... 143
5.5.1 A Framework for Cryptographic Keys Exchange in NoCs..... 143
5.5.1.1 Secure Messages Exchange .......................................... 145
5.5.1.2 Download of New Keys................................................ 146
5.5.1.3 Other Applications ........................................................ 148
5.5.1.4 Implementation Issues .................................................. 148
5.5.2 Protection of IP Cores from Side Channel Attacks................... 148
5.5.2.1 Countermeasures to Side-Channel Attacks ............... 149
5.6 Conclusions ................................................................................................ 150
5.7 Acknowledgments..................................................................................... 151
References............................................................................................................. 151
123
© 2009 by Taylor & Francis Group, LLC
124 Networks-on-Chips: Theory and Practice
5.1 Introduction
As computing and communications increasingly pervade our lives, security
and protection of sensitive data and systems are emerging as extremely im-
portant issues. This is especially true for embedded systems, often operating
in nonsecure environments, while at the same time being constrained by such
factors as computational capacity of microprocessor cores, memory size, and
in particular power consumption [1–3]. Due to such limitations, security so-
lutions designed for general purpose computing are not suitable for this type
of systems.
At the same time, viruses and worms for mobile phones have been reported
recently [4], and they are foreseen to develop and spread as the targeted sys-
tems increase in offered functionalities and complexity. Known as malware,
these malicious software are currently able to spread through Bluetooth con-
nections or MMS (Multimedia Messaging Service) messages and infect recip-
ients’ mobile phones with copies of the virus or the worm, hidden under the
appearance of common multimedia files [5,6]. As an example, the worm fam-
ily Beselo operates on devices based on the operating system (OS) Symbian
S60 Second Edition [7]. It is able to spread via Bluetooth and MMS as Symbian
SIS installation files. The SIS file is named with MP3, JPG, or RM extensions
to trick the recipient into thinking that it is a multimedia file. If the phone user
attempts to open the file, the Symbian OS will recognize it as an installation
file and will start the application installer, thereby infecting the device.
In the context of the overall embedded System-on-Chip (SoC)/device se-
curity, security-awareness is therefore becoming a fundamental concept to be
considered at each level of the design of future systems, and to be included as
good engineering practice from the early stages of the design of software and
hardware platforms. In fact, an attacker is more likely to address its attack
to weak points of the system instead of trying to break by brute force some
complex cryptographic algorithms or secure transmission protocols in or-
der to access/decrypt the protected information. Networks-on-Chips (NoCs)
should be considered in the secure-aware design process as well. In fact, the
advantages in term of scalability, efficiency, and reliability given by the use of
such a complex communication infrastructure may lead to new weaknesses
in the system that can be critical and should be carefully studied and eval-
uated. On the other hand, NoCs can contribute to the overall security of the
system, providing additional means to monitor system behavior and detect
specific attacks [8,9]. In fact, communication architectures can effectively react
to security attacks by disallowing the offending communication transactions,
or by notifying appropriate components of security violations [10].
The particular characteristics of NoC architectures make it necessary to
address the security problem in a comprehensive way, encompassing all the
aspects ranging from silicon-related to network-specific ones, both with re-
spect to the families of attacks that should be expected and to the protective
countermeasures that must be created. To provide a guide along such lines, we
FIGURE 5.1
Attacks on embedded systems.
it or interfere with it. These types of attacks exploit the characteristic imple-
mentation of the system or some of its properties to break the security of the
device. The literature usually classifies them as invasive and noninvasive [13].
Invasive attacks require direct access to the internal components of the
system. For a system implemented on a circuit board, inter-component com-
munication can be eavesdropped by means of probes to retrieve the desired
information [1]. In the case of SoC, access to the internal information of the
chip implies the use of sophisticated techniques to depackage it and the use of
microprobes to observe internal structure and detect values on buses, mem-
ories, and interfaces. A typical microprobing attack would employ a probing
station, used in the manufacturing industry for manual testing of product line
samples, and consisting of a microscope and micromanipulators for position-
ing microprobes on the surface of the chip. After depackaging the chip by
dissolving the resin covering the silicon, the layout is reconstructed using in
combination the microscope and the removal of the covering layers, inferring
at various level of granularity the internal structure of the chip. Microprobes
or e-beam microscopy are therefore used to observe values inside the chip.
The cost of the infrastructure makes microprobing attacks difficult. However,
they can be employed to gather information on some sample devices (e.g.,
information on the floorplan of the chip and the distribution of its main com-
ponents) that can be used to perform other types of noninvasive attacks.
Noninvasive attacks exploit externally available information, unintention-
ally leaking from the observed system. Unlike invasive attacks, the device is
not opened or damaged during the attack. There are several types of non-
invasive attacks, exploiting different sources of information gained from the
physical implementation of a system, such as power consumption, timing
information, or electromagnetic leaks.
Timing attacks were first introduced by Kocher [14]. Figure 5.2 shows a rep-
resentation of a timing attack. The attacker knows the algorithm implemen-
tation and has access to measurements of the inputs and outputs of the secure
system. Its goal is to discover the secret key stored inside the secure system.
The attacker exploits the observation that the execution time of computa-
tions is data-dependent, and hence secret information can be inferred from
secret key
secure system
FIGURE 5.2
Representation of the timing attack.
its measurement. In those attacks, the attacker observes the time required by
the device to process a set of known inputs with the goal of recovering a se-
cret parameter (e.g., the cryptographic key inside a smart-card). The execution
time for hardware blocks implementing cryptographic algorithms depends
usually on the number of ‘1’ bits in the key. Although the number of ‘1’ bits
alone is not enough to recover the key, repeated executions with the same key
and different inputs can be used to perform statistical correlation analysis of
timing information and therefore recover the key completely. Delaying com-
putations to make them a multiple of the same amount of time, or adding
random noise or delays, increases the number of measurements required, but
does not prevent the attack. Techniques exist, however, to counteract timing
attacks at the physical, technological, or algorithmic level [13].
Power analysis attacks [15] are based on the analysis of power consumption
of the device while performing the encryption operation. Main contributions
to power consumption are due to gate switching activity and to the parasitic
capacitance of the interconnect wires. The current absorbed by the device
is measured by very simple means. It is possible to distinguish between two
types, of power analysis attacks: simple power analysis (SPA) and differential
power analysis (DPA).
SPA involves direct interpretation of power consumption measurements
collected during cryptographic operations. Observing the system’s power
consumption allows identifying sequences of instructions executed by the
attacked microprocessor to perform a cryptographic algorithm. In those im-
plementations of the algorithm in which the execution path depends on the
data being processed, SPA can be used directly to interpret the cryptographic
key employed. As an example, SPA can be used to break RSA implementa-
tions by revealing differences between multiplication and squaring operation
performed during the modular exponentiation operation [15]. If the squar-
ing operation is implemented (due to code optimization choices) differently
than the multiplication, two distinct consumption patterns will be associated
with the two operations, making it easier to correlate the power trace of the
execution of the exponentiator to the exponent’s value. Moreover, in many
cases SPA attacks can help reduce the search space for brute-force attacks.
Avoiding procedures that use secret intermediates or keys for conditional
branching operations will help protect against this type of attack [15].
DPA attacks are harder to prevent. In addition to the large-scale power
variations used in SPA, DPA exploits the correlation between the data val-
ues manipulated and the variation in power consumption. In fact, it allows
adversaries to retrieve extremely weak signals from noisy sample data, often
without knowing the design of the target system. To achieve this goal, these
attacks use statistical analysis and error-correction statistical methods to gain
information about the key. The power consumption of the target device is re-
peatedly and extensively sampled during the execution of the cryptographic
computations. The goal of the attacker is to find the secret key used to ci-
pher the data at the input of the device, by making guesses on a subset of
the key to discover and calculating the values of the processed data in the
5e–06
Correct key 23
4e–06
3e–06
Current Absorption [A]
2e–06
1e–06
–1e–06
–2e–06
–3e–06
0 5 10 15 20 25 30 35 40 45 50
Time [10 ps]
FIGURE 5.3
Power traces of a DPA attack on a Kasumi S-box. (From Regazzoni, F. et al. In Proc. of International
Symposium on Systems, Architectures, Modeling, and Simulation (SAMOS VII), Somos, Greeca, July
2007).
point of the cryptographic algorithm selected for the attack. Power traces
are collected and divided into two subsets, depending on the value predicted
for the bit selected. The differential trace, calculated as the difference between
the average trace of each subset, shows spikes in regions where the computed
value is correlated to the values being processed. The correct value of the key
can thus be identified from the spikes in its differential trace. As an example,
Figure 5.3 shows a simulation of a DPA attack on a Kasumi S-box implemented
in CMOS technology [16]. The Kasumi block cipher is a Feistel cipher with
eight rounds, with a 64-bit input and a 64-bit output, and a secret key with a
length of 128 bits. Kasumi is used as a standardized confidentiality algorithm
in 3GPP (3rd Generation Partnership Project) [17]. In the figure it is possible
to note how the differential trace of the correct key (plotted in black) presents
the highest peak, being therefore clearly distinguishable from the remaining
ones and showing a clear correlation to the values processed by the block
cipher. For a more detailed discussion of DPA attacks, see Kocher et al. [15].
Electromagnetic analysis (EMA) attacks exploit measurements of the elec-
tromagnetic radiations emitted by a device to reveal sensitive information.
This can be performed by placing coils in the neighborhood of the chip and
studying the measured electromagnetic field. The information collected can
therefore be analyzed with simple analysis (SEMA) and differential analysis
(DEMA) or more advanced correlation attacks. Compared to power analysis
attacks, EMA attacks present a much more flexible and challenging measure-
ment phase (in some cases measurement can be carried out at a significant
distance from the device—15 feet [13]), and the provided information offers
a wide spectrum of potential information. A deep knowledge of the layout
makes the attack much more efficient, allowing the isolation of the region
around which the measurement should be performed. Moreover, depackag-
ing the chip will avoid perturbations due to the passivation layers.
Fault induction attacks exploit some types of variations in external or envi-
ronmental parameters to induce faulty behavior in the components to inter-
rupt the normal functioning of the system or to perform privacy or precursor
attacks. Faulty computations are sometimes the easiest way to discover the
secret key used within the device. Results of erroneous operations and be-
havior can constitute the leak information related to the secret parameter to
be retrieved. Faults can be induced acting on the device’s environment and
putting it in abnormal conditions. Typical fault induction attacks may involve
variation of voltage supply, clock frequency, operating temperature, and en-
vironmental radiations and light. As an example, refer to Boneh et al. [18],
where the use of the Chinese Remainder Theorem to improve performances
in the execution of RSA is exploited to force a fault-based attack. Differen-
tial fault analysis (DFA) has also been introduced to attack Data Encryption
Standard (DES) implementations [13].
Scan based channel attacks exploit access to scan chains to retrieve secret in-
formation stored in the device. The concept of scan design was introduced
over 30 years ago by Williams and Eichelberger [19] with the basic aim of
making the internal state of a finite state machine directly controllable and
observable. To this end, all (D-type) flip-flops in the FSM are substituted
by master-slave devices provided with a multiplexer on the data input, and
when the FSM is set to test mode they are connected in a “scan path,” that
is, a shift register accessible from external pins. This concept has been ex-
tended for general, complex chips (and boards) through the JTAG standard
(IEEE 1149.1) that allows various internal modes for the system and makes its
internal operation accessible to external commands and observation—when
in test mode—through the test port. JTAG compliance is by now a universal
standard, given the complexity of the testing SoC. Internal scan chains are
connected to the JTAG interface during the packaging of the chip, in order
to provide on-chip debug capability. To prevent access after the test phase, a
protection bit is set by using for instance fuses or anti-fuses, or the scan chain
is left unconnected. However, both techniques can be compromised allowing
the attacker to access the information stored in the scan chain [20].
Network interfaces (NIs) provide a basic filter to requests and packets in-
jected maliciously in the network by compromised cores. However, an illegal
access to NIs’ configuration registers performed by an attacker may be ex-
ploited to carry out the described types of attacks. Moreover, fault induction
techniques can be applied to modify information stored in such registers and
cause disruptions in inter-core communication.
Data and instructions tampering represents a serious threat for the system.
Unauthorized access to data and instructions in memory can compromise
the execution of programs running on the system, causing it to crash or to
behave in an unpredictable way. Therefore, protection of critical data repre-
sents an essential task, in particular in multiprocessor SoC, where blocks of
memory are often shared among several processing units. Tampering of data
and instructions in memory can be performed when a processor writes out-
side the bounds of the allocated memory, for instance, in the case of an attack
exploiting buffer overflow techniques [12].
Draining attacks aim at reducing the operative life of a battery-powered
embedded system. In fact, the battery in mobile pervasive devices represents
a point of vulnerability that must be protected. If an attacker is able to drain
a device’s battery, for example, by having it execute energy-hungry tasks, the
device will not be of any use to the user. Literature by Martin et al. and Nash
et al. [22,23] presents the following three main methods by which an attacker
can drain the battery of a device:
trusted and protected from physical attacks, so that its internal state cannot
be tampered with or observed directly by physical means. On the contrary,
external memory and peripherals are assumed to be untrusted and subject to
observation and tampering. Therefore, their integrity and privacy is ensured
by a mechanism for integrity verification and encryption. The system is pro-
tected against untrusted OSs by a security kernel that operates with higher
privileges than a regular OS, or by a hardware secure context manager that
verifies the core functions of the OS.
Enhanced communication architectures have been proposed to facilitate
higher security in SoCs, monitoring and detecting violations, blocking at-
tacks, and providing diagnostic information for triggering suitable responses
and recovery mechanisms [10]. This can be implemented by adding specific
modules to typical communication architectures such as AMBA, to moni-
tor access to regions on the address space, configuration of peripherals, and
sequences of bus transactions.
Considering typical commercial embedded platforms, ARM’s approach to
enabling trusted computing within the embedded world is based on the con-
cept of the TrustZone Platform [28]. The entire TrustZone architecture can
be seen as subdivided into secure and nonsecure regions, allowing the se-
cure code and data to run alongside an OS securely and efficiently, without
being compromised or vulnerable to attack. A non-secure indicator bit (NS)
determines the security operation state of the various components and can
only be accessed through the “Secure Monitor” processor mode, accessible
only through a limited set of entry points. This mode is allowed to switch the
system between secure and nonsecure states, allowing a core in the secure
state to gain higher levels of privilege. With reference to the interconnection
system, the AMBA AXI Configurable Interconnect supports secure-aware
transactions. Transactions requested by masters are monitored by a specific
TrustZone controller, which is in charge of aborting those considered ille-
gal. Secure-aware memory blocks are supported through the AXI TrustZone
memory adapter, allowing sharing of single memory cells between secure and
nonsecure storage areas. A similar solution to protect memory access is pro-
vided by Sonics [29] in its SMART Interconnect solutions, where an on-chip
programmable security “firewall” is employed to protect the system integrity
and the media content passed between on-chip processing blocks and various
I/Os and the memory subsystem.
It is worth noting that the use of protected transactions is also included in the
specifications defined by the Open Core Protocol International Partnership
(OCP-IP) [30]. The standard OCP interface can be extended through a layered
profile to create a secure domain across the SoC and provide protection against
software and some selective hardware attacks. The secure domain might in-
clude CPU, memory, I/O, etc., which requires to be secured by using a col-
lection of hardware and software features such as secured interrupts, secured
memory, or special instructions to access the secure mode of the processor.
In multiprocessor environments, protection of preinstalled applications
from native applications downloaded from untrusted sources can be assured
FIGURE 5.4
Data protection unit (DPU): basic idea.
(SourceID), its Role at the time of the request, the type of the target data (data
or instruction, D/I )) and the target Memory Address.
DPU
μP NI R R Mem
NI
μP NI R R NI μP
FIGURE 5.5
A simple example of a system with three initiators (μPs) and one target (Mem), showing the
architecture using the DPU integrated at the target network interface.
LUT
CAM TCAM RAM
U S
LS LS
0×01 0 0×001B2 10 10
0×3C 1 0×02FFX 01 10
0×B2 0 0×01CXX 10 00
0×A1 1 0×0110X 10 00
0×1C 0 0×04XXX 11 10
0×3D 1 0×03ABC 10 00
0×10 1 0×03AAA 11 11
0×2B 0 0×01DXX 10 10
Mux
20 match enable
8
upper_bound 32
>=
Adder
FIGURE 5.6
DPU microarchitecture integrated at the target network interface.
and corresponds to the case when a set of memory blocks could not require
any access verification.
The output-enabled line of the DPU is generated by a logic AND opera-
tion between the access right obtained by the lookup, the check on the block
boundaries, and, considering the more conservative version of the DPU, the
match on the LUT.
0.6 0.6
1.8
0.5 0.5
Critical Path [ns]
Area [mm2]
Energy [nJ]
0.4 0.4
1.7
0.3 0.3
0.2 0.2
1.6
0.1 0.1
1.5 0 0
8 16 32 64 128 8 16 32 64 128 8 16 32 64 128
DPU Entries DPU Entries DPU Entries
(a) Delay (b) Area (c) Energy
FIGURE 5.7
DPU overhead by varying the number of entries.
Alert
Reconfiguration
or Alert
INIT SNI RUN DPR
FIGURE 5.8
Protocol used for SNIs reconfiguration.
NI NI NI NI
NoC
NI NI NI
...
FIGURE 5.9
Framework for secure exchange of keys at the network level.
Transmission Reception
c = E(K´n , m) m = D(K´n , c)
Authentication Integrity
Authentication Integrity
tt = mac(KMAC, j , c) it = h(c)
t = mac(KMAC, i , c) i = h(c)
Send cII t Send cII i
tt = t tt ≠ t it = i it ≠ i
FIGURE 5.10
Protocol for secure exchange of messages within the NoC.
When receiving a new message ({c t}, {c i}), the following steps are
performed by the wrapper of the receiving secure core:
1. The key-keeper core receives the new key encrypted with the old
new
user key (E AES ( K user , K user )), and sends it to the secure core SCoreAES,i ,
new
receive EAES(Kuser , Kuser)
send E(K´n , EAES(...))
new
D(K´n , E(K´n , EAES(...))) = EAES(Kuser , Kuser)
new new
DAES(Kuser , EAES(Kuser , Kuser)) = Kuser
new
send E(K´n , Kuser )
new new
D(K´n , E(K´n , Kuser)) = Kuser
FIGURE 5.11
Protocol for user’s key updating.
Authentication of the encrypted new key will also be required to verify the
validity of the authority sending the message, involving in the procedure the
same or other secure cores.
the influence of the sensitive data (hiding) or randomizing the connection be-
tween sensitive data and the observable physical values (masking). Hardware
or software implementation of the techniques can be realized. Although soft-
ware implementations imply a reduction of system performance, hardware
countermeasures increase the amount of area and power consumption of the
system. In this section, an overview of some of these techniques to enhance
IP cores security is presented.
5.6 Conclusions
This chapter addressed the problem of security in embedded devices, with
particular emphasis on systems adopting the NoC paradigm. We have dis-
cussed general security threats and analyzed attacks that could exploit weak-
nesses in the implementation of the communication infrastructure. Security
should be considered at each level of the design, particularly in embedded
systems, physically constrained by such factors as computational capacity of
microprocessor cores, memory size, and in particular power consumption.
This chapter therefore presented solutions proposed to counteract security
threats at three different levels. From the point of view of the overall sys-
tem design, we presented the implementation of a secure NoC-based system
suitable for reconfigurable devices. Considering data transmission and se-
cure memory transactions, we analyzed trade-offs in the implementation of
on-chip data protection units. Finally, at a lower level, we addressed phys-
ical implementation of the system and physical types of attacks, discussing
a framework to secure the exchange of cryptographic keys or more general
sensitive messages.
Although existing work addresses specific security threats and proposes
some solutions for counteracting them, security in NoC-based systems re-
mains so far an open research topic. Lots of work still remains to be done
toward the overall goal of providing a secure system at each level of the
design and to address the security problem in a comprehensive way.
Future challenges in the topic of security in NoC-based SoCs are toward
the direction of including security awareness at the early stages of the design
of the system, in order to limit the possible fallacies that could be exploited
by attackers for their malicious purposes. Moreover, modern secure systems
should be able to counteract efficiently and rapidly attempts at security vio-
lations. As shown in this chapter, NoCs can represent the ideal system where
malicious behaviors are monitored and detected. However, security has a cost.
5.7 Acknowledgments
Part of this work has been carried out under the MEDEA+ LoMoSA+ Project
and was partially funded by KTI—The Swiss Innovation Promotion Agency—
Project Nr. 7945.1 NMPP-NM. The authors would like also to acknowledge
the fruitful discussions about security on embedded systems they had with
Francesco Regazzoni and Slobodan Lukovic.
References
1. S. Ravi and A. Raghunathan, “Security in embedded systems: Design chal-
lenges,” ACM Transactions on Embedded Computing Systems 3(3)(August 2004):
461–491.
2. P. Kocher, R. Lee, G. McGraw, A. Raghunathan, and S. Ravi, “Security as a
new dimension in embedded system design.” In Proc. of 41st Design Automation
Conference (DAC’04), San Diego, CA, June 2004.
3. R. Vaslin, G. Gogniat, and J. P. Diguet, “Secure architecture in embedded
systems: An overview.” In Proc. of ReCoSoC’06, Montpellier, France, July 2006.
4. “Symbos.cabir,” Symantec Corporation, Technical Report, 2004.
5. J. Niemela, Beselo—Virus Descriptions, F-Secure, Dec. 2007. [Online]. Available:
http://www.f-secure.com/v-descs/worm_symbos_beselo.shtml.
6. J. Niemela, Skulls.D—Virus Descriptions, F-Secure, Oct. 2005. [Online]. Available:
http://www.f-secure.com/v-descs/skulls_d.shtml.
7. Symbian OS, Available: http://www.symbian.com.
8. L. Fiorin, C. Silvano, and M. Sami, “Security aspects in networks-on-chips:
Overview and proposals for secure implementations.” In Proc. of Tenth Euromi-
cro Conference on Digital System Design Architecture, Methods, and Tools (DSD’07),
Lübeck, Germany, August 2007.
9. J. P. Diguet, S. Evain, R. Vaslin, G. Gogniat, and E. Juin, “NoC-centric security
of reconfigurable SoC.” In Proc. of First International Symposium on Networks-on-
Chips (NOCS 2007), Princeton, NJ, May 2007.
10. J. Coburn, S. Ravi, A. Raghunathan, and S. Chakradhar, “SECA: Security-
enhanced communication architecture.” In Proc. of International Conference on
Compilers, Architectures, and Synthesis for Embedded Systems, San Francisco, CA,
September 2005.
11. S. Ravi, A. Raghunathan, and S. Chakradhar, “Tamper resistance mechanism
for secure embedded systems.” In Proc. of 17th International Conference on VLSI
Design (VLSID’04), Mumbai, India, January 2004.
12. E. Chien and P. Szoe, Blended Attacks Exploits, Vulnerabilities and Buffer Overflow
Techniques in Computer Viruses, Symantec White Paper, September 2002.
13. F. Koeune and F. X. Standaert, Foundations of Security Analysis and Design III.
Berlin/Heidelberg: Springer, 2005.
14. P. C. Kocher, “Differential power analysis.” In Proc. of 16th International Confer-
ence on Cryptology (CRYPTO’96), Santa Barbara, CA, August 1996, 104–113.
15. P. C. Kocher, J. Jaffe, and B. Jun, “Differential power analysis.” In Proc. of 19th
International Conference on Cryptology (CRYPTO’99), Santa Barbara, CA, August
1999, 388–397.
16. F. Regazzoni, S. Badel, T. Eisenbarth, J. Großschdl, A. Poschmann, Z. Toprak,
M. Macchetti, et al., “A simulation-based methodology for evaluating the DPA-
resistance of cryptographic functional units with application to CMOS and
MCML technologies.” In Proc. of International Symposium on Systems, Architec-
tures, Modeling and Simulation (SAMOS VII), Samos, Greece, July 2007.
17. 35.202 Technical Specification version 3.1.1. Kasumi S-box function specifica-
tions, 3GPP, Technical Report, 2002, Available: http://www.3gpp.org/ftp/
Specs/archive/35_series/35.202.
18. D. Boneh, R. A. DeMillo, and R. J. Lipton, “On the importance of eliminating er-
rors in cryptographic computations,” Journal of Cryptology, 14 (December 2001):
101–119.
19. T. W. Williams and E. B. Eichelberger, “A logic design structure for LSI testabil-
ity.” In Proc. of Design Automation Conference (DAC’73), June 1977.
20. B. Yang, K. Wu, and R. Karri, “Scan based side channel attack on dedicated hard-
ware implementations of Data Encryption Standard.” In Proc. of International Test
Conference 2004 (ITC’04), Charlotte, NC, October 2004, 339–344.
21. S. Evain and J. Diguet, “From NoC security analysis to design solutions.” In
Proc. of IEEE Workshop on Signal Processing Systems (SIPS’05), Athens, Greece,
Nov. 2005, 166–171.
22. T. Martin, M. Hsiao, D. Ha, and J. Krishnaswami, “Denial-of-service attacks
on battery-powered mobile computers.” In Proc. of Third International Conference
on Pervasive Computing and Communications (PerCom’04), Orlando, FL, March
2004.
23. D. C. Nash, T. L. Martin, D. S. Ha, and M. S. Hsiao, “Towards an intrusion de-
tection system for battery exhaustion attacks on mobile computing devices.” In
Proc. of Third International Conference on Pervasive Computing and Communications
(PerCom’05), Kauai Island, Hawaii, March 2005.
24. T. Simunic, S. P. Boyd, and P. Glynn, “Managing power consumption in network
on chips,” IEEE Transactions on VLSI Systems 12(1) (January 2004).
25. Digital Audio over IEEE1394, White Paper, Oxford Semiconductor, January 2003.
26. XOM Technical Information, Available: http://www-vlsi.stanford.edu/∼lie/
xom.htm.
27. G. Edward Suh, C. W. O’Donnell, I. Sachdev, and S. Devadas, “Design and
implementation of the AEGIS single-chip secure processor.” In Proc. of 32nd
Annual International Symposium on Computer Architecture (ISCA’05), Madison,
WI, June 2005, 25–26.
28. T. Alves and D. Felton, TrustZone: Integrated Hardware and Software Security,
White Paper, ARM, 2004.
29. SonicsMX SMART Interconnect Datasheet, Available: http://www.sonicsinc.
com.
30. Open Core Protocol Specification 2.2, Available: http://www.ocpip.org.
CONTENTS
6.1 Introduction: Validation of NoCs ............................................................ 156
6.1.1 Main Issues in NoC Validation.................................................... 156
6.1.2 The Generic Network-on-Chip Model ....................................... 157
6.1.3 State-of-the-Art .............................................................................. 158
6.2 Application of Formal Methods to NoC Verification ........................... 160
6.2.1 Smooth Introduction to Formal Methods .................................. 160
6.2.2 Theorem Proving Features........................................................... 161
6.3 Meta-Model and Verification Methodology .......................................... 164
6.4 A More Detailed View of the Model ....................................................... 165
6.4.1 General Assumptions ................................................................... 165
6.4.1.1 Computations and Communications.......................... 165
6.4.1.2 Generic Node and State Models .................................. 165
6.4.2 Unfolding GeNoC: Data Types and Overview ......................... 167
6.4.2.1 Interfaces ......................................................................... 167
6.4.2.2 Network Access Control............................................... 168
6.4.2.3 Routing............................................................................ 168
6.4.2.4 Scheduling ...................................................................... 168
6.4.2.5 GeNoC and GenocCore ................................................ 169
6.4.2.6 Termination..................................................................... 169
6.4.2.7 Final Results and Correctness...................................... 169
6.4.3 GeNoC and GenocCore: Formal Definition .............................. 170
6.4.4 Routing Algorithm........................................................................ 171
6.4.4.1 Principle and Correctness Criteria .............................. 171
6.4.4.2 Definition and Validation of Function Routing......... 173
6.4.5 Scheduling Policy .......................................................................... 173
6.5 Applications ............................................................................................... 174
6.5.1 Spidergon Network and Its Packet-Switched Mode................ 174
6.5.1.1 Spidergon: Architecture Overview ............................. 174
6.5.1.2 Formal Model Preliminaries: Nodes and State
Definition ........................................................................ 176
155
© 2009 by Taylor & Francis Group, LLC
156 Networks-on-Chips: Theory and Practice
developed yet. In this chapter, we present the initial abstract model from
which we will develop the refinement methodology.
The idiosyncratic aspects of our approach are (1) to consider a generic
or meta-model and (2) to provide a practical implementation. Our model
is characterized by components, which are not given a definition but only
characterized by properties, and how these components are interconnected.
Consequently, the global correctness of this interconnection only depends on
those properties, “local” to components. Our model represents all possible
instances of the components, provided these instances satisfy instances of
the properties. Our model is briefly introduced in Section 6.3 and a detailed
presentation is given in Section 6.4.
The generic aspect of our model is its power in its implementation. The
model and the proof that the local properties imply a general property of
the interconnected components constituted the largest effort. The proof that
particular instances also satisfy this global property reduces to the proof that
they satisfy the local properties. The “implemented” model generates all these
formula automatically. Moreover, our implemented model can also be exe-
cuted on concrete test scenarios. The same model is used for simulation and
formal validation. Section 6.5 illustrates our approach to two complete exam-
ples: the Spidergon and the HERMES NoCs.
The first section is concluded by an overview of the state-of-the-art design
and analysis of NoCs and communication structures. Before presenting our
approach and its applications, we introduce necessary basic notions about
formal methods in Section 6.2.
6.1.3 State-of-the-Art
Intensive research efforts have been devoted to the development of per-
formance, traffic, or behavior analyzers for NoCs. Most proposed solutions
are either simulation or emulation oriented. Orion [7] is an interconnection
network simulator that focuses on the analysis of power and performance
characteristics. A variety of design and exploration environments have been
described, such as the modeling environment for a specific NoC-based mul-
tiprocessor platform [8]. Examples of frameworks for NoC generation and
simulation have been proposed: NoCGEN [9] builds different routers from
parameterizable components, whereas MAIA [10] can be used for the sim-
ulation of SoC designs based on the HERMES NoC. An NoC design flow
based on the Æthereal NoC [11] provides facilities for performance analysis.
Genko et al. [12] describe an emulation framework implemented on an FPGA
that gives an efficient way to explore NoC solutions. Two applications are
reported: the emulations of a network of switches and of a full NoC.
Few approaches address the use of (semi-) formal methods, essentially
toward detection of faults or debuging. A methodology based on temporal
assertions [13] targets a two-level hierarchical ring structure. PSL (Property
Specification Language) [14] properties are used to express interface-level
requirements, and are transformed into synthesizable checkers (monitors).
Example
The formula ∀x ∀y (x ≤ y ⇔ ∃z (x + z = y)) is a first-order logic formula
where x, y and z are variables, and ≤, = and + are infix representations of the
corresponding functions.
The first mechanized proof systems date back to the 1970s. Nowadays there
exists a large variety of systems, more or less automatic. They are used for
function TIMES(x, y) =
if natp(x) then
if x = 0 then return 0
else return y + Times(x−1, y)
end if
else return 0
end if
end function
Predefined data types are: Booleans, characters and strings, rational num-
bers, complex numbers, and lists. The language is not typed, that is, the types
of the function parameters are not explicitly declared. Rather, typing predi-
cates are used in the function bodies (for instance, natp(x) to check whether x
is a natural number).
A great advantage of ACL2 with respect to the previously mentioned
proof assistants is its high degree of automation (it qualifies as a theorem
prover). When provided with the necessary theories and libraries of pre-
proven lemmas, it may find a proof automatically: successive proof strategies
are applied in a predefined order. Otherwise, the user may suggest proof
hints or introduce intermediate lemmas. This automation is due to the logic
it implements, gained at the cost of expressiveness, but the ACL2 logic is
expressive enough for our purpose. Despite the fact that ACL2 is first-order
and does not support the explicit use of quantifiers, certain kinds of origi-
nally higher-order or quantified statements can be expressed. Very powerful
definition mechanisms, such as the encapsulation principle, allow one to ex-
tend the logic and reason on undefined functions that satisfy one or more
constraints.
Another characteristic of interest is that the specification language is an
applicative subset of the functional programming language Common Lisp. As
a consequence ACL2 provides both a theorem prover and an execution engine
in the same environment: theorems that express properties of the specified
functions can be proven, and the same function definitions can be executed
efficiently [49,50].
This prover has already been used to formally verify various complex
hardware architectures, such as microprocessors [51,52], floating point
operators [53], and many other structures [54]. In the next sections, we de-
fine a high-level generic formal model for NoC, and encode it into ACL2;
we can then perform high-level reasoning on a large variety of structures
and module generators, with possibly unbounded parameters. Our generic
model is intrinsically of higher-order, for example, quantification over func-
tions, whereas the ACL2 logic is first-order and quantifier-free. Applying a
systematic and reusable mode of expression [55], our model can be entirely
formalized in the ACL2 logic.
THEOREM 6.1
∀T ∀I ∀R ∀S, P1 (T ) ∧ P2 (I) ∧ P3 (R) ∧ P4 (S) ⇒ ℘ (GeNoC(T , I, R, S))
Roughly speaking, the property ℘ asserts that every message arrived at some node n
was actually issued at some source node s and originally addressed to node n, and
that it reaches its destination without modification of its content.
Example 1
Figure 6.1(b) shows the instantiation of the generic node model for a 2D
mesh. The position is given by coordinates along the X- and Y-axis. There
local ports
local ports
north
south
FIGURE 6.1
Generic node model and its instantiation for a simple 2D mesh.
are input and output ports for neighbors connected to all cardinal points.
In a 2 × 2 mesh (Figure 6.2) examples of valid addresses are (0 0), east, o
– which is connected to (1 0), west, i – and (0 1), east, i – which is con-
nected to (1 1), west, o . But, (0 0), south, o , or (1 1), north, i are not valid
addresses.
east east
west west
01 11
south south
east east
west west
00 10
south south
FIGURE 6.2
2D mesh.
Each address has some storage elements, noted mem. We make no assump-
tion on the structure of these elements. The generic global state of a network
consists of all tuples addr, mem . Let st be such a state. We adopt the following
notation. The state element of address addr, that is, a tuple addr, mem , is noted
st.addr; the storage element is noted st.addr.mem. We assume two generic func-
tions that manipulate a global network state. Function loadBuffer(addr, msg, st)
takes as arguments an address (addr), some content (msg), and a global state
(st). It returns to a new state, where msg is added to the content of the buffer
with address addr. Function readBuffer (addr, st) returns the state element with
address addr, that is, st.addr. In Sections 6.5.1 and 6.5.2, we give instances of
the generic network state.
6.4.2.1 Interfaces
Interfaces model the encoding and the decoding of messages and frames
that are injected or received in the network. Interfaces are represented by
two functions. (1) Function send represents the mechanisms used to encode
messages into frames. (2). Function recv represents the mechanisms used to
Messages Messages
Missives
recv
Routing
Scheduling
FIGURE 6.3
Unfolding function GeNoC.
decode frames. The main constraint associated with these functions expresses
that a receiver should be able to extract the injected information, that is, the
composition of functions recv and send (recv ◦ send) is the identity function.
The main input of function GeNoC is a list of transactions. A transaction is
a tuple of the form id, org, msg, dest, flit, time , where id is a unique identifier
(e.g., a natural number), msg is an arbitrary message, org and dest are the origin
and the destination of msg, flit is a natural number, which optionally denotes
the number of flits in the message (flit is set to 1 by default), and time is a natural
number which denotes the execution step when the message is emitted. The
origin and the destination must be valid addresses, that is, be members of the
set Addresses. The first operation of GeNoC is to encode messages into frames.
It applies function send of the interfaces to each transaction.
A missive results from converting the message of a transaction to a frame.
A missive is a transaction where the message is replaced by the frame with
an additional field containing the current position of the frame. The current
position must be a valid address.
6.4.2.3 Routing
The routing algorithm on a given topology is represented by function Routing.
At each step of function GeNoC, function Routing computes for every frame
all the possible routes from the current address c to the destination address
d. The main constraint associated to function Routing is that each route from
c to d actually starts in c and uses only valid addresses to end in d.
The traveling missives chosen by function r4d are given to function Routing,
which computes for each frame routes from the current node to the destina-
tion. The result of this function is a list of travels. A travel is a tuple of the
form id, org, frm, Route, flit, time , where Route denotes the possible routes of
the frame. The remaining fields equal the corresponding fields of the initial
missive.
6.4.2.4 Scheduling
The switching technique is represented by function Scheduling. The schedul-
ing policy participates in the management of conflicts, and computes a set of
possible simultaneous communications. Formally, these communications sat-
isfy an invariant. Scheduling a communication, that is, adding it to the current
set of authorized communications, must preserve the invariant, at all times
and in any admissible state of the network. The invariant is specific to the
scheduling policy. Examples are given in Sections 6.5.1 and 6.5.2.
Function Scheduling represents the execution of one network simulation step.
It takes as main arguments the list of travels produced by function Routing,
and the current global network state. Whenever possible, function Scheduling
moves a frame from its current address to the next address according to one
of the possible routes and the current network state. It returns three main
elements: the list EnRoute of frames that have not reached their destination, the
list Arrived of frames that have reached their destination, and a new state st .
6.4.2.6 Termination
To make sure that GenocCore terminates, we associate a finite number of
attempts to every node. At each recursive call to GenocCore, every node with
a pending transaction consumes one attempt. This is performed by func-
tion ConsumeAttempts(att). The association list att stores the attempts and
att[i] denotes the number of remaining attempts for the node i. Function
SumOfAtt(att) computes the sum of the remaining attempts for all the nodes
and is used as the decreasing measure of parameter att. Function GenocCore
halts if all attempts have been consumed.
THEOREM 6.2
Correctness of GenocCore.
atr.id = m.id ∧ atr.org = m.org
∀atr ∈ Arrived, ∃!m ∈ Missives,
∧ atr.frm = m.frm ∧ Last(atr.Route) = m.dest
∗ Theunion of lists EnRoute and Delayed is converted to proper missives. We do not detail this
operation.
of the travel equals the destination of the missive. This is preserved by the
list Arrived of travels produced by function Scheduling, because this list is a
sublist of the input of Scheduling (Proof Obligation 3 of Section 6.4.5). For more
details about a similar proof, we refer to the previous publications [56].
function ROUTINGCORE(s, d)
if s = d then return d //at destination
else return list(s, routingCore(L(s, d), d)) //make one hop
end if
end function
Example 2
Let us consider a 3 × 3 mesh and the XY routing algorithm.∗ Function Lxy
represents the routing logic of each node. It decides the next hop of a message
depending on its destination. In the following definition, sx or dx denotes the
coordinate along the X-axis, and s y or d y denotes the coordinate along the
Y-axis.
(0 0) (1 0) X (2 0)
∗ In this short example, we omit ports and directions. We refer to Section 6.5.2 for a more detailed
function ROUTING(Missives)
if Missives = then return // denotes the empty list
else
m := first(Missives) //first(l) returns the first element of list l
t := m.id, m.org, m.frm, routingCore(m.curr, m.dest), m.flit, m.time
return list(t, Routing(tail(Missives))) //tail(l) returns l without its first element
end if
end function
At each scheduling round, all travels of list Travels are analyzed. If several
travels are associated to a single node, the node consumes one attempt for
the set of its travels. At each call to Scheduling, an attempt is consumed at
each node. If all attempts have not been consumed, the sum of the remaining
attempts after the application of function Scheduling is strictly less than the
sum of the attempts before the application of Scheduling. This is expressed by
the following proof obligation:
The next two proof obligations show that there is no spontaneous gen-
eration of new travels, and that any travel of the lists EnRoute or Arrived
corresponds to a unique travel of the input argument of function Scheduling.
The first proof obligation (Proof Obligation 3) ensures that for every travel atr
of list Arrived, there exists exactly one travel v in Travels such that atr and v
have the same identifier, the same frame, the same origin, and that their route
ends with the same destination.
List EnRoute must satisfy a similar proof obligation (Proof Obligation 4):
6.5 Applications
6.5.1 Spidergon Network and Its Packet-Switched Mode
6.5.1.1 Spidergon: Architecture Overview
The Spidergon network, designed by STMicroelectronics [60,61], is an
extension of the Octagon network [62]. A basic Octagon unit consists of eight
nodes and twelve bidirectional links [Figure 6.4a]. It has two main prop-
erties: the communication between any pair of nodes requires at most two
hops, and it has a simple, shortest-path routing algorithm [62]. Spidergon
[Figure 6.4b] extends the concept of the Octagon to an arbitrary even number
of nodes. Let NumNode be that number. Spidergon forms a regular architec-
ture, where all nodes are connected to three neighbors and a local IP. The
maximum number of hops is NumNode 4
, if NumNode is a multiple of four. We
0 1 2 3 4
15 5
0
7 1
14 6
6 2
13 7
5 3 12 8
4 11 10 9
(a) Octagon Network (b) Spidergon Network
FIGURE 6.4
Octagon and Spidergon architectures.
restrict our formal model to the latter. We assume a global parameter N, such
that NumNode = 4 · N.
A Spidergon packet contains data that must be carried from the source node
to the destination node as the result of a communication request by the
source node. We consider a Spidergon network based on the packet switching
technique.
The routing of a packet is accomplished as follows. Each node compares
the address of a packet (PackAd) to its own address (NodeAd) to determine the
next action. The node computes the relative address of a packet as
Example 3
Consider that N = 4. Consider a packet Pack at node 2 sent to node 12. First,
12−2 mod 16 = 10, Pack is routed across to 10. Then, 12−10 mod 16 = 2, Pack
is routed clockwise to 11, and then to node 12. Finally, 12 − 12 mod 16 = 0,
Pack has reached its final destination.
function COUNTERCLOCKWISE(s)
if s.dir = i then return s.id, ccw, o
else return (s.id − 1) mod (4 · N), cw, i
end if
end function
function ACROSS(s)
if s.dir = i then return s.id, acr, o
else return (s.id + 2 · N) mod (4 · N), acr, i
end if
end function
is consumed. If the relative address is positive and less than N, the message
moves clockwise. If this address is between 3N and 4N, it moves counter-
clockwise. Otherwise, it moves across.
function SPIDERGONLOGIC(s,d)
RelAd := (d.id − s.id) mod (4 . N)
if RelAd = 0 then return Local(s) //final destination reached
else if 0 < RelAd <= N then return Clockwise(s)//clockwise move
else if 3 · N ≤ RelAd ≤ 4 · N then
return Counterclockwise(s) //counterclockwise move
else return Across(s) //destination in opposite half
end if
end function
function SPIDERGONROUTINGCORE(s,d)
if s = d then return d //at destination
else//do one hop
return list(s, SpidergonRoutingCore(SpidergonLogic(s,d),d))
end if
end function
THEOREM 6.3
Validity of Spidergon Routes.
The proof obligation of Section 6.4.4 has been discharged, as well as some
minor proof obligations related to type checking. Function SpidergonRouting
is therefore a valid instance of function Routing.
To empty a buffer, one simply loads it with , the empty buffer. Let ToLeave
be the set of addresses that have been left in step (2). Function free(ToLeave, st)
returns to a new state where all addresses in ToLeave have an empty buffer.
function FREE(ToLeave, st)
if ToLeave = then return st
else
addr := first(ToLeave)
st’ := SpidergonLoadBuffer(addr, , st)
return free(tail(ToLeave), st’)
end if
end function
The proof obligations of Section 6.4.5 have been discharged for this
function.
Because it has been proven with ACL2 that all the instantiated functions sat-
isfy the instantiated proof obligations, it automatically follows that function
SpidergonGenocCore satisfies the corresponding instance of Theorem 6.2.
Control
B
W Logic E
B B
L
S
FIGURE 6.5
HERMES switch [63].
function MOVEEAST(s)
if s.dir = i then return s.x, s.y, e, o
else return (s.x + 1), s.y, w, i
end if
end function
function MOVEWEST(s)
if s.dir = i then return s.x, s.y, w, o
else return (s.x − 1), s.y, e, i
end if
end function
These four functions are used in function XYRoutingLogic below. First, the
X-coordinates of the two nodes are compared. If they are different, then the
flit has to move to the east if the X-coordinate of the destination is higher than
that of the current node, else it goes to the west. If the X-coordinates are equal
then, if the Y-coordinate of the destination is higher than the current’s, the flit
moves to the north otherwise it moves to the south.
function XYROUTINGLOGIC(s,d)
if s.x ! = d.x then //change X
if s.x < d.x then moveEast(s)
else moveWest(s)
end if
else//change Y
if s.y < d.y then moveNorth(s)
else moveSouth(s)
end if
end if
end function
function XYROUTINGCORE(s,d)
if s = d then return d //destination reached
else return list(s, XYRoutingCore(XYRoutingLogic(s,d), d)) //do one hop
end if
end function
THEOREM 6.4
Validity of XY routing Routes.
The same proof obligations discharged for SpidergonRouting are verified for
XYRouting to prove that it is a valid instance of function Routing.
In the wormhole technique, it might be the case that not all the flits of a
frame can move. If the head of a worm makes one step, then all the remain-
ing flits can also make one step. This movement is computed by function
advanceFlits(moving, st), which takes as arguments a list of frames, the head of
which can make a step, and the current network state. It produces a new state.
In the case where the head of a frame is blocked at some address but (some
of) its flits can move towards it, we use function moveBlockedFlits(blocked, st),
where blocked is a list of frames, the head of which is blocked, and st is the
current network state. Function moveBlockedFlits will move flits of blocked,
whenever it is possible. It produces a new state. These two operations define
function free.
function FREE(blocked, moving, st)
return advanceFlits (moving, (moveBlockedFlits (blocked, st)))
end function
The proof obligations of Section 6.4.5 have been discharged for this function.
Because ACL2 has proved that all the instantiated functions satisfy
the instantiated proof obligations, it automatically follows that function
HermesGenocCore satisfies the corresponding instance of Theorem 6.2.
6.6 Conclusion
In this chapter, we have formalized two dimensions of the NoC design space—
the communication infrastructure and the communication paradigm—as a
functional model in the ACL2 logic. For each essential design decision—
topology, routing algorithm, and scheduling policy—a meta-model has been
given. We have identified the properties and constraints that are requested
• The effort has been done on the meta-model, but proving its in-
stances is mechanized and largely automatic.
• Using an executable logic (we recall that the input to ACL2 is a
subset of Common Lisp) allows one to visualize the advancement
of messages and their interactions over the NoC on test cases, as in
any conventional simulator.
Much remains to be done before this type of approach can enter a rou-
tine design flow. First, the meta-model needs to be refined and a systematic
method elaborated, to progressively synthesize the very abstract view it pro-
vides into an RTL implementation. Correctness preserving transformations
and possibly additional proof obligations will lead to a modeling level that
can directly be translated to synthesizable HDL.
Another direction for future work concerns the proof of theorems about
other application independent properties, such as absence of deadlocks and
livelocks, absence of starvation, and the consideration of non-minimal adap-
tive routing algorithms. Again, we want to lay the ground work for the
proof of properties over generic structures, and intend to proceed with a
similar approach, by which a meta-model is applicable to a large class of IP
generators.
References
1. J. A. Nacif, T. Silva, A. I. Tavares, A. O. Fernandes, and C. N. Coelho Jr, “Efficient
allocation of verification resources using revision history information,” In Proc.
of 11th IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems
(DDECS’08). Bratislava, Slovakia: IEEE, April 2008.
2. L. Loh, “Where should I use formal functional verification,” Jasper Design Auto-
mation white paper, July 2006. http://www.scdsource.com/download.php?id=4.
3. T. Bjerregaard and S. Mahadevan, “A survey of research and practices of
network-on-chip,” ACM Computing Surveys 38(1) (2006).
4. P. Pande, G. D. Micheli, C. Grecu, A. Ivanov, and R. Saleh, “Design, synthesis,
and test of networks on chips,” Design & Test of Computers 22 (2005) (5): 404.
5. L. Benini and G. D. Micheli, “Networks on chips: A new SoC paradigm,”
Computer 35 (2002) (1): 70.
6. U. Ogras, J. Hu, and R. Marculescu, “Key research problems in NoC design:
A holistic perspective.” In Proc. of International Conference on Hardware/Software
Codesign and System Synthesis (CODES+ISSS’05), 69. http://www.ece.cmu.edu/
∼sld/pubs/pagers/f175-ogras.pdf.
7. H. Wang, X. Zhu, L. Peh, and S. Malik, “Orion: A power-performance simu-
lator for interconnection networks.” In Proc. of ACM/IEEE 35th Annual Interna-
tional Symposium on Microarchitecture (MICRO-35), 294. http://www.princeton.
edu/∼peh/publications/orion.pdf.
8. J. Madsen, S. Mahadevan, K. Virk, and M. Gonzalez, “Network-on-Chip
modeling for system-level multiprocessor simulation.” In Proc. of the 24th
IEEE Real-Time Systems Symposium (RTSS 2003), 2003, 265. 10.1109/REAL.2003.
1253273.
9. J. Chan and S. Parameswaran, “NoCGEN: A template based reuse methodology
for networks on chip architecture.” In Proc. of 17th International Conference on VLSI
Design (VLSI Design 2004), 717. 10.1109/ICVD.2004.1261011.
10. L. Ost, A. Mello, J. Palma, F. Moraes, and N. Calazans, “MAIA—a framework
for networks on chip generation and verification.” In Proc. of 2005 Conference on
Asia South Pacific Design Automation (ASP-DAC 2005), 29. http://doi.acm.org/
10.1145/1120725.1120741.
11. K. Goossens, J. Dielissen, O. P. Gangwal, S. G. Pestana, A. Radulescu, and E.
Rijpkema, “A design flow for application-specific networks on chip with guar-
anteed performance to accelerate SoC design and verification.” In Proc. of Design,
Automation, and Test in Europe Conference (DATE’05), 1182. http://homepages.
inf.ed.ac.uk/kgoossen/2005-date.pdf.
12. N. Genko, D. Atienza, and G. D. Micheli, “NoC emulation on FPGA: HW/SW
synergy for NoC features exploration.” In Proc. of International Conference on
Parallel Computing (ParCo 2005), Malaga, Spain, September 2005.
13. J. S. Chenard, S. Bourduas, N. Azuelos, M. Boulé, and Z. Zilic, “Hardware asser-
tion checkers in on-line detection of network-on-chip faults.” In Proc. of Workshop
on Diagnostic Services in Networks-on-Chips, Nice, France, April 2007.
14. IEEE Std 1850-2005, IEEE Standard for Property Specification Language (PSL). IEEE,
2005.
15. K. Goossens, B. Vermeulen, R. van Steeden, and M. Bennebroek, “Transaction-
based communication-centric debug.” In Proc. of First Annual ACM/IEEE
International Symposium on Networks-on-Chip (NoCs’07), Princeton, NJ, May 2007,
95.
16. E. M. Clarke, O. Grumberg, and S. Jha, “Verifying parameterized networks,”
ACM Transactions on Programming Languages and Systems 19(5) (September 1997).
17. S. Creese and A. Roscoe, “Formal verification of arbitrary network topologies.”
In Proc. of 1999 International Conference on Parallel and Distributed Processing Tech-
niques and Applications (PDPTA’99). Las Vegas, NV: ACM/IEEE, June 1999.
CONTENTS
7.1 Test and Fault Tolerance Issues in NoCs ................................................ 192
7.2 Test Methods and Fault Models for NoC Fabrics ................................. 193
7.2.1 Fault Models for NoC Infrastructure Test.................................. 194
7.2.2 Fault Models for NoC Interswitch Links ................................... 194
7.2.3 Fault Models for FIFO Buffers in NoC Switches ...................... 194
7.2.4 Structural Postmanufacturing Test ............................................. 196
7.2.4.1 Test Data Transport........................................................ 197
7.2.5 Functional Test of NoCs ............................................................... 202
7.2.5.1 Functional Fault Models for NoCs.............................. 202
7.3 Addressing Reliability of NoC Fabrics through Error
Control Coding .......................................................................................... 203
7.3.1 Crosstalk Avoidance Coding ....................................................... 204
7.3.2 Forbidden Overlap Condition (FOC) Codes............................. 205
7.3.3 Forbidden Transition Condition (FTC) Codes .......................... 206
7.3.4 Forbidden Pattern Condition Codes .......................................... 207
7.4 Joint Crosstalk Avoidance and Error Control Coding.......................... 208
7.4.1 Duplicate Add Parity and Modified Dual Rail Code............... 209
7.4.2 Boundary Shift Code .................................................................... 210
7.4.3 Joint Crosstalk Avoidance and Double Error
Correction Code............................................................................. 211
7.5 Performance Metrics ................................................................................. 213
7.5.1 Energy Savings Profile of NoCs in Presence of Coding .......... 213
7.5.2 Timing Constraints in NoC Interconnection Fabrics
in Presence of Coding ................................................................... 218
7.6 Summary..................................................................................................... 219
References............................................................................................................. 219
191
© 2009 by Taylor & Francis Group, LLC
192 Networks-on-Chips: Theory and Practice
–Functional core
–Switch
FIGURE 7.1
(a) Regular (mesh architecture) NoC. (b) Irregular NoC.
Y1 Y1
Y2 Y2
dr sr
(delayed rise) (speedy rise)
Y3 Y3
(a) (b)
Y1 Y1
Y2 Y2
df sf
(delayed fall) (speedy fall)
Y3 Y3
(c) (d)
Y1 Y1
Y2 “0” Y2 “1”
gp gn
(positive glitch) (negative glitch)
Y3 Y3
(e) (f)
Signal affected by MAF Fault-free signal
Y1, Y3: Aggressor lines Y2: Victim line
FIGURE 7.2
MAF crosstalk errors (Y2 victim wire; Y1 , Y3 aggressor wires).
WRITE PORT
B
WO RO
FF EF
WD0 RD0
WD1 RD1
FIFO
WRITE Control
READ Control
Memory
Array
RLB
FIFO (routing logic FIFO
block)
WDn–1 RDn–1
FIFO
WCK RCK
READ PORT
(a) 4-port NoC switch generic architecture (b) Dual port NoC FIFO
FIGURE 7.3
(a) 4-port NoC switch-generic architecture. (b) Dual port NoC FIFO.
T_start T_start
Test_control
Test_header
Test data
FIGURE 7.4
Test packet structure.
U D D
U M D
U M U D
U M U U D
S S
(a) Unicast data transport (b) Multicast data transport
in an NoC in an NoC
FIGURE 7.5
Unicast and multicast switch modes. S and D are the source and destination nodes.
(2)
FIFO
FIFO
FIFO
MWU
FIFO
(4)
FIGURE 7.6
4-port NoC switch with multicast wrapper unit (MWU) for test data transport.
For instance, in the multicast step shown in Figure 7.5(b), only three switches
must possess the multicast feature. By exploring all the necessary multicast
steps to reach all destinations, we can identify the switches and ports that are
involved in the multicast transport, and subsequently implement the MWU
only for the required switches/ports. The header of a multidestination mes-
sage must carry the destination node addresses [13]. To route a multidesti-
nation message, a switch must be equipped with a method for determining
the output ports to which a multicast message must be simultaneously for-
warded. The multidestination packet header encodes information that allows
the switch to determine the output ports towards which the packet must be
directed. When designing multicast hardware and protocols with limited pur-
pose such as test data transport, a set of simplifying assumptions can be made
to reduce the complexity of the multicast mechanism. This set of assumptions
can be summarized as follows:
1. The test data traffic is fully deterministic.
2. Test traffic is scheduled off-line, prior to test application.
3. For each test packet, the multicast route can be determined exactly
at all times (i.e., routing of test packets is static).
4. For each switch, the set of I/O ports involved in multicast test data
transport is known and may be a subset of all I/O ports of the switch
(i.e., for each switch, only a subset of I/O ports may be required to
support multicast).
These assumptions help in reducing the hardware complexity of the multi-
cast mechanism by implementing the required hardware only for those switch
ports that must support multicast. For instance, in the example of Figure 7.6,
S2 S2 S2
Test_packets
Test_packets
S1 T S1 S1
l1 l1 l1 T
S3 S3 S3
l2 l2 l2
(a) (b) T (c) T
FIGURE 7.7
(a), (b) Unicast test transport. (c) Multicast test transport.
where Tl, L is the latency of the interswitch link, Tl, S is the switch latency [the
number of cycles required for a flit (see Section 7.5.1) to traverse an NoC
switch from input to output], and Tt, S is the time required to perform the
actual testing of the switch (i.e., Tt, S = TFIFO + TRLB ). Following the same
reasoning for the multicast transport case in Figure 7.7(c), the total test T m 2,3
time for testing switches S2 and S3 can be written as:
Tm2,3 = Tl, L + Tl, S + Tt, S
From this simple example, we can infer that there are two mechanisms that
can be employed for reducing the test time: reducing the transport time of test
data, and reducing the effective test time of NoC components. The transport
time of test patterns can be reduced in two ways: (a) by delivering the test
patterns on the shortest path from the test source to the element under test;
(b) by transporting multiple test patterns on nonoverlapping paths to their
respective destinations.
Therefore, to reduce the test time, we would need to reevaluate the fault
models or the overall test strategy (i.e., to generate test data locally for each
element, with the respective incurred overhead [19]). Within the assumptions
in this work (all test data is generated off-line and transported to the ele-
ment under test), the only feasible way to reduce the effective test time per
element is to overlap the test of more NoC components. The direct effect is
the corresponding reduction of the overall test time. This can ultimately be
accomplished by employing the multicast transport and applying test data
simultaneously to more components. The graph representation of the NoC
infrastructure used to find the minimum test transport latency is obtained by
representing each NoC element as a directed graph G = (S, L), where each
vertex si ∈ S is an NoC switch, and each edge li ∈ L is an interswitch link.
Each switch is tagged with a numerical pair (Tl,s , Tt,s ) corresponding to switch
latency and switch test time. Each link is similarly labeled with a pair (Tl, L ,
Tt, L ) corresponding to link latency and link test time, respectively. For each
edge and vertex, we define a symbolic toggle t that can take two values: N
and T. When t = N, the cost (weight) associated with the edge/vertex is the
latency term, which corresponds to the normal operation. When t = T, the
cost (weight) associated with the edge/vertex is the test time (of the link or
switch) and corresponds to the test operation.
D Q
S
inputs outputs
NoC channel
CUT
pass/fail
expected
test inputs TC
outputs
test packets
FIGURE 7.8
Test packets processing and output comparison.
NoC test based on functional fault models has several advantages compared
to structural test, the most important ones being lower hardware overhead
and shorter test time mainly due to a reduced set of test data that has to
be applied. For a satisfactory fault coverage and yield, both structural and
functional tests are required for NoC platforms.
cosmic radiation, etc. [20,21]. These failures can alter the behavior of the NoC
fabrics and degrade the signal integrity. Providing resilience against such
failures is critical for the operation of NoC-based chips. There are many ways
to achieve signal integrity. Among different practical methods, the use of new
materials for device and interconnect, and tight control of device layouts may
be adopted in the NoC domain. Here, we propose to tackle this problem at
the design stage. Instead of depending on postdesign methods, we propose
to incorporate corrective intelligence in the NoC design flow. This will help
reduce the number of postdesign iterations. The corrective intelligence can
be incorporated into the NoC data stream by adding error control codes to
decrease vulnerability to transient errors. The basic operations of NoC infras-
tructures are governed by on-chip packet-switched networks. As NoCs are
built on packet-switching, it is easy to modify the data packets by adding
extra bits of coded information in space and time to protect against transient
malfunctions.
In the face of increased gate counts, designers are compelled to reduce
the power supply voltage to keep energy dissipation to a tolerable limit,
thus reducing noise margins [20]. The interconnects become more closely
packed and this increases mutual crosstalk effects. Faster switching can also
cause ground bounce. The switching current can cause the already low-supply
voltage to instantaneously go even lower, thus causing timing violations. All
these factors can cause transient errors in the ultra deep submicron (UDSM)
ICs [20]. Crosstalk is a prominent source of transient malfunction in NoC in-
terconnects. Crosstalk avoidance coding (CAC) schemes are effective ways
of reducing the worst-case switching capacitance of a wire by ensuring that
a transition from one codeword to another does not cause adjacent wires
to switch in opposite directions. Though CACs are effective in reducing
mutual interwire coupling capacitance, they do not protect against any other
transient errors. To make the system robust, in addition to CAC we need to
incorporate forward error correction coding (FEC) into the NoC data stream.
Among different FECs, single error correction codes (SECs) are the simplest
to implement. There are various joint CAC/SEC codes proposed by differ-
ent research groups. But aggressive supply-voltage scaling and increase in
DSM noise in future-generation NoCs will prevent these joint CAC/SEC
codes from satisfying reliability requirements. Hence, low-complexity joint
crosstalk avoidance and multiple error correction codes (CAC/MEC) suit-
able for applying to NoC fabrics need to be designed. Below we elaborate
characteristics of different CAC, joint CAC/SEC and CAC/MEC codes.
1 Aggressor Wire 1
Victim Wire
1
Victim Wire 0
1 1 Victim Wire
0 Aggressor Wire
1
0 0
1 1
0
Aggressor Wire Victim Rise Time Aggressor Wire 2
0 Original 0
Victim Rise Time Rise Time Victim Rise Time
Aggressor Fall Time Aggressor Rise Time Aggressor Fall Time
(a) (b) (c)
FIGURE 7.9
Different types of transitions causing crosstalk between adjacent wires.
The worst-case crosstalk occurs when two aggressors on either side of the
victim wire transition in the opposite direction to the victim, as shown in
Figure 7.9(c). Such a pattern of opposite transitions always increases the
delay by increasing the mutual switching capacitance between the wires.
In addition, it also causes extra energy dissipation due to the increase in
switching capacitance. One of the common crosstalk avoidance techniques
is to increase the spacing between adjacent wires. However, this doubles the
wire layout area [22]. For global wires in the higher metal layers that do not
scale as fast as the device geometries, this doubling of area is hard to justify.
Another simple technique can be shielding the individual wires with a
grounded wire in between them. Although this is effective in reducing
crosstalk to the same extent as increased spacing, it also necessitates the same
overhead in terms of wire routing requirements. By incorporating coding
mechanisms, the same reduction in crosstalk can be achieved at a lower over-
head of routing area [23]. These coding schemes, broadly termed as the class of
crosstalk avoidance coding (CAC), prevent worst-case crosstalk between ad-
jacent wires by preventing opposite transitions in the neighbors. Thus CACs
enhance system reliability by reducing the probabilities of crosstalk–induced
soft errors and also reduce the energy dissipation in UDSM buses and global
wires by reducing the coupling capacitance between adjacent wires.
Different crosstalk avoidance codes [24] are proposed in the literature. Here,
characteristics of three representative CACs that achieve different degrees of
reduction in coupling capacitance are described.
(3–0) (4–0)
FOC 4–5 (1)
(7–0) (9–0)
Input Output
(3–0) (4–0)
FOC 4–5 (2)
FIGURE 7.10
Block diagram of combining adjacent subchannels in FOC coding.
codeword having the bit pattern 010 does not make a transition to a codeword
having the pattern 101 at the same bit positions. The codes that satisfy the
above condition are referred to as forbidden overlap condition (FOC) codes.
The simplest method of satisfying the forbidden overlap condition is half-
shielding, in which a grounded wire is inserted after every two signal wires.
Though simple, this method has the disadvantage of requiring a significant
number of extra wires. Another solution is to encode the data links such that
the codewords satisfy the forbidden overlap (FO) condition. However, encod-
ing all the bits at once is not feasible for wide links due to prohibitive size and
complexity of the coder-decoder (codec) hardware. In practice, partial coding
is adopted, in which the links are divided into subchannels that are encoded
using FOC. The subchannels are then combined in such a way as to avoid
forbidden patterns at their boundaries. In this case, two subchannels can be
placed next to each other without any shielding, as well as not violating the
FO condition as shown in Figure 7.10. The Boolean expressions relating to
the original input (d3 to d0 ) and coded bits (c 4 to c 0 ) for the FOC scheme are
expressed as follows:
c 0 = d1 + d 2 d 3
c 1 = d2 d 3
c 2 = d0
c 3 = d 2 d3
c 4 = d 1 d2 + d 3
(2–0) (3–0)
FTC 3–4 (1)
(5–0) (8–0)
Input Output
(2–0) (3–0)
FTC 3–4 (2)
FIGURE 7.11
Block diagram of combining adjacent subchannels in FTC coding.
condition (FTC), and the CACs satisfying it are known as FTC codes. For
wider communication links, the message words can be subdivided into mul-
tiple subchannels, each having a three-bit width, and then each coded sub-
words recombined following the scheme shown in Figure 7.11. This scheme of
recombination simply places a shielded wire between each subchannel. This
ensures no forbidden transitions even at the boundaries of the subchannels.
The Boolean expressions relating to the original input and coded bits for
the FTC scheme are expressed as follows:
c 0 = d1 + d 2 d 0
c 1 = d 0 d1 d 2 + d 0 d 1 d 2
c 2 = d0 + d 2
c 3 = d0 d2 + d1 d2
c 0 = d0
c 1 = d1 d1 + d2 d1 + d1 d 3 + d0 d2 d 3
c 2 = d2 d 3 + d1 d2 + d 0 d2 + d1 d 0 d 3
c 3 = d2 d3 + d 0 d2 + d2 d1 + d1 d3 d 0
c 4 = d3
Bit 0
Bit 0 1
1 FPC 4–5 2
2 (1) 3
3 4
(6–0) (9–0)
Input Output
Bit 5
6
Bit 4 FPC 4–5 7
5 (2) 8
6 9
FIGURE 7.12
Block diagram of combining adjacent subchannels after FPC coding.
the recent past is the joint coding schemes that attempt to minimize crosstalk
while also performing forward error correction. These are called joint crosstalk
avoidance and single error correction codes (CAC/SEC) [30]. A few of these
joint codes have been proposed in the literature for on-chip buses. These
codes can be adopted in the NoC domain too. These include duplicate add
parity (DAP) [31], boundary shift code (BSC) [32], or modified duplicate
add parity (MDR) [33]. These are joint crosstalk-avoiding, single error cor-
recting codes. These coding schemes achieve the dual function of reducing
crosstalk and also increase the resilience against multiple sources of transient
errors.
Most of the above work depended on simple SEC codes. But with tech-
nology scaling, SECs are not sufficient to protect NoCs from varied sources
of transient noise. This was acknowledged for the first time by Sridhara and
Shanbhag [30] in the context of traditional bus-based systems. It was pointed
out that with aggressive supply scaling and increase in DSM noise, more
powerful error correction schemes than the simple joint CAC/SEC codes will
be needed to satisfy reliability requirements. But aggressive supply-voltage
scaling and increase in deep submicron noise in future-generation NoCs will
prevent joint CAC/SEC codes from satisfying reliability requirements. Hence,
further investigations into the performance of joint CAC/MEC codes in NoC
fabrics need to be made. A particular example of a joint crosstalk avoidance
and double error correction Code (CADEC) is discussed in details. Below,
the characteristics of the joint crosstalk avoidance and error correction coding
schemes and their implementation principles are discussed in details.
y0
x0 y1 y0 1 x0
y1 0
y2 y2 1 x1
y3 0
x1 y3
y4 1 x2
y5 0
y4
x2 y5 y6 1 x3
y7 0
y6
x3 y7 y8
DECODER
y8 ENABLE
(a) Duplicate add parity (DAP) encoder (b) Decoder
FIGURE 7.13
Duplicate add parity encoder and decoder.
c i = di , for i = 0, k − 1
c k = d0 ⊕ d1 ⊕ · · · ⊕ dk−1
The modified dual rail (MDR) code is very similar to the DAP [33]. In the MDR
code, two copies of parity bit Ck are placed adjacent to the other codeword
bits to reduce crosstalk.
1 y0 1 1
0 y0 0 0 x0
x0 y1 y1
1 y2 1 x1
0 y2 1 0
0
x1 y3
y3
1
0 y4
y4 1 1 x2
0 0
x2 y5
y5
1 y6
0 y6 1
0 1 x3
0
x3 y7
y7
1 y8
0 y8 1
0
CLK
CLK
ENCODER DECODER
ENABLE ENABLE
(a) BSC encoder (b) Decoder
FIGURE 7.14
BSC encoder and decoder.
into different subchannels and then perform partial coding. We can perform
DAP/BSC/MDR coding/decoding on the link as a whole.
bit 74
bit 75
38
parity,
bit 76
Hamming
duplication
encoding
(a) CADEC encoder
38 Parity from
1st copy, p1
77 bits i/p
38
Parity from
2nd copy, p2
Sent
parity, p0
38 32 bit
0 38, 32 o/p
38 HAM
0 38 DECODE
1
1 38
38
38
38, 32 HAM
DED
38
0
38 1
Stage–I Stage–II
(b) CADEC decoder
FIGURE 7.15
CADEC encoder and decoder.
DAP or BSC schemes, is added to make the decoding process very energy
efficient as explained below.
CADEC decoder. The decoding procedure for the CADEC encoded flit can
be explained with the help of the flow diagram shown in Figure 7.16. The
decoding algorithm consists of the following simple steps:
start
Compute parities of
individual copies, p1 and p2
no
Stage-I p1 = p2 ?
yes
p0 = p2 ?
Send b for Double Error
Detection
no
yes
is b error- no
free?
yes
Choose b for final stage Choose a for final stage
Output
FIGURE 7.16
CADEC decoding algorithm.
network, both the interswitch wires and the logic gates in the switches tog-
gle, resulting in energy dissipation. The flits from the source nodes need to
traverse multiple hops consisting of switches and wires to reach destinations.
The motivation behind incorporating CAC in the NoC fabric is to reduce
switching capacitance of the interswitch wires and hence make communica-
tion among different blocks more energy efficient. But this reduction in energy
dissipation is linear with the switching capacitance of the wires. By incorpo-
rating the joint coding schemes in an NoC data stream, the reliability of the
system is enhanced. Consequently, the supply voltage can be reduced with-
out compromising system reliability. As energy dissipation depends quadrat-
ically on the supply voltage, a significantly higher amount of savings is
possible to achieve by incorporating the joint codes. To quantify this pos-
sible reduction in supply voltage, a Gaussian distributed noise voltage of
magnitude VN and variance or power of σ 2N is considered that represents the
cumulative effect of all the different sources of UDSM noise. This gives the
probability of bit error, , also called the bit error rate (BER) as
Vdd
=Q (7.1)
2σ N
where, the Q-function is given by
∞
1 y2
Q(x) = √ e 2 dy (7.2)
2π x
The word error probability is a function of the channel BER, . If Punc () is
the probability of word error in the uncoded case and Pecc () is the residual
probability of word error with error control coding, then it is desirable that
Pecc () ≤ Punc (). Using Equation (7.1), we can reduce the supply voltage in
presence of coding to V̂ dd , given by
Q−1 ( )
ˆ
V̂ dd = Vdd −1
(7.3)
Q ()
In Equation (7.3), Vdd is the nominal supply voltage in the absence of any
coding such that Pecc ( )
ˆ = Punc (). Therefore, to compute the V̂ dd for the joint
CAC and SEC, the residual word error probability of these schemes has to be
computed. The various residual word error probabilities in terms of BER, ,
are listed in Table 7.1.
Figure 7.17 shows the plot of possible voltage swing reduction for differ-
ent joint codes discussed here with increasing word error rates. As CADEC
has the highest error correction capability, it allows maximum voltage swing
reduction.
So, the metric of interest is the average savings in energy per flit with coding
compared to the uncoded case. All the schemes have different number of bits
in the encoded flit. A fair comparison in terms of energy savings demands
that the redundant wires be also taken into account while comparing the
energy dissipation profiles. The relevant metric used for comparison should
TABLE 7.1
Residual Word Error Probabilities of Different Coding
Schemes
Coding Scheme Probability of Residual Word Error
take into account the savings in energy due to the reduced crosstalk, reduced
voltage swing on the interconnects, and additional energy dissipated in the
extra redundant wires and the codecs. The savings in energy per flit per hop
is given by
E savings, j = E link,uncoded − ( E link,coded + E codec ) (7.4)
where E link,uncoded and E link,coded are the energy dissipated by the uncoded flit
and the coded flit in each interswitch link, respectively. E codec is the energy
dissipated by each codec. The energy savings in transporting a single flit, the
i th flit, through h i hops can be calculated as
hi
E savings,i = E savings, j (7.5)
j=1
1
ED
DAP
0.9 CADEC
0.8
Vdd
0.7
0.6
0.5
0.4
10–20 10–15 10–10 10–5
Word Error rate
FIGURE 7.17
Variation of achievable voltage swing with bit error rate for different coding schemes.
2400
FOC
Flit(pJ)
Flit(pJ)
1200
600
800
400
200 400
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Injection load Injection load
(a) Energy profile (b) Incorporating CAC
FIGURE 7.18
Energy savings profile for a mesh-based NoC by incorporating CAC at (a) λ = 1; (b) λ = 6.
2000 4000
Energy Savings per Cycle (pJ)
FIGURE 7.19
Energy savings profile for a mesh-based NoC by incorporating the joint codes at (a) λ = 1;
(b) λ = 6.
NoC system as used before. Because the performances of BSC and MDR are
very similar to DAP, they are omitted from the plot for clarity.
As shown in the figure, the CADEC scheme achieves more energy savings
than the other joint codes. This happens due to the fact that the residual word
error probability of CADEC is much less as it can correct up to 2-bit errors in
the flits, and hence can tolerate a much lower voltage swing.
be constrained within the one clock cycle limit, then the pipelined nature of
the communication will be maintained. However, there is an increasing drive
in the NoC research community for design of low-latency NoCs adopting
numerous techniques both at the routing as well as NI level [38,40]. There-
fore, it is not sufficient to just fit the codecs into separate pipelined stages as
this will increase message latency. To further enhance the performance, if the
delay of the codecs can be constrained so much that they can be merged with
existing stages of the NoC switch, then there will be no latency penalty at
all. Due to the crosstalk avoidance characteristic of the codes, the crosstalk
induced bus delay (CIBD) of the interswitch wire segments will decrease.
Hence, alternatively by constraining the delay of the codec blocks, if they can
be merged into the interswitch link traversal stages irrespective of the switch
architecture, then also there will be no additional latency penalty.
7.6 Summary
NoC is emerging as a revolutionary methodology to integrate numerous
blocks in a single chip. It is the digital communications backbone that in-
terconnects the components on a multicore SoC. It is well known that with
shrinking geometries, NoCs will be increasingly exposed to permanent and
transient sources of error that could degrade manufacturability, signal in-
tegrity, and system reliability. The challenges of NoC testing lie in achieving
sufficient fault coverage under a set of fault models relevant to NoC charac-
teristics, under constraints such as test time, test power dissipation, low-area
overhead and test complexity. A fine balance must be achieved between test
quality and test resources.
To accomplish these goals, NoCs are augmented with design-for-test
features that allow efficient test data transport, built-in test data generation
and comparison, and postmanufacturing yield tuning. One of the effective
ways to protect the future nanoscale systems from transient errors is to apply
coding techniques similar to the domain of communication engineering. By
incorporating joint crosstalk avoidance and multiple error correction codes, it
is possible to protect the NoC fabrics against varied sources of transient noise
and yet lower the overall energy dissipation.
References
[1] M. L. Bushnell and V. D. Agrawal, Essentials of Electronic Testing for Digital, Mem-
ory, and Mixed-Signal VLSI Circuits. New York: Springer, 2000.
[2] A. Alaghi, N. Karimi, M. Sedghi, and Z. Navabi, “Online NoC switch fault
detection and diagnosis using a high level fault model.” In Proc. of 22nd IEEE
CONTENTS
8.1 Introduction................................................................................................ 224
8.2 Monitoring Objectives and Opportunities............................................. 226
8.2.1 Verification and Debugging......................................................... 226
8.2.2 Network Parameters Adaptation................................................ 227
8.2.3 Application Profiling .................................................................... 227
8.2.4 Run-Time Reconfigurability ........................................................ 228
8.3 Monitoring Information in Networks-on-Chips ................................... 228
8.3.1 A High-Level Model of NoC Monitoring .................................. 228
8.3.1.1 Events .............................................................................. 229
8.3.1.2 Programming Model ..................................................... 230
8.3.1.3 Traffic Management....................................................... 230
8.3.1.4 NoC Monitoring Communication Infrastructure ..... 231
8.3.2 Measurement Methods................................................................. 231
8.3.3 NoC Metrics ................................................................................... 233
8.4 NoC Monitoring Architecture ................................................................. 234
8.5 Implementation Issues .............................................................................. 238
8.5.1 Separate Physical Communication Links .................................. 239
8.5.2 Shared Physical Communication Links ..................................... 239
8.5.3 The Impact of Programmability on Implementation ............... 240
8.5.4 Cost Optimizations ....................................................................... 241
8.5.5 Monitor-NoC Codesign................................................................ 242
8.6 A Case Study .............................................................................................. 244
8.6.1 Software Assisted Monitoring Services .................................... 244
8.6.2 Monitoring Services Interacting with OS .................................. 245
8.6.3 Monitoring Services at Transaction Level
and Monitor-Aware Design Flow ............................................... 246
8.6.4 Hardware Support for Testing NoC ........................................... 248
8.6.5 Monitoring for Cost-Effective NoC Design............................... 248
8.6.6 Monitoring for Time-Triggered Architecture Diagnostics ...... 249
223
© 2009 by Taylor & Francis Group, LLC
224 Networks-on-Chips: Theory and Practice
8.1 Introduction
Network monitoring is the process of extracting information regarding the
operation of a network for purposes that range from management functions
to debugging and diagnostics. Originally started in bus-based systems for the
most basic and critical purpose of debugging, monitoring consisted of probes
that could relay bus transactions to an external observer (be it a human or
a circuit). The observability is crucial for debugging so that the behavior of
the system is recorded and can be analyzed, either on- or off-line. When
the behavior is recorded into a trace, the run-time evolution of the system
can be replayed, facilitating the debugging process. Robustness in time- or
life-critical applications also requires monitoring of the system and real-time
reaction upon false or misbehaving operation.
Research has already produced valuable results in providing observability
for bus-based systems, such as ARM’s Coresight technology [1]. Also First
Silicon’s on-chip instrumentation technology (OCI) provides on-chip logic
analyzers for AMBA AHB, OCP, and Sonics SiliconBackplane bus systems [2].
These solutions allow the user to capture bus activity at run-time, and can be
combined in a multicore-embedded debug system with in-system analyzers
for cores, for example, for MIPS cores.
Because buses offer limited bandwidth, these simple bus-based systems at
first evolved using hierarchies of multiple interconnected buses. This solution
offered the required increase in bandwidth but made the design more com-
plex and ad hoc, and proved difficult to scale. As systems increase in num-
ber of interconnected components, communication complexity, and band-
width requirements, we see a shift toward the use of generic networks
(Networks-on-Chips) that can meet the communication requirements of re-
cent and future complex Systems-on-Chips (SoC). Figure 8.1 shows the use
of a regular topology for the creation of a heterogeneous SoC. An exam-
ple of how a heterogeneous application can be mapped on this SoC is also
depicted. Of course the topology does not have to be regular, as shown in
Figure 8.2.
However, this change dramatically increased the complexity of monitoring
compared to the simpler systems for several reasons. First, the sheer increase
in communication bandwidth of each component increases the amount of
information that needs to be monitored or traced. Second, the structure of the
system does not provide the single, convenient central-monitoring location
any more. As communication in most cases is conducted in a point-to-point,
R R R R
NI
Tile Tile CPU Crypto
IP Core
R R R R
FIGURE 8.1
Network-on-Chip based on a regular topology, and an example with a heterogeneous application.
Each node (or tile) is connected to a router, and the routers are interconnected to form the network.
The nodes can be identical creating a homogeneous system (i.e, CPUs), or can differ leading to
a heterogeneous system.
NI R R NI
NI
Video
IPCore R R
NI
IPCore
FIGURE 8.2
Network-on-Chip based on an irregular topology. The nodes are again connected to (possibly
multiple) routers, but the routers are interconnected on an ad hoc basis to customize the network
to the application demands and achieve better cost–performance ratio.
and also collects and delivers the possible responses. However, this type of
operation is intrusive and useful only for off-line testing.
Another important benefit of monitoring is to use it for debugging pur-
poses. When the system is in operation and we want to extract information,
we can track the system progress without affecting its operation (i.e., in a
nonintrusive manner). To achieve this goal, the testing wrappers should pro-
vide the necessary information to the monitoring infrastructure, which can
then deliver it to the tester without affecting the application’s behavior.
can readjust the speed of uncongested portions of the NoC to save power.
When links and routers do not support multiple voltage and corresponding
speed levels, the identified routers can be shutdown, and their (presumably
noncritical) traffic can be rerouted via other low utilization routers.
8.3.1.1 Events
In the high-level schemes, the data collected are modeled in the forms that
are called events [12]. Based on this approach, all the events have specific
predefined formats and are most frequently categorized because they may
have different meanings. According to Mansouri-Samani and Sloman [13],
“an event is a happening of interest, which occurs instantaneously at a cer-
tain time.” Therefore information characterizing an event consists of (a) a
timestamp giving the exact time the event occurred, (b) a source id that
defines the source of the event, (c) a special identifier specifying the cate-
gory that the event belongs to, and (d) the information that this event carries.
The information of the events are called attributes of the events, and consist of
an attribute identifier and a value. The exact attributes as well as the number
of them depend on the category to which the event belongs.
Regarding the classification of the events, Ciordas et al. have grouped them
in five main classes: user configuration events, user data events, NoC config-
uration events, NoC alert events, and monitoring service internal events [12].
NoC monitoring hardware subsystems and the traffic for setting up connec-
tions for the transport of data from the actual NoC to the NoC-monitoring
processing system. On the other hand, the event traffic management sys-
tem deals with the traffic generated after the NoC has been thoroughly
configured.
• In-band traffic. In this case, the NoC traffic is transmitted over the
NoC links either by using time division multiplexing (TDM) tech-
niques or by sharing a network interface (NI).
• Out-of-band traffic. When hard real-time diagnostic services are
needed or when the NoC capacity is limited by communication-
bounded applications, a separate interconnection scheme is used
and the NoC monitoring traffic is considered out-of-band.
Considering that the employed monitoring services are used for debugging,
performance optimization purposes, or power management, it is clear that
the choice of the appropriate interconnection structure is very critical because
it may affect the overall efficiency of the NoC toward the opposite direction
of the desired objective.
A self-adapting monitor service could encompass programmable mecha-
nisms to adjust the generated monitoring traffic in a dynamic manner. Using
a hybrid methodology, the distributed NoC-monitoring subsystems or the
central-monitor controller can support an efficient traffic management scheme
and regulate the traffic from the NoC to the central diagnostic manager. How-
ever, placing extra functionality increases the overhead of the monitoring
probes, in terms of area or energy consumption.
Monitoring Server
Transactor Message
Level Generator
Generator Decompress
Transactor Message
Level Compressor
Monitor Interconnect
Link
FIGURE 8.3
Monitoring component combining the two alternatives: sniffing data and filtering up to transac-
tion level, and streaming the messages using compression to reduce transferring large amounts
data.
Transaction
Message Level
Packet Level
Physical
Wire/Bit Level
FIGURE 8.4
Layered organization of a monitoring component.
Filtering is based on prior knowledge of the type and the format of data,
and in some cases also on the timing of the data of interest. Dimensioning the
filters though is a trade-off between flexibility and increased area cost. It is
feasible to use masks and even more intelligent event-based filters as long as
the total overhead is affordable. The benefits of appropriate conditioning focus
on reducing the traced data to only the critically needed pieces of information.
Monitoring services can be characterized as best effort (BE) assuming either
that the probing of a link is done periodically or that the messages sent and
henceforth the reaction to them is not strictly real-time. This type of service
is useful for observing liveness of a core or of a link. Moreover, because pri-
oritization of on-chip connections is a usual mechanism to differentiate and
ensure QoS for on-chip user traffic, the same prioritization should be ensured
also for the monitoring services.
Meanwhile, monitoring services that need guaranteed accuracy (GA) might
be required when an exact piece of information is needed; for example, to cal-
culate throughput based on bytes sent over a link or for debuging purposes.
In addition, hard real-time performance necessitates the quality of GA ser-
vices in terms of low latency and complete view of the traffic or capacity
to sustain monitoring traffic at full throughput. Guarantees are obtained by
means of separate physical links or by means of TDMA slot reservations in
NoC interfaces.
Although this is a modular approach, it may suffer from transferring and
possibly keeping in memory large amounts of data. If the memory refers to
on-chip memory resources, then the issue that can be raised is the amount of
available memory, although if off-chip storage is used the issues shift to band-
width needs, pad limitations, or augmented system complexity. Additionally,
multiplexing with already used memory interfaces affects the available band-
width and may raise redesign considerations.
If on-the-fly analysis is desired, then such a monitoring must be application-
specific (i.e., hard monitor). Alternatively, an embedded software solution can
also perform such analysis provided that software latencies can be tolerated.
This is a viable option in low bandwidth configurations of an NoC or for a
monitoring application that is not critical in a real-time environment.
Monitoring Monitoring
Probe Probe
(a) (b)
IP Core
Monitoring
Probe
Monitoring
Probe
Router Router
Router Router
(c) (d)
FIGURE 8.5
Attachment options of a monitoring component: (a) sniffing packets from a link, (b) operating
as a bridge observing and even injecting packets, (c) collecting data also from the core of an IP,
(d) accessing also the internal status of a router.
R M
R M R R R R
Rm Rm M M
M M
NI NI NI NI NI NI
IPCore IPCore IPCore IPCore IPCore IPCore
R M R M R R R R
Rm Rm M M
M M
NI NI NI NI NI NI
IPCore IPCore IPCore IPCore IPCore IPCore
FIGURE 8.6
Monitoring architectural options: (a) use a separate monitoring NoC for transfering monitor
traffic, (b) share the user NoC also for monitoring, (c) use separate bus-based interconnect.
Monitoring Interface
Packet Filtering
Data Layers
Connection Filtering
Guaranteed / BestEffort
Filtering
Physical Layers
Sniffer
Link #0 #N
NoC Interfaces or Communication Links
FIGURE 8.7
Layered organization of a transaction monitor: Each filter layer can be configured at run-time
via a command-based interface. The required functionality defines the number of layers of a
monitoring probe.
Routing/Path Selection
Monitor Placement
Dimensioning
(a) (b)
FIGURE 8.8
Integrated NoC-monitor design flow. Part (a) shows a simplified flow for simple NoCs, although
part (b) shows how it is changed to integrate monitor placement and optimization.
in the design of the regular NoC design and monitoring, and its network is
through the increased area to accommodate the monitors and the monitoring
network.
NoC
ICache
CPU
R R R
Probe
P P P Interrupt
Controller
Data Interface
FIGURE 8.9
Architecture of the hybrid monitoring system of a software monitoring manager assisted by
hardware interface accelerators.
connected affects the number of NIs and the associated channels and not the
routers. Thus, full coverage requires a large number of transaction monitors
attached to the NIs. In other NoCs with cores attached to the same channel, a
lower number of transaction monitors will be required. From the real exam-
ples, the area-efficient solutions were achieved when all routers were probed.
Finally, in all designs, the area of the monitors is several times lower than the
area of the routers involved.
It must be noted that in the case of bottleneck designs, the number of routers
was inevitably increased. The situation might be even worse assuming that
an irregular topology might be in use, or in the case where TDMA was not
employed. The benchmarks showed a dependence between the slot table size
and the NoC topology; a mesh comprised of fewer routers required larger slot
table size. Even most important, it is noticeable that the monitoring service
itself is not considered in the design stage. It could dynamically affect and
ultimately alter the application, which is mapped on the NoC, so as to discover
and avoid bottleneck situations or hotspots at run-time.
There is also very little research done regarding other synchronization,
arbitration mechanisms in NoCs, and the impact of monitoring traffic to it.
Additionally, the transaction monitors in all these studies follow a centralized
organization. The MSA, for example, configures the monitor function layers
and collects the sniffed data. A distributed control monitoring scheme will
obviously deviate from the previous conclusions and needs investigation.
NoC, allowing a diagnostic unit to easily pinpoint the faulty components. The
diagnostic unit collects messages with failure indications of other components
at the application level and at the architecture level. Failure detection mes-
sages are sent on the same TT NoC. Each message is a tuple < type, timestamp,
location >, which provides information concerning the type of the occurred
failure (e.g., crash failure of a micro component, illegal resource allocation
requests), the time of detection with respect to the global time base, and the
location within the SoC.
To provide full coverage, failures within the diagnostic unit itself must
be detected and all the failure notifications analyzed by correlating failure
indications along time and space. The diagnostic unit can distinguish perma-
nent and transient failures, and determine the severity of the action whether
to restart a component or to take the component off-line and call for main-
tenance action. The authors conclude that the determinism inherent in the
TTA facilitates the detection of out-of-norm behavior and also find that their
encapsulation mechanisms were successful in preventing error propagation.
Arrays (FPGAs). In such a future system, the monitoring modules will de-
cide when and how the NoC infrastructure will be reconfigured based on a
number of different criteria such as the ones presented in the last paragraph.
Because the real-time reconfiguration can take a significant amount of time,
the relevant issues that should be covered are how the traffic will be routed
during the reconfiguration and how the different SoC interfaces connected to
the NoC will be updated after the reconfiguration is completed. This feature
will not only be employed in FPGAs but can also be used in standard ASIC
SoCs, because numerous field-programmable embedded cores are available,
which can be utilized within an SoC and offer the ability to be real-time re-
configured in a partial manner.
The monitoring systems can also be utilized, in the future, to change the
encoding schemes employed by the NoC. For example, when a certain power
consumption level is reached, the monitoring system may close down some of
the NoC individual links and adapt the encoding scheme to reduce the power
consumption at the cost of reduced performance. To have such an efficient
system, the monitoring module should be able to communicate and alter all
the NoC interfaces to be aware of the updated data encoding system.
It would also be beneficial if the future monitoring systems are very mod-
ular and are combined with a relevant efficient design flow to offer flexibility
to the designer to instantiate only the modules needed for her or his specific
device. For example, in a low-cost, low-power multiprocessor system only
the basic modules will be utilized, which will allow the processors to have
full access directly to the monitoring statistics that would be collected in the
most power-efficient manner. On the other hand, in a heterogeneous system
consisting of numerous high-end cores, the monitoring system will include
the majority of the provided modules as well as one or more processors, which
will collect numerous different detailed statistics that will be further analyzed
and processed by the monitoring CPU(s).
8.8 Conclusions
Network monitoring is a very useful service that can be integrated in future
NoCs. Its benefits are expected to increase as the demand for short time to mar-
ket forces designers to create their SoCs with an incomplete list of features,
and rely on programmability to complete the feature list during the prod-
uct lifetime instead of before-the-product creation. SoC reuse for multiple
applications or even a simple application’s extensions may lead to a product
behavior that is vastly different than the one originally imagined during the
design phase.
Monitoring the system operation is a vehicle to capture the changes in the
behavior of the system and enable mechanisms to adapt to these changes. Net-
work monitoring is a systematic and flexible approach and can be integrated
into the NoC design flow and process. When the monitored information can
be abstracted at higher-level constructs, such as complex events, and when
monitoring is sharing resources with the regular SoC communication, the
cost of supporting monitoring can be much higher. However, given the po-
tential benefits of monitoring during the SoC lifetime, supporting a more
detailed (lower level) monitoring abstraction can be acceptable, especially
when the monitoring resources are reused for traditional testing and verifi-
cation purposes.
References
[1] “Coresight,” ARM. [Online]. Available: http://www.arm.com/products/
solutions/CoreSight.html.
[2] R. Leatherman, “On-chip instrumentation approach to system-on-chip de-
velopment,” First Silicon Solutions, 1997. Available: http://www.fs2.com/
pdfs/OCI_Whitepaper.pdf.
[3] Érika Cota, L. Carro, and M. Lubaszewski, “Reusing an on-chip network for the
test of core-based systems,” ACM Transactions on Design Automation of Electronic
Systems (TOADES) 9 (2004) (4): 471–499.
[4] A. Ahmadinia, C. Bobda, J. Ding, M. Majer, J. Teich, S. Fekete, and J. van der Veen,
“A practical approach for circuit routing on dynamic reconfigurable devices,”
Rapid System Prototyping, 2005. (RSP 2005). In Proc. of the 16th IEEE International
Workshop, June 2005, 8–10, 84–90.
[5] T. Bartic., J.-Y. Mignolet, V. Nollet, T. Marescaux, D. Verkest, S. Vernalde, and
R. Lauwereins, “Topology adaptive network-on-chip design and implementa-
tion,” In Proc. of the IEEE Proceedings on Computers and Digital Techniques, 152
(July 2005) (4): 467–472.
[6] C. Zeferino, M. E. Kreutz, and A. A. Susin, “Rasoc: A router soft-core for
networks-on-chip.” In Proc. of Design, Automation and Test in Europe conference
(DATE’04), February 2004, 198–203.
[7] B. Sethuraman, P. Bhattacharya, J. Khan, and R. Vemuri, “Lipar: A light-weight
parallel router for FPGA-based networks-on-chip.” In GLSVSLI ’05: Proc. of 15th
ACM Great Lakes symposium on VLSI. New York: ACM, 2005, 452–457.
[8] S. Vassiliadis and I. Sourdis, “Flux interconnection networks on demand,” Jour-
nal of Systems Architecture 53 (2007) (10): 777–793.
[9] A. Amory, E. Briao, E. Cota, M. Lubaszewski, and F. Moraes, “A scalable test
strategy for network-on-chip routers.” In Proc. of IEEE International Test Confer-
ence (ITC 2005), November 2005, 9.
[10] L. Möller, I. Grehs, E. Carvalho, R. Soares, N. Calazans, and F. Moraes, “A NoC-
based infrastructure to enable dynamic self reconfigurable systems.” In Proc. of
3rd International Workshop on Reconfigurable Communication-Centric Systems-on-
Chip (ReCoSoC), 2007, 23–30.
[11] R. Mouhoub and O. Hammami, “NoC monitoring hardware support for fast
NoC design space exploration and potential NoC partial dynamic reconfigura-
tion.” In Proc. of International Symposium on Industrial Embedded Systems (IES ’06),
October 2006, 1–10.
CONTENTS
9.1 Energy and Power ..................................................................................... 257
9.1.1 Power Sources................................................................................ 257
9.1.1.1 Dynamic Power Consumption .................................... 258
9.1.1.2 Static Power Consumption........................................... 258
9.1.2 Energy Model for NoC ................................................................. 260
9.2 Energy and Power Reduction Technologies in NoC ............................ 261
9.2.1 Microarchitecture Level Techniques........................................... 261
9.2.1.1 Low-Swing Signaling .................................................... 261
9.2.1.2 Link Encoding ................................................................ 262
9.2.1.3 RTL Power Optimization.............................................. 263
9.2.1.4 Multithreshold (Vth ) Circuits ........................................ 263
9.2.1.5 Buffer Allocation............................................................ 263
9.2.1.6 Performance Enhancement .......................................... 264
9.2.1.7 Miscellaneous ................................................................. 264
9.2.2 System-Level Techniques ............................................................. 265
9.2.2.1 Dynamic Voltage Scaling .............................................. 265
9.2.2.2 On-Off Links................................................................... 268
9.2.2.3 Topology Optimization................................................. 269
9.2.2.4 Application Mapping.................................................... 270
9.2.2.5 Globally Asynchronous Locally
Synchronous (GALS)..................................................... 271
9.3 Power Modeling Methodology for NoC ................................................ 271
9.3.1 Analytical Model ........................................................................... 272
9.3.2 Statistical Model ............................................................................ 272
9.4 Summary..................................................................................................... 274
References............................................................................................................. 275
255
© 2009 by Taylor & Francis Group, LLC
256 Networks-on-Chips: Theory and Practice
P = W/T (9.1)
E = P·T (9.2)
where P is power, E energy, T a time interval, and W the total work performed
in that interval.
The concepts of energy and power are important because techniques that
reduce power do not necessarily reduce energy. For instance, the power con-
sumed by a network can be reduced by halving the operating clock frequency,
but if it takes twice as long to forward the same amount of data, the total
energy consumed will be similar.
V V
dV
C
dt
V
lleakage
lshort-circuit
FIGURE 9.1
(a) Dynamic and (b) static power dissipation mechanisms in CMOS.
Equation (9.3) defines power consumption P as the sum of dynamic and static
components, Pdynamic and Pstatic , respectively.
1
Pdynamic = αC V 2 f (9.4)
2
(V − Vth ) β
f max = η (9.5)
V
Equation (9.5) establishes the relationship between the supply voltage V and
the maximum operating frequency f max , where Vth is the threshold voltage,
and η and β are experimentally derived constants.
Architectural efforts to control power dissipation have been directed pri-
marily at the dynamic component of power dissipation. There are four ways
to control dynamic power dissipation:
leakage current of the MOS transistor in the absence of any switching activity.
As the following equation illustrates, it is the product of the supply voltage
(V) and leakage current (Ileak ):
CPU RAM
ROM
Network
Interface
DSP
Peripheral
I/O
Communication
co-Processor
FIGURE 9.2
Example of Network-on-Chip architecture.
where E network and E network interface are energy sources consumed by the network,
including link and switch, and network interface, respectively.
When a flit travels on the interconnection network, both links and switches
toggle. We use an approach similar to the one presented by Eisley and Peh [3]
to estimate the energy consumption for a network. E network can be further
decomposed as
where E link is the energy consumed by a flit when traversing a link between
adjacent switches, E switch is the energy consumed by each flit within the switch,
and H is the number of hops a flit traverses. A typical switch consists of
several microarchitectural components: buffers that house flits at input ports,
routing logic that steers flits toward appropriate output ports along its way
to destination, arbiter that regulates access to the crossbar, and a crossbar that
transports flits from input to output ports. E switch is the summation of energy
consumed on the internal buffer E buffer , arbitration logic E arbiter , and crossbar
E crossbar .
Vdd Vdd
Vswing
in out
Driver Receiver
FIGURE 9.3
Low-swing differential signaling.
Link
Sender Encoder Decoder Receiver
b n b
FIGURE 9.4
Model of link encoding.
9.2.1.7 Miscellaneous
The crossbar is one of the most power-consuming components in NoC. Wang
et al. [25] investigated power efficiency of different microarchitectures: seg-
mented crossbar, cut-through crossbar, write-through buffer, and Express
Cube, evaluating their power-performance-area impact with power model-
ing and probabilistic analysis. Kim et al. [29] reduced the number of crossbar
ports, and Lee et al. [6] proposed a partially activated crossbar reducing ef-
fective capacitive loads.
Different types of interconnect wire have different trade-offs for power
consumption and area cost. Power consumption of RC wires with repeated
buffers increases linearly with the total wire length. Increasing the spacing
between wires can reduce power consumption, but result in additional on-
chip area. Using a transmission line is appropriate for long-distance high
frequency on-chip interconnection networks, but has complicated transmitter
and receiver circuits that may add to the overhead cost. Hu et al. [30] utilized
a variety of interconnect wire styles to achieve high-performance, low-power,
and on-chip communication.
N
Plink = Pwirei (9.13)
i=1
where η is the efficiency of the DC-DC converter and Cfilter is the filter capac-
itance of the power supply regulator.
Therefore, the total link energy with DVS is represented as
M
E link = T fi Plink fi + nE link−transition (9.15)
i=1
R1 R2
L1 L2
R3 R4 R5 R6
L6 L7 L3
L5 L4
R7 R8
FIGURE 9.5
Network containing three traffics: R1 → R2 , R3 → R6 , and R7 → R2 .
In this scenario, the goal is to find the configuration that maximizes energy
and power savings while delivering a prespecified level of performance. The
network in Figure 9.5 shows an example of an NoC architecture that consists
of eight nodes and seven links. Each node Ri represents a router and solid
line L j represents a link connection. There are three network traffic flows that
could occur simultaneously: (1) from node R1 to R2 ; (2) from R3 to R6 ; and (3)
from R7 to R2 . Assuming the same amount of traffic load for three flows, the
link traffics ξ L i on link L i are ordered as ξ L 7 > ξ L 2 > ξ L 1 , ξ L 3 , ξ L 5 , and ξ L 6 > ξ L 4 .
Thus, we can assign the link frequencies as f L 7 > f L 2 > f L 1 , f L 3 , f L 5 , and
f L 6 > f L 4 at that time period, reducing the energy and power of the links.
Wei and Kim proposed chip-to-chip parallel [32] and serial [33] link design
techniques where links can operate at different voltage and frequency levels
(Figure 9.6). When link frequency is adjusted, supply voltage can track to the
lower suitable value. It consists of components of a typical high-speed link: a
transmitter to convert digital binary signals into electrical signals; a signaling
channel usually modeled as a transmission line; a receiver to convert electrical
Adaptive Power
Supply Regulator
Data in
Data out
TX RX
TX PLL
RX PLL
FIGURE 9.6
Example of a DVS link.
signals back to digital data; and a clock recovery block to compensate for delay
through the signaling channel. Although this link was not designed for both
dynamic voltage and frequency settings, the link architecture can be extended
to accommodate DVS [34].
In applying DVS policy to a link, we confront two general problems. One is
the estimation of link usage for a given application and the other is the algo-
rithm that adjusts the link frequency according to the time varying workload.
L
E link = ( Pon Toni + Poff Toff i + ni E P ) (9.16)
i=1
where Toni and Toff i are the length of total power on and power off time periods
for link i, ni is the number of times link i has been reactivated, E P is an energy
penalty during the transition period, and L is the total number of links in
the network. By assuming Poff 0, the energy consumption of links can be
L
reduced to E link i=1 ( Pon Toni + ni E P ). There can be trade-offs based on the
values of ni and E P . For instance, link L 4 in Figure 9.5 can be turned off to
reduce the energy and power consumption for the network.
Dynamic link shutdown (DLS) [41] powers down links intelligently when
their utilizations are below a certain threshold level and a subset of highly
used links can provide connectivity in the network. An adaptive routing strat-
egy that intelligently uses a subset of links for communication was proposed,
thereby facilitating DLS for minimizing energy consumption. Soteriou and
Peh [42] explored the design space for communication channel turn-on/off
based on a dynamic power management technique depending on hardware
counter measurement obtained from the network during run-time.
Compiler-directed approaches have benefits as compared to hardware-
based approaches. Based on high-level communication analysis, these tech-
niques determine the point at which a given communication link is idle and
can be turned off to save power, without waiting for a certain period of time
Cluster
FIGURE 9.7
Network topologies: (a) Mesh, (b) CMesh, and (c) hierarchical star.
to be certain that the link has truly become idle. Similarly, the reactivation
point which was identified automatically eliminates the turn on performance
penalty. Chen et al. [43] introduced a compiler-directed approach, which
increases the idle periods of communication channels by reusing the same
set of channels for as many communication messages as possible. Li et al. [44]
proposed a compiler-directed technique to turn off the communication chan-
nels to reduce NoC energy consumption.
where D is the distance from source to destination and E avg is the average
link traversal energy per unit length. Among these factors, H and D are
strongly influenced by the topology. For instance, the topology in Figure 9.5
can be changed to Figure 9.8(a), by adding additional links, while reducing
the number of hop counts. The power trade-offs are determined by interaction
of all factors dynamically, and the variation of one factor will clearly impact
other factors. For example, topology optimization can effectively reduce the
hop count, but it might inevitably increase router complexity, which increases
the switch energy (E switch ).
Energy efficiency of different topologies was derived and compared based
on the network size and architecture parameters for technology scaling [45].
Based on the model, Lee et al. [6] showed that hierarchical star topology has
the lowest energy and area cost for their application. For any given aver-
age point-to-point communication latency requirement, an algorithm which
R1 R2 R4 R7
L8 L9 L10
L1 L2 L1 L2
R3 R4 R5 R6 R3 R6 R2 R8
L6 L7 L3 L6 L7 L3
L5 L4 L5 L4
L13 L12 L11
R7 R8 R5 R1
(a) (b)
FIGURE 9.8
(a) Topology optimization, (b) application mapping.
finds the optimal NoC topology from a given topology library (mesh, torus,
and hypercube) was proposed, balancing between NoC power efficiency and
communication latency [46]. Balfour and Dally [47] developed area and en-
ergy models for an on-chip interconnection network and described trade-
offs in a tiled CMP. Using these models, they investigated how aspects of
the network architecture including topology, channel width, routing strategy,
and buffer size affect performance, and impact area and energy efficiency.
Among the different topologies, the Concentrated Mesh (CMeshX2) network
was substantially the most efficient. Krishnan et al. [48] presented an MILP
formulation that addresses both wire and router energy by splitting the topol-
ogy generation problem into two distinct subproblems: (1) system-level floor
planning and (2) topology and route generation. A prohibitive greedy itera-
tive improvement strategy was used to generate an energy optimized appli-
cation specific NoC topology which supports both point-to-point and packet
switched networks [49].
high energy per instruction, whereas the smaller and slower cores execute the
parallel phase for lower energy per instruction, reducing power consumption
for similar performance.
of components. This level is less accurate but requires less simulation time.
Power models for NoC are targeted for power optimization, system perfor-
mance, and power trade-offs.
RTL Design
2
Technology
Synthesis Physical Information
Library
3 4
Test Bench Gate-Level Simulation Power Analysis
1
Packet Switching Power Reports
Synthesizer Information (cycle accurate)
5
Hierarchical Modeling
6 Power
Multiple Regression Analysis Model
FIGURE 9.9
Power model generation methodology.
P̂ = α0 + A · (9.18)
9.4 Summary
The key feature of on-chip interconnection network is the capability to provide
required communication bandwidth and low power and energy consump-
tion in the network. With the continuing progress in VLSI technology where
billions of transistors are available to the designer, power awareness becomes
the dominant enabler for a practical energy-efficient on-chip interconnection
network. This chapter discussed a few of the power and energy management
techniques for NoC. Ways to minimize the power consumption were covered
starting with microarchitectural-level techniques followed by system-level
approaches.
The microarchitectural-level power savings were presented by reducing
supply voltage and switching activity. RTL optimization enables circuit-level
power savings and multithreshold circuit reduces the static power consump-
tion for NoC components. There are trade-offs in performance and power
dissipation for buffer allocation and throughput of switches. System-level
power management is to allow the system power scale with changing con-
ditions and performance requirements. Energy savings are achieved by DVS
References
[1] A. P. Chandrakasan, W. J. Bowhill, and F. Fox, Design of High-Performance Micro-
processor Circuits. Hoboken, NJ: Wiley-IEEE Press, 2000.
[2] N. S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin,
et al., “Leakage current: Moore’s law meets static power,” Computer 36 (2003)
(12): 68–75.
[3] N. Eisley and L.-S. Peh, “High-level power analysis for on-chip networks.” In
CASES ’04: Proc. of 2004 International Conference on Compilers, Architecture, and
Synthesis for Embedded Systems. New York: ACM, 2004, 104–115.
[4] H. Zhang and J. Rabaey, “Low-swing interconnect interface circuits.” In ISLPED
’98: Proc. of 1998 International Symposium on Low Power Electronics and Design. New
York: ACM, 1998, 161–166.
[5] C. Svensson, “Optimum voltage swing on on-chip and off-chip interconnect,”
IEEE Journal of Solid-State Circuits 36 (Jul. 2001) (7): 1108–1112.
[6] K. Lee, S.-J. Lee, and H.-J. Yoo, “Low-power network-on-chip for high-
performance SoC design,” IEEE Transactions on Very Large Scale Integration Sys-
tems 14 (2006) (2): 148–160.
[7] V. Venkatachalam and M. Franz, “Power reduction techniques for microproces-
sor systems,” ACM Computing Surveys 37 (2005) (3): 195–237.
[8] M. R. Stan and W. P. Burleson, “Bus-invert coding for low-power I/O,” IEEE
Transactions on Very Large Scale Integration Systems 3 (1995) (1): 49–58.
[9] C. N. Taylor, S. Dey, and Y. Zhao, “Modeling and minimization of intercon-
nect energy dissipation in nanometer technologies.” In DAC ’01: Proc. of 38th
Conference on Design Automation. New York: ACM, 2001, 754–757.
[10] B. Victor and K. Keutzer, “Bus encoding to prevent crosstalk delay.” In ICCAD
’01: Proc. of 2001 IEEE/ACM International Conference on Computer-Aided Design.
Piscataway, NJ: IEEE Press, 2001, 57–63.
[11] K. N. Patel and I. L. Markov, “Error-correction and crosstalk avoidance in DSM
busses,” IEEE Transactions on Very Large Scale Integration Systems 12 (2004) (10):
1076–1080.
[12] W.-W. Hsieh, P.-Y. Chen, and T. Hwang, “A bus architecture for crosstalk elimi-
nation in high performance processor design.” In CODES+ISSS ’06: Proc. of 4th
John Bainbridge
CONTENTS
10.1 CHAIN Works ........................................................................................ 282
10.2 Chapter Contents..................................................................................... 283
10.3 CHAIN NoC Building Blocks and Operation................................... 284
10.3.1 Differences in Operation as Compared to Clocked
Interconnect................................................................................ 284
10.3.2 Two-Layer Abstraction Model ................................................ 285
10.3.3 Link-Level Operation ............................................................... 286
10.3.4 Transmit and Receive Gateways and the CHAIN
Gateway Protocol ...................................................................... 288
10.3.5 The Protocol Layer Adapters................................................... 289
10.4 Architecture Exploration ........................................................................ 290
10.4.1 CSL Language............................................................................ 291
10.4.1.1 Global Definitions .................................................... 292
10.4.1.2 Endpoints and Ports ................................................ 292
10.4.1.3 Address Maps........................................................... 294
10.4.1.4 Connectivity Specification ...................................... 295
10.4.2 NoC Architecture Exploration Using CHAIN architect .... 296
10.4.3 Synthesis Algorithm ................................................................. 297
10.4.4 Synthesis Directives .................................................................. 299
10.5 Physical Implementation: Floorplanning, Placement,
and Routing .............................................................................................. 299
10.6 Design-for-Test (DFT).............................................................................. 301
10.7 Validation and Modeling........................................................................ 303
10.7.1 Metastability and Nondeterminism ....................................... 304
10.7.2 Equivalence Checking .............................................................. 305
10.8 Summary................................................................................................... 305
References............................................................................................................. 306
281
© 2009 by Taylor & Francis Group, LLC
282 Networks-on-Chips: Theory and Practice
These concepts are somewhat different from the design principles to which a
clocked interconnect designer is accustomed and warrant further explanation.
The second major difference between clocked interconnect and the self-
timed approach used by CHAINworks is in the ability to use pipelining to
tune for bandwidth without having to consider latency. This stems from the
fact that the C-element∗ based half-buffer pipeline latch has only a single gate-
delay propagation time. This is very different from the use of clocked registers
where insertion of each extra register adds an additional whole clock-cycle
of latency to a communication. Clocked designers are thus accustomed to
having to use registers sparingly, requiring the P&R tool to perform substan-
tial buffering and struggle to meet timing. However, exactly the opposite
approach is best in the design of a CHAINworks system—copious use of
pipelining results in shorter wires and provides extra bandwidth slack, facil-
itating easier timing closure.
Finally, the combination of low-cost serializers and rate-matching FIFOs
enables simple bandwidth aggregation where narrow links merge to deliver
traffic onto a wider link [3]. Typically this is difficult to achieve in a clocked
environment and can only be performed with time-division multiplexing re-
quiring complex management of time-slots and global synchrony across the
system.
These differences from clocked interconnect implementation impact on the
architectural, logical and physical design of the CHAIN NoC, but are largely
hidden from the user, with just the benefits visible through the use of the
CHAINworks tools.
C
input data output data
C
latch
completion
detector
input acknowledge output acknowledge
latch
controller Port A
latch
latch
latch
latch
Route control
[dual-rail encoding]
latch
latch
latch
latch
Flit control
[1-of-3 encoding]
latch
latch
latch
latch
latch
Flit body
[3-of-6
latch
latch
latch
encoding]
Port B
latch
Pipeline Pipeline
latch
ireq C 0 req 1
C 0 ack 1
lack C 0 req 2
0 ack 2 1 2 Router
C 0 req 3
0 ack 3
FIGURE 10.1
Link and logic structure showing two pipeline latches and a router.
The CHAINworks, concept uses fine grained acknowledges [4] (here, each
grouping of 4 bits of the payload has its own acknowledge) to achieve high
frequency operation, and partition the wide datapath into small bundles to
ease placement and routing. Such use of fine-grained acknowledges also al-
lows for skew tolerance between the route, control, and payload parts of the
flits and packets passing through the network. Skew of this nature is to be
expected as a result of the early-propagation techniques used in the switches
when they open a new route. Consideration of the steps involved in opening
a new route in a 1-to-2 port router with internal structure, as illustrated in
Figure 10.1, helps to explain how this happens.
• The bottom bits are tapped off the routing control signals.
• The tapped-off route bits are used to select which output port to
use.
∗ Dual-railcodes use two wires to convey a single bit by representing the logic level 0 using
signaling activity on one wire and logic level 1 by signaling activity on the other wire.
The first steps are performed with low latency and the flit-control, and up-
dated route symbols are output approximately together. Then, to perform the
final step, significant skew is introduced as a result of the C-element-based
pipeline tree (represented as simple buffers in the diagram) used to achieve
the fanout while maintaining the high frequency operation. The latency in-
troduced is of the order of log3 (width/4) gate delays for a datapath of width
bits implemented using 4-bit 3-of-6 code groups.
All of the transport layer components are designed using this early prop-
agation technique to pass the performance-critical route control information
through to the next stage as quickly as possible. Consequently, all the com-
ponents can accommodate this variable skew, up to a maximum of half of a
four-phase return-to-zero handshake, and the receive gateway realigns the
data wavefront as part of the process of bridging data back into the receiving
clock domain.
report
CSL System C
spec models
floorplan
implementation
Verilog
models
topology
FIGURE 10.2
CHAINarchitect architecture exploration flow.
Figure 10.2 shows the use model that can be used for this frontend of the
Silistix CHAINworks tools.
Using CHAINarchitect, it is possible to iterate over many variations on a
design, exploring the impact of different requirements and implementations
in a very short period of time. The tightness of this iterative loop is provided
by the fact that the CSL language is used to capture all of the requirements
of the interconnect such as the interface types and properties, and the com-
munication requirements between each of these interfaces. This formalized
approach leads to a more robust design process by requiring the system ar-
chitect to consider all of the IP blocks in the system rather than focusing only
on the interesting or high-performance blocks. When all blocks are consid-
ered, a much more complete picture of the actual traffic in the system results
in being able to use reduced design margins, that is, the interconnect can be
more closely matched to the needs of the system.
Once the architect is satisfied with the results predicted by the CSL compiler
for his synthesized architecture, he can proceed to transaction- or cycle-level
modeling of the system with the SystemC and verilog models, and finally to
the implementation.
There are four sections in a CSL source file: the global definitions, the ad-
dress maps, the port descriptions, and the connection descriptions. Explana-
tions and code-fragments for each of these are discussed below.
In each case, the exact options available and syntax used to capture the
specification are protocol specific, and many of the attributes have default
values. A typical CSL fragment for an endpoint port description is shown
below.
as part of the initiator port specification. The target entries from each inde-
pendent address map are then bound in the descriptions of the targets using
statements such as
//Read connections:
cpu.i_port <= dram.t_port(bandwidth=200 Mbs, latency = 120ns);
mpeg.i_port <= dram.t_port(bandwidth=800 Mbs);
dma.i_port <= dram.t_port(bandwidth=200 Mbs);
dma.i_port <= eth.t_port(bandwidth=50 Mbs);
//Write connections:
cpu.i_port => dram.t_port(bandwidth=200 Mbs);
mpeg.i_port => dram.t_port(bandwidth=200 Mbs);
dma.i_port => dram.t_port(bandwidth=200 Mbs);
dma.i_port => eth.t_port(bandwidth=100 Mbs);
CSL Compiler Version 2008.0227 report run on Sun Feb 24 02 20:39:36 2008
command line = "-or:.\rep\silistix_training_demo_fpe_master.rep -nl -ga "
System Statistics
-----------------
Initiators: 3
Targets: 2
Adaptors: 5 (0.088 mm2, 38.3 kgates - 1.0%)
TX: 5 (0.063 mm2, 27.5 kgates - 0.7%)
Route: 1 (0.001 mm2, 0.6 kgates - 0.0%)
Serdes: 3 (0.003 mm2, 1.5 kgates - 0.0%)
..... ..... .....
Total fabric area: 0.242 mm2, 105.7 kgates (2.7%)
Fabric nominal power: 62.918457 mWatt
The bill of materials shows a list of the Silistix library components required
for the design, including the clockless hard-macro components implementing
the NoC transport layer and the protocol adapters coupling the endpoint
interfaces of those protocols onto the transport layer. The other important data
provided in the results is a hop-by-hop breakdown of the path through the
network for each communication showing the bandwidth and latency slack.
FIGURE 10.3
Trafic time-window usage model.
use of the self-timed transport layer allows simple in-fabric serdes support
for changing the aspect ratio (and consequently duty cycle on the link) of
transactions as they move from one link width to another. This is an impor-
tant distinction from clocked implementations where the rigidity of the clock
makes changes to the aspect ratio or duty cycle much more challenging to
achieve.
The final part of this static time-window traffic model affects the added
latency impact of contention and congestion in the system. When considered
from an idle situation, some transactions will have to wait because they en-
counter a link that is busy, the worst case delay being determined by the
sum of all of the duty terms of the other communications performed over the
same link. However, such “latency from idle” analysis is not representative
of a real system, where once a stable operating condition is achieved, which
is of course regulated by the traffic-generation rates of the endpoints, the
congestion-imposed latency is substantially lower than the theoretical worst
case and typically negligible.
Figure 10.4 shows a time-window illustration of two transfer sequences,
A and B, which are merged onto a shared link. Initially both sequences en-
counter added latency, with the first item of sequence B suffering the worst
delay (due to the arbiter in the merge unit resolving in favor of A on the first
transfer). Then the second transfer on A suffers a small delay waiting for the
link to become available. By the third and fourth transfers in the sequences
the congestion-imposed delays are minimal. Thus, although the upper limit
on the jitter introduced into each transfer sequence is of magnitude equal to
the duration of the activity from the other contending sequence, the average
jitter is substantially smaller and almost negligible once synchronization de-
lays experienced at the edges of the network are considered, provided the
contiguous flit-sequence lengths are short. If wider jitter can be tolerated, as
is often the case, then longer sequences can be used.
A
2 1
merge
B
A1 A2 A3 A4 A5 A6
A1 B1 B2 A2 B3 A3 B4 A4 B5 A5 B6 A6 B7
B1 B2 B3 B4 B5 B6 B7
FIGURE 10.4
Arbitration impact on future transfer alignment.
FIGURE 10.5
CHAINarchitect floor-plan estimate.
floorplan estimate, the small black blocks are the Silistix asynchronous logic,
and the pipeline latches are just visible as the really small blocks spanning
the distance between the other components. Also key, here, is the observation
noted earlier about the ease and nondamaging impact of over-provisioning
pipelining that is central to the methodology. This means that once CHAINar-
chitect has settled on a suitable floorplan and topology, it can calculate the
link widths and pipelining depths necessary for the implementation and suf-
ficiently over-provision the pipelining to ensure that the system will meet
its requirements while allowing for the uncertainty that is inherent in the
later steps of the physical design flow in moving from a rough floorplan
estimate constructed at an abstract level to a real physical implementation
post place-and-route. A basic spring model is used as part of this process to
evenly distribute the switching components and pipeline latches of the net-
work fabric across the physical distances to be spanned. For a typical SoC,
the runtime of the CHAINworks synthesis and provisioning algorithms is a
few minutes thereby allowing the system architect to rapidly iterate through
the exploration of a range of system architectures.
Once the floorplan is finalized and the hard-macro components of the net-
work fabric are placed, along with any other hard macros in the design, the
placement and routing of other blocks is performed as normal. Timing con-
straints are output by CHAINcompiler in conventional SDC format for the
self-timed wires between the NoC transport-layer components for use with
the mainstream timing-driven flows. Using these, the place and route tools
perform buffer insertion and routing of the longer wires as appropriate.
Functional vectors loaded in to datapath Scan chain weaves through the self-timed logic cutting all
using scan at transmitter clock domain Functional vectors read out using
global feedback loops allowing control of progress of scan at receiver clock domain
vectors from transmitter to receiver clock domains
scan-out scan-in
scan-in test test scan-out
controller controller
test-mode test-mode
C C C
C C C
C C C
FSM FSM
FIGURE 10.6
Scan-latch locations for partial scan.
features an integral scan latch. This approach is compatible with all existing
scan-chain manipulation tools and conventional ATPG approaches. The more
advanced and lower cost approach relies on a combination of functional pat-
terns and sequential, partial scan. In both cases similar 99.xx percent stuck-at
fault coverage is achieved as in regular clocked logic test. Consideration of a
pipelined path from a transmitter to a receiver, as shown in Figure 10.6, can
illustrate how the partial scan [8] approach works.
Scan flops are placed on targeted nodes, typically feedback loops, state-
machine outputs and select lines that intersect the datapath. These are shown
explicitly in the simplified pipeline of Figure 10.6, but in reality they are en-
capsulated in (and placed and routed as part of) the hard-macro components.
Testing the network is then a three-stage process.
• The first pass of the test process uses just the conventional scan flops
in the transmit and receives hard macros to check the interface with
the conventionally clocked logic.
• Second, in transport-layer test mode, the same scan flops are used
to shift functional vectors into the datapath at the transmitter. The
circuit is switched back into operational mode (where the global-
feedback partial-scan latches connect straight through without in-
terrupting the loops) and the vectors are then transmitted through
the network at speed and then read out at the receiver using its scan
chain. This achieves good coverage of all the datapath nodes and
many of the control-path nodes through the network. It is not es-
sential, just more efficient to use this step because all faults can also
be detected using the final pass below.
• The final pass uses the partial-scan flops on the global feedback
loops in non-bypass mode to break the global loops allowing access
The full set of required patterns are generated by the CHAINworks tools in
STIL format for use with conventional testers and test pattern processing tools.
Achievable coverage is verified using conventional third-party concurrent
fault simulators.
Highly efficient delay-fault testing is performed using a variant on the
functional-test approach. Patterns injected at a transmit unit are steered
through the network to a receiver and the total flight time from the transmit-
ters clocked/asynchronous converter to the receivers asynchronous/clocked
converter is measured. Any significant increase above the expected value is
indicative of a delay fault somewhere on the path between the two ends.
This approach is very efficient for detecting the absence or presence of delay
faults, but does not help in the exact location of a delay fault. However, the
scan access facilitates such localization of any faults detected.
Gateway
Adapter
Adapter
domain 0 IP IP domain 3
switches,
pipeline-latches,
Endpoint fifos Endpoint
Gateway
Gateway
Adapter
Adapter
domain 1 IP IP domain 4
Endpoint Endpoint
Gateway
Gateway
Adapter
Adapter
domain 2 IP IP domain 5
Endpoint domains and switching fabric stored in separate files CGP checkers/snoopers
FIGURE 10.7
Output verilog netlist partitioning.
block attached. The clocked logic generated is RTL, ready for synthesis or
simulation. For the self-timed logic that will be implemented as hard macros
in the realization of the system, behavioral models are provided that sim-
ulate substantially faster than a gate-level simulation of the real structural
netlist of the asynchronous circuits. These models are built using verilog2001
language constructs and their timing is calibrated against the characterized
hard macros allowing realistic time-accurate simulations of the system to be
performed.
The final level of detailed accuracy, possible with this flow, is to simulate
a combination of RTL (or the synthesized gate-level netlists) of the clocked
components with the gate-level netlist of the asynchronous macros using
back-annotated timing. This gives very accurate timing simulation but at the
expense of substantial run-times.
10.8 Summary
This chapter introduces the CHAINworks tools, a commercially available
flow for the synthesis and deployment of NOC-style interconnect in SoC
designs. A new language for the capture of system-level communication re-
quirements has been presented and some of the implementation challenges
that impact the conventional ASIC design flow as a result of moving toward a
References
1. International Technology Roadmap for Semiconductors, 2007 edition,
http://www.itrs.net.
2. L. A. Plana, W. J. Bainbridge, and S. B. Furber, “The design and test of a smartcard
chip using a CHAIN self-timed network-on-chip.” In Proc. of Design, Automation
and Test in Europe Conference and Exhibition, Paris, France, February 2004.
3. L. A. Plana, J. Bainbridge, S. Furber, S. Salisbury, Y. Shi, and J. Wu, “An on-chip
and inter-chip communications network for the SpiNNaker massively-parallel
neural network.” In Proc. of 2nd IEEE International Symposium Networks on Chip,
New Castle, United Kingdom, April 2008.
4. W. J. Bainbridge and S. B. Furber, “CHAIN: A delay insensitive CHip area IN-
terconnect,” IEEE Micro Special Issue on Design and Test of System on Chip,” 142
(Sep. 2002) (4): 16–23.
5. T. Verhoeff, “Delay-Insensitive codes—An overview, distributed computing,” 3
(1988): 1–8.
6. W. J. Bainbridge, W. B. Toms, D. A. Edwards, and S. B. Furber, “Delay-insensitive,
point-to-point interconnect using m-of-n codes.” In Proc. of 9th IEEE International
Symposium on Asynchronous Circuits and Systems, Vancouver, Canada, May 2003,
pp. 132–140.
7. IEEE Std 1500–2005, “Standard testability method for embedded core-based
integrated circuits,” IEEE Press.
8. A. Efthymiou, J. Bainbridge, and D. Edwards, “Test pattern generation and
partial-scan methodology for an asynchronous SoC interconnect,” IEEE Trans-
actions on Very Large Scale Integration (VLSI) Systems, 13, December 2005, (12):
1384–1393.
9. C. L. Seitz, “System timing.” In Introduction to VLSI Systems, C. A. Mead and
L. A. Conway, eds. Reading, MA: Addison-Wesley, 1980.
CONTENTS
11.1 Introduction.............................................................................................. 308
11.2 Short Survey of Existing Interconnect Solutions................................. 310
11.3 Arteris NoC: Basic Building Blocks and EDA Tools........................... 311
11.3.1 NoC Transaction and Transport Protocol .............................. 311
11.3.1.1 Transaction Layer..................................................... 312
11.3.1.2 Transport Layer ........................................................ 313
11.3.1.3 Physical Layer........................................................... 313
11.3.2 Network Interface Units........................................................... 316
11.3.2.1 Initiator NIU Units................................................... 317
11.3.2.2 Target NIU Units ...................................................... 318
11.3.3 Packet Transportation Units .................................................... 319
11.3.3.1 Switching................................................................... 319
11.3.3.2 Routing ...................................................................... 320
11.3.3.3 Arbitration................................................................. 321
11.3.3.4 Packet Management................................................. 321
11.3.4 NoC Implementation Issues .................................................... 323
11.3.4.1 Pipelining .................................................................. 323
11.3.4.2 Clock Gating ............................................................. 324
11.3.5 EDA Tools for NoC Design Tools............................................ 325
11.3.5.1 NoCexplorer ............................................................. 325
11.3.5.2 NoCcompiler ............................................................ 326
11.4 MPSoC Platform ...................................................................................... 329
11.4.1 ADRES Processor ...................................................................... 331
11.4.2 Communication Assist ............................................................. 333
11.4.3 Memory Subsystem .................................................................. 334
11.4.4 NoC ............................................................................................. 335
11.4.5 Synthesis Results ....................................................................... 337
307
© 2009 by Taylor & Francis Group, LLC
308 Networks-on-Chips: Theory and Practice
11.5 Power Dissipation of the NoC for Video Coding Applications........ 340
11.5.1 Video Applications Mapping Scenarios................................. 340
11.5.1.1 MPEG-4 SP Encoder ................................................ 340
11.5.1.2 AVC/H.264 SP Encoder .......................................... 341
11.5.2 Power Dissipation Models of Individual NoC
Components ............................................................................... 345
11.5.2.1 Network Interface Units.......................................... 345
11.5.2.2 Switches..................................................................... 346
11.5.2.3 Links: Wires............................................................... 347
11.5.3 Power Dissipation of the Complete NoC .............................. 348
References............................................................................................................. 352
11.1 Introduction
In the near future, handheld, mobile, battery-operated electronic devices
will integrate different functionalities under the same hood, including
mobile telephony and Internet access, personal digital assistants, powerful 3D
game engines, and high-speed cameras capable of acquiring and processing
high-resolution images at realtime frame rates. All these functionalities will
result in huge computational complexity that will require multiple processing
cores to be embedded in the same chip, possibly using the Multi-Processor
System-on-Chip (MPSoC) computational paradigm. Together with increased
computational complexity, the communication requirements will get bigger,
with data streams of dozens, hundreds, and even thousands of megabytes
per second of data to be transferred. A quick look at the state-of-the art video
encoding algorithms, such as AVC/H.264 for high-resolution (HDTV) re-
altime image compression applications, for example, already indicate band-
widths of a few gigabytes per second of traffic. Such bandwidth requirements
cannot be delivered through traditional communication solutions, therefore
Networks-on-Chips (NoC) are used more and more in the development of
such MPSoC systems. Finally, for battery-powered devices both processing
and communication will have to be low-power to increase the autonomy of
a device as much as possible.
In this chapter we will present an MPSoC platform, developed at the
Interuniversity Microelectronics Center (IMEC), Leuven, Belgium in partner-
ship with Samsung Electronics and Freescale, using Arteris NoC as communi-
cation infrastructure. This MPSoC platform is dedicated to high-performance
(HDTV image resolution), low-power (700 mW power budget for processing),
and real-time video coding applications (30 frames per second) using state-of-
the-art video encoding algorithms such as MPEG-4, AVC/H.264, and Scalable
Video Coding (SVC). The proposed MPSoC platform is built using six Coarse
Grain Array (CGA) ADRES processors also developed at IMEC, four on-chip
memory nodes, one external memory interface, one control processor, one
node that handles input and output of the video stream, and Arteris NoC as
communication infrastructure. The proposed MPSoC platform is supposed
to be flexible, allowing easy implementation of different multimedia applica-
tions and scalable to the future evolutions of the video encoding standards
and other mobile applications in general.
Although it is obvious that NoCs represent the future of the interconnects
in the large, high-performance, and scalable MPSoCs, it is less evident to be
convinced of their area and power efficiency. With the NoC, the raw data has
to be encapsulated first in packets in the network interface unit on the master
side (IP to NoC protocol conversion) and these packets have to travel through
a certain number of routers. Depending on the routing strategy, a portion of
the packet (or the complete packet) will eventually have to be buffered in
some memory before reaching the next router. Finally, on the slave network
interface side, the raw data has to be extracted from packets before reaching
the target (here we assume a write operation; a read operation would require a
similar path for the data request, but would also include a path in the opposite
direction with actual data). Therefore all these NoC elements use some logic
resources and dissipate power.
In this work we show that in the context of a larger MPSoC system adapted
to today’s standards (64 mm2 Die in 90 nm technology with 13 computa-
tional and memory nodes) and complex video encoding applications such as
MPEG-4 and AVC/H.264 encoders, the NoC accounts for less than three per-
cent of the total chip area and for less than five percent of the total power bud-
get. In absolute terms, this means less than 450 kgates and less than 25 mW
of power dissipation for a fully connected NoC mesh composed of 12 routers
and for a traffic of about 1 GB per second. Such communication performance,
area, and power budget are acceptable even for smaller MPSoC platforms.
The remainder of this chapter is structured as follows: in Section 11.2 we
will briefly present a survey of different interconnect solutions. In Section 11.3
we will present in some more details the Arteris NoC. We will first introduce a
description of some of the basic NoC components (NoC protocol, network in-
terfaces, and routers) provided within the Arteris Danube NoC IP library. We
will also briefly describe the associated EDA tools that will be used for NoC
design space exploration, specification, RTL generation, and verification. In
Section 11.4, we will describe the architecture of the MPSoC platform in a
more detailed way. We will give a description of the ADRES CGA processor
architecture, memory subsystem, NoC topology and configuration. Finally,
we will present the MPSoC platform synthesis results and power dissipa-
tion figures. In Section 11.5 we will present the power models of different
NoC components (network interfaces, routers and wires) and will provide
the power model of the complete NoC. Such models will be used to derive
the power dissipation of the MPEG-4 and AVC/H.264 simple profile encoders
for different frame resolutions and different applications mapping scenarios.
The results obtained will be compared with some of the state-of-the-art
implementations already presented in the literature.
Initiator
NIU
Rx Tx Tx Data Rx
Physical
Request Frm
nk packets
Li Head
TallOfs
Transaction
Transport
Pres.
Vid
Physical
Response RxRdy
packets
Tx Rx Request network
Target
NIU Response network
FIGURE 11.1
NTTP protocol layers mapped on NoC units and Media Independent NoC Interface—MINI.
Units (PTUs). The physical layer defines how packets are physically trans-
mitted over an interface.
An NTTP transaction is typically made of request packets, traveling
through the request network between the master and the slave NIUs, and
response packets that are exchanged between a slave NIU and a master NIU
through the response network. At this abstraction level, there is no assump-
tion on how the NoC is actually implemented (i.e., the NoC topology). Trans-
actions are handed off to the transport layer, which is responsible for deliv-
ering packets between endpoints of the NoC (using links, routers, muxes,
rated adapters, FIFOs, etc.). Between NoC components, packets are physi-
cally transported as cells across various interfaces, a cell being a basic data
unit being transported. This is illustrated in Figure 11.1, with one master and
one slave node, and one router in the request and response path.
As shown in Figure 11.1, requests from an initiator are sent through the master
NIU’s transmit port, Tx, to the NoC request network, where they are routed to
the corresponding slave NIU. Slave NIUs, upon reception of request packets
35 29 28 25 24 15 14 5 4 3 0
Header Info Len Master Address Slave Address Prs Opcode
Necker Tag Err Slave offset StartOfs StopOfs
Data BE Data Byte BE Data Byte BE Data Byte BE Data Byte
---
Data BE Data Byte BE Data Byte BE Data Byte BE Data Byte
32 31 30 27 26 20 19 14 13 5 4 3 0
Header Rsv Len Info Tag Master Address Prs Opcode
Data CE Data
---
Data CE Data
FIGURE 11.2
NTTP packet structure.
on their receive ports, Rx, translate requests so that they comply with the pro-
tocol used by the target third-party IP node. When the target node responds,
returning responses are again converted by the slave NIU into appropriate
response packets, then delivered through the slave NIU’s Tx port to the
response network. The network then routes the response packets to the re-
questing master NIU, which forwards them to the initiator. At the transaction
level, NIUs enable multiple protocols to coexist within the same NoC. From
the point of view of the NTTP modules, different third-party protocols are
just packets moving back and forth across the network.
maximum cell-width (header, necker, and data cell) and the link-width. One
link (represented in Figure 11.1) defines the following signals:
• Data—Data word of the width specified at design-time.
• Frm—When asserted high, indicates that a packet is being transmit-
ted.
• Head—When asserted high, indicates the current word contains a
packet header. When the link-width is smaller than single (SGL), the
header transmission is split into several word transfers. However,
the Head signal is asserted during the first transfer only.
• TailOfs—Packet tail: when asserted high, indicates that the current
word contains the last packet cell. When the link-width is smaller
than single (SGL), the last cell transmission is split into several word
transfers. However, the Tail signal is asserted during the first transfer
only.
• Pres.—Indicates the current priority of the packet used to define
preferred traffic class (or Quality of Service). The width is fixed
during the design time, allowing multiple pressure levels within
the same NoC instance (bits 3–5 in Figure 11.2).
• Vld—Data valid: when asserted high, indicates that a word is being
transmitted.
• RxRdy—Flow control: when asserted high, the receiver is ready to
accept word. When de-asserted, the receiver is busy.
This signal set, which constitutes the Media Independent NoC Interface
(MINI), is the foundation for NTTP communications.
Packet definition. Packets are composed of cells that are organized into
fields, with each field carrying specific information. Most of the fields in
header cells are parameterizable and in some cases optional, which makes
it possible to customize packets to meet the unique needs of an NoC instance.
The following list summarizes the different fields, their size, and function:
Opcode 4 bits/3 bits Packet type: 4 bits for requests, 3 bits for responses
MstAddr User Defined Master address
SlvAddr User Defined Slave address
SlvOfs User Defined Slave offset
Len User Defined Payload length
Tag User Defined Tag
Prs User defined (0 to 2) Pressure
BE 0 or 4 bits Byte enables
CE 1 bit Cell error
Data 32 bits Packet payload
Info User Defined Information about services supported by the NoC
Err 1 bit Error bit
For request packets, a data cell is typically 32 or 36 bits wide depending on the
presence of byte enables (this is fixed at design time). For response packets,
a data cell is always 33 bits wide. A possible instance of a packet structure
is illustrated in Figure 11.2. Header, necker, and data cells do not necessarily
have the same size. Different data cell widths and their relation to the cells
are illustrated in Figure 11.3.
To provide services to IP cores, the transaction layer relies primarily on Load
and Store transactions, which are converted into packets. The predominant
packet types are Store and Load for requests, and Data and Acknowledge for
responses. Control packets and Error Response packets are also provided for
NoC management.
Quality of Service (QoS). The QoS is a very important feature in the inter-
connect infrastructures because it provides a regulation mechanism allowing
specification of guarantees on some of the parameters related to the traf-
fic. Usually the end users are looking for guarantees on bandwidth and/or
end-to-end communication latency. Different mechanisms and strategies have
been proposed in the literature. For instance, in Æthereal NoC [11,24] pro-
posed by NXP, a TDMA approach allows the specification of two traffic cat-
egories [25]: BE and GT.
In the Arteris NoC, the QoS is achieved by exploiting the signal pressure em-
bedded into the NTTP packet definition (Figures 11.1 and 11.2). The pressure
HEADER
NECKER HEADER
FIGURE 11.3
Packet, cells, and link width.
signal can be generated by the IP itself and is typically linked to a certain level
of urgency with which the transaction will have to be completed. For exam-
ple, we can imagine associating the generation of the pressure signal when a
certain threshold has been reached in the FIFO of the corresponding IP. This
pressure information will be embedded in the NTTP packet at the NIU level:
packets that have pressure bits equal to zero will be considered without QoS;
packets with a nonzero value of the pressure bit will indicate preferred traffic
class.∗ Such a QoS mechanism offers immediate service to the most urgent
inputs and variables, and fair service whenever there are multiple contend-
ing inputs of equal urgency (BE). Within switches, arbitration decisions favor
preferred packets and allocate remaining bandwidth (after preferred packets
are served) fairly to contending packets. When there are contending preferred
packets at the same pressure level, arbitration decisions among them are also
fair.
The Arteris NoC supports the following four different traffic classes:
• Real time and low latency (RTLL)—Traffic flows that require the
lowest possible latency. Sometimes it is acceptable to have brief
intervals of longer latency as long as the average latency is low.
Care must be taken to avoid starving other traffic flows as a side
effect of pursuing low latency.
• Guaranteed throughput (GT)—Traffic flows that must maintain
their throughput over a relatively long time interval. The actual
bandwidth needed can be highly variable even over long intervals.
Dynamic pressure is employed for this traffic class.
• Guaranteed bandwidth (GBW)—Traffic flows that require a guar-
anteed amount of bandwidth over a relatively long time interval.
Over short periods, the network may lag or lead in providing this
bandwidth. Bandwidth meters may be inserted onto links in the
NoC to regulate these flows, using either of the two methods. If the
flow is assigned high pressure, the meter asserts backpressure (flow
control) to prevent the flow from exceeding a maximum bandwidth.
Alternatively, the meter can modulate the flows pressure (priority)
dynamically as needed to maintain an average bandwidth.
• Best effort (BE)—Traffic flows that do not require guaranteed
latency or throughput but have an expectation of fairness.
∗ Note that in the NTTP packet, the pressure field allows more then one bit, resulting in multiple
In the following, we will describe in more details both initiator and target
NIU units for the AHB protocol, because this particular protocol has been
used for all nodes in the MPSoC platform.
NIU Architecture
Request Path
&
Ctrl NECKER
Response Path
Information from
AHB Resp FLOW request path
CONTROL CONTROL
Rx Port
PIPE
DATA WIDTH
CONVERTER
FIGURE 11.4
Network interface unit: Initiator architecture.
burst at NoC rate as soon as a minimum amount of data has been received.
The width of the FIFO and the AHB data bus is identical, and the FIFO depth
is defined by the hardware parameter. This parameter indicates the amount of
data required to generate a Store packet: each time the FIFO is full, a Request
packet is sent on the Tx port. Of course, if the AHB access ends before the FIFO
is full, the NTTP request packet is sent. Because AHB can only tolerate a single
outstanding transaction, the AHB bus is frozen until the NTTP transaction
has been completed. That is
• During a read request, until the requested data arrives from the Rx
port
• During a nonbufferable write request, in which case only the last
access is frozen and the acknowledge occurs when the last NTTP
response packet has been received
• When an internal FIFO is full
PIPE
SHIFTER WR Data
Rx Port
AHB Req
AHB Master Interface
ADDRESS + Ctrl
CONTROL
HEADER INFO
Response Path
FIGURE 11.5
Network interface unit: Target architecture.
32 bits wide, but the actual address space size may be downsized by setting a
hardware parameter. Unused AHB address bits are then driven to zero. The
NTTP request packet is then translated into one or more corresponding AHB
accesses, depending on the transaction type (word aligned or nonaligned ac-
cess). For example, if the request is an atomic Store, or a Load that can fit an
AHB burst of specified length, then such a burst is generated. Otherwise, an
AHB burst with unspecified length is generated.
11.3.3.1 Switching
The switching is done by accepting NTTP packets carried by input ports and
forwarding each packet transparently to a specific output port. The switch is
characterized with a fully synchronous operation and can be implemented
as a full crossbar (up to one data word transfer per port and per cycle),
although there is an automatic removal of hardware corresponding to unused
input/output port connections (port depletion). The switch uses wormhole
routing, for reduced latency, and can provide full throughput arbitration; that
is, up to one routing decision per input port and per cycle. An arbitrary num-
ber of switches can be connected in cascade, supporting any loopless network
topology. The QoS is supported in the switch using the pressure information
generated by the IP itself and embedded in NTTP packets.
A switch can be configured to meet specific application requirements by
setting the MINI-ports (Rx or Tx ports, as defined by the MINI interface
introduced earlier) attributes, routing tables, arbitration mode, and pipelining
strategy. Some of the features can be software-controlled at runtime through
the service network. There is one routing table per Rx port and one arbiter
per Tx port. Packet switching consists of the following four stages:
11.3.3.2 Routing
The switch extracts the destination address and possibly the scattering infor-
mation from the incoming packet header and necker cells, and then selects
an output port accordingly. For a request switch, the destination address
is the slave address and the scattering information is the master address
Router Architecture
Input Input Input Crossbar Output
Controller Pipe Shifter (Data) Controller
Data
Flow
Control
Connection Flow
Control Control
Target Output Output Crossbar
Address Number Request (Flow control)
Route Arbiter
Table
Pipeline stage
Service Bus
FIGURE 11.6
Packet transportation unit: Router architecture.
(as defined in packet structure, Figure 11.2). For a response switch, the desti-
nation address is the Master address and there is no scattering information.
The switch ensures that all input packets are routed to an output port. If the
destination address is wrong or if the routing table is not written properly,
the packet is forwarded to a default output port. In this way, an NTTP slave
will detect an error upon packet reception. The “default” output is the port of
highest index that is implemented: port n, or port n − 1 if port n is depleted,
or port n − 2 if ports n and n − 1 are depleted, and so on.
11.3.3.3 Arbitration
Each output port tracks the set of input ports requesting it. For each cycle in
which a new packet may be transmitted, the arbiter elects one input port in
that set. This election is conducted logically in two phases.
First, the pressure information used to define the preferred traffic class
(QoS) of the requesting inputs is considered. The pressure information is
explicitly carried by the MINI interface (signal Pres. in Figure 11.1), and in-
dicates the urgency for the current packet to get out of the way. It is the
maximum packet pressure backing up behind the current packet. The pres-
sure information is given top priority by the switch arbiter: among the set of
requesters, the input with the greatest pressure is selected. Additionally, the
maximum pressure of the requesters is directly forwarded to the output port.
Second, the election is held among the remaining requesters (i.e., inputs
with equal maximum pressure) according to the selected arbiter. Hardware
parameters enable the user to select a per “output port” arbiter from the
library, such as: random, round robin, least recently used (LRU), FIFO, or
fixed priority (software programmable).
In general, the detection of packet tail causes the output currently allocated
to that input to be released and become re-electable. Locked transactions are
a notable exception. If packet A enters the switch and is blocked waiting for
its output to become available, and if packet B enters the switch through a
different input port, but aims for the same output port, then when the output
port is released, at equal pressure, the selected arbitration mode must choose
between A and B. The pressure information on an input port can increase
while a packet is blocked waiting, typically because of a higher pressure
packet colliding at the rear of the jam (packet pressure propagates along
multiswitch paths). Thus, a given input can be swapped in or out of candidate
status while it is waiting.
The switch routes incoming packets without altering their contents. Never-
theless, it is sensitive to Lock/Unlock packets: when a Lock packet is received,
the connection between the input and the output as defined in the routing
table is kept until an Unlock packet is encountered. The packets framed by
Lock and Unlock packets, including the Unlock packet itself, are blindly
routed to the output allocated on behalf of the Lock packet. The input con-
troller extracts pertinent data from packet headers, forwards it to the routing
table, fetches back the target output number, and then sends a request to the
arbiter. After arbitration is granted, the input controller transmits the rest of
the packet to the crossbar. The request to the arbiter is sustained as long as
the last word of the packet has not been transferred. Upon transferring the
last cell of the packet, the arbiter is allowed to select a new input.
Lock packets, on the other hand, are treated differently. Once a Lock packet
has won arbitration, the arbitrated output locks on the selected input until the
last word of the pending unlock packet is transmitted. Thus packets between
lock and unlock packets are unconditionally routed to the output requested
by the lock packet.
Depending on the kind of routing table chosen, more than one cycle may
be required to make a decision. A delay pipeline is automatically inserted
in the input controller to keep data and routing information in phase, thus
guaranteeing one-word-per-cycle peak throughput. Routing tables select the
output port that a given packet must take. The route decision is based on the
tuple (destination address, scattering information) extracted from the packet
header and necker. In a request environment, the Destination Address is the
Slave Address and the Scattering Information is the Master Address. In a
response environment, the Destination Address is the Master address and
the Scattering Information is the Tag (Figure 11.2).
For maximum flexibility, the routing tables actually used in the switch are
parameterizable for each input port of the switch. It is thus possible to use
different routing tables for each switch input. Routing tables can optionally be
programmed via the service network interface; in this case, their configuration
registers appear in the switch register address map.
The input pipe is optional and may be inserted individually for each input
port. It introduces a one-word-deep FIFO between the input controller and
the crossbar and can help timing closure, although at the expense of one
supplementary latency cycle.
The input shifter is optional and is implemented when arbiters are allowed
to run in two cycles (the late arbitration mode is fixed at design time). The
role of the shifter is to delay data by one cycle, according to the requests of
the arbiter. This option is common to all inputs.
The arbiter ensures that the connection matrix (a row per input and a
column per output) contains at most one connection per column, that is, a
given output is not fed by two inputs at the same time. The dual guarantee—
at most one connection per row—is handled by the input controller. Each
output has an arbiter that includes prefiltering. For maximum flexibility, each
port can specify its own arbiter from the list of available arbiters (random,
round robin, LRU, FIFO, or fixed priority). A late arbitration mode is avail-
able to ease timing closure; when activated, one additional cycle is required
to provide the arbitration result.
The crossbar implements datapath connection between inputs and outputs.
It uses the connection matrix produced by the arbiter to determine which
connections must be established. It is equivalent to a set of m muxes (one
per output port), each having n inputs (one per input port). If necessary, the
crossbar can be pipelined to enhance timing. The number of pipeline stages
can be as high as max(n, m).
The output controller constructs the output stream. It is also responsible for
compensating crossbar latency. It contains a FIFO with as many words as there
are data pipelined in the crossbar. FIFO flow control is internally managed
with a credit mechanism. Although FIFO is typically empty, should the output
port become blocked, it contains enough buffering to flush the crossbar. When
necessary for timing reasons, a pipeline stage can be introduced at the output
of the controller.
The switch has a specific interface allowing connection to the service net-
work and a dedicated communication IP used for software configuration and
supervision.
parameter named fwdPipe: when set, this parameter introduces a true pipeline
register on the forward signals, and effectively breaks the forward path. The
parameter inserts the DFFs required to register a full data word as well as
with control signals, and a cycle delay is inserted for packets traveling this
path.
The unit comprises all the logic that is necessary to control the global clock-
gater, turning the clock off or on depending on the traffic. Note that the design
can apply the local clock-gating technique, the global clock-gating technique,
or both.
11.3.5.1 NoCexplorer
The NoCexplorer tool allows easy and fast NoC design space exploration
through modeling and simulation of the NoC, at different abstraction levels
(different NoC models can coexist within the same simulation instance). The
NoC models and associated traffic scenarios are first described using scripting
language based on a subset of syntax and semantics derived from the Python
programming language. The NoC models can be very abstract, defined with
only few parameters, or on the contrary they can be more detailed, thus
being very close to the actual RTL model that will be defined within the
NoCcompiler environment. One NoC model (or all of them) is then simulated
for one (or all) traffic scenarios with a built-in simulation engine, producing
performance results for further analysis. Typically, the designer can analyze
bandwidths to and from all initiator and target nodes, the end-to-end latency
statistics, the FIFO fillings, etc. These results are then interpreted to see if the
NoC and associated architectural choices (NoC topology and configuration)
meet the application requirements.
The NoCexplorer environment allows a very fast modeling and simulation
cycle. NoC and traffic description depend heavily on the complexity of the
system, but will require typically less than an hour, even for the specification
of a complex system (provided the user is experienced). On the other hand,
the actual simulation of the model will take less than a minute with a standard
desktop computer, even for complicated systems containing dozens of nodes
and including complex traffic patterns. This means that the designer can easily
test different traffic scenarios for different NoC topology specifications until
the satisfactory results are reached, before moving to the NoC specification
for RTL generation. Note that the simulation cycle can be easily automated
in more complex frameworks for wider benchmarking.
In the NoCexplorer framework, a typical NoC model will include the de-
scription of the following items:
11.3.5.2 NoCcompiler
While the NoCexplorer tool enables fast exploration of the NoC design space
using high-level NoC models, the NoCcompiler tool is used to describe the
NoC at lower abstraction levels allowing automatic RTL generation after com-
plete specification of the NoC. Typical NoC design flow using NoCcompiler
can be divided into the following steps:
NoCexplorer NoCcompiler
NoC
NoC traffic model Topology
(script) NoC
Verification & Validation
NoC Models Connectivity test
NoC
(script) Configuration Minimum latency
Peak throughput
Rand. transactions
NoC
Assembly
FIGURE 11.7
Arteris NoC design flow.
other EDA tools, such as CoWare, and used for transaction-level simulation
of the complete SoC platforms. With such simulation frameworks, one can
easily trade off between simulation speed and accuracy. This is a very useful
feature, especially when the design involves larger and more complex MPSoC
platforms running intensive applications, computationally.
CA
ADRES3
ADRES4
L1I
L1I
L1D
L1D
CA
CA
ADRES2
ADRES5
Arteris
L1I
L1I
NoC
L1D
L1D
CA
CA
ADRES1
ADRES6
L1I
L1I
L1D
L1D
CA CA CA Initiator NIU
Target NIU
L2_D1 EMIF L2_D2
(a)
MO Communication Assist
DmaOut
AHB
NIU_Init
PD
Comm
CTRL
RE
Intc MEM
DMem
MI
AHB
Dmain CFIFO NIU_Target
MD
(b)
FIGURE 11.8
(a) Architecture of the MPSoC platform and (b) close-up of the communication assist architecture.
VLIW View
RF
FU FU FU FU
RF RF RF RF
FU FU FU FU
RF RF RF RF
FU FU FU FU
RF RF RF RF
(a)
Configuration
Pred Src1 Src2
BUFFER
RAM
FU
Pred Src1 Src2 RF
Configuration
Counter REG REG REG
(b)
FIGURE 11.9
(a) Architecture of the ADRES CGA core and (b) close-up of the functional unit.
(RFUs), register files (RFs), and routing resources. The ADRES CGA processor
can be seen as if it were composed of the following two parts:
• the top row of the array that acts as a tightly coupled Very Long
Instruction Word (VLIW) processor (marked in light gray) and
• the bottom row (marked in dark gray) that acts as a reconfigurable
array matrix.
The two parts of the same ADRES instance share the same central RF and
load/store units. The computation-intensive kernels, typically data-flow
loops, are mapped onto the reconfigurable array by the compiler using the
modulo scheduling technique to implement software pipelining and to ex-
ploit the highest possible parallelism. The remaining code (control or se-
quential code) is mapped onto the VLIW processor. The data communica-
tion between the VLIW processor and the reconfigurable array is performed
through the shared RF and memory. The array mode is controlled from the
VLIW controller through an infinite loop between two (configuration mem-
ory) address pointers with a data-dependent loop exit signal from within the
array that is handled by the compiler. The ADRES architecture is a flexible
template that can be freely specified by an XML-based architecture specifica-
tion language as an arbitrary combination of those elements.
Figure 11.9(b) shows a detailed datapath of one ADRES FU. In contrast to
FPGAs, the FU in ADRES performs coarse-grained operations on 32 bits of
data, for example, ADD, MUL, Shift. To remove the control flow inside the
loop, the FU supports predicated operations for conditional execution. A good
timing is guaranteed by buffering the outputs in a register for each FU. The
results of the FU can be written in a local RF, which is usually small and has
less ports than the shared RF, or routed directly to the inputs of other FUs. The
multiplexors are used for routing data from different sources. The configura-
tion RAM acts as a (VLIW) instruction memory to control these components.
It stores a number of configuration contexts, locally, which are loaded on
a cycle-by-cycle basis. Figure 11.9(b) shows only one possible datapath, as
different heterogeneous FUs are quite possible.
The proposed ADRES architecture has been successfully used for mapping
different video compression kernels that are part of MPEG-4 and AVC/H.264
video encoders and decoders. More information on these implementations
can be found in studies by Veredas et al. , Mei et al. and Arbelo et al. [32,34,35].
For this particular MPSoC platform, all ADRES instances were generated
using the same processor template, although the configuration context of each
processor, that is, the reconfigurable array matrix, can be fixed individually at
runtime. Each ADRES instance is composed of 4×4 reconfigurable functional
units and has separate data and instruction L1 cache memories of 32 kB each.
All ADRES cores in the system operate at the same frequency that can be either
150 MHz (the same as the NoC) or 300 MHz, depending on the computational
load. This is fixed by the MPSoC controller node, the ARM core running a
Quality of Experience Manager application.
• BT-Write—to move data blocks from a local memory, over the NoC
and into a destination memory. It is the task of the CA to generate the
proper memory addresses according to the geometrical parameters
of the source data block. The CA will send the data over the network
to some remote CA, as a stream of words using NoC transactions.
This remote CA will process the stream and write the data into
the memory by generating the proper memory addresses according
to the geometrical parameters of the target data block. When the
In the context of the MPSoC platform, any node in the system (typically
ADRES or ARM core) can set up a BT transfer for any other pairs of nodes (one
master and one slave node) in the system. In such scenario, even a memory
node can act as a master for a BT transfer; the memory can then perform
a BT-Write or BT-Read operation to/from some other distant node. This is
possible because memories, like processors, access the NoC through the CAs
(provided that another node has programmed the CA). Also, each CA is
designed to support a certain number of concurrent BTs. This means that any
node can issue one or more BTs for any other pair of nodes in the system
(CAs implement communication through virtual channels). The number of
concurrent BTs is fixed at the design time, and in the case of this MPSoC
platform we limit this number to four for processors and external memory
node (EMIF) and to 32 for L2 memory nodes, which is a design choice made to
balance performance versus area.∗ Finally, different CAs are part of the NoC
clock domain and they operate at 150 MHz.
∗ Implementing the virtual channel concept in the CA comes with the implementation cost: area,
and this can be costly, especially in the context of MPSoC system, where multiple instances of the
CAs are expected. The choice of four concurrent BTs per processor is derived from the fact that
for the majority of the computationally intensive kernels we foresee at most four concurrent BT,
that is, at most four consecutive prefetching operation per processor and per loop. For memory
nodes, we want to maximize the number of concurrent transfers taking into account the total
number of concurrent BTs in the system (depending on number of nodes: six in this case).
Because the connection between the CA and the NoC is 32 bits data wide
running at 150 MHz, the maximal throughput that can be achieved with one
data memory node is 1.2 GB per second when both read and write operations
are performed simultaneously to different memory banks (2 × 600 MB per
second, 2.4 GB per second for the whole memory cluster). Because the in-
struction memory nodes are single banked, only 600 MB per second per node
can be achieved (we do assume that the instruction memories will be used
most of the time for reading, that is, the configuration of the system occurs
every once in a while).
11.4.4 NoC
As shown in Figure 11.8, every node in the MPSoC is connected to the NoC
through a CA using a certain number of NIUs, depending on the node type.
ADRES CGA processors will require three NIUs: two NIUs are used for the
data port (one initiator and one target NIU) and one initiator NIU is used for
the instruction port. The ARM subsystem also counts three NIUs, two being
connected to the corresponding CA, while one NIU is connected directly to
the NoC for debugging purposes. Both data and instruction memories are
connected to the NoC through a pair of NIUs: one initiator and one target
NIU. The complete NoC instance, as shown in Figure 11.10, has a total of
20 initiator and 13 target NIUs. Note that all NIUs are using AHB protocol∗
and have the same configuration. Different initiator NIUs are single width
(SGL), and can buffer up to four transactions. All transactions have a fixed
length (4, 8, or 16 beat transactions, imposed by AMBA-AHB protocol) and
introduce only one pipeline stage. Target NIUs are also single width (SGL)
and introduce one pipeline stage in the datapath. Finally, the NoC packet
configuration defines the master address (6 bits), slave address (5 bits), and
∗ The choice of the IP protocol for this MPSoC Platform can be argued. AHB does not support
split and retry transaction, has a fixed burst length, and is therefore not very well adapted for the
high-performance applications. We have chosen AHB only because all of our already developed
IPs (namely ADRES processor and CA) have been using AHB interfaces.
Data NoC
SW11
6:6
ADRES3
ADRES4
CA
CA
SW11 SW01
6:8 6:6
ADRES2
ADRES5
6:6 6:6
CA
CA
SW00 SW01
6:6 7:6
ADRES1
ADRES6
SW00 SW10
CA
CA
8:7
SW10
Instruction 2:7
Target side
NoC SWIs
Rx
7:2 CA CA CA
Tx
SWIs L2_D1 EMIF L2_D2
FIGURE 11.10
Topology of the NoC: Instruction and data networks with separated request and response
network paths.
slave offset (27 bits) with a total protocol overhead of 72 bits for both header
and necker cells of the request packet. The response packet overhead is 33 bits.
The adopted NoC topology shown in Figure 11.10 has been chosen to satisfy
different design objectives. First, we want to minimize the latency upon in-
struction cache miss, because this will greatly influence the final performance
of the whole system. Second, we want to maximize connectivity and band-
width between different nodes because all video encoding applications are
bandwidth demanding especially when considering high-resolution, high-
frame rate video streams. Finally, we want to minimize the transfer latency,
because of the performance and scalability requirements. For these reasons,
the data and instruction networks are completely separated (in the following
we will refer to these as Instruction and Data NoC) and each of these two
networks is decomposed into separate request and response networks with
the same topology. The data network topology consists of a fully connected
mesh using 2 × 2 switches (routers) for both request (white switches) and
response networks (gray switches). It allows connections between any pair of
nodes in the systems with the minimum and maximum traveling distances
of one and two hops, respectively.
The instruction network topology uses only one switch and enables ADRES
instruction ports to access L2_I$1 and L2_I$2 memory nodes in only one hop,
as shown in Figure 11.10. The only switch in this network is connected to the
data NoC so that the instruction memories can be reached from any other
node in the system. Typically, the application code is stored in the L3 memory
and will be transferred to both L2_I$s via EMIF. Note that the latency of such
transfers will require three hops, but this is not critical because we assume
that such transfers will occur only during the MPSoC configuration phase
and will not interfere with normal encoding (or decoding) operations.
Different networks (data, instruction, request, and response) have 10
switches in all (Figure 11.10) with different numbers of input/output ports. All
switches in the NoC have the same configuration and introduce one pipeline
stage, whereas arbitration is based on the round-robin (RR) scheme repre-
senting a good compromise between implementation costs and arbitration
fairness. The routing is fixed, that is, there is one possible route for each
initiator/target pair, fixed at design time to minimize the gate count of the
router.
All links in the NoC have the same size, they are single cell width (SGL),
meaning that in the request path they contain 36 wires and for the response
path 32 wires, plus 4 NTTP control wires. Because the NoC operating fre-
quency has been set to 150 MHz, the maximal raw throughput (data plus
NoC protocol overhead) is 600 MB per second per NoC link.
In this NoC instance, we also implemented a service bus, which is a dedi-
cated communication infrastructure allowing runtime configuration and
error recovery of the NoC. Because this MPSoC instance does not require
any configuration parameters (all parameters are fixed at design time), the
service bus is used only for application debugging. Any erroneous trans-
action within the NoC will be logged in the target NIUs logging registers.
These registers can then be accessed at any time via service bus and from the
control node. Appropriate actions can be taken for identification of the erro-
neous transaction. The service bus adopts token ring topology and is using
only eight data and four control wires, minimizing the implementation cost
of this feature. The access point from and to NTTP protocol is provided by
the NTTP-to-Host IP (accessed from the ARM core in the case of the MPSoC
platform). To simplify the routing, the token ring follows the logical path
imposed by the floorplan: ARM, ADRES4, ADRES5, ADRES6, L2_D2, EMIF,
L2_D1, ADRES1, . . ., ARM (in order to simplify Figure 1.10, the service bus
has not been represented).
L2D1 L2D2
ADRES1 ADRES4
NoC
ADRES2 ADRES5
ADRES3 ADRES6
FIGURE 11.11
Layout of the MPSoC platform.
The results presented are relative to the TSMC 90 nm GHP technology library,
the worst case using Vdd = 1.08 V and 125◦ C. The implemented circuit has
been validated using VStation solution from Mentor Graphics. Figure 11.11
shows the layout of the MPSoC chip.
The complete circuit uses 17,295 kGates (45,08 mm2 ), resulting in an 8 × 8
mm square die. Figure 11.12 provides a detailed area breakdown per plat-
form node (surface and gate count) and the relative contribution of each node
with respect to the total MPSoC area. Note that the actual density of the
circuit is 70 percent, which is reasonable for the tool used. Maximum operat-
ing frequencies after synthesis are, respectively, 364 MHz, 182 MHz, 91 MHz
for ADRES cores, and NoC and ARM subsystem. The NoC has been gen-
erated using the design flow described in Section 11.3.5 using NoCexplorer
and NoCcompiler tools version 1.4.14. The typical size of basic Arteris NoC
components for this particular instance as reported by the NoCcompiler es-
timator tool is given in Figure 11.13. This figure also gives the total NoC gate
count, based on the number of different instances and without taking into
account wires and placement/route overheads (difference of 450 kgates from
actual placement and route results in Figure 11.12). The power dissipation
breakdown of the complete platform and the relative contribution on a per
instance base are given in Figure 11.14.
FIGURE 11.12
Area breakdown of the complete MPSoC platform.
SW.Req.Dt
Unit Size Instances Total
SW.Req.Ins
[kgates] [kgates] NIU_T
NIU_I 4.6 20 92
SW.Rsp.Dt
NIU_T 6.2 13 80.6
Req. D 9 4 36
Req. I 3 2 6
Resp. D 9 4 36
SW.Rsp.Ins
Resp. I 3 2 6
Total 256.6
NIU_I
FIGURE 11.13
Area breakdown of the NoC.
NOC
Component Power Inst. Total Relative
[mW] [mW] [%] ARM
ADRES 91.1 6 546.6 84 ADRES
L2D 20 2 40 6
L2I 15 2 30 5 L2I
ARM 10.5 1 10.5 2
NoC 25 1 25 4
Total 652.1 100 L2D
FIGURE 11.14
Power dissipation breakdown of the complete MPSoC and per component comparison.
SRAM SRAM
New Frame Recframe
384
FIGURE 11.15
Functional block diagram of the MPEG-4 SP encoder with bandwidth requirements expressed
in bytes/macroblock.
that have to be accessed for the computation of each new macroblock (MBL)∗
expressed in bytes per macroblock units (B/MBL). The following three
columns show throughput requirements (expressed in MB/s) for CIF, 4CIF,
and HDTV image resolutions at 30 frames per second corresponding to 11880,
47520, and 108000 computed MBLs per second. For this particular implemen-
tation, the total power budget of the circuit built in 180 nm, 1.62 V technology
node for the processing of 4CIF images at 30 frames per second rate is 71 mW,
from which 37 mW is spent on communication, including on-chip memory
accesses.
The application mapping used in this implementation scenario can be easily
adapted (although it may be not optimal) to the MPSoC platform with the
following functional pipeline:
a. Data split scenario. The input video stream is divided into six equal
substreams of data. Each substream is being processed with a ded-
icated ADRES subsystem.
∗ MBL is a data structure usually used in the block-based video encoding algorithms, such as
MPEG-4, AVC/H.264 or SVC. It is composed of 16 × 16 pixels requiring 384 bytes when an MBL
is encoded using 4:2:2 YCb Cr scheme.
TABLE 11.1
MPEG-4 SP (a) and AVC/H.264 Data Split Scenario (b) Encoder
Throughput Requirements When Mapped on an MPSoC Platform
CIF 4CIF HDTV
Source Target B [MB/s] [MB/s] [MB/s]
Current Error
384 384 Error 384 384 384 Entropy 64
Input MB Frame DCT
Difference encoder
buffer Buffer
3330 48 MV
OR
Intra buffer
2640 48
Inter 384
384 Predicted
Reference 260
MC OR Frame
frame
15400 Buffer
Reference 17800 Full ME
line
4095
Search 4095 MC
window
4095
buffer 384
8190 MC
FIGURE 11.16
Functional block diagram of the AVC SP encoder with bandwidth requirements expressed in
bytes/macroblock.
TABLE 11.2
AVC/H.264 Encoder Throughput Requirements for
Functional (a) and Hybrid (b) Split Mapping Scenario
CIF 4CIF HDTV
Source Target B [MB/s] [MB/s] [MB/s]
caused by the encoder code size and the size of the L1 instruction memory.
The functional split solves this problem but at the expense of much heavier
traffic in the data NoC (which is more than doubled) and uneven computa-
tional load among different processors. Finally, the hybrid mapping scenario
offers a good compromise between pure data and pure functional split in
terms of total throughput requirements and even distribution of the compu-
tational load.
As for the MPEG-4 encoder, the application mapping scenarios do not pre-
tend to be optimal. It is obvious that for lower frame resolutions, for example,
it is not necessary to use all six ADRES cores. The real-time processing con-
straint could certainly be satisfied with fewer cores, with nonactive ones being
shut down, thus lowering the power dissipation of the whole system.
where Pidle is the power dissipation of the NIU when it is in an idle state, that
is, there is no traffic. The idle power component is mainly due to the static
power dissipation and the clock activity, and depends on NIU configuration
and NoC frequency. For a given configuration and frequency, the idle power
dissipation component of the NIU is constant, so
Pidle = c 1 (11.2)
TABLE 11.3
Constant c 2 (Dynamic Power Dissipation
Component) of Initiator and Target AHB
NIU for Different Payload Size
4 Bytes 16 Bytes 64 Bytes
Pdyn = c 2 · A (11.3)
11.5.2.2 Switches
The power dissipation of a switch can be modeled in the same way as the NIU,
using Equations 11.1, 11.2, and 11.3. The activity A of a switch is expressed
as a portion of the aggregate bandwidth of the switch that is actually being
used. The aggregate bandwidth of a switch is computed with min(ni , no ) · lbw
where ni , no are the number of input and output ports of a switch and lbw
is the aggregate bandwidth of one link. The experiments have been carried
out to determine the values of the constants c 1 and c 2 for different arbitration
strategies and switch sizes (number of input/output ports). The influence of
the payload size on the power dissipation of a switch is small and will not be
taken into account in the following.
Because we are targeting low-power applications, we chose a round-
robin arbitration strategy for all NoC switches because it represents a good
TABLE 11.4
Constants c 1 and c 2 Used for Computation of the
Static and Dynamic Power Dissipation Components
of the Switch for Various Numbers of Input and
Output Ports
SW6×6 SW7×8 SW2×7 SW7×2
∗ Ina FIFO arbitration scheme the order of requests will be taken into account, highest priority
being given to the least recently serviced requests.
Power dissipation of the wires in the request and the response network are
computed separately, because all transactions in the NoC are supposed to be
writes (CA to CA protocol). This implies that in the request network both data
and control wires will toggle, while in the response network the toggling will
occur only on control wires. While data wires in the request network toggle
with the frequency depending on the activity of that link, the toggle rate of
the control wires will take place every once in a while when compared to
data wires. For the sake of simplicity, in the following we will assume that
the control wires in the request network toggle with the same frequency as
data wires. Such a hypothesis is quite pessimistic, but it can be used safely
because there are only few control wires in a link and their influence on the
overall power dissipation of the NoC is quite small. As explained above, for
the power dissipation of the response network, we only count control wires
because there will be no read operation. The activity of the control wires is
fixed using the assumption that all packets will have 64 bytes of payload (16
beat AHB burst).
Because of the low activity of the switches in the response network (no data,
control only), when compared with those in the request network, the power
dissipation of these switches will be modeled with an idle power dissipa-
tion component only (the power dissipation of the request switches will be
constant for different application mapping scenarios).
Based on the circuit layout (Figure 11.11) we can easily derive the total
length of every link in the NoC for different mapping scenarios. The total
length of 132, 102, 38, and 57 mm has been found for the MPEG-4 application
and for three different scenarios for the AVC/H.264 encoder, respectively.
Note that we assume the same length of the request and response networks
for the same initiator–target pair. The power model of one wire segment pre-
sented earlier and the total length of the links can be combined to determine
the power dissipation of the wires in the NoC. As we mentioned earlier, NoC
links do not transport clock signal, so the power dissipation due to the in-
sertion of the clock tree must be taken into account separately. Based on the
layout, the total length of the clock tree has been estimated to be 24 mm.
For such a length and for a frequency of 150 MHz, the power dissipation has
been evaluated to 1 mW. This value is systematically added to the total power
dissipation of the wires in the NoC.
The power model described above has been used to calculate the power
dissipation of the Arteris NoC running at 150 MHz, for different mapping
scenarios of the MPEG-4 and AVC/H.264 SP encoder and for typical frame
resolutions (CIF, 4CIF and HDTV). Table 11.5 indicates leakage, static, and
total idle power dissipation of different IPs in the NoC. Finally, if we take
into account the NoC topology, we can easily derive the total idle power
dissipation of this NoC instance (10.7 mW).
It is, however, worth mentioning that the new local (isolation of one NIU
or router) and global (isolation of one cluster, the cluster being composed of
multiple NIUs and switches) clock-gating methods implemented in the latest
version of the Danube IP library (v.1.8.) enable significant reduction of the idle
power dissipation component. Each unit (NIU, switch) is capable of moni-
toring its inputs and cutting the clock when there are no packets and when
the processing of all packets currently in the pipeline is completed. When a
new packet arrives at the input, the units can restart their operation in one
clock cycle at most. Our preliminary observations show that the application
TABLE 11.5
Leakage, Static, and Total Idle Power
Dissipation in mW for Different IPs of the
NoC Instance
Leakage Static Total NoC
TABLE 11.6
Power Dissipation of the NoC for MPEG-4 and
AVC/H.264 Simple Profiles Encoders for Different
Frame Resolutions (30 fps)
AVC/H.264 AVC/H.264 AVC/H.264
MPEG-4 Data Functional Hybrid Split
of these local and global clock-gating methods can reduce the total idle power
dissipation of the NoC to only 2 mW.
Total power dissipation is presented in Table 11.6 and Figure 11.17. We also
show the relative contribution of different NoC IPs (NIUs, wires and switches)
to the total NoC power budget. The dynamic power component relative to
the instruction traffic is, respectively: 4.3, 1.4, and 2.2 mW, depending on the
mapping scenario.
The power dissipation of the NoC presented in this work can be compared
with the power dissipation of other interconnects for multimedia applications
already presented in the literature. Table 11.7 summarizes this comparison
1.15 5.03
1.15 6.00 1.15 4.40 4.65 1.15
4.07
3.56 1.15
10 3.01 2.62
1.15 3.30 1.15 1.88
2.45 2.83 1.46
1.57 2.54 2.66 2.71
1.96 2.53
1.65 2.43
5
5.84 6.93 6.31 6.85
5.16 6.04 4.92 5.32 6.01
0
CIF 4CIF HDTV CIF 4CIF HDTV CIF 4CIF HDTV
NIU_Init SW_Rqst Wires SW_Resp NIU_Target
FIGURE 11.17
Power dissipation of the NoC for MPEG-4 and AVC/H.264 SP encoder: total power dissipation
and breakdown per NoC component.
TABLE 11.7
Comparison of the Communication Infrastructure Power Dissipation for
Different Multimedia Applications
NoC Scaled
Topology BW Process Frequency Power Power
Design Nodes Routers, [MB/S] [nm,V] [MHz] [mW] [mW]
with (1), (2), and (3) being the implementation presented here. The results are
those for 4CIF resolution, chosen for closest bandwidth requirements. Note
that for easier comparison we scaled down the power dissipation figures of
the designs made in other technologies, to the 90-nm, 1.08-V technology node
used in our implementation (last column), using the expression suggested by
Denolf et al. [38], where Vdd is the power supply voltage and λ feature size
1.7 1.5 −1
Vdd2 λ2
P1 = P2 · · (11.7)
Vdd1 λ1
varies from 14 to 22 mW, from which 10.7 mW are due to the idle power
dissipation (no traffic), and could be further reduced with more aggressive
clock-gating techniques. Note that an important part of the total power dis-
sipation (from 60 to 70 percent) is due to the 30 NIUs (for 13 nodes only) and
embedded smart DMAs circuits (Communication Assist - CAs engines). It is
also interesting to underline that the increase in the throughput requirements
leads to a relatively low increase in the dissipated power. If we consider the
functional split, which is a worst case from the required bandwidth point of
view, when moving from CIF to HDTV resolution, the data throughput will
increase almost 400 percent (from 241 to 987 MB/s) but resulting in only 35
percent increase of the total power dissipation of the NoC.
The implementation cost of the NoC in terms of the silicon area is also more
than acceptable, because it represents less than three percent of the total area
budget (less than 450 kgates). When compared to other IPs in the system, on
a one-to-one basis, the NoC represents eight percent of one ADRES VLIW/
CGA processor, twenty percent of one 256 kB memory and is forty percent
bigger than the ARM9 core. This is acceptable even for the medium-sized
MPSoC platforms targeting lower performances. As for the power dissipation,
note that in this particular design and due to the presence of the CAs allowing
block transfer type of communication, a considerable amount of the area is
taken by the NIU units.
Finally, the complete design cycle (including the learning period for the
NoC tools), NoC instance definition, specification with high- and low-level
NoC models, RTL generation, and final synthesis took only two man months.
This argument combined with the achieved performance in terms of available
bandwidth, power, and area budget clearly points out the advantages of the
NoC as communication infrastructure in the design of high-performance low-
power MPSoC platforms.
References
[1] M. Millberg, E. Nilsson, R. Thid, S. Kumar, and A. Jantsch, “The Nostrum
backbone—A communication protocol stack for Networks on Chip.” In Proc.
of the VLSI Design Conference, Mumbai, India, Jan. 2004. [Online]. Available:
http://www.imit.kth.se/ axel/papers/2004/VLSI-Millberg.pdf.
[2] A. Jantsch and H. Tenhunen, eds., Networks on Chip. Hingham, MA: Kluwer
Academic Publishers, 2003.
[3] N. E. Guindi and P. Elsener, “Network on Chip: PANACEA—A Nostrum in-
tegration,” Swiss Federal Institute of Technology Zurich, Technical Report,
Feb. 2005. [Online]. Available: http://www.imit.kth.se/∼axel/papers/2005/
PANACEA-ETH.pdf.
[4] A. Jalabert, S. Murali, L. Benini, and G. D. Micheli, “xpipesCompiler: A tool for
instantiating application specific Networks on Chip.” In Design, Automation and
Test in Europe (DATE), Paris, France, February 2004.
[19] J. Bainbridge and S. Furber, “CHAIN: A delay insensitive CHip area INter-
connect,” IEEE Micro Special Issue on Design and Test of System on Chip, 142(4)
(September 2002): 16–23.
[20] J. Bainbridge, L. A. Plana, and S. B. Furber, “The design and test of a Smartcard
chip using a CHAIN self-timed Network-on-Chip.” In Proc. of the Design, Au-
tomation and Test in Europe Conference and Exhibition, Paris, France, 3 (February
2004): 274.
[21] J. Bainbridge, T. Felicijan, and S. Furber, “An asynchronous low latency arbiter
for Quality of Service (QoS) applications.” In Proc. of the 15th International Con-
ference on Microelectronics (ICM’03), Cairo, Egypt, Dec. 2003, 123–126.
[22] T. Felicijan and S. Furber, “An asynchronous on-chip network router with
Quality-of-Service (QoS) support.” In Proc. of IEEE International SOC Conference,
Santa Clara, CA, September 2004, 274–277.
[23] Arteris, “A comparison of network-on-chip and busses,” White paper, 2005.
[24] J. Dielissen, A. Rădulescu, K. Goossens, and E. Rijpkema, “Concepts and im-
plementation of the Philips Network-on-Chip,” IP-Based SOC Design, Grenoble,
France, November 2003.
[25] E. Rijpkema, K. G. W. Goossens, A. Radulescu, J. Dielissen, J. L. van Meerbergen,
P. Wielage, and E. Waterlander, “Trade-offs in the design of a router with both
guaranteed and best-effort services for Networks on Chip.” In DATE ’03: Proc.
of the Conference on Design, Automation and Test in Europe. Washington, DC: IEEE
Computer Society, 2003, 10350.
[26] P. Schumacher, K. Denolf, A. Chilira-Rus, R. Turney, N. Fedele, K. Vissers, and
J. Bormans, “A scalable, multi-stream MPEG-4 video decoder for conferencing
and surveillance applications.” In ICIP 2005. IEEE International Conference on
Image Processing, Genova, Italy, 2005, 2 (September 2005): 11–14, II–886–9.
[27] Y. Watanabe, T. Yoshitake, K. Morioka, T. Hagiya, H. Kobayashi, H.-J. Jang,
H. Nakayama, Y. Otobe, and A. Higashi, “Low power MPEG-4 ASP codec IP
macro for high quality mobile video applications,” Consumer Electronics, 2005.
ICCE. 2005 Digest of Technical Papers. International Conference, Las Vegas, NV,
(January 2005): 8–12, 337–338.
[28] T. Fujiyoshi, S. Shiratake, S. Nomura, T. Nishikawa, Y. Kitasho, H. Arakida,
Y. Okuda, et al. “A 63-mW H.264/MPEG-4 audio/visual codec LSI with module-
wise dynamic voltage/frequency scaling,” IEEE Journal of Solid-State Circuits,
41(1) (January 2006): 54–62.
[29] C.-C. Cheng, C.-W. Ku, and T.-S. Chang, “A 1280/spl times/720 pixels 30
frames/s H.264/MPEG-4 AVC intra encoder.” In Proc. of Circuits and Systems,
2006. ISCAS 2006. 2006 IEEE International Symposium, Kos, Greece, May 21–24,
2006, 4.
[30] C. Mochizuki, T. Shibayama, M. Hase, F. Izuhara, K. Akie, M. Nobori, R. Imaoka,
H. Ueda, K. Ishikawa, and H. Watanabe, “A low power and high picture quality
H.264/MPEG-4 video codec IP for HD mobile applications.” In Solid-State Cir-
cuits Conference, 2007. ASSCC ’07. 2007 IEEE International Conference, Jeju City,
South Korea, Nov. 12–14, 2007, 176–179.
[31] B. Mei, “A coarse-grained reconfigurable architecture template and its compi-
lation Techniques,” Ph.D. dissertation, IMEC, January 2005.
[32] F.-J. Veredas, M. Scheppler, W. Moffat, and M. Bingfeng, “Custom implemen-
tation of the coarse-grained reconfigurable ADRES architecture for multime-
dia purposes.” In Field Programmable Logic and Applications, 2005. International
Conference, Tampere, Finland, August 24–26, 2005, 106–111.