Figure. 16: Second phase of the Pareto-based NMAP ap-
1 Mappings PBNMAP( Cores C)
2 {
3 Mappings M;
5 M = PBNMAP_1st(C ) ;
6 M = PBNMAP_2nd( M , C ) ;
8 ret urn M;
9 }
Figure. 17: Pareto-based NMAP approach.
VI. Experiments
A. MPEG-4 Codec
In order to evaluate the various approaches in real trafc sce-
narios, an MPEG-4 simple prole @ level 2 codec was used
as a case study [34]. A general block diagram of the encoder
and decoder is shown in Figure 18.
For the hardware/software partitioning reference was
made to the MoVa architecture described in [22]. It adopts a
macroblock-based pipeline with 4 stages for the encoder and
Core Description
MEC Motion estimation coarse
MEF Motion estimation ne
MC Motion compensation
VLC Variable length coding
VLD Variable length decoding
REC Reconstruction
SP Stream producer
DB Deblocking
DCTQ Discrete cosine transform & quantization
IQIDCT Inverse discrete cosine transform & inverse quantiza-
RISC 32 bit risc microprocessor
VIM Video input module
VOM Video output module
ISC Input stream controller
MEME Encoder memory
MEMD Decoder memory
Table 1: Cores implementing the codec.
3 for the decoder. More specically, the encoding section
performs coarse motion estimation in the rst stage, ne mo-
tion estimation ne and motion compensation in the second
stage, discrete cosine transform and quantization in the third
stage, and nally reconstruction and production of the stream
in the fourth stage. In the decoding section, the rst stage in-
volves variable length decoding of each data stream; in the
second stage it performs sequential inverse cosine transfor-
mation, inverse quantization and motion compensation; the
third and nal stage is reconstruction.
To obtain the trafc traces the C application implementing
the codec [24] was modied with the addition of a monitor
code to record the volume of incoming and outgoing trafc
in the various functional blocks into which the application
is partitioned. Table 1 shows the 16 cores implementing the
codec. They were characterized in terms of timing by using
the clock cycle data in [22] for the execution of each oper-
ation (DCT, MC, etc.). For power characterization, we used
the mean values given in the datasheets [27, 31]. For the in-
terconnection system we used an approach similar to the one
presented in [17]. To characterize the switches, a 5x5 switch
was implemented in VHDL following the architecture de-
scribed in [6]. It was synthesized with a Synopsys Design
Compiler using the Virtual Silicon 0.13m, 1.2V technolog-
ical library and analyzed using Synopsys Design Power us-
ing different random input data streams for the inputs of the
switch. The amount of power consumed by a it for a hop
switch was estimated as being 0.181nJ. We assumed the tile
size to be 2mm2mm and that the tiles were arranged in a
regular fashion on the oorplan. The load wire capacitance
was set to 0.50f F per micron, so considering an average of
25% switching activity the amount of power consumed by a
it for a hop interconnect is 0.384nJ.
Figure 19 shows the application characterization graph of
the MPEG-4 codec. Each vertex of the graph represents a
core. An edge that connects a core i to a core j denes
a communication ow from core i to core j. Each edge is
Figure. 19: Application characterization graph of the
MPEG-4 codec.
characterized by a set of attributes such as the trafc volume
i, j
) and the minimum bandwidth requirement for the com-
munication (B
i, j
). The latter one is used as an exploration
constraint. More precisaly, a mapping is rejected if it does
not satisfy at least one of such constraints. These constraints
are set by performing a proling of the application and an-
notating the trafc volume exchanged between the various
application components. For example, to decode N frames at
X fps we have B
i, j
= T
i, j
The following values were used for the free parameters of
the exploration algorithm. For GAMAP we chose a popula-
tion of 50 mappings, a crossover probability of 0.7 and a mu-
tation probability of 0.1. The R parameter of the crossover
operator was set to 2. These values were chosen after nu-
merous simulations and were the values that on average led
to better solutions or shorter convergence times. The number
of generations was set runtime by means of a stop criterion
based on analysis of the convergence of the Pareto-front [4].
For PBBB, the parameter T
was set to 100. Figure 20
gives the power values and trafc clearing times for 10,000
random mappings. It also shows the Pareto fronts obtained
by GAMAP, PBNMAP, and PBBB, and the solutions found
by BB [17] and NMAP [28]. As can be seen, the solutions
obtained by GAMAP dominate those obtained by the other
approaches. The gure also shows the good trade-off be-
tween delay and power (respectively equal to a factor of 3
for delay and 2.5 for power).
Figure 21(a) gives the number of simulations (i.e. map-
pings evaluated by GAMAP) for varying numbers of genera-
tions. It gives the number of simulations actually performed
and those virtually performed if no caching mechanism had
been used. Figure 21(b) gives the normalized delay and en-
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Time (s)
Figure. 20: Evaluation of 10,000 random mappings and
Pareto fronts obtained by GAMAP, PBNMAP, and PBBB for
a 4x4 NoC in which the MPEG-4 codec is mapped on.
ergy values for varying numbers of generations. As can be
seen, in both cases no mappings that determine appreciable
improvements in delay and energy consumption are found af-
ter the 20th generation. At the 20th generation GAMAP had
only performed 840 simulations as compared with 2,670 by
PBNMAP and 7,238 by PBBB, thus providing an exploration
time speed-up of 3.2 and 8.6 respectively.
Figure. 22: Pareto mapping for the lMPEG-4 codec.
Finally, Figure 22 shows a point (the minimum energy
mapping) in the Pareto set obtained by GAMAP. The cores
specic to the encoding section are shown against a dark
gray background, whereas those specic to the decoding are
against a white background. The cores shared by the encoder
and decoder are shown against a light gray background and
have been mapped (in this case) in the centre of the NoC. In
the decoding section, the cores VOM and DB are topolog-
ically separated from VLD, MEMD and ISC as there is no
direct communication ow between these sets: they commu-
nicate by means of a ring represented by the core REC. In the
encoding section there are also two separate parts which do
not communicate directly but through the set of shared cores.
B. Cell Phone
Figure 23(a) is a block diagramof a mobile phone application
in which it is possible not only to hold a normal conversation
but also to listen to an MP3, surf the web, receive and send
images, and listen to emails. The application example used
is the airport scenario described in [30]. In this example
the trafc ows are generated under certain synchronization
constraints. For example, as can be seen from Figure 23(b),
which shows a fragment of the communication timeline, it is
not possible to read an email and perform MP3 streaming at
the same time.
The application was partitioned into 13 cores [one for each
block shown in Figure 23(a)] and mapped onto a 44 NoC.
Cores for a concurrent synthesized application in which each
core communicates at random with the others were mapped
onto the remaining 3 tiles.
200 250 300 350 400 450 500 550 600
delay metric
Figure. 24: Evaluation of 10,000 random mappings and
Pareto fronts obtained by GAMAP and PBBB for a 44 NoC
and cell phone application.
Figure 24 shows the solutions obtained by GAMAP and
PBBB together with the evaluation of 10,000 randomly gen-
erated mappings. In this case it was not possible to com-
plete the exploration using PBNMAP due to the great num-
ber of Pareto mappings obtained at each iteration during the
rst phase of the algorithm (Figure 15). The main reason
for this behavior lies in the characteristics of the trafc con-
sidered. More specically, in the rst phase of the algo-
rithm the mapping of a core that does not communicate with
any of the other cores already mapped generates as many
Pareto mappings as there are free tiles. In such situations
the ExtractPareto(M) function returns the same set M , the
mappings of which will be extended in the following itera-
tion by the MakeMappings(M, c) function to map the core
c, generating a new set of mappings [M[ f in size (where
f indicates the number of free tiles in the incomplete map-
ping). Obviously, the more often this situation arises, the
more quickly the number of mappings to be evaluated (and
thus the number of simulations to be performed) grows. In
this example it happens quite frequently because the applica-
tion was partitioned using a coarser granularity. The trafc
ows, in fact, involve on average fewer cores than the previ-
ous examples, thus reducing the probability that a core being
mapped will communicate with at least one of the cores al-
ready mapped.
Going back to Figure 24 we can observe a great range
of dispersion between the points (2.3x for delay and 2.5x
for energy consumption) which once again requires efcient
techniques to explore the mapping space. In this example
GAMAP and PBBB yield the same solution but the former
requires only 1,227 simulations as compared with the 9,893
required by the latter.
VII. Conclusions
In this paper we have proposed a strategy for topological
mapping of IPs/cores in a mesh-based NoC architecture. The
approach uses heuristics based on multi-objective genetic al-
gorithms to explore the mapping space and nd the Pareto
mappings that optimize performance and power consump-
tion. At the same time, two of the most widely-known ap-
proaches to mapping in mesh-based NoC architectures have
been extended in order to explore the mapping space in a
multi-criteria mode. The approaches have been then evalu-
ated and compared, in terms of both accuracy and efciency,
on a platform based on an un event-driven trace-based sim-
ulator which makes it possible to take account of important
dynamic effects that have a great impact on mapping. The
experiments carried out on real applications (an MPEG-4 en-
coder/decoder system and a cellular phone application) con-
rm the efciency, accuracy and scalability of the proposed
approach. Future developments will mainly address the de-
nition of more efcient genetic operators to improve the pre-
cision and convergence speed of the algorithm. Evaluation
will also be made of the possibility of optimizing mapping
by acting on other architectural parameters such as routing
strategies, switch buffer sizes, etc.
Author Biographies
Giuseppe Ascia received the Laurea degree in electronic engineering and
the Ph.D. degree in computer science from the Universit di Catania, Italy, in
1994 and 1998, respectively. In 1994, he joined the Institute of Computer
Science and Telecommunications at the Universit di Catania. Currently, he
is an Associate Professor at the Universit di Catania. His research interests
are soft computing, VLSI design, hardware architectures, and low-power
Vincenzo Catania received the Laurea degree in electrical engineering
from the Universit di Catania, Italy, in 1982. Until 1984, he was responsible
for testing microprocessor system at STMicroelectronics, catania, Italy.
Since 1985 he has cooperated in research on computer network with the
Istituto di Informatica e Telecomunicazioni at the Universit di Catania,
where he is a Full Professor of computer science. His research interests
include performance and reliability assessment in parallel and distribuited
system, VLSI design, low-power design, and fuzzy logic.
Maurizio Palesi received the Dr.Eng. degree and the Ph.D. degree in com-
puter engineering from Universit di Catania, Italy, in 1999 and 2003 respec-
tively. Since December 2003, he has held a research contract as Assistant
Professor at the Dipartimento di Ingegneria Informatica e delle Telecomu-
nicazioni, Facolt di Ingegneria, Universit di Catania. His research focuses
on Platform based system design, design space exploration, low-power tech-
niques for embedded systems, and Network-on-Chip architectures.
Figure. 4: Switch interface (a), Behavioural annotated graph of a switch (b).
(a) (b)
Figure. 18: Block diagram of the MPEG-4 codec. (a) Encoder. (b) Decoder.
0 20 40 60 80 100 120 140 160
0 20 40 60 80 100 120 140 160
Minimum delay
Minimum energy
(a) (b)
Figure. 21: Number of (virtual and real) mappings evaluated by GAMAP in varying numbers of generations (a). Normalized
minimum delay and power consumption values obtained by the GAMAP in varying numbers of generations (b).
(a) (b)
Figure. 23: Example application of a cell phone (source [30]) (a). A portion of the communication timeline (b).