Mapping Cores On Network-on-Chip: ° Research India Publications HTTP://WWW - Ijcir.info
Mapping Cores On Network-on-Chip: ° Research India Publications HTTP://WWW - Ijcir.info
Mapping Cores On Network-on-Chip: ° Research India Publications HTTP://WWW - Ijcir.info
M
/
;
10 e l s e i f ( M ~M
/
)
11 M
n
= M
n
M ;
12 e l s e
13 M
n
= M
n
M, M
/
;
14 }
15 M = ExtraxtPareto( M
n
) ;
16 }
17
18 ret urn M;
19 }
Figure. 16: Second phase of the Pareto-based NMAP ap-
proach.
1 Mappings PBNMAP( Cores C)
2 {
3 Mappings M;
4
5 M = PBNMAP_1st(C ) ;
6 M = PBNMAP_2nd( M , C ) ;
7
8 ret urn M;
9 }
Figure. 17: Pareto-based NMAP approach.
VI. Experiments
A. MPEG-4 Codec
In order to evaluate the various approaches in real trafc sce-
narios, an MPEG-4 simple prole @ level 2 codec was used
as a case study [34]. A general block diagram of the encoder
and decoder is shown in Figure 18.
For the hardware/software partitioning reference was
made to the MoVa architecture described in [22]. It adopts a
macroblock-based pipeline with 4 stages for the encoder and
Mapping Cores on Network-on-Chip 119
Core Description
MEC Motion estimation coarse
MEF Motion estimation ne
MC Motion compensation
VLC Variable length coding
VLD Variable length decoding
REC Reconstruction
SP Stream producer
DB Deblocking
DCTQ Discrete cosine transform & quantization
IQIDCT Inverse discrete cosine transform & inverse quantiza-
tion
RISC 32 bit risc microprocessor
VIM Video input module
VOM Video output module
ISC Input stream controller
MEME Encoder memory
MEMD Decoder memory
Table 1: Cores implementing the codec.
3 for the decoder. More specically, the encoding section
performs coarse motion estimation in the rst stage, ne mo-
tion estimation ne and motion compensation in the second
stage, discrete cosine transform and quantization in the third
stage, and nally reconstruction and production of the stream
in the fourth stage. In the decoding section, the rst stage in-
volves variable length decoding of each data stream; in the
second stage it performs sequential inverse cosine transfor-
mation, inverse quantization and motion compensation; the
third and nal stage is reconstruction.
To obtain the trafc traces the C application implementing
the codec [24] was modied with the addition of a monitor
code to record the volume of incoming and outgoing trafc
in the various functional blocks into which the application
is partitioned. Table 1 shows the 16 cores implementing the
codec. They were characterized in terms of timing by using
the clock cycle data in [22] for the execution of each oper-
ation (DCT, MC, etc.). For power characterization, we used
the mean values given in the datasheets [27, 31]. For the in-
terconnection system we used an approach similar to the one
presented in [17]. To characterize the switches, a 5x5 switch
was implemented in VHDL following the architecture de-
scribed in [6]. It was synthesized with a Synopsys Design
Compiler using the Virtual Silicon 0.13m, 1.2V technolog-
ical library and analyzed using Synopsys Design Power us-
ing different random input data streams for the inputs of the
switch. The amount of power consumed by a it for a hop
switch was estimated as being 0.181nJ. We assumed the tile
size to be 2mm2mm and that the tiles were arranged in a
regular fashion on the oorplan. The load wire capacitance
was set to 0.50f F per micron, so considering an average of
25% switching activity the amount of power consumed by a
it for a hop interconnect is 0.384nJ.
Figure 19 shows the application characterization graph of
the MPEG-4 codec. Each vertex of the graph represents a
core. An edge that connects a core i to a core j denes
a communication ow from core i to core j. Each edge is
Figure. 19: Application characterization graph of the
MPEG-4 codec.
characterized by a set of attributes such as the trafc volume
(T
i, j
) and the minimum bandwidth requirement for the com-
munication (B
i, j
). The latter one is used as an exploration
constraint. More precisaly, a mapping is rejected if it does
not satisfy at least one of such constraints. These constraints
are set by performing a proling of the application and an-
notating the trafc volume exchanged between the various
application components. For example, to decode N frames at
X fps we have B
i, j
= T
i, j
X/N.
The following values were used for the free parameters of
the exploration algorithm. For GAMAP we chose a popula-
tion of 50 mappings, a crossover probability of 0.7 and a mu-
tation probability of 0.1. The R parameter of the crossover
operator was set to 2. These values were chosen after nu-
merous simulations and were the values that on average led
to better solutions or shorter convergence times. The number
of generations was set runtime by means of a stop criterion
based on analysis of the convergence of the Pareto-front [4].
For PBBB, the parameter T
pbbb
was set to 100. Figure 20
gives the power values and trafc clearing times for 10,000
random mappings. It also shows the Pareto fronts obtained
by GAMAP, PBNMAP, and PBBB, and the solutions found
by BB [17] and NMAP [28]. As can be seen, the solutions
obtained by GAMAP dominate those obtained by the other
approaches. The gure also shows the good trade-off be-
tween delay and power (respectively equal to a factor of 3
for delay and 2.5 for power).
Figure 21(a) gives the number of simulations (i.e. map-
pings evaluated by GAMAP) for varying numbers of genera-
tions. It gives the number of simulations actually performed
and those virtually performed if no caching mechanism had
been used. Figure 21(b) gives the normalized delay and en-
120 Giuseppe Ascia, Vincenzo Catania and Maurizio Palesi
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Time (s)
E
n
e
r
g
y
(
J
)
random
GAMAP
PBBB
PBNMAP
BB
NMAP
Figure. 20: Evaluation of 10,000 random mappings and
Pareto fronts obtained by GAMAP, PBNMAP, and PBBB for
a 4x4 NoC in which the MPEG-4 codec is mapped on.
ergy values for varying numbers of generations. As can be
seen, in both cases no mappings that determine appreciable
improvements in delay and energy consumption are found af-
ter the 20th generation. At the 20th generation GAMAP had
only performed 840 simulations as compared with 2,670 by
PBNMAP and 7,238 by PBBB, thus providing an exploration
time speed-up of 3.2 and 8.6 respectively.
Figure. 22: Pareto mapping for the lMPEG-4 codec.
Finally, Figure 22 shows a point (the minimum energy
mapping) in the Pareto set obtained by GAMAP. The cores
specic to the encoding section are shown against a dark
gray background, whereas those specic to the decoding are
against a white background. The cores shared by the encoder
and decoder are shown against a light gray background and
have been mapped (in this case) in the centre of the NoC. In
the decoding section, the cores VOM and DB are topolog-
ically separated from VLD, MEMD and ISC as there is no
direct communication ow between these sets: they commu-
nicate by means of a ring represented by the core REC. In the
encoding section there are also two separate parts which do
not communicate directly but through the set of shared cores.
B. Cell Phone
Figure 23(a) is a block diagramof a mobile phone application
in which it is possible not only to hold a normal conversation
but also to listen to an MP3, surf the web, receive and send
images, and listen to emails. The application example used
is the airport scenario described in [30]. In this example
the trafc ows are generated under certain synchronization
constraints. For example, as can be seen from Figure 23(b),
which shows a fragment of the communication timeline, it is
not possible to read an email and perform MP3 streaming at
the same time.
The application was partitioned into 13 cores [one for each
block shown in Figure 23(a)] and mapped onto a 44 NoC.
Cores for a concurrent synthesized application in which each
core communicates at random with the others were mapped
onto the remaining 3 tiles.
200 250 300 350 400 450 500 550 600
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
delay metric
e
n
e
r
g
y
m
e
t
r
i
c
random
GAMAP
PBBB
Figure. 24: Evaluation of 10,000 random mappings and
Pareto fronts obtained by GAMAP and PBBB for a 44 NoC
and cell phone application.
Figure 24 shows the solutions obtained by GAMAP and
PBBB together with the evaluation of 10,000 randomly gen-
erated mappings. In this case it was not possible to com-
plete the exploration using PBNMAP due to the great num-
ber of Pareto mappings obtained at each iteration during the
rst phase of the algorithm (Figure 15). The main reason
for this behavior lies in the characteristics of the trafc con-
sidered. More specically, in the rst phase of the algo-
rithm the mapping of a core that does not communicate with
any of the other cores already mapped generates as many
Pareto mappings as there are free tiles. In such situations
the ExtractPareto(M) function returns the same set M , the
mappings of which will be extended in the following itera-
tion by the MakeMappings(M, c) function to map the core
c, generating a new set of mappings [M[ f in size (where
f indicates the number of free tiles in the incomplete map-
ping). Obviously, the more often this situation arises, the
more quickly the number of mappings to be evaluated (and
Mapping Cores on Network-on-Chip 121
thus the number of simulations to be performed) grows. In
this example it happens quite frequently because the applica-
tion was partitioned using a coarser granularity. The trafc
ows, in fact, involve on average fewer cores than the previ-
ous examples, thus reducing the probability that a core being
mapped will communicate with at least one of the cores al-
ready mapped.
Going back to Figure 24 we can observe a great range
of dispersion between the points (2.3x for delay and 2.5x
for energy consumption) which once again requires efcient
techniques to explore the mapping space. In this example
GAMAP and PBBB yield the same solution but the former
requires only 1,227 simulations as compared with the 9,893
required by the latter.
VII. Conclusions
In this paper we have proposed a strategy for topological
mapping of IPs/cores in a mesh-based NoC architecture. The
approach uses heuristics based on multi-objective genetic al-
gorithms to explore the mapping space and nd the Pareto
mappings that optimize performance and power consump-
tion. At the same time, two of the most widely-known ap-
proaches to mapping in mesh-based NoC architectures have
been extended in order to explore the mapping space in a
multi-criteria mode. The approaches have been then evalu-
ated and compared, in terms of both accuracy and efciency,
on a platform based on an un event-driven trace-based sim-
ulator which makes it possible to take account of important
dynamic effects that have a great impact on mapping. The
experiments carried out on real applications (an MPEG-4 en-
coder/decoder system and a cellular phone application) con-
rm the efciency, accuracy and scalability of the proposed
approach. Future developments will mainly address the de-
nition of more efcient genetic operators to improve the pre-
cision and convergence speed of the algorithm. Evaluation
will also be made of the possibility of optimizing mapping
by acting on other architectural parameters such as routing
strategies, switch buffer sizes, etc.
References
[1] C. J. Alpert, L. W. Hagen, and A. B. Kahng. A hybrid
multilevel/genetic approach for circuit partitioning. In
Fifth ACM/SIGDA Physical Design Workshop, pages
100105, Apr. 1996.
[2] C. J. Alpert and A. B. Kahng. Recent developments
in netlist partitioning: A survey. VLSI Journal, 19(1
2):181, 1995.
[3] ARM. AMBA specication. http://www.arm.
com/, May 1999.
[4] G. Ascia, V. Catania, and M. Palesi. A GA based design
space exploration framework for parameterized system-
on-a-chip platforms. IEEE Transactions on Evolution-
ary Computation, 8(4):329346, Aug. 2004.
[5] G. Ascia, V. Catania, and M. Palesi. Multi-objective
mapping for mesh-based NoC architectures. In Second
IEEE/ACM/IFIP International Conference on Hard-
ware/Software Codesign and System Synthesis, pages
182187, Stockholm, Sweden, Sept. 810 2004.
[6] N. Banerjee, P. Vellanki, and K. S. Chatha. Apower and
performance model for network-on-chip architectures.
In Design, Automation and Test in Europe, pages 1250
1255, Feb. 1620 2004.
[7] M. S. Bright and T. Arslan. Synthesis of low-power
DSP systems using a genetic algorithm. IEEE Transac-
tions on Evolutionary Computation, 5(1):2740, 2001.
[8] H. Chang, L. Cooke, M. Hunt, G. Martin, A. McNelly,
and L. Todd. Surviving the SOC Revolution A Guide to
Platform-Based Design. Kluwer Academic Publishers,
1999.
[9] J.-M. Chang and M. Pedram. Codex-dp: Co-design of
communicating systems using dynamic programming.
IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, 19(7):732744, July 2000.
[10] C. A. C. Coello. A comprehensive survey of
evolutionary-based multiobjective optimization tech-
niques. Knowledge and Information Systems. An In-
ternational Journal, 1(3):269308, Aug. 1999.
[11] D. W. Corne, J. D. Knowles, and M. J. Oates. The
pareto envelope-based selection algorithm for multi-
objective optimization. In M. Schoenauer, K. Deb,
G. Rudolph, X. Yao, E. Lutton, J. J. Merelo, and H.-P.
Schwefel, editors, Parallel Problem Solving from Na-
ture - PPSN VI 6th International Conference, pages
839848. Springer. Lecture Notes in Computer Science
No. 1917, 2000.
[12] W. J. Dally and B. Towles. Route packets, not wires:
On-chip interconnection networks. In Design Automa-
tion Conference, pages 684689, Las Vegas, Nevada,
USA, 2001.
[13] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A
fast elitist multiobjective genetic algorithm: NSGA-
II. IEEE Transactions on Evolutionary Computation,
6(2):182197, Apr. 2002.
[14] M. R. Garey and D. S. Johnson. Computers and in-
tractability: a guide to the theory of NP-completeness.
Freeman and Company, 1979.
[15] C. J. Glass and L. M. Ni. The turn model for adaptive
routing. In 25 Years ISCA: Retrospectives and Reprints,
pages 441450, 1998.
122 Giuseppe Ascia, Vincenzo Catania and Maurizio Palesi
[16] A. Hemani, T. Meincke, S. Kumar, A. Postula, T. Ols-
son, P. Nilsson, J. Oberg, P. Ellervee, and D. Lundqvist.
Lowering power consumption in clock by using glob-
ally asynchronous locally synchronous design style. In
ACMIEEE Design Automation Conference, pages 873
878. ACM Press, 1999.
[17] J. Hu and R. Marculescu. Energy-aware mapping for
tile-based NoC architectures under performance con-
straints. In Asia & South Pacic Design Automation
Conference, Jan. 2003.
[18] J. Hu and R. Marculescu. DyAD - smart routing for
networks-on-chip. In ACM/IEEE Design Automation
Conference, pages 260263, San Diego, CA, USA,
June 711 2004.
[19] J. Hu and R. Marculescu. Energy- and performance-
aware mapping for regular NoC architectures. IEEE
Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 24(4):551562, Apr. 2005.
[20] IBM Corporation. The CoreConnect bus architecture.
http://www.ibm.com/.
[21] K. Keutzer, S. Malik, R. Newton, J. M. Rabaey, and
A. Sangiovanni-Vincentelli. System-level design: Or-
thogonalization of concerns and platform-based design.
IEEE Transactions on Computer Aided Design of Inte-
grated Circuits and Systems, 19(12):15231543, Dec.
2000.
[22] S.-M. Kim, J.-H. Park, S.-M. Park, B.-T. Koo, K.-S.
Shin, K.-B. Suh, I.-K. Kim, N.-W. Eum, , and K.-S.
Kim. Hardware-software implementation of MPEG-4
video codec. ETRI Journal, 25(6):489502, Dec. 2003.
[23] V. Kommu and I. Pomenraz. GAFAP: Genetic algo-
rithm for FPGA technology mapping. In European De-
sign Automation Conference, pages 300305, 1993.
[24] C. Lampert, M. Militzer, and P. Ross. XviD MPEG4
core library. http://www.xvid.org/.
[25] T. Lei and S. Kumar. A two-step genetic algorithm for
mapping task graphs to a network on chip architecture.
In Euromicro Symposium on Digital Systems Design,
Sept. 16 2003.
[26] D. Martin, S. Ahmad, and K. Khalilian. Platform-based
design: Report from the front. In Ninth IEEE/DATC
Electronic Design Processes Workshop, Apr. 2002.
[27] Mentor Graphics. Inventra intellectual property
cores. http://www.mentor.com/inventra/
cores/.
[28] S. Murali and G. D. Micheli. Bandwidth-constrained
mapping of cores onto NoC architectures. In Design,
Automation, and Test in Europe, pages 896901. IEEE
Computer Society, Feb. 1620 2004.
[29] Palmchip. SoC bus architecutre. http://www.
palmchip.com/.
[30] J. M. Paul, D. E. Thomas, and A. Bobrek. Benchmark-
based design strategies for single chip heteroge-
neous multiprocessors. In Proceedings of the 2nd
IEEE/ACM/IFIP International Conference on Hard-
ware/Software Codesign and System Synthesis, pages
5459, Stockholm, Sweden, Sept.810 2004. ACM
Press.
[31] Philips Electronics. Philips IP portfolio. http://
www.semiconductors.philips.com.
[32] J. D. Schaffer. Multiple objective optimization with
vector evaluated genetic algorithms. In Genetic Algo-
rithms and their Applications: Proceedings of the First
International Conference on Genetic Algorithms, pages
93100. Lawrence Erlbaum, 1985.
[33] International thechnology roadmap for semiconductors.
Semiconductor Industry Association, 2003.
[34] T. Sikora. The MPEG-4 video standard verication
model. IEEE Transactions on Circuits and Systems for
Video Technology, 7(1):1931, Feb. 1997.
[35] Silicore. WISHBONE system-on-chip (SoC) intercon-
nection architecture for portable IP cores. http://
www.silicore.net/, Sept. 2002.
[36] Sonics. SiliconBackplane III MicroNetwork IP.
http://www.sonicsinc.com/.
[37] D. Sylvester and K. Keutzer. Getting to the bottom of
deep submicron. In IEEE/ACM International Confer-
ence on Computer-aided design, pages 203211. ACM
Press, 1998.
[38] D. Sylvester and K. Keutzer. A global wiring para-
digm for deep submicron design. IEEE Transactions
on Computer Aided Design of Integrated Circuits and
Systems, 19(2):242252, Feb. 2000.
[39] C. Valenzuela. A simple evolutionary algorithm for
multi-objective optimization (SEAMO). In Proceed-
ings of the 2002 Congress on Evolutionary Computa-
tion CEC2002, pages 717722, 2002.
[40] C. L. Valenzuela and P. Y. Wang. VLSI placement and
area optimization using a genetic algorithm to breed
normalized postx expressions. IEEE Transactions on
Evolutionary Computation, 6(4):390401, 2002.
[41] D. A. V. Veldhuizen and G. B. Lamont. Multiobjective
evolutionary algorithms: Analyzing the state-of-the-art.
Evolutionary Computation, 8(2):125147, 2000.
[42] VSI Alliance. On-chip bus attributes specication ver-
sion 1. http://www.vsi.org/, Sept. 2001.
Mapping Cores on Network-on-Chip 123
[43] E. Zitzler, M. Laumanns, and L. Thiele. SPEA2: Im-
proving the performance of the strength pareto evolu-
tionary algorithm. In EUROGEN 2001. Evolutionary
Methods for Design, Optimization and Control with
Applications to Industrial Problems, pages 95100,
Athens, Greece, Sept. 2001.
[44] E. Zitzler and L. Thiele. Multiobjective evolutionary
algorithms: A comparative case study and the strength
pareto approach. IEEE Transactions on Evolutionary
Computation, 4(3):257271, Nov. 1999.
Author Biographies
Giuseppe Ascia received the Laurea degree in electronic engineering and
the Ph.D. degree in computer science from the Universit di Catania, Italy, in
1994 and 1998, respectively. In 1994, he joined the Institute of Computer
Science and Telecommunications at the Universit di Catania. Currently, he
is an Associate Professor at the Universit di Catania. His research interests
are soft computing, VLSI design, hardware architectures, and low-power
design.
Vincenzo Catania received the Laurea degree in electrical engineering
from the Universit di Catania, Italy, in 1982. Until 1984, he was responsible
for testing microprocessor system at STMicroelectronics, catania, Italy.
Since 1985 he has cooperated in research on computer network with the
Istituto di Informatica e Telecomunicazioni at the Universit di Catania,
where he is a Full Professor of computer science. His research interests
include performance and reliability assessment in parallel and distribuited
system, VLSI design, low-power design, and fuzzy logic.
Maurizio Palesi received the Dr.Eng. degree and the Ph.D. degree in com-
puter engineering from Universit di Catania, Italy, in 1999 and 2003 respec-
tively. Since December 2003, he has held a research contract as Assistant
Professor at the Dipartimento di Ingegneria Informatica e delle Telecomu-
nicazioni, Facolt di Ingegneria, Universit di Catania. His research focuses
on Platform based system design, design space exploration, low-power tech-
niques for embedded systems, and Network-on-Chip architectures.
124 Giuseppe Ascia, Vincenzo Catania and Maurizio Palesi
(a)
(b)
Figure. 4: Switch interface (a), Behavioural annotated graph of a switch (b).
Mapping Cores on Network-on-Chip 125
(a) (b)
Figure. 18: Block diagram of the MPEG-4 codec. (a) Encoder. (b) Decoder.
0 20 40 60 80 100 120 140 160
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Generation
S
i
m
u
l
a
t
i
o
n
s
Virtual
Real
0 20 40 60 80 100 120 140 160
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Generation
N
o
r
m
a
l
i
z
e
d
v
a
l
u
e
s
Minimum delay
Minimum energy
(a) (b)
Figure. 21: Number of (virtual and real) mappings evaluated by GAMAP in varying numbers of generations (a). Normalized
minimum delay and power consumption values obtained by the GAMAP in varying numbers of generations (b).
126 Giuseppe Ascia, Vincenzo Catania and Maurizio Palesi
(a) (b)
Figure. 23: Example application of a cell phone (source [30]) (a). A portion of the communication timeline (b).