0% found this document useful (0 votes)
62 views14 pages

Codesign Paper Presentation

This document presents a new codesign methodology called R-codesign for real-time reconfigurable embedded systems under energy constraints. R-codesign uses modeling and partitioning techniques to create a task allocation of software functions and hardware behaviors based on user constraints, using heuristics. It models system tasks probabilistically to predict performance, cost, and power of design tradeoffs. R-codesign takes hardware and software specifications as input and constructs partitions that are mapped to a heterogeneous multiprocessor system-on-chip platform with FPGAs, evaluating constraints during partitioning and mapping.

Uploaded by

ashish6789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views14 pages

Codesign Paper Presentation

This document presents a new codesign methodology called R-codesign for real-time reconfigurable embedded systems under energy constraints. R-codesign uses modeling and partitioning techniques to create a task allocation of software functions and hardware behaviors based on user constraints, using heuristics. It models system tasks probabilistically to predict performance, cost, and power of design tradeoffs. R-codesign takes hardware and software specifications as input and constructs partitions that are mapped to a heterogeneous multiprocessor system-on-chip platform with FPGAs, evaluating constraints during partitioning and mapping.

Uploaded by

ashish6789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

R-Codesign: Codesign Methodology for Real-time


Reconfigurable Embedded Systems Under Energy
Constraints
Ines Ghribi, Riadh Ben Abdallah, Mohamed Khalgui, Zhiwu Li, Fellow, IEEE, Khalid Alnowibet and
Marco Platzner

Abstract—Hardware/software codesign involves various design I. I NTRODUCTION


problems including system specification, design space explo-
ration, hardware/software co-verification, and system synthesis. The complexity of designing embedded systems is con-
An effective codesign process requires accurately predicting
the performance, cost and power consequence of any design stantly increasing which motivates the need for using more
trade-off in algorithms and hardware configuration. This paper efficient tools and design methodologies. Methodologies that
presents a new co-design methodology called R-codesign. Based employ modeling techniques at a low level abstraction are
on new modeling and partitioning techniques for reconfigurable no more applicable due to this complexity. We propose in
embedded systems, R-codesign creates a task allocation of SW this work a new codesign methodology, based on high level
functions and HW behaviors based on the user constraints
and using heuristics. The modeling approach relies basically on abstraction modeling techniques, which deal with coarse grain
probabilistic estimations of the executions of system tasks. Hard- components that increase productivity. We remember that
ware and software specifications are the inputs of R-codesign hardware/software co-design is the technique of designing
which constructs partitions (clusters of tasks) and maps them to concurrent hardware and software components of an embedded
a specified heterogeneous MPSOC (Multiprocessor System-on- system. Generally, hardware/software co-design starts with
chip) execution platform with FPGAs (Field-programmable gate
array). Several design constraints are evaluated and tested during a specification step followed by a modeling step in which
the partitioning and mapping process. We have developed a visual designers have to decide which part of the system should
environment called SPEX that implements this methodology. be mapped on hardware and which part on software. The
SPEX computes a control matrix which is a pre-computation hardware/software partitioning step follows. It is a combina-
of validated mappings that will occur in a case of a system tional optimization problem that assigns the system functions
reconfiguration. SPEX is open source, fast and provides efficient
results for the codesign of reconfigurable embedded systems. to the target architecture on the software and hardware domain
under the condition of meeting the design constraints. This is
a key task in the system level design since the decisions made
during this step directly impact the performance and cost of
Index Terms—Embedded System, Reconfiguration, Real-time,
Co-design, MPSoC, FPGA. the final implementation. Another aspect of hardware/software
design methodologies is their incapability to roll back hard-
ware/software partitioning decisions. This flexibility is an
This work is supported by the Science and Technology Development important aspect allowing to early discover the consequences
Fund under Grant 078/2015/A3 (Corresponding author : M. Khalgui and Z. of a particular hardware/software partitioning decision and,
Li.). if deemed inappropriate, exploring another [2]. One way of
I. Ghribi is with the School of Electrical and Information Engineering,
Jinan University, Zhuhai 519070, China, also with the National Institute achieving this goal is to develop abstract hardware/software
of Applied Sciences and Technology, University of Carthage, Tunis 1080, models during the partitioning process which can be used to
Tunisia and also with the Faculty of Mathematical, Physical and Natu- assess these decisions.
ral Sciences, University of Tunis-El Manar, Tunis 2092, Tunisia, (e-mail :
[email protected]). Another main performance issues in embedded systems de-
R. Ben Abdallah is with Prince Sattam Bin Abdulaziz University, sign is to guarantee the results within a given time. Such
Al Kharj, Saudi Arabia, and also with the National Institute of Ap- systems that have to fulfill posed constraints are called real-
plied Sciences and Technology, University of Carthage, Tunisia (e-mail :
[email protected]). time systems [3]. Most of these real-time embedded systems
M. Khalgui is with the School of Electrical and Information Engineering, interact with the external environment, which means that task
Jinan University (Zhuhai Campus), Zhuhai 519070, China and with the executions are triggered by external events [4]. The system
National Institute of Applied Sciences and Technology (INSAT), University
of Carthage, Tunis 1080, Tunisia (e-mail: khalgui. [email protected]). response should be modulated according to the stimulus from
Z. Li is with the Institute of Systems Engineering, Macau University outside. Designers of such systems make use of reconfigurable
of Science and Technology, Taipa 999078, Macau, and also with the School components and the system implementation becomes a kind
of Electro-Mechanical Engineering, Xidian University, Xi’an 710071, China
(e-mail: [email protected]). of building blocks game [5], [6]. These systems undergo
K. Alnowibet is with the Department of Statistics and Operations Re- unpredictable events that require adequate online decisions
search, King Saud University, Riyadh, P.O. Box 2455, Riyadh 11451, Saudi so as to maintain the desired performance and schedulability
Arabia (e-mail: [email protected]).
M. Platzner is with the University of Paderborn, Paderborn 33098, [7]. A reconfiguration event is defined as an internal/external
Germany (e-mail: [email protected]). event that leads to add/remove tasks [8]. Consequently, any

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

reconfiguration scenario may increase the energy consumption satisfy the design constraints. If I-codesign fails to map the
and/or make some tasks to violate their deadlines. Thus, specification, R-codesign reports the issues to the user in order
this flexibility adds more complexity in their design process to tune the input parameters. Otherwise, the results are stored
[9], [10]. Moreover, in real-time systems, tasks scheduling is in a controller matrix which will be used by the reconfiguration
critical and depends on previous mapping steps among the controller during run-time. In case of large systems, the global
system processing elements [11], [12]. mapping matrix could be broken-down into small matrices
In the present paper, we propose a methodology called controlled by multiple controllers in a distributed fashion
R-codesign for reconfigurable co-design. It aims to find so- which avoids reconfiguration fetch overhead.
lutions for modeling and partitioning probabilistic real-time At last, R-codesign is a codesign methodology that allows
systems having multiple reconfiguration scenarios. R-codesign to rapidly evaluate hardware/software systems using abstract
creates a task allocation of SW functions and HW behaviors models. The partitioning and mapping results reveal its effi-
based on the user constraints and using heuristics. Guaranteed ciency. Indeed, it guarantees a feasible solution and enhances
available resources, feasible scheduling and a generation of the overall performance compared with existing methodolo-
a reconfiguration controller are the main concern. The gain gies.
of the methodology resides in two aspects: (i) the estimation The paper proceeds as follows. The next section describes
of the execution flow allow to map the most probabilistic useful background. Section III presents the system formal-
functions to be executed, to be stored together, and hence ization and the notations used in this paper. In Section IV
the communication costs will be reduced and the overall the R-codesign methodology is developed. Section V exposes
performance will be enhanced, (ii) the precomputed mapping the experimental results and finally we conclude this paper in
of the possible execution scenarios allows to reconfigure the Section VI.
system at run-time with a minimum reconfiguration overhead.
Firstly, R-codesign presents an abstract model for hard-
II. S TATE OF T HE A RT
ware/software systems allowing early exploration of hard-
ware/software executions and evaluation of design alternatives. Hardware/software codesign can be considered as the pro-
This model supports incremental refinement and evaluation at cess of concurrent and coordinated design of an electronic
multiple abstraction levels. The separation between software system comprising hardware as well as software compo-
and hardware tasks is supposed to be manually done by nents based on a system description that is implementation-
users. The decision is made by an expert after a complexity independent [15], [16]. One of the key problems in hard-
study of each module (node in the DAG) who knows the ware/software codesign is hardware/software partitioning [17].
computational requirements of the system processing flow. One of the most relevant works dealing with partitioning is
Hardware is limited to specifically designed tasks that are, presented in [18]: A very sophisticated integer linear pro-
taken independently, very simple. Software implements algo- gramming model for the joint partitioning and scheduling
rithms that allow to complete much more complex tasks. The problem for a wide range of target architectures. This integer
entry point for R-codesign is a hardware/software specification program is part of a 2-phase heuristic optimization scheme
modeled by a DAG (Directed Acyclic Graph) where nodes which aims at gaining better timing estimates using repeated
are software functions or hardware behaviors. The edges of scheduling phases, and using the estimations in the partitioning
these DAGs are valued with a probabilistic estimation of their phases. The work in [19] presents a method for allocation
connecting nodes execution along with the communication of hardware/software resources for optimal partitioning. Dur-
cost of the communicating nodes. The goal of the methodology ing the allocation algorithm, an estimated hardware/software
is to partition and map all predefined possible configuration partition is also built. The algorithm for this is basically
scenarios off-line into a hardware target architecture that is a greedy algorithm: It takes the components one by one,
mainly an MPSoC and implement a controller that will super- and allocates the most critical building block of the current
vise and reconfigure the system on-line [13]. All the important component to hardware. The study in [20] shows an algorithm
and more likely configuration scenarios are pre-computed and to solve the joint problem of partitioning and scheduling.
given as input to the methodology. Each possible configuration It consists of basically two local search heuristics: one for
is composed of a set of periodic tasks modeled according partitioning and one for scheduling. The two algorithms oper-
to the proposed DAGs presentation. Thus, we developed ate on the same graph, at the same time. The work in [21]
adequate partitioning and mapping techniques for the proposed considers partitioning in the design of ASIPs (application-
hardware/software model. This partitioning/mapping approach specific integrated processors). It presents a formal frame
is called I-codesign. Several design constraints are considered work and proposes a partitioning algorithm based on branch
in this work such as the inclusion/exclusion constraint which and bound. The research in [22] presents an approach that
is related to the functional specification of processors. An is largely orthogonal to other partitioning methods: it deals
optimization phase is applied at the end of the I-codesign with the problem of hierarchically matching tasks to resources.
using the Kernighan-Lin algorithm [14] in attempt to find an It also shows a method for weighting partially defined user
optimal series of interchange operations between communi- preferences, which can be very useful for multiple-objective
cating elements in the DAGs. I-codesign treats the software optimization problems [23].
functions and the hardware behaviors separately and then a Along with the partitioning and mapping problem, co-
co-simulation step decides whether or not the mapping results simulation becomes an important area of research for the early

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

validation of design decisions [24], [25]. In co-simulation, the GPP (General Purpose Processor) and DSP (Digital Signal
execution of software on CPUs is simulated using a virtual Processor)).
model of the processor hardware or with simulation models as
ISS (Instruction Set Simulator). ISS reduces the complexity of
CPU RHw
the system design compared with performing a pure gate level
or register transfer level (RTL) hardware simulation, which is
typically too much slow. The co-simulation problem lies in
coupling different models to make the hardware simulation I/O Mem Com Tile Tile
sufficiently accurate [26]. Nowadays, we already live in a third
generation of co-design technology with cross-level design Bus1

environments for the synthesis of complex electronic systems


Bus2
[27], [28]. During the last decade, many important milestones
of progress with respect to the initial findings have been Tile Tile
Shared
Memory
achieved [29], [30]. Several design environments have been
developed: SARA [31], ADAS [32], PTOLEMY [33], and
Figure 1. Proposed Hardware Model.
PML [34].
In the present work, we introduce a new co-design method-
Their implementation is an executable code while hardware
ology based on constructive and iterative partitioning phases.
tasks are those implemented by a specific programmable
The originality of this work compared with the previously pro-
integrated circuit. Hardware tasks are generally provided as
posed approaches in the literature resides in multiple aspects
IPs (Intellectual Property) written in a hardware description
including:
language (e.g., VHDL, verilog) and implemented by FPGAs
• A probabilistic estimation of the software models aim- or dedicated ASICs (Application Specific Integrated Circuit).
ing to predict the execution flow which leads to an We assume that all processors are homogeneous and same
improvement in the codesign results and a noticeable for FPGAs in order to simplify the understanding of our
enhancement in the performance of the system, methodology through the exposed examples. The proposed
• A codesign methodology based on multiple constraints methodology remains valid even with heterogeneous process-
and feasibility analysis that show good performance en- ing elements. Figure. 1 presents an example of the hardware
hancements especially in terms of execution time and model. A tile Hwi , i ∈ [1..L ], is characterized by the quadru-
communication cost, plet (SCPU , SFPGA , PwFPGA , PwCPU , Freq), where (i) SCPU
is the available memory size in case of CPU, (ii) SFPGA is
• A visual tool that implements the methodology and
the FPGA area in term of gates number, (iii) PwFPGA is the
generates the controller table providing tasks mapping to
produced electrical power of an PFGA, and (iv) PwCPU is the
anticipate all reconfiguration scenarios,
produced electrical power of a CPU, and (v) Freq is the range
of the available operating frequency. The Bandwidth BWi, j is
III. F ORMALIZATION
defined in case of two communicating tiles Hwi and Hw j . The
This section presents the formalization of a hard- memory and power parameters are common to all tiles.
ware/software system specification. We also explain the parti- The system model is divided into γ configurations. A
tioning techniques used in R-codesign. configuration ζl , l ∈ [1..γ], is a set of tasks to be executed
when the configuration is initiated. Each task in a configuration
A. System Model is considered as a graph of elementary functions/behaviors
In this study we target an MPSoC hardware architecture as with their intrinsic proprieties and constraints. A task Ti ∈
an executing platform [35]. It is composed of a single master ζl , i ∈ [1..R ] is represented by a directed acyclic graph Ti =
tile controlling multiple slave tiles. Each tile is composed (Vi , Ei ), where (i) Vi is a set of nodes that correspond to
of a CPU (Central Processing Unit), a local memory and behaviors or functions, and (ii) Ei is a set of arcs which
a reconfigurable hardware allowing the implementation of describe the connections between functions/behaviors. Each
custom hardware used for acceleration purpose. The master task Ti is composed of ni,1 behaviors and ni,2 functions. A
tile includes I/O interfaces. In a classic use case, the master tile hardware behavior denoted by B j,i is described as a 5-tuple
receives through these interfaces external data events from sen- B j,i = (E hw hw hw hw hw hw
j , M j , C j , D j , Pj ), where E j represents the
sors and reconfigure the system tasks and hardware behaviors execution time of the hardware behavior on FPGA, M hw j stands
accordingly. The tiles communicate through a communication for the number of gates necessary for implementing B j,i , Chw j
medium which can be a bus, a NoC (Network on chip), denotes the power consumption of the hardware behavior, and
a crossbar or a shared memory. A hybrid memory model Dhw hw
j and Pj are respectively its relative deadline and period.
is adopted i.e., each tile has its own private memory. The A software function Fk,i is described as a 5-tuple Fk,i = (Eksw ,
considered tiles can communicate through a global shared Mksw , Cksw , Dsw sw sw
k , Pk ), where Ek stands for the execution time
sw
of Fk,i on CPU, Mk denotes the memory size in byte required
memory where an upper bound on the time required to access
the shared resource is considered. We assume that software by Fk,i , Cksw represents the power consumption of the software
tasks are those executed by programmable processors (e.g., function, and Dsw k and Pk
sw denote respectively its relative

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

deadline and period. B. I-codesign Methodology


We also consider inclusion/exclusion constraints. They are The R-codesign partitioning reuses I-codesign partitioning
used to impose at a couple of functions and/or behaviors to algorithms [36] and extends them to hardware behaviors. The
be executed either on the same computing unit or on different goal of I-codesign is to achieve a concurrent design between
ones. These constraints are expressed by the following func- the probabilistic task model and the hardware architecture
tions: previously described in a manner that fulfills all the system
• Exclu(Fk,i ) is a set that groups functions which have requirements and respects the design constraints. Each phase
not to be executed on the same processor with the of the I-codesign has its own constraint(s) and terminates when
function Fk,i . This constraint is modeled within the task all the nodes characterized with the specified constraint(s) are
representation by marking the symbol 6⊂ on the function mapped. Firstly, we apply I-codesign to software functions
Fk,i which means that Fk,i should not be executed with its ignoring the hardware behaviors. Then in a second step we
predecessors on the same computing unit. perform the same steps to hardware behaviors.
• Inclu(Fk,i ) is a set that groups functions which have to be 1) Functional Partitioning: Evaluates the inclu-
executed on the same processor with Fk,i . This constraint sion/exclusion constraints between graph nodes. Then,
is modeled by marking the symbol ⊂ on Fk,i which means this phase creates clusters depending on these constraints.
that Fk,i should be executed with its predecessors on the The inclusion/exclusion constraint decides the number of the
same computing unit. physical components when placing the functions/behaviors.
The edges are weighted with a couple ≺Prk , Cck  where Algorithm 1 Functional partitioning algorithm
Prk is the probability of executing this edge and Cck is the 1: procedure F UNCTIONAL - PARTITION ( TTASK : TAB - TASK ,
communication cost of data transfer between the two nodes. VAR TC : TAB -C LUSTER , VAR TF : TAB - FUNCTION )
We consider that the data are always fetched by software 2: for i = 1 to length(ttask) do
functions and propagated to the hardware behaviors where it 3: for j = 1 to length(ttask[i]) do
is attached to. Figure. 2 presents an example of the proposed 4: if inclusion(ttask[i][ j]) then
specification. The system is composed of ni,1 = 8 hardware 5: Cluster T (Cluster1 ,ttask[i][ j],
behaviors and ni,2 = 5 software functions connected with valued Pred(ttask[i][ j]))
edges. Inclusion constraints are visible on F2,i , F5,i , B3,i and 6: if OK Memory(cluster1)&
B8,i while the exclusion constraints are present on F4,i and OK Energy(cluster1 ) then
B6,i . The design constraints are user-defined parameters in the 7: Add C(cluster1 ,tc)
system model which are set according to a prior performance 8: empty(cluster1 )
study. The goal of the current paper is to provide a solution 9: end if
that places the tasks in the related devices under real-time, 10: else if exclusion(ttask[i][ j]) then
energy and memory constraints. 11: Cluster(cluster1 ,ttask[i][ j])
12: Cluster(cluster2 , Pred(ttask[i][ j]))
F1,i 13: if OK Memory(cluster1 ) &
0.6/10 0.4/8 OK Energy(cluster 1) then
B1,i B2,i
14: Add C(cluster1 ,tc)
0.8/12 0.2/7 0.3/11
15: empty(cluster1 )
0.7/9
16: end if
F2,i B3,i B4,i F3,i 17: if OK Memory(cluster2 ) &
0.1/5 0.5/7 1/12
0.9/8
0.5/11 1/10 OK Energy(cluster2 ) then
B5,i F4,i B6,i B7,i B8,i F5,i
18: Add C(cluster2 ,tc)
19: empty(cluster2 )
20: end if
Figure 2. A task example. 21: else Add F(ttask[i][ j],t f )
22: end if
The aim of the design formalization is to generate a con- 23: end for
troller allowing an efficient system reconfiguration. 24: end for
Problem Statement Given target hardware following the 25: if cluster1 6= 0 then ‘
previously described tile based execution model and our R- 26: Add C(cluster1 ,tc)
codesign System model (DAG of tasks comprising software 27: end if
functions and hardware behaviors with a set of constraints 28: if cluster2 6= 0 then ‘
along with their execution estimated probabilities), find an ap- 29: Add C(cluster2 ,tc)
propriate mapping for each possible configuration that respects 30: end if
available hardware resources while satisfying real time and 31: end procedure
energy constraints.

We place the rest of the functions/behaviors in the al-

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

ready used physical components within the limits of available Algorithm 2 describes this partitioning phase where (i)
resources. In the case of resource shortage, other physical ttask is a table containing tasks of a configuration, (ii) tc
components are allocated. If there are available memory and is a table that will hold the constructed clusters, and (iii) tf
energy on a created cluster, then sub-tasks from different tasks is a table that will hold the functions that are not affected
can be associated to the cluster. The main rules applied at this with the inclusion/execution constraint, (iv) FetchTask(ttask,
level are: F) is a function that fetches a table of tasks ttask in order to
• Rule 1: ∀ ζl , l ∈ [1..γ], ∀ Ti ∈ ζl i ∈ [1..R ], for each determine the index of the task that includes the function F,
pair of functions Fk,i and Fh,i / Fk,i ∈ Inclu(Fh,i ), group (v) maxProba(F, Tpred) determines the maximum probability
Fk,i and Fh,i on the same cluster, value of edges connected to a function F using the table of
its predecessors Tpred, (vi) OK Memory(c, F) is a boolean
• Rule 2: ∀ ζl , l ∈ [1..γ], ∀ Ti ∈ ζl , i ∈ [1..R ], for each function that returns true if there is enough memory on a
pair of functions Fk,i and Fh,i / Fk,i ∈ Exclu(Fh,i ), put Fk,i cluster c for a function F, (vii) OK Energy(c) is a boolean
and Fh,i on different clusters. function that returns true if there is enough energy on a
Algorithm 1 describes this partitioning phase where (i) ttask cluster c for a function F, (viii) Pred(ta, func, Tpred) is a
is a table containing tasks of a configuration, (ii) tc is a function that returns the predecessors of a function func using
table that will hold the constructed clusters, (iii) tf is a table its corresponding task ta and stores them into a table Tpred,
that will hold the functions that are not affected with the and (ix) FetchCluster(tc, F) is a function that fetches a table
inclusion/exclusion constraints, (iv) Cluster T (c, F, tab) is a of clusters tc in order to determine the index of the cluster
function that stores a function F and all the elements of a storing the function F.
table tab into a cluster c, (v) Cluster(c, F) is a function that The rules to be applied are:
stores a function F into a cluster c, (vi) OK Memory(c) is a • Rule 3: for any remaining un-clustered Fk,i ∈ Ti , deter-
function that indicates memory availability in a cluster c, (vii) mines its predecessors Fk−1,i in the DAG of Ti ,
OK Energy(c) is a function that indicates energy availability
in a cluster c, (viii) Add C(c, tab) is a function that adds a • Rule 4: For any Fk,i , extracts the highest edge’s probabil-
cluster c to a table tab, and (ix) Add F(F, ta) is a function ity couples ≺ Fk,i , Fk−1,i  and cluster Fk,i with its related
that adds a function F to a table ta. clustered functions having the highest edge probability.
Inclusion/exclusion is a hard constraint, thus clustered ele- 3) Kernighan-Lin: Optimizes the generated clusters. This
ments are locked and they will not be moved to other clusters phase evaluates both probability and communication cost on
during the remaining process. the edges connecting functions by gain calculation. If the gain
2) Hierarchical Partitioning: Clusters the remaining func- is positive, then the function is moved to another cluster.
tions that have no inclusion/exclusion constraints. The func-
tions are evaluated by their connecting edges probabilities and Algorithm 3 Kernighan-Lin Optimizationg Algorithm
high probability values are treated first. For each remained 1: procedure K ERNIGHAN -L IN O PTIMIZATION ( TTASK :
function Fk,i , all its predecessors are assessed to determine TAB - TASK , TF : TAB - FUNCTION , TC : TAB - CLUSTER )
the highest probability value of their connecting edges. Fk,i 2: for h = 1 to length(t f ) do
is associated to the cluster where the predecessor having the 3: PT ← FetchTask(ttask,t f [h])
highest edge probability value is located. 4: Pred(ttask[PT ],t f [h], T pred)
5: Func ← maxGain(t f [h], T pred)
Algorithm 2 Hierarchical partitioning Algorithm 6: if Func 6= NULL then
1: procedure H IERARCHICAL - PARTITION ( VAR TC : TAB - 7: PC ← FetchCluster(tc, Func)
CLUSTER , TF : TAB - FUNCTION , TTASK : TAB - TASK ) 8: if OK Memory(tc[PC]) & OK Energy(tc[PC])
2: for k = 1 to length(t f ) do then
3: PT ← FetchTask(ttask,t f [i]) 9: tc[PC] ← t f [h]
4: Pred(ttask[PT ],t f [k], T pred) 10: P ← FetchCluster(tc,t f [h])
5: ok ← f alse 11: Remove(tc[P],t f [h])
6: repeat 12: end if
7: max ← maxProba(t f [k], T pred) 13: end if
8: PC ← FetchCluster(tc, T pred[max]) 14: end for
9: if OK Memory(tc[PC]) & OK Energy(tc[PC]) 15: end procedure
then
10: tc[PC] ← t f [k] This step applies the following rules:
11: ok ← true
12: else T pred[max] ← 0 • Rule 5: starts with choosing an unlocked function Fk,i ,
13: end if • Rule 6: calculates the gain GF of moving Fk,i from
14: until ok a partition to another, GF = ((Cce × Pre ) − (Cch ×
15: end for Prh )) where (i) Cce is the communication cost of edges
16: end procedure connecting Fk,i with Fe,i placed in another cluster, (ii) Cch
is the communication cost of edges connecting Fk,i with

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

Fh,i placed in its own cluster, (iii) Pre is the probability of main input of Algorithm 4 and Number(conf) is a function that
edges connecting Fk,i with Fe,i placed in another cluster, returns the number of tasks per configuration conf. Figure. 3
and (iv) Prh is the probability of edges connecting Fk,i represents the flow diagram of the methodology. R-codesign
with Fh,i placed in its own cluster. steps are stated as follows:
• Rule 7: If GF ≥ 0 then we move Fk,i to another cluster.
The kernighan-lin optimization algorithm is described with A. Task Extraction
Algorithm 3 where (i) ttask is a table containing tasks of a R-codesign starts with extracting software functions and
configuration, (ii) tc is a table that will hold the constructed hardware behaviors from the system specification DAGs. It
clusters, (iii) tf is a table that will hold the functions that constructs a sub-DAG for each type of task elements. The
are not affected with the inclusion/execution constraints, (iv) task extraction is performed as follows.
maxGain(F, tab) is a function that returns a function having a • ∀ ζl , l ∈ [1..γ], ∀ Ti ∈ ζl , i ∈ [1..R ], ∀ Fk,i /B j,i
maximum gain value when stored with the function F on the ∈ Ti , add ≺Fk,i , Fk−1,i  to the software sub-DAG and
same cluster from a table tab storing F predecessors, and (v) ≺B j,i , B j−1,i  to the hardware sub-DAG.
Remove(c, F) is a function that removes a function F from a During the extraction phase, we adjust the communication
cluster c. cost and the probability estimation of behaviors edges that
are separated with function(s) or functions edges separated
IV. R- CODESIGN with behavior(s). As for the transfer cost, it is the sum
In this section, we present the R-codesign methodology. A of the individual communication cost of edges separating
system specification according to the proposed probabilistic the communicating functions/behaviors. Regarding the edge
modeling DAGs is the input of R-codesign. From these DAGs, probability, it is the product of the individual edge probability
hardware and software tasks are extracted and processed separating communicating functions/behaviors. For example,
through the I-codesign engine. A mapping process follows in figure. 2, the probability on the edge connecting F1,i to
the I-codesign algorithms and a mapping matrix is generated F2,i is equal to 0.6×0.8 while the communication cost is
in order to be used further by the co-simulation module. 10+12. If there are more than one functions/behaviors that
A validation strategy is then applied. If the performance communicate with another function/behavior, i.e., more than
results from the validation module are not convenient, then one incoming edge in the nodes of the graph, then we adjust
the I-codesign module is called again and a new mapping is the communication cost related to the edge that will bind the
recalculated. There is no upper bound on the created allocation two behaviors by summing the communication cost on each
by the I-codesign module. The designed system should be communicating path and considering the highest value of the
capable of running different configurations. By applying our communication cost. As to the adjusted probability, it is the
methodology, an allocation for each specified configuration is sum of probability of the multiple paths (its value is 1 when the
created. sum exceeds 1). As mentioned earlier, the path probability is
the product of all the individual edges probability constituting
Algorithm 4 R-codesign Algorithm the path.
1: procedure R- CODESIGN (T CONF : TAB - CONFIGURATION ,
LENGTH [T CONF ]: INTEGER )
B. I-codesign for hardware behaviors
2: if length(T con f ) > 0 then
3: NbT ← Number(T con f [length(T con f )]) The inclusion/exclusion constraints are hard constraints that
4: for k = 1 to NbT do generally decide the number of clusters to be created and
5: TaskExtraction(Tk , DAGsw , DAGhw ) locks the behaviors that are concerned in term of placement.
6: repeat If behaviors share the control of the same components, then
7: Mapping Table ← they are deployed on the same FPGA. Since each behavior
Icodesign(DAGhw , HW ) has its own design, FPGAs can have several implementations.
8: Mapping Table ← Icodesign(DAGsw , SW ) Multiplexers can be used in order to switch from an imple-
9: Per f ormanceResults ← mentation to another. The assignment of the behaviors based
CoSimulation(Mapping Table) on this constraint is formalized as follows.
10: until Per f ormaceResults == ”ok” ∀B p,i , B j,i ∈ Assign(Cl), p, j ∈ [1..ni,1 ], B p,i 6∈ Exclu(B j,i ) (1)
11: end for
12: R − codesgin(T con f , length[T con f ] − 1) ∀B p,i , B j,i /B p,i ∈ Inclu(B j,i ), T hen B p,i , B j,i ∈ Assign(Cl) (2)
13: end if where Cl is a cluster created based on the exclusion/inclusion
14: GenerateController() constraint. The number of the created clusters depends on
15: end procedure
the number of behaviors on the hardware DAG. Inclu(B j,i )
designates the behaviors that are related with inclusion to
Algorithm 4 implements the R-codesign methodology. The B j,i . Exclu(B j,i ) designates the behaviors that are related with
input specification can be composed of multiple configurations exclusion to B j,i . Assign(Cl) groups the set of behaviors
scenarios where each scenario executes a set of tasks. Thus affected to the cluster Cl. We define NCl as the number of
we define Tconf as the configurations table that will be the elements associated to a cluster Cl.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

Specification

Task
Task
Task

I-codesign Module

Functional Partitoning
Task Extraction

Unsatisfied Constraint
Validation Strategy Hw-Task
Hw-Task
Hw-Task Hw-Task SwitchCluster()
Hw-Task
Sw-Task CreateCluster()
OK
RT Level VHDL

hierarichal Partitoning
Executable Software Mapping Results
Gate level VHDL Hw/Sw Co-simulation

Circuit Unsatisfied Constraint


CPU
Data
SwitchCluster()
FPGA Board Executable Software CreateCluster()
OK
C Code VHDL Code

Address Kenrighan-Lin Optimization


! ’Performance Results’
(a)

ok Unsatisfied Constraint

SwitchCluster()
CreateCluster()
Controller Generation OK

Add Task to Mapping Table


Controller Matrix

Deployment
(b)

Figure 3. R-codesign flow.

Each cluster created for hardware tasks Cl={B1,i , B2,i ,..., corresponds to the bandwidth between B p,i and B j,i . The
BNCl ,i } is composed of behaviors and will be implemented verification step includes also the energy consumed by a
on a single FPGA unit. The reconfigurable hardware device given FPGA. Indeed, the energy consumption of a partition Cl
offers a certain amount of computational resources, e.g., the depends on the selected operating frequency Freq based on the
configurable logic blocks of a FPGA, which is also referred to current configuration and the number of available gates SFPGA
as the SFPGA parameter of the device. At each iteration of the of the corresponding tile. The electrical power constraint is
I-codesign methodology, a placement decision can affect the given by
hardware units. Hence, the available area on the FPGA must
be sufficient in order to execute the affected behaviors. Thus, SFPGA .[Freq]3 PwFPGA (6)

we propose to apply the following constraint:


In case of unavailable area or energy insufficiency, the
ΣB j,i ∈Cl M hw
j (B j,i ) SFPGA (3) problem is reported to the I-codesign methodology and another

FPGA is allocated. When all these constraints are satisfied, the


The third constraint concerns the bandwidth which is cor-
real-time feasibility is evaluated.
related to the transmission data rate and expressed in Bytes
per seconds. The bandwidth affects the transmission capacity Uk = ∑ Ehw (B j,i )/Phw (B j,i ) ≤ 1 (7)
between the linked components (CPUs, FPGAs, IPs) and B j,i ∈Cl
particularly when there are data dependencies between two
According to EDF scheduling algorithm, the feasibility is
tasks located in different hardware units. This dependency
tested using the following Eq. (7) where Uk is the utilization
constraint is defined as follows: a behavior B p,i placed in a
of the cluster Cl. The summary of the constraints that has to
cluster Clu depends on a behavior B j,i placed in a cluster Clv :
be respected at each step of the co-design for the hardware
∀B p,i ↔ B j,i , B p,i ∈ Clv , B j,i ∈ Clu (4) behaviors are given by
ΣB j,i ∈Cl M hw

j (B j,i ) SFPGA





Bandwidth(Clv ,Clu ) = ∑ Bandwidth(B p,i , B j,i ) ≤ BWv,l 
Bandwidth(B p,i , B j,i ) ≤ BW
 B ∈Cl∑



B p,i ∈Clv ,B j,i ∈Clu p,i v ,B j,i ∈Clu
(5) (8)
SFPGA .[Freq]3 PwFPGA


where BWv,l stands for an available bandwidth between two 


tiles Hwv and Hwl . B p,i ↔ B j,i means that B p,i and B j,i

Uk = ∑ Ehw (B j,i )/Phw (B j,i ) ≤ 1




are placed on different clusters Clu and Clv and that they B j,i ∈Cl
have a data dependency. The expression Bandwidth(B p,i , B j,i )

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

C. I-codesign for software functions E. Controller Generation


Previously described equations (1) and (2) are also ap- The software specification is divided into configurations.
plicable to the software functions when dealing with the Each configuration has a set of tasks to be executed when
inclusion/exclusion constraints. Each created cluster Cl for initiated. Initially, a boot configuration is loaded. However, a
software tasks resulting from I-codesign algorithms will be reconfiguration can occur at run-time which requires to recon-
mapped to a CPU unit and is composed of NCl functions. We figure (replace, re-parameter, change the functionality, etc) of
assume that all the target MPSoC CPUs are homogeneous. the system tasks. Therefore, a pre-calculated mapping of all
Thus, they have common characteristics (SCPU ,..). For a given the possible configuration scenarios is necessary and will lead
software sub-DAG of Ti composed of ni,2 functions, the to better performance. Thus, we propose to build a controller
following constraints should be verified: The available memory module that manages the reconfiguration. It acts following
on the designed CPUs at an iteration must be sufficient in order internal or external events that induce reconfigurations. When
to place the affected functions. An extra CPU is designated executing the R-codesign process, a matrix is constructed
in case of resource shortage. We define the memory space based on the output of each partitioned configuration (task
constraint as follows. set). In fact, the task mapping to the execution platform is
stored in the controller matrix along with its corresponding
ΣFk,i ∈Cl Mksw (Fk,i ) SCPU (9) execution scenario. Hence, when the system is executing and

a reconfiguration occurs, the constructed table is consulted and


The bandwidth constraint is applied by following the equations
the proper partitioning according to the I-codesign scheme is
(4) and (5) using functions Fk,i instead of behaviors B j,i . The
applied.
energy constraint is verified using equation (10) where V is
the voltage and Ca is the capacitance of the corresponding
Execute
processor.
Freq ×V 2 ×Ca PwCPU (10) Monitoring Loading Execution

Real-time feasibility is verified according to the EDF algo- Monitor


End of Execution
Reconfiguration
rithm following the equation below.

Uk = ∑ Esw (Fk,i )/Psw (Fk,i ) ≤ 1 (11) Terminate Switching Execution


Fk,i ∈Cl
End of Execution
where Uk is the utilization of a cluster Cl, Esw is the execution
time of Fk,i and Psw is the period of Fk,i . The summary of the Figure 4. The controller State Diagram.
constraints that have to be respected at each step of the co-
design for the software functions are given by The controller receives internal or external events and
initiates a necessary reconfiguration. Figure. 4 shows the
ΣFk,i ∈Cl Mksw (Fk,i ) SCPU

state diagram of the controller. An example of the controller




 matrix is presented in figure. 5 where the specification is
Bandwidth(Fk,i , Fv,i ) ≤ BW

 F ∈Cl∑


 composed of three tasks T1 , T2 and T3 and two configurations
k,i v ,Fv,i ∈Clu
(12) con f1 ={T1 , T2 , T3 } and con f2 ={T1 , T2 }. The number of lines in
Freq ×V 2 ×C PwCPU


 this matrix is equal to γ the number of possible configurations.
The number of columns is equal to ∑Ri ni,1 + ni,2 . The





 Uk = ∑ Esw (Fk,i )/Psw (Fk,i ) ≤ 1

Fk,i ∈Cl matrix associates each task function/behavior with a specified
PE (Processing Unit: CPU, FPGA) when the corresponding
D. Hardware software co-Simulation configuration is selected.
Hardware/software co-simulation allows to verify the fea- T1 T2 T3
sibility of mixed hardware/software descriptions in term of F11 F12 B11 F12 F22 F32 B12 F13 F23 B13 B23
timing constraints. Implementing the co-simulation consists Conf1 CPU1 CPU1 FPGA1 CPU1 CPU2 CPU2 FPGA2 CPU2 CPU2 FPGA2 FPGA1

in writing a set of HW components in VHDL and a set of


Conf2 CPU1 CPU1 FPGA2 CPU1 CPU2 CPU1 FPGA1 -- -- -- --
SW components (C programs) and linking them together with
communication interfaces. Finally, running co-simulation will
lead to two important results: (i) the execution status whether
it is a success or a failure. (ii) execution trace which can be Figure 5. The controller Matrix.
used for further analysis. The concept of correctness of this
verification is defined as follows: The system fails when it
reaches an undefined state or its predefined time frame is V. E XPERIMENTAL R ESULTS
violated and no time-out action is defined [37]. If the co- A. R-Codesign Environment: SPEX
simulation fails, then a remapping scheme is calculated using We develop a co-design execution environment that is
the I-codesign module. called SPEX. It provides a toolbox in order to create a hard-
ware/software system description according to the proposed

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

Execution Environment
Task
Task
Task Task decomposition Hw/Sw partitioner
Hardware Architecture

Cluster Scheduler

Task Set Generator


Controller
Controller Code

Task DAG

Input Parameters

Figure 6. The tool Architecture.

design models and implements the co-design algorithms. It Figure. 7 summarizes the TP structure. Finally, EE executes
proposes a flexible task set generator for different scenar- each cluster on the associated computing unit and collects
ios and purposes. The tool places the software specification results for reporting energy and memory utilization and con-
following several design constraints as inclusion/exclusion troller status. The output results allow the assessment of using
parameters, probabilistic execution of the software tasks, avail- static partitioning stored in the controller table (generated
able memory and energy on the hardware units and real-time by I-codesign) and permits us to compare the I-codesign
parameters. At each iteration, it constructs the controller table methodology with a legacy dynamic mapping scheme.
that stores all the possible execution scenarios. For simulation
purposes the tool loads a specification file, reads the software B. A Case Study
and hardware characteristics, applies the co-design algorithms
We consider a system having two possible configurations
and generates the controller table along with memory and
denoted by con f1 and con f2 . This system is specified by a
energy estimation. Figure. 6 summarizes the general tool
task graph composed of three tasks T1 , T2 and T3 according to
structure. The tool is composed of four different parts: 1)
R-codesign modeling techniques. The required memory size
Task set Generator (TSG), 2) Task Decompositioner (TD), 3)
of the application is 2.3 MB where T1 , T2 and T3 require
Task Partitioner (TP), 4) Execution Environment (EE). TSG
respectively 1.5 MB, 0.5 MB and 0.3 MB. The composition
should be set with parameters such as CPU utilization and the
of each configuration is: con f1 ={T1 , T2 } and con f2 ={T1 , T3 }.
desired number of tasks, and then it creates a task set that is
Figure. 8 presents the DAG of the task T1 . The target hardware
called a configuration. The design constraints (probability, in-
is an MPSoC composed of two identical tiles where each tile
clusion/exclusion, communication costs, and dependency) are
has a CPU, a reconfigurable unit and a local memory. The
randomly generated by the tool. The generated configuration
local memory’s size is of 1.2 MB.
is passed as input to TD which decomposes the tasks into
elementary functions with design constraints. TG produces the
task graphs. Then, TP performs the partitioning algorithms and F1,1

0.6/10 0.4/8
generates optimized clusters.
B1,1 B2,1
0.8/12 0.2/7 0.3/11
0.7/9
Task specification Architecture Templates
Description File
F2,1 B3,1 B4,1 F3,1
Json F1 CPU FPGA
F2
0.1/5 0.5/7 0.5/11 1/12 1/10
0.9/8

B1 B5,1 F4,1 B6,1 B7,1 B8,1 F5,1

0.3/12 0.7/6 0.2/13 0.8/6


1/9 1/15 1/14 1/16
F6,1 F7,1 B9,1 F8,1 B10,1
F9,1
Partition Partition 1/4
Task Partitioner 0.4/10 0.6/12
1/15 1/7 1/6 1/5 1/12
B11,1 F10,1 B12,1 B13,1
Mutliprocessor Scheduler PE Allocation Controller
Decision

Figure 8. The DAG of T1 .


Fail
Performance

Feasibility Results
Evaluation The reconfigurable device includes one million gates. The
master and slave tiles are connected to a shared bus. Arbi-
Success tration is resolved by a bus arbiter. It periodically examines
Result
pending requests from the master and grants access using
arbitration mechanisms specified by the bus protocol.
Figure 7. The partitioning flow graph of SPEX.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

10

Table I 0.06/42 F8,1


S OFTWARE PARAMETERS OF T1 0.4/10
F1,1 0.06/45
0.48/22

Function Eksw Pksw Dsw


k Mksw 0.24/17 F2,1
0.28/45
0.12/19 F10,1

F1,1 5 120 150 0.2 F6,1 0.2/8


F9,1
F3,1

F2,1 3 120 200 0.15 0.56/11


F4,1
1/10

1/4
F3,1 2 90 210 0.11 1/16
F5,1
1/9
F4,1 6 110 180 0.02 F7,1

F5,1 2 120 190 0.17


F6,1 2 200 250 0.05 Figure 10. Task extraction: Software graph.
F7,1 2 90 210 0.07
F8,1 6 110 180 0.01 2) Functional Partitioning: This step evaluates the inclu-
F9,1 2 120 190 0.08 sion/exclusion constraints and generates initial clusters with
F10,1 2 200 250 0.02 locked functions/behaviors.

The maximum bandwidth of the communication links is


1 Mbps. Each CPU has a 10 Watt power consumption. The F2,1
F4,1 B1,1 B8,1 B10,1

operating frequency Freq range is [150..250] Mhz. Software F1,1 F9,1 Cl 1


B13,1
and hardware parameters used in R-codesign phases are listed F3,1 F8,1
B3,1
B4,1

respectively in Tables I and II. In this case study, the parti- Cl 1 F5,1 Cl 2 F10,1
Cl 2
B6,1
tioning results as well as the controller generation are of main Software Clusters Hardware Clusters
focus. Following are descriptions of SPEX steps applied to
this case study. Figure 11. Resulted clusters after the Functional Partitioning.

Table II Then it optimizes the number of generated clusters since


H ARDWARE PARAMETERS OF T1
their creation depends on these inclusion/exclusion constraints.
Behavior E hw Phw Dhw M hw The verification phase applied on the created clusters succeeds
j j j j
B1,1 3 60 150 0.1 since the placements meet these constraints (8) and (12).
B2,1 3 80 200 0.02 Figure. 11 shows the initial clusters created after applying
B3,1 5 90 210 0.05 the functional partitioning algorithm on the hardware/software
B4,1 5 110 180 0.04 DAGs.
3) Hierarchical Partitioning: This phase optimizes the
B5,1 5 120 190 0.02
communication costs on communication links since it stores
B6,1 4 160 210 0.035
the most probabilistic traffic on the same processor. It also
B7,1 1 180 220 0.08
optimizes the processor occupations and assigns tasks to the
B8,1 4 190 260 0.1
maximum load of processors. Figure. 12 shows the resulted
B9,1 4 160 210 0.02
clusters after applying the hierarchical partitioning algorithm.
B10,1 1 180 220 0.03
B11,1 4 190 260 0.022
Software Clusters
B12,1 4 190 260 0.01 B1,1 B6,1
F7,1
F2,1 B12,1 B8,1
Cl 2
B13,1
1) Task Extraction: The task extraction builds the software Cl 1 F4,1 B5,1
F1,1 F9,1
functions and the hardware behaviors graphs. F8,1
B3,1 B4,1
B2,1
F3,1
B11,1
F10,1 Cl 1 B10,1 Cl 2
F5,1
B2,1 F6,1
B1,1 B7,1 B9,1
0.2/7 0.7/9 0.24/17
0.608/36
Hardware Clusters
B3,1 B4,1 B10,1
B11,1 0.08/12
0.5/7 0.5/11
1/12 0.06/39 1/12 Figure 12. Resulted Clusters after the Hierarchical Partitioning.
1/27 B5,1 B6,1 B7,1 B8,1 B13,1

1/15 0.6/14 1/21 The verification phase applied on the created clusters suc-
B9,1
B12,1 ceeds since the placements respect (8) and (12).
4) Kernighan-Lin Optimization: This phase aims to opti-
mize the resulting clusters from the hierarchical clustering
Figure 9. Task extraction: Hardware graph.
phase by iterative improvements. In our partitioning process,
The probability and the communication cost are recalculated a combination of two metrics is used in order to opti-
according to the connections roots and leafs on the original mize the traffic circulation of the system: the communica-
task’s DAG. Figures. 9 and 10 presents the extraction of the tion cost and the probabilistic estimations of the executions.
software and hardware DAGs. Figure. 13shows the optimized clusters after applying the

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

11

kernighan-Lin algorithm. The verification phase applied to the issues, and energy/power consumption, and (iii) analyze the
optimized clusters succeeds since the placements respect (8) causality between MPSoC components i.e., resource constrains
and (12). and inter-dependencies [38]. The application model is based
on task graphs, where the exact functionality of a task is
Software Clusters
abstracted away and expressed using a set of timing constraints
B1,1 B6,1
F7,1 (execution time, deadline and offset). Recording files are gen-
B12,1 B8,1
Cl 1
F6,1
F4,1
Cl 2
B13,1
erated providing an overview on the architecture-under-test,
F2,1 B5,1
F9,1
B3,1 B4,1
the profile of the application, the PE utilization, the memory
F1,1
F3,1 F8,1
B2,1
and communication costs. ARTS model captures the impact of
B11,1
F5,1 F10,1
Cl 1 B10,1 Cl 2 the dynamic and unpredictable behavior on processor, memory
B7,1 B9,1 and communication performance. In particular, it focuses on
Hardware Clusters analyzing the impact of application mapping on the processor
and memory utilization taking the on-chip communication
Figure 13. Resulted Clusters after the Kernighan Lin Optimization. latency into account.
The evaluated performance parameters that are taken into
5) Controller Generation: For each configuration, SPEX account are the total communication costs of the system
runs R-codesign on the hardware/software specification and functions/behaviors, the total consumed energy during the
constructs the controller matrix. The generated matrix for this system execution, the total number of exchanged messages,
case study is showed in figure. 14. The output of SPEX is also and the total execution time of the grouped task sets. The
presented in figure. 15. generated results when varying the utilization on CPUs and
FPGAs are compared with two partitioning and scheduling
Software F1,1 F2,1 F3,1 F4,1 F5,1 F6,1 F7,1 F8,1 F9,1 F10,1 algorithms: the work reported in [27] which proposes a task
conf1 CPU1 CPU1 CPU1 CPU2 CPU1 CPU2 CPU2 CPU2 CPU2 CPU2
allocation algorithm based on clustering. This work that finds
a near optimal solution and tries to minimize the total system
conf2 CPU2 CPU2 CPU2 CPU1 CPU2 CPU1 CPU1 CPU1 CPU1 CPU1 cost by forming a cluster of tasks in such a way that the
cluster, having minimum execution cost, is allocated first. Then
Hardware B1,1 B2,1 B3,1 B4,1 B5,1 B6,1 B7,1 B8,1 B9,1 B10,1 B11,1 B12,1 B13,1 comes the second work [28] which proposes an algorithm
conf1
RHW2 RHW2 RHW2 RHW2 RHW2 RHW1 to extend the battery life by partitioning and scheduling the
RHW1 RHW1
RHW1
RHW1
RHW1 RHW1
RHW2 RHW1
RHW1 RHW1
RHW1
RHW2
RHW2
input task wisely. We also compare the performance results
conf2 RHW2 RHW2 RHW2 RHW1 RHW2 RHW2 RHW1 from the traditional approach TA that during the run-time
reconfiguration calculates the appropriate mapping of tasks
into processors with R-codesign R-co.
Figure 14. Resulted Controller Matrix of task T1 . Figures. 16, 17, 18 and 19 present the performance result-
ing from randomly generated task sets. Figure. 16 describes
the communication costs in term of delays of the transfer
through the communication medium while figure. 19 enumer-
ates the transferred messages. The comparison between the
evaluated approaches has demonstrated that R-codesign offers
better performance results particularly with large utilization
factors and high number of nodes on the specification DAGs.
These enhancements are due to probabilistic estimation of
the communicated functions/behaviors that store dependent
tasks with high chances to be executed successively on same
PEs. Another advantage of R-codesign is the pre-calculated
Figure 15. Output of SPEX.
mapping of the possible reconfigurations at run-time. This step
helps significantly to minimize the reconfiguration overhead
which is made clear from the comparison with TA. Due to
C. Evaluation these changes a reduction by 30% of the global execution time
To evaluate the R-codesign methodology, several task sets is observed. Since execution time is a crucial performance pa-
of different dimensions are generated. The generated tasks rameter of the embedded system design, adopting the proposed
are processed through SPEX and we obtain the generated idea improves significantly the response time. Compared with
controller matrix along with the mapping scheme of each existing works, it is shown from the graphs that R-codesign
execution scenario. improves the communication costs with an average of 10% and
For performance simulation, we use ARTS framework which therefore the exchanged messages are reduced by an average
is a simulation tool for user-driven abstract MPSoC design of 12%. Simulation results show that this contribution has
explorations. Hence, The framework allows to: (i) model few benefits: (i) the number of exchanged messages has been
processing elements (PE), memory units and interconnect, noticeably reduced, and (ii) the global execution time has been
(ii) investigate PE utilization, memory usage, communication minimized. Hence, the energy consumption will be reduced as

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

12

Task set I Task set II Task set III

R-co R-co R-co


Communication Cost

Communication Cost

Communication Cost
400 Alg[27] 400 Alg[27] 400 Alg[27]
Alg[28] Alg[28] Alg[28]
TA TA TA
200 200 200

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Utilization Utilization Utilization

Figure 16. Simulation Results for Communication Costs.

Task set I Task set II Task set III

80 R-co 80 R-co 80 R-co


Consumed Energy

Consumed Energy

Consumed Energy
Alg[27] Alg[27] Alg[27]
60 Alg[28] 60 Alg[28] 60 Alg[28]
TA TA TA
40 40 40

20 20 20

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Utilization Utilization Utilization

Figure 17. Simulation Results for Energy Consumption.

task set I task set II task set III

R-co R-co R-co


150 Alg[27] 150 Alg[27] 150 Alg[27]
Execution time

Execution time

Execution time

Alg[28] Alg[28] Alg[28]


100 TA 100 TA 100 TA

50 50 50

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Utilization Utilization Utilization

Figure 18. Simulation Results for Execution Time.

Task set I Task Set II task set III

400 R-co 400 R-co 400 R-co


Message Number

Message Number

Message Number

Alg[27] Alg[27] Alg[27]


Alg[28] Alg[28] Alg[28]
TA TA TA
200 200 200

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Utilization Utilization Utilization

Figure 19. Simulation Results for the number of exchanged messages.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

13

a result of decreased execution time. Another advantage of R- [9] F. Ferrandi, P. L. Lanzi, C. Pilato, D. Sciuto, and A. Tumeo, “Ant colony
codesign is its validation tests (see equation systems (8) and optimization for mapping, scheduling and placing in reconfigurable
systems,” in Proc. NASA/ESA Conference on Adaptive Hardware and
(12)) that avoid any issues related to a lack of resources. Systems (AHS), Italy, June 2013, pp. 47–54.
[10] X. Wang, I. Khemaissia, M. Khalgui, Z. Li, O. Mosbahi, and M. Zhou,
VI. C ONCLUSION “Dynamic low-power reconfiguration of real-time systems with periodic
and probabilistic tasks,” IEEE Transactions on Automation Science and
In this paper, we propose a complete methodology for Engineering, vol. 12, no. 1, pp. 258–271, Jan 2015.
modeling, partitioning and validating reconfigurable embedded [11] T. K. Liu, Y. P. Chen, and J. H. Chou, “Developing a multiobjective
optimization scheduling system for a screw manufacturer: A refined
system design. We expose in this paper probabilistic estimation genetic algorithm approach,” IEEE Access, vol. 2, pp. 356–364, 2014.
of the executions and a mathematical formalization of the [12] W. Housseyni, O. Mosbahi, M. Khalgui, Z. Li, and L. Yin, “Multiagent
design constraints. The obtained performance improvement architecture for distributed adaptive scheduling of reconfigurable real-
of the proposed techniques in terms of communication costs time tasks with energy harvesting constraints,” IEEE Access, vol. PP,
no. 99, pp. 1–1, 2017.
(the number of exchanged messages), consumed energy and [13] S. Brandsttter and M. Huemer, “A novel mpsoc interface and control
required CPU time has been verified. Furthermore, the new architecture for multistandard rf transceivers,” IEEE Access, vol. 2, pp.
partitioning combination of iterative, constructive and func- 771–787, 2014.
[14] S. Dutt, “New faster kernighan-lin-type graph-partitioning algorithms,”
tional techniques allows efficient and optimized placements in Proceedings of 1993 International Conference on Computer Aided
of software/hardware specification while respecting the con- Design (ICCAD), Nov 1993, pp. 370–377.
strained resources. We proposed an execution model for R- [15] D. W. Franke and M. K. Purvis, “Design automation technology for
codesign: status and directions,” in Proc. IEEE International Symposium
codesign methodology that relies on a controller module that on Circuits and Systems, ISCAS ’92, May 1992, pp. 2669–2672.
stores all the possible reconfiguration scenarios and manages [16] K. Li, X. Tang, B. Veeravalli, and K. Li, “Scheduling precedence
the system tasks whenever a reconfiguration event occurs. constrained stochastic tasks on heterogeneous cluster systems,” IEEE
Finally, we developed the SPEX tool which allows to: (i) write Transactions on Computers, vol. 64, no. 1, Jan 2015.
[17] C. C. Kao, “Performance-oriented partitioning for task scheduling of
a specification according to the R-codesign system model, parallel reconfigurable architectures,” IEEE Transactions on Parallel and
(ii) apply the new partitioning techniques and (iii) generate Distributed Systems, vol. 26, no. 3, pp. 858–867, March 2015.
the controller table. We are working on enhancing SPEX and [18] R. Niemann and P. Marwedel, “An algorithm for hardware/software par-
titioning using mixed integer linear programming,” Design Automation
from a future perspective we intend to explore tasks migration for Embedded Systems, vol. 2, no. 2, pp. 165–193, Mar 1997.
mechanisms for an optimal reconfiguration process [39]. [19] J. Grode, P. V. Knudsen, and J. Madsen, “Hardware resource allocation
for hardware/software partitioning in the lycos system,” in Proceedings
Design, Automation and Test in Europe, Feb 1998, pp. 22–27.
ACKNOWLEDGMENT
[20] K. S. Chatha and R. Vemuri, “Magellan: Multiway hardware-software
The authors would like to extend their sincere appreciation partitioning and scheduling for latency minimization of hierarchical
to the Deanship of Scientific Research at King Saud University control-dataflow task graphs,” in Proceedings of the Ninth International
Symposium on Hardware/Software Codesign, ser. CODES ’01, 2001, pp.
for its funding this Research Group NO (RGP-1436-040 ). 42–47.
[21] N. N. Binh, M. Imai, A. Shiomi, and N. Hikichi, “A hardware/software
R EFERENCES partitioning algorithm for designing pipelined asips with least gate
counts,” in 33rd Design Automation Conference Proceedings, 1996, Jun
[1] M. Uzam, Z. Li, G. Gelen, and R. S. Zakariyya, “A divide-and-conquer- 1996, pp. 527–532.
method for the synthesis of liveness enforcing supervisors for flexible [22] G. Quan, X. Hu, and G. Greenwood, “Preference-driven hierarchical
manufacturing systems,” Journal of Intelligent Manufacturing, vol. 27, hardware/software partitioning,” in Proceedings 1999 IEEE Interna-
no. 5, pp. 1111–1129, Oct 2016. tional Conference on Computer Design: VLSI in Computers and Pro-
[2] Y. Chen, Z. Li, K. Barkaoui, and M. Uzam, “New Petri net structure and cessors (Cat. No.99CB37040), 1999, pp. 652–657.
its application to optimal supervisory control: Interval inhibitor arcs,” [23] P. Arato, S. Juhasz, Z. A. Mann, A. Orban, and D. Papp, “Hardware-
IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 44, software partitioning in embedded system design,” in IEEE International
no. 10, pp. 1384–1400, Oct 2014. Symposium on Intelligent Signal Processing, 2003, Sept 2003, pp. 197–
[3] T. Kim and S. Tak, “Experience with hardware-software codesign of 202.
network protocol stacks supporting real-time inter-task communication,” [24] C. Wang, A. Wang, Y. Chen, A. Al-Ahmari, and Z. Li, “On computation
in Proc. 10th International Conference on computer and Information reduction of liveness-enforcing supervisors,” IEEE Access, vol. 5, pp.
Technology (CIT), IEEE, United Kingdom, June 2010, pp. 26–32. 14 775–14 786, Aug 2017.
[4] S. Zhang, N. Wu, Z. Li, T. Qu, and C. Li, “Petri net-based approach [25] C. Li, Y. Chen, Z. Li, and K. Barkaoui, “Synthesis of liveness-enforcing
to short-term scheduling of crude oil operations with less tank require- Petri net supervisors based on a think-globally-act-locally approach and
ment,” Information Sciences, vol. 417, pp. 247 – 261, 2017. vector covering for flexible manufacturing systems,” IEEE Access, vol. 5,
[5] H. Grichi, O. Mosbahi, M. Khalgui, and Z. Li, “Rwin: New methodology pp. 16 349–16 358, June 2017.
for the development of reconfigurable wsn,” IEEE Transactions on [26] J. Shi, W. Liu, M. Jiang, H. Che, and L. Chen, “Software hardware
Automation Science and Engineering, vol. 14, no. 1, pp. 109–125, Jan co-simulation and co-verification in safety critical system design,” in
2017. Proc. IEEE International Conference on Intelligent Rail Transportation
[6] M. Gasmi, O. Mosbahi, M. Khalgui, L. Gomes, and Z. Li, “R-node: (ICIRT), China, Aug 2013, pp. 71–74.
New pipelined approach for an effective reconfigurable wireless sensor [27] P. Bhardwaj and V. Kumar, “An effective load balancing task allocation
node,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, algorithm using task clustering,” International Journal of Computer
vol. PP, no. 99, pp. 1–14, 2017. Applications, vol. 77, no. 7, pp. 32–39, September 2013.
[7] H. Grichi, O. Mosbahi, M. Khalgui, and Z. Li, “New power-oriented [28] R. Shi, S. Yin, C. Yin, L. Liu, and S. Wei, “Energy-aware task
methodology for dynamic resizing and mobility of reconfigurable partitioning and scheduling algorithm for reconfigurable processor,” in
wireless sensor networks,” IEEE Transactions on Systems, Man, and Proc. IEEE 11th International Conference on Solid-State and Integrated
Cybernetics: Systems, vol. PP, no. 99, pp. 1–11, 2017. Circuit Technology (ICSICT), China, Oct 2012, pp. 1–3.
[8] M. O. Ben Salem, O. Mosbahi, M. Khalgui, Z. Jlalia, G. Frey, and [29] X. Wang, Z. Li, and W. M. Wonham, “Dynamic multiple-period re-
M. Smida, “Brometh: Methodology to design safe reconfigurable medi- configuration of real-time scheduling based on timed des supervisory
cal robotic systems,” The International Journal of Medical Robotics and control,” IEEE Transactions on Industrial Informatics, vol. 12, no. 1,
Computer Assisted Surgery, vol. 13, no. 3, pp. e1786–n/a, 2017. pp. 101–111, Feb 2016.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access

14

[30] G. Yi, J. H. Park, and S. Choi, “Energy-efficient distributed topology


control algorithm for low-power iot communication networks,” IEEE
Access, vol. 4, pp. 9193–9203, 2016.
[31] G. Estrin, R. S. Fenchel, R. R. Razouk, and M. K. Vernon, “Sara (system
architects apprentice): Modeling, analysis, and simulation support for
design of concurrent systems,” IEEE Transactions on Software Engi-
neering, vol. SE-12, no. 2, pp. 293–311, Feb 1986.
[32] C. U. Smith, G. A. Frank, and J. L. Cuadrado, “An architecture design
and assessment system for software/hardware codesign,” in Proc. 22nd
ACM/IEEE Design Automation Conference, Piscataway, NJ, USA, 1985,
pp. 417–424.
[33] A. Kalavade and E. A. Lee, “A hardware-software codesign methodology
for dsp applications,” IEEE Design Test of Computers, vol. 10, no. 3,
pp. 16–28, Jul. 1993.
[34] F. Rose, T. Carpenter, S. Kumar, J. Shackleton, and T. S. Honeywell,
“A model for the coanalysis of hardware and software architectures,”
in Proc. 4th International Workshop on Hardware/Software Co-Design.
Washington DC, USA: IEEE Computer Society, 1996, pp. 94–103.
[35] V. Boppana, S. Ahmad, I. Ganusov, V. Kathail, V. Rajagopalan, and
R. Wittig, “Ultrascale+ mpsoc and fpga families,” in Proc. IEEE Hot
Chips Symposium (HCS), Cupertino, CA, Aug 2015, pp. 1–37.
[36] I. Ghribi, R. Abdallah, M. Khalgui, and M. Platzner, “New co-design
methodology for real-time embedded systems,” in Proc. 11th Interna-
tional Conference on Software Engineering and Applications, Portugal,
2016, pp. 185–195.
[37] R. Gumzej and M. Colnaric, “An approach to modeling and verification
of real-time systems,” in Proc. 4th IEEE International Symposium on
Object-Oriented Real-Time Distributed Computing, 2001, pp. 283–290.
[38] P. Chandraiah and R. Doemer, “Designer-controlled generation of par-
allel and flexible heterogeneous MPSoC specification,” in Proc. 44th
ACM/IEEE Design Automation Conference, San Diego, CA, June 2007,
pp. 787–790.
[39] A. R. Gaiduk and N. N. Prokopenko, “Design of nonlinear optimal
systems on basis of controlled jordan form,” in 2017 IEEE East-West
Design Test Symposium (EWDTS), Sept 2017, pp. 1–4.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like