Codesign Paper Presentation
Codesign Paper Presentation
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
reconfiguration scenario may increase the energy consumption satisfy the design constraints. If I-codesign fails to map the
and/or make some tasks to violate their deadlines. Thus, specification, R-codesign reports the issues to the user in order
this flexibility adds more complexity in their design process to tune the input parameters. Otherwise, the results are stored
[9], [10]. Moreover, in real-time systems, tasks scheduling is in a controller matrix which will be used by the reconfiguration
critical and depends on previous mapping steps among the controller during run-time. In case of large systems, the global
system processing elements [11], [12]. mapping matrix could be broken-down into small matrices
In the present paper, we propose a methodology called controlled by multiple controllers in a distributed fashion
R-codesign for reconfigurable co-design. It aims to find so- which avoids reconfiguration fetch overhead.
lutions for modeling and partitioning probabilistic real-time At last, R-codesign is a codesign methodology that allows
systems having multiple reconfiguration scenarios. R-codesign to rapidly evaluate hardware/software systems using abstract
creates a task allocation of SW functions and HW behaviors models. The partitioning and mapping results reveal its effi-
based on the user constraints and using heuristics. Guaranteed ciency. Indeed, it guarantees a feasible solution and enhances
available resources, feasible scheduling and a generation of the overall performance compared with existing methodolo-
a reconfiguration controller are the main concern. The gain gies.
of the methodology resides in two aspects: (i) the estimation The paper proceeds as follows. The next section describes
of the execution flow allow to map the most probabilistic useful background. Section III presents the system formal-
functions to be executed, to be stored together, and hence ization and the notations used in this paper. In Section IV
the communication costs will be reduced and the overall the R-codesign methodology is developed. Section V exposes
performance will be enhanced, (ii) the precomputed mapping the experimental results and finally we conclude this paper in
of the possible execution scenarios allows to reconfigure the Section VI.
system at run-time with a minimum reconfiguration overhead.
Firstly, R-codesign presents an abstract model for hard-
II. S TATE OF T HE A RT
ware/software systems allowing early exploration of hard-
ware/software executions and evaluation of design alternatives. Hardware/software codesign can be considered as the pro-
This model supports incremental refinement and evaluation at cess of concurrent and coordinated design of an electronic
multiple abstraction levels. The separation between software system comprising hardware as well as software compo-
and hardware tasks is supposed to be manually done by nents based on a system description that is implementation-
users. The decision is made by an expert after a complexity independent [15], [16]. One of the key problems in hard-
study of each module (node in the DAG) who knows the ware/software codesign is hardware/software partitioning [17].
computational requirements of the system processing flow. One of the most relevant works dealing with partitioning is
Hardware is limited to specifically designed tasks that are, presented in [18]: A very sophisticated integer linear pro-
taken independently, very simple. Software implements algo- gramming model for the joint partitioning and scheduling
rithms that allow to complete much more complex tasks. The problem for a wide range of target architectures. This integer
entry point for R-codesign is a hardware/software specification program is part of a 2-phase heuristic optimization scheme
modeled by a DAG (Directed Acyclic Graph) where nodes which aims at gaining better timing estimates using repeated
are software functions or hardware behaviors. The edges of scheduling phases, and using the estimations in the partitioning
these DAGs are valued with a probabilistic estimation of their phases. The work in [19] presents a method for allocation
connecting nodes execution along with the communication of hardware/software resources for optimal partitioning. Dur-
cost of the communicating nodes. The goal of the methodology ing the allocation algorithm, an estimated hardware/software
is to partition and map all predefined possible configuration partition is also built. The algorithm for this is basically
scenarios off-line into a hardware target architecture that is a greedy algorithm: It takes the components one by one,
mainly an MPSoC and implement a controller that will super- and allocates the most critical building block of the current
vise and reconfigure the system on-line [13]. All the important component to hardware. The study in [20] shows an algorithm
and more likely configuration scenarios are pre-computed and to solve the joint problem of partitioning and scheduling.
given as input to the methodology. Each possible configuration It consists of basically two local search heuristics: one for
is composed of a set of periodic tasks modeled according partitioning and one for scheduling. The two algorithms oper-
to the proposed DAGs presentation. Thus, we developed ate on the same graph, at the same time. The work in [21]
adequate partitioning and mapping techniques for the proposed considers partitioning in the design of ASIPs (application-
hardware/software model. This partitioning/mapping approach specific integrated processors). It presents a formal frame
is called I-codesign. Several design constraints are considered work and proposes a partitioning algorithm based on branch
in this work such as the inclusion/exclusion constraint which and bound. The research in [22] presents an approach that
is related to the functional specification of processors. An is largely orthogonal to other partitioning methods: it deals
optimization phase is applied at the end of the I-codesign with the problem of hierarchically matching tasks to resources.
using the Kernighan-Lin algorithm [14] in attempt to find an It also shows a method for weighting partially defined user
optimal series of interchange operations between communi- preferences, which can be very useful for multiple-objective
cating elements in the DAGs. I-codesign treats the software optimization problems [23].
functions and the hardware behaviors separately and then a Along with the partitioning and mapping problem, co-
co-simulation step decides whether or not the mapping results simulation becomes an important area of research for the early
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
validation of design decisions [24], [25]. In co-simulation, the GPP (General Purpose Processor) and DSP (Digital Signal
execution of software on CPUs is simulated using a virtual Processor)).
model of the processor hardware or with simulation models as
ISS (Instruction Set Simulator). ISS reduces the complexity of
CPU RHw
the system design compared with performing a pure gate level
or register transfer level (RTL) hardware simulation, which is
typically too much slow. The co-simulation problem lies in
coupling different models to make the hardware simulation I/O Mem Com Tile Tile
sufficiently accurate [26]. Nowadays, we already live in a third
generation of co-design technology with cross-level design Bus1
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
ready used physical components within the limits of available Algorithm 2 describes this partitioning phase where (i)
resources. In the case of resource shortage, other physical ttask is a table containing tasks of a configuration, (ii) tc
components are allocated. If there are available memory and is a table that will hold the constructed clusters, and (iii) tf
energy on a created cluster, then sub-tasks from different tasks is a table that will hold the functions that are not affected
can be associated to the cluster. The main rules applied at this with the inclusion/execution constraint, (iv) FetchTask(ttask,
level are: F) is a function that fetches a table of tasks ttask in order to
• Rule 1: ∀ ζl , l ∈ [1..γ], ∀ Ti ∈ ζl i ∈ [1..R ], for each determine the index of the task that includes the function F,
pair of functions Fk,i and Fh,i / Fk,i ∈ Inclu(Fh,i ), group (v) maxProba(F, Tpred) determines the maximum probability
Fk,i and Fh,i on the same cluster, value of edges connected to a function F using the table of
its predecessors Tpred, (vi) OK Memory(c, F) is a boolean
• Rule 2: ∀ ζl , l ∈ [1..γ], ∀ Ti ∈ ζl , i ∈ [1..R ], for each function that returns true if there is enough memory on a
pair of functions Fk,i and Fh,i / Fk,i ∈ Exclu(Fh,i ), put Fk,i cluster c for a function F, (vii) OK Energy(c) is a boolean
and Fh,i on different clusters. function that returns true if there is enough energy on a
Algorithm 1 describes this partitioning phase where (i) ttask cluster c for a function F, (viii) Pred(ta, func, Tpred) is a
is a table containing tasks of a configuration, (ii) tc is a function that returns the predecessors of a function func using
table that will hold the constructed clusters, (iii) tf is a table its corresponding task ta and stores them into a table Tpred,
that will hold the functions that are not affected with the and (ix) FetchCluster(tc, F) is a function that fetches a table
inclusion/exclusion constraints, (iv) Cluster T (c, F, tab) is a of clusters tc in order to determine the index of the cluster
function that stores a function F and all the elements of a storing the function F.
table tab into a cluster c, (v) Cluster(c, F) is a function that The rules to be applied are:
stores a function F into a cluster c, (vi) OK Memory(c) is a • Rule 3: for any remaining un-clustered Fk,i ∈ Ti , deter-
function that indicates memory availability in a cluster c, (vii) mines its predecessors Fk−1,i in the DAG of Ti ,
OK Energy(c) is a function that indicates energy availability
in a cluster c, (viii) Add C(c, tab) is a function that adds a • Rule 4: For any Fk,i , extracts the highest edge’s probabil-
cluster c to a table tab, and (ix) Add F(F, ta) is a function ity couples ≺ Fk,i , Fk−1,i and cluster Fk,i with its related
that adds a function F to a table ta. clustered functions having the highest edge probability.
Inclusion/exclusion is a hard constraint, thus clustered ele- 3) Kernighan-Lin: Optimizes the generated clusters. This
ments are locked and they will not be moved to other clusters phase evaluates both probability and communication cost on
during the remaining process. the edges connecting functions by gain calculation. If the gain
2) Hierarchical Partitioning: Clusters the remaining func- is positive, then the function is moved to another cluster.
tions that have no inclusion/exclusion constraints. The func-
tions are evaluated by their connecting edges probabilities and Algorithm 3 Kernighan-Lin Optimizationg Algorithm
high probability values are treated first. For each remained 1: procedure K ERNIGHAN -L IN O PTIMIZATION ( TTASK :
function Fk,i , all its predecessors are assessed to determine TAB - TASK , TF : TAB - FUNCTION , TC : TAB - CLUSTER )
the highest probability value of their connecting edges. Fk,i 2: for h = 1 to length(t f ) do
is associated to the cluster where the predecessor having the 3: PT ← FetchTask(ttask,t f [h])
highest edge probability value is located. 4: Pred(ttask[PT ],t f [h], T pred)
5: Func ← maxGain(t f [h], T pred)
Algorithm 2 Hierarchical partitioning Algorithm 6: if Func 6= NULL then
1: procedure H IERARCHICAL - PARTITION ( VAR TC : TAB - 7: PC ← FetchCluster(tc, Func)
CLUSTER , TF : TAB - FUNCTION , TTASK : TAB - TASK ) 8: if OK Memory(tc[PC]) & OK Energy(tc[PC])
2: for k = 1 to length(t f ) do then
3: PT ← FetchTask(ttask,t f [i]) 9: tc[PC] ← t f [h]
4: Pred(ttask[PT ],t f [k], T pred) 10: P ← FetchCluster(tc,t f [h])
5: ok ← f alse 11: Remove(tc[P],t f [h])
6: repeat 12: end if
7: max ← maxProba(t f [k], T pred) 13: end if
8: PC ← FetchCluster(tc, T pred[max]) 14: end for
9: if OK Memory(tc[PC]) & OK Energy(tc[PC]) 15: end procedure
then
10: tc[PC] ← t f [k] This step applies the following rules:
11: ok ← true
12: else T pred[max] ← 0 • Rule 5: starts with choosing an unlocked function Fk,i ,
13: end if • Rule 6: calculates the gain GF of moving Fk,i from
14: until ok a partition to another, GF = ((Cce × Pre ) − (Cch ×
15: end for Prh )) where (i) Cce is the communication cost of edges
16: end procedure connecting Fk,i with Fe,i placed in another cluster, (ii) Cch
is the communication cost of edges connecting Fk,i with
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
Fh,i placed in its own cluster, (iii) Pre is the probability of main input of Algorithm 4 and Number(conf) is a function that
edges connecting Fk,i with Fe,i placed in another cluster, returns the number of tasks per configuration conf. Figure. 3
and (iv) Prh is the probability of edges connecting Fk,i represents the flow diagram of the methodology. R-codesign
with Fh,i placed in its own cluster. steps are stated as follows:
• Rule 7: If GF ≥ 0 then we move Fk,i to another cluster.
The kernighan-lin optimization algorithm is described with A. Task Extraction
Algorithm 3 where (i) ttask is a table containing tasks of a R-codesign starts with extracting software functions and
configuration, (ii) tc is a table that will hold the constructed hardware behaviors from the system specification DAGs. It
clusters, (iii) tf is a table that will hold the functions that constructs a sub-DAG for each type of task elements. The
are not affected with the inclusion/execution constraints, (iv) task extraction is performed as follows.
maxGain(F, tab) is a function that returns a function having a • ∀ ζl , l ∈ [1..γ], ∀ Ti ∈ ζl , i ∈ [1..R ], ∀ Fk,i /B j,i
maximum gain value when stored with the function F on the ∈ Ti , add ≺Fk,i , Fk−1,i to the software sub-DAG and
same cluster from a table tab storing F predecessors, and (v) ≺B j,i , B j−1,i to the hardware sub-DAG.
Remove(c, F) is a function that removes a function F from a During the extraction phase, we adjust the communication
cluster c. cost and the probability estimation of behaviors edges that
are separated with function(s) or functions edges separated
IV. R- CODESIGN with behavior(s). As for the transfer cost, it is the sum
In this section, we present the R-codesign methodology. A of the individual communication cost of edges separating
system specification according to the proposed probabilistic the communicating functions/behaviors. Regarding the edge
modeling DAGs is the input of R-codesign. From these DAGs, probability, it is the product of the individual edge probability
hardware and software tasks are extracted and processed separating communicating functions/behaviors. For example,
through the I-codesign engine. A mapping process follows in figure. 2, the probability on the edge connecting F1,i to
the I-codesign algorithms and a mapping matrix is generated F2,i is equal to 0.6×0.8 while the communication cost is
in order to be used further by the co-simulation module. 10+12. If there are more than one functions/behaviors that
A validation strategy is then applied. If the performance communicate with another function/behavior, i.e., more than
results from the validation module are not convenient, then one incoming edge in the nodes of the graph, then we adjust
the I-codesign module is called again and a new mapping is the communication cost related to the edge that will bind the
recalculated. There is no upper bound on the created allocation two behaviors by summing the communication cost on each
by the I-codesign module. The designed system should be communicating path and considering the highest value of the
capable of running different configurations. By applying our communication cost. As to the adjusted probability, it is the
methodology, an allocation for each specified configuration is sum of probability of the multiple paths (its value is 1 when the
created. sum exceeds 1). As mentioned earlier, the path probability is
the product of all the individual edges probability constituting
Algorithm 4 R-codesign Algorithm the path.
1: procedure R- CODESIGN (T CONF : TAB - CONFIGURATION ,
LENGTH [T CONF ]: INTEGER )
B. I-codesign for hardware behaviors
2: if length(T con f ) > 0 then
3: NbT ← Number(T con f [length(T con f )]) The inclusion/exclusion constraints are hard constraints that
4: for k = 1 to NbT do generally decide the number of clusters to be created and
5: TaskExtraction(Tk , DAGsw , DAGhw ) locks the behaviors that are concerned in term of placement.
6: repeat If behaviors share the control of the same components, then
7: Mapping Table ← they are deployed on the same FPGA. Since each behavior
Icodesign(DAGhw , HW ) has its own design, FPGAs can have several implementations.
8: Mapping Table ← Icodesign(DAGsw , SW ) Multiplexers can be used in order to switch from an imple-
9: Per f ormanceResults ← mentation to another. The assignment of the behaviors based
CoSimulation(Mapping Table) on this constraint is formalized as follows.
10: until Per f ormaceResults == ”ok” ∀B p,i , B j,i ∈ Assign(Cl), p, j ∈ [1..ni,1 ], B p,i 6∈ Exclu(B j,i ) (1)
11: end for
12: R − codesgin(T con f , length[T con f ] − 1) ∀B p,i , B j,i /B p,i ∈ Inclu(B j,i ), T hen B p,i , B j,i ∈ Assign(Cl) (2)
13: end if where Cl is a cluster created based on the exclusion/inclusion
14: GenerateController() constraint. The number of the created clusters depends on
15: end procedure
the number of behaviors on the hardware DAG. Inclu(B j,i )
designates the behaviors that are related with inclusion to
Algorithm 4 implements the R-codesign methodology. The B j,i . Exclu(B j,i ) designates the behaviors that are related with
input specification can be composed of multiple configurations exclusion to B j,i . Assign(Cl) groups the set of behaviors
scenarios where each scenario executes a set of tasks. Thus affected to the cluster Cl. We define NCl as the number of
we define Tconf as the configurations table that will be the elements associated to a cluster Cl.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
Specification
Task
Task
Task
I-codesign Module
Functional Partitoning
Task Extraction
Unsatisfied Constraint
Validation Strategy Hw-Task
Hw-Task
Hw-Task Hw-Task SwitchCluster()
Hw-Task
Sw-Task CreateCluster()
OK
RT Level VHDL
hierarichal Partitoning
Executable Software Mapping Results
Gate level VHDL Hw/Sw Co-simulation
ok Unsatisfied Constraint
SwitchCluster()
CreateCluster()
Controller Generation OK
Deployment
(b)
Each cluster created for hardware tasks Cl={B1,i , B2,i ,..., corresponds to the bandwidth between B p,i and B j,i . The
BNCl ,i } is composed of behaviors and will be implemented verification step includes also the energy consumed by a
on a single FPGA unit. The reconfigurable hardware device given FPGA. Indeed, the energy consumption of a partition Cl
offers a certain amount of computational resources, e.g., the depends on the selected operating frequency Freq based on the
configurable logic blocks of a FPGA, which is also referred to current configuration and the number of available gates SFPGA
as the SFPGA parameter of the device. At each iteration of the of the corresponding tile. The electrical power constraint is
I-codesign methodology, a placement decision can affect the given by
hardware units. Hence, the available area on the FPGA must
be sufficient in order to execute the affected behaviors. Thus, SFPGA .[Freq]3 PwFPGA (6)
∧
Bandwidth(Clv ,Clu ) = ∑ Bandwidth(B p,i , B j,i ) ≤ BWv,l
Bandwidth(B p,i , B j,i ) ≤ BW
B ∈Cl∑
B p,i ∈Clv ,B j,i ∈Clu p,i v ,B j,i ∈Clu
(5) (8)
SFPGA .[Freq]3 PwFPGA
∧
where BWv,l stands for an available bandwidth between two
tiles Hwv and Hwl . B p,i ↔ B j,i means that B p,i and B j,i
Uk = ∑ Ehw (B j,i )/Phw (B j,i ) ≤ 1
are placed on different clusters Clu and Clv and that they B j,i ∈Cl
have a data dependency. The expression Bandwidth(B p,i , B j,i )
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
matrix is presented in figure. 5 where the specification is
Bandwidth(Fk,i , Fv,i ) ≤ BW
F ∈Cl∑
composed of three tasks T1 , T2 and T3 and two configurations
k,i v ,Fv,i ∈Clu
(12) con f1 ={T1 , T2 , T3 } and con f2 ={T1 , T2 }. The number of lines in
Freq ×V 2 ×C PwCPU
∧
this matrix is equal to γ the number of possible configurations.
The number of columns is equal to ∑Ri ni,1 + ni,2 . The
Uk = ∑ Esw (Fk,i )/Psw (Fk,i ) ≤ 1
Fk,i ∈Cl matrix associates each task function/behavior with a specified
PE (Processing Unit: CPU, FPGA) when the corresponding
D. Hardware software co-Simulation configuration is selected.
Hardware/software co-simulation allows to verify the fea- T1 T2 T3
sibility of mixed hardware/software descriptions in term of F11 F12 B11 F12 F22 F32 B12 F13 F23 B13 B23
timing constraints. Implementing the co-simulation consists Conf1 CPU1 CPU1 FPGA1 CPU1 CPU2 CPU2 FPGA2 CPU2 CPU2 FPGA2 FPGA1
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
Execution Environment
Task
Task
Task Task decomposition Hw/Sw partitioner
Hardware Architecture
Cluster Scheduler
Task DAG
Input Parameters
design models and implements the co-design algorithms. It Figure. 7 summarizes the TP structure. Finally, EE executes
proposes a flexible task set generator for different scenar- each cluster on the associated computing unit and collects
ios and purposes. The tool places the software specification results for reporting energy and memory utilization and con-
following several design constraints as inclusion/exclusion troller status. The output results allow the assessment of using
parameters, probabilistic execution of the software tasks, avail- static partitioning stored in the controller table (generated
able memory and energy on the hardware units and real-time by I-codesign) and permits us to compare the I-codesign
parameters. At each iteration, it constructs the controller table methodology with a legacy dynamic mapping scheme.
that stores all the possible execution scenarios. For simulation
purposes the tool loads a specification file, reads the software B. A Case Study
and hardware characteristics, applies the co-design algorithms
We consider a system having two possible configurations
and generates the controller table along with memory and
denoted by con f1 and con f2 . This system is specified by a
energy estimation. Figure. 6 summarizes the general tool
task graph composed of three tasks T1 , T2 and T3 according to
structure. The tool is composed of four different parts: 1)
R-codesign modeling techniques. The required memory size
Task set Generator (TSG), 2) Task Decompositioner (TD), 3)
of the application is 2.3 MB where T1 , T2 and T3 require
Task Partitioner (TP), 4) Execution Environment (EE). TSG
respectively 1.5 MB, 0.5 MB and 0.3 MB. The composition
should be set with parameters such as CPU utilization and the
of each configuration is: con f1 ={T1 , T2 } and con f2 ={T1 , T3 }.
desired number of tasks, and then it creates a task set that is
Figure. 8 presents the DAG of the task T1 . The target hardware
called a configuration. The design constraints (probability, in-
is an MPSoC composed of two identical tiles where each tile
clusion/exclusion, communication costs, and dependency) are
has a CPU, a reconfigurable unit and a local memory. The
randomly generated by the tool. The generated configuration
local memory’s size is of 1.2 MB.
is passed as input to TD which decomposes the tasks into
elementary functions with design constraints. TG produces the
task graphs. Then, TP performs the partitioning algorithms and F1,1
0.6/10 0.4/8
generates optimized clusters.
B1,1 B2,1
0.8/12 0.2/7 0.3/11
0.7/9
Task specification Architecture Templates
Description File
F2,1 B3,1 B4,1 F3,1
Json F1 CPU FPGA
F2
0.1/5 0.5/7 0.5/11 1/12 1/10
0.9/8
Feasibility Results
Evaluation The reconfigurable device includes one million gates. The
master and slave tiles are connected to a shared bus. Arbi-
Success tration is resolved by a bus arbiter. It periodically examines
Result
pending requests from the master and grants access using
arbitration mechanisms specified by the bus protocol.
Figure 7. The partitioning flow graph of SPEX.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
10
1/4
F3,1 2 90 210 0.11 1/16
F5,1
1/9
F4,1 6 110 180 0.02 F7,1
respectively in Tables I and II. In this case study, the parti- Cl 1 F5,1 Cl 2 F10,1
Cl 2
B6,1
tioning results as well as the controller generation are of main Software Clusters Hardware Clusters
focus. Following are descriptions of SPEX steps applied to
this case study. Figure 11. Resulted clusters after the Functional Partitioning.
1/15 0.6/14 1/21 The verification phase applied on the created clusters suc-
B9,1
B12,1 ceeds since the placements respect (8) and (12).
4) Kernighan-Lin Optimization: This phase aims to opti-
mize the resulting clusters from the hierarchical clustering
Figure 9. Task extraction: Hardware graph.
phase by iterative improvements. In our partitioning process,
The probability and the communication cost are recalculated a combination of two metrics is used in order to opti-
according to the connections roots and leafs on the original mize the traffic circulation of the system: the communica-
task’s DAG. Figures. 9 and 10 presents the extraction of the tion cost and the probabilistic estimations of the executions.
software and hardware DAGs. Figure. 13shows the optimized clusters after applying the
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
11
kernighan-Lin algorithm. The verification phase applied to the issues, and energy/power consumption, and (iii) analyze the
optimized clusters succeeds since the placements respect (8) causality between MPSoC components i.e., resource constrains
and (12). and inter-dependencies [38]. The application model is based
on task graphs, where the exact functionality of a task is
Software Clusters
abstracted away and expressed using a set of timing constraints
B1,1 B6,1
F7,1 (execution time, deadline and offset). Recording files are gen-
B12,1 B8,1
Cl 1
F6,1
F4,1
Cl 2
B13,1
erated providing an overview on the architecture-under-test,
F2,1 B5,1
F9,1
B3,1 B4,1
the profile of the application, the PE utilization, the memory
F1,1
F3,1 F8,1
B2,1
and communication costs. ARTS model captures the impact of
B11,1
F5,1 F10,1
Cl 1 B10,1 Cl 2 the dynamic and unpredictable behavior on processor, memory
B7,1 B9,1 and communication performance. In particular, it focuses on
Hardware Clusters analyzing the impact of application mapping on the processor
and memory utilization taking the on-chip communication
Figure 13. Resulted Clusters after the Kernighan Lin Optimization. latency into account.
The evaluated performance parameters that are taken into
5) Controller Generation: For each configuration, SPEX account are the total communication costs of the system
runs R-codesign on the hardware/software specification and functions/behaviors, the total consumed energy during the
constructs the controller matrix. The generated matrix for this system execution, the total number of exchanged messages,
case study is showed in figure. 14. The output of SPEX is also and the total execution time of the grouped task sets. The
presented in figure. 15. generated results when varying the utilization on CPUs and
FPGAs are compared with two partitioning and scheduling
Software F1,1 F2,1 F3,1 F4,1 F5,1 F6,1 F7,1 F8,1 F9,1 F10,1 algorithms: the work reported in [27] which proposes a task
conf1 CPU1 CPU1 CPU1 CPU2 CPU1 CPU2 CPU2 CPU2 CPU2 CPU2
allocation algorithm based on clustering. This work that finds
a near optimal solution and tries to minimize the total system
conf2 CPU2 CPU2 CPU2 CPU1 CPU2 CPU1 CPU1 CPU1 CPU1 CPU1 cost by forming a cluster of tasks in such a way that the
cluster, having minimum execution cost, is allocated first. Then
Hardware B1,1 B2,1 B3,1 B4,1 B5,1 B6,1 B7,1 B8,1 B9,1 B10,1 B11,1 B12,1 B13,1 comes the second work [28] which proposes an algorithm
conf1
RHW2 RHW2 RHW2 RHW2 RHW2 RHW1 to extend the battery life by partitioning and scheduling the
RHW1 RHW1
RHW1
RHW1
RHW1 RHW1
RHW2 RHW1
RHW1 RHW1
RHW1
RHW2
RHW2
input task wisely. We also compare the performance results
conf2 RHW2 RHW2 RHW2 RHW1 RHW2 RHW2 RHW1 from the traditional approach TA that during the run-time
reconfiguration calculates the appropriate mapping of tasks
into processors with R-codesign R-co.
Figure 14. Resulted Controller Matrix of task T1 . Figures. 16, 17, 18 and 19 present the performance result-
ing from randomly generated task sets. Figure. 16 describes
the communication costs in term of delays of the transfer
through the communication medium while figure. 19 enumer-
ates the transferred messages. The comparison between the
evaluated approaches has demonstrated that R-codesign offers
better performance results particularly with large utilization
factors and high number of nodes on the specification DAGs.
These enhancements are due to probabilistic estimation of
the communicated functions/behaviors that store dependent
tasks with high chances to be executed successively on same
PEs. Another advantage of R-codesign is the pre-calculated
Figure 15. Output of SPEX.
mapping of the possible reconfigurations at run-time. This step
helps significantly to minimize the reconfiguration overhead
which is made clear from the comparison with TA. Due to
C. Evaluation these changes a reduction by 30% of the global execution time
To evaluate the R-codesign methodology, several task sets is observed. Since execution time is a crucial performance pa-
of different dimensions are generated. The generated tasks rameter of the embedded system design, adopting the proposed
are processed through SPEX and we obtain the generated idea improves significantly the response time. Compared with
controller matrix along with the mapping scheme of each existing works, it is shown from the graphs that R-codesign
execution scenario. improves the communication costs with an average of 10% and
For performance simulation, we use ARTS framework which therefore the exchanged messages are reduced by an average
is a simulation tool for user-driven abstract MPSoC design of 12%. Simulation results show that this contribution has
explorations. Hence, The framework allows to: (i) model few benefits: (i) the number of exchanged messages has been
processing elements (PE), memory units and interconnect, noticeably reduced, and (ii) the global execution time has been
(ii) investigate PE utilization, memory usage, communication minimized. Hence, the energy consumption will be reduced as
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
12
Communication Cost
Communication Cost
400 Alg[27] 400 Alg[27] 400 Alg[27]
Alg[28] Alg[28] Alg[28]
TA TA TA
200 200 200
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Utilization Utilization Utilization
Consumed Energy
Consumed Energy
Alg[27] Alg[27] Alg[27]
60 Alg[28] 60 Alg[28] 60 Alg[28]
TA TA TA
40 40 40
20 20 20
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Utilization Utilization Utilization
Execution time
Execution time
50 50 50
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Utilization Utilization Utilization
Message Number
Message Number
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Utilization Utilization Utilization
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
13
a result of decreased execution time. Another advantage of R- [9] F. Ferrandi, P. L. Lanzi, C. Pilato, D. Sciuto, and A. Tumeo, “Ant colony
codesign is its validation tests (see equation systems (8) and optimization for mapping, scheduling and placing in reconfigurable
systems,” in Proc. NASA/ESA Conference on Adaptive Hardware and
(12)) that avoid any issues related to a lack of resources. Systems (AHS), Italy, June 2013, pp. 47–54.
[10] X. Wang, I. Khemaissia, M. Khalgui, Z. Li, O. Mosbahi, and M. Zhou,
VI. C ONCLUSION “Dynamic low-power reconfiguration of real-time systems with periodic
and probabilistic tasks,” IEEE Transactions on Automation Science and
In this paper, we propose a complete methodology for Engineering, vol. 12, no. 1, pp. 258–271, Jan 2015.
modeling, partitioning and validating reconfigurable embedded [11] T. K. Liu, Y. P. Chen, and J. H. Chou, “Developing a multiobjective
optimization scheduling system for a screw manufacturer: A refined
system design. We expose in this paper probabilistic estimation genetic algorithm approach,” IEEE Access, vol. 2, pp. 356–364, 2014.
of the executions and a mathematical formalization of the [12] W. Housseyni, O. Mosbahi, M. Khalgui, Z. Li, and L. Yin, “Multiagent
design constraints. The obtained performance improvement architecture for distributed adaptive scheduling of reconfigurable real-
of the proposed techniques in terms of communication costs time tasks with energy harvesting constraints,” IEEE Access, vol. PP,
no. 99, pp. 1–1, 2017.
(the number of exchanged messages), consumed energy and [13] S. Brandsttter and M. Huemer, “A novel mpsoc interface and control
required CPU time has been verified. Furthermore, the new architecture for multistandard rf transceivers,” IEEE Access, vol. 2, pp.
partitioning combination of iterative, constructive and func- 771–787, 2014.
[14] S. Dutt, “New faster kernighan-lin-type graph-partitioning algorithms,”
tional techniques allows efficient and optimized placements in Proceedings of 1993 International Conference on Computer Aided
of software/hardware specification while respecting the con- Design (ICCAD), Nov 1993, pp. 370–377.
strained resources. We proposed an execution model for R- [15] D. W. Franke and M. K. Purvis, “Design automation technology for
codesign: status and directions,” in Proc. IEEE International Symposium
codesign methodology that relies on a controller module that on Circuits and Systems, ISCAS ’92, May 1992, pp. 2669–2672.
stores all the possible reconfiguration scenarios and manages [16] K. Li, X. Tang, B. Veeravalli, and K. Li, “Scheduling precedence
the system tasks whenever a reconfiguration event occurs. constrained stochastic tasks on heterogeneous cluster systems,” IEEE
Finally, we developed the SPEX tool which allows to: (i) write Transactions on Computers, vol. 64, no. 1, Jan 2015.
[17] C. C. Kao, “Performance-oriented partitioning for task scheduling of
a specification according to the R-codesign system model, parallel reconfigurable architectures,” IEEE Transactions on Parallel and
(ii) apply the new partitioning techniques and (iii) generate Distributed Systems, vol. 26, no. 3, pp. 858–867, March 2015.
the controller table. We are working on enhancing SPEX and [18] R. Niemann and P. Marwedel, “An algorithm for hardware/software par-
titioning using mixed integer linear programming,” Design Automation
from a future perspective we intend to explore tasks migration for Embedded Systems, vol. 2, no. 2, pp. 165–193, Mar 1997.
mechanisms for an optimal reconfiguration process [39]. [19] J. Grode, P. V. Knudsen, and J. Madsen, “Hardware resource allocation
for hardware/software partitioning in the lycos system,” in Proceedings
Design, Automation and Test in Europe, Feb 1998, pp. 22–27.
ACKNOWLEDGMENT
[20] K. S. Chatha and R. Vemuri, “Magellan: Multiway hardware-software
The authors would like to extend their sincere appreciation partitioning and scheduling for latency minimization of hierarchical
to the Deanship of Scientific Research at King Saud University control-dataflow task graphs,” in Proceedings of the Ninth International
Symposium on Hardware/Software Codesign, ser. CODES ’01, 2001, pp.
for its funding this Research Group NO (RGP-1436-040 ). 42–47.
[21] N. N. Binh, M. Imai, A. Shiomi, and N. Hikichi, “A hardware/software
R EFERENCES partitioning algorithm for designing pipelined asips with least gate
counts,” in 33rd Design Automation Conference Proceedings, 1996, Jun
[1] M. Uzam, Z. Li, G. Gelen, and R. S. Zakariyya, “A divide-and-conquer- 1996, pp. 527–532.
method for the synthesis of liveness enforcing supervisors for flexible [22] G. Quan, X. Hu, and G. Greenwood, “Preference-driven hierarchical
manufacturing systems,” Journal of Intelligent Manufacturing, vol. 27, hardware/software partitioning,” in Proceedings 1999 IEEE Interna-
no. 5, pp. 1111–1129, Oct 2016. tional Conference on Computer Design: VLSI in Computers and Pro-
[2] Y. Chen, Z. Li, K. Barkaoui, and M. Uzam, “New Petri net structure and cessors (Cat. No.99CB37040), 1999, pp. 652–657.
its application to optimal supervisory control: Interval inhibitor arcs,” [23] P. Arato, S. Juhasz, Z. A. Mann, A. Orban, and D. Papp, “Hardware-
IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 44, software partitioning in embedded system design,” in IEEE International
no. 10, pp. 1384–1400, Oct 2014. Symposium on Intelligent Signal Processing, 2003, Sept 2003, pp. 197–
[3] T. Kim and S. Tak, “Experience with hardware-software codesign of 202.
network protocol stacks supporting real-time inter-task communication,” [24] C. Wang, A. Wang, Y. Chen, A. Al-Ahmari, and Z. Li, “On computation
in Proc. 10th International Conference on computer and Information reduction of liveness-enforcing supervisors,” IEEE Access, vol. 5, pp.
Technology (CIT), IEEE, United Kingdom, June 2010, pp. 26–32. 14 775–14 786, Aug 2017.
[4] S. Zhang, N. Wu, Z. Li, T. Qu, and C. Li, “Petri net-based approach [25] C. Li, Y. Chen, Z. Li, and K. Barkaoui, “Synthesis of liveness-enforcing
to short-term scheduling of crude oil operations with less tank require- Petri net supervisors based on a think-globally-act-locally approach and
ment,” Information Sciences, vol. 417, pp. 247 – 261, 2017. vector covering for flexible manufacturing systems,” IEEE Access, vol. 5,
[5] H. Grichi, O. Mosbahi, M. Khalgui, and Z. Li, “Rwin: New methodology pp. 16 349–16 358, June 2017.
for the development of reconfigurable wsn,” IEEE Transactions on [26] J. Shi, W. Liu, M. Jiang, H. Che, and L. Chen, “Software hardware
Automation Science and Engineering, vol. 14, no. 1, pp. 109–125, Jan co-simulation and co-verification in safety critical system design,” in
2017. Proc. IEEE International Conference on Intelligent Rail Transportation
[6] M. Gasmi, O. Mosbahi, M. Khalgui, L. Gomes, and Z. Li, “R-node: (ICIRT), China, Aug 2013, pp. 71–74.
New pipelined approach for an effective reconfigurable wireless sensor [27] P. Bhardwaj and V. Kumar, “An effective load balancing task allocation
node,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, algorithm using task clustering,” International Journal of Computer
vol. PP, no. 99, pp. 1–14, 2017. Applications, vol. 77, no. 7, pp. 32–39, September 2013.
[7] H. Grichi, O. Mosbahi, M. Khalgui, and Z. Li, “New power-oriented [28] R. Shi, S. Yin, C. Yin, L. Liu, and S. Wei, “Energy-aware task
methodology for dynamic resizing and mobility of reconfigurable partitioning and scheduling algorithm for reconfigurable processor,” in
wireless sensor networks,” IEEE Transactions on Systems, Man, and Proc. IEEE 11th International Conference on Solid-State and Integrated
Cybernetics: Systems, vol. PP, no. 99, pp. 1–11, 2017. Circuit Technology (ICSICT), China, Oct 2012, pp. 1–3.
[8] M. O. Ben Salem, O. Mosbahi, M. Khalgui, Z. Jlalia, G. Frey, and [29] X. Wang, Z. Li, and W. M. Wonham, “Dynamic multiple-period re-
M. Smida, “Brometh: Methodology to design safe reconfigurable medi- configuration of real-time scheduling based on timed des supervisory
cal robotic systems,” The International Journal of Medical Robotics and control,” IEEE Transactions on Industrial Informatics, vol. 12, no. 1,
Computer Assisted Surgery, vol. 13, no. 3, pp. e1786–n/a, 2017. pp. 101–111, Feb 2016.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2799852, IEEE Access
14
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.