Hardwaresoftware Codesign Using Binary Partitioning
Hardwaresoftware Codesign Using Binary Partitioning
CHAPTER 2
The hardware replacing the software portion must produce the same
transformation, as the original pure software implementation, to ensure that
the functionality does not change. Partitioning at the binary level makes the
method suitable for dynamic on-the-fly partitioning of software into
hardware. The basic blocks of software binaries are transformed into dataflow
descriptions for implementation of the partitioned software in hardware. As a
first step, the portion of the software binary to be transformed into hardware is
identified using instruction level profiling. The partitioned software (to be
23
system specification
optimization
system partitioning
decompilation
hardware estimation of control construct
system on chip
2.2.2 Optimization
not only to enhance productivity of the designer and system developer, but,
also to improve the quality of the final synthesis outcome. The present need in
codesign to move from synthesis-based technology to compiler-based
technology is pointed out. The data flow and control optimizations at a higher
abstraction level can lead to significant size and performance improvements
in both the synthesized hardware and software. It is recognized that guided
optimization can be applied on the internal design representation, independent
of the abstraction level, and need not be restricted to the final stages of
software assembly code generation or hardware synthesis. Function/
Architecture Optimization and Codesign of Embedded Systems will be of
primary interest to researchers, developers and professionals in the field of
embedded systems design.
move critical regions of software to the coprocessor platform are some of the
efforts that have gone into reconfigurable computing with partitioning at the
source-code level.
S/W SOURCE
COMPILER
FRONT-END
H/W S/W
PARTITIONING
COMPILER HDL
BACKEND SOURCE
ASSEMBLY ASSEMBLER
AND & SYNTHESIS
OBJECT LINKER
FILES
BINARIES NETLISTS
SW SOURCE
COMPILATION
ASSEMBLY
AND ASSEMBLER &
OBJECT LINKER
FILES
BINARIES
H/W S/W
PARTITIONING
HW SOURCE
SYNTHESIS
NETLISTS
BINARIES
2.2.4 Decompilation
The control logic for the partitioned hardware is estimated for the
hardware that is partitioned in source level and binary level languages using
behavioral network graph.
between models are through signals and events. At this level of modeling, the
architecture and algorithms are verified. Any performance issues and
bottlenecks are studied and simulated. Once the architecture and algorithms
are verified, the next step is to determine which part of the system is to be
implemented in hardware and which part goes into software. This process is
called hardware/software partitioning. The software portion runs as embedded
software on the general-purpose microprocessor and the hardware portion is
implemented as an embedded ASIC core as shown in Figure 2.3.
FUNCTIONAL LEVEL
TRANSACTION LEVEL
BEHAVIORAL SYNTHESIS
RTL MODEL
GATE NETLIST
a full range of data types: char (8 bits, 1 byte), short (16 bits,
2 bytes), int (16 bits, 2 bytes), long (32 bit, 4 bytes) and float
(4 byte IEEE).
COMPILATION
ASSEMBLY
AND
OBJECT ASSEMBLER AND LINKER
FILES
BINARIES
BINARIES CDFG
BNG DECOMPILATION
IP core
ESTIMATION HW
Partitioned HW
2.6.3 Decompilation
The hardware partition along with the software glue code, has to
realize precisely the effect of the software instructions that are going to be
replaced by them, in a partitioned system. Stitt and Vahid (2005) discusses a
decompilation technique for software binaries partitioned for hardware. This
hardware partition must maximize performance and minimize delay and
buffer requirements as the cost function.
Sequence of software
instruction
S1 S2
Sequence of software
instruction
S1 S2
nodes, the algorithm checks against already created control nodes to avoid
duplication of nodes. The algorithm works by tracing of execution paths. The
structure of CF generator with register allocation is given in Appendix 2 of
the thesis. Similarly Copy propagation and Dead code elimination is in
Appendix 3 of the thesis . Copy propagation is used to create the dead code(
i.e. when (regw,a) is preceded by (reg0,w+a), it is modified to (reg0,a+a). So
(regw,a) can be removed inorder to reduce the space complexity. Both CF
Generator and Decompiler are designed by using growing array. Hence, it can
be scaled up to any number of instructions. As an illustration, a code snippet
is shown in Figure 2.6 in pseudocode form.
Basic Block 1
True
a<>0
Sequence of
False processing
instructions
Basic Block 2
Basic Block 3
Basic Block 4
node
True
a<>0
False
node
The basic block 4 in the original code snippet has been duplicated.
The reason is that “goto” instruction is not considered as a branch instruction
in the proposed paradigm, since the destination location of the branch is fixed
and only the distinct paths of execution are traced. An alternative would be to
consider “goto” as any other branch instruction and create another block in
CDFG for block 4.
1. For a basic block, if the registers in the initial-state set are also
listed in the final-state set and if the value is the same in both
(can happen if registers are shadowed and used as scratchpads
temporarily), the entry can be removed from the final-state
list.
3. Identify the section of the graph that has only forward jumps
and no jumps from the remaining part of the graph into the
middle of the section. Then, the initial-state sets and final-state
sets of all the basic blocks in that section can be combined
together, and the intermediate control nodes can be removed.
This transformation is termed as the Merge operation. The
destination operand element in the combined final-state set
can have multiple values, each corresponding to a basic block,
that are tagged with a conditional expression corresponding to
the control nodes that lead to the execution of the basic block.
The conditional expressions for these values of an operand are
43
Basic Block1
Initial =
{(reg1,x),(reg2,y),
Final =
((reg3,x+y),(reg9,z))
True False
reg7 0
Basic Block 3
Basic Block 2
Initial ={(reg3,a), (reg1,b),
Initial = {(reg8,e)} (reg6,c)}
Final = Final = {(reg4,a),
{(reg4,e),(reg9,2e)),
(reg5,a+b+c)}
(reg2,e}
Basic Block 1
Initial ={(reg1,x), (reg2,y), (reg6,c), (reg7,d),
(reg8,e), (reg5,f)}
Final = {(reg3,x+y),(reg4,{e|d 0, x+y| if d =0}),
(reg5 {2x+y+c | if d=0, f | if d 0}),
(reg9, {z| if d=0, 2e| if d 0))
(reg2, {e| if d 0, y | if d=0})}
All constructs having forward jumps only and no jumps from other
part of the graph into them would be contained by recursively applying this
model structure.
The detailed steps for executing the Merge operation of the Basic
Blocks as given in e Figure 2.9 are:
6. Apply the same steps for B3 as had been done for B2.
12. Repeat steps 9 to 11 for B3 as had been done for B2. The
mutually exclusive multiple value sets can be synthesized in
hardware using multiplexers with the conditional expressions
functioning as the select lines. As a further optimization step,
Sub expression elimination/deduction can be used for
expression evaluations. Expressions can be tagged with
identifiers while building/merging the initial-state and final-
state sets and registers can be mapped to these identifiers
instead of the expressions. This avoids duplication of the same
expression.
46
ii. Copy propagation is used to create the dead code( i.e. when
(regw,a) is preceded by (reg0,w+a), it is modified to
(reg0,a+a) which is not reported earlier. So (regw,a) can be
removed inorder to reduce the space complexity.
1. For the registers in the final-state set that have values which
are straight moves from those in initial-state set, software code
to move the values are inserted into the software glue code
that calls the hardware. This glue code replaces the original
software block. These registers can be removed from the final-
state set. Also, registers in initial-state set whose values are
not used in the final state set after this step can be removed.
2. For each register in the final-state set, the blocks that can
follow the partition block in execution are scanned for a read
or write on the corresponding register. If the first operation
encountered is a write, these can be removed from the final-
state set.
S/W Partition
Basic Basic runtime runtime in Pure
block 1 block 2 H/W Speed up by
Benchmark Loop in cycles cycles
(mul (mult (in 100 partitioning
int) long ) (100 ns) (100 ns) ns)
approx approx
Dct - - 26667 23222 - 1.14
Diffeq - - 645 357 4.5 1.8
Ellip - - 952 538 0.375 1.76
FIR - - 769 402 0.192 1.9
IIR - - 714 366 0.35 1.95
Lattice - 13333 8855 - 1.5
Nc - - 16000 15161 - 1.05
Volterra - - 16000 15588 - 1.02
Wavelet - - 13333 12922 - 1.03
Wdf7 - - 20000 13479 - 1.5
%
Total Instructions Instruction % of runtime Speedup in
Benchmark of number partitioned Partitioned Instructions of the % by
instructions For SW for HW partitioned instructions partitioning
partitioned
Dct 3504 3481 23 0.7 12 114
Diffeq 410 330 80 17 59.6 180
Ellip 574 494 80 12.45 58.2 176
Fir 404 324 80 17.75 63.9 190
Iir 382 302 80 18.8 65.3 195
Lattice 2613 2589 24 9.2 38.1 150
Nc 3886 3862 24 0.6 5.3 105
Volterra 2664 2640 24 0.9 2.6 102
Wavelet 3617 3593 24 0.7 3.1 103
Wdf7 3958 3692 266 5.5 37.4 150
The total elapsed time to generate the hardware module from the
binary is given in Table 2.4 and is obtained using MATLAB.
The Figure 2.11 shows to what extent each program’s size in terms
of number of cycles consumed per run gets reduced. The improvement in
results for certain benchmarks (Dct, Nc, Volterra, Wavelet) may not be much,
but from Table 2.3 and Figure 2.12, it can be seen that the temporal size,
(temporal size is the percentage of runtime of the partition as defined by
Jantsch et al (1994), for the partition selected manually out of the candidate
partitions, with small size as a desired criteria to make easy the processing of
dataflow extraction, is very low compared to other benchmarks.
54
The instruction rlcf, which means rotate left with carry, takes as
input the carry bit of status register, and generates again as output the carry
bit. This means the carry bit in the initial-state as well as the final-state is need
to be hold. This emphasizes that the bits of status registers also need to be
held in the final-state if they are going to be affected by the present
instruction, and in the initial-state if they are going to be used in the
instruction execution. The structure CF generator is represented in the figure
2.16. The optimization such as dead code elimination and copy propagation is
performed in CFG as shown in Figure 2.17. The generated CDFG is shown in
Figure 2.18.
58
Merging forward
Without Merging Merging
branching
Adders/ Adders/ Adders/
Registers Registers Registers
subtractors subtractors subtractors
DCT 14 125 2 39 2 22
Diffeq 24 123 1 8 - -
Ellip 24 123 1 8 - -
Fir 24 123 1 8 - -
Iir 24 123 1 8 - -
Lattice 14 124 2 39 2 22
NC 14 124 2 39 2 22
Volterra 14 124 2 39 2 22
Wavelet 14 124 2 39 2 22
Wdf7 36 125 2 47 - -
binaries are partitioned for hardware but it results in very good reduction of
hardware cost.
57