0% found this document useful (0 votes)
197 views42 pages

Hardwaresoftware Codesign Using Binary Partitioning

The document discusses hardware/software codesign using binary partitioning. It describes how the software binary is partitioned, with some portions decompiled into a hardware description language to be implemented in hardware. It outlines the design flow, including system specification, optimization, partitioning, decompilation, hardware estimation, and system on chip integration. Partitioning can occur at the source code level or binary level, and dynamic partitioning is also discussed to allow on-the-fly hardware/software assignment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views42 pages

Hardwaresoftware Codesign Using Binary Partitioning

The document discusses hardware/software codesign using binary partitioning. It describes how the software binary is partitioned, with some portions decompiled into a hardware description language to be implemented in hardware. It outlines the design flow, including system specification, optimization, partitioning, decompilation, hardware estimation, and system on chip integration. Partitioning can occur at the source code level or binary level, and dynamic partitioning is also discussed to allow on-the-fly hardware/software assignment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

22

CHAPTER 2

HARDWARE/SOFTWARE CODESIGN USING


BINARY PARTITIONING

Hardware-Software partitioning is the key issue in the codesign of


embedded systems. The behavioral modeling is transformed into software
binaries which need to be partitioned for hardware and software. The
partitioned software binary for hardware should be decompiled to hardware
description language. A partitioner and decompiler are needed to be designed
in binary level approach. However, to obtain good result in binary level
approach the decompiler design should be an optimized one. The buffer size
has to be estimated in binary level and source level approach to obtain the
communication cost between hardware and software. An estimator is needed
to calculate the number of hardware required for the control construct of the
partitioned hardware.

2.1 PARTITIONING AT BINARY LEVEL

The hardware replacing the software portion must produce the same
transformation, as the original pure software implementation, to ensure that
the functionality does not change. Partitioning at the binary level makes the
method suitable for dynamic on-the-fly partitioning of software into
hardware. The basic blocks of software binaries are transformed into dataflow
descriptions for implementation of the partitioned software in hardware. As a
first step, the portion of the software binary to be transformed into hardware is
identified using instruction level profiling. The partitioned software (to be
23

transformed into hardware) is represented in terms of initial and final state


which is described by set of register value pairs. A CDFG representing the
initial state to final state transformation is derived by equating the final state
attained in terms of algebraic place holders which is then synthesized in
hardware.

2.2 DESIGN FLOW FOR CODESIGN SYSTEM

The design process would involve the following:

system specification
optimization
system partitioning
decompilation
hardware estimation of control construct
system on chip

2.2.1 System Specification

The system design process starts with a specification of the system.


The requirements of the system are typically specified in a non-formal
language. However, there have been several attempts to formalize system
specification to help in automation of the various steps that follow.
ESTEREL, for example, is a Finite State Machine based specification
language and is supported by codesign tools like POLIS. System specification
also incorporates Timing, Area and Power constraints.

2.2.2 Optimization

The approach incorporates both data flow and control optimizations


performed on a suitable intermediate design task representation. The aim is
24

not only to enhance productivity of the designer and system developer, but,
also to improve the quality of the final synthesis outcome. The present need in
codesign to move from synthesis-based technology to compiler-based
technology is pointed out. The data flow and control optimizations at a higher
abstraction level can lead to significant size and performance improvements
in both the synthesized hardware and software. It is recognized that guided
optimization can be applied on the internal design representation, independent
of the abstraction level, and need not be restricted to the final stages of
software assembly code generation or hardware synthesis. Function/
Architecture Optimization and Codesign of Embedded Systems will be of
primary interest to researchers, developers and professionals in the field of
embedded systems design.

2.2.3 System Partitioning

Partitioning refers to the technique of identifying which part of the


system will be implemented in hardware (as an ASIC) and which part in
software. It involves identifying the behavioral description for the different
parts of the system. The hardware part, for example, might be described using
VHDL or Verilog and the software model using the C language. In addition,
the interfacing logic, including any handshaking or bus logic, is also decided
at this time.

2.2.3.1 Source-level Partitioning

In Source-level partitioning as outlined in Figure 2.1, the


partitioning is done by taking the source code as the input. Hence, the
decision as to which part of software is to be transformed into hardware is
taken at the source-code level. The Berkeley BRASS project, the NAPA C
compiler, the Nimble compiler, Proceler’s product for compiling C code to a
processor and an FPGA coprocessor using profiling information to detect and
25

move critical regions of software to the coprocessor platform are some of the
efforts that have gone into reconfigurable computing with partitioning at the
source-code level.

S/W SOURCE

COMPILER
FRONT-END

H/W S/W
PARTITIONING

COMPILER HDL
BACKEND SOURCE

ASSEMBLY ASSEMBLER
AND & SYNTHESIS
OBJECT LINKER
FILES

BINARIES NETLISTS

Figure 2.1 Source-based Partitioning approach

2.2.3.2 Binary-level Partitioning

Partitioning an embedded system application among a


microprocessor and custom hardware has been shown to improve the
performance. The advent of single-chip microprocessor/FPGA platforms
26

makes such partitioning even more beneficial. Although source code


partitioning is preferable from a purely technical viewpoint, binary-level
partitioning provides several very practical benefits for commercial use. The
methodology for binary partitioning is represented in Figure 2.2.

SW SOURCE

COMPILATION

ASSEMBLY
AND ASSEMBLER &
OBJECT LINKER
FILES

BINARIES

H/W S/W
PARTITIONING

HW SOURCE

SYNTHESIS

NETLISTS

BINARIES

Figure 2.2 Binary based partitioning approach


27

2.2.3.3 Dynamic Partitioning

Another trend is towards dynamic reconfigurability of the


Microprocessor/Coprocessor platform to achieve on-the-fly partitioning of the
software into critical regions that need to be placed on the Coprocessor.
System architecture includes on-chip tools for profiling, decompilation,
synthesis, and placement and routing, for a simplified configurable logic
fabric, to be able to perform dynamic partitioning of real benchmarks.
Bingfeng Mei et al (2000) described a hardware-software partitioning and
scheduling approach for Dynamically Reconfigurable Embedded Systems
using partial reconfiguration approaches based on genetic algorithm and list
scheduling.

2.2.4 Decompilation

In software oriented partitioning, the section of the code partitioned


for hardware, has to be transformed into a synthesizable RTL model and
subsequently realize as hardware. In hardware oriented partitioning the non-
critical regions are moved to software and critical regions are moved to
hardware which needs to be decompiled to software. In binary partitioning the
critical kernel is moved to the hardware which needs to be decompiled to
synthesizable model and other instructions are moved to the software.

2.2.5 Hardware Estimation for Control Construct

The control logic for the partitioned hardware is estimated for the
hardware that is partitioned in source level and binary level languages using
behavioral network graph.

2.2.6 System on a Chip (SoC)

The functional level model is an un-timed model and composed of


function calls. The transaction level model is a timed model, and interactions
28

between models are through signals and events. At this level of modeling, the
architecture and algorithms are verified. Any performance issues and
bottlenecks are studied and simulated. Once the architecture and algorithms
are verified, the next step is to determine which part of the system is to be
implemented in hardware and which part goes into software. This process is
called hardware/software partitioning. The software portion runs as embedded
software on the general-purpose microprocessor and the hardware portion is
implemented as an embedded ASIC core as shown in Figure 2.3.

FUNCTIONAL LEVEL

TRANSACTION LEVEL

H/W S/W PARTITIONING

EMBEDDED SOFTWARE BEHAVIORAL MODEL

BEHAVIORAL SYNTHESIS

RTL MODEL

RTL AND LOGIC SYNTHESIS

GATE NETLIST

Figure 2.3 Typical SoC design flow

The hardware-software partitioning approach obtains a significant


improvement in performance of the embedded application, compared to the
29

pure software implementation approach while using less hardware resources


compared to pure hardware implementation.

2.3 MODELING TOOLS FOR SOC DESIGN PARADIGM

Small Device C Compiler (SDCC) is a retargettable, optimizing


ANSI - C compiler. Some of the features include:

a host of standard optimizations such as global sub expression


elimination, loop optimizations (loop invariant, strength
reduction of induction variables and loop reversing ), constant
folding and propagation, copy propagation, dead code
elimination and jump tables for 'switch' statements.

MCU specific optimizations, including a global register


allocator.

a full range of data types: char (8 bits, 1 byte), short (16 bits,
2 bytes), int (16 bits, 2 bytes), long (32 bit, 4 bytes) and float
(4 byte IEEE).

the ability to add inline assembler code anywhere in a


function.

GPSIM is a full-featured software simulator for Microchip PIC


microcontrollers. The standard simulation paradigm including breakpoints,
single stepping, disassembling, memory inspect and change. In addition,
GPSIM supports many debugging features that are generally available only
with in-circuit emulators. For example, a continuous trace buffer tracks every
action of the simulator. Also, it's possible to set read and write break points
even when a specific value is read from or written to a register.
30

ModelSim provides a comprehensive simulation and debug


environment for complex ASIC and FPGA designs. Support is provided for
multiple languages including Verilog, SystemVerilog, VHDL and SystemC.
The Virtex-II Pro, Virtex-4, and Virtex-5 FPGA families from Xilinx are of
particular interest to system-on-a-chip (SoC) designers. Since, they can
include upto two embedded IBM PowerPC cores. These designs are capable
of running a regular embedded OS (such as Linux or vxWorks) and
interfacing with custom logic implemented in the surrounding fabric.

2.4 DESIGN METHODOLOGY FOR CODESIGN SYSTEM

The high level language is transformed into binary form by the


assembler and linker of SDCC compiler. The generated binary is profiled
using GPSIM compiler by passing suitable test vectors. The critical kernel of
the profiled binary is identified and partitioned. The control construct of the
partitioned software module that has to be transformed into hardware is
estimated and decompiled into HDL . These steps are shown in Figure 2.4.
SW SOURCE

COMPILATION

ASSEMBLY
AND
OBJECT ASSEMBLER AND LINKER
FILES

BINARIES

H/W S/W PARTITIONING

BINARIES CDFG

BNG DECOMPILATION

IP core

ESTIMATION HW

Partitioned HW

Figure 2.4 Design methodology for Codesign System


31

2.5 ISSUES IN HARDWARE/SOFTWARE CODESIGN

The POLIS system (Balarin et al 1997) is intended only for


control dominated systems whose implementation is based on
a microcontroller for tasks to be implemented in software and
ASIC for tasks to be implemented in hardware. The proposed
work is applicable for both control dominated and data
dominated systems which is converted into binary. The binary
is partitioned into software and hardware module. The
partitioned binary for software implementation is based on
microcontroller and partitioned binary for hardware is
optimized and also decompiled into HDL.

Hardware/ Software Partitioning is manual in POLIS


approach. The proposed work concentrates Binary level
partitioning, Source level partitioning and Flexible
partitioning. A partitioning algorithm is proposed to partition
the critical kernel of the software binaries.

POLIS supports languages like ESTEREL, graphical FSMs,


subsets of Verilog or VHDL. In the proposed work binary
level partitioning is performed in binaries so it is independent
of high level language. In source level partitioning and
flexible partitioning it supports C language.

HW-SW partitioning is based heavily on design experience


and is very difficult to automate in POLIS. In the proposed
work the structure of partitioning algorithm for binary level
partitioning is included in Appendix 1.

For software synthesis the CFSMs implementation and


optimization of the desired behavior in a high-level,
processor-independent representation of the decision process
32

similar to a control/data flow graph in POLIS. In the proposed


work the intermediate representation is control data flow
graph.

The final implementation of the system should be made using


automatic synthesis as much as possible from this high level
abstraction, to ensure implementation that is “correct by
construction”. Validation should be done at the highest
possible levels of abstraction is made as in POLIS.

2.6 PROFILING AND PARTITIONING

Hardware/software partitioning is the process of dividing an


application into software running on a microprocessor and hardware
co-processors. On the pessimistic side, Hardware/software partitioning has
limited commercial success partly due to the tool flow problems. These are
primarily due to the following reasons: First, a designer must use an
appropriate profiler to detect regions that contribute to a large percentage of
program execution. Second, a designer must use a compiler with partitioning
capabilities to partition the software source. Such compilers are rare and often
resisted. Third, the designer must apply a synthesis tool to convert the
partitioning compiler’s hardware description output to an FPGA
configuration. A tool flow requiring integration of profilers, special compilers
and synthesis tools is far more complicated than that of typical software
design, requiring extra designer’s effort that most designers and companies
are not willing to carry out.

However, on the optimistic side, partitioning is a well-known


technique that can achieve results superior to software-only solutions. The
appearance of single-chip platforms incorporating a microprocessor and
FPGA (field-programmable gate array) on a single chip has recently made
33

hardware/software partitioning even more attractive. Such platforms yield


more efficient communication between the microprocessor and FPGA than
two chip designs, resulting in improved performance and reduced power. In
fact, such single chip platforms encourage partitioning by designers who
might have otherwise created a software-only design. By treating the FPGA
as an extension of the microprocessor, a designer can move critical software
regions from the microprocessor onto FPGA hardware, resulting in improved
performance and reduced energy consumption.

2.6.1 SDCC Compiler

The RISC architecture has been chosen as the platform for


software. The software is profiled at the instruction level (to identify
bottlenecks in the binary) and run for a fixed number of cycles, from which
the average clock cycles consumed by an instruction is calculated. To
generate the binary a retargettable ANSI-C compiler called SDCC compiler is
used. The compiler does standard optimization such as global sub expression
elimination, loop optimizations (loop invariant, strength reduction of
induction variables and loop reversing), constant folding and propagation,
dead code elimination, jump table for ‘switch’ statements and also specific
optimization like global register allocation. The generated output to be
profiled by passing test vectors, is performed in GPSIM based on the
microchip PIC 18F452 device.

The compiler accepts the behavioral description with a full range of


data types: char (8 bits, 1 byte), short (16 bits, 2 bytes), int (16 bits, 2 bytes),
long (32 bit, 4 bytes) and float (4 byte IEEE) and has the ability to add inline
assembler code anywhere in a function. The compiler accepts C-language as
input and generates the “cod” file as output. A binary that is partitioned for
implementation in hardware is decompiled using data flow extraction by
creating equivalent HDL source for the basic binary. The presented technique
34

is generic for all processors. If the number of clock pulses consumed by an


instruction set exceeds the average by a certain value then that instruction
indicates the presence of a potential candidate block for pure hardware
implementation.

2.6.2 Partitioning Algorithm

An instruction block is selected for hardware partitioning when the


average clock cycle consumption per instruction in the block exceeds the
average clock cycle consumption per instruction in the whole program. The
scattered instructions that may logically belong to the block, both in terms of
control flow as well as cycle consumption but are separated from the block by
instructions having lower cycle consumption: These scattered instructions can
occur in case of jumps. The source code for partitioning algorithm is
represented in Appendix 1 which partitions the set of instructions that
consume maximum cycles.

2.6.3 Decompilation

The behavioral model of the high level language is converted into


software binaries and optimization is performed for the partitioned software
binary. The behavioral model is transformed into assembly instruction for
target architecture PIC 18F452 and profiled using GPSIM. The critical kernel
is identified from software binaries using partitioning algorithm of
section 2.3.2. The decompilation is performed using MATLAB based
decompiler. The assembly instruction is converted into control data flow
graph and optimized. An important step in achieving some of these
optimizations involves the recognition of high-level language constructs such
as loops and arrays. Most profilers specifically determine the execution time
of loops by monitoring branch instructions that have a small negative target.
If a loop is completely unrolled by decompiler, then these branch instructions
do not appear in the software binary.
35

Stitt and Frank Vahid (2003) presented a decompiler based on


register transfers for each assembly instruction in software binary which is
then converted into control flow graph to determine basic blocks. A dynamic
jump instruction is excluded from the region of code from hardware
implementation. A data flow graph is generated for each of the basic block
using “definition-use” and “use definition” analysis on the register transfer
lists to connect the tree for each register transfer into a full data flow graph.
A majority of the optimization is performed using behavioral synthesis tool.
Few other optimizations are performed on decompiler.

The hardware partition along with the software glue code, has to
realize precisely the effect of the software instructions that are going to be
replaced by them, in a partitioned system. Stitt and Vahid (2005) discusses a
decompilation technique for software binaries partitioned for hardware. This
hardware partition must maximize performance and minimize delay and
buffer requirements as the cost function.

Mittal (2004) discusses translation of binaries to FPGA hardware.


For each instruction processed in the partitioning algorithm, a jump
instruction is checked and lists of all such jump instructions are created. The
reported additional time complexity is O(j*p) where ‘j’ represents the number
of jumps encountered and ‘p’ the number of partitions gathered. Similarly,
space complexity is in the order of O(j). To decompile the binary a dataflow
extraction technique is proposed. The system state representation is shown in
Figure 2.5. The variable ‘c’ and ‘d’ are literals, whereas ‘a’ and ‘b’ are
algebraic placeholders for the registers 0 and 1.
36

MOV R(0), R(1)


ADD C, R(1)
MUL R(1),d

Initial state S1 : { (0,a), (1,b) }


Final state S2 : {(0,a), (1, (a+c)*d) }

Sequence of software
instruction
S1 S2

Initial state S1 : { (i, value(i))| For every register I


in the system }
Final state Sn : {(i, value(i)) | for every register I in
the system}

Sequence of software
instruction
S1 S2

Figure 2.5 System State representation

2.7 COMPLEXITY ANALYSIS OF PARTITIONING

The algorithm makes a single pass of the whole program and


compares the cutoff value with the execution cycles consumed by an
instruction. If ‘n’ is the total number of instructions in the program, the
algorithm’s time complexity is O(n). Similarly, space complexity of the
algorithm can be calculated. The algorithm at a time processes only one
instruction and holds only one instruction in memory if the list of instructions
is maintained in a file. The number of partitions as a worst case is ‘n/2’ since
there are n/2 sections of time consuming code intertwined with n/2 sections of
less time consuming code. Hence, the space complexity, can be O(1) as a best
case, when partitions selected are few, and O(n) as a worst case, when
partitions selected are numerous.
37

2.8 DATA FLOW EXTRACTION

The partitioned instruction is transformed into control data flow


graph which is then converted into HDL source. By decoding the instruction
sequence the initial state and the final state are separated which is needed for
the realization. The algebraic variables are assigned to all the registers in
instruction sequence. For a non-control-flow instruction, there would be one
or more source operands, one or more destination operands and an operation
that maps source operands to destination operands. The destination operands
have been changed by the operation in terms of the source operands resulting
in algebraic expression. The original value of the destination operands in the
initial-state set varies with the final-state set values according to the algebraic
expression. Then the next instruction’s initial-state set is the final-state set of
the previous instruction.

Initial-state set can suffice to contain only the register-value pairs


whose values are used as source operands in the instruction. Final-state set
can have only those register value pairs whose values are modified from the
original-state set. The registers that are neither used as source operand or
destination operand in register value pairs from the initial-state and final-state
sets are omitted. With an already populated initial-state and final-state set, the
processing of an instruction requires scanning the final-state set for the value
of the source operand. If the lookup is a hit, the value is propagated to the
expression for the destination operand. If the lookup is a miss, a scan for the
same is done in the initial-state set and the value is propagated. If this lookup
also fails, it means that this register is neither a source nor destination. For
this case, a register is added to the initial-state set and assigned a new
algebraic value. This value is propagated to the destination register’s value
expression.
38

At the start of processing of a partition block, no instruction would


have yet been processed, which means the initial-state and final-state sets
would be empty to start with, and get populated as and when more
instructions are processed. In effect, the initial-state and final-state sets are
incrementally updated to reflect the effect of execution upto the instruction in
question from the beginning of the block. This kind of mapping from initial
state to final state is straightforward for a basic block. When loops are
involved, this mapping method quickly degenerates leading to complex,
computationally expensive expressions. So, loops are unrolled using array
function.

2.9 CDFG GENERATION

Recurrence relations can be used to represent repetitive


modification of registers. Instead, the proposed algorithm envisions splitting
the partition block into basic blocks alternating with control nodes. Within
each basic block the initial state and final state can be defined. In turn, a
CDFG with basic blocks representing data flow alternating with control nodes
is obtained. The algorithm is based on exploring different distinct paths of
execution. The algorithm works by scanning the instructions in sequence, and
creating control nodes and stacking the different possible destination
instruction locations for future exploration of the associated control paths,
whenever a branch instruction (except goto, since it doesn’t determine control
flow at runtime as in a multiple choice scenario) is encountered. A stacked
destination location is not processed if it has been already encountered in
another branch instruction. This effectively avoids duplication of a basic
block. When the boundary of the partition is hit or a branching instruction is
hit, the scan stops signifying the end of a basic block. Control nodes are then
created and the stack is popped for scan of the next basic block. For each
scan, an initial state set and a final state set are created. When creating control
39

nodes, the algorithm checks against already created control nodes to avoid
duplication of nodes. The algorithm works by tracing of execution paths. The
structure of CF generator with register allocation is given in Appendix 2 of
the thesis. Similarly Copy propagation and Dead code elimination is in
Appendix 3 of the thesis . Copy propagation is used to create the dead code(
i.e. when (regw,a) is preceded by (reg0,w+a), it is modified to (reg0,a+a). So
(regw,a) can be removed inorder to reduce the space complexity. Both CF
Generator and Decompiler are designed by using growing array. Hence, it can
be scaled up to any number of instructions. As an illustration, a code snippet
is shown in Figure 2.6 in pseudocode form.

add reg-b to reg-a // basic block 1


// some more processing

if (reg-a<> 0) goto block (3) // control node1

add reg-c to reg-b // basic block 2


// some more processing
goto block(4)

add reg-c to reg-a // basic block 3


// some more processing

add reg-d to reg-b // basic block 4


// some more processing

Figure 2.6 Example code snippet

The control flow can be visualized as shown in Figure 2.7.


40

Basic Block 1

True
a<>0
Sequence of
False processing
instructions

Basic Block 2

Basic Block 3

Basic Block 4

node

Figure 2.7 Control flow and Block visualization in memory

The application of the dataflow extraction algorithm results in the sets


of basic blocks and control nodes as shown in Figure 2.8.

Basic Block 1 Basic block 1 for


derived CDFG

True

a<>0

False

Basic Block 2 Basic Block 3


Basic block 2 for Basic block 3 for
derived CDFG derived CDFG
Basic Block 4 Basic Block 4

node

Figure 2.8 CDFG generated for the code snippet


41

The basic block 4 in the original code snippet has been duplicated.
The reason is that “goto” instruction is not considered as a branch instruction
in the proposed paradigm, since the destination location of the branch is fixed
and only the distinct paths of execution are traced. An alternative would be to
consider “goto” as any other branch instruction and create another block in
CDFG for block 4.

However, separation of block 3 from block 4 can be identified only


if we check for every instruction that is processed as to whether it is a starting
location of basic block 4. The rule of thumb for creating the initial and final
state sets can be stated as follows:

1. Initialize the sets to null set.

2. For every instruction scanned, if source operand is listed in the


final-state set, use the corresponding value listed in the final-
state set for the destination operand’ value expression.

3. If step 2 fails and source operand is listed in the initial state


set, use the corresponding value for the destination operand’s
expression.

4. If step 3 also fails, add source operand to the initial-state set


with a distinct algebraic value, and use this value for
destination operand’s express

5. Add or update the destination operand with the expression to


the final-state set as required.

A minor constraint for this algorithm to work is that, there should


be no jumps from other part of the code into the middle of the partition block.
This check can easily be incorporated in the partitioning algorithm itself with
a minor cost in terms of time complexity. For each instruction processed in
the partitioning algorithm, a jump is checked and a list of all such jump
42

instructions is made. The algorithm for dataflow extraction is given in


pseudocode format in Appendix 4. For a dereferences address in an
instruction, the address register is placed in the initial-state set. They represent
the destination/source register in the initial/final state sets by a dereferencing
operator combined with the address register’s value.

2.10 OPTIMIZATION OF CDFG

The optimizations steps to be applied for the CDFG once it is


ready, are listed as follows:

1. For a basic block, if the registers in the initial-state set are also
listed in the final-state set and if the value is the same in both
(can happen if registers are shadowed and used as scratchpads
temporarily), the entry can be removed from the final-state
list.

2. After processing by step 1, if the values in the initial-state set


are not used in the final-state set, these can be removed from
the initial-state set as well.

3. Identify the section of the graph that has only forward jumps
and no jumps from the remaining part of the graph into the
middle of the section. Then, the initial-state sets and final-state
sets of all the basic blocks in that section can be combined
together, and the intermediate control nodes can be removed.
This transformation is termed as the Merge operation. The
destination operand element in the combined final-state set
can have multiple values, each corresponding to a basic block,
that are tagged with a conditional expression corresponding to
the control nodes that lead to the execution of the basic block.
The conditional expressions for these values of an operand are
43

mutually exclusive, meaning that, only one of the values is


selected at any instant of execution in the CDFG, depending
on the evaluation of the expressions.

As an illustration for demonstrating the last step stated,


Figure 2.9 shows the result of merging a basic block (B1), a
control node (C1) following the basic block, and the two basic
blocks (B2,B3) that represent the two paths that a control node
can lead to. This is the basic structure for a merge operation.

Basic Block1
Initial =
{(reg1,x),(reg2,y),
Final =
((reg3,x+y),(reg9,z))

True False
reg7 0

Basic Block 3
Basic Block 2
Initial ={(reg3,a), (reg1,b),
Initial = {(reg8,e)} (reg6,c)}
Final = Final = {(reg4,a),
{(reg4,e),(reg9,2e)),
(reg5,a+b+c)}
(reg2,e}

Basic Block 1
Initial ={(reg1,x), (reg2,y), (reg6,c), (reg7,d),
(reg8,e), (reg5,f)}
Final = {(reg3,x+y),(reg4,{e|d 0, x+y| if d =0}),
(reg5 {2x+y+c | if d=0, f | if d 0}),
(reg9, {z| if d=0, 2e| if d 0))
(reg2, {e| if d 0, y | if d=0})}

Figure 2.9 Merge Operations for Basic Blocks


44

All constructs having forward jumps only and no jumps from other
part of the graph into them would be contained by recursively applying this
model structure.

4. Loops are unrolled using array function.

5. Linear scan algorithm is used for register allocation


optimization.

6. Loops rerolling is performed to calculate the buffer size.

2.10.1 Merge Operation for Basic Blocks

The detailed steps for executing the Merge operation of the Basic
Blocks as given in e Figure 2.9 are:

1. Propagate values for registers in the set B1.final B2.initial


from B1.final to B2.initial to B2.final.

2. Set B2.initial = B2.initial - B1.final B2.initial.

3. Propagate values for registers in the set B1.initial B2.initial


from B1.initial to B2.initial to B2.final.

4. Set B2.initial = B2.initial - B1.initial B2.initial.

5. Set B1.initial = B1.initial U B2.initial. Discard B2.initial (as it


is not needed further).

6. Apply the same steps for B3 as had been done for B2.

7. Propagate value from B1.initial to control node C1’s


conditional expression. Add an element to B1.initial if
necessary.

8. For registers in B3.final B2.final, assign both values (listed


in B2.final, B3.final) to the register, each tagged with the
45

appropriate condition. e.g., (B2.final.reg-a | if C1 is true,


B3.final.reg-a | if C1 is false).

9. For registers in (B1.final B2.final) – (B3.final B2.final),


assign both values (listed in B1.final, B2.final) to the register,
each tagged with the appropriate condition. e.g., (B2.final.reg-
b | if C1 is true, B1.final.reg-b | if C1 is false).

10. For registers in (B2.final - (B1.final B2.final) – (B3.final


B2.final)) B1.initial, assign both values (listed in B1.initial,
B2.final) to the register, each tagged with the appropriate
condition. (e.g., (B2.final.reg-c | if C1 is true, B1.initial.reg-c |
if C1 is false).

11. For registers remaining in B2.final, i.e., B2.final – ((B2.final -


(B1.final B2.final) – (B3.final B2.final)) B1.initial),
add the registers to B1.initial with a distinct identifier.

12. Repeat steps 9 to 11 for B3 as had been done for B2. The
mutually exclusive multiple value sets can be synthesized in
hardware using multiplexers with the conditional expressions
functioning as the select lines. As a further optimization step,
Sub expression elimination/deduction can be used for
expression evaluations. Expressions can be tagged with
identifiers while building/merging the initial-state and final-
state sets and registers can be mapped to these identifiers
instead of the expressions. This avoids duplication of the same
expression.
46

In this proposed work, the optimizations performed during


decompilation include:

i. Optimization using Register allocation for partitioned binary


is performed.

ii. Copy propagation is used to create the dead code( i.e. when
(regw,a) is preceded by (reg0,w+a), it is modified to
(reg0,a+a) which is not reported earlier. So (regw,a) can be
removed inorder to reduce the space complexity.

iii. Dead code elimination is performed


iv. Loops are unrolled using array functions and this is performed
in the decompiler itself - Gaurav Mittal et al 2007 does only
manual unrolling which is the input code to PACT compiler
v. Loop rerolling is done manually

vi. Merging is performed for each block separately which is not


reported earlier- For an ellip benchmark it takes 81 states
without merging and 3 states with merging. For DCT
benchmark it takes up 24 states without merging and 10 states
with merging.

vii. Merging for forward branching is performed to reduce the


number of states which is not reported earlier- For an ellip
benchmark there is no forward branching (so further reduction
is not possible). For DCT benchmark with merging it takes up
10 states and merging for forward branching takes up 3 states.
47

2.11 COMPLEXITY ANALYSIS OF DATA FLOW


EXTRACTION

Traditionally, consecutive sets of blocks are merged into one large


block, thereby allowing the compiler to possibly schedule more instructions in
parallel (Gaurav Mittal 2007). Alternately, in this work, merging is performed
for forward branching and blocks as represented in Figure 2.7.

The time complexity of the algorithm can be computed assuming


each instruction has a single source operand and a single destination operand.
If the number of instructions in a partition is n, and the number of instructions
determining control-flow (equivalent to number of control nodes that would
be created) is c, the average length of a block can be stated as n/c. The
number of blocks that could be created accounting for possible duplication of
blocks is 2c. If each operand in each instruction is assumed distinct as a worst
case, the number of comparisons to create initial and final state sets for a
block of length n/c would be O((n/c)2). The total number of comparisons for
all 2c blocks would be O(n2/c). Comparisons to determine the uniqueness of
control nodes is of O(c2). Similarly the determination of the uniqueness of
basic block nodes is O((2c)2) = O(c2). Hence, in summary, the time
complexity of the algorithm is O(n2/c) + O(c2) + O(c2) = O(n2/c) + O(c2).
Similarly, the space complexity can be calculated. For a basic block, the
initial and final state sets can have elements of O(2n/c), and for 2c basic
blocks, the space complexity is O(2n/c*c) = O(n). Similarly space complexity
for control nodes is O(c). Similarly for the stack, the complexity is determined
by the number of basic blocks 2c, hence O(c). Hence the total space
complexity is O(n) + O(c).
48

2.12 COPROCESSOR SYNTHESIS

The condition for successful synthesis using the CDFG derived


from dataflow extraction is that there should be no jumps into the middle of
the software block to be replaced by hardware, from the rest of the software.
This check can be incorporated in the partitioning algorithm itself without
much effect on the algorithm complexity as was stated previously. The steps
for hardware synthesis of a basic block can be stated as follows:

1. For the registers in the final-state set that have values which
are straight moves from those in initial-state set, software code
to move the values are inserted into the software glue code
that calls the hardware. This glue code replaces the original
software block. These registers can be removed from the final-
state set. Also, registers in initial-state set whose values are
not used in the final state set after this step can be removed.

2. For each register in the final-state set, the blocks that can
follow the partition block in execution are scanned for a read
or write on the corresponding register. If the first operation
encountered is a write, these can be removed from the final-
state set.

3. The elements in the initial-state set now constitute the input


parameters for the coprocessor. These can be fed through
ports from the microcontroller. The expressions in the
final-state set constitute the output parameters, which can be
read from the coprocessor and moved to the appropriate
registers by software code.
49

2.13 EXPERIMENTAL RESULT

The results obtained by applying the proposed algorithm to the


benchmarks are shown in Table 2.1.

Table 2.1 Runtime Comparison

S/W Partition
Basic Basic runtime runtime in Pure
block 1 block 2 H/W Speed up by
Benchmark Loop in cycles cycles
(mul (mult (in 100 partitioning
int) long ) (100 ns) (100 ns) ns)
approx approx
Dct - - 26667 23222 - 1.14
Diffeq - - 645 357 4.5 1.8
Ellip - - 952 538 0.375 1.76
FIR - - 769 402 0.192 1.9
IIR - - 714 366 0.35 1.95
Lattice - 13333 8855 - 1.5
Nc - - 16000 15161 - 1.05
Volterra - - 16000 15588 - 1.02
Wavelet - - 13333 12922 - 1.03
Wdf7 - - 20000 13479 - 1.5

Louis-Noel Pouchet et al (2008) reported the performance


distribution of dct with maximum of 1.4 speedup and in the proposed work
the speedup achieved is 1.14. Hardware utilization for the partitioned module
is shown in Table 2.2.
50

Table 2.2 Hardware Utilization of Partitioned Binary

H/W utilization for


Basic
Basic block 2 partition
Benchmark Loop block
(mult long ) (M-multiplier)
1(mul int)
(S-Slice)
Dct - - 31/1408 S = 2%
3/12 M= 25%
Diffeq - -
8/1408 S= 0%
3/12 M= 25%
Ellip - -
8/1408 S= 0%
3/12 M= 25%
Fir - -
8/1408 S= 0%
3/12 M= 25%
Iir - -
8/1408 S= 0%
97/1408 S=6%
Lattice -
10/12 M= 83%
Nc - - 31/14008 S=2%
Volterra - - 31/14008 S=2%
Wavelet - - 31/14008 S=2%
66/1408 S =4%
Wdf7 - -
10/12 M = 83%

Out of the three sections of code that frequented most of the


benchmarks, two are procedure calls representing integer multiplication and
long integer multiplication and the third is a conditional loop. These were
selected for hardware synthesis out of the candidate partitions selected by
applying the partitioning algorithm with a cutoff ratio of 1.5.

The percentage of partitioned instruction is tabulated in


Table 2.3 and circuits were synthesized for the target device Xilinx Virtex
xc2v8000-5-ff1152.
51

Table 2.3 Percentage of instructions partitioned and corresponding


speedup

%
Total Instructions Instruction % of runtime Speedup in
Benchmark of number partitioned Partitioned Instructions of the % by
instructions For SW for HW partitioned instructions partitioning
partitioned
Dct 3504 3481 23 0.7 12 114
Diffeq 410 330 80 17 59.6 180
Ellip 574 494 80 12.45 58.2 176
Fir 404 324 80 17.75 63.9 190
Iir 382 302 80 18.8 65.3 195
Lattice 2613 2589 24 9.2 38.1 150
Nc 3886 3862 24 0.6 5.3 105
Volterra 2664 2640 24 0.9 2.6 102
Wavelet 3617 3593 24 0.7 3.1 103
Wdf7 3958 3692 266 5.5 37.4 150

The total elapsed time to generate the hardware module from the
binary is given in Table 2.4 and is obtained using MATLAB.

Table 2.4 Total elapsed time for decompilation

Total Instructions Instruction Elapsed time for


Benchmark of number partitioned Partitioned decompilation
instructions For SW for HW (sec)
Dct 3504 3481 23 60.187
Diffeq 410 330 80 64.969
Ellip 574 494 80 100.687
Fir 404 324 80 92.954
Iir 382 302 80 88.516
Lattice 2613 2589 24 76.547
Nc 3886 3862 24 74.641
Volterra 2664 2640 24 72.422
Wavelet 3617 3593 24 73.906
Wdf7 3958 3692 266 206.578
52

Gaurav Mittal et al 2007 has first decompile the software


binaries to a high-level language, such as C, and then use a behavioral
synthesis tool to generate the hardware from the high-level design. The
speedup achieved for an ellip benchmark is 3.19, for an IIR it is 2.06 and for a
diffeq it is 10.30 using DSP processor and Xilinx. The speedup achieved in
the proposed work (Table 2.1) is less when compared because only few
instructions of software binaries are partitioned for hardware but it results in
very good reduction of hardware cost. Many of the benchmarks showed
significant speedup on partitioning while using very less resources compared
to pure hardware implementation. Particularly notable are the rows for the
Ellip and IIR benchmarks. They show a speedup of 1.76 and 1.95
respectively. There is also a high savings in hardware resources compared to
pure hardware implementation (slices of 0%, 0% in partitioned approach
compared to 55%, 18% in pure hardware, and Multipliers of 25%, 25% in
partitioned approach compared to 116%, 116% respectively in pure hardware
which exceeds the multiplier resources available). These results are plotted in
Figure 2.10.

Figure 2.10 Comparative Bar chart of Percentage of Hardware


Utilization
53

A comparative bar chart showing the speedup obtained is plotted in


Figure 2.11.

Figure 2.11 Comparative Bar chart of Speedup gained

The Figure 2.11 shows to what extent each program’s size in terms
of number of cycles consumed per run gets reduced. The improvement in
results for certain benchmarks (Dct, Nc, Volterra, Wavelet) may not be much,
but from Table 2.3 and Figure 2.12, it can be seen that the temporal size,
(temporal size is the percentage of runtime of the partition as defined by
Jantsch et al (1994), for the partition selected manually out of the candidate
partitions, with small size as a desired criteria to make easy the processing of
dataflow extraction, is very low compared to other benchmarks.
54

Figure 2.12 Bar chart of Percentage of Instructions partitioned and


corresponding speedup

2.14 PROFILING RESULTS

The profiling results of the Diffeq benchmark is shown in


Figure 2.13. First three columns show instruction word location, the binary
opcode and opcode in assembly format respectively. Last column shows the
number of times the instruction location was traversed. The part shown is the
procedure for integer multiplication. The program was run repeatedly to
accumulate data for a 10,000 cycle period which averaged 645 clock
cycles per run for software runtime and 357 cycles per run for partitioned
runtime. The speedup achieved is 1.8. The partitioned instruction which is
rerolled for Diffeq is shown in figure 2.14 is synthesized to obtain the
hardware clock period. The ‘I’, ‘J’, ‘H’ and ‘G’ are the variables that need to
be communicated between hardware and software which determines the
buffer size when loop rerolled. The profiling result showed that this procedure
call with 71 instruction cycles had each instruction in it consuming 84 cycles.
The Go instruction establishes communication from software to hardware.
55

Figure 2.13 Profiling results for Diffeq

Using the technique to extract dataflow description for this block,


the following set of input and output parameters for the coprocessor shown in
Figure 2.14 was obtained.

Figure 2.14 Dataflow extraction for Diffeq


56

Similarly, Figure 2.15 shows profiling results for the Dct


benchmark.

Figure 2.15 Profiling Results for Dct

The instruction rlcf, which means rotate left with carry, takes as
input the carry bit of status register, and generates again as output the carry
bit. This means the carry bit in the initial-state as well as the final-state is need
to be hold. This emphasizes that the bits of status registers also need to be
held in the final-state if they are going to be affected by the present
instruction, and in the initial-state if they are going to be used in the
instruction execution. The structure CF generator is represented in the figure
2.16. The optimization such as dead code elimination and copy propagation is
performed in CFG as shown in Figure 2.17. The generated CDFG is shown in
Figure 2.18.
58

Figure 2.17 Optimization of CFG


59

Figure 2.18 CDFG for Dct

The status bits have been omitted from consideration in the


proposed algorithm for the sake of brevity, since they are also a part of a
register. The CDFG derived after merge operation is shown in Figure 2.19.
The hardware realization of the CDFG represents the body of the loop, with
registers to hold the result of computation for each iteration of the loop.
60

Figure 2.19 CDFG for DCT after Merging operation

The coprocessor communicates with the processor using a


handshake protocol. The processor issues a “go” signal after placing all the
required input parameters for computation, on the data lines. The coprocessor
starts computation at the edge of the clock pulse after receiving the go signal.
Each clock pulse signifies the start of a new iteration. After computation for
all iterations are complete, the coprocessor issues the done signal, placing the
output parameters on the data lines. Hence there is no need for clock
synchronization. The coprocessor can run independently from the processor,
and both can run tasks simultaneously. Communication buffers can be used
instead of the data lines, in which case, the maximum of the number of input
parameters and the number of output parameters is the buffer size needed for
61

communication. The estimates of buffer size requirements for each partition


block are shown in Table 2.5.

Table 2.5 Buffer size / Data bus width Estimates

Buffer size/ Data bus


Partition Block
size mbytes
Loopl 7
Basic block 1 (mul-int) 4
Basic block 2 (mul-long) 8

The Table 2.6 shows the number of adders/subtractors and registers


required for different practical implementations with and without forward
branching. The present work has performed merging with forward branching
and hence, there is a good improvement in the utilization of resources. For ex:
Merging the (new) consecutive sets of blocks into one large block, allows the
compiler to possibly schedule more instructions in parallel (Gaurav Mittal
et al 2007). Merging for large blocks requires two adders/subtractors and
thirty nine registers. However, in this work by the novel use of merging with
forward branching the number of registers reduces to 22 and forward
branching does not occur in Diffeq, Ellip, Fir, Iir and Wdf7 .
62

Table 2.6 Merging for binary level partitioning

Merging forward
Without Merging Merging
branching
Adders/ Adders/ Adders/
Registers Registers Registers
subtractors subtractors subtractors
DCT 14 125 2 39 2 22
Diffeq 24 123 1 8 - -
Ellip 24 123 1 8 - -
Fir 24 123 1 8 - -
Iir 24 123 1 8 - -
Lattice 14 124 2 39 2 22
NC 14 124 2 39 2 22
Volterra 14 124 2 39 2 22
Wavelet 14 124 2 39 2 22
Wdf7 36 125 2 47 - -

The register allocation is performed for critical kernel of software binaries


(integer multiplier, long integer multiplier and loop is identified) using
Maximal Clique which is represented in Appendix 5. Initial and final state is
used to represent each block. Merging basic blocks, Merging blocks with
forward branching is performed which is not reported earlier and also loop
rerolling is made manually. Since optimization is made in software binaries
this methodology supports control dominated and data dominated systems.

Gaurav Mittal et al 2007 has first decompile the software binaries


to a high-level language, such as C, and then use a behavioral synthesis tool to
generate the hardware from the high-level design. The speedup achieved for
an ellip benchmark is 3.19, for an IIR it is 2.06 and for a diffeq it is 10.30
using DSP processor and Xilinx. The speedup achieved in the proposed work
(Table 2.1)is less when compared because only few instructions of software
63

binaries are partitioned for hardware but it results in very good reduction of
hardware cost.

2.15 CHAPTER CONCLUSION

In this Chapter, a partition algorithm is designed and the decompiler is


designed with known optimization techniques like dead code elimination,
loop unrolling, loop rerolling, register allocation and merging forward
branching with set of blocks. The novelty of the method proposed in the
chapter is forward branching using merging, Register allocation for software
binaries has not been reported earlier, Loops are unrolled using array
functions which is not represented earlier and this is performed in the
decompiler itself. Gaurav Mittal et al 2007 does only manual unrolling which
is the input code to PACT compiler. Loop rerolling is done manually.
Merging is performed for each block separately which is not reported earlier-
For an ellip benchmark it takes 81 states without merging and 3 states with
merging. For DCT benchmark it takes up 24 states without merging and 10
states with merging. Merging for forward branching is performed to reduce
the number of states which is not reported earlier. For an ellip benchmark
there is no forward branching (so further reduction is not possible). For DCT
benchmark with merging it takes up 10 states and merging for forward
branching takes up 3 states. Optimization is often non-trivial since there are
multiple conflicting constraints involved (e.g. performance vs cost vs power
consumption in digital VLSI design), and optimization problems are generally
belong to class of NP-complete problems. Hence, heuristics are used instead
of exact algorithms to find optimal solutions, and reduce the complexity. For
ex: DCT implementation is non-trivial and the implementation of this non-
trivial system is presented in the thesis. Buffer size estimation is carried out
for binary level approach and source level approach and observed to be same
for both when applied for an elliptic wave filter. The critical kernel is
partitioned to hardware in binary level partitioning while multipliers are
partitioned to hardware in source level partitioning.
Figure 2.16 Structure of CF Generator

57

You might also like