Ch2bac 1 PDF
Ch2bac 1 PDF
In this chapter, we provide necessary background to understand our works. We explain scheduling and assignment procedures for data path generation in high-level synthesis, test pattern generation and test response analysis for built-in self-test (BIST), and integer linear programming (ILP) for solving various tasks of high level synthesis. We review relevant works in high-level BIST synthesis.
High-level synthesis reads a high-level description and translates it into an intermediate form. An intermediate form should represent all the necessary information and be simple to be applicable in high-level synthesis. A control data flow graph (CDFG) refers to such an intermediate form. A data flow graph, in which control information is omitted from a CDFG, is used frequently for data path-intensive circuits (to which high-level synthesis applies) such as filters and signal processing applications. They often do not require control of the data flow. There are controller-intensive circuits in which control of the data flow is required. Control data flow graph representation is used for such circuits. More specific descriptions of the two types of circuits and their representations are described in the following two subsections.
z-1
z-1
z-1
z-1
z-1
z-1
h0
h1
h2
h3
h4
h5
h6
Figure 2.1. A 6th order FIR filter Fig. 2.2 is a Silage program [45] that describes the behavior of the 6th order FIR filter. Delay elements (outside the dashed line in Fig. 2.1) are implemented as a shift register and are
not described in the Silage program. Fig. 2.3 shows an unscheduled data flow graph of the 6th order FIR filter. #define num8 num<8,7> #define h0 0.717. func main(x0, x1, x2, x3, x4, x5, x6: num9) y: num8 = begin t0 = num8(h0 * x0); t1 = num8(h1 * x1); t2 = num8(h2 * x2); t3 = num8(h3 * x3); t4 = num8(h4 * x4); t5 = num8(h5 * x5); t6 = num8(h6 * x6); y = (((((t0 + t1) + t2) + t3) + t4) + t5) + t6; end; Figure 2.2 Silage program of a 6th order FIR filter
h0 x0 h1 x1
* +
h2 x2
* + +
h3 x3
h4 x 4
* + +
h5 x5
h6 x6
* +
y
Figure 2.3 Unscheduled data flow graph of a 6th order FIR filter
A description for controller-intensive circuits has constructs which control the flow of data such as if-then-else, and for (while)-loop. The representation of such constructs affects greatly the performance of high-level synthesis tasks. A control data flow graph for control constructs, if-then-else and while-loop, are shown in Fig. 2.4 (a) and (b). Fig. 2.5 (a) shows description of a GCD (Greatest Common Divisor) [7], a controller-intensive MCNC high level benchmark circuit [49], in VHDL and in a control data flow graph.
if stmt
true DFG false DFG
if stmt
false (exit) true (loop) DFG
end if
(a)
process begin X := PortX; Y := PortY; If ( X == 0 ) or ( Y == 0 ) then GCD := 0; else while ( X != Y ) loop if ( X > Y ) then X := X Y; else Y := Y X; end if; Y end loop; GCD := X; end if; PortGCD <= GCD; end process;
(b)
Figure 2.4 Basic control data flow graph elements for (a) if-then-else (b) while-loop
X Y
X == 0 or Y == 0
true false (exit) false
X != Y
0 true (loop)
:=
GCD
X>Y
true false X Y X
:=
GCD
+
X X
+ end if
end if GCD
(a)
(b)
Figure 2.5 GCD description in (a) behavioral VHDL (b) control data flow graph
Unlike a data flow graph, a control data flow graph can possibly have many different representations.
2.1.2 Scheduling
Scheduling determines the precise start time of each operation for a given data flow graph [27],[30]. The start times must satisfy the original dependencies of the graph, which limit the amount of parallelism of the operations. This means that the scheduling determines the concurrency of the resultant implementation, which, in turn, affects the performance. The maximum number of concurrent operations of any given type at any step of the schedule is a lower bound on the number of required hardware resources of the type. Therefore, the choice of a schedule affects the area and the BIST design. Three commonly used scheduling algorithms are ASAP (As Soon As Possible), ALAP (As Late As possible), and list scheduling. In the ASAP scheduling, the start time of each operation is assigned its as soon as possible value. This scheduling solves an unconstrained minimum-latency scheduling problem in polynomial time. An example DFG (called paulin from [50]) and the corresponding ASAP scheduled one are shown in Fig. 2.6 and Fig. 2.7, respectively. In the ALAP scheduling, the start time of each operation is assigned its as late as possible value, and the scheduling is usually constrained in its latency. When it is applied to an unconstrained scheduling, the latency bound (which is the upper bound of latency) is the length of the schedule computed by the ASAP algorithm. When the ALAP algorithm is applied to the DFG in Fig. 2.6, the resultant DFG is obtained as shown in Fig 2.8. The latency for the DFG is 4.
10
v1
*
v5
v2
v1
*
v3
*
v5
v2
*
v6
v3
v4
v10
*
v9
+ <
*
v7
v4
v10
*
v9
+ <
*
3
v11
v8
v6
* -
v11
v7
4
v8
The list scheduling [51], one of the most popular heuristic methods, is used to solve scheduling problems with resource constraints or latency constraints. A list scheduling maintains a priority list of the operations. A commonly used priority list is obtained by labeling each vertex with the weight of its longest path to the sink and ranking the vertices in the decreasing order. The most urgent operations are scheduled first. It constructs a schedule that satisfies the constraints. However, the computed schedule may not have the minimum latency. Fig. 2.9 shows the result of the list scheduling for the DFG in Fig. 2.6.
1 1
v1
2
*
v5
v2
*
v3
v1
2
*
v5
v2
*
v3 v10
*
3
*
3
*
v4 v10
*
v4
v7
4
v8
v6
*
v9
* +
v11
+
4
v7
v8
v6
*
v9
* +
v11
<
5
<
11
2.1.3 Transformation
Transformation is to modify a given DFG, while preserving the functionality of the original DFG. A transformed DFG usually yields a different hardware structure, which leads to a different structure for testability and power consumption. Following techniques are available for a behavioral transformation of DFGs: Algebraic laws and redundancy manipulation: Associativity, commutativity, and common sub-expression Temporal Transformation: Retiming, pipelining, and rephasing Control (Memory) Transformation: Loop folding, unrolling, and functional pipelining Sub-Operation level Transformation: Add/shift multiplication and multiple constant multiplication Among the above techniques, pipelining and loop folding/unrolling are most commonly used and are described in detail. There are two types of pipelining, structural pipelining and functional pipelining, which are also explained below.
12
v1
2
v2
v6
v10
+ <
1
v8
v11
*
v3 v7
v1
*
v3
v2
*
v7
v6
v10
STAGE 1
*
v8 v11
+ <
v9
+
5
*
v4
*
v9
* +
v4
6
STAGE 2
v5
v5
13
Loop folding is an optimization technique to reduce the execution delay of a loop. If it is possible to pipeline to the loop body itself with a local data introduction interval l less than the latency of the loop body, the overall loop execution delay would be l. For example, a loop body of the data flow graph of FIR, shown in Fig. 2.12 (a), can be folded as shown in Fig. 2.12 (b) [52]. The loop execution delay would be reduced from 4 to 3. Note that a loop folding can be viewed as pipelining. However it is not a pipelined operation, since it can consume new data only when the loop exit condition is met.
h0
1
h0
NOP
*
2 2
* * *
3
+
4
*
4
+
NOP
*
out[n]
+
5
out[n]
(a) original
14
and (b), respectively. After loop unrolling, several behavioral transformations such as pipelining and algebraic transformation may be applied to improve the performance such as reduction of power consumption. Loop folding, which is the reverse operation of loop unrolling, is an optimization technique to reduce the execution delay of a loop. While the unrolling transformation is simpler, the folding transformation is more complex. This is because different folding descriptions result in different folded data-flow graphs while unrolling results in a unique data-flow graph.
xn
*
b0
+
a12
+ * 2D
yn
xn
*
b0 a1
+ D *
yn
*
b0 xn-1 a1b0
* +
a1 yn-1
(a) original
15
easy to figure out the minimum number of hardware resources such as modules and registers. Unlike the assignment operation that maps each operation (variable) to a specific module (register), allocation is to compute the number of necessary hardware resources for the given data flow graph. The minimum number of modules and registers can be easily identified by counting the number of operations and variables exist in the same control step, respectively. Scheduling combined with allocation can generate an area-optimized data path without a complex assignment task. However, it is still necessary to combine assignment with scheduling for better testability or low power consumption. Operations described in a behavioral language are mapped to proper data path modules available in the given library. Operations used in high-level synthesis can be classified as arithmetic and logical operations. Arithmetic operations include add, subtract, multiply, divide, and comparison. The corresponding register-level modules include adder, subtractor, multiplier, divider, comparator and so on. Logical operations include and, or, nand, nor, and so on. If a datapath library has modules with the same functionality but different characteristics, high-level synthesis can achieve better performance. For example, an add operation can be mapped to a ripple-carry adder, a carry-lookahead adder, or a carry-save adder. Similarly, a multiply operation can be mapped to an array multiplier, a wallace-tree multiplier, or a radix-n booth multiplier. The trade-off among different characteristics enables the synthesized circuit to have smaller area, higher performance, or less power consumption. Operations in the same control step in a data flow graph should be assigned to different modules, and those operations are called incompatible. The incompatible operations cannot share the same modules. Fig. 2.14 (a) shows scheduled data flow graph, and Fig. 2.14 (b) shows three different sets of modules. When module set 1 is given, all the operations in the data flow graph can be assigned to any of the given modules unless there is incompatibility. The incompatibility graph of operations for the data flow graph for the module set 1 is shown in Fig. 2.14 (c). A vertex of the graph is an operation, and an edge exists between each pair of incompatible vertices. If multi-functional modules are not allowed as in module set 2 in Fig 2.14 (b), the
16
number of necessary modules increases to 3. If the module set 3 in Fig. 2.14 (b) is given, the number of necessary modules decreases to 2.
a
1
M2 ALU M2 M3
d
2
+1
e
+2
3
*1
g
*
M2
/1
4
*1
(c) Incompatibility graph of operations for the set 1 Figure. 2.14. A data flow graph, given modules and incompatibility graph
17
register assignment, a delayed variable spanning n clock cycles is often split into n variables to increase the flexibility of the register assignment [1].
a
a
1
d
2
+
e
b
1
+
3
*
g
M3
a
2
/
h
f e
(c) Compatibility graph of variables Figure. 2.15. A data flow graph and its incompatibility and compatibility graphs Register assignment, including all the other assignments, can be solved by the graph coloring algorithm. For the graph coloring, there have been a number of algorithms such as leftedge algorithm [31], clique partitioning [32], perfect vertex elimination scheme (PVES) [33], bipartite-matching [54], and so on. All the methods mentioned above are based on heuristic, thus, they do not guarantee optimal solutions but generate results. In this section, PVES, left-edge algorithm, and clique partitioning are described briefly. Perfect vertex elimination scheme was first described in [33] and was applied recently in the register assignment problem to enhance BIST testability [20]. It starts its operation by ordering the vertices in a graph. The ordering is determined by the maximum clique size of each vertex. For example, the ordering of the variables in the data flow graph in Fig. 2.15 (a) is {h, f, g, a, b, c, d, e}, and the maximum clique sizes are {1, 2, 2, 3, 3, 3, 3, 3}. The register assignment is performed in the following the order. If a variable is compatible with the variable(s) in a preassigned register, then the variable is assigned to the register. If not, it is assigned to a new
18
register. The register assignment for the example is R1 = {h, f, a, d}, R2 = {g, b, e}, and R3 = {c}. Left-edge algorithm was proposed by Hashimoto and Stevens in [31] to assign wires to routing tracks. The same idea has been applied to the register assignment by Kurdahi and Parker in [34], where variable intervals correspond to wires and routing tracks correspond to registers. The algorithm lists variable intervals as shown in Fig 2.16 (a). As long as the inverval of variable does not overlap, it is moved to the left edge as shown in Fig 2.16 (b). For example, we have register assignment as R1 = {a, d, f, h}, R2 = {b, e, g}, and R3 = {c} as shown in the figure. Note that the assignment is the same as that for perfect vertex elimination scheme.
R1 R 2 R 3
1
b c
b c
(a)
(b)
Figure 2.16 Left edge algorithm (a) variable intervals (b) assignment results
Clique partitioning algorithm is an approach based on the heuristic proposed by Tseng and Siewiorek in [32]. Although it can be applied to any assignment, we describe it only for the register assignment. A complete graph is one such that every pair of vertices has an edge. A clique of a graph is a complete subgraph. The size of a clique is the number of vertices of the clique. If a clique is
19
not contained by any other clique, it is called maximum. For the compatibility graph in Fig. 2.15 (c), the graph {a,d,f,h} is a clique of size 4 and is maximum. A clique partition is to partition a graph into a disjoint set of cliques. Maximum clique partition is a clique partition with the smallest number of cliques. To find a maximum clique partition is an NP-complete problem [30]. Two measurements, common neighbor and non-common edge, are used for clique partitioning. A vertex v is called common neighbor of a subgraph provided the vertex v is not contained in the subgraph and has an edge with every vertex of the subgraph. If a vertex v is not a common neighbor of a subgraph, but has an edge with at least one vertex of the subgraph, the edge is called non-common edge. For example, vertex h in the compatible graph in Fig. 2.15 (c) is a common neighbor of a subgraph {c,d,f}, but vertex g is not. As g has an edge c-g with vertex c, the edge c-g is a non-common edge for the subgraph. With common neighbor and noncommon edge, all the vertices are ordered. For the example, all the vertices are assigned to registers in the same order as those in perfect vertex elimination scheme.
20
Figure 2.17 Various data path architectures complicated, it is more difficult to find an optimal solution. The bus-based architecture can be used to minimize the number of interconnections. As multiple hardware resources share the same bus, tri-state buffers are used, and a conflict in using a bus should be resolved. The architecture as shown in Fig 2.17 (d) uses register files [55], or often referred to as multiport memory. The number of interconnections can be further reduced by using register files. However, register files require decoder circuits for multiport-access to degrade the performance. One way to minimize the number of interconnections is to swap input ports of the module or of the operation in a data flow graph. For example, suppose that we have a data path that performs R1+R3, R1+R5, R2+R3, R3+R4, and R4+R5. Our objective is to swap operands, so that the resultant circuit has the smallest number of interconnections [56]. When there are two operands for an operation, both operands cannot be placed on one side, either left or right. Thus those operands have incompatibility, or no edge, in the compatibility graph as shown in Fig 2.18
21
(a). From the compatibility graph, we have two cliques, {R1, R2, R4} and {R3, R4}. These cliques lead to the smallest number of interconnections as shown in Fig 2.18 (b).
R1 R5 R4 R2 R1 R2 R4 R3 R5
R3
M1
(a)
(b)
Figure 2.18 Input swapping: (a) compatibility graph (b) simplified interconnection
22
Figure 2.19 Typical BIST structure When a BIST is performed under normal function, it is referred to on-line BIST. Whereas, BIST is performed in off-line BIST under test mode. The methods regarding BIST described in this thesis assume off-line BIST. Parallel BIST, which is based on random pattern testing, employs test pattern generators and test data evaluators for every module under test (which is usually a combinational circuit). Parallel BIST often achieves relatively high fault coverage compared with other BIST methods such as circular BIST [29]. In this thesis, we present a high-level BIST synthesis method that employs the parallel BIST structure. The objective of the BIST synthesis considered in the thesis is to assign registers to incur the least area overhead for the reconfiguration of assigned registers, while the processing time is within the practical limit. Two constraints are imposed in our BIST synthesis (and in most BIST synthesis systems). First, test pattern generators and signature registers are reconfigured from
23
existing system registers. In other words, all test registers function as system registers during normal operation. Second, extra paths are not added for testing. The constraints can be met through the reconfiguration of existing registers into four different types of test registers described below. A system register may be converted into one of four different types of test registers: a test pattern generator (TPG), a multiple input signature register (short for signature register), a builtin logic block observer (BILBO) [57], or a concurrent BILBO (CBILBO) [58]. If a register should be a TPG and a signature register (SR) at the same sub-test session, it should be reconfigured as a CBILBO. If a register should behave as a TPG and an SR, but not at the same time, it should be reconfigured as a BILBO. Reconfiguration of a register into a CBILBO requires twice the number of flip-flops of the register. Hence, it is expensive in hardware costs. A TPG can be shared between modules as long as each input of a module receives test patterns from different TPGs. (Application of the same test patterns to two or more inputs of a module does not achieve a high fault coverage due to the dependence of test patterns.) However, an SR cannot be shared between modules tested in the same sub-test session. Therefore, the number of sub-test sessions necessary for a test session is usually determined by the number of modules sharing the same SR. Consider a DFG in which all operators are assigned to N modules. Through an appropriate register assignment, it is possible that the N modules can be tested at least once using exactly k sub-test sessions where k is 1, 2, N. Test registers are reconfigured to TPGs and/or SRs in each sub-test session, and a subset of modules is tested in a sub-test session. When a BIST design is intended to test all of the modules in k sub-test sessions, we say that the BIST design is for k-test session. As extreme cases, BIST design for 1-test session tests all modules in one sub-test session, while a BIST design for N-test session tests only one module in each subtest session.
24
2.2.1 Test Pattern Register (TPG) and Linear Feedback Shift Register (LFSR)
For the test method, there are two approaches: exhaustive and pseudorandom testing. Exhaustive testing applies all 2n test patterns generated by counters and linear feedback shift registers (LFSRs). While it can detect all combinational faults, it is applicable only to circuits with a small number of inputs. In pseudorandom testing, test patterns resemble random but are deterministic. The test patterns for pseudorandom testing are generated by LFSRs as shown in Fig. 2.20 (a). The connection polynomial function p(x) determines feedback points. Fig. 2.20 (b) shows an LFSR with a connection polynomial function, p(x) = x4+x+1.
p(x) = xn+cn--1 +...+c1 x1+1 (ci=0 is open and ci=1 is short) c1 c2 cn-1
p(x) = x4+x1+1
x0
x1
x2 (a)
xn-1
xn
x0
x1 x2 (b)
x3
x4
Figure 2.20 LFSR for (a) generic case and, (b) n=4
When a primitive polynomial is used to construct an LFSR, the resultant LFSR generates all patterns except the all-zero patterns. Such LFSR is called maximum length LFSR. There is at least one primitive polynomial for any order n. Table 2.1 shows primitive polynomials for some orders in the range of 2 to 32. From the table, p(x) = x16+ x5+x3+x2+1 is primitive for order 16, which is used for our 8-point discrete cosine transform circuit to be presented in Chapter 4.
25
Order 2,3,4,6,7 5 8 9
Term 1 0 2 0 6 5 1 0 4 0
Order 10 11 12 13
Term 3 0 2 0 7 4 3 0 4 3 1 0
Order 14 15 16 32
Term 12 11 1 0 1 0 5 3 2 0 28 27 1 0
D Q
D Q
D Q
D Q
26
B1 1 0 1 0
B2 1 0 0 1
Z3 B0 B1
0 1
Z2
Z1
Z0
MUX
Si
D QB
QB
D QB
D QB
So
Q3
Q2
Q1
Q0
27
If a test register should be a TPG and an SR in the same sub-test session, it should be reconfigured as a CBILBO. CBILBO, as shown in Fig 2.23, uses two layers of LFSRs to provide TPG and SR functionality simultaneously. CBILBO has three modes, as shown in Table 2.3, such that normal function, shift register (scan in/out), and TPG/MISR. CBILBO incurs high area overhead and hence should be avoided, if possible.
B0
Z3
Z2
Z1
Z0
D Q
1 0
MUX
D Q 1 D Q 2D SEL
D Q 1 D Q 2D SEL
D Q 1 D Q 2D SEL So
Si
1 D Q 2D SEL
B1 Q3
Q2
Q1
Q0
28
register, viz. register R1 in Fig. 2.24. A self-adjacent register often requires a CBILBO. The main focus of early high-level BIST synthesis methods focused on the avoidance of self-adjacent registers [17], [18]. However, self-adjacent registers do not necessarily require CBILBOs. Consider part of a data flow graph given in Fig. 2.25 (a). The data path logic under a register assignment, R1={c,e}, R2={a,f}, and R3={b,d}, is shown in Fig. 2.25 (b). The necessary reconfiguration of registers to test module M1 is also indicated in the figure. Note that Register R1 is self-adjacent, but is reconfigured to a SR, not a CBILBO.
SR R1 TPG R2 TPG
M1
M2
a f c, e
SR TPG
b d
TPG
f
M 2
R1
R2
R3
M 3
M1
(a) Data flow graph (b) BIST configuration for M1 Figure. 2.25 Self-adjacent register and its reconfiguration to a SR Parulkar et al.s method aims to reduce the overall area overhead by sharing registers in their maximum capacity [20]. However, excessive sharing of signature registers is unnecessary. In fact, it may result in higher area overhead. Consider the cases given in Fig. 2.26. Registers R1 and R2 are reconfigured into signature registers during testing. The sharing of R1 with M3 in Fig. 2.26 (a) is unnecessary for two test sessions as shown in Fig. 2.26 (b), and it creates an unnecessary path to incur higher area overhead.
29
M1
M2
M3
M1
M2
M3
R1
R2
R1
R2
SR
SR
SR
SR
(a) Excessive sharing (b) Proper sharing Figure. 2.26. Allocation of signature registers
(vi , v j ) , where vi and v j are associated operations of the edge: E = {( v1, v5), (v2, v5), }.
C is the set of available control steps: C = {0, 1, 2, 3, }. K is the type of modules: K = {1, 2, 3, 4}. Note that this definition is used only in this section. It represents the set of test sessions in Chapter 3. (vi ) represents resource type of operation vi : (v1 ) = 1.
t iS represents a lower bound for operation vi's start time measured in control steps set by
S ASAP scheduling: t 4 =1.
30
t iL represents an upper bound for operation vis start time measured in control steps set by
L ALAP scheduling: t 4 =3.
di represents execution delay of operation vi: d4=1. Using an ILP model [30],[35],[36],[38], the minimum latency scheduling under a
resource constraint and minimum resource scheduling under a latency constraint can be solved. Let us consider a minimum latency scheduling. A binary variable xil is 1 only when the start time of operation vi is assigned to control-step t. Note that upper and lower bounds on the start times can be computed using ASAP and ALAP algorithms. List scheduling (which is a fast heuristic scheduling) is used to derive an upper bound for latency, , which, in turn, is used in ALAP scheduling. Using upper and lower bounds for the start time of operations, the number of ILP constraints can be reduced. For example, a binary variable, xit is necessarily zero for l < tiS or
l > tiL for any operation vi, i {0,1, , n} where n is the number of operations. We now
consider ILP formulations for the minimum latency scheduling under a resource constraint in which the start time for each operation is unique. Thus,
xit = 1, i .
t
(2.1)
Therefore, the start time of any operation vi can be stated in terms of xit as ti =
t t xit . The data dependence relations represented in a data flow graph must be satisfied.
Thus,
ti t j + d j i, j : (v j , vi ) E
which is the same as:
(2.2)
t xit t x jt + d j ,
t t
i, j {0,1,..., n} : (v j , vi ) E .
(2.3)
31
Finally, the number of the type k resources required by an operation vi at a control step t should not exceed a resource constraint ak. Thus,
i:(vi ) = k
m =t di +1
xim ak , k K , t C ,
(2.4)
Note that m can be larger than 1 for a multicycle operation. The objective of the formulation is to minimize latency that is to minimize i t i = i t t xit . To illustrate the ILP formulations for scheduling, the data flow graph in Fig. 2.6 is used in the following. The start time of every operation is bound by the result of ASAP and ALAP schedulings. Let Nm, Na, Ns and Nc be the number of multipliers, adders, subtractors and comparators, respectively. They are given as 2, 1, 1, and 1, respectively. First, all operations must start only once:
x11 x 21 x31 + x32 x 41 + x 42 + x43 x52 x62 + x63 x73 x84 x92 + x93 + x94 x10 ,1 + x10 ,2 + x10 ,3 x11,2 + x11,3 + x11,4 =1 =1 =1 =1 =1 =1 =1 =1 =1 =1 = 1.
32
x1,1 + x2,1 + x3,1 + x4,1 2 = N m x3,2 + x4,2 + x5, 2 + x6, 2 2 = N m x4,3 + x6,3 2 = N m x7,3 1 = N s x8, 4 1 = N s x10,1 1 = N a x9,2 + x10, 2 1 = N a x9,3 + x10,3 1 = N a x9, 4 1 = N a x11, 2 1 = N c x11,3 1 = N c x11, 4 1 = N c .
Finally, data dependence relations, represented by data flow graph, are, 1 x3,1 + 2 x3,2 2 x6,2 3 x6,3 1 1 x 4,1 + 2 x 4, 2 + 3 x 4,3 2 x9, 2 3 x9,3 4 x9,4 1 1 x10,1 + 2 x10,2 + 3 x10,3 2 x11,2 3 x11,3 4 x11,4 1. The objective of the ILP is to minimize i t t xit . An optimum solution for the data flow graph in Fig. 2.6, is the latency with 4 control steps and is shown in Fig. 2.27.
1
v1
2
*
v5
v2
*
v3
v10
+ <
v11
*
3
*
v4
v7
4
v8
v6
*
v9
* +
33
Now, let us consider the minimum resource scheduling under a latency constraint for the example. The uniqueness constrains on the start time of the operations and the data dependence constraints are the same as before. However, the resource constraints are changed to unknown variables as follows:
Nm Nm Nm Ns Ns
Na Na Na Na Nc Nc Nc .
34
0 8 4 9
+
2
R0 +
5 11 7
R1
M4
R2
*
6
10 M3
Figure 2.28 A data flow graph and a synthesized data path Gray lines in the DFG denote clock cycle boundaries called control steps. All input and output variables on a clock boundary should be stored in a register. In other words, a register should be assigned to each input or output variable; this process is called register assignment. The data path in Fig. 2.28 (b) is obtained under a register assignment: R0 = {0, 4}, R1 = {1, 3, 6}, and R2 = {2, 5, 7}. The horizontal crossing of a control step is the number of variables for the control step. Therefore, the minimum number of registers required for the synthesis of a DFG is equal to the maximal horizontal crossing of the DFG. For example, the horizontal crossing of the control step 0 and of the control step 1 is three and is maximal for the DFG. Hence, the minimum number of registers required for the synthesis of the DFG is three. The minimum number of modules necessary for a type of operation is obtained directly from the maximum concurrency of the operation. For example, if the maximum number of multiplication operations performed between
35
any two consecutive clock steps is three for a DFG, then at least three multipliers are needed to synthesize the DFG. The data path logic in Fig. 2.28 (b) contains the minimum number of registers (three) and the minimum number of modules (two). We assume that the number of registers and modules to be used for the synthesis of a DFG are known a priori. The following nomenclatures are used for the ILP model described in the thesis. Vo is the set of operations: Vo = {8, 9, 10, 11}. Vv is the set of variables: Vv = {0, 1, 2, , 7}. l is the label of an input port of an operation. The leftmost input port is labeled as 0, the next port is 1, and so on. An input port is designated by its label. I(o) is the set of input ports for an operation o: I(8) = {0,1} and I(9) = {0,1}. Ei is the set of ordered triples (v, o, l) defined for the inputs of all operations where o is an operation, l is an input port of the operation, and v is the variable on the input port l: Ei = {(0,8,0), (1,8,1), (3,9,0), (4,9,1), (4,10,0), (2,10,1), (5,11,0), (6,11,1)}. Eo is the set of ordered doubles (o, v) defined for the outputs of all operations where o is an operation and v is the output variable of the operation: Eo ={(8,4), ((9,5), (10,6), (11,7)}. T is the set of control steps: T = {0, 1, 2, 3}. C is the set of constants: C = .
The following nomenclature is defined for a data path logic to be synthesized from a data flow graph. Input ports of modules are labeled in the same manner as those of operators. R is the set of the registers: R = {0, 1, 2}. M is the set of modules: M = {3, 4}. I(m) is the set of input ports of a module m where mM: I(3) = {0,1} and I(4) = {0,1}.
36
A binary variable xvr (xom) is 1 only when a variable (operation) v (o) is assigned to a register (module) r (m) and is 0 otherwise. Each variable (operation) should be assigned to only one register (module). Hence,
r R mM
xvr
= 1, v Vv
(2.5) (2.6)
xom = 1, o Vo .
From Eq. (2.5) and Eq. (2.6), we have the following set of equations:
x0 ,0 + x0 ,1 + x1,0 + x1,1 +
x0 ,2 = 1 x1,2 = 1
vV
(2.7)
From Eq. (2.7), the set of equations for the control step 0 is illustrated below.
37
From Eq. (2.8), the following set of equations is obtained for the interconnections for r=0 and m=3.
x0,0 + x8,3 z0,3,0 1 x1,0 + x8,3 z0,3,1 1 x3,0 + x9,3 z0,3,0 1 ...
Similarly, a binary variable z mr is 1 only if there is an interconnection between the output of a module m and a register r and is 0 otherwise. The constraint is represented as xom + xvr z mr 1, m M , r R , ( o , v ) Eo . (2.9)
Commutative operations, in which the two input ports can be swapped, are modeled as follows [39]. Inputs are applied to input ports of the operation through pseudo-input ports. Let a binary variable sl*,l,o = 1 if a connection exists between a pseudo-input port l* and an input port l
38
of an commutative operation o. All of the possible connections between the pseudo and original inputs for operation 8 are shown in Fig. 2.29 (a). The constraints for commutative operations can be written as
sl*,l , o l * I (o)
sl*,l ,o l I ( o )
= 1, l I (o)
(2.10) (2.11)
= 1, l* I ( o ).
From Eq. (2.10) and Eq. (2.11), the constraints for operation 8 are obtained as
s0,0,8 + s1,0,8 s0,1,8 + s1,1,8 s0,0,8 + s0,1,8 s1,0,8 + s1,1,8 =1 =1 =1 =1
The above set of equations exclude the illegal connections shown in Fig. 2.29 (b).
l* l
0 1 0 1
s0,0,8=s1,0,8 =1
Figure 2.29 Connections between pseudo-input ports and input ports Hence, if a module is commutative, the constraints for interconnection can be written as
xvr + xom + sl*,l ,o 2 z rml 0 , r R , m M ,l I ( o ),( v , o ,l*) Ei .
(2.12)
A synthesis procedure solves the above set of equations under a cost function such as area. The data path in Fig. 2.28 (b) is optimal in terms of the number of components and interconnections.
39
40
was later refined using a new structure called extended TFB as shown in Fig.2.30 (b), which further reduces the area overhead [24].
mux
mux
TPG ALU
TPG
ALU
Testable Register
SR
normal
normal
(a) TFB (b) extended TFB Figure 2.30 A structure of testable functional blocks B. RALLOC Avra proposed a heuristic in [17] that guides the register assignment to avoid selfadjacent register whenever it is possible. The heuristic in based on a clique partitioning of a register conflict graph. A testability conflict edge is added between two nodes representing variables provided one node is an input to an operation and the other node is the output of the same operation. It is illustrated using the data flow graph shown in Fig. 2.31 (a). Suppose that the * operations in the data flow graph are assigned to the same module. If both variables x and z are assigned to the same register, then the register becomes self-adjacent register. The existence of an edge between x and z in the register conflict graph in Fig.2.31 (b) prevents such an assignment. If two operations in consecutive clocks are assigned to the same block and if the output of one operation is an input to the other, the variable associated with both the input and the output of the block must be assigned to a CBILBO register. If a variable (or value) is active across several control steps, then it is split to minimize the number of CBILBOs. If splitting does not contribute to minimize the number of CBILBOs, it is merged back to save the number of multiplexer inputs. Thus, a weight is assigned to split variables (delayed values).
41
u
1
v x
y w1
+
2
*
t2
w1
w2
x t2 v
w2 y
t1
*
3
t1 z
C. BITS The register assignment in BITS [20] is performed using the perfect vertex elimination scheme (PVES). The order for the assignment is determined by a sharing degree and the size of maximum clique of a conflict graph. The sharing degree of a register is the number of different modules connected to the register. The goal of the assignment is to maximize the sharing. With highly shared data path logic, the number of TPGs and SRs necessary to test modules is reduced. In addition to that, the number of CBILBOs is minimized, by checking the condition of previously assigned registers and new variable to be assigned. The assignment of CBILBOs is performed only when there are no alternatives. Let MCS(v) of a variable v denote the size of a clique containing v and SD(v) the sharing degree of variable v. The sharing degree in this case is the number of distinctive modules attached to the register to which the variable is to be assigned. Variables are ordered such that if a variable v appears before w, then SD(v) < SD(w) or MCS(v) < MCS(w) in case SD(v) = SD(w). Following the order, variables are assigned such that if a variable conflicts with all existing registers then a new register is created. Otherwise, among the registers that do not conflict with the variable, pick a register, in the following priority: maximize the increase of sharing degree when adding new variable, maximize sharing degree of the register, and minimize the interconnection cost.
42
For example, a data flow graph in Fig. 2.32 (a) has a conflict graph of variables as shown in Fig. 2.32 (b). The ordering of PVES for the data flow graph is {h, a, b, e, f, g, d, c} with SDs and MCSs calculated as in Table 2.3 for each variable. The reverse order of that is {c, d, g, f, e, b, a, h} and by which register assignment is performed sequentially.
a
1
b c
+1
2
a +2
n 2
c d (b)
e f
*1
g
n
n 4
*2
4
(a)
Figure. 2.32 Data flow graph and its conflict graph Table 2.3 Ordering of variables in BITS Variables v Sharing degree SD(v) MaxClique size MCS(v) Reverse Order a 1 2 7 b 1 2 6 c 2 3 1 d 2 3 2 e 1 3 5 f 2 2 4 g 2 2 3 h 1 1 8
43
process is guided to avoid such conflicts for the synthesized circuit. They reported that test time for example circuits is reduced (presumably at the cost of higher area overhead) through the proposed method [25], [26].
44
From the interval graph in Fig. 2.35, intervals of two variables {v1 and v2} are completely contained in the Zone1. The intervals of variables, {v3, v6, v7, v8, v9, v10, v11}, are cross the two zones. Processing of variables whose intervals across the zones is the major issue of the algorithm. Interested readers may refer to [37]. It is reported in [37] that when optimal ILP formulations are applied for a sixth-order elliptic band-pass filter, it takes more than 1800 seconds. When the proposed ZS is applied to the same filter, the processing time is reduced to 1 to 10 seconds. The cost for the speedup is the increase of control steps from 40 to 42, which is small.
v1 v2 v6 v8 v9 v10 v11
*
v3
*
v7
* +
+ <
*
v4
v5
45
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10v11
1
* *
Zone1
2
+ <
* * *
3
* * * * + <
Zone2
5
46