Flow Map
Flow Map
I , JANUARY 1994 1
A6struct- The field programmable gate-array (FPGA) has several synthesis techniques [20], [22], Chortle and Chortle-
become an important technology in VLSI ASIC designs. In the crf by Francis et al. based on tree decomposition and bin
past a few years, a number of heuristic algorithms have been packing techniques [l 13, [14], Xmap by Karplus based on
proposed for technology mapping in lookup-table (LUT) based
the if-then-else DAG representation [17], the algorithm by
FPGA designs, but none of them guarantees optimal solutions for
Woo based on the notion of edge visibility [27], and the
general Boolean networks and Little is known about how far their
work by Sawkar and Thomas based on the clique partitioning
solutions are away h m the optimal ones. This paper presents a
theoretical breakthrough which shows that the LUT-based FPGA approach [24]. The algorithms in the second class emphasize
technology mapping problem for depth minimization can be on minimizing the delay of the mapping solutions. This class
solved optimally in polynomial time. A key step in our algorithm
is to compute a minimum height K-feasible cut in a network, includes MIS-pga-delay by Murgai et al. which combines
which is solved optimally in polynomial time based on networkthe technology mapping with layout synthesis [21], Chortle-
d by Francis et al. which minimizes the depth increase at
flow computation. Our algorithm also effectively minimizes the
number of LUT’s by maximizing the volume of each cut and each bin packing step [12], and DAG-Map by Cong et al.
by several post-processingoperations. Based on these results, we
[3], [7] based on Lawler’s labeling algorithm. The mapping
have implemented an LUT-based FPGA mapping package called
FlowMap. We have tested FlowMap on a large set of benchmark algorithms in the third class, including that proposed by Bhat
examples and compared it with other LUT-basedFPGA mapping and Hill [l], and that by Schlag, Kong, and Chan [26], have
the objective of maximizing the routability of the mapping
algorithms for delay optimization, including Chortle-d, MIS-pga-
delay, and DAG-Map. FlowMap reduces the LUT network depth solutions. Although many existing mapping methods showed
by up to 7% and reduces the number of LUT’s by up to 50% encouraging results, these methods are heuristic in nature, and
compared to the three previous methods.
it is difficult to determine how far the mapping solutions
of these algorithms are away from the optimal solution.’ It
I. INTRODUCTION has been of both theoretical and practical interest to CAD
T HE SHORT DESIGN cycle and low manufacturing cost researchers to develop optimal FPGA mapping algorithms for
have made FPGA an important technology for VLSI general Boolean networks.
ASIC designs. The LUT-based FPGA architecture is a popular This paper presents a theoretical breakthrough which shows
architecture used by several FPGA manufacturers, including that the LUT-based FPGA technology mapping problem for
Xilinx and AT&T [15], [28]. In an LUT-based FFGA chip, depth minimization can be solved optimally in polynomial
the basic programmable logic block is a K-input lookup table time for general K-bounded Boolean networks. A key step
(K-LUT) which can implement any Boolean function of up in our algorithm is to compute a minimum height K-feasible
to K variables. The technology mapping problem in LUT- cut in a network, which is solved optimally in polynomial time
based FPGA designs is to cover a general Boolean network based on efficient network flow computation. Our algorithm
(obtained by technology independent synthesis) using K - also effectively minimizes the number of LUT’s by maximiz-
LUT’s to obtain a functionally equivalent K-LUT network. ing the volume of each cut and by several post-processing
This paper studies the LUT-based FFGA technology mapping operations. Based on these results, we have implemented an
problem for delay optimization. JT-based FPGA mapping package named FlowMap. We
The previous LUT-based FPGA mapping algorithms can have tested FlowMap on a set of benchmark examples and
be roughly divided into three classes. The algorithms in the compared it with other LUT-based FPGA mapping algorithms
first class emphasize on minimizing the number of LUT’s for delay optimization, including Chortle-d, MIS-pga-delay,
in the mapping solutions. This class includes MIS-pga and and DAG-Map. FlowMap reduces the LUT network depth by
its enhancement, MIS-pga-new, by Murgai et al. based on up to 7% and reduces the number of LUT’s by up to 50%
compared to the three previous methods.
Manuscript received September 28, 1992; revised April 30, 1993. This
research was supported in part by the NSF under Grant MIP-9110511. Xilinx
Inc. and the State of California MICRO F’rogram under Grant 92-030. This ’ Some previous algorithms achieve optimal mapping for restricted problem
paper was recommended by Associate Editor L. Trevillyan. domains: Chortle is optimal when the input network is a tree, Chortle-crf and
The authors are with the Department of Computer Science, University of Chortle-d are optimal when the input network is a tree and h’ 5 6, and DAG-
California, Los Angeles, CA 90024. Map is optimal when the mapping constraint is monotone, which is true for
IEEE Log Number 9212334. trees.
Our result makes a sharp contrast with the fact that the K-LUT can implement any K-feasible cone of a Boolean
conventional technology mapping problem in library-based network. The technology mapping problem for K-LUT based
designs is “-hard for general Boolean networks [9], [18]. FPGA’s is to cover a given K-bounded Boolean network
Due to the inherent difficulty, most conventional technology with K-feasible cones, or equivalently, K-LUT’s4. shows an
mapping algorithms decompose the input network into a example of mapping a Boolean network into a 3-LUT network.
forest of trees and then map each tree optimally [9], [18]. Note that we allow these cones to overlap, which means that
Such a methodology was also used in some existing FPGA the nodes in the overlapped region can be duplicated when
mapping algorithms [ l l ] , [12], [14]. However, the result ingenerating K-LUT’s. In fact, our algorithm is capable of
this paper shows that K-LUT mapping can be carried out duplicating nodes automatically when necessary, in order to
directly on general K-bounded Boolean networks to achieve achieve depth optimization. A technology mapping solution
depth-optimal solutions. S is a DAG in which each node is a K-feasible cone
The remainder of this paper is organized as follows. Section(equivalently, a K-LUT) and the edge (C,,C,)exists if U
I1 gives a precise problem formulation and some preliminaries. is in input(C,). Our main objective is to compute a mapping
Section I11 presents our depth-optimal technology mapping solution that results in the minimum delay.
algorithm for LUT-based FPGA designs. Section IV describes The delay of an FPGA circuit is determined by two factors:
several enhancements of our algorithm for area minimization. the delay in K-LUT’s and the delay in the interconnection
Experimental results and comparative study are presented in paths. Each K-LUT contributes a constant delay (the access
Section V. Extensions and conclusions are presented in Section time of a K-LUT) independent of the function it implements.
VI. Since layout information is not available at this stage, we
assume that each edge in the mapping solution contributes
11. PROBLEM FORMULATION AND PRELIMINARIES a constant delay. In this case, the delay is determined by the
depth of the mapping solution, which is known as the unit
A Boolean network can be represented as a directed acyclic delay model. We say that a mapping solution is optimal if its
graph (DAG) where each node represents a logic gate? and a depth is minimum. The primary objective of our algorithm
directed edge (z,j) exists if the qutput of gate i is an input is to find an optimal mapping solution in terms of depth
of gate j . A primary input (PI) node has no incoming edge minimization, and the secondary objective is to reduce the
and a primary output (PO) node has no outgoing edge. We use number of K-LUT’s used in the technology mapping solution.
input(v) to denote the set of nodes which are fanins of gate Several concepts about cuts in a network will be used in
U. Given a subgraph H of the Boolean network, input(H)
our algorithm. Given a network N = ( V ( N ) E , ( N ) ) with a
denotes the set of distinct nodes outside H which supply source 3 and a sink t, a cut ( X , x )is a partition of the nodes
inputs to the gates in H. For a node v in the network, a K-
feasible cone at v , denoted C,,is a subgraph consisting of U
in V ( N ) such that s E X and t E x.
The node cut-size of
( X , x ) ,denoted n ( X , X ) ,is the number of nodes in X that
and its predecessors3 such that linput(C,) I 5 K and any path
connecting a node in C, and v lies entirely in C,. The level
are adjacent to some node in i.e., x,
of a node v is the length of the longest path from any PI node
to U. The level of a PI node is zero. The depth of a network
n ( X , X )= I{. : (z,y)E E ( N ) , x E X and y E x}l
is the largest node level in the network. A Boolean network
is K-bounded if linput(v)I 5 K for each node U. A cut ( X , X ) is K-feasible if n ( X , X ) 5 K. Assume that
We assume that each programmable logic block in an each edge (u,u) has a non-negative capacity c ( u , v ) . The
FPGA is a K-input one-input lookup-table (K-LUT) that edge cut-size of ( X , Y ) ,denoted e ( X , X ) , is the sum of the
can implement any K-input Boolean function. Thus, each
41f the input network is not K-bounded, it may not be covered with K -
’In the rest of the paper gates and nodes are used interchangeably for LUT’s directly. In this case, nodes in the network with more than K fanins
Boolean networks. may have to be decomposed before covering. However, we consider such a
3~ is a predecessor of 2) if there is a directed path from U to v. decomposition step as part of the synthesis procedure.
CONG AND DING: FLOWMAP AN OPTIMAL TECHNOLOGY MAPPING ALGORITHM
Fig. 3. Constraint on the number of inputs to LUT is not monotone (I< = 3).
Fig. 2. A 3-feasible cut of edge cut-size 10, volume 9, and height 2. an input network into a network of two-input simple gates,
the optimality of our algorithm does not depend on the fact
that each node in the given Boolean network is a two-input
capacities of the forward edges that cross the cut, i.e., simple gate. The optimality of our mapping result holds as
long as the input network is a K-bounded network, in which
the gates need not to be simple.
The fundamental difficulty in the LUT-based FPGA map-
Throughout this paper, we assume that the capacity of each ping is that the constraint on the number of inputs of a
edge is one unless specified otherwise. The volume of a cut programmable logic block is not a monotone clustering con-
( X ,y), denoted woZ(X,X),is the number of nodes in x,
i.e., straint. A clustering constraint r is monotone, if knowing
woZ(X, x) 1x1.
= Moreover, assume that there is a given label that a network H satisfies r implies that any subnetwork of
E(w) associated with each node 'U. The height of a cut ( X , x), H also satisfies r [19]. For example, if we assume that the
constraint for each programmable logic block is the number
denoted h ( X , X ) ,is defined to be the maximum label in X,
i.e., of gates it may cover in the original network, it is a monotone
clustering constraint. Unfortunately, limiting the number of
h ( X , X )= max{l(x) : 5 E X } distinct inputs of each programmable logic block is not a
monotone clustering constraint. For example, Fig. 3 shows
Fig. 2 shows a cut ( X , x ) in a network with given node
a network of three distinct inputs, which is 3-feasible. But the
labels, where n ( X , X ) = 3, e ( X l x ) = 10, h ( X , X ) = 2,
subnetwork consisting of nodes t , v and 20 has four distinct
and woZ(X,x) = 9.
inputs, which is not 3-feasible. Clustering (or, similarly, map-
ping) for a monotone clustering constraint r is much easier
111. AN OPTIMALLUT-BASEDFPGA MAPPING because if a subnetwork H does not satisfy the constraint
ALGORITHMFOR DEPTHMINIMIZATION r, we can conclude that H is not a part of any cluster. It
Our algorithm is applicable to any K-bounded Boolean was shown that Lawler's labeling algorithm [19] can produce
network. Given a general Boolean network as input, if it is a minimum depth clustering solution in polynomial time
not K-bounded, there are a number of ways to transform whenever the clustering constraint is monotone. The DAG-
it into a K-bounded network. For example, the Roth-Karp Map algorithm developed by Cong et al. [3], [7] modified
decomposition [23] was used in [20] to obtain a K-bounded Lawler's algorithm and applied it to the LUT-based FPGA
network. We use the algorithm DMIG presented in [3], which mapping problem. Although it achieved encouraging results
is based on the Huffman coding tree construction [16], to for depth minimization, it was shown that the DAG-Map
decompose each multiple input simple gate5 into a tree of algorithm is not optimal [3].
two-input simple gates. According to the result in [3], such The mapping algorithm presented in this paper successfully
a decomposition procedure increases the network depth by overcomes the difficulty due to the nonmonotone clustering
at most a small constant factor. The reason for carrying out constraint in LUT-based FPGA technology mapping. The
such a transformation is that if we think of FPGA technology algorithm runs in two phases. In the first phase, it computes
mapping as a process of packing gates into K-LUT's, then, a label for each node which reflects the level of the K-LUT
smaller gates will be more easily packed, and the mapping implementing that node in an optimal mapping solution. In the
algorithm will be able to pack more gates along critical paths to second phase, it generates the K-LUT mapping solution based
one K-LUT, resulting smaller depth in the mapping solution. on the node labels computed in the first phase.
This argument is justified by our experimental results in Table
111 shown in Section V.
3.1. The Labeling Phase
In the rest of this paper, we shall assume that the input
networks are K-bounded networks. Although we transform Given a K-bounded Boolean network N , let Nt denote the
subnetwork consisting of node t and all the predecessors of
'We can always obtain a simple gate network by representing each complex
gate in the sum-of-products form and then replacing it with two levels of t. We define the label of t , denoted Z(t), to be the depth of
simple gates. the optimal K-LUT mapping solution of Nt. Clearly, the level
4 IEEE TRANSAmONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 13, NO. 1, JANUARY 1994
Fig. 4. Computing the label l ( t ) of node t ( K = 3). (a) The partial network. (b) Construction of Nt and the highest 3-feasible cut. (c) Determining l(t).
of the K-LUT rooted at t (if exists) in the optimal mapping we have Z(t) = 2, and the optimal K-LUT mapping solution
solution of N is at least Z(t),and the maximum label of all the of Nt is shown in Fig. 4(c).
PO’s of N is the depth of the optimal mapping solution of N . There is no existing algorithm for computing a minimum
The first phase of our algorithm computes the labels of all height K-feasible cut efficiently. One important contribution
the nodes in N, according to the topological order starting of our work is that we have developed an O(Krn) time
from the PI’S. The topological ordering guarantees that every algorithm for computing a minimum height K-feasible cut
node is processed after all of its predecessors have been in N t , where rn is the number of edges in Nt. First, we show
processed. For each PI node U , we assign l(u)= 0. Suppose that the node labels defined by our labeling scheme satisfy the
t is the current node being processed. Then, for each node following property.
U # t in Nt, the label Z ( U ) must have been computed. By Lemma 2: Let E(t) be the label of node t , then l(t) = p or
including in Nt an auxiliary node s and connecting s to all +
l(t) = p 1, where p is the maximum label of the nodes in
the PI nodes in N t , we obtain a network with s as the source input(t).
and t as the sink. For simplicity we still denote it as N t . Fig. Proofi Let t’ be any node in input (t). Then for any cut
4(a) shows part of a Boolean network in which gate t is to be (X,x)in Nt, either t’ E X , or ( X , x ) also determines a
labeled, and Fig. 4(b) shows the construction of the network K-feasible cut (X’,X’) in Nt, with h(X‘,X’) 5 h ( X , X ) ,
N t . Let LUT(t) be the K-LUT that implements node t in an whereX’=XnV(Nt,) andX’=XnV(Nt,).If ( X , x ) i s
optimal K-LUT mapping solution of Nt. Let 13 denote the set a minimum height K-feasible cut in N t , then, in the former
of nodes in LUT(t) and X denote the remaining nodes in Nt. case, we have Z(t) = h ( X , X ) + l 2 Z(t’)+l, i.e., Z(t) > Z(t’);
Then, ( X , x ) forms a K-feasible cut between s and t in Nt and in the latter case, we have Z(t’) - 1 5 h(X’,X’) 5
because the number of inputs of LUT(t) is no more than K. h ( X , X ) = Z(t) - 1, which implies Z(t) 2 Z(t’). (Note this
Moreover, let U be the node with the maximum label in X , proves that the label of any node cannot be smaller than those
then, the level of LUT(t) is Z(U) +
1 in the optimal mapping of its predecessors.) Therefore, l ( t ) 2 p.
solution of Nt. Recall the definition of the height of a cut in On the other hand, since the network N is K-bounded,
Section II, we have h ( X , X ) = 1(u). Therefore, in order to linput(t)I 5 K. Therefore, (V(Nt) - { t } , { t } )is a K-
minimize the level of LUT(t) in the mapping solution of Nt, feasible cut. Because each node in V(Nt) - {t} is either in
we want to find a minimum height K-feasible cut (X, in x) input(t) or is a predecessor of some node in input(t), the
Nt.6 In other words, maximum label of the nodes in V ( N t )- {t} is p. Therefore,
Z(t) = min h(X,X) + 1. (1).
h(V(Nt) - {t}, {t}) = p, i.e., Z(t) 5 p 1. + 0
(X,s?) is K-feasible According to Lemma 2, our algorithm first checks if there
is a K-feasible cut ( X t , x t ) of height p - 1 in N t . If there
Based on the above discussion, we have is such a cut, we assign Z(t) = p and node t can be packed
Lemma I: The label Z(t) computed by Eq. ( I ) gives the mini-
mum depth of any mapping solution of N t . 0
with the nodes in wt into one K-LUT in the second phase of
our algorithm for generating the mapping solution. Otherwise,
Fig. 4(b) and (c) illustrate our labeling method. Since in 4(b) the minimum height of the K-feasible cuts in Nt is p and
there is a minimum height 3-feasible cut in Nt of height 1, (V(Nt) - { t } , { t } )is such a cut. In this case, we assign
6We exclude the cuts ( X , x ) where 7 contains a PI node. Our algorithm +
l ( t ) = p 1 and we shall use a new K-LUT for node t
to be shown later on guarantees that such kind of cuts are not generated. in the second phase.
CONG AND DING FLOWMAP: AN OPTIMAL TECHNOLOGY MAPPING ALGORITHM 5
Whether Nt has a K-feasible cut of height p - 1 or not to be infinity. Fig. 5(c) shows the resulting N,” obtained from
can be tested efficiently using the following method. Let p be Ni in Fig. 5@). According to the result in [lo] (pp. 23-26),
the maximum label of the nodes in input(t),which is also we have
the maximum label of the nodes Nt - {t}. We first apply a Lemma 4: Ni has a K-feasible cut if and only if N,“ has a
network transformation on Nt that collapses all the nodes in cut whose edge cut-size is no more than K . U
Nt with label 2 p, together with t, into the new sink t‘. Denote Based on the Max-flow Min-cut Theorem [SI, [lo], N,“ has
the resulting network as N;, we have the following result. a cut whose edge cut-size is no more than K if and only if the
Lemma 3: Nt has a K-feasible cut of height p - 1if and only maximum flow7 between s and t’ in N,” has value no more
if Ni has a K-feasible cut. than K. Since we are only interested in testing if the maximum
Proof: Let Ht denote the set of nodes in Nt that are flow is of value K or smaller, we apply the augmenting path
collapsed into t’. algorithm in N,” to compute a maximum flow. (For the basic
concepts of network flow and the details of the augmenting
-If Ni has a K-feasible cut (X’,x’), let X = XI, and
path algorithm, see [SI, [lo].) Since each bridging edge in N,”
X = (x’- {t’}) U H t , then clearly ( X , X ) is a K-feasible
cut of Nt. Since no node in X’ (= X)has label p or larger, has unit capacity, each augmenting path in the flow residual
we have h ( X , X ) 5 p - 1. Moreover, according to Lemma graph of N,” from s to t’ increases the flow by one unit.
2, l ( t ) 2 p, which implies h ( X , X ) 2 p - 1. Therefore, +
If we can find K 1 augmenting paths, then the maximum
flow in N,” has value more than K , and we can conclude
h ( X , X ) = p - 1.
On the other hand, if Nt has a cut (X,x) of height p - 1,
that N,” does not have a cut ( X ” , F ) with e(X”,x”) 5 K.
Otherwise, the residual graph is disconnected before we find
then X cannot contain any node of label p or higher. Therefore,
Ht C x. (x
In this case, (X, - Ht) U { t’}) forms a K-feasible
+
the (K 1)th augmenting path. We can find a cut (XI’, x”)
of edge-cut size no more than K in N,“ by performing a depth
cut of N;. 0
first search starting at the source s, and including in X” all
For example, Fig. 5(a) shows the network Nt for node the nodes which are reachable from s in the residual graph.
t in Fig. 4(a), and Fig. 5(b) shows the induced network Since finding an augmenting path takes O(m) time, where m
Ni. In order to determine if Ni has a K-feasible cut, we is the number of edges in the residual graph of N,” (which
apply another standard network transformation, called the is in the same order as the number of edges in Nt), we can
node-splitting transformation, which reduces the node cut-size determine in O ( K m )time whether N,” has a cut of edge cut-
constraint to an edge cut-size constraint by splitting nodes size no more than K and find one if such a cut exists. Such a
into edges, so that we can use misting edge cut computation cut (XI’, x“)in N,” induces a K-feasible cut (XI, F)in N i ,
algorithms [SI, [lo]. Specifically, we construct a network N,”
from N; as follows. For each node w in Ni other than s and
which in turn gives a minimum height K-feasible cut (X, x)
in Nt.’ Based on the above discussions, we have
t’, we introduce two nodes w1 and v2 and connect them by Theorem 1: A minimum height K-jeasible cut in Nt can be
an edge (211, w2) in N,”, which is called a bridging edge. The found in O(Km) time, where m is the number of edges in Nt . 0
source s and sink t’ are also included in N,”. For each edge Applying Theorem 1 to each node in N in the topological
(5, w) in Ni, there is an edge (s, wl) in N,”; and for each edge order in the labeling phase, we have
(w,t’) in Ni, there is an edge (w2,t’) in N,”. Moreover, for
each edge (U,w) in Ni (U # s and w # t’), we introduce an III this paper, a flow always means a flow from the source to the sink.
edge (UZ,w1) in N,”. We assign the capacity of each bridging x)
*It is clear that for the resulting cut ( X , x
in Nt , does not contain any
edge to be one, and the capacity of each non-bridging edge PI nodes since any outgoing edge of the source s in N,“ has infinite capacity.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED ClRCUlTS AND SYSTEMS,VOL. 13, NO. 1, JANUARY 1994
Corollary I : The labels of all the nodes in N can be computed for N,, Z(w) is also the best possible depth for the K-LUT
in O(Kmn)time, where n and m are the number of nodes and generated for w in any mapping solution for N . Therefore, the
edges in N , respectively. 0 mapping solution for N generated by the FlowMap algorithm
In fact, the result in Theorem 1 can be generalized for is optimal. Moreover, since the labeling phase takes O(Kmn)
computing the minimum height K-feasible cut in a general +
time and the mapping phase takes O(n m ) time, the total
network with arbitrary node labels and edge capacities as complexity of FlowMap is O(Kmn). 0
follows. In the current LUT-based FPGA architecture, the typical
Theorem 2: Given a general network N with a non-negative value of K is 4 or 5. Moreover, if the average number of
integer capacity defined on each edge and a non-negative inte- fanins (or fanouts) of the nodes in N is bounded by a constant
ger label dejined on each node except the sink. For anzposi- (which is two in our implementation), we have m = O ( n ) .
tive integer K , a minimum height K-feasible cut ( X , X ) can Therefore, the complexity of the FlowMap algorithm is O(n2)
be found in O(m . min{K,fi}. log L ) time, and a mini-
in practice.
mum height cut ( X , x ) with e ( X , x ) _< K can be found in
O(mnlog(n*/m)log L ) time, where n and m are the number Note that in a network of n nodes, there are O ( n K )K -
of the nodes and edges in N , respectively, and L is the number feasible cuts. An exhaustive enumeration will resulting in
of different node labels. 0 a pseudo polynomial time algorithm of complexity O ( n K ) .
The proof and the detailed algorithm can be found in [4]. Our algorithm, on the other hand, is strongly polynomial with
This result has been used for delay-optimal K-LUT technology respect to K , thus it is much more efficient.
mapping under arbitrary net-delay models [6].
IV. ENHANCEMENT
OF THE FLOWMAP
3.2. The Mapping Phase ALGORITHM
FOR AREAOPTIMIZATION
The second phase of our algorithm is to generate the K - The secondary objective of our technology mapping al-
LUT’s in the optimal mapping solution. Let L be the set of gorithm is area optimization, i.e., to minimize the number
gates, which are to be implemented using K-LUT’s. Initially, of K-LUT’s in the mapping solution. In FlowMap, area
L contains all the PO nodes. We process the nodes in L one optimization is considered by maximizing the volume of
by one. For each node Y in L, assume that (X,, xu) is the each cut during the mapping process and by post-processing
operations for K-LUT reduction.
minimum height K-feasible cut in N, that we have computed
in the first phase by the labeling algorithm. We generate a K-
LUT w‘ to implement the function of gate w, using the input 4.1. Maximizing the Cut Volume During Mapping
signals from X , to x,. That is, the K-LUT Y’ includes all
the gates in T, and input (U’) = input ( r,). (Since the cut
According to the discussion in the preceding section, for
each node t in the input network N, the FlowMap algorithm
is K-feasible, we have linput(X,)I 5 K.) Then, we update
the set L to be ( L - {w})U input (w’). It is possible that a
computes a minimum-height K-feasible cut ( X , x)
in Nt
gate w belongs to both xv and Tufor two different gates ‘U and the nodes in w will be packed into one K-LUT t’ if a
K-LUT is generated to implement t. In general, the minimum-
and U in L. In this case, gate w is automatically duplicated
height K-feasible cut is not unique. Intuitively, the larger
and is included in both K-LUT’s w’ and U’. It is also possible
that no K-LUT is generated for a gate w since it has been
vol(X,x)= 1x1 is, the more nodes we can pack into the
K-LUT t’, and the fewer K-LUT’s we use in total. Therefore,
completely covered by the K-LUT’s generated for some of
our algorithm wants to maximize the volume of the cut during
its successors. In general, a K-LUT has to be generated for a
gate w if w belongs to input (U’) of some K-LUT U’ which the minimum height K-feasible cut computation.
has been generated, since its output signal is required by w’ According to Lemmas 3 and 4,finding a minimum height
as input. x)
K-feasible cut ( X , in Nt is reduced to finding a K-feasible
The second phase ends when L consists of only PI nodes of x’)
cut (XI, in Nl? which is further reduced to finding a cut
the original network. It is clear that at the end of the execution ( X ” , F ) with e(X”,X”) 5 K in N,”. According to the
we get a network of K-LUT’s which is logically equivalent transformations in the preceding section, it is easy to see that
to the original network. voZ(X’,X’) = ~(woZ(X”,x”)- e ( X ” , F ) l), and if the +
Our minimum depth LUT-based FPGA mapping algorithm, number of nodes with the maximum label in Nt is P, then
named FlowMap, is summarized in Fig. 6. Based on the above woZ(X,x) = woZ(X’,x’) +
P. Note that for a given N t ,
discussions, we have P is a constant. Therefore, v o Z ( X , x ) is maximized when
Theorem 3: For any K-bounded Boolean network N , the w o l ( X ’ , F ) is maximized, and w o l ( X ’ , F ) is maximized
FlowMap algorithm produces a K-LUT mapping solution with when woI(X”, x”) -e(X”, x”)is maximized. Thus, we want
the minimum depth in O(Kmn)time, where n and m are the to find a cut (X”,x”) in N,” such that e ( X ” , F ) 5 K
number of nodes and edges in N .
Pro08 By induction one can easily show that for any
and woZ(X”, x”) - e(X”, x”)is maximum. Therefore, we
want to find a min-cut in N,” (i.e.. a cut (X”,x”) with the
node w in N, if a K-LUT w’ is generated for w in the second
phase, then the level of w’ in the mapping solution is no more
9Assume that the minimum height of the K-feasible cuts in Nt is p - 1.
than Z(v), which is the depth of the optimal mapping solution Otherwise, (V(Nt ) - { t }, { t } ) is a minimum height IT-feasible cut in Nt ,
for N,. Since any mapping solution for N induces a solution which is trivial to compute.
CONG AND DING: FLOWMAP: AN O m M A L TECHNOLOGY MAPPING ALGORITHM 7
a b c d e f g c a b e f g d c abefg d
=3
also be determined in O ( n K )time by enumerating all the K - is monotonically increasing, and at each step, we minimize
feasible cuts. However, this method is too expensive when n the increase of the rank of the cut. There are two reasons to
is large, even for K = 5. Therefore, we have developed a minimize the rank. First, we want to limit the increase of the
heuristic algorithm. node cut-size at each step since we are interested only in K -
We define the rank of a cut (X,x), denoted T ( X , X ) ,to feasible cuts. Second, for the cuts of the same node cut-size,
be an ordered pair < n(X,W), -wol(X,x) >. The cuts can the smaller the rank is, the larger volume the cut has.
be ordered according to their ranks under the lexicographic Specifically, we start with the maximum volume min-cut
ordering, i.e., for any two cuts (X,x) and ( Y , P ) r, ( X , X )> x~),
(XO, which has the minimum rank. In the ith iteration
r(Y,Y)if n ( X , x ) > n(Y,F),or n ( X , X ) = n(Y,P)and (i 2 1). we-compute a new cut (Xi,Xi) from the previous
w o l ( X , x ) < voZ(Y,B).Clearly, the maximum volume min- cut (Xi-1,Xi-l) - in the sequence as follows.
cut has the smallest rank. Let n(X+I,Xi-l) = ki-1, and let w 1 , w 2 , ..., be the
Given a K-LUT network M and a K-LUT U , our algorithm nodes in Xi-1 that are adjacent to some node in xi-1.To
iteratively computes a sequence of cuts in Mu whose rank compute (Xi, Xi), we first collapse all the nodes in Ti-1into
CONG AND DING FLOWMAP: AN OpIlMAL TECHNOLOGY MAPPING AUjORITHM
TABLE I TABLE II
COMPARISON
WITH CHORTLE-D
AND DAG-MAP WITH MIS-PGA-DELAY
COMPARISON ALGORITHM
MIS-pgadelay FlowMap
Circuit
#LuT’s depth #LuT’s depth
5xpl (104)l 26 3 I 2 4 3 2 5 3 5xpl 21 22 3
Ssym (200) 63 5 61 5 61 5 Ssym 7 3 60 5
9symml (191) 59 5 58 5 58 5 9symml 7 2 55 5
C499 (658) 382 6 207 5 154 5 c499 199 8 68 4
C880 (548) 329 8 243 8 232 8 C880 259 S 124 8
aIu2 (393) 227 9 169 8 162 8 alu2 122 c I 155 9
ah4 (726) 500 10 305 10 268 10 ah4 155 11 253 9
apex6 (779) 308 4 266 4 257 4 apex6 274 5 238 5
apex7 (247) 108 4 91 4 89 4 apex7 95 4 79 4
count (216) 91 4 81 4 76 3 count 81 4 31 5
des (3263) 2086 6 1433 6 1308 5 des 1397 11 1310 5
duke2 (392) 241 4 192 4 187 4 duke2 164 c 174 4
misexl (57) 19 2 15 2 15 2 misexl 17 2 16 2
rd84 (141) 61 4 43 4 43 4 rd84 13 ? 46 4
rot (647) 326 6 292 6 268 6 Tot 322 5 234 7
vg2 (120) 55 4 4 6 4 45 4 vg2 39 4 29 3
z4mI (48) 25 3 17 3 13 3 z4mI 10 4 5 2
Total 4906 87 3543 85 3261 83 Total 3182 9c 2899 84
I
Comparison 1+50.4% +4.8% +8.6% +2.4% 1 1 Comparison +9.8% +7.1’?4 1 1
We also tested the above three algorithms on the inpu ietworks used by
DAG-Map in [3]. The results showed that compared to FlowMap, DAG-
Map used 5.6% more 5-LUT’s and had 1.2% larger network depth, while
Chortle4 used 52.2% m m SLUT‘S and had 10.7%larger network depth.
ity matching among all pairs of gates which are eligible
FlowMap produced consistent better results than the other two algorithms.
for gate-decomposition. Details of the matching based gate-
decomposition algorithm can be found in [3].) Then, we apply
the flow-pack operation to each K-LUT U in the mapping
the sink U. Denote the reduced network as MA. Moreover, let solution so that U is collapsed with a maximal subset of its
M ; ( j ) denotes the network obtained from M t by collapsing predecessors into a single K-LUT.
wj (1 5 j 5 ki- 1) into the sink U. We compute the maximum The advantage of the flow-pack operation is clear: the
volume min-cut (Y,,Fj)in ~ : ( j for ) every j , 1 5 j I flow-pack operation takes the information about the entire sub-
ki-1. Let k, be the minimum node cut-size of these cuts. network Mu into consideration, while the predecessor packing
Moreover, among those cuts that have node cut-size k,, let examines only the nodes adjacent to U locally. Therefore, in
(Y,y)be the one of maximum volume. Let X, = Y, and general flow-pack leads to more substantial reduction of the
-
X, = V ( M ~-> X,,we accept (X,,Xi) as the resulting cut number of K-LUT’s. Our experimental results show that the
of the ith iteration. flow-pack operation alone reduced the number of K-LUT’s
It can be shown [4]that the cut (Xi, Ti) has the following by 13.5% on average.
properties:
-
(1) WO~(X~,X*) > vol(Xa-1,
- Xi-1); v. EXPERIMENTALRESULTS
(2) n(X,,_x,) > ~(Xd-A-1); ancl We have implemented the FlowMap algorithm and its
(3) r(X,,Xa) 5 r(X,X)for any cut (X,X) such that preprocessing and post-processing steps using the C language
X C Xa-1. on Sun SPARC workstations. We used input/output routines
Therefore, the cut computed at each step is locally optimal. and general utility functions provided by MIS [2] in our
This iterative procedure ends when at some step 1, (Xl, Xl) implementation. Given a general Boolean network as input,
is not K-feasible. The last K-feasible cut, (Xl-l,Xl-~), we first decompose it into a 2-input network of simple gates
computed in the sequence, is used in the flow-pack algorithm as described in [3]. We then apply the FlowMap algorithm to
such that Pu = Xl-1. Since the node cut-size is monotonically obtain a minimum depth K-LUT mapping solution. Next, we
increasing, we have the number of iterations E 5 K. Moreover, perform a matching-based gate-decomposition procedure on
each iteration consists of at most K maximum volume min-cut the K-LUT network, followed by the flow-pack operation to
computations, each of time complexity O ( K m ) (Theorem 4). reduce the number of K-LUT’s in the mapping solution. We
Therefore, the time complexity for carrying out the flow-pack chose the size of the K-LUT to be K = 5, reflecting, e.g.,
algorithm at each node is bounded by O(K3m),where m is the XC3000 FPGA family produced by Xilinx [28]. We tested
the number of edges in the K-LUT network. FlowMap on a number of MCNC benchmark examples and
The flow-pack algorithm is implemented as a pst-proc- the results were compared with those produced by Chortle-d
essing step of FlowMap. During the post-processing phase, [13], MIS-pga-delay [21], and DAG-Map [3]. The results are
we first carry out the gate-decomposition operation. (In fact, shown in Tables I and 11.
we carry out a maximum set of gate-decomposition oper- In Table I, we used the initial networks provided by Robert
ations simultaneously by computing a maximum cardinal- Francis, which were used by Chortle-d to obtain the results
10 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 13, NO. 1, JANUARY 1994
TABLE III
IMPACT OF DECOMP~SITTON
METHODSON ~ ~ A P P I NRESULTS
G
reported in [13], for all the three algorithms. These initial TABLE IV
networks were obtained by a sequence of technology inde- EFFEcIlvENEsS OF THE FLOWMAP
PHASE FORAREAM~IMIZATION
POST-PROCESSING
pendent area and depth optimization steps using MIS. (Since
these networks are already 2-input networks, we did not Number of 5-LUT’s
No Gate- Gate-Decomp.
apply our preprocessing algorithm for FlowMap.) Overall, the Circuit Depth Post- Decomp. &
solutions of Chortle-d used 50.4% more 5-LUT’s and had processing Only How-Pack
4.8% larger network depth; the solutions of DAG-Map used 5xpl 3 24 24 22
8.6% more 5-LUT’s and had 2.4% larger network depth. Note Ssym 5 76 71 60
Ssymml 5 68 64 55
that FlowMap always results in the mapping solution of the c499 4 80 76 68
smallest depth. Moreover, in terms of the number of 5-LUT’s C880 8 133 133 124
used in the mapping solutions, FlowMap is consistently better alu2 9 167 166 155
than Chortle-d for all examples, and is better than or as good alu4 9 279 277 253
as DAG-Map in most cases. apex6 5 276 273 238
apex7 4 88 88 79
In Table 11, we cited the results of MIS-pga-delay from count 5 43 43 31
[21] since we were unable to run the program directly. des 5 1539 1521 1310
The FlowMap results were obtained by first synthesizing duke2 4 196 193 174
the original benchmarks using the MIS optimization script misexl 2 20 18 16
used by Chortle-crf [14] and DAG-Map [3] for technology- rd84 4 52 50 46
rot 7 258 258 234
independent optimization, then applying the FlowMap algo-
vg2 3 32 31 29
rithm for technology mapping. Since MIS-pga-delay combines z4ml 2 5 5 5
logic synthesis and technology mapping, in several cases it Total 84 3336 3291 2899
produced mapping solutions of smaller depth than those of Comparison 1 -1.4% -15.1%
RowMap. However, overall MIS-pga-delay still used 9.8%
more 5-LUT’s and had 7.1% larger depth. fanins, the mapping phase may result in many unsaturated
We have also evaluated the impact of the choices of multi- K-LUT’s, which gives more flexibility to the post-processing
input gate decomposition methods on the mapping results. operations for further reduction of the number of K-LUT’s.
We used the DMIG algorithm [3] to decompose the ini- However, this reduction is usually not worthwhile considering
tial networks into two-, three-, four-, or five-input networks the substantial increase in the depth of the mapping solutions.
and applied FlowMap on the resulting networks. The initial Finally, We have tested the effectiveness of the post-
networks for these decomposition algorithms are the same processing phase for area minimization. The results are shown
as those used to produce the FlowMap results in Table 11. in Table IV. The initial networks used for this experiment are
We summarize the results in Table III. It can be seen that the same as those used in Table 11. The post-processing phase
two-input decomposition gives the best depth results. On the reduced the number of K-LUT’s in the mapping solution by
other hand, multi-input decompositions use slightly fewer K - 15.1%,and the flow-pack operation alone reduced the number
LUT’s in some cases. It was observed during the experiments of K-LUT’s (after gate-decomposition operation) by 13.5%.
that this reduction is achieved mainly by the post-processing The experiments were carried out on a Sun SPARC IPC
operations. Intuitively, when the gates have larger number of workstation (14.8 MIPS). For each benchmark example, our
CONG AND DING: FLOWMAP AN OF’TIMAL TECHNOLOGY MAPPING ALGORITHM 11
system took less than one minute of CPU time (in most cases multiple-level logic optimization,”IEEE Trans. Computer-AidedDesign,
a few seconds). Therefore, it is much faster than the Boolean pp. 1062-1081, NOV.1987.
[3] K. C. Chen, J. Cong, Y. Ding, A. B. Kahng, and P. Trajmar, “DAG-
optimization based algorithms. map: Graph-based FPGA technology mapping for delay optimization,”
IEEE Design and Test of Computers, pp. 7-20, Sept. 1992.
[4] J. Cong and Y. Ding, “An optimal technology mapping algorithm for
delay optimization in lookup-table based FPGA designs,” Tech. Rep.
AND FUTURE EXTENSIONS
VI. CONCLUSION CSD-920022, UCLA Computer Science Dept., May 1992.
151 J. Cong and Y. Ding, “On area/depth trade-off in LUT-based FPGA
In this paper, we have presented a technology mapping technology mapping,” in Proc. 30th ACMIIEEE Design Automation
algorithm named FlowMap for depth minimization in LUT- Conf., June 1993, pp. 213-218.
based FPGA designs, which is optimal for any K-bounded [6] J. Cong, Y.Ding, T. Gao, and K. Chen, “An optimal performance-driven
technology mapping algorithm for LUT based FPGA’s under arbitrary
Boolean network. It is based on efficient computation of net-delay models,” in Proc. 1993 Inr. Conf. on Computer-Aided Design
minimum height K-feasible cuts in a network. A number of and Computer Graphics, Aug. 1993, pp. 599-603.
area optimization techniques also allow FlowMap to reduce the [7] J. Cong, A. Kahng, P. Trajmar, and K. C. Chen, “Graph based FPGA
technology mapping for delay optimization,” in ACM Int. Workshop on
number of K-LUT’s significantly. Compared to the existing Field Programmable Gate Arrays, Feb. 1992, pp. 77-82.
LUT-based FPGA technology mapping algorithms for delay [8] T. Cormen, C. Leiserson, and R. Rivest, Algorithms. Cambridge, MA:
optimization, FlowMap reduces the depth of the LUT network MIT Press, 1990.
[9] E. Detjens, G. Gannot, R. Rudell, A. Sangiovanni-Vincentelli,and
by up to 7% and reduces the number of LUT’s by up to 50%. A. Wang, “Technology mapping in MIS,” in Proc. IEEE Int. Conf.
FlowMap takes less than one minute of CPU time for each of Computer-Aided Design, 1987, pp. 1 1 6 119.
the benchmarks in our test suite. [lo] L. R. Ford and D. R. Fulkerson, Flows in Nefworks. Princeton, N J
Princeton Univ. Press, 1962.
One extension is to use a more general delay model other [ 111 R. J. Francis, J. Rose, and K. Chung, “Chortle: A technology mapping
than the unit delay model. For example, Chan, Schlag, and program for lookup table-based field programmable gate arrays,” in
Kong [25] used the nominal delay model in FPGA designs Proc. 27th ACMIIEEE Design Automation Conf.,1990, pp. 613-619.
[12] R. J. Francis, J. Rose, and 2.Vranesic, “Technology mapping for delay
where the interconnectiondelay of a signal net is estimated by optimization of lookup table-based FPGA’s,” in MCNC Logic Synthesis
the number of fanouts of the net. Their results showed that the Workshop, 1991.
nominal delay model estimates the interconnectiondelay quite [13] R. J. Francis, J. Rose, and Z. Vranesic, “Technology mapping of
lookup table-based FPGA’s for performance,” in Proc. IEEE Int. Cor$
well. Based on Theorem 2, we have generalized the FlowMap Computer-Aided Design, Nov. 1991, pp. 568-571.
algorithm to perform delay-optimal mapping under arbitrary [14] R. J. Francis, J. Rose, and Z. Vranesic, “Chortle-cxf Fast technology
net-delay models, including the nominal delay model [6]. mapping for lookup table-based FPGA’s,” in Proc. 28th ACMIIEEE
Another extension is to combine area and depth optimization Design Automation Conf.,1991, pp. 613-619.
151 D. Hill, “A CAD system for the design of field programmable gate ar-
in the mapping procedure. Note that the depth of every rays,” in Proc. ACMIIEEE Design Automation Conf. 1991, pp. 187-192.
node is minimum in a FlowMap mapping solution, while 161 D. A. Huffman, “A method for the construction of minimum redundancy
in fact only the depths of the nodes on the critical paths codes,’’ in Proc. IRE 40, 1952, pp. 1098-1101.
171 K. Karplus. “Xmap: A Technology mapper for table-lookup field-
need to be minimized to guarantee depth-optimal mapping. programmablegate arrays,” in Proc. 28th ACMIIEEE Design Automation
The slacks of the non-critical nodes can be utilized for area Cor$, 1991, pp. 24CK243.
minimization without affecting the depth optimality. This 181 K. Keutzer, “DAWN: Technology binding and local optimization by
DAG matching,” in Proc. 24th ACMIIEEE Design Automation Conf.,
method can be further extended to the general problem of 1987, pp. 341-347.
area minimization under given depth constraint. Based on [19] E. L. Lawler, K. N. Levitt, and J. Turner, “Module clustering to
a set of depth relaxation operations defined for non-critical minimize delay in digital networks,” IEEE Trans. Computers, vol C-18,
pp. 47-57, Jan. 1969.
nodes, We have developed an algorithm that can produce a [20] R. Murgai, et al., “Logic synthesis algorithms for programmable gate
spectrum of area-optimized mapping solutions for different arrays,” in Proc. 27th ACMIIEEE Design Automation Conf., 1990, pp.
depth constraints, yielding smooth area and depth trade-off 620-625.
[21] R. Murgai, N. Shenoy, R. K. Brayton, and A. Sangiovanni-Vincentelli,
in LUT-based FPGA designs [5]. “Performance directed synthesis for table look up programmable gate
The area-optimal mapping problem for LUT-based FPGA arrays,’’ in Proc. IEEE Int. Conf. Computer-Aided Design, Nov. 1991,
designs is still open. Based on the concept of the maximum pp. 572-575.
[22] R. Murgai, N. Shenoy, R. K. Brayton, and A. Sangiovanni-Vincentelli,
fanout-free cones, introduced in [5] we have developed a “Improved logic synthesis algorithms for table look up architectures,” in
polynomial time algorithm for area-optimal K-LUT mapping Proc. IEEEInt. Conf.Computer-Aided Design, Nov. 1991, pp. 564-567.
without node duplication for any fixed K [5]. [23] J. P. Roth and R. M. Karp, “Minimization over Boolean graphs,” IBM
J. Res. Devel., pp. 227-238, Apr. 1962.
[24] P. Sawkar and D. Thomas, ‘Technology mapping for table-look-up
based field programmable gate arrays,’’ in ACMISIGDA Workshop on
ACKNOWLEDGMENT Field Programmable Gate Arrays, Feb. 1992, pp. 83-88.
[25] M. Schlag, P. Chan, and J. Kong, “Empirical evaluation of multilevel
The authors thank Professor Jonathan Rose, Robert Fran- logic minimization tools for a field programmable gate array tech-
nology,” inProc. 1st Int. Workshop on Field Programmable Logic and
cis, and Rajeev Murgai for their assistance in the authors’ Applications, Sept. 1991.
comparative study. [26] M. Schlag, J. Kong, and P. K. Chan, “Routability-driven technology
mapping for lookup table-based FPGA’s,” in Proc. 1992 IEEE Int. Conf.
REFERENCES Computer Design, Oct. 1992.
[27] N. S. Woo, “A heuristic method for FPGA technology mapping based
[l] N. Bhat. and D. Hill, “Routable technology mapping for FPGA’s,” in on the edge visability,” in Proc. 28th ACMIIEEE Design Automation
First Inr. ACMISIGDA Workshop on Field Programmable Gate Arrays, Conf., 1991, pp. 248-251.
Feb. 1992, pp. 143-148. [28] Xilinx, The Programmable Gate Array Data Book. San Jose, CA:
[2] R. K. Brayton,R. Rudell, and A. L. Sangiovanni-Vincentelli,“MIS: A Xilinx, 1992.
12 IEEE TRANSACXIONS ON COMPUTER-AIDEDDESIGN OF INTUjRATED CIRCUITS AND SYSTEMS,VOL. 13, NO. 1. JANUARY 1994
Jason (Jingsheng) Cong (S’SS-M’W) received the Yuzheng Ding received the B. S. degree in com-
B. S . degree in computer science from the Peking puter science from Peking University, and the M.
University in 1985. He received the M. S. degree S. degree in computer science from Tsinghua Uni-
and Ph. D. degree in computer science from the versity, both in Beijing, China. Currently he is a
University of Illinois at Urbana-Champaign in 1987 research assistant in the Department of Computer
and 1990, respectively. Science of University of California, Los Angeles,
Currently, he is an assistant professor in the where he is pursuing his Ph. D.degree.
Computer Science Department of University of Cal- His research interests include computer-aided de-
ifornia, Los Angeles. From 1986 to 1990, he was a sign of VLSl circuits, design and analysis of data
research assistant in the Computer Science Depart- structu~sand algorithms, and database systems.
ment of the University of Illinois. He worked at the Mr. Ding is a member of ACM.
Xerox Palo Alto Research Center in the summer of 1987. He worked at the
National Semiconductor Corporation in the summer of 1988. His research
interests include computer-aided design of VLSl circuits, fault-tolerantdesign
of VLSl systems, and design and analysis of efficient combinatorial and
geometric algorithms. He has published over fifty research papers in these
fields.
Dr. Cong received the Best Graduate Award from the Peking University
in 1985. He was awarded a DEC Computer Science Fellowship in 1988.
He received the Ross J. Martin Award for Excellence in Research from the
University of Illinois at Urbana-Champaign in 1989. He received the National
Science Foundation Research Initiation Award in 1991, and the National
Science Foundation Young Investigator Award in 1993. He has served on the
program committeesof several VLSI CAD conferences. He was the chairman
of the 4th ACWSIGDA Physical Design Workshop.