MQ91464 Ocr
MQ91464 Ocr
The author has granted a non- L'auteur a accordé une licence non
exclusive licence allowing the exclusive permettant a la
National Library of Canada to Bibliotheque nationale du Canada de
reproduce, loan, distribute or sell reproduire, préter, distribuer ou
copies of this thesis in microform, vendre des copies de cette thése sous
paper or electronic formats. la forme de microfiche/film, de
reproduction sur papier ou sur format
électronique.
Canada
i+E
A Synthesis Oriented Omniscient
Manual Editor for FPGA Circuit Design
Abstract
Logic circuit designers for Field-Programmable Gate Arrays (FPGAs) put increasing
demands on Computer Aided Design (CAD) tools to provide higher logic circuit speeds than are
possible using a traditional CAD flow. The problem with the traditional CAD flow lies in that the
logic synthesis makes assumptions about how its logic optimizations will affect the speed of the
logic circuit post-routing. These assumptions are not always realized by the place and route tool,
which leads to a degradation in logic circuit speed. If the post-routing effect of a logic optimization
was known, then better logic circuit optimization decisions could be made. This work refers to such
knowledge as omniscience and explores its applications in the domain of manual physical synthesis.
This work develops a manual editor, named Augur, which uses omniscience in the context
of physical synthesis. The user is allowed to select physical synthesis transformations to improve the
speed of the logic circuit. After each modification the user is instantly informed about the effect of
the specified transformation on the speed of the logic circuit. Because of the manual nature of the
editor the size of logic circuits is limited to 1000 logic elements.
The manual editor was tested on a suite of 10 logic circuits, all implemented on a Xilinx
Virtex-E device. The post-routing timing analysis performed with commercial tools shows that the
application of omniscience improves the logic circuit maximum operating frequency by 9.9% on
average, with a low area penalty. In addition, several new logic synthesis transformations are
developed that arise from the architectural properties of the target device.
Acknowledgements
I would like to take this opportunity to express my thanks to my thesis supervisor, Professor
Jonathan Rose, for his guidance, advice and encouragement throughout the course of my research.
I wish to express my sincere gratitude for all his help. Without his knowledge and technical expertise
this work could not have been realized.
I would like to thank the professors at the University of Toronto who have taught me
throughout my undergraduate and graduate studies. I would like to express my gratitude to Professor
Stephen Brown, Professor Zvonko Vranesic and Professor Paul Chow who through their teaching
encouraged me to pursue graduate studies.
I would also like to thank Dr. Kevin Chung at Xilinx for answering my questions with
patience and dedication, and William Chow for providing the software basis for this thesis.
I would like acknowledge my friends in graduate school: Anish Alex, Navid Azizi, Mehrdad
Eslami, Valavan Manohararajah and Lesley Shannon for their friendship and technical advice.
My closest friends: Dagmara Biskupska, Mark Bourgeault, Borys Bradel, Jennifer Gee,
Henry Jo, Agnieszka and Cezary Piekacz, Gabriel Quan, and Chris Sun. I can only hope that in the
years to come I can do justice the friendship you have blessed me with.
I would like to thank my father, Dr. Grzegorz Czajkowski, for everything that he has taught
me, both as a teacher and a father. My mother, Tatiana Czajkowska, has always supported me in all
my endeavors. My brother, Adam Czajkowski, always reminded me that there is more to life than
research.
I would also like to thank my academic thesis supervisor, the University of Toronto, Micronet
R&D and Xilinx Corporation for funding this research.
ili
Table of Contents
] Introduction 2.2... cece cece cette eee ee beeen ene e eee eee nee 1
1.1 Introduction to FPGA Circuit Design Process ........ 00.0... eee eee eee ee 1
1.3 Research Goals... 2... ccc eee ee tne n teen eeene 3
1.4 Organization... 0.0... ee een ee een ene ees 3
2 Background ....... 0.0 ccc cee ene eee eee e eet ee eees 5
2.1 Introduction 2.2... 0... cece
eee eee n ent eeenees 5
2.2 Technology Mapping for FPGAS ..........0..0
0. 0c cece ee eee ee 5
2.2.1 Terminology ..... 00... 02. ccc ccc eee eee ee eben ees 6
2.2.2 Basic Approach ....... 00... cc cece ene ene cent eeneee 7
2.2.3 Depth Optimal Solution ........ 0.0.0... cece ce eee 8
2.2.4 Improving Flowmap ......... 0... cece cece eee eee eet e eee 9
2.2.5 Modifying Initial Representation ...........
0.0... c eee ee eee 10
2.2.6 Mapping Logic Functions into Complex Logic Structures ........... 10
2.3. Physical Synthesis ..... 00.0... cece cece ee teen teen teens 12
2.3.1 Estimating Net Delay to Improve Logic Synthesis ................. 12
2.3.2 Improving Interaction between the Logic Synthesis andthe P&R ..... 13
2.4 Xilinx Virtex-E device and Xilinx CAD tools..............0.
000200 e eee 15
2.4.1 Xilinx Virtex-E Device Family .........
00.00. c eee cee eee ee 15
2.4.2 Xilinx FPGA Editor .... 00... eee
ee eens 17
2.5 EVE-An Omniscient Placer and Packer..........
0.0.0.0 02 cece eee eee 18
Vv
Augur Context and Logic Synthesis Transformations .......... 0.0.0... eee eee eee 21
3.1 Introduction 2.6.0... ccc
eee eee een e en ee enee 21
3.2 Implementing Logic Transformations with Augur .........0...... 00.0005. 22
3.3 Remapping 2... ee eee eee eee nett eee 25
3.3.1 Carry Chain Remapping ..... 0... ce eeeeens 25
3.3.2 Multiplexor Mapping .......... 2.0. cece eee eee ees 28
3.4 Duplication.......... 0... ce cee ete ee eee nes 32
3.5 Merging ....... cece ccc tee eee eee ee eee eee 33
3.6 Carry Chain Shortening ......... 0... 0c ce eee eee eee 34
3.7. Flip-flop Control Signal Extraction .........00
0... ccc eee eee 35
3.8 SUMMALY 2... eee ee ee eee ee ee ee ees 35
vi
5.3 Baseline Comparison ....... 2.0... eee cee eee tte eens 57
5.4 Placement and Packing Results ..........
0.2. eee ene 58
5.5 Results Including the New Logic Synthesis Transformations ............... 59
5.6 Optimization Strategies ... 0... eeecee eee ee 63
5.6.1 Promoting Nearest-Neighbour Interconnect ................ 00000. 63
5.6.2 Liberating Free Space for Critical Logic ..... 0.0.0.0...
0... e ee eee 64
5.6.3 Increasing Packing flexibility of Flip-Flops ...................00. 65
5.64 Stopping Criterion 2.0... 00.0... cece
eee eee eee eens 66
5.7 SUMMATY 2.2... ee eee ete eee eee e eens 67
References 2.2... ce
eee e ee ee etn e eee ees 73
A VN 0} 0 10 | D, Qa 77
Vil
List of Figures
Figure 1-1: Traditional CAD Flow for FPGA circuit design ... 20.0...
eee ee eee 2
Figure 1-2: The Physical Synthesis approach .. 0... 00... ce ec cece ees 2
Figure 2-1: Sub-optimal solution produced with the basic approach [4] ...............04.. 7
Figure 2-2: Generic Logic Block Structure for Virtex-E and XC4000 series FPGAS ........ 11
Figure 2-3: An Island-Style FPGA 2.0.0... cee nee 15
Figure 2-4: Simplified view of a Virtex-E slice... 0... 2. cc eee 15
Figure 2-5: The Nearest Neighbour Interconnect on the Virtex-E device ................. 16
Figure 3-1: Logic circuit representation in Augur... 6... 0. eeeee eens 22
Figure 3-2: The SubCircuit View 2.2.0... 2 cece
eee eens 23
Figure 3-3: The Placement view used during resynthesis .........
0.0... cece eee eee 24
Figure 3-4: Virtex-E Carry Chain Structure .. 0.2.6.0... cece cc eee eee 26
Figure 3-5: Sample circuit to be mapped into a carry chain configuration ................. 26
Figure 3-6: AND gate extracted from the forward LUT 2.0.0... 0... ec eee 27
Figure 3-7: Final implementation in carry chain configuration ...................00000. 27
Figure 3-8: Algorithm for mapping an AND or an OR gate into the carry chain ............ 28
Figure 3-9: The Joint-LUT structure... 00... 0 eee ce teen eens 29
Figure 3-10: Basic Joint-LUT mapping algorithm .........
000... cece ee eee eee 29
Figure 3-11: The Joint-Slice structure 2.2.1.0... eee
ce eee eee eens 30
Figure 3-12: Example of mapping a 2 output logic function into the Jomt-LUT structure ....31
Figure 3-13: Mapping a multi-output logic function into the Joint-LUT structure .......... 32
Figure 3-14: Mapping solution not found by the algorithm ............... 00.00.0000 00. 32
Figure 3-15: Circuit before duplication .. 0.0... 0. cc ccc ce ee ee eee 33
Figure 3-16: Circuit after duplication .............
00. cece ee ct eee eee ees 33
Figure 3-17: Critical path before the application of carry chain shortening ................ 34
Figure 3-18: Circuit after carry chain shortening ...........
00... e eee eee eee 34
Figure 3-19: Two registers with incompatible control signals ...............2 002000000 35
Figure 3-20: Control signal functionality implemented in LUTs ...................005. 35
ix
Figure 4-1: The Placer and Packer View ...........
0. cece eee e eee e bene eben eee 39
Figure 4-2: View Controls 2.0.0... occ
cee eee eee eee ees 40
Figure 4-3: Delay profile showing nine delay bins ..... 2.2... oe ee ceeeee 42
Figure 4-4: The Options window ......... 2.0... cee cece eee ee ee ene eens 43
Figure 4-5: Critical paths with delay greater than specified budget ........0......0...0.. 43
Figure 4-6: Information box 0.2.0.0... 0. ccc ce eee e tee ete e nee eeee 44
Figure 4-7: The SubCircuit View ....... 2.0.0 c eee
te een eee e nee nena 46
Figure 4-8: The SubCircuit view after a logic transformation ...............
0.00 eee eee 47
Figure 4-9: SubCircuit Placer view ....... 0.0... cece tee eee e teen ene ees 48
Figure 4-10: The Placer and Packer view after accepting the logic synthesis transformation . . 49
Figure 4-11: Augur software Overview ...... 20... c ee eee tee ene n teens 51
Figure 4-12: Action list data structure 2.0... ee eee eee ees 52
Figure 5-1: Procedure to obtain baseline performance for a benchmark circuit ............. 57
Figure 5-2: Delay profile for the miim circuit... 2... cece eee 66
Figure 5-3: The 15 slowest paths in the miim circuit... 0.0... eee ee eee eee 67
Figure A-1: Detailed Xilinx Virtex-E Slice Schematic 2.1.0.0...
0. ccc cece eee eee 77
List of Tables
Table 5-1: Statistics for benchmark logic circuits ... 2.2.0... eee eee ees 56
Table 5-2: Results using only placement and packing modifications ...................-.. 58
Table 5-3: Speed improvement results using the new logic synthesis transformations ....... 59
Table 5-4: Area change due to the new logic synthesis transformations .................. 60
Xi
1 Introduction
Technology Mapping
(Tinting Aualysis
Timing Analysis
the early stages of the design process by providing information from the later stages, as shown in
Figure 1-2. This is achieved through iteration of the traditional CAD flow, which provides the cost
functions in the early stages with the information about the overall effect of the optimizations on the
final circuit speed. With this information the early stages are able to better predict the effect of
optimizations they perform.
The goal of this research is to develop a manual editor, which provides the user with the
selection of logic synthesis optimizations to improve the speed ofa digital circuit. The manual editor
adopts a physical synthesis approach, but instead of applying it to the entire circuit, or on
macroscopic scale, the editor focuses on implementing small and incremental, or microscopic,
design modifications. After each modification the editor performs routing and timing analysis and
provides the user with almost instant feedback about the new circuit speed.
The complete placement, routing and timing analysis performed after each microscopic
modification provides the knowledge of the effect of a small change on the post-routed circuit
performance. This level of knowledge will be referred to as omniscience. The concept of
omniscience is used as a guide to evaluate incremental logic synthesis transformations.
The application of omniscience within the manual editor creates a manual CAD tool capable
of providing the user with all necessary information to make superior optimization decisions. The
manual editor is named Augur, after a religious official in ancient Rome, who interpreted “omens”
to guide the public, similarly to the way that the manual editor guides the user to improve the speed
of a digital circuit.
1.4 Organization
This dissertation consists of six chapters. Chapter two reviews the necessary background
information and terminology used throughout the document. After a brief summary of the user’s
view of the manual editor, the description of logic synthesis transformations provided by Augur is
given in chapter three.
In chapter four, Augur’s user interface is presented, along with all necessary information to
use Augur successfully. Augur’s user manual is followed by the description of the software design
of this work.
Chapter five presents results from the use of Augur in improving the speed of FPGA logic
circuits. The observations made by the author during this process are utilized to suggest optimization
strategies, which might be automated in CAD tools. The conclusion of this thesis and the avenues
for future work are presented in chapter six.
2 Background
2.1 Introduction
This chapter introduces concepts and algorithms from the field of logic synthesis required
as background to the work presented in this dissertation. Section 2.2 describes prior works related
to technology mapping of logic equations onto FPGAs. Section 2.3 describes physical synthesis
algorithms. The material covered in these Sections is relevant to the discussion about logic synthesis
transformations in chapter 3. We then describe the Xilinx Virtex-E FPGA, which is the target device
of this work. A short overview of the work of Chow and Rose [6], which is the basis for this
research, is given in Section 2.5.
2.2.1 Terminology
A logic circuit can be represented as graph that consists of nodes and directed edges. The
nodes in a graph that represents a logic circuit correspond to logic gates, primary inputs and primary
outputs. A directed edge (a, b) in a graph is present between a pair of nodes a and b if the output of
a node a is an input to node b. A primary input of a graph represents a primary input to the logic
circuit or an output ofa flip-flop and has no incoming edges. Similarly, the primary output ofa graph
corresponds to a primary output of the logic circuit or an input to a flip-flop and has no outgoing
edges [4].
For the remainder of this Section a graph representing a logic circuit will be considered to
be free of paths that start and end at the same node, also referred to a a graph without cycles. A graph
with directed edges and no cycles is called a Directed Acyclic Graph (DAG) [27].
Ina DAG anode is said to be K-feasible [22] if the number of inputs to the node is less than
or equal to K. If all nodes in a graph are K-feasible then the graph is said to be a K-feasible graph.
In this Section we assume that a graph that is to be the input to the technology mapping algorithm
is K-feasible, so that any single node, or a group of nodes that implement a function of at most K
inputs, can always be implemented in a lookup table that has K inputs (K-LUT). The depth of a K-
LUT in a logic circuit is determined by the maximum number of K-LUTs on any path from any
primary input to that K-LUT. Therefore, the primary input has a depth of 0, while a node whose
inputs are only the primary inputs has a depth of 1.
The edges between the nodes in a DAG are directed edges and represent a connection
between a pair of logic gates. The connection has a source, which is the node where the directed edge
originates, and a destination, which is where the connection terminates. A set of connections that
have the same source are also referred to as a net.
In the following subsection the above terminology is used to describe several technology
mapping algorithms. The description of the technology mapping algorithms begins with the basic
approach to technology mapping.
3-feasible
Not 3-feasible
7
the entire circuit only has 3 inputs, therefore only one 3-LUT is required to implement it. An
algorithm that covers this case was introduced by Cong and Ding. Their Flowmap [4] algorithm,
which guarantees a depth optimal mapping in polymonial time, is presented in the following
subsection.
during the mapping stage the logic gates are processed from primary output to primary input to
reduce the LUT depth. To address that problem a mechanism known as the network flow [27] is
used.
Consider a directed graph G, where a node s has only outgoing edges and node ¢ only
incoming edges. The network flow of such a graph determines the number of paths between s and
t. If each edge is allowed to be used only once then the network flow will determine the maximum
number of paths between nodes s and ¢ that have no common edges. Each edge is assigned a
capacity, which represents the maximum flow per unit time that the edge can support. In the process
the network flow locates the place in the network where a minimum number of edges can be
removed to disconnect all paths from s to ¢.
The Flowmap algorithm, developed by Cong and Ding [4], uses the network flow to
overcome the problem illustrated in Figure 2-1. Similar to the basic approach, the Flowmap
algorithm processes the logic network in two phases: labeling and mapping. The difference is that
the network flow computation is utilized to determine the LUT depth that should be assigned to each
gate in the logic network.
The input to the Flowmap algorithm is a K-feasible DAG G, which represents the logic
network. Each node in G corresponds to a logic gate or a primary input, while each edge of Gis a
connection between the output of one logic gate and the input of another. To determine the label for
each gate the graph G undergoes the following transformation to be suitable for use with the network
flow algorithm:
° anode s driving all primary inputs is created
° anode fis the node to be labeled. All nodes with label max {label(v) : v € inputs(t)} and node
t are put together to form node ¢’
° any node or edge of the graph that is not on a directed path from s to ¢ is ignored
° each node y, except s and t’, is split in two. The resulting nodes v, and v, are such that:
° all input edges are assigned to v,
° all output edges are assigned to v,
° a directed edge from v, to v, is created. This edge is called the bridging edge [4].
In this new graph, G’, consider the bridging edges as outputs of logic gates. To find a mapping that
covers the node t’ the graph G’ must be divided, or cut, into two parts. One part contains the node
t’ and possibly a few other nodes, while the other part contains the rest of the graph G’. Cutting one
of the bridging edges means that the gate corresponding to the source of this edge will drive a LUT
that implements node ¢’. Finding a mapping that fits ¢’ into a K-LUT means separating the graph into
two parts by cutting only the bridging edges. One part of the graph must contain node s and the other
node t’. A successful mapping results in locating a cut of size K or less in the graph G’ by using the
network flow algorithm. If the cut is not found then a new LUT must be created to implement the
10
Figure 2-2: Generic Logic Block
Structure for Virtex-E and XC4000
series FPGAs
be realized in a single K-LUT. Most commercial FPGAs, such as the Xilinx Virtex-E [7] and the
Xilinx XC4000 devices [19], contain more complex logic structures. The generic logic structure for
both of these devices is shown in Figure 2-2, where F, G and H are LUTs.
Boolean matching approaches for logic structures in Figure 2-2 have been presented by Cong
and Hwang [26]. This dissertation focuses on the mapping approach for the Virtex-E type logic
block, in which the LUT H implements a multiplexor controlled by signal x.
To map a logic function f into the Virtex-E type logic structure a logic function
decomposition must be found such that f{(X) = x°F + x-G, where xEX. The logic function fcan be
mapped into the Virtex-E type logic structure when Shannon’s expansion [9] f((X) = x-f, + x-f, can
be found for some xX and each of the cofactors (f, and f,) can be implemented in LUT G and F
respectively.
In this work the approach of Cong and Hwang [26] is extended to cover more complex logic
structures. The logic structures and the algorithms for mapping logic functions into them are
presented in Chapter 3.
11
2.3 Physical Synthesis
The previous Section described technology mapping algorithms that focus on improving the
delay in the logic circuit as measured by the number of logic components on all paths in the logic
circuit. Some of the algorithms went one step further by attempting to model interconnect delay.
However, all of these works maintained a strict separation between the logic synthesis and the
placement and routing (P&R) stages. Thus, a logic synthesis optimization that appeared to be
beneficial from the logic synthesis point of view, may in fact be detrimental to the speed of the logic
circuit, once placement and routing is performed. Improving the interaction between the logic
synthesis and P&R to improve the synthesis of the logic circuit is the domain of physical synthesis.
Physical synthesis algorithms use an iterative approach to converge on a good design
implementation. Each iteration provides additional information about the final implementation of
the logic circuit that affects the choice of the logic synthesis optimizations. In this process it is
important to have a delay model so that the delay between logic components can be computed
accurately. Furthermore, methods of leveraging logic synthesis optimization and improving the
interaction between logic synthesis and placement and routing are needed.
We review works that study the means by which an accurate delay model for the connection
between logic components can be obtained, and then give examples of physical synthesis algorithms.
These algorithms present various methods of improving the interaction between the logic synthesis
and the placement and routing (P&R) stages to improve the speed of the logic circuit.
12
into account wires of various lengths to better estimate the delay of a logic connection.
A recent work by Lin et. al. [2] uses the location of the source and the target LUTs, in which
a pair of gates reside, to estimate the delay between a pair of connected gates. Their mapping
algorithm uses this information to decide how to modify the assignment of logic gates to the LUTs,
such that the performance of the circuit is improved. The algorithm moves a gate from one LUT to
another, sometimes necessitating the creation of an additional LUT due to LUT’s input size
constraint. Every time a LUT is modified it is assigned a preferred location to guide the subsequent
placement iteration.
The work of Lu et. al. [10] suggests taking the type of gate and the size of the gate fanout into
account when estimating the delay of a net. Their algorithm utilizes these models in an attempt to
decrease the estimated critical path delay through logic synthesis and placement perturbations. The
results show a 17% reduction in post-placement logic circuit delay when these models are used to
estimate net delay in comparison to SIS-1.2 [28].
Each of the aforementioned works does have some inaccuracies in the delay model. In the
present work, as discussed in Chapter 3, commercial FPGA devices and tools are used to estimate
the net delay. The Xilinx FPGA Editor has sufficient knowledge of Xilinx Virtex-E devices to
accurately model every delay element in the FPGA.
2.3.2 Improving Interaction between the Logic Synthesis and the P&R
In the physical synthesis CAD flow the logic synthesis stage precedes the placement and
routing (P&R) stage. This is because the P&R stage requires logic components to be created before
they can be placed and routed. However, to perform good logic synthesis optimizations a logic
synthesis tool needs accurate net delay estimates, which are only available after placement and
routing. Similarly, the P&R stage must be aware of the intention of a logic optimization so that it
does not undermine the efforts of the logic synthesizer. Therefore, a close interaction between the
logic synthesis and the placement and routing (P&R) stages is essential for improving the speed of
logic circuits.
There are currently three types of approaches that address this issue. The first approach
applies synthesis and placement in an iterative process. The second method is for the synthesizer to
13
specify to the placer where to place the synthesized logic components. This enables the placer to
better understand the decisions made by the synthesizer, and possibly accommodate them. The third
option is to permit the placer to evaluate several alternate logic mappings so that their placement can
be considered.
The example of the iterative approach is given by Lin et. al. [2]. During each iteration the
mapping algorithm takes some of the gates from one LUT and places them in another, basing its
decisions on net delays between the gates. The new mapping is then placed again, using last
placement as a guide. The iterative application of synthesis and placement that strives to minimize
the number of logic changes in each consecutive iteration is shown to yield 12.3% speed
improvement, based on the post-routing delay using VPR [29] on a set of 10 logic circuits.
The work of Singh and Brown [3] proposes that the placer should be provided with an
incentive to situate logic components in a specific location on the device. Their approach starts with
a regular CAD flow to obtain a synthesized and placed logic circuit implementation. Then layout-
driven optimization techniques are used to reduce the delay on the critical paths. Each new logic
element, which is created in the process, is assigned a location that the placer aims to obtain for it
while minimizing the disruption to the entire logic circuit. This approach conveys the context in
which the synthesizer made its decision, thus allowing the placer to respond to it accordingly. The
results, based on a set of 10 logic circuits, presented in [3] show that performing logic re-synthesis
on small subsets of logic and assigning “preferred locations” [3] to newly synthesized components
improves convergence.
The previous two examples maintained the separation between the logic synthesis and the
placement stages. The approach proposed by Lou ef. al. [1] breaks this boundary by having the
synthesis stage provide several mapping solutions for a subcircuit it considers to be good. The placer
then chooses the mapping solution to improve the speed of the logic circuit, since speed is easier to
estimate during placement and routing stages. This approach was tested on a set of 13 logic circuits,
using a 0.35um ASIC library. The results have shown an average of 29% reduction in delay and 5%
increase in area.
The method applied in this thesis borrows from each of the three approaches. The iterative
process is used to let the user improve the circuit in an incremental fashion. The user is provided
14
COUT
4>( 3
oS G4G3 >>4— LUT Carry
By
G2>4~ & YQ
i Control
est
xB
| F4 > x
CLE CLE F3 44 Carry
, F2>44LUT & > xO
F1 444 Control EF
CIN
Figure 2-3: An Island-Style FPGA Figure 2-4: Simplified view of a Virtex-E
slice
with detailed timing analysis and visual cues, such as rubber bands and visual representation of
circuit components on the device, to suggest the place the component should be situated. Finally, the
user can explore more than one logic optimization alternative while performing placement.
15
Nearest Neighbour
Interconnect
16
This interconnect has a very low delay and can significantly improve performance of the circuit when
used properly. This resource is limited to two pairs of unidirectional wires between each pair of
CLBs, as shown in Figure 2-5.
17
A prior work by Chow and Rose [6] introduced a manual editor, which targeted the Virtex-E
architecture. The EVent Horizon Editor (EVE) [6] enabled the user to modify the placement and
packing incrementally. After every modification EVE conducted timing analysis to inform the user
about the new circuit performance, thereby providing omniscience.
EVE relies on Xilinx FPGA Editor to extract the delay information from the design, since
Xilinx FPGA implementation tools encompass more detailed information about Xilinx devices. The
timing information in this environment is therefore accurate, so the performance gains obtained using
the manual editor are realistic. Furthermore, accurate timing information is necessary to provide
omniscience.
To modify a logic circuit with EVE the user must first synthesize, place and route a logic
circuit using commercial tools. The implemented logic circuit is then read by EVE and represented
visually as a set of connected logic components on an FPGA grid. The FPGA grid is represented as
a two dimensional array of rectangular cells, where each cell of the grid corresponds to a single
Virtex-E CLB.
To improve the logic circuit the user selects logic components and moves them into a free
location on the FPGA. The user can select a single component, or a group of components, to move
from one place to another. After each placement and packing modification EVE makes the specified
modification in the FPGA Editor, which holds the exact copy of the logic circuit. After the FPGA
Editor implements the requested changes it performs timing analysis. The results of timing analysis
are transmitted to EVE so that the user is informed about the new speed of the logic circuit.
The EVE software was applied to a set of 8 benchmark logic circuits to modify their
placement and packing. On this set of benchmarks EVE achieved a 12.7% improvement in logic
circuit speed.
In addition to the placement and packing modifications, EVE also assists the user with
pipelining of the logic circuit. EVE allows the user to specify where a flip-flop should be inserted
into the logic circuit and then determines if the specified location is valid. If the location for the new
flip-flop is valid, EVE inserts a flip-flop at the user specified location and at any other location that
requires a flip-flop to be inserted to preserve logic circuit functionality. The pipelining feature of
EVE was tested [25] on two logic circuits, improving their operating frequency by 3.53% and
18
42.24%.
2.6 Summary
This chapter presented prior work concerning physical synthesis. There are two key points
suggested by these works. The first point is that a close interaction between the logic synthesis and
placement and routing stages is necessary to improve the speed of the logic circuit. The second point
is that accurate timing information is crucial to make better logic synthesis optimizations.
In this thesis omniscience is the means by which a close interaction between logic synthesis
and placement and routing is achieved. The timing information that is used to provide omniscience
comes from commercial tools, which have accurate timing information for the device they target.
The work of Chow and Rose [6] is the basis for the manual editor developed in this thesis. The
manual editor is enhanced to provide the user with the ability to perform various logic synthesis
transformations in the context of omniscience.
19
3 Augur Context and Logic Synthesis Transformations
3.1 Introduction
The goal of this research is to improve the speed of the implementation of logic circuits on
FPGAs by providing the user with the most correct, and complete, feedback possible on the
consequences of manually specified logic transformations. If logic synthesis transformations were
made with the full knowledge of their final effect after routing then the final result would likely be
better. |
The methodology adopted in this work is to first implement a logic circuit using commercial
logic synthesis, placement and routing tools and then let the user improve the speed of the logic
circuit through manually specified logic transformations. To make informed decisions the user is
provided with information about how each microscopic logic circuit modification affects the speed
of the circuit. This information includes the maximum circuit operating frequency, delay distribution
of all paths in the circuit, the function implemented by each logic component and the placement of
logic components. We term the ability of a CAD tool to provide the above information after every
microscopic modification as omniscience.
In this research we develop a manual editor, called Augur, which uses logic synthesis
optimizations in the context of omniscience to improve the speed of a digital circuit. The focus of
Augur is on microscopic logic transformations, such that the user can observe the effect of a single
logic transformation on the speed of the circuit. After each manual modification resulting in the
change in the netlist and placement of the logic circuit, the editor performs routing and timing
21
analysis and provides the user with instant feedback about the new circuit speed, our so-called
omniscience.
AS
geen
Reroute. St
toto
SO
NetNonea
_HiltNone |
_ Optons
Jee ees
“atu otrenounnomonn
22
in yellow with a plus sign (+) icon in the middle. The flip-flop is green with a flip-flop icon in the
middle. The user can select one or more of these icons to make logic circuit modification. The two
types of logic circuit modifications available are placement/packing and logic synthesis
modifications.
To make a placement modification the user selects a set of components that can then be
moved, retaining their relative positions, to a new location. Augur does not allow components to be
moved into illegal positions, such as moving. a LUT into a carry chain slot. A successful logic circuit
modification causes Augur to perform the routing and timing analysis and return to the user with the
new logic circuit maximum operating frequency.
To perform a logic synthesis transformation the user selects a set of logic components and
attempts to employ one ofa set of available logic transformations, by pressing the Transform button.
ey:
CS ERS
Us oo nnsins
coon out |
eoort Fit |
doimtd UT
eet,
daint Slice
” Figure 33-2:
2: The SubCircuit view
23
This causes Augur to present the user with an alternate representation of just the selected logic, as
shown in Figure 3-2.
This view is termed the SubCircuit View, and it shows the logic subcircuit selected by the
user in isolation from the rest of the design. The inputs to the logic subcircuit are shown on the left
side of the screen, the subcircuit itself in the middle, while the outputs of the subcircuit are located
on the right side of the screen. The subcircuit itself is organized on the screen such that it presents
a topological view of the subcircuit, looking from left to right. The user can select the components
in this view and perform logic synthesis transformations on them, causing the creation ofa different
netlist of components (still with LUT, carry and flip-flop cells).
At any point during the resynthesis process, the user may attempt to place currently
synthesized subcircuit - this is an important issue as once the netlist has been changed by a logic
Window un
CSA
ir
: Joint LUT |
" fointStice |
24
synthesis transformation the old placement is no longer valid. By selecting a component in the view
in Figure 3-2 and pressing the Tab key, the selected component is shown on the FPGA grid, allowing
the user to select a suitable placement for it. The placement view is shown in Figure 3-3. Once the
user places all logic components of the subcircuit, the user can press the Accept button for the
changes to take effect. This will cause Augur to perform the routing and timing analysis and provide
the user with the new maximum operating frequency.
Augur provides a total of five types of logic transformations: remapping, logic duplication,
merging, carry chain shortening, and register control signal extraction, as described in the following
Sections.
3.3 Remapping
The remapping operation attempts to transform a selected set of logic components into a
functionally equivalent set that fits into a slice or a CLB of the Virtex-E FPGA. The following
subsections describe two remapping algorithms: Carry Chain Mapping and Multiplexer mapping.
25
COUT
YB
Ga) -
G3
a oe im ea
GI
CY¥OG
+XB
meF2>4 LUT
in rx )
Fi s
CYOF
od
Figure 3-4: Virtex-E Carry Chain Structure
implemented in the carry logic. Since the carry multiplexor is very fast, the transformation leads to
an overall reduction in delay if the original pair of LUTs is on the critical path.
Consider the example pair of LUTs (A and B) illustrated in Figure 3-5, which shows the logic
function of each LUT as a schematic inside each LUT box. The highlighted AND gate at the right
of LUT A can be implemented in the Carry Chain of Figure 3-4 because one of its inputs comes
directly from LUT B. Figure 3-6 illustrates the isolation of this AND gate and Figure 3-7 shows the
final implementation of the function using the carry chain to implement an AND gate.
To determine if a pair of serially connected LUTs can be mapped into a slice, the AND gate
26
Figure 3-6: AND gate extracted Figure 3-7: Final implementation
from the forward LUT in carry chain configuration
(or OR gate) must be found, as shown in Figure 3-6. The search process is performed using
Shannon’s expansion of LUT A’s function with respect to LUT B. Let A be the function of LUT
A, and B be the signal generated by LUT B. The inputs to A are x,, X,, x, and B. From Shannon’s
theorem [9] the following equation is obtained:
If the function A(x,, X), X;, 1) evaluates to a constant 0, then the equation (1) reduces to:
A= A(Xx,,X,,X;,0)¢B (2)
which produces the desired AND gate that can be mapped into the carry chain of the slice.
An OR gate can be detected by following a similar procedure. When the function A(x, x),
x3, 1) in equation (1) evaluates to 1 then the following simplified equation is obtained:
A= A(X,,X,,X;,0)+B (3)
which produces the OR gate to be implemented in the carry chain of the slice.
The algorithm in Figure 3-8 determines if a pair of serially connected LUTs can be
transformed in this manner. Its input is a user-selected pair of serially connected LUTs, and the
output is either the mapping of those LUTs into a slice, or a declaration that the mapping is not
possible.
27
Input: A pair of series-connected LUTs A and B, with LUT B driving LUT A.
Output: On success, a mapped slice with the same functionality of A & B, but
mapped into carry logic
In addition, it is possible to map a pair of serially connected LUTs in which the LUT B has
fanout greater than 1. Notice that the carry chain configuration in Figure 3-4 allows the bottom LUT
to produce an output signal through pin X. To properly generate this secondary output it may be
necessary to add a single LUT that inverts the output of pin X, since the function of LUT B may need
to be inverted to implement an AND gate (or an OR gate) in the carry chain.
28
G4
G3
G2
G1
F4
F3
F2
F4
BX
Figure 3-9: The Joint-LUT structure
mapping logic function into a structure with a single multiplexor that joins the output of two LUTs
has been presented before by Cong and Hwang [26]. This section presents the approach which allows
a mapping into such a structure and extends it to a more complex logic structure that contains three
multiplexors.
The first structure that makes use of the fast multiplexors is the Joint-LUT structure, shown
in Figure 3-9. This structure can implement a logic function containing up to 9 inputs. To implement
a 9-input logic function, the function has to be decomposed to determine if the mapping into the
Joint-LUT is possible. The decomposition process determines if it is possible to break up the logic
function into a multiplexer driven by two 4-input LUTs. We use Shannon’s decomposition theorem
[9] to perform the decomposition of the logic function. The basic idea is to perform Shannon’s
29
oat BY 4
634 G4>
G24 eee
3 ee
F4
, Fay
oT wut =| x || F384
FimcI | r2\
ris
BX Bx
expansion with respect to every variable of the function [26]. A valid mapping of the 9-input
function into the Joint-LUT structure is found when both cofactors of the function in the Shannon’s
expansion are functions with at most 4 inputs. The algorithm is summarized in Figure 3-10.
The advantage of using the Joint-LUT structure is that it exploits parallelism, which can
reduce the delay of signals passing through it. For example, when a pair of serially connected LUTs
is mapped into this structure, the function of both LUTs in the Joint-LUT structure is evaluated
simultaneously. Thus, the delay through the Joint-LUT structure is the delay of one LUT plus the
delay of the dedicated multiplexer. A pair of serially connected LUTs has a longer delay, which is
equal to the delay through the two LUTs plus the routing delay between the LUTs.
It is possible to implement even more complex functions by merging two Joint-LUT
structures with a multiplexer available in the Virtex-E CLB, as illustrated in Figure 3-11. This will
be called the Joint-Slice structure. The Joint-Slice can implement some logic functions with up to
19 total inputs. The mapping algorithm is very similar to the Joint-LUT mapping algorithm above,
except that after Shannon’s decomposition of the logic function the algorithm looks for a successful
mapping of each cofactor, f, and f;, into a Joint-LUT structure instead of a LUT.
30
k>
a
1m
t>
Figure 3-12: Example of mapping a 2 output logic function into the Joint-LUT structure
Hwang covered the case for the Joint-LUT, which we extended to the Joint-Slice structure. A more
important contribution of our multiplexor mapping algorithm is its ability to implement logic
functions with multiple outputs in the Joint-LUT and the Joint-Slice structures, as shown in Figure
3-12. The following discusses how the basic algorithm is modified to provide this ability.
Figure 3-9 shows that the Joint-LUT structure pin X produces the multiplexor output for the
structure. In addition to this output the structure has a secondary output that is generated by pin Y.
Notice that pin Y corresponds to the function of the top LUT in this structure. Because the logic
function implemented produced through the multiplexor, and therefore output through pin X,
depends on the function of the top LUT, the logic function produced by pin X will be termed the
primary output function.
To map a logic function with two outputs into the Joint-LUT structure we must first
determine the primary and the secondary output function. This can be done by checking which
function is a subfunction of the other. Then the algorithm proceeds to decompose the primary output
function using Shannon’s decomposition, but a successful mapping is considered to be found ifand
only if one of the cofactors of the expansion implements the secondary output function. The
algorithm is summarized in Figure 3-13.
Note that this algorithm will be unable to map structures like those illustrated in Figure 3-14,
even though this is a correct mapping. This is because during Shannon’s expansion one variable is
31
Input: a function /(X,,....X,,) with up to two outputs
Output: a Joint-LUT structure that implements /{x,,...,X,)
Figure 3-13: Mapping a multi-output logic function into the Joint-LUT structure
abcf+f(a +b)
f
Figure 3-14: Mapping solution not found by the
algorithm
removed from the equation of each cofactor and it is assigned to drive the multiplexer selector input.
However, the case shown in Figure 3-14 requires the variable fto drive both the selector input of the
multiplexer and the top LUT.
3.4 Duplication
The second transformation available in Augur is duplication, which creates a copy of a
selected component. By duplicating a component on the critical path, one can increase the freedom
32
L Critical path |
(4,6)
“LB |- | CLB
(9,5)
Figure 3-15: Circuit before Figure 3-16: Circuit after
duplication duplication
to position and route critical connections [4][13]. Logic duplication is a particularly useful feature
that enables the use of fast Nearest-Neighbour (NN) interconnect for critical connections.
For example, consider the circuit in Figure 3-15. The critical path starts at a flip-flop at
location (5,5) and goes through the LUT at location (4,6). Generally, a good strategy is to put critical
connects on NN interconnect to speed them up, but in the current placement this strategy cannot be
used, because the first LUT on the critical path is not in the adjacent CLB. Moving the flip-flop from
(5,5) to (5,6) permits this connection to use the NN interconnects, but removes the NN connections
from the LUTs at (7,5). By duplicating the flip-flop and LUT at (5, 5) and placing them in the CLB
at location (5,6), as shown in Figure 3-16, NN connections can be used for all outputs of the flip-
flop.
3.5 Merging
It is sometimes beneficial to reverse duplication that has occurred in previous synthesis,
which is allowed in a transformation called merging. For example, after the placement and routing
it becomes clear that the distribution of connections between two duplicated components is causing
the performance to suffer. This is because duplication during logic synthesis cannot predict the final
placement of the circuit.
Merging can be used to redistribute connections between duplicate components. To do this
the user selects a pair of identical logic components and uses the merging transformation to merge
33
them. Once the components are merged, the user applies the logic duplication transformation to
recreate the two logic components, but with a different connection distribution.
Cc
(FT
re
0
Figure 3-17: Critical path before the Figure 3-18: Circuit after carry chain
application of carry chain shortening shortening
34
Figure 3-19: Two registers with Figure 3-20: Control signal
incompatible control signals functionality implemented in LUTs
3.8 Summary
This chapter presented the logic synthesis transformation available in Augur. These
transformations are used in the context of omniscience to improve the speed of the logic circuit. Each
logic transformation is designed specifically for the Xilinx Virtex-E device, taking advantage of the
35
design of the Virtex-E slice to implement complex logic functions with low delay.
The logic synthesis transformations described here are provided by the manual editor, which
allows the user to apply these logic transformations in the context of omniscience. The user
experience with the manual editor and the explanation of the method the user can apply are the topics
36
4 The User Experience with the Manual Editor
4.1 Introduction
The previous chapter introduced a set of logic synthesis transformations that Augur performs.
These transformations take advantage of the CLB design of the Virtex-E and focus on improving the
speed of the logic circuit. This chapter focuses on how Augur provides the user with omniscience
and how the logic synthesis transformations are used in the context of omniscience.
Providing omniscience requires a CAD tool to perform routing and timing analysis after
every logic synthesis or placement transformation. In addition, a CAD tool must provide all
information about the implementation of a logic circuit. In this work omniscience is provided
through a manual editor called Augur, which is capable of implementing various physical and logical
transformations. Augur provides the user with detailed information about the design implementation
at every step of the improvement process.
The input to Augur is a placed and routed logic circuit. The user uses placement, packing and
logic synthesis transformations to improve the operating frequency of the logic circuit. The output
of Augur is a new placed and routed logic circuit. The logic circuit produced by Augur can be used
by commercial tools. The speed of the logic circuit will be as reported by Augur, as the results
obtained by Augur are verified by commercial tools at every step of the optimization process.
37
file format and then use Xilinx Place and Route tool (called PAR) to generate placement and routing
for it. The synthesis of the logic circuit must follow three rules:
1. There can only be one clock signal, because Augur does not perform timing analysis of logic
circuits with more than one clock.
2. The Synthesis Tool must be set to not create I/O pins, because Augur is intended for use with
small modules. It is expected that the placed and routed module will be instantiated in a
higher level logic circuit. The assignment of pins should happen at a higher level, because
not all input or outputs of the logic circuit will be connected to I/O pins.
3. The primary output nets must have the prefix “END_”, to allow Augur to distinguish them
for other non-//O nets.
The logic circuit synthesized this way into an EDF file can then be used to generate placement and
routing for the logic circuit using the PAR program. The PAR program will generate an NCD file,
which contains the synthesis, placement and routing information about the logic circuit.
The user starts the manual editor with the following command entered in the MSDOS
prompt:
editor <filename.ncd>
where the </ilename.ncd> is the complete filename for the logic circuit that is to be improved.
The manual editor will launch the Xilinx FPGA Editor in the background and display
Augur’s graphical user interface. The following Sections present how the user utilizes the graphical
user interface to interpret the data provided and to improve the speed of the logic circuit.
38
To demonstrate how the user utilizes all of these views, and the functions provided with
them, the following subsections will show an example of how each view is used. In each subsection
a specific view will be described, showing the information that can be gathered from it. This
information assists the user in making an informed optimization decision.
39
delay on the critical path is reduced. When the user moves a component that is on the critical path,
the manual editor will immediately implement the requested operation, if it is valid, and perform
timing analysis to provide the user with the update logic circuit speed.
The most effective method of achieving the delay reduction through placement and packing
changes is to create placement that allows connections between logic components to utilize the
Nearest-Neighbour (NN) interconnect [8]. To do this the user must place logic components
horizontally across the CLB boundaries. In the view these boundaries are marked by vertical white
lines. To better see the CLB boundaries, and the logic circuit components, the user can scroll or
zoom the Placer View using the view controls.
The view controls, shown in Figure 4-2, consist of buttons which allow the |
ok
Placer and Packer view to shift up, down, left and right, as well as perform a zoom woe
in and zoom out operation. The four buttons in the top right corner of the window 4 Ry
cause the view to shift Up, Down, Left, and Right. To perform a zoom operation 2|
the user can use the Zoom In/Zoom Out pair of buttons, the Window button or Zoom In
the Zoom Fit button. The Zoom In/Zoom Out buttons cause the view to zoom in —- 40m Out
(or out) towards the center of the view. The Window button enables the userto Zoom Fit |
use the mouse to select the area of the design to be enlarged. The same effect is Windove
achieved by pressing the right mouse button in the view area, selecting an area to ” Figure 4-2:
zoom into and then pressing the left mouse button. Finally, the Zoom Fit button View Controls
causes Augur to display the entire logic circuit in the view area. This option is very
useful as it gets the user out of the zoom, making it easy shift focus to a different location in the
design quickly.
At this point in the optimization process the user has identified the critical path and zoomed
in the view to better observe why the highlighted path has the longest delay and determine if
anything can be done to remedy the situation. To proceed further, more information needs to be
gathered so that the next step of the optimization can be determined. There are three ways in which
the user can obtain information: visual inspection, delay profiling and information dialog box.
40
Visual Inspection
The user can visually inspect the connectivity of logic components as well as the delay
information. for a subset of longest paths in the logic circuit. The Net button enables the user to
observe all, some or no connections between logic components. Initially, the Placer View does not
show any connection between components. This is indicated by the state of the Net button (None).
The user can toggle the Net button to be in two other states: Selected and All. When the Net
button is in “Selected” state, only the connections for selected components are shown, while in the
“All” state every connection in the logic circuit is shown. The “Select” option is the most useful as
it shows the user only the connections for the components of interest. The “AII” option is useful to
determine possible areas of high connectivity.
By looking at the fanout of a single component, the user can determine if it may be beneficial
to use logic duplication to reduce the fanout of the logic component in question. A component with
a large number of arrows that fanout from it is usually a good candidate for duplication. Furthermore,
by locating the logic components that drive the selected component, the user can locate components
that may be identical. Identical components will have the same type (LUT, Carry, or Flip-Flop) and
will be configured the same way, which would make them possible candidates for merging.
Delay Profiling
In addition to visual inspection, Augur provides the user with the ability to determine the
location of paths whose delay is in a user-specified range. Each path starts at a flip-flop or a primary
input and ends at a flip-flop or a primary output. Knowing where these paths are located helps in
selecting an appropriate logic optimization approach. There are two tools provided by Augur to help
the user determine the delay and location of paths in the logic circuit: the delay profile and the delay
budget [6].
The delay profile is a histogram of paths and their delays. Each path is associated with a bin,
where a bin has an upper and a lower bound on the delay of paths that belong to it. The delay profile
for the example in Figure 4-1 is shown in Figure 4-3. The bins are arranged so that bin 1 contains
all paths which operate within a speed of 1 MHz of the critical path, while every consecutive bin
speed range grows by 50%. This arrangement of bins allows the user to determine the overall number
41
Geometric bin delay profile :
Bin 9: 3.308ns-4.553ns (302.337MHz-219.613MHz), count = 1239
Bin 8: 4.553ns-5.384ns (219.613MHz-185.734MHz), count = 855
Bin 7: 5.384ns-5.938ns (185.734MHz-168.413MHz), count = 877
Bin 6: 5.938ns-6.307ns (168.413MHz-158.556MHz), count = 523
Bin 5: 6.307ns-6.553ns (158.556MHz-152.601MHz), count = 249
Bin 4: 6.553ns-6.717ns (152.601MHz-148.874MHz), count = 78
Bin 3: 6.717ns-6.826ns (148.874MHz-146.489MHz), count = 56
Bin 2: 6.826ns-6.899ns (146.489MHz-144.940MHz), count = 21
Bin 1: 6.899ns-6.948ns (144.940MHz-143.926MHz), count = 9
of paths that are close to critical and then use the delay budget to locate these paths in the logic
circuit.
The delay budget is the means by which Augur displays paths that do not meet the timing
constraints. The Options window, shown in Figure 4-4, allows the user to specify the delay budget
at any time. To specify the delay budget for the logic circuit, the user must press the Options button
and then enter the logic circuit delay in the delay budget edit box [6].
Once the delay budget is set all paths in the logic circuit with the delay longer than specified
by the budget are highlighted. The user can use the Hilt button to toggle the display to show only the
critical path, the paths that do not meet the timing budget or no paths.
The Hilt button has three states: None, Max and Slow. The None state disables highlighting
of paths in the logic circuit, while the Max state displays only the longest path in the logic circuit.
The Slow state is used to displays all the paths that have the delay longer than the specified budget.
The budget can be set to any delay value. However, using the delay profile to pick a good
delay budget aids the user in improving the speed of the logic circuit. For example, the logic circuit
in Figure 4-1 has the delay profile shown in Figure 4-3. By setting the delay budget to show all the
paths in bin 1, by setting the delay budget to 6.899ns, highlights the nine critical paths. Figure 4-5
shows the critical paths using the highlighting ability of Augur.
One of these paths traverses the carry chain, near the left edge of the figure, while the
remaining eight paths start at the same flip-flop, go through three common LUTs and then fanout
into two different directions. These paths overlap the longest path in the logic circuit, shown in
42
Timing Budget {ne} | ~
, Selection Scheme
oo =.
Tl” Select buddy Fipfiop
TT Select Whole Slice
tiga EE SE
Figure 4-5: ‘Critical paths with delay greater than specified budget
Figure 4-1. Therefore, improving the performance of the logic components that are shared between
the eight paths would improve the performance of eight critical paths.
One way to attempt improving the speed of these critical paths is to move the originating flip-
flop one row up, so that the Nearest Neighbour interconnect can be utilized. However, the flip-flop
is driven by the carry chain structure and moving it from its current location significantly increases
43
(SLICE configuration summary
Name
N46
Lut conliquration
LUI G.
Lut
Equation:
eee ee
tal Ad)
LUT FP
LUE
Seige ee econ
Equation:
TAPAS
CARRY G:
Nol present
| CARRYF: e
the delay of paths passing through the carry chain. Moving the LUTs one row lower and to the right
is also not a good idea as some of the critical paths will have their delay increased. Therefore, the
only alternative is to search for a logic transformation that will decrease the delay on these eight
paths. To find out if some of these logic components can be resynthesized such that the delay
through them is decreased the user needs the information about the logic function these components
implement. This is obtained using the Logic Component Information box.
implemented in the bottom half of the slice, also known as LUT F. The second LUT is implemented
in the top LUT of the same slice, also known as LUT G. Figure 4-6 states that the LUT G
44
implements an OR function, while the arrow connecting these logic components in the Placer View
indicates that one of the inputs to LUT Gis the output of LUT F. This means that the entire function
has 5 inputs. Therefore it may be a candidate for a Joint-LUT mapping, described in Section 3.3.2.
To perform a logic transformation the user must select the two LUTs and then press the Transform
button.
Transform Button
This button allows the user to select one the logic transformations available in Augur. When
pressed the menu on the right hand side of the window presents the user with the following options:
1. Back - do not perform any logic transformation
2. Duplicate- —_ perform logic duplication
3. Merge - perform logic merging
4. Remap - initiate logic remapping, which includes carry chain shortening
5. Ctrl Ext - perform flip-flop synchronous control signal extraction
Selecting any logic transformation switches the view to the SubCircuit view, where the user is able
to perform logic synthesis transformations. In this example, the user chooses the Remapping
transformation.
45
a COS
KOK
buttons responsible for a logic synthesis transformation, Augur will attempt to implement the
selected logic components using the selected logic synthesis transformation.
The SubCircuit view for the logic components selected in Figure 4-5 is shown in Figure 4-7.
To perform a logic synthesis transformation that implements these components in the Joint-LUT
structure the user must select both components and press the Joint-LUT button. A successful
transformation will generate the view shown in Figure 4-8.
In this view the dark gray box, which contains the two LUTs, represents a placement
dependency between the two LUTs. These two LUTs have been implemented such that a Joint-LUT
structure is formed. Therefore, when these LUTs are placed they must be in the same slice and in the
same relative position. Augur will always keep track of the relative position of these components and
will not allow them to be placed separately. To place the Joint-LUT structure created by the logic
synthesis transformation the user must select the structure and press the TAB key. This action
46
SS
. —
a sturmation shocesetul
aires Ioana
Figure 4-
« 8: The SubCircuit view after a logic transformation
switches the view to the SubCircuit Placer view, where the user is able to look at the logic circuit
and place the newly synthesized components in a suitable location.
47
Placer view
Transtormation successful
the Accept button. As a consequence, the logic components that are not a part of the subcircuit are
grayed-out to signify that they cannot be moved.
To help the user place the logic component rubber bands are shown whenever a logic
component is being moved in the SubCircuit Placer view. The rubber bands are semi-transparent
arrows that show the sources and targets of connections associated with a particular logic
component. The rubber bands only show connection to placed logic components. Therefore, if a
logic component is being placed that has a connection to another logic component that is not yet
placed, then the rubber band that corresponds to the connection between these logic components will
not be shown.
The user can also remove logic components from the SubCircuit Placer view. To perform
this action the user must select the desired component and make it move. While the SubCircuit
Placer view is moving the selected logic component and the user presses the TAB key the user will
48
lemme Sates | Ae N7UMitel Budget BANMas) lL. |
be returned to the SubCircuit view and the selected component be removed from the SubCircuit
Placer view.
In Figure 4-9 the placement for the resynthesized logic has been selected. Since all of the
logic components have been placed the user can now press the Accept button to see the result of
implementing the logic synthesis transformation. The result of this transformation is presented in
Figure 4-10. This action takes the user back to the Placer and Packer view.
Once back in the Placer and Packer view the user can observe the effect of the logic synthesis
transformation on the speed of the logic circuit. From this point the process of logic circuit
modification becomes iterative. The user can again focus on a logic subcircuit, gather information
about it and select a logic circuit modification that best suits the given situation. Once the user has
completed all modifications, or the desired logic circuit speed is achieved, the file commands can
be used to save the logic circuit.
49
4.4 File Commands
While working with the manual editor, the user can keep intermediate results by using the
Save button. The save button allows the user to create a copy of the logic circuit and save it ina
NCD type file, which can be processed by Xilinx tools. This operation will not change the logic
circuit source file unless the name of the file the logic circuit is to be saved to is identical to the
initial file name.
It is helpful to manually save the logic circuit on occasion, as it allows the user to go back
to the logic circuit implementation that the user considered to be good. Another option which allows
the user to do a similar thing is the Undo. By pressing the Undo button the manual editor will
immediately undo the last logic circuit modification, restoring logic synthesis, placement and routing
of the logic circuit. The Undo command is capable of undoing all logic circuit modifications that
occurred since the start of the program.
50
Figure 4-11: Augur software overview
51
typedef struct s_action_list {
action_type type;
t_action_data *transformation_data;
struct s_action_list *previous, *next;
} t_action_list;
Finally, the previous and next fields specify a link to the last action that was performed and the next
action to be performed to implement the logic modification. Although it is sufficient to provide only
the link to the next action to be performed to complete the logic modification, the link to the
previous action allows the same set of actions to be reversed to undo the logic transformation in case
the modification could not be implemented.
Once a placement, packing or synthesis modification is complete the manual placer and
packer module uses the FPGA Editor Interface to communicate the changes to the FPGA Editor. The
FPGA Editor is a Xilinx software, which runs in the background. The FPGA Editor allows Augur
to implement placement, packing or synthesis modification on a real FPGA device and perform
timing analysis. The result of timing analysis is returned to the manual placer and packer module,
through the FPGA Editor Interface, so that it can be displayed to the user using the GUI.
4.7 Summary
This chapter has presented the user interface of Augur and a method of using the features of
Augur to improve the speed of the logic circuit. The two aspects of omniscience, which are the
gathering of information from the design and instant feedback after design modification, were
presented in the context of interaction with the user.
In this chapter the process of improving the logic circuit is also shown. The user begins the
process by analyzing the data in the Placer and Packer view to decide how to improve the logic
circuit. The user can use placement or packing modifications to improve the circuit or apply a logic
synthesis transformation. The SubCircuit view and the SubCircuit Placer view facilitate the
implementation of logic synthesis transformations. After each modification the logic circuit
undergoes timing analysis to provide the user with the effect of the logic circuit modification on the
52
speed of the logic circuit.
The approach presented in this chapter has been applied to a suite of benchmark logic
circuits. The improvements resulting from the use of Augur as well as the strategies used during the
logic circuit improvement process are presented in the following chapter.
53
5 Experimental Results
5.1 Introduction
The goal of this work was to create an omniscient manual editor (Augur), which uses logic
synthesis transformation as the means to improve the speed of logic circuits. In Chapter 3 the logic
synthesis transformations that are available in the manual editor were described, while Chapter 4
presented the method by which the user employs Augur. This chapter presents the results obtained
using the editor. In addition, several strategies that can be automated and form the basis for
algorithms in automatic CAD tools are presented.
To evaluate the performance improvement obtained using Augur, a fair basis of comparison
is needed. The reference point chosen in this work is a suite of 10 benchmark logic circuits, which
are synthesized, placed and routed using latest commercial CAD tools. Each of the logic circuits in
the benchmark suite is briefly described in section 5.2. To ensure that the comparison of results
obtained by the use of Augur to those obtained by automatic CAD tools is fair, a rigorous method
of generating the benchmark suite was devised. This method is described in section 5.3.
To evaluate the effectiveness of this new approach, the results for each benchmark logic
circuit were obtained in two phases. The first phase was to apply the approach of Chow and Rose
[6], using only placement and packing modifications to improve the speed of the logic circuit. The
results of this approach are discussed in section 5.4. The second phase was to employ logic synthesis
transformations. The results obtained using these logic synthesis transformations are described in
section 5.5.
55
Size Operating
Logic Circuit Name ["[ UjTs + Frequency
Carry FFs (MHz)
Section 5.6 describes several optimization strategies that emerged during the course of
improving logic circuits using Augur, that can be automated.
56
1. Set target frequency to any value, synthesize, place and route the design. The resulting
operating frequency will be used as initial setting for the remainder of this procedure
2. Set target frequency to initial setting
3. Synthesize using Synplify 7.1 Pro [14]
4, Place and route tool (par.exe) [15] provided with Xilinx ISE 5.1 Service pack 3 tools. The
placement and routing is performed 100 times, each time with a different seed.
5. Record best result
6. Repeat 3-5 for target frequency -10%, -5% +5% and +10% with respect to current
7. Ifa better solution was obtained in 6 then repeat 2-6 using the frequency setting that
produced a better result as initial setting.
memory to look up textures based on given (x, y) coordinates. Also a part of the hardware
ray-tracing engine [17].
8. Vidout - module used to display a rendered image using the VGA interface on the
Transmogrifier-3 [18], and also part of the hardware ray-tracing engine [17].
9. Raygencont - a circuit that generates all rays to be traced for a given view. Part of the
hardware ray-tracing engine [17].
10. Mult - a 4x4 bit multiplier.
The next section describes the process used to generate the baseline results for each of the
benchmark logic circuits.
57
o Baseline Frequency after %
Logic Circuit Name | Frequency Placement and Improvement
(MHz) Packing (MHz)
Batcher 298.6 314.0 5.1
Miim 155.0. 155.2 0.1
Vision 197.4 197.8 0.2
Banyan 359.3 367.8 2.4
Trap 381.0 398.6 4.6
Boundcontroller 131.5 137.9 4.8
Linearmap 108.0 109.3 1.3
Vidout 134.4 140.0 4.1
Raygencont 162.1 173.2 6.8
Mult 122.2 122.3 0.1
Average 3.0
Table 5-2: Results using only placement and packing modifications
58
Logic Circuit after Logic Circuit after re- Improvement
placement and packing . °
modifications only synthesis (%)
Logic Circuit Name
LUT Operating | LUT | Operating} With Due to logic
+ | FFs |Frequency; + | FFs |Frequency| respect to synthesis
Carry (MHz) |Carry (MHz) | baseline {transformations
Batcher 252} 436} 314.0 | 252 |447| 374.8 25.5 19.4
Mim 162} 119) 155.2 | 164 ,119| 155.6 0.4 0.2
Vision 310} 243) 197.8 | 320 }243]) 210.2 6.5 6.3
Banyan 176} 335| 367.8 | 176 |335| 367.8 2.4 0.0
Trap 186] 486| 398.6 | 187 |501| 418.4 9.8 5.0
Boundcontroller 473} 466, 137.9 | 480 |469] 149.5 13.7 8.5
Linearmap 460} 72); 109.3 | 460 | 76 125.7 16.4 14.9
Vidout 447) 220) 140.0 | 454 | 221 155.6 15.8 11.2
Raygencont 211i} 118) 173.2 | 211 1118) 173.2 6.8 0.0
Mult 29} 21) 122.3 | 28 | 25 124.3 1.7 1.6
Average 9.9 6.7
Table 5-3: Speed improvement results using the new logic synthesis transformations
The new placement and routing tools were successful in performing a number of placement and
packing modifications that the ISE 3.3 SP7 was not. In addition to that the Synplify 7.1 Pro version
utilized the multiplexers in slices more often than version 6.2, essentially making use of the
Joint-LUT and Joint-Slice structures. While this can reduce the delay it also places some restrictions
on placement perhaps contributing to diminished improvements using just placement and packing
modifications. Finally, the baseline logic circuits were created using a more rigorous procedure than
in [6], which made it significantly harder to improve logic circuits using placement and packing
modifications only.
59
Logie Circuit after Logic Circuit after re- Area Increase.
placement and packing . °
Logic Circuit Name | modifications only synthesis (%)
LUT | Carry | FFs | LUT | Carry | FFs |LUT+Carry| FFs
Batcher 252 0 436 | 252 0 447 0 2.5
Miim 149 13 119 | 151 13 119 1.2 0
Vision 195 115 | 243 197 123 | 243 3.2 0
Banyan 176 0 335 | 176 0 335 0 0
Trap 186 0 486 | 187 0 501 0.5 3.1
Boundcontroller 383 90 466 | 390 90 469 1.7 0.6
Linearmap 270 190 72 270 190 76 0 5.6
Vidout 317 130 | 220 | 325 129 | 221 1.6 0.5
Raygencont 166 45 118 166 45 118 0. 0
Mult 29 0 21 28 0 25 -3.5 19.0
Average 0.5 3.1
Table 5-4: Area change due to the new logic synthesis transformations
stopping logic circuit improvement are described as a part of the stopping criterion in Section 5.6.4.
The speed improvement presented in Table 5-3 was determined by calculating the per cent difference
between the speed of the logic circuit in Table 5-3, column 7, and the baseline logic circuit speed
in Table 5-1, column 4.
The results obtained for the 10 benchmark logic circuits show that using Augur improves the
speed of logic circuit of up to 25.5% and an average of 9.9%. The speed improvement did not cause
a severe area penalty. In the worst case, a total of 16 LUTs, carry elements and flip-flops were added
to the logic circuit Trap. On average the number of LUTs and carry elements has increased by 0.5%,
while the number of flip-flops has increased by 3.1%. The speed and area results are summarized
in Tables 5-3 and 5-4 respectively.
The speed improvement results are divided into two categories. The first category is the logic
speed improvement obtained using placement, packing and logic synthesis transformations, which
provides the logic circuit speed improvement when compared to the baseline logic circuit speed. The
second category is the logic circuit speed improvement that was possible due to the introduction of
logic synthesis transformations. The contribution of logic synthesis transformations was determined
60
by calculating the per cent difference between the frequency in column 7 and column 4.
The following are the key steps involved in improving each logic circuit:
1. Batcher - the majority of the flip-flops in this designed were synthesized to use distinct
control signals. These control signals were used to minimize the combinational lo gic part of
the circuit. However, the resulting packing allowed only one register to occupy a slice, which
spread the circuit over a larger area than necessary. The application of Flip-Flop control
signal extraction (described in Section 3.7) allowed previously incompatible flip-flops to
share a slice, resulting in an overall 25.5% improvement over the baseline logic circuit speed.
2. Miim - duplication (described in Section 3.4) was used to improve performance of some of
the paths in the circuit. However, the complexity of the design, as well as the congestion in
the critical region, allowed for only minor (0.6%) improvements.
3. Vision - this logic circuit suffered from improper synthesis of logic that controlled Flip-Flop
enable signals. A detailed analysis of these signals showed that each of the enable signals
was functionally identical, while the logic function for these signals was a 7-input AND gate.
The logic synthesis tools implemented this logic function as a pair of serially connected
LUTs and duplicated the forward LUT to reduce fanout. Logic merging (described in Section
3.5) was first used to create a single logic function to implement the enable signal. Then both
LUTs were duplicated (described in section 3.4) to reduce the fanout and distribute the
connection properly. Further improvement was obtained by implementing these two LUTs
ina carry chain, using Carry Chain Remapping transformation (described in Section 3.3.1).
The resulting logic circuit speed exceeded the baseline speed by 6.5%.
4. Banyan - each path in this logic circuit contains at most one LUT. This was possible because
of the use of flip-flop control signals to reduce the logic depth of the logic circuit, when
synthesized by Synplify 7.1 Pro. We were only able to improve the performance of the logic
circuit by modifying the placement and packing. The resulting speed improvement was 2.4%
over baseline.
5. Trap - in this logic circuit a few registers used flip-flop control signals to decrease logic
depth. However, it was critical to circuit performance that these flip-flops had the freedom
to share a slice with other flip-flops. We used control signals extraction transformation
61
(described in Section 3.7) to achieve placement flexibility for these flip-flops. Once these
flip-flops had the flexibility to share a slice with other flip-flops, we were able to modify the
placement and packing of the logic circuit effectively. In combination with logic duplication
(described in Section 3.4) the logic circuit speed was increased by 9.8%.
Boundcontroller - the design contained a number of Joint-LUT configurations. A closer
examination revealed that remapping certain pairs of them into Joint-Slice structures with
multiple outputs was possible. After these pairs of Joint-LUTs were remapped into Joint-
Slice type structures the placement of the logic components was rearranged to promote usage
of NN interconnect. These modifications resulted in the improvement in the logic circuit
speed by 13.7% compared to the baseline logic circuit speed.
Linearmap - the design contains mostly carry chain logic. The problem was that a few
registers were driving multiple carry chains and could not use NN interconnect for all
connections. Duplicating (described in Section 3.4) some of those registers and
re-synthesizing non-carry chain logic into Joint-LUT and Joint-Slice structures (described
in Section 3.3.2) improved the logic circuit speed by 16.4%.
Vidout - the logic circuit contained a carry chain that was unnecessarily long. The output of
the top carry cell was not driving the local register, which made it a candidate for the carry
chain shortening transformation (described in Section 3.6). Further analysis revealed that the
functionality implemented by the top segment of the carry chain and the logic function
implemented by the LUT that the carry chain was driving could be implemented in a single
LUT. This modification allowed for further logic optimization resulting in the improvement
in the logic circuit speed by 15.8% compared to the baseline logic circuit speed.
Raygencont - the critical path of this logic circuit traverses LUTs that could be
re-synthesized into wide AND gate carry chain structures (described in Section 3.3.1).
However, that causes the near critical paths to become critical with longer delay. Without
logic synthesis transformations we were able to modify the placement and packing of the
logic circuit to improve the speed by 6.8%.
10. Mult - the original placement and synthesis was good, however performing logic duplication
and remapping improved the logic circuit speed by 1.7%.
62
5.6 Optimization Strategies
The previous Sections of this chapter presented the specific steps used to obtain logic circuit
improvement results and the results themselves. Although the results in this thesis are obtained
manually, one of the long-term goals of this research is to discover new optimization strategies that
could be fully automated. In this Section several such strategies are proposed, which could form the
basis of future algorithms.
63
then abandon this strategy, otherwise
e. If set S is empty and some connections on Pi are not utilizing NN interconnect then
for each connection that could use a NN interconnect on P,
1. Locate the path P;, which uses that the NN interconnect P; needs
il. Check if a components on P; can be moved or remapped to free the NN
interconnect, while ensuring that the delay of P, is less than the delay of P,
lil. If step (ii) is unsuccessful then abandon this strategy
If successful, this should allow the critical logic to acquire the liberated NN connection.
64
il. For each pair of serially connected logic components on P, determine the
pairs that can be remapped into Joint-LUT or Joint-Slice structures (as per
Section 3.3.2) and put them in set S
iil. Apply remapping to component pairs in § to liberate the space, while
maintaining, or lowering, the delay of P;
Move the critical logic into the liberated space
65
Bin 10: 1.715ns-3.314ns, count = 16
Bin 9: 3.314ns-4.379ns, count = 323
Bin 8: 4.379ns-5.090ns, count = 386
Bin 7: 5.090ns-5.563ns, count = 444
Bin 6: 5.563ns-5.879ns, count = 238
Bin 5: 5.879ns-6.089ns, count = 210
Bin 4: 6.089ns-6.230ns, count = 92
Bin 3: 6.230ns-6.323ns, count = 42
Bin 2: 6.323ns-6.385ns, count = 10
Bin 1: 6.385ns-6.427ns, count = 5
circuit speed improvement. In this work the stopping criterion is based on the distribution of path
delays.
Augur summarizes the path delays using a geometric delay profile, as described in section
4.3.1. The delay profile is used to determine the number of critical, and near-critical, paths as well
as their location in the logic circuit. Clearly it is easier to improve the speed of a logic circuit that
has just a few critical paths that do not share components or connection, rather than a logic circuit
with many critical paths that share logic connection or components. In this work the optimization
of the logic circuit was stopped when the following two criterion were met:
1. The two slowest bins contained 15 or more paths and
2. The paths in the two slowest bins were:
a) Situated in close physical proximity to each other, and
b) A logic transformation that improved the delay on all these paths could not be found
The first criterion says that there is little point in continuing if there are too many paths that must be
improved to gain speed. The second notices that close-to-critical paths pose a problem if improving
one of them has a strong likelihood of increasing the delay of the other close-to-critical paths by
virtue of their close physical proximity.
An example of the application of this strategy is shown in Figures 5-2 and 5-3. The logic
circuit miim has the 15 slowest paths as indicated by the geometric delay profile shown in Figure
5-2. The 15 slowest paths in this logic circuit are highlighted in red in Figure 5-3. These paths are
in close proximity to one another. No further optimization of the circuit was performed, as none of
66
—
[imax
Hotes
ees TANS aie et ee Nee a arr
ta ae Sa aa et ee
|
8.427no a 7 154 NbaidHetRudget 6.42dn9] HI2C7 G01 UT 2 sree
our strategies could further improve the logic circuit in reasonable amount of time.
5.7 Summary
This chapter presented the results obtained using the manual editor Augur on a set of
benchmark logic circuits. The set of 10 logic circuit was used as a benchmark. The logic circuits in
the benchmark suite were synthesized, placed and routed using a rigorous method, which yielded a
high logic circuit speed and therefore a fair point of reference.
Using only placement and packing modifications the average logic circuit speed was
improved by 3.0 per cent. The introduction of logic synthesis transformations improved those results
by another 6.7 per cent, resulting in an average speed improvement with respect to the baseline of
9.9 per cent. These results show that the application of omniscience in the context of logic synthesis
yields significant logic circuit speed improvement.
In addition to the speed results, this chapter presented a set of logic circuit optimization
strategies. These strategies are based on the observations made during the logic circuit improvement
67
process that yielded the above results. The optimization strategies can be used as a base for
automated algorithms in commercial CAD tools.
The results presented in this chapter show that the development of Augur was a success.
However, there are still a number of things that can be improved. This topic and avenues for future
research that bases on the approach presented in this thesis are the topics of the next chapter.
68
6 Conclusion
69
6.2. Future Work
While Augur is a good manual CAD tool, there are still a number of issues that need to be
addressed. Currently, the user is not informed of an estimated delay of the logic subcircuit while
performing a logic transformation. This information could better assist the user in performing a logic
optimization. Furthermore, the remapping transformations only result in the creation of a single
LUT, Joint-LUT or a Joint-SLICE. An alternate approach would be to search for a mapping solution
which consists of multiple logic structures.
An interesting topic of future research is the development of a logic block exploration
software based on the approach presented in this dissertation. The following discussion describes
how this approach could be applied in the context of architecture exploration.
There are several issues concerning the logic block design that need to be considered by the
logic block design exploration software. They are:
1. The number of outputs available to the CLB
2. The kind of hardwired logic to include and its relation to the LUTs
3. Output sharing
This work has shown that multiple outputs for a complex logic configuration can be beneficial. In
traditional CAD tools the problem of multiple-output functions is resolved by logic duplication.
Thus, a multi-output function becomes a set of single output logic functions, which implement the
same functionality. The mapping of each of these functions is performed independently, thus a
possible sharing of resources, which arises in the context of complex Jogic structures, may be
omitted. An alternate approach presented in this thesis shows that there is a significant benefit from
taking into consideration multi-output complex logic structures, such as the Joint-LUT and the Joint-
Slice.
A related topic to multiple output CLBs is the hardwired logic. In this work the hardwired
logic that is considered are the multiplexers in the Virtex-E slice. This thesis showed how to map
into logic structures of this type. When exploring FPGA architectures, it would be beneficial to
implement a black box instead ofa multiplexer. Then, a study of which basic function yields the best
results on average could result in a different choice for a logic component.
Finally, there is a question of how the outputs of a CLB should be shared. Providing each
70
output ofa CLB with a distinct pin would be very flexible. However, the area and delay penalty may
be too great if all output pins are not frequently used.
71
References
[1] J. Lou, W. Chen and M. Pedram, “Concurrent Logic Restructuring and Placement for Timing
Closure,” Proc. of the 1999 ACM/IEEE Int. Conf. on CAD, November 1999,
[2] J. Lin, A. Jagannathan and J. Cong, “Placement-Driven Technology Mapping for LUT-Based
FPGAs,” ACM/SIGDA Int. Symp. on FPGAs, February 2003, Monterey, California, USA.
[4] J. Cong and Y. Ding, “FlowMap: an optimal technology mapping algorithm for delay
optimization in lookup-table based FPGA designs,” IEEE Trans. on CAD of Integrated
Circuits and Systems, vol. 13, issue 1, 1994, pp. 1-12.
[6] W. Chow and J. Rose, “EVE: A CAD Tool for Manual Placement and Pipelining Assistance
of FPGA Circuits,” Proc. of ACM/SIGDA Int. Symp. on FPGAs, February 2002, Monterey,
California, USA.
0-07-016500-9
A. Lu, H. Eisenmann, G. Stenz, and F. M. Johannes, “Combining Technology Mapping with
Post-Placement Resynthesis for Performance Optimization,” Proc. of Int. Conf. on Computer
Design: VLSI in Computers and Processors, October 1998, pp. 616-621.
R. McCready and J. Rose, “Real-Time Face Detection on Configurable Hardware System,”
Tenth International Workshop on Field Programmable Logic, 2000, pp.157-162.
S. Chang, M. Marek-Sadowska, T. Hwang, “Technology Mapping for TLU FPGA’s based
on Decomposition of Binary Decision Diagrams,” IEEE Trans. On CAD of Integrated
Circuits and Systems, Vol. 5, No. 10, pp. , October 1996.
73
[13] K. Schabas and S. D. Brown, “Logic Synthesis and mapping: Using logic duplication to
improve performance in FPGAs,” Proc. of ACM/SIGDA Int. Symp. on FPGAs, February
2003, Monterey, California, USA
Synplify 7.1 Pro is a product developed by Synplicity Incorporated. Online product link is:
http://www.synplicity.com/products/synplifypro/index.html
Xilinx ISE 5.1, Service Pack 3, is a product of Xilinx Corporation. Online product link is:
http://www.xilinx.com/xInx/xil_prodcat_landingpage jsp?title=[SE+Foundation
P. Bade, W. Chow, P. Kundarewich, N. Saniei, and A. Wong, “StarBurst ATM Chip project
at the University of Toronto,” October 2000.
J. Fender and J. Rose, “A High Speed Ray Tracing Engine Built on a Field programmable
System,” submitted to the ACM/SIGDA International Conference on FPGAs, 2004.
[20] S. Hauck, M. Hosler, and T.Fry, “High-Performance Carry Chains for FPGAs,” Proc. of the
ACM/SIGDA Int. Symp. on FPGAs, Monterey, California, February 1998, pp. 223-233.
[21] K. Chen, J. Cong, Y. Ding, A. Kahng and P. Trajmar, “DAG-Map: Graph Based FPGA
Technology Mapping For Delay Optimization,” IEEE Design and Test, September 1992,
pp.7-20.
[22] H. Yang and D. F. Wong, “Edge-Map: Optimal Performance Driven Technology Mapping
for Iterative LUT Based FPGA Designs,” 1994, pp. 150-155.
[23] J. P. Roth and R. M. Karp, “Minimization over boolean graphs,” IBM Journal, April 1962,
pp 227-238.
E. Lehman, Y. Watanabe, J. Grodstein, H. Harkness, “Logic Decomposition During
[25] W. Chow, “EVE: A CAD Tool Providing Placement and Pipelining Assistance for High-
Speed FPGA Circuit Designs,” Master of Applied Science Thesis, University of Toronto,
74
2001.
[26] J. Cong and Y. Hwang, “Boolean Matching for LUT-Based Logic Blocks With Applications
to Architecture Evaluation and Technology Mapping,” IEEE Trans. on CAD of Integrated
Circuits and Systems, Vol. 20, No. 9, September 2001, pp. 1077-1090.
[27] Douglas B. West, Introduction to Graph Theory, Second Edition, Prentice-Hall Inc., 2001,
ISBN 0-13-014400-2.
[29] V. Betz and J. Rose, “VPR: A new packing, placement and routing tool for FPGA research,”
Seventh International Workshop on Field Programmable Logic, September 1997, pp. 213-
222.
75
A Appendix
couT
vB
¥
G4
@GI 1a
BY
xB
FS
x
F5In xO
F4
F3
RESET TYPE
CE
CLK
CH
77