Full Arithmetic Optimization Techniques For Hardware and Software Design 1st Edition Ryan Kastner Ebook All Chapters
Full Arithmetic Optimization Techniques For Hardware and Software Design 1st Edition Ryan Kastner Ebook All Chapters
Full Arithmetic Optimization Techniques For Hardware and Software Design 1st Edition Ryan Kastner Ebook All Chapters
com
https://ebookname.com/product/arithmetic-optimization-
techniques-for-hardware-and-software-design-1st-edition-
ryan-kastner/
OR CLICK BUTTON
DOWLOAD NOW
https://ebookname.com/product/embedded-system-design-a-unified-
hardware-software-introduction-frank-vahid/
https://ebookname.com/product/computer-organization-and-design-
the-hardware-software-interface-3rd-edition-david-a-patterson/
https://ebookname.com/product/computer-architecture-software-
aspects-coding-and-hardware-1st-edition-john-y-hsu/
https://ebookname.com/product/raspberry-pi-cookbook-software-and-
hardware-problems-and-solutions-1st-edition-simon-monk/
BeagleBone Cookbook Software and Hardware Problems and
Solutions 1st Edition Mark A. Yoder
https://ebookname.com/product/beaglebone-cookbook-software-and-
hardware-problems-and-solutions-1st-edition-mark-a-yoder/
https://ebookname.com/product/rtl-hardware-design-using-vhdl-
coding-for-efficiency-portability-and-scalability-1st-edition-
chu/
https://ebookname.com/product/embedded-controller-hardware-
design-1st-edition-ken-arnold/
https://ebookname.com/product/design-for-motion-fundamentals-and-
techniques-of-motion-design-austin-shaw/
https://ebookname.com/product/microcontrollers-and-
microcomputers-principles-of-software-and-hardware-
engineering-2nd-edition-frederick-m-cady/
This page intentionally left blank
Arithmetic Optimization Techniques for Hardware and Software Design
Obtain better system performance, lower energy consumption, and avoid hand-
coding arithmetic functions with this concise guide to automated optimization
techniques for hardware and software design. High-level compiler optimizations
and high-speed architectures for implementing FIR filters are covered, which can
improve performance in communications, signal processing, computer graphics,
and cryptography. Clearly explained algorithms and illustrative examples through-
out make it easy to understand the techniques and write software for their imple-
mentation. Background information on the synthesis of arithmetic expressions and
computer arithmetic is also included, making the book ideal for new-comers to
the subject. This is an invaluable resource for researchers, professionals, and
graduate students working in system level design and automation, compilers, and
VLSI CAD.
F AR Z A N F A L L AH
Stanford University
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore,
São Paulo, Delhi, Dubai, Tokyo
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9780521880992
© Cambridge University Press 2010
1 Introduction 1
1.1 Overview 1
1.2 Salient features of this book 5
1.3 Organization 6
1.4 Target audience 7
3 Software compilation 21
3.1 Chapter overview 21
3.2 Basic software compiler structure 21
3.3 Algebraic transformations in optimizing software compilers 25
3.4 Summary 33
4 Hardware synthesis 35
4.1 Chapter overview 35
4.2 Hardware synthesis design flow 35
4.3 System specification 38
4.4 Program representation 39
4.5 Algorithmic optimization 44
4.6 Resource allocation 45
4.7 Operation scheduling 49
vi Contents
6 Polynomial expressions 95
6.1 Chapter overview 95
6.2 Polynomial expressions 95
6.3 Problem formulation 96
6.4 Related optimization techniques 96
6.5 Algebraic optimization of arithmetic expressions 99
6.6 Experimental results 113
6.7 Optimal solutions for reducing the number of operations
in arithmetic expressions 117
6.8 Summary 123
Index 182
Abbreviations
1.1 Overview
Arithmetic is one of the old topics in computing. It dates back to the many early
civilizations that used the abacus to perform arithmetic operations. The seventeenth
and eighteenth centuries brought many advances with the invention of mechanical
counting machines like the slide rule, Schickard’s Calculating Clock, Leibniz’s
Stepped Reckoner, the Pascaline, and Babbage’s Difference and Analytical Engines.
The vacuum tube computers of the early twentieth century were the first program-
mable, digital, electronic, computing devices. The introduction of the integrated
circuit in the 1950s heralded the present era where the complexity of computing
resources is growing exponentially. Today’s computers perform extremely advanced
operations such as wireless communication and audio, image, and video processing,
and are capable of performing over 1015 operations per second.
Owing to the fact that computer arithmetic is a well-studied field, it should
come as no surprise that there are many books on the various subtopics of
computer arithmetic. This book provides a focused view on the optimization of
polynomial functions and linear systems. The book discusses optimizations that
are applicable to both software and hardware design flows; e.g., it describes the
best way to implement arithmetic operations when your target computational
device is a digital signal processor (DSP), a field programmable gate array
(FPGA) or an application specific integrated circuit (ASIC).
Polynomials are among the most important functions in mathematics and are
used in algebraic number theory, geometry, and applied analysis. Polynomial
functions appear in applications ranging from basic chemistry and physics to
economics, and are used in calculus and numerical analysis to approximate other
functions. Furthermore, they are used to construct polynomial rings, a powerful
concept in algebra and algebraic geometry.
One of the most important computational uses of polynomials is function
evaluation, which lies at the core of many computationally intensive applications.
Elementary functions such as sin, cos, tan, sin1, cos1, sinh, cosh, tanh, exponen-
tiation and logarithm are often approximated using a polynomial function.
Producing an approximation of a function with the required accuracy in a rather
large interval may require a polynomial of a large degree. For instance, approximating
the function ln(1 þ x) in the range [1=2, 1=2] with an error less than 10–8
2 Introduction
Computational analysis
System
specification Hardware/Software
partitioning
Embedded system
Register transfer level
description
behavior represented in the HDL [4]. In addition the tools perform optimizations
such as redundancy elimination (common subexpression elimination (CSE) and
value numbering) and critical path minimization. The constant multiplications in
the linear systems and polynomials can be decomposed into shifts and additions
and the resulting complexity can be further reduced by eliminating common
subexpressions [5–8]. Furthermore, there are some numeric transformations of
the constant coefficients that can be applied to linear transforms to reduce the
strength of the operations [9, 10]. This book provides an in-depth discussion of
such transforms. The order and priorities of the various optimizations and
transformations are largely application dependent and are the subject of current
research. In most cases, this is done by evaluating a number of transformations
and selecting the one that best meets the constraints [11]. The RTL description is
then synthesized into a gate level netlist, which is subsequently placed and routed
using standard physical design tools.
For the software portion of the design, custom instructions tuned to the
particular application may be added [12–14]. Certain computation intensive
kernels of the application may require platform dependent software in order to
achieve the best performance on the available architecture. This is often done
manually by selecting the relevant functions from optimized software libraries.
For some domains, including signal processing applications, automatic library
generators are available [11]. The software is then compiled using various trans-
formations and optimization techniques [15]. Unfortunately, these compiler
optimizations perform limited transformations for reducing the complexity of
polynomial expressions and linear systems. For some applications, the generated
assembly code is optimized (mostly manually) to improve performance, though it
is not practical for large and complex programs. An assembler and a linker are
then used to generate the executable code.
Opportunities for optimizing polynomial expressions and linear systems exist
for both the hardware and the software implementations. These optimizations
have the potential for huge impact on the performance and power consumption
of the embedded systems. This book presents techniques and algorithms for
performing such optimizations during both the hardware design flow and the
software compilation.
The unique feature of this book is its treatment of the hardware synthesis and
software compilation of arithmetic expressions. It is the first book to discuss
automated optimization techniques for arithmetic expressions. The previous
literature on this topic, e.g., [16] and [17], deals only with the details of implement-
ing arithmetic intensive functions, but stops short of discussing techniques to
optimize them for different target architectures. The book gives a detailed intro-
duction to the kind of arithmetic expressions that occur in real-life applications,
6 Introduction
such as signal processing and computer graphics. It shows the reader the import-
ance of optimizing arithmetic expressions to meet performance and resource
constraints and improve the quality of silicon. The book describes in detail the
different techniques for performing hardware and software optimizations. It also
describes how these techniques can be tuned to improve different parameters such
as the performance, power consumption, and area of the synthesized hardware.
Though most of the algorithms described in it are heuristics, the book also shows
how optimal solutions to these problems can be modeled using integer linear
programming (ILP). The usefulness of these techniques is then verified by applying
them on real benchmarks.
In short, this book gives a comprehensive overview of an important problem
in the design and optimization of arithmetic intensive embedded systems.
It describes in detail the state of the art techniques that have been developed to
solve this problem. This book does not go into detail about the mathematics
behind the arithmetic expressions. It assumes that system designers have per-
formed an analysis of the system and have come up with a set of polynomial
equations that describe the functionality of the system, within an acceptable error.
Furthermore, it assumes that the system designer has decided what is the best
architecture (software, ASIC or FPGA or a combination of them) to implement
the arithmetic function. The book does not talk about techniques to verify the
precision of the optimized arithmetic expressions. Techniques such as those dis-
cussed in [2] and [18] can be used to verify if the expressions produce errors within
acceptable limits.
1.3 Organization
When writing this book we had several audiences in mind. Much of the material
is targeted towards specialists, whether they be researchers in academia or
industry, who are designing both software and hardware for polynomial expres-
sions and/or linear systems. The book also provides substantial background of
the state of the art algorithms for the implementation of these systems, and
serves as a reference for researchers in these areas. This book is designed to
accommodate readers with different backgrounds, and the book includes some
basic introductory material on several topics including computer arithmetic,
software compilation, and hardware synthesis. These introductory chapters give
just enough background to demonstrate basic ideas and provide references to
gain more in-depth information. Most of the book can be understood by anyone
with a basic grounding in computer engineering. The book is suitable for
graduate students, either as a reference or as textbook for a specialized class
on the topics of hardware synthesis and software compilation for linear systems
and polynomial expressions. It is also suitable for an advanced topics class for
undergraduate students.
References
[4] G.D. Micheli, Synthesis and optimization of digital circuits, New York, NY:
McGraw-Hill, 1994.
[5] M. Potkonjak, M.B. Srivastava, and A.P. Chandrakasan, Multiple constant
multiplications: efficient and versatile framework and algorithms for exploring
common subexpression elimination, IEEE Transactions on Computer Aided Design
of Integrated Circuits and Systems, 15(2), 151–65, 1996.
[6] R. Pasko, P. Schaumont, V. Derudder, V. Vernalde, and D. Durackova, A new
algorithm for elimination of common subexpressions, IEEE Transactions on Computer
Aided Design of Integrated Circuits and Systems, 18(1), 58–68, 1999.
[7] R. Pasko, P. Schaumont, V. Derudder, and D. Durackova, Optimization method for
broadband modem FIR filter design using common subexpression elimination,
International Symposium on System Synthesis, 1997. Washington, DC: IEEE Computer
Society, 1997.
[8] A. Hosangadi, F. Fallah, and R. Kastner, Common subexpression elimination
involving multiple variables for linear DSP synthesis, IEEE International Conference on
Application-Specific Architectures and Processors, 2004. Washington, DC: IEEE
Computer Society, 2004.
[9] A. Chatterjee, R.K. Roy, and M.A. D’Abreu, Greedy hardware optimization
for linear digital circuits using number splitting and refactorization, IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 1(4), 423–31, 1993.
[10] H.T. Nguyen and A. Chatterjee, Number-splitting with shift-and-add decomposition
for power and hardware optimization in linear DSP synthesis, IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, 8, 419–24, 2000.
[11] M. Puschel, B. Singer, J. Xiong, et al., SPIRAL: a generator for platform-adapted
libraries of signal processing algorithms, Journal of High Performance Computing and
Applications, 18, 21–45, 2004.
[12] R. Kastner, S. Ogrenci-Memik, E. Bozorgzadeh, and M. Sarrafzadeh, Instruction
generation for hybrid reconfigurable systems, International Conference on Computer
Aided Design. New York, NY: ACM, 2001.
[13] A. Peymandoust, L. Pozzi, P. Ienne, and G. De Micheli, Automatic instruction set
extension and utilization for embedded processors, IEEE International Conference on
Application-Specific Systems, Architectures, and Processors, 2003. Washington, DC:
IEEE Computer Society, 2003.
[14] Tensilica Inc., http://www.tensilica.com.
[15] S.S. Muchnick, Advanced Compiler Design and Implementation, San Francisco, CA:
Morgan Kaufmann Publishers, 1997.
[16] J.P. Deschamps, G.J.A. Bioul, and G.D. Sutter, Synthesis of Arithmetic Circuits:
FPGA, ASIC and Embedded Systems, New York, NY: Wiley-Interscience (2006).
[17] U. Meyer-Baese, Digital Signal Processing with Field Programmable Gate Arrays,
third edition. Springer, 2007.
[18] C. Fang Fang, R.A. Rutenbar, M. Puschel, and T. Chen, Toward efficient static
analysis of Finite-Precision effects in DSP applications via affine arithmetic modeling,
Design Automation Conference. New York, NY: ACM, 2003.
2 Use of polynomial expressions
and linear systems
Polynomial expressions and linear systems are found in a wide range of applica-
tions: perhaps most fundamentally, Taylor’s theorem states that any differentiable
function can be approximated by a polynomial. Polynomial approximations are
used extensively in computer graphics to model geometric objects. Many of the
fundamental digital signal processing transformations are modeled as linear
systems, including FIR filters, DCT and H.264 video compression. Cryptographic
systems, in particular, those that perform exponentiation during public key
encryption, are amenable to modeling using polynomial expressions. Finally,
address calculation during data intensive applications requires a number of add
and multiply operations that grows larger as the size and dimension of the array
increases. This chapter describes these and other applications that require arith-
metic computation. We show that polynomial expressions and linear systems are
found in a variety of applications that are driving the embedded systems and high-
performance computing markets.
x3 x5 x7
sinðxÞ ¼ x þ : ð2:1Þ
3! 5! 7!
This is a polynomial of degree 7 that approximates the sine function. Assuming
that the terms 1/3!, 1/5!, and 1/7! are precomputed (these will be denoted as S3, S5,
10 Use of polynomial expressions and linear systems
d1 ¼ x x,
d 2 ¼ S5 S 7 d1 ,
d3 ¼ d 2 d 1 S 3 ,
d4 ¼ d3 d1 þ 1,
sinðxÞ ¼ x d4 :
having one high-degree polynomial approximation for the entire curve. In general,
the spline interpolation yields similar accuracy to modeling the same curve using one
higher-degree polynomial. Therefore, splines are less computationally complex.
Consider a quartic spline – a spline where the polynomials have degree less than
or equal to 4, i.e., k ¼ 4. A quartic spline is smooth in both first and second
derivatives and continuous in the third derivative. The unoptimized polynomial
expression representing a quartic spline is
d1 ¼ u2 ; d2 H ¼ v2 ; d3 ¼ uv;
ð2:3Þ
P ¼ d12 z þ 4ad1 d3 þ 6bd1 d2 þ 4wd2 d3 þ qd22 :
Note that three two-term common subexpressions were extracted: u2, v2, and uv.
This form requires 16 multiplications and 4 additions, reducing the number of
multiplications by seven from the straightforward implementation. Alternatively,
the Horner form can be used to reduce the number of operations. The Horner
form is a way of efficiently computing a polynomial by viewing it as a linear
combination of monomials. In the Horner form, a polynomial is converted into a
nested sequence of multiplications and additions, which is very efficient to com-
pute with multiply accumulate (MAC) operations; it is a popular form for evalu-
ating many polynomials in signal processing libraries including the GNU C library [3].
The Horner form of the quartic spline polynomial is
d1 ¼ v2 ; d2 ¼ 4v;
ð2:5Þ
P ¼ u3 ðuz þ ad2 Þ þ d1 ðqd1 þ u ðwd2 þ 6buÞÞ:
12 Use of polynomial expressions and linear systems
The output value y[n] is computed by multiplying the L most recent input
samples from the input vector x by a set of constant coefficients stored in the h
vector, where |h| ¼ L. Equivalently, h[k] represents the kth constant coefficient of
the filter, x[n] represents the input time series, and y[n] is the output time series.
The constants vary depending on the type of filter (e.g., low-pass, high-pass,
Butterworth).
There are many different implementations for an FIR filter. The conventional
tapped delay-line realization of this inner product is shown in Figure 2.1. The com-
putation of each output sample consists of L constant multiplications and L 1
2.4 Digital signal processing (DSP) 13
x[n]
X X X . . . X X
+ + . . . + + y[n]
Figure 2.1 The tapped delay line representation of an FIR filter with L taps. Each output
sample requires L constant multiplications and L 1 additions.
As an example, consider the DCT [4], which is commonly used for compression
in many signal processing systems. For example, the DCT is used in both JPEG
and MPEG compression [5]. The DCT expresses data as a sum of sinusoids, in a
14 Use of polynomial expressions and linear systems
Figure 2.2 The constant matrix for a four-point DCT. The matrix on the right provides
a simple substitution of the variables A, B, C, and D for the constants cos (0), cos (p/8),
cos (3p/8), and cos (p/4), respectively.
(a) (b)
y0 A A A A x0 y0 = Ax0 + Ax1 + Ax2 + Ax3
Figure 2.3 (a) A four-point DCT represented as a multiplication of input vector with
a constant matrix and (b) the corresponding set of equations.
similar manner to Fourier transform. In fact, it is a special case of the DFT [6],
but uses only real numbers (corresponding to the cosine values of complex
exponentials, hence, the name). DCT has a strong energy compaction, which
is ideal for compression of image and video data, where most of the signal infor-
mation is found in the lower-frequency components.
DCT can be modeled according to Equation (2.7), where the constant matrix
(C) and a vector of input samples (X) are multiplied to compute the output vector Y.
The constant matrix for a four-point DCT is shown in Figure 2.2. The matrix
multiplication with a vector of input samples is shown in Figure 2.3. In the figures,
A, B, C and D can be viewed as distinct constants. The straightforward computation
of this matrix multiplication requires 16 multiplications and 12 additions/subtrac-
tions. In general O(N2) operations are required for an N-point DCT. However,
by extracting common factors, these expressions can be rewritten as shown in
Figure 2.4. This implementation is cheaper than the original implementation by
ten multiplications and four additions/subtractions. In general, factorization of
DCT equations can reduce the number of operations to O(N log N). However, these
optimizations are typically done manually in hand-coded signal processing libraries;
the methods discussed in this book can extract these common factors and common
subexpressions automatically.
The 4 4 linear integer transform used in H.264 [5] is another example of
a linear system found in signal processing. It is a digital video codec that achieves
a very high data compression rate. The integer transform for the video encoding
2.4 Digital signal processing (DSP) 15
d1 = x0 + x3
d2 = x1 + x2
d3 = x1 – x2
d4 = x0 – x3
y0 = A × (d1 + d2)
y 1 = B × d 4 + C × d3
y2 = D × (d1 – d2)
y3 = C × d4 – B × d3
Figure 2.4 The four-point DCT after using techniques to eliminate common subexpressions.
1
Let DCT A ¼ H.264 A, DCT B ¼ H.264 B, DCT C ¼ H.264 A, DCT D ¼ H.264 A.
16 Use of polynomial expressions and linear systems
X0 + + Y0
X1 + Y2
– +
X2 Y1
– + +
–<<1
<<1
X3 – + + Y3
Figure 2.5 H.264 integer transform after extracting common subexpressions and applying
strength reduction on the constant multiplications.
2.5 Cryptography
(a) (b)
t1 = x ⋅ x = x2 10 P = 101101101101 = 2925
t2 = t1 ⋅ t1 = x4 100 d1 = 101
t3 = t2 ⋅ x = x5 101 d2 = d1 + d1 << 3
t4 = t3 ⋅ t3 = x10 1010 P = d2 + d2 << 6
t5 = t4 ⋅ x = x11 1011
t6 = t5 ⋅ t5 = x22 10110 t1 = x ⋅ x = x2 10
t7 = t6 ⋅ t6 = x44 101100 t2 = t1 ⋅ t1 = x4 100
t8 = t7 ⋅ x = x45 101101 t3 = t2 ⋅ x = x5 101 (d1)
t9 = t8 ⋅ t8 = x90 1011010 t4 = t3 ⋅ t3 = x10 1010
t10 = t9 ⋅ x = x91 1011011 t5 = t4 ⋅ t4 = x20 10100
t11 = t10 ⋅ t10 = x182 10110110 t6 = t5 ⋅ t5 = x40 101000
t12 = t11 ⋅ t11 = x364 101101100 t7 = t6 ⋅ t3 = x45 101101 (d2)
t13 = t12 ⋅ x = x365 101101101 t8 = t7 ⋅ t7 = x90 1011010
t14 = t13 ⋅ t13 = x730 1011011010 t9 = t8 ⋅ t8 = x180 10110100
t15 = t14 ⋅ x = x731 1011011011 t10 = t9 ⋅ t9 = x360 101101000
t16 = t15 ⋅ t15 = x1462 10110110110 t11 = t10 ⋅ t10 = x720 1011010000
t17 = t16 ⋅ t16 = x2924 101101101100 t12 = t11 ⋅ t11 = x1440 10110100000
t18 = t17 ⋅ x = x2925 101101100101 t13 = t12 ⋅ t12 = x2880 101101000000
t14 = t13 ⋅ t7 = x2985 101101101101
Figure 2.6 Exponentiation: (a) using the method of squaring, and (b) eliminating common
computations. The number next to the equations denotes the binary representation of
the current exponent.
2.7 Summary
References
[11] B. Schneier, Applied Cryptography: Protocols, Algorithms and Source Code in C, second
edition. New York, NY: John Wiley and Sons Inc, 1996.
[12] P. Downey, B. Leong, and R. Sethi, Computing sequences with addition chains,
SIAM Journal of Computing, 10, 638–46, 1981.
[13] M. A. Miranda, F. V. M. Catthoor, M. Janssen, and H. J. De Man, High-level address
optimization and synthesis techniques for data-transfer-intensive applications, IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 6, 677–86, 1998.
3 Software compilation
Characters
Lexical analysis Analysis
Syntax Intermediate
tree code
Syntax Output
tree code
Intermediate code
generation
Output program
Figure 3.1 The basic structure of a compiler. Compilers are divided into two stages:
a frontend and a backend. The goal is to translate a source program into an output
program; this requires many different optimizations.
steps to transform that program into an output program. We briefly describe each
of these steps in further detail.
Lexical analysis is the first step in compilation; it is often called “lexing” or
“scanning.” This is the act of breaking the input into a set of words or tokens.
A token is an atomic unit in the programming language and commonly includes
variable names, operations, type identifiers, keywords, numbers, and symbols.
One can draw a parallel between lexical analysis and converting letters to words.
Most specification languages specify the token syntax using a regular
language, and, therefore, valid tokens can be represented using a set of regular
expressions. Since every regular expression has an equivalent finite automaton,
we can recognize tokens by scanning the input program one character at a time,
following the appropriate transitions in the finite automaton, and outputting valid
tokens when we reach certain specified states. This stage can find only limited
types of errors, more specifically errors involved in creating tokens. For example,
it can determine that the characters “12abc” are not valid in the C language
since C specifies that variables must start with an alphabetic character.
Syntactic analysis takes the set of tokens from the lexical analysis stage and
groups them into meaningful phrases. This is most often done by creating a tree of
tokens, a parse tree, which specifies the relationship between the tokens. The tree
3.2 Basic software compiler structure 23
is built according to the rules of the formal grammar as denoted in the input
language specification. The parse tree is used in the subsequent stages for analysis
and optimization. In some sense, this stage can be viewed as grouping words
(tokens) into sentences (valid structures in the language). This stage is also often
referred to as “parsing.”
Semantic checking analyzes the parse tree to verify that the input program abides
by the requirements of the specification language. Several properties are con-
firmed. For example, object binding associates the use of every variable/function
to its definition. Definite assignment verifies that every variable is defined before it
is used. Type checking is performed on expressions to insure that operations are
being performed on variables of the appropriate type. A symbol table, which
stores each variable’s type and location, is built during this stage and used for
checking as well as in the later stages of compilation.
The frontend of the compiler ends with the intermediate code generation. This
stage transforms the syntax tree into another representation. This representation
varies from compiler to compiler and depends on the input specification lan-
guage(s) that the compiler accepts as well as the target output language(s) that
the compiler produces. Optimizing compilers often use more than one inter-
mediate representation. In general, the representation is the starting point of the
transformation into the final output program. Therefore, the intermediate code
often looks somewhat similar to the output code. The proceeding optimizations
perform transformations on this intermediate code; hence, the representation
must be easy to change. Furthermore, it should retain the important features
of the input code, while simplifying the code by removing the unimportant
features.
We now discuss two common models of computation used for intermediate
representations – the data flow graph (DFG) and the control flow graph (CFG).
These graphs show the dependencies between operations in the code. Figure 3.2
displays the CFG for an implementation of a factorial function. The function is
broken into a set of basic blocks, which are the nodes of a CFG. A basic block is a
sequence of consecutive intermediate language statements in which flow of control
can only enter at the beginning and leave at the end. In other words, a basic block
is an atomic sequence of statements, i.e., if one of the statements is executed it
means that all other statements will also be executed. The arrows in the CFG
define control dependencies amongst the basic blocks. More formally, a CFG is a
directed multigraph in which: (1) the nodes are basic blocks and (2) the edges
represent flow of control (branches or fall-through execution). Note that the CFG
is formed statically; therefore, we have no information about the values of the
data. Hence, an edge in the CFG simply means there is a possibility to take that
path. Many arithmetic optimizations are performed on a CFG as we discuss in
Section 3.3.
A DFG is a directed acyclic graph where each node is a single instruction or
operation and each edge denotes a direct data dependency between the output of
one node and the input of another. Figure 3.2 shows a simple two-node DFG
24 Software compilation
End
End
Figure 3.2 CFG and DFG representations of the factorial function. The CFG displays
the control dependencies in the function while the DFG exhibits the data dependencies
for the statements within the function.
corresponding to the two statements in one of the basic blocks of the factorial
function. There are two operations in this basic block and equivalently two nodes
in the DFG. The subtract operation produces a data value n that is used by the
subsequent operation. Hence there is an edge from the subtract node to the
multiply node.
Most intermediate representations use some sort of CFG and DFG to model
dependencies. Of course, there are intermediate representations which use other
models of computation. This book focuses primarily on the CFG and the DFG.
We refer the interested reader to more advanced compiler books [1, 2] for further
information.
The development of a compiler frontend is a fairly straightforward process.
There are a number of standard tools (e.g., lex, yacc [3]) to perform each of the
steps and the methodology is quite mature. On the other hand, most compiler
research is focused on the backend, which is still evolving. The backend may vary
significantly across optimizing compilers. As such, we will discuss the backend
stages at a high level, and focus our discussion on the portions that are pertinent to
this book. Referring again to Figure 3.1, we can see there are three stages in the
backend: analysis, optimization, and code generation.
The analysis stage gathers general information about the program structure.
Some typical analyses include deriving information about the data flow, control
flow, function calls, pointers, etc. The previously discussed CFG and DFG are
usually built during this stage. In addition, the call graph, which models function
calls, is often created at this time.
3.3 Algebraic transformations in optimizing software compilers 25
This section presents some related work on the optimization of arithmetic compu-
tations used in modern software compilers. The presented techniques are applied
to general purpose programs and arithmetic expressions. In particular, the discus-
sion focuses on various techniques for redundancy elimination used in modern
software compilers.
(a) (b)
Entry Entry
t=5*a
b=5*a b=t B1
B1
c=2*b–7 c=2*b–7
Yes Yes
if (b < c) if (b < c)
No
b=0 B2 b=0 B2
d=5*a B3 d=t B3
Exit Exit
Figure 3.3 An example of applying CSE: (a) the original flowgraph, and (b) the flowgraph
after eliminating common subexpression 5 * a.
are typically performed on an intermediate form of the code such as the flowgraph
shown in Figure 3.3.
through the basic block, adding entries to and removing them from the list of
AEBs as appropriate, inserting instructions to save the expressions’ values in
temporary variables, and modifying existing instructions to use the values saved
in temporary variables. The iteration stops when no further common subexpres-
sion exists.
The global CSE procedure operates on the entire function, or equivalently the
CFG, and finds the available expressions. An expression exp is said to be available
at the entry to a basic block if there is an evaluation of exp on every control path
from the entry to this block that is not killed before the entry to the basic block
(an expression is killed if one or more of its operands is assigned a new value). The
set of available expressions can be found as follows. Assume that EVAL(i) is the set
of expressions evaluated in block i available at the block’s exit. Further, assume
KILL(i) denotes the set of expressions killed by block i. EVAL(i) is computed by
scanning block i from the beginning to the end, accumulating the expressions
evaluated in it, and deleting those expressions whose operands are later assigned
new values inside the block. AEin(i) and AEout(i) represent the sets of available
expressions on entry to and exit from block i, respectively, as shown in the
data-flow equations in
\
AEinðiÞ ¼ AEoutðjÞ
j2Pr edðiÞ ð3:1Þ
AEoutðiÞ ¼ EVALðiÞ [ ðAEinðiÞ KILLðiÞÞ:
Figure 3.4 An example of value numbering: (a) the original code, (b) the code
transformed using value numbering, and (c) the simplified code.
(a) (b)
b=a+3 b=5*a
c=a if (a > 0)
d=c+3 c=5*a
Figure 3.5 Examples showing the difference between value numbering and CSE’s capabilities:
(a) an example which can be simplified using value numbering, but not CSE, and (b) an
example which can be simplified using CSE, but not value numbering.
simple lexicographic search. On the other hand, Figure 3.5(b) shows an example
where global CSE is able to determine that expression (5 * a) appears twice, but
value numbering cannot detect it because variables b and c are not always equal.
For example, if the value of a is not greater than 0, then c ¼ 5 * a will not be
executed, therefore, b and c may have different values.
The original formulation of value numbering operates on individual basic
blocks, but has been extended to a global form [5, 6]. To use value numbering
for basic blocks, hashing is used to partition expressions into classes. Upon
encountering an expression, its hash value is computed. If it is not already among
the expressions with that hash value, it is added to them. The hash function
and the expression matching function are defined to take commutativity of the
operators into account.
(a) (b)
for i = 1, 100 { a1 = 10 * (n + 1)
a = i * (n + 1) a2 = 100 * n
for j = 1, 100 for i = 1, 100 {
b(i, j) = 100 * n + 10 * a + j a3 = a1 * i + a2
}
for j = 1, 100
b(i, j) = a3 + j
}
Figure 3.6 (a) An example of code having loop invariant computations and (b) the code
after transformation.
Figure 3.6(a) shows a piece of code in which there are several loop invariant
expressions. Figure 3.6(b) shows the code after moving the loop invariant expres-
sions outside of the loops. The original code performs two multiplications and two
additions per iteration of the inner loop. The outer loop performs 201 multipli-
cations and 201 additions during each iteration. Overall, the code executes 20 100
multiplications and 20 100 additions.
The modified code performs one addition at each iteration of the inner loop. The
outer loop requires 1 multiplication and 101 additions for each iteration, resulting
in a total of 102 multiplications and 10 101 additions. In this example, loop invari-
ant code motion saves 19 998 multiplications and 9 999 additions. This can have a
significant impact on the execution time of the code and the energy consumption of
the processor executing it. Further improvement can be achieved by modifying the
ranges of i and j in the FOR loops, e.g., instead of using 1 and 100 as the lower and
upper bounds for variable j, a3 þ 1 and a3 þ 100 can be used, respectively.
(a) (b)
Entry Entry
No Yes No Yes
if (a < b) if (a < b)
c=x
d=y c=x d=y e=c+d
e=c+d t=c+d t=e
f=c+d f=t
Exit Exit
Figure 3.7 An example of the benefit of PRE: (a) the original partially redundant code,
and (b) the simplified code.
(a) (b)
Entry Entry
c=x
c=x
e=c+d
a=a+1
e=c+d a=a+1
if (a < b) if (a < b)
Yes Yes
No No
Exit Exit
Figure 3.8 PRE can move loop invariant computations to reduce the number of operations:
(a) the original loop, and (b) the loop after moving the loop invariant computation
outside the loop.
(a) (b)
for j = 1, 100 t=2
a(5 * j + 2) = 1 for j = 1, 100 {
t=t+5
a(t) = 1
}
Figure 3.9 An example showing the benefits of strong strength reduction: (a) the original
loop and (b) the loop after performing strong strength reduction.
n(n þ 1)/2 multiplications and n additions,1 while the Horner form uses n
multiplications and n additions. This substantially reduces the number of
multiplications if n is large. Since multiplication is an expensive operation in
terms of cycle time and energy consumption, transforming to the Horner form
is an effective way of reducing execution time and energy consumption of
software programs. In [9], the authors report an average 55% reduction in the
number of multiplications when the Horner form is used instead of unopti-
mized expressions for a set of applications.
(2) The special form of the resulting polynomial eases the use of MAC operations,
which exist in many processors especially digital signal processors. The poly-
nomial can be calculated by first computing P0 ¼ a0x þ a1 using a MAC
operation, then calculating P1 ¼ P0x þ a2 and so on. This means the calcula-
tion can be done using n MAC operations. Again, this significantly reduces the
execution time and the energy consumption.
(3) The Horner form increases the numerical stability. In the original form, even
when the value of P(x) is small, the intermediate values (e.g., xn, a0xn, a0xn þ
a1xn1) can be prohibitively large. Thus, it may not be possible to represent
them directly in a 32-bit or a 64-bit processor.2 On the other hand, in the
Horner form, intermediate values P0, P1, . . ., Pn1 can be small if for example
a0 and a1 have different signs.
(4) It is easy to write polynomials in the Horner form. Therefore, it can be
integrated into a compiler with little effort. Furthermore, the transformation
of polynomials into the Horner form can be done quickly, which means it will
not have a major impact on the compilation time which is very important in
general purpose compilers.
The disadvantages of the Horner form are the following:
(1) It optimizes only a single polynomial at a time; it does not look for common
subexpressions among a set of polynomials. Furthermore, it is not good at
optimizing multivariate polynomials, used for example in computer graphics
applications [10]. Equation (2.2) shows the quartic polynomial used in three-
dimensional computer graphics for modeling textures. The original polyno-
mial consists of 23 multiplications or 21 multiplications and 2 shift operations.
The polynomial in the Horner form (shown in Equation (2.4)) has 17 multipli-
cations or 15 multiplications and 2 shifts. Using algebraic methods [9], the
polynomial can be optimized to a form which requires 13 multiplications or
12 multiplications and 1 shift.
(2) The Horner form may not give the best result if some coefficients of the
polynomial are zero. For P(x) ¼ a0x6 þ a2x4 þ a4x2 þ a6 the Horner method
1
Calculating term aixn irequires n i multiplications. Therefore, the total number of necessary
multiplications is n þ ðn 1Þ þ þ 1 ¼ nðn þ 1Þ=2.
2
It is possible to use several words to represent a large number on a processor, but this significantly
reduces the performance.
3.4 Summary 33
Figure 3.10 An example illustrating the power of algebraic techniques: (a) the original
equations, (b) the equations optimized using an algebraic technique, and (c) the equations
optimized using CSE.
results in P(x) ¼ (((((a0x þ 0)x þ a2)x þ 0)x þ a4)x þ 0)x þ a6 which requires
six MAC operations or six multiplications and three additions, while it
is easy to write the polynomial using y ¼ x x as P(x) ¼ a0y3 þ a2y2 þ a4y
þ a6 ¼ ((a0y þ a2)y þ a4)y þ a6. In this case, a total of four MAC operations or
four multiplications and three additions are necessary.
3.4 Summary
This chapter presented some of the basic concepts in software design flows. We
started by describing the fundamental steps for a software compiler including
those in the frontend and the backend. Then we provided more detail on the
compilation process including the place where arithmetic optimization can be
3
Value numbering uses properties such as commutativity, but this is not specific to arithmetic
operations.
34 Software compilation
References
[1] S. S. Muchnick, Advanced Compiler Design and Implementation. San Francisco, CA:
Morgan Kaufmann Publishers, 1997.
[2] K. Kennedy and J. R. Allen, Optimizing Compilers for Modern Architectures:
A Dependence-based Approach. San Francisco, CA: Morgan Kaufmann Publishers,
2001.
[3] J. R. Levine, T. Mason, and D. Brown, Lex & yacc, second edition. Sebastopol, CA:
O’Reilly & Associates, 1995.
[4] J. Cocke and J. T. Schwartz, Programming Languages and Their Compilers:
Preliminary Notes, Technical Report, Courant Institute of Mathematical Sciences,
New York University, 1970.
[5] J. R. Reif and H. R. Lewis, Symbolic evaluation and the global value graph,
Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming
Languages, Los Angeles, 1977, pp. 104–18. New York, NY: ACM, 1977.
[6] B. Alpern, M. N. Wegman and F. K. Zadeck, Detecting equality of variables in
programs, Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of
Programming Languages, San Diego, 1988, pp. 1–11. New York, NY: ACM, 1988.
[7] A. Sinha and A. P. Chandrakasan, JouleTrack – a web based tool for software energy
profiling, Proceedings of the 38th Conference on Design Automation, Las Vegas, 2001.
pp. 220–225. New York, NY: ACM, 2001.
[8] http://www.gnu.org/software/libc/
[9] A. Hosangadi, F. Fallah and R. Kastner, Factoring and eliminating common
subexpressions in polynomial expressions, Proceedings of the 2004 IEEE/ACM
International Conference on Computer-aided Design, San Jose, 2004, pp. 169–174.
Washington, DC: IEEE Computer Society, 2004.
[10] G. Nurnberger, J. W. Schmidt and G. Walz, Multivariate Approximation and Splines.
Basel: Birkhäuser, 1997.
4 Hardware synthesis
This chapter provides a brief summary of the stages in the hardware synthesis
design flow. It is designed to give unfamiliar readers a high-level understanding
of the hardware design process. The material in subsequent chapters describes
different hardware implementations of polynomial expressions and linear systems.
Therefore, we feel that it is important, though not necessarily essential, to have an
understanding of the hardware synthesis process.
The chapter starts with a high-level description of the hardware synthesis
design flow. It then proceeds to discuss the various components of this design
flow. These include the input system specification, the program representation,1
algorithmic optimizations, resource allocation, operation scheduling, and resource
binding. The chapter concludes with a case study using an FIR filter. This
provides a step-by-step example of the hardware synthesis process. Additionally,
it gives insight into the hardware optimization techniques presented in the
following chapters.
The initial stages of a hardware design flow are quite similar to the frontend of
a software compiler. One of the biggest differences is that the input system
specification languages are different. Hardware description languages must deal
with many features that are unnecessary in software, which for the most part
model execution in a serial fashion. Such features include the need to model
concurrent execution of the underlying resources, define a variety of different data
types specifically for different bit widths, and introduce some notion of time into
the language. Figure 4.1 gives a high-level view of the different stages of hardware
compilation.
Architectural synthesis is an automated design process that interprets an algo-
rithmic representation of a behavior and creates hardware specification that
1
We use the term “program representation,” a common term in software compilation, due to the
absence of a widely used term in hardware synthesis.
36 Hardware synthesis
System
Architectural synthesis
specification
Logic and physical
Lexical and synthesis
syntactic analysis
Program
representation
Logic synthesis
Algorithmic
optimization
Common subexpression
elimination, constant
folding, and other
common compiler Resource allocation
techniques (focus of and scheduling Physical synthesis:
this book) floorplanning,
placement, and
routing
Resource binding
Register
transfer level
description GDSII
Figure 4.1 A high-level view of the stages of hardware compilation. These can be broadly
broken down to architectural, logic, and physical synthesis. The optimizations described in
this book are primarily focused on architectural synthesis, specifically on the algorithmic
optimization.
Throughput, power, clock frequency, and latency are some of the common
optimization objectives.
The first step of architectural synthesis is lexical and syntactic analysis, which
parses the input specification into a program representation. This step is very
similar to that of software compilation and more details of this can be found in
Section 3.2. The program representation is a description of the system specification
that is easily amenable to analysis, optimization, and translation to a more refined
specification, which in this case is the register transfer level description. There are
many examples of program representations; we discuss some of them later in this
chapter. The DFG is perhaps the most popular program representation for
architectural synthesis. We formally describe this in Section 4.4. However, in order
to progress our discussion to the next steps in the architectural synthesis process, we
will now informally define it as a directed graph consisting of vertices that represent
operations and directed edges that denote dependencies between operations.
The architectural synthesis problem can be defined in the following manner:
given a system specification, a set of fully characterized architectural resources,
a set of constraints, and an optimization function, determine a connected set of
resources (a structural representation) that conforms to the given constraints and
minimizes the objective function. The architectural synthesis problem can be split
into the following subproblems: algorithmic optimization, resource allocation,
operation scheduling, and operation binding.
Algorithmic optimization uses a set of techniques that transform the program
representation to make it run faster, use fewer operations, expose parallelism,
enable more accurate dependency analysis, improve memory usage, and so on.
These techniques are very often similar to those found in software compilers and
include optimizations such as CSE, loop unrolling, dead code elimination. The
techniques that we present later in this book for polynomial and linear system
optimization can be used in this stage.
Resource allocation is the act of choosing the appropriate number and type of
components from a library. For example, you can choose to have two adders – one
ripple carry and one carry look-ahead – one multiplier, one divider, etc. Scheduling
determines the temporal ordering of the operations. Given a set of operations with
execution delays and a partial ordering, the scheduling problem assigns a start
time for each operation. The start times must follow the precedence constraints as
specified in the system specification. Additional restrictions such as timing and
area constraints may be added to the problem, depending on the target architec-
ture. The scheduling affects the resource allocation and vice-versa. Therefore, the
ordering of these two tasks is sometimes interchanged; some synthesis tools
perform scheduling, then resource allocation, while others allocate the resources
first, and then schedule the operations.
Resource binding is the assignment of each operation to a specific hardware
component; it is an explicit mapping between operations and resources. The goal
of resource binding is to minimize the area by allowing multiple operations to
share a common resource. The scheduling limits the possible resource bindings.
38 Hardware synthesis
For example, operations that are scheduled at the same time cannot share the
same resource. To be more precise, any two operations can be bound to the same
resource if they are not executed concurrently, i.e., are not scheduled in overlap-
ping time steps. Some resources are capable of executing different operations, e.g.,
both an addition and subtraction can be bound to an arithmetic logic unit (ALU).
The resource binding can greatly affect the area and latency of the circuit as it
dictates the number of interconnect logic and storage elements of the circuit.
Logic synthesis is the act of taking the register transfer level description that is
output from architectural synthesis and transforming it into a network of logic
gates. There are a number of optimizations that are performed during this stage.
The optimizations are generally grouped into two types – multi-level and two-
level. The two-level optimizations have roots in Boolean minimization, which
attempts to minimize the number of gates in a two-stage Boolean network.
Multi-level optimizations often view the problem as a network of logic gates and
attempt to minimize the number and the area of the gates as well as the critical
path or the delay of the network. An interested reader can find a vast amount of
literature on this topic. Reference [1] is a good introduction to the basic algorithms.
Physical synthesis or physical design looks at how the logical network can be
transformed into an integrated circuit that can be fabricated. The output is
essentially a set of planar geometric shapes that detail the size and the type of
materials needed to make the transistors and wires in a circuit. GDSII is one
common database format used to specify the layout of the integrated circuit. The
primary tasks of physical synthesis are floorplanning, placement, and routing.
Floorplanning creates a basic plan for the layout of the chip, indicating the general
area where hard macros, power and ground planes, input/output (I/O) and other
logic elements reside. Placement assigns an exact physical location for each of the
logic gates, while routing determines the precise wiring of the required intercon-
nections between the gates. Further information on the stages of physical synthe-
sis, as well as the algorithms used to implement these stages, can be found in [2].
Now that we have given an overview of entire hardware synthesis process, we
will go into detail on a few of the topics that are needed to fully understand the
later chapters in this book. Specifically, we discuss the architectural synthesis
process. The majority of the optimizations in this book occur in the algorithm
optimization stage; however, it is important to understand the other stages in
architecture synthesis, which we focus on in the remainder of the chapter.