Panda 2001
Panda 2001
We present a survey of the state-of-the-art techniques used in performing data and memory-
related optimizations in embedded systems. The optimizations are targeted directly or
indirectly at the memory subsystem, and impact one or more out of three important cost
metrics: area, performance, and power dissipation of the resulting implementation.
We first examine architecture-independent optimizations in the form of code transformations.
We next cover a broad spectrum of optimization techniques that address memory architectures
at varying levels of granularity, ranging from register files to on-chip memory, data caches,
and dynamic memory (DRAM). We end with memory addressing related issues.
Categories and Subject Descriptors: B.3 [Hardware]: Memory Structures; B.5.1 [Register-
Transfer-Level Implementation]: Design—Memory design; B.5.2 [Register-Transfer-Level
Implementation]: Design Aids—Automatic synthesis; Optimization; B.7.1 [Integrated Cir-
cuits]: Types and Design Styles—Memory technologies; D.3.4 [Programming Languages]:
Processors—Compilers; Optimization
Authors’ addresses: P. R. Panda, Synopsys, Inc., 700 E. Middlefield Rd., Mountain View, CA
94043; email: [email protected]; F. Catthoor, Inter-University Microelectronics Centre and
Katholieke Universiteit Leuven , Kapeldreef 75, Leuven, Belgium; email: [email protected]; N.
D. Dutt, Center for Embedded Computer Systems, University of California at Irvine , Irvine,
CA 92697; email: [email protected]; K. Danckaert, E. Brockmeyer, C. Kulkarni, and A.
Vandercappelle, Inter-University Microelectronics Centre, Kapeldreef 75, Leuven, Belgium;
email: [email protected]; brpcl,[email protected]; [email protected]; [email protected]; P. G.
Kjeldsberg, Norwegian University of Science and Technology, Trondheim, Norway; email:
[email protected].
Permission to make digital / hard copy of part or all of this work for personal or classroom use
is granted without fee provided that the copies are not made or distributed for profit or
commercial advantage, the copyright notice, the title of the publication, and its date appear,
and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires prior specific permission
and / or a fee.
© 2001 ACM 1084-4309/01/0400 –0149 $5.00
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001, Pages 149 –206.
150 • P. R. Panda et al.
1. INTRODUCTION
In the design of embedded systems, memory issues play a very important
role, and often impact significantly the embedded system’s performance,
power dissipation, and overall cost of implementation. Indeed, as new
processor families and processor cores begin to push the limits of high
performance, the traditional processor-memory gap widens and often be-
comes the dominant bottleneck in achieving high performance. While
embedded systems range from simple micro-controller-based solutions to
high-end mixed hardware/software solutions, embedded system designers
need to pay particular attention to issues such as minimizing memory
requirements, improving memory throughput, and limiting the power dis-
sipated by the system’s memory.
Traditionally, much attention has been paid to the role of memory system
design in the compiler, architecture, and CAD domains. Many of these
techniques, while applicable to some extent, do not fully exploit the
optimization opportunities in embedded system design. From an applica-
tion viewpoint, embedded systems are special-purpose, and so are amena-
ble to aggressive optimization techniques that can fully utilize knowledge
of the applications. Whereas many traditional memory-related hardware
and software optimizations had to account for variances due to general-
purpose applications, memory optimizations for embedded systems can be
tailored to suit the expected profile of code and data. Furthermore, from an
architectural viewpoint, the embedded system designer pays great atten-
tion to the customization of the memory subsystem (both on-chip, as well as
off-chip): this leads to many nontraditional memory organizations, with a
standard cache hierarchy being only one of many memory architectural
options. Finally, from a constraint viewpoint, the embedded system de-
signer needs to meet not only system performance goals, but also has to do
this within a power budget (especially for mobile applications), and meet
real-time constraints. The system performance should account for not only
the processor’s speed but also the system bus load to the shared board-level
storage units such as main memory and disk. Even the L2 cache is shared
in a multiprocessor context. As a result of all this, the memory and bus
subsystem costs become a significant contributor to overall system costs,
and thus the embedded system designer attempts to minimize memory
requirements with the goal of lowering overall system costs.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization • 151
Pingali 1992]; at Illinois [Padua and Wolfe 1986]; at Stanford [Wolf and
Lam 1991] and [Amarasinghe et al. 1995]; at Santa Clara [Shang et al.
1996]); and finally in the high-level synthesis community also (at the
University of Minnesota [Parhi 1989] and the University of Notre-Dame
[Passos and Sha 1994]).
Efficient parallelism is however partly coupled to locality of data access,
and this has been incorporated in a number of approaches. Examples are
the work on data and control flow transformations for distributed shared-
memory machines at the University of Rochester[Cierniak and Li 1995], or
heuristics to improve the cache hit ratio and execution time at the Univer-
sity of Amherst [McKinley et al. 1996]. Rice University has recently also
started investigating the actual memory bandwidth issues and the relation
to loop fusion [Ding and Kennedy 2000]. At E.N.S. Lyon, the effect of
several loop transformation on memory access has been studied too
[Fraboulet et al. 1999].
It is thus no surprise that these code rewriting techniques are also very
important in the context of data transfer and storage (DTS) solutions,
especially for embedded applications that permit customized memory orga-
nizations. As the first optimization step in the design methodology pro-
posed in Franssen et al. [1994]; Greef et al. [1995]; and Masselos et
al.1999a]; they were able to significantly reduce the required amount of
storage and transfers and improve access behavior, thus enabling the
ensuing steps of more platform-dependent optimizations. As such, the
global loop transformations mainly increase the locality and regularity of
the accesses in the code. In an embedded context this is clearly good for
memory size (area) and memory accesses (power) [Franssen et al. 1994;
Greef et al. 1995], but of course also for pure performance [Masselos et al.
1999a], even though the two objectives do not fully lead to the same loop
transformation steering. The main distinction from the vast amount of
earlier related work in the compiler literature is that they perform these
transformations across all loop nests in the entire program [Franssen et al.
1994]. Traditional loop optimizations performed in compilers, where the
scope of loop transformations is limited to one procedure or usually even
one loop nest, can enhance the locality (and parallelization possibilities)
within that loop nest, but may not solve the global data flow and associated
buffer space needed between the loop nests or procedures. A recent trans-
formation framework including interprocedural analysis proposed in McK-
inley [1998] is a step in this direction: it is focused on parallelisation for a
shared memory multiprocessor. The memory-related optimizations are still
performed on a loop-nest basis (and so are “local”); but the loops in that
loop nest may span different procedures and a fusing preprocessing step
tries to combine all compatible loop nests that do not have dependencies
blocking their fusing. The goal of the fusing is primarily to improve
parallelism.
The global loop and control flow transformation step proposed in Greef et
al. [1995]; Franssen et al. [1994]; and Masselos et al. [1999a] can be viewed
as a precompilation phase, applied prior to conventional compiler loop
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
154 • P. R. Panda et al.
optimize the use of caches have been studied in several flavors and contexts
(see, e.g., Kulkarni and Stumm [1995] and Manjiakian and Abdelrahman
[1995]). In an embedded context, the memory size and energy angles have
also been added, as illustrated in the early work of Franssen et al. [1994];
Catthoor et al. [1994]; and Greef et al. [1995] to increase locality and
regularity globally, and more recently in Fraboulet et al. [1999] and
Kandemir et al. [2000]. In addition, memory access scheduling has a clear
link to certain loop transformations to reduce the embedded implementa-
tion cost. This is illustrated by the work on local loop transformations to
reduce the memory access in procedural descriptions [Kolson et al. 1994];
the work on multidimensional loop scheduling for buffer reduction [Passos
et al. 1995]; and the PHIDEO project where “loop” transformations on
periodic streams were applied to reduce an abstract storage and transfer
cost [Verhaegh et al. 1996].
To automate the proposed loop transformations, the Franssen et al.
[1994] and Danckaert et al. [2000] approach makes use of a polytope model
[Franssen et al. 1993; Catthoor et al. 1998]. In this model, each n-level loop
nest is represented geometrically by an n-dimensional polytope. An exam-
ple is given in Figure 1, where the loop nest at the top is two-dimensional
and has a triangular polytope representation, because the inner loop bound
is dependent on the value of the outer loop index. The arrows in the figure
represent the data dependencies; they are drawn in the direction of the
data flow. The order in which the iterations are executed can be repre-
sented by an ordering vector that traverses the polytope. To perform global
loop transformations, a two-phase approach is used. In the first phase, all
polytopes are placed in one common iteration space. During this phase, the
polytopes are considered as merely geometrical objects, without execution
semantics. In the second phase, a global ordering vector is defined in this
global iteration space. In Figure 1, an example of this methodology is given.
At the top, the initial specification of a simple algorithm is shown; at the
bottom left, the polytopes of this algorithm are placed in the common
iteration space in an optimal way, and at the bottom right, an optimal
ordering vector is defined and the corresponding code is derived.
Most existing loop transformation strategies work directly on the code.
Moreover, they typically work on single loop nests, thereby omitting the
global transformations crucial for storage and transfers. Many of these
techniques also consider the body of each loop nest as one union [Darte et
al. 1993], whereas in Franssen et al. [1993] each statement is represented
by a polytope, which allows more aggressive transformations. An exception
to the “black box” view on the loop body is formed by the “affine-by-
statement” [Darte and Robert 1992] techniques which transform each
statement separately. However, the two-phase approach still allows a more
global view on the data transfer and storage issues.
k
C: (k: 1..N)::
(l: 1..k)::
b[k][l+1] = g( b[k][l] );
l
B: (p: 1..N)::
b[p][1] = f( a[N-p+1][p], a[N-p][p] );
i p
A: (i: 1..N)::
(j: 1 .. N-i+1)::
a[i][j] = in[i][j] + a[i-1][j];
l l
for (j=1; j<=N; ++j) {
i i for (i=1; i<=N-j+1; ++i)
a[i][j] = in[i][j] + a[i-1][j];
b[j][1] = f( a[N-j+1][j],
a[N-j][j] );
for (l=1; l<=j; ++l)
b[j][l+1] = g( b[j][l] );
k }
p
j j
size and “time” of these copies based on the available locality of access. In
the Diguet et al. [1997] approach, a global exploration of the data reuse
copies is performed to globally optimize the size and timing of these copies
in the code. A custom memory hierarchy can then be designed on which
these copies can be mapped in a very efficient way (see, e.g., Wuytack et al.
[1998]). However, even for a predefined memory hierarchy, typically
present in a programmable processor context, the newly derived code from
this step implicitly steers the data reuse decisions and still results in a
major benefit to system bus load, system power budget, and cache miss
behavior (see, e.g., Kulkarni et al. [1998]). This compile-time exploration of
data reuse and code modification appears to be a unique approach not
investigated elsewhere.
Example 2. Consider the following example, which has already under-
gone the loop transformations discussed in the previous section:
for (i50; i,N; 11i)
for (j50; j,5N-L; 11j)
b[i][j] 5 0;
for (k50; k,L; 11k)
b[i][j] 15 a[i][j1k];
}
When this code is executed on a processor with a small cache, it performs
much better than the initial code. To map it on a custom memory hierarchy,
however, the designer has to know the optimal size of the different levels of
this hierarchy. To this end, signal copies (buffers) are added to the code in
order to make data reuse explicit. For the example, this results in the
following code (the initialization of a_buf @# has been left out for simplici-
ty):
int a_buf[L];
int b_buf;
for (i50; i,N; 11i)
initialize a_buf
for (j50; j,5N-L; 11j) {
b_buf 5 0;
a_buf[(j1L-1)%L]5a[i][j1L-1];
for (k50; k,L; 11k)
b_buf 15 a_buf[(j1k)%L];
b[i][j] 5 b_buf;
}
In this code, two data reuse buffers are present:
description, prove that this size is reduced when the partitioning becomes
more data oriented. Initially, this size is smaller for the first hybrid
partitioning (245 K), which is more data-oriented than the second hybrid
partitioning (282 K) and the task-level partitioning (287 K). However, this
can change after the transformations are applied. In terms of the number of
memory accesses to the intermediate signals the situation is simpler. The
number of accesses to these signals always decreases as the partitioning
becomes more data oriented. The table also shows the huge impact that
this platform-independent transformation stage can have on highly data-
dominated applications like this video coder. Experiments on several
processor platforms for different demonstrators [Danckaert et al. 1999]
have shown the importance of applying these optimizations.
list, the second a binary tree, and the third one a pointer array. Each of
these three levels is accessed by subkeys that are repartitioned from the
original keys. An automated technique for this exploration is proposed in
Ykman-Couvreur et al. [1999]. In a second main phase, dynamic allocation
and freeing duties are performed by a virtual memory manager, which
performs the typical tasks involved in maintaining the free list of blocks in
memory: keeping track of free blocks, choosing free blocks, freeing deleted
blocks, splitting , and merging blocks [Wilson et al. 1995]. An exploration
technique is proposed in da Silva et al. [1998], in which different memory
allocators are generated for different data types. Following this, a basic
group splitting operation splits the memory segment into smaller basic
groups to increase the allocation freedom by, for instance, splitting an
array of structures into its constituent fields. These logical memory seg-
ments are then mapped into physical memory modules in the Storage
Bandwidth Optimization (SBO) step as described in Section 3.2.
An approach at a lower abstraction level is proposed by Semeria et al.
[2000]. It is specifically targeted to a hardware synthesis context and
assumes that the virtual memory managers are already fixed. So the
outcome of the above approach can be used directly as input for this step.
Here, the actual number and size of the memory modules are specified by
the designer, along with a hint of which malloc call is targeted at which
memory module. A general-purpose memory allocator module that per-
forms the block allocation and freeing tasks is also instantiated for each
memory module. However, the allocator can be optimized and simplified
when the size arguments to all malloc calls for a single module are
compile-time constants and when constant-size data is allocated and freed
within the same basic block. In the latter case, the dynamic allocation is
replaced by a static array declaration.
can share the same physical memory location (the in-place mapping prob-
lem [Verbauwhede et al. 1989]), a more accurate estimate has to account
for mapping arrays and parts of arrays to the same place in memory. To
what degree it is possible to perform in-place mapping depends heavily on
the order in which the elements in the arrays are produced and consumed.
This is mainly determined by the execution ordering of the loop nests
surrounding the instructions accessing the arrays.
At the beginning of the design process, little information about the
execution order is known. Some is given from the data dependencies
between the instructions in the code, and the designer may restrict the
ordering for example, due to I/O constraints. In general, however, the
execution order is not fixed, giving the designer considerable freedom in the
implementation. As the process progresses, the designer takes decisions
that gradually fix the ordering, until the full execution ordering is known.
To steer this process, estimates of the upper and lower bounds on the
storage requirement are needed at each step, given the partially fixed
execution ordering.
The storage requirements for scalar variables can be determined by a
clique partitioning formulation for performing register allocation (described
in Section 3.1.1). However, such techniques break down for large multidi-
mensional arrays, due to the huge number of scalars present when each
array element is treated as a scalar. To overcome this shortcoming, several
research teams have tried to split the arrays into suitable units before or as
a part of the estimation. Typically, each instance of array element access-
ing the code is treated separately. Due to the code’s loop structure, large
parts of an array can be produced or consumed by the same code instance.
This reduces the number of elements the estimator must handle compared
to the scalar approach.
Verbauwhede et al. [1994] use a production time axis to find the maxi-
mum difference between the production and consumption times for any two
dependent instances, giving the storage requirement for one array. The
total storage requirement is the sum of the requirements for each array.
Only in-place mapping internal to an array is considered, not the possibil-
ity of mapping arrays in place of each other. In Grun et al. [1998], the data
dependency relations between the array references in the code are used to
find the number of array elements produced or consumed by each assign-
ment. From this, a memory trace of upper and lower bounding rectangles
as a function of time is found with the peak bounding rectangle indicating
the total storage requirement. If the difference between the upper and
lower bounds for this critical rectangle is too large, the corresponding loop
is split into two and the estimation is rerun. In the worst-case situation, a
full loop unrolling is necessary to achieve a satisfactory estimate, which
can become expensive. Zhao and Malik [1999] describe a methodology based
on live variable analysis and integer point counting for intersection/union
of mappings of parameterized polytopes. They show that it is only neces-
sary to find the number of live variables for one instruction in each
innermost loop nest to get the minimum memory size estimate. However,
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
162 • P. R. Panda et al.
the live variable analysis is performed for each iteration of the loops, which
makes it computationally hard for large multidimensional loop nests. A
major limitation for all of these techniques is their requirement of a fully
fixed (imperative) execution ordering.
In contrast to the methods described in the previous paragraph, the
storage requirement estimation technique presented by Balasa et al. [1995]
does not assume an execution ordering. It starts with an extended data
dependency analysis, resulting in a number of nonoverlapping basic sets of
array elements and the dependencies between them. The size of the
dependency is the number of elements consumed (read) from one basic set
while producing the dependent basic set. The maximal combined size of
simultaneously alive basic sets gives the storage requirement.
The high-level estimation methodology described by Kjeldsberg et al.
[2000b] goes a step further, and takes into account partially fixed execution
ordering, achieved by an array data flow analysis preprocessing [Feautrier
1991; Pugh and Wonnacott 1993].
Example 3. Consider the simple application code example shown in
Figure 3. Two instructions, I.1 and I.2, produce elements of two arrays, A
and B. Elements from array A are consumed when elements of array B are
produced. This gives rise to a flow type data dependency between the
instructions [Banerjee 1998].
The loops around the operations define an iteration space [Banerjee
1998], as shown in Figure 3. Each point within this space represents one
execution of the operations inside the loop nest. For our example, at each of
these iteration points, one A-array element and, when the if clause condi-
tion is true, one B-array element is produced. In general, not all elements
produced by one operation are read by a depending operation. A depen-
dency part (DP) is defined containing all the iteration points for which
elements that are read by the depending operation are produced. Next, a
dependency vector (DV) is drawn from any iteration point in the DP
producing an array element to the iteration point producing the depending
element. This DV is usually drawn from the point in the DP that is nearest
to the origin. Finally, the chosen DV spans a rectangular dependency vector
polytope (DVP) in the N-dimensional space with sides parallel to the
iteration space axes. The N dimensions of this DVP are defined as spanning
dimensions (SD). Since normally the SD only comprises a subset of the
iterator space dimensions, the remaining dimensions are denoted nonspan-
ning dimensions (ND), but this set can be empty. For the DVP in Figure 3,
i and j are SDs while k is ND.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization • 163
Fig. 3. Iteration space with dependency part, dependency vector, and dependency vector
polytope.
Using the concepts above, Kjeldsberg et al. [2000a] describe the details of
the size estimates of individual dependencies. The main contribution is the
use of the DP and DVP for calculating the upper and lower bounds on the
dependency size, respectively. As the execution ordering is fixed gradually
during the design phases, dimensions and array elements are removed
from the DP or added to the DVP to comprise tighter bounds until they
converge for a fully fixed ordering. Whether dimensions and array elements
are removed from the DP or added to the DVP is, in general, decided by the
partial fixation of spanning and nonspanning dimensions. It has been
shown that the size of a dependency is minimized if spanning dimensions
are fixed innermost and nonspanning dimensions outermost. Table II
summarizes estimation results for the dependency in Figure 3 for a number
of partially fixed execution orderings. The results are compared with those
achieved with the methodology in Balasa et al. [1995] where the execution
ordering is ignored, and with manually calculated exact results for best-
case (BC) and worst-case (WC) ordering.
In order to achieve a global view of the storage requirements for an
application, the combined size of simultaneously alive dependencies must
be taken into account [Kjeldsberg et al. 2000b]; but this falls outside the
scope of this survey. Applying this approach to the MPEG-4 [The ISO/IEC
Moving Picture Experts Group 2001], the motion estimation kernel demon-
strates how the designer can be guided in applying the critical early loop
transformations to the source code. Figure 4 shows estimates of upper and
lower bounds on the total storage requirement for two major arrays. In
Step (a) no ordering is fixed, leaving a large span between the upper and
lower bounds. At (b), one dimension is fixed outermost in the loop nest,
resulting in big changes in both upper and lower bounds. For step (c), an
alternative dimension is fixed outermost in the loop nest. Here the reduc-
tion of the upper bound is much larger than in (b), while the increase of the
lower bound is much smaller. Even with such limited information, it is
possible for the designer to conclude that the outer dimension used in (c) is
better than the one used in (b). At (d), there is an additional fixation of a
second outermost dimension with a reduced uncertainty in the storage
requirement as a result. Finally at step (e), the execution ordering is fully
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
164 • P. R. Panda et al.
Table II. Dependency Size Estimates of a Simple Example (in number of scalar
dependencies)
1026 1664
1000
257 300 257 260 257
100
10
2
1
a) b) c) d) e)
Fig. 4. Storage requirement of the ME kernel.
S1 R1 R2 R3
S2 R1 R2
a register file or a memory module. The registers (or memory locations) can
no longer all be accessed simultaneously; the number of allowed simulta-
neous accesses is limited to the number of available ports in the memory.
This results in a stronger interaction of the memory allocation decision
with the scheduling phase of HLS.
Balakrishnan et al. [1988] present a technique to allocate multiport
memories in HLS. To exploit the increased efficiency of grouping several
registers into a single multiport memory, the technique attempts to merge
registers with disjoint access times. While clique partitioning is sufficient
to handle the case of a single port memory, a more general framework is
needed to handle multiport memories. The technique formulates a 0-1
linear programming problem by modeling the port types (read, write, and
read/write), the number of ports, and the accesses scheduled to each
register in each control step. Since the linear programming problem is
NP-complete, a branch-and-bound heuristic is employed.
Example 4. Consider a scheduled sequence with states S1 and S2
involving three registers R1, R2, and R3 to be mapped into a dual-port
memory:
S1 : R1 4 R2 1 R3
S2 : R2 4 R1 1 R1
x1 1 x2 1 x3 # 2
x1 1 x2 # 2
Address Address
Data
R0 W0
R1 W1
RA RB RC
RD RA WB WC
WD WA RC
WC
(a)
A B
A B
B C
C D B
time
A C
D
D A
C D
A C
B
(b) (c) (d)
A B
A B
B C
A B
C D
time
C D
D A
C D
C
B A
(e) (f) (g)
Fig. 7. Storage bandwidth optimization: (a) data flow graph; (b) candidate schedule; (c)
conflict graph; (d) memory assignment; (e) alternate schedule; (f) new conflict graph; (g) new
assignment.
from the available memory types and port configurations. But the dimen-
sions of the memories are determined only in the second stage. When
arrays are assigned to memories, their sizes can be added and the maximal
bit-width can be taken to determine the required size and bit-width of the
memory. With this decision, the memory organization is fully determined.
Allocating more or fewer memories has an effect on the chip area and on
the energy consumption of the memory architecture (see Fig. 8). Large
memories consume more energy per access than small memories, due to the
longer word- and bit-lines. So the energy consumed by a single large
memory containing all the data is much larger than when the data is
distributed over several smaller memories. Also, the area of the one-
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization • 171
Cost Area
Power
Allocation
Chrom. Nr. Nr. Signals
Bit waste Area Periphery (&interconnect)
Large memories Power Interconnect
A B 3 mem. 4 mem.
B
A D C A B C D
C D
Chrom. Nr. = 3
Fig. 8. Tradeoff between number of memories and cost during allocation and assignment.
2K x 7 @ 30 ns 8K x 8 @ 10 ns 2K x 7 @ 20 ns
8K x 8 @ 10 ns
2K x 7 @ 30 ns 2K x 7 @ 20 ns
8K x 8 @ 10 ns
5K x 8 @ 10 ns 5K x 8 @ 10 ns 8K x 8 @ 10 ns
Fig. 9. Mapping logical memories to physical ones while satisfying required access rates.
Behavior
int A [1000]
int B [1000]
...
B[k] = A[i] + A[j]
No Clustering Clustering
i
k j’
A B
j
j A
i B
+
R 1000 k R1 R2
b
a a
b c
3.4 Memory Access Time versus Cost Exploration Using Pareto Curves
By combining the SCBD and MAA steps of the previous sections, we can
effectively explore and tradeoff different solutions in the performance,
power, and area space [Brockmeyer et al. 2000a]. Indeed, every step of the
SCBD-MAA combination generates a set of valid solutions for a different
cycle budget. Hence it becomes possible to make the right tradeoff within
this solution space. Note that without automated tool support, the type of
tradeoffs discussed here are not feasible on industrial-strength applica-
tions, and thus designers may miss the opportunity to explore a larger
design space.
When the input behavior has a single thread and the goal is to reduce
power, the tradeoff can be based solely on tool output. The given cycle
budget defines a conflict graph that can be used for the MAA tool [Vande-
cappelle et al. 1999]. Obviously, the power and area costs increase when
the cycle budget is lowered: more bandwidth is needed, which requires
multiport memories (increases power) or more memories (increases area).
This is illustrated in a binary tree predictive coding (BTPC) application, a
lossless or lossy image compression algorithm based on multiresolution
that involves a complex algorithm. The platform-independent code trans-
formation steps [Catthoor et al. 1998], discussed in Section 2, are applied
manually in this example and the platform-dependent steps (using tools)
give accurate feedback about performance and cost [Vandecappelle et al.
1999]. Figure 12 shows the relation between speed and power for the
original, optimized and intermediate transformed specifications. The off-
chip signals are stored in separate memory components. Four memories are
allocated for the on-chip signals. Every step leads to a significant perfor-
mance improvement without increasing the system cost. For every descrip-
tion, the cycle budget can be traded for system cost, demonstrating the
significant effect of the platform-independent code transformation [Brock-
meyer et al. 2000b].
This performance-power function, when generated per task, can be used
to tradeoff cycles assigned to a task at the system level. Assigning too few
cycles to a single task causes the entire application to perform poorly. The
cycle and power estimates help the designer to assign tasks to processors
and to distribute the cycles within the processors over various tasks
[Brockmeyer et al. 2000a]. Minimizing the overall power within a processor
is possible by applying function minimization on all the power-cycle func-
tions together.
The interaction of the datapath power/performance with the memory
system creates another level of tradeoffs. The assignment of cycles to
memory accesses and to the data-path is important for overall power
consumption. A certain percentage of the overall time can be spent on
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
176 • P. R. Panda et al.
300
200
Power (mW)
Original
Original, off-chip dual ported
100 Loop transformed (merge)
Data reuse
Basic Group matching
Hierarchy assignment
Software pipelining
Software pipelining (without hierarchy assignment)
0
0 5 10 15 20 25 30
Cycle budget (#Mcycles)
4 on-chip memories
Fig. 12. Power versus performance for the BTPC example.
1.0
Memory related power
0.8 Data-path related power
Total power
0.6
Power
0.4
0.2
0.0
0% 20% 40% 60% 80% 100%
Percentage of cycles to memory accesses
Fig. 13. Tradeoff cycles assigned to memory accesses and data path.
00000000
11111111
00000000
This represents the peak power dissipation on the bus because all 8 bits
transition at once. The Bus-Invert coding introduces a control bit to the
bus, which is 1 when the Hamming distance [Kohavi 1978] between
successive values is greater than half the bus width. The above sequence is
encoded as
000000000
000000001
000000000
The coding scheme incurs an area overhead due to the extra control bit
and the encoding and decoding circuitry as well as a possibly small
performance overhead due to the computation of the encoded data, but
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
178 • P. R. Panda et al.
lowers peak power dissipation to the case where only half the number of
data bits are toggling, i.e., by 50%. The authors have extended this coding
scheme to limited-weight codes, which are useful in protocols where data
values are represented by the presence of a bit-transition rather than by 1
or 0.
The correlations expected in the memory address and data streams can
often be exploited to generate intelligent encodings that result in low-power
implementations. The most common scenario occurs in the processor-
memory interactions. High degrees of correlation are observed in the
instruction address stream due to the principle of locality of reference.
Reference spatial locality occurs when consecutive references to memory
result in accesses to nearby data. This is readily observed in the instruction
addresses generated by a processor: there is a high probability that the
next instruction executed after the current one lies in the next instruction
memory location. This correlation can be exploited in various ways to
encode the instruction addresses for low power.
To encode the streams of data that are known at design time (e.g.,
addresses for memories), Catthoork et al. [1994] first proposed a gray
coding technique [Kohavi 1978], relying on the well-known observation
that the gray code scheme results in exactly one bit transition for any two
consecutive numbers. Su and Despain [1995] applied this idea to the
instruction stream. Musoll et al. [1998] proposed a working-zone encoding,
observing that programs tend to spend a lot of time in small regions of code
(e.g., loops). They partition the address bus into two parts: the most
significant part identifies the working zone and the least significant part
carries an offset in the working zone. This ensures that, as long as the
processor executes instructions from the working zone, the most significant
bits will never change. An additional control bit is used to identify the case
when the address referenced does not belong to the working zone any more.
The T0 encoding [Benini et al. 1998b] relies on a similar principle. Here, an
additional control bit on the address bus indicates whether the next
address is consecutive or not. If it is consecutive, the control line is
asserted, and remains as long as successive instructions executed are
consecutive in memory. This is superior to the gray code, since in the
steady state the address bus does not switch at all. In cases where the
processor uses the same address bus to address both instruction and data
memory, a judicious combination of T0 and Bus-Invert encodings looks
promising [Benini et al. 1998a].
Iteration 2 Iteration 4
Iteration 1 Iteration 3
0
1
2
0 1 2
(d) New Elements
(c) Inner Loop Execution Trace (e) Tile
in each Iteration
Fig. 14. Inferring the tile size.
since the analysis is done statically, it obviates the need for an expensive
encoding scheme for dynamically detecting runtime correlations. However,
data organization can also be combined with an encoding scheme to further
reduce switching activity.
A considerable amount of flexibility exists in the actual scheme used to
store arrays in memory. For example, two-dimensional arrays can be stored
in row-major or column-major style. Panda and Dutt [1999] evaluate the
impact of three storage schemes: row-major, column-major, and tile-based,
on the memory address bus-switching activity.
Example 9. Tile-based storage of array data is illustrated in Figure 14.
For the example in Figure 14(a), the memory access trace for array u is
shown in Figure 14(b). New elements accessed in each iteration are shown
graphically in Figure 14(c). Note that one element is reused from the
previous iteration and can be registered instead of being accessed from
memory again. The tile shown in Figure 14(e) is the smallest rectangle
enclosing the access pattern of Figure 14(d). Array u can now be stored
tile-wise in order to exploit spatial locality.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
180 • P. R. Panda et al.
for(i850;i8,10;i811) {
v[0][i8] 5 AutoCorr[i811];
u[O][i8] 5 AutoCorr[i8];
}
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization • 181
ac inter[] has a dependency on only two of its earlier values, thus only
three (earlier) integer values need to be stored for computing each of the
autocorrelated values. Thus, by performing intrasignal in-place data map-
ping, as shown below, we can drastically reduce the size of this signal from
26400 to 33 integer elements.
for(i650;i6,11;i611) {
ac-inter[i6][i6%3] 5 Hamwind[O] * Hamwind[i6];
ac-inter[i6][(i611)%3] 5 Hamwind[1] * Hamwind[i6];
for(i75(i612);i7,2400;i711)
ac-inter[i6][i7%3] 5 ac-inter[i6][(i7-1)%3] 1
ac-inter[i6][(i7-2)%31 1 (Hamwind[i7-i6] * Hamwind[i7]);
}
for(i850;i8,10;i811) {
v[O][i8] 5 ac-inter[i81][2]; /* 2399 % 3 5 2 */
u[O][i8] 5 ac-inter[i8][2];
}
The signal AutoCorr[] is a temporary signal. By reusing the memory
space of signal ac inter[] for storing AutoCorr[], we can further reduce
the total required memory space. This is achieved by intersignal inplace
mapping of array AutoCorr[] on ac-inter[]. Thus, initially, ac-in-
ter[] could not have been accommodated in the on-chip local memory, due
to the large size of this signal; but we have removed this problem. This
results both in reduced memory size and a reduction in the associated
power consumption.
CAD techniques are needed to explore the many in-place mapping
opportunities; the theory behind this technique is not presented in this
survey. Effective techniques for the intrasignal mapping are described in
De Greef and Catthoor [1996]; Lefebvre and Feautrier [1997]; Quillere and
Rajopadhye [1998], and for the intersignal mapping in Greef et al. [1997].
(3) Full knowledge of the application: The assumption that the compiler
has access to the entire application at once allows us to perform many
global optimizations skipped by traditional compilers, e.g., changing
data layouts.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
182 • P. R. Panda et al.
N N N
(a) (b)
Dummy Words
N N N
C
a b c
Data Cache
Memory
(c)
Tiles ...
C
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
184 • P. R. Panda et al.
longer delays of misses. However, the memory controller gets only a local
view of the program, and is unable to perform the kinds of global optimiza-
tions afforded by a compiler. Recent approaches to cache access scheduling
[Grun et al. 2000b] have proposed a more accurate timing model for
memory accesses. By attaching accurate timing information to cache hits
and misses, the compiler’s scheduler is able to hide the latency of the
lengthy cache miss operations better.
Prefetching was proposed as another solution to increase the cache hit
ratio, and was studied extensively by the compiler and architecture com-
munities. Hardware prefetching techniques [Jouppi 1990] use structures
such as stream buffers to recognize patterns in the stream of accesses in
the hardware (through some recognition/prediction mechanism), and allo-
cate streams to stream buffers, allowing prefetching of data from the main
memory. Software prefetching [Mowry et al. 1992] inserts prefetch instruc-
tions in the code that bring data from main memory into the cache well
before it is needed in a computation.
1 cycle 0
Scratch
Pad Mem.
(onchip)
P1
CPU P
Memory
Address
Space
Data DRAM
Cache (offchip)
1 cycle 1020
cycles
N1
Fig. 16. Dividing data address space between Scratch Pad memory and off-chip memory.
Iteration: x = 0, y = 0 Iteration: x = 0, y = 1
Fig. 17. (a) Procedure CONV; (b) memory access pattern in CONV.
small mask array in the Scratch pad memory. This assignment eliminates
all conflicts in the data cache—the data cache is now used for memory
accesses to source, which are very regular. Storing mask on-chip ensures
that frequently accessed data is never ejected off-chip, thereby significantly
improving the memory performance and energy dissipation.
The Panda et al. [2000] memory assignment first determines the total
conflict factor (TCF) for each array, based on access frequency and possibi-
lity of conflict with other arrays, and then considers the arrays for
assignment to scratch pad memory in the order of TCF/(array size), giving
priority to high-conflict/small-size arrays.
A scratch-pad memory storing a small amount of frequently accessed
data on-chip has an equivalent in the instruction cache. The idea of using a
small buffer to store blocks of frequently used instructions was first
introduced in Jouppi [1990a]. Recent extensions of this strategy are the
decoded instruction buffer [Bajwa et al. 1997] and the L-cache [Bellas et al.
2000].
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
188 • P. R. Panda et al.
800000 1500000
Memory Cycles
Memory Cycles
600000 1000000
400000 500000
200000
4 8 16 32 64 128 256 512 1024 2048 0
4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K
Cache Size (Bytes) On-chip Memory Size (Bytes)
Fig. 18. Histogram example. (a) Variation of memory performance with different mixes of
cache and scratch-pad memory for total on-chip memory of 2 KB; (b) variation of memory
performance with total on-chip memory space .
3.8.1 DRAM Modeling for HLS and Optimization. The DRAM memory
address is split internally into a row address, consisting of the most
significant bits, and a column address, consisting of the least significant
bits. The row address selects a page from the core storage and the column
address selects an offset within the page to arrive at the desired word.
When an address is presented to the memory during a READ operation, the
entire page addressed by the row address is read into the page buffer, in
anticipation of spatial locality. If future accesses are to the same page, then
there is no need to access the main storage area because it can just be read
off the page buffer, which acts like a cache. Hence, subsequent accesses to
the same page are very fast.
Panda et al. [1998] describe a scheme for modeling the various memory
access modes and uses them to perform useful optimizations in the context
of an HLS environment.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization • 191
Example 14. Figure 19(a) shows a simplified timing read cycle diagram
of a typical DRAM. The memory read cycle is initiated by the falling edge of
the RAS (row address strobe) signal, at which time the row address is
latched from the address bus. The column address is latched at the falling
edge of CAS (column address strobe) signal, which should occur at least
T ras 5 45 ns later. Following this, the data is available on the data bus
after T cas 5 15 ns. Finally, the RAS signal is high for at least T p 5 45 ns
to allow for bit-line precharge, which is necessary before the next memory
cycle can be initiated. In order to use the above information in an auto-
mated scheduling tool, we need to abstract a set of control data flow graph
(CDFG) nodes from the timing diagram [Panda et al. 1998]. For the
memory read operation, the CDFG node cluster consists of three stages
(Figure l9(b)): (1) row decode; (2) column decode; and (3) precharge. The
row and column addresses are available at the first and second stages,
respectively, and the output data is available at the beginning of the third
stage. Assuming a clock cycle of 15 ns, and a 1-cycle delay for the addition
and shift operations, we can derive the schedule in Figure 19(d) for the code
in Figure 19(c) using the memory read model in Figure 19(b). Since the four
accesses to array b are treated as four independent memory reads, each
incurs the entire read cycle delay of T rc 5 105 ns, i.e., 7 cycles, requiring a
total of 7 3 4 5 28 cycles.
However, DRAM features such as page mode read can be exploited
efficiently to generate a much tighter schedule for behaviors such as the
FindAverage example, which accesses data in the same page in succession.
Figure 19(e) shows the timing diagram for the page mode read cycle, and
Figure 19(f) shows the schedule for the FindAverage routine using the page
mode read feature. Note that the page mode does not incur the long row
decode and precharge times between successive accesses, thereby eliminat-
ing a significant amount of delay from the schedule. In this case, the
column decode time is followed by a minimum pulse width duration for the
CAS signal, which is also 15 ns in our example. Thus, the effective cycle
times between successive memory accesses was greatly reduced, resulting
in an overall reduction of 50% in the total schedule length.
The key feature in reducing the schedule length in the example above is
the recognition that input behavior is characterized by memory access
patterns amenable to the page mode feature and the incorporation of this
observation in the scheduling phase. Some additional DRAM-specific opti-
mizations discussed in Panda et al. [1998] are as follows:
Read-Modify-Write (R-M-W) optimization that takes advantage of the
R-M-W mode in modern DRAMs, which provides support for a more
efficient realization of the common case where a specific address is read,
the data is involved in a computation, and the output is written back to
the same location;
hoisting where the row-decode node is scheduled ahead of a conditional
node if the first memory access in both branches is on the same page;
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
192 • P. R. Panda et al.
Row
Trc = 105 ns Addr
RAS Stage 1:
RowDecode
CAS (45ns) Col
Addr
Address ROW COL Stage 2:
ColDecode (15 ns)
Data VALID Data
Stage 3:
Tcas = 15 ns Precharge
Tras = 45 ns Tp = 45 ns (45 ns)
(a) (b)
FindAverage:
av = (b[0] + b[1] + b[2] + b[3]) / 4
(c) 2
av
RAS
CAS
(e)
2
Row (b[0])
av
RowDecode 2 cycles 2 cycles 2 cycles 2 cycles Precharge
3 cycles 3 cycles
A
Row Address
Addr [15:8] B
C
Page
Col Address
Addr [7:0]
Page Buffer
is associated with a low cost, guiding the partitioner to assign the arrays
together.
Bank assignment can also be seen as the array-to-memory assignment in
Section 3.2, when the appropriate cost function and I/O profile constraints
are introduced [Brockmeyer et al. 2000a].
3.8.3 Memory-Aware Compilation. Traditionally, the compiler is
shielded from the detailed organization and timing of the memory sub-
system; interactions with the memory subsystem are typically through read
and write operations with timing granularity at the level of cache hit and
miss delays. However, a memory-aware compiler approach can aggressively
exploit the detailed timing information of the memory subsystem to obtain
improved scheduling results. Grun et al. [2000a] present an algorithm,
called TIMGEN, to include DRAM characteristics in a compiler framework.
Detailed timing information on DRAM nodes is made available to the
compiler, which can then make intelligent scheduling decisions based on
timing knowledge. For each instruction, TIMGEN traces a detailed timing
path through the processor pipeline, including different possible memory
access modes. This information is then used during scheduling to generate
aggressive schedules that are on the average 24% smaller than one that
assumes no knowledge of memory timing.
4. ADDRESS GENERATION
One important consequence of all the above memory organization-related
steps, is that the address sequences typically become much more compli-
cated than in the original nonoptimized application code. This is first of all
due to the source code transformations like in-place mapping that intro-
duce modulo arithmetic and loop transformations, which in turn generate
more index arithmetic and manifest local conditions. Additional complexity
is added by the very distributed memory organization used in embedded
processors, both custom and programmable. As a result, address genera-
tion, which involves generating efficient assembly code or hardware to
implement the translation of array references to actual memory addresses,
is a critical stage in the entire data management flow.
Initial work on address generation focused only on regular DSP applica-
tions mapped on hardware. Central to this early research is the observation
that if the generated addresses were known to be periodic (true for many
DSP applications accessing large data arrays), then there would be no need
to use a full-blown arithmetic circuit to generate the sequence; a simple
counter-based circuit could achieve the same effect. Initial research on
synthesizing hardware address generation focused on generating efficient
designs from a specified trace of memory address [Grant et al. 1989; Grant
and Denyer 1991] by recognizing the periodicity of the patterns and
automatically building a counter-based circuit for generating the sequence
of addresses. Since the problem of extracting the periodic behavior from an
arbitrary sequence of addresses can be extremely difficult, work like that of
Grant and Denyer [1991] relies on designer hints such as number of
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
196 • P. R. Panda et al.
memory accesses in the basic repeating pattern. The ZIPPO system [Grant
et al. 1994] solves a generalization of the above problem by considering
several address streams that are incident on different memory modules
on-chip, and synthesizing an address generator that is optimized by shar-
ing hardware. Multiplexing such sequences allows more exploratory free-
dom and produces better results [Miranda et al. 1994].
Another simplification of address generation hardware can be achieved
by employing certain interesting properties of the exclusive OR (XOR)
function. Schmit and Thomas [1998] presented an address bit inversion
technique for generating a simplified address generator at the expense of a
small area overhead. The authors point out that if two arrays a and b have
sizes 0... A 2 1 and 0...B 2 1 , respectively, such that A&B 5 0 , i.e., the
bitwise AND of the arrays sizes is zero, then two disjoint address spaces
are created by performing a bitwise XOR on the index of one array with the
size of the other.
Example 15. Suppose we have to store two arrays a and b with sizes of
3 and 4 words, respectively, in the same memory module. In order to access
random elements a @ i # and b @ j # from memory, the arrays would normally
be located contiguously in memory, and the addressing circuit would be
implemented as follows:
5. CONCLUSIONS
We have presented a survey of contemporary and emerging data and
memory optimization techniques for embedded systems.
We first discussed platform-independent memory optimizations that op-
erate on a source-to-source level, and typically guarantee improved perfor-
mance, power, and cost metrics, irrespective of the implementation’s target
architecture. Next, we surveyed a number of data and memory optimization
techniques applicable to memory structures at different levels of architec-
tural granularity: from registers and register files, all the way up to
off-chip memory structures. Finally, we discussed the attendant address
generation optimizations that remove the address and local control over-
head that appears as a byproduct of both platform-independent, as well as
platform-dependent, data and memory optimizations.
Given the constraints on the length of this manuscript, we have at-
tempted to survey a wide range of both traditional approaches as well as
emerging techniques designed to handle memory issues in embedded
systems, from the viewpoint of performance, power, and area (cost). We
have not addressed the contexts of many parallel platforms including data-
and (dynamic) task-level parallelism. Many open issues remain in the
context of memory-intensive embedded systems, including testing, valida-
tion, and (formal) verification, embedded system reliability, and optimiza-
tion opportunities in the context of networked embedded systems.
As complex embedded systems-on-a-chip (SOC) begin to proliferate, and
as the software content of these embedded SOCs dominate the design
process, memory issues will continue to be a critical optimization dimen-
sion in the design and development of future embedded systems.
ACKNOWLEDGMENTS
We gratefully acknowledge the input from our colleagues at IMEC and the
ACES laboratory at U.C. Irvine and their many research contributions,
which are partly summarized in this survey.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
198 • P. R. Panda et al.
REFERENCES
AGARWAL, A., KRANTZ, D., AND NATARANJAN, V. 1995. Automatic partitioning of parallel loops
and data arrays for distributed shared-memory multiprocessors. IEEE Trans. Parallel
Distrib. Syst. 6, 9 (Sept.), 943–962.
AHMAD, I. AND CHEN, C. Y. R. 1991. Post-processor for data path synthesis using multiport
memories. In Proceedings of the IEEE/ACM International Conference on Computer-Aided
Design (ICCAD ’91, Santa Clara, CA, Nov. 11-14). IEEE Computer Society Press, Los
Alamitos, CA, 276 –279.
AHO, A., SETHI, R., AND ULLMAN, J. 1986. Compilers: Principles, Techniques, and
Tools. Addison-Wesley, Reading, MA.
AMARASINGHE, S., ANDERSON, J., LAM, M., AND TSENG, C.-W. 1995. An overview of the suif
compiler for scalable parallel machines. In Proceedings of the SIAM Conference on Parallel
Processing for Scientific Computing (San Francisco, CA, Feb.). SIAM, Philadelphia, PA.
BAJWA, R. S., HIRAKI, M., KOJIMA, H., GORNY, D. J., NITTA, K., SHRIDHAR, A., SEKI, K., AND
SASAKI, K. 1997. Instruction buffering to reduce power in processors for signal
processing. IEEE Trans. Very Large Scale Integr. Syst. 5, 4, 417– 424.
BAKSHI, S. AND GAJSKI, D. D. 1995. A memory selection algorithm for high-performance
pipelines. In Proceedings of the European Conference EURO-DAC ’95 with EURO-VHDL ’95
on Design Automation (Brighton, UK, Sept. 18 –22), G. Musgrave, Ed. IEEE Computer
Society Press, Los Alamitos, CA, 124 –129.
BALAKRISHNAN, M., BANERJI, D. K., MAJUMDAR, A. K., LINDERS, J. G., AND MAJITHIA, J.
C. 1990. Allocation of multiport memories in data path synthesis. IEEE Trans. Comput.-
Aided Des. 7, 4 (Apr.), 536 –540.
BALASA, F., CATTHOOR, F., AND DE MAN, H. 1994. Dataflow-driven memory allocation for
multi-dimensional signal processing systems. In Proceedings of the 1994 IEEE/ACM
International Conference on Computer-Aided Design (ICCAD ’94, San Jose, CA, Nov. 6 –10),
J. A. G. Jess and R. Rudell, Eds. IEEE Computer Society Press, Los Alamitos, CA, 31–34.
BALASA, F., CATTHOOR, F., AND DE MAN, H. 1995. Background memory area estimation for
multidimensional signal processing systems. IEEE Trans. Very Large Scale Integr. Syst. 3,
2 (June), 157–172.
BANERJEE, P., CHANDY, J., GUPTA, M., HODGES, E., HOLM, J., LAIN, A., PALERMO, D., RA-
MASWAMY, S., AND SU, E. 1995. The paradigm compiler for distributed-memory
multicomputers. IEEE Computer 28, 10 (Oct.), 37– 47.
BANERJEE, U. 1998. Dependence Analysis for Supercomputing. Kluwer Academic Publishers,
Hingham, MA.
BANERJEE, U., EIGENMANN, R., NICOLAU, A., AND PADUA, D. A. 1993. Automatic program
parallelization. Proc. IEEE 81, 2 (Feb.), 211–243.
BELLAS, N., HAJJ, I. N., POLYCHRONOPOULOS, C. D., AND STAMOULIS, G. 2000. Architectural and
compiler techniques for energy reduction in high-performance microprocessors. IEEE
Trans. Very Large Scale Integr. Syst. 8, 3 (June), 317–326.
BENINI, L. AND DE MICHELI, G. 2000. System-level power optimization techniques and
tools. ACM Trans. Des. Autom. Electron. Syst. 5, 2 (Apr.), 115–192.
BENINI, L., DE MICHELI, G., MACII, E., PONCINO, M., AND QUER, S. 1998a. Power optimization
of core-based systems by address bus encoding. IEEE Trans. Very Large Scale Integr. Syst.
6, 4, 554 –562.
BENINI, L., DE MICHELI, G., MACII, E., SCIUTO, D., AND SILVANO, C. 1998b. Address bus
encoding techniques for system-level power optimization. In Proceedings of the Conference
on Design, Automation and Test in Europe 98. 861– 866.
BENINI, L., MACII, A., AND PONCINO, M. 2000. A recursive algorithm for low-power memory
partitioning. In Proceedings of the IEEE International Symposium on Low Power Design
(Rapallo, Italy, Aug.). IEEE Computer Society Press, Los Alamitos, CA, 78 – 83.
BROCKMEYER, E., VANDECAPPELLE, A., AND CATTHOOR, F. 2000a. Systematic cycle budget
versus system power trade-off: a new perspective on system exploration of real-time data-
dominated applications. In Proceedings of the IEEE International Symposium on Low Power
Design (Rapallo, Italy, Aug.). IEEE Computer Society Press, Los Alamitos, CA, 137–142.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization • 199
BROCKMEYER, E., WUYTACK, S., VANDECAPPELLE, A., AND CATTHOOR, F. 2000b. Low power
storage cycle budget tool support for hierarchical graphs. In Proceedings of the 13th
ACM/IEEE International Symposium on System-Level Synthesis (Madrid, Sept). ACM
Press, New York, NY, 20 –22.
CATTHOOR, F., DANCKAERT, K., KULKARNI, C., AND OMNES, T. 2000. Data transfer and storage
architecture issues and exploration in multimedia processors. In Programmable Digital
Signal Processors: Architecture, Programming, and Applications, Y. H. Yu, Ed. Marcel
Dekker, Inc., New York, NY.
CATTHOOR, F., JANSSEN, M., NACHTERGAELE, L., AND MAN, H. D. 1996. System-level data-flow
transformations for power reduction in image and video processing. In Proceedings of the
International Conference on Electronic Circuits and Systems on Electronic Circuits and
Systems (Oct.). 1025–1028.
CATTHOOR, F., WUYTACK, S., DE GREEF, E., BALASA, F., NACHTERGAELE, L., AND VANDECAPPELLE,
A. 1998. Custom Memory Management Methodology: Exploration of Memory Organization
for Embedded Multimedia System Design. Kluwer Academic, Dordrecht, Netherlands.
CATTHOOR, F., FRANSSEN, F., WUYTACK, S., NACHTERGAELE, L., AND DE MAN, H. 1994. Global
communication and memory optimizing transformations for low power systems. In Proceed-
ings of the International Workshop on Low Power Design. 203–208.
CHAITIN, G., AUSLANDER, M., CHANDRA, A., COCKE, J., HOPKINS, M., AND MARKSTEIN,
P. 1981. Register allocation via coloring. Comput. Lang. 6, 1, 47–57.
CHANG, H.-K AND LIN, Y.-L. 2000. Array allocation taking into account SDRAM
characteristics. In Proceedings of the Asia and South Pacific Conference on Design
Automation (Yokohama, Jan.). 497–502.
CHEN, T.-S. AND SHEU, J.-P. 1994. Communication-free data allocation techniques for
parallelizing compilers on multicomputers. IEEE Trans. Parallel Distrib. Syst. 5, 9 (Sept.),
924 –938.
CIERNIAK, M. AND LI, W. 1995. Unifying data and control transformations for distributed
shared-memory machines. SIGPLAN Not. 30, 6 (June), 205–217.
CRUZ, J.-L., GONZALEZ, A., VALERO, M., AND TOPHAM, N. 2000. Multiple-banked register file
architectures. In Proceedings of the 27th International Symposium on Computer Architec-
ture (ISCA-27, Vancouver, B.C., June). ACM, New York, NY, 315–325.
CUPPU, V., JACOB, B. L., DAVIS, B., AND MUDGE, T. N. 1999. A performance comparison of
contemporary dram architectures. In Proceedings of the International Symposium on
Computer Architecture (Atlanta, GA, May). 222–233.
DA SILVA, J. L., CATTHOOR, F., VERKEST, D., AND DE MAN, H. 1998. Power exploration for
dynamic data types through virtual memory management refinement. In Proceedings of the
1998 International Symposium on Low Power Electronics and Design (ISLPED ’98,
Monterey, CA, Aug. 10 –12), A. Chandrakasan and S. Kiaei, Chairs. ACM Press, New York,
NY, 311–316.
DANCKAERT, K., CATTHOOR, F., AND MAN, H. D. 1996. System-level memory management for
weakly parallel image processing. In Proceedings of the Conference on EuroPar’96 Parallel
Processing (Lyon, France, Aug.). Springer-Verlag, New York, NY, 217–225.
DANCKAERT, K., CATTHOOR, F., AND MAN, H. D. 1999. Platform independent data transfer and
storage exploration illustrated on a parallel cavity detection algorithm. In Proceedings of
the International Conference on Parallel and Distributed Processing Techniques and Appli-
cations (PDPTA99). 1669 –1675.
DANCKAERT, K., CATTHOOR, F., AND MAN, H. D. 2000. A preprocessing step for global loop
transformations for data transfer and storage optimization. In Proceedings of the Interna-
tional Conference on Compilers, Architecture and Synthesis for Embedded Systems (San Jose
CA, Nov.).
DARTE, A., RISSET, T., AND ROBERT, Y. 1993. Loop nest scheduling and transformations. In
Environments and Tools for Parallel Scientific Computing, J. J. Dongarra and B. Tou-
rancheau, Eds. Elsevier Advances in parallel computing series. Elsevier Sci. Pub. B. V.,
Amsterdam, The Netherlands, 309 –332.
DARTE, A. AND ROBERT, Y. 1995. Affine-by-statement scheduling of uniform and affine loop
nests over parametric domains. J. Parallel Distrib. Comput. 29, 1 (Aug. 15), 43–59.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
200 • P. R. Panda et al.
DIGUET, J. PH., WUYTACK, S., CATTHOOR, F., AND DE MAN, H. 1997. Formalized methodology
for data reuse exploration in hierarchical memory mappings. In Proceedings of the 1997
International Symposium on Low Power Electronics and Design (ISLPED ’97, Monterey, CA,
Aug. 18 –20), B. Barton, M. Pedram, A. Chandrakasan, and S. Kiaei, Chairs. ACM Press,
New York, NY, 30 –35.
DING, C. AND KENNEDY, K. 2000. The memory bandwidth bottleneck and its amelioration by a
compiler. In Proceedings of the International Symposium on Parallel and Distributed
Processing (Cancun, Mexico, May). 181–189.
DE GREEF, E. AND CATTHOOR, F. 1996. Reducing storage size for static control programs
mapped onto parallel architectures. In Proceedings of the Dagstuhl Seminar on Loop
Parallelisation (Schloss Dagstuhl, Germany, Apr.).
FEAUTRIER, P. 1991. Dataflow analysis of array and scalar references. Int. J. Parallel
Program. 20, 1, 23–53.
FEAUTRIER, P. 1995. Compiling for massively parallel architectures: A
perspective. Microprocess. Microprogram. 41, 5-6 (Oct.), 425– 439.
FRABOULET, A., HUARD, G., AND MIGNOTTE, A. 1999. Loop alignment for memory access
optimisation. In Proceedings of the 12th ACM/IEEE International Symposium on System-
Level Synthesis (San Jose CA, Dec.). ACM Press, New York, NY, 70 –71.
FRANSSEN, F., BALASA, F., VAN SWAAIJ, M., CATTHOOR, F., AND MAN, H. D. 1993. Modeling
multi-dimensional data and control flow. IEEE Trans. Very Large Scale Integr. Syst. 1, 3
(Sept.), 319 –327.
FRANSSEN, F., NACHTERGAELE, L., SAMSOM, H., CATTHOOR, F., AND MAN, H. D. 1994. Control
flow optimization for fast system simulation and storage minimization. In Proceedings of
the International Conference on Design and Test (Paris, Feb.). 20 –24.
GAJSKI, D., DUTT, N., LIN, S., AND WU, A. 1992. High Level Synthesis: Introduction to Chip
and System Design. Kluwer Academic Publishers, Hingham, MA.
GAREY, M. R. AND JOHNSON, D. S. 1979. Computers and Intractibility – A Guide to the Theory
of NP-Completeness. W. H. Freeman and Co., New York, NY.
GHEZ, C., MIRANDA, M., VANDECAPPELLE, A., CATTHOOR, F., AND VERKEST, D. 2000. Systematic
high-level address code transformations for piece-wise linear indexing: illustration on a
medical imaging algorithm. In Proceedings of the IEEE Workshop on Signal Processing
Systems (Lafayette, LA, Oct.). IEEE Press, Piscataway, NJ, 623– 632.
GONZÁLEZ, A., ALIAGAS, C., AND VALERO, M. 1995. A data cache with multiple caching
strategies tuned to different types of locality. In Proceedings of the 9th ACM International
Conference on Supercomputing (ICS ’95, Barcelona, Spain, July 3–7), M. Valero,
Chair. ACM Press, New York, NY, 338 –347.
GOOSSENS, G., VANDEWLLE, J., AND DE MAN, H. 1989. Loop optimization in register-transfer
scheduling for DSP-systems. In Proceedings of the 26th ACM/IEEE Conference on Design
Automation (DAC ’89, Las Vegas, NV, June 25-29), D. E. Thomas, Ed. ACM Press, New
York, NY, 826 – 831.
GRANT, D. AND DENYER, P. B. 1991. Address generation for array access based on modulus m
counters. In Proceedings of the European Conference on Design Automation (EDAC,
Feb.). 118 –123.
GRANT, D., DENYER, P. B., AND FINLAY, I. 1989. Synthesis of address generators. In
Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD
’89, Santa Clara, CA, Nov.). ACM Press, New York, NY, 116 –119.
GRANT, D. M., MEERBERGEN, J. V., AND LIPPENS, P. E. R. 1994. Optimization of address
generator hardware. In Proceedings of the 1994 Conference on European Design and Test
(Paris, France, Feb.). 325–329.
GREEF, E. D., CATTHOOR, F., AND MAN, H. D. 1995. Memory organization for video algorithms
on programmable signal processors. In Proceedings of the IEEE International Conference on
Computer Design (ICCD ’95, Austin TX, Oct.). IEEE Computer Society Press, Los Alamitos,
CA, 552–557.
GREEF, E. D., CATTHOOR, F., AND MAN, H. D. 1997. Array placement for storage size reduction
in embedded multimedia systems. In Proceedings of the International Conference on
Applic.-Spec./Array Processors (Zurich, July). 66 –75.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization • 201
GRUN, P., BALASA, F., AND DUTT, N. 1998. Memory size estimation for multimedia
applications. In Proceedings of the Sixth International Workshop on Hardware/Software
Codesign (CODES/CASHE ’98, Seattle, WA, Mar. 15–18), G. Borriello, A. A. Jerraya, and L.
Lavagno, Chairs. IEEE Computer Society Press, Los Alamitos, CA, 145–149.
GRUN, P., DUTT, N., AND NICOLAU, A. 2000a. Memory aware compilation through accurate
timing extraction. In Proceedings of the Conference on Design Automation (Los Angeles, CA,
June). ACM Press, New York, NY, 316 –321.
GRUN, P., DUTT, N., AND NICOLAU, A. 2000b. MIST: An algorithm for memory miss traffic
management. In Proceedings of the IEEE/ACM International Conference on Computer-
Aided Design (San Jose, CA, Nov.). ACM Press, New York, NY, 431– 437.
GRUN, P., DUTT, N., AND NICOLAU, A. 2001. Access pattern based local memory customization
for low power embedded systems. In Proceedings of the Conference on Design, Automation,
and Test in Europe (Munich, Mar.).
GUPTA, M., SCHONBERG, E., AND SRINIVASAN, H. 1996. A unified framework for optimizing
communication in data-parallel programs. IEEE Trans. Parallel Distrib. Syst. 7, 7,
689 –704.
GUPTA, S., MIRANDA, M., CATTHOOR, F., AND GUPTA, R. 2000. Analysis of high-level address
code transformations for programmable processors. In Proceedings of the 3rd ACM/IEEE
Conference on Design and Test in Europe (Mar.). ACM Press, New York, NY, 9 –13.
HALAMBI, A., GRUN, P., GANESH, V., KHARE, A., DUTT, N., AND NICOLAU, A. 1999a. Expression:
A language for architecture exploration through compiler/simulator retargetability. In
Proceedings of the Conference on DATE (Munich, Mar.).
HALAMBI, A., GRUN, P., TOMIYAMA, H., DUTT, N., AND NICOLAU, A. 1999b. Automatic software
toolkit generation for embedded systems-on-chip. In Proceedings of the Conference on ICVC.
HALL, M. W., HARVEY, T. J., KENNEDY, K., MCINTOSH, N., MCKINLEY, K. S., OLDHAM, J. D.,
PALECZNY, M. H., AND ROTH, G. 1993. Experiences using the ParaScope Editor: an
interactive parallel programming tool. SIGPLAN Not. 28, 7 (July), 33– 43.
HALL, M., ANDERSON, J., AMARASINGHE, S., MURPHY, B., LIAO, S., BUGNION, E., AND LAM, M.
1996. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer 29,
12 (Dec.), 84 – 89.
HENNESSY, J. L. AND PATTERSON, D. A. 1996. Computer Architecture: A Quantitative
Approach. 2nd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA.
HUANG, C.-Y., CHEN, Y.-S., LIN, Y.-L., AND HSU, Y.-C. 1990. Data path allocation based on
bipartite weighted matching. In Proceedings of the 27th ACM/IEEE Conference on Design
Automation (DAC ’90, Orlando, FL, June 24-28), R. C. Smith, Chair. ACM Press, New York,
NY, 499 –504.
ISO/IEC MOVING PICTURE EXPERTS GROUP. 2001. The MPEG Home Page (http://www.cselt.it/
mpeg/)11.
ITOH, K., SASAKI, K., AND NAKAGOME, Y. 1995. Trends in low-power RAM circuit
technologies. Proc. IEEE 83, 4 (Apr.), 524 –543.
JHA, P. K. AND DUTT, N. 1997. Library mapping for memories. In Proceedings of the
Conference on European Design and Test (Mar.). 288 –292.
JOUPPI, N. 1990. Improving direct-mapped cache performance by the addition of a small
fully-associative cache and prefetch buffers. In Proceedings of the 17th International
Symposium on Computer Architecture (ISCA ’90, Seattle, WA, May). IEEE Press, Piscat-
away, NJ, 364 –373.
KANDEMIR, M., VIJAYKRISHNAN, N., IRWIN, M. J., AND YE, W. 2000. Influence of compiler
optimisations on system power. In Proceedings of the Conference on Design Automation (Los
Angeles, CA, June). ACM Press, New York, NY, 304 –307.
KARCHMER, D. AND ROSE, J. 1994. Definition and solution of the memory packing problem for
field-programmable systems. In Proceedings of the 1994 IEEE/ACM International Confer-
ence on Computer-Aided Design (ICCAD ’94, San Jose, CA, Nov. 6 –10), J. A. G. Jess and R.
Rudell, Eds. IEEE Computer Society Press, Los Alamitos, CA, 20 –26.
KELLY, W. AND PUGH, W. 1992. Generating schedules and code within a unified reordering
transformation framework. UMIACS-TR-92-126. University of Maryland at College Park,
College Park, MD.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
202 • P. R. Panda et al.
KHARE, A., PANDA, P. R., DUTT, N. D., AND NICOLAU, A. 1999. High-level synthesis with
SDRAMs and RAMBUS DRAMs. IEICE Trans. Fundam. Electron. Commun. Comput. Sci.
E82-A, 11 (Nov.), 2347–2355.
KIM, T. AND LIU, C. L. 1993. Utilization of multiport memories in data path synthesis. In
Proceedings of the 30th ACM/IEEE International Conference on Design Automation (DAC
’93, Dallas, TX, June 14 –18), A. E. Dunlop, Ed. ACM Press, New York, NY, 298 –302.
KIROVSKI, D., LEE, C., POTKONJAK, M., AND MANGIONE-SMITH, W. 1999. Application-driven
synthesis of memory-intensive systems-on-chip. IEEE Trans. Comput.-Aided Des. 18, 9
(Sept.), 1316 –1326.
KJELDSBERG, P. G., CATTHOOR, F., AND AAS, E. J. 2000a. Automated data dependency size
estimation with a partially fixed execution ordering. In Proceedings of the IEEE/ACM
International Conference on Computer-Aided Design (San Jose, CA, Nov.). ACM Press, New
York, NY, 44 –50.
KJELDSBERG, P. G., CATTHOOR,, F., AND AAS, E. J. 2000b. Storage requirement estimation for
data-intensive applications with partially fixed execution ordering. In Proceedings of the
ACM/IEEE Workshop on Hardware/Software Co-Design (San Diego CA, May). ACM Press,
New York, NY, 56 – 60.
KOHAVI, Z. 1978. Switching and Finite Automata Theory. McGraw-Hill, Inc., New York, NY.
KOLSON, D. J., NICOLAU, A., AND DUTT, N. 1994. Minimization of memory traffic in high-level
synthesis. In Proceedings of the 31st Annual Conference on Design Automation (DAC ’94,
San Diego, CA, June 6 –10), M. Lorenzetti, Chair. ACM Press, New York, NY, 149 –154.
KRAMER, H. AND MULLER, J. 1992. Assignment of global memory elements for multi-process
vhdl specifications. In Proceedings of the International Conference on Computer Aided
Design. 496 –501.
KULKARNI, C., CATTHOOR, F., AND MAN, H. D. 1999. Cache transformations for low power
caching in embedded multimedia processors. In Proceedings of the International Sympo-
sium on Parallel Processing (Orlando, FL, Apr.). 292–297.
KULKARNI, C., CATTHOOR, F., AND MAN, H. D. 2000. Advanced data layout organization for
multi-media applications. In Proceedings of the Workshop on Parallel and Distributed
Computing in Image Processing, Video Processing, and Multimedia (PDIVM 2000, Cancun,
Mexico, May).
KULKARNI, D. AND STUMM, M. 1995. Linear loop transformations in optimizing compilers for
parallel machines. Aust. Comput. J. 27, 2 (May), 41–50.
KURDAHI, F. J. AND PARKER, A. C. 1987. REAL: A program for REgister ALlocation. In
Proceedings of the 24th ACM/IEEE Conference on Design Automation (DAC ’87, Miami
Beach, FL, June 28-July 1), A. O’Neill and D. Thomas, Eds. ACM Press, New York, NY,
210 –215.
LEE, H.-D. AND HWANG, S.-Y. 1995. A scheduling algorithm for multiport memory minimiza-
tion in datapath synthesis. In Proceedings of the Conference on Asia Pacific Design
Automation (CD-ROM) (ASP-DAC ’95, Makuhari, Japan, Aug. 29 –Sept. 4), I. Shirakawa,
Chair. ACM Press, New York, NY, 93–100.
LEFEBVRE, V. AND FEAUTRIER, P. 1997. Optimizing storage size for static control programs in
automatic parallelizers. In Proceedings of the Conference on EuroPar. Springer-Verlag,
New York, NY, 356 –363.
LEUPERS, R. AND MARWEDEL, P. 1996. Algorithms for address assignment in DSP code
generation. In Proceedings of the 1996 IEEE/ACM International Conference on Computer-
Aided Design (ICCAD ’96, San Jose, CA, Nov. 10 –14), R. A. Rutenbar and R. H. J. M. Otten,
Chairs. IEEE Computer Society Press, Los Alamitos, CA, 109 –112.
LI, W. AND PINGALI, K. 1994. A singular loop transformation framework based on non-
singular matrices. Int. J. Parallel Program. 22, 2 (Apr.), 183–205.
LI, Y. AND HENKEL, J.-R. 1998. A framework for estimation and minimizing energy dissipation
of embedded HW/SW systems. In Proceedings of the 35th Annual Conference on Design
Automation (DAC ’98, San Francisco, CA, June 15–19), B. R. Chawla, R. E. Bryant, and J.
M. Rabaey, Chairs. ACM Press, New York, NY, 188 –193.
LI, Y. AND WOLF, W. 1998. Hardware/software co-synthesis with memory hierarchies. In
Proceedings of the 1998 IEEE/ACM International Conference on Computer-Aided Design
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization • 203
(ICCAD ’98, San Jose, CA, Nov. 8-12), H. Yasuura, Chair. ACM Press, New York, NY,
430 – 436.
LIEM, C., PAULIN, P., AND JERRAYA, A. 1996. Address calculation for retargetable compilation
and exploration of instruction-set architectures. In Proceedings of the 33rd Annual
Conference on Design Automation (DAC ’96, Las Vegas, NV, June 3–7), T. P. Pennino and E.
J. Yoffa, Chairs. ACM Press, New York, NY, 597– 600.
LOVEMAN, D. B. 1977. Program improvement by source-to-source transformation. J. ACM 24,
1 (Jan.), 121–145.
LY, T., KNAPP, D., MILLER, R., AND MACMILLEN, D. 1995. Scheduling using behavioral
templates. In Proceedings of the 32nd ACM/IEEE Conference on Design Automation (DAC
’95, San Francisco, CA, June 12–16), B. T. Preas, Ed. ACM Press, New York, NY, 101–106.
MANJIAKIAN, N. AND ABDELRAHMAN, T. 1995. Fusion of loops for parallelism and
locality. Tech. Rep. CSRI-315. Dept. of Computer Science, University of Toronto, Toronto,
Ont., Canada.
MASSELOS, K., CATTHOOR, F., GOUTIS, C. E., AND MAN, H. D. 1999a. A performance oriented
use methodology of power optimizing code transformations for multimedia applications
realized on programmable multimedia processors. In Proceedings of the IEEE Workshop on
Signal Processing Systems (Taipeh, Taiwan). IEEE Computer Society Press, Los Alamitos,
CA, 261–270.
MASSELOS, K., DANCKAERT, K., CATTHOOR, F., GOUTIS, C. E., AND DEMAN, H. 1999b. A
methodology for power efficient partitioning of data-dominated algorithm specifications
within performance constraints. In Proceedings of the IEEE International Symposium on
Low Power Design (San Diego CA, Aug.). IEEE Computer Society Press, Los Alamitos, CA,
270 –272.
MCFARLING, S. 1989. Program optimization for instruction caches. In Proceedings of the 3rd
International Conference on Architectural Support for Programming Languages and Operat-
ing Systems (ASPLOS-III, Boston, MA, Apr. 3– 6), J. Emer, Chair. ACM Press, New York,
NY, 183–191.
MCKINLEY, K. S. 1998. A compiler optimization algorithm for shared-memory multiprocessors.
IEEE Trans. Parallel Distrib. Syst. 9, 8, 769 –787.
MCKINLEY, K. S., CARR, S., AND TSENG, C.-W. 1996. Improving data locality with loop
transformations. ACM Trans. Program. Lang. Syst. 18, 4 (July), 424 – 453.
MENG, T., GORDON, B., TSENG, E., AND HUNG, A. 1995. Portable video-on-demand in wireless
communication. Proc. IEEE 83, 4 (Apr.), 659 – 690.
MIRANDA, M., CATTHOOR, F., AND MAN, H. D. 1994. Address equation optimization and
hardware sharing for real-time signal processing applications. In Proceedings of the IEEE
Workshop on VLSI Signal Processing VII (La Jolla, CA, Oct. 26-28). IEEE Press, Piscat-
away, NJ, 208 –217.
MIRANDA, M. A., CATTHOOR, F. V. M., JANSSEN, M., AND DE MAN, H. J. 1998. High-level
address optimization and synthesis techniques for data-transfer-intensive
applications. IEEE Trans. Very Large Scale Integr. Syst. 6, 4, 677– 686.
MISHRA, P., GRUN, P., DUTT, N., AND NICOLAU, A. 2001. Processor-memory co-exploration
driven by a memory-aware architecture description language. In Proceedings of the
Conference on VLSIDesign (Bangalore).
MOWRY, T. C., LAM, M. S., AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm
for prefetching. SIGPLAN Not. 27, 9 (Sept.), 62–73.
MUSOLL, E., LANG, T., AND CORTADELLA, J. 1998. Working-zone encoding for reducing the
energy in microprocessor address buses. IEEE Trans. Very Large Scale Integr. Syst. 6, 4,
568 –572.
NEERACHER, M. AND RUHL, R. 1993. Automatic parallelization of linpack routines on
distributed memory parallel processors. In Proceedings of the IEEE International Sympo-
sium on Parallel Processing (Newport Beach CA, Apr.). IEEE Computer Society Press, Los
Alamitos, CA.
NICOLAU, A. AND NOVACK, S. 1993. Trailblazing: A hierarchical approach to percolation
scheduling. In Proceedings of the International Conference on Parallel Processing: Software
(Boca Raton, FL, Aug.). CRC Press, Inc., Boca Raton, FL, 120 –124.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
204 • P. R. Panda et al.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization • 205
SHANG, W., HODZIC, E., AND CHEN, Z. 1996. On uniformization of affine dependence
algorithms. IEEE Trans. Comput. 45, 7 (July), 827– 839.
SHANG, W., O’KEEFE, M. T., AND FORTES, J. A. B. 1992. Generalized cycle shrinking. In
Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures II
(Gers, France, June 3– 6), P. Quinton and Y. Robert, Eds. Elsevier Sci. Pub. B. V.,
Amsterdam, The Netherlands, 131–144.
SHIUE, W. AND CHAKRABARTI, C. 1999. Memory exploration for low power, embedded
systems. In Proceedings of the 36th ACM/IEEE Conference on Design Automation (New
Orleans LA, June). ACM Press, New York, NY, 140 –145.
SHIUE, W.-T., TADAS, S., AND CHAKRABARTI, C. 2000. Low power multi-module, multi-port
memory design for embedded systems. In Proceedings of the IEEE Workshop on Signal
Processing Systems (Lafayette, LA, Oct.). IEEE Press, Piscataway, NJ, 529 –538.
SLOCK, P., WUYTACK, S., CATTHOOR, F., AND DE JONG, G. 1997. Fast and extensive system-level
memory exploration for ATM applications. In Proceedings of the Tenth International
Symposium on System Synthesis (ISSS ’97, Antwerp, Belgium, Sept. 17–19), F. Vahid and F.
Catthoor, Chairs. IEEE Computer Society Press, Los Alamitos, CA, 74 – 81.
STAN, M. R. AND BURLESON, W. P. 1995. Bus-invert coding for low-power I/O. IEEE Trans.
Very Large Scale Integr. Syst. 3, 1 (Mar.), 49 –58.
STOK, L. AND JESS, J. A. G. 1992. Foreground memory management in data path
synthesis. Int. J. Circuits Theor. Appl. 20, 3, 235–255.
SU, C.-L. AND DESPAIN, A. M. 1995. Cache design trade-offs for power and performance
optimization: a case study. In Proceedings of the 1995 International Symposium on Low
Power Design (ISLPD-95, Dana Point, CA, Apr. 23–26), M. Pedram, R. Brodersen, and K.
Keutzer, Eds. ACM Press, New York, NY, 63– 68.
SUDARSANAM, A. AND MALIK, S. 2000. Simultaneous reference allocation in code generation for
dual data memory bank asips. ACM Trans. Des. Autom. Electron. Syst. 5, 2 (Apr.), 242–264.
SYNOPSYS INC. 1997. Behavioral Compiler User Guide. Synopsys Inc, Mountain View, CA.
THIELE, L. 1989. On the design of piecewise regular processor arrays. In Proceedings of the
IEEE International Symposium on Circuits and Systems (Portland, OR, May). IEEE Press,
Piscataway, NJ, 2239 –2242.
TOMIYAMA, H., HALAMB, A., GRUN, P., DUTT, N., AND NICOLAU, A. 1999. Architecture
description languages for systems-on-chip design. In Proceedings of the 6th Asia Pacific
Conference on Chip Design Languages (Fukuoka, Japan, Oct.). 109 –116.
TOMIYAMA, H., ISHIHARA, T., INOUE, A., AND YASUURA, H. 1998. Instruction scheduling for
power reduction in processor-based system design. In Proceedings of the Conference on
Design, Automation and Test in Europe 98. 855– 860.
TOMIYAMA, H. AND YASUURA, H. 1996. Size-constrained code placement for cache miss rate
reduction. In Proceedings of the ACM/IEEE International Symposium on System Synthesis
(La Jolla, CA, Nov.). ACM Press, New York, NY, 96 –101.
TOMIYAMA, H. AND YASUURA, H. 1997. Code placement techniques for cache miss rate
reduction. ACM Trans. Des. Autom. Electron. Syst. 2, 4, 410 – 429.
TSENG, C. AND SIEWIOREK, D. P. 1986. Automated synthesis of data paths in digital
systems. IEEE Trans. Comput.-Aided Des. 5, 3 (July), 379 –395.
VANDECAPPELLE, A., MIRANDA, M., CATTHOOR, E. B. F., AND VERKEST, D. 1999. Global
multimedia system design exploration using accurate memory organization feedback. In
Proceedings of the 36th ACM/IEEE Conference on Design Automation (New Orleans LA,
June). ACM Press, New York, NY, 327–332.
VERBAUWHEDE, I., CATTHOOR, F., VANDEWALLE, J., AND MAN, H. D. 1989. Background memory
management for the synthesis of algebraic algorithms on multi-processor dsp chips. In
Proceedings of the IFIP 1989 International Conference on VLSI (IFIP VLSI ’89, Munich,
Aug.). IFIP, 209 –218.
VERBAUWHEDE, I. M., SCHEERS, C. J., AND RABAEY, J. M. 1994. Memory estimation for high
level synthesis. In Proceedings of the 31st Annual Conference on Design Automation (DAC
’94, San Diego, CA, June 6 –10), M. Lorenzetti, Chair. ACM Press, New York, NY, 143–148.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
206 • P. R. Panda et al.
VERHAEGH, W., LIPPENS, P., AARTS, E., KORST, J., VAN MEERBERGEN, J., AND VAN DER WERF, A.
1995. Improved force-directed scheduling in high-throughput digital signal processing.
IEEE Trans. Comput.-Aided Des. 14, 8 (Aug.), 945–960.
VERHAEGH, W., LIPPENS, P., AARTS, E., MEERBERGEN, J., AND VAN DER WERF, A. 1996.
Multi-dimensional periodic scheduling: model and complexity. In Proceedings of the
Conference on EuroPar’96 Parallel Processing (Lyon, France, Aug.). Springer-Verlag, New
York, NY, 226 –235.
WILSON, P. R., JOHNSTONE, M., NEELY, M., AND BOLES, D. 1995. Dynamic storage allocation: A
survey and critical review. In Proceedings of the International Workshop on Memory
Management (Kinross, Scotland, Sept.).
WOLF, M. E. AND LAM, M. S. 1991. A loop transformation theory and an algorithm to
maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2, 4 (Oct.), 452– 471.
WOLFE, M. 1991. The tiny loop restructuring tool. In Proceedings of the 1991 International
Conference on Parallel Processing (Aug.).
WOLFE, M. 1996. High-Performance Compilers for Parallel Computing. Addison-Wesley,
Reading, MA.
WUYTACK, S., CATTHOOR, F., JONG, G. D., AND MAN, H. D. 1999a. Minimizing the required
memory bandwidth in vlsi system realizations. IEEE Trans. Very Large Scale Integr. Syst.
7, 4 (Dec.), 433– 441.
WUYTACK, S., DA SILVA, J. L., CATTHOOR, F., JONG, G. D., AND YKMAN-COUVREU, C. 1999b.
Memory management for embedded network applications. IEEE Trans. Comput.-Aided Des.
18, 5 (May), 533–544.
WUYTACK, S., DIGUET, J.-P., CATTHOOR, F. V. M., AND DE MAN, H. J. 1998. Formalized
methodology for data reuse exploration for low-power hierarchical memory mappings. IEEE
Trans. Very Large Scale Integr. Syst. 6, 4, 529 –537.
YKMAN-COUVREUR, C., LAMBRECHT, J., VERKEST, D., CATTHOOR, F., AND MAN, H. D. 1999.
Exploration and synthesis of dynamic data sets in telecom network applications. In
Proceedings of the 12th ACM/IEEE International Symposium on System-Level Synthesis
(San Jose CA, Dec.). ACM Press, New York, NY, 125–130.
ZHAO, Y. AND MALIK, S. 1999. Exact memory size estimation for array computation without
loop unrolling. In Proceedings of the 36th ACM/IEEE Conference on Design Automation
(New Orleans LA, June). ACM Press, New York, NY, 811– 816.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.