0% found this document useful (0 votes)
24 views11 pages

DRAM Schedule

Uploaded by

Tuấn Hoàng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views11 pages

DRAM Schedule

Uploaded by

Tuấn Hoàng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

To appear in ISCA-27 (2000)

Memory Access Scheduling


Scott Rixner1, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens

Computer Systems Laboratory


Stanford University
Stanford, CA 94305
{rixner, billd, ujk, pmattson, jowens}@cva.stanford.edu

Abstract often limited by memory system bandwidth than other com-


puter systems.
The bandwidth and latency of a memory system are strongly
dependent on the manner in which accesses interact with To maximize memory bandwidth, modern DRAM compo-
nents allow pipelining of memory accesses, provide several
the “3-D” structure of banks, rows, and columns character-
independent memory banks, and cache the most recently
istic of contemporary DRAM chips. There is nearly an order
of magnitude difference in bandwidth between successive accessed row of each bank. While these features increase
the peak supplied memory bandwidth, they also make the
references to different columns within a row and different
performance of the DRAM highly dependent on the access
rows within a bank. This paper introduces memory access
scheduling, a technique that improves the performance of a pattern. Modern DRAMs are not truly random access
devices (equal access time to all locations) but rather are
memory system by reordering memory references to exploit
three-dimensional memory devices with dimensions of
locality within the 3-D memory structure. Conservative
reordering, in which the first ready reference in a sequence bank, row, and column. Sequential accesses to different
rows within one bank have high latency and cannot be pipe-
is performed, improves bandwidth by 40% for traces from
lined, while accesses to different banks or different words
five media benchmarks. Aggressive reordering, in which
operations are scheduled to optimize memory bandwidth, within a single row have low latency and can be pipelined.
improves bandwidth by 93% for the same set of applica- The three-dimensional nature of modern memory devices
tions. Memory access scheduling is particularly important makes it advantageous to reorder memory operations to
for media processors where it enables the processor to make exploit the non-uniform access times of the DRAM. This
the most efficient use of scarce memory bandwidth. optimization is similar to how a superscalar processor
schedules arithmetic operations out of order. As with a
1 Introduction superscalar processor, the semantics of sequential execution
are preserved by reordering the results.
Modern computer systems are becoming increasingly lim-
This paper introduces memory access scheduling in which
ited by memory performance. While processor performance
increases at a rate of 60% per year, the bandwidth of a mem- DRAM operations are scheduled, possibly completing
memory references out of order, to optimize memory sys-
ory chip increases by only 10% per year making it costly to
tem performance. The several memory access scheduling
provide the memory bandwidth required to match the pro-
cessor performance [14] [17]. The memory bandwidth bot- strategies introduced in this paper increase the sustained
memory bandwidth of a system by up to 144% over a sys-
tleneck is even more acute for media processors with
tem with no access scheduling when applied to realistic syn-
streaming memory reference patterns that do not cache well.
Without an effective cache to reduce the bandwidth thetic benchmarks. Media processing applications exhibit a
30% improvement in sustained memory bandwidth with
demands on main memory, these media processors are more
memory access scheduling, and the traces of these applica-
tions offer a potential bandwidth improvement of up to
1. Scott Rixner is an Electrical Engineering graduate student at 93%.
the Massachusetts Institute of Technology.
To see the advantage of memory access scheduling, con-
sider the sequence of eight memory operations shown in
Figure 1A. Each reference is represented by the triple (bank,
row, column). Suppose we have a memory system utilizing
a DRAM that requires 3 cycles to precharge a bank, 3 cycles
to access a row of a bank, and 1 cycle to access a column of
a row. Once a row has been accessed, a new column access
can issue each cycle until the bank is precharged. If these
eight references are performed in order, each requires a pre-

1
(A) Without access scheduling (56 DRAM Cycles)
Time (Cycles)
References (Bank, Row, Column)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
(0,0,0) P A C
(0,1,0) P A C
(0,0,1) P A C
(0,1,3) P A C
(1,0,0) P A C
(1,1,1) P A C
(1,0,1) P A C
(1,1,2) P A C

(B) With access scheduling (19 DRAM Cycles)

Time (Cycles) DRAM Operations:


References (Bank, Row, Column)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
(0,0,0) P A C P: bank precharge (3 cycle occupancy)
(0,1,0) P A C
(0,0,1) C
A: row activation (3 cycle occupancy)
(0,1,3) C C: column access (1 cycle occupancy)
(1,0,0) P A C
(1,1,1) P A C
(1,0,1) C
(1,1,2) C

Figure 1. Time to complete a series of memory references without (A) and with (B) access reordering.
charge, a row access, and a column access for a total of available column accesses, the cached row must be written
seven cycles per reference, or 56 cycles for all eight refer- back to the memory array by an explicit operation (bank
ences. If we reschedule these operations as shown in Figure precharge) which prepares the bank for a subsequent row
1B they can be performed in 19 cycles. activation. An overview of several different modern DRAM
types and organizations, along with a performance compari-
The following section discusses the characteristics of mod-
son for in-order access, can be found in [4].
ern DRAM architecture. Section 3 introduces the concept of
memory access scheduling and the possible algorithms that For example, the 128Mb NEC µPD45128163 [13], a typical
can be used to reorder DRAM operations. Section 4 SDRAM, includes four internal memory banks, each com-
describes the streaming media processor and benchmarks posed of 4096 rows and 512 columns. This SDRAM may be
that will be used to evaluate memory access scheduling. operated at 125MHz, with a precharge latency of 3 cycles
Section 5 presents a performance comparison of the various (24ns) and a row access latency of 3 cycles (24ns). Pipe-
memory access scheduling algorithms. Finally, Section 6 lined column accesses that transfer 16 bits may issue at the
presents related work to memory access scheduling. rate of one per cycle (8ns), yielding a peak transfer rate of
250MB/s. However, it is difficult to achieve this rate on
2 Modern DRAM Architecture non-sequential access patterns for several reasons. A bank
cannot be accessed during the precharge/activate latency, a
As illustrated by the example in the Introduction, the order single cycle of high impedance is required on the data pins
in which DRAM accesses are scheduled can have a dra- when switching between read and write column accesses,
matic impact on memory throughput and latency. To and a single set of address lines is shared by all DRAM
improve memory performance, a memory controller must operations (bank precharge, row activation, and column
take advantage of the characteristics of modern DRAM. access). The amount of bank parallelism that is exploited
and the number of column accesses that are made per row
Figure 2 shows the internal organization of modern access dictate the sustainable memory bandwidth out of
DRAMs. These DRAMs are three-dimensional memories such a DRAM, as illustrated in Figure 1 of the Introduction.
with the dimensions of bank, row, and column. Each bank
operates independently of the other banks and contains an A memory access scheduler must generate a schedule that
array of memory cells that are accessed an entire row at a conforms to the timing and resource constraints of these
time. When a row of this memory array is accessed (row modern DRAMs. Figure 3 illustrates these constraints for
activation) the entire row of the memory array is transferred the NEC SDRAM with a simplified bank state diagram and
into the bank’s row buffer. The row buffer serves as a cache a table of operation resource utilization. Each DRAM oper-
to reduce the latency of subsequent accesses to that row. ation makes different demands on the three DRAM
While a row is active in the row buffer, any number of reads resources: the internal banks, a single set of address lines,
or writes (column accesses) may be performed, typically and a single set of data lines. The scheduler must ensure that
with a throughput of one per cycle. After completing the

2
(A) Simplified bank state diagram
Bank N
Bank Precharge
Bank 1 Column Access
IDLE ACTIVE

Row Decoder
Memory
Array Row Activation
(Bank 0)
(B) Operation resource utilization

Address
Cycle 1 2 3 4

Sense Amplifiers Precharge: Bank


(Row Buffer)
Address
Column Decoder Data

Data
Activate: Bank
Address
Figure 2. Modern DRAM organization. Data
the required resources are available for each DRAM opera- Read: Bank
tion it schedules.
Address
Each DRAM bank has two stable states: IDLE and ACTIVE, Data
as shown in Figure 3A. In the IDLE state, the DRAM is pre-
charged and ready for a row access. It will remain in this Write: Bank
state until a row activate operation is issued to the bank. To Address
issue a row activation, the address lines must be used to Data
select the bank and the row being activated, as shown in
Figure 3B. Row activation requires 3 cycles, during which Figure 3. Simplified state diagram and resource utilization
no other operations may be issued to that bank, as indicated governing access to an internal DRAM bank.
by the utilization of the bank resource for the duration of the cally also support column accesses with automatic pre-
operation. During that time, however, operations may be charge, which implicitly precharges the DRAM bank as
issued to other banks of the DRAM. Once the DRAM’s row soon as possible after the column access.
activation latency has passed, the bank enters the ACTIVE
state, during which the contents of the selected row are held The shared address and data resources serialize access to the
in the bank’s row buffer. Any number of pipelined column different DRAM banks. While the state machines for the
accesses may be performed while the bank is in the ACTIVE individual banks are independent, only a single bank can
state. To issue either a read or write column access, the perform a transition requiring a particular shared resource
address lines are required to indicate the bank and the col- each cycle. For many DRAMs, the bank, row, and column
umn of the active row in that bank. A write column access addresses share a single set of lines. Hence, the scheduler
requires the data to be transferred to the DRAM at the time must arbitrate between precharge, row, and column opera-
of issue, whereas a read column access returns the requested tions that all need to use this single resource. Other
data three cycles later. Additional timing constraints not DRAMs, such as Direct Rambus DRAMs (DRDRAMs) [3],
shown in Figure 3, such as a required cycle of high imped- provide separate row and column address lines (each with
ance between reads and writes, may further restrict the use their own associated bank address) so that column and row
of the data pins. accesses can be initiated simultaneously. To approach the
peak data rate with serialized resources, there must be
The bank will remain in the ACTIVE state until a precharge enough column accesses to each row to hide the precharge/
operation is issued to return it to the IDLE state. The pre- activate latencies of other banks. Whether or not this can be
charge operation requires the use of the address lines to achieved is dependent on the data reference patterns and the
indicate the bank which is to be precharged. Like row acti- order in which the DRAM is accessed to satisfy those refer-
vation, the precharge operation utilizes the bank resource ences. The need to hide the precharge/activate latency of the
for 3 cycles, during which no new operations may be issued banks in order to sustain high bandwidth cannot be elimi-
to that bank. Again, operations may be issued to other banks nated by any DRAM architecture without reducing the pre-
during this time. After the DRAM’s precharge latency, the charge/activate latency, which would likely come at the cost
bank is returned to the IDLE state and is ready for a new row of decreased bandwidth or capacity, both of which are unde-
activation operation. Frequently, there are also timing con- sirable.
straints that govern the minimum latency between a column
access and a subsequent precharge operation. DRAMs typi-

3
V L/S Row Col Data State Memory Access
Scheduler Logic
Precharge0
Bank 0 Pending References
Row
Arbiter0

Column Address
Memory References DRAM Operations
Arbiter Arbiter

V L/S Row Col Data State

Row
Bank N Pending References ArbiterN

PrechargeN

Figure 4. Memory access scheduler architecture.

3 Memory Access Scheduling the pending reference storage could be shared by all the
banks (with the addition of a bank address field) to allow
Memory access scheduling is the process of ordering the dynamic allocation of that storage at the cost of increased
DRAM operations (bank precharge, row activation, and col- logic complexity in the scheduler.
umn access) necessary to complete the set of currently
As shown in Figure 4, each bank has a precharge manager
pending memory references. Throughout the paper, the term
operation denotes a command, such as a row activation or a and a row arbiter. The precharge manager simply decides
when its associated bank should be precharged. Similarly,
column access, issued by the memory controller to the
the row arbiter for each bank decides which row, if any,
DRAM. Similarly, the term reference denotes a memory ref-
erence generated by the processor, such as a load or store to should be activated when that bank is idle. A single column
arbiter is shared by all the banks. The column arbiter grants
a memory location. A single reference generates one or
the shared data line resources to a single column access out
more memory operations depending on the schedule.
of all the pending references to all of the banks. Finally, the
Given a set of pending memory references, a memory precharge managers, row arbiters, and column arbiter send
access scheduler may chose one or more row, column, or their selected operations to a single address arbiter which
precharge operations each cycle, subject to resource con- grants the shared address resources to one or more of those
straints, to advance one or more of the pending references. operations.
The simplest, and most common, scheduling algorithm only
The precharge managers, row arbiters, and column arbiter
considers the oldest pending reference, so that references
are satisfied in the order that they arrive. If it is currently can use several different policies to select DRAM opera-
tions, as enumerated in Table 1. The combination of policies
possible to make progress on that reference by performing
used by these units, along with the address arbiter’s policy,
some DRAM operation then the memory controller makes
the appropriate access. While this does not require a compli- determines the memory access scheduling algorithm. The
address arbiter must decide which of the selected precharge,
cated access scheduler in the memory controller, it is clearly
activate, and column operations to perform subject to the
inefficient, as illustrated in Figure 1 of the Introduction.
constraints of the address line resources. As with all of the
If the DRAM is not ready for the operation required by the other scheduling decisions, the in-order or priority policies
oldest pending reference, or if that operation would leave can be used by the address arbiter to make this selection.
available resources idle, it makes sense to consider opera- Additional policies that can be used are those that select pre-
tions for other pending references. Figure 4 shows the struc- charge operations first, row operations first, or column oper-
ture of a more sophisticated access scheduler. As memory ations first. A column-first scheduling policy would reduce
references arrive, they are allocated storage space while the latency of references to active rows, whereas a pre-
they await service from the memory access scheduler. In the charge-first or row-first scheduling policy would increase
figure, references are initially sorted by DRAM bank. Each the amount of bank parallelism.
pending reference is represented by six fields: valid (V),
load/store (L/S), address (Row and Col), data, and whatever If the address resources are not shared, it is possible for both
a precharge operation and a column access to the same bank
additional state is necessary for the scheduling algorithm.
to be selected. This is likely to violate the timing constraints
Examples of state that can be accessed and modified by the
scheduler are the age of the reference and whether or not of the DRAM. Ideally, this conflict can be handled by hav-
ing the column access automatically precharge the bank
that reference targets the currently active row. In practice,

4
Table 1. Scheduling policies for the precharge managers, row arbiters, and column arbiter.

Policy Arbiters Description

precharge, row, A DRAM operation will only be performed if it is required by the oldest pending reference. While
in-order used by almost all memory controllers today, this policy yields poor performance compared to
and column policies that look ahead in the reference stream to better utilize DRAM resources.
The operation(s) required by the highest priority ready reference(s) are performed. Three possible
priority schemes include: ordered, older references are given higher priority; age-threshold, refer-
precharge, row, ences older than some threshold age gain increased priority; and load-over-store, load references
priority are given higher priority. Age-threshold prevents starvation while allowing greater reordering
and column
flexibility than ordered. Load-over-store decreases load latency to minimize processor stalling on
stream loads.
A bank is only precharged if there are pending references to other rows in the bank and there are
no pending references to the active row. The open policy should be employed if there is significant
open precharge row locality, making it likely that future references will target the same row as previous references
did.
A bank is precharged as soon as there are no more pending references to the active row. The
closed precharge closed policy should be employed if it is unlikely that future references will target the same row as
the previous set of references.
The row or column access to the row with the most pending references is selected. This allows
rows to be activated that will have the highest ratio of column to row accesses, while waiting for
other rows to accumulate more pending references. By selecting the column access to the most
most pending row and column demanded row, that bank will be freed up as soon as possible to allow other references to make
progress. This policy can be augmented by one of the priority schemes described above to prevent
starvation.
The fewest pending policy selects the column access to the row targeted by the fewest pending
references. This minimizes the time that rows with little demand remain active, allowing refer-
fewest pending column ences to other rows in that bank to make progress sooner. A weighted combination of the fewest
pending and most pending policies could also be used to select a column access. This policy can
also be augmented by one of the priority schemes described above to prevent starvation.

upon completion, which is supported by most modern so as not to interfere with other data that does cache well.
DRAMs. Many streams are accessed sequentially, so prefetching
streams into the cache can sometimes be effective at
4 Experimental Setup improving processor performance [15]. However, this is an
inefficient way to provide storage for streaming data
Streaming media data types do not cache well, so they because address translation is required on every reference,
require other types of support to improve memory perfor- accesses are made with long addresses, tag overhead is
mance. In a stream (or vector) processor, the stream transfer incurred in the cache, and conflicts may evict previously
bandwidth, rather than the latency of any individual mem- fetched data.
ory reference, drives processor performance. A streaming
media processing system, therefore, is a prime candidate for The Imagine stream processor [16] employs a 64KB stream
memory access scheduling. To evaluate the performance register file (SRF), rather than a cache, to capture the refer-
impact of memory access scheduling on media processing, a ence locality of streams. Entire streams are transferred
streaming media processor was simulated running typical between the DRAMs and the SRF. This is more efficient
media processing applications. than a cache because a single instruction, rather than many
explicit instructions, can be used to transfer a stream of data
to or from memory.
4.1 Stream Processor Architecture
Stream memory transfers (similar to vector memory trans-
Media processing systems typically do not cache streaming fers) are independent operations that are isolated from com-
media data types, because modern cache hierarchies cannot putation. Therefore, the memory system can be loading
handle them efficiently [10]. In a media computation on streams for the next set of computations and storing streams
long streams of data, the same operations are performed for the previous set of computations while the current set of
repeatedly on consecutive stream elements, and the stream computations are occurring. A computation cannot com-
elements are discarded after the operations are performed. mence until all of the streams it requires are present in the
These streams do not cache well because they lack temporal stream register file. The Imagine streaming memory system
locality (stream elements are usually only referenced once) consists of a pair of address generators, four interleaved
and they have a large cache footprint, which makes it likely memory bank controllers, and a pair of reorder buffers that
that they will interfere with other data in the cache. In many place stream data in the SRF in the correct order. All of
media processing systems, stream accesses bypass the cache

5
that reference is updated to reflect that this reference will be
To Reorder Buffer 0
satisfied by the same DRAM access. When a reference to a
To Reorder Buffer 1 location that is not already the target of another in-flight ref-
From Address Generator 0 erence arrives, a new MSHR is allocated and the reference
From Address Generator 1
is sent to the bank buffer. The bank buffer corresponds
directly to the pending reference storage in Figure 4,
Memory Bank although the storage for all of the internal DRAM banks is
Controller combined into one 32 entry buffer. The memory controller
Interface schedules DRAM accesses to satisfy the pending references
in the bank buffer and returns completed accesses to the
Reply
Holding Holding
Buffer
MSHRs. The MSHRs send completed loads to the reply
Buffer Buffer buffer where they are held until they can be sent back to the
reorder buffers. As the name implies, the reorder buffers
receive out of order references and transfer the data to the
MSHRs SRF in order.
In this streaming memory system, memory consistency is
Memory maintained in two ways: conflicting memory stream refer-
Access ences are issued in dependency order and the MSHRs
Scheduler
Bank Memory ensure that references to the same address complete in the
Buffer Controller order that they arrive. This means that a stream load that fol-
lows a stream store to overlapping locations may be issued
as soon as the address generators have sent all of the store’s
Imagine Processor references to the memory banks.
For the simulations, it was assumed that the processor fre-
Off-chip DRAM quency was 500 MHz and that the DRAM frequency was
Figure 5. Memory bank controller architecture. 125 MHz.3 At this frequency, Imagine has a peak computa-
tion rate of 20GFLOPS on single precision floating point
these units are on the same chip as the Imagine processor computations and 20GOPS on 32-bit integer computations.
core. Each memory bank controller has two external NEC
The address generators support three addressing modes: µPD45128163 SDRAM chips attached to it to provide a
constant stride, indirect, and bit-reversed. The address gen- column access width of 32 bits, which is the word size of
erators may generate memory reference streams of any the Imagine processor. These SDRAM chips were briefly
length, as long as the data fits in the SRF. For constant stride described earlier and a complete specification can be found
references, the address generator takes a base, stride, and in [13]. The peak bandwidth of the SDRAMs connected to
length, and computes successive addresses by incrementing each memory bank controller is 500MB/s, yielding a total
the base address by the stride. For indirect references, the peak memory bandwidth of 2GB/s in the system.
address generator takes a base address and an index stream
from the SRF and calculates addresses by adding each index 4.2 Benchmarks
to the base address. Bit-reversed addressing is used for FFT
memory references and is similar to constant stride address- The experiments were run on a set of microbenchmarks and
ing, except that bit-reversed addition is used to calculate five media processing applications. Table 2 describes the
addresses. microbenchmarks above the double line, and the applica-
Figure 5 shows the architecture of the memory bank con- tions below the double line. For the microbenchmarks, no
computations are performed outside of the address genera-
trollers.2 References arriving from the address generators
tors. This allows memory references to be issued at their
are stored in a small holding buffer until they can be pro- maximum throughput, constrained only by the buffer stor-
cessed. Despite the fact that there is no cache, a set of regis- age in the memory banks. For the applications, the simula-
ters similar in function to the miss status holding registers
tions were run both with the applications’ computations and
(MSHRs) of a non-blocking cache [9] exist to keep track of
without. When running just the memory traces, dependen-
in-flight references and to do read and write coalescing. cies were maintained by assuming the computation occurred
When a reference arrives for a location that is already the
at the appropriate times but was instantaneous. The applica-
target of another in-flight reference, the MSHR entry for
tions results show the performance improvements that can

2. Note that these are external memory banks, each composed of 3. This corresponds to the expected clock frequency of the Imag-
separate DRAM chips in contrast to the internal memory banks ine stream processor and the clock frequency of existing
within each DRAM chip. SDRAM parts.

6
Table 2. Benchmarks. sidered in the scheduling decision. This algorithm, or slight
variations such as automatically precharging the bank when
Name Description a cache line fetch is completed, can commonly be found in
Unit stride load stream accesses with paral- systems today.
Unit Load lel streams to different rows in different
internal DRAM banks. The gray bars of Figure 6 show the performance of the
benchmarks using the baseline in-order access scheduler.
Unit stride load and store stream accesses
Unit with parallel streams to different rows in Unsurprisingly, unit load performs very well with no access
different internal DRAM banks. scheduling, achieving 97% of the peak bandwidth (2GB/s)
of the DRAMs. The 3% overhead is the combined result of
Unit stride load and store stream accesses
Unit Conflict with parallel streams to different rows in the infrequent precharge/activate cycles and the start-up/shut-
same internal DRAM banks. down delays of the streaming memory system.
Constrained Random access load and store streams con- The 14% drop in sustained bandwidth from the unit load
Random strained to a 64KB range. benchmark to the unit benchmark shows the performance
Random access load and store streams to the degradation imposed by forcing intermixed load and store
Random entire address space. references to complete in order. Each time the references
switch between loads and stores a cycle of high impedance
Ten consecutive 1024-point real Fast Fou-
FFT rier Transforms. must be left on the data pins, decreasing the sustainable
bandwidth. The unit conflict benchmark further shows the
Stereo depth extraction from a pair of penalty of swapping back and forth between rows in the
Depth
320x240 8-bit grayscale images. a DRAM banks, which drops the sustainable bandwidth down
QR matrix decomposition of a 192x96 ele- to 51% of the peak. The random benchmarks sustain about
QRD 15% of the bandwidth of the unit load benchmark. This loss
ment matrix.b
roughly corresponds to the degradation incurred by per-
MPEG2 encoding of three frames of
MPEG 360x288 24-bit color video. forming accesses with a throughput of one word every
seven DRAM cycles (the random access throughput of the
Triangle rendering of a 720x720 24-bit color SDRAM) compared to a throughput of one word every
Tex
image with texture mapping.c DRAM cycle (the column access throughput of the
a. Depth performs depth extraction using Kanade’s algo- SDRAM).
rithm [8]. Only two stereo images are used in the bench- The applications’ behavior closely mimics their associated
mark, as opposed to the multiple cameras of the video- microbenchmarks. The QRD and MPEG traces include
rate stereo machine.
many unit and small constant stride accesses, leading to a
b. QRD uses blocked Householder transformations to gen- sustained bandwidth that approaches that of the unit bench-
erate an orthogonal Q matrix and an upper triangular R mark. The Depth trace consists almost exclusively of con-
matrix such that Q·R is equal to the input matrix. stant stride accesses, but dependencies limit the number of
c. Tex applies modelview, projection, and viewport trans- simultaneous stream accesses that can occur. The FFT trace
formations on its unmeshed input triangle stream and is composed of constant stride loads and bit-reversed stores.
performs perspective-correct bilinear interpolated tex- The bit-reversed accesses sustain less bandwidth than con-
ture mapping on its generated fragments. A single frame stant stride accesses because they generate sequences of ref-
of the SPECviewperf 6.1.1 Advanced Visualizer bench-
mark image was rendered.
erences that target a single memory bank and then a
sequence of references that target the next memory bank
be gained by using memory access scheduling with a mod- and so on. This results in lower bandwidth than access pat-
ern media processor. The application traces, with instanta- terns that more evenly distribute the references across the
neous computation, show the potential of these scheduling four memory banks. Finally, the Tex trace includes constant
methods as processing power increases and the applications stride accesses, but is dominated by texture accesses which
become entirely limited by memory bandwidth. are essentially random within the texture memory space.
These texture accesses lead to the lowest sustained band-
width of the applications. Note that for the applications,
5 Experimental Results memory bandwidth corresponds directly to performance
A memory controller that performs no access reordering because the applications make the same number of memory
will serve as a basis for comparison. This controller per- references regardless of the scheduling algorithm. There-
forms no access scheduling, as it uses an in-order policy, fore, increased bandwidth means decreased execution time.
described in Table 1, for all decisions: a column access will
only be performed for the oldest pending reference, a bank 5.1 First-ready Scheduling
will only be precharged if necessary for the oldest pending
reference, and a row will only be activated if it is needed by The use of a very simple first-ready access scheduler
the oldest pending reference. No other references are con- improves performance by an average of over 25% on all of

7
Microbenchmarks Applications Memory Traces

Memory Bandwidth (MB/s)

Memory Bandwidth (MB/s)


Memory Bandwidth (MB/s)

2000 2000 2000


1800 1800 1800
1600 in-order 1600 1600
1400 first-ready 1400 1400
1200 1200 1200
1000 1000 1000
800 800 800
600 600 600
400 400 400
200 200 200
0 0 0
Unit Load Unit Unit Conflict Constrained Random FFT Depth QRD MPEG Tex Weighted FFT Depth QRD MPEG Tex Weighted
Random Mean Mean

Figure 6. Sustained memory bandwidth using in-order and first-ready access schedulers (2 GB/s peak supplied bandwidth).

Microbenchmarks Applications Memory Traces


in-order

Memory Bandwidth (MB/s)


Memory Bandwidth (MB/s)

2000 first-ready Memory Bandwidth (MB/s) 2000 2000


1800 col/open 1800 1800
1600 col/closed 1600 1600
1400 row/open 1400 1400
1200 row/closed 1200 1200
1000 1000 1000
800 800 800
600 600 600
400 400 400
200 200 200
0 0 0
Unit Load Unit Unit Conflict Constrained Random FFT Depth QRD MPEG Tex Weighted FFT Depth QRD MPEG Tex Weighted
Random Mean Mean

Figure 7. Sustained memory bandwidth of memory access scheduling algorithms (2 GB/s peak supplied bandwidth).
Table 3. Reordering scheduling algorithm policies.

Algorithm Column Access Precharging Row Activation Access Selection


col/open priority (ordered) open priority (ordered) column first
col/closed priority (ordered) closed priority (ordered) column first
row/open priority (ordered) open priority (ordered) row first
row/closed priority (ordered) closed priority (ordered) row first

the benchmarks. First-ready scheduling uses the ordered 5.2 Aggressive Reordering
priority scheme, as described in Table 1, to make all sched-
uling decisions. The first-ready scheduler considers all When the oldest pending reference targets a different row
pending references and schedules a DRAM operation for than the active row in a particular bank, the first-ready
the oldest pending reference that does not violate the timing scheduler will precharge that bank even if it still has pend-
and resource constraints of the DRAM. The most obvious ing references to its active row. More aggressive scheduling
benefit of this scheduling algorithm over the baseline is that algorithms are required to further improve performance. In
accesses targeting other banks can be made while waiting this section, four scheduling algorithms, enumerated in
for a precharge or activate operation to complete for the old- Table 3, that attempt to further increase sustained memory
est pending reference. This relaxes the serialization of the bandwidth are investigated. The policies for each of the
in-order scheduler and allows multiple references to schedulers in Table 3 are described in Table 1. The range of
progress in parallel. possible memory access schedulers is quite large, and cov-
ering all of the schedulers examined in Section 3 would be
Figure 6 shows the sustained bandwidth of the in-order and prohibitive. These four schedulers were chosen to be repre-
first-ready scheduling algorithms for each benchmark. The sentative of many of the important characteristics of an
sustained bandwidth is increased by 79% for the aggressive memory access scheduler.
microbenchmarks, 17% for the applications, and 40% for
the application traces. As should be expected, unit load Figure 7 presents the sustained memory bandwidth for each
shows little improvement as it already sustains almost all of memory access scheduling algorithm on the given bench-
the peak SDRAM bandwidth, and the random benchmarks marks. These aggressive scheduling algorithms improve the
show an improvement of over 125%, as they are able to memory bandwidth of the microbenchmarks by 106-144%,
increase the number of column accesses per row activation the applications by 27-30%, and the application traces by
significantly. 85-93% over in-order scheduling.
Unlike the rest of the applications, MPEG does not show a
noticeable improvement in performance when moving to

8
Microbenchmarks Applications Memory Traces
4 Entries

Memory Bandwidth (MB/s)


8 Entries

Memory Bandwidth (MB/s)


Memory Bandwidth (MB/s)

2000 2000 2000


1800 16 Entries 1800 1800
1600 32 Entries 1600 1600
1400 64 Entries 1400 1400
1200 1200 1200
1000 1000 1000
800 800 800
600 600 600
400 400 400
200 200 200
0 0 0
Unit Load Unit Unit Conflict Constrained Random FFT Depth QRD MPEG Tex Weighted FFT Depth QRD MPEG Tex Weighted
Random Mean Mean

Figure 8. Sensitivity to bank buffer size.


the more aggressive scheduling algorithms. On a stream unit load benchmark requires only 8 entries to saturate the
architecture like Imagine, MPEG efficiently captures data memory system. The unit conflict and random benchmarks
locality within the SRF. This makes MPEG compute-bound, require 16 entries to achieve their peak bandwidth. The unit
thereby eliminating any opportunity for performance and constrained random benchmarks are able to utilize
improvement by improving the memory system bandwidth. additional buffer space to improve bandwidth.
However, the performance on the memory trace can be
improved by the aggressive scheduling algorithms to over A 16 entry buffer allows all of the applications to achieve
their peak memory bandwidth, which is 7% higher than
90% of the peak bandwidth of the memory system.
with a 4 entry buffer. Depth and MPEG are not sensitive to
The use of a column-first or a row-first access selection pol- the bank buffer size at all because they are compute-bound
icy makes very little difference across all of the bench- on these configurations. The bandwidth of Tex improves as
marks. There are minor variations, but no significant the buffer size is increased from 4 to 16 entries because the
performance improvements in either direction, except for larger buffer allows greater flexibility in reordering its non-
FFT. This has less to do with the characteristics of the strided texture references. QRD benefits from increasing the
scheduling algorithm than with the fact that the FFT bench- buffer size from 4 to 16 because it issues many conflicting
mark is the most sensitive to stream load latency, and the stream transfers that benefit from increased reordering. For
col/open scheduler happens to allow a store stream to delay the applications’ memory traces, there is a slight advantage
load streams in this instance. to further increasing the buffer size beyond 16; a 16 entry
buffer improves bandwidth by 27% and a 64 entry buffer
The benchmarks that include random or somewhat random improves bandwidth by 30% over a 4 entry buffer. Again,
address traces favor a closed precharge policy, in which
the sustainable bandwidth of FFT fluctuates because of its
banks are precharged as soon as there are no more pending
extreme sensitivity to load latency.
references to their active row. This is to be expected as it is
unlikely that there will be any reference locality that would FFT
make it beneficial to keep the row open. By precharging as
Memory Bandwidth (MB/s)

2000
soon as possible, the access latency of future references is 1800
in-order
1600
minimized. For most of the other benchmarks, the differ- 1400
first-ready
col/open
ence between an open and a closed precharge policy is 1200
row/open
1000
slight. Notable exceptions are unit load and FFT. Unit load 800 load/col/open
performs worse with the col/closed algorithm. This is 600 load/row/open
400
because column accesses are satisfied rapidly, emptying the 200

bank buffer of references to a stream, allowing the banks to 0


FFT FFT Trace
be precharged prematurely in some instances. This phenom-
enon also occurs in the QRD and MPEG traces with the Figure 9. Sustained memory bandwidth for FFT with load-
col/closed algorithm. FFT performs much better with an over-store scheduling.
open precharging policy because of the bit-reversed refer-
ence pattern. A bit-reversed stream makes numerous To stabilize the sustainable bandwidth of FFT, load refer-
accesses to each row, but they are much further apart in time ences must be given higher priority than store references.
than they would be in a constant stride access. Therefore, Write buffers are frequently used to prevent pending store
leaving a row activated until that bank is actually needed by references from delaying load references required by the
another reference is advantageous, as it eliminates the need processor [5]. As the bank buffer is already able to perform
to reactivate the row when those future references finally this function, the col/open and row/open scheduling algo-
arrive. rithms can simply be augmented with a load-over-store pri-
ority scheme for their column access and row activation
Figure 8 shows the effects of varying the bank buffer size on policies. This allows load references to complete sooner, by
sustained memory bandwidth when using memory access giving them a higher priority than store references, as
scheduling. The row/closed scheduling algorithm is used described in Table 1. Figure 9 shows that with the addition
with bank buffers varying in size from 4 to 64 entries. The of the load-over-store priority scheme, the FFT trace sus-

9
tains over 97% of the peak memory system bandwidth with 7 Conclusions
both the load/row/open and load/col/open schedulers. Using
a load-over-store policy does not affect the other applica- Memory bandwidth is becoming the limiting factor in
tions which are not as sensitive to load latency. achieving higher performance, especially in media process-
ing systems. Processor performance improvements will
6 Related Work continue to outpace increases in memory bandwidth, so
techniques are needed to maximize the sustained memory
Stream buffers prefetch data structured as streams or vectors bandwidth. To maximize the peak supplied data bandwidth,
to hide memory access latency [7]. Stream buffers do not, modern DRAM components allow pipelined accesses to a
however, reorder the access stream to take advantage of the three-dimensional memory structure. Memory access
3-D nature of DRAM. For streams with small, fixed strides, scheduling greatly increases the bandwidth utilization of
references from one stream tend to make several column these DRAMs by buffering memory references and choos-
accesses for each row activation, giving good performance ing to complete them in an order that both accesses the
on a modern DRAM. However, conflicts with other streams internal banks in parallel and maximizes the number of col-
and non-stream accesses often evict the active row, thereby umn accesses per row access, resulting in improved system
reducing performance. McKee’s Stream Memory Controller performance.
(SMC) extends a simple stream buffer to reduce memory
conflicts among streams by issuing several references from Memory access scheduling realizes significant bandwidth
one stream before switching streams [6] [12]. The SMC, gains on a set of media processing applications as well as on
however, does not reorder references within a single stream. synthetic benchmarks and application address traces. A sim-
ple reordering algorithm that advances the first ready mem-
The Command Vector Memory System (CVMS) [2] reduces ory reference gives a 17% performance improvement on
the processor to memory address bandwidth by transferring applications, a 79% bandwidth improvement for the
commands to the memory controllers, rather than individual microbenchmarks, and a 40% bandwidth improvement on
references. A command includes a base and a stride which the application traces. The application trace results give an
is expanded into the appropriate sequence of references by indication of the performance improvement expected in the
each off-chip memory bank controller. The bank controllers future as processors become more limited by memory band-
in the CVMS utilize a row/closed scheduling policy among width. More aggressive reordering, in which references are
commands to improve the bandwidth and latency of the scheduled to increase locality and concurrency, yields sub-
SDRAM. The Parallel Vector Access unit (PVA) [11] aug- stantially larger gains. Bandwidth for synthetic benchmarks
ments the Impulse memory system [1] with a similar mech- improved by 144%, performance of the media processing
anism for transferring commands to the Impulse memory applications improved by 30%, and the bandwidth of the
controller. Neither of these systems reorder references application traces increased by 93%.
within a single stream. Conserving address bandwidth, as in
the CVMS and PVA, is important for systems with off-chip A comparison of alternative scheduling algorithms shows
memory controllers, but is largely orthogonal to memory that on most benchmarks it is advantageous to employ a
access scheduling. closed page scheduling policy in which banks are pre-
charged as soon as the last column reference to an active
The SMC, CVMS, and PVA do not handle indirect (scatter/ row is completed. This is in part due to the ability of the
gather) streams. These references are usually handled by the DRAM to combine the bank precharge request with the
processor cache, as they are not easily described to a stream final column access. There is little difference in perfor-
prefetching unit. However, indirect stream references do not mance between scheduling algorithms that give preference
cache well because they are large and lack both spatial and to row accesses over column accesses, except that the
temporal locality. These references also do not typically col/closed algorithm can sometimes close pages too soon,
make consecutive column accesses to the same row, somewhat degrading performance. Finally, scheduling loads
severely limiting the sustainable data bandwidth when those ahead of stores improves application performance for
references are satisfied in order. The memory access sched- latency sensitive applications.
uling techniques described here work for indirect streams as
well as for strided streams, as demonstrated by the improve- Contemporary cache organizations waste memory band-
ments in the random benchmarks and the Tex application. width in order to reduce the memory latency seen by the
processor. As memory bandwidth becomes more precious,
Hitachi has proposed an access optimizer for embedded this will no longer be a practical solution to reducing mem-
DRAM as part of a system-on-a-chip and has built a test ory latency. Media processing has already encountered this
chip containing the access optimizer and some DRAM [18]. phenomenon, because streaming media data types do not
This access optimizer implements a first-ready scheduler, cache well and require careful bandwidth management. As
and is 1.5mm2, dissipates 26mW, and runs at 100MHz in a cache organizations evolve to be more conscious of memory
0.18µm process. While the more aggressive schedulers bandwidth, techniques like memory access scheduling will
would require more logic, this should give a feel for the be required to sustain a significant fraction of the available
actual cost of memory access scheduling. data bandwidth. Memory access scheduling is, therefore, an

10
important step toward maximizing the utilization of the [11] MATTHEW, BINU K., ET AL., Design of a Parallel Vector Access
increasingly scarce memory bandwidth resources. Unit for SDRAM Memory Systems. In Proceedings of the
Sixth International Symposium on High-Performance Com-
puter Architecture (January 2000), pp. 39-48.
Acknowledgments
[12] MCKEE, SALLY A. AND WULF, WILLIAM A., Access Ordering and
The authors would like to thank the other members of the Memory-Conscious Cache Utilization. In Proceedings of the
Imagine project for their contributions. The authors would First Symposium on High Performance Computer Architec-
also like to thank Kekoa Proudfoot and Matthew Eldridge ture (January 1995), pp. 253-262.
for providing the Tex benchmark triangle trace data. The [13] NEC Corporation. 128M-bit Synchronous DRAM 4-bank,
research described in this paper was supported by the LVTTL Data Sheet. Document No. M12650EJ5V0DS00, 5th
Defense Advanced Research Projects Agency under ARPA Edition, Revision K (July 1998).
order E254 and monitored by the Army Intelligence Center [14] PATTERSON, DAVID, ET AL., A Case for Intelligent RAM. IEEE
under contract DABT63-96-C-0037. Micro (March/April 1997), pp. 34-44.
[15] RANGANATHAN, PARTHASARATHY, ET AL., Performance of Image
References and Video Processing with General-Purpose Processors and
Media ISA Extensions. In Proceedings of the International
[1] CARTER, JOHN, ET AL., Impulse: Building a Smarter Memory
Symposium on Computer Architecture (May 1999), pp. 124-
Controller. In Proceedings of the Fifth International Sympo-
135.
sium on High Performance Computer Architecture (January
1999), pp. 70-79. [16] RIXNER, SCOTT, ET AL., A Bandwidth-Efficient Architecture
for Media Processing. In Proceedings of the International
[2] CORBAL, JESUS, ESPASA, ROGER, AND VALERO, MATEO, Com-
Symposium on Microarchitecture (December 1998), pp. 3-13.
mand Vector Memory Systems: High Performance at Low
Cost. In Proceedings of the 1998 International Conference [17] SAULSBURY, ASHLEY, PONG, FONG, AND NOWATZYK, ANDREAS,
on Parallel Architectures and Compilation Techniques (Octo- Missing the Memory Wall: The Case for Processor/Memory
ber 1998), pp. 68-77. Integration. In Proceedings of the International Symposium
on Computer Architecture (May 1996), pp. 90-101.
[3] CRISP, RICHARD, Direct Rambus Technology: The New Main
Memory Standard. IEEE Micro (November/December 1997), [18] WATANABE, TAKEO, ET AL., Access Optimizer to Overcome the
pp. 18-28. “Future Walls of Embedded DRAMs” in the Era of Systems
on Silicon. In IEEE International Solid-State Circuits Confer-
[4] CUPPU, VINODH, ET AL., A Performance Comparison of Con-
ence Digest of Technical Papers (February 1999), pp. 370-
temporary DRAM Architectures. In Proceedings of the Inter-
371.
national Symposium on Computer Architecture (May 1999),
pp. 222-233.
[5] EMER, JOEL S. AND CLARK, DOUGLAS W., A Characterization of
Processor Performance in the VAX-11/780. In Proceedings of
the International Symposium on Computer Architecture (June
1984), pp. 301-310.
[6] HONG, SUNG I., ET AL., Access Order and Effective Bandwidth
for Streams on a Direct Rambus Memory. In Proceedings of
the Fifth International Symposium on High Performance
Computer Architecture (January 1999), pp. 80-89.
[7] JOUPPI, NORMAN P., Improving Direct-Mapped Cache Perfor-
mance by the Addition of a Small Fully-Associative Cache
and Prefetch Buffers. In Proceedings of the International
Symposium on Computer Architecture (May 1990),pp. 364-
373.
[8] KANADE, TAKEO, KANO, HIROSHI, AND KIMURA, SHIGERU, Devel-
opment of a Video-Rate Stereo Machine. In Proceedings of
the International Robotics and Systems Conference (August
1995), pp. 95-100.
[9] KROFT, DAVID, Lockup-Free Instruction Fetch/Prefetch Cache
Organization. In Proceedings of the International Symposium
on Computer Architecture (May 1981), pp. 81-87.
[10] LEE, RUBY B. AND SMITH, MICHAEL D., Media Processing: A
new design target. IEEE Micro (August 1996), pp. 6-9.

11

You might also like