Memory Arbitration and Cache Management in Stream-Based Systems
Memory Arbitration and Cache Management in Stream-Based Systems
Françoise Harmsze , Adwin Timmer and Jef van Meerbergen
(1) Philips Research Labs Eindhoven, The Netherlands
(2) Eindhoven University of Technology, The Netherlands
CPU / peripherals
Abstract P R
P R
This results in new methods for on- and off-chip commu- Processing P
CPA
switch matrix
2 Background memory arbitration and
response time calculations
CPU multiport
stream memory
periph. video video ..... video cache SDRAM
I$ D$ arbiter
proc. proc. proc. & control In [2] a simple and efficient memory arbitration scheme
control bus is presented which supports continuous
streams along with
random requests. A service cycle of clock cycles is de-
fined, see Figure 4, in which clock cycles are reserved
Figure 3. Co-Processor Array (CPA) for video for continuous (media) streams. These media streams do
processing not require a low latency, since their demand for data is
known well in advance. Pre-fetching from memory can thus
be used for data streams coming from memory, and data
and compilation techniques are used to optimize the mem- streams towards memory can be buffered for a while. Given
ory management [1],[4]. We focus however on an architec- and ,
clock cycles are available
tural method for the memory hierarchy which gives us flex- for random traffic such as generated by a CPU. The ran-
ibility at low area cost. In this paper we describe the stream dom traffic has highest priority, thus ensuring low latency,
caching scheme required for the media processors. For this provided that enough cycles within the service cycle are re-
scheme we will introduce a method to lock cache lines for maining for the continuous streams. If this is not the case,
periodic data streams. Figure 3 shows an instance of a the continuous streams
will have priority. Of course, the
multi-processor architecture called the CPA. This CoPro- value of within must be large enough to guarantee the
cessor Array shares an external background memory with a periodic streams their required bandwidth.
service cycle burst of
random requests
time
τ
x
1 2 3 . . . . . . N R cycles N cycles
service cycle
or saturation.
The latter can be understood from Figure 5. line #7 TAIL
When is increased, must increase proportionally to
allow the periodic streams their required bandwidth. For
an overloaded system there are more random requests than Figure 6. Linked list: initial state
those that can be handled in a period of
cycles. This will
result in a longer period in which no random requests are dent and because buffers in the cache memory are dynam-
allowed access to background memory. Therefore we see ically allocated and de-allocated, the cache memory will
that for this case a large period of and therefore a large suffer from fragmentation. The run-time reconfiguration of
period of cycles will result in a large latency for those each independent stream does not leave a singular moment
random requests that remain pending when
cycles have when no stream is active. This means that there is no singu-
already been consumed within a service cycle. lar moment when the cache is not in use, therefore garbage
For our example with 20 streams we can calculate the collection or de-fragmentation will be difficult. To avoid the
&
required amount of buffering again if we take
0 7 1 disadvantage of cache fragmentation an ordered list of non-
3
as;`Jthe worst case situation. 7[GPG] clock cycles and assigned cache lines can be maintained. The initial situation
@
can be calculated if we consider that 20 streams share is shown in Figure 6, where the HEAD of the list points to
a &
b 7 1 for their requests, and each request takes line ijT . If a number of cache lines has to be assigned to a
Removed: list of assigned cache lines of stream 0
HEAD stream #0
line #0
HEAD
line #1 line #5
line #2 line #6
line #3 line #7
line #4 line #0
TAIL
line #1
Created: list of assigned cache lines of stream 0 line #2
line #3
HEAD
line #5 line #4 TAIL
line #6
Remaining: linked list of unassigned cache lines
line #7 TAIL
M R = N 2- M 2
2 2
5 Application: coprocessor array
Level 3a arbitration Level 3b arbitration
We have applied both the arbitration scheme and the GFX priority
cache management scheme in the CPA shown in Figure 3.
As mentioned in Section 1 the CPA shares an external back- debugger CPU GFX periodic configuration-
time control
run-
+ peripherals streams time control
ground memory with a CPU and peripherals. The off-chip (run-time)
communication between the SDRAM and the memory ar-
biter uses arbitration at 3 levels, as shown in Figure 9.
Level 1 handles the arbitration between random requests Figure 9. Arbitration at 3 levels for the CPA
and periodic requests. This arbitration uses the scheme
given in Section 2. Level 2a decides between debugger scheme in Figure 9 is valid for our application domain be-
requests and other random requests, normally giving pri- cause it gives all random requests a low latency, provided
ority to debugging. Level 3a splits the remaining random they do not overload the memory. The scheme also gives all
requests into two types: requests from the CPU and re- periodic requests a guaranteed bandwidth, while at the same
quests from the graphics accelerator (GFX). The value of a time it gives periodic requests at configuration-time such as
variable, GFX priority, can be used to ensure that the GFX programming parameters a lower latency if required. Figure
coproc.
6 Conclusions
regfile