0% found this document useful (0 votes)
44 views6 pages

Memory Arbitration and Cache Management in Stream-Based Systems

- Memory arbitration and caching are important for stream-based systems to balance the needs of periodic media processing and random CPU/peripheral access. - The paper proposes an arbitration scheme that prioritizes periodic media stream requests during processing periods and allows bursty CPU access during configuration periods. - It also introduces a caching method to limit on-chip buffering while guaranteeing bandwidth for real-time streams using a shared cache rather than separate caches for each processor.

Uploaded by

alekhyakushik
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views6 pages

Memory Arbitration and Cache Management in Stream-Based Systems

- Memory arbitration and caching are important for stream-based systems to balance the needs of periodic media processing and random CPU/peripheral access. - The paper proposes an arbitration scheme that prioritizes periodic media stream requests during processing periods and allows bursty CPU access during configuration periods. - It also introduces a caching method to limit on-chip buffering while guaranteeing bandwidth for real-time streams using a shared cache rather than separate caches for each processor.

Uploaded by

alekhyakushik
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Memory Arbitration and Cache Management in Stream-Based Systems

 
Françoise Harmsze , Adwin Timmer and Jef van Meerbergen
(1) Philips Research Labs Eindhoven, The Netherlands
(2) Eindhoven University of Technology, The Netherlands

CPU / peripherals
Abstract P R
P R

With the ongoing advancements in VLSI technology, the


R
performance of an embedded system is determined to a P
large extend by the communication of data and instructions. Signal P
arbitration

This results in new methods for on- and off-chip commu- Processing P

nication and caching schemes. In this paper, we use an P P = periodic request


R = random request

arbitration scheme that exploits the characteristics of con-


tinuous ’media’ streams while minimizing the latency for
random (e.g. CPU) memory accesses to background mem- Figure 1. Basic view of the background mem-
ory. We also introduce a novel caching scheme for a stream- ory arbitration
based multiprocessor architecture, to limit as much as pos-
sible the amount of on-chip buffering required to guarantee
the throughput of the continuous streams. With these two
acteristics of media processing plays a very important role
schemes we can build an architecture for media processing
in obtaining the best solution. The signal processing usu-
with optimal flexibility at run-time while performance guar-
ally consists of stream-based processing with FIFO periodic
antees can be determined at compile-time.
communication behaviour. We see that video processing is
usually done on a field or frame basis. Any reconfiguration
of the application can be done at the start of a new video
1 Introduction field. For the video processing units we can thus distin-
guish between two parts of an application: the run-time of
In many embedded systems, the bandwidth to off-chip a program is when the video processing algorithm is being
memory is becoming an important limiting factor with re- performed on a video field, and the configuration-time of a
spect to the system performance. It becomes especially program is when any new parts of the program or param-
critical when a CPU, peripherals and other (co-)processors eters are fetched at the start of a new video field. In this
must use the same background memory in a unified mem- paper we will explain our usage of the difference between
ory architecture (UMA). In media systems for instance, configuration-time and run-time of a program for our mem-
real-time performance is very important. Signal processing ory arbitration scheme.
applications like video decoding and processing require a While the signal processing shows a periodic communi-
guaranteed bandwidth, (otherwise a fall-back mechanism is cation behaviour, the CPU and its peripherals on the other
necessary), since such systems normally do not have much hand will show random burst behaviour (Figure 1). The
headroom to catch up when the bandwidth requirements are CPU requires a low latency for a better performance. We
temporarily not met. On the other hand, a CPU requires wish to allow the CPU to access background memory in
low latency for the best performance. These two require- large bursts, as this will give a better average CPU per-
ments may easily clash, but in this paper it will be shown formance in a well-balanced system: that is, a system in
that by applying a proper arbitration and cache management which the CPU is not asking too much bandwidth too of-
scheme, both objectives can be met. For the media process- ten. The high performance required for video processing
ing application domain on which we focus in this paper, we cannot just be obtained by increasing the clock frequency
wish to have a flexible solution which allows for run-time of the processor elements as this will also drastically in-
reconfiguration of applications while performance guaran- crease the power consumption of an IC. The reason to con-
tees need to be known at compile-time. Exploiting the char- sider multi-processor architectures is that for such applica-
CPU and peripherals. Figure 3 shows that the coprocessors
CPU coproc.
CPU
regfile use on-chip communication via a switch matrix which han-
I$ D$ dles its own arbitration [3]. We will use this architecture to
.... illustrate our caching and memory arbitration scheme.
multiport
The problem addressed in this paper is : how to optimize
cache
multimode
cache control the amount of on-chip buffering for handling data streams
to and from background memory with the following three
side conditions:
SDRAM

compile-time guarantees of the performance

flexibility at run-time and configuration time
Figure 2. Generic view on multiprocessor 
CPU and peripherals have access to background mem-
memory hierarchy ory in burst mode for optimal performance
We will introduce a background memory arbitration
scheme and a method for guaranteeing the bandwidth
tions true task-level parallelism is necessary to obtain the demands for high-throughput real-time tasks. The task
required high performance at a reasonable cost. In this pa- scheduling of the CPU resulting in random streams as
per we will concentrate on the communication aspects be- shown in Figure 1 lies beyond the scope of this paper, but
tween (co-)processors and background memory, in the con- priority-based schemes [5] can be used for the arbitration.
text of multi-processor architectures for high-throughput The arbitration for off-chip communication is addressed in
media (video) applications. In a multi-processor architec- Section 2. In Section 3 we will show that a separate cache
ture with many (co)processors, the hierarchy of the mem- for each (co)processor will make the system too large, and
ory architecture is very important. Figure 2 shows a generic that a central cache will be more advantageous. In a typ-
multi-processor memory hierarchy and the different kinds ical CPU cache the data and instructions of one task can
of usage for a multi-mode cache. By multi-mode we mean occupy the whole cache. In a multi-processor architecture
that part of the cache can be used for FIFO-based caching of with many independent streams, data from different tasks
data for media processors, while the remainder of the cache can concurrently occupy a central cache. Therefore we re-
can be used for 2nd level caching or even 1st level caching quire the notion of stream caching [7]. Section 4 describes
for CPUs, with standard caching and pre-fetching mecha- a method to overcome the disadvantages of cache fragmen-
nisms. However, the latter mode of caching lies beyond the tation and all methods will be illustrated in section 5 using
scope of this paper. In most related work, data flow analysis the CPA architecture.

CPA

switch matrix
2 Background memory arbitration and
response time calculations
CPU multiport
stream memory
periph. video video ..... video cache SDRAM
I$ D$ arbiter
proc. proc. proc. & control In [2] a simple and efficient memory arbitration scheme
control bus is presented which supports continuous 
streams along with
random requests. A service cycle of clock cycles is de-
fined, see Figure 4, in which  clock cycles are reserved
Figure 3. Co-Processor Array (CPA) for video for continuous (media) streams. These media streams do
processing not require a low latency, since their demand for data is
known well in advance. Pre-fetching from memory can thus
be used for data streams coming from memory, and data
and compilation techniques are used to optimize the mem- streams towards memory can be buffered for a while. Given

 
ory management [1],[4]. We focus however on an architec- and  ,   clock cycles are available
tural method for the memory hierarchy which gives us flex- for random traffic such as generated by a CPU. The ran-
ibility at low area cost. In this paper we describe the stream dom traffic has highest priority, thus ensuring low latency,
caching scheme required for the media processors. For this provided that enough cycles within the service cycle are re-
scheme we will introduce a method to lock cache lines for maining for the continuous streams. If this is not the case,
periodic data streams. Figure 3 shows an instance of a the continuous streams

will have priority. Of course, the
multi-processor architecture called the CPA. This CoPro- value of  within must be large enough to guarantee the
cessor Array shares an external background memory with a periodic streams their required bandwidth.
service cycle burst of
random requests

time
τ
x
1 2 3 . . . . . . N R cycles N cycles
service cycle

Figure 4. Definition of a service cycle


Figure 5. Critical problem instance to find the
worst case response time
Although the memory management scheme guarantees
that the continuous streams can obtain sufficient memory
bandwidth, there are still two issues. The first issue is that
the arbitration between different periodic streams has to be bursts from a CPU. In that case, as we can see from equation
solved, and the second issue is that the worst-case response 2, the value of is dominating in the calculation of the
worst case response time. For a large value of , equation
time for any individual periodic stream needs to be calcu- 3:9<;=?>@BA
2 simplifies to C *D4  4E6 12 . This can also be
lated. This worst-case response time directly determines the
amount of on-chip buffering that is required to guarantee the seen intuitively by looking at Figure 5. 4  4 periodic requests
throughput. at time - . will lead to a worst case response time of  *F4  4
We have chosen a first-come-first-serve (FCFS) scheme which must then be added to the 1G cycles occupied by
for the arbitration of the continuous streams. This approach random requests. This means that the worst case response
gives us relatively simple calculations of the worst case re- time for all continuous streams is more or less the same
sponse time and the buffer requirements. For these calcu- regardless of the arbitration scheme between these streams.
lations the following definitions are used. Let  be the set Earlier in this section we decided to use a first-come-first-
of periodic, continuous streams accessing the background serve scheme for the arbitration of continuous streams. We
memory. Let  be the minimum number of clock cycles can now say that taking a simple arbitration scheme for the
 continuous streams can be done at no extra cost in terms of
between two memory requests from any stream   and
let  be the maximum number of clock cycles needed for response times.
a burst request

from any stream to access the background
memory. If equals the number of clock cycles within a 3 Buffering of continuous streams
service cycle, then the maximum number of cycles  re-
quired for requests from the set  within one service cycle
Since we wish to keep the total amount of on-chip mem-
is
'& ory as small as possible, it is necessary to use the buffer-
  )(
* ,+ (1)
 "!"#%$ ing as efficiently as possible. One option would be to give
every stream its own buffer memory, in which case every
The worst case response time of a background memory buffer would require enough locations to handle the peak
request can be calculated if we consider a critical instance bandwidth demands. The peak bandwidth demands depend
of the problem. Consider -/. , as shown in Figure 5, which on the application. This can result in a very large amount of
0
is the moment within a service cycle at cycles from on-chip buffering as the following calculations will show.
the start of the service cycle. If the CPU did not issue any For the video processing architecture from Figure 3 we
requests in this service cycle until - . , then it can obtain ac- can determine the number of buffers and the size of each
cess to background memory for a period of 12 , if the re- buffer which we would require for a solution with separate
quests continue in the next service cycle. If at -/. all peri- buffer memories. For the CPA we have the case in which
odic streams issue a request at the same time when the CPU 20 individual streams require access to background mem-
is granted a burst of requests of size 12 , it can be seen that ory. For the on-chip communication data streams each can
we have the critical instance during which the background @H;JI
have a peak bandwidth   of 128 MB/s. For the back-
memory is not accessible for continuous requests. In [6] it
 ground memory we use a standard 32-bit SDRAM running
is
3
proven that for all    , the worst case response time at 96MHz with burst size K of 64 bytes (16 words). The
of a memory request of a continuous stream equals overhead of switching between reading and writing is on
3 & 0   average 2 clock cycles, so an average of L 7MN6 1O 7QP
  *54    *54  4 (
4 (
6 ( + 687(
* (2)
$ $ $ $ clock cycles are required for every memory access of one

In practical cases, we want to be large to allow for large burst. The length of a service cycle has been set to 1024,
;`J@ & 
since for this order of magnitude a reasonable part of the c 7P cycles, so   7 ] T * *F7QP  ]\M T/d P clock
;`J@ &
CPU cache can be refilled in one burst and the amount of cycles. This means that ^  M2]e6f7Q[\P2]e*
MG] ]\M Tgd P 
on-chip buffering is still acceptable. In [6] it is proven that 256 bytes. For 20 streams we therefore need 5kB in one
& 
the situation where R 7 1 is the corner case resulting memory rather than 37kB in 20 individual memories.
in the largest buffer requirements.
For this example each stream has its worst case re- 4 Cache fragmentation
3 &2V
sponse time  7QPS* 12T 6  7PU* 12T 7 1W+ 6X7(Y*
V
7 1Z 7Q[\P2] clock cycles, in $ which for a stream with In the previous section it was shown that using one cache
a peak bandwidth of 128MB/s, 1845 bytes can arrive. memory instead of separate buffer memories is more ad-
We do know that the total bandwidth of the background vantageous. To guarantee the real-time constraints for all
memory is 4 bytes * 96MHz=384MB/s which does not al- data streams, each stream must be able to claim the neces-
low for all streams to use the peak bandwidth at the same sary amount of buffering at any time. To allow all streams
time. However, since all individual streams can use this to use the same cache memory and claim a certain amount
peak bandwidth at different applications, we have to give of buffering, buffer locations must be allocated within the
all streams the maximum required buffering for these cir- memory for the separate streams. Each individual continu-
cumstances. This means that for buffering the individual ous stream is assigned a number of cache lines. To deter-
streams, 20*1845 bytes = 37kB of buffering will be re- mine the required number of lines we use equation 3 for one
quired. 3 &
single stream: ^NhXK 6 * K  where  is the mini-
We wish to ensure that we can work with the average mum inter-arrival time of requests which is determined by
amount of buffering per stream rather than with the required the real-time characteristics of the stream. These assigned
peak amount for each individual stream. Therefore we use cache lines can only be occupied by data for that particular
one large cache for all continuous streams to and from back- stream, so they are locked if the cache is to be used for reg-
ground memory. ular caching mechanisms as well. This locking of the cache
Based on the worst case response time, the amount of lines will guarantee that the continuous streams will obtain
buffering ^ for all streams can be calculated. It can be the desired buffering. Since continuous streams will start
shown that the amount of buffering required for all strictly and stop independent from the other streams, the locking of
periodic streams is as follows: cache lines for a particular stream will also be independent
3 & ;`J@ of the locking for others. Because all streams are indepen-
^_ 4 4* K 6 * K  ( (3)
$
;`J@ 3 HEAD
where  is the average over  and where as given in line #0

equation 2 is proportional to . This formula shows that a line #1
trade-off

can be made between the size of the service cycle line #2
and the amount of on-chip buffering. A large value of
 line #3
allows for a better average latency for random requests.
line #4
This is obtained at the cost of additional on-chip memory
line #5
for the continuous streams and possibly a larger deviation
of the latency for the random requests in case of overload line #6

or saturation.
The latter can be understood from Figure 5. line #7 TAIL
When is increased,  must increase proportionally to
allow the periodic streams their required bandwidth. For
an overloaded system there are more random requests than Figure 6. Linked list: initial state
those that can be handled in a period of cycles. This will
result in a longer period in which no random requests are dent and because buffers in the cache memory are dynam-
allowed access to background memory. Therefore we see ically allocated and de-allocated, the cache memory will

that for this case a large period of and therefore a large suffer from fragmentation. The run-time reconfiguration of
period of  cycles will result in a large latency for those each independent stream does not leave a singular moment
random requests that remain pending when cycles have when no stream is active. This means that there is no singu-
already been consumed within a service cycle. lar moment when the cache is not in use, therefore garbage
For our example with 20 streams we can calculate the collection or de-fragmentation will be difficult. To avoid the
& 
required amount of buffering again if we take 0 7 1 disadvantage of cache fragmentation an ordered list of non-
3
as;`Jthe worst case situation.  7[GPG] clock cycles and assigned cache lines can be maintained. The initial situation
@
 can be calculated if we consider that 20 streams share is shown in Figure 6, where the HEAD of the list points to
a & 
b 7 1 for their requests, and each request takes line ijT . If a number of cache lines has to be assigned to a
Removed: list of assigned cache lines of stream 0

HEAD stream #0
line #0
HEAD
line #1 line #5

line #2 line #6

line #3 line #7

line #4 line #0
TAIL
line #1
Created: list of assigned cache lines of stream 0 line #2

line #3
HEAD
line #5 line #4 TAIL
line #6
Remaining: linked list of unassigned cache lines
line #7 TAIL

Remaining: linked list of unassigned cache lines


Figure 8. Linked list: released lines appended
to list
Figure 7. Linked list: lines assigned for
stream iLT
unit can claim some requests from the CPU which in gen-
eral will have higher priority. Level 2b uses the scheme
certain stream, these lines are taken from the list, starting at from Section 2 again, to handle the stream requests which
always occur at run-time, and the (periodic) control or in-
the head of it. Subsequently, the new head of the list will
now point to the new first cache line that is not assigned struction fetches which require a low latency. The latter
(Figure 7), and a new list of cache lines for this particular type of requests are again subdivided at level 3b, where a
stream has been created. If assigned lines become unas- difference is made between control requests which occur
signed, they can be either appended (Figure 8) or prepended at configuration-time and run time parameter or instruction
to the list. The advantage of this method is that we can avoid requests originating from specific video processing units
difficult garbage collection schemes. The disadvantage is requiring extra information at run-time. The arbitration
that only FIFO-based streams are easily supported in this to SDRAM

mode of the cache with respect to (re)placement strategies.


However, since we explained in Section 1 that for media Level 1 arbitration
processing FIFO-based behaviour is dominating, this spe- random periodic
M
R =N 1- M 1 1
cial stream mode of our cache is not a disadvantage for the 1

architecture. Level 2a arbitration Level 2b arbitration

M R = N 2- M 2
2 2
5 Application: coprocessor array
Level 3a arbitration Level 3b arbitration

We have applied both the arbitration scheme and the GFX priority
cache management scheme in the CPA shown in Figure 3.
As mentioned in Section 1 the CPA shares an external back- debugger CPU GFX periodic configuration-
time control
run-
+ peripherals streams time control
ground memory with a CPU and peripherals. The off-chip (run-time)
communication between the SDRAM and the memory ar-
biter uses arbitration at 3 levels, as shown in Figure 9.
Level 1 handles the arbitration between random requests Figure 9. Arbitration at 3 levels for the CPA
and periodic requests. This arbitration uses the scheme
given in Section 2. Level 2a decides between debugger scheme in Figure 9 is valid for our application domain be-
requests and other random requests, normally giving pri- cause it gives all random requests a low latency, provided
ority to debugging. Level 3a splits the remaining random they do not overload the memory. The scheme also gives all
requests into two types: requests from the CPU and re- periodic requests a guaranteed bandwidth, while at the same
quests from the graphics accelerator (GFX). The value of a time it gives periodic requests at configuration-time such as
variable, GFX priority, can be used to ensure that the GFX programming parameters a lower latency if required. Figure
coproc.
6 Conclusions
regfile

In this paper we have presented a memory arbitra-


linked tion scheme and a cache management scheme which have
list
multiport
multimode
cache both been effectively used in a video processing architec-
control addr. gen
cache ture. The memory arbitration scheme can be used for sys-
FCFS tems where both continuous high-throughput and random
low-latency requests are present. The cache management
SDRAM scheme is very effective for stream-based buffering of data.
By using one cache memory for several independent data
streams, a cost-effective solution has been obtained. The
Figure 10. Stream cache and cache control of scheme allows for flexibility in the reconfiguration of appli-
the CPA cations while at the same time we can guarantee at compile-
time that all run-time constraints will be met. This guar-
antee is obtained by using the calculated worst case re-
sponse time to determine the required amount of buffering
10 shows the CPA instantiation of the stream cache and the and by locking the corresponding cache lines for continu-
cache control from Figure 3. To ensure that all video pro- ous streams. This means that none of the video process-
cessors get sufficient bandwidth for transporting data to and ing units require a fall-back mechanism for cases in which
from background memory, 5 parallel buses are connected the real-time constraints are not met. Both schemes offer
to our cache. Small serial-to-parallel conversion buffers are a good scalability for increasing the number of processors.
used at the input, to convert from 16-bit words to 128-bit The memory arbitration scheme and the cache management
words. The reverse is done at the output with parallel-to- scheme have both been used in a CoProcessor Array IC for
serial conversion buffers. The maximum number of streams video processing. This IC has been processed in a 0.35 k
which can use the cache is 20. The Linked List block per- technology in which the total area for caching, cache con-
forms the function explained in Section 4 and the FCFS trol and memory arbitration is 8.1 mm l .
block is responsible for the First-Come-First-Serve arbi-
tration between all periodic streams. The CPA has been References
processed in a 0.35 k technology. The cache memory is
3 mm l , with an additional 3 mm l required for address gen- [1] M. Hall, J. Anderson, S. Amarasinghe, B. Murphy, S.-W.
eration and multi-stream accessing. The linked list requires Liao, E. Bugnion, and M. Lam. Maximizing multiprocessor
0.8 mml and the FCFS unit which collects the requests performance with the suif compiler. IEEE Computer, Decem-
from all 20 streams is 1.3 mm l . Figure 11 shows the com- ber 1996.
plete CPA layout. The area within the white box comprises [2] S. Hosseini-Khayat and A. Bovopoulos. A simple and
the cache and its address generation, the linked list and the efficient bus management scheme that supports continu-
background memory arbitration. ous streams. ACM Transactions on Computer Systems,
13(2):112–140, 1995.
[3] J. Leijten, J. van Meerbergen, A. Timmer, and J. Jess.
Stream communication between real-time tasks in a high-
performance multiprocessor. Proc. Design, Automation and
Test in Europe Conference, pages 125–131, 1998.
[4] B. Lin, G. DeJong, C. Verdonck, S. Wuytack, and F. Catthoor.
Address Background memory management for dynamic data struc-
Generation
tures intensive processing systems. International Conference
FCFS on Computer-aided Design, November 1995.
Linked [5] C. Liu and J. Layland. Scheduling algorithms for multipro-
List
gramming in a hard-real-time environment. Journal of the
Association for Computing Machinery, 20(1):46–61, 1973.
[6] A. Timmer, F. Harmsze, J. Leijten, M. Strik, and J. van Meer-
bergen. Guaranteeing on- and off-chip communication in em-
bedded systems. Proc. IEEE Computer Society Workshop on
VLSI ’99, pages 93–98, 1999.
[7] D. Zucker, M. Flynn, and R. Lee. A comparison of hard-
Figure 11. Layout of the CPA IC, comprising ware prefetching techniques for multimedia benchmarks. In
the arbitration scheme and caching Proceedings of the International Conference on Multimedia
Computing and Systems, June 1996.

You might also like