0% found this document useful (0 votes)
72 views

Synthetic Trace Driven Simulation

The document proposes a distribution-driven trace generation methodology as an alternative to traditional execution- and trace-driven cache simulation. It uses an adaptation of the Least Recently Used Stack Model to capture locality features in a trace. A two-state Markov chain model is then used to generate synthetic traces. Simulation shows the synthetic traces preserve cacheability characteristics similar to real traces, and can improve performance over ISA emulation.

Uploaded by

reg2reg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Synthetic Trace Driven Simulation

The document proposes a distribution-driven trace generation methodology as an alternative to traditional execution- and trace-driven cache simulation. It uses an adaptation of the Least Recently Used Stack Model to capture locality features in a trace. A two-state Markov chain model is then used to generate synthetic traces. Simulation shows the synthetic traces preserve cacheability characteristics similar to real traces, and can improve performance over ISA emulation.

Uploaded by

reg2reg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Synthetic Trace-Driven Simulation of Cache Memory

Rahman Hassan Antony Harris Nigel Topham, Aris Efthymiou


Institute for System Level Integration, ARM Ltd, School of Informatics,
Livingston, UK. Sheffield, UK. University of Edinburgh, UK.
[email protected] [email protected] {npt, aefthymi}@inf.ed.ac.uk

Abstract Cache memory is usually evaluated using


execution-driven or trace-driven simulation. Execution-
The widening gap between CPU and memory speed driven simulation can be performed at various levels of
has made caches an integral feature of modern high- abstraction, from the algorithmic level to the bit-
performance processors. The high degree of accurate and cycle-accurate RTL [1]. Instruction Set
configurability of cache memory can require extensive Architecture (ISA) emulators such as ARMulator [2]
design space exploration and is generally performed and SimpleScalar [3] can perform execution-driven
using execution-driven or trace-driven simulation. cache simulation with a high level of behavioural and
Execution-driven simulators can be highly accurate timing accuracy. However, execution-driven simulation
but require a detailed development flow and may can be slow and requires architectural models,
impose performance costs. Trace-driven simulators are application source-code, and a development toolkit.
an efficient alternative but maintaining large traces Trace-driven simulation is a faster and increasingly
can present storage and portability problems. We common way of evaluating memory systems. Trace-
propose a distribution-driven trace generation driven simulators such DineroIII [4] accept a
methodology as an alternative to traditional execution- chronological stream of memory references and
and trace- driven simulation. An adaptation of the evaluate miss statistics based on the selected
Least Recently Used Stack Model is used to concisely configuration. Trace-driven simulation can be an
capture the key locality features in a trace and a two- attractive way of exploring multi-level cache designs
state Markov chain model is used for trace generation. and multiprocessor system caches. However, as
Simulation and analysis of a variety of embedded applications become more complex, they generate
application traces demonstrate the cacheability increasingly larger traces. A key problem is the storage
characteristics of the synthetic traces are generally requirement of enormous trace files containing
very well preserved and similar to their real trace, and hundreds of millions of memory references.
we also highlight the potential performance Truncation, trace-sampling, and compression can be
improvement over ISA emulation. employed to reduce the effective length of a trace but
often at the expense of accuracy and/or time. Synthetic
1. Introduction trace generation can address the problems of
maintaining large traces by using a distribution-driven
Caches are highly configurable features of modern (stochastic) model to generate an arbitrary number of
processors and their architecture is characterised by memory references on-the-fly; the reference stream can
parameters such as size, associativity, line (block) size, then be passed to a trace-driven cache model for
and replacement policy. Cache performance can vary system evaluation. Unfortunately, synthetic trace
considerably depending on the choice of configuration models usually lack accuracy [11] unless detailed
and workload, with penalties for an incorrectly profiling procedures are employed, and as such may
configured cache including an increase in latency and not always be suitable for fast and accurate cache
area overhead (in using oversized caches). In order to design space exploration.
maximise performance and reduce cost, much emphasis
is therefore placed on finding the optimum 2. Related Work
configuration for an expected application workload.
An early proposal by Denning [6] considered much better measure of locality is presented by
generating memory references based on their Mattson [13] in the form of the Least Recently Used
independent probability. Known as the Independent Stack Model (LRUSM). The LRUSM is based on the
Reference Model (IRM), the approach fails to capture stack distance, which is the number of unique
the locality of memory references inherent in a real intervening memory references between identical
trace. Thiebaut [8] looks at the generation of synthetic references and is a very effective measure of temporal
program traces using an extension of the basic Distance locality. Grimsrud [14] analyses the efficiency of the
Model [7]. The idea is to model the probability stack distance model in preserving temporal locality
distribution of the jumps between consecutive memory using his locality surfaces, while Brehob [15] uses the
references as a hyperbolic relationship and generate stack distance model for implementing a probabilistic
new references using a random walk through an cache model to evaluate miss ratio. The works of
address space. The proposed model uses two Mattson, Grimsrud, and Brehob emphasise the fact that
parameters corresponding to the working-set size and the LRUSM is a natural representation of least recently
locality of reference. The working-set size for a cache used behaviour. However, the works do not analyse the
is the storage needed at any one time and contains efficacy of the LRUSM in the generation of synthetic
current and recent data. As the working-set changes traces for trace-driven simulation of cache memory. In
during the execution of a program, determining a our work, we implement an adaptation of the LRUSM
parameter to quantify its size requires a number of and apply it to an algorithm employing a two-state
simulation runs. Eeckhout [9] proposes an approach of Markov chain and show we can generate accurate
profiling instruction mixes, branches and dependencies synthetic traces aimed at cache simulation using both
to establish a pattern in the memory reference stream. least recently used and random block replacement
The approach requires detailed trace information to policies. We use traces collated from application
perform the profiling procedures and operand benchmarks executed on an embedded processor and
evaluations. The Partial Markov Model (PMM) evaluate the approach using a comprehensive range of
presented in Agarwal [10] is based on a two-state cache configurations. We also compare the approach
Markov chain. The first state produces sequential against ISA emulation to highlight the potential
memory references and the second state generates performance improvement.
random references. Maintaining a state or switching
state is governed by the data captured from the real 3. Trace Locality
trace. A problem with the proposed approach is that a
single probability threshold for state transitioning does An important general rule of program execution is
not capture sufficient temporal information. Sorenson the 90/10 rule [16], also known as the Pareto Principle
[11] highlights the need to capture both spatial and [17]. The rule states that 90% of a program’s execution
temporal information from a real trace and extends the time is spent in only 10% of the code and highlights the
original work by Grimsrud [14] on three-dimensional significance of locality in evaluating and optimising
plots called locality surfaces for visualising reference cache performance.
locality. Sorenson does not present a practical approach Fundamentally, there are two types of locality that a
to quantify locality but does evaluate some existing cache exploits to achieve favourable hit rates: temporal
synthetic trace models and analyses their performance locality, which is apparent when identical memory
using the locality surface. Berg [12] captures trace references occur close together in time; and spatial
locality by profiling the reuse distance and applies the locality, which is present when physically proximate
distribution to a probabilistic cache model to estimate references occur close together in time. A cache can
the miss ratio of fully-associative caches. The reuse take advantage of temporal locality by keeping recently
distance is the number of intervening memory referenced data as long as possible, while spatial
references between identical references. The author locality is exploited by performing block transfers
uses the reuse distance due to the fact that a memory (line-fills) on a miss. A synthetic memory reference
reference is less likely to remain in the cache the longer model must seek to preserve the original temporal and
it has not been accessed – and it is this likelihood of spatial locality information. We capture spatial locality
eviction that the proposed cache model aims to exploit. using line-size granularity and map memory references
However, the author’s assumption that the larger the to cache line numbers. Any unique references mapping
reuse distance, the higher the probability of a cache line to the same cache line are treated as identical
eviction is not necessarily true. The intermediate references. We use a distribution based on the LRUSM
accesses could all be to the same memory location and to quantify temporal locality. The LRUSM is
a large reuse distance would not reflect this pattern. A formulated from the number of unique intervening
memory references between two identical references. SDS is profiled to generate a distribution of
Although intended to apply to caches with LRU block probabilities. Stack distance values with a probability
replacement, the stack distance is a useful metric of lower than 0.001% are discarded to reduce the size of
temporal locality regardless of replacement policy and the distribution set and to improve performance of the
can in fact form the foundation for trace generation for trace generation algorithm. The probability distribution
caches with random block replacement. Listing 1 is then integrated to generate the cumulative
describes the pseudo-code of the trace profiling distribution of stack distance values as a non-
algorithm. decreasing function:
Fi=Pi+Fi-1 ∀ i where F0=P0
procedure Profile(tracefile)
declareFIFO(SDS)
declareFIFO(L)
The cumulative probability distribution F and
declareStack(S) corresponding stack distance values SD are stored as
B:=32 numerically ordered probability vectors in separate
SIZE:=0 data structures.
LCOUNT:=0
while forever
R:=nextRef(tracefile) 4. Trace Generation
if R=null then break
else
R:=R/B The trace generation algorithm is modelled as a
for i:=0 to SIZE Markov chain. A Markov chain is a discrete-time
if R=S[i] then break stochastic process that describes the different states a
end for
if i=SIZE then system can assume at successive time intervals. The
sd=-1 Markov property stipulates that a state transition
pushBottom(SDS, sd) depends only on the current state of the system and not
pushBottom(L, R)
pushTop(S, R) on past or future states. We use a two-state Markov
SIZE:=SIZE+1 chain model with the first state generating new memory
else references and the second state generating memory
sd:=i
references based on a history of previous references.
pushBottom(SDS, sd)
temp:=S[sd] Both states are governed by the stack distance
remove(S, sd) cumulative probability distribution vector F.
pushTop(S, temp) Stack distance values are generated by a pseudo-
end if
LCOUNT:=LCOUNT+1 random number generator issuing a number in the
end if interval [0:1] that is mapped to a stack distance using
end while the Inverse Transform Sampling method [25], as
end procedure
illustrated in Figure 1.
Listing 1. Trace profiling algorithm.

We use a dynamically growing integer stack (S). 1


For each memory reference R we see if it is resident in
0.8
the stack. If R is not found then it is pushed directly to
Probability

the top of the stack and assigned a stack distance (sd) 0.6
value of -1. If R is found in the stack then it is removed 0.4
from its position and then pushed to the top. The depth 0.2
from which R is fetched is the new sd. The stack
0
distances are stored in a stack distance string data
0 20 40 60 80 100 120 140
structure (SDS). New line accesses theoretically have a
stack distance of ∞ as they have not been referenced Stack Distance
previously but we use a finite value of -1 to enable
quantitative profiling for the trace generation Figure 1. Stack distance mapping.
algorithm. We also capture new line accesses and the
order in which they appear (L), the number of which A stack distance value of -1 issues a new memory
we describe as the full working-set size of the reference while any other value generates a previous
application program code. In addition, a count of the reference. Inter-state and intra-state probabilities are
total number of references is maintained. A cache line treated as stochastically independent in line with the
size of 32 bytes is assumed. Markov property, as shown in Figure 2.
Fi >0
For LRU replacement, the procedure is almost
Fi =0 Fi >0 identical except for a slight modification in the
S1 S2 scheduler. As before, memory references are output
from the top of the stack for each new reference while
previous references use the bottom of the stack as the
Fi =0 base and an offset equal to the requested stack distance.
However additionally, each request for a previous
Figure 2. Markov model for trace generation. reference causes the reference element at that depth to
be removed and pushed to the bottom of the stack to
The key to the algorithm for random block represent the fact it was the most recently used. As the
replacement caches is the maintenance of a FIFO data trace generation progresses, the stack organises itself
structure that schedules the order of memory such that the reference element at the bottom of the
references. The FIFO is initialised with the full stack is the most recently used, with frequency
working-set of memory references mapped as cache gradually reducing up to the least recently used
line numbers (L). On every request for a new reference reference element at the top of the stack. Listing 3
(state S1), the element at the front of the FIFO is summarises the procedure. Listing 4 presents the
popped off and pushed to the back, before being algorithm for stack distance generation employed in
mapped back to a memory reference and passed to the both procedures.
output. On every request for an existing reference (state procedure TraceGen(L, SD, F, LCOUNT)
S2), the element at the requested stack distance is read declareStack(S)
from the back of the FIFO and passed to the output. initialise(S, L)
SIZE:=getLength(S)
The depth at which the element is fetched from the TLENGTH=arbitrary
FIFO must be less than the running total of newly B:=32
generated references (NEWREF). This is achieved by NEWREF:=0
dynamically scaling the random number before it is for i:=0 to TLENGTH
sd:=genStackDistance(SD,F, NEWREF)
mapped to the stack distance cumulative distribution. if sd=-1 then
Stack distance values are selected from the stack memRef:=S[0]
distance probability vector (SD) using its popTop(S)
pushBottom(S, memRef)
corresponding cumulative probability distribution (F). memRef:=memRef*B
The maximum possible stack distance value in theory is NEWREF:=NEWREF+1
the length of the FIFO, but practically it is the value of else
the last element in SD. Both SD and F are numerically memRef:=S[SIZE-1-sd]
memRef:=memRef*B
ordered vectors as F is a monotonically increasing pop(S, SIZE-1-sd)
cumulative distribution function. Listing 2 summarises pushBottom(S, memRef)
the procedure for arbitrary length trace generation. end if
end for
procedure TraceGen(L, SD, F, LCOUNT) end procedure
declareFIFO(S) Listing 3. Trace generation algorithm for LRU
initialise(S, L) replacement caches.
SIZE:=getLength(S)
TLENGTH=arbitrary procedure genStackDistance(SD, F, NEWREF)
B:=32 SIZE:=getLength(SD)
NEWREF:=0 maxSD:=SD[SIZE-1]
for i:=0 to TLENGTH ran:=randomFloat(0,1)
sd:=genStackDistance(SD,F, NEWREF) if NEWREF<=maxSD then
if sd=-1 then k:=0
memRef:=S[0] while SD[k]<NEWREF
popFront(S) k:=k+1
pushBack(S, memRef) end while
memRef:=memRef*B ran:=ran*F[k-1]
NEWREF:=NEWREF+1 end if
else for k:=0 to SIZE
memRef:=S[SIZE-1-sd] if ran<F[k] then
memRef:=memRef*B sd:=SD[k]
end if return sd
end for end if
end procedure end for
end procedure
Listing 2. Trace generation algorithm for random
replacement caches. Listing 4. Stack distance generation algorithm
5. Evaluation repetition in our analysis as it has no bearing on the
number of cache misses. The stack distance distribution
We evaluated the approach using the ARMulator of the references is illustrated in Figure 3. The relative
instruction set simulator [2, 20]. ARMulator simulates smoothness of the curves indicates that the data
the instructions sets and architecture of a variety of memory locations are generally referenced in a
ARM processors, as well as memory systems and progressive, orderly manner.
peripherals. We selected an ARM926 processor model
[18], which has a Harvard cached architecture and
hosts an ARM9 32-bit integer core. It was connected to
program and data memory models through separate
AMBA AHB interfaces. We simulated a variety of
application benchmarks that may typically run in an
embedded system:

1. mpeg2enc – MPEG-2 format video encoder


from the MediaBench benchmark suite [21].
2. djpeg – JPEG format image decoder from the
EEMBC Consumer benchmark suite [22].
3. aes – security application from the EEMBC
Consumer benchmark suite that implements
the Advanced Encryption Standard using the
Rijndael algorithm [22]. Figure 3. Cumulative distribution of stack distance for
4. wcdma – application program that emulates the data reference traces.
the physical layer operation of the W-CDMA
communications protocol [23]. Table I summarises some of the characteristics of
5. go – artificial intelligence game from the the data traces. The ratio of dynamic to static coverage
SpecInt95 benchmark suite that plays the is defined as the ratio of the number of new lines
game Go against itself [24]. observed in the program execution (full working-set) to
6. compress – compression algorithm from the the number of cache lines in the static image. We use a
SpecInt95 benchmark suite that employs 32-byte line size. The static data size of the image is
Limpel-Ziv encoding [24]. the combined size of the read-only data (constants,
literals, etc), read-write data and zero-initialised data.
Executable images of the application source code Additionally, its full working-set also includes stack
were created using the ARM development toolkit [19]. and heap accesses. The average stack distance can be a
The source code was compiled with optimisation level useful basic metric of locality and is defined as the
-O2 and targeted specifically for the ARM9 core to mean number of unique cache line accesses between
maximise use of any static scheduling and instruction- identical accesses:
set extensions. ARM program code supports static
prefetching by way of conditional code generated by Average SD = ∑ SDi × PSDi
the compiler, in addition to the dynamic prefetching i
offered by the dedicated prefetch unit in the core. For
our validation, we analysed traces of data transactions Static Full Dynamic- Ave.
initiated by the core. Data Size Working- Static SD
(Bytes) Set Coverage
5.1 Trace Characterisation (lines) ratio
1 16820 7781 46% 4.7
We captured the cumulative distribution of stack 2 815184 28375 3.5% 73.6
distance for the data references of the application 3 3660 190 5.2% 5.3
benchmarks. A stack distance value of zero is a cache 4 992476 41342 4.2% 29.2
5 571468 20031 3.5% 33.7
line repetition, or in other words an intra-line memory
6 44112988 1380585 3.1% 398.6
reference, and is the single most frequent occurrence
Table I. Characteristics of the traces for mpeg2enc (1),
due to the naturally sequential nature of program
djpeg (2), aes (3), wcdma (4), go (5), and compress
execution and the atomic execution of multiple
(6).
load/store operations. We chose not to include line
5.2 Trace Simulation misses (and therefore bus usage), but the way selection
logic imposes its own latency overhead. Using
The profile data of each application benchmark was ARMulator v1.4 running on the Intel Pentium IV
passed to the synthetic trace generation algorithm, 3.00GHz CPU under Microsoft XP, we assessed the
which was configured to generate traces of half their performance of ARMulator executing the mpeg2enc
original length. This was a somewhat arbitrary cut-off application benchmark for cache size C ∈ {1K, 16K}
but allowed sufficient time for cache warmup. The bytes and associativity A ∈ {direct-mapped:fully-
synthetic traces were evaluated against their real associative}. The results were compared with the
counterpart using the DineroIII trace-driven cache combined time of the synthetic trace generation
simulator [4]. DineroIII was configured with a write- algorithm and the corresponding trace-driven cache
allocate write miss policy, and a 32 byte cache line simulation in the manner described previously. Table II
size. In order to demonstrate the performance of the illustrates the results for 135x106 instruction
approach, we looked at set-associative caches with all executions. The table shows the performance of ISA
ways through to full associativity and cache size C ∈ emulation is dependent on cache configuration and that
{64:16K} bytes. Figures 4 and 5 illustrate the cache performance can deteriorate with increasing
miss ratio results of the real traces (expected) versus associativity and/or decreasing cache size. On the other
the synthetic traces (observed) for random and LRU hand, the synthetic trace generation and simulation
block replacement caches. generally takes a fixed length of time.
We note from the simulation results that the
observed performance of the synthetic traces is time(secs)
generally representative of the expected behaviour. A C=1K C=16K
Although the synthetic trace for go consistently ISA Model ISA Model
overestimates the miss ratio for random block 1 138 76 121 76
replacement caches, it does so with an offset that is 2 131 76 117 76
proportional to cache size. As a result, the synthetic 4 125 76 117 76
trace still has the potential to perform accurately to find 8 133 77 119 76
the best cache configuration for that application (based 16 127 78 122 77
on some criteria such as minimising miss ratio, area, 32 415 81 142 78
and/or latency) since the observed results do very well 64 - - 170 78
to track the expected miss ratios, albeit with the 128 - - 182 79
proportional offset. While it is accepted that no 256 - - 192 80
synthetic trace generation model can consistently 512 - - 1495 82
generate exact cache simulation results for every cache Table II. Performance evaluation results.
configuration and for every input trace due to the very
nature of stochastic modelling, our results typically 6. Conclusions
show that we are able to preserve the cacheability
properties of the real trace and generate a synthetic Exploration of cache design space is usually
trace with similar behaviour for random and LRU performed using execution-driven or trace-driven
block replacement caches operating over a wide range simulation. Achieving the accuracy of execution-driven
of configurations. A comparison with the results simulation is a trade-off against performance and
presented by Sorenson [11] demonstrates the improved factors such as architectural simulation models and
accuracy of the approach relative to some existing application source-code. Trace-driven simulation is an
models. efficient alternative, but large traces can present
significant storage and portability problems.
5.3 Performance Evaluation We have presented a synthetic trace generation
methodology for trace-driven cache simulation that
The speed of execution-driven cache simulation can uses an efficient adaptation of the LRUSM that
be significantly affected not only by workload employs cache line profiling to concisely capture trace
characteristics but also the cache configuration. A locality. A trace generation algorithm using a two-state
notable trade-off exists between cache size and Markov chain model is used to generate arbitrary
associativity. A smaller cache size causes a higher length traces independent of cache size and
number of capacity misses, thereby increasing latency associativity. Extensive simulation and analysis of
by the increased number of bus transactions. A higher traces of a variety of application benchmarks show the
associativity serves to reduce the number of conflict
% Miss Ratio % Miss Ratio
% Miss Ratio % Miss Ratio % Miss Ratio % Miss Ratio
64 64 64 64 64 64
2 2

0%
10%
20%
30%
40%
50%
60%
2

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
0%
10%
20%
30%
40%
50%
60%
70%
0%
10%
20%
30%
40%

0%
10%
20%
30%
40%
50%
60%
70%
80%
2- 2-
25 - W 25 - W W 2-
W W
6 6 25 12 -W
8
2 2 12 12 6 4
51 - W 51 - W 8 8 2-
2 2 4-
W 4- 25 -W
51 2- W 51 W 6
2 W
51 2- W
2 25 2 4
16 16 6
25
6 2 51 -W
- - 4- 4- 51 -W 2
1K W 1K W W W 2 2
51 51 16 51 -W
8- 8- 2 2 -W 2
2K W 2K W 2- 2- 8-
2 2 W W
1K
1K W
51 8-
2K -W 2K -W
2
51
2 W 2-
16 16 8- 8- 2K
- - W W 2-
1K W
4K W 4K W 8
2 2 1K 1K 2K W 1K -W
4K -W 4K -W 2- 2-
W W 16 32
4K 16- 4K 16- -W -
W W 2K W

go - rand
1K 1K

aes - rand
12 12 4K
4
djpeg - rand
8- 8-

wcdma - rand
8- 8- W W 2- 2K -W

compress - rand
mpeg2enc - rand

8K W 8K W 1K 1K 4K W 16
8- 8 32 32 16
8K W 8K -W - W -W
2K -W

Cache Configuration
Cache Configuration
Cache Configuration
4K - W

Cache Configuration
Cache Configuration

Cache Configuration
64 64 64
16 -W 16 -W 2K 2K 12 -
K K 4- 4- 8-
W
4K W
16 2- 16 2- W W 4
K W K W 2K 2K 8K 4K -W
1 6 16 16 16 16 16 8- 16
K -W K -W - W -W 8K W
12 12 4K -W
8- 8- 2K 2K 64 64
W W 64 64 -W -W
- W -W

Expected
Expected

Observed
Expected

Observed
Expected

Observed
Observed
Expected

Expected
Observed

Observed

Figure 4. Results for random block replacement.


% Miss Ratio % Miss Ratio % Miss Ratio % Miss Ratio % Miss Ratio % Miss Ratio
64 64 64 64 64
2 64 2
0%
10%
20%
30%

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
2

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
0%
10%
20%
30%
40%
50%
0%
10%
20%
30%
40%
50%
60%
70%
80%

2-

0%
10%
20%
30%
40%
50%
60%
70%
80%

2-
25 W 2 5 -W W 2-
W 12 -W
6 6 25 12 -W
8 8
2 2 6 12 4 4
51 - W 5 1 -W 2- 8
2 2 4- 25 -W 25 -W
51 W W 6 6
51 2- W 51 2-W 2 4 4
2 2 2- 25 51 -W
16 16 51 W 6 51 -W
2 2
- - 2 4- 2
1K W 1K W 16 W 2-
51 -W 51 W
8- 8- -W 51 2 2
2K W 2K W 2 8- 8-
1K 2-
2 2 8- W 1K W 1K W
2K -W 2K -W W 51 2-
16 16 2K 2 2-
W
- - 8- 1K W 1K
2- W
4K W 4K W
2K W 8 8-
2 2 1K 1K - W 1K W
4K -W 4K -W 16 2- 32 32
1 1 -W W - -W

go - lru
4K 6-
aes - lru

W 4K 6-
W 4K 1K 2K W 2K
djpeg - lru

wcdma - lru

12 12 2- 8- 4 4

compress - lru
W
mpeg2enc - lru

8- 8- 2K - W 2K -W
8K W 8K W 4K W 1K 16 16
8 8 16 32 2K - W 2K -W
8K -W 8K -W 4K - W -W

Cache Configuration

Cache Configuration
Cache Configuration
Cache Configuration
Cache Configuration

Cache Configuration

64 64 12 64 64
16 -W 16 -W 8- 2K - -W
K K W 4-
4K W 4K
16 2- 1 6 2 -W 8K W 4 4
K W K 8- 2K 4K - W 4K -W
16 16 16 16 8K W 16 16 16
K -W K -W -W 4K - W 4K -W
12 12 64
8- 8- -W 2K 64 64
W W 64 -W -W
-W
Expected

Expected
Expected
Observed
Expected
Expected

Observed
Observed
Expected

Observed
Observed
Observed

Figure 5. Results for LRU block replacement.


synthetic traces generally preserve the cacheability [16] J. L. Hennessy and D. A. Patterson, Computer
properties of the real trace for both LRU and random Architecture: A Quantitative Approach, Morgan
block replacement caches operating over a wide range Kauffman Publishers, 2003.
[17] D. Spinellis, Code Quality: The Open Source
of configurations. Performance evaluations against the
Perspective, Addison Wesley, 2006.
ARMulator ISS show that the simulation speed of the [18] ARM926E-S Technical Reference Manual, ARM
approach is generally independent of cache architecture Ltd, 2001
and has the potential to perform significantly faster [19] ARM Developer Suite: Compiler, Linker, and
than ISA emulation. Utilities Guide, ARM Ltd, 2000.
[20] RealView Developer Suite: AXD and armsd
Debuggers Guide, ARM Ltd, 2004.
[21] C. Lee, M. Potkonjak, and H. Mangione-Smith,
7. References MediaBench: A Tool for Evaluating and
Synthesizing Multimedia and Communications
[1] J. Connell, ARM System-Level Modeling, ARM Systems, Micro-30, November 1997.
Ltd, 2003. [22] EDN Embedded Microprocessor Benchmark
[2] RealView ARMulator ISS, ARM Ltd, 2004. Consortium, http://www.eembc.org.
[3] T. Austin et al, SimpleScalar Tutorial v4, [23] H. Lee, wcdmaBench, Software Defined Radio
University of Michigan. Group, University of Michigan, 2006.
[4] D. M. Hill, DineroIII Cache Simulator, University [24] SPEC Benchmark Suite, http://www.spec.org.
of California, Berkeley, 1985. [25] Luc Devroye, Non-Uniform Random Variate
[5] D. J. Lilja, Measuring Computer Performance: A Generation, Springer-Verlag, 1986.
Practitioner’s Guide, Cambridge University Press,
2000.
[6] P. Denning and S. Schwartz, Properties of the
Working-Set Model, Communications of the ACM,
1972.
[7] J. Spirn, Program Behavior: Models and
Measurements, Elsevier, 1977.
[8] D. Thiebaut, J. L. Wolf, and H. S. Stone, Synthetic
Traces for Trace-Driven Simulation of Cache
Memories, IEEE Transactions on Computers, 1992.
[9] L. Eeckhout, K. De Bosschere, and H. Neefs,
Performance Analysis Through Synthetic Trace
Generation, IEEE Symposium on Performance
Analysis of Systems and Software, 2000.
[10] A. Agarwal, M. Horowitz, and J. Hennessy, An
Analytical Cache Model, ACM Transactions on
Computer Systems, 1989.
[11] E. Sorenson and J. K. Flanagan, Evaluating
Synthetic Trace Models using Locality Surfaces,
IEEE Workshop on Workload Characterization,
2002.
[12] E. Berg and E. Hagersten, StatCache: A
Probabilistic Approach to Efficient and Accurate
Data Locality Analysis, IEEE Symposium on
Performance Analysis of Systems and Software,
2004.
[13] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L.
Traiger, Evaluation Techniques for Storage
Hierarchies, IBM System Journal, 1970.
[14] K. Grimsrud, J. Archibald, R. Frost, and B. Nelson,
On The Accuracy of Memory Reference Models,
Seventh International Conference on Modeling
Techniques and Tools for Computer Performance
Evaluation, 1994.
[15] M. Brehob and R. Enbody, An Analytical Model of
Locality and Caching, Michigan State University,
1999.

You might also like