Synthetic Trace Driven Simulation
Synthetic Trace Driven Simulation
the top of the stack and assigned a stack distance (sd) 0.6
value of -1. If R is found in the stack then it is removed 0.4
from its position and then pushed to the top. The depth 0.2
from which R is fetched is the new sd. The stack
0
distances are stored in a stack distance string data
0 20 40 60 80 100 120 140
structure (SDS). New line accesses theoretically have a
stack distance of ∞ as they have not been referenced Stack Distance
previously but we use a finite value of -1 to enable
quantitative profiling for the trace generation Figure 1. Stack distance mapping.
algorithm. We also capture new line accesses and the
order in which they appear (L), the number of which A stack distance value of -1 issues a new memory
we describe as the full working-set size of the reference while any other value generates a previous
application program code. In addition, a count of the reference. Inter-state and intra-state probabilities are
total number of references is maintained. A cache line treated as stochastically independent in line with the
size of 32 bytes is assumed. Markov property, as shown in Figure 2.
Fi >0
For LRU replacement, the procedure is almost
Fi =0 Fi >0 identical except for a slight modification in the
S1 S2 scheduler. As before, memory references are output
from the top of the stack for each new reference while
previous references use the bottom of the stack as the
Fi =0 base and an offset equal to the requested stack distance.
However additionally, each request for a previous
Figure 2. Markov model for trace generation. reference causes the reference element at that depth to
be removed and pushed to the bottom of the stack to
The key to the algorithm for random block represent the fact it was the most recently used. As the
replacement caches is the maintenance of a FIFO data trace generation progresses, the stack organises itself
structure that schedules the order of memory such that the reference element at the bottom of the
references. The FIFO is initialised with the full stack is the most recently used, with frequency
working-set of memory references mapped as cache gradually reducing up to the least recently used
line numbers (L). On every request for a new reference reference element at the top of the stack. Listing 3
(state S1), the element at the front of the FIFO is summarises the procedure. Listing 4 presents the
popped off and pushed to the back, before being algorithm for stack distance generation employed in
mapped back to a memory reference and passed to the both procedures.
output. On every request for an existing reference (state procedure TraceGen(L, SD, F, LCOUNT)
S2), the element at the requested stack distance is read declareStack(S)
from the back of the FIFO and passed to the output. initialise(S, L)
SIZE:=getLength(S)
The depth at which the element is fetched from the TLENGTH=arbitrary
FIFO must be less than the running total of newly B:=32
generated references (NEWREF). This is achieved by NEWREF:=0
dynamically scaling the random number before it is for i:=0 to TLENGTH
sd:=genStackDistance(SD,F, NEWREF)
mapped to the stack distance cumulative distribution. if sd=-1 then
Stack distance values are selected from the stack memRef:=S[0]
distance probability vector (SD) using its popTop(S)
pushBottom(S, memRef)
corresponding cumulative probability distribution (F). memRef:=memRef*B
The maximum possible stack distance value in theory is NEWREF:=NEWREF+1
the length of the FIFO, but practically it is the value of else
the last element in SD. Both SD and F are numerically memRef:=S[SIZE-1-sd]
memRef:=memRef*B
ordered vectors as F is a monotonically increasing pop(S, SIZE-1-sd)
cumulative distribution function. Listing 2 summarises pushBottom(S, memRef)
the procedure for arbitrary length trace generation. end if
end for
procedure TraceGen(L, SD, F, LCOUNT) end procedure
declareFIFO(S) Listing 3. Trace generation algorithm for LRU
initialise(S, L) replacement caches.
SIZE:=getLength(S)
TLENGTH=arbitrary procedure genStackDistance(SD, F, NEWREF)
B:=32 SIZE:=getLength(SD)
NEWREF:=0 maxSD:=SD[SIZE-1]
for i:=0 to TLENGTH ran:=randomFloat(0,1)
sd:=genStackDistance(SD,F, NEWREF) if NEWREF<=maxSD then
if sd=-1 then k:=0
memRef:=S[0] while SD[k]<NEWREF
popFront(S) k:=k+1
pushBack(S, memRef) end while
memRef:=memRef*B ran:=ran*F[k-1]
NEWREF:=NEWREF+1 end if
else for k:=0 to SIZE
memRef:=S[SIZE-1-sd] if ran<F[k] then
memRef:=memRef*B sd:=SD[k]
end if return sd
end for end if
end procedure end for
end procedure
Listing 2. Trace generation algorithm for random
replacement caches. Listing 4. Stack distance generation algorithm
5. Evaluation repetition in our analysis as it has no bearing on the
number of cache misses. The stack distance distribution
We evaluated the approach using the ARMulator of the references is illustrated in Figure 3. The relative
instruction set simulator [2, 20]. ARMulator simulates smoothness of the curves indicates that the data
the instructions sets and architecture of a variety of memory locations are generally referenced in a
ARM processors, as well as memory systems and progressive, orderly manner.
peripherals. We selected an ARM926 processor model
[18], which has a Harvard cached architecture and
hosts an ARM9 32-bit integer core. It was connected to
program and data memory models through separate
AMBA AHB interfaces. We simulated a variety of
application benchmarks that may typically run in an
embedded system:
0%
10%
20%
30%
40%
50%
60%
2
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
0%
10%
20%
30%
40%
50%
60%
70%
0%
10%
20%
30%
40%
0%
10%
20%
30%
40%
50%
60%
70%
80%
2- 2-
25 - W 25 - W W 2-
W W
6 6 25 12 -W
8
2 2 12 12 6 4
51 - W 51 - W 8 8 2-
2 2 4-
W 4- 25 -W
51 2- W 51 W 6
2 W
51 2- W
2 25 2 4
16 16 6
25
6 2 51 -W
- - 4- 4- 51 -W 2
1K W 1K W W W 2 2
51 51 16 51 -W
8- 8- 2 2 -W 2
2K W 2K W 2- 2- 8-
2 2 W W
1K
1K W
51 8-
2K -W 2K -W
2
51
2 W 2-
16 16 8- 8- 2K
- - W W 2-
1K W
4K W 4K W 8
2 2 1K 1K 2K W 1K -W
4K -W 4K -W 2- 2-
W W 16 32
4K 16- 4K 16- -W -
W W 2K W
go - rand
1K 1K
aes - rand
12 12 4K
4
djpeg - rand
8- 8-
wcdma - rand
8- 8- W W 2- 2K -W
compress - rand
mpeg2enc - rand
8K W 8K W 1K 1K 4K W 16
8- 8 32 32 16
8K W 8K -W - W -W
2K -W
Cache Configuration
Cache Configuration
Cache Configuration
4K - W
Cache Configuration
Cache Configuration
Cache Configuration
64 64 64
16 -W 16 -W 2K 2K 12 -
K K 4- 4- 8-
W
4K W
16 2- 16 2- W W 4
K W K W 2K 2K 8K 4K -W
1 6 16 16 16 16 16 8- 16
K -W K -W - W -W 8K W
12 12 4K -W
8- 8- 2K 2K 64 64
W W 64 64 -W -W
- W -W
Expected
Expected
Observed
Expected
Observed
Expected
Observed
Observed
Expected
Expected
Observed
Observed
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
2
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
0%
10%
20%
30%
40%
50%
0%
10%
20%
30%
40%
50%
60%
70%
80%
2-
0%
10%
20%
30%
40%
50%
60%
70%
80%
2-
25 W 2 5 -W W 2-
W 12 -W
6 6 25 12 -W
8 8
2 2 6 12 4 4
51 - W 5 1 -W 2- 8
2 2 4- 25 -W 25 -W
51 W W 6 6
51 2- W 51 2-W 2 4 4
2 2 2- 25 51 -W
16 16 51 W 6 51 -W
2 2
- - 2 4- 2
1K W 1K W 16 W 2-
51 -W 51 W
8- 8- -W 51 2 2
2K W 2K W 2 8- 8-
1K 2-
2 2 8- W 1K W 1K W
2K -W 2K -W W 51 2-
16 16 2K 2 2-
W
- - 8- 1K W 1K
2- W
4K W 4K W
2K W 8 8-
2 2 1K 1K - W 1K W
4K -W 4K -W 16 2- 32 32
1 1 -W W - -W
go - lru
4K 6-
aes - lru
W 4K 6-
W 4K 1K 2K W 2K
djpeg - lru
wcdma - lru
12 12 2- 8- 4 4
compress - lru
W
mpeg2enc - lru
8- 8- 2K - W 2K -W
8K W 8K W 4K W 1K 16 16
8 8 16 32 2K - W 2K -W
8K -W 8K -W 4K - W -W
Cache Configuration
Cache Configuration
Cache Configuration
Cache Configuration
Cache Configuration
Cache Configuration
64 64 12 64 64
16 -W 16 -W 8- 2K - -W
K K W 4-
4K W 4K
16 2- 1 6 2 -W 8K W 4 4
K W K 8- 2K 4K - W 4K -W
16 16 16 16 8K W 16 16 16
K -W K -W -W 4K - W 4K -W
12 12 64
8- 8- -W 2K 64 64
W W 64 -W -W
-W
Expected
Expected
Expected
Observed
Expected
Expected
Observed
Observed
Expected
Observed
Observed
Observed