Trabalho Faculdade
Trabalho Faculdade
This three-part series shows how applying knowledge about the underlying computer hardware to the code for
a simple but computationally intensive algorithm can significantly improve performance. This second segment
focuses on memory accesses.
July/August 2008 Copublished by the IEEE CS and the AIP 1521-9615/08/$25.00 ©2008 IEEE 71
E d u ca t i o n
CPU
3 clock cycles
20 clock cycles
A Programming Example
Main memory Cache Let’s look back at our orientational correlation code from
part 1 of this series to evaluate its spatial and temporal
Set 0
locality. The data to be analyzed is a series of N points,
possibly from a microscope image, where each point has
a location (with x and y values), as well as a local orienta-
Set 1 tion. The data is stored in four separate input arrays: x[N],
y[N], sin6[N], and cos6[N]. (Previously, we found it
computationally advantageous to precalculate the sine and
Set 2 cosine for each given orientation angle—hence, the two
arrays.) Calculating the orientational correlation function
requires calculating the distance between each possible
Set 3 pair of points i, j,
2 2
ri , j = ( xi – x j ) + ( yi – y j ) ,
and accumulating correlation data for each pair in the rth
…
July/August 2008 73
E d u ca t i o n
rows of Table 1. (All table entries reflect the use of the trigo- special cases (not handled in the sample code) for pairs of
nometric optimization discussed in the first installment.) Ap- points within the same block and for any leftover points if the
parently, this simple code modification cuts the cycles needed number of points N isn’t an integer multiple of blocksize.
per pair by more than one-third, getting our observed execu- Table 1 shows that using this blocking technique,
tion time closer to our theoretical estimate of about 50 clock whether in conjunction with structs or individual arrays,
cycles per iteration. Clearly, rearranging the data to reflect increases the speed by about 10 clock cycles per pair. This
how data is accessed enhances the cache’s ability to exploit modest gain underscores that the hardware was already do-
spatial locality and streamlines the prefetching of data. ing a pretty good job of prefetching the input array data.
However, the fact that we’re still above the estimated 50
Improving Temporal Locality: Blocking clock cycles required for the calculations in each inner loop
As mentioned earlier, our code already exhibits some tem- iteration indicates that some memory latency is still affect-
poral locality by keeping the index constant on the outer ing our performance. Apparently, the L1 cache still isn’t
loop, resulting in each ith array element being repeatedly satisfying some of the data requests.
reused. During each outer loop iteration, however, it also The only way to know for sure what causes the additional la-
cycles through the data for all possible values of j. Conse- tency would be to run a detailed simulation of our calculation
quently, the time between subsequent uses of the same data based on our memory system’s known behavior. Fortunately,
is the time it takes to process all of the data for one iteration we can make some educated guesses based on the sizes and
of the outer loop. For large numbers of points, the data for latencies of each level of cache on our CPU. It happens that
all values of j might be far too big to fit in either the L1 or for the set of input data we used to test our code, the required
L2 cache, so later values overwrite the first values of j in size of the accumulation array (with or without structs) is just
the cache. Consequently, the cache can’t exploit this data’s over 64 Kbytes, which is just a bit too big for our L1 data
eventual reuse. cache. Worse, each pair of points from the input data could
A solution to this problem is to group the input data to- be any distance apart, so the accumulation array is accessed
gether in small blocks (of size blocksize) that will easily randomly, with neither spatial nor temporal locality. Given
fit in the lowest level of cache. Then we can process each that the L1 cache is too small to hold the entire accumulation
possible pair of points from those two blocks before mov- array and the two input data blocks at the same time, data is
ing on to the next pair of blocks, as Figure 4 shows. This frequently evicted from the L1 cache and must subsequently
technique, called blocking, lets us perform more computa- be reloaded—most likely from the L2 cache, which, at 512
tions for every new data point that we read from the main Kbytes, is large enough to hold everything. Each L2 cache
memory. For example, if we store in the cache two blocks access takes 20 clock cycles, which would be just about right
of data, each of blocksize = 100 points, we could process to explain the additional latencies we observe.
all 10,000 possible pairs of points before having to access
the next block of 100 points. That’s a big improvement
from processing only a single pair for every point read.
The modified portion of our code snippet is as follows: T he array-merging and blocking techniques, which
are applicable to a variety of problems, improved our
code’s performance by approximately 40 percent. Coupled
for(A=0; A+2*blocksize<N; A+=blocksize) with the improvements we made to the code in the first
for(B=A+blocksize; B+blocksize<N; installment of this series, we have reduced execution time
B+=blocksize) dramatically—by more than a factor of four. But we’re not
for(i=A; i<A+blocksize; ++i) done yet! There are additional ways we can improve our
for(j=B; j<B+blocksize; ++j) { code that also use the general strategies of reducing in-
.
. structions (both explicitly and implicitly, for address arith-
.
metic) and increasing locality (both spatial and temporal).
The technique’s disadvantage, of course, is that it makes In the next installment, we will see how paying attention to
the code more complex. First, it increases the number of computer architecture for a particular algorithm can yield
nested loops from two to four. Second, it introduces several even more dramatic performance improvements.
Data: …
Figure 4. Increasing spatial locality using blocking. This technique lets us perform more computations for every new data
point that we read from the main memory.
July/August 2008 75