0% found this document useful (0 votes)
16 views5 pages

Trabalho Faculdade

This document summarizes how computer memory access affects performance and how to optimize code to improve memory access. It discusses how caches store recently accessed data closer to the CPU for faster retrieval. It emphasizes exploiting spatial and temporal locality by accessing data in nearby memory locations in a short period of time to keep it cached. As an example, it analyzes code for an orientation correlation function and how rearranging loops and data access can improve locality and cache performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

Trabalho Faculdade

This document summarizes how computer memory access affects performance and how to optimize code to improve memory access. It discusses how caches store recently accessed data closer to the CPU for faster retrieval. It emphasizes exploiting spatial and temporal locality by accessing data in nearby memory locations in a short period of time to keep it cached. As an example, it analyzes code for an orientation correlation function and how rearranging loops and data access can improve locality and cache performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

E d u ca t i o n

Editors: Michael Dennin, [email protected]


Steven Barrett, [email protected]

Why Computer Architecture


Matters: Memory Access
By Cosmin Pancratov, Jacob M. Kurzer, Kelly A. Shaw, 
and Matthew L. Trawick

This three-part series shows how applying knowledge about the underlying computer hardware to the code for
a simple but computationally intensive algorithm can significantly improve performance. This second segment
focuses on memory accesses.

W hen scientists write code to implement an algo-


rithm, we initially focus on just getting some-
thing that works. However, as we execute our
code on large problem sizes, we sometimes discover that
our initial implementation runs too slowly. Fortunately,
faster memory, called a “cache,” on or very near the CPU
chip itself, where it can be accessed much more quickly
than the main memory. Modern processors typically have
two or more levels of cache (denoted L1, L2, and so on),
each with a different trade-off of size versus latency.1 Fig-
there are some easy ways to speed up the code by tailoring ure 1 depicts the memory-system configuration used in
it to how computer hardware actually works. our tests. Reading from the L1 cache (the smallest cache,
In the first installment of this three-part series (May/ located closest to the processor) is very fast, whereas read-
June 2008), we examined the individual instructions re- ing from the main memory (which is large and farthest
quired by a simple but computationally intensive algorithm from the processor) is much slower.
(an orientational correlation function) and explored how To reduce the negative impact of memory accesses, the fast
to reduce the number of clock cycles needed for each loop L1 cache must be able to satisfy as many memory accesses
iteration. By avoiding long-running instructions and pre- as possible. We want to avoid going to the main memory at
calculating some values outside of the innermost loop, we all costs. As programmers, we need to understand when the
successfully cut the clock cycles needed for each iteration cache retains data—that way, we ensure that the data we want
by more than half. will be cached when we need it.
But our code still executed much more slowly than we
anticipated. Although we estimated that each loop iteration Locality, Locality, Locality!
should require roughly 50 clock cycles on our test system, Caches work by retaining recently used data so that it can be
we found that each actually required more than 130 clock retrieved more quickly on future accesses. Data is retained
cycles! The reason for this large discrepancy turned out to in the cache for as long as possible and is only evicted when
be the delay time, or latency, associated with retrieving data other, more recently accessed data must be retained. One
from the computer’s main memory. way to reduce the frequency of having to go to the main
In this installment of our series, we show how to rear- memory is to reuse data as many times as possible within
range code so that it accommodates the peculiarities of the a short period of time, because the data will likely remain
memory system and consequently reduces execution time. cached. This is known as temporal locality.
Caches also exploit spatial locality, which occurs when a
Main Memory and Caches program uses multiple pieces of contiguously stored data
In part 1 of this series, we saw that latencies for arithme- in a short time period—for instance, by accessing array
tic instructions range from one to 10 cycles (with some elements sequentially. Caches exploit this spatial local-
select instructions taking longer). In contrast, a typical ity by retrieving and storing not just the specific bytes
main memory access can take approximately 300 cycles. of data requested by the processor but an entire “line” or
All the optimizations we made to decrease the number of sequence of contiguous bytes of data. By doing so, this
cycles doing computation are irrelevant if most memory cached line of data can satisfy future requests for near-
accesses incur these long latencies. by memory addresses. Additionally, memory controllers
Fortunately, hardware designers have developed a strat- usually “prefetch” multiple lines of data from the main
egy to mask this problem. They place a small amount of memory into the cache when they suspect that they might

July/August 2008 Copublished by the IEEE CS and the AIP 1521-9615/08/$25.00 ©2008 IEEE 71
E d u ca t i o n

CPU
3 clock cycles

L1 data cache (64 Kbytes)

20 clock cycles

L2 cache (512 Kbytes)


reside. In our example, every piece of data maps to two dif-
ferent cache lines, collectively referred to as a set. (We refer
~300 clock cycles to this configuration as two-way set associative because each
set contains two cache lines.) In the figure, green areas in
the main memory can map to only one of the two green
Main memory (2 Gbytes) cache lines in the cache. If more than two areas map to the
same set, a conflict results, causing some data to be evicted.
Spatial locality can avoid such conflicts, because contigu-
Figure 1. The relative sizes and latencies of different parts of ous areas in the main memory are guaranteed to map to
our test platform’s memory system. different sets in the cache.

A Programming Example
Main memory Cache Let’s look back at our orientational correlation code from
part 1 of this series to evaluate its spatial and temporal
Set 0
locality. The data to be analyzed is a series of N points,
possibly from a microscope image, where each point has
a location (with x and y values), as well as a local orienta-
Set 1 tion. The data is stored in four separate input arrays: x[N],
y[N], sin6[N], and cos6[N]. (Previously, we found it
computationally advantageous to precalculate the sine and
Set 2 cosine for each given orientation angle—hence, the two
arrays.) Calculating the orientational correlation function
requires calculating the distance between­ each possible
Set 3 pair of points i, j,
2 2
ri , j = ( xi – x j ) + ( yi – y j ) ,
and accumulating correlation data for each pair in the rth

element of the arrays g[r] and count[r]. The relevant


portion of the code is as follows:
Figure 2. Mapping between main memory (divided into
cache lines) and a two-way set associative cache with a //Now, accumulate data for all pairs of
capacity of eight cache lines. //points (i,j).
for(i=0; i<N; ++i) //for each i < N
for(j=i+1; j<N; ++j) { //for each j < N
be used in the near future—for example, if nearby mem- Dx = x[i]-x[j];
ory locations have been recently accessed. So, even if data Dy = y[i]-y[j];
spans several cache lines, storing it contiguously improves r = sqrt(Dx*Dx + Dy*Dy);
the chance that it will be prefetched before it’s needed. g[r] += cos6[i]*cos6[j]+sin6[i]*sin6[j]
There’s another way in which a program’s spatial local- ++count[r];
ity can improve a cache’s performance, based on a subtle }
aspect of how caches work. Figure 2 shows an example
cache with eight cache lines. On the left, we also show an Even without thinking too hard about what the code is
example memory space divided into cache lines. In an ideal actually doing, it’s clear that it already exhibits some de-
world, any line of data could reside in any cache line. How- gree of temporal and spatial locality, which the cache is
ever, to limit the number of lines that must be searched and exploiting. The temporal locality comes from repeatedly
keep the cache access times short, caches generally limit reusing the ith element of the input arrays, as guaranteed
the number of cache lines in which a given piece of data can by the nested for loops. The spatial locality comes from

72 Computing in Science & Engineering


x: …
y: …
sin6: …
cos6: …

always accessing the four input arrays count: …


sequentially, with the index j incre- (a) g: …
mented by one every iteration. The
cache must satisfy at least some of the
program’s memory requests—if not, Data: x y cos6 sin6 x y cos6 sin6 x y cos6 sin6 …
the clock cycles per pair of points in (b) Accumulation: count g count g count g …
our code would be significantly higher
than the observed 130 cycles.
But there’s clearly room for im- Figure 3. Array merging for increased spatial locality. (a) The data structure as
provement. Each of the inner loop’s originally coded in six separate arrays and (b) the revised data structures, as two
iterations currently requires access arrays of structures.
to six different arrays, which means
that six different locations in memory
must be accessed. And since the cache won’t be able to tomatically prefetch the y, cos6, and sin6 values for that
hold all of the data for large values of N, the data will have point and could potentially prefetch values for several
to be read again from main memory every time the inner other points (such as j + 1, j + 2, and so on), depending
loop begins. Our goal as programmers is to find ways to on the cache line size. Similarly, because the rth element
improve spatial and temporal locality where we can, by of the arrays count and g are always accessed at the same
changing either the order in which we store data or the time, we can define a second struct that contains both
order in which we access it. Fortunately, some broadly ap- the g and count values; all Max_R of these values are held
plicable strategies can help. in a single array of these structures accum[Max_R]. The
revised declaration section and the code that uses these
Improving Spatial Locality: Array Merging structures is as follows:
Examining our sample program, we have declared six dif-
ferent large arrays of numbers. The declarations for these struct DataStruct {
arrays are int x, y;
float cos6, sin6; };
int x[N], y[N]; DataStruct data[N];
float sin6[N], cos6[N];
int count[MAX_R]; struct AccumulationStruct {
float g[MAX_R]; int count;
float g; };
As Figure 3a shows, these arrays are each stored sequen- AccumulationStruct accum[Max_R];
tially in the main memory. Structuring the data into these
parallel arrays would be ideal if our processing required //Now, accumulate data for all pairs of
accessing only the x or y values. But our program accesses //points (i,j).
the jth element of each of the four input arrays and the for(i=0; i<N; ++i) //for each i < N
rth element of each of the two output arrays at the same for(j=i+1; j<N; ++j) { //for each j < N
time, so this data organization doesn’t mimic how we ac- Dx = data[i].x - data[j].x;

cess the data. Dy = data[i].y - data[j].y;
In our case, we can improve our program’s spatial lo- r = sqrt(Dx*Dx + Dy*Dy);
cality by storing contiguously in memory the values as- accum[r].g += 
data[i].cos6 * data[j].cos6 +
sociated with each jth input point as shown in Figure data[i].sin6 * data[j].sin6;
3b, a technique known as array merging.2 In C, this is ++accum[r].count;
accomplished by defining a struct that contains all four }
values for each point, with our data held in a single array
data[N] of those structs. When the cache retrieves the We tested this code’s speed and compared it to the code
cache block containing the x value for point j, it will au- without array merging; the results appear in the first two

July/August 2008 73
E d u ca t i o n

Table 1. Effects of array merging and blocking on execution time.


Execution time (seconds) Clock cycles per pair
Without array merging or blocking 4,422 133.6
With array merging only 2,917 88.1
With blocking only 4,086 123.5
With array merging and blocking 2,634 79.6

rows of Table 1. (All table entries reflect the use of the trigo- special cases (not handled in the sample code) for pairs of
nometric optimization discussed in the first installment.) Ap- points within the same block and for any leftover points if the
parently, this simple code modification cuts the cycles needed number of points N isn’t an integer multiple of blocksize.
per pair by more than one-third, getting our observed execu- Table 1 shows that using this blocking technique,
tion time closer to our theoretical estimate of about 50 clock whether in conjunction with structs or individual arrays,
cycles per iteration. Clearly, rearranging the data to reflect increases the speed by about 10 clock cycles per pair. This
how data is accessed enhances the cache’s ability to exploit modest gain underscores that the hardware was already do-
spatial locality and streamlines the prefetching of data. ing a pretty good job of prefetching the input array data.
However, the fact that we’re still above the estimated 50
Improving Temporal Locality: Blocking clock cycles required for the calculations in each inner loop
As mentioned earlier, our code already exhibits some tem- iteration indicates that some memory latency is still affect-
poral locality by keeping the index constant on the outer ing our performance. Apparently, the L1 cache still isn’t
loop, resulting in each ith array element being repeatedly satisfying some of the data requests.
reused. During each outer loop iteration, however, it also The only way to know for sure what causes the additional la-
cycles through the data for all possible values of j. Conse- tency would be to run a detailed simulation of our calculation
quently, the time between subsequent uses of the same data based on our memory system’s known behavior. Fortunately,
is the time it takes to process all of the data for one iteration we can make some educated guesses based on the sizes and
of the outer loop. For large numbers of points, the data for latencies of each level of cache on our CPU. It happens that
all values of j might be far too big to fit in either the L1 or for the set of input data we used to test our code, the required
L2 cache, so later values overwrite the first values of j in size of the accumulation array (with or without structs) is just
the cache. Consequently, the cache can’t exploit this data’s over 64 Kbytes, which is just a bit too big for our L1 data
eventual reuse. cache. Worse, each pair of points from the input data could
A solution to this problem is to group the input data to- be any distance apart, so the accumulation array is accessed
gether in small blocks (of size blocksize) that will easily randomly, with neither spatial nor temporal locality. Given
fit in the lowest level of cache. Then we can process each that the L1 cache is too small to hold the entire accumulation
possible pair of points from those two blocks before mov- array and the two input data blocks at the same time, data is
ing on to the next pair of blocks, as Figure 4 shows. This frequently evicted from the L1 cache and must subsequently
technique, called blocking, lets us perform more computa- be reloaded—most likely from the L2 cache, which, at 512
tions for every new data point that we read from the main Kbytes, is large enough to hold everything. Each L2 cache
memory. For example, if we store in the cache two blocks access takes 20 clock cycles, which would be just about right
of data, each of blocksize = 100 points, we could process to explain the additional latencies we observe.
all 10,000 possible pairs of points before having to access
the next block of 100 points. That’s a big improvement
from processing only a single pair for every point read.
The modified portion of our code snippet is as follows: T he array-merging and blocking techniques, which
are applicable to a variety of problems, improved our
code’s performance by approximately 40 percent. Coupled
for(A=0; A+2*blocksize<N; A+=blocksize) with the improvements we made to the code in the first
for(B=A+blocksize; B+blocksize<N; installment of this series, we have reduced execution time
B+=blocksize) dramatically—by more than a factor of four. But we’re not
for(i=A; i<A+blocksize; ++i) done yet! There are additional ways we can improve our
for(j=B; j<B+blocksize; ++j) { code that also use the general strategies of reducing in-
.
. structions (both explicitly and implicitly, for address arith-
.
metic) and increasing locality (both spatial and temporal).
The technique’s disadvantage, of course, is that it makes In the next installment, we will see how paying attention to
the code more complex. First, it increases the number of computer architecture for a particular algorithm can yield
nested loops from two to four. Second, it introduces several even more dramatic performance improvements.

74 Computing in Science & Engineering


x y cos6 sin6

Data: …

Block A Block B (Next Block B)

Figure 4. Increasing spatial locality using blocking. This technique lets us perform more computations for every new data
point that we read from the main memory.

References rithms and performance optimization. Contact him at jacob.kurzer@


1. J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative richmond.edu.
Approach, 3rd ed., Morgan Kaufmann, 2003.
2. A.R. Lebeck and D.A. Wood, “Cache Profiling and the SPEC Bench- Kelly A. Shaw is an assistant professor of computer science at the
marks: A Case Study,” Computer, vol. 27, no. 10, 1994, pp. 15–26.
University of Richmond. Her research interests include the interac-
tion of hardware and software in chip multiprocessors. Shaw has a
PhD in computer science from Stanford University. Contact her at
Cosmin Pancratov is a research assistant and undergraduate stu- [email protected].
dent at the University of Richmond. His research interests include
condensed matter physics and computer science. Contact him at Matthew L. Trawick is an assistant professor of physics at the Univer-
­[email protected]. sity of Richmond. His research interests include the physics of block
copolymer materials and their applications in nanotechnology, as
Jacob M. Kurzer is a research assistant and undergraduate student well as atomic force microscopy. Trawick has a PhD in physics from
at the University of Richmond. His research interests include algo- the Ohio State University. Contact him at [email protected].

Advertiser Index Advertising Sales Representatives Will Hamilton Email: [email protected]


Phone: +1 269 381 2156
July/August 2008 Mid Atlantic Fax: +1 269 381 2556 Northwest (product)
(product/recruitment) Email: [email protected] Lori Kehoe
Dawn Becker Phone: +1 650 458 3051
Advertiser Page
Phone: +1 732 772 0160 Joe DiNardo Fax: +1 650 458 3052
Fax: +1 732 772 0164 Phone: +1 440 248 2456 Email: [email protected]
The Portland Group 9 Email: [email protected] Fax: +1 440 248 2594
Princeton University Press 5 Email: [email protected] Southern CA (product)
New England (product) Marshall Rubin
Jody Estabrook Southeast (recruitment) Phone: +1 818 888 2407
Phone: +1 978 244 0192 Thomas M. Flynn Fax: +1 818 888 4907
Fax: +1 978 244 0103v Phone: +1 770 645 2944 Email: [email protected]
Email: [email protected] Fax: +1 770 993 4423
Northwest/Southern CA
Advertising Personnel New England (recruitment)
Email: flynntom@
(recruitment)
mindspring.com
John Restchack Tim Matteson
Marion Delaney Phone: +1 212 419 7578 Southeast (product) Phone: +1 310 836 4064
EEE Media, Advertising Dir. Fax: +1 212 419 7589 Bill Holland Fax: +1 310 836 4067
Phone: +1 415 863 4717 Email: [email protected] Phone: +1 770 435 6549 Email: [email protected]
Email: [email protected] Fax: +1 770 435 0243
Connecticut (product) Email: [email protected] Japan
Marian Anderson
Stan Greenfield Tim Matteson
Sr. Advertising Coordinator Phone: +1 203 938 2418 Midwest/Southwest (recruitment) Phone: +1 310 836 4064
Phone: +1 714 821 8380 Fax: +1 203 938 3211 Darcy Giovingo Fax: +1 310 836 4067
Fax: +1 714 821 4010 Email: greenco@ Phone: +1 847 498-4520 Email: [email protected]
Email: [email protected] optonline.net Fax: +1 847 498-5911
Sandy Brown Email: [email protected] Europe (product)
Midwest (product) Hilary Turnbull
Sr. Business Development Mgr.
Dave Jones Southwest (product) Phone: +44 1875 825700
Phone: +1 714 821 8380 Phone: +1 708 442 5633 Steve Loerch Fax: +44 1875 825701
Fax: +1 714 821 4010 Fax: +1 708 442 7620 Phone: +1 847 498 4520 Email: [email protected]
Email: [email protected] Email: [email protected] Fax: +1 847 498 5911

July/August 2008 75

You might also like