Whymb 2010 06 07c
Whymb 2010 06 07c
Paul E. McKenney
Linux Technology Center
IBM Beaverton
[email protected]
June 7, 2010
So what possessed CPU designers to cause them ing ten instructions per nanosecond, but will require
to inflict memory barriers on poor unsuspecting SMP many tens of nanoseconds to fetch a data item from
software designers? main memory. This disparity in speed — more than
In short, because reordering memory references al- two orders of magnitude — has resulted in the multi-
lows much better performance, and so memory barri- megabyte caches found on modern CPUs. These
ers are needed to force ordering in things like synchro- caches are associated with the CPUs as shown in Fig-
nization primitives whose correct operation depends ure 1, and can typically be accessed in a few cycles.1
on ordered memory references.
Getting a more detailed answer to this question
requires a good understanding of how CPU caches CPU 0 CPU 1
work, and especially what is required to make caches
really work well. The following sections:
Cache Cache
1. present the structure of a cache,
2. describe how cache-coherency protocols ensure Interconnect
that CPUs agree on the value of each location in
memory, and, finally,
3. outline how store buffers and invalidate queues Memory
help caches and cache-coherency protocols
achieve high performance.
We will see that memory barriers are a necessary evil
that is required to enable good performance and scal- Figure 1: Modern Computer System Cache Structure
ability, an evil that stems from the fact that CPUs
are orders of magnitude faster than are both the in- Data flows among the CPUs’ caches and memory
terconnects between them and the memory they are in fixed-length blocks called “cache lines”, which are
attempting to access. normally a power of two in size, ranging from 16 to
256 bytes. When a given data item is first accessed by
1 It is standard practice to use multiple levels of cache, with
1 Cache Structure a small level-one cache close to the CPU with single-cycle ac-
cess time, and a larger level-two cache with a longer access
Modern CPUs are much faster than are modern mem- time, perhaps roughly ten clock cycles. Higher-performance
ory systems. A 2006 CPU might be capable of execut- CPUs often have three or even four levels of cache.
1
Way 0 Way 1
a given CPU, it will be absent from that CPU’s cache,
meaning that a “cache miss” (or, more specifically, 0x0 0x12345000
0x1 0x12345100
a “startup” or “warmup” cache miss) has occurred. 0x2 0x12345200
The cache miss means that the CPU will have to 0x3 0x12345300
wait (or be “stalled”) for hundreds of cycles while the 0x4 0x12345400
item is fetched from memory. However, the item will 0x5 0x12345500
be loaded into that CPU’s cache, so that subsequent 0x6 0x12345600
0x7 0x12345700
accesses will find it in the cache and therefore run at 0x8 0x12345800
full speed. 0x9 0x12345900
After some time, the CPU’s cache will fill, and sub- 0xA 0x12345A00
sequent misses will likely need to eject an item from 0xB 0x12345B00
the cache in order to make room for the newly fetched 0xC 0x12345C00
0xD 0x12345D00
item. Such a cache miss is termed a “capacity miss”, 0xE 0x12345E00 0x43210E00
because it is caused by the cache’s limited capacity. 0xF
However, most caches can be forced to eject an old
item to make room for a new item even when they are
not yet full. This is due to the fact that large caches Figure 2: CPU Cache Structure
are implemented as hardware hash tables with fixed-
size hash buckets (or “sets”, as CPU designers call
them) and no chaining, as shown in Figure 2. 0x43210E00 through 0x43210EFF, and this program
This cache has sixteen “sets” and two “ways” for a accessed data sequentially from 0x12345000 through
total of 32 “lines”, each entry containing a single 256- 0x12345EFF. Suppose that the program were now to
byte “cache line”, which is a 256-byte-aligned block access location 0x12345F00. This location hashes to
of memory. This cache line size is a little on the large line 0xF, and both ways of this line are empty, so the
size, but makes the hexadecimal arithmetic much corresponding 256-byte line can be accommodated.
simpler. In hardware parlance, this is a two-way set- If the program were to access location 0x1233000,
associative cache, and is analogous to a software hash which hashes to line 0x0, the corresponding 256-byte
table with sixteen buckets, where each bucket’s hash cache line can be accommodated in way 1. However,
chain is limited to at most two elements. The size (32 if the program were to access location 0x1233E00,
cache lines in this case) and the associativity (two in which hashes to line 0xE, one of the existing lines
this case) are collectively called the cache’s “geome- must be ejected from the cache to make room for
try”. Since this cache is implemented in hardware, the new cache line. If this ejected line were accessed
the hash function is extremely simple: extract four later, a cache miss would result. Such a cache miss
bits from the memory address. is termed an “associativity miss”.
In Figure 2, each box corresponds to a cache en- Thus far, we have been considering only cases
try, which can contain a 256-byte cache line. How- where a CPU reads a data item. What happens when
ever, a cache entry can be empty, as indicated by it does a write? Because it is important that all CPUs
the empty boxes in the figure. The rest of the boxes agree on the value of a given data item, before a given
are flagged with the memory address of the cache CPU writes to that data item, it must first cause it
line that they contain. Since the cache lines must be to be removed, or “invalidated”, from other CPUs’
256-byte aligned, the low eight bits of each address caches. Once this invalidation has completed, the
are zero, and the choice of hardware hash function CPU may safely modify the data item. If the data
means that the next-higher four bits match the hash item was present in this CPU’s cache, but was read-
line number. only, this process is termed a “write miss”. Once a
The situation depicted in the figure might arise given CPU has completed invalidating a given data
if the program’s code were located at address item from other CPUs’ caches, that CPU may repeat-
2
edly write (and read) that data item. the CPU. Because this cache holds the only up-to-
Later, if one of the other CPUs attempts to access date copy of the data, this cache is ultimately respon-
the data item, it will incur a cache miss, this time sible for either writing it back to memory or handing
because the first CPU invalidated the item in order it off to some other cache, and must do so before
to write to it. This type of cache miss is termed reusing this line to hold other data.
a “communication miss”, since it is usually due to The “exclusive” state is very similar to the “modi-
several CPUs using the data items to communicate fied” state, the single exception being that the cache
(for example, a lock is a data item that is used to line has not yet been modified by the correspond-
communicate among CPUs using a mutual-exclusion ing CPU, which in turn means that the copy of the
algorithm). cache line’s data that resides in memory is up-to-
Clearly, much care must be taken to ensure that date. However, since the CPU can store to this line
all CPUs maintain a coherent view of the data. With at any time, without consulting other CPUs, a line
all this fetching, invalidating, and writing, it is easy in the “exclusive” state can still be said to be owned
to imagine data being lost or (perhaps worse) differ- by the corresponding CPU. That said, because the
ent CPUs having conflicting values for the same data corresponding value in memory is up to date, this
item in their respective caches. These problems are cache can discard this data without writing it back
prevented by “cache-coherency protocols”, described to memory or handing it off to some other CPU.
in the next section. A line in the “shared” state might be replicated in
at least one other CPU’s cache, so that this CPU is
not permitted to store to the line without first con-
2 Cache-Coherence Protocols sulting with other CPUs. As with the “exclusive”
state, because the corresponding value in memory is
Cache-coherency protocols manage cache-line states up to date, this cache can discard this data without
so as to prevent inconsistent or lost data. These writing it back to memory or handing it off to some
protocols can be quite complex, with many tens of other CPU.
states,2 but for our purposes we need only concern A line in the “invalid” state is empty, in other
ourselves with the four-state MESI cache-coherence words, it holds no data. When new data enters the
protocol. cache, it is placed into a cache line that was in the
“invalid” state if possible. This approach is preferred
2.1 MESI States because replacing a line in any other state could re-
sult in an expensive cache miss should the replaced
MESI stands for “modified”, “exclusive”, “shared”, line be referenced in the future.
and “invalid”, the four states a given cache line can Since all CPUs must maintain a coherent view
take on using this protocol. Caches using this proto- of the data carried in the cache lines, the cache-
col therefore maintain a two-bit state “tag” on each coherence protocol provides messages that coordinate
cache line in addition to that line’s physical address the movement of cache lines through the system.
and data.
A line in the “modified” state has been subject to
a recent memory store from the corresponding CPU, 2.2 MESI Protocol Messages
and the corresponding memory is guaranteed not to Many of the transitions described in the previous sec-
appear in any other CPU’s cache. Cache lines in the tion require communication among the CPUs. If the
“modified” state can thus be said to be “owned” by CPUs are on a single shared bus, the following mes-
2 See Culler et al. [CSG99] pages 670 and 671 for the nine- sages suffice:
state and 26-state diagrams for SGI Origin2000 and Sequent
(now IBM) NUMA-Q, respectively. Both diagrams are signif- Read: The “read” message contains the physical ad-
icantly simpler than real life. dress of the cache line to be read.
3
Read Response: The “read response” message Quick Quiz 3: If SMP machines are really using
contains the data requested by an earlier “read” message passing anyway, why bother with SMP at
message. This “read response” message might all?
be supplied either by memory or by one of the
other caches. For example, if one of the caches
has the desired data in “modified” state, that 2.3 MESI State Diagram
cache must supply the “read response” message. A given cache line’s state changes as protocol mes-
Invalidate: The “invalidate” message contains the sages are sent and received, as shown in Figure 3.
physical address of the cache line to be invali-
dated. All other caches must remove the corre- M
sponding data from their caches and respond.
a f
Invalidate Acknowledge: A CPU receiving an
“invalidate” message must respond with an “in- b c d e
validate acknowledge” message after removing
g
the specified data from its cache.
E S
Read Invalidate: The “read invalidate” message h
contains the physical address of the cache line to
be read, while at the same time directing other j k
caches to remove the data. Hence, it is a combi- i l
nation of a “read” and an “invalidate”, as indi-
cated by its name. A “read invalidate” message I
requires both a “read response” and a set of “in-
validate acknowledge” messages in reply.
Writeback: The “writeback” message contains both Figure 3: MESI Cache-Coherency State Diagram
the address and the data to be written back
to memory (and perhaps “snooped” into other The transition arcs in this figure are as follows:
CPUs’ caches along the way). This message per-
mits caches to eject lines in the “modified” state Transition (a): A cache line is written back to
as needed to make room for other data. memory, but the CPU retains it in its cache and
further retains the right to modify it. This tran-
Interestingly enough, a shared-memory multipro- sition requires a “writeback” message.
cessor system really is a message-passing computer
under the covers. This means that clusters of SMP Transition (b): The CPU writes to the cache line
machines that use distributed shared memory are us- that it already had exclusive access to. This
ing message passing to implement shared memory at transition does not require any messages to be
two different levels of the system architecture. sent or received.
Quick Quiz 1: What happens if two CPUs at-
tempt to invalidate the same cache line concurrently? Transition (c): The CPU receives a “read invali-
date” message for a cache line that it has mod-
Quick Quiz 2: When an “invalidate” message ified. The CPU must invalidate its local copy,
appears in a large multiprocessor, every CPU must then respond with both a “read response” and an
give an “invalidate acknowledge” response. Wouldn’t “invalidate acknowledge” message, both sending
the resulting “storm” of “invalidate acknowledge” re- the data to the requesting CPU and indicating
sponses totally saturate the system bus? that it no longer has a local copy.
4
Transition (d): The CPU does an atomic read- both a “read response” and an “invalidate ac-
modify-write operation on a data item that was knowledge” message.
not present in its cache. It transmits a “read
invalidate”, receiving the data via a “read re- Transition (j): This CPU does a store to a data
sponse”. The CPU can complete the transition item in a cache line that was not in its cache,
once it has also received a full set of “invalidate and thus transmits a “read invalidate” message.
acknowledge” responses. The CPU cannot complete the transition until it
receives the “read response” and a full set of “in-
Transition (e): The CPU does an atomic read- validate acknowledge” messages. The cache line
modify-write operation on a data item that was will presumably transition to “modified” state
previously read-only in its cache. It must trans- via transition (b) as soon as the actual store com-
mit “invalidate” messages, and must wait for a pletes.
full set of “invalidate acknowledge” responses be-
fore completing the transition. Transition (k): This CPU loads a data item in
a cache line that was not in its cache. The
Transition (f ): Some other CPU reads the cache CPU transmits a “read” message, and completes
line, and it is supplied from this CPU’s cache, the transition upon receiving the corresponding
which retains a read-only copy, possibly also “read response”.
writing it back to memory. This transition is
initiated by the reception of a “read” message, Transition (l): Some other CPU does a store to a
and this CPU responds with a “read response” data item in this cache line, but holds this cache
message containing the requested data. line in read-only state due to its being held in
other CPUs’ caches (such as the current CPU’s
Transition (g): Some other CPU reads a data item
cache). This transition is initiated by the recep-
in this cache line, and it is supplied either from
tion of an “invalidate” message, and this CPU
this CPU’s cache or from memory. In either case,
responds with an “invalidate acknowledge” mes-
this CPU retains a read-only copy. This tran-
sage.
sition is initiated by the reception of a “read”
message, and this CPU responds with a “read re- Quick Quiz 4: How does the hardware handle the
sponse” message containing the requested data. delayed transitions described above?
Transition (h): This CPU realizes that it will soon
need to write to some data item in this cache 2.4 MESI Protocol Example
line, and thus transmits an “invalidate” message.
Let’s now look at this from the perspective of a cache
The CPU cannot complete the transition until
line’s worth of data, initially residing in memory at
it receives a full set of “invalidate acknowledge”
address 0, as it travels through the various single-line
responses. Alternatively, all other CPUs eject
direct-mapped caches in a four-CPU system. Table 1
this cache line from their caches via “writeback”
shows this flow of data, with the first column show-
messages (presumably to make room for other
ing the sequence of operations, the second the CPU
cache lines), so that this CPU is the last CPU
performing the operation, the third the operation be-
caching it.
ing performed, the next four the state of each CPU’s
Transition (i): Some other CPU does an atomic cache line (memory address followed by MESI state),
read-modify-write operation on a data item in a and the final two columns whether the corresponding
cache line held only in this CPU’s cache, so this memory contents are up to date (“V”) or not (“I”).
CPU invalidates it from its cache. This transi- Initially, the CPU cache lines in which the data
tion is initiated by the reception of a “read in- would reside are in the “invalid” state, and the data
validate” message, and this CPU responds with is valid in memory. When CPU 0 loads the data at
5
address 0, it enters the “shared” state in CPU 0’s CPU 0 CPU 1
cache, and is still valid in memory. CPU 3 also loads Write
the data at address 0, so that it is in the “shared”
state in both CPUs’ caches, and is still valid in mem- Invalidate
ory. Next CPU 0 loads some other cache line (at ad-
dress 8), which forces the data at address 0 out of its
cache via an invalidation, replacing it with the data
Stall
at address 8. CPU 2 now does a load from address 0,
Acknowledgement
but this CPU realizes that it will soon need to store
to it, and so it uses a “read invalidate” message in
order to gain an exclusive copy, invalidating it from
CPU 3’s cache (though the copy in memory remains
up to date). Next CPU 2 does its anticipated store,
changing the state to “modified”. The copy of the
data in memory is now out of date. CPU 1 does an
atomic increment, using a “read invalidate” to snoop
the data from CPU 2’s cache and invalidate it, so that
the copy in CPU 1’s cache is in the “modified” state Figure 4: Writes See Unnecessary Stalls
(and the copy in memory remains out of date). Fi-
nally, CPU 1 reads the cache line at address 8, which
But there is no real reason to force CPU 0 to stall
uses a “writeback” message to push address 0’s data
for so long — after all, regardless of what data hap-
back out to memory.
pens to be in the cache line that CPU 1 sends it, CPU
Note that we end with data in some of the CPU’s
0 is going to unconditionally overwrite it.
caches.
Quick Quiz 5: What sequence of operations
would put the CPUs’ caches all back into the “in- 3.1 Store Buffers
valid” state?
One way to prevent this unnecessary stalling of writes
is to add “store buffers” between each CPU and its
3 Stores Result in Unnecessary cache, as shown in Figure 5. With the addition of
these store buffers, CPU 0 can simply record its write
Stalls in its store buffer and continue executing. When the
cache line does finally make its way from CPU 1 to
Although the cache structure shown in Figure 1 pro- CPU 0, the data will be moved from the store buffer
vides good performance for repeated reads and writes to the cache line.
from a given CPU to a given item of data, its perfor- However, there are complications that must be ad-
mance for the first write to a given cache line is quite dressed, which are covered in the next two sections.
poor. To see this, consider Figure 4, which shows a
timeline of a write by CPU 0 to a cacheline held in
CPU 1’s cache. Since CPU 0 must wait for the cache
3.2 Store Forwarding
line to arrive before it can write to it, CPU 0 must To see the first complication, a violation of self-
stall for an extended period of time.3 consistency, consider the following code with vari-
3 The time required to transfer a cache line from one CPU’s
ables “a” and “b” both initially zero, and with the
cache to another’s is typically a few orders of magnitude more
cache line containing variable “a” initially owned by
than that required to execute a simple register-to-register in- CPU 1 and that containing “b” initially owned by
struction. CPU 0:
6
CPU Cache Memory
Sequence # CPU # Operation 0 1 2 3 0 8
0 Initial State -/I -/I -/I -/I V V
1 0 Load 0/S -/I -/I -/I V V
2 3 Load 0/S -/I -/I 0/S V V
3 0 Invalidation 8/S -/I -/I 0/S V V
4 2 RMW 8/S -/I 0/E -/I V V
5 2 Store 8/S -/I 0/M -/I I V
6 1 Atomic Inc 8/S 0/M -/I -/I I V
7 1 Writeback 8/S 8/S -/I -/I V V
it is missing.
CPU 0 CPU 1
3. CPU 0 therefore sends a “read invalidate” mes-
sage in order to get exclusive ownership of the
cache line containing “a”.
Store Store
Buffer Buffer 4. CPU 0 records the store to “a” in its store buffer.
5. CPU 1 receives the “read invalidate” message,
Cache Cache and responds by transmitting the cache line and
removing that cacheline from its cache.
Interconnect 6. CPU 0 starts executing the b=a+1.
7. CPU 0 receives the cache line from CPU 1, which
still has a value of zero for “a”.
Memory
8. CPU 0 loads “a” from its cache, finding the value
zero.
9. CPU 0 applies the entry from its store queue to
Figure 5: Caches With Store Buffers the newly arrived cache line, setting the value of
“a” in its cache to one.
1 a = 1;
10. CPU 0 adds one to the value zero loaded for “a”
2 b = a + 1;
3 assert(b == 2);
above, and stores it into the cache line containing
“b” (which we will assume is already owned by
One would not expect the assertion to fail. How- CPU 0).
ever, if one were foolish enough to use the very simple
architecture shown in Figure 5, one would be sur- 11. CPU 0 executes assert(b==2), which fails.
prised. Such a system could potentially see the fol- The problem is that we have two copies of “a”, one
lowing sequence of events: in the cache and the other in the store buffer.
1. CPU 0 starts executing the a=1. This example breaks a very important guaran-
tee, namely that each CPU will always see its own
2. CPU 0 looks “a” up in the cache, and finds that operations as if they happened in program order.
7
This guarantee is violently counter-intuitive to soft- 1 void foo(void)
ware types, so much so that the hardware guys took 2 {
pity and implemented “store forwarding”, where each 3 a = 1;
CPU refers to (or “snoops”) its store buffer as well 4 b = 1;
as its cache when performing loads, as shown in Fig- 5 }
6
ure 6. In other words, a given CPU’s stores are di-
7 void bar(void)
rectly forwarded to its subsequent loads, without hav- 8 {
ing to pass through the cache. 9 while (b == 0) continue;
10 assert(a == 1);
11 }
8
CPU 0 and invalidates this cache line from its 3. CPU 0 executes smp_mb(), and marks all current
own cache. But it is too late. store-buffer entries (namely, the a=1).
9. CPU 0 receives the cache line containing “a” and 4. CPU 0 executes b=1. It already owns this cache
applies the buffered store just in time to fall vic- line (in other words, the cache line is already in
tim to CPU 1’s failed assertion. either the “modified” or the “exclusive” state),
but there is a marked entry in the store buffer.
Quick Quiz 6: In step 1 above, why does CPU 0
Therefore, rather than store the new value of “b”
need to issue a “read invalidate” rather than a simple
in the cache line, it instead places it in the store
“invalidate”?
buffer (but in an unmarked entry).
The hardware designers cannot help directly here,
since the CPUs have no idea which variables are re- 5. CPU 0 receives the “read” message, and trans-
lated, let alone how they might be related. There- mits the cache line containing the original value
fore, the hardware designers provide memory-barrier of “b” to CPU 1. It also marks its own copy of
instructions to allow the software to tell the CPU this cache line as “shared”.
about such relations. The program fragment must
be updated to contain the memory barrier: 6. CPU 1 receives the cache line containing “b” and
1 void foo(void) installs it in its cache.
2 {
3 a = 1; 7. CPU 1 can now finish executing while(b==0)
4 smp_mb(); continue, but since it finds that the value of
5 b = 1; “b” is still 0, it repeats the while statement.
6 } The new value of “b” is safely hidden in CPU 0’s
7 store buffer.
8 void bar(void)
9 { 8. CPU 1 receives the “read invalidate” message,
10 while (b == 0) continue; and transmits the cache line containing “a” to
11 assert(a == 1);
CPU 0 and invalidates this cache line from its
12 }
own cache.
The memory barrier smp_mb() will cause the CPU
to flush its store buffer before applying subsequent 9. CPU 0 receives the cache line containing “a” and
stores to their cache lines. The CPU could either applies the buffered store, placing this line into
simply stall until the store buffer was empty before the “modified” state.
proceeding, or it could use the store buffer to hold
subsequent stores until all of the prior entries in the 10. Since the store to “a” was the only entry in the
store buffer had been applied. store buffer that was marked by the smp_mb(),
With this latter approach the sequence of opera- CPU 0 can also store the new value of “b” —
tions might be as follows: except for the fact that the cache line containing
“b” is now in “shared” state.
1. CPU 0 executes a=1. The cache line is not in
CPU 0’s cache, so CPU 0 places the new value 11. CPU 0 therefore sends an “invalidate” message
of “a” in its store buffer and transmits a “read to CPU 1.
invalidate” message.
12. CPU 1 receives the “invalidate” message, in-
2. CPU 1 executes while(b==0)continue, but the validates the cache line containing “b” from its
cache line containing “b” is not in its cache. It cache, and sends an “acknowledgement” message
therefore transmits a “read” message. to CPU 0.
9
13. CPU 1 executes while(b==0)continue, but the This situation can be improved by making invali-
cache line containing “b” is not in its cache. It date acknowledge messages arrive more quickly. One
therefore transmits a “read” message to CPU 0. way of accomplishing this is to use per-CPU queues
of invalidate messages, or “invalidate queues”.
14. CPU 0 receives the “acknowledgement” message,
and puts the cache line containing “b” into the
“exclusive” state. CPU 0 now stores the new
4.1 Invalidate Queues
value of “b” into the cache line. One reason that invalidate acknowledge messages can
take so long is that they must ensure that the corre-
15. CPU 0 receives the “read” message, and trans- sponding cache line is actually invalidated, and this
mits the cache line containing the new value of invalidation can be delayed if the cache is busy, for
“b” to CPU 1. It also marks its own copy of this example, if the CPU is intensively loading and storing
cache line as “shared”. data, all of which resides in the cache. In addition,
if a large number of invalidate messages arrive in a
16. CPU 1 receives the cache line containing “b” and
short time period, a given CPU might fall behind in
installs it in its cache.
processing them, thus possibly stalling all the other
17. CPU 1 can now finish executing while(b==0) CPUs.
continue, and since it finds that the value of However, the CPU need not actually invalidate the
“b” is 1, it proceeds to the next statement. cache line before sending the acknowledgement. It
could instead queue the invalidate message with the
18. CPU 1 executes the assert(a==1), but the understanding that the message will be processed be-
cache line containing “a” is no longer in its cache. fore the CPU sends any further messages regarding
Once it gets this cache from CPU 0, it will be that cache line.
working with the up-to-date value of “a”, and
the assertion therefore passes. 4.2 Invalidate Queues and Invalidate
As you can see, this process involves no small Acknowledge
amount of bookkeeping. Even something intuitively Figure 7 shows a system with invalidate queues. A
simple, like “load the value of a” can involve lots of CPU with an invalidate queue may acknowledge an
complex steps in silicon. invalidate message as soon as it is placed in the queue,
instead of having to wait until the corresponding line
is actually invalidated. Of course, the CPU must refer
4 Store Sequences Result in to its invalidate queue when preparing to transmit in-
Unnecessary Stalls validation messages — if an entry for the correspond-
ing cache line is in the invalidate queue, the CPU
Unfortunately, each store buffer must be relatively cannot immediately transmit the invalidate message;
small, which means that a CPU executing a modest it must instead wait until the invalidate-queue entry
sequence of stores can fill its store buffer (for exam- has been processed.
ple, if all of them result in cache misses). At that Placing an entry into the invalidate queue is essen-
point, the CPU must once again wait for invalida- tially a promise by the CPU to process that entry
tions to complete in order to drain its store buffer before transmitting any MESI protocol messages re-
before it can continue executing. This same situa- garding that cache line. As long as the corresponding
tion can arise immediately after a memory barrier, data structures are not highly contended, the CPU
when all subsequent store instructions must wait for will rarely be inconvenienced by such a promise.
invalidations to complete, regardless of whether or However, the fact that invalidate messages can be
not these stores result in cache misses. buffered in the invalidate queue provides additional
10
1 void foo(void)
CPU 0 CPU 1 2 {
3 a = 1;
4 smp_mb();
5 b = 1;
Store Store 6 }
Buffer Buffer 7
8 void bar(void)
9 {
Cache Cache 10 while (b == 0) continue;
11 assert(a == 1);
12 }
Invalidate Invalidate
Queue Queue Then the sequence of operations might be as fol-
lows:
Interconnect
1. CPU 0 executes a=1. The corresponding cache
line is read-only in CPU 0’s cache, so CPU 0
Memory places the new value of “a” in its store buffer and
transmits an “invalidate” message in order to
flush the corresponding cache line from CPU 1’s
cache.
Figure 7: Caches With Invalidate Queues 2. CPU 1 executes while(b==0)continue, but the
cache line containing “b” is not in its cache. It
therefore transmits a “read” message.
opportunity for memory-misordering, as discussed in 3. CPU 1 receives CPU 0’s “invalidate” message,
the next section. queues it, and immediately responds to it.
4. CPU 0 receives the response from CPU 1, and is
therefore free to proceed past the smp_mb() on
line 4 above, moving the value of “a” from its
store buffer to its cache line.
4.3 Invalidate Queues and Memory
5. CPU 0 executes b=1. It already owns this cache
Barriers line (in other words, the cache line is already in
either the “modified” or the “exclusive” state),
Let us suppose that CPUs queue invalidation re- so it stores the new value of “b” in its cache line.
quests, but respond to them immediately. This ap-
6. CPU 0 receives the “read” message, and trans-
proach minimizes the cache-invalidation latency seen
mits the cache line containing the now-updated
by CPUs doing stores, but can defeat memory barri-
value of “b” to CPU 1, also marking the line as
ers, as seen in the following example.
“shared” in its own cache.
Suppose the values of “a” and “b” are initially
7. CPU 1 receives the cache line containing “b” and
zero, that “a” is replicated read-only (MESI “shared”
installs it in its cache.
state), and that “b” is owned by CPU 0 (MESI “ex-
clusive” or “modified” state). Then suppose that 8. CPU 1 can now finish executing while(b==0)
CPU 0 executes foo() while CPU 1 executes function continue, and since it finds that the value of
bar() in the following code fragment: “b” is 1, it proceeds to the next statement.
11
9. CPU 1 executes the assert(a==1), and, since transmits an “invalidate” message in order to
the old value of “a” is still in CPU 1’s cache, flush the corresponding cache line from CPU 1’s
this assertion fails. cache.
10. Despite the assertion failure, CPU 1 processes 2. CPU 1 executes while(b==0)continue, but the
the queued “invalidate” message, and (tardily) cache line containing “b” is not in its cache. It
invalidates the cache line containing “a” from therefore transmits a “read” message.
its own cache.
3. CPU 1 receives CPU 0’s “invalidate” message,
Quick Quiz 7: In step 1 of the first scenario in queues it, and immediately responds to it.
Section 4.3, why is an “invalidate” sent instead of a 4. CPU 0 receives the response from CPU 1, and is
”read invalidate” message? Doesn’t CPU 0 need the therefore free to proceed past the smp_mb() on
values of the other variables that share this cache line line 4 above, moving the value of “a” from its
with “a”? store buffer to its cache line.
There is clearly not much point in accelerating in-
validation responses if doing so causes memory barri- 5. CPU 0 executes b=1. It already owns this cache
ers to effectively be ignored. However, the memory- line (in other words, the cache line is already in
barrier instructions can interact with the invalidate either the “modified” or the “exclusive” state),
queue, so that when a given CPU executes a mem- so it stores the new value of “b” in its cache line.
ory barrier, it marks all the entries currently in its
6. CPU 0 receives the “read” message, and trans-
invalidate queue, and forces any subsequent load to
mits the cache line containing the now-updated
wait until all marked entries have been applied to
value of “b” to CPU 1, also marking the line as
the CPU’s cache. Therefore, we can add a memory
“shared” in its own cache.
barrier to function bar as follows:
1 void foo(void) 7. CPU 1 receives the cache line containing “b” and
2 { installs it in its cache.
3 a = 1;
8. CPU 1 can now finish executing while(b==0)
4 smp_mb();
5 b = 1; continue, and since it finds that the value of “b”
6 } is 1, it proceeds to the next statement, which is
7 now a memory barrier.
8 void bar(void)
9. CPU 1 must now stall until it processes all pre-
9 {
10 while (b == 0) continue; existing messages in its invalidation queue.
11 smp_mb(); 10. CPU 1 now processes the queued “invalidate”
12 assert(a == 1);
message, and invalidates the cache line contain-
13 }
ing “a” from its own cache.
Quick Quiz 8: Say what??? Why do we need
11. CPU 1 executes the assert(a==1), and, since
a memory barrier here, given that the CPU cannot
the cache line containing “a” is no longer in
possibly execute the assert() until after the while
CPU 1’s cache, it transmits a “read” message.
loop completes???
With this change, the sequence of operations might 12. CPU 0 responds to this “read” message with the
be as follows: cache line containing the new value of “a”.
1. CPU 0 executes a=1. The corresponding cache 13. CPU 1 receives this cache line, which contains a
line is read-only in CPU 0’s cache, so CPU 0 value of 1 for “a”, so that the assertion does not
places the new value of “a” in its store buffer and trigger.
12
With much passing of MESI messages, the CPUs 1 void foo(void)
arrive at the correct answer. This section illustrates 2 {
why CPU designers must be extremely careful with 3 a = 1;
their cache-coherence optimizations. 4 smp_wmb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 while (b == 0) continue;
11 smp_rmb();
12 assert(a == 1);
13 }
5 Read and Write Memory
Barriers Some computers have even more flavors of memory
barriers, but understanding these three variants will
provide a good introduction to memory barriers in
general.
In the previous section, memory barriers were used
to mark entries in both the store buffer and the inval-
idate queue. But in our code fragment, foo() had no 6 Example Memory-Barrier
reason to do anything with the invalidate queue, and Sequences
bar() simlarly had no reason to do anything with the
store queue. This section presents some seductive but subtly bro-
ken uses of memory barriers. Although many of them
Many CPU architectures therefore provide weaker will work most of the time, and some will work all
memory-barrier instructions that do only one or the the time on some specific CPUs, these uses must be
other of these two. Roughly speaking, a “read mem- avoided if the goal is to produce code that works re-
ory barrier” marks only the invalidate queue and a liably on all CPUs. To help us better see the subtle
“write memory barrier” marks only the store buffer. breakage, we first need to focus on an ordering-hostile
while a full-fledged memory barrier does both. architecture.
13
This hardware must obey the following ordering Node 0 Node 1
constraints [McK05a, McK05b]: CPU 0 CPU 1 CPU 2 CPU 3
2. CPUs will reorder a given operation with a store CPU 0 CPU 1 CPU 2 CPU 3
only if the two operations are referencing differ- Message Message Message Message
Queue Queue Queue Queue
ent locations.
Interconnect
3. All of a given CPU’s loads preceding a read mem-
ory barrier (smp_rmb()) will be perceived by all
CPUs to precede any loads following that read Memory
memory barrier.
14
CPU 0 CPU 1 CPU 2 CPU 0 CPU 1 CPU 2
1 a=1;
a=1; while(a==0); 2 smb_wmb();
smp_mb(); y=b; 3 b=1; while(b==0); while(b==0);
b=1; smp_rmb(); 4 smp_mb(); smp_mb();
5 c=1; d=1;
x=a; 6 while(c==0);
assert(y==0||x==1); 7 while(d==0);
8 smp_mb();
9 e=1; assert(e==0||a==1);
15
and stores to be reordered with atomic instructions.
The seventh column, dependent reads reordered,
requires some explanation, which is undertaken in the
following section covering Alpha CPUs. The short
version is that Alpha requires memory barriers for
readers as well as updaters of linked data structures.
Yes, this does mean that Alpha can in effect fetch
the data pointed to before it fetches the pointer itself,
16
1 struct el *insert(long key, long data)
memory barriers in spinlocks already enforce 2 {
MMIO ordering. The platforms with a non- 3 struct el *p;
4 p = kmalloc(sizeof(*p), GPF_ATOMIC);
no-op mmiowb() definition include some (but 5 spin_lock(&mutex);
not all) IA64, FRV, MIPS, and SH systems. 6 p->next = head.next;
7 p->key = key;
This primitive is relatively new, so relatively few 8 p->data = data;
drivers take advantage of it. 9 smp_wmb();
10 head.next = p;
11 spin_unlock(&mutex);
The smp mb(), smp rmb(), and smp wmb() primi- 12 }
tives also force the compiler to eschew any opti- 13
14 struct el *search(long key)
mizations that would have the effect of reorder- 15 {
ing memory optimizations across the barriers. The 16 struct el *p;
17 p = head.next;
smp read barrier depends() primitive has a simi- 18 while (p != &head) {
lar effect, but only on Alpha CPUs. 19 /* BUG ON ALPHA!!! */
20 if (p->key == key) {
These primitives generate code only in SMP ker- 21 return (p);
nels, however, each also has a UP version (mb(), 22 }
23 p = p->next;
rmb(), wmb(), and read barrier depends(), respec- 24 };
tively) that generate a memory barrier even in UP 25 return (NULL);
26 }
kernels. The smp versions should be used in most
cases. However, these latter primitives are useful
when writing drivers, because MMIO accesses must Figure 9: Insert and Lock-Free Search
remain ordered even in UP kernels. In absence of
memory-barrier instructions, both CPUs and compil-
ers would happily rearrange these accesses, which at the most popular and prominent CPUs. Although
best would make the device act strangely, and could nothing can replace actually reading a given CPU’s
crash your kernel or, in some cases, even damage your documentation, these sections give a good overview.
hardware.
So most kernel programmers need not worry about 7.1 Alpha
the memory-barrier peculiarities of each and every
CPU, as long as they stick to these interfaces. If It may seem strange to say much of anything about a
you are working deep in a given CPU’s architecture- CPU whose end of life has been announced, but Al-
specific code, of course, all bets are off. pha is interesting because, with the weakest memory
Furthermore, all of Linux’s locking primitives ordering model, it reorders memory operations the
(spinlocks, reader-writer locks, semaphores, RCU, ...) most aggressively. It therefore has defined the Linux-
include any needed barrier primitives. So if you are kernel memory-ordering primitives, which must work
working with code that uses these primitives, you on all CPUs, including Alpha. Understanding Alpha
don’t even need to worry about Linux’s memory- is therefore surprisingly important to the Linux ker-
ordering primitives. nel hacker.
That said, deep knowledge of each CPU’s memory- The difference between Alpha and the other CPUs
consistency model can be very helpful when debug- is illustrated by the code shown in Figure 9. This
ging, to say nothing of when writing architecture- smp wmb() on line 9 of this figure guarantees that the
specific code or synchronization primitives. element initialization in lines 6-8 is executed before
Besides, they say that a little knowledge is a very the element is added to the list on line 10, so that the
dangerous thing. Just imagine the damage you could lock-free search will work correctly. That is, it makes
do with a lot of knowledge! For those who wish to un- this guarantee on all CPUs except Alpha.
derstand more about individual CPUs’ memory con- Alpha has extremely weak memory ordering such
sistency models, the next sections describes those of that the code on line 20 of Figure 9 could see the old
17
1 struct el *insert(long key, long data)
2 {
3 struct el *p;
4 p = kmalloc(sizeof(*p), GPF_ATOMIC);
Writing CPU Core Reading CPU Core 5 spin_lock(&mutex);
6 p->next = head.next;
7 p->key = key;
(r)mb Sequencing (r)mb Sequencing 8 p->data = data;
9 smp_wmb();
10 head.next = p;
Cache Cache Cache Cache 11 spin_unlock(&mutex);
12 }
Bank 0 Bank 1 Bank 0 Bank 1 13
14 struct el *search(long key)
15 {
(w)mb Sequencing (w)mb Sequencing 16 struct el *p;
17 p = head.next;
6 18
19
while (p != &head) {
smp_read_barrier_depends();
Interconnect 20 if (p->key == key) {
21 return (p);
22 }
23 p = p->next;
Figure 10: Why smp read barrier depends() is Re- 24 };
25 return (NULL);
quired 26 }
garbage values that were present before the initial- Figure 11: Safe Insert and Lock-Free Search
ization on lines 6-8.
Figure 10 shows how this can happen on an ag-
gressively parallel machine with partitioned caches, ever, this imposes unneeded overhead on systems
so that alternating caches lines are processed by the (such as i386, IA64, PPC, and SPARC) that re-
different partitions of the caches. Assume that the spect data dependencies on the read side. A
list header head will be processed by cache bank 0, smp read barrier depends() primitive has been
and that the new element will be processed by cache added to the Linux 2.6 kernel to eliminate overhead
bank 1. On Alpha, the smp wmb() will guarantee that on these systems. This primitive may be used as
the cache invalidates performed by lines 6-8 of Fig- shown on line 19 of Figure 11.
ure 9 will reach the interconnect before that of line 10 It is also possible to implement a software bar-
does, but makes absolutely no guarantee about the rier that could be used in place of smp wmb(), which
order in which the new values will reach the reading would force all reading CPUs to see the writing
CPU’s core. For example, it is possible that the read- CPU’s writes in order. However, this approach was
ing CPU’s cache bank 1 is very busy, but cache bank deemed by the Linux community to impose exces-
0 is idle. This could result in the cache invalidates for sive overhead on extremely weakly ordered CPUs
the new element being delayed, so that the reading such as Alpha. This software barrier could be imple-
CPU gets the new value for the pointer, but sees the mented by sending inter-processor interrupts (IPIs)
old cached values for the new element. See the Web to all other CPUs. Upon receipt of such an IPI,
site called out earlier for more information, or, again, a CPU would execute a memory-barrier instruction,
if you think that I am just making all this up.6 implementing a memory-barrier shootdown. Addi-
One could place an smp rmb() primitive be- tional logic is required to avoid deadlocks. Of course,
tween the pointer fetch and dereference. How- CPUs that respect data dependencies would define
such a barrier to simply be smp wmb(). Perhaps this
6 Of course, the astute reader will have already recognized
decision should be revisited in the future as Alpha
that Alpha is nowhere near as mean and nasty as it could be,
the (thankfully) mythical architecture in Section 6.1 being a fades off into the sunset.
case in point. The Linux memory-barrier primitives took their
18
names from the Alpha instructions, so smp mb() is mb, called DWB (drain write buffer or data write bar-
smp rmb() is rmb, and smp wmb() is wmb. Alpha is the rier, your choice) in early versions of the ARM
only CPU where smp read barrier depends() is an architecture.
smp mb() rather than a no-op.
For more detail on Alpha, see the reference man- 3. ISB (instruction synchronization barrier) flushes
ual [SW95]. the CPU pipeline, so that all instructions fol-
lowing the ISB are fetched only after the ISB
completes. For example, if you are writing a self-
7.2 AMD64 modifying program (such as a JIT), you should
AMD64 is compatible with x86, and has recently execute an ISB after between generating the code
updated its memory model [Adv07] to enforce the and executing it.
tighter ordering that actual implementations have
None of these instructions exactly match the se-
provided for some time. The AMD64 implemen-
mantics of Linux’s rmb() primitive, which must
tation of the Linux smp mb() primitive is mfence,
therefore be implemented as a full DMB. The DMB
smp rmb() is lfence, and smp wmb() is sfence. In
and DSB instructions have a recursive definition of ac-
theory, these might be relaxed, but any such relax-
cesses ordered before and after the barrier, which has
ation must take SSE and 3DNOW instructions into
an effect similar to that of POWER’s cumulativity.
account.
One difference between the ARMv7 and the
POWER memory models is that while POWER re-
7.3 ARMv7-A/R spects both data and control dependencies, ARMv7
respects only data dependencies. The difference be-
The ARM family of CPUs is extremely popular
tween these two CPU families can be seen in the fol-
in embedded applications, particularly for power-
lowing code fragment, which was discussed earlier in
constrained applications such as cellphones. There
Section 4.3:
have nevertheless been multiprocessor implementa-
tions of ARM for more than five years. Its memory 1 void foo(void)
model is similar to that of Power (see Section 7.6, but 2 {
ARM uses a different set of memory-barrier instruc- 3 a = 1;
tions [ARM10]: 4 smp_wmb();
5 b = 1;
6 }
1. DMB (data memory barrier) causes the specified
7
type of operations to appear to have completed 8
void bar(void)
before any subsequent operations of the same {9
type. The “type” of operations can be all op- 10 while (b == 0) continue;
erations or can be restricted to only writes (sim- 11 assert(a == 1);
ilar to the Alpha wmb and the POWER eieio 12
}
instructions). In addition, ARM allows cache
coherence to have one of three scopes: single There is a control dependency between lines 10 and
processor, a subset of the processors (“inner”) 11 in the above example. This control dependency
and global (“outer”). would cause POWER to insert an implicit memory
barrier between these two lines, but ARM would in-
2. DSB (data synchronization barrier) causes the stead permit line 11 to be speculatively executed be-
7
specified type of operations to actually complete fore line 10 completed. On the other hand, these
before any subsequent operations (of any type) 7 Of course, as written above, the compiler would be within
are executed. The “type” of operations is the its rights to reorder lines 10 and 11. So please be careful out
same as that of DMB. The DSB instruction was there!
19
two CPUs would both respect the data dependency
in the following code:
1 int oof(void)
2 {
3 struct el *p;
4
5 p = global_pointer;
6 return p->a;
7 }
20
with respect to subsequent loads. Interest- ory will not necessarily be reflected in the instruction
ingly enough, the lwsync instruction enforces cache. Thankfully, few people write self-modifying
the same ordering as does zSeries, and coinci- code these days, but JITs and compilers do it all
dentally, SPARC TSO. the time. Furthermore, recompiling a recently run
program looks just like self-modifying code from the
3. eieio (enforce in-order execution of I/O, in CPU’s viewpoint. The icbi instruction (instruction
case you were wondering) causes all preceding cache block invalidate) invalidates a specified cache
cacheable stores to appear to have completed be- line from the instruction cache, and may be used in
fore all subsequent stores. However, stores to these situations.
cacheable memory are ordered separately from
stores to non-cacheable memory, which means
that eieio will not force an MMIO store to pre- 7.7 SPARC RMO, PSO, and TSO
cede a spinlock release.
Solaris on SPARC uses TSO (total-store order), as
4. isync forces all preceding instructions to ap- does Linux when built for the “sparc” 32-bit architec-
pear to have completed before any subsequent ture. However, a 64-bit Linux kernel (the “sparc64”
instructions start execution. This means that architecture) runs SPARC in RMO (relaxed-memory
the preceding instructions must have progressed order) mode [SPA94]. The SPARC architecture
far enough that any traps they might generate also offers an intermediate PSO (partial store or-
have either happened or are guaranteed not to der). Any program that runs in RMO will also
happen, and that any side-effects of these in- run in either PSO or TSO, and similarly, a program
structions (for example, page-table changes) are that runs in PSO will also run in TSO. Moving a
seen by the subsequent instructions. shared-memory parallel program in the other direc-
tion may require careful insertion of memory barri-
Unfortunately, none of these instructions line up ers, although, as noted earlier, programs that make
exactly with Linux’s wmb() primitive, which requires standard use of synchronization primitives need not
all stores to be ordered, but does not require the worry about memory barriers.
other high-overhead actions of the sync instruction. SPARC has a very flexible memory-barrier instruc-
But there is no choice: ppc64 versions of wmb() and tion [SPA94] that permits fine-grained control of or-
mb() are defined to be the heavyweight sync in- dering:
struction. However, Linux’s smp wmb() instruction is
never used for MMIO (since a driver must carefully StoreStore: order preceding stores before subse-
order MMIOs in UP as well as SMP kernels, after quent stores. (This option is used by the Linux
all), so it is defined to be the lighter weight eieio smp wmb() primitive.)
instruction. This instruction may well be unique in
having a five-vowel mneumonic. The smp mb() in- LoadStore: order preceding loads before subsequent
struction is also defined to be the sync instruction, stores.
but both smp rmb() and rmb() are defined to be the
lighter-weight lwsync instruction. StoreLoad: order preceding stores before subse-
Power features “cumulativity”, which can be used quent loads.
to obtain transitivity. When used properly, any code
seeing the results of an earlier code fragment will also LoadLoad: order preceding loads before subse-
see the accesses that this earlier code fragment itself quent loads. (This option is used by the Linux
saw. Much more detail is available from McKenney smp rmb() primitive.)
and Silvera [MS09].
Many members of the POWER architecture have Sync: fully complete all preceding operations before
incoherent instruction caches, so that a store to mem- starting any subsequent operations.
21
MemIssue: complete preceding memory operations On SMP systems, all CPUs’ caches are flushed, but
before subsequent memory operations, impor- there is no convenient way to determine when the off-
tant for some instances of memory-mapped I/O. CPU flushes complete, though there is a reference to
an implementation note.
Lookaside: same as MemIssue, but only applies to
preceding stores and subsequent loads, and even
then only for stores and loads that access the 7.8 x86
same memory location. Since the x86 CPUs provide “process ordering” so
The Linux smp mb() primitive uses the first that all CPUs agree on the order of a given CPU’s
four options together, as in membar #LoadLoad writes to memory, the smp wmb() primitive is a no-op
| #LoadStore | #StoreStore | #StoreLoad, thus for the CPU [Int04b]. However, a compiler directive
fully ordering memory operations. is required to prevent the compiler from performing
So, why is membar #MemIssue needed? Because a optimizations that would result in reordering across
membar #StoreLoad could permit a subsequent load the smp wmb() primitive.
to get its value from a write buffer, which would be On the other hand, x86 CPUs have tradition-
disastrous if the write was to an MMIO register that ally given no ordering guarantees for loads, so
induced side effects on the value to be read. In con- the smp mb() and smp rmb() primitives expand to
trast, membar #MemIssue would wait until the write lock;addl. This atomic instruction acts as a barrier
buffers were flushed before permitting the loads to ex- to both loads and stores.
ecute, thereby ensuring that the load actually gets its More recently, Intel has published a memory model
value from the MMIO register. Drivers could instead for x86 [Int07]. It turns out that Intel’s actual CPUs
use membar #Sync, but the lighter-weight membar enforced tighter ordering than was claimed in the pre-
#MemIssue is preferred in cases where the additional vious specifications, so this model is in effect simply
function of the more-expensive membar #Sync are not mandating the earlier de-facto behavior. Even more
required. recently, Intel published an updated memory model
The membar #Lookaside is a lighter-weight ver- for x86 [Int09], which mandates a total global order
sion of membar #MemIssue, which is useful when for stores, although individual CPUs are still permit-
writing to a given MMIO register affects the value ted to see their own stores as having happened earlier
that will next be read from that register. However, than this total global order would indicate. This ex-
the heavier-weight membar #MemIssue must be used ception to the total ordering is needed to allow impor-
when a write to a given MMIO register affects the tant hardware optimizations involving store buffers.
value that will next be read from some other MMIO Software may use atomic operations to override these
register. hardware optimizations, which is one reason that
It is not clear why SPARC does not define wmb() to atomic operations tend to be more expensive than
be membar #MemIssue and smb wmb() to be membar their non-atomic counterparts. This total store order
#StoreStore, as the current definitions seem vulner- is not guaranteed on older processors.
able to bugs in some drivers. It is quite possible that However, note that some SSE instructions are
all the SPARC CPUs that Linux runs on implement a weakly ordered (clflush and non-temporal move in-
more conservative memory-ordering model than the structions [Int04a]). CPUs that have SSE can use
architecture would permit. mfence for smp mb(), lfence for smp rmb(), and
SPARC requires a flush instruction be used be- sfence for smp wmb().
tween the time that an instruction is stored and ex- A few versions of the x86 CPU have a mode bit
ecuted [SPA94]. This is needed to flush any prior that enables out-of-order stores, and for these CPUs,
value for that location from the SPARC’s instruction smp wmb() must also be defined to be lock;addl.
cache. Note that flush takes an address, and will Although many older x86 implementations accom-
flush only that address from the instruction cache. modated self-modifying code without the need for
22
any special instructions, newer revisions of the x86 each thread would wait until memory was ready, with
architecture no longer requires x86 CPUs to be so ac- tens, hundreds, or even thousands of other threads
commodating. Interestingly enough, this relaxation making progress in the meantime. In such an archi-
comes just in time to inconvenience JIT implemen- tecture, there would be no need for memory barriers,
tors. because a given thread would simply wait for all out-
standing operations to complete before proceeding to
7.9 zSeries the next instruction. Because there would be poten-
TM
tially thousands of other threads, the CPU would be
The zSeries machines make up the IBM main- completely utilized, so no CPU time would be wasted.
frame family, previously known as the 360, 370, and The argument against would cite the extremely
390 [Int04c]. Parallelism came late to zSeries, but limited number of applications capable of scaling up
given that these mainframes first shipped in the mid to a thousand threads, as well as increasingly se-
1960s, this is not saying much. The bcr 15,0 in- vere realtime requirements, which are in the tens of
struction is used for the Linux smp mb(), smp rmb(), microseconds for some applications. The realtime-
and smp wmb() primitives. It also has comparatively response requirements are difficult enough to meet as
strong memory-ordering semantics, as shown in Ta- is, and would be even more difficult to meet given the
ble 5, which should allow the smp wmb() primitive to extremely low single-threaded throughput implied by
be a nop (and by the time you read this, this change the massive multi-threaded scenarios.
may well have happened). The table actually un- Another argument in favor would cite increas-
derstates the situation, as the zSeries memory model ingly sophisticated latency-hiding hardware imple-
is otherwise sequentially consistent, meaning that all mentation techniques that might well allow the CPU
CPUs will agree on the order of unrelated stores from to provide the illusion of fully sequentially consis-
different CPUs. tent execution while still providing almost all of the
As with most CPUs, the zSeries architecture does performance advantages of out-of-order execution.
not guarantee a cache-coherent instruction stream, A counter-argument would cite the increasingly se-
hence, self-modifying code must execute a serializing vere power-efficiency requirements presented both by
instruction between updating the instructions and ex- battery-operated devices and by environmental re-
ecuting them. That said, many actual zSeries ma- sponsibility.
chines do in fact accommodate self-modifying code Who is right? We have no clue, so are preparing
without serializing instructions. The zSeries instruc- to live with either scenario.
tion set provides a large set of serializing instructions,
including compare-and-swap, some types of branches
(for example, the aforementioned bcr 15,0 instruc- 9 Advice to Hardware Design-
tion), and test-and-set, among others. ers
There are any number of things that hardware de-
8 Are Memory Barriers For- signers can do to make the lives of software people
ever? difficult. Here is a list of a few such things that we
have encountered in the past, presented here in the
There have been a number of recent systems that are hope that it might help prevent future such problems:
significantly less aggressive about out-of-order exe-
cution in general and re-ordering memory references 1. I/O devices that ignore cache coherence.
in particular. Will this trend continue to the point This charming misfeature can result in DMAs
where memory barriers are a thing of the past? from memory missing recent changes to the out-
The argument in favor would cite proposed mas- put buffer, or, just as bad, cause input buffers
sively multi-threaded hardware architectures, so that to be overwritten by the contents of CPU caches
23
just after the DMA completes. To make your get a nasty surprise when it first runs on the
system work in face of such misbehavior, you real hardware. Unfortunately, it is still the rule
must carefully flush the CPU caches of any loca- that the hardware is more devious than are the
tion in any DMA buffer before presenting that simulators and emulators, but we hope that this
buffer to the I/O device. And even then, you situation changes.
need to be very careful to avoid pointer bugs,
as even a misplaced read to an input buffer can Again, we encourage hardware designers to avoid
result in corrupting the data input! these practices!
24
Quick Quiz 4:
Answer: How does the hardware handle the delayed transi-
One of the CPUs gains access to the shared bus tions described above?
first, and that CPU “wins”. The other CPU must
invalidate its copy of the cache line and transmit an Answer:
“invalidate acknowledge” message to the other CPU. Usually by adding additional states, though these
Of course, the losing CPU can be expected to additional states need not be actually stored with
immediately issue a “read invalidate” transaction, so the cache line, due to the fact that only a few lines
the winning CPU’s victory will be quite ephemeral. at a time will be transitioning. The need to delay
transitions is but one issue that results in real-world
cache coherence protocols being much more complex
Quick Quiz 2: than the over-simplified MESI protocol described
When an “invalidate” message appears in a large in this appendix. Hennessy and Patterson’s classic
multiprocessor, every CPU must give an “invalidate introduction to computer architecture [HP95] covers
acknowledge” response. Wouldn’t the resulting many of these issues.
“storm” of “invalidate acknowledge” responses
totally saturate the system bus?
Quick Quiz 5:
Answer: What sequence of operations would put the CPUs’
It might, if large-scale multiprocessors were in fact caches all back into the “invalid” state?
implemented that way. Larger multiprocessors,
particularly NUMA machines, tend to use so-called Answer:
“directory-based” cache-coherence protocols to avoid There is no such sequence, at least in absence
this and other problems. of special “flush my cache” instructions in the
CPU’s instruction set. Most CPUs do have such
instructions.
Quick Quiz 3:
If SMP machines are really using message passing
anyway, why bother with SMP at all? Quick Quiz 6:
In step 1 above, why does CPU 0 need to issue a
Answer: “read invalidate” rather than a simple “invalidate”?
There has been quite a bit of controversy on this
topic over the past few decades. One answer is that
the cache-coherence protocols are quite simple, and Answer:
therefore can be implemented directly in hardware, Because the cache line in question contains more
gaining bandwidths and latencies unattainable by than just the variable a.
software message passing. Another answer is that
the real truth is to be found in economics due to the
relative prices of large SMP machines and that of Quick Quiz 7:
clusters of smaller SMP machines. A third answer In step 1 of the first scenario in Section 4.3, why is
is that the SMP programming model is easier to an “invalidate” sent instead of a ”read invalidate”
use than that of distributed systems, but a rebuttal message? Doesn’t CPU 0 need the values of the
might note the appearance of HPC clusters and other variables that share this cache line with “a”?
MPI. And so the argument continues.
Answer:
25
CPU 0 already has the values of these variables, Quick Quiz 10:
given that it has a read-only copy of the cache line Could this code be fixed by inserting a memory
containing “a”. Therefore, all CPU 0 need do is barrier between CPU 1’s “while” and assignment to
to cause the other CPUs to discard their copies of “c”? Why or why not?
this cache line. An “invalidate” message therefore
suffices. Answer:
No. Such a memory barrier would only force order-
ing local to CPU 1. It would have no effect on the
Quick Quiz 8: relative ordering of CPU 0’s and CPU 1’s accesses,
Say what??? Why do we need a memory barrier so the assertion could still fail. However, all main-
here, given that the CPU cannot possibly execute the stream computer systems provide one mechanism
assert() until after the while loop completes??? or another to provide “transitivity”, which provides
intuitive causal ordering: if B saw the effects of A’s
Answer: accesses, and C saw the effects of B’s accesses, then
CPUs are free to speculatively execute, which can C must also see the effects of A’s accesses.
have the effect of executing the assertion before
the while loop completes. That said, some weakly
ordered CPUs respect “control dependencies.” Such Quick Quiz 11:
CPUs would execute an implicit memory barrier Suppose that lines 3-5 for CPUs 1 and 2 are in an
after each conditional branch, such as the branch interrupt handler, and that the CPU 2’s line 9 is run
terminating the while loop. This example never- at process level. What changes, if any, are required
theless uses an explicit memory barrier, as would be to enable the code to work correctly, in other words,
required on DEC Alpha. to prevent the assertion from firing?
Answer:
Quick Quiz 9: The assertion will need to coded so as to ensure
Does the guarantee that each CPU sees its own that the load of “e” precedes that of “a”. In the
memory accesses in order also guarantee that each Linux kernel, the barrier() primitive may be used
user-level thread will see its own memory accesses in to accomplish this in much the same way that the
order? Why or why not? memory barrier was used in the assertions in the
previous examples.
Answer:
No. Consider the case where a thread migrates from
one CPU to another, and where the destination References
CPU perceives the source CPU’s recent memory
operations out of order. To preserve user-mode [Adv02] Advanced Micro Devices. AMD x86-64 Ar-
sanity, kernel hackers must use memory barriers chitecture Programmer’s Manual Volumes
in the context-switch path. However, the locking 1-5, 2002.
already required to safely do a context switch
should automatically provide the memory barriers [Adv07] Advanced Micro Devices. AMD x86-64 Ar-
needed to cause the user-level task to see its own chitecture Programmer’s Manual Volume 2:
accesses in order. That said, if you are designing a System Programming, 2007.
super-optimized scheduler, either in the kernel or at
user level, please keep this scenario in mind! [ARM10] ARM Limited. ARM Architecture Ref-
erence Manual: ARMv7-A and ARMv7-R
Edition, 2010.
26
[CSG99] David E. Culler, Jaswinder Pal Singh, and com/epubs/pdf/dz9zr003.pdf [Viewed:
Anoop Gupta. Parallel Computer Architec- February 16, 2005], May 2004.
ture: a Hardware/Software Approach. Mor-
gan Kaufman, 1999. [Int07] Intel Corporation. Intel 64 Ar-
chitecture Memory Ordering White
[Gha95] Kourosh Gharachorloo. Memory consis- Paper, 2007. Available: http:
tency models for shared-memory multipro- //developer.intel.com/products/
cessors. Technical Report CSL-TR-95-685, processor/manuals/318147.pdf [Viewed:
Computer Systems Laboratory, Depart- September 7, 2007].
ments of Electrical Engineering and
Computer Science, Stanford University, [Int09] Intel Corporation. Intel 64 and IA-
Stanford, CA, December 1995. Available: 32 Architectures Software Developers
http://www.hpl.hp.com/techreports/ Manual, Volume 3A: System Program-
Compaq-DEC/WRL-95-9.pdf [Viewed: ming Guide, Part 1, 2009. Available:
October 11, 2004]. http://download.intel.com/design/
[HP95] John L. Hennessy and David A. Patterson. processor/manuals/253668.pdf [Viewed:
Computer Architecture: A Quantitative Ap- November 8, 2009].
proach. Morgan Kaufman, 1995.
[Kan96] Gerry Kane. PA-RISC 2.0 Architecture.
[IBM94] IBM Microelectronics and Motorola. Pow- Hewlett-Packard Professional Books, 1996.
erPC Microprocessor Family: The Pro-
gramming Environments, 1994. [LSH02] Michael Lyons, Ed Silha, and Bill
Hay. PowerPC storage model and
[Int02a] Intel Corporation. Intel Itanium Architec-
AIX programming. Available: http:
ture Software Developer’s Manual Volume
//www-106.ibm.com/developerworks/
3: Instruction Set Reference, 2002.
eserver/articles/powerpc.html
[Int02b] Intel Corporation. Intel Itanium Architec- [Viewed: January 31, 2005], August
ture Software Developer’s Manual Volume 2002.
3: System Architecture, 2002.
[McK05a] Paul E. McKenney. Memory or-
[Int04a] Intel Corporation. IA-32 Intel Architecture dering in modern microprocessors,
Software Developer’s Manual Volume 2B: part I. Linux Journal, 1(136):52–
Instruction Set Reference, N-Z, 2004. 57, August 2005. Available: http:
Available: ftp://download.intel.com/ //www.linuxjournal.com/article/8211
design/Pentium4/manuals/25366714.pdf http://www.rdrop.com/users/paulmck/
[Viewed: February 16, 2005]. scalability/paper/ordering.2007.09.
[Int04b] Intel Corporation. IA-32 Intel Architec- 19a.pdf [Viewed November 30, 2007].
ture Software Developer’s Manual Volume
3: System Programming Guide, 2004. [McK05b] Paul E. McKenney. Memory or-
Available: ftp://download.intel.com/ dering in modern microprocessors,
design/Pentium4/manuals/25366814.pdf part II. Linux Journal, 1(137):78–82,
[Viewed: February 16, 2005]. September 2005. Available: http:
//www.linuxjournal.com/article/8212
[Int04c] International Business Machines Corpora- http://www.rdrop.com/users/paulmck/
tion. z/Architecture principles of operation. scalability/paper/ordering.2007.09.
Available: http://publibz.boulder.ibm. 19a.pdf [Viewed November 30, 2007].
27
[MS09] Paul E. McKenney and Raul Silvera. Exam-
ple power implementation for c/c++ mem-
ory model. Available: http://www.rdrop.
com/users/paulmck/scalability/paper/
N2745r.2009.02.27a.html [Viewed: April
5, 2009], February 2009.
28