Programming Language Memory Models
Programming Language Memory Models
P L M M
The end result of making done atomic is that the program behaves as we want,
successfully passing the value in x from thread 1 to thread 2.
In the original program, after the compiler’s code reordering, thread 1 could
be writing x at the same moment that thread 2 was reading it. This is a data
race. In the revised program, the atomic variable done serves to synchronize ac-
cess to x: it is now impossible for thread 1 to be writing x at the same moment
that thread 2 is reading it. The program is data-race-free. In general, modern
languages guarantee that data-race-free programs always execute in a sequen-
tially consistent way, as if the operations from the different threads were inter-
leaved, arbitrarily but without reordering, onto a single processor. This is the
DRF-SC property from hardware memory models, adopted in the programming
language context.
As an aside, these atomic variables or atomic operations would more properly
be called “synchronizing atomics.” It’s true that the operations are atomic in the
database sense, allowing simultaneous reads and writes which behave as if run
sequentially in some order: what would be a race on ordinary variables is not
a race when using atomics. But it’s even more important that the atomics syn-
chronize the rest of the program, providing a way to eliminate races on the non-
atomic data. The standard terminology is plain “atomic”, though, so that’s what
this post uses. Just remember to read “atomic” as “synchronizing atomic” unless
noted otherwise.
The programming language memory model specifies the exact details of what
is required from programmers and from compilers, serving as a contract be-
tween them. The general features sketched above are true of essentially all mod-
ern languages, but it is only recently that things have converged to this point: in
the early 2000s, there was significantly more variation. Even today there is sig-
nificant variation among languages on second-order questions, including:
– What are the ordering guarantees for atomic variables themselves?
– Can a variable be accessed by both atomic and non-atomic opera-
tions?
– Are there synchronization mechanisms besides atomics?
– Are there atomic operations that don’t synchronize?
– Do programs with races have any guarantees at all?
After some preliminaries, the rest of this post examines how different languages
answer these and related questions, along with the paths they took to get there.
The post also highlights the many false starts along the way, to emphasize that
we are still very much learning what works and what does not.
P L M M
proves they are different and usually helps us see whether, at least for that test
case, one is weaker or stronger than the other. For example, here is the litmus
test form of the program we examined earlier:
Litmus Test: Message Passing
Can this program see r1 = 1, r2 = 0?
// Thread 1 // Thread 2
x = 1 r1 = y
y = 1 r2 = x
W(x)
S(a) S(a)
R(x) R(x)
We saw this program in the previous post too. Thread 1 and thread 2 execute a
synchronizing instruction S(a). In this particular execution of the program, the
two S(a) instructions establish a happens-before relationship from thread 1 to
thread 2, so the W(x) in thread 1 happens before the R(x) in thread 2.
Two events on different processors that are not ordered by happens-before
might occur at the same moment: the exact order is unclear. We say they exe-
cute concurrently. A data race is when a write to a variable executes concurrently
with a read or another write of that same variable. Processors that provide DRF-
P L M M
SC (all of them, these days) guarantee that programs without data races behave
as if they were running on a sequentially consistent architecture. This is the fun-
damental guarantee that makes it possible to write correct multithreaded assem-
bly programs on modern processors.
As we saw earlier, DRF-SC is also the fundamental guarantee that modern
languages have adopted to make it possible to write correct multithreaded pro-
grams in higher-level languages.
P L M M
P L M M
// Thread 1 // Thread 2
x = 1; while(done == 0) { /* loop */ }
done = 1; print(x);
Because done is declared volatile, the loop is guaranteed to finish: the compiler
cannot cache it in a register and cause an infinite loop. However, the program
is not guaranteed to print 1. The compiler was not prohibited from reordering
the accesses to x and done, nor was it required to prohibit the hardware from
doing the same.
Because Java volatiles were non-synchronizing atomics, you could not use
them to build new synchronization primitives. In this sense, the original Java
memory model was too weak.
Coherence is incompatible with compiler optimizations
The orginal Java memory model was also too strong: mandating coher-
ence—once a thread had read a new value of a memory location, it could not
appear to later read the old value—disallowed basic compiler optimizations.
Earlier we looked at how reordering reads would break coherence, but you
might think, well, just don’t reorder reads. Here’s a more subtle way coherence
might be broken by another optimization: common subexpression elimination.
Consider this Java program:
// p and q may or may not point at the same object.
int i = p.x;
// ... maybe another thread writes p.x at this point ...
int j = q.x;
int k = p.x;
In this program, common subexpression elimination would notice that p.x is
computed twice and optimize the final line to k = i. But if p and q pointed to
the same object and another thread wrote to p.x between the reads into i and
j, then reusing the old value i for k violates coherence: the read into i saw an
old value, the read into j saw a newer value, but then the read into k reusing i
would once again see the old value. Not being able to optimize away redundant
reads would hobble most compilers, making the generated code slower.
Coherence is easier for hardware to provide than for compilers because hard-
ware can apply dynamic optimizations: it can adjust the optimization paths
based on the exact addresses involved in a given sequence of memory reads and
writes. In contrast, compilers can only apply static optimizations: they have to
write out, ahead of time, an instruction sequence that will be correct no mat-
ter what addresses and values are involved. In the example, the compiler cannot
easily change what happens based on whether p and q happen to point to the
same object, at least not without writing out code for both possibilities, leading
to significant time and space overheads. The compiler’s incomplete knowledge
about the possible aliasing between memory locations means that actually pro-
viding coherence would require giving up fundamental optimizations.
Bill Pugh identified this and other problems in his 1999 paper “Fixing the Java
Memory Model.”
P L M M
// Thread 1 // Thread 2
x = 1 y = 1
r1 = y r2 = x
P L M M
P L M M
agree: technically, r1 and r2 can be left having read different values of x. That
is, this program can end with r1 and r2 holding different values. Of course, no
real implementation is going to produce different r1 and r2. Mutual exclusion
means there are no writes happening between those two reads. They have to get
the same value. But the fact that the memory model allows different reads shows
that it is, in a certain technical way, not precisely describing real Java implemen-
tations.
The situation gets worse. What if we add one more instruction, x = r1, be-
tween the two reads:
// Thread 1 // Thread 2 // Thread 3
lock(m1) lock(m2)
x = 1 x = 2
unlock(m1) unlock(m2)
lock(m1)
lock(m2)
r1 = x
x = r1 // !?
r2 = x
unlock(m2)
unlock(m1)
Now, clearly the r2 = x read must use the value written by x = r1, so the pro-
gram must get the same values in r1 and r2. The two values r1 and r2 are now
guaranteed to be equal.
The difference between these two programs means we have a problem for
compilers. A compiler that sees r1 = x followed by x = r1 may well want to
delete the second assignment, which is “clearly” redundant. But that “optimiza-
tion” changes the second program, which must see the same values in r1 and r2,
into the first program, which technically can have r1 different from r2. There-
fore, according to the Java Memory Model, this optimization is technically in-
valid: it changes the meaning of the program. To be clear, this optimization
would not change the meaning of Java programs executing on any real JVM you
can imagine. But somehow the Java Memory Model doesn’t allow it, suggesting
there’s more that needs to be said.
For more about this example and others, see Ševčík and Aspinall’s paper.
// Thread 1 // Thread 2
r1 = x r2 = y
y = r1 x = r2
(Obviously not!)
All the variables in this program start out zeroed, as always, and then this pro-
gram effectively runs y = x in one thread and x = y in the other thread. Can x
and y end up being 42? In real life, obviously not. But why not? The memory
model turns out not to disallow this result.
Suppose hypothetically that “r1 = x” did read 42. Then “y = r1” would write
42 to y, and then the racing “r2 = y” could read 42, causing the “x = r2” to write
P L M M
42 to x, and that write races with (and is therefore observable by) the original
“r1 = x,” appearing to justify the original hypothetical. In this example, 42 is
called an out-of-thin-air value, because it appeared without any justification but
then justified itself with circular logic. What if the memory had formerly held a
42 before its current 0, and the hardware incorrectly speculated that it was still
42? That speculation might become a self-fulfilling prophecy. (This argument
seemed more far-fetched before Spectre and related attacks showed just how ag-
gressively hardware speculates. Even so, no hardware invents out-of-thin-air val-
ues this way.)
It seems clear that this program cannot end with r1 and r2 set to 42, but hap-
pens-before doesn’t by itself explain why this can’t happen. That suggests again
that there’s a certain incompleteness. The new Java Memory Model spends a lot
of time addressing this incompleteness, about which more shortly.
This program has a race—the reads of x and y are racing against writes in the
other threads—so we might fall back on arguing that it’s an incorrect program.
But here is a version that is data-race-free:
Litmus Test: Non-Racy Out Of Thin Air Values
Can this program see r1 = 42, r2 = 42?
// Thread 1 // Thread 2
r1 = x r2 = y
if (r1 == 42) if (r2 == 42)
y = r1 x = r2
(Obviously not!)
Since x and y start out zero, any sequentially consistent execution is never go-
ing to execute the writes, so this program has no writes, so there are no races.
Once again, though, happens-before alone does not exclude the possibility that,
hypothetically, r1 = x sees the racing not-quite-write, and then following from
that hypothetical, the conditions both end up true and x and y are both 42 at
the end. This is another kind of out-of-thin-air value, but this time in a pro-
gram with no race. Any model guaranteeing DRF-SC must guarantee that this
program only sees all zeros at the end, yet happens-before doesn’t explain why.
The Java memory model spends a lot of words that I won’t go into to try to
exclude these kinds of acausal hypotheticals. Unfortunately, five years later, Sari-
ta Adve and Hans Boehm had this to say about that work:
Prohibiting such causality violations in a way that does not also pro-
hibit other desired optimizations turned out to be surprisingly diffi-
cult. … After many proposals and five years of spirited debate, the cur-
rent model was approved as the best compromise. … Unfortunately,
this model is very complex, was known to have some surprising be-
haviors, and has recently been shown to have a bug.
(Adve and Boehm, “Memory Models: A Case For Rethinking Parallel Languages
and Hardware,” August 2010)
P L M M
if (i < 2) {
foo: ...
switch (i) {
case 0:
...;
break;
case 1:
...;
break;
}
}
The claim is that a C++ compiler might be holding i in a register but then need
to reuse the registers if the code at label foo is complex. Rather than spill the
P L M M
current value of i to the function stack, the compiler might instead decide to
load i a second time from the global x upon reaching the switch statement. The
result is that, halfway through the if body, i < 2 may stop being true. If the
compiler did something like compiling the switch into a computed jump using
a table indexed by i, that code would index off the end of the table and jump
to an unexpected address, which could be arbitrarily bad.
From this example and others like it, the C++ memory model authors con-
clude that any racy access must be allowed to cause unbounded damage to the
future execution of the program. Personally, I conclude instead that in a mul-
tithreaded program, compilers should not assume that they can reload a local
variable like i by re-executing the memory read that initialized it. It may well
have been impractical to expect existing C++ compilers, written for a single-
threaded world, to find and fix code generation problems like this one, but in
new languages, I think we should aim higher.
Digression: Undefined behavior in C and C++
As an aside, the C and C++ insistence on the compiler’s ability to behave arbi-
trarily badly in response to bugs in programs leads to truly ridiculous results.
For example, consider this program, which was a topic of discussion on Twitter
in 2017:
#include <cstdlib>
void NeverCalled() {
Do = EraseAll;
}
int main() {
return Do();
}
If you were a modern C++ compiler like Clang, you might think about this pro-
gram as follows:
– In main, clearly Do is either null or EraseAll.
– If Do is EraseAll, then Do() is the same as EraseAll().
– If Do is null, then Do() is undefined behavior, which I can implement
however I want, including as EraseAll() unconditionally.
– Therefore I can optimize the indirect call Do() down to the direct call
EraseAll().
– I might as well inline EraseAll while I’m here.
The end result is that Clang optimizes the program down to:
int main() {
return system("rm -rf slash");
}
P L M M
You have to admit: next to this example, the possibility that the local variable i
might suddenly stop being less than 2 halfway through the body of if (i < 2)
does not seem out of place.
In essence, modern C and C++ compilers assume no programmer would dare
attempt undefined behavior. A programmer writing a program with a bug? In-
conceivable!
Like I said, in new languages I think we should aim higher.
Acquire/release atomics
C++ adopted sequentially consistent atomic variables much like (new) Java’s
volatile variables (no relation to C++ volatile). In our message passing example,
we can declare done as
atomic<int> done;
and then use done as if it were an ordinary variable, like in Java. Or we can de-
clare an ordinary int done; and then use
atomic_store(&done, 1);
and
while(atomic_load(&done) == 0) { /* loop */ }
to access it. Either way, the operations on done take part in the sequentially con-
sistent total order on atomic operations and synchronize the rest of the pro-
gram.
C++ also added weaker atomics, which can be accessed using atom-
ic_store_explicit and atomic_load_explicit with an additional memo-
ry ordering argument. Using memory_order_seq_cst makes the explicit calls
equivalent to the shorter ones above.
The weaker atomics are called acquire/release atomics, in which a release ob-
served by a later acquire creates a happens-before edge from the release to the
acquire. The terminology is meant to evoke mutexes: release is like unlocking a
mutex, and acquire is like locking that same mutex. The writes executed before
the release must be visible to reads executed after the subsequent acquire, just
as writes executed before unlocking a mutex must be visible to reads executed
after later locking that same mutex.
To use the weaker atomics, we could change our message-passing example to
use
atomic_store(&done, 1, memory_order_release);
and
while(atomic_load(&done, memory_order_acquire) == 0) { /* loop */ }
and it would still be correct. But not all programs would.
Recall that the sequentially consistent atomics required the behavior of all the
atomics in the program to be consistent with some global interleaving—a to-
tal order—of the execution. Acquire/release atomics do not. They only require
a sequentially consistent interleaving of the operations on a single memory lo-
cation. That is, they only require coherence. The result is that a program using
acquire/release atomics with more than one memory location may observe exe-
cutions that cannot be explained by a sequentially consistent interleaving of all
the acquire/release atomics in the program, arguably a violation of DRF-SC!
P L M M
// Thread 1 // Thread 2
x = 1 y = 1
r1 = y r2 = x
void Cond::notify() {
done = 1;
if (!waiting)
return;
// ... wake up waiter ...
}
void Cond::wait() {
waiting = 1;
if(done)
return;
// ... sleep ...
}
P L M M
The important part about this code is that notify sets done before checking
waiting, while wait sets waiting before checking done, so that concurrent
calls to notify and wait cannot result in notify returning immediately and
wait sleeping. But with C++ acquire/release atomics, they can. And they prob-
ably would only some fraction of time, making the bug very hard to reproduce
and diagnose. (Worse, on some architectures like 64-bit ARM, the best way to
implement acquire/release atomics is as sequentially consistent atomics, so you
might write code that works fine on 64-bit ARM and only discover it is incor-
rect when porting to other systems.)
With this understanding, “acquire/release” is an unfortunate name for these
atomics, since the sequentially consistent ones do just as much acquiring and re-
leasing. What’s different about these is the loss of sequential consistency. It might
have been better to call these “coherence” atomics. Too late.
Relaxed atomics
C++ did not stop with the merely coherent acquire/release atomics. It
also introduced non-synchronizing atomics, called relaxed atomics (memo-
ry_order_relaxed). These atomics have no synchronizing effect at all—they
create no happens-before edges—and they have no ordering guarantees at all ei-
ther. In fact, there is no difference between a relaxed atomic read/write and an
ordinary read/write except that a race on relaxed atomics is not considered a
race and cannot catch fire.
Much of the complexity of the revised Java memory model arises from
defining the behavior of programs with data races. It would be nice if C++’s
adoption of DRF-SC or Catch Fire—effectively disallowing programs with data
races—meant that we could throw away all those strange examples we looked
at earlier, so that the C++ language spec would end up simpler than Java’s. Un-
fortunately, including the relaxed atomics ends up preserving all those concerns,
meaning the C++11 spec ended up no simpler than Java’s.
Like Java’s memory model, the C++11 memory model also ended up incor-
rect. Consider the data-race-free program from before:
Litmus Test: Non-Racy Out Of Thin Air Values
Can this program see r1 = 42, r2 = 42?
// Thread 1 // Thread 2
r1 = x r2 = y
if (r1 == 42) if (r2 == 42)
y = r1 x = r2
(Obviously not!)
C++11 (ordinary variables): no.
C++11 (relaxed atomics): yes!
In their paper “Common Compiler Optimisations are Invalid in the C11 Mem-
ory Model and what we can do about it” (2015), Viktor Vafeiadis and others
showed that the C++11 specification guarantees that this program must end
with x and y set to zero when x and y are ordinary variables. But if x and y are
relaxed atomics, then, strictly speaking, the C++11 specification does not rule
out that r1 and r2 might both end up 42. (Surprise!)
See the paper for the details, but at a high level, the C++11 spec had some for-
mal rules trying to disallow out-of-thin-air values, combined with some vague
words to discourage other kinds of problematic values. Those formal rules were
the problem, so C++14 dropped them and left only the vague words. Quoting
the rationale for removing them, the C++11 formulation turned out to be “both
P L M M
P L M M
P L M M
like. In addition to those challenges, which are mostly the same as elsewhere, the
ES2017 definition had two interesting bugs that arose from a mismatch with the
semantics of the new ARMv8 atomic instructions. These examples are adapted
from Conrad Watt et al.’s 2020 paper “Repairing and Mechanising the JavaScript
Relaxed Memory Model.”
As we noted in the previous section, ARMv8 added ldar and stlr instruc-
tions providing sequentially consistent atomic load and store. These were target-
ed to C++, which does not define the behavior of any program with a data race.
Unsurprisingly, then, the behavior of these instructions in racy programs did not
match the expectations of the ES2017 authors, and in particular it did not sat-
isfy the ES2017 requirements for racy program behavior.
Litmus Test: ES2017 racy reads on ARMv8
Can this program (using atomics) see r1 = 0, r2 = 1?
// Thread 1 // Thread 2
x = 1 y = 1
r1 = y x = 2 (non-atomic)
r2 = x
P L M M
// Thread 1 // Thread 2
x = 1 x = 2
r1 = x
if (r1 == 1) {
r2 = x // non-atomic
}
Conclusions
Looking at C, C++, Java, JavaScript, Rust, and Swift, we can make the following
observations:
– They all provide sequentially consistent synchronizing atomics for co-
ordinating the non-atomic parts of a parallel program.
– They all aim to guarantee that programs made data-race-free using
proper synchronization behave as if executed in a sequentially consis-
tent manner.
– Java resisted adding weak (acquire/release) synchronizing atomics un-
til Java 9 introduced VarHandle. JavaScript has avoided adding them
as of this writing.
– They all provide a way for programs to execute “intentional” data races
without invalidating the rest of the program. In C, C++, Rust, and
Swift, that mechanism is relaxed, non-synchronizing atomics, a spe-
cial form of memory access. In Java, that mechanism is either ordi-
nary memory access or the Java 9 VarHandle “plain” access mode. In
JavaScript, that mechanism is ordinary memory access.
P L M M
Acknowledgements
This series of posts benefited greatly from discussions with and feedback from
a long list of engineers I am lucky to work with at Google. My thanks to them.
I take full responsibility for any mistakes or unpopular opinions.