Proposal for a new LLVM concurrency memory model

Hi all,

Chandler, Owen, and I have written up a proposal for a new memory
model and atomic intrinsics in LLVM, which will make it possible to
support Java and the upcoming C++0x standard. The proposed changes to
the LangRef are at
<http://docs.google.com/View?docID=ddb4mhxz_22dz5g98dd&revision=_latest&gt;,
and a rationale for some of the more surprising changes is at
<http://docs.google.com/View?docID=ddb4mhxz_24gh4ksvgh&revision=_latest&gt;\.
You're the first group to see this, so it's likely to need some
significant fixes based on your feedback. Let us know what you think.

Thanks,
Jeffrey, Chandler, and Owen

One question I had for LLVM folks while we were working on this: does it make sense to transition the atomic operations to instructions instead of intrinsics? I’m not sold on this in either direction, I’d just like to get people’s thoughts on it.

Certainly for languages such as Java, they will make up a surprisingly large chunk of the loads and stores, and instructions have much mor flexibility in terms of syntax. On the flip side, it’s a lot of plumbing IIRC, and we’d really need to stick to the very minimal set of operations, supporting more obscure ones by pattern matching or intrinsics.

If you add it to the instructions, their syntax will be more complex
than they are today, and reading them could prove a challenge.

IMHO, we should keep it simple. I agree that multi-task is ubiquitous
nowadays but the detailed implementation might vary considerably from
language to language and making it explicit only helps, at least in
the beginning.

cheers,
--renato

Reclaim your digital rights, eliminate DRM, learn more at
http://www.defectivebydesign.org/what_is_drm

Certainly for languages such as Java, they will make up a surprisingly large
chunk of the loads and stores, and instructions have much mor flexibility in
terms of syntax. On the flip side, it's a lot of plumbing IIRC, and we'd
really need to stick to the very minimal set of operations, supporting more
obscure ones by pattern matching or intrinsics.

If you add it to the instructions, their syntax will be more complex
than they are today, and reading them could prove a challenge.

To be clear, Chandler wasn't suggesting any change to the existing
load and store instructions. Instead, we were wondering if people like
the idea of _new_ atomic_load, atomic_store, atomic_cmpxchg, and maybe
atomic_exchange and atomic_add instructions.

I see, in that case, I don't have any strong opinion. Maybe new
instructions would be simpler and cleaner...

I quite like the idea of having more expressive atomic operators, as
it'll be easier to map high-level synchronization concepts to IR.

cheers,
--renato

Reclaim your digital rights, eliminate DRM, learn more at
http://www.defectivebydesign.org/what_is_drm

I just started reading this. Should I interpret "may not" as "must not"
throughout this document?

The language of this proposal needs to be much clearer. "may not" can
be interpreted in at least two different, opposite ways.

                        -Dave

The "must not" reading is correct. I've changed the document to reflect that. Thanks for pointing it out.

--Owen

I don't have a strong preference, but I am curious why all the atomics
are integer-only. I would like to see 32- and 64-bit float supported as
well, at minimum.

                             -Dave

I have no real problem with making the types of all of these analogous to the types supported by icmp, and providing floating point atomics as well. A couple of observations:

Allowing direct use of pointers would be convenient I suspect – pointers are very common operands to atomic operations, and it seems annoying to have to convert them to ints.

We can allow the IR to represent vectors, but unless hardware supports it, I think lowering these by decomposing them is more than LLVM should try to do. Because of that, I’m not sure we should support vectors as elsewhere they degrade gracefully.

Hi all,

Chandler, Owen, and I have written up a proposal for a new memory
model and atomic intrinsics in LLVM, which will make it possible to
support Java and the upcoming C++0x standard. The proposed changes to
the LangRef are at
<http://docs.google.com/View?docID=ddb4mhxz_22dz5g98dd&revision=_latest&gt;,

What's a "trap" and "trap value?" Is it some C++0X or Java thing?
It needs to be defined.

From the rationale, "trap" sounds like the IA64 trap value. That's

too target-specific for a proposal like this. Why not just say that
the use of an uninitialized value is undefined and leave it at that?

                           -Dave

> I don't have a strong preference, but I am curious why all the atomics
> are integer-only. I would like to see 32- and 64-bit float supported as
> well, at minimum.

I have no real problem with making the types of all of these analogous to
the types supported by icmp, and providing floating point atomics as well.

Good.

A couple of observations:
Allowing direct use of pointers would be convenient I suspect -- pointers
are very common operands to atomic operations, and it seems annoying to
have to convert them to ints.

Yep.

We can allow the IR to represent vectors, but unless hardware supports it,
I think lowering these by decomposing them is more than LLVM should try to
do. Because of that, I'm not sure we should support vectors as elsewhere
they degrade gracefully.

Vector atomics are extremely useful on architectures that support them.

I'm not sure we need atomicity across vector elements, so decomposing
shouldn't be a problem, but I will have to think about it a bit.

                         -Dave

Hi David-

Hi all,

Chandler, Owen, and I have written up a proposal for a new memory
model and atomic intrinsics in LLVM, which will make it possible to
support Java and the upcoming C++0x standard. The proposed changes to
the LangRef are at
<http://docs.google.com/View?docID=ddb4mhxz_22dz5g98dd&revision=_latest
>,
and a rationale for some of the more surprising changes is at
<http://docs.google.com/View?docID=ddb4mhxz_24gh4ksvgh&revision=_latest
>.

From a first glance, it looks like double-wide compare exchange is missing.

See ABA problem - Wikipedia for an example of where it's
very useful.

Mark

What is the semantics for vectorization across atomic vector operations?

Suppose I atomically write in thread 1 and read in thread 2, to a
vector with 64 elements. If I do automatic vectorization, it'd naively
be converted into N operations of 64/N-wide atomically writes and
reads, but not necessarily reading block k on thread 2 would happen
before writing it on thread 1, supposing reads are much faster than
writes.

I suppose one would have to have great care when doing such
transformations, to keep the same semantics. For instance, splitting
in two loops and putting a barrier between them, thus back to the
original design.

cheers,
--renato

Reclaim your digital rights, eliminate DRM, learn more at
http://www.defectivebydesign.org/what_is_drm

You'll be able to get at double-wide compare exchange on 64-bit
platforms with @llvm.atomic.cmp.exchange.i128.p0i128 (same as you can
now). If your target doesn't support this width, it may explode when
you get to codegen. We don't have any plans for an intrinsic for
Itanium's cmp8xchg16 instruction (since we don't know of a language
that supports it that compiles to LLVM, and it doesn't fit well into
the api for ordinary cmpxchges), but someone could propose that as an
extension later.

The more abstract cmp.exchange on {i8*, i8*} would require extending
how intrinsics work or switching to instructions. For now, I think
casting to i128s will be ok for most frontends.

Thanks for taking a look,
Jeffrey

We can allow the IR to represent vectors, but unless hardware supports it,
I think lowering these by decomposing them is more than LLVM should try to
do. Because of that, I'm not sure we should support vectors as elsewhere
they degrade gracefully.

Vector atomics are extremely useful on architectures that support them.

I'm curious about the architectures/instructions you're thinking of.
Something like 'lock; movdqa'?

I'm not sure we need atomicity across vector elements, so decomposing
shouldn't be a problem, but I will have to think about it a bit.

That's interesting. Naïvely, it seems to violate the whole point of
atomics, since it means their side-effects don't appear atomic to
other threads. We could relax that if it's useful, of course. How do
you imagine specifying the program's constraints on vector/word
tearing?

Jeffrey

Yes; it was just added 4 days ago, for an unrelated purpose. It's an
interesting concept which on the surface seems to be a good fit, though
there are some complicating issues.

Here's one:

In the "load widening" section of the proposed memory model, if there's
a race condition on the high bits of a widened load, but the high bits
are never observed, the low bits should remain defined. However, that's
not how the current trap concept works (yet).

Dan

So I think there are at least two cases here.

The first case is the one you outline: a produce-consumer relationship.
In that case we would have to respect atomicity across vector elements,
so that a read in thread 2 would not get some elements with the updates
value of the write in thread 1 and some elements with the old value.

The second case is a pure atomic update: a bunch of thread collaborate
to produce a set of values. A partial reduction, for example. A bunch
of threads in a loop atomically operate on a vector, for example
computing a vector sum into it via an atomic add. After this operation
the code does a barrier sync and continues with the next phase. In
this case there is no producer-consumer relationship within the loop
(everyone's producing/updating) so we don't need to worry about respecting
atomicity across elements.

My intuition is that the second case is more important (in the sense of
computation time) than the first, but I will have to talk to some people here
more familiar with the common codes than I am. The first case might be used
for boundary updates and that kind of thing while the second case is used for
the meat of the computation.

It shouldn't be very hard for the compiler to detect the second case.
It's a pretty straightforward pattern. For everything else it would
have to assume case #1.

So perhaps we want two kinds of vector atomic: one that respects
atomicity across elements and one that doesn't.

Of course this only matters when looking at decomposing vector atomics
into scalars. I think it is probably a better strategy just to not
generate the vector atomics in the first place if the target doesn't support
them. Then we only need one kind: the one that respects atomicity across
elements.

                               -Dave

The only problem is that threads are rarely within language specs, so
the compiler would have to choose one particular flavour of threads or
"find out" what flavour is being used. That would require the compiler
to know all / implement all types, which are not only time consuming
and error prone, but not the compiler's job in the first place.

cheers,
--renato

Reclaim your digital rights, eliminate DRM, learn more at
http://www.defectivebydesign.org/what_is_drm

> Vector atomics are extremely useful on architectures that support them.

I'm curious about the architectures/instructions you're thinking of.
Something like 'lock; movdqa'?

Don't think X86. Think traditional vector machines like the Cray X1/X2.
Atomic vector adds and logicals are common operations.

> I'm not sure we need atomicity across vector elements, so decomposing
> shouldn't be a problem, but I will have to think about it a bit.

That's interesting. Naïvely, it seems to violate the whole point of
atomics, since it means their side-effects don't appear atomic to
other threads. We could relax that if it's useful, of course. How do
you imagine specifying the program's constraints on vector/word
tearing?

I wrote up a response to Renato's question. In the end I don't think
the tearing idea is worth the effort. The compiler should just not
generate vector atomics if the target doesn't support them.

                            -Dave