One question I had for LLVM folks while we were working on this: does it make sense to transition the atomic operations to instructions instead of intrinsics? I’m not sold on this in either direction, I’d just like to get people’s thoughts on it.
Certainly for languages such as Java, they will make up a surprisingly large chunk of the loads and stores, and instructions have much mor flexibility in terms of syntax. On the flip side, it’s a lot of plumbing IIRC, and we’d really need to stick to the very minimal set of operations, supporting more obscure ones by pattern matching or intrinsics.
If you add it to the instructions, their syntax will be more complex
than they are today, and reading them could prove a challenge.
IMHO, we should keep it simple. I agree that multi-task is ubiquitous
nowadays but the detailed implementation might vary considerably from
language to language and making it explicit only helps, at least in
the beginning.
Certainly for languages such as Java, they will make up a surprisingly large
chunk of the loads and stores, and instructions have much mor flexibility in
terms of syntax. On the flip side, it's a lot of plumbing IIRC, and we'd
really need to stick to the very minimal set of operations, supporting more
obscure ones by pattern matching or intrinsics.
If you add it to the instructions, their syntax will be more complex
than they are today, and reading them could prove a challenge.
To be clear, Chandler wasn't suggesting any change to the existing
load and store instructions. Instead, we were wondering if people like
the idea of _new_ atomic_load, atomic_store, atomic_cmpxchg, and maybe
atomic_exchange and atomic_add instructions.
I don't have a strong preference, but I am curious why all the atomics
are integer-only. I would like to see 32- and 64-bit float supported as
well, at minimum.
I have no real problem with making the types of all of these analogous to the types supported by icmp, and providing floating point atomics as well. A couple of observations:
Allowing direct use of pointers would be convenient I suspect – pointers are very common operands to atomic operations, and it seems annoying to have to convert them to ints.
We can allow the IR to represent vectors, but unless hardware supports it, I think lowering these by decomposing them is more than LLVM should try to do. Because of that, I’m not sure we should support vectors as elsewhere they degrade gracefully.
> I don't have a strong preference, but I am curious why all the atomics
> are integer-only. I would like to see 32- and 64-bit float supported as
> well, at minimum.
I have no real problem with making the types of all of these analogous to
the types supported by icmp, and providing floating point atomics as well.
Good.
A couple of observations:
Allowing direct use of pointers would be convenient I suspect -- pointers
are very common operands to atomic operations, and it seems annoying to
have to convert them to ints.
Yep.
We can allow the IR to represent vectors, but unless hardware supports it,
I think lowering these by decomposing them is more than LLVM should try to
do. Because of that, I'm not sure we should support vectors as elsewhere
they degrade gracefully.
Vector atomics are extremely useful on architectures that support them.
I'm not sure we need atomicity across vector elements, so decomposing
shouldn't be a problem, but I will have to think about it a bit.
What is the semantics for vectorization across atomic vector operations?
Suppose I atomically write in thread 1 and read in thread 2, to a
vector with 64 elements. If I do automatic vectorization, it'd naively
be converted into N operations of 64/N-wide atomically writes and
reads, but not necessarily reading block k on thread 2 would happen
before writing it on thread 1, supposing reads are much faster than
writes.
I suppose one would have to have great care when doing such
transformations, to keep the same semantics. For instance, splitting
in two loops and putting a barrier between them, thus back to the
original design.
You'll be able to get at double-wide compare exchange on 64-bit
platforms with @llvm.atomic.cmp.exchange.i128.p0i128 (same as you can
now). If your target doesn't support this width, it may explode when
you get to codegen. We don't have any plans for an intrinsic for
Itanium's cmp8xchg16 instruction (since we don't know of a language
that supports it that compiles to LLVM, and it doesn't fit well into
the api for ordinary cmpxchges), but someone could propose that as an
extension later.
The more abstract cmp.exchange on {i8*, i8*} would require extending
how intrinsics work or switching to instructions. For now, I think
casting to i128s will be ok for most frontends.
We can allow the IR to represent vectors, but unless hardware supports it,
I think lowering these by decomposing them is more than LLVM should try to
do. Because of that, I'm not sure we should support vectors as elsewhere
they degrade gracefully.
Vector atomics are extremely useful on architectures that support them.
I'm curious about the architectures/instructions you're thinking of.
Something like 'lock; movdqa'?
I'm not sure we need atomicity across vector elements, so decomposing
shouldn't be a problem, but I will have to think about it a bit.
That's interesting. Naïvely, it seems to violate the whole point of
atomics, since it means their side-effects don't appear atomic to
other threads. We could relax that if it's useful, of course. How do
you imagine specifying the program's constraints on vector/word
tearing?
Yes; it was just added 4 days ago, for an unrelated purpose. It's an
interesting concept which on the surface seems to be a good fit, though
there are some complicating issues.
Here's one:
In the "load widening" section of the proposed memory model, if there's
a race condition on the high bits of a widened load, but the high bits
are never observed, the low bits should remain defined. However, that's
not how the current trap concept works (yet).
The first case is the one you outline: a produce-consumer relationship.
In that case we would have to respect atomicity across vector elements,
so that a read in thread 2 would not get some elements with the updates
value of the write in thread 1 and some elements with the old value.
The second case is a pure atomic update: a bunch of thread collaborate
to produce a set of values. A partial reduction, for example. A bunch
of threads in a loop atomically operate on a vector, for example
computing a vector sum into it via an atomic add. After this operation
the code does a barrier sync and continues with the next phase. In
this case there is no producer-consumer relationship within the loop
(everyone's producing/updating) so we don't need to worry about respecting
atomicity across elements.
My intuition is that the second case is more important (in the sense of
computation time) than the first, but I will have to talk to some people here
more familiar with the common codes than I am. The first case might be used
for boundary updates and that kind of thing while the second case is used for
the meat of the computation.
It shouldn't be very hard for the compiler to detect the second case.
It's a pretty straightforward pattern. For everything else it would
have to assume case #1.
So perhaps we want two kinds of vector atomic: one that respects
atomicity across elements and one that doesn't.
Of course this only matters when looking at decomposing vector atomics
into scalars. I think it is probably a better strategy just to not
generate the vector atomics in the first place if the target doesn't support
them. Then we only need one kind: the one that respects atomicity across
elements.
The only problem is that threads are rarely within language specs, so
the compiler would have to choose one particular flavour of threads or
"find out" what flavour is being used. That would require the compiler
to know all / implement all types, which are not only time consuming
and error prone, but not the compiler's job in the first place.
> Vector atomics are extremely useful on architectures that support them.
I'm curious about the architectures/instructions you're thinking of.
Something like 'lock; movdqa'?
Don't think X86. Think traditional vector machines like the Cray X1/X2.
Atomic vector adds and logicals are common operations.
> I'm not sure we need atomicity across vector elements, so decomposing
> shouldn't be a problem, but I will have to think about it a bit.
That's interesting. Naïvely, it seems to violate the whole point of
atomics, since it means their side-effects don't appear atomic to
other threads. We could relax that if it's useful, of course. How do
you imagine specifying the program's constraints on vector/word
tearing?
I wrote up a response to Renato's question. In the end I don't think
the tearing idea is worth the effort. The compiler should just not
generate vector atomics if the target doesn't support them.