-
Notifications
You must be signed in to change notification settings - Fork 289
Implement boolean vectors #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not convinced that we should be adding boolean vectors to the vendor intrinsics. Vendor intrinsics are supposed to be a low level interface to vendor specific APIs, and none of the vendor specific APIs (at least for Intel) have a concept of a boolean vector. One may be implied by the stated contracts on a certain vendor intrinsics, but that means we need to go through all of them and vet them ourselves. We're already trying to improve the state of the world a little bit with more granular integer vector types, but only because others have (for the most part) already done that work for us. I grant that boolean vectors may be a nice option for the portable API. |
A
It seems like a problem if the return type of the comparison methods or vendor operations changes on stable. Additionally to what the initial comment says, I think it would be good to have API surface for safely bitcasting |
Every operation (including in the Intel case) whose output is defined to set all bits of a given lane to either zero or one conceptually returns a boolean vector even if the concept isn't a stated part of the Intel C API. As for inputs, safe and convenient bitcasting from |
What I was thinking is not stabilizing those intrinsics where we are unsure if we might want to make them return boolean vectors. That would mean that one must use rust nightly to use those, and the rust nightly code might break if we switch them to boolean vectors.
Yes, that is what it would mean. So I wanted to do this, but before doing so, I wanted to cheat a bit and look at the
But... I cannot find a single intrinsic in the whole The @hsivonen is this correct? If so, I don't understand why they can't be provided by a different crate. If what you wanted is for them to be used as the return type of some intrinsics, then I think you need to make a better case and tells us that a sufficient amount of intrinsics require them (e.g. by reviewing the x86 simd intrinsics and letting us know what you find). |
No. The |
As for examples of vendor intrinsics that |
It would be nice to have a comprehensive list so that we can make an informed decision. |
Boolean vectors have been implemented in #338. Only I've taken a look at the assembly generated for the SSE2 // Compares two vectors of packed 16-bit integers
__m128i _mm_cmpeq_epi16 (__m128i a, __m128i b) which returns an First I consider these two cases: pub unsafe fn foo(x: i16) -> __m128i {
let a = _mm_setr_epi16(x, 1, 2, 3, 4, 5, 6, 7);
let b = _mm_setr_epi16(7, x, 2, 4, 3, 2, 1, 0);
_mm_cmpeq_epi16(a, b)
}
pub unsafe fn bar(x: i16) -> i16x8 {
let a = i16x8::new(x, 1, 2, 3, 4, 5, 6, 7);
let b = i16x8::new(7, x, 2, 4, 3, 2, 1, 0);
_mm_cmpeq_epi16(a.into_bits(), b.into_bits()).into_bits()
} which produce identical assembly - pub unsafe fn bar2(x: i16) -> bool {
let a = i16x8::new(x, 1, 2, 3, 4, 5, 6, 7);
let b = i16x8::new(7, x, 2, 4, 3, 2, 1, 0);
let b: i16x8 = _mm_cmpeq_epi16(a.into_bits(), b.into_bits()).into_bits();
b.extract(3) == 0xffff_i16
}
pub unsafe fn baz(x: i16) -> bool {
let a = i16x8::new(x, 1, 2, 3, 4, 5, 6, 7);
let b = i16x8::new(7, x, 2, 4, 3, 2, 1, 0);
let b = b8x8::from_bits(i8x8::from(i16x8::from_bits(_mm_cmpeq_epi16(a.into_bits(), b.into_bits()))));
b.extract(3)
} The functions @rkruppe pointed out that LLVM comparisons actually return |
Thank you!
Why? I ported encoding_rs to
(The results being compared are merges of the fastest results of four runs with the To have the kind of performance that one would expect, it seems to me that an operation that returns a boolean vector should return a vector of the same bit width as the operands and the |
Filed #362. |
Note that these operate on UTF-16, so they compare |
Because converting from wider types to In particular, boolean operations always return |
What does "correctly" mean in this context?
I don't understand. If
The pattern I'm using doesn't logically involve LLVM having to materialize the 64-bit type in RAM. Logically, everything should happen in 128-bit registers: #[inline(always)]
pub fn simd_is_basic_latin(s: u16x8) -> bool {
let above_ascii = u16x8::splat(0x80);
s.lt(above_ascii).all()
} Yet, performance suffers when going from It seems very brittle to me to define a narrower type (64 bits) and hope that LLVM never needs to actually materialize it as a 64-bit representation and operations actually get done in 128-bit registers. In some sense, part of the problem here is that all types need to have a concrete memory representation. After all, logically in this case there are just 8 bits we care about, not even 64, but it's crucial for performance that the 8 bits are stretched across a 128-bit register. It seems to me that design used in the |
To improve my understanding from a different angle: What concrete problem (apart from the abstract problem of the number of types proliferating) is solved by having a single boolean vector type for a given lane count (as in AFAICT, in realistic use cases, one wouldn't typically need to have boolean vectors of the same lane count but different lane widths interact without conversion ceremony. |
In a way that conveys to LLVM that these vectors are boolean vectors.
It should compile to whatever instruction sequence makes your code run faster. If you don't use the result of that comparison, it should compile to nothing.
As mentioned in #362 this is currently an expected performance bug that needs to be fixed.
It makes it easier for LLVM to optimize sequences of SIMD instructions across all ISAs. |
Can we not convey to LLVM that Having a type whose declared bit width differs from the bit width that is performant on the machine still scares me. Also, it worries me not to be able to be sure that viewing a boolean vector as a bitmask is a zero-cost transmute. encoding_rs doesn't have this code yet, but consider this SIMD implementation of converting x-user-defined to UTF-16 using the #[inline(always)]
fn shift_upper(s: u16x8) -> u16x8 {
let highest_ascii = u16x8::splat(0x7F);
let offset = u16x8::splat(0xF700);
let mask = s.gt(highest_ascii).to_repr().to_u16();
s + (offset & mask)
}
#[inline(always)]
fn decode_x_user_defined_to_utf16(b: u8x16) -> (u16x8, u16x8) {
let (a, b) = unpack(b); // zero extends one u8x16 into two u16x8s
(shift_upper(a), shift_upper(b))
} With the With It may look like I'm flipflopping by first complaining that comparisons don't return boolean vectors and then turning around and complaining that I don't get the result as integer lanes, but I have previously emphasized that it's both important that comparisons signal on the type system level that all bits on each lane are either set or unset and that it is important that this type layer can be waived at zero cost to get at the actual bits.
Is there a concrete example of this within the bounds of the target having the same register width as the code uses? (I.e. it's a priority for me that Superficially, it seems unbelievable that
Considering that sequences of operations on boolean vectors are bitwise ops on the whole register and the
It's unclear to me if you mean a performance bug in the sense that you have something simple in mind that makes |
Yes, we can. As stated previously, we are currently not doing it. So this is why this bug happens.
If it isn't zero-cost, then it's a bug.
If you do operations on floating point vectors and them move them to integer vectors to do some more operations LLVM does often performs the floating point operations on integer vectors directly to avoid the latency of switching domains on the CPU. This can implicitly happen if you return a floating point vector from a function, for example, because they must be passed around as integer vectors due to the ABI.
LLVM has a function that does When you produce a boolean vector using a comparison, like When you don't produce a boolean vector using a comparison, Rust calls This is the bug. Also, this bug is completely independent of the width of a boolean vector. Even if we added So "Do we need wider boolean vector types?" and "boolean reductions produce bad codegen" are two completely unrelated issues. |
If you want to argue for wider boolean vectors here using examples of bad codegen to motivate their need, you should be using examples that do not trigger the codegen bug. Otherwise we can't tell whether the bad codegen is due to the bug, or due to the type not being wide enough. |
I intentionally filed the reduction perf bug separately and showed a bitmask example when discussing width. Regardless reductions and regardless of the bitmask case, which I haven't benchmarked (due to not knowing the answer to the question below), I'm worried that having a type whose memory representation differs from what can be loaded into the efficient register representation using
With the |
I understand the worry, but right now it is just that: a worry. |
I don't see this case implemented. What am I missing?
|
Nothing, the error is correct, the implementation is currently missing. Looks like an oversight. EDIT: filled #370 to track this. |
This has been implemented already. |
I've come around (thanks @hsivonen for being persuasive) and I think it makes sense to implement boolean vectors as part of
stdsimd
.I think we could:
simd
crate (we don't need to be 1:1 on par in functionality on the initial implementation, we can add more features to boolean vectors later in a backwards compatible way)I think that none of this must block stabilization (although @hsivonen might disagree). First, IIUC the time-line is as follows: someday somebody will integrate this as part of
core
andstdsimd
will be shipped to nightly users behind some feature flag. We should try hard to not break stuff from then on, but we can technically still break some stuff (like changing the return type of an unstable intrinsic to use boolean vectors). One release cycle later we will want to start stabilizing stuff. For that we are probably having to white-list intrinsics as stable on a case-by-case basis. Once we have done that we need to write an RFC with all those intrinsics inside and submit it to the RFC repo.If we feel that boolean vectors aren't ready by then we don't stabilize them nor any intrinsic that we think might want to use them in the future. But I'd like to defer that discussion until then. For what we know boolean vectors might turn out to be ready, and we might want to just ship them.
Thoughts?
The text was updated successfully, but these errors were encountered: