Prepare portable packed vector types for RFCs #338

gnzlbg · 2018-03-02T23:10:21Z

This commit cleans up the implementation of the Portable Packed SIMD Vectors
(PPSV), adds some new features, and makes some breaking changes.

The implementation is moved to coresimd/ppsv (they are
still exposed via coresimd::simd).

As before, the vector types of a certain width are implemented in the v{width}
submodules. The macros.rs file has been rewritten as an api module that
exposes the macros to implement each API.

It should now hopefully be really clear where each API is implemented, and which types
implement these APIs. It should also now be really clear which APIs are tested and how.

Additions

boolean vectors of the form b{element_size}x{number_of_lanes}.
reductions: arithmetic (sum, product), bitwise (and, or, xor), min/max, and boolean (all, any, none) - mainly implemented via llvm.experimental.vector.reduction.{...} modulo bugs.
FromBits trait analogous to {f32,f64}::from_bits that perform "safe" transmutes.
Instead of writing From::from/x.into() (see below for breaking changes) now one writes
FromBits::from_bits(x)/x.into_bits().
portable vector types implement Default and Hash
tests for all portable vector types and all portable operations (~2000 new tests; this does hurts compile-time of cargo test a lot, and increases the memory requirements...).
(hopefully) comprehensive implementation of bitwise transmutes and lane-wise
casts (before From and the .as_... methods where implemented "when they were needed").
documentation for PPTV (not great yet, but better than nothing)
conversions/transmutes from/to x86 architecture specific vector types

Breaking changes

{store,load}{,_unchecked} API has been replaced with {store,load}_{aligned,unaligned}{,_unchecked}
eq,ne,lt,le,gt,ge APIs now return boolean vectors
The .as_{...} methods have been removed. Lane-wise casts are now performed via From/Into.
From/Into traits now perform lane-wise casts (see above). Previously they used to perform bitwise transmutes.
simd vectors' replace method's result is now #[must_use]; executing replace and dropping the result is an easy to make error.

gnzlbg · 2018-03-02T23:27:59Z

stdsimd/mod.rs

@@ -240,101 +241,6 @@
 /// we'll be using SSE4.1 features to implement hex encoding.
 ///
 /// ```
-/// #![feature(cfg_target_feature, target_feature, stdsimd)]


I have no idea about what happened here. Did rustfmt delete all of this?

Yep, it look like rustfmt just deleted all of this (the fmt commit equals the previous one + cargo fmt --all). @nrc does this look familiar? It looks like rustfmt deleted a chunk of a comment :/

alexcrichton

This looks awesome, thanks @gnzlbg!

alexcrichton · 2018-03-03T00:46:15Z

coresimd/ppvt/api/bool_vectors.rs

+
+        #[cfg_attr(feature = "cargo-clippy", allow(expl_impl_clone_on_copy))]
+        impl Clone for $id {
+            #[inline] // currently needed for correctness


FWIW I don't think this should be necessary any more w/ the changes in upstream rust-lang/rust

The tests passed without this, I just saw that the types in the x86 module were still doing this and decided to add it for consistency. I can remove them there as well.

Oh yeah I think at this point in time they can all move to #[derive(Clone)]

alexcrichton · 2018-03-03T00:48:53Z

coresimd/ppvt/api/arithmetic_reductions.rs

+        impl $id {
+            /// Lane-wise addition of the vector elements.
+            #[inline(always)]
+            pub fn add(self) -> $elem_ty {


For these reductions I'd personally only expect add and mul to be here (but called sum and product to avoid shadowing Add::add and Mul::mul). The sub, div, and rem reductions seem odd, although perhaps someone's requested them before?

So llvm only provides llvm.experimental.vector.reduce.{add, fadd, mul, fmul, and, or, xor, smax, smin, umax, umin}, which means sub, div, and rem cannot really be implemented here any better than in a third party crate. I provided them for completeness, but I think it makes sense to leave them out . Writing tests for them felt weird.

Done, I've renamed add/mul to sum/product (just like the Iterator methods) and removed the sub/div/rem reductions.

alexcrichton · 2018-03-03T00:51:36Z

coresimd/ppvt/api/bitwise_reductions.rs

+    ($id:ident, $elem_ty:ident) => {
+        impl $id {
+            /// Lane-wise bitwise `and` of the vector elements.
+            pub fn and(self) -> $elem_ty {


Sort of like the div/rem reductions above are we sure these make sense to add as well? There's certainly nothing wrong with them they just seem a little odd I think in terms of functionality.

I think we'll also want to perhaps select different names to avoid conflcits with Or::or and such.

Oh in the meantime though I think we'll want #[inline] on these methods.

So all these 3 are provided by llvm. They are necessary to implement the reductions of boolean vectors (all,any,none), but maybe we shouldn't provide them for the integer types (llvm supports that though).

I am leaving these here for now since I can just map them directly to the llvm intrinsics. We might just decide to never stabilize these and expose them only for boolean vectors via all,any,none.

alexcrichton · 2018-03-03T00:52:10Z

coresimd/ppvt/api/arithmetic_ops.rs

+        }
+
+        impl ops::AddAssign for $id {
+            #[inline(always)]


FWIW technically #[inline(always)] isn't needed for anything, but there's also not much harm vs #[inline] I think for such small methods.

I thought about this and recalled that with ThinLTO #[inline(always)] should be necessary anymore, but then looked at what the types in the x86 module were doing (and the vector types before that), and they were all still using #[inline(always)].

I think we should do a pass through the library and see if we can replace most of the #[inline(always)] attributes with just #[inline].

I've filled #340 for this.

alexcrichton · 2018-03-03T01:06:09Z

coresimd/ppvt/api/fmt.rs

+                    if i > 0 {
+                        write!(f, ", ")?;
+                    }
+                    write!(f, "{:#x}", self.extract(i))?;


I think here if you do self.extract(i).fmt(f) it'll automatically forward formatting flags like # which means we may not need to hardcode the #

alexcrichton · 2018-03-03T01:07:42Z

coresimd/ppvt/api/hash.rs

+                }
+                unsafe {
+                    let mut bytes: A = mem::uninitialized();
+                    self.store_aligned_unchecked(&mut bytes.data);


Could this function be simplified to:

A { vec: *self }.data.hash(state)

I'll give it a try.

I should fill an issue to discuss the semantics of this, it basically just has the vector as a slice of bytes. This is a bit differently of how Hash works for arrays, where first the length is hashes, and then each element is hashed.

alexcrichton · 2018-03-03T01:25:12Z

stdsimd/mod.rs

+//!         slots[1] = hex(*byte & 0xf);
+//!     }
+//! }
+//! ```


This was all actually intended to be the rustdoc documentation for the arch module (which will show up in libstd's docs soon), mind leaving it on the arch module instead of the stdsimd module?

Ah! I got a complaint about the stdsimd module not having any documentation and thought this was the other way around, I'll change this.

gnzlbg · 2018-03-03T16:08:03Z

@alexcrichton so I was able to implement most of the reductions on top of llvm.experimental.vector.reduction.{...}. However, for floating-point vector types, sum and product produce code-gen errors with everything I've tried (passing 0. as $elem_ty and mem::uninitialized() as an accumulator).

alexcrichton · 2018-03-04T03:58:50Z

Hm I wonder if the added tests are stressing out rustc a bit much? Travis looks like it's timing out quite a lot :(

alexcrichton · 2018-03-04T04:09:20Z

Looks good to me to merge modulo CI

gnzlbg · 2018-03-04T10:38:33Z

@alexcrichton yes build times have doubled. I could split the simd types into their own crate, and import it from coresimd.

alexcrichton · 2018-03-04T20:12:21Z

Hm splitting crates I don't think will be feasible due to integration into libstd, do you know why this takes so long to compile?

gnzlbg · 2018-03-05T09:13:36Z

I think it's because of the tests (compiling coresimd without tests is slower than before, but still pretty quick).

gnzlbg · 2018-03-05T09:51:40Z

I am also unable to do re-compilations of the crate without doing a cargo clean first. Doing a cargo test, one line change, cargo test again, makes the second cargo test require huge amounts of memory and HDD space..

This commit cleans up the implementation of the Portable Packed Vector Types (PPTV), adds some new features, and makes some breaking changes. The implementation is moved to `coresimd/src/ppvt` (they are still exposed via `coresimd::simd`). As before, the vector types of a certain width are implemented in the `v{width}` submodules. The `macros.rs` file has been rewritten as an `api` module that exposes the macros to implement each API. It should now hopefully be really clear where each API is implemented, and which types implement these APIs. It should also now be really clear which APIs are tested and how. - boolean vectors of the form `b{element_size}x{number_of_lanes}`. - reductions: arithmetic, bitwise, min/max, and boolean - only the facade, and a naive working implementation. These need to be implemented as `llvm.experimental.vector.reduction.{...}` but this needs rustc support first. - FromBits trait analogous to `{f32,f64}::from_bits` that perform "safe" transmutes. Instead of writing `From::from`/`x.into()` (see below for breaking changes) now you write `FromBits::from_bits`/`x.into_bits()`. - portable vector types implement `Default` and `Hash` - tests for all portable vector types and all portable operations (~2000 new tests). - (hopefully) comprehensive implementation of bitwise transmutes and lane-wise casts (before `From` and the `.as_...` methods where implemented "when they were needed". - documentation for PPTV (not great yet, but better than nothing) - conversions/transmutes from/to x86 architecture specific vector types - `store/load` API has been replaced with `{store,load}_{aligned,unaligned}` - `eq,ne,lt,le,gt,ge` APIs now return boolean vectors - The `.as_{...}` methods have been removed. Lane-wise casts are now performed by `From`. - `From` now perform casts (see above). It used to perform bitwise transmutes. - `simd` vectors' `replace` method's result is now `#[must_use]`.

… intrinsics

hanna-kruppe · 2018-03-05T11:58:58Z

Why are boolean vectors represented as integer vectors containing 0 or -1? I know that some popular platforms don't have native vectors with one bit elements but why does that impact the portable types?

gnzlbg · 2018-03-05T12:09:17Z

@rkruppe Those are the values that the simd comparison instructions return on at least x86 to represent true (0xFFF...) and false (0x000...), and those are the values that the portable vector comparisons (and the reductions) of LLVM also return. Since the boolean vectors are currently implemented as vectors of iX, !(0 as iX) and 0 as iX is what they use.

The API of boolean vector doesn't expose these values (it translate these values to/from bool), and it also doesn't expose any conversions nor bitwise transmutes from other vector types.

In any case, we should document this, since this is relevant for those explicitly calling mem::transmute, which is something one might want to do to transmute the result of an intrinsic returning a comparison into a boolean vector type.

hanna-kruppe · 2018-03-05T12:23:34Z

Those are the values that the simd comparison instructions return on at least x86

I know, that's what I was alluding to, but again, why does that have to impact the portable types? This is like saying i64 arithmetic doesn't exist on 32 bit targets so the portable i64xN types should be implemented in terms of i32xM for M = 2 N. When a target doesn't support some portable vector type, the type should be legalized by the backend. (Sometimes that doesn't work in practice, like for i128 on some targets, but I know that i1 vectors can be legalized on x86 at least.)

and those are the values that the portable vector comparisons (and the reductions) of LLVM also return.

That is not true. icmp eq <N x i32> returns <N x i1>, for example. I haven't checked reductions but since they are overloaded, in principle they should also work with i1.

Clang does lower C-language vector compares to icmp (returning an i1 vector) + sext so that the end result is an integer vector containing 0 and -1, but this is a front end decision (inherited from GCC), not anything inherent about LLVM IR.

gnzlbg · 2018-03-05T13:14:33Z

That is not true. icmp eq returns , for example. I haven't checked reductions but since they are overloaded, in principle they should also work with i1.

I'll give this a try.

gnzlbg · 2018-03-05T13:26:35Z

@rkruppe so I tried to use bools, but got a "SIMD vector element type should be machine type" error: https://play.rust-lang.org/?gist=a0331d2eb68fec6c5e32b8a49356cd8d&version=nightly

hanna-kruppe · 2018-03-05T13:27:25Z

Come to think of it, I actually doubt there is a portable way to expose <N x i1> to Rust -- because Rust types can be stored in memory, but <N x i1> is stored either like <N x i8> (one byte per element) or iN (packed, individual bits not addressable) depending on the target. And indeed rustc won't let you do #[repr(simd)] struct BoolVec(bool, bool, ...) currently.

Probably best to use a memory layout compatible with [bool; N] or (bool, bool, ...): u8 elements, 0 or 1.

GabrielMajeri · 2018-03-05T13:43:37Z

Cargo.toml

+lto = false
+debug-assertions = true
+codegen-units = 1
+panic = 'unwind'


Minor nitpick: should be a newline at the end of this file.

I will revert these changes before merging. These profiles were added only to test if setting codegen-units to 1 would improve either compiletimes or remove some issues of incremental compilation.

gnzlbg · 2018-03-05T13:59:58Z

@rkruppe the question is, should we only expose b8x{2,4,6,8,32,74} or should we also expose wider boolean vector types?

From the POV of portable operations exposing b8xN should be enough, because the actual size of the boolean vector type is irrelevant.

From the POV of the architecture specific intrinsics, some of them return "boolean vectors" of a larger width stored in either integer or floating point registers. I don't know if b8xN is enough to provide type-safe wrappers around these intrinsics that have zero-runtime cost. The original simd crate had types like b32fx8 probably for this purpose.

gnzlbg · 2018-03-05T14:07:01Z

Come to think of it, I actually doubt there is a portable way to expose to Rust -- because Rust types can be stored in memory, but is stored either like (one byte per element) or iN (packed, individual bits not addressable) depending on the target.

Could you elaborate on this? (or are you on IRC?). My question is: why does this matter?

gnzlbg · 2018-03-05T17:28:16Z

@alexcrichton so this should be good to go modulo compile-times.

alexcrichton · 2018-03-05T20:15:49Z

Ok thanks! I'd like to dig into this a bit first and investigate compile times a bit, I'll do that now.

alexcrichton · 2018-03-05T20:32:31Z

Hm ok there may have been a recent rustc regression that was fixed, in any case looks like it's not too slow now, and yeah it's all almost entirely tests which we can of course move around later if need be. Thanks again @gnzlbg!

gnzlbg commented Mar 2, 2018

View reviewed changes

gnzlbg force-pushed the pvt branch from a55de27 to 0ead6f5 Compare March 3, 2018 00:04

alexcrichton reviewed Mar 3, 2018

View reviewed changes

gnzlbg mentioned this pull request Mar 5, 2018

revise #[inline(always)] vs #[inline] #340

Closed

gnzlbg added 19 commits March 5, 2018 11:26

enable backtrace and nocapture

804393d

unalign load/store fail test by 1 byte

8495a01

update arm and aarch64 neon modules

7e71717

fix arm example

4eb35b4

fmt

d97b2d7

clippy and read example that rustfmt swallowed

ea3bc50

reductions should take self

ec79ac7

rename add/mul -> sum/product; delete other arith reductions

695f05c

clean up fmt::LowerHex impl

66a6af7

revert incorret doc change

f310e6b

make Hash equivalent to [T; lanes()]

dd40e90

use travis_wait to increase timeout limit to 20 minutes

2586976

remove travis_wait; did not help

0d6418f

implement reductions on top of the llvm.experimental.vector.reduction…

a49a061

… intrinsics

implement cmp for boolean vectors

7520234

add missing eq impl file

06d6931

implement default

e607abc

rename llvm intrinsics

2559c57

gnzlbg mentioned this pull request Mar 5, 2018

Provide a way to const-initialize vendor-specific vector types rust-lang/rust#48745

Open

gnzlbg added 2 commits March 5, 2018 12:18

workaround broken product in aarch64

b28011c

make boolean vector constructors const fn

3f7594a

gnzlbg mentioned this pull request Mar 5, 2018

boolean vector constructors cannot be const yet #336

Closed

fix more reductions on aarch64

57427dc

gnzlbg added 2 commits March 5, 2018 14:20

fix min/max reductions on aarch64

b2ee9ae

remove whitespace

c19080e

GabrielMajeri reviewed Mar 5, 2018

View reviewed changes

remove all boolean vector types except for b8xN

2be65ad

gnzlbg added 3 commits March 5, 2018 18:19

use a sum reduction fallback on aarch64

a74f06e

disable llvm add reduction for aarch64

5169b8c

rename the llvm intrinsics to use llvm names

84f4580

gnzlbg force-pushed the pvt branch from 0faa5a2 to 84f4580 Compare March 5, 2018 17:19

remove old macros.rs file

34e586a

alexcrichton merged commit 548abdc into rust-lang:master Mar 5, 2018

This was referenced Mar 7, 2018

load[_unchecked] and store[_unchecked] take an unnecessary offset parameter #272

Closed

Implement boolean vectors #185

Closed

hsivonen mentioned this pull request Mar 8, 2018

Switch from simd to packed_simd hsivonen/encoding_rs#23

Closed

hdevalence mentioned this pull request Apr 3, 2018

Relationship between From/Into and FromBits traits? #413

Closed

Prepare portable packed vector types for RFCs #338

Prepare portable packed vector types for RFCs #338

Conversation

gnzlbg commented Mar 2, 2018 • edited Loading

Additions

Breaking changes

Choose a reason for hiding this comment

gnzlbg Mar 2, 2018 • edited Loading

Choose a reason for hiding this comment

alexcrichton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnzlbg Mar 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnzlbg commented Mar 3, 2018

alexcrichton commented Mar 4, 2018

alexcrichton commented Mar 4, 2018

gnzlbg commented Mar 4, 2018

alexcrichton commented Mar 4, 2018

gnzlbg commented Mar 5, 2018 • edited Loading

gnzlbg commented Mar 5, 2018

hanna-kruppe commented Mar 5, 2018

gnzlbg commented Mar 5, 2018 • edited Loading

hanna-kruppe commented Mar 5, 2018

gnzlbg commented Mar 5, 2018

gnzlbg commented Mar 5, 2018

hanna-kruppe commented Mar 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnzlbg commented Mar 5, 2018

gnzlbg commented Mar 5, 2018

gnzlbg commented Mar 5, 2018

alexcrichton commented Mar 5, 2018

alexcrichton commented Mar 5, 2018

gnzlbg commented Mar 2, 2018 •

edited

Loading

gnzlbg Mar 2, 2018 •

edited

Loading

gnzlbg Mar 3, 2018 •

edited

Loading

gnzlbg commented Mar 5, 2018 •

edited

Loading

gnzlbg commented Mar 5, 2018 •

edited

Loading