Simd-128 TC-39 JS
Simd-128 TC-39 JS
Vector Processor
6.0 5.0 11.0
VS
DEMO: Mandelbrot // z(i+1) = z(i)^2 + c
// terminate when |z| > 2.0
// returns 4 iteration counts
● Fixed 128-bit vector types as close to the metal while remaining portable
○ SSE
○ NEON
○ Efficient scalar fallback could be implemented
● Polyfill + benchmarks
○ https://github.com/johnmccutchan/ecmascript_simd
SIMD in JavaScript
1. SIMD module
a. New “value” types float32x4 4 IEEE-754 32-bit Floating Point Numbers
b. Composable operations
int32x4 4 32-bit Signed Integers
i. Arithmetic
ii. Logical float64x2 2 IEEE-754 64-bit Floating Point Numbers
iii. Comparisons
iv. Reordering (shuffling)
v. Conversions Float32x4Array Typed Array of float32x4
2. Extension to Typed Data
Int32x4Array Typed Array of int32x4
a. A new array type for each
Float64x2Array Typed Array of float64x2
Object Hierarchy
SIMD
float32x4
(128-bits)
x y z w
“lanes”
Constructing
var b = SIMD.float32x4.zero();
var c = a.withX(5.0);
5.0 2.0 3.0 4.0
Arithmetic
var c = SIMD.float32x4.add(a,b);
function average(list) {
var n = list.length;
var sum = 0.0;
for (int i = 0; i < n; i++) {
sum += list[i];
}
return sum / n;
}
function average(f32x4list) {
var n = f32x4list.length;
var sum = SIMD.float32x4.zero();
for (int i = 0; i < n; i++) {
sum = SIMD.float32x4.add(sum, f32x4list.getAt(i));
}
var total = sum.x + sum.y + sum.z + sum.w;
return total / (n * 4);
}
Example
1.0 2.0 3.0 3.0 5.0 7.0 7.0 6.0 11.0 7.0 8.0 15.0
75.0
SIMD in JavaScript
;; Load list[i]
0x4ccddce 0f104c3807 movups xmm1,[eax+edi*0x1+0x7] Load 4 floats
;; sum +=
0x4ccddde 0f59ca addps xmm2,xmm1 Add 4 floats
Shuffling
max = function(a, b) {
if (a > b) {
return a;
} else {
return b;
}
}
max = function(a, b) {
if (a > b) {
return a;
} else { 1.0 2.0 3.0 4.0
return b;
} 0.0 3.0 5.0 2.0
}
Branching
max = function(a, b) {
var greaterThan = SIMD.float32x4.greaterThan(a, b);
return SIMD.float32x4.select(a, b, greaterThan);
}
max = function(a, b) {
var greaterThan = SIMD.float32x4.greaterThan(a, b);
return SIMD.float32x4.select(a, b, greaterThan);
}
0xF 0x0
0xF 1.0 0.0 1.0
1. Unboxing
a. Boxed -> allocated in memory
b. Unboxed -> in CPU memory (in registers)
● Interpreter support:
○ In Nightly since early 2014. No flags needed
● IonMonkey:
○ Support has been prototyped for x86
○ Missing ARM port of register allocator
○ Ongoing refactoring of a generic register allocator before landing the
JIT compiler support
○ Reuse work done for OdinMonkey
● OdinMonkey (for asm.js):
○ Current focus
○ Full x86 support planned for end of August in Nightly
Chrome/V8 implementation status
MatrixMultiply 74 20 3.7
VectorTransform 30 6 5
MatrixMultiply 97 19 5.1
VectorTransform 33 8 4.1
● Practicality
○ Stream processing and auto vectorization have limited use cases
○ Variable width vectors cannot efficiently implement
■ Matrix multiplication
■ Matrix inversion
■ Vector transform
■ ….
● Portable performance
○ 128-bit is the only vector width supported by all architectures
Why fixed width and not variable width vectors (continued)?
● Abstraction
○ Stream processors can be built in software on top of SIMD-128
● Result of ‘typeof’:
○ “float32x4”, “float64x2”, “int32x4”
● Result of SIMD.float32x4(1,2,3,4).toString():
○ “float32x4(1,2,3,4)"
● Feature detection:
○ Fine grained feature detection
■ Something like: SIMD.optimized.<feature>
○ There are arch differences that will need exposure!
■ Two vector shuffle (Useful for 4x4 matrix transpose)
■ .signmask for NEON
■ Algorithm specific instructions where no overlap/equivalent exists
○ Inlined scalar fallbacks can help minimize performance hit across
ISAs
Stage 1 Ready?
● SSE*
○ V8
○ SpiderMonkey
○ Intel’s Crosswalk HTML5 runtime
● NEON
○ SpiderMonkey (In progress)
○ Dart VM*
Future Work
Polyfill repository
https://github.com/johnmccutchan/ecmascript_simd
Wikipedia
http://en.wikipedia.org/wiki/SIMD