Intro Spu Optimizations Part 1
Intro Spu Optimizations Part 1
March 5, 2010
Introduction
These slides are used internally at Naughty Dog to introduce new programmers to our SPU programming methods. Due to popular interest, we are now making these public. Note that some of the tools that we are using are not released to the public, but there exists many other alternatives out there that do similar things. The rst set of slides introduce most of the SPU assembly instructions. Please read these carefully before reading the second set. Those slides go through a made-up example showing how one can improve performance drastically, by knowing the hardware as well as employing a technique called software pipe-lining.
In these slides, we will go through all of the assembly instructions that exist on the SPU, giving you a quick introduction to the power of the SPUs. Each SPU has 256 kB of local memory. This local memory can be thought of as 1 cycle memory. Programs and data exist in the same local memory space. There are no memory protections in local memory! The only way to access external memory is through DMA. There is a signicant delay between when a DMA request is queued until it nishes.
The SPU has 128 general purpose 128-bit wide registers. You can think of these as
2 doubles (64-bit oating-point values), 4 oats (32-bit oating-point values), 4 words (32-bit integer values), 8 half-words (16-bit integer values), or 16 bytes (8-bit integer values).
Instruction Classes
The instruction set can be put in classes, where the instructions in the same class have the same arity (i.e. whether they are even or odd) and latency (how long it takes for the result to be ready): (SP) Single Precision {e6} (FX) FiXed {e2} (WS) Word Shift {e4} (LS) Load/Store {o6} (SH) SHuffle {o4} (FI) Fp Integer {e7} (BO) Byte Operations {e4} (BR) BRanch {o-} (HB) Hint Branch {o15} (CH) CHannel Operations {o6} (DP) Double Precision {e13}
The syntax here indicates that for each of the 4 32-bit oating point values in the register, the operation in the comment is executed.
Example:
If the registers r1 and r2 contains r1 = ( 1.0, 2.0, 3.0, 4.0 ), r2 = ( 0.0, -2.0, 1.0, 4.0 ), then after fa r0, r1, r2 ; r0 = r1 + r2 then r0 contains r0 = ( 1.0, 0.0, 4.0, 8.0 ).
The FX class of instructions all have latency of just 2 cycles and all have a throughput of 1 cycle. These are even instructions. Theres quite a few of them, and we can further divide them down into: Integer Arithmetic Operations. Immediate Loads Operations. Comparison Operations. Select Bit Operation. Logical Bit Operations. Extensions and Misc Operations.
Example:
ilhu ones, 0x3f80 ; ones = (1.0, 1.0, 1.0, 1.0) ila magic, 0x10203; magic = (0x00010203_00010203_00010203_00010203)
TRUE = 0xFF FALSE = 0x00 (s) means signed and (u) means unsigned compares.
TRUE = 0xFFFF_FFFF FALSE = 0x0000_0000 Note: All zeros are equal, e.g.: 0.0 == -0.0.
FX: Misc
generate borrow bit bg i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n]) i.w[n] = tmp.w[n] < 0 ? 0 : 1 generate borrow bit with borrow bgx i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n] + (i.w[n]&1) - 1) i.w[n] = tmp.w[n] < 0 ? 0 : 1 generate carry bit cg i, j, k ; i.w[n] = (j.w[n] + k.w[n]) > 0xffffffff ? 1 : 0 generate carry bit with carry cgx i, j, k ; tmp.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1) i.w[n] = tmp.w[n] > 0xffffffff ? 1 : 0
FX: Misc
add with carry bit addx i, j, k ; i.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1)) subtract with borrow bit sfx i, j, k ; i.w[n] = (-j.w[n] + k.w[n] + (i.w[n] & 1) - 1) sign-extend byte to half-word xsbh i, j ; i.h[n] = ext(i.h[n] & 0xff) sign-extend half-word to word xshw i, j ; i.w[n] = ext(i.w[n] & 0xffff) sign-extend word to double-word xswd i, j ; i.d[n] = ext(i.d[n] & 0xffffffff) count leading zeros clz i, j ; i.w[n] = leadingZeroCount(j.w[n])
Notice that there is an independent shift amount for each of the shlh and shl versions, i.e., this is truly SIMD!
Example
; Assume r0 ; r1 shl r2, r0, ; Now r2 = = = ( 1, 2, 4, 8 ) = ( 1, 2, 3, 4 ) r1 ( 1<<1, 2<<2, 4<<3, 8<<4 ) ( 2, 4, 32, 128 )
Notice here that the shift amounts need to be negative in order to produce a proper shift. This is because this is actually a rotate left and then mask operation.
The load/store operations are odd instructions that work on the 256 kB local memory. They have a latency of 6 cycles, but the hardware has short-cuts in place so that you can read a written value immediately after the store. Do note: Memory wraps around, so you can never access memory outside the local store (LS). You can only load and store a whole quadword, so if you need to modify a part, you need to load the quadword value, merge in the modied part into the value and store the whole quadword back. Addresses are in units of bytes, unlike the VUs on the PS2. The load/store operations will use the value in the preferred word of the address register, i.e.: the rst word.
LS: Loads
lqa i, label18 ; ; lqd i, qoff(j) ; ; lqr i, label14 ; ; lqx i, j, k ; addr = label18 range = 256kb (or +/- 128kb) addr = qoff * 16 + j.w[0] qoff is 10 bit signed, addr range = +/-8kb. addr = ext(label14) + pc label14 range = +/- 8kb. addr = j.w[0] + k.w[0]
LS: Stores
stqa i, label18 ; ; stqd i, qoff(j) ; ; stqr i, label14 ; ; stqx i, j, k ; addr = label18 range = 256kb (or +/- 128kb) addr = qoff * 16 + j.w[0] qoff is 10 bit signed, addr range = +/-8kb. addr = ext(label14) + pc label14 range = +/- 8kb. addr = j.w[0] + k.w[0]
The shue operations all have 4 cycle latency and they are odd instructions. Most of the instructions in this class deal with the whole quadword: We can divide the SH class into: The Shue Bytes Instruction. Quadword left-shifts, rotates and right-shifts. Creation of Shue Masks. Form Select Instructions. Gather Bit Instructions. Reciprocal Estimate Instructions.
The ordering of bytes, half-words and words within the quadword is shown below. Notice that this is big-endian, not little-endian: +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +-------+-------+-------+-------+-------+-------+-------+-------+ | 0 | 1 | 2 | 3 | +---------------+---------------+---------------+---------------+ The shue byte instruction shufb take three inputs, two source registers r0, r1, and a shue mask msk. The output register d is found by running the following logic on each byte within the input registers:
if x in 0x80 .. 0xbf: d.b[n] = 0x00 if x in 0xc0 .. 0xdf: d.b[n] = 0x if x in 0xe0 .. 0x: d.b[n] = 0x80 This is very powerful stu!
Using these masks, we can quickly create a registers with all xs, ys, zs or ws: shufb xs, v, v, s_AAAA shufb zs, v, v, s_CCCC ; xs = (v.x, v.x, v.x, v.x) ; zs = (v.z, v.z, v.z, v.z)
SH: .dshuf
Because the shue instruction is so useful, our frontend tool supports quick creations of shue masks. Using the .dshuf directive, we create shue masks that follow the following rules. If the length of the string is 4, we assume it is word-sized shues, if 8 then half-word sized, and if 16 then byte-sized shues, upper-cased letters indicate sources from the rst input, lower-cased ones indicate from the second input, 0 indicates zeros, X ones and 8 0x80s. .dshuf "ABC0" ; 0x00010203_04050607_08090a0b_80808080 .dshuf "aX08" ; 0x10111213_c0c0c0c0_80808080_e0e0e0e0 .dshuf "aBC0aBC0" ; 0x1011_0203_0405_8080_1011_0203_0405_8080
= = = =
n) & 0x8000 ) ? 0xff : 0x00 n) & 0x8000 ) ? 0xff : 0x00 n) & 0x80 ) ? 0xffff : 0x00 n) & 0x8 ) ? 0xffffffff : 0x00
Example:
fsmbi selABCd, 0x000f; make select mask to get XYZ from first arg
a x onehalf b, one c, b
; b is good to 12 bits precision ; (b and a can share register) ; (c and x can share register) ; b is good to 24 bits precision
Here precis is the precision as an immediate, so that e.g. cuflt fp, val, 8; converts 0x80 into 0.5 Also, please note that these instructions saturate to the min and max values of their precision.
Branches on the SPU are costly. If a branch is taken, and it has not been predicted, there is a 18 cycle penalty so that the chip can restart the pipe. There is no penalty for falling through a non-predicted branch. However, if you have predicted a branch, and this does not occur - then there is also a 18 cycle penalty. Branches and branch hints are all odd instructions. Note: Even a static branch needs to be predicted. Note: This is one of the reasons why diverging control-paths are so dicult to optimize for.
goto label address and Set Link gosub label address, i.w[0] = return address, (*) goto i.w[0] and Set Link gosub j.w[0], i.w[0] = return address, (*) goto brTo and Set Link gosub label address, i.w[0] = return address (*)
(*): These instructions have a 4 cycle latency for the return register. Note: The bi instructions have enable/disable interrupt versions, e.g.: bie, bid, bisle, bisld.
If you know the most likely (or only) outcome for a branch, you can make sure the branch is penalty free as long as the hint occurs at least 15 cycles before the branch is taken. If the hint occurs later, there still may be a benet, since the penalty is lowered. However, if the hint arrives less than 4 cycles before the branch, there is no benet. Please note that it also turns out that there is a hardware bug w.r.t. the hbr instructions. One cannot hint a branch where the branch targets forwards and is also within the same 64-byte block as the branch.
hint for any BIxxx type branch hint for any BRAxxx type branch hint for any BRxxx type branch prefetch code (*)
DP instructions have a latency of 13 and are even. However, they will stall pipelining for 6 cycles (that is all currently executing instructions are halted) while this instruction is executed. Therefore, we do not recommend using double precision at all!
Questions?