0% found this document useful (0 votes)

145 views

Intro Spu Optimizations Part 1

The document provides an introduction to optimizations for the SPU (Synergistic Processing Unit) on the Cell Broadband Engine. It summarizes the main classes of SPU assembly instructions, including single precision floating point instructions, fixed precision integer instructions, load/store instructions, and others. The fixed precision integer instructions are discussed in detail, covering arithmetic, logical, comparison and other operations. The goal is to help programmers understand the SPU hardware and how to improve performance through techniques like software pipelining.

Uploaded by

mobiuswew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

145 views

Intro Spu Optimizations Part 1

Uploaded by

mobiuswew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Introduction to SPU Optimizations

Part 1: Assembly Instructions

Pl-Kristian Engstad a pal [email protected]

March 5, 2010

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Introduction

These slides are used internally at Naughty Dog to introduce new programmers to our SPU programming methods. Due to popular interest, we are now making these public. Note that some of the tools that we are using are not released to the public, but there exists many other alternatives out there that do similar things. The rst set of slides introduce most of the SPU assembly instructions. Please read these carefully before reading the second set. Those slides go through a made-up example showing how one can improve performance drastically, by knowing the hardware as well as employing a technique called software pipe-lining.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SPU programming is Cool

In these slides, we will go through all of the assembly instructions that exist on the SPU, giving you a quick introduction to the power of the SPUs. Each SPU has 256 kB of local memory. This local memory can be thought of as 1 cycle memory. Programs and data exist in the same local memory space. There are no memory protections in local memory! The only way to access external memory is through DMA. There is a signicant delay between when a DMA request is queued until it nishes.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SPU Execution Environment

The SPU has 128 general purpose 128-bit wide registers. You can think of these as
2 doubles (64-bit oating-point values), 4 oats (32-bit oating-point values), 4 words (32-bit integer values), 8 half-words (16-bit integer values), or 16 bytes (8-bit integer values).

An SPU executes an even and an odd instruction each cycle.

Even instructions are mostly arithmetic instructions, whereas the odd ones are load/store instructions, shues, branches and other special instructions.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Instruction Classes
The instruction set can be put in classes, where the instructions in the same class have the same arity (i.e. whether they are even or odd) and latency (how long it takes for the result to be ready): (SP) Single Precision {e6} (FX) FiXed {e2} (WS) Word Shift {e4} (LS) Load/Store {o6} (SH) SHuffle {o4} (FI) Fp Integer {e7} (BO) Byte Operations {e4} (BR) BRanch {o-} (HB) Hint Branch {o15} (CH) CHannel Operations {o6} (DP) Double Precision {e13}

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Single Precision Floating Point Class (SP) [Even:6]

The SP class of instructions have latency of 6 cycles and a throughput of 1 cycle. These are all even instructions. fa fs fm fma fms fnms a, a, a, a, a, a, b, b, b, b, b, b, c ; a.f[n] c ; a.f[n] c ; a.f[n] c, d ; a.f[n] c, d ; a.f[n] c, d ; a.f[n] = = = = = = b.f[n] + b.f[n] b.f[n] * b.f[n] * b.f[n] * -(b.f[n] c.f[n] c.f[n] c.f[n] c.f[n] + d.f[n] c.f[n] - d.f[n] * c.f[n] - d.f[n])

The syntax here indicates that for each of the 4 32-bit oating point values in the register, the operation in the comment is executed.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Single Precision Floating Point Class (SP)

No broadcast versions. No dot-products or cross-products. No fnma instruction.

Example:
If the registers r1 and r2 contains r1 = ( 1.0, 2.0, 3.0, 4.0 ), r2 = ( 0.0, -2.0, 1.0, 4.0 ), then after fa r0, r1, r2 ; r0 = r1 + r2 then r0 contains r0 = ( 1.0, 0.0, 4.0, 8.0 ).

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FiXed precision Class (FX) [Even:2]

The FX class of instructions all have latency of just 2 cycles and all have a throughput of 1 cycle. These are even instructions. Theres quite a few of them, and we can further divide them down into: Integer Arithmetic Operations. Immediate Loads Operations. Comparison Operations. Select Bit Operation. Logical Bit Operations. Extensions and Misc Operations.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Arithmetic Operations

The integer arithmetic operations add and subtract from work on either 4 words at a time or 8 half-words at a time. ah ahi a ai sfh sfhi sf sfi i, i, i, i, i, i, i, i, j, j, j, j, j, j, j, j, k s10 k s10 k s10 k s10 ; ; ; ; ; ; ; ; i.h[n] i.h[n] i.w[n] i.w[n] i.h[n] i.h[n] i.w[n] i.w[n] = = = = = = = = j.h[n] + k.h[n] j.h[n] + ext(s10) j.w[n] + k.w[n] j.w[n] + ext(s10) -j.h[n] + k.h[n] -j.h[n] + ext(s10) -j.w[n] + k.w[n] -j.w[n] + ext(s10)

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Arithmetic Operations & Examples

Notice the subtract from semantics. This is dierent from the oating point subtract (fs) semantic. We think this was mainly due to the additional power of the immediate forms. ai ahi sfi sfhi sf i, i, i, x, z, i, i, i, x, y, 1 -1 0 1 x ; ; ; ; ; i i i x z = = = = = i + 1, for each word in i i - 1, for each half-word in i (-i), for each word in i 1 - x, for each half-word in i x - y, for each word in i

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Immediate Loads

The SPU has some instructions that enable us to quickly set up registers values. These immediate loads are also 2-cycle FX instructions: il ilh ila ilhu iohl i, i, i, i, i, s16 u16 u18 u16 u16 ; ; ; ; ; i.w[n] i.h[n] i.w[n] i.w[n] i.w[n] = ext(s16) = u16 = u18 = u16 << 16 |= u16

Example:
ilhu ones, 0x3f80 ; ones = (1.0, 1.0, 1.0, 1.0) ila magic, 0x10203; magic = (0x00010203_00010203_00010203_00010203)

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Logical Bit Operations

These instructions work on each of the 128 bits in the registers. and nand andc or nor orc xor eqv i, i, i, i, i, i, i, i, j, j, j, j, j, j, j, j, k k k k k k k k ; ; ; ; ; ; ; ; i i i i i i i i = j & k = ~(j & k) = j & ~k = j | k = ~(j | k) = j | ~k = j ^ k = j == k

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Logical Operations w/immediates

andbi andhi andi orbi orhi ori xorbi xorhi xori i, j, u8 ; i.b[n] = j.b[n] & u8 i, j, s10 ; i.h[n] = j.h[n] & ext(s10) i, j, s10 ; i.w[n] = j.w[n] & ext(s10) i, j, u8 ; i.b[n] = j.b[n] | u8 i, j, s10 ; i.h[n] = j.h[n] | ext(s10) i, j, s10 ; i.w[n] = j.w[n] | ext(s10) i, j, u8 ; i.b[n] = j.b[n] ^ u8 i, j, s10 ; i.h[n] = j.h[n] ^ ext(s10) i, j, s10 ; i.w[n] = j.w[n] ^ ext(s10)

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Comparisons (Bytes)

ceqb ceqbi cgtb cgtbi clgtb clgtbi i, i, i, i, i, i, j, j, j, j, j, j, k su8 k su8 k su8 ; ; ; ; ; ; i.b[n] i.b[n] i.b[n] i.b[n] i.b[n] i.b[n] = = = = = = (j.b[n] (j.b[n] (j.b[n] (j.b[n] (j.b[n] (j.b[n] == == > > > > k.b[n]) su8) k.b[n]) su8) k.b[n]) su8) ? ? ? ? ? ? TRUE TRUE TRUE TRUE TRUE TRUE : : : : : : FALSE FALSE FALSE (s) FALSE FALSE (u) FALSE

TRUE = 0xFF FALSE = 0x00 (s) means signed and (u) means unsigned compares.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Comparisons (Halves)

ceqh ceqhi cgth cgthi clgth clgthi i, i, i, i, i, i, j, j, j, j, j, j, k s10 k s10 k s10 ; ; ; ; ; ; i.h[n] i.h[n] i.h[n] i.h[n] i.h[n] i.h[n] = = = = = = (j.h[n] (j.h[n] (j.h[n] (j.h[n] (j.h[n] (j.h[n] == == > > > > k.h[n]) ext(s10)) k.h[n]) ext(s10)) k.h[n]) ext(s10)) ? ? ? ? ? ? TRUE TRUE TRUE TRUE TRUE TRUE : : : : : : FALSE FALSE FALSE FALSE FALSE FALSE

(s) (s) (u) (u)

TRUE = 0xFFFF FALSE = 0x0000

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Comparisons (Words)

ceq ceqi cgt cgti clgt clgti i, i, i, i, i, i, j, j, j, j, j, j, k s10 k s10 k s10 ; ; ; ; ; ; i.w[n] i.w[n] i.w[n] i.w[n] i.w[n] i.w[n] = = = = = = (j.w[n] (j.w[n] (j.w[n] (j.w[n] (j.w[n] (j.w[n] == == > > > > k.w[n]) ext(s10)) k.w[n]) ext(s10)) k.w[n]) ext(s10)) ? ? ? ? ? ? TRUE TRUE TRUE TRUE TRUE TRUE : : : : : : FALSE FALSE FALSE FALSE FALSE FALSE

(s) (s) (u) (u)

TRUE = 0xFFFF_FFFF FALSE = 0x0000_0000

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Comparisons (Floats)

fceq fcmeq fcgt fcmgt i, i, i, i, b, b, b, b, c c c c ; ; ; ; i.w[n] i.w[n] i.w[n] i.w[n] = = = = (b[n] == c[n]) (abs(b[n]) == abs(c[n])) (b[n] > c[n]) (abs(b[n]) > abs(c[n])) ? ? ? ? TRUE TRUE TRUE TRUE : : : : FALSE FALSE FALSE FALSE

TRUE = 0xFFFF_FFFF FALSE = 0x0000_0000 Note: All zeros are equal, e.g.: 0.0 == -0.0.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Select Bits

This very important operation selects bits from j and k depending on the bits in the l registers. These t well with the comparison functions given previously. selb i, j, k, l ; i = (l==0) ? j : k Notice that if the bit is 0, then it selects j and if not then it selects the bit in k.

Example: SIMD min/max

fcgt mask, a, b ; mask is all 1s if a > b selb max, b, a, mask ; select a if a > b selb min, a, b, mask ; select b if !(a > b)

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Misc
generate borrow bit bg i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n]) i.w[n] = tmp.w[n] < 0 ? 0 : 1 generate borrow bit with borrow bgx i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n] + (i.w[n]&1) - 1) i.w[n] = tmp.w[n] < 0 ? 0 : 1 generate carry bit cg i, j, k ; i.w[n] = (j.w[n] + k.w[n]) > 0xffffffff ? 1 : 0 generate carry bit with carry cgx i, j, k ; tmp.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1) i.w[n] = tmp.w[n] > 0xffffffff ? 1 : 0

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FX: Misc
add with carry bit addx i, j, k ; i.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1)) subtract with borrow bit sfx i, j, k ; i.w[n] = (-j.w[n] + k.w[n] + (i.w[n] & 1) - 1) sign-extend byte to half-word xsbh i, j ; i.h[n] = ext(i.h[n] & 0xff) sign-extend half-word to word xshw i, j ; i.w[n] = ext(i.w[n] & 0xffff) sign-extend word to double-word xswd i, j ; i.d[n] = ext(i.d[n] & 0xffffffff) count leading zeros clz i, j ; i.w[n] = leadingZeroCount(j.w[n])

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Word Shift Class (WS) [Even:4]

The WS class of instructions have latency of 4 cycles and a throughput of 1 cycle. These are all even instructions. shlh shlhi shl shli i, i, i, i, j, j, j, j, k imm k imm ; ; ; ; i.h[n] i.h[n] i.w[n] i.w[n] = = = = j.h[n] j.h[n] j.w[n] j.w[n] << << << << ( ( ( ( k.h[n] imm k.w[n] imm & & & & 0x1f 0x1f 0x3f 0x3f ) ) ) )

Notice that there is an independent shift amount for each of the shlh and shl versions, i.e., this is truly SIMD!

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Example
; Assume r0 ; r1 shl r2, r0, ; Now r2 = = = ( 1, 2, 4, 8 ) = ( 1, 2, 3, 4 ) r1 ( 1<<1, 2<<2, 4<<3, 8<<4 ) ( 2, 4, 32, 128 )

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

WS: Rotate left logical

roth rothi rot roti i, i, i, i, j, j, j, j, k imm k imm ; ; ; ; i.h[n] i.h[n] i.w[n] i.w[n] = = = = j.h[n] j.h[n] j.w[n] j.w[n] <^ <^ <^ <^ ( ( ( ( k.h[n] imm k.w[n] imm & & & & 0x0f 0x0f 0x1f 0x1f ) ) ) )

<^ is my idiosyncratic symbol for rotate.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

WS: Shift right logical

rothm rothmi rotm rotmi i, i, i, i, j, j, j, j, k imm k imm ; ; ; ; i.h[n] i.h[n] i.w[n] i.w[n] = = = = j.h[n] j.h[n] j.w[n] j.w[n] >> >> >> >> ( ( ( ( -k.h[n] -imm -k.w[n] -imm & & & & 0x1f 0x1f 0x3f 0x3f ) ) ) )

Notice here that the shift amounts need to be negative in order to produce a proper shift. This is because this is actually a rotate left and then mask operation.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

WS: Shift right arithmetic

rotmah rotmahi rotma rotmai i, i, i, i, j, j, j, j, k imm k imm ; ; ; ; i.h[n] i.h[n] i.w[n] i.w[n] = = = = j.h[n] j.h[n] j.w[n] j.w[n] >> >> >> >> ( ( ( ( -k.h[n] -imm -k.w[n] -imm & & & & 0x1f 0x1f 0x3f 0x3f ) ) ) )

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Load/Store Class (LS) [Odd:6]

The load/store operations are odd instructions that work on the 256 kB local memory. They have a latency of 6 cycles, but the hardware has short-cuts in place so that you can read a written value immediately after the store. Do note: Memory wraps around, so you can never access memory outside the local store (LS). You can only load and store a whole quadword, so if you need to modify a part, you need to load the quadword value, merge in the modied part into the value and store the whole quadword back. Addresses are in units of bytes, unlike the VUs on the PS2. The load/store operations will use the value in the preferred word of the address register, i.e.: the rst word.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

LS: Loads
lqa i, label18 ; ; lqd i, qoff(j) ; ; lqr i, label14 ; ; lqx i, j, k ; addr = label18 range = 256kb (or +/- 128kb) addr = qoff * 16 + j.w[0] qoff is 10 bit signed, addr range = +/-8kb. addr = ext(label14) + pc label14 range = +/- 8kb. addr = j.w[0] + k.w[0]

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

LS: Stores
stqa i, label18 ; ; stqd i, qoff(j) ; ; stqr i, label14 ; ; stqx i, j, k ; addr = label18 range = 256kb (or +/- 128kb) addr = qoff * 16 + j.w[0] qoff is 10 bit signed, addr range = +/-8kb. addr = ext(label14) + pc label14 range = +/- 8kb. addr = j.w[0] + k.w[0]

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Shue Class (SH) [Odd:4]

The shue operations all have 4 cycle latency and they are odd instructions. Most of the instructions in this class deal with the whole quadword: We can divide the SH class into: The Shue Bytes Instruction. Quadword left-shifts, rotates and right-shifts. Creation of Shue Masks. Form Select Instructions. Gather Bit Instructions. Reciprocal Estimate Instructions.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Shue Bytes

The ordering of bytes, half-words and words within the quadword is shown below. Notice that this is big-endian, not little-endian: +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +-------+-------+-------+-------+-------+-------+-------+-------+ | 0 | 1 | 2 | 3 | +---------------+---------------+---------------+---------------+ The shue byte instruction shufb take three inputs, two source registers r0, r1, and a shue mask msk. The output register d is found by running the following logic on each byte within the input registers:

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Shue Bytes

Let x = msk.b[n], where n goes from 0 to 15: if x in 0 .. 0x7f:

If (x & 0x10) == 0x00, then d.b[n] = r0.b[x & 0x0f]. If (x & 0x10) == 0x10, then d.b[n] = r1.b[x & 0x0f].

if x in 0x80 .. 0xbf: d.b[n] = 0x00 if x in 0xc0 .. 0xdf: d.b[n] = 0x if x in 0xe0 .. 0x: d.b[n] = 0x80 This is very powerful stu!

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Shufb Examples

Previously, we mentioned that the SPU has no broadcast ability, but with a single shufb instruction we can broadcast one word into all words. We can create the shue masks using instructions directly, or else we could simply load it using a LS class instruction. ila orbi s_AAAA, 0x10203 ; s_AAAA = 0x00_01_02_03 x 4 ; = 0x00010203_00010203_00010203_00010203 s_CCCC, s_AAAA, 8 ; s_CCCC = 0x08_09_0a_0b x 4

Using these masks, we can quickly create a registers with all xs, ys, zs or ws: shufb xs, v, v, s_AAAA shufb zs, v, v, s_CCCC ; xs = (v.x, v.x, v.x, v.x) ; zs = (v.z, v.z, v.z, v.z)

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: .dshuf
Because the shue instruction is so useful, our frontend tool supports quick creations of shue masks. Using the .dshuf directive, we create shue masks that follow the following rules. If the length of the string is 4, we assume it is word-sized shues, if 8 then half-word sized, and if 16 then byte-sized shues, upper-cased letters indicate sources from the rst input, lower-cased ones indicate from the second input, 0 indicates zeros, X ones and 8 0x80s. .dshuf "ABC0" ; 0x00010203_04050607_08090a0b_80808080 .dshuf "aX08" ; 0x10111213_c0c0c0c0_80808080_e0e0e0e0 .dshuf "aBC0aBC0" ; 0x1011_0203_0405_8080_1011_0203_0405_8080

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Another Shufb Example

We can create nite state-machines, piping input into one end of the quad-word, while spitting out the result into another (like e.g. the preferred word). Heres an example of such a delay machine: ; in the data section: m_bcdA: .dshufb "bcdA" ; in the init section: lqa s_bcdA, 0(m_bcdA) ; in the loop: shufb state, input, state, s_bcdA ; ; ; ;

state.x state.y state.z state.w

= = = =

state.y state.z state.w input.x

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Quadword Shift Left

These instructions take the preferred byte (byte 3) or an immediate value, shifting the whole quadword to the left. There are versions that shift in number of bytes as well as in number of bits. For bit shifts, the shift amount is clamped to be less than 8. SHift Left Quadword by shlqby i, j, k ; i SHift Left Quadword by shlqbyi i, j, imm ; i SHift Left Quadword by shlqbybi i, j, k ; i SHift Left Quadword by shlqbi i, j, k ; i SHift Left Quadword by shlqbii i, j, imm ; i BYtes = j << ((k.b[3] & 0x1f) * 8) BYtes Immediate = j << ((imm & 0x1f) * 8) BYtes using BIt count = j << (k.b[3] & 0xf8) BIts = j << (k.b[3] & 0x07) BIts Immediate = j << (imm & 0x07)

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Quadword Rotate Left

These follow the same pattern as left shifts: ROTate (left) Quadword rotqby i, j, k ; i ROTate (left) Quadword rotqbyi i, j, imm ; i ROTate (left) Quadword rotqbybi i, j, k ; i ROTate (left) Quadword rotqbi i, j, k ; i ROTate (left) Quadword rotqbii i, j, imm ; i by BYtes = j <^ ((k.b[3] & 0x0f) * 8) by BYtes Immediate = j <^ ((imm & 0x0f) * 8) by BYtes using BIt count = j <^ (k.b[3] & 0x78) by BIts = j <^ (k.b[3] & 0x07) by BIts Immediate = j <^ (imm & 0x07)

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Quadword Shift Right

Ditto for shift rights, though as for the WS class, we call it rotates with mask and use the negative shift amounts: ROTate and Mask rotmqby i, j, ROTate and Mask rotmqbyi i, j, ROTate and Mask rotmqbybi i, j, ROTate and Mask rotmqbi i, j, ROTate and Mask rotmqbii i, j, Quadword by k ; i = j Quadword by imm ; i = j Quadword by k ; i = j Quadword by k ; i = j Quadword by imm ; i = j BYtes >> ((-k.b[3] & 0x1f) * 8) BYtes Immediate >> ((-imm & 0x1f) * 8) BYtes using BIt count >> (-(k.b[3] & 0xf8) & 0xf8) (*) BIts >> (-(k.b[3] & 0x07)) BIts Immediate >> (-imm & 0x07)

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Form Select Instructions

These instructions are designed to expand a small number of bits into many bits of ones, and they are good for use with the sel operation. Form Select Mask for Bytes Immediate fsmbi i, u16 ; i.b[n] = ( (u16 << Form Select Mask for Bytes fsmb i, j ; i.b[n] = ( (i.h[1] << Form Select Mask for Halfwords fsmh i, j ; i.h[n] = ( (i.b[3] << Form Select Mask for Words fsm i, j ; i.w[n] = ( (i.b[3] <<

n) & 0x8000 ) ? 0xff : 0x00 n) & 0x8000 ) ? 0xff : 0x00 n) & 0x80 ) ? 0xffff : 0x00 n) & 0x8 ) ? 0xffffffff : 0x00

Example:
fsmbi selABCd, 0x000f; make select mask to get XYZ from first arg

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Gather Bits Instructions

These are the opposite to the form select instructions, and can be used to quickly pack results from comparison operators into compact bytes or half-words. They all gather the rightmost bit from the the source register and packs it into a single bit in the target. Gather gbb i, Gather gbh i, Gather gb i, Bits j ; Bits j ; Bits j ; from Bytes i=0;for(n=0;n<16;n++){i.w[0]|=(j.b[n]&1);i.w[0]<<=1;} from Halfwords i=0;for(n=0;n< 8;n++){i.w[0]|=(j.h[n]&1);i.w[0]<<=1;} (from Words) i=0;for(n=0;n< 4;n++){i.w[0]|=(j.w[n]&1);i.w[0]<<=1;}

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: How to generate masks for non-quadword stores.

As seen in the section for load/store, there are no non-quadword load/store operations. A way to store a non-quadword value is to load the destination quadword, shue the value with the loaded quadword, and store it back to the same location. In order to make the process of generating these shue-masks, there are a few instructions that generate these control masks: Generate Controls for Byte Insertion (d-form) cbd i, imm(j) Generate Controls for Byte Insertion (x-form) cbx i, j, k

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: How to generate masks for non-quadword stores.

Generate Controls chd i, imm(j) Generate Controls chx i, j, k Generate Controls cwd i, imm(j) Generate Controls cwx i, j, k Generate Controls cdd i, imm(j) Generate Controls cdx i, j, k for Halfword Insertion (d-form) for Halfword Insertion (x-form) for Word Insertion (d-form) for Word Insertion (x-form) for Doubleword Insertion (d-form) for Doubleword Insertion (x-form)

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: How to generate masks for non-quadword stores.

Example: Store prefered byte into a table
lqx qword, table, offset cbx mask, table, offset shufb qword, value, qword, mask stqx qword, table, offset ai offset, offset, 1

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Reciprocal Estimate Instructions

The hardware supports two fast (4 cycles) that calculate the reciprocal recip(x) = 1/x, or the reciprocal square root rsqrt(x) = 1/ x. These instructions work in conjunction with the instruction that well later explain in detail. After the interpolation instruction, result are accurate to a precision of 12 bits, which is about half the oating-point precision of 23. In order to improve the accuracy, one must perform another Taylor- or Euler-step. Do note that: sqrt(x) = x= x x 1 = |x| = x rsqrt(x), x x

since x 0, so there is no need for a seperate square-root function.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Improving precision on the reciprocal function

Assuming we have the input in the x-register, we proceed to calculate frest fi fnms fma a, b, c, b, x x, a b, x, one c, b, b

; b is good to 12 bits precision ; ; b is good to 24 bits precision ;

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Improving precision on the reciprocal square-root function

frsqest fi fm fm fnms fma a, b, c, d, c, b, x x, b, b, c, d,

a x onehalf b, one c, b

; b is good to 12 bits precision ; (b and a can share register) ; (c and x can share register) ; b is good to 24 bits precision

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

SH: Or Across - The Final Instruction

The last instruction in the SH class is a new addition. Or Across orx i, j

; i.w[0] = ( j.w[0] | j.w[1] | j.w[2] | j.w[3] ); i.w[1] = i.w[2] = i.w[3] = 0

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Floating point / Integer Class (FI) [Even:7]

The FI class of instructions have latency of 7 cycles and a throughput of 1 cycle. These are all even instructions. There are basically three types of instructions: integer multiplies, interpolations for reciprocal calculations, and nally, fp/integer conversions.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FI: Integer Multiplies

multiply lower halves mpy i, j, k ; multiply lower halves mpyi i, j, s10 ; multiply lower halves mpyu i, j, k ; multiply lower halves mpyui i, j, s10 ; signed i.w[n] = j.h[2n+1] signed immediate i.w[n] = j.h[2n+1] unsigned i.w[n] = j.h[2n+1] unsigned immediate i.w[n] = j.h[2n+1]

* k.h[2n+1] * ext(s10) * k.h[2n+1] (immediate sign-extends) * ext(s10)

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FI: Integer Multiplies

multiply lower halves, add word mpya i, j, k, l ; i.w[n] = j.h[2n+1] * k.h[2n+1] + l.w[n] multiply lower halves, shift result down 16 with sign extend mpys i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] >> 16 multiply upper half j by lower half k, shift up 16 mpyh i, j, k ; i.w[n] = j.h[2n] * k.h[2n+1] << 16

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FI: Integer Multiplies

multiply upper halves signed mpyhh i, j, k ; i.w[n] = j.h[2n] * k.h[2n] multiply upper halves unsigned mpyhhu i, j, k ; i.w[n] = j.h[2n] * k.h[2n] multiply/accumulate upper halves mpyhha i, j, k ; i.w[n] += j.h[2n] * k.h[2n] multiply/accumulate upper halves unsigned mpyhhau i, j, k ; i.w[n] += j.h[2n] * k.h[2n]

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

FI: Conversions and FI instruction

fi cuflt csflt cfltu cflts a, b, c a, a, i, i, j, j, b, b, precis precis precis precis ; use after frest or frsqest ; ; ; ; unsigned int to float signed int to float float to unsigned int float to signed int

Here precis is the precision as an immediate, so that e.g. cuflt fp, val, 8; converts 0x80 into 0.5 Also, please note that these instructions saturate to the min and max values of their precision.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Byte Operations (BO) [Even: 4]

Theres a couple of interesting instructions that help with multi-media and streaming logic. Count Ones in Bytes cntb i, j ; i.b[n] = numOneBits( j.b[n] ) Average Bytes avgb i, j, k ; i.b[n] = ( j.b[n] + k.b[n] + 1 ) / 2 Absolute Difference in Bytes absdb i, j, k ; i.b[n] = abs( j.b[n] - k.b[n] ) Sum Bytes into Half-words sumb i, j, k ; i.h[0] = k.b[0] + k.b[1] + k.b[2] + k.b[3]; i.h[1] = j.b[0] + j.b[1] + j.b[2] + j.b[3]; : i.h[6] = k.b[12] + k.b[13] + k.b[14] + k.b[15]; i.h[7] = j.b[12] + j.b[13] + j.b[14] + j.b[15];

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Branch Class (BR) [Odd:-]

Branches on the SPU are costly. If a branch is taken, and it has not been predicted, there is a 18 cycle penalty so that the chip can restart the pipe. There is no penalty for falling through a non-predicted branch. However, if you have predicted a branch, and this does not occur - then there is also a 18 cycle penalty. Branches and branch hints are all odd instructions. Note: Even a static branch needs to be predicted. Note: This is one of the reasons why diverging control-paths are so dicult to optimize for.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

BR: Unconditional Branches

Branch Relative br brTo ; Branch Relative brsl i, brTo ; Branch Indirect bi i ; Branch Indirect bisl i, j ; BRanch Absolute bra brTo ; BRanch Absolute brasl i, brTo ;

goto label address and Set Link gosub label address, i.w[0] = return address, (*) goto i.w[0] and Set Link gosub j.w[0], i.w[0] = return address, (*) goto brTo and Set Link gosub label address, i.w[0] = return address (*)

(*): These instructions have a 4 cycle latency for the return register. Note: The bi instructions have enable/disable interrupt versions, e.g.: bie, bid, bisle, bisld.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

BR: Conditional Branches (Relative)

Branch on Zero brz i, brTo ; branch Branch on Not Zero brnz i, brTo ; branch Branch on Zero brhz i, brTo ; branch Branch on Not Zero brhnz i, brTo ; branch

if i.w[0] == 0 if i.w[0] != 0 if i.h[1] == 0 if i.h[1] != 0

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

BR: Conditional Branches (Indirect)

Branch Indirect on Zero biz i, j ; branch to j.w[0] Branch Indirect on Not Zero binz i, j ; branch to j.w[0] Branch Indirect on Zero bihz i, j ; branch to j.w[0] Branch Indirect on Not Zero bihnz i, j ; branch to j.w[0]

if i.w[0] == 0 if i.w[0] != 0 if i.h[1] == 0 if i.h[1] != 0

Note: These instructions can enable/disable interrupts as well.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

BR: Interrupt & Misc

Interrupt RETurn iret i ; Return from interrupt Interrupt RETurn iretd i ; Return from interrupt, disable interrupts Interrupt RETurn irete i ; Return from interrupt, enable interrupts Branch Indirect and Set Link if External Data bisled i, j ; gosub j if channel 0 is non-zero

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Hints Branch Class (HB) [Odd:15]

If you know the most likely (or only) outcome for a branch, you can make sure the branch is penalty free as long as the hint occurs at least 15 cycles before the branch is taken. If the hint occurs later, there still may be a benet, since the penalty is lowered. However, if the hint arrives less than 4 cycles before the branch, there is no benet. Please note that it also turns out that there is a hardware bug w.r.t. the hbr instructions. One cannot hint a branch where the branch targets forwards and is also within the same 64-byte block as the branch.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Hints Branch Instructions

Hint hbr Hint hbra Hint hbrr Hint hbrp Branch (Immediate) brFrom, j ; branch Branch Absolute brFrom, brTo ; branch Branch Relative brFrom, brTo ; branch Branch Prefetch ; inline

hint for any BIxxx type branch hint for any BRAxxx type branch hint for any BRxxx type branch prefetch code (*)

(*) allows 15 LS instructions in a row without any instruction fetch stall.

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

CH: DMA Channel Ops

We will explain these in further talks, but for completeness weve included these here. They are all odd instructions with a latency of 6. Note, that the latency may actually be much higher if channels are not ready. Read from Channel rdch i, chn ; read i from channel chn Write to Channel wrch chn, i ; write i into channel chn Read Channel Count rdchcnt i, chn; read channel count for channel chn into i

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

DP: Double Precision

DP instructions have a latency of 13 and are even. However, they will stall pipelining for 6 cycles (that is all currently executing instructions are halted) while this instruction is executed. Therefore, we do not recommend using double precision at all!

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Questions?

Thats all folks!

P al-Kristian Engstadpal [email protected]

Introduction to SPU Optimizations

Solution Manual of Cmputer Organization and Architectur
44% (27)
Solution Manual of Cmputer Organization and Architectur
29 pages
Application/Database Technical Design Document Template
100% (1)
Application/Database Technical Design Document Template
20 pages
Answers 2 Reviews and Exercises
No ratings yet
Answers 2 Reviews and Exercises
26 pages
VMware Corporate Overview Presentation
No ratings yet
VMware Corporate Overview Presentation
48 pages
UNCC-IESLecture23 - Fixed Point Math
No ratings yet
UNCC-IESLecture23 - Fixed Point Math
22 pages
Exercise Only
No ratings yet
Exercise Only
40 pages
Optimizing C++/Code Optimization/faster Operations: Structure Fields Order
No ratings yet
Optimizing C++/Code Optimization/faster Operations: Structure Fields Order
5 pages
Unit - I Syllabus: Basic Structure of Computers
No ratings yet
Unit - I Syllabus: Basic Structure of Computers
72 pages
Floating Point Multipliers: Simulation & Synthesis Using VHDL
No ratings yet
Floating Point Multipliers: Simulation & Synthesis Using VHDL
40 pages
final_exam_comp_org
No ratings yet
final_exam_comp_org
4 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
S5 To S7 Conversion Tips V1
No ratings yet
S5 To S7 Conversion Tips V1
24 pages
Module 3
No ratings yet
Module 3
35 pages
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
No ratings yet
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
201 pages
Computer Architecture
No ratings yet
Computer Architecture
22 pages
C674x CPU Features
No ratings yet
C674x CPU Features
23 pages
CS1601 Computer Architecture
100% (1)
CS1601 Computer Architecture
389 pages
BCS402_MC_M3_Notes SJCIT
No ratings yet
BCS402_MC_M3_Notes SJCIT
18 pages
Pape 3
No ratings yet
Pape 3
20 pages
Sehs3317 L4
No ratings yet
Sehs3317 L4
53 pages
ECE 252 - Quiz - 1 - Solutions
No ratings yet
ECE 252 - Quiz - 1 - Solutions
5 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Benchmark Programs For CPU
No ratings yet
Benchmark Programs For CPU
5 pages
Module-5
No ratings yet
Module-5
33 pages
Module 3 Notes
No ratings yet
Module 3 Notes
18 pages
Module-3 ARMProgram Notes.-16857877494142 PDF
No ratings yet
Module-3 ARMProgram Notes.-16857877494142 PDF
5 pages
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
No ratings yet
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
31 pages
System Programing and Operating System
No ratings yet
System Programing and Operating System
376 pages
BCS402 M3
No ratings yet
BCS402 M3
110 pages
ADVANCED COMPUTER ARCHITECTURE
No ratings yet
ADVANCED COMPUTER ARCHITECTURE
71 pages
278 hw5
No ratings yet
278 hw5
20 pages
hw1solS04
No ratings yet
hw1solS04
5 pages
Instruction Sets in Computer Architecture
No ratings yet
Instruction Sets in Computer Architecture
52 pages
SIMD Library Specification For CBEA 1.1
No ratings yet
SIMD Library Specification For CBEA 1.1
42 pages
Computer Architecture - An: Unit-1
No ratings yet
Computer Architecture - An: Unit-1
30 pages
1001purl COA TYS
No ratings yet
1001purl COA TYS
22 pages
FALLSEM2024-25_CSI3021_TH_VL2024250101951_2024-07-19_Reference-Material-I
No ratings yet
FALLSEM2024-25_CSI3021_TH_VL2024250101951_2024-07-19_Reference-Material-I
21 pages
Ic Samsung s3f9454
No ratings yet
Ic Samsung s3f9454
50 pages
S3C9442/C9444/F9444/C9452/C9454/F9454 Sam88Rcri Instruction Set
No ratings yet
S3C9442/C9444/F9444/C9452/C9454/F9454 Sam88Rcri Instruction Set
50 pages
Today - Finish Single-Cycle Datapath/control Path - Look at Its Performance and How To Improve It
No ratings yet
Today - Finish Single-Cycle Datapath/control Path - Look at Its Performance and How To Improve It
28 pages
CS222 - COAL - SOLUTION - Final - Spring2023
No ratings yet
CS222 - COAL - SOLUTION - Final - Spring2023
12 pages
Sharc Processor
No ratings yet
Sharc Processor
97 pages
chap15
No ratings yet
chap15
61 pages
S Rawat
No ratings yet
S Rawat
49 pages
HPC Unit 5 b
No ratings yet
HPC Unit 5 b
31 pages
Computer Architecture and Organization: The Central Processing Unit
100% (1)
Computer Architecture and Organization: The Central Processing Unit
126 pages
BCS402 Module 3 PDF
No ratings yet
BCS402 Module 3 PDF
18 pages
COD - Unit-3 - N - 4 - PPT AJAY Kumar
No ratings yet
COD - Unit-3 - N - 4 - PPT AJAY Kumar
93 pages
Mechatronics
No ratings yet
Mechatronics
26 pages
Module 2 Part B (Mces 21cs43)
No ratings yet
Module 2 Part B (Mces 21cs43)
29 pages
ps1 Sol
No ratings yet
ps1 Sol
11 pages
ARM-Inst Summary
No ratings yet
ARM-Inst Summary
2 pages
Section A: COA Comprehensive Makeup Examination 2014. Answer Key
No ratings yet
Section A: COA Comprehensive Makeup Examination 2014. Answer Key
5 pages
Integers Floating Point: N N S E
No ratings yet
Integers Floating Point: N N S E
4 pages
csa final
No ratings yet
csa final
7 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Digital Circuit Simulation Using Excel
From Everand
Digital Circuit Simulation Using Excel
Anthony Mazzurco
No ratings yet
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
64c10576923f8600186d8eb7 - ## - Phase Controlled Rectifier - Practice Sheet 01
No ratings yet
64c10576923f8600186d8eb7 - ## - Phase Controlled Rectifier - Practice Sheet 01
4 pages
JZ4755 PM
No ratings yet
JZ4755 PM
697 pages
6 71 w5100 d03
No ratings yet
6 71 w5100 d03
86 pages
FactoryTalk View - Failed To Load Data Server (Result 0x80070005)
No ratings yet
FactoryTalk View - Failed To Load Data Server (Result 0x80070005)
7 pages
Volnei A. Pedroni Finite State Machines in Hardware Theory and Design PDF
No ratings yet
Volnei A. Pedroni Finite State Machines in Hardware Theory and Design PDF
349 pages
Lab Assignment-1
No ratings yet
Lab Assignment-1
4 pages
LStream DOC v1 6 en
No ratings yet
LStream DOC v1 6 en
24 pages
System Software
No ratings yet
System Software
2 pages
Web Operating System 09022014085132 Web Operating System
No ratings yet
Web Operating System 09022014085132 Web Operating System
12 pages
The Need For Harmonic Modeling and Mitigation of Generator Applications
No ratings yet
The Need For Harmonic Modeling and Mitigation of Generator Applications
8 pages
ApolloWaveform UserGuide 16917R2
No ratings yet
ApolloWaveform UserGuide 16917R2
24 pages
Microprocessor-Based Systems: Prof. Dr. Eng. Sebestyen Gheorghe Computers Department
No ratings yet
Microprocessor-Based Systems: Prof. Dr. Eng. Sebestyen Gheorghe Computers Department
26 pages
AXI Assertions
50% (2)
AXI Assertions
42 pages
VERSIONES DE FIRMWARE Lexmark
No ratings yet
VERSIONES DE FIRMWARE Lexmark
32 pages
Output
No ratings yet
Output
23 pages
Zubair Lateef Resume
No ratings yet
Zubair Lateef Resume
5 pages
Ch.3 - IPC
No ratings yet
Ch.3 - IPC
27 pages
Chopper Controlled DC Drives
No ratings yet
Chopper Controlled DC Drives
1 page
User Manual 2464263
No ratings yet
User Manual 2464263
144 pages
Milatary Network
No ratings yet
Milatary Network
55 pages
What Is FPGA and Why Should You Care
No ratings yet
What Is FPGA and Why Should You Care
6 pages
Unibox Userguide 1.8
No ratings yet
Unibox Userguide 1.8
219 pages
AeroTrak 6310 6510 RPC Manual 6007938A
No ratings yet
AeroTrak 6310 6510 RPC Manual 6007938A
55 pages
Shebang (Unix)
No ratings yet
Shebang (Unix)
8 pages
Inspiron 15 5577 Gaming Laptop Service Manual en Us
No ratings yet
Inspiron 15 5577 Gaming Laptop Service Manual en Us
108 pages
SANOG40 - Conference T1 Grafana Sufian
No ratings yet
SANOG40 - Conference T1 Grafana Sufian
33 pages
The Monad - Reader Issue 18
No ratings yet
The Monad - Reader Issue 18
51 pages