0% found this document useful (0 votes)
33 views77 pages

Chapter 03

computer_3

Uploaded by

k0966493450.ee11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views77 pages

Chapter 03

computer_3

Uploaded by

k0966493450.ee11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

COMPUTER ORGANIZATION AND DESIGN

The Hardware/Software Interface


6th
Edition

Chapter 3
Arithmetic for
Computers
§3.1 Introduction
Arithmetic for Computers/Processors
 Representations
 2’s complement representation for fixed-point N-bit INT
 Std. IEEE754 FP32/64 representation
 Fixed-point INT arithmetic vs. Floating-point (FP) arithmetic
 General operations: Addition/subtraction, multiplication, division
 Special DSP operations: fused multiply-and-accumulate (MAC),
butterfly unit, general matrix-matrix multiplication (GEMM), …
 Efficient multiplication/division algorithms
 Efficient implementation of adder, multiplier, and divider

 Should deal with the problem of overflow/underflow, divide by 0, …


 The representation of infinity, NAN, …

Chapter 3 — Arithmetic for Computers — 2


(Fixed-Point) Integer Addition
 Example: 7 + 6

 Overflow if result out of range


 Adding +ve and –ve operands, no overflow

 Adding two +ve operands,

 Overflow if result sign is 1

 Adding two –ve operands

 Overflow if result sign is 0

Chapter 3 — Arithmetic for Computers — 3


(Fixed-Point) Integer Subtraction
 Example: 7 – 6 = 7 + (–6)
+7: 0000 0000 … 0000 0111
–6: 1111 1111 … 1111 1010
+1: 0000 0000 … 0000 0001

 Overflow if result out of range


 Subtracting two +ve or two –ve operands, no overflow
 Subtracting +ve from –ve operand
 Overflow if result sign is 0
 Subtracting –ve from +ve operand
 Overflow if result sign is 1

Chapter 3 — Arithmetic for Computers — 4


Detecting Overflow
 No overflow when adding a positive and a negative number
 No overflow when signs are the same for subtraction
 Overflow occurs when the value affects the sign:
 overflow when adding two positives yields a negative
 or, adding two negatives gives a positive
 or, subtract a negative from a positive and get a negative
 or, subtract a positive from a negative and get a positive
 Overflow detection

Chapter 3 — Arithmetic for Computers — 5


Overflow Detection Logic
 Overflow occurs when adding:
 2 positive numbers and the sum is negative
 2 negative numbers and the sum is positive
=> sign bit is set with the value of the result
 Overflow if: Carry into MSB  Carry out of MSB
 Overflow = CarryIn[N-1] XOR CarryOut[N-1]

Chapter 3 — Arithmetic for Computers — 6


Dealing with Overflow
 Some languages (e.g., C) ignore overflow
 Use MIPS addu, addui, subu instructions
 Saturated arithmetic

 Other languages (e.g., Ada, Fortran) require raising an


exception
 Use MIPS add, addi, sub instructions

 On overflow, invoke exception handler


 Save PC in exception program counter (EPC) register
 Jump to predefined handler address
 mfc0 (move from coprocessor reg) instruction can retrieve EPC
value, to return after corrective action

Chapter 3 — Arithmetic for Computers — 7


Designing Arithmetic Logic Unit (ALU)
 ALU performs arithmetic and logical operations
 add, sub: two’s complement adder/subtractor with overflow

detection
 and, or, nor : logical AND, logical OR, logical NOR

 slt (set on less than): two’s complement adder with inverter,

check sign bit of result

ALUop 4 (ALUop) Function


A 0000 and
32 0001 or
Zero
ALU

0010 add
Result
32 0110 subtract
Overflow 0111 set-on-less-than
B 1100 nor
32
CarryOut
32-Bit ALU  Group Bit-Slice ALU
 Design trick 1: divide and conquer
 Break the problem into simpler problems, solve them and glue together
the solution
 Design trick 2: solve part of the problem and extend

A 32 B 32

a31 b31 a0 b0 4
m ALU0 m
ALU31
ALUop
c31 cin c0 cin
s31 s0

Overflow Zero
32 Result

—9
A 4-bit ALU Example
 Design trick 3: take pieces you know (or can imagine) and try to put
them together
4-bit ALU

1-bit ALU CarryIn0 Operation

CarryIn Operation A0 1-bit Result0


and ALU
A B0
0 CarryOut0
CarryIn1
A1 1-bit Result1
or ALU
Result B1
1 CarryOut1
CarryIn2
Mux

A2 1-bit Result2
ALU
B2
1-bit add
Full 2 CarryIn3 CarryOut2
B Adder A3 1-bit Result3
ALU
B3
CarryOut
CarryOut3
Overflow Detection Logic
 Overflow = CarryIn[N-1] XOR CarryOut[N-1]
CarryIn0

A0 1-bit Result0 X Y X XOR Y


B0 ALU
CarryOut0 0 0 0
CarryIn1
A1 1-bit 0 1 1
Result1
B1 ALU 1 0 1
CarryOut1
CarryIn2 1 1 0
A2 1-bit Result2
B2 ALU
CarryIn3 Overflow
A3 1-bit Result3
B3 ALU

CarryOut3

— 11
Arithmetic for Multimedia
 Graphics and media processing operates on vectors of
8-bit (byte) and 16-bit INT data

 SIMD (single-instruction, multiple-data) extension ISA


 Use 64-bit adder, with partitioned carry chain

 Operate on 8×8-bit, 4×16-bit, or 2×32-bit configurable ALU


operations

 On overflow, usually applying saturating arithmetic


 Result is replaced by the largest representable value

 E.g., clipping in audio, saturation in video

Chapter 3 — Arithmetic for Computers — 12


§3.3 Multiplication
Multiplication
 Start with long-multiplication approach

multiplicand
1000
multiplier
× 1001
1000
0000
0000
1000
product 1001000

Length of product is the


sum of that of operand
and multiplicand
Initially 0

Chapter 3 — Arithmetic for Computers — 13


3-Step Multiplication in MIPS
mult $t1, $t2 # t1 * t2
 No destination register: product could be ~264; need two
special registers to hold it
 3-step process:
$t1 01111111111111111111111111111111
X $t2 01000000000000000000000000000000

00011111111111111111111111111111 11000000000000000000000000000000

Hi Lo
mfhi $t3 $t3 00011111111111111111111111111111

mflo $t4 $t4 11000000000000000000000000000000

Chapter 3 — Arithmetic for Computers — 14


Multiply Algorithm (Ver. 1) Start

Multiplier0 = 1 1. Test Multiplier0 = 0


Multiplier0

1a. Add multiplicand to product and


place the result in Product register

Initially 0
0010 x 0011
2. Shift Multiplicand register left 1 bit
Product Multiplier Multiplicand
0000 0000 0011 0000 0010
0000 0010 0001 0000 0100 3. Shift Multiplier register right 1 bit
0000 0110 0000 0000 1000
0000 0110 0000 0001 0000 No: < 32 repetitions
32nd
0000 0110 0000 0010 0000 Done
repetition?

Yes: 32 repetitions
Done
— 15
Observations
 1 clock per cycle => too slow
 Ratio of multiply to add 5:1 to 100:1
 Half of the bits in multiplicand always 0
=> 64-bit adder is wasted
 0’s inserted in right of multiplicand as shifted
=> least significant bits of product never changed once formed
 Instead of shifting multiplicand to left, shift product to
right?
 Product register wastes space => combine Multiplier and
Product register

Chapter 3 — Arithmetic for Computers — 16


Multiply Algorithm (Ver. 2) Start

Product0 = 1 1. Test Product0 = 0


Product0

1a. Add multiplicand to left half of product and


place the result in left half of Product register

Multiplicand Product 2. Shift Product register right 1 bit


0010 0000 0011
0010 0011
0010 0001 0001
0011 0001
0010 0001 1000
32nd No: < 32 repetitions
0010 0000 1100
repetition?
0010 0000 0110
Add & shift perform in parallel Yes: 32 repetitions
Done
— 17
Optimized Multiplier
 Perform steps in parallel: add/shift

0-bit Multiplier register

 One cycle per partial-product addition


 That’s ok, if frequency of multiplications is low

Chapter 3 — Arithmetic for Computers — 18


Concluding Remarks
 2 steps per bit because multiplier and product registers
combined
 MIPS registers Hi and Lo are left and right half of Product
register
=> this gives the MIPS instruction MultU
 What about signed multiplication?
 The easiest solution is to make both positive and remember
whether to complement product when done (leave out sign bit,
run for 31 steps)
 Apply definition of 2’s complement
 sign-extend partial products and subtract at end
 Booth’s Algorithm is an elegant way to multiply signed numbers
using same hardware as before and save cycles
Chapter 3 — Arithmetic for Computers — 19
Faster Multiplier
 Uses multiple adders
 Cost/performance tradeoff

Adder Reduction Tree


 Can be pipelined
 Several multiplication performed in parallel

Chapter 3 — Arithmetic for Computers — 20


MIPS Multiplication Instructions
 Two 32-bit registers for product
 HI: most-significant 32 bits
 LO: least-significant 32-bits
 MIPS multiply instructions
 mult rs, rt / multu rs, rt
 64-bit product in HI/LO
 mfhi rd / mflo rd
 Move from HI/LO to rd
 Can test HI value to see if product overflows 32 bits
 mul rd, rs, rt
 Least-significant 32 bits of product –> rd

Chapter 3 — Arithmetic for Computers — 21


§3.4 Division
Long Division Algorithm
 Check for 0 divisor

quotient  Long division approach

dividend  If divisor ≤ dividend bits

1001  1 bit in quotient, subtract

1000 1001010  Otherwise


-1000 0 bit in quotient, bring down next dividend bit
divisor 

10
101  Restoring division
1010  Do the subtract, and if remainder goes < 0, add
-1000
divisor back
remainder 10
 Signed division
 Divide using absolute values

 Adjust sign of quotient and remainder as


required
Division Algorithm and Hardware (Ver.1)

Initially divisor
in left half

Initially dividend

2n-bit dividend and n-bit divisor yield


n-bit quotient and remainder

Chapter 3 — Arithmetic for Computers — 23


Division Example Start: Place Dividend in Remainder

1. Subtract Divisor register from


Remainder register, and place the
result in Remainder register
Quot. Divisor Rem.
0000 00100000 00000111 Remainder  0 Test Remainder < 0
11100111 Remainder
00000111
0000 00010000 00000111
11110111 2b. Restore original value by
2a. Shift Quotient
00000111 adding Divisor to Remainder,
register to left,
0000 00001000 00000111 place sum in Remainder, shift
setting new
11111111 Quotient to the left, setting new
rightmost bit to 1
00000111 least significant bit to 0
0000 00000100 00000111
00000011
0001 00000011 3. Shift Divisor register right 1 bit
0001 00000010 00000011
00000001
0011 00000001 33rd No: < 33 repetitions
0011 00000001 00000001 repetition?

Yes: 33 repetitions
Done — 24
Observations
 Half of the bits in divisor register always 0
=> 1/2 of 64-bit adder is wasted
=> 1/2 of divisor is wasted
 Instead of shifting divisor to right,
shift remainder to left?
 1st step cannot produce a 1 in quotient bit
(otherwise quotient is too big for the register)
=> switch order to shift first and then subtract
=> save 1 iteration
 Eliminate Quotient register by combining with Remainder
register as shifted left

Chapter 3 — Arithmetic for Computers — 25


Start: Place Dividend in Remainder
Divide Algorithm (Ver. 2)
1. Shift Remainder register left 1 bit

Step Remainder Div. 2. Subtract Divisor register from the


0 0000 0111 0010 left half of Remainder register, and place the
1.1 0000 1110 result in the left half of Remainder register
1.2 1110 1110
1.3b 0001 1100
Remainder  0 Test Remainder < 0
2.2 1111 1100 Remainder
2.3b 0011 1000
3.2 0001 1000
3.3a 0011 0001 3b. Restore original value by adding
4.2 0001 0001 3a. Shift
Remainder to left, Divisor to left half of Remainder, and
4.3a 0010 0011 place sum in left half of Remainder.
0001 0011 setting new
rightmost bit to 1 Also shift Remainder to left, setting
the new least significant bit to 0

32nd No: < 32 repetitions


repetition?
Yes: 32 repetitions
Done. Shift left half of Remainder right 1 bit
— 26
Optimized Divider
0-bit Multiplier/Quotient register

 One cycle per partial-remainder subtraction


 Looks a lot like a multiplier!
 Same hardware can be used for both

Chapter 3 — Arithmetic for Computers — 27


Faster Division
 Can’t use parallel hardware as in multiplier

 Subtraction is conditional on sign of remainder

 Faster dividers (e.g. SRT division) generate multiple


quotient bits per step

 Still require multiple steps

Chapter 3 — Arithmetic for Computers — 28


MIPS Division
 Use HI/LO registers for result

 HI: 32-bit remainder

 LO: 32-bit quotient

 Instructions

 div rs, rt / divu rs, rt

 No overflow or divide-by-0 checking

 Software must perform checks if required

 Use mfhi, mflo to access result

Chapter 3 — Arithmetic for Computers — 29


Concluding Remarks
 Observations: Divide vs. Multiply

 Divide can use the same hardware as multiply

 just need ALU to add or subtract, and 64-bit register to


shift left or shift right

 Hi and Lo registers in MIPS combine to act as 64-bit


register for multiply and divide

Chapter 3 — Arithmetic for Computers — 30


§3.5 Floating Point
Floating Point (FP)
 Representation for non-integral real-valued numbers
 Including very small and very large numbers
 Scientific notation
 –2.34 × 1056 normalized

 +0.002 × 10–4
not normalized
 +987.02 × 109
 In binary
( 1)S  (1 F)  2(EBias)
 ±1.xxxxxxx2 × 2yyyy
 The programming language C use the name float (or
double) for single-precision (or double-precision) FP
numbers.
Chapter 3 — Arithmetic for Computers — 31
Standard FP Representation
 Defined by IEEE Std 754-1985

 Developed in response to divergence of


representations
 Portability issues for scientific code

 Now almost universally adopted

 Two representations
 32-bit single-precision (SP) FP

 64-bit double-precision (DP) FP

Chapter 3 — Arithmetic for Computers — 32


IEEE 754 Standard (1/2)
 Regarding single precision (SP), DP similar
 Sign bit S: ( 1)S  (1 F)  2(EBias)
1 means negative
0 means positive
 Significand F:
 To pack more bits, leading 1 implicit for normalized numbers
 1 + 23 bits single, 1 + 52 bits double
 always true: 0  Significand < 1
(for normalized numbers)
 Note: 0 has no leading 1, so reserve exponent value 0
just for number 0

Chapter 3 — Arithmetic for Computers — 33


IEEE 754 Standard (2/2)
 Exponent E:
 Need to represent positive and negative exponents
 Also want to compare FP numbers as if they were integers, to
help in value comparisons
 If use 2’s complement to represent?
e.g., 1.0 x 2-1 versus 1.0 x2+1 (1/2 versus 2)

0 1111 1111 000 0000 0000 0000 0000 0000


1/2

2 0 0000 0001 000 0000 0000 0000 0000 0000

If we use integer comparison for these two


words, we will conclude that 1/2 > 2!!!

Chapter 3 — Arithmetic for Computers — 34


Biased (Excess) Notation
 let notation 0000 be most negative, and 1111 be most positive
 Example: Biased 7
0000 -7
0001 -6
0010 -5
0011 -4
0100 -3
0101 -2
0110 -1
0111 0
1000 1
1001 2
1010 3
1011 4
1100 5
1101 6
1110 7
1111 8 Chapter 3 — Arithmetic for Computers — 35
IEEE 754 Standard
 Using biased notation
 the bias is the number subtracted to get the real number
 IEEE 754 uses bias of 127 for single precision:
Subtract 127 from Exponent field to get actual value for exponent
 1023 is bias for double precision
 The example becomes ….

0 0111 1110 000 0000 0000 0000 0000 0000


1/2

2 0 1000 0000 000 0000 0000 0000 0000 0000

Chapter 3 — Arithmetic for Computers — 36


IEEE Floating-Point Format
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction

x  ( 1)S  (1 Fraction)  2(ExponentBias)

 S: sign bit (0  non-negative, 1  negative)


 Normalize significand: 1.0 ≤ |significand| < 2.0
 Always has a leading pre-binary-point 1 bit, so no need to represent it
explicitly (hidden bit)
 Significand is Fraction with the “1.” restored
 Exponent: excess representation: actual exponent + Bias
 Ensures exponent is unsigned
 Single: Bias = 127; Double: Bias = 1203

Chapter 3 — Arithmetic for Computers — 37


Single-Precision Range
 Exponents 00000000 and 11111111 reserved
 Smallest value
 Exponent: 00000001
 actual exponent = 1 – 127 = –126
 Fraction: 000…00  significand = 1.0
 ±1.0 × 2–126 ≈ ±1.2 × 10–38
 Largest value
 exponent: 11111110
 actual exponent = 254 – 127 = +127
 Fraction: 111…11  significand ≈ 2.0
 ±2.0 × 2+127 ≈ ±3.4 × 10+38

Chapter 3 — Arithmetic for Computers — 38


Double-Precision Range
 Exponents 0000…00 and 1111…11 reserved
 Smallest value
 Exponent: 00000000001
 actual exponent = 1 – 1023 = –1022
 Fraction: 000…00  significand = 1.0
 ±1.0 × 2–1022 ≈ ±2.2 × 10–308
 Largest value
 Exponent: 11111111110
 actual exponent = 2046 – 1023 = +1023
 Fraction: 111…11  significand ≈ 2.0
 ±2.0 × 2+1023 ≈ ±1.8 × 10+308

Chapter 3 — Arithmetic for Computers — 39


Floating-Point Precision
 Relative precision
single: 23 bits
 all fraction bits are significant double: 52 bits

 SP : approx 2–23
 Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal
digits of precision

 DP : approx 2–52
 Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal
digits of precision

Chapter 3 — Arithmetic for Computers — 40


Floating-Point Representation Example
 Represent –0.75
 –0.75 = (–1)1 × 1.12 × 2–1
 S=1
 Fraction = 1000…002
 Exponent = –1 + Bias
 Single: –1 + 127 = 126 = 011111102
 Double: –1 + 1023 = 1022 = 011111111102

 SP : 1011111101000…00
 DP : 1011111111101000…00

Chapter 3 — Arithmetic for Computers — 41


Floating-Point Representation Example
 What number is represented by the single-precision float

11000000101000…00
 S=1

 Fraction = 01000…002

 Bias Exponent = 100000012 = 129

 Sol. x = (–1)1 × (1 + 012) × 2(129 – 127)


= (–1) × 1.25 × 22

= –5.0

Chapter 3 — Arithmetic for Computers — 42


Concluding Remarks
 What have we defined so far? (SP float)

Exponent Significand Object


0 0 ???
0 nonzero ???
1-254 anything +/- floating-point
255 0 ???
255 nonzero ???

Chapter 3 — Arithmetic for Computers — 43


Zero and Special Numbers
 Represent 0?
 exponent all zeroes
 significand all zeroes too
 What about sign?
 +0: 0 00000000 00000000000000000000000
 -0: 1 00000000 00000000000000000000000
 Why two zeroes?
 Helps in some limit comparisons

 Special numbers
 Range: 1.0  2-126  1.8  10-38
 What if result too small? (>0, < 1.8x10-38 => Underflow! )
 What if result too large? (> 3.4x1038 => Overflow! )

Chapter 3 — Arithmetic for Computers — 44


Gradual Underflow
 Represent denormalized numbers (denorms)
 Exponent : all zeroes
 Significand : non-zeroes
 Allow a number to degrade in significance until it
become 0 (gradual underflow)

 The smallest normalized number


 1.0000 0000 0000 0000 0000 0000  2-126

Chapter 3 — Arithmetic for Computers — 45


Representation for +/- Infinity
 In FP, divide by zero should produce +/- infinity, not
overflow
 Why?
 OK to do further computations with infinity, e.g., X/0 > Y may be
a valid comparison
 IEEE 754 represents +/- infinity
 Most positive exponent reserved for infinity
 Significands all zeroes

S 1111 1111 0000 0000 0000 0000 0000 000

Chapter 3 — Arithmetic for Computers — 46


Representation for Not a Number
 What do I get if I calculate sqrt(-4.0) or 0.0/0.0?
 If infinity is not an error, these should not be either
 They are called Not a Number (NaN)
 Exponent = 255, Significand nonzero
 Why is this useful?
 Hope NaNs help with debugging?
 They contaminate: op(NaN,X) = NaN
 OK if calculate but don’t use it

Chapter 3 — Arithmetic for Computers — 47


IEEE 754 Encoding of FP Numbers
 What have we defined so far? (single-precision)

Exponent Significand Object


0 0 0
0 nonzero denom
1-254 anything +/- fl. pt. #
255 0 +/- infinity
255 nonzero NaN

Chapter 3 — Arithmetic for Computers — 48


Floating-Point Addition
 Now consider a 4-digit binary example
 1.0002 × 2–1 + –1.1102 × 2–2 (i.e. 0.5 + –0.4375)
1. Align binary points
 Shift number with smaller exponent
 1.0002 × 2–1 + –0.1112 × 2–1
2. Add significands
 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
3. Normalize result & check for over/underflow
 1.0002 × 2–4, with no over/underflow
4. Round and renormalize if necessary
 1.0002 × 2–4 (no change) = 0.0625

Chapter 3 — Arithmetic for Computers — 49


Floating-Point Addition Algorithm
Basic addition algorithm:
compute Ye - Xe (to align binary point)
(1) right shift the smaller number, say Xm, that many
positions to form Xm  2Xe-Ye
(2) compute Xm  2Xe-Ye + Ym

if demands normalization, then normalize:


(3) left shift result, decrement result exponent
right shift result, increment result exponent
(3.1) check overflow or underflow during the shift
(4) round the mantissa
continue until MSB of data is 1
(NOTE: Hidden bit in IEEE Standard)
(5) if result is 0 mantissa, set the exponent
FP Adder Hardware

Step 1

Step 2

Step 3

Step 4

Chapter 3 — Arithmetic for Computers — 51


Floating-Point Multiplication
 Now consider a 4-digit binary example
 1.0002 × 2–1 × –1.1102 × 2–2 (i.e. 0.5 × –0.4375)
1. Add exponents
 Unbiased: –1 + –2 = –3
 Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127
2. Multiply significands
 1.0002 × 1.1102 = 1.1102  1.1102 × 2–3
3. Normalize result & check for over/underflow
 1.1102 × 2–3 (no change) with no over/underflow
4. Round and renormalize if necessary
 1.1102 × 2–3 (no change)
5. Determine sign: +ve × –ve  –ve
 –1.1102 × 2–3 = –0.21875

Chapter 3 — Arithmetic for Computers — 52


FP Arithmetic Hardware
 Much more complex than integer arithmetic
 Doing it in one clock cycle would take too long
 FP multiplier is of similar complexity to FP adder
 But uses a multiplier for significand instead of an adder
 FP arithmetic hardware usually does
 Addition, subtraction, multiplication, division, reciprocal,
square-root
 FP  integer conversion is not trivial
 Operations usually takes several cycles
 Can be pipelined

Chapter 3 — Arithmetic for Computers — 53


FP Instructions in MIPS (1/2)
 FP hardware is coprocessor 1
 Adjunct processor that extends the ISA
 Separate FP registers
 32 single-precision: $f0, $f1, … $f31
 Paired for double-precision: $f0/$f1, $f2/$f3, …
 Release 2 of MIPS ISA supports 32 × 64-bit FP reg’s
 FP instructions operate only on FP registers
 Programs generally don’t do integer ops on FP data, or vice
versa
 More registers with minimal code-size impact
 FP load and store instructions
 lwc1, ldc1, swc1, sdc1
 e.g., ldc1 $f8, 32($sp)

Chapter 3 — Arithmetic for Computers — 54


FP Instructions in MIPS (2/2)
 Single-precision arithmetic
 add.s, sub.s, mul.s, div.s
 e.g., add.s $f0, $f1, $f6
 Double-precision arithmetic
 add.d, sub.d, mul.d, div.d
 e.g., mul.d $f4, $f4, $f6
 Single- and double-precision comparison
 c.xx.s, c.xx.d (xx is eq, lt, le, …)
 Sets or clears FP condition-code bit
 e.g. c.lt.s $f3, $f4
 Branch on FP condition code true or false
 bc1t, bc1f more examples,
 e.g., bc1t TargetLabel please refer to Fig. 3.17-18,
p. 222-223
Chapter 3 — Arithmetic for Computers — 55
FP Example: °F to °C
 C code:
float f2c (float fahr) {
return ((5.0/9.0)*(fahr - 32.0));
}
 fahr in $f12, result in $f0, literals in global memory space
 Compiled MIPS code:
f2c: lwc1 $f16, const5($gp) #$f16=5.0(in Mem.)
lwc1 $f18, const9($gp) #$f18=9.0(in Mem.)
div.s $f16, $f16, $f18 #$f16=5.0/9.0
lwc1 $f18, const32($gp) #$f18=32.0(in Mem)
sub.s $f18, $f12, $f18 #f18=fahr-32.0
mul.s $f0, $f16, $f18 #$f0=(5/9)*(fahr-32)
jr $ra

Chapter 3 — Arithmetic for Computers — 56


FP Example: Matrix Multiplication (1/3)
 X=X+Y×Z
 All 32 × 32 matrices, 64-bit double-precision elements
 C code:
void mm (double x[][], double y[][], double z[][]) {
int i, j, k;
for (i = 0; i! = 32; i = i + 1)
for (j = 0; j! = 32; j = j + 1)
for (k = 0; k! = 32; k = k + 1)
x[i][j] = x[i][j] + y[i][k] * z[k][j];
}
 Addresses of x, y, z in $a0, $a1, $a2, and i, j, k in $s0, $s1, $s2

Chapter 3 — Arithmetic for Computers — 57


FP Example: Matrix Multiplication (2/3)
 MIPS code:
li $t1, 32 # $t1 = 32 (row size/loop end)
li $s0, 0 # i = 0; initialize 1st for loop
L1: li $s1, 0 # j = 0; restart 2nd for loop
L2: li $s2, 0 # k = 0; restart 3rd for loop
sll $t2, $s0, 5 # $t2 = i * 32 (size of row of x)
addu $t2, $t2, $s1 # $t2 = i * size(row) + j
sll $t2, $t2, 3 # $t2 = byte offset of [i][j]
addu $t2, $a0, $t2 # $t2 = byte address of x[i][j]
l.d $f4, 0($t2) # $f4 = 8 bytes of x[i][j]
L3: sll $t0, $s2, 5 # $t0 = k * 32 (size of row of z)
addu $t0, $t0, $s1 # $t0 = k * size(row) + j
sll $t0, $t0, 3 # $t0 = byte offset of [k][j]
addu $t0, $a2, $t0 # $t0 = byte address of z[k][j]
l.d $f16, 0($t0) # $f16 = 8 bytes of z[k][j]

Chapter 3 — Arithmetic for Computers — 58


FP Example: Matrix Multiplication (3/3)

sll $t0, $s0, 5 # $t0 = i*32 (size of row of y)
addu $t0, $t0, $s2 # $t0 = i*size(row) + k
sll $t0, $t0, 3 # $t0 = byte offset of [i][k]
addu $t0, $a1, $t0 # $t0 = byte address of y[i][k]
l.d $f18, 0($t0) # $f18 = 8 bytes of y[i][k]
mul.d $f16, $f18, $f16 # $f16 = y[i][k] * z[k][j]
add.d $f4, $f4, $f16 # f4=x[i][j] + y[i][k]*z[k][j]
addiu $s2, $s2, 1 # $k k + 1
bne $s2, $t1, L3 # if (k != 32) go to L3
s.d $f4, 0($t2) # x[i][j] = $f4
addiu $s1, $s1, 1 # $j = j + 1
bne $s1, $t1, L2 # if (j != 32) go to L2
addiu $s0, $s0, 1 # $i = i + 1
bne $s0, $t1, L1 # if (i != 32) go to L1

Chapter 3 — Arithmetic for Computers — 59


Variant FP Format

Chapter 3 — Arithmetic for Computers — 60


Accurate Arithmetic
 IEEE Std 754 specifies additional rounding control
 Extra bits of precision (guard, round, sticky)
 Choice of rounding modes
 Allows programmer to fine-tune numerical behavior of a
computation
 Not all FP units implement all options
 Most programming languages and FP libraries just use defaults
 Trade-off between hardware complexity, performance,
and market requirements

Chapter 3 — Arithmetic for Computers — 61


Extra Bits for Rounding
 Why rounding after addition?
 Because not every intermediate results is truncated
 To keep more precision

 Guard and round bits: extra bits to guard against loss of bits during
intermediate additions
 to the right of significand
 can later be shifted left into significand during normalization

 Sticky bit
 Additional bit to the right of the round digit
 Better fine tune rounding

Chapter 3 — Arithmetic for Computers — 62


Example
 Try to add 2.98x100 and 2.34x102
 only 3 decimal digits are allowed
2.34
+ 0.02
without guard bits
2.36

 with 2 more guard bits during computation


 perform rounding at last
2.3400
+ 0.0298
2.3698  rounding  2.37

 With guard bits and rounding  more accurate results

Chapter 3 — Arithmetic for Computers — 63


Rounding Methods
 Round to zero or Truncation
 The result closet to zero is returned.
 Nothing is added to the least significant bit.
 Round up
 The more positive result closest to the infinitely precise result is returned.
 If the result is positive and either the guard or the sticky bit is 1, the
result is rounded.
 If the result is negative, the result is not rounded because the unrounded
result is the most positive result that is closest to the infinitely precise
result.
 Round down
 The more negative result is returned.
 Round to nearest

Chapter 3 — Arithmetic for Computers — 64


Associativity
 Parallel programs may interleave operations in
unexpected orders

 Assumptions of associativity may fail


(x+y)+z x+(y+z)
x -1.50E+38 -1.50E+38
y 1.50E+38 0.00E+00
z 1.0 1.0 1.50E+38
1.00E+00 0.00E+00

 Need to validate parallel programs under varying


degrees of parallelism

Chapter 3 — Arithmetic for Computers — 65


§3.7 Real Stuff: Streaming SIMD Extensions and AVX in x86
Subword Parallellism
 Graphics and audio applications can take advantage of
performing simultaneous operations on short vectors
 Example: 128-bit adder:
 16x8-bit adds; 8x16-bit adds; 4x32-bit adds

 Also called data-level parallelism, vector parallelism, or


Single Instruction, Multiple Data (SIMD)
 ARM NEON multimedia instruction extension

 Intel SSE, SSE2 FP instructions

Chapter 3 — Arithmetic for Computers — 66


ARM NEON Instructions
 NEON supports all the subword data type you can imagine except 64-bit
FP numbers
 8-bit, 16-bit, 32-bit, and 64-bit signed and unsigned integers
 32-bit FP numbers

Chapter 3 — Arithmetic for Computers — 67


Right Shift and Division
 Left shift by i places multiplies an integer by 2i
 Right shift divides by 2i?
 Only for unsigned integers

 For signed integers


 Arithmetic right shift: replicate the sign bit
 e.g., –5 / 4
 111110112 >> 2 = 111111102 = –2
 Rounds toward –∞

 c.f. 111110112 >>> 2 = 001111102 = +62

Chapter 3 — Arithmetic for Computers — 68


Concluding Remarks
 ISAs support arithmetic
 Signed and unsigned integers
 Floating-point approximation to reals
 Bounded range and precision
 Operations can overflow and underflow
 MIPS ISA
 Core instructions: 54 most frequently used
 100% of SPECINT, 97% of SPECFP

 Other instructions: less frequent

Chapter 3 — Arithmetic for Computers — 69


APPENDIX

Chapter 3 — Arithmetic for Computers — 70


x86 FP Architecture
 Originally based on 8087 FP coprocessor
 8 × 80-bit extended-precision registers
 Used as a push-down stack
 Registers indexed from TOS: ST(0), ST(1), …
 FP values are 32-bit or 64 in memory
 Converted on load/store of memory operand
 Integer operands can also be converted
on load/store
 Very difficult to generate and optimize code
 Result: poor FP performance

Chapter 3 — Arithmetic for Computers — 71


x86 FP Instructions
Data transfer Arithmetic Compare Transcendental
FILD mem/ST(i) FIADDP mem/ST(i) FICOMP FPATAN
FISTP mem/ST(i) FISUBRP mem/ST(i) FIUCOMP F2XMI
FLDPI FIMULP mem/ST(i) FSTSW AX/mem FCOS
FLD1 FIDIVRP mem/ST(i) FPTAN
FLDZ FSQRT FPREM
FABS FPSIN
FRNDINT FYL2X

 Optional variations
 I: integer operand
 P: pop operand from stack
 R: reverse operand order
 But not all combinations allowed

Chapter 3 — Arithmetic for Computers — 72


Streaming SIMD Extension 2 (SSE2)
 Adds 4 × 128-bit registers
 Extended to 8 registers in AMD64/EM64T
 Can be used for multiple FP operands
 2 × 64-bit double precision
 4 × 32-bit double precision
 Instructions operate on them simultaneously
 Single-Instruction Multiple-Data

Chapter 3 — Arithmetic for Computers — 73


Matrix Multiply
 Unoptimized code:

1. void dgemm (int n, double* A, double* B, double* C)


2. {
3. for (int i = 0; i < n; ++i)
4. for (int j = 0; j < n; ++j)
5. {
6. double cij = C[i+j*n]; /* cij = C[i][j] */
7. for(int k = 0; k < n; k++ )
8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */
9. C[i+j*n] = cij; /* C[i][j] = cij */
10. }
11. }

Chapter 3 — Arithmetic for Computers — 74


Matrix Multiply
 x86 assembly code:
1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0
2. mov %rsi,%rcx # register %rcx = %rsi
3. xor %eax,%eax # register %eax = 0
4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1
5. add %r9,%rcx # register %rcx = %rcx + %r9
6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1,
element of A
7. add $0x1,%rax # register %rax = %rax + 1
8. cmp %eax,%edi # compare %eax to %edi
9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0
10. jg 30 <dgemm+0x30> # jump if %eax > %edi
11. add $0x1,%r11d # register %r11 = %r11 + 1
12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element

Chapter 3 — Arithmetic for Computers — 75


Matrix Multiply
 Optimized C code:
1. #include <x86intrin.h>
2. void dgemm (int n, double* A, double* B, double* C)
3. {
4. for ( int i = 0; i < n; i+=4 )
5. for ( int j = 0; j < n; j++ ) {
6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j]
*/
7. for( int k = 0; k < n; k++ )
8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n),
10. _mm256_broadcast_sd(B+k+j*n)));
11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
12. }
13. }

Chapter 3 — Arithmetic for Computers — 76


Matrix Multiply
 Optimized x86 assembly code:
1. vmovapd (%r11),%ymm0 # Load 4 elements of C into %ymm0
2. mov %rbx,%rcx # register %rcx = %rbx
3. xor %eax,%eax # register %eax = 0
4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element
5. add $0x8,%rax # register %rax = %rax + 8
6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements
7. add %r9,%rcx # register %rcx = %rcx + %r9
8. cmp %r10,%rax # compare %r10 to %rax
9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0
10. jne 50 <dgemm+0x50> # jump if not %r10 != %rax
11. add $0x1,%esi # register % esi = % esi + 1
12. vmovapd %ymm0,(%r11) # Store %ymm0 into 4 C elements

Chapter 3 — Arithmetic for Computers — 77

You might also like