0% found this document useful (0 votes)

33 views77 pages

Chapter 03

computer_3

Uploaded by

k0966493450.ee11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views77 pages

Chapter 03

computer_3

Uploaded by

k0966493450.ee11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

COMPUTER ORGANIZATION AND DESIGN

The Hardware/Software Interface

6th
Edition

Chapter 3
Arithmetic for
Computers
§3.1 Introduction
Arithmetic for Computers/Processors
 Representations
 2’s complement representation for fixed-point N-bit INT
 Std. IEEE754 FP32/64 representation
 Fixed-point INT arithmetic vs. Floating-point (FP) arithmetic
 General operations: Addition/subtraction, multiplication, division
 Special DSP operations: fused multiply-and-accumulate (MAC),
butterfly unit, general matrix-matrix multiplication (GEMM), …
 Efficient multiplication/division algorithms
 Efficient implementation of adder, multiplier, and divider

 Should deal with the problem of overflow/underflow, divide by 0, …

 The representation of infinity, NAN, …

Chapter 3 — Arithmetic for Computers — 2

(Fixed-Point) Integer Addition
 Example: 7 + 6

 Overflow if result out of range

 Adding +ve and –ve operands, no overflow

 Adding two +ve operands,

 Overflow if result sign is 1

 Adding two –ve operands

 Overflow if result sign is 0

Chapter 3 — Arithmetic for Computers — 3

(Fixed-Point) Integer Subtraction
 Example: 7 – 6 = 7 + (–6)
+7: 0000 0000 … 0000 0111
–6: 1111 1111 … 1111 1010
+1: 0000 0000 … 0000 0001

 Overflow if result out of range

 Subtracting two +ve or two –ve operands, no overflow
 Subtracting +ve from –ve operand
 Overflow if result sign is 0
 Subtracting –ve from +ve operand
 Overflow if result sign is 1

Chapter 3 — Arithmetic for Computers — 4

Detecting Overflow
 No overflow when adding a positive and a negative number
 No overflow when signs are the same for subtraction
 Overflow occurs when the value affects the sign:
 overflow when adding two positives yields a negative
 or, adding two negatives gives a positive
 or, subtract a negative from a positive and get a negative
 or, subtract a positive from a negative and get a positive
 Overflow detection

Chapter 3 — Arithmetic for Computers — 5

Overflow Detection Logic
 Overflow occurs when adding:
 2 positive numbers and the sum is negative
 2 negative numbers and the sum is positive
=> sign bit is set with the value of the result
 Overflow if: Carry into MSB  Carry out of MSB
 Overflow = CarryIn[N-1] XOR CarryOut[N-1]

Chapter 3 — Arithmetic for Computers — 6

Dealing with Overflow
 Some languages (e.g., C) ignore overflow
 Use MIPS addu, addui, subu instructions
 Saturated arithmetic

 Other languages (e.g., Ada, Fortran) require raising an

exception
 Use MIPS add, addi, sub instructions

 On overflow, invoke exception handler

 Save PC in exception program counter (EPC) register
 Jump to predefined handler address
 mfc0 (move from coprocessor reg) instruction can retrieve EPC
value, to return after corrective action

Chapter 3 — Arithmetic for Computers — 7

Designing Arithmetic Logic Unit (ALU)
 ALU performs arithmetic and logical operations
 add, sub: two’s complement adder/subtractor with overflow

detection
 and, or, nor : logical AND, logical OR, logical NOR

 slt (set on less than): two’s complement adder with inverter,

check sign bit of result

ALUop 4 (ALUop) Function

A 0000 and
32 0001 or
Zero
ALU

0010 add
Result
32 0110 subtract
Overflow 0111 set-on-less-than
B 1100 nor
32
CarryOut
32-Bit ALU  Group Bit-Slice ALU
 Design trick 1: divide and conquer
 Break the problem into simpler problems, solve them and glue together
the solution
 Design trick 2: solve part of the problem and extend

A 32 B 32

a31 b31 a0 b0 4
m ALU0 m
ALU31
ALUop
c31 cin c0 cin
s31 s0

Overflow Zero
32 Result

—9
A 4-bit ALU Example
 Design trick 3: take pieces you know (or can imagine) and try to put
them together
4-bit ALU

1-bit ALU CarryIn0 Operation

CarryIn Operation A0 1-bit Result0

and ALU
A B0
0 CarryOut0
CarryIn1
A1 1-bit Result1
or ALU
Result B1
1 CarryOut1
CarryIn2
Mux

A2 1-bit Result2
ALU
B2
1-bit add
Full 2 CarryIn3 CarryOut2
B Adder A3 1-bit Result3
ALU
B3
CarryOut
CarryOut3
Overflow Detection Logic
 Overflow = CarryIn[N-1] XOR CarryOut[N-1]
CarryIn0

A0 1-bit Result0 X Y X XOR Y

B0 ALU
CarryOut0 0 0 0
CarryIn1
A1 1-bit 0 1 1
Result1
B1 ALU 1 0 1
CarryOut1
CarryIn2 1 1 0
A2 1-bit Result2
B2 ALU
CarryIn3 Overflow
A3 1-bit Result3
B3 ALU

CarryOut3

— 11
Arithmetic for Multimedia
 Graphics and media processing operates on vectors of
8-bit (byte) and 16-bit INT data

 SIMD (single-instruction, multiple-data) extension ISA

 Use 64-bit adder, with partitioned carry chain

 Operate on 8×8-bit, 4×16-bit, or 2×32-bit configurable ALU

operations

 On overflow, usually applying saturating arithmetic

 Result is replaced by the largest representable value

 E.g., clipping in audio, saturation in video

Chapter 3 — Arithmetic for Computers — 12

§3.3 Multiplication
Multiplication
 Start with long-multiplication approach

multiplicand
1000
multiplier
× 1001
1000
0000
0000
1000
product 1001000

Length of product is the

sum of that of operand
and multiplicand
Initially 0

Chapter 3 — Arithmetic for Computers — 13

3-Step Multiplication in MIPS
mult $t1, $t2 # t1 * t2
 No destination register: product could be ~264; need two
special registers to hold it
 3-step process:
$t1 01111111111111111111111111111111
X $t2 01000000000000000000000000000000

00011111111111111111111111111111 11000000000000000000000000000000

Hi Lo
mfhi $t3 $t3 00011111111111111111111111111111

mflo $t4 $t4 11000000000000000000000000000000

Chapter 3 — Arithmetic for Computers — 14

Multiply Algorithm (Ver. 1) Start

Multiplier0 = 1 1. Test Multiplier0 = 0

Multiplier0

1a. Add multiplicand to product and

place the result in Product register

Initially 0
0010 x 0011
2. Shift Multiplicand register left 1 bit
Product Multiplier Multiplicand
0000 0000 0011 0000 0010
0000 0010 0001 0000 0100 3. Shift Multiplier register right 1 bit
0000 0110 0000 0000 1000
0000 0110 0000 0001 0000 No: < 32 repetitions
32nd
0000 0110 0000 0010 0000 Done
repetition?

Yes: 32 repetitions
Done
— 15
Observations
 1 clock per cycle => too slow
 Ratio of multiply to add 5:1 to 100:1
 Half of the bits in multiplicand always 0
=> 64-bit adder is wasted
 0’s inserted in right of multiplicand as shifted
=> least significant bits of product never changed once formed
 Instead of shifting multiplicand to left, shift product to
right?
 Product register wastes space => combine Multiplier and
Product register

Chapter 3 — Arithmetic for Computers — 16

Multiply Algorithm (Ver. 2) Start

Product0 = 1 1. Test Product0 = 0

Product0

1a. Add multiplicand to left half of product and

place the result in left half of Product register

Multiplicand Product 2. Shift Product register right 1 bit

0010 0000 0011
0010 0011
0010 0001 0001
0011 0001
0010 0001 1000
32nd No: < 32 repetitions
0010 0000 1100
repetition?
0010 0000 0110
Add & shift perform in parallel Yes: 32 repetitions
Done
— 17
Optimized Multiplier
 Perform steps in parallel: add/shift

0-bit Multiplier register

 One cycle per partial-product addition

 That’s ok, if frequency of multiplications is low

Chapter 3 — Arithmetic for Computers — 18

Concluding Remarks
 2 steps per bit because multiplier and product registers
combined
 MIPS registers Hi and Lo are left and right half of Product
register
=> this gives the MIPS instruction MultU
 What about signed multiplication?
 The easiest solution is to make both positive and remember
whether to complement product when done (leave out sign bit,
run for 31 steps)
 Apply definition of 2’s complement
 sign-extend partial products and subtract at end
 Booth’s Algorithm is an elegant way to multiply signed numbers
using same hardware as before and save cycles
Chapter 3 — Arithmetic for Computers — 19
Faster Multiplier
 Uses multiple adders
 Cost/performance tradeoff

Adder Reduction Tree

 Can be pipelined
 Several multiplication performed in parallel

Chapter 3 — Arithmetic for Computers — 20

MIPS Multiplication Instructions
 Two 32-bit registers for product
 HI: most-significant 32 bits
 LO: least-significant 32-bits
 MIPS multiply instructions
 mult rs, rt / multu rs, rt
 64-bit product in HI/LO
 mfhi rd / mflo rd
 Move from HI/LO to rd
 Can test HI value to see if product overflows 32 bits
 mul rd, rs, rt
 Least-significant 32 bits of product –> rd

Chapter 3 — Arithmetic for Computers — 21

§3.4 Division
Long Division Algorithm
 Check for 0 divisor

quotient  Long division approach

dividend  If divisor ≤ dividend bits

1001  1 bit in quotient, subtract

1000 1001010  Otherwise

-1000 0 bit in quotient, bring down next dividend bit
divisor 

10
101  Restoring division
1010  Do the subtract, and if remainder goes < 0, add
-1000
divisor back
remainder 10
 Signed division
 Divide using absolute values

 Adjust sign of quotient and remainder as

required
Division Algorithm and Hardware (Ver.1)

Initially divisor
in left half

Initially dividend

2n-bit dividend and n-bit divisor yield

n-bit quotient and remainder

Chapter 3 — Arithmetic for Computers — 23

Division Example Start: Place Dividend in Remainder

1. Subtract Divisor register from

Remainder register, and place the
result in Remainder register
Quot. Divisor Rem.
0000 00100000 00000111 Remainder  0 Test Remainder < 0
11100111 Remainder
00000111
0000 00010000 00000111
11110111 2b. Restore original value by
2a. Shift Quotient
00000111 adding Divisor to Remainder,
register to left,
0000 00001000 00000111 place sum in Remainder, shift
setting new
11111111 Quotient to the left, setting new
rightmost bit to 1
00000111 least significant bit to 0
0000 00000100 00000111
00000011
0001 00000011 3. Shift Divisor register right 1 bit
0001 00000010 00000011
00000001
0011 00000001 33rd No: < 33 repetitions
0011 00000001 00000001 repetition?

Yes: 33 repetitions
Done — 24
Observations
 Half of the bits in divisor register always 0
=> 1/2 of 64-bit adder is wasted
=> 1/2 of divisor is wasted
 Instead of shifting divisor to right,
shift remainder to left?
 1st step cannot produce a 1 in quotient bit
(otherwise quotient is too big for the register)
=> switch order to shift first and then subtract
=> save 1 iteration
 Eliminate Quotient register by combining with Remainder
register as shifted left

Chapter 3 — Arithmetic for Computers — 25

Start: Place Dividend in Remainder
Divide Algorithm (Ver. 2)
1. Shift Remainder register left 1 bit

Step Remainder Div. 2. Subtract Divisor register from the

0 0000 0111 0010 left half of Remainder register, and place the
1.1 0000 1110 result in the left half of Remainder register
1.2 1110 1110
1.3b 0001 1100
Remainder  0 Test Remainder < 0
2.2 1111 1100 Remainder
2.3b 0011 1000
3.2 0001 1000
3.3a 0011 0001 3b. Restore original value by adding
4.2 0001 0001 3a. Shift
Remainder to left, Divisor to left half of Remainder, and
4.3a 0010 0011 place sum in left half of Remainder.
0001 0011 setting new
rightmost bit to 1 Also shift Remainder to left, setting
the new least significant bit to 0

32nd No: < 32 repetitions

repetition?
Yes: 32 repetitions
Done. Shift left half of Remainder right 1 bit
— 26
Optimized Divider
0-bit Multiplier/Quotient register

 One cycle per partial-remainder subtraction

 Looks a lot like a multiplier!
 Same hardware can be used for both

Chapter 3 — Arithmetic for Computers — 27

Faster Division
 Can’t use parallel hardware as in multiplier

 Subtraction is conditional on sign of remainder

 Faster dividers (e.g. SRT division) generate multiple

quotient bits per step

 Still require multiple steps

Chapter 3 — Arithmetic for Computers — 28

MIPS Division
 Use HI/LO registers for result

 HI: 32-bit remainder

 LO: 32-bit quotient

 Instructions

 div rs, rt / divu rs, rt

 No overflow or divide-by-0 checking

 Software must perform checks if required

 Use mfhi, mflo to access result

Chapter 3 — Arithmetic for Computers — 29

Concluding Remarks
 Observations: Divide vs. Multiply

 Divide can use the same hardware as multiply

 just need ALU to add or subtract, and 64-bit register to

shift left or shift right

 Hi and Lo registers in MIPS combine to act as 64-bit

Chapter 3 — Arithmetic for Computers — 30

§3.5 Floating Point
Floating Point (FP)
 Representation for non-integral real-valued numbers
 Including very small and very large numbers
 Scientific notation
 –2.34 × 1056 normalized

 +0.002 × 10–4
not normalized
 +987.02 × 109
 In binary
( 1)S  (1 F)  2(EBias)
 ±1.xxxxxxx2 × 2yyyy
 The programming language C use the name float (or
double) for single-precision (or double-precision) FP
numbers.
Chapter 3 — Arithmetic for Computers — 31
Standard FP Representation
 Defined by IEEE Std 754-1985

 Developed in response to divergence of

representations
 Portability issues for scientific code

 Now almost universally adopted

 Two representations
 32-bit single-precision (SP) FP

 64-bit double-precision (DP) FP

Chapter 3 — Arithmetic for Computers — 32

IEEE 754 Standard (1/2)
 Regarding single precision (SP), DP similar
 Sign bit S: ( 1)S  (1 F)  2(EBias)
1 means negative
0 means positive
 Significand F:
 To pack more bits, leading 1 implicit for normalized numbers
 1 + 23 bits single, 1 + 52 bits double
 always true: 0  Significand < 1
(for normalized numbers)
 Note: 0 has no leading 1, so reserve exponent value 0
just for number 0

Chapter 3 — Arithmetic for Computers — 33

IEEE 754 Standard (2/2)
 Exponent E:
 Need to represent positive and negative exponents
 Also want to compare FP numbers as if they were integers, to
help in value comparisons
 If use 2’s complement to represent?
e.g., 1.0 x 2-1 versus 1.0 x2+1 (1/2 versus 2)

0 1111 1111 000 0000 0000 0000 0000 0000

1/2

2 0 0000 0001 000 0000 0000 0000 0000 0000

If we use integer comparison for these two

words, we will conclude that 1/2 > 2!!!

Chapter 3 — Arithmetic for Computers — 34

Biased (Excess) Notation
 let notation 0000 be most negative, and 1111 be most positive
 Example: Biased 7
0000 -7
0001 -6
0010 -5
0011 -4
0100 -3
0101 -2
0110 -1
0111 0
1000 1
1001 2
1010 3
1011 4
1100 5
1101 6
1110 7
1111 8 Chapter 3 — Arithmetic for Computers — 35
IEEE 754 Standard
 Using biased notation
 the bias is the number subtracted to get the real number
 IEEE 754 uses bias of 127 for single precision:
Subtract 127 from Exponent field to get actual value for exponent
 1023 is bias for double precision
 The example becomes ….

0 0111 1110 000 0000 0000 0000 0000 0000

1/2

2 0 1000 0000 000 0000 0000 0000 0000 0000

Chapter 3 — Arithmetic for Computers — 36

IEEE Floating-Point Format
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction

x  ( 1)S  (1 Fraction)  2(ExponentBias)

 S: sign bit (0  non-negative, 1  negative)

 Normalize significand: 1.0 ≤ |significand| < 2.0
 Always has a leading pre-binary-point 1 bit, so no need to represent it
explicitly (hidden bit)
 Significand is Fraction with the “1.” restored
 Exponent: excess representation: actual exponent + Bias
 Ensures exponent is unsigned
 Single: Bias = 127; Double: Bias = 1203

Chapter 3 — Arithmetic for Computers — 37

Single-Precision Range
 Exponents 00000000 and 11111111 reserved
 Smallest value
 Exponent: 00000001
 actual exponent = 1 – 127 = –126
 Fraction: 000…00  significand = 1.0
 ±1.0 × 2–126 ≈ ±1.2 × 10–38
 Largest value
 exponent: 11111110
 actual exponent = 254 – 127 = +127
 Fraction: 111…11  significand ≈ 2.0
 ±2.0 × 2+127 ≈ ±3.4 × 10+38

Chapter 3 — Arithmetic for Computers — 38

Double-Precision Range
 Exponents 0000…00 and 1111…11 reserved
 Smallest value
 Exponent: 00000000001
 actual exponent = 1 – 1023 = –1022
 Fraction: 000…00  significand = 1.0
 ±1.0 × 2–1022 ≈ ±2.2 × 10–308
 Largest value
 Exponent: 11111111110
 actual exponent = 2046 – 1023 = +1023
 Fraction: 111…11  significand ≈ 2.0
 ±2.0 × 2+1023 ≈ ±1.8 × 10+308

Chapter 3 — Arithmetic for Computers — 39

Floating-Point Precision
 Relative precision
single: 23 bits
 all fraction bits are significant double: 52 bits

 SP : approx 2–23
 Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal
digits of precision

 DP : approx 2–52
 Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal
digits of precision

Chapter 3 — Arithmetic for Computers — 40

Floating-Point Representation Example
 Represent –0.75
 –0.75 = (–1)1 × 1.12 × 2–1
 S=1
 Fraction = 1000…002
 Exponent = –1 + Bias
 Single: –1 + 127 = 126 = 011111102
 Double: –1 + 1023 = 1022 = 011111111102

 SP : 1011111101000…00
 DP : 1011111111101000…00

Chapter 3 — Arithmetic for Computers — 41

Floating-Point Representation Example
 What number is represented by the single-precision float

11000000101000…00
 S=1

 Fraction = 01000…002

 Bias Exponent = 100000012 = 129

 Sol. x = (–1)1 × (1 + 012) × 2(129 – 127)

= (–1) × 1.25 × 22

= –5.0

Chapter 3 — Arithmetic for Computers — 42

Concluding Remarks
 What have we defined so far? (SP float)

Exponent Significand Object

0 0 ???
0 nonzero ???
1-254 anything +/- floating-point
255 0 ???
255 nonzero ???

Chapter 3 — Arithmetic for Computers — 43

Zero and Special Numbers
 Represent 0?
 exponent all zeroes
 significand all zeroes too
 What about sign?
 +0: 0 00000000 00000000000000000000000
 -0: 1 00000000 00000000000000000000000
 Why two zeroes?
 Helps in some limit comparisons

 Special numbers
 Range: 1.0  2-126  1.8  10-38
 What if result too small? (>0, < 1.8x10-38 => Underflow! )
 What if result too large? (> 3.4x1038 => Overflow! )

Chapter 3 — Arithmetic for Computers — 44

Gradual Underflow
 Represent denormalized numbers (denorms)
 Exponent : all zeroes
 Significand : non-zeroes
 Allow a number to degrade in significance until it
become 0 (gradual underflow)

 The smallest normalized number

 1.0000 0000 0000 0000 0000 0000  2-126

Chapter 3 — Arithmetic for Computers — 45

Representation for +/- Infinity
 In FP, divide by zero should produce +/- infinity, not
overflow
 Why?
 OK to do further computations with infinity, e.g., X/0 > Y may be
a valid comparison
 IEEE 754 represents +/- infinity
 Most positive exponent reserved for infinity
 Significands all zeroes

S 1111 1111 0000 0000 0000 0000 0000 000

Chapter 3 — Arithmetic for Computers — 46

Representation for Not a Number
 What do I get if I calculate sqrt(-4.0) or 0.0/0.0?
 If infinity is not an error, these should not be either
 They are called Not a Number (NaN)
 Exponent = 255, Significand nonzero
 Why is this useful?
 Hope NaNs help with debugging?
 They contaminate: op(NaN,X) = NaN
 OK if calculate but don’t use it

Chapter 3 — Arithmetic for Computers — 47

IEEE 754 Encoding of FP Numbers
 What have we defined so far? (single-precision)

Exponent Significand Object

0 0 0
0 nonzero denom
1-254 anything +/- fl. pt. #
255 0 +/- infinity
255 nonzero NaN

Chapter 3 — Arithmetic for Computers — 48

Floating-Point Addition
 Now consider a 4-digit binary example
 1.0002 × 2–1 + –1.1102 × 2–2 (i.e. 0.5 + –0.4375)
1. Align binary points
 Shift number with smaller exponent
 1.0002 × 2–1 + –0.1112 × 2–1
2. Add significands
 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
3. Normalize result & check for over/underflow
 1.0002 × 2–4, with no over/underflow
4. Round and renormalize if necessary
 1.0002 × 2–4 (no change) = 0.0625

Chapter 3 — Arithmetic for Computers — 49

Floating-Point Addition Algorithm
Basic addition algorithm:
compute Ye - Xe (to align binary point)
(1) right shift the smaller number, say Xm, that many
positions to form Xm  2Xe-Ye
(2) compute Xm  2Xe-Ye + Ym

if demands normalization, then normalize:

(3) left shift result, decrement result exponent
right shift result, increment result exponent
(3.1) check overflow or underflow during the shift
(4) round the mantissa
continue until MSB of data is 1
(NOTE: Hidden bit in IEEE Standard)
(5) if result is 0 mantissa, set the exponent
FP Adder Hardware

Step 1

Step 2

Step 3

Step 4

Chapter 3 — Arithmetic for Computers — 51

Floating-Point Multiplication
 Now consider a 4-digit binary example
 1.0002 × 2–1 × –1.1102 × 2–2 (i.e. 0.5 × –0.4375)
1. Add exponents
 Unbiased: –1 + –2 = –3
 Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127
2. Multiply significands
 1.0002 × 1.1102 = 1.1102  1.1102 × 2–3
3. Normalize result & check for over/underflow
 1.1102 × 2–3 (no change) with no over/underflow
4. Round and renormalize if necessary
 1.1102 × 2–3 (no change)
5. Determine sign: +ve × –ve  –ve
 –1.1102 × 2–3 = –0.21875

Chapter 3 — Arithmetic for Computers — 52

FP Arithmetic Hardware
 Much more complex than integer arithmetic
 Doing it in one clock cycle would take too long
 FP multiplier is of similar complexity to FP adder
 But uses a multiplier for significand instead of an adder
 FP arithmetic hardware usually does
 Addition, subtraction, multiplication, division, reciprocal,
square-root
 FP  integer conversion is not trivial
 Operations usually takes several cycles
 Can be pipelined

Chapter 3 — Arithmetic for Computers — 53

FP Instructions in MIPS (1/2)
 FP hardware is coprocessor 1
 Adjunct processor that extends the ISA
 Separate FP registers
 32 single-precision: $f0, $f1, … $f31
 Paired for double-precision: $f0/$f1, $f2/$f3, …
 Release 2 of MIPS ISA supports 32 × 64-bit FP reg’s
 FP instructions operate only on FP registers
 Programs generally don’t do integer ops on FP data, or vice
versa
 More registers with minimal code-size impact
 FP load and store instructions
 lwc1, ldc1, swc1, sdc1
 e.g., ldc1 $f8, 32($sp)

Chapter 3 — Arithmetic for Computers — 54

FP Instructions in MIPS (2/2)
 Single-precision arithmetic
 add.s, sub.s, mul.s, div.s
 e.g., add.s $f0, $f1, $f6
 Double-precision arithmetic
 add.d, sub.d, mul.d, div.d
 e.g., mul.d $f4, $f4, $f6
 Single- and double-precision comparison
 c.xx.s, c.xx.d (xx is eq, lt, le, …)
 Sets or clears FP condition-code bit
 e.g. c.lt.s $f3, $f4
 Branch on FP condition code true or false
 bc1t, bc1f more examples,
 e.g., bc1t TargetLabel please refer to Fig. 3.17-18,
p. 222-223
Chapter 3 — Arithmetic for Computers — 55
FP Example: °F to °C
 C code:
float f2c (float fahr) {
return ((5.0/9.0)*(fahr - 32.0));
}
 fahr in $f12, result in $f0, literals in global memory space
 Compiled MIPS code:
f2c: lwc1 $f16, const5($gp) #$f16=5.0(in Mem.)
lwc1 $f18, const9($gp) #$f18=9.0(in Mem.)
div.s $f16, $f16, $f18 #$f16=5.0/9.0
lwc1 $f18, const32($gp) #$f18=32.0(in Mem)
sub.s $f18, $f12, $f18 #f18=fahr-32.0
mul.s $f0, $f16, $f18 #$f0=(5/9)*(fahr-32)
jr $ra

Chapter 3 — Arithmetic for Computers — 56

FP Example: Matrix Multiplication (1/3)
 X=X+Y×Z
 All 32 × 32 matrices, 64-bit double-precision elements
 C code:
void mm (double x[][], double y[][], double z[][]) {
int i, j, k;
for (i = 0; i! = 32; i = i + 1)
for (j = 0; j! = 32; j = j + 1)
for (k = 0; k! = 32; k = k + 1)
x[i][j] = x[i][j] + y[i][k] * z[k][j];
}
 Addresses of x, y, z in $a0, $a1, $a2, and i, j, k in $s0, $s1, $s2

Chapter 3 — Arithmetic for Computers — 57

FP Example: Matrix Multiplication (2/3)
 MIPS code:
li $t1, 32 # $t1 = 32 (row size/loop end)
li $s0, 0 # i = 0; initialize 1st for loop
L1: li $s1, 0 # j = 0; restart 2nd for loop
L2: li $s2, 0 # k = 0; restart 3rd for loop
sll $t2, $s0, 5 # $t2 = i * 32 (size of row of x)
addu $t2, $t2, $s1 # $t2 = i * size(row) + j
sll $t2, $t2, 3 # $t2 = byte offset of [i][j]
addu $t2, $a0, $t2 # $t2 = byte address of x[i][j]
l.d $f4, 0($t2) # $f4 = 8 bytes of x[i][j]
L3: sll $t0, $s2, 5 # $t0 = k * 32 (size of row of z)
addu $t0, $t0, $s1 # $t0 = k * size(row) + j
sll $t0, $t0, 3 # $t0 = byte offset of [k][j]
addu $t0, $a2, $t0 # $t0 = byte address of z[k][j]
l.d $f16, 0($t0) # $f16 = 8 bytes of z[k][j]
…

Chapter 3 — Arithmetic for Computers — 58

FP Example: Matrix Multiplication (3/3)
…
sll $t0, $s0, 5 # $t0 = i*32 (size of row of y)
addu $t0, $t0, $s2 # $t0 = i*size(row) + k
sll $t0, $t0, 3 # $t0 = byte offset of [i][k]
addu $t0, $a1, $t0 # $t0 = byte address of y[i][k]
l.d $f18, 0($t0) # $f18 = 8 bytes of y[i][k]
mul.d $f16, $f18, $f16 # $f16 = y[i][k] * z[k][j]
add.d $f4, $f4, $f16 # f4=x[i][j] + y[i][k]*z[k][j]
addiu $s2, $s2, 1 # $k k + 1
bne $s2, $t1, L3 # if (k != 32) go to L3
s.d $f4, 0($t2) # x[i][j] = $f4
addiu $s1, $s1, 1 # $j = j + 1
bne $s1, $t1, L2 # if (j != 32) go to L2
addiu $s0, $s0, 1 # $i = i + 1
bne $s0, $t1, L1 # if (i != 32) go to L1

Chapter 3 — Arithmetic for Computers — 59

Variant FP Format

Chapter 3 — Arithmetic for Computers — 60

Accurate Arithmetic
 IEEE Std 754 specifies additional rounding control
 Extra bits of precision (guard, round, sticky)
 Choice of rounding modes
 Allows programmer to fine-tune numerical behavior of a
computation
 Not all FP units implement all options
 Most programming languages and FP libraries just use defaults
 Trade-off between hardware complexity, performance,
and market requirements

Chapter 3 — Arithmetic for Computers — 61

Extra Bits for Rounding
 Why rounding after addition?
 Because not every intermediate results is truncated
 To keep more precision

 Guard and round bits: extra bits to guard against loss of bits during
intermediate additions
 to the right of significand
 can later be shifted left into significand during normalization

 Sticky bit
 Additional bit to the right of the round digit
 Better fine tune rounding

Chapter 3 — Arithmetic for Computers — 62

Example
 Try to add 2.98x100 and 2.34x102
 only 3 decimal digits are allowed
2.34
+ 0.02
without guard bits
2.36

 with 2 more guard bits during computation

 perform rounding at last
2.3400
+ 0.0298
2.3698  rounding  2.37

 With guard bits and rounding  more accurate results

Chapter 3 — Arithmetic for Computers — 63

Rounding Methods
 Round to zero or Truncation
 The result closet to zero is returned.
 Nothing is added to the least significant bit.
 Round up
 The more positive result closest to the infinitely precise result is returned.
 If the result is positive and either the guard or the sticky bit is 1, the
result is rounded.
 If the result is negative, the result is not rounded because the unrounded
result is the most positive result that is closest to the infinitely precise
result.
 Round down
 The more negative result is returned.
 Round to nearest

Chapter 3 — Arithmetic for Computers — 64

Associativity
 Parallel programs may interleave operations in
unexpected orders

 Assumptions of associativity may fail

(x+y)+z x+(y+z)
x -1.50E+38 -1.50E+38
y 1.50E+38 0.00E+00
z 1.0 1.0 1.50E+38
1.00E+00 0.00E+00

 Need to validate parallel programs under varying

degrees of parallelism

Chapter 3 — Arithmetic for Computers — 65

§3.7 Real Stuff: Streaming SIMD Extensions and AVX in x86
Subword Parallellism
 Graphics and audio applications can take advantage of
performing simultaneous operations on short vectors
 Example: 128-bit adder:
 16x8-bit adds; 8x16-bit adds; 4x32-bit adds

 Also called data-level parallelism, vector parallelism, or

Single Instruction, Multiple Data (SIMD)
 ARM NEON multimedia instruction extension

 Intel SSE, SSE2 FP instructions

Chapter 3 — Arithmetic for Computers — 66

ARM NEON Instructions
 NEON supports all the subword data type you can imagine except 64-bit
FP numbers
 8-bit, 16-bit, 32-bit, and 64-bit signed and unsigned integers
 32-bit FP numbers

Chapter 3 — Arithmetic for Computers — 67

Right Shift and Division
 Left shift by i places multiplies an integer by 2i
 Right shift divides by 2i?
 Only for unsigned integers

 For signed integers

 Arithmetic right shift: replicate the sign bit
 e.g., –5 / 4
 111110112 >> 2 = 111111102 = –2
 Rounds toward –∞

 c.f. 111110112 >>> 2 = 001111102 = +62

Chapter 3 — Arithmetic for Computers — 68

Concluding Remarks
 ISAs support arithmetic
 Signed and unsigned integers
 Floating-point approximation to reals
 Bounded range and precision
 Operations can overflow and underflow
 MIPS ISA
 Core instructions: 54 most frequently used
 100% of SPECINT, 97% of SPECFP

 Other instructions: less frequent

Chapter 3 — Arithmetic for Computers — 69

APPENDIX

Chapter 3 — Arithmetic for Computers — 70

x86 FP Architecture
 Originally based on 8087 FP coprocessor
 8 × 80-bit extended-precision registers
 Used as a push-down stack
 Registers indexed from TOS: ST(0), ST(1), …
 FP values are 32-bit or 64 in memory
 Converted on load/store of memory operand
 Integer operands can also be converted
on load/store
 Very difficult to generate and optimize code
 Result: poor FP performance

Chapter 3 — Arithmetic for Computers — 71

x86 FP Instructions
Data transfer Arithmetic Compare Transcendental
FILD mem/ST(i) FIADDP mem/ST(i) FICOMP FPATAN
FISTP mem/ST(i) FISUBRP mem/ST(i) FIUCOMP F2XMI
FLDPI FIMULP mem/ST(i) FSTSW AX/mem FCOS
FLD1 FIDIVRP mem/ST(i) FPTAN
FLDZ FSQRT FPREM
FABS FPSIN
FRNDINT FYL2X

 Optional variations
 I: integer operand
 P: pop operand from stack
 R: reverse operand order
 But not all combinations allowed

Chapter 3 — Arithmetic for Computers — 72

Streaming SIMD Extension 2 (SSE2)
 Adds 4 × 128-bit registers
 Extended to 8 registers in AMD64/EM64T
 Can be used for multiple FP operands
 2 × 64-bit double precision
 4 × 32-bit double precision
 Instructions operate on them simultaneously
 Single-Instruction Multiple-Data

Chapter 3 — Arithmetic for Computers — 73

Matrix Multiply
 Unoptimized code:

1. void dgemm (int n, double* A, double* B, double* C)

2. {
3. for (int i = 0; i < n; ++i)
4. for (int j = 0; j < n; ++j)
5. {
6. double cij = C[i+j*n]; /* cij = C[i][j] */
7. for(int k = 0; k < n; k++ )
8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */
9. C[i+j*n] = cij; /* C[i][j] = cij */
10. }
11. }

Chapter 3 — Arithmetic for Computers — 74

Matrix Multiply
 x86 assembly code:
1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0
2. mov %rsi,%rcx # register %rcx = %rsi
3. xor %eax,%eax # register %eax = 0
4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1
5. add %r9,%rcx # register %rcx = %rcx + %r9
6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1,
element of A
7. add $0x1,%rax # register %rax = %rax + 1
8. cmp %eax,%edi # compare %eax to %edi
9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0
10. jg 30 <dgemm+0x30> # jump if %eax > %edi
11. add $0x1,%r11d # register %r11 = %r11 + 1
12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element

Chapter 3 — Arithmetic for Computers — 75

Matrix Multiply
 Optimized C code:
1. #include <x86intrin.h>
2. void dgemm (int n, double* A, double* B, double* C)
3. {
4. for ( int i = 0; i < n; i+=4 )
5. for ( int j = 0; j < n; j++ ) {
6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j]
*/
7. for( int k = 0; k < n; k++ )
8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n),
10. _mm256_broadcast_sd(B+k+j*n)));
11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
12. }
13. }

Chapter 3 — Arithmetic for Computers — 76

Matrix Multiply
 Optimized x86 assembly code:
1. vmovapd (%r11),%ymm0 # Load 4 elements of C into %ymm0
2. mov %rbx,%rcx # register %rcx = %rbx
3. xor %eax,%eax # register %eax = 0
4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element
5. add $0x8,%rax # register %rax = %rax + 8
6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements
7. add %r9,%rcx # register %rcx = %rcx + %r9
8. cmp %r10,%rax # compare %r10 to %rax
9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0
10. jne 50 <dgemm+0x50> # jump if not %r10 != %rax
11. add $0x1,%esi # register % esi = % esi + 1
12. vmovapd %ymm0,(%r11) # Store %ymm0 into 4 C elements

Chapter 3 — Arithmetic for Computers — 77

Computer Architecture and Organization: The Central Processing Unit
100% (1)
Computer Architecture and Organization: The Central Processing Unit
126 pages
Cs 404 Coa Unit II Computer Arithmetic
100% (6)
Cs 404 Coa Unit II Computer Arithmetic
31 pages
Computer Architecture
No ratings yet
Computer Architecture
17 pages
Unit-3 PPT Co
No ratings yet
Unit-3 PPT Co
24 pages
Chapter IV Computer Arithmetic
No ratings yet
Chapter IV Computer Arithmetic
133 pages
Arithmetic For Computers: The Hardware/Software Interface
No ratings yet
Arithmetic For Computers: The Hardware/Software Interface
59 pages
Aritmética - Arq. Mic.
No ratings yet
Aritmética - Arq. Mic.
53 pages
Arithmetic For Computers: The Hardware/Software Interface 5
No ratings yet
Arithmetic For Computers: The Hardware/Software Interface 5
49 pages
Computer Architecture ECE 361 Lecture 6: ALU Design
No ratings yet
Computer Architecture ECE 361 Lecture 6: ALU Design
33 pages
2.2 Multiplication & Division PDF
No ratings yet
2.2 Multiplication & Division PDF
28 pages
Arithmetic For Computers
No ratings yet
Arithmetic For Computers
43 pages
3 - Arithmetic For Computers
No ratings yet
3 - Arithmetic For Computers
59 pages
Arithmetic For Computers: Computer Organization and Design
No ratings yet
Arithmetic For Computers: Computer Organization and Design
57 pages
Chapter 03
100% (1)
Chapter 03
49 pages
Chapter 3 Arithmetic For Computers
No ratings yet
Chapter 3 Arithmetic For Computers
43 pages
Chapter IV Computer Arithmetic
No ratings yet
Chapter IV Computer Arithmetic
133 pages
Arithmetic For Computers
No ratings yet
Arithmetic For Computers
48 pages
CPSC 161: Prof. L.N. Bhuyan .HTML
No ratings yet
CPSC 161: Prof. L.N. Bhuyan .HTML
28 pages
Arithmetic For Computers: Prof. Sebastian Eslava M.Sc. PH.D
No ratings yet
Arithmetic For Computers: Prof. Sebastian Eslava M.Sc. PH.D
16 pages
Chapter 3: Arithmetic For Computers
100% (1)
Chapter 3: Arithmetic For Computers
39 pages
Computer Arithmetic: Electrical and Computer Engineering Department
No ratings yet
Computer Arithmetic: Electrical and Computer Engineering Department
72 pages
Chapter 3 Arithmetic For Computers
No ratings yet
Chapter 3 Arithmetic For Computers
49 pages
Computer Arithmetic: Electrical and Computer Engineering Department
No ratings yet
Computer Arithmetic: Electrical and Computer Engineering Department
72 pages
ARITHMETIC and LOGIC UNIT - in This Lecture, We Will Examine How
No ratings yet
ARITHMETIC and LOGIC UNIT - in This Lecture, We Will Examine How
12 pages
3 Integer Arithmetic
No ratings yet
3 Integer Arithmetic
40 pages
Chapter 03 RISC V
No ratings yet
Chapter 03 RISC V
52 pages
CSE341 Lecture Notes Fall 2009 Arithmetic For Computers: Ex: Write - 38 in 32 Bits
No ratings yet
CSE341 Lecture Notes Fall 2009 Arithmetic For Computers: Ex: Write - 38 in 32 Bits
30 pages
07 CA (Computer+Arithmetic)
No ratings yet
07 CA (Computer+Arithmetic)
19 pages
Week 6 - Lecture 6 - Arithmetic Processing Unit Implementation
No ratings yet
Week 6 - Lecture 6 - Arithmetic Processing Unit Implementation
32 pages
Chapter 03
No ratings yet
Chapter 03
51 pages
04S. Computer Arithmetic (Supplemental Material)
No ratings yet
04S. Computer Arithmetic (Supplemental Material)
95 pages
Chapter 3
No ratings yet
Chapter 3
49 pages
Computer Arithmatic1
No ratings yet
Computer Arithmatic1
38 pages
Computer Organization and Architecture Arithmetic & Logic Unit
No ratings yet
Computer Organization and Architecture Arithmetic & Logic Unit
12 pages
Data Representation (V) For SMJE3093
No ratings yet
Data Representation (V) For SMJE3093
75 pages
Chapter 03
No ratings yet
Chapter 03
54 pages
Introduction To Hardware Accelerator Systems For Artificial Intelligence and Machine Learning
No ratings yet
Introduction To Hardware Accelerator Systems For Artificial Intelligence and Machine Learning
21 pages
CPE440 Computer Architecture
No ratings yet
CPE440 Computer Architecture
7 pages
Chapter-3 2
No ratings yet
Chapter-3 2
9 pages
PPT#04
No ratings yet
PPT#04
43 pages
Chapter 3 Arithmetic For Computers
No ratings yet
Chapter 3 Arithmetic For Computers
82 pages
CO Module 5 Notes
No ratings yet
CO Module 5 Notes
16 pages
Computer Org Chapter - 03 J
No ratings yet
Computer Org Chapter - 03 J
61 pages
Chapter 03
No ratings yet
Chapter 03
57 pages
Chapter 3 Arithmetic For Computers
No ratings yet
Chapter 3 Arithmetic For Computers
56 pages
Computer Architecture AllClasses-Outline
No ratings yet
Computer Architecture AllClasses-Outline
294 pages
Computer Organization: Department of Computer Science & Engineering
No ratings yet
Computer Organization: Department of Computer Science & Engineering
16 pages
Parallel Computing System
No ratings yet
Parallel Computing System
4 pages
CMPS290 Class Notes Chap 03
No ratings yet
CMPS290 Class Notes Chap 03
20 pages
Patterson6e MIPS Ch03 PPT r2
No ratings yet
Patterson6e MIPS Ch03 PPT r2
56 pages
COA2
No ratings yet
COA2
28 pages
Module 2
No ratings yet
Module 2
26 pages
Advanced Computer Systems Architecture Lect-1
No ratings yet
Advanced Computer Systems Architecture Lect-1
31 pages
Module 4 (02-12-2024)
No ratings yet
Module 4 (02-12-2024)
64 pages
Module 2 - Parallel Computing
No ratings yet
Module 2 - Parallel Computing
55 pages
Lecture 05 - Arithmetic For Computers
No ratings yet
Lecture 05 - Arithmetic For Computers
18 pages
Lecture6 2
No ratings yet
Lecture6 2
6 pages
IntelAVX-512 InstructionSetForPacketProcessing TechGuide 633930v2
No ratings yet
IntelAVX-512 InstructionSetForPacketProcessing TechGuide 633930v2
20 pages
Chapter 4 PDF
No ratings yet
Chapter 4 PDF
49 pages
Chapter 3 Merged
No ratings yet
Chapter 3 Merged
81 pages
Questions With Answers
No ratings yet
Questions With Answers
22 pages
ParalleSystem Report
No ratings yet
ParalleSystem Report
11 pages
Chapter 04
No ratings yet
Chapter 04
169 pages
Architecture of Computers: Vistula University
No ratings yet
Architecture of Computers: Vistula University
34 pages
1.2 Underlying Principles of Parallel and Distributed Computing
No ratings yet
1.2 Underlying Principles of Parallel and Distributed Computing
42 pages
Chapter 06
No ratings yet
Chapter 06
59 pages
SIMD Assembly Tutorial Neon (Mozilla)
No ratings yet
SIMD Assembly Tutorial Neon (Mozilla)
60 pages
Slides Chap10
No ratings yet
Slides Chap10
138 pages
Deep Learning With Intel AVX512 and Intel Deep Learning Boost Tuning Guide On 3rd Generation Intel Xeon Scalable Processors 1
No ratings yet
Deep Learning With Intel AVX512 and Intel Deep Learning Boost Tuning Guide On 3rd Generation Intel Xeon Scalable Processors 1
24 pages
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
No ratings yet
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
57 pages
CS5204/EE5364 - Advanced Computer Architecture - Introduction
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Introduction
28 pages
Pin Tutorial Cgo Ispass 2012
No ratings yet
Pin Tutorial Cgo Ispass 2012
202 pages
07.1 - CA (Computer+Arithmetic Mul Div)
No ratings yet
07.1 - CA (Computer+Arithmetic Mul Div)
39 pages
Distributed and Parallel Comuting
No ratings yet
Distributed and Parallel Comuting
70 pages
Fast Sine
No ratings yet
Fast Sine
9 pages
Iber
No ratings yet
Iber
23 pages
Gaudi2 Whitepaper
No ratings yet
Gaudi2 Whitepaper
34 pages
TYBSC CS Cloud Computing
No ratings yet
TYBSC CS Cloud Computing
95 pages
FINAL - Electrocardiography (ECG) - 1122
No ratings yet
FINAL - Electrocardiography (ECG) - 1122
42 pages
CO-24 Fall
No ratings yet
CO-24 Fall
6 pages
Ebooks File Computer Architecture John L. Hennessy All Chapters
100% (6)
Ebooks File Computer Architecture John L. Hennessy All Chapters
49 pages
Chapter 10 Org
No ratings yet
Chapter 10 Org
20 pages
The Intel microprocessors: 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium Pro processor, Pentium II, Pentium III, Pentium 4, and Core2 with 64-bit extensions: architecture, programming, and interfacing 8th ed Edition Barry B Brey - eBook PDF 2024 Scribd Download
100% (1)
The Intel microprocessors: 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium Pro processor, Pentium II, Pentium III, Pentium 4, and Core2 with 64-bit extensions: architecture, programming, and interfacing 8th ed Edition Barry B Brey - eBook PDF 2024 Scribd Download
50 pages
Cao - Unit 4 - Notes - Final
No ratings yet
Cao - Unit 4 - Notes - Final
30 pages
CS224 Topic 05 Arithmetic (Updated)
No ratings yet
CS224 Topic 05 Arithmetic (Updated)
15 pages
Lecture 3.1.1 (Parallelism in Uniprocessor System, Flynn - S Classification)
No ratings yet
Lecture 3.1.1 (Parallelism in Uniprocessor System, Flynn - S Classification)
8 pages
Computer Arithmetic
No ratings yet
Computer Arithmetic
30 pages
1990 Duncan Parallel Architectures
No ratings yet
1990 Duncan Parallel Architectures
12 pages
The Pragmatic Programmer For Machine Learning Engineering Analytics and Data Science Solutions Marco Scutari Download
No ratings yet
The Pragmatic Programmer For Machine Learning Engineering Analytics and Data Science Solutions Marco Scutari Download
89 pages
Register Allocation For Intel Processor Graphics: Wei-Yu Chen Guei-Yuan Lueh Pratik Ashar
No ratings yet
Register Allocation For Intel Processor Graphics: Wei-Yu Chen Guei-Yuan Lueh Pratik Ashar
13 pages
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
Exercises in Electronics: Operational Amplifier Circuits
From Everand
Exercises in Electronics: Operational Amplifier Circuits
Roland Büchi
3/5 (1)
Digital Circuit Simulation Using Excel
From Everand
Digital Circuit Simulation Using Excel
Anthony Mazzurco
No ratings yet
Analog Dialogue, Volume 47, Number 2
From Everand
Analog Dialogue, Volume 47, Number 2
Analog Dialogue
No ratings yet