Open Book 331

Computer Architechure

Uploaded by

tradecredible

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

57 views33 pages

Open Book 331

Computer Architechure

Uploaded by

tradecredible

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 33

4.4. Personal computer (includes workstation and laptop): Personal compaters emphasize delivery of good performance to single users at low cost and usually execute third-party software. Personal mobile device (PMD, includes tablets): PMDsare battery operated with wireless connectivity to the Internet and typically cost hundreds of dollars, and, like PCs, users can download software (“apps”) to run on them. Unlike PCs, they no longer have a keyboard and mouse, and are more likely to rely on a touch-sensitive screen or even speech input. Server: Computer used to run large problems and usually accessed via a network. Warehouse scale computer: Thousands of processors forming a large cluster. ‘Supercomputer: Computer composed of hundredsto thousands of processors and terabytes of memory. Embedded computer: Computer designed to run one application or one set of related applications and integrated into a single system. 1.2 a. Performance via Pipelining b. Dependability via Redundancy «. Performance via Prediction 4. Make the Common Case Fast e. Hierarchy of Memories £. Performance via Parallelism g. Design for Moore's Law h. Use Abstraction to Simplify Design 1.3 ‘Ihe program is compiled into an assembly language program, which is then assembled into a machine language program. 14 a. 1280 X 1024 pixels = 1,310,720 pixels => 1,310,720 X 3 = 3,932,160 bytes/frame. b. 3,932,160 bytes x (8 bits/byte) /100E6 bits/second = 0.31 seconds1s a. performance of PI (instructions/scc) = 3 x 10°/1.5 = 2 x 10° performance of P2 (instructions/sec) ~ 2.5 X 10°/1.0 = 2.5 X 10" performance of P3 (instructions/sec) ~ 4 X 10*/2.2 = 1.8 X 10° Chapter1 Solutions b. cycles(P1) = 103 x 10° = 30 10s < 10 = 25 x 10's eyeles(P2) = 10 X 2. cycles(P3) = 10 x 4 x 10? = 40 x 10° = 20 10° ¢. No. instructions(P1) = 30 10/1. 25 X 10/1 15 10 No. instructions(P2} No. instructions(P3) = 40 10°/2.2 = 18.18 x 10" CPI, = CPL, ¥ 1.2, then CPI(PI) = 1.8, CPI(P2) = 1.2, CPI(P3) = 2.6 f= No. instr. x CPI/time, then APL) = 20 X 10? X1.8/7 = 5.14 GHz. 10? x 1.2/7 =4.28GHa fe2)1.6 a. Class A: 10° instr. Class B: 2 X 10° instr. Class C: 5 X 10° instr. Class D: 2 X 10° instr. ‘Time = No. instr. X CPI/clock rate ‘Total time Pl = (10° + 2 X 10° X 2 +5 X 10° X3 +2 X 10 3)/(2.5 X 10°) = 10.4 x 10 *s ‘Total time P2 = (105 X 2 + 2X 10° X 2+ 5X 10° X 2+ 2X 10° x 2)/ (3 X 10°) = 6.66 X 10 ‘s CPLPL) = 10.4 X 10 X 2.5 * 10°/10° = 2.6 CPL(P2) = 6.66 X 10 $x 3X 10*/10° = 2.0 b. clock cycles(P1) = 10° X 1+ 2 10°X 245X109 X3+2X 10° x3 = 26 10° clock cycles(P2) = 10° X 2+ 2X 10°X 2+5X 10'X2+2x 1x2 = 20 10° 17 a. CPL X f/No. instr. Compiler A CPI = 1.1 Compiler B CPI = 1.25 b. £,/f, = (No. instr.(B) x CPI(B))/(No. instr.(A) x CPI(A)) = 1.3718 1.8.1 C = 2X DP/(V?*F) Pentium 4: C = 3.2E-8F Core i5 Ivy Bridge: C = 2.9E-8F 1.8.2 Pentium 4: 10/100 = 10% Core i5 Ivy Bridge: 30/70 = 42.9% 1.8.3 (S,_, + D,.,/(Sy + Dy.) = 0.80 D_=CXV,_ 2x Sau = Vou 1 S_.= VXI ‘Therefore: = IDC x Ry D,_, = 0.90 (8, + Dy.) ~ §, Sree = Voor * Bad Vou) Pentium 4: S.0. = Vee X (10/1.25) = Vx 8 be 0.90 x 100 = V__ xX 8=90-V__x8 ‘em = ((90 — V,_, X 8)/(3.2E8 X 3.6E9)]"" Vyo = 0.85 Core is: 8... = Vu * 60/09) = V__ x 333 D,, = 0.90 X 70 = V_X 33.3 = 63 - V_ x 333 V,_, = [(63 — V,_, X 33.3)/(2.9E8 X 3.4E9)}!" V_. = 0.64V 19 19.1 Ee 2.56ES 1289 25688 7eae10 [387 2 | 18369 ry 25688 567e10 | 283 i 4 | 92288 4578 25688 2ase10 | 142 28 = |_45768 2.298 2 56ER Lazio [7.10 561.9.2 410 29.3 m6 733 1.9.3 31.10 1.10.1 1.10.2 1.10.3 1.10.4 aaa LLL 12 L.IL3. die area, = wafer area/dies per wafer = pi*7 yield, _, = 1/(1 + (0.020'2.10/2) die area,,... = wafer area/dies per wafer = pi*10*/100 = 3.14 cm* /(1+(0.031"3.14/2))? = 0.9093 = 12/(84°0.9593) = 0.1489 = 15/(100"0.9093) = 0.1650 °/ 84 = 2.10 cm? 0.9593 yield, = cost/die,.. cost/die, die area,.., = wafer area/dies per wafer JC + (0.02081.15*1.91/2))* 57. */(84* 1.1) = 1.91 cm? 0.9575 yield... die area,,.,. ~ wafer area/dies per wafer — pi*10?/(100* 1.1) — 2.86 cm? yield,,., = 1/(1 + (0.03*1.15*2.86/2))* 0.9082 defects. per area,,, (1-yA.5)MMy*.5*die_area/2) = (1-0.924.5)/ (0.924.5*2/2) = 0.043 defects/cm? defects per area,,, = (I-yAS)/(y.5*die_area/2) = (1-0.954.5)/ (0.954.5*2/2) = 0.026 defects/em? CPI = clock rate X CPU timefinstr. count clock rate = I/cycle time = 3 GHz CPI(bzip2) = 3 X 10° x 750/(2389 X 10")= 0.94 SPEC ratio = ref. time/execution time SPEC ratio(bzip2) = 9650/750 = 12.86 CPU time = No. instr. x CPI/clock rate If CPI and clock rate do not change, the CPU time increase is equal to the increase in the of number of instructions, that is 10%.1.11.4 CPU time(before) = No. instr. x CPI/clock rate CPU time(after) = 1.1 X No. instr. X 1.05 X CPl/clock rate CPU time(after)/CPU time(before) = 1.1 X 1.05 =1.155. Thus, CPU time is increased by 15.5%, 1.11.5 SPECratio = reference time/CPU time SPECratio(after)/SPECratio(before) = CPU time(before)/CPU time(after) = 1/1.1555 = 0.86. The SPECratio is decreased by 14%. 1.11.6 CPI = (CPU time X clock rate)/No. instr. CPL = 700 * 4 107/(0.85 X 2389 10°) = 1.37 1.11.7 Clock rate ratio = 4 GHz/3 GHz = 1.33 CPL @ 4 GHz = 1.37, CPI @ 3 GHz = 0.94, ratio = 1.45 ‘They are different because, although the number of instructions has been, reduced by 15%, the CPU time has been reduced by a lower percentage. 1.11.8 700/750 = 0.933. CPU time reduction: 6.7% 1.11.9 No. instr. = CPU time X clock rate/CPL No. instr. = 960 X 0.9 X 4 & 10°/1.61 = 2146 X 10° 1.11.10 Clock rate = No. instr, X CPI/CPU time. Clock rate,_, = No. instr. X CPI/0.9 X CPU time = 1/0.9 clock rate, = 3.33 GHz 1.11.11 Clock rate = No. instr. x CPI/CPU time. Clock rate, = No. instr. X 0.85% CPI/0.80 CPU time = 0.85/0.80, clock rate,,, = 3.18 GHz 1.12 1.12.1 T(P1) = 5 X 10° X 0.9/ (4 X 10°) = 1.125, T(P2) = 10° x 0.75 /(3 X 10°) = clock rate (P1) > clock rate(P2), performance(P1) < performance(P2) 125s1.122 1 1.123 124 143 113.1 113.2 1.13.3 144 L141 T(P1) = No. instr. x CPl/clock rate ‘T(P1) = 2.253 1021s ‘T(P2) 5 NX 0.75/(3 X 10°), then N = 9 X 10° MIPS = Clock rate x 10 */CPI MIPS(P1) = 4 X 10° X 10 ‘/0.9 = 4.44 X 10° MIPS(P2) = 3 X 10° X 10 “0.75 = 4.0 x 10° MIPS(P1) > MIPS(P2), performance(P1) < performance(P2) (from 11a) MELODS = N ?P operations X 10 °/T MPLOPS(P1) = 4 X 5E9 X 1E-6/1.125 = 1.78E3 MELOPS(P2) = 4X LE9 X 1E-6/.25 = 1.603 MELOPS(P1) > MBLOPS(P2), performance(P1) < performance(P2) (from 112) ‘T,, = 70 X 0.8 = 565. T,, = 56+85+55+40 = 236 s. Reduction: 5.6% T_, = 250 X 0.8 = 2005, 1, +7, +7, time INT: 58.8% = 165s, T,, = 35s. Reduction T_, = 250 X 0.8 = 2005, T, +T,,+T,, = 210s. NO Clock cycles = CPI, X No. FP instr. + CPI,,, x No. INT instr. + CPI, No. L/S instr. + CPI.,,.., x No. branch instr ‘T.yy = lock cycles/clock rate = clock cycles/2. x 10” clock cycles = 512 X 10% Toy = 256s ‘To have the number of clock cycles by improving the CPI of FP instructions: CPL X No, EP instr. + CPI,, X No. INT instr. + CPI,, No. L/S instr. F'CBI,___, X No. branch insti. = clock cycles/2 CPL, recip = (Clock eycles/2 — (CPI,,, x No. INT instr, + CPL, x No. L/S inst +CPL,__, No. branch instr)) / No. FP instr. CPL, = (256—462)/50 <0 not possible1.14.2 Using the clock cycle data from a. ‘To have the number of clock cycles improving the CPI of L/S instructions: CPL, X No. FP instr, + CPI... X No. INT instr. + CPI... * No. L/S instt. + CPL, __, X No. branch instr. = clock cycles/2 Phipps csys = (Clock cycles/2 ~ (CPI,, X No. EP instr. + CPI, X No. INT instr. ¥'CBI, x No. branch instr) No. L/S instr. Chgset ia = "25619880 = 0.725 1.14.3 Clock cycles = CPI, No. FP instr. + CPI,,, x No. INT instr. + CPI, x No. L/Sinsts. + CPI... x No. branch instr. “oyu = clock cycles/clock rate = clock eycles/2 X 10° CPI, = 0.6 X 1 = 06; CPI, = 0.6 X 1 = 06; CPL, = 0.7 X 4 = 28 CPI, = 0.7 X2=14 (before improx.) = 0.256 ;T, (after improv.)= 0.171s L 2 = a 100/54 = 185 4 25 23 100/29 = 3.44 8 225 165 100/165 = 6.06 Te 625 1025 | 100/10.25 — 9.76 3.76/16 - O6424 addi f. h. add f, fg 5 (note, 22f=gehti 2.3 sub $t0, $53, $54 add $t0, $56, $t0 Iw $t1, 16($t0) sw $t1, 32($s7) 2.4 8[g] = ALf] + ALI+f]: 2.8 add $t0, $56, $50 add $tl, $s7, $sl In $50, 0($t0) Iw $t0, 4($t0) add $t0, $t0, $50 sw $t0, O($tl) 2.6 2.6.1 tenp = Array[0]; temp2 = Array[l]: no subi) Array(0] = Array[4]: Array(1] = temp; Array(4] = Array(3]s Array(3] = temp2; 2.6.2 Iw $t0, 0($56) lw $tl, 4($s6) lw $t2, 16($s6) sw $12, 0($36) sw $t0, 4($56) lw $t0, 12($s6) sw $t0, 16($s6) sw $tl, 12($s6)27 = ef 2 2.8 2882400018 29511 $t0. $51, 2 4*q add $t0, $t0, $s Acdr(8[g]) lw $t0, O(St0) Big addi $0, $t0, 1 BIgI+l sll $t0, $t0, 2 # $t0 <-- 4*(B[g]+1) = Addr(ALBLgH1]) lw $80, 0($t0) $f <-- ALBLg]+1] 2.10 6 — 2*( 4A); 2.11 a ed addi aad $0, $56, 4 Hype $ : $ 0 1, $6, $0 Riype @ Tc Type Tw $00, (ht Type Ec 50, sur, $0 Ripe 212 2.12.1 50000000 2.12.2 overflow 2.12.3 30000000 2.12.4 no overflow 2.12.5 00000000 2.12.6 overflow 2.13 2.13.1 128 + > 2"-1,x > 2-129 and 128 +x < —24)x< —2! - 128 (impossible) 2.93.2 128 — x > 2-1, x < 24129 and 128 - x < -2%,x > 2 + 128 (impossible) 2.13.3 x— 128 < -2",x < —2" + 128 and x — 128 > 2" - 1x >2"+ 127 (impossible)244 2.15 i-type, 0xaD490020 2.46 r-type, sub $v1, $vl, $vO, 0x00621822 2.47 i-type, Iw $v0, 4(Sat), OxBCZz0004 218 2.18.1 opcode would be 8 bits, rs, rt, rd fields would be 7 bits each 2.18.2 opcode would be 8 bits, rs and rt fields would be 7 bits each 2.18.3 more registers + more bits per instruction — could increase code size more registers — less register spills less instructions more instructions + more appropriate instruction + decrease code size more instructions ~ larger opcodes — larger code size 249 2.19.1 (xBABEFEFS 2.19.2 CxAAAAMAAD 2.49.3 0x00005545 $0, oxo3rr . $12, 16 $t2, oe 2.21 nor Stl, $t2, $t2 2.22 2.23 2.242.25 2.25.1 2.25.2 st2, - beq $t2. $0 2.26 2.26.1 20 2.26.2 1s > 0) 2.26.3 2.27 0 30, TESTI Loop1: » $0, 0 $0, TEST2 Loop2: stl sl] $12, 4 add $t2, $52 sw $t3, ) addi $t 1 TEST2: sit $81 bne $2, Loop2 addi $t0, 1 TESTI: slt $t2, $t0, $50 bne $tZ, $0, LOOPL 2.28 14 instructions to implement and 158 instructions executed emArray[sO1; 2.29 for (i= resul2.30 addi stl, $s0, 400 Loop: Iw sl, 0($t1) add $52, $52, addi $t1, $t1, bne $tl, $s0, 2.34 fib: addi $sp, sw $ra, sw $50, sw $a0, bgt $a0, add $v0, jortn test2: addi sto, bne sto, add $v0, jortn gen: subi $a0, Jal fib add $50, sub $a0, jal fib add v0, rtn: Iw $20, Iw $s0, Iw Sra, addi ssp, jr $ra # fib(0) # fib(N) ssl “4 Loo? $sp, -12 (ssp) 4(Ssp) O(ssp) $0, test2 $0, $0 $0, 1 $a0, gen $0, $20 $a0,1 $v0, $0 $a0,1 3v0, $80 aCssp) 4(Ssp) B(Ssp) ssp, 12 # make room on stack Hf push $ra # push $30 # push $a0 (N) # if n>0, test if n=1 # else fib(0) =0 # if n>L, gen #else fib(l) = #on-1 # call fib(n-1) 4 copy Fib(n-1) #on-2 # call fib(n-2) # Tib(n-1)+Fib(n-2) # pop $20 # pop $50 # pop $ra # restore sp 2 instructions, fib(1) = 14 instructions, 6 + 18N instructions for N >=2 2.32 Due to the recursive nature of the code, itis not possible for the compiler to in-line the function call. 2.33 after calling function fib: old $sp -> Ox7ffffffc 4 -8 $sp-> -12 aM contents of register Sra for Fib(N) contents of register $s0 for FiD(N) contents of register $a0 for Fib(N) there will be N-1 copies of $ra, $s0 and $a02.34 2.36 : addi $sp,$sp,-12 sw $ra,8($5p) sw $51,4(Ssp) sw $50,0($sp) move $51,$a2 move $50,543 jal func move $a0,$v0 add $al,,$s0,$s1 jal func Vw $ra,8($sp) Iw $51,4(Ssp) Iw $50,0($sp) addi $sp,$sp,12 ir Sra We can use the tail-call optimization for the second call to func, but then we must restore $ra, $50, $51, and $sp before that call. We save only one instruction (jr $ra). Register $1-a is equal to the return address in the caller function, registers $sp and $53 have the same values they had when function f was called, and register $t5 can have an arbitrary value. For register $15, notethat although our function f does not modify it, function FUNC is allowed to modify it so we cannot assume anything about the of $5 after function func has been called. 2.37 MAIN: addi $sp, Ssp, -4 sw Sra, ($sp) add $t $0, 0x30 # ‘0" add $t7, $0, 0x39 # ‘9° add $50, $0, $0 add $t0, $0, $0 Loop: 1b stl. ($t0) sIt $t2, Stl, $26 bne $t2, $0, DONE slt $t2, $t7, $t bne $t2, $0, DONE sub Stl, $26 beq $s0, FIRST mul $s0, $50, 10 FIRST: add $50, $50, $t1 addi $t0, $t0, 1 § Loop DONE: add $vO, $50, $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra2.38 0x00000011 2.39 Generally, all solutions are similar: lui Stl, top_l6_bits ori Stl, Stl, bottom_l6é_bits 2.40 No. jump can go up to OXOF FFFFFC. 2.41 No, range is 0x604 + OxIFFFC = 0x0002 0600 to 0x604 - 0x20000 = OxFFFE 0604. XIFFFC = Ox2001F000 to 2.42 Yes, range is 0x1 FFFFO04 + - 0x20000 = 1FFDFOO4 1FFFFOO4 243 trylk: li $tl,1 1 $ bnez $ sc $t1,0($a0) beqz $ Ww S$ sIt $ bnez $t3,skip sw $a2,0($al) skip: sw $0,0($a0) 244 try: 1] $t0.0(Sa1) slt $t1,$t0,$a2 bnez $tl.skip mov $t0,$a2 sc $t0,0($al1) beqz $ y skip: 2.45 It is possible for one or both processors to complete this code without ever reaching the SC instruction. If only one executes SC, it completes successfully. If both reach SC, they do so in the same cycle, but one SC completes first and then the other detects this and fails.2.46 2.46.4 Answer is no in all cases. Slows down the computer. CCT = clock cycle time 1Ca = instruction count (arithmetic) ICls = instruction count (load/store) 1cb new CPU time = 0.75*old ICa*CPla' 1.1*oldCCT + oldICIs*CPIls*1.1*oldCCT instruction count (branch) + oldICb*CPIb* 1.1*oldCCT The extra clock cycle time adds sufficiently to the new CPU time such that it is not quicker than the old execution time in all cases. 2.46.2 107.04%, 113.43% 2.47 2.47.1 2.6 2.47.2 0.88 2.47.3 0.53333333334 3.2 33 34 35 3.6 3.7 38 3.9 5730 5730 0101111011010100 The attraction is that each hex digit contains one of 16 different characters (0-8, A-E). Since with 4 binary bits you can represent 16 different patterns, in hex each digit requires exactly 4 binary bits. And bytes are by definition 8 bits long, so two hex digits arc all that arc required to represent the contents of Lbyte. 753 7777 (3777) Neither (63) Neither (65) Overflow (result = —179, which does not fit into an $M 8-bit format) —105 — 42 = ~128 (—147) 3.10 —105 + 42 = —63 BAL 151 + 214 = 255 (365) 3.42 6212 |_sten | __sction _|_Mutiptior | _muttioicand | o nieTal Vale 000 000 110 070 0 1 LiniFt Meand 00 oot 100 100 Rahift Mplie 11001 2 LaniFr Neand 000-011 001 000) Reh e 11 001 3 000 110 010 000 4 001 11001 T1007 11 001 110 o 6 Lani fi Meena 2-010 000 e00 [0 Rah r 110 OF 003.43 3.44 3.45 3.16 347 62x12 [ste | Acton | Matnicand | Prodect/Maltpto S (000-000-001 010 (000-000 000_10: O11 001 000010) O11 0nT 000010 TH Tt 110 100-007 a 010-000) 0-010 00 01 000 TH or Ti ToT 000 Forhardware, it takes cycle to do the add, I cycle todo the shift, and 1 cycle to decide if we are done. So the loop takes (3 X A) cycles, with each cycle being B time units long. Fora software implementation, it takes 1 cycle to decide what to add, 1 cycle to do the add, 1 cycle to do each shift, and 1 cycle to decide if weare done. So the loop takes (5 X A) cycles, with each cycle being B time units long. (3X8)x4tu = 96 time units for hardware (5X8)X4tu = 160 time units for software It takes B time units to get through an adder, and there will be A — 1 adders. Word is 8 bits wide, requiring 7 adders. 7 4iu = 28 time units. It takes B time units to get through an adder, and the adders are arranged in a tree structure, It will require log2(A) levels. 8 bit wide word requires 7 adders in 3 levels. 3X4tu = 12 time units. 0x33 X 0x55 = Ox10ER 0x33 = 51, and 51 = 32+16+2+1. Wecan shift 0x55 left 5 places (0xAAO), then add 0x55 shifted left 4 places (0x550), then add (0x55 shifted left once (OxAA), then add 0x55. OxAA0+0x550-+0xAA +0x55 = Ox10EE 3 shifts, 3 adds. (Could also use 0355, which is 64+ 16+4+ 1, and shift 0x33 left 6 times, add to it 0x33 shifted left 4 times, add to that 0x33 shifted left 2 times, and add to that 0x33. Same number of shifisand adds.)3.18 74/21 = 3 remainder 9 Teitis) Wale o00 000 O10 001 000 000 p00 000 111 100) Rem-RenDiv 000 000 010 001 000 000 Tor 11 117 100) 1 RencO RID. OC 000 000. 910 001 900 000) 900 000 111 100, Remit Oty 000 000) 001 000 100 000) 900 000 113 100) RemRen-Div 000 900 1001 900 100 000) 111000 011 100) 2 RemcO RID. OCC 000 000) 001 000 100 000 000 000 111 100) Rshift Diy 000 000. 000 100 010 000) 900 000 111 100) Rem Ren-Div 000-000 000 100-010 000, Ti 109 101 100) 3 Rome RDO 1000 000 1000 100-010 900 990 900 177 100) Rshift Diy 000 000. 000 010-001 000, 900 000 111 100, 100 000 000 O10 O07 000 TH 110 110 100, 4 000 000) 000 010-007 000, 300 000 117 100, Renift Oty 000 000) 000 001 000 100) 900 000 117 100) 000 000 000 001 000 100 1 111 000 5 Ren<0 RID. 0 000 000 900 001 000 100) 900 000 117 100 Renife Of 000 000) 000 000 100 O10 000 111 100 Rem-Rem Div 900 000 000 000 100 O10 000 O11 010 6 Remd0 064i 000 001 (000 000 100 010 000 011 O10 Renift Oi 000 001 00 000 010 001 00 011 010 Rem fen Ov 000 007 {000 000 010 001 900 000 001 O01 7 Remd0 064 000 011 (000 000 O10 001 000-000 001 oo7 Renift Of 000 011 (000 000 001 000 000 000 001 007 3.19. In these solutions a 1 ora 0 was added to the Quotient if the remainder was greater than or equal to 0. However, an equally valid solution is to shift ina 1 or 0, but ifyou do this you must do a compensating right shifi of the remainder (only the remainder, not the entire remainder/quotient combination) after the last step. 74/21 = 3 remainder 11 eee Taitia O10 001 000 000 111 100 Re 010 oor 000 Oo 111 000 RenhomDiv O10 001 TH 000 117 000) Ren<0,ReD 010 001 000 001 111 000 Re O10 007 00 01 190 000 2 Oy 010 001 110 010 110 000 Ren

Open Book 331

Uploaded by

Open Book 331

Uploaded by

You might also like