0% found this document useful (0 votes)
10 views58 pages

105

The document contains a series of technical questions and answers related to computer architecture, covering topics such as pipeline design, cache memory, virtual memory, and RAID systems. It includes exercises on MIPS rates, data hazards, and the effects of design changes on performance metrics. Additionally, it discusses a new CPU design center and the specifications for a new processor called NCTU.

Uploaded by

DanielNel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views58 pages

105

The document contains a series of technical questions and answers related to computer architecture, covering topics such as pipeline design, cache memory, virtual memory, and RAID systems. It includes exercises on MIPS rates, data hazards, and the effects of design changes on performance metrics. Additionally, it discusses a new CPU design center and the specifications for a new processor called NCTU.

Uploaded by

DanielNel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

目 錄

105 清大資工 .......................................................................................................................... 2

105 交大資聯 .......................................................................................................................... 5

105 台聯大電機................................................................................................................... 12

105 中央資工 ....................................................................................................................... 20

105 中正資工 ....................................................................................................................... 24

105 中正電機 ....................................................................................................................... 26

105 中山電機 ....................................................................................................................... 32

105 中山資工 ....................................................................................................................... 36

105 中興資工 ....................................................................................................................... 41

105 中興電機 ....................................................................................................................... 42

105 台科大資工................................................................................................................... 46

105 台科大電子................................................................................................................... 48

105 成大資聯 ....................................................................................................................... 50

105 成大電通 ....................................................................................................................... 53

105 成大電機 ....................................................................................................................... 56

1
105 清大資工
1. For the following questions we assume that the pipeline contains 5 stages: IF, ID, EX, M, and
W and each stage requires one clock cycle. A MIPS-like assembly is used in the representation.
(a) Explains the concept of forwarding in the pipeline design.
(b) Identify all of the data dependencies in the following code. Which dependencies are data
hazards that will be resolved via forwarding?
ADD $2, $5, $4
ADD $4, $2, $5
SW $5, 100($2)
ADD $3, $2, $4
Answer:
(a) Forwarding is a method of resolving a data hazard by retrieving the missing data element
from internal buffers rather than waiting for it to arrive from programmer-visible registers or
memory.
(b)
Data dependency (1, 2), (1, 3), (1, 4), (2, 4)
Data hazard (resolved via forwarding) (1, 2), (1, 3), (2, 4)

2. Suppose out of order execution is adopted in a pipeline design. The figure below gives a generic
pipeline architectures with out-of-order pipeline design.
(a) Explain the needed work in the reservation station and reorder buffer, respectively, in order
to support out of order execution.
(b) In order to reduce power usage, we want to power off a floating point unit when it is no
longer in used. The following code fragment uses a new instruction called Power-off to
attempt to turn off floating point multiplier when it is no longer in used. To deal with out of
order executions, what are the work needed for reservation station and reorder buffer to
avoid the “Power-off @multiplier” instruction to be moved in front of the set of multiply
instructions.
mul.d $f2, $f4, $f6
mul.d $f1, $f3, $f5
mul.d $f7, $f1, $f2
Power-Off @multiplier
/* Integer operations in the rest of the assembly code*/
ADD $2, $5, $4

END
Answer:
(a) Reservation station is a buffer within a functional unit that holds the operands and the
2
operation.
Reorder buffer is the buffer that holds results in a dynamically scheduled processor until it
is safe to store the results to memory or a register.
(b) The reservation station of floating point multiplier has to check that there are no floating
point multiplication instructions remained and the reorder buffer has to commit the
Power-off @multiplier instruction after all floating point multiplication instructions.

3. A non-pipelined processor A has an average CPI (Clock Per Instruction) of 4 and has a clock
rate of 40MHz.
(a) What is the MIPS (Million Instructions Per Second) rate of processor A
(b) If an improved successor, called processor B, of processor A is designed with four-stage
pipeline and has the same clock rate, what is the maximum speedup of processor B as
compared with processor A?
(c) What is the CPI value of processor B?
Answer:
(a) MIPS of processor A = (40  106) / (4  106) = 10
(b) Speedup = (4 / 40M) / (1 / 40M) = 4
(c) The CPI value of processor B is 1

4. Denote respectively si and ci the Sum and CarryOut of a one-bit adder Adder(ai, bi), where ai
and bi are input.
(a) For a 16-bit addition, if we use the ripple adder, how many gate delay is required for
generating c16 after bits A, B and c0 are applied as input?
(b) If we use the carry-look-ahead adder, where the 16-bit adder is divided into four 4-bit
adders. In each 4-bit adder there is a 4-bit carry-look-ahead circuit. Let pi = ai + bi and gi =
ai * bi and assume there is one gate delay for each pi and gi and two gate delays for each
carry c4, c8, c12, c16 and final three gate delay for s15. How many gate delay is required for
generating c16 after A, B, and c0 are applied is input?
(c) Another way to apply look-ahead circuit is to add a second-level look-ahead circuit. With
the second-level look-ahead circuit, how many gate delay is required for generating c16
after A, B and c0 are applied as input?
Answer:
(a) 16  2 = 32 gate delay
(b) 1 + 2 + 2 + 2 + 2 = 9 gate delay
(c) 1 + 2 + 2 = 5 gate delay

3
5. Please fill “?” with the answer of “up” or “down” in the following. For example, when
associativity goes up, the access time goes up.
Design change Effect on miss rate Possible effects
Associativity (up) conflict miss ? access time up
Cache size (down) capacity miss ? access time ?
Block size (down) spatial locality ? miss penalty ?
Answer:
Design change Effect on miss rate Possible effects
Associativity (up) conflict miss down access time up
Cache size (down) capacity miss up access time down
Block size (down) spatial locality down miss penalty down

6. In 2-way cache, there are already 3 numbers in the sequence of 14, 2, 3 in the cache. The
following table shows the time sequences when 14, 2, and 3 are inserted into the 2-way cache.
Please insert 6 numbers after and show the time sequence using LRU. 6, 10, 22, 42, 11, 27
0 1 2 3

14
14 2
14 2 3

Answer:
0 1 2 3

14
14 2
14 2 3
6 10 3
22 42 3
22 42 3 11
22 42 27 11

4
105 交大資聯
複選題
1. Which of the following statements are correct?
(a) Writing programs with a set of powerful instructions is the shortcut to yield high
performance.
(b) The number of pipeline stages affects latency, not throughput; thus pipelining improves the
performance of a processor by decreasing the latency of a job (i.e., an instruction) to be
done.
(c) For a given program, its average cycles per instruction (CPI) is affected not only by the
instruction set architecture, but also by the compiler used.
(d) By reducing the clock frequency of a processor from 2 GHz to 1.5 GHz, and also reducing
its supply voltage from 1.25 Volt to 1 Volt, the overall power consumption of this processor
will be reduced by 40% theoretically.
(e) Hexadecimal integer value 0xABBCCBBA has identical storage sequence in the
(byte-addressable) memory no matter the machine is big-endian or little-endian.
Answer: (c)
註(d):(1.5  12) / (2  1.252) = 0.48. The power consumption will be reduced by 52%

2. Consider a byte-addressable memory hierarchy with 44-bit addresses. The memory hierarchy
adopts a 4-way set associative cache of 64 KB, with every block in a set containing 64 bytes.
Which of the following statements are correct?
(a) Among an address of 44 bits, the least-significant 6 bits is the “offset”.
(b) Among an address of 44 bits, the most-significant 30 bits is the “tag”.
(c) The total number of bits required to store the entire cache (including valid bits, tags, and
data) is (1 + 30 + 64 × 4 × 8) × 28 = 532,224 (bits).
(d) By increasing the block size while using the same size of storage for data, the total number
of bit required to storage the entire cache will decrease.
(e) Given the following sequence of access addresses: (0E1B01AA050)16, (0E1B01AA073)16,
(0E1B2FE3057)16, (0E1B4FFD85F)16, (0E1B01AA04E)16, assuming that the cache is
initially empty, there will be three sets containing exactly one referenced block at the end
of the sequence.
Answer: (a), (b), (d), (e)
註(c):The total number of bits = (1 + 30 + 64 × 8) × 4× 28 bits
註(e):
Address 0E1B01AA050 0E1B01AA073 0E1B2FE3057 0E1B4FFD85F 0E1B01AA04E
Index (bin) 10000001 10000001 11000001 01100001 10000001

5
3. Which of the following statements (about virtual memory and page table) are correct?
(a) For virtual memory, write-through is more practical than write-back.
(b) For virtual memory, full associativity is typically used for minimizing page fault rate.
(c) Given a 32-bit virtual address space with 2 KB per page and 4 bytes per page table entry,
the total page table size is 8 MB.
(d) It is possible to miss in cache and translation look-aside buffer (TLB), but hit in page table.
(e) It is possible to miss in TLB, but hit in cache and page table.
Answer: (b), (c), (d), (e)

4. Which of the following statements are correct?


(a) It is possible that increasing block size could increase the overall miss rate.
(b) It is possible to decrease the occurrence of capacity misses by increasing block size.
(c) It is possible to decrease the occurrence of conflict misses by increasing associativity.
(d) It is possible to “eliminate” the occurrence of conflict misses by further increasing
associativity.
(e) It is possible that increasing associativity could increase the overall miss rate.
Answer: (a), (c), (d)

5. Which of the following statements are correct?


(a) RAID (redundant array of inexpensive/independent disks) not only enhances the reliability/
availability of data storage, but also boosts the performance of data access.
(b) RAID 0 provides fault tolerance by replicating data to mirror disks.
(c) The advantage of RAID 4 over RAID 3 is on the byte-level striping/interleaving of data
across disks, so as to allow more parallel data accesses.
(d) The advantage RAID 5 over RAID 4 is on the distribution of parity blocks across disks, so
as to avoid a single parity disk from being the bottleneck.
(e) The advantage of RAID 6 over RAID 5 is on the higher degree of fault tolerance through
more disk redundancy.
Answer: (a), (d), (e)
註(a):RAID只能enhances availability而無法enhances reliability,不過交大公佈的答案含(a),此
題以公佈的答案為準

6
ARM recently built up a new CPU design center at Hsinchu in Taiwan, the first CPU design center in
Asia. You are the principal engineer in a team in charge of developing a brand new CPU, called
NCTU (Next-generation Compute Terabit Unit). The following what-if questions refer to some
assumptions. The latencies for logic blocks in the figure are listed, while the latency of other blocks
is almost negligible:

Main control I-mem Adder Registers ALU Sign-extend D-mem


50ps 100ps 50ps 40ps 50ps 10ps 100ps

The NCTU1, the basic standard 5-stage pipeline processor, is organized as the following diagram.
All instructions executed by a pipelined processor are broken down as ALU: 40%, beq: 30%, lw:
25% and sw: 5%.
IF/ID ID/EX EX/MEM MEM/WB

Add
Add
Shift
left 2
Instruction

PC Address Read Read


register 1 data 1
Read Zero
Instruction register 2 Read ALU Address Read
memory Write data 2 Result data
register Data
Write memory
data Registers
Write
data
Sign
extend

6. Which of following statements can be true for ideal NCTU1 (assume there is no stalls, where
all instructions are executed ideally)?
(a) The best clock rate of NCTU1 is 10GHz.
(b) The MIPS (million instructions per second) can be reached up to 10000.
(c) All pipeline registers between stages have the same width (number of bits).
(d) The pipeline clock cycle time is equal to the average of all stage latencies.
(e) The longest instruction will be lw instruction. Reducing number of lw instructions in a
program can improve the NCTU1 system throughput (e.g. CPI or instructions executed per
second).
Answer: (a), (b)

7
7. Ideal execution does not exist at all. Which of following statements can be true for improving
NCTU1?
(a) The pipe stages look unbalanced. The ideal CPI can be reduced to less than 1 if we move
around the building blocks for better balance.
(b) The overall performance can be always improved by making the pipeline deeper as the
cycles are shorter.
(c) Solve the load-use data hazard (load instructions followed by a dependent instruction) by
inserting a hazard detection unit at the ID stage.
(d) Increase the register file with more read ports (three reads and a write) to avoid instruction
conflicts.
(e) If ALU control decoder (marked by dashed circle) needs to take 50ps, moving the ALU
control to the ID stage can improve NCTU1 for allowing more concurrent activities.
Answer: (c)
註(b):當cycle time已經很短了,再增加pipeline depth未必可以增加performance

8. For the statements in the following (i is $r8 and A is $r10), you solve the pipeline dependency
of NCTU1 as the textbook. How many cycles totally are needed for NCTU1 to execute them
all? (you may need to insert NOP if necessary.)
// C or C++
A[i] += 10;
A[i] *= 2;

// MIPS
add $r1, $r8, $r10
lw $r2, 4($r1)
addi $r2, $r2, 10
multi $r2, $r2, 2
sw $r2, 12($r1)
(a) Total 9 cycles ideally if we don’t consider hazard problems.
(b) Total 19 cycles if there is no forwarding unit and no hazard detection unit, and we use
NOPs.
(c) Total 11 cycles if there is a forwarding unit and no hazard detection unit for load-use.
(d) Total 10 cycles if there are forwarding unit and hazard detection unit for load-use.
(e) The forwarding unit will not be used in the lw instruction (load-use scenario) when hazard
detection unit is triggered.
Answer: (a), (d)
註(b):Total cycles = (5 – 1) + 5 + 8 = 17
註(c):Total cycles = (5 – 1) + 5 + 1 = 10

8
9. Now we hope to design a reduced low cost version using fewer stages, a new 4-piped CPU
called NCTU2. We can remove MEM stage and put the memory access in parallel with the
ALU. Which of following statements will be true for NCTU2?
(a) It is feasible except all load/store instructions with nonzero offsets, whereas we have to
convert them into register-based only (no offset). Specifically, convert
lw $r3, 30($r5)
into
addi $r1, $r5, 30
lw $r3, ($r1)
(b) The NCTU2 has better overall system throughput, as instructions take less cycles to
complete.
(c) The NCTU2 will have a worse clock rate, as its cycle time is longer than that of NCTU1.
(d) NCTU2 still needs a stall hazard detection unit for load-use condition.
(e) The data forwarding unit is still needed for NCTU2.
Answer: (a), (e)

Branches share a significant portion of program execution. In current NCTU1, branch outcomes are
determined at MEM stage. It’s one of the key problems.
10. Which of following statements can be true for NCTU1 by considering branches only?
(a) The CPI with all branch stalls is 1.9 for NCTU1 (assume no other stalls.)
(b) If the NCTU1 is designed with deeper pipelined stages, the branch control hazard becomes
harder to solve.
(c) Assuming branches are not taken can continue execution down the sequential instruction
stream and thus improve the NCTU1 without changing hardware.
(d) If branches handling is moved to the EXE stage, it can improve CPI without increasing the
cycle time.
(e) If more hardware resources such as forwarding comparator are provided, the branch
resolving can be perfectly moved to the IF stage while still maintaining the normal
pipelining mechanism.
Answer: (a), (b)
註(a):CPI = 1 + 0.3  3 = 1.9
註(b):The AND gate (with the zero and branch inputs) should be moved from MEM stage to EXE
stage

9
題組 A: Now we have a new NCTU3 derived from NCTU1, where the branch handling is moved to
the ID stage.
11. Which of following statements will be true?
(a) It needs more hardware costs such as extra adder for subtraction and a forwarding unit.
(b) The branch target address adder has to be replicated.
(c) The good thing is that it improves CPI without increasing cycle time.
(d) The branch only CPI of new NCTU3 will become 1.1.
Answer: (a)
註(d):CPI = 1 + 0.3  1 = 1.3

12. Even with NCTU3, we hope to improve the branch problem further. Which of following
statements is NOT true?
(a) Delayed branch is a mechanism to delay the effect of branch. Branch delay slot in NCTU3
is 1. It also needs no extra hardware resources in NCTU3.
(b) Compiler can help in filling delay slots. If compiler can find any safe instruction for 50%
of branches, the CPI can be reduced to 1.15.
(c) Predicting branch as not-taken is a good mechanism and it needs no extra hardware
resources in NCTU3.
(d) Branch prediction buffer is good to predict the branch outcome, but it does not help in
predicting the branch target.
Answer: (c)

題組 B: Given a 16-byte cache (byte-addressable, initially empty) and a sequence of access


addresses: (2)10, (14)10, (0)10, (17)10, (12)10, (3)10, please derive the corresponding hit/miss
sequence for each of the following cache designs. There is one question for each design.
You may use the attached tables to find the answers to the questions; filling out the tables is
for your own convenience and is not the answers to be submitted.

13. 8 bytes per block, direct-mapped


Address Cache index Hit or miss?
2
14
0
17
12
3
How many of these six accesses are “hit”?
(a) 0

10
(b) 1
(c) 2
(d) 3
(e) 4
Answer: (c)
註:
Byte address Block address Tag Index Hit or Miss?
2 0 0 0 Miss
14 1 0 1 Miss
0 0 0 0 Hit
17 2 1 0 Miss
12 1 0 1 Hit
3 0 0 0 Miss

14. 4 bytes per block, 2-way set associative


Address Cache index Hit or miss?
2
14
0
17
12
3
How many of these six accesses are “hit”?
(a) 0
(b) 1
(c) 2
(d) 3
(e) 4
Answer: (d)
註:
Byte address Block address Tag Index Hit or Miss?
2 0 0 0 Miss
14 3 1 1 Miss
0 0 0 0 Hit
17 4 2 0 Miss
12 3 1 1 Hit
3 0 0 0 Hit

11
105 台聯大電機
1. Consider an implementation of an instruction set architecture. The instructions can be divided
into three classes according to their CPI (class A, B, and C). The CPIs are 1, 2, and X for the
three classes, respectively, and the clock rate is 4.4 GHz.
Give a program with a dynamic instruction count of 5 × 109 instructions divided into classes as
flows: 20% class A, 60% class B, and 20% class C. The program is compiled using compiler A.
(1) Find X (CPI for the class C) given that the global CPI is 2.2.
(2) Find the clock cycles required.
(3) What is the performance expressed in instructions per second (the global CPI is 2.2).
(4) A new compiler, compiler B, is developed that uses only 109 instructions and has an
average CPI of 1.1. Assume the programs compiled using Compiler A and B run on
Processor A and Processor B, respectively. If the execution time on the Processor B is a
half of the execution time on the Processor A. how much faster is the clock of Processor A
versus the clock of Processor B? (Find Clock A / Clock B)
Answer:
(1) 2.2 = 1  0.2 + 2  0.6 + X  0.2  X = 4
(2) The clock cycles = 2.2  5 × 109 = 11 × 109
(3) Instructions per second = (4.4 × 109) / 2.2 = 2 × 109
(4) Execution Time (Processor A) = (11 × 109) / (4.4 × 109) = 2.5 sec.
Execution Time (Processor B) = 2.5 / 2 = 1.25 = (109  1.1) / Clock rateA
 = Clock rateA = 0.88 GHz
The clock of Processor A is 4.4 GHz / 0.88 GHz = 5 times faster than Processor B

2. Assume there is a 16-bit half precision format. The leftmost bit is still the sign bit, the exponent
is 5 bits wide and has a bias of 15, and the mantissa is 10 bits long. A hidden 1 is assumed.
Please write down the bit pattern to represent –2.1875 × 10-1 assuming the 16-bit half precision
format. (Note: 0.21875 = 7/32)
Answer:
–2.1875 × 10-1 = 0.001112 = 1.112 × 2-3. The half precision format = 1 01100 1100000000

3. Given the following C program segment, the converted MIPS assembly codes are shown in the
below. Assume that arguments n and s locate in $a0 and $a1.
int Sum (int n, int s ) {
if (n < 1) return 0;
else return (n + sum (n – s, s))
};
Sum: addi $sp, $sp, -12
(a)
12
sw $a1, 4($sp)
sw $a0, 0($sp)
slti
$t0, $a0, 1
beq$ t0, $ zero, L1
add$v0, $zero, $zero
(b)
jr $ra
L1: sub $a0, $a0, $a1
jal sum
lw $a0, 0($sp)
lw $a1, 4($sp)
(c)
addi $sp, $sp, 12
add $v0, $a0, $v0
jr $ra
(1) Please fill in the blanks (a), (b), and (c) to complete this assembly codes.
(2) Let the initial values of $a0, $a1, $t0, $sp are 6(hex), 2(hex), 8(hex), 6FFF808C(hex) respectively.
What are the final values of $v0 and $a0 after the completion of the program? What is the
value of $sp when the first “jr” instruction is encountered?
Answer:
(1)
(a) (b) (c)
sw $ra, 8($sp) addi $sp, $sp, 12 lw $ra, 8($sp)
(2)
$v0 $a0 $sp
C(hex) 6(hex) 6FFF808C(hex)

註: $v0 = sum (6, 2) = 6 + sum(4, 2) = 6 + 4 + sum(2, 2) = 6 + 4 + 2 + sum(0, 2)


= 6 + 4 + 2 + 0 = C(hex)

13
4. Consider the following architecture. The latency of each block is given in the following.
Assume that the control block has zero delay if not specified.

X Z
Add

4 Add Sum

Shift PCSrc
left 2

Instruction [31-26]
Control

Instruction [25-21] Read


PC Read register 1
address Read
Instruction [20-16] data 1
Read
Instruction register 2
[31-0] ALU
Write Read Address Read
data
Instruction Register data 2
memory Data
Write memory
Y
Instruction [15-11] data Registers
Write
data
Instruction [15-0] 16 32
Sign
ALU
extend
control

Instruction [5-0]

I-Mem Add Mux ALU Regs D-Mem AND Sign-Extend Shift-left-2


220ps 50ps 20ps 120ps 80ps 250ps 10ps 20ps 10ps

1020 sub $s4, $s3, $t3


1024 beq $t2, $ s4, Else
1028 addi $t2, $t2, -4
1032 add $t0, $t1, $t2
1036 sw $s2, 0($t0)
1040 j Exit
1044 Else: addi $t2, $t2, 4
1048 Exit: ….
(1) Which instruction is not supported by this architecture? How to revise the architecture to
Support that instruction. Please explain by drawing the related blocks and wires?
(2) Except the instruction that is not supported in (1), which instruction needs the longest clock
cycle period? What would the cycle time be?
(3) Before this code segment, the register contents in decimal are given as follows. After
execution, for some clock cycle, if X = 1028, what are the values of signal Y (after the
MUX controlled by ALUSrc) and signal Z (input of PC)?
$t0 $t1 $t2 $t3 $s2 $s3 $s4
48 8 12 288 75 300 102

14
Answer:
(1) The jump instruction j is not supported by this architecture. And the datapath should be
revised as the following to support the instruction.
Instruction [25:0] Jump address [31:0]
Shift
left 2
26 28
PC + 4 [31:28]
Add

4 Add Sum

Shift
left 2

Instruction [31-26]
Control

Instruction [25-21]
Read
PC Read register 1
address Read
Instruction [20-16] data 1
Read
Instruction register 2
[31:0] ALU
Write Read Address Read
Instruction Register data 2 data
memory
Write Registers
data
Write Data
data memory
Instruction [15-0] 16 32
Sign ALU
extend control

Instruction [5-0]

(2) Latency for R-type instruction = 220 + 80 + 20 + 120 + 20 + 80 = 540 ps


Latency for beq instruction = 220 + 80 + 20 + 120 + 10 + 20 = 470 ps
Latency for addi instruction = 220 + 80 + 120 + 20 + 80 = 520 ps
Latency for sw instruction = 220 + 80 + 120 + 250 = 670 ps
The sw instruction needs the longest clock cycle period.
The cycle time = 670 ps
(3) When X = 1028, the instruction in the address of 1024 (that is beq) is executed.
After the first instruction is executed the content of register $s4 is 12 and the beq will jump
to label Else.
Signal Y = 12 and Signal Z = 1044

5. Consider the following code sequence running on 5-stage pipeline MIPS:


sw r16, 12(r6)
lw r16, 8(r6)
beq r5, r4, Label # Assume r5 != r4
add r5, r1, r4
slt r5, r15, r4
assume that the individual pipeline stage of IF, ID, Exe, Mem, and WB has the latency of
200ps, 120ps, 200ps, 190ps, and 100ps, respectively.

15
(1) Assume that all branches are perfectly predicted and that no delay slots are used. If we only
have one memory (for both instructions and data), there is a structural hazard every time
we need to fetch an instruction in the same cycle in which another instruction accesses
data, To guarantee forward progress, this hazard must always be resolved in favor of the
instruction that accesses data. What is the total execution time of this instruction sequence
in the 5-stage pipeline that only has one memory? We have seen that data hazards can be
eliminated by adding nops to the code. Can you do the same with this structural hazard?
Why?
(2) Assuming stall-on-branch and no delay slots, what speedup is achieved on this code if
branch outcomes are determined in the ID stage, relative to the execution where branch
outcomes are determined in the Exe stage?
(3) Assume that all branches are perfectly predicted and that no delay slots are used. If we
change load/store instructions to use a register (without an offset) as the address, these
instructions no longer need to use the ALU. As a result, Mem and Exe stages can be
overlapped and the pipeline has only 4 stages. Change this code to accommodate this
changed ISA. Assuming this change does not affect clock cycle time, what speedup is
achieved in this instruction sequence?
(4) Given these pipeline stage latencies, repeat the speedup calculation from (3), but take into
account the (possible) change in clock cycle time. When Exe and Mem are done in a single
stage, most of their work can be done in parallel. As a result, the resulting Exe/Mem stage
has a latency that is the larger of the original two, plus 20 ps needed for the work that could
not be done in parallel.
(5) Given these pipeline stage latencies, repeat the speedup calculation from (3), taking into
account the (possible) change in clock cycle time, Assume that the latency of the ID stage
increases by 50% and the latency of the Exe stage decreases by 10ps when branch outcome
resolution is moved from Exe to ID.
Answer:
(1) ** represents a stall when an instruction cannot be fetched because a load or store instruction
is using the memory in that cycle
Instruction Pipeline stage
sw r16, 12(r6) IF ID EX ME WB
lw r16, 8(r6) IF ID EX ME WB
beq r5, r4, Label IF ID EX ME WB
add r5, r1, r4 ** ** IF ID EX ME WB
slt r5, r15, r4 IF ID EX ME WB
Clock cycles = 11
We cannot add NOPs to the code to eliminate this hazard—NOPs need to be fetched just like
any other instructions

16
(2) When branches execute in the EXE stage, each branch causes two stall cycles. When
branches execute in the ID stage, each branch only causes one stall cycle.
Cycles with branch in EXE Cycles with branch in ID Speedup
(5 – 1) + 5 + 1  2 = 11 (5 – 1) + 5 + 1  1 = 10 11 / 10 = 1.10
(3)
Cycles with 5 stages Cycles with 4 stages Speedup
(5 – 1) + 5 = 9 (4 – 1) + 5 = 8 9 / 8 = 1.13
(4) The number of cycles for the (normal) 5-stage and the (combined EX/MEM) 4-stage pipeline
is already computed in (3). The clock cycle time is equal to the latency of the longest-latency
stage. Combining EX and MEM stages affects clock time only if the combined EX/MEM
stage becomes the longest-latency stage:
Cycle time with 5 stages Cycle time with 4 stages Speedup
200 ps 220 ps (9  200) / (8  220) = 1.02
(5)
New ID New EX New Cycle Old Cycle
Speedup
Latency Latency Time Time
180ps 190ps 200ps (IF) 200ps (IF) (9  200) / (8  200) = 1.13

6. Suppose you want to perform two sums: one is a sum of 10 scalar variables, and one is a matrix
sum of a pair of two-dimensional arrays, which have dimensions 20 by 20. Suppose that only
the matrix sum is parallelizable.
(1) Assume the load was perfectly balanced, what speed-up do you get with 40 multiple
processors?
(2) If one processor's load is 2 times higher than all the rest, what speed-up do you get with 40
multiple processors?
Answer: suppose t is the required time for an addition operation
(1) Execution time for one processor = 9t + 400t = 409t
Execution time for 40 processor = 9t + 400t / 40 = 19t
Speedup = 409t / 19t = 21.53
(2) Execution time for one processor = 9t + 400t = 409t
Execution time for 40 processor = 9t + Max(20t, 380t / 39) = 29t
Speedup = 409t / 29t = 14.1

7. Some disks are quoted to have a 1,000,000-hour mean time to failure (MTTF). For a data
center, there might have 50,000 servers. Suppose each server has 4 disks. Use annual failure
rate (AFR) to calculate how many disks we would expect to fail per year.
Answer:

There are
50000  4 disks  8760 hrs/disk  1752 disks we would expect to fail per year
1000000 hrs/failur e

17
8. A new processor is advertised as having a total on-chip cache size of 168 Kbytes (KB) with a
byte-addressed two-level cache. It integrates 8 KB of onboard L1 cache, split evenly between
instruction and data caches, at 4 KB each. Meanwhile, the processor integrates 160 KB of
unified L2 cache with the chip packaging. The two-level cache has the following
characteristics:
Average local
Address Associative Sector size Block size Caching method Hit time
miss rate
L1 Physical Direct 2 1 word/block Write through, no
1 clock 0.15
cache byte-address mapped blocks/sector (8 bytes/word) write allocate
L2 Virtual 5-way set 2 1 word/block Write through 5 clocks
0.05
cache byte-address associative blocks/sector (8 bytes/word) write allocate (after L1 miss)
The system has a 40-bit physical address space and a 52-bit virtual address space. L2 miss
(transport time) takes 50 clock cycles.
(1) What is the total number of bits within each L1 cache block, including status bits?
(2) What is the total number of bits within each L1 cache sector, including status bits?
(3) What is the total number of bits within each L2 set, including status bits?
(4) Compute the average memory access time (AMAT, in clocks) for the given 2-level cache.
(5) Consider a case in which a task is interrupted, causing L1 and L2 caches described above
to be completely flushed by other tasks, and then that original task begins execution again
and runs to o completion, completely refilling the cache as it runs. What is the approximate
time penalty, in clocks, associated with refilling the caches when the original program
resumes execution? Note: “approximate” is that you should compute L1 and L2 penalties
independently and then add them, rather than try to figure out coupling between them.
Answer:
(1) 8 bytes per block + 1 valid bit = 65 bits
(2) Tag is 40 bits - 12 bits (4 KB) = 28 bits, and is the only sector overhead since it is direct
mapped 28 + 65 + 65 = 158 bits
(3) 8 bytes per block + 1 valid bit + 1 dirty bit = 66 bits
Cache is 32 KB  5 in size ; tag is 40 bits - 15 bits (32 KB) = 25 bits tag per sector
overhead bits per sector = 25 bits tag per sector + 3 bits LRU counter per sector = 28 bits
Set size = ((66  2 ) + 28)  5) = 800 bits per set
(4) AMAT = Thit1 + Pmiss1Thit2 + Pmiss1Pmiss2Ttransport
= 1 + ( 0.15  5 ) + (0.15  0.05  50) = 2.125 clocks
(5) For the L1 cache, 1K memory references (8K / 8 bytes) will suffer an L1 miss before the L1
cache is warmed up. But, 15% of these would have been misses anyway. Extra L1 misses =
1K  0.85 = 870 misses
L2 cache will suffer (160K / 8 bytes)  0.95 extra misses = 19456 misses
Penalty due to misses is approximately:
870  5 + 19456  50 = 977,150 clocks extra execution time due to loss of cache context

18
註:此題為 Sector Mapping Cache 已超出課本範圍

19
105 中央資工
單選題
1. Use 32-bit IEEE 754 single precision to encode “-1234.6875”. If K is the number of “1” in this
32-bit value, then what is “K mod 5”? (A) 0 (B) 1 (C) 2 (D) 3 (E.) 4
Answer: (b)
註:-1234.687510 = -10011010010.10112  K = 11

2. Assume that the number of data dependencies existing in the following MIPS code segment is
K. What is “K mode 5”? (A) 0 (B) 1 (C) 2 (D) 3 (E.) 4
add $6, $4, $5
add $2, $5, $6
sub $3, $6, $4
add $2, $2, $3
Answer: (e)
註:data dependencies between instructions: (1, 2), (1, 3), (2, 4), (3, 4)  K = 4

3. Following the previous question. Assume that the code will be processed by the pipelined
datapath shown below and assume that the registers initially contain their number plus 100. For
example, $2 contains 102 and $5 contains 105, etc. In the fifth clock, calculate the sum of A, B,
C, D, E, F and G. If this sum is equal to K, what is “{Round(K/3}} mod 5"? (A) 0 (B) 1 (C) 2
(D)3 (E) 4

ID/EX EX/MEM MEM/WB

Registers Forward A G
ALU

F Data
memory

Forward B
Rs
Rt EX/MEM.RegisterRd
Rt
Rd

A Forwarding
C MEM/WB.RegisterRd
unit
B D

b. With forwarding
Answer: (a)
註:K = 645
A B C D E F G
6 4 2 6 209 104 314

20
4. A system has a 256 Kbyte cache memory and the address format is
31 14 13 4 3 0
Tag Set Offset
The cache should be (A) 2-way set associative (B) 4-way set associative (C) 8-way set
associative (D) a direct mapped cache (E) None of above.
Answer: (e)
註: block size = 24 = 16 bytes; number of cache blocks = 256 Kbyte / 16 byte = 16K
Number of sets = 210 = 1K; associativity = 16K / 1K = 16

5. In a 3-level memory hierarchy system, the hit time for each level is
T1 = 10ns (L1-cache)
T2 = 200ns (12-cache)
T3 = 600ns (Main memory)
The local hit rate in each level is H1 = 0.9 (L1-cache), H2 = 0.8 (L2-cache), and H3 = 1. (It is
assumed that we can always find the data in main memory.) If the average access time (in ns) of
this memory system is K. What is “{Round(K*123)} mod 5"? (A) 0 (B) 1 (C) 2 (D) 3 (E) 4
Answer: (b)
註: (global) miss rate for L1-cache = 0.1; (global) miss rate for L2-cache = 0.1  0.2 = 0.02;
AMAT = 10 + 0.1  200 + 0.02  600 = 42 ns  K = 42

6. Suppose you want to perform the following executions:


(a) 100 sums of scalar variables which can only be performed sequentially
(b) A matrix sum of a pair of two-dimensional arrays, size 1000 by 1000.
What speedup do you get with 1000 processors (compared with only one processor)?
(A) 900
(B) 909
(C) 919
(D) 929
(E) 939
Answer: (b)
註: Execution time (1-processor) = 100 + 1000000 = 1000100
Execution time (1000-processor) = 100 + 1000000 / 1000 = 1100
Speedup = 1000100 / 1100 = 909

7. Compute the clock cycle per instruction (CPI) for the following instruction mix. The mix
includes 20% loads, 20% stores, 35% R-format operations, 20% branches, and 5% jumps. The
number of clock cycles for each instruction class is listed as follows: 4 cycles for loads, 4 cycles
for stores, 4 cycles for R-format instructions, 3 cycles for branches, 3 cycles for jumps.

21
(A) 3.15
(B) 3.25
(C) 3.5
(D) 3.75
(E) 3.9
Answer: (d)
註:CPI = 4  0.2 + 4  0.2 + 4  0.35 + 3  0.2 + 3  0.05 = 3.75

多選題(每個選項單獨計分,答錯一個選項倒扣1分)
8. Which of the following statements are correct?
(A) The case of “TLB miss, Page Table hit, Cache hit" is possible.
(B) In a pipelined system, forwarding can eliminate all the data hazards.
(C) A write-through cache and a write-back cache will have the same miss rate.
(D) In the analysis using 3C miss model, a cache miss will be classified as one of the three
types of misses, i.e. compulsory miss, conflict miss and capacity miss.
(E) Given a 32-bit long address, if a page size is 8K bytes, the virtual page number is of 19 bits
long.
Answer: (a), (c), (d), (e)
註(E):length of the virtual page number = 32 – 13 = 19

9. For the C code A = B + C ;


Which of the following are correct.
(A) The accumulator style assembly would be
Load Address(B)
Add Address(C)
Store Address(A)
(B) The memory-to-memory style assembly code would be
Load R1, Address(B)
Load R2, Address(C)
Add R3, R1, R2
StoreAddress(A), R3
(C) The stack-style assembly would be
Push Address(A)
Push Address(B)
Add
Pop Address(C)
(D) The register-to-memory style assembly code would be
Load R1, Address(A)

22
Add R2, R1, Address(B)
Store Address(C), R2
(E) None of the above is correct.
Answer: (a)
註:
memory-to-memory stack register-to-memory
Push Address(B) Load R1, Address(B)
Push Address(C) Add R2, R1, Address(C)
Add Addr.(A), Addr.(B), Addr.(C)
Add Store Address(A), R2
Pop Address(A)

10. What of the following statements are true?


(A) Using registers is more efficient for a compiler than other forms of internal storage.
(B) Load interlock can be resolved by forwarding hardware.
(C) For a cache with write through strategy, read misses might result in writes.
(D) Compilers can schedule the instructions to avoid unnecessary hazards
(E) None of the above
Answer: (a), (d)
註(B):Load interlock means load-use data hazard
註(C):”read misses might result in writes”是指寫入 memory 而非 cache

11. Assume the following execution time for different operations: MUL takes 6 cycles, DIV takes
15 cycles, and ADD and SUB both take 1 cycle to complete. The instructions are issued into the
pipeline in order, but out of order execution and completion is allowed. What hazards will be
encountered when executing this code sequence?
MUL F1, F3, F2
ADD F2, F7, F8
DIV F10, F1, F5
SUB F5, F8, F2
SD 8(R1), F2
(A) No hazards for this code sequence
(B) There will be a data hazard when the ADD instruction is executed.
(C) There will be a data hazard when the SD instruction is executed
(D) There will be a data hazard when the DIV instruction is executed.
(E) A Write After Read (WAR) hazard is possible when SUB instruction completes execution.
Answer: (d), (e)

23
105 中正資工
1. If we want to design an adder to compute the addition of two 16-bit unsigned numbers with
ONLY 4-bit carry-look-ahead (CLA) adders and 2-to-1 multiplexers. The delay time of a 4-bit
CLA adder and a 2-to-1 multiplexer are DCLA and DMX, respectively. In addition, DMX is equal to
0.2  DCLA. Please determine the minimum delay time for this 16-bit adder in terms of DCLA.
Answer: The minimum delay time = 2  DCLA + 0.2  DCLA = 2.2DCLA
a15 b15 a12 b12 a11 b11 a8 b8 a15 b15 a12 b12 a11 b11 a8 b8

4-bit CLA 4-bit CLA 1 4-bit CLA 4-bit CLA 0

1 0 1 0 a7 b7 a4 b4 a3 b3 a0 b0

s15 ~ s12 s11 ~ s8 4-bit CLA 4-bit CLA

s7 s6 s5 s4 s3 s2 s1 s0

2. Based on IEEE 754 standard, the single precision numbers are stored in 32 bits with one sign
bit, eight exponent bits, and 23 mantissa bits. Please show the representation of -0.6875.
Answer: -0.687510 = -0.10112 = -1.011  2-1
 1 01111110 01100000000000000000000

3. Caches take advantage of temporal locality and spatial locality to improve the performance of
the memory. Please briefly explain what is temporal locality and spatial locality.
Answer
Temporal locality: if an item is referenced, it will tend to be referenced again soon.
Spatial locality: if an item is referenced, items whose addresses are close by will tend to be
referenced soon.

4. The following techniques have been developed for cache optimizations: “Pipelined cache” and
“multi-banked cache.” Please briefly explain these techniques and how they work.
Answer
Pipelined cache:, Pipelining the cache access by inserting latches between modules in the cache
can achieve a high bandwidth cache. Cache access time is divided into decoding delay, wordline
to sense amplifier delay, and mux to data out delay. Using this technique, cache accesses can
start before the previous access is completely done, resulting in high bandwidth and a high
frequency cache.
Multi-banked cache: rather than treat the cache as a single monolithic block, divide into
independent banks that can support simultaneous accesses and can increase cache bandwidth.

24
5. The following circuit is composed of two D-type flip-flops (DFFs) and an OR gate, where the
DFFs are reset when c1k=1, Describe the function of the circuit and one possible application of
it.
clk D
dout
din rst

clk
D

rst

Answer:
此電路會於clock在low-level期間偵測出din的輸入信號是否有0、1切換。可用於偵測電路是
否有雜訊的干擾

6. The following is a code segment in MIPS assembly, where JAV, (jump and link) jumps to the
label j_target and saves its return address in R31 and JR returns to the next instruction of JAL,
JAL j target
….
j_target JR R31
List all possible hazards when the code is executed on a classical 5-stage pipelined datapath (i.e.
fetch, decode, execute, memory, and write-back). What are the minimum cycles for an interlock
unit to resolve each hazard without data forwarding?
Answer
JAL 及 JR 指令執行時會造成 control hazard 會各有一個 clock cycle 的延遲。若 JR 指令與
其前面指令有 data dependency 則會造成 data hazard 會有兩個 clock cycle 的延遲。

7. Briefly describe the basic ideas of the following terms in computer designs:
(a) TLB, (b) BTB, and (c) AMAT.
Answer
(a) TLB: a cache that keeps track of recently used address mappings to avoid an access to the
page table
(b) BTB: a structure that caches the destination PC or destination instruction for a branch. It is
usually organized as a cache with tags, making it more costly that a simple prediction buffer
(c) AMAT: average memory access time is the average time to access memory considering
both hits and misses and the frequency of different accesses.

25
105 中正電機
1. Explain three types of hazard in pipelining and give their solutions.
Answer
(1) Structural hazards: hardware cannot support the instructions executing in the same clock
cycle (limited resources)
(2) Data hazards: attempt to use item before it is ready (Data dependency)
(3) Control hazards: attempt to make a decision before condition is evaluated (branch
instructions)
Type Solutions
增加足夠的硬體(例如: use two memories, one for instruction and one
Structure hazard
for data)或暫停pipeline
Software solution: (a)使用compiler插入足夠的no operation (nop)指
令。(b)重排指令的順序使data hazard情形消失。
Data hazard Hardware solution: (a)使用Forwarding前饋資料給後續有data hazard
的指令。(b)若是遇到load-use這種無法單用Forwarding解決的data
hazard時則須先暫停(stall)一個時脈週期後再使用Forwarding
Software solution: (a)使用compiler插入足夠的no operation (nop)指
令。(b)使用Delay branch。
Hardware solution: (a)將分支判斷提前以減少發生control hazard時所
Control hazard
需要清除指令的個數。(b)使用預測方法(static or dynamic)。當預測
正確時pipeline便可全速運作其效能就不會因為分支指令而降低。當
預測不正確時我們才需要清除掉pipeline中擷取錯誤的指令。

2. The following formula is 32-bit version of IEEE 754 stabdard.


N = (-1)S2E-127(1.M) 0 < E < 255
(a) Find the representation (E and M) of N = -3.25.
(b) Find the magnitude range of nonzero floating-point numbers.
Answer
(a) -3.2510 = -11.012 = -1.101  21
E = 10000000; M = 10100000000000000000000
(b) FP: ±1.11111111111111111111111  2127 ~ ±1.00000000000000000000000  2-126
Denorm: ±0.11111111111111111111111  2-126 ~ ±0.00000000000000000000001  2-126
±∞

26
3. Give the hardware architecture of floating-point add unit for A + B.
A = XM  2XE
B = YM  2YE
Answer

4. Calculate Y  X (Y = 101110, X = 100011)


(a) By Robertson multiplication algorithm
n2
x  2n1 xn1   2i xi
i 0

(b) By Booth’s multiplication algorithm


Multiplicand Y 101010
Multiplier X 100011

Answer

(a)
Iteration Step Multiplicand Product
0 Initial values 101110 000000 100011
27
1 prod = prod + Y 101110 101110 100011
1
Shift right product 101110 110111 010001
1 prod = prod + Y 101110 100101 010001
2
Shift right product 101110 110010 101000
0 no operation 101110 110010 101000
3
Shift right product 101110 111001 010100
0 no operation 101110 111001 010100
4
Shift right product 101110 111100 101010
0 no operation 101110 111100 101010
5
Shift right product 101110 111110 010101
1 prod = prod + (Y’ + 1) 101110 010000 010101
6
Shift right product 101110 001000 001010
(b)
Iteration Step Multiplicand Product
0 Initial values 101010 000000 100011 0
10 prod = prod – Mcand 101010 010110 100011 0
1
Shift right product 101010 001011 010001 1
11 no operation 101010 001011 010001 1
2
Shift right product 101010 000101 101000 1
01 prod = prod + Mcand 101010 101111 101000 1
3
Shift right product 101010 110111 110100 0
00 no operation 101010 110111 110100 0
4
Shift right product 101010 111011 111010 0
00 no operation 101010 111011 111010 0
5
Shift right product 101010 111101 111101 0
10 prod = prod – Mcand 101010 010011 111101 0
6
Shift right product 101010 001001 111110 1

5. Supposing that the industry trends show that a new process generation scales resistance by 1/3,
capacitance by 1/2, and clock cycle time by 1/2, by what factor does the dynamic power of the
digital system (without clock gating) scale?
Answer
Powernew = (2  Fold)  (0.5  Cold)  Vold2 =Fold  Cold  Vold2 = Powerold
Thus power scales by 1

28
6. If processor A has a higher clock rate than processor B, and processor A also has a higher MIPS
rating than processor B, explain whether processor A will always execute faster than processor
B. Why or why not?
Answer
clock rate
MIP

For the expression above, higher clock rate also implies higher MIPS rating. MIPS only provide
the information of clock rate and CPI. Lacking of instruction count we cannot inference that
processor A will always execute faster than processor B.

7. Consider the following piece of codes:


int x = 0, y = 0; // The compiler puts x in $r1 and y in $r2
int i; // The compiler put i in $r3
int A[4096]; // A is in memory at address 0x10000

For (i = 0; i < 1024; i++)


x += A[i];
For (i = 0; i < 1024; i++)
y += A[i + 2048];
Assume that integers are 32-bits and the cache starts out empty.
(a) If the system has a 8192-byte, direct-mapped data cache with 16-byte blocks, what is the
series of data cache hits and misses for this snippet of code?
(b) If the cache is 2-way set associative with an LRU replacement policy and 16-byte sets
(8-byte blocks), what is the series of data cache hits and misses for this snippet of code?
Answer
(a) Each block has 16-byte or 4 integers. For the first loop, one in 4 elements causes a miss. The
series of hits and misses: Miss, Hit, Hit, Hit, Miss Hit, Hit, Hit, Miss,…….
For the second loop: 2048  4 = 8192 bytes which is the capacity of the direct mapped
cache. Therefore A[i+2048] is again mapped starting from 0 onwards. So the sequence is
same above: Miss, Hit, Hit, Hit, Miss, Hit, Hit, Hit, Miss, …
(b) The number of sets in this case will be 8192/16 = 1024. Each block in a set holds 2 integers.
Since 1024 accesses are consecutive in the first loop, there is one miss per 2 elements. The
series of hits and misses: Miss, Hit, Miss, Hit,…..
The second loop accesses elements that in the same set as the first loop and hence results in
the same miss behavior.

29
8. What kind of hazards will happen when executing the following MIPS program? How to avoid
the hazard without stalls?
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
Answer
Both add instructions have a data hazard because of their respective dependence on the
immediately preceding lw instruction. Notice that bypassing eliminates several other potential
hazards including the dependence of the first add on the first lw and any hazards for store
instructions. Moving up the third lw instruction eliminates both hazards:
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
註:此題為課本範例,課本假設有 forwarding unit。在此題目並未說明是否有 forwarding unit,
答案是以課本為主

30
9. Given the full datapath of a MIPS CPU below.
(a) Please specify the required functional blocks for “Branch instructions”, “R-format
instructions”, and “Load/Store instructions”, respectively.
(b) If the CPU can only execute four instructions, including “bne”, “add”, “lw”, “sw”, which
instruction will result in the longest delay by assuming that the delay of each functional
block is the same.
PCSrc

Add

4 Add Sum
Shift
left 2

Read ALUSrc 4 ALU operation


PC Read register 1 Read MemWrite
address data 1
Read MemtoReg
Instruction register 2 Registers
ALU
Write Address Read
Instruction Register Read data
memory data 2
Write
data Write Data
data memory
RegWrite

Sign MemRead
extend
16 32

Answer
(a)
Instruction Functional blocks used for each instruction
Branch IM Reg ALU
R-format IM Reg ALU Reg
Load IM Reg ALU DM Reg
Store IM Reg ALU DM
Remark: IM: instruction memory; Reg: register file; DM: data memory
(b) According to above table, lw instruction will result in the longest delay

31
105 中山電機
1. Terminology Explanation
(a) SuperScalar Processor
(b) Multi-core Processor
(c) Branch Prediction
(d) VLIW Architecture
(e) Booth’s multiplication algorithm
Answer
(a) Superscalar: an advanced pipelining technique that enables the processor to execute more
than one instruction per clock cycle
(b) A multi-core processor is an integrated circuit (IC) to which two or more processors have
been attached for enhanced performance, reduced power consumption, and more efficient
simultaneous processing of multiple tasks
(c) Branch Prediction: A method of resolving a branch hazard that assumes a given outcome
for the branch and proceeds from that assumption rather than waiting to ascertain the actual
outcome
(d) VLIW: a style of instruction set architecture that lunches many operations that are defined
to be independent in a single wide instruction, typically with many separate opcode field.
(e) Booth’s multiplication algorithm is a multiplication algorithm that multiplies two
signed binary numbers in two's complement notation.

2. (a) Describe the definition of Amdahl’s law.


(b) Suppose we enhance a machine making all floating-point instructions run four times faster.
If the execution time of some benchmark before the floating-point enhancement is 60
seconds, what will the speedup be if three-fourth of the 60 seconds is spent executing
floating-point instructions?
Answer
(a) Amdahl’s law: a rule stating that the performance enhancement possible with a given
improvement is limited by the amount that the improved feature is used. The expression for
Amdahl’s law is shown below
Execution timeaffected by improvement
Execution timeafter improvement = + Execution timeunaffected
Amount of improvement
(b) Speedup = 1 / [(0.75 / 4) + 0.25] = 2.29

3. A set associative cache has a block size of four 32-bit words and a set size of 4. The cache can
accommodate a total of 16K words. The main memory size that is cacheable is 64M  32 bits.
Design the cache structure and show how the processor’s addresses are interpreted.
Answer

32
26-bit word address
20-bit word address

Tag Index Offset


228 810 2
14

Index V Tag Data V Tag Data V Tag Data V Tag Data


0
1
2

..
..
1023

= = = =
128 128
128 128

4-to-1 MUX

Data
Hit

4. Design a Four-phase Stepper Motor Controller circuit with a clock input to generate the output
signals A, B, C and D to control a four-phase stepper motor whose stepping waveform is
described as table 1.
Table 1
Output signals
step
A B C D
1 1 1 0 0
2 1 0 0 1
3 0 0 1 1
4 0 1 1 0
1 1 1 0 0

Answer
State table

Current state Next state Output signals


step
Qa Qb Qa Qb A B C D
1 0 0 0 1 1 1 0 0
2 0 1 1 0 1 0 0 1
3 1 0 1 1 0 0 1 1
4 1 1 0 0 0 1 1 0

A a B a b a bC a D a b a b Da a b a b Db b

33
Da Qa C
Qa A
B
Db Qb

Qb
D
clock

5. Use the following code fragment:


Loop: LW R1, 0(R2)
ADDI R1, R1, #1
SW 0(R2), R1
ADDI R2, R2, #4
SUB R4, R3, R2
BNEZ R4, Loop
Assume the initial value of R3 is R2+100. Use the five-stage instruction pipeline (IF, DEC,
EXE, MEM, WB) and assume all memory accesses are one cycle operation. Furthermore,
branches are resolved in MEM stage.
(a) Show the timing of this instruction sequence for the five-stage instruction pipeline with
normal forwarding and bypassing hardware. Assume that branch is handled by predicting it
as not taken. How many cycles does this loop take to execute?
(b) Assuming the five-stage instruction pipeline with a single-cycle delayed branch and normal
forwarding and bypassing hardware, schedule the instructions in the loop including the
branch-delay slot. You may reorder instructions and modify the individual instruction
operands, but do not undertake other loop transformations that change the number of
op-code of instructions in the loop. Show a pipeline timing diagram and compute the
number of cycles needed to execute the entire loop.
Answer
(a)
Instructions Clock cycle number
1 2 3 4 5 6 7 8 9 10 11 12 13 14
LW R1, 0(R2) F D X M W
ADDI R1, R1, #1 F D D X M W
SW R1, 0(R2) F D X M W
ADDI R2, R2, #4 F D X M W
SUB R4, R3, R2 F D X M W
BNEZ R4, Loop F D D X M W
34
Flushed instruction F D X M W
Flushed instruction F D X M W
Flushed instruction F D X M
LW R1, 0(R2) F D X
The total number of iterations is 200 / 4 = 50 cycles
There are 2 RAW hazards (2 stalls) and 3 flushes after the branch since the branch is taken.
For the first 49 iterations, it takes 11 cycles between loop instances.
The last loop takes 8 cycles since the branch is not taken.
So, the total number of cycles is (5 – 1) + 49  11 + 8 = 551
(b)
LW R1, 0(R2)
ADDI R2, R2, #4
SUB R4, R3, R2
ADDI R1, R1, #1
BNEZ R4, Loop
SW R1, -4(R2)
Instructions Clock cycle number
1 2 3 4 5 6 7 8 9 10
LW R1, 0(R2) F D X M W
ADDI R2, R2, #4 F D X M W
SUB R4, R3, R2 F D X M W
ADDI R1, R1, #1 F D X M W
BNEZ R4, Loop F D X M W
SW R1, -4(R2) F D X M W
The total number of iterations is 200 / 4 = 50
Single-cycle delayed branch means branch is determined in ID stage.
So, the total number of cycles is (5 – 1) + 50  6 = 304.

35
105 中山資工
1. SIMD, SIMT, VLIW, Superscalar, Superpipeline
(1) Explain the major features of single instruction multiple data (SIMD) processor. In other
words, point out the major differences between SIMD CPU and conventional CPU. Give
a practical example of SIMD instructions, CPU, or computer.
(2) Compare the differences of thread and process. Is context switching the change of threads
or the change of process?
(3) Explain the major features of single instruction multiple threads (SIMT) processor. Give a
practical example of SIMT computer.
(4) Explain the major features of very long instruction word (VLIW) processor. Which
processor category does VLIW belong to, static multiple issue or dynamic multiple issue?
(5) What is a superscalar processor? What is a superpipeline processor?
Answer:
(1) A conventional uniprocessor has a single instruction stream and single data stream. SIMD
computers operate on vectors of data. One practical example of SIMD instructions is SSE
instructions of x86
(2) A process is an executing instance of an application. A thread is a path of execution within a
process. Also, a process can contain multiple threads. When you start a running a program,
the operating system creates a process and begins executing the primary thread of that
process.
Context switching is the change of process
(3) A processor architecture that applies one instruction to multiple independent threads in
parallel.
Nvidia's introduced the single-instruction multiple-thread (SIMT) execution model where
multiple independent threads execute concurrently using a single instruction.
(4) A style of instruction set architecture that launches many operations that are defined to be
independent in a single wide instruction, typically with many separate opcode fields.
VLIW belong to, static multiple issue
(5) Superscalar is an advanced pipelining that enables the processor to execute more than one
instruction per clock cycle by selecting them during execution.
Superpipeline is an advanced pipelining that increase the depth of the pipeline to increase
the clock rate

36
2. Cache
(1) Give two methods to reduce cache miss rate.
(2) Give two methods to reduce cache hit time.
(3) Give two methods to reduce cache miss penalty.
(4) For the following Figure 1, explain the reason why the miss rate goes down as the block
size increases to a certain level, and explain the reason why the miss rate goes up if the
block size is too large relative to the cache size.

Figure 1
(5) Explain the differences of write-allocate and no-write-allocate during cache write miss.
Answer:
(1) increase cache size; increase associativity
(2) use virtually addressed cache; use direct-mapped cache
(3) use early restart technique; use critical word first technique
(4) Larger blocks exploit spatial locality to lower miss rates. The miss rate may go up
eventually if the block size becomes a significant fraction of the cache size, because the
number of blocks that can be held in the cache will become small, and there will be a great
deal of competition for those blocks. As a result, a block will be bumped out of the cache
before many of its words are accessed.
(5) Write allocate: the block is fetched from memory and then the appropriate portion of the
block is overwritten.
No write allocate: the portion of the block in memory is updated but not put it in the cache

37
3. The execution of an instruction can be divided into five parts: instruction fetch (IF), register
read (RR), ALU operation (EX), data access (MEM), and register write (RW). The following
Table 1 shows the execution time of each part for several types of instructions, assuming that
the multiplexors, and control unit have no delay.
Instruction Register ALU Data Register
fetch read operation access write
Load word (lw) 200 ps 100 ps 200 ps 200 ps 100 ps
Store word (sw) 200 ps 100 ps 200 ps 200 ps
R-format
200 ps 100 ps 200 ps 100 ps
(add, sub, and, or, slt)
Branch (beq) 200 ps 100 ps 200 ps
If instructions are to be executed in a pipelined CPU with five pipeline stages, IF, RR, EX,
MEM, RW where the pipeline stages execute the corresponding operations mentioned above.
(1) What is the cycle time of the pipelined CPU? What is the maximum working frequency?
(2) What is the latency of executing the load-word instruction (lw) in the pipelined CPU?
(3) What is the latency of executing the add instruction (add) in the pipelined CPU?
(4) What is the maximum throughput of the pipelined CPU?
(5) Propose a design method to increase the throughput performance of the pipelined CPU.
Answer:
(1) Cycle time = 200ps
Maximum frequency = 1 / 200ps = 5GHz
(2) The latency of a lw instruction = 5  200 = 1000 ps
(3) The latency of an add instruction = 5  200 = 1000 ps
(4) The maximum throughput = 1 / 200ps = 5G instructions/sec.
(5) Increase the depth of the pipeline

4. Memory Hierarchy
The following Figure 2 shows the process of going from a virtual address to a data item in
cache. Answer the following questions based on this figure.
(1) What is the purpose of the translation lookaside buffer (TLB)?
(2) Is the TLB direct mapped, set associative, or full associative? Is the data cache, direct
mapped, set associative, or fully associative?
(3) What is the page size? What is the block size of the data cache?
(4) What is the size of the data cache (excluding the tags)? What is the maximum size of
physical memory supported in this memory system?
(5) Is the cache virtually addressed or a physically addressed?

38
Virtual address
31 30 29 15 14 13 12 11 10 9 3210
Virtual page number Page offset
20 12
valid dirty Tag Physical page number
TLB
TLB hit

20

Physical page number Page offset


Physical address
Block Byte
Physical address tag Cache index offset offset
16 8 14 2
18
4
valid Tag Data

Cache

32
=
Cache hit Data
Figure 2
Answer:
(1) The purpose of a TLB is to speed up the translation of a virtual address to a physical address
(2) The TLB is a fully associative cache. The data cache is direct-mapped
(3) Page size = 212 bytes = 4Kbytes. Block size = 26 bytes = 64 bytes
(4) The size of data cache is 28  64 bytes = 16Kbytes and the size of physical memory is 220 
212 = 4GB
(5) The cache is physically addressed

5. Answer the following questions.


(1) What are the differences of arithmetic mean and geometric mean? Is geometric mean or
arithmetic mean used by SPECINT2006 to summarize the speed measurements for several
benchmark programs?
(2) The following Figure 3 shows the growth in processor performance since mid-1980s.
Answer the following questions.
(a) Is the vertical axis of Figure 3 linear scale or log-scale? What is the growth rate,
linear or exponential, if it is a straight line in a particular time interval in Figure 3?
(b) Give two possible reasons for the large performance growth rate from 1986 to 2002.
(c) Give two possible reasons to explain why the growth rate slows down after 2002.

39
Figure 3
(d) Give two design methods to increase the performance of executing conditional
branch instructions
Answer:
(1) Arithmetic mean (AM) is the average of the execution times that is directly proportional to
total execution time.
The geometric mean is the nth root of the product of n execution time ratio
SPECINT2006 use geometric mean to summarize the speed measurements for several
benchmark programs
(2) (a) The vertical axis of Figure 3 is log-scale.
The growth rate is exponential
There is a straight line from 1986 to 2002
(b) VLSI technology improvement; advance architectural and organizational ideas
(c) Limits of power; available instruction-level parallelism; memory latency
(d) Branch prediction; move the branch decision earlier

40
105 中興資工
1. True or False
(a) The processor comprises two main components: datapath and register.
(b) In the immediate addressing mode, the operand is a constant within the instruction itself.
(c) TLB can be used to speed up the virtual-to-physical address translation.
(d) In cache write hit, we can use the write buffer to improve the performance of write-through
scheme.
(e) Program counter (PC) is the register that contains the data address in the program being
executed.
Answer
(a) (b) (c) (d) (e)
False True True True False
註(a):The processor comprises two main components: datapath and control unit
註(e):PC is the register that contains the instruction address in the program being executed.

2. Answer the following questions briefly.


(a) Please define a basic block in the program.
(b) Suppose the cache size is 4096 bytes, and block size is 16-byte. In the 32-bit address
format, find the tag size for a 2-way set associative cache,
(c) Suppose you want to achieve a speedup of 2 times faster with 6 processors. What
percentage of the original computation can be sequential?
Answer
(a) Basic block: a sequence of instructions without branches (except possibly at the end) and
without branch targets or branch labels (except possibly at the beginning)
(b) Number of blocks = 4096 / 16 = 256
Number of sets = 256 / 2 = 128 = 27
Tag size = 32 – 7 – 4 = 21
(c) Suppose that f is the percentage of the original computation can be sequential
1
 2  f = 40%
1  f   f
6

41
105 中興電機
1. Please answer “Yes” or “No” for the following five comparisons between RI C and CI C
processors:
(a) The instruction format complexity of CISC is more than that of RISC.
(b) The clock cycle time of instructions in CISC is more than that in RISC.
(c) The clock cycle per instruction in CISC is more than that in RISC,
(d) The performance of CISC is more than that of RISC.
(e) The power consumption of CISC is more than that of RISC.
Answer:
(a) (b) (c) (d) (e)
Yes Yes Yes No Yes

2. (a) Please write five categories of MIPS assembly language.


(b) Which category is the MIP instruction “jr” belonged ?
Answer:
(a) (b)
(1) Register addressing Register addressing
(2) Base or displacement addressing
(3) Immediate addressing
(4) PC-relative addressing
(5) Pseudodirect addressing

3. The following C codes are compiled into the corresponding MIPS assembly codes. Assume
that the two parameters array and size are found in the registers $as) and $a1, and allocates p to
register $t0.
C codes:
clear(int *array, int size)
{
int *p;
for (p=&array[0]; p < &array[size]; p=p+1)
*p=0;
}
MIPS assembly codes:
move $t0, $a0
loop: OP1 $zero, 0($t0)
addi $t0, $t0, 4
add $t1, $a1, $a1

42
add $t1, $t1, OP3
OP2 $t2, OP4, $t1
slt $t3, $t0, $t2
bne $t3, $zero, loop
Please determine proper instructions for (OP1, OP2) and proper operands for (OP3, OP4).
Copy the following table (Table 1) to your answer sheet and fill in the two instructions and
two operands.
Table 1
Instruction/Operand
OP1
OP2
OP3
OP4

Answer:
Instruction/Operand
OP1 sw
OP2 add
OP3 $t1
OP4 $a0

4. For signed addition ($t0 = $t1 + $t2), the following sequence of MIPS codes can detect two
special conditions. The sequence codes are shown as follows:
addu $t0, $t1, $t2
xor $t3, $t1, $t2
slt $t3, $t3, $zero
bne $t3, $zero, T1

addu $t0, $t1, $t2


xor $t3, $t0, $t1
slt $t3, $t3, $zero
bne $t3, $zero, T2

Please describe which conditions can T1 and T2 indices discover for signed addition?

Answer:
T1 T2
No_overflow Overflow

43
5. Compute the following floating-point operations using IEEE 754 single precision. Please show
the floating point encoding in BINARY.
–344.3125 - 123.5625
Answer:
−344.3125 −101011000.01012 −1.0101100001012 × 28
−123.5625 = −1111011.10012 = −1.11101110012 × 26
Step1: −1.11101110012×26 = −0.0111101110012 × 28
Step2: (−1.010110000101 − 0.0111101110012) × 28 = −1.110100111110 × 28
Step3: −1.110100111110 × 28 = −1.110100111110 × 28
Step4: −1.110100111110 × 28 = −1.110100111110 × 28
−1.110100111110 × 28 = −111010011.1110 = 467.87510

6. We examine how pipelining affects the clock cycle time of the processor. Assume that
individual stages of the datapath have the following latencies:

IF ID ΕΧ MEM WB
250 ps 350 ps 150 ps 300 ps 200 ps

Also, assume that instructions executed by the processor are broken down as follows:

ALU BEQ LW SW
45% 20% 20% 15%

(a) What is the clock cycle time in a pipelined and non-pipelined processor?
(b) What is the total latency of an LW instruction in a pipelined and non-pipelined processor?
(c) If we can split one stage of the pipelined datapath into two new stages, each with half the
latency of the original stage, which stage would you split and what is the new clock cycle
of the processor?
(d) Assuming there are no stalls for hazards, what is the utilization of the data memory?
Answer:
(a) (b)
Pipelined Single-cycle Pipelined Single-cycle
350ps 1250ps 1750ps 1250ps
(c)
Stage to split New clock cycle time
ID 300ps
(d)
Data memory is utilized only by LW and SW instructions in the MIPS
ISA. So the utilization is 35% of the clock cycles
44
7. Consider executing the following code on the pipelined datapath.
add $8, $5, $5
add $8, $5, $8
sub $3, $8, $4
add $2, $2, $3
add $4, $2, $3
(a) Show or list all of the dependencies in this program. For each dependency, indicate which
instructions and register are involved.
(b) At the end of the fifth cycle of execution, which registers are being read and which
register will be written?
(c) With regard to the program, explain what the forwarding unit is doing during the fifth
cycle of execution. If any comparisons are being made, describe them.
(d) With regard to the program, explain what the hazard detection unit is doing during the
fifth cycle of execution. If any comparisons are being made, describe them.
Answer:
(a)
Dependencies between instructions Register
(1, 2) $8
(2, 3) $8
(3, 4) $3
(3, 5) $3
(4, 5) $2
(b) Register $8 is being written and Registers $2 and $3 are being read.
(c) The forwarding unit is comparing $8 = $8? $8 = $4? $8 = $8? $8 = $4?
(d) The hazard detection unit is comparing $8 = $2? $8 = $3?

45
105 台科大資工

1. Assume a program requires the execution of 25106 FP instructions, 110106 INT instructions,
80106 LS instructions, and 16106 branch instructions. The CPI for each type of instruction is
2, 1, 4, and 2, respectively. Assume that the processor has a 2 GHz clock rate.
(a) What are the clock cycles of the program?
(b) If we want the program to run two times faster, what should the CPI of L/S instructions be?
(c) What is the execution time of the program if the CPI of INT and FP is reduced by 40% and
the CPI of L/S and Branch is reduced by 30%?
Answer:
(a) Clock cycles = 2  25  106 + 1  110  106 + 4  80  106 + 2  16  106 = 512  106
(b) 256  106 = 2  25  106 + 1  110  106 + CPIL/S  80  106 + 2  16  106 = 512  106
 CPIL/S = 0.8
(c) Execution time = (2  0.6  25  106 + 1  0.6  110  106 + 4  0.7  80  106 + 2  0.7  16
 106) / 2G = 0.1712 sec.

2. Assume that individual stages of the datapath have the following latencies:
IF ID EX MEM WB
250ps 360ps 150ps 300ps 200ps
Also, assume that instructions executed by the processor are broken down as follows:
alu beq lw sw
50% 20% 15% 15%
(a) What is the clock cycle time in a pipelined processor? What is the clock cycle time in a
non-pipelined processor?
(b) What is the total latency of an LW instruction in a pipelined processor? What is the total
latency of an LW instruction in a non-pipelined processor?
(c) If we can split one stage of the pipelined datapath into two new stages, each with half the
latency of the original stage, which stage should be split? What is the new clock cycle time
of the processor?
(d) Assuming there are no stalls or hazards, what is the utilization of the data memory?
(e) Assuming there are no stalls or hazards, what is the utilization of the write-register port of
the “Registers” unit?
Answer:
(a) Clock cycle time in a pipelined processor = 360ps
Clock cycle time in a non-pipelined processor = 1160ps
(b) Total latency in a pipelined processor = 360  5 = 1800ps
Total latency in a pipelined processor = 360  5 = 1160ps
(c) ID could be split into two new stages
The new clock cycle time = 300ps
46
(d) The utilization of the data memory = 15% + 15% = 30%
(e) The utilization of the write-register port = 50% + 15% = 65%

47
105 台科大電子

1. What do CISC and RISC stand for? Compare and contrast CISC architecture with RISC
architecture, and then give an example for each of them.

Answer:
CISC stands for complex instruction set computer and is the name given to processors that use a
large number of complicated instructions, to try to do more work with each one.
RISC stands for reduced instruction set computer and is the generic name given to processors
that use a small number of simple instructions, to try to do less work with each instruction but
execute them much faster.

RISC CISC
All instructions are the same size Instructions are not the same size
Few addressing modes are supported Support a lot of addressing modes
Only a few instruction formats Support a lot of instruction formats
Arithmetic instructions can only work on Arithmetic instructions can work on
registers memory
Data in memory must be loaded into Data in memory can be processed directly
registers before processing without using load/store instructions

RISC example: MIPS


CISC example: IA-32

2. Let a half adder be expressed as follows, construct a full adder using this building block.
A
B

Cout

Answer: Truth table for full adder


A B Cin S Cout
0 0 0 0 0 S = (AB)C
0 0 1 1 0 Cout  AB Cin  ABCin  ABCin  ABC in
0 1 0 1 0    
 AB Cin  Cin  AB  AB Cin
0 1 1 0 1  AB   A  B Cin
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1

48
A
B S

Cout

Cin

3. Explain the following terms: cache memory

Answer:
Cache memory is a smaller, faster memory which stores copies of the data from frequently used
main memory locations and a CPU can access more quickly than it can access regular RAM.

49
105 成大資聯

1. Consider a MIPS processor with an additional floating point unit. Assume functional unit
delays in the processor are as follows: memory (2 ns), ALU and adders (2 ns), register file
access (1 ns), FPU add (8 ns), FPU multiply (16 ns), and the remaining units (0 ns). Also
assume instruction mix as follows: loads (31%), stores (21%), R-format instructions (27%),
branches (5%), jumps (2%), FP adds and subtracts (7%), and FP multiplies and divides (7%).
(a) What is the delay in nanosecond to execute a load, store, R-format, branch, jump, FP
add/subtract, and FP multiply/divide instruction in a single MIPS design? (b) What is the
averaged delay in nanosecond to execute a load, store, R-format, branch, jump, FP add/subtract,
and FP multiply/divide instruction in a multicycle MIPS design?
Answer:
(1) Each instruction need 20 ns to execute.
ALU/FPU add
Instruction Memory Register Memory Register Delay (ns)
/ FPU MPY
load 2 1 2 2 1 8
store 2 1 2 2 0 7
R-format 2 1 2 0 1 6
branch 2 1 2 0 0 5
jump 2 0 0 0 0 2
FP add/sub 2 1 8 0 1 12
FP mul/div 2 1 16 0 1 20
(2) Average delay = (5  0.31 + 4  0.21 + 4  0.27 + 3  0.05 + 3  0.02 + 4  0.07 + 4  0.07)
 16 = 4.24  16 = 67.84 ns

2. Consider a pipelined processor that executes the MIPS code shown in Figure 1 using the logic
of hazard detection and data forwarding unit shown in Figure 2. If the MIPS code cannot be
executed correctly, then how do we revise the logic shown in Figure 2 such that the code can be
correctly executed?
add $1, $1, $5
add $1, $1, $6
add $1, $1, $7
Figure 1: The MIPS code
if (MEM/WB. RegWrite
and (MEM/WB.RagisterRd≠0)
and (MEM/WB.RegisterRd=ID/EX.RegisterRs))
then

50
ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd≠0)
and (MEM/WB.RagisterRd=ID/EX.RagistaxRt))
then
ForwardB = 01
Figure 2: The logic of hazard detection and data forwarding unit

Answer:
The logic should be revised as the follows.
if (MEM/WB. RegWrite
and (MEM/WB.RagisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
then
ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RagisterRd = ID/EX.RagistaxRt))
then
ForwardB = 01

3. What is the biased single precision IEEE 754 floating point format of 0.9375? What is the
purpose to bias the exponent of the floating point numbers?
Answer:
(1) 0.9375ten = 0.1111two = 1.111two × 2–1
S E F
0 01111110 11100000000000000000000
(2) 多數浮點數運算需要先比較指數大小來決定要對齊哪一個數的指數,使用bias表示法
可以直接比較指數欄位的無號數值即可判斷兩個浮點數指數的大小而不須考慮其正負
號,因此可以加快比較速度。

4. Which of the following techniques can resolve control hazards?


(a) Branch prediction
(b) Stall
(c) Delayed branch
51
Answer:
All these three techniques can resolve control hazard.
(a) Execution of the branch instruction is continued in the pipeline by predicting the
branch is to take place or not. If the prediction is wrong, the instructions that are being
fetched and decoded are discarded (flushed).
(b) Pipeline is stalled until the branch is complete. The penalty will be several clock
cycles.
(c) A delayed branch always executes the following instruction. Compilers and
assemblers try to place an instruction that does not affect the branch after the branch
in the branch delay slot.

52
105 成大電通
1. Design a direct memory access (DMA) controller.
(1) Show a detailed block diagram of a DMA controller, including registers used
(2) Describe the operation of the DMA controller.
Answer:
(1)
DMA Controller

Data
count

Data
Data Lines register

Address
Address Lines register

DMA Request
DMA Acknowledge Control
Interrupt Logic
Read/Write

(2) To transfer data between memory and I/O devices, DMA controller takes over the control of
the system from the processor and transfer of data take place over the system bus. For this
purpose, the DMA controller must use the bus only when the processor does not need it, or it
must force the processor to suspend operation temporarily.

2. Show the design of the following memory hierarchy system for a conventional five-stage
pipeline processor. Assume both the virtual address and physical address are 32 bits. The page
size is 4KB. The L1 ITLB has 48 entries and is 3-way set associative. The L1 DTLB is
direct-mapped and has 32 entries. The L2 TLB is a unified TLB which has 128 entries using
4-way set associative design. The data cache and instruction cache are both physically
addressed; each has a cache size of 64KB, direct-mapped, line size 32 bytes.
(1) Show the L1 ITLB design and explain the translation of virtual address to the physical
address for |TLB.
(2) Show the L1 DTLB design integrated with the data cache.
(3) Show how to integrate the three TLBs, instruction cache, and the data cache to the pipeline.
Answer:
(1) The virtual address is first broken into a virtual page number and a page offset. The ITLB
which contains the virtual to physical address translation is indexed by the virtual page
number. If the mapping exists, the physical page number mapped from the ITLB constitutes
53
the upper portion of the physical address, while the page offset constitutes the lower portion.
If the mapping does not exist, the missing translation will be retrieved from the page table.
32

Taag Index Page offset

2216 8
4 12

Index V Tag PPN V Tag PPN V Tag PPN


0
1
2

14
15

0 1 2

PPN

Hit

(3)

ID/EX
IF ID EX MEM WB
WB
EX/MEM
MEM/WB
PCSrc Control M WB

EX WB
IF/ID M

Add
Add Shift
left 2
RegWrite

DTLB
MemWrite

ITLB
MemtoReg

ALUSrc
Instruction

PC Read Read
register 1 data 1
Address Read Zero
register 2 Read ALU Address Read
Write data 2 Result data
Instruction register
Data
Cache Write Registers
Cache
data
Write
data
Instruction
[15-0] Sign ALU
extend control
Instruction MemRead
[20-16] ALUOp
Instruction
[15-11]

RegDst

54
(2)
Virtual page number
12 bits
15 bits

Virtual address Tag Index Page offset


15bits
17 bits 55bits
bits
V Tag PPN
0
1
DTLB

31

20bits
17 bits
= 12bits
bits
15

Physical page number Page offset


Physical address Tag Index Offset
20 10 6 bits
H it 17 911 5
16 D ata

Inde x V alid T ag D a ta
0
1
2

Data Cache

51 1
2047
20 32
17
16 512
256

55
105 成大電機
1. Find the word or phrase from the list below that best matches the description in the following
questions. Each answer should be used only once. (a) assembler (b) bit (c) binary number (d)
cache (e) CPU (f) chip (g) compiler (h) control (i) defect (j) DRAM (k) memory (l) operating
system (m) semiconductor (n) Supercomputer (o) yield (p) die (q) loader (r) linker (s) SRAM (t)
coverage (u) procedure (v) pipeline (w) ISA.
(1) Integrated circuit commonly used to construct main memory.
(2) Location of programs when they are running, containing the data needed as well.
(3) Microscopic flaw in a wafer.
(4) Percentage of good dies from the total number of dies on the wafer.
(5) Program that translates a symbolic version of an instruction into binary version.
(6) Program that translates from a higher level notation to assembly language.
(7) Small, fast memory that acts as a buffer for the main memory.
(8) Substance that does not conduct electricity well.
(9) Base 2 number.
(10) Component of the processor that tells the datapath, memory, and I/O devices what to do
according to the instructions of the program.
Answer:
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
j k i o a g d m c h

2. Fill in the appropriate term or terminology for the underline fields:


(1) move $s1, $zero = add , ,
(2) CPU execution time = Instruction count × × clock cycle time.
(3) is a technique in which data blocks needed in the future are brought into the cache
early.
(4) For a 64-bit data, if the least significant byte (B0) is stored at memory address 8N where
N is an integer = 0, this storage order is called endian.
(5) For a 64-bit data, if the least significant byte (B0) is stored at memory address 8N+3
where N is an integer 2: 0, this storage order is called endian.
Answer:
(1) (2) (3) (4) (5)
$s1, $zero, $zero CPI prefetching little big

Choose the correct answers for the following multiple choice problems. Each question may have
more than one answer.

56
3. Which of the following Statements is (are) true for stack operations?
(1) Callee is a procedure that executes a series of stored instructions based on parameters
provided by the caller and then returns control to the caller.
(2) Jal procedure_name, is the instruction that calls procedure_name and returns from the
callee.
(3) Stack is a data structure for spilling registers organized as a last-in-first-out queue.
(4) Stack pointer, sp, is the register containing the address of program being executed.
(5) Push operation can be achieved by executing a store instruction.
Answer: (3)
註(1):ax is maintained by caller not callee
註(4):PC is the register containing the address of program being executed
註(5):Push operation can be achieved by executing a store plus add instructions

4. Which of the following statements is (are) true for IEEE 754 floating point representation?
(1) IEEE 754 standard defines the double precision number to be a 128-bit format.
(2) If a floating point number is shown in the form of (-1)5  (1+ F)  2E where S defines the
sign of the number, F the fraction field, and E the exponent, this means the leading 1-bit of
normalized binary numbers is implicit.
(3) IEEE 754 binary representation of -0.75(10) is 10111111011000000000000000000000 for
single precision.
(4) The floating point number represented in a biased exponent is actually this value: (-1)s 
(1+ F)  2E-Bias
(5) Since there is no way to get 0.0 from this form: (-1)5  (1+ F)  2E, we will not be able to
represent 0.0 in floating point format.
Answer: (2), (4)
註(1):double precision is a 64-bit format
註(3):-0.75(10) is 10111111010000000000000000000000

5. Which of the following is (are) true about memory management unit?


(1) A TLB miss can be handled either in software via privileged instructions or hardware state
machine.
(2) A TLB miss invokes an operation called page walk which finds the missing entry from the
page table.
(3) The instruction causing the TLB miss is a restartable instruction once the TLB miss is
served.
(4) The faulting instruction of a page fault is not restartable.
(5) The TLB is a software cache for page table.
Answer: (1), (2), (3)
57
6. Which of the following is (are) true about virtual memory operations?
(1) The physical location of a page is managed by the linker.
(2) TLB is a cache for instruction and data.
(3) Page table is a data structure managed by the operating system.
(4) If a page fault occurs, this signals that the requested page is not in the main memory.
(5) A TLB miss also signals the occurrence of a page fault.
Answer: (3), (4)
註(1):The physical location of a page is managed by the loader
註(2):TLB is a cache for page table

7. Which of the following is (are) true about cache coherence?


(1) There is no cache coherence issue for using a write-through cache since the written result is
also updated in the next level of memory.
(2) There is no cache coherence issue for using a write-back cache since the written result is
not updated in the next level of memory.
(3) If the 1/O data are non-cacheable, then there is no cache coherence issues for I/O
operations.
(4) The instruction cache has no cache coherence issue if self-modifying code is prohibited.
(5) To resolve cache coherence, one can use the write-through policy to enforce memory
consistency.
Answer: (4)

58

You might also like