0% found this document useful (0 votes)
17 views80 pages

106

The document contains a detailed examination of various computer architecture concepts, including processor design, memory hierarchy, and performance optimization techniques. It discusses specific cases, calculations, and comparisons between different architectures and configurations. Additionally, it addresses topics such as cache design, branch prediction, and the impact of register size on assembly program size.

Uploaded by

許宗祐
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views80 pages

106

The document contains a detailed examination of various computer architecture concepts, including processor design, memory hierarchy, and performance optimization techniques. It discusses specific cases, calculations, and comparisons between different architectures and configurations. Additionally, it addresses topics such as cache design, branch prediction, and the impact of register size on assembly program size.

Uploaded by

許宗祐
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

目 錄

106 年台大資工..................................................................................................................................... 2

106 年台大電機..................................................................................................................................... 7

106 年台聯大電機................................................................................................................................. 9

106 年清大資工................................................................................................................................... 18

106 年交大資聯................................................................................................................................... 26

106 年成大電機................................................................................................................................... 34

106 年成大資聯................................................................................................................................... 37

106 年成大電通................................................................................................................................... 39

106 年中央資工................................................................................................................................... 42

106 年中山電機................................................................................................................................... 46

106 年中山資工................................................................................................................................... 51

106 年中正電機................................................................................................................................... 55

106 年中興電機................................................................................................................................... 59

106 年中興資工................................................................................................................................... 65

106 年台科大資工............................................................................................................................... 67

106 年台科大電子............................................................................................................................... 71

106 台師大資工................................................................................................................................... 74

106 年彰師大電子、資工................................................................................................................... 77

1
106 年台大資工

1. For a 5-stage MEPS processor with separate L1 data/instruction caches (32KB/256KB), the
latency for each stage is listed below:
Pipeline IF ID EXE MEM WB
Latency 150ps 100ps 50ps 200ps 100ps
The CPI with a perfect cache is 1 and both I/D caches are blocking caches. The cache miss
penalty is 200 CPU cycles. For the target benchmark suite, the I-Cache miss rate is 5%, D-cache
miss rate is 10% and the frequency of the load/store instructions is 35%. The architecture team is
considering to increase the data cache size to 512KB so the data miss rate could be reduced to
5% for the target benchmark. But it increases the MEM pipeline stage latency to 250ps. Please
justify if increasing the data size is a good design decision or not.
Answer:
CPIold = 1 + 1  0.05  200 + 0.35  0.1  200 = 18
Instruction Timeold = 18  200 ps = 3600 ps
CPInew = 1 + 1  0.05  200 + 0.35  0.05  200 = 14.5
Instruction Timeold = 14.5  250 ps = 3625 ps
Increasing the data size is not a good decision.

2. A stride-based prefetching scheme exploits regular streams of memory accesses to hide memory
latency. Which of the following code segments can get more benefit from stride-based
prefetching? Please explain why.
Code A:
int Y[100];
int i, x;
{
for (i= 0; i++; i:<100)
Y[i] = Y[i] + x;
x++;
}
Code B:
struct node {
int x;
struct node *next;
};
while (node->next ! = 0) {
node-> x = 1;
node = node->next;
}
2
Answer:
Code A can get more benefit from stride-based prefetching.
In a stride-based prefetching scheme, when an address miss, prefetch address that is offset by a
distance from the missed address. In code A, since the data elements in an array is sequentially
placed is the memory, stride access could be more efficiency. Inversely, the data elements in
code B are scattered in the heap, stride access is inefficiency.

3. (Multiple Choices: no partial points) Figure 1 shows the roofline model of the target architecture
(peak memory bandwidth:16GB/s and peak floating-point performance: 16GFLOPS/s). If the
arithmetic intensity of an application falls between 1/2 FLOPs/byte and 2 FLOPs/byte in different
program phases, which of the following techniques could be adopted to improve this application's
performance:
(a) Software prefetching
(b) Loop unrolling
(c) Apply SIMD (Single instruction Multiple Data)
(d) Rewrite the codes for better data locality

Figure 1
Answer: (1), (2), (3)

4. (Multiple, Choices: no partial points) For applications with high data-level parallelism, which of
the following architectural features are effective for performance improvement:
(a) SIMD (Single instruction Multiple Data)
(b) SIMT (Single instruction Multiple Threads)
(c) VLIW (Very Long instruction Word)
(d) SMT (Simultaneous Multi-Threading)
(e) Vector instruction Extension
Answer: (a), (b), (e)
3
5. For the MIPS instruction set (32-bit instruction/32 registers), if the register file size is increased
to 128 registers, what is the total number of bits needed for a R-type format instruction (register
instructions)? Could more registers decrease the size of a MIPS assembly program? Why?
Answer:
The total bits for a R-type instruction is 38.
Opcode Rs Rt Rd Shift amount Funct. code
6 7 7 7 5 6
More registers  less register spilling  less load/store instructions  could reduce program
size

Machine 1 Machine 2
Benchmark Observed CPI
Seconds SpecRatio Seconds SpecRatio
400.perlbench 457 21.4 424 23.0 0.75
401.bzip2 539 17.9 622 15.5 0.85
403.gcc 338 23.9 293 27.5 1.72
429.mfc 280 32.6 204 44.7 10.00
445.gobmk 527 19.9 451 23.3 1.09
456.hmmer 270 34.6 440 21.2 0.80
458.sjeng 657 18.4 511 23.7 0.96
462.libquantum 96.5 215 190 109.0 1.61
464.h264ref 847 26.1 613 36.1 0.80
471.omnetpp 310 20.2 280 22.4 2.94
473.astar 380 18.5 447 15.7 1.79
483.xalancbmk 234 29.5 191 36.2 2.70

6. The above table gives measurements of running SPECint2006 on one Intel machine and one
AMD machine. Execution time is reported in seconds. SPECratio is the reference time, supplied
by SPEC, divided by the measured execution time.
(a) Which machine is faster? How much faster? (You must show your calculation, without
showing the calculation, your answer can only receive partial credits)
(b) Which benchmarks you would target for some hardware/software optimizations? Suggest a
few architecture and/or compiler optimizations that might speed up your selected programs.
Answer:
(a) AM for machine 1
= (457 + 539 + 338 + 280 + 527 + 270 + 657 + 96.5 + 847 + 310 + 380 + 234) / 12 =
411.29
AM for machine 2

4
= (424 + 622 + 293 + 204 + 451 + 440 + 511 + 190 + 613 + 280 + 447 + 191) / 12 = 388.83
Machine 2 is 411.29 / 388.83 = 1.06 times faster than machine 1
(b) Benchmark 429.mfc might be speedup by compiler optimization techniques such as
common sub-expression elimination, unreachable code elimination, dead code elimination,
loop optimization.

7. New mobile phones are required to handle larger data sets, for example, recording 4Kvideo and
taking high-resolution pictures often drive the needs for 256GB of on-board memory. Buying
larger memory configurations or extended storage with lightning flash drives are costly options.
Recently, some companies offer CME (Cloud Memory Extension) solutions to make use of WiFi
and Cloud storage to extend the memory of your mobile phones. The idea is to use the cloud
storage as the secondary storage, and uses part of the on-board memory as a cache for the
storage. Frequently used videos, photos, music can be cached and less frequently used files can
be transferred to the cloud storage using WiFi. This approach is similar to the idea of disk caches
which are often used to speed up disk I/O. Please answer the following questions of this CME
cache design.
(a) How is this CME cache different from on-chip caches in processor?
(b) What should be the block size of a CME cache line? Why?
(c) What should be the prefetch policy for such CME caches?
(d) Should the cache be fully associative, set associative or direct-mapped? Why?
(e) If set associative is the choice of (d), please suggest a replacement algorithm that works
better than LRU. Explain why it could be better.
(f) What should be used as the CME cache tag?
Answer:
(a) CME cache is part of the on-board memory and on-chip caches is an SRAM reside in a
processor chip.
(b) 4K video has two major resolutions: 3840  2160 and 4096  2160. The appropriate block
size should be the frame size. So, the block size should be 3840  2160  4B ≈ 33.17MB or
4096  2160  4B ≈ 35.39MB.
(c) Sequential prefetching, which mainly used when contiguous locations, could be used for
CME caches.
(d) Direct-mapped is a proper choice for CME cache because multimedia data exploit spatial
locality rather than temporal locality. That is, each item to be accessed has the same
probability. In addition, the hit time and cost for direct-mapped cache are lower than that of
fully or set associative cache.
(e) Random replacement algorithm that works better than LRU since each item to be accessed
has the same probability and random replacement algorithm is more efficiency in hardware
operation.
5
(f) Video or picture identity (name) can be used as the CME cache tag.

8. Branch instructions can be divided into conditional and unconditional branches, or direct and
indirect branches. For example, bne $s1, $s2, 25 is a conditional and direct branch in the MIPS
architecture. So a branch instruction can be classified as one of the following four types:
(1) Conditional direct branch
(2) Conditional indirect branch
(3) Unconditional direct branch
(4) Unconditional indirect branch
(a) For each branch instruction type, please gives one C construct which the compiler would
generate that branch instruction
(b) Which one in the four types is easier to predict? Which one is most difficult to predict?
(c) What hardware structures are often used to predict type (2) and type (4) branches
Answer:
(a)
Branch type C construct MIPS instruction example
(1) Conditional direct branch If-then-else statement beq $s1, $s2, Label
(2) Conditional indirect branch Switch-case statement beq $s1, $s2, $s3
(3) Unconditional direct branch For-loop statement j Label
(4) Unconditional indirect branch Function return statement jr $ra
註:beq $s1, $s2, $s3  (if $s1 = $s2 goto address in $s3)
(b) The easier to predict: unconditional direct branch
The most difficult to predict: conditional indirect branch
(c) VPC (Virtual Program Counter) Prediction, which treat an indirect branch as
multiple “virtual” conditional branches and use the conditional branch predictor, can be
used to predict type (2) and type (4) branches.

6
106 年台大電機
複選題
1. What are the approaches to reducing the average memory access time in a computer with
memory hierarchy?
(1) reducing the miss rate of cache
(2) using direct mapped cache
(3) increasing the associativity of cache
(4) increasing the cache size
Answer: (1)
註: AMAT = hit time + miss rate  miss penalty
(2) using direct mapped cache will reduce hit time but increase miss rate
(3) increasing the associativity will reduce miss rate but increase hit time
(4) increasing the cache size will reduce miss rate but increase hit time

2. What are the features or facts related to a RISC (reduced instruction set computer) architecture?
(1) Instruction format is short and being one size
(2) The MIPS architecture is a RISC machine
(3) The main operations that affect memory are load and store instructions.
(4) A larger register set comprised of most general purpose registers, as compared to CISC
(Complex Instruction Set Computer) computer
(5) Increase performance through the use of pipelining
Answer: (1), (2), (3), (4), (5)

3. A microprocessor with CMOS technologies is operating at a frequency F and a voltage V, if we


reduce the operating voltage to 0.8V. What percentage will the dynamic energy and dynamic
power reduce?
Answer:
2
F × C × 0.8V
= 0.64
F × C × V2 . The dynamic power is reduce by 36%

4. In a pipeline of a microprocessor, state the three types of instructions that would best fill the
branch delay slot and explain under what conditions they improve pipeline performance.
Answer:
As shown in the following figure, there are 3 ways to schedule branch delay slot: (a) from
before, (b) from target, and (c) from fall through. (a) is the best. Use (b) (c) when (a) is
impossible (data dependency). (b) is only valuable when branch taken. It is OK to execute this
instruction even the branch is non-taken (c) is only valuable when branch not taken; It is OK to
execute this instruction even the branch is taken.
7
5. A multiprocessor system has four processor cores, each capable of generating 2 loads and 1 store
per clock cycle. The processor clock cycle is 2ns, while the cycle time of the SRAMs used in the
memory system is 4 ns. Calculate the minimum number of memory banks required to allow all
processors to run at full memory bandwidth.
Answer:
3 memory references per processor  3  4 = 12 total memory references
4 ns / 2ns = 2 processor cycles pass for one SRAM cycle
Therefore, 2  12 = 24 banks are needed

8
106 年台聯大電機

1. (1) For a 32-bit address Space, calculate the total number of bits required for the cache listed
below: 32 KiB direct-mapped, write-back data cache, 2 words per cache block, and 2
control bits per cache block.
(2) Given that cache size in problem (1), please find the total size of the closet direct-mapped,
write-back data cache with 16-word blocks of equal size or greater. Explain why this data
cache, despite its larger size, might provide slower performance than that in (1).
(3) A generic memory hierarchy would consist of a TLB and a cache. A memory reference can
encounter three different types of misses: a TLB miss, a page fault, and a cache miss,
Consider all the combinations of these three events with one or more occurrences. There are
seven possibilities. For each possibility, please state whether this event can actually occur
and under what circumstances.
Answer
(1) Number of blocks in the cache = 32KiB / 8 bytes = 4K  index = 12 bits
A tag bits = 32 – 12 – 3 = 17
The total number of bits for the cache = (2 + 17 + 64)  4K = 332 Kbits
(2) Suppose that there are 2N blocks in the cache.
The total number of bits for the cache = (2 + 32 – N – 6 + 512)  2N ≥ 332 Kbits  N ≥ 10.
Pick N = 10  The total number of bits for the cache = 530 Kbits
Larger block size increase the miss penalty. This might provides slower performance than
(1)
(3)
TLB Page table Cache Identify/Explain
Possible, although the page table is never really
hit hit miss
checked if TLB hits.
TLB misses, but entry found in page table: after retry,
miss hit hit
data is found in cache.
TLB misses, but entry found in page table; after retry,
miss hit miss
data misses in cache.
TLB misses and is followed by a page fault; after retry,
miss miss miss
data must miss in cache.
Impossible: cannot have a translation in TLB if page is
hit miss miss
not present in memory.
Impossible: cannot have a translation in TLB if page is
hit miss hit
not present in memory.
Impossible: data cannot be allowed in cache if the page
miss miss hit
is not in memory.

9
2. Regarding computer arithmetic, is the following statement “True” or “False'? If True, please give
a brief explanation. If False, please give a counter example,
(1) Associativity holds for a sequence of two’s complement integer additions, even if the
computation overflows.
(2) Associativity holds for a sequence of floating-point additions.
Answer
(1) True. Suppose a, b, and c are three signed numbers. Clearly, addu [(a + b) + c] = addu [a +
(b + c)]. That is, if both additions result overflow, there will get the same wrong value.
(2) False. Suppose that the following 3 numbers are all single precision.
x = − l.510 × 1038, y = l.510 × 1038, z = 1.0
x + (y + z) = − l.510 × 1038 + (l.510 × 1038 + 1.0) = − l.510 × 1038 + l.510 × 1038 = 0.0
(x + y) + z = (− l.510 × 1038 + l.510 × 1038) + 1.0 = 0.0 + 1.0 = 1.0
 x + (y + z) ≠ (x + y) + z

3. When making changes to optimize part of a computer, it is often the case that speeding up one
type of instructions comes at the cost of slowing down something else.
(1) Suppose that floating-point operations take 20% of the original program’s execution time
and the new fast floating-point unit speeds up floating-point operation by, on average, 2
times. Ignoring the penalty to any other instructions, what is the overall speedup?
(2) Suppose that speeding up the floating-point unit would slow down data cache accesses,
resulting in a 1.5 times slowdown. Suppose that data cache accesses consume 10% of the
execution time. What is the overall speedup?
(3) After implementing the new floating-point unit, what percentage of execution time is spent
on floating-point operations? What percentage is spent on data cache accesses?
Answer
(1) Speedup = 1 / (0.2 / 2 + 0.8) = 1.11
(2) Speedup = 1 / (0.2 / 2 + 0.1  1.5 + 0.7) = 1.053
(3) The execution time percentage on floating-point operations = 0.1/(0.1+0.15+0.7) = 10.53%
The execution time percentage on data cache accesses = 0.15/(0.1+0.15+0.7) = 15.79%

10
4. A new I-type format instruction swu has been added to the MIPS instruction set. Its format is
swu rt, l(rs). It takes arguments register rt, register rs, and immediate l, and it stores the contents
of R[rt] at the memory address (R[rs] + l) and then increments R[rs] by l.
(1) Given the single-cycle datapath in the following (control signals are marked with dashed
lines), fill in the blanks in the table below for this new instruction. You must give the control
signals (0, 1, 2, X) for the MIPS instruction. Each control signal must be specified as 0, 1, 2
or X (don't care). Writing a 0 or when an X is more accurate is not correct.
Opcode RegDst RegWrite ALUSrc ALUOp MemWrite MemRead MemToReg PCSrc
swu

Add 0
1
4 Add
Shift
PCSrc
left 2
MemWrite MemToReg
RegWrite
[25-21] Read Read
Read
PC register 1 Read Address 2
address [20-16] Read
Read data 1 1
write data
Instruction register 2 0
[31-0] 0 Write ALU Address
1 register Read Write
[15-11]
data 2 0 Data
Instruction 2 Write data
1 Memory
Memory data
[25-21]
RegDst
[15-0] Sign ALUSrc ALUOp
extend

This new I-type format instruction jin imm16(rs) is developed for the single-cycle processor,
which is a Jump Indirect instruction and will cause the processor to jump to the address stored in
the word at memory location imm16 + R[rs] (the same address computed by lw and sw).
(2) Draw the necessary modifications to implement the jin instruction on your sheet according
to the figure of the single-cycle datapath provided above.
(3) What is/are the new control signal(s) required to implement jin? Why is/are the control
signal(s) necessary?
Consider the 32-bit ALU design shown in the following. Now suppose that we wish to add
hardware support for xor - exchusive-OR. For example, xor $t0, $t1, $t2.
(4) Please clearly describe (a) the necessary changes to the ALU hardware showing the changes
to the ith ALU bit position by modifying the figure below, (b) the corresponding values of
the ALU control signals, and (c) describe briefly how the single-cycle datapath would
operate with these changes.

11
Operation
Binvert CarryIn
a
0

1
Result
b 0 + 2
1
Less 3

CarryOut

Answer
(1)
Opcode RegDst RegWrite ALUSrc ALUOp MemWrite MemRead MemToReg PCSrc
swu 2 1 1 0 1 0 0 0

(2)

PCSrc Jinc

Add 0 0
1 1
4 Add
Shift
left 2
RegWrite MemWrite MemToReg

[25-21] Read Read


Read
PC register 1 Read Address 2
address [20-16] Read
Read data 1 1
write data
Instruction register 2 0
[31-0] 0 Write ALU Address
1 register Read Write
[15-11]
data 2 0 Data
Instruction 2 Write data
1 Memory
Memory data
[25-21]
RegDst ALUOp
[15-0] Sign ALUSrc
extend

(3) Control signal jinc is needed. When Jump Indirect instruction is executed the control signal
jinc is set to 1 and the jump target address can go through the multiplexor to the input of the
PC.
12
(4)
(a) (b) Binv CarryIn Op2 Op1 Op1
Operation
Binvert CarryIn and 0 x 0 0 0
a
0 or 0 x 0 0 1
1
Result
add 0 x 0 1 0
b 0 + 2
1 sub 1 1 0 1 0
Less 3

4 slt 1 1 0 1 1

CarryOut
xor 0 x 1 0 0
(c) Only the setting of ALUOp control signal is changed when instructions are executed.

5. Answer the following questions with respect to the MIPS program shown below. Note that this
simple program does not use the register saving conventions followed in class. Assume that each
instruction is a native instruction and can be stored in one word further, assume that the data
segment starts at 0x10001000 and that the text segment starts at 0x00400000.
.data
label: .word 8, 16, 32, 64
.byte 64, 32
.text
.globl main
main: la $4, label
li $5, 16
jal func
done
func: move $2, $4
move $3, $5
add $3, $3, $2
move $9, $0
loop: lw $22, 0($2)
add $9, $9, $22
addi $2, $2, 4
slt $8, $2, $3
bne $8, $0, loop
move $2, $9
jr $31
(1) What does the program do?
13
(2) State the values of the labels loop and main?
(3) Please finish the hexadecimal encodings of the following instructions?
bne $8, $0, loop
1st source 2nd source
Op-code OFFSET
register register
(6-bit) (16-bit)
(5-bit) (5-bit)
5

add $9, $9, $22


1st source 2nd source Destination Shift Function
Op-code
register register register amount Code
(6-bit)
(5-bit) (5-bit) (5-bit) (5-bit) (6-bit)
0

(4) Identify the local and global labels in the program?


Answer
(1) Sum the elements stored starting at address label
(2)
Label value
loop 0x00400020
main 0x00400000
(3)
bne $8, $0, loop
1st source 2nd source
Op-code OFFSET
register register
(6-bit) (16-bit)
(5-bit) (5-bit)
0x05 0x08 0x00 0xfffb

add $9, $9, $22


1st source 2nd source Destination Shift Function
Op-code register register amount
register Code
(6-bit) (5-bit) (5-bit) (5-bit)
(5-bit) (6-bit)
0x00 0x09 0x16 0x09 0x00 0x20
(4) main is global and all other labels are local labels

14
6. Given a 5-stage pipelined MIPS ISA design, the individual pipeline stages are named as IF, ID,
EX, MEM, and WB, respectively. The latencies of five stages are given as 350 ps, 250 ps, 280
ps, 400 ps, 280 ps, respectively.
(1) What is the maximum operating frequency for this processor?
(2) The following fragment of MIPS codes is being executed. If there is no forwarding or
hazard detection, please insert nops and rewrite the assembly to ensure correct execution.
add $t4, $s2, $t1
or $t2, $t1, $t2
lw $s4, 20($t4)
sw $t1, 16($t4)
sub $s3, $s4, $t2
(3) How many clock cycles are required to complete the execution of these instructions?
Answer
(1) The maximum operating frequency = 1 / 400 ps = 2.5 GH
(2) add $t4, $s2, $t1
or $t2, $t1, $t2
nop
lw $s4, 20($t4)
sw $t1, 16($t4)
nop
sub $s3, $s4, $t2
(3) (5 – 1) + 7 = 11 clock cycles are required to complete the execution of these instructions.

7. Assume that a pipelined processor has 8 pipelined stages as shown in the following. Due to the
branch prediction, the IF stage needs two clock cycles, called IF1 and IF2. The EXE stage needs
three clock cycles to Support multiplication and division. However, the addition, Subtraction,
and logic operations are still completed in one clock cycle. The three EXE stages are named
EXE1, EXE2, and EXE3. The latencies for 8 stages are 150 ps, 150 ps, 100 ps, 200 ps, 200 ps,
180ps, 200 ps, and 100 ps, respectively. The pipeline registers are named from PP1 to PP7
sequentially.
PP1 PP2 PP3 PP4 PP5 PP6 PP7

Mul/Div
Instr.
Instruction Fetch & Decode & Memory Write
Branch Prediction Register Add/Sub/L Access Back
ogic
Read
Operation

IF ID EXE MEM WB
Consider the following MIPS code.
15
1 lw $s2, 0($t4)
2 add $t0, $s2, $t1
3 mul $t2, $s3, $s5
4 sub $t0, $t0, $t2
5 sw $t0, -4($t4)

(1) If there is no forwarding or hazard detection, how many nops instructions should be
inserted between the first instruction (lw) and the second instruction (add)? Please also give
some simple explanation.
(2) If a forwarding path from the memory output result in pipeline registers PP7 to the input of
ALU at EXE1 stage is added to reduce the delay, how many nops instructions should be
inserted between the first instruction (lw) and the second instruction (add)? Please also give
some simple explanation.
(3) How to add the forwarding path to solve the data hazard with the minimum delay between
instruction 3 (mul) and instruction 4 (sub)? Please also give some simple explanation.
(4) How to add the forwarding path to Solve the data hazard with the minimum delay between
instruction 4 (sub) and instruction 5 (sw)? Please also give some simple explanation.
Answer
(1) 4 nops. The lw instruction should write to the register file at the first half of clock at WB
stage and add instruction should read from the register file at the second half of clock at ID
stage. There are 4 nops are needed to ensure the correctness of execution.
(2) 3 nops. The lw instruction at WB stage will forward the memory data from PP7 and the add
instruction should at the first EXE stage to accept the correct value for addition. There are 3
nops are needed to ensure the correctness of execution.
(3) Forwarding path should be added from MEM stage (PP6) to the first EXE stage since the
mul instruction will finish multiplication operation earliest at MEM stage.
(4) Forwarding path should be added from WB stage (PP7) to the MEM stage. Since the sw
instruction will write the data into memory at MEM stage.
註:
(3)

PP1 PP2 PP3 PP4 PP5 PP6 PP7

Mul/Div
Instr.
Instruction Fetch & Decode & Memory Write
Branch Prediction Register Add/Sub/L Access Back
ogic
Read
Operation
(4)

IF ID EXE MEM WB

16
8. Assume an 8-core computer system can process database queries at a steady state rate of
requests per Second and each transaction averagely takes a fixed amount time to process. Some
results regarding the transaction latency and processing rate are given in the following table.
How many requests are being processed at any given instant per core for the following 2
cases? Please write the values for (1) and (2), respectively.
Case Average Transaction Latency Maximum Transaction Processing Rate No. of Requests per Core

1 1 ms 10,000/sec. (1)
2 2 ms 24,000/sec. (2)

Answer
(1) (10000 / 8)  1  10-3 = 1.25
(2) (24000 / 8)  1  10-3 = 3

17
106 年清大資工

1. What is the decimal value of ‘‘1111 0011’’, a one-byte 2's complement binary number?
Answer: -13

2. For a color display using 8 bits for each of the primary colors (red, green, blue) per pixel and
with a resolution of 1280  800 pixels, what should be the size (in bytes) of the frame buffer to
store a frame?
Answer: 1280  800  3 = 3072000 bytes

3. Determine the values of the labels ELSE and DONE of the following segment of instructions.
Assume that the first instruction is loaded into memory location F0008000hex

slt $t2, $t0, $t0


bne $t2, $zero, ELSE
j DONE
ELSE: addi $t2, $t2, 2
DONE: … …

Answer:
labels value
ELSE F000800Chex
DONE F0008010hex

4. Translate the following C code into assembly code:


f = A[B[2]] – 4;
As the following resister assignment: f in $s1, the base address of A is at $s2 and that of B is at
$s3.
Answer:
lw $t0, 8($s3)
sll $t0, $t0, 2
add $t0, $t0, $s2
lw $s1, 0($t0)
addi $s1, $s1, -4

5. You are using a tool that transforms machine code that is written for the ISA to code in a VLIW
ISA. The VLIW ISA is identical to except that multiple instructions can be grouped together into
one VLIW instruction. Up to N MIPS instructions can be grouped together ( N is the machine
width, which depends on the particular machine). The transformation tool can reorder
instructions to fill VLIW instructions, as long as loads and stores are not reordered relative to

18
each other (however, independent loads and stores can be placed in the same VLIW instruction).
You give the tool the following MIPS program (we have numbered the instructions for reference
below):
(01) lw $t0 ← 0($a0)
(02) lw $t2 ← 8($a0)
(03) lw $t1 ← 4($a0)
(04) add $t6 ← $t0, $t1
(05) lw $t3 ← 12($a0)
(06) sub $t7 ← $t1, $t2
(07) lw $t4 ← 16($a0)
(08) lw $t5 ← 20($a0)
(09) srlv $s2 ← $t6, $t7
(10) sub $s1 ← $t4, $t5
(11) add $s0 ← $t3, $t4
(12) sllv $s4 ← $t7, $s1
(13) srlv $s3 ← $t6, $s0
(14) sllv $s5 ← $s0, $s1
(15) add $s6 ← $s3, $s4
(16) add $s7 ← $s4, $s6
(17) srlv $t0 ← $s6, $s7
(18) srlv $t1 ← $t0, $s7
(a) Draw the dataflow graph of the program. Represent instructions as numbered nodes
(01~18), and flow dependences as directed edges (arrows).
(b) When you run the tool with its settings targeted for a particular VLIW machine, you find
that the resulting VLIW code has 9 VLIW instructions. What minimum value of N must the
target VLIW machine have?
(c) Based on the code above and the minimum value of N in (b), write down the MIPS
instruction numbers corresponding to each VLIW instruction in the table below. If there is
more than one MIPS instruction that could be placed into a VLIW instruction, please
choose the instruction that comes earliest in the original MIPS program.
MIPS MIPS MIPS MIPS MIPS MIPS MIPS
Instr. No. Instr. No. Instr. No. Instr. No. Instr. No. Instr. No. Instr. No.
VLIW Instruction 1:
VLIW Instruction 2:
VLIW Instruction 3:
VLIW Instruction 4:
VLIW Instruction 5:
VLIW Instruction 6:

19
VLIW Instruction 7:
VLIW Instruction 8:
VLIW Instruction 9:
(d) You find that the code is still not fast enough when it runs on the VLIW machine, so you
contact the VLIW machine vendor to buy a machine with a larger machine width N. What
minimum value of N would yield the maximum possible performance (i.e., the fewest
VLIW instructions), assuming that all MIPS instructions (and thus VLIW instructions)
complete with the same fixed latency and assuming no cache misses?
(e) Write the MIPS instruction numbers corresponding to each VLIW instruction, for this
optimal value of N. Again, as in part (c) above, pack instructions such that what more than
one instruction can be placed in a given VLIW instruction, the instruction that comes first
in the original code is chosen.
Answer:
(a)
01 02 03 05 07 08

04 06 11 10

09 13 12 14

15

16

17

18

(b) The minimum value of N is 3


(c)
MIPS MIPS MIPS MIPS MIPS MIPS
Instr. No. Instr. No. Instr. No. Instr. No. Instr. No. Instr. No.
VLIW Instruction 1: 01 02 03
VLIW Instruction 2: 05 07 08
VLIW Instruction 3: 04 06 11
20
VLIW Instruction 4: 10 09 13
VLIW Instruction 5: 12 14
VLIW Instruction 6: 15
VLIW Instruction 7: 16
VLIW Instruction 8: 17
VLIW Instruction 9: 18
(d) The minimum value of N is 6 that would yield the maximum possible performance.
(e)
MIPS MIPS MIPS MIPS MIPS MIPS
Instr. No. Instr. No. Instr. No. Instr. No. Instr. No. Instr. No.
VLIW Instruction 1: 01 02 03 05 07 08
VLIW Instruction 2: 04 06 11 10
VLIW Instruction 3: 09 13 12 14
VLIW Instruction 4: 15
VLIW Instruction 5: 16
VLIW Instruction 6: 17
VLIW Instruction 7: 18

6. Fine-Grained Multithreading (FGMT): consider a design "Machine Ⅰ" with five pipeline
stages: fetch, decode, execute, memory, and write back. Each stage takes 1 cycle. The
instruction and data caches (i.e., there is never a stall for a cache miss). Branch directions and
targets are resolved in the execute stage. The pipeline stalls when a branch is fetched, until the
branch is resolved. Dependency check logic is implemented in the decode stage to detect flow
dependences. The pipeline does not have any forwarding paths, so it must stall on detection of a
flow dependence. In order to avoid these stalls, we will consider modifying MachineⅠ to use
fine-grained multithreading.
Fetch Decode Execute Writeback

PC Reg.
file
PC Address Address
Reg.
Instruction file ALU Data
Instruction Data
Cache Cache
PC
Reg.
file
Thread ID
Thread ID Thread ID

(a) Is the five-stage pipeline of Machine I shown above, the machine's designer first focuses
on the branch stalls, and decides to use multithreading to keep the pipeline busy no matter
how many branch stalls occurs. What is the minimum number of threads required to
21
achieve this? Why?
(b) The machine's designer now decides to eliminate dependency-check logic and remove the
need for flow-dependence stalls (while still avoiding branch stalls). How many threads are
needed to ensure that no flow dependence ever occurs in the pipeline?
A rival designer is impressed by the throughput improvements and the reduction complexity
that FGMT brought to Machine I. This designer decides to implement FGMT on another
machine, Machine II. Machine II is a pipelined machine with the following stages.
Fetch 1 stage
Decode 1 stage
Execute 8 stages (branch direction/target are resolved in the first execute stage)
Memory 2 stages
Writeback 1 stage
Assume everything else in Machine II is the same in Machine I.
(c) Is the number of threads required to eliminate branch-related stalls in Machine II the same
as in Machine I?If YES, why? If NO, how many threads are required?
(d) Now consider flow-dependence stalls. Does Machine II require the same minimum number
of threads as Machine I to avoid the need for flow-dependence stalls? If YES, why? If NO,
how many threads are required?
Answer:
(a) 3 threads are required since branch directions and targets are resolved in the 3th (execution)
stage.
(b) There are 3 stages from decode to writeback; hence 3 threads are required to remove the
flow-dependence stalls. In addition, 3 threads are required to remove the branch stalls. So,
there are total 3 + 3 = 6 threads are required.
(c) Yes, Since the branch directions and targets are still resolved in the 3th (execution) stage.
(d) No. There are 11 stages from decode to writeback; hence 11 threads are required to remove
the flow-dependence stalls.

7. Assume you developed the next greatest memory technology, MagicRAM. A MagicRAM cell
is non-volatile. The access latency of a MagicRAM cell is 2 times that of an SRAM cell but the
same as that of a DRAM cell. The read/write energy of MagicRAM is similar to the read/write
energy of DRAM. The cost of MagicRAM is similar to that of DRAM. MagicRAM has higher
density than DRAM. MagicRAM has one shortcoming, however: a MagicRAM cell stops
functioning after 2000 writes are performed to the cell.
(a) Is there an advantage of MagicRAM over DRAM? Why?
(b) Is there an advantage of MagicRAM over SRAM?
(c) Assume you have a system that has a 32KB L1 cache made of SRAM, a 8MB L2 cache
made of SRAM, and 2GB main memory made of DRAM, as shown in Fig 1.
22
Processor core

L1 cache (32KB)

L2 cache (8 MB)

Main memory (2 GB)

Fig.1. Memory hierarchy of the system

Assume you have complete design freedom and add structures to overcome the shortcoming of
MagicRAM. You will be able to propose a way to reduce/overcome the shortcoming of MagicRAM
(note that you can design the hierarchy in any way you like, but cannot change MagicRAM itself).
Does it make sense to add MagicRAM somewhere this memory hierarchy, given that you can
potentially reduce its shortcoming? If so, where would you place MagicRAM? If not, why not?
Explain below clearly and methodically. Depict in a figure clearly and describe why you made this
choice.
(d) Propose a way to reduce/overcome the shortcoming of MagicRAM by modifying the given
memory hierarchy. Be clear in your explanations and illustrate with drawings to aid
understanding.
Answer:
(a) Yes. Since the MagicRAM has higher density than DRAM, the area and power
consumption for the MagicRAM will be lower than that of the DRAM.
(b) Yes. Since a MagicRAM cell is non-volatile when the system, which is critical to safety
or property application, is crashed the data is remained and the system can be recovered.
(c)(d) MagicRAM can be used in memory hierarchy as a backup of the main memory, shown
in the following figure, if the computer system is critical to safety or property
application. Only used for the backup of main memory, the MagicRAM won’t be
accessed frequently and thus can reduce its shortcoming.

23
Processor core

L1 cache (32KB)

L2 cache (8 MB)

MagicRAM Main memory (2 GB)

8. Consider the following three processors (X, Y, and Z) that are all of varying areas. Assume that
the single-thread performance of a core increases with the square root of its area.

Processor X Processor Y Processor Z


Core area = A Core area = 4A Core area 16A

(a) You are given a workload where S fraction of its work is serial and 1-S fraction of its work
is in infinitely parallelizable. If executed on a die composed of 16 Processor X’s, what
value of S would give a speedup of 4 over the performance of the workload on just one
Processor X?
(b) Given a homogeneous die of area 16A, which of the three processors would you use on
your die to achieve maximal speedup? What is that speedup over just a single ProcessorⅩ?
Assume the same workload in part (a).
(c) Now you are given a heterogeneous processor of area 16A to run the above workload. The
die consists of 1 Processor Y and 12 Processor X’s. When running the workload, all
sequential parts of the program will be run on the larger core while all parallel parts of the
program run exclusively on the smaller cores. What is the overall speedup achieved over a
Single Processor Ⅹ ?
(d) One of the programmers decides to optimize the given workload so that it has 10% of its
work serial sections and 90 % of its work in parallel sections. Which configuration would
you use to run the workload if given the choices between the processors from part (a), part
(b), and part (c)? Please write down the speedups for the three configurations.
(e) Typically, for a realistic workload, the parallel fraction is not infinitely parallelizable. What
are the three fundamental reasons?
Answer:
(a) S = 0.2

24
(b) Speedup of processor Y = . Speedup of processor Z = .

So, we would use processor Y and get a speedup of 5.

(c) The overall speedup =

(d) part (a) speedup = = 6.4

part (b) speedup = = 6.15

part (c) speedup = =8

You would use the processor from part (c) because it gives you the maximum speed up.
(e) 1. Synchronization.
2. Load imbalance.
3. Resource contention.

25
106 年交大資聯

1. Which of the following objects have identical binary representation no matter the machine is
little endian or big endian.
(a) 2's complement number -l
(b) int i = 0xABBAABBA
(c) A single precision number -0.0 (in IEEE754 encoding format)
(d) A C null pointer
Answer: (a), (d)
註(d): In C, the NULL can be expressed as the integer value 0

2. For the pipelined implementation of the MIPS processor, which statements below are NOT
correct?
(a) All pipelined registers have the same length
(b) The pipeline clock cycle time is the average of all stage latencies,
(c) Exceptions in a pipeline are handled like mis-predicted branches.
(d) In the pipelined data path, separate instruction and data memories are used to reduce data
hazards.
Answer: (a), (b), (d)
註(d):Separate instruction and data memories are used to reduce structural hazards.

3. Assume the register numbers of $s2 and $zero are 17 and 0, respectively. Given the MIPS code
sequence below, if we assume it starts at location 80004000h in memory, which of following
statements are correct? .
80004000h add $t0, $zero, $zero
loop: beq $s2, $zero, finish
add $t0, $t0, $s1
sub $s2, $s2, 1
j loop
finish: addi $t0, $t0,100
add $v0, $t0, $zero
(a) The MIPS machine codes of the beq (OP code is 4) in this code sequence is 12200003h.
(b) The MIPS machine code of the j (OP code is 2) in this code sequence is 08001000h.
(c) Assume both $s1 and $s2 initially contain integers 5 and 6, respectively. The value in $v0 is
30 after the execution of the whole code sequence.
(d) With the same assumption as (c) that both $s1 and $s2 initially contain integers 5 and 6,
respectively. The beg instruction will be executed 6 times for this code sequence.
Answer: (a)
註(a):the register numbers of $s2 and $zero are 17 and 0
26
OP rs rt 16-bit address
Binary 000100 10001 00000 0000000000000011
Hexadecimal 12200003hex
註(b):
OP 26-bit address
Binary 000010 00000000000001000000000001
Hexadecimal 08001001hex
註(c):$v0 = 5  6 + 100 = 130
註(d):7 times

4. Consider the following sequence of actual outcomes for a branch. T means the branch is taken.
N means not taken. Assume both predictors are initialized to predict taken. Which of following
statements are true?
Branch: T-N-T-N-N-T-N
(a) If 1-bit predictor is used, the predictions for this branch will be T-T-N-T-N-N-T.
(b) If 2-bit predictor is used, the predictions for this branch will be T-T-T-T-T-N-N.
(c) If the same pattern (i.e., T-N-T-N-N-T-N) are repeated thousands of times, the prediction
accuracy rate of 1-bit predictor is about 2/7.
(d) If the same pattern are repeated thousands of times, the prediction accuracy rate of 2-bit
predictor is about 4/7.
Answer: (a), (d)
註(a):
T N T N N T N
State (predict) 1 (T) 1 (T) 0 (N) 1 (T) 0 (N) 0 (N) 1 (T)
註(b):
T N T N N T N
State (predict) 3 (T) 3 (T) 2 (T) 3 (T) 2 (T) 1 (N) 2 (T)
註(c):
T N T N N T N
Round 1 1 1 0 1 0 0 1
Round 2 0 1 0 1 0 0 1
Round 3 0 1 0 1 0 0 1
Correct?       
註(d):
T N T N N T N
Round 1 3 3 2 3 2 1 2
Round 2 1 2 1 2 1 0 1

27
Round 3 0 1 0 1 0 0 1
Round 4 0 1 0 1 0 0 1
Correct?       

5. Given the operation times for the major functional units are: 200ps for memory access, 100ps for
ALU operation; and 50ps for register file read or write. Assuming that all the other delays (like
control unit, multiplexer, pipeline overheads, etc) are negligible. Assume only R-type, lw, sw are
supported. Which of following statements are correct?
(a) For a single-cycle CPU where the instruction with the longest latency determines the clock
cycle time, the clock cycle time is 350ps.
(b) For a classic MIPS CPU with a 5-stage pipeline, the clock cycle time can be 200ps.
(c) It takes 3000ps to execute the code sequence below using the single-cycle CPU.
(d) It takes 1000ps to execute the code sequence below using the pipelined MIPS CPU
lw $t1, 0($s1)
sw $s1, 0($s2)
add $t2, $s2, $s3
add $t3, $s1, $s2
lw $t1,0($t2)
Answer: (b), (c)
註(a):The execution time for the lw instruction = 200 + 50 + 100 + 200 + 50 = 600 ps
註(d):(Suppose with forwarding) Execution Time = [(5 – 1) + 5]  200 = 1800ps

6. Which of the following statements are correct?


(a) The number of pipeline stages affects latency, not throughput; thus pipelining improves the
performance of a processor by decreasing the latency of a job (i.e., an instruction) to be
done.
(b) For a given program, its average cycles per instruction (CPI) is affected not only by the
instruction set architecture, but also by the compiler used.
(c) By changing the clock frequency of a processor from 1.5 GHz to 2 GHz, and also changing
its supply voltage from 1 Volt to 1.25 Volt, the overall power consumption of this processor
will theoretically increase by more than 2X.
(d) According to Amdahl's law, it is theoretically possible to achieve a 6Xspeedup on executing
a program which is 87% parallelizable, by using a 16-core multiprocessor.
Answer: (b), (c)
註(c):Power consumption increase by (2  1.252) / 1.5 = 2.08 > 2
註(d):Max. speedup = 1 / (0.87 / 16 + 0.13) = 5.42 < 6

28
7. Consider a 1 KB, 4-way set associative cache (initially empty) with block size of 64 bytes. The
main memory consists of 256 blocks and the request for memory blocks is in the following
order: 0, 255, 1, 4, 3, 8, 142, 133, 159, 216, 113, 129, 63, 8, 17, 48, 32, 73, 92, 155. Which
one(s) of the following memory blocks will NOT be in the cache if LRU replacement policy is
used?
(a) 3 (b) 8 (c) 133 (d) 216
Answer: (c), (d)
註: Number of blocks in cache = 1KB / 64B = 16. The number of sets in cache = 16 / 4 = 4
Block address Tag Index Hit/Miss
0 0 0 Miss
255 63 3 Miss
1 0 1 Miss
4 1 0 Miss
3 0 3 Miss
8 2 0 Miss
142 35 2 Miss
133 33 1 Miss
159 39 3 Miss
216 54 0 Miss
113 28 1 Miss
129 32 1 Miss
63 15 3 Miss
8 2 0 Hit
17 4 1 Miss
48 12 0 Miss
32 8 0 Miss
73 18 1 Miss
92 23 0 Miss
155 38 3 Miss
Final content
Block 0 Block 1 Block 2 Block 3
Set 0 0 48 4 32 8 216 92
Set 1 1 17 133 73 113 129
Set 2 142
Set 3 255 155 3 159 63

29
8. Which of the following statements (about virtual memory and page table) are correct?
(a) For virtual memory, write-back is more practical than write-through.
(b) Given a 32-bit virtual address space with 4KB per page and 4 bytes per page table entry, the
total page table size is 4MB.
(c) It is possible to miss in cache and page table, but hit in translation look-aside buffer (TLB).
(d) It is possible to miss in TLB, but hit in cache and page table.
Answer: (a), (b), (d)
註(b):Number of page table entries = 4GB / 4KB = 1M. The page table size = 4 bytes  1M = 4MB

9. Which of the following statements are correct?


(a) RAID can enhance the reliability/availability of data storage, and can also boost the
performance of data access.
(b) RAID 1 provides fault tolerance by replicating data to mirror disks.
(c) The advantage of RAID 4 over RAID3 is on the block-level striping/interleaving of data
across disks, so as to allow more parallel data accesses.
(d) The advantage of RAID 6 over RAID 5 is on the distribution of parity blocks across disks,
so as to avoid a single parity disk from being the bottleneck.
Answer: (a), (b), (c)

題組 A:
Consider the Pipelined CPU with five stages (IF, D, EXE, MEM, WB) and the code sequence shown
below. Please answer the questions below.
add $s1, $t1, $t2
sub $s1, $s1, $s2
lw $s2, 0($s1)
add $t1, $s1, $s2
sub $t3, $t2, $s2
sw $s1, 0($t3)
or $t4, $t2, $s2
add $t1, $s2, $s1

30
PCSrc

ID/EX
WB
EX/MEM
MEM/WB
Control M WB

EX WB
IF/ID M

Add
Add
Shift Branch

RegWrite
left 2
ALUSrc

MemWrite

MemtoReg
Instruction
PC Read
Address Read
register 1 data 1
Read Zero
register 2 ALU
Instruction
Write
Read
Result Address Read
memory register
data 2 data
Data
Write Registers memory
data
Write
data
Instruction
[15-0] Sign ALU
extend control
Instruction MemRead
[20-16] ALUOp

Instruction
[15-11]
RegDst

A1. Assume forwarding and stall mechanisms have been designed (though it is not shown in the
figure), which instruction is in IF stage when the code sequence runs to the 7 cycle?
(a) or $t4, $t2, $s2 .
(b) sw $s1, 0($t3)
(c) sub $t3, $t2, $s2
(d) add $t1, $s1, $s2
Answer: (b)
註:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13
add IF ID EX ME WB
sub IF ID EX ME WB
lw IF ID EX ME WB
add IF ID ID EX ME WB
sub IF IF ID EX ME WB
sw IF ID EX ME WB
or IF ID EX ME WB
add IF ID EX ME WB

31
A2. With the same assumption as question A1, which instruction is in EXE stage when the code
sequence runs to the 7th cycle?
(a) or $t4, $t2, $s2
(b) sw $s1, 0($t3)
(c) sub $t3, $2, $s2
(d) add $t1, $s1, $s2
Answer: (d)

A3. Assume only stall mechanism has been designed (i.e., no forwarding paths) and assume register
read and write can be done in the same cycle, which instruction is in MEM stage when the code
sequence runs to the 7th cycle?
(a) or $t4, $t2, $s2
(b) sub $t3, $2, $s2
(c) lw $s2,0($s1)
(d) sub $s1, $s1, $s2
Answer: (d)
註:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13
add IF ID EX ME WB
sub IF ID ID ID EX ME WB
lw IF IF IF ID ID ID EX ME WB
add IF IF IF ID ID ID EX ME

題組 B:
A computer system has an L1 cache, an L2 cache, and a main memory unit connected as shown
below. The block size is 16 words for the L1 cache, and is 4 words for the L2 cache; the main
memory is 4-word wide. The access times are 2ns, 20ns, and 200 ns for the L1 cache, L2 cache,
and main memory, respectively.
Data Bus Data Bus
L1 L2 Main
Cache Cache MEM
4 words 4 words

B1. When the processor requests some 4-word data, but there is a miss in the L1 cache and a hit in
the L2 cache, how much is the total required time for data transfer upon this request?
(a) 20 ns
(b) 22 ns
(c) 80 ns
(d) 82ns
32
Answer: (d)
註:2 + (16 / 4)  20 = 82 ns

B2. When the processor requests some 4-word data, but there is a miss in both of the L1 cache and
the L2 cache, then a hit in the main memory, how much is the total memory access time for this
request?
(a) 220 ns
(b) 222 ns
(c) 282 ns
(d) 880ns
Answer: (c)
註:2 + (16 / 4)  20 + 200= 282 ns

33
106 年成大電機

1. For a five-stage pipeline processor, answer the following questions:


(a) In the MIPS five stage pipeline, where is the earliest stage that can compute the branch
outcome? Why?
(b) Can we find a possible branch target at the IF stage? How?
(c) In the pipelined datapath, which components should be connected to the CPU clock?
(d) If the control unit is located at the ID stage. How might the control unit affect the CPU
clock?
(e) Is there a structure hazard of reading the source register operands at the ID stage and the
write back of the destination register at the WB stage, why yes or why no?
Answer
(a) ID stage. Both branch target address calculation and registers’ comparison (by an XOR
array) can be done at the ID stage. The branch outcome cannot be determined earlier at IF
stage because the contents of registers are not available at this stage.
(b) Yes. The computation of branch target needs PC + 4 and 16-bit constant of the executed
instruction. We can get this two values at IF stage, so the branch target can be calculated at
this stage.
(c) Program counter (PC) and all the pipeline registers should be connected to the CPU clock.
(d) If the latency of the control unit dominates the latency of the ID stage and the ID stage is
the longest stage among all pipeline stages, the control unit will affect the clock cycle time.
(e) No. We can design the register file to read registers at the first half cycle and to write a
register at the second half cycle of the clock and there will no structure hazard occur.

2. In a five-stage pipelined processor, multiple exceptions may occur at the same clock cycle.
Answer the following questions:
(a) Which stage can detect an illegal instruction?
(b) What might be the possible causes that result in an illegal instruction?
(c) Which stage(s) can detect memory access violations?
(d) TLB exception may occur at which stage(s)?
(e) If multiple exceptions have occurred at the same time, which PC should be identified and
saved for precise interrupt?
Answer
(a) ID stage
(b)  The program was compiled with some processor-specific optimizations and is then
running on a processor that fails to meet those requirements.
 The program is piled using wrong compiler.
 The program runs on processors that don’t support the instructions.
(c) IF and MEM stage.
34
(d) IF and MEM stage.
(e) The program counter of the oldest instruction (the earlier instruction into pipeline) should
be identified and saved for precise interrupt.

Choose the most appropriate answers for the following multiple choice problems. Each question may
have more than one answer.
3. Using 16K  8 SRAM modules for on-chip memory system, which of the following is (are)
true?
(a) For 1 MB memory system, it needs 64 SRAM modules.
(b) The 16K  8 module has 14 address lines.
(c) The 16K  8 module has 16K address lines.
(d) It needs at least 8 modules for the connection to a 64-bit data bus. So the minimum memory
size is 128KB.
(e) It needs at least 8 modules for the connection to a 64-bit data bus. So the minimum memory
size is 64KB.
Answer: (a), (b), (d)
註(a): (1M  8) / (16K  8) = 26 = 64
註(d): (16K  8)  8 = 16KB  8 = 128KB

4. For a conditional branch instruction such as beq, rs, rt, foo, which of the following statements
are true?
(a) The label "foo" defines the base address of the branch target.
(b) The label "foo" is an offset relative to the program counter which points to the next
sequential instruction of the branch instruction.
(c) The label "foo" is translated into an unsigned number.
(d) The label "foo" is coded into the instruction as a string.
(e) The label "foo" is coded into the instruction as a signed number.
Answer: (b), (e)

5. Which of the following is (are) true for the forwarding unit used in a typical five-stage pipelined
processor?
(a) The forwarding unit is used to bypass the write-back result due to RAW hazards.
(b) The forwarding unit is used to forward data to the instruction cache.
(c) The forwarding unit compares the source register number of the instructions in the MEM
and WB stages with the destination register numbers of the dependent instruction.
(d) The forwarding unit compares the destination register number of the instructions in the
MEM and WB stages with the source register numbers of the dependent instruction.
(e) The forwarding unit is a combinational logic.
35
Answer: (a), (d), (e)

6. Which of the following statements is (are) true for virtual memory system?
(a) The space on the disk or flash memory reserved for the full virtual memory space of a
process is called Swap Space.
(b) Virtual memory function can be enabled through software control.
(c) Virtual memory technique treats part of the main memory as a fully-set associative
write-back cache for program execution.
(d) A translation lookaside buffer can be seen as the cache of a page table.
(e) A page table is shared among the programs in execution.
Answer: (a), (c), (d)

7. Which of the following statements is (are) true?


(a) Checking the state of an I/O device to see if it is time for the next I/O operation is called I/O
polling.
(b) A multi-core system using a write-through cache as its private cache will prevent the Cache
coherence problem since the written data are also updated in the main memory.
(c) When an interrupt occurs, the processor always responds to the interrupt and enters the
interrupt service routine. This is called an interrupt request.
(d) ISA (instruction set architecture) is an abstraction that enables different implementations of
the same ISA for the processor, for example, a pipelined implementation or a non-pipelined
one.
(e) Using a DMA controller to perform memory-to/from I/O operations is called CPU I/O.
Answer: (a), (d)

36
106 年成大資聯

1. Consider the following figure.


A B
Main
CPU TLB Hit Cache Miss
memory

Miss Resolved Data

Page
table

(1) What is the name of the TLB’s input (Symbol A in the figure)? (Hint: XXX address.)
(2) What is the name of the TLB’s output (Symbol B in the figure)?
(3) What is the name of this type of cache (with B as its input)? (Hint: XXX cache.)
(4) Does the cache aliasing occur in the cache design shown in the figure (Yes or No)? Explain
your answer.
(5) “We could have a hit in the cache, and get a TLB miss and a page table miss.” Is the
statement true (Yes or No)? Explain your answer.
(6) Consider the processor operating at 1GHz. The processor stalls during a cache miss and has
the following properties: (i) a cache access time of 2 clock cycles for a hit, (ii) a miss
penalty of 100 clock cycles, and (iii) a miss rate of 0.03 misses per reference. Please
compute the average memory access time.
Answer
(1) Virtual address
(2) Physical address
(3) Physically addressed cache
(4) No.
(5) No. Page table miss  data are not in memory  data are not in cache
(6) AMAT = (2 + 0.03  100)  1 ns = 5 ns

2. Assume the MIPS processor with 5 stages of the pipeline:


(i) IF for the instruction fetch stage,
(ii) ID for the instruction decode/register file read stage,
(iii) EX for the execution stage,
(iv) MEM for the memory access stage, and
(v) WB for the write-back stage.
Given the following instruction sequences.
I1: ADD R1, R2, R0

37
I2: LW R2, 16(R1)
I3: LW R1, 4(R3)
I4: SUB R5, R3, R4
(1) Find all data dependencies in the instruction sequence.
(2) Find all hazards in the instruction sequence for the processor with and without forwarding.
(3) Sometimes, even with forwarding, we would have to stall one stage for a data hazard.
Fortunately, the software technique, Reordering Code, would be adopted to avoid the
pipeline stalls. Please state if the instruction sequence suffers from a data hazard that cannot
be resolved by the forwarding (Yes or No). If your answer is Yes, please list your reordered
code.
Answer
RAW WAR WAW
(1) (R1) I1 to I2 (R2) I1 to I2 (R1) I1 to I3
(R1) I2 to I3
With forwarding Without forwarding
(2)
(R1) I1 to I2
(3) No

3. Determine whether each of the following statements is True (T) or False (F), and explain your
answer.
(1) Strong scaling is not limited by Amdahl's law.
(2) Both SMPs and message-passing computers rely on locks for synchronization.
(3) Multithreading technology in CPUs help reduce the memory latency.
Answer
Weak scaling can compensate for a serial portion of the program that
(1) F
would otherwise limit scalability.
Sending and receiving a message is an implicit synchronization, as well as
(2) F
a way to share data
Multithreading technology rely on parallelism to get more efficiency from
(3) F
a chip.

38
106 年成大電通

1. For a MIPS-like 5-stage pipelined processor,


(a) Explain how the processor handles the control hazard for beq rs, rt, loop instruction.
(Assume that non-taken prediction is used; state your assumptions for the implementation
of the beq instruction).
(b) In a MIPS-like 5-stage pipelined processor, what is a load-use hazard?
(c) How to handle this load-use hazard in the pipeline?
(d) Assume that an exception handling unit is placed at the MEM stage of the pipeline. What is
the precise interrupt PC?
(e) Define precise interrupt.
Answer
(a) Assume branch not taken prediction is to assume that the branch will not be taken and thus
continue execution down the sequential instruction stream. If the branch is taken, the
instructions that are being fetched and decoded must be discarded. To discard instructions,
we must be able to flush instructions in the IF, ID, and EX stages of the pipeline if the
branch decision is made in MEM stage.
(b) Load-use hazard: a specific form of data hazard in which the data being loaded by a load
instruction has not yet become available when it is needed by another instruction.
(c) Pipeline should stall at IF and ID stages for one cycle and then forwarding technique is used
to forward data from WB stage to EX stage.
(d) Precise interrupt PC will be the address of the instruction after offending instruction which
cause exception.
(e) Precise interrupt: also called precise exception. An interrupt or exception that is always
associated with the correct instruction in pipeline computer.

2. Design a DTLB and data cache system. Assume both the virtual address and physical address are
32 bits. The page size is 4KB. The DTLB uses 4-way set associative structure and has a total of
32 entries. The data cache is physically addressed; cache size 32KB, direct-mapped, line size 32
bytes.
(a) Show the integrated design of the DTLB and the data cache.
(b) Show the integration of this sub-memory system into a 5-stage processor pipeline.
Answer
(a)

39
Virtual page number
12 bits

Virtual address Tag Index Page offset


15 bits 5 bits

V Tag PPN V Tag PPN V Tag PPN V Tag PPN


0
DTLB
31

= = = =

Physical address Physical page number Page offset


20 bits 12 bits
Physical address Tag Index Offset

H it 17 bits 10 bits D ata

In de x V alid T ag D a ta
0
1
2

Data Cache

1023
20 32
17 bits 256 bits

40
(b)

ITLB DTLB

address
address

41
106 年中央資工
多選題
1. In Boolean algebra, which of the following statements are true? (Note: Z' is the inverse of Z)
(1) X + YZ' = (X + Y)(Y + Z')
(2) (X + Y)(X' + Z) = XZ + X'Y
(3) (W' + Z + XY)(Z' + W' + XY) = Z' + XY
(4) (X + Y)(Y + Z)(X' + Z) = (X' + Z)(X + Y)
(5) XY'Z + YZ = XZ + YZ
Answer: (2), (4), (5)
註: (1) RHS = Y + XZ’ when Z = 1, X = 0, and Y = 1 LHS = 0 ≠ 1 = RHS
(2) LHS = XZ + X’Y + YZ = XZ + X’Y + XYZ + X’YZ = XZ + X’Y = RHS
(3) LHS = (W’ + XY + ZZ’) = (W’ + XY) when W = 1, Z = 0, and X = 0 LHS = 0 ≠ 1 = RHS
(4) LHS = (Y + XZ)(X’ + Z) = X’Y + YZ + XZ
RHS = X’Y + YZ + XZ = LHS
(5) RHS = XYZ + XY’Z + YZ = XY’Z + YZ = LHS

2. About single-cycle and multi-cycle implementation, which of the following statements are true?
(a) Single-cycle implementation of CPU is used in the mainstream processors nowadays.
(b) For single-cycle implementation of CPU, the clock cycle is determined by the longest
possible path.
(c) Single-cycle implementation of CPU allows a functional unit to be used more than once per
instruction.
(d) Compared to single-cycle implementation, multicycle implementation is more efficient.
(e) None of the above.
Answer: (b)

3. Consider the following code sequence:


SUB R2, R1, R3
AND R4, R2, R5
OR R7, R6, R2
ADD R8, R2, R2
SW R9, 100(R2)
Which of the following statements about the dependency in the code sequence are correct?
(a) The dependency between SUB and AND instructions can be detected by the logic:
EX/MEM.RegisterRd = ID/EX.RegisterRs
(b) The dependency between SUB and OR instruction can be detected by the logic:
MEM/WB.RegisterRd = ID/EX.RegisterRt
(c) The two dependencies between SUB and ADD instructions are not hazards.
42
(d) There is no data hazard between ADD and SW instructions.
(e) None of the above
Answer: (a), (b), (c), (d)

4. About the branch prediction, which of the following statements are true?
(a) For a one-bit dynamic branch predictor, the branch prediction buffer contains one bit to
record whether the branch instruction was recently taken or not.
(b) A branch prediction buffer is a small special-purpose memory indexed by the higher-order
bits of the address of the branch instruction.
(c) For a two-bit dynamic branch predictor, a prediction must be wrong twice before it is
changed.
(d) The assumption of dynamic branch prediction is that the underlying algorithms and the data
that is being operated on have regularities.
(e) If the delay branch slots can be scheduled with independent instructions from before the
branch, branch hazards can be avoided.
Answer: (a), (c), (d), (e)

5. For the instruction set design, which of the following statements are true?
(a) Compared to Mem-Mem architecture or Reg-Mem architecture, Reg-Reg architecture has
the disadvantage of having large variation in CPI.
(b) MIPS requires that objects must be aligned in the memory.
(c) Register indirect addressing mode can be used for accessing using a pointer or a computed
address. For example: Add R4, (R1) means that Regs[R4]  Regs[R4] + Mem [Regs[R1]]
(d) Modern compiler technology and its ability to effectively manipulate registers has led to a
decrease in register counts in more recent architectures.
(e) For PC-relative addressing mode, a displacement is added to the program counter.
Answer: (b), (c), (e)

單選題
6. Consider the following instruction mix for a processor.
ALU operations: 40%, uses 4 cycles
Branch operations: 30%: uses 4 cycles
Memory references: 30%: uses 5 cycles
The un-pipelined processor has a clock cycle time of 1ns. The pipelined processor has a clock
cycle time of 1.2ns. Suppose that we ignore any latency and hazards and assume that the
pipelined processor has an ideal CPI of 1. How much speedup can be achieved when
comparing the un-pipelined processor and the pipelined processor?
(a) 3.28 (b) 3.35 (c) 3.58 (d)3.86 (e) 3.98

43
Answer: (c)
註: Instruction time for un-pipelined processor = 1 ns  (0.4  4 + 0.3  4 + 0.3  5) = 4.3 ns
Instruction time for pipelined processor = 1.2 ns  1 = 1.2 ns
Speedup = 4.3 / 1.2 = 3.58

7. We know that the Boolean function F1(A, B, C, D) = AB' + CD has the minterms, M3, M7, M8,
M9, M10, M11, M15. Now, F2(A, B, C, D) = A'B'D + CD' + A'BC' + ABD. Xi = i if F2(A, B, C,
D) has a minterm Mi, and Xi = 0, otherwise. The sum of all the sixteen Xi is K, 0 ≤ i < 16, What
is (K mod 5)? (“mod” is the modulo operation.)
(a) 0 (b) 1 (c) 2 (d) 3 (e) 4
Answer: (d)
註: F2 has the minterms: M1, M2, M3, M4, M5, M6, M10, M13, M14, M15
(3 + 10 + 15) mod 5 = 3
CD 00 01 11 10
AB
00 1 1 1
01 1 1 1
11 1 1 1
10 1

8. Boolean function F3(A, B, C, D) = A'B'C' + A'C'D + ABD + BCD + ACD + AB'D' + BCD'. If
the number of essential prime implicants is K, what is (K mod 5)?
(a) 0 (b) l (c) 2 (d) 3 (e) 4
Answer: (c)
註:
CD 00 01 11 10
AB
00 1 1 1
01 1 1
11 1 1
10 1 1 1

44
9. Consider two different machines, A and B. The measurements on the two machines running a set
of benchmark programs are shown below:
Machine A (2GHz)
Instruction Type Instruction Count (millions) CPI
Arithmetic and logic 8 l
Load and store 4 3
Branch 2 4
Others 4 3

Machine B (2.2GHz)
Instruction Type Instruction Count (millions) CPI
Arithmetic and logic 10 1
Load and store 8 2
Branch 2 4
Others 4 3

If the MIPS of the slower machine is K, what is (Round(K) mod 5)?


(Note: Round (K) = floor(K+0.5))
(a) 0 (b) 1 (c) 2 (d) 3 (e) 4
Answer: (d)
註: Execution time for MA = [(8  1 + 4  3 + 2  4 + 4  3)  106] / (2  109) = 20 ms
Execution time for MB = [(10  1 + 8  2 + 2  4 + 4  3)  106] / (2.2  109) = 21 ms
MB is slower. The MIPS of MB = (10 + 8 + 2 + 4) / (21  10-3) = 1142.86
Round(1142.86) mod 5 = 1143 mod 5 = 3

10. A processor with 2GHz has a base CPI of 4 when all the memory references hit in the primary
memory. Assume that a main memory access time is 100ns, including all the miss handling and
that the miss rate at the primary cache is 8%. We add a secondary cache that has a 20-ns access
time and helps to reduce the miss rate to the main memory to 2%. The resulting CPI now is K.
What is (Round(K  8) mod 5)? (Note: K  8 means K multiplied by 8)
(a) 0 (b) 1 (c) 2 (d) 3 (e) 4
Answer: (a)
註: CPI = 4 + 0.08  40 + 0.02  200 = 11.2
Round(11.2  8) mod 5 = Round(89.6) mod 5 = 90 mod 5 = 0

45
106 年中山電機

1. Terminology Explanation
(a) Pipeline Processing (b) Superscalar Processor (c) RISC (d) GPGPU
Answer
(a) Pipeline Processing: a category of techniques that provide simultaneous, or parallel,
processing within the computer. It refers to overlapping operations by moving data or
instructions into a conceptual pipe with all stages of the pipe processing simultaneously. For
example, while one instruction is being executed, the computer is decoding the next
instruction.
(b) Superscalar Processor: an advanced pipelining technique that enables the processor to
execute more than one instruction per clock cycle
(c) RISC: (Reduced instruction set computer) is a computer which is based on the design
strategy that simplified instructions can provide higher performance if this simplicity
enables much faster execution of each instruction.
(d) GPGPU: (General-purpose GPU): Using a GPU for general-purpose computation via a
traditional graphics API and graphics pipeline.

2. Suppose we are considering a change to an instruction set. The base machine is a load-store
machine. Measurements of the load-store machine showing the instruction mix and clock cycle
counts per instructions are given in the following table:
Instruction Type Frequency Clock cycle Count
ALU operations 40% 1
Loads 25% 2
Stores 15% 2
Branches 20% 2
Let's assume that 30% of the ALU operations directly use a loaded operand that is not used again. We
propose adding ALU instructions that have one source operand in memory. These new register-memory
instructions have a clock cycle count of 2. Suppose that the extended instruction set increases the clock
cycle count for branches by 1, but it does not affect the clock cycle time. Would this change improve CPU
performance? Explain your answer.
Answer
The frequency of register-memory ALU instructions = 0.4 × 0.3 = 0.12
The frequency of register-register ALU instructions becomes 0.4 – 0.12 = 0.28
The frequency of Load instructions becomes 0.25 – 0.12 = 0.13
The CPI for branch instructions becomes 2 + 1 = 3
CPIold = 0.4 × 1 + 0.25 × 2 + 0.15 × 2 + 0.2 × 2 = 1.6
CPInew = (0.28 × 1 + 0.13 × 2 + 0.15 × 2 + 0.2 × 3 + 0.12 × 2) / 0.88 = 1.91
ExTimeold = IC × CPI × T = 1.6 × IC × T
46
ExTimenew = 0.88 IC × CPI × T = 0.88 × 1.91 × IC× T = 1.68 × IC × T
This change would not improve CPU performance

3. A set associative cache has a block size of four 32-bit words and a set size of 4. The cache can
accommodate a total of 256K words. The main memory size that is cacheable is 1024M  32 bits.
Design the cache structure, and show how the processor's addresses are interpreted.
Answer
Cacheable memory size is 1024M  32 bits  word address length = 30 bits
Block size of four words  block offset = 2 bits
Number of sets = (256K / 4) / 4 = 16K  index length = 14 bits

30-bit word address

Tag Index Offset


2214 814 2

Index V Tag Data V Tag Data V Tag Data V Tag Data


0
1
2

..
..
1023

= = = =
128 128
128 128

4-to-1 MUX

Data
Hit

4. Given the 8-bits adder (named Add8), the 2-to-1 8-bits multiplexers (named MUX8 2to1) and the
basic gates such as NOT, AND, OR, NAND, and NOR, you are asked to design an ALU in
function block diagrams, which must match the following requirements:
(1) Support add, sub, and sgt (set on great than) functions. Their operation selection bits (op_sel)
are as follows: add(00), sub(10), sgt(11),
(2) Report the result status in sign, zero, overflow, and carry bits.
Answer

47
(1)

op0 op1

c0
s0
s1 s0 ~ s7 0
a 0 r0 ~ r7
s2 1
1
s3
Add8

s4 s0
s5
b s1 ~ s7
0 s6 ×7
1 0 s7 0
1 c7
×8 c8

(2)

op0 op1

c0
s0
s1 s0 ~ s7 0 zero
a 0 r0 ~ r7
s2 1
1
s3
Add8

s0
s4 Sign
s5 s1 ~ s7 r7
b 0 s6 0
×7
1 0 s7
1 c7 Overflow
×8 c8

Carry out

5. Use the following code fragment:


Loop: LW R1, 0(R2)
ADDI R1, R1, #2
48
SW 0(R2), R1
ADDI R2, R2, #5
SUB R4, R3, R2
BNEZ R4, Loop
Assume the initial value of R3 is R2+100. Use the five-stage instruction pipeline (IF, DEC, EXE,
MEM, WB) and assume all memory accesses are one cycle operation. Furthermore, branches are
resolved in MEM stage.
(a) Show the timing of this instruction sequence for the five-stage instruction pipeline with
normal forwarding and bypassing hardware. Assume that branch is handled by predicting it
has not taken. How many cycles does this loop take to execute?
(b) Assuming the five-stage instruction pipeline with a single-cycle delayed branch and normal
forwarding and bypassing hardware, schedule the instructions in the loop including the branch-delay
slot. You may reorder instructions and modify the individual instruction operands, but do not
undertake other loop transformations that change the number of opcode of instructions in the loop.
Show a pipeline timing diagram and compute the number of cycles needed to execute the entire loop.
Answer
(a)
Instructions Clock cycle number
1 2 3 4 5 6 7 8 9 10 11 12 13 14
LW R1, 0(R2) F D X M W
ADDI R1, R1, #2 F D D X M W
SW R1, 0(R2) F D X M W
ADDI R2, R2, #5 F D X M W
SUB R4, R3, R2 F D X M W
BNEZ R4, Loop F D D X M W
Flushed instruction F D X M W
LW R1, 0(R2) F D X M W

The total number of iterations is 200 / 5 = 40 cycles


There are 2 RAW hazards (2 stalls) and a flush after the branch since the branch is taken.
For the first 39 iterations, it takes 9 cycles between loop instances.
The last loop takes 12 cycles since the latency cannot be overlapped with additional loop
instances. So, the total number of cycles is 39  9 + 12 = 363.
(b)
LW R1, 0(R2)
ADDI R2, R2, #4
SUB R3, R3, R2
ADDI R1, R1, #1

49
BNEZ R3, Loop
SW R1, -4(R2)

Instructions Clock cycle number


1 2 3 4 5 6 7 8 9 10
LW R1, 0(R2) F D X M W
ADDI R2, R2, #5 F D X M W
SUB R3, R3, R2 F D X M W
ADDI R1, R1, #2 F D X M W
BNEZ R3, Loop F D X M W
SW R1, -4(R2) F D X M W

The total number of iterations is 200 / 5 = 40 cycles


For the first 39 iterations, it takes 6 cycles between loop instances.
The last loop takes 10 cycles since the latency cannot be overlapped with additional loop
instances. So, the total number of cycles is 39  6 + 10 = 244.

50
106 年中山資工

1. True or False. (If the statement is false, please explain the answer shortly)
(1) Increasing the block size of a cache is likely to take advantage of temporal locality.
(2) Increasing the page size tends to decrease the size of the page table.
(3) Virtual memory typically uses a write-back strategy, rather than a write-through Strategy.
(4) If the cycle time and the CPI both increase by 10% and the number of instruction decreases
by 20%, then the execution time will remain the same.
(5) In uniform memory access (UMA) designs, all processors use the same address space.
Answer:
(1) F Increasing the block size of a cache is likely to take advantage of spatial locality.
(2) T
(3) T
(4) F 執行時間變為原來的 0.8 × 1.1 × 1.1 = 0.968
(5) T

2. Server farms such as Google and Yahoo! Provide enough computer capacity for the highest
request rate of the day. Imaging that most of the time these servers operate at only 60% capacity.
Assume further that the power does not scale linearly with the load; that is, when the servers are
operating at 60% capacity, they consume 90% of maximum power. The servers could be turned
off, but they would too long to restart in response to more load. As new System has been
proposed that allows for a quick restart but requires 20% of the maximum power while in this
“barely alive” state.
(1) How much power saving would be achieved by turning off 60% of the servers?
(2) How much power saving would be achieved by placing 60% of the servers in the “barely
alive” state?
Answer:
(1) 60%
(2) 0.4 + 0.6 × 0.2 = 0.58, which reduces the energy to 58% of the original energy

3. A multicycle CPU has three implementations. The first one is a 5-cycle IF-ID-EX-MEM-WB
design running at 4.8GHz, where load takes 5 cycles; store/R-type 4 cycles and branch/jump 3
cycles. The second one is a 6-cycle design running 5.6GHz, with MEM replaced by MEM1 and
MEM2. The third is a 7-cycle design running at 6.4GHz, with IF further replaced by IF1 and
IF2. Assume we have an instruction mix: load 26%, store 10%, R-type 49%, branch/jump 15%.
(1) Do you think it is worthwhile to go for the 6-cycle design over the 5-cycle design?
(2) How about the 7-cycle design over the 6-cycle design, is it worthwhile?
Answer:
(1) The average CPI for implementation 1 is:
51
5  0.26 + 4  0.1 + 4  0.49 + 3  0.15 = 4.11
The execution time for an instruction in implementation 1 = 4.11/4.8G = 0.86 ns
The average CPI for implementation 2 is:
6  0.26 + 5  0.1 + 4  0.49 + 3  0.15 = 4.47
The execution time for an instruction in implementation 2 = 4.47/5.6G = 0.80 ns
It is worthwhile to go for the 6-cycle design over the 5-cycle design,
(2) The average CPI for implementation 3 is:
7  0.26 + 6  0.1 + 5  0.49 + 4  0.15 = 5.47
The execution time for an instruction in implementation 3 = 5.47/6.4G = 0.85 ns
It is not worthwhile to go for 7-cycle design over the 6-cycle design.

4. Identify all of the data dependencies in the following code running in a 5-stage pipelined MIPS
CPU. Which dependencies are data hazards that will be resolves via forwarding? Which
dependencies are data hazards that will cause a stall?
Line Instructions
1 add $3, $4, $2
2 sub $5, $3, $1
3 lw $6, 200($3)
4 add $7, $3, $6
Answer:
Data dependency (line 1, line 2), (1, 3), (1, 4), (3, 4)
Data hazard (1, 2), (1, 3), (3, 4)
Can be resolved via forwarding (1, 2), (1, 3)
Cause a stall (3, 4)

5. For a system with 32-bit address, the CPU uses a 4-way set associate cache with block size of 16
bytes. The cases has 1024 entries in total
(1) Determine the tag size for each block.
(2) Assume each block requires 2 extra valid bits. What is the size of the cache memory?
Answer:
(1) Tag size = 32 – 10 – 4 = 18 bits
(2) The number of blocks in cache = 4  1024 = 4096 = 4K
The number of bits in each block = 2 + 18 + 16  8 = 148 bits
The total size of the cache memory = 148  4K = 592 Kbits

52
6. Given the following datapath for the single-cycle implementation of a computer and the
definition of its instructions:
PCSrc

Add

4 Add Sum

RegWrite Shift
left 2
Instruction [25-21]
Read Read MemWrite
PC
address Instruction [20-16] register 2 Read
Instruction Read data 1 ALUSrc MemtoReg
[31-0] register 1 ALU Address
Read
Instruction Write data 2 Read
memory Register data
Instruction [15-11] Write
data Registers Write
RegDst data Data
memory
Instruction [15-0] 16 Sign 32 ALU
extend
control MemRead
Instruction [5-0]
ALUOp

add $rd, $rs, $rt


lw $rt, addr($rs)
sw $rt, addr($rs)
beq $rs, $rt, addr
Assume that the instructions are fixed length and the operation time for the major functional
units in this implementation are as follows:
 Memory units: 2ns
 ALU and adders: 2ns
 Register file (read or write): 1 ns
 Multiplexers, control unit, PC accesses, sign extension unit, and wires: no delay
Please compute the required time for each instruction and explain why.
Answer:
Instruction Instruction Register Data Register Total
ALU
Class Memory (Read) Memory (Write) Time
add 2 1 2 1 6 ns
lw 2 1 2 2 1 8 ns
sw 2 1 2 2 7 ns
beq 2 1 2 5 ns
The execution time for each instruction is 8 ns since the clock cycle for the single cycle machine
is determined by the longest instruction, which is 8 ns.

53
7. The following series of branch outcomes occurs for a single branch in a program. T means the
branch is taken; N means the branch is not taken. Τ T T N N T T T How many instances of this
branch instruction are mis-predicted with a 1-bit and 2-bit local branch predictor, respectively?
Assume the Branch History Table (BHT) are initialized to the N state. You may assume that this
is the only one branch in this program.
Answer:
1-bit predictor: 3 times; 2-bit predictor: 5 times
1-bit T T T N N T T T
Current state 0 1 1 1 0 0 1 1
Next state 1 1 1 0 0 1 1 1
Correct/Incorrect        

2-bit T T T N N T T T
Current state 0 1 2 3 2 1 2 3
Next state 1 2 3 2 1 2 3 3
Correct/Incorrect        

8. A computer whose processes have 1024 pages in their address spaces keeps its page tables in
memory. The overhead required for reading a word from the page table is 500 ns. In order to
reduce the overhead, the computer has Translation Lookaside Buffer (TLB), which holds 32
(virtual page, physical page frame) pairs, and can do a look up in 100 ns. What hit rate is needed
to reduce the mean overhead to 200 ns?
Answer:
Suppose the hit rate of TLB is H
100 ns + (1 – H)  500 ns = 200 ns  H = 0.8

54
106 年中正電機

1. Please explain the terms of following:


(1) Branch delay slot (2) Datapath (3) Hazard (4) Forwarding (5) Single-cycle CPU
Answer
(1) Branch delay slot: The slot directly after a delayed branch instruction, which in the MIPS
architecture is filled by an instruction that does not affect the branch.
(2) Datapath: A unit used to operate on or hold data within a processor. In the MIPS, the
datapath include the instruction and data memories, the register file, the ALU, and adders.
(3) Hazard: Situations in pipelining when the next instruction cannot execute in the following
clock cycle.
(4) Forwarding: A method of resolving a data hazard by retrieving the missing data element
from internal buffers rather than waiting for it to arrive from programmer-visible registers
or memory.
(5) Single-cycle CPU: An implementation of the CPU in which an instruction is executed in
one clock cycle.

2. Please resolve the following cache designs:


(1) A caches is designed with 128 blocks and 8 bytes for each block. For the memory byte
address 1280, what is the mapped cache block by using direct-map mechanism?
(2) Assume the address format of cache is 32-bit. What is the size for tag field?
Answer
(1) Memory block address = 1280 / 8 = 160
The mapped cache block number = 160 mod 128 = 32
(2) The tag field size = 32 – 7 – 3 = 22

3. By considering the following instructions for a five stages MIPS CPU:


lw $5, -10($5)
sw $5, -10($5)
sub $5, $5, $5
(1) By assuming there is not any forwarding in this pipelined processor, please plot the pipeline
diagram of instructions with bubbles (pipeline stall).
(2) Again please plot pipeline diagram with forwarding.
Answer
(1) c1 c2 c3 c4 c5 c6 c7 c8 c9
lw $5, -10($5) IF ID EX ME WB
sw $5, -10($5) IF ID ID ID EX ME WB
sub $5, $5, $5 IF IF IF ID EX ME WB

55
(2) c1 c2 c3 c4 c5 c6 c7 c8 c9
lw $5, -10($5) IF ID EX ME WB
sw $5, -10($5) IF ID ID EX ME WB
sub $5, $5, $5 IF IF ID EX ME WB

4. A CPU has a clock rate of 4GHz and voltage of 1 V. Assume that, on average, it consumes 30W
of static power and 40W of dynamic power. If the supply voltage is reduced by 10%, how much
percentage of total power saving can be achieved?
Answer
Suppose leakage current remain the same when the voltage is reduced.
The leakage current I = 30W / 1V = 30A
The new static power = 30A  0.9V = 27W
The capacity load C = 40W / (4G  12V) = 10pF
The new dynamic power = 10pF  4G  0.92V = 32.4W
(27W + 32.4W) / (30W + 40W) = 0.85
The total power can be reduced by 15%

5. The individual stages of a datapath have the following latencies:


IF ID EX MEM WB
500 ps 400 ps 350 ps 450 ps 300ps
(1) What are the clock frequencies of a pipelined and non-pipelined processor?
(2) What is the total latency of a sw instruction in a non-pipelined processor?
(3) If we can combine two adjacent stages of the pipelined datapath and then split the combined
one into three new stages, which stages would you choose and what is the optimum clock
cycle time of the new processor?
Answer
(1) Clock rate of a pipelined processor = 1 / 500 ps = 2 GH
Clock rate of a non-pipelined processor = 1 / (500 + 400 + 350 + 450 + 300) ps = 0.5 GH
(2) The total latency of a sw instruction = 500 + 400 + 350 + 450 + 300 = 2000 ps
(3) IF and ID stages. the optimum clock cycle time of the new processor = 450 ps

6. What's the difference between “response time” and “throughput of a CPU? Give an idea to
improve “response time” and “throughput” of a CPU, respectively
Answer:
Response time: Also called execution time. The total time required for the computer to
complete a task, including disk accesses, memory accesses, I/O activities, operating system
overhead, CPU execution time, and so on.
Throughput: Also called bandwidth. Another measure of performance, it is the number of

56
tasks completed per unit time.
Use pipeline technique to implement a CPU.

7. When designing memory hierarchy in a computer system with DRAM and SRAM, which one is
used for cache memory and which one for main memory? Why?
Answer
Main memory is implemented from DRAM and caches is implemented from SRAM. It is
because that DRAM is less costly per bit but has longer latency than SRAM.

8. Implement the function “unsigned int Fib (unsigned int n)” which returns the value of the nth
Fibonacci number, Fib (0) = 0, Fib (1) = 1, Fib (2) = 1, Fib (3) = 2, ..., Fib (n) = Fib (n - 1) + Fib
(n - 2).
(1) Write the C code.
(2) Translate your C code into MIPS code. Assume that the argument n is in $a0, and the result
is in $v0.
Answer
(1) int fib(int n){
if (n ≤ 1)
return n;
else
return (fib(n-1) + fib(n-2))};
(2) fib: addi $sp, sp, -12 # save registers on stack
sw $a0, 0($sp)
sw $s0, 4($sp)
sw $ra, 8($sp)
bgt $a0,1, L1 # if n > 1 then goto L1
add $v0,$a0, $0 # output = input if n = 0 or n = 1
addi $sp, $sp, 12
jr $ra
L1: addi $a0, $a0, -1 # set argument = n-1
aal fib # compute fib(n-1)
move $s0, $v0 # save fib(n-1)
addi $a0, $a0, -1 # set argument to n-2
jal fib # compute fib(n-2)
add $v0, $v0, $s0 # $v0 = fib(n-2) + fib(n-1)
lw $a0, 0($sp) # restore registers from stack
lw $s0, 4($sp)
lw $ra, 8($sp)
57
addi $sp, $sp, 12
jr $ra

58
106 年中興電機

1. Please answer “Yes” or “No” for the MIPS addressing modes:


(1) Register addressing, where the operand is a constant.
(2) Immediate addressing, where the operand is a register within the instruction itself.
(3) PC-relative addressing, where the address is the sum of the PC and a register in the
instruction.
(4) Base addressing, where the operand is at the memory location whose address is the sum of a
register and a constant in the instruction.
(5) Pseudo-direct address, where the jump address is the 26 bits of the instruction concatenated
with the upper bits of the PC.
Answer
(1) (2) (3) (4) (5)
No No No Yes Yes

註(1): the operand is a register


註(2): the operand is a constant
註(3): the address is the sum of the PC and a constant in the instruction

2. The sequential circuit with a D flip-flop is shown in Fig. 1.


(a) Write the state equation.
(b) Write the state table.
(c) Plot the state diagram of the circuit.

Fig. 1
Answer
(a) A(t + 1) = A♁x♁y
(b)
Present state Input Next state
A x y A
0 0 0 0
0 0 1 1
0 1 0 1

59
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 0
1 1 1 1
(c)

3. Suppose we have made the following measurements of average CPI for MIPS instructions
shown in Table 1. Please compute the effective CPI for the MIPS machine.
Table 1
MIPS Frequency
Instruction exam Average CPI
Integer Floating point
ples
Arithmetic add, sub 24% 48% 1.0 clock cycles
Logic and, sll 18% 4% 1.0 clock cycles
Data transfer lw, sw 36% 40% 1.3 clock cycles
Conditional branch beq, bne 18% 6% 1.6 clock cycles
Jump jr, jal 4% 2% 1.1 clock cycles

Answer: Effective CPI = 0.36  1 + 0.11  1 + 0.38  1.3 + 0.12  1.6 + 0.03  1.1 = 1.189

4. The performance of a program depends on the algorithm, the language, the compiler, the
architecture, and the actual hardware. The following table (Table 2) will summarize how these
components affect the factors in the CPU performance. The factors may include “CPI”,
“instruction count”, “execution time”, “gate counts”, and “clock rate”. Please determine proper
factors for “Affects what?” in Table 2. Copy the following table (Table 2) to your answer sheet
and fill the factors for the fields, i.e. A1, A2, A3, and A4, respectively.
Table 2
Hardware or Software Component Affects what?
Instruction set architecture A1
Compiler A2
Programming language A3
Algorithm A4

Answer
60
Hardware or Software Component Affects what?
Instruction set architecture instruction count, CPI, clock rate
Compiler instruction count, CPI
Programming language instruction count, CPI
Algorithm instruction count, CPI

5. For each Pseudo-instruction in Table 3, it produces a minimal sequence of actual MIPS


instructions to accomplish the same thing. In Table 3, “big” refers to a specific number that
requires 32 bits to represent, and “small” refers to a number that can fit in 16 bits. Please
determine proper Pseudoinstructions for (OP1, OP2, OP3, OP4, OP5) in Table 3. Copy the
following table (Table 3) to your answer sheet and fill in the five Pseudo-instructions.
Table 3
Pseudo-instruction What it accomplishes
move $t1, $t2 $t1 = $t2
OP1 if($t1 = small) go to H
ΟP2 $t1 = big
OP3 if($t5 ≥ $t3) go to L
OP4 $t1 = $t2 + big
OP5 $t5 = Memory[$t0 + big]

Answer
Pseudo-instruction What it accomplishes
move $t1, $t2 $t1 = $t2
beq $t1, small , H if($t1 = small) go to H
li $t1, big $t1 = big
bge $t5, $t3, L if($t5 ≥ $t3) go to L
addi $t1, $t2, big $t1 = $t2 + big
lw $t5, big($t0) $t5 = Memory[$t0 + big]

61
6. Consider the following assembly language code:
I0: ADD R4 = R1 + R0;
I1: SUB R9 = R3 - R4;
I2: ADD R4 = R5 + R6;
I3: LDW R2 = MEM[R3 + 100];
I4: LDW R2 = MEM[R2 + 0];
I5: STW MEM[R4 + 100] = R2;
I6: AND R2 = R2 & R1;
I7: BEQ R9 == R1, Target;
I8: AND R9 = R9 & R1;
Consider a pipeline with forwarding, hazard detection, and 1 delay slot for branches. The
pipeline is the typical 5-stage IF, ID, EX, MEM, WB MIPS design. For the above code,
complete the pipeline diagram below (instructions on the left, cycles on top) for the code. Insert
the characters IF, ID, EX, MEM, WB for each instruction in the boxes. Assume that there two
levels of bypassing, that the second half of the decode stage performs a read of source registers,
and that the first half of the write-back stage writes to the register file. Label all data stalls (Draw
an X in the box). Label all data forwards that the forwarding unit detects (arrow between the
stages handing off the data and the stages receiving the data). What is the final execution time of
the code?
請將下列表格彙製於答案卷上做答
T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14
I0
I1
I2
I3
I4
I5
I6
I7
I8

Answer
T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14
I0 IF ID EX ME WB
I1 IF ID EX ME WB
I2 IF ID EX ME WB
I3 IF ID EX ME WB
I4 IF ID X EX ME WB

62
I5 IF X ID EX ME WB
I6 IF ID EX ME WB
I7 IF ID EX ME WB
I8 IF ID EX ME WB

7. Floating-Point Representation
(a) What decimal number does the bit pattern 0x0C000000 represent if it is a floating point
number? Use the IEEE 754 standard.
(b) Write down the binary representation of the decimal 63.25 assuming the IEEE 754 single
precision format.
Answer
(a) 0C00000016 = 0000 1100 0000 0000 0000 0000 0000 00002
Sign Exponent Fraction
0 00011000 00000000000000000000000
Decimal number = +(1  224-127) = 2-103
(b) 63.2510 = 111111.012 = 1.1111101  25
Sign Exponent Fraction
0 1000100 11111010000000000000000

8. In order to explore energy efficiency and its relationship with performance, we assume the
following energy consumption for activity in Instruction memory, Registers, and Data memory.
You can assume that the other components of the datapath spend a negligible amount of energy.

I-Mem 1-Regiser Read Register Write D-Mem Read D-Mem Write


140 pJ 70 pJ 60 pJ 140 pJ 120 pJ

Assume that components in the datapath have the following latencies. You can assume that the
other components of the datapath have negligible latencies.

I-Mem Control Regiser Read or Write ALU D-Mem Read or Write


200 ps 150 ps 90 ps 90 ps 250 ps

(1) How much energy is spent to execute an ADD instruction in a single-cycle design and in
the 5-stage pipelined design?
(2) What is the worst case MIPS instruction in terms of energy consumption, and what is the
energy spent to execute it?
(3) If energy reduction is important, how would you change the pipelined design? What is the
percentage reduction in the energy spent by an LW instruction after this change?
(4) What is the performance impact (clock cycle time) of your changes from (3).

63
Answer
(1) The energy for the two designs is the same: I-Mem is read, two registers are read, and a
register is written. We have:
140 pJ + 2 × 70 pJ + 60 pJ = 340 pJ
(2) Because the sum of memory read and register write energy is larger than memory write
energy, the worst-case instruction is a load instruction. For the energy spent by a load, we
have:
140 pJ + 2 × 70 pJ + 140 pJ + 60 pJ = 480 pJ
(3) We can avoid reading registers whose values are not going to be used. To do this, we must
add RegRead1 and RegRead2 control inputs to the Registers unit to enable or disable each
register read. With these new control signals, a lw instruction results in only one register
read (we still must read the register used to generate the address), so we have:

Energy before change Energy saved by change % Savings


140pJ + 2 × 70pJ + 140pJ + 60pJ = 480pJ 70 pJ 14.6%

(4) After the change, the latencies of Control and Register Read cannot be overlapped. This
increases the latency of the ID stage and could affect the processor’s clock cycle time if
the ID stage becomes the longest-latency stage. We have:

Clock cycle time before change Clock cycle time after change
250 ps (D-Mem in MEM stage) No change (150 ps + 90 ps < 250 ps)

64
106 年中興資工

1. True or False
(1) An interrupt is an event that causes an unexpected change in control flow but comes from
outside of the processor.
(2) Pipelining improves instruction throughput rather than individual instruction execution
time.
(3) DMA is implemented with a specialized controller that transfers data between an I/O device
and memory dependent on the processor.
(4) The forwarding technique can eliminate all pipeline stalls caused by data dependency.
(5) There are six data hazards in this program:
add R2, R4, R4
sub R1, R2, R1
add R3, R1,R2
add R1, R1, R3
Answer
(1) (2) (3) (4) (5)
T T F F F

註(3):DMA transfers data between an I/O device and memory independent of the processor
註(4):Stall caused by load-use data hazard cannot be eliminated
註(5):There are 5 data hazards-- (1, 2), (1, 3), (2, 3), (2, 4), (3, 4)

2. We assume that there is a computer having the characteristics as Table 1 shows, where the X
denotes the stage is not required. What is its CPI with variable clock design? if this computer is
implemented with single clock design, what is its CPI. What is the improvement in
performance?
Table 1
Distribution IF ID ΕΧΕ MEM WB
Load 30% 3ns 1ns 1ns
Store 15% 3ns 2ns X
2ns 1ns
Arithmetic 40% 1ns X 1ns
Branch 15% 2ns X X

Answer
CPI for variable clock design (multi-cycle machine) = 0.3  5 + 0.15  4 + 0.4  4 + 0.15  3 =
4.15
CPI for single clock design = 1
Instruction time for single clock design = 2 + 1 + 3 + 1 + 1 = 8 ns
Instruction time for variable clock design = 4.15  3 ns = 12.45 ns
65
Improvement in performance (speedup) = 12.45 / 8 = 1.56

3. Assume that there is processor with a CPI of 3 without memory stall and the memory stall
penalty is 150 clock cycles. If there have 8% instructions of program require to accessing
memory and the miss rate of cache is 0.5, please find the CP with memory stalls.
Answer
CP with memory stalls = 3 + 1.08  0.08  150 = 15.96

66
106 年台科大資工

1. A complier designer is trying to decide between two code sequences for a particular Computer.
The hardware designers have supplied the following facts:
CPI for instruction class
A B C
CPI 7 3 1

For a particular high-level-language statement, the Compiler writer is considering two code
sequences that require the following instruction Counts:

Instruction counts for instruction class


Code Sequence
A B C
1 1 1 8
2 2 1 1

(1) Which code sequence executes the most instructions?


(2) Which will be faster?
(3) What is the CPI for each sequence?
Suppose we measure the Code for the same program from two different compliers and obtain
the following data:

Instruction counts (in billions) for instruction class


Code from
A B C
Complier 1 3 1 1
Complier 2 1 3 4

Assume that the computer's clock rate is 5 GHz.


(4) Which code sequence will execute faster according to execution time?
(5) Which code sequence will execute faster according to MIPSP?
Answer:
(1) Instruction count for code sequence 1 = 1 + 1 + 8 = 10
Instruction count for code sequence 2 = 2 + 1 + 1 = 4
Hence, code sequence 1 executes the most instructions
(2) Clock cycles for code sequence 1 = 1  7 + 1  3 + 8  1 = 18
Clock cycles for code sequence 2 = 2  7 + 1  3 + 1  1 = 18
Hence, code sequence 1 is as faster as code sequence 2
(3) CPI for code sequence 1 = 18 / 10 = 1.8
CPI for code sequence 2 = 18 / 4 = 4.5
(3  7  1 3  1 1)  109
(4) Execution Time for compiler 1 = =5s
5  109
67
(1 7  3  3  4  1)  109
Execution Time for compiler 2 = =4s
5  109
According to execution time, code sequence from compiler 2 is faster.
(3  1  1)  109
(5) MIPS for compiler 1 = = 1000
5  106
(1  3  4)  109
MIPS for compiler 2 = = 2000
4  106
According to MIPS, code sequence from compiler 2 is faster.

2. Consider three different cache configurations below:


Cache 1: direct-mapped with two-word blocks.
Cache 2: two-way set associative with four-word blocks and LRU replacement.
Cache 3: fully associative with eight-word blocks and LRU replacement.
Assuming that each cache has total data size of 32 32-bit words and all of them are initially empty.
20-bit word address is used. Consider the following sequence of address references given as word
addresses: 36, 25, 2, 32, 53, 23, 31, 62, 3, 56 and 36. For caches 1, 2, and 3, please label each
reference in the list as a hit or a miss.
Answer:
Case 1: number of blocks in cache = 32 / 2 = 16
Word address Block address Tag Index Hit/Miss
36 18 1 2 Miss
25 12 0 12 Miss
2 1 0 1 Miss
32 16 1 0 Miss
53 26 1 10 Miss
23 11 0 11 Miss
31 15 0 15 Miss
62 31 1 15 Miss
3 1 0 1 Hit
56 28 1 12 Miss
36 18 1 2 Hit

Case 2: number of blocks in cache = 32 / 4 = 8; number of blocks in cache = 8 / 2 = 4


Word address Block address Tag Index Hit/Miss
36 9 2 1 Miss
25 6 1 2 Miss
2 0 0 0 Miss
32 8 2 0 Miss
68
53 13 3 1 Miss
23 5 1 1 Miss
31 7 1 3 Miss
62 15 3 3 Miss
3 0 0 0 Hit
56 14 3 2 Miss
36 9 2 1 Miss

Case 3: number of blocks in cache = 32 / 8 = 4.


Word address Block address Tag Hit/Miss
36 4 4 Miss
25 3 3 Miss
2 0 0 Miss
32 4 4 Hit
53 6 6 Miss
23 2 2 Miss
31 3 3 Miss
62 7 7 Miss
3 0 0 Miss
56 7 7 Hit
36 4 4 Miss

3. (1) Given the following MIPS instruction Code segment, please answer each question below
16 L1: addi $t0, $t0, 4
20 lw $s1, 0($t0)
24 sw $s1, 32($t0)
28 lw $s1, 64($t0)
32 slt $s0, $t1, $zero
Given a pipeline process which has 5 stages: IF, ID, EX, ME, WB. Assume no forwarding
unit is available. There are hazards in the Code, please detect the hazards and point
out where to insert no-ops (or bubbles) to make the pipeline datapath execute the
code correctly. You don’t need to rewrite the entire Code segment. You can simply
indicate the location where you would insert no-ops. For example, if you want to
insert 6 no-ops between the instruction addi at address 16 and lw at address 20, you
can state Something like “6 no-ops between 16 and 20”.
(2) Assume the following code is executed on a MIPS CPU with 5 pipeline stages and data
forwarding capability. If this CPU use “always assume branch not taken” strategy to
handle branch instruction but the branch is taken in this example, how many clock cycles
69
are required to complete this program?
lw $4, 50($7)
beq $1, $4, 3
add $5, $3, $4
sub $6, $4, $3
or $7, $5, $2
slt $8, $5, $6
Answer:
(1) 2 no-ops between 16 and 20
2 no-ops between 20 and 24
2 no-ops between 28 and 32
2 no-ops between 32 and 36
(2) 3 instructions are executed (beq is taken); 2 stalls between lw and beq (data hazard); 1
instruction is flushed (guess wrong)
Total clock cycles = (5 – 1) + 3 + 2 + 1 = 10

70
106 年台科大電子

1. Convert the following numbers


(1) 7B316 into base 8.
(2) 47538 into base 16.
Answer: (1) 7B316 = 0111 1011 00112 = 36638
(2) 47538 = 100 111 101 0112 = 9EB16

2. Please explain the following terms with examples:


(1) Structure hazard.
(2) Data hazard.
(3) Control hazard.
Answer:
(1) Structural hazards: hardware cannot support the instructions executing in the same clock cycle
(limited resources)
(2) Data hazards: attempt to use item before it is ready (data dependency)
(3) Control hazards: attempt to make a decision before condition is evaluated (branch instruction)
Example program:
1 lw $5, 50($2)
2 add $2, $5, $4
3 add $4, $2, $5
4 beq $8, $9, L1
5 sub $16, $17, $18
6 sw $5, 100($2)
7 L1:
Type Example
假設上列指令是在只有單一記憶體的 datapath 中執行,則在第 4 個時脈
(1) Structure 週期,指令 1 讀取記憶體資料同時指令 4 也在從同一個記憶體中擷取指
hazard 令,也就是兩個指令同時對一記憶體進行存取。在這樣情形下就會發生
structural hazard
指令 1 在第 5 個時脈週期時才會將計算結果寫回暫存器$5,但指令 2 和
(2) Data 3 分別在時脈週期 3 跟 4 時便需要使用暫存器$5 的內容。此時指令 2 和
hazard 3 擷取到暫存器$5 的內容並非最新的,因此在這樣情形下就會發生 data
hazard
指令 4 要在 pipeline 的 MEM 階段完成後才知道此分支指令是否發生,
(3) Control 若分支發生則擷取指令 7,若不發生則擷取指令 5。但當指令 4 在 MEM
hazard 階段時指令 5,6 也早已於 EX,ID 階段內執行。若指令 4 發生分支則
此時 pipeline 就會發生 control hazard

71
3. Multi-core has become a popular technology for new generation processors. However, the
amount of performance gained by the use of a multi-core processor does not increase as much
as the core numbers inside the chip, why?
Answer: The improvement in performance gained by the use of a multi-core processor depends
very much on the software algorithms used and their implementation. In particular,
possible gains are limited by the fraction of the software that can run in
parallel simultaneously on multiple cores; this effect is described by Amdahl's law. In the
best case, so-called embarrassingly parallel problems may realize speedup factors near the
number of cores.

4. For the memory hierarchy, describe the two different types of locality.
(1) Temporal locality.
(2) Spatial locality.
Answer:
(1) Temporal locality: if an item is referenced, it will tend to be referenced again soon.
(2) Spatial locality: if an item is referenced, items whose addresses are close by will tend to
be referenced soon.

5. For 1-bit full adder with inputs A, B, Cin and outputs SUM and Carry.
(1) Please write down its truth table and express SUM and Carry in Sum-of-minterms.
(2) Please use K-map to obtain the minimal sum-of-product form.
Answer:
(1)
A B Cin (C) SUM Carry
0 0 0 0 0 SUM = A’B’C + A’BC’ + AB’C’ + ABC
0 0 1 1 0
0 1 0 1 0 Carry = A’BC + AB’C + ABC’ + ABC
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1

72
(2)
SUM Carry
BC BC
00 01 11 10 00 01 11 10
A A
0 1 1 0 1

1 1 1 1 1 1 1

SUM = A’B’C + A’BC’ + AB’C’ + ABC Carry = AB + AC + BC

73
106 台師大資工

1. Convert the following hexadecimal numbers to binary numbers:


(1) 3A01hex
(2) 1FC6hex
Answer
(1) 3A01hex = 0011 1010 0000 00012
(2) 1FC6hex = 0001 1111 1100 01102

2. Consider the MIPS code sequence shown below:


LW R1, 20(R12); R1  MEM[R12 + 20]
LW R2, 24(R12); R2  MEM[R12 + 24]
SUB R3, R2, R1; R3  R2 - R1
ADD R4, R8, R3; R4  R8 + R3
LW R10, 28(R12); R10  MEM[R12 + 28]
ADD R5, R10, R4; R5  R10 + R4
SW R5, 32(R12); MEM[R12 + 32]  R5
LW R11, 36(R12); R11  MEM[R12 + 36]
The code sequence is executed by a MIPS CPU with separate instruction and data memories.
Suppose the execution of the code sequence takes 16 clock cycles. The clock cycle time is
1 ns.
(1) Find the clock cycles per instruction (CPI) of the code sequence,
(2) Find the CPU time of the code sequence,
(3) Find the number of accesses to the instruction memory,
(4) Find the number of accesses to the data memory.
(5) Identify all the instructions where the ALU of the CPU is used for the computation of
memory addresses for data accesses.
Answer
(1) CPI = 16 / 8 = 2
(2) CPU time 16  1 ns = 16 ns
(3) The number of accesses to the instruction memory is 8
(4) The number of accesses to the data memory is 5
(5) LW R1, 20(R12)
LW R2, 24(R12)
LW R10, 28(R12)
SW R5, 32(R12)
LW R11, 36(R12)

74
3. Consider a cache with 64 blocks, and a block size of 64 bytes.
(1) Suppose the cache is direct mapped. What cache block number does byte address 16000
map to?
(2) Suppose the cache is direct mapped. What cache block number does byte address 32128
map to?
(3) Suppose the cache is 2-way set associative. How many sets are there in the cache?
(4) Suppose the cache is 2-way set associative. What cache set number does byte address
16000 map to?
Answer
(1) Memory block address = 16000 / 64 = 250
The cache block number which the address 16000 mapped to is 250 mod 64 = 58
(2) Memory block address = 32128 / 64 = 502
The cache block number which the address 32128 mapped to is 502 mod 64 = 54
(3) The number of sets in the cache = 64 / 2 = 32
(4) The cache set number which the address 16000 mapped to is 250 mod 32 = 26

4. Consider a MIPS processor with five-stage (i.e., IF, ID, EX, MEM, WB) pipeline. Suppose the
processor has separate instruction and data memories. The hazard detection and forwarding units
are also employed. Assume there is no structural hazard. There is also no cache miss. Find the
number of clock cycles required by the pipeline for the execution of each of the following
sequences.
(1) ADD R3, R2, R1 ; R3  R2 + R1
ADD R4, R10, R3 ; R4  R10 + R3
(2) LW R6, 12(R4) ; R6  MEM[R4+12]
ADD R8, R10, R6 ; R8  R10 + R6
ADD R12, R6, R8 ; R12  R6 + R8
Answer
(1) (5 – 1) + 2 = 6
(2) (5 – 1) + 3 + 1 = 8

5. Consider a two-level cache. Suppose the hit time of the L1 cache is 1 clock cycle. The local miss
rate of the L1 cache is 25%. The hit time of the L2 cache is 6 clock cycles. The local miss rate
for L2 is 4%. The L2 is connected to main memory. It will take 250 clock cycles to access a
cache block from main memory.
(1) Compute the average memory access time of L2 cache (in clock cycles),
(2) Compute the miss penalty for L1 cache (in clock cycles),
(3) Compute the average memory access time of L1 cache (in clock cycles).
Answer
75
(1) AMATL2 = 6 + 0.04  250 = 16
(2) L1 cache miss penalty ≈ L2 cache hit time = 6
(3) AMATL1 = 1 + 0.25  6 + 0.04  250 = 12.5

76
106 年彰師大電子、資工

1. Consider the following C codes (procedure swap(v, j) swaps v[j] and v[j+1]).
void sort ( int v[ ], int n )
{int i, j;
for (i = 0; i < n; i += 1) {
for ( j = i – 1; j >= 0 && v[j] > v[j+1]; j -= 1) {swap(v, j);}
}
}
(1) Assume n = 5 and v[0] to v[4] are initially 3, 5, 1, 9, 7. List the temporary results of v[ ] at
the end of i = 0, 1, 2, 3, 4, respectively. In this example, how many swaps are actually
invoked totally?
(2) Implement the outer loop (i loop) with MIPS assembly language (Assume i is in $s0, j is in
$s1, v is in $s2, and n is in $s3).
Answer:
(1)
V[0] V[1] V[2] V[3] V[4] No. of swap
i=0 3 5 1 9 7 0
i=1 3 5 1 9 7 0
i=2 1 3 5 9 7 2
i=3 1 3 5 9 7 0
i=4 1 3 5 7 9 1
swap are invoked for 3 times
(2)
move $s0, $zero
for1tst: slt $t0, $s0, $s3
beq $t0, $zero, exit1
...
(body of first for loop)
...
addi $s0, $s0, 1
j for1tst
exit1:

2. Transform the decimal real into IEEE754 single precision and vice versa.
(1) -1.625 (decimal real) note: use hex format to shorten your answer
(2) C0A00000 (IEEE754 single precision in hex)
Answer:
(1) BFD0000016
77
-1.62510 = -1.1012  20
S E F
1 01111111 10100000000000000000000
(2) -3.2510
C0A0000016 = 1 1000000 101000000000000000000000
 -1. 01  22 = -101 = -510

3. Consider 2 different implementations P1 and P2 of the same Instruction Set Architecture (ISA).
There are 4 classes of instructions A, B, C and D, as shown in the following table.
Clock rate CPIA CPIB CPIC CPID
P1 1.5 GHz 1 2 3 4
P2 2 GHz 2 2 2 2
6
(1) Given a program with 10 instructions divided into classes as follows: 10% class A, 20%
class B, 50% class C and 20% class D, which implementation is faster?
(2) What is the global CPI for each implementation?
(3) Find the clock cycles required in both cases.
Answer:
(1) CPI for P1 = 1  0.1 + 2  0.2 + 3  0.5 + 4  0.2 = 2.8
The execution time for P1 = 106  2.8 / 1.5G = 1.87 ms
CPI for P2 = 2
The execution time for P2 = 106  2 / 2G = 1 ms
P2 is faster than P1
(2) Global CPI for P1 = 2.8
Global CPI for P2 = 2
(3) The clock cycles for P1 = 2.8  106 = 2.8M
The clock cycles for P2 = 2  106 = 2M

4. Explain the following terminologies/registers based on your understanding.


(1) assembler (2) IEEE double precision (3) $sp (4) $ra (5) pseudo-instruction.
Answer:
(1) A program that translates a symbolic version of instructions into the binary version.
(2) A floating-point value represented in two 32-bit words.
(3) A register stores the address which points to the top of the stack.
(4) A register stores the return address of a caller procedure.
(5) A common variation of assembly language instructions often treated as if it were an
instruction in its own right.

78
5. Explain the following terminologies.
(1) Program counter (PC)
(2) Exception program counter (EPC)
(3) Cache miss
(4) Page fault
Answer:
(1) Program counter (PC): The register containing the address of the instruction in the program
being executed
(2) Exception program counter (EPC): A register which is used to store the address of the
offending instruction that was executing when the exception was generated.
(3) Cache miss: A request for data from the cache that cannot be filled because the data is not
present in the cache.
(4) Page fault: An event that occurs when an accessed page is not present in main memory.

6. What are the four steps in executing a MIPS instruction?


Answer:
1. Instruction fetch
2. Instruction Decode, PC = PC + 4, Register read
3. ALU operation, memory address computation, Branch target address computation
4. LW/STORE in Data memory
5. Register Write

7. Explain the four-step process in handling an instruction cache miss.


Answer:
1. Send the original PC value (current PC – 4) to the memory.
2. Instruct main memory to perform a read and wait for the memory to complete its access.
3. Write the cache entry, putting the data from memory in the data portion of the entry, writing
the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on.
4. Restart the instruction execution at the first step, which will refetch the instruction, this time
finding it in the cache.

79
8. For the single-cycle design in Fig. 1, determine what instructions (add, lw, sw, or beq) will be
affected and cannot execute correctly if some functional unit is removed. Fill in the blanks in
Table 1 with a letter “X” if the corresponding instruction is affected.
Table 1
Unit add lw sw beq
“Sign-extend” removed
“Shift left 2” removed
“Mux A” removed
“Mux B” removed

0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25 21] Read


PC Read register 1
address Read
Instruction [20 16] data 1
Read
register 2 Zero
Instruction
_ 0 Registers Read ALU ALU
[31 0]
Instruction A M
u
Write
register
data 2 0
M
result Address Read
data 1
M
memory x u u
Instruction [15 11]
1 Write
data 1
x
B Data
memory
x
0
Write
data

Instruction [15 0] 16 32
Sign
extend ALU
control

Instruction [5 0]

Fig. 1
Answer:
Unit add lw sw beq
“Sign-extend” removed x x x
“Shift left 2” removed x
“Mux A” removed x x
“Mux B” removed x x x x

80

You might also like