106
106
106 年台大資工..................................................................................................................................... 2
106 年台大電機..................................................................................................................................... 7
106 年台聯大電機................................................................................................................................. 9
106 年清大資工................................................................................................................................... 18
106 年交大資聯................................................................................................................................... 26
106 年成大電機................................................................................................................................... 34
106 年成大資聯................................................................................................................................... 37
106 年成大電通................................................................................................................................... 39
106 年中央資工................................................................................................................................... 42
106 年中山電機................................................................................................................................... 46
106 年中山資工................................................................................................................................... 51
106 年中正電機................................................................................................................................... 55
106 年中興電機................................................................................................................................... 59
106 年中興資工................................................................................................................................... 65
106 年台科大資工............................................................................................................................... 67
106 年台科大電子............................................................................................................................... 71
106 台師大資工................................................................................................................................... 74
106 年彰師大電子、資工................................................................................................................... 77
1
106 年台大資工
1. For a 5-stage MEPS processor with separate L1 data/instruction caches (32KB/256KB), the
latency for each stage is listed below:
Pipeline IF ID EXE MEM WB
Latency 150ps 100ps 50ps 200ps 100ps
The CPI with a perfect cache is 1 and both I/D caches are blocking caches. The cache miss
penalty is 200 CPU cycles. For the target benchmark suite, the I-Cache miss rate is 5%, D-cache
miss rate is 10% and the frequency of the load/store instructions is 35%. The architecture team is
considering to increase the data cache size to 512KB so the data miss rate could be reduced to
5% for the target benchmark. But it increases the MEM pipeline stage latency to 250ps. Please
justify if increasing the data size is a good design decision or not.
Answer:
CPIold = 1 + 1 0.05 200 + 0.35 0.1 200 = 18
Instruction Timeold = 18 200 ps = 3600 ps
CPInew = 1 + 1 0.05 200 + 0.35 0.05 200 = 14.5
Instruction Timeold = 14.5 250 ps = 3625 ps
Increasing the data size is not a good decision.
2. A stride-based prefetching scheme exploits regular streams of memory accesses to hide memory
latency. Which of the following code segments can get more benefit from stride-based
prefetching? Please explain why.
Code A:
int Y[100];
int i, x;
{
for (i= 0; i++; i:<100)
Y[i] = Y[i] + x;
x++;
}
Code B:
struct node {
int x;
struct node *next;
};
while (node->next ! = 0) {
node-> x = 1;
node = node->next;
}
2
Answer:
Code A can get more benefit from stride-based prefetching.
In a stride-based prefetching scheme, when an address miss, prefetch address that is offset by a
distance from the missed address. In code A, since the data elements in an array is sequentially
placed is the memory, stride access could be more efficiency. Inversely, the data elements in
code B are scattered in the heap, stride access is inefficiency.
3. (Multiple Choices: no partial points) Figure 1 shows the roofline model of the target architecture
(peak memory bandwidth:16GB/s and peak floating-point performance: 16GFLOPS/s). If the
arithmetic intensity of an application falls between 1/2 FLOPs/byte and 2 FLOPs/byte in different
program phases, which of the following techniques could be adopted to improve this application's
performance:
(a) Software prefetching
(b) Loop unrolling
(c) Apply SIMD (Single instruction Multiple Data)
(d) Rewrite the codes for better data locality
Figure 1
Answer: (1), (2), (3)
4. (Multiple, Choices: no partial points) For applications with high data-level parallelism, which of
the following architectural features are effective for performance improvement:
(a) SIMD (Single instruction Multiple Data)
(b) SIMT (Single instruction Multiple Threads)
(c) VLIW (Very Long instruction Word)
(d) SMT (Simultaneous Multi-Threading)
(e) Vector instruction Extension
Answer: (a), (b), (e)
3
5. For the MIPS instruction set (32-bit instruction/32 registers), if the register file size is increased
to 128 registers, what is the total number of bits needed for a R-type format instruction (register
instructions)? Could more registers decrease the size of a MIPS assembly program? Why?
Answer:
The total bits for a R-type instruction is 38.
Opcode Rs Rt Rd Shift amount Funct. code
6 7 7 7 5 6
More registers less register spilling less load/store instructions could reduce program
size
Machine 1 Machine 2
Benchmark Observed CPI
Seconds SpecRatio Seconds SpecRatio
400.perlbench 457 21.4 424 23.0 0.75
401.bzip2 539 17.9 622 15.5 0.85
403.gcc 338 23.9 293 27.5 1.72
429.mfc 280 32.6 204 44.7 10.00
445.gobmk 527 19.9 451 23.3 1.09
456.hmmer 270 34.6 440 21.2 0.80
458.sjeng 657 18.4 511 23.7 0.96
462.libquantum 96.5 215 190 109.0 1.61
464.h264ref 847 26.1 613 36.1 0.80
471.omnetpp 310 20.2 280 22.4 2.94
473.astar 380 18.5 447 15.7 1.79
483.xalancbmk 234 29.5 191 36.2 2.70
6. The above table gives measurements of running SPECint2006 on one Intel machine and one
AMD machine. Execution time is reported in seconds. SPECratio is the reference time, supplied
by SPEC, divided by the measured execution time.
(a) Which machine is faster? How much faster? (You must show your calculation, without
showing the calculation, your answer can only receive partial credits)
(b) Which benchmarks you would target for some hardware/software optimizations? Suggest a
few architecture and/or compiler optimizations that might speed up your selected programs.
Answer:
(a) AM for machine 1
= (457 + 539 + 338 + 280 + 527 + 270 + 657 + 96.5 + 847 + 310 + 380 + 234) / 12 =
411.29
AM for machine 2
4
= (424 + 622 + 293 + 204 + 451 + 440 + 511 + 190 + 613 + 280 + 447 + 191) / 12 = 388.83
Machine 2 is 411.29 / 388.83 = 1.06 times faster than machine 1
(b) Benchmark 429.mfc might be speedup by compiler optimization techniques such as
common sub-expression elimination, unreachable code elimination, dead code elimination,
loop optimization.
7. New mobile phones are required to handle larger data sets, for example, recording 4Kvideo and
taking high-resolution pictures often drive the needs for 256GB of on-board memory. Buying
larger memory configurations or extended storage with lightning flash drives are costly options.
Recently, some companies offer CME (Cloud Memory Extension) solutions to make use of WiFi
and Cloud storage to extend the memory of your mobile phones. The idea is to use the cloud
storage as the secondary storage, and uses part of the on-board memory as a cache for the
storage. Frequently used videos, photos, music can be cached and less frequently used files can
be transferred to the cloud storage using WiFi. This approach is similar to the idea of disk caches
which are often used to speed up disk I/O. Please answer the following questions of this CME
cache design.
(a) How is this CME cache different from on-chip caches in processor?
(b) What should be the block size of a CME cache line? Why?
(c) What should be the prefetch policy for such CME caches?
(d) Should the cache be fully associative, set associative or direct-mapped? Why?
(e) If set associative is the choice of (d), please suggest a replacement algorithm that works
better than LRU. Explain why it could be better.
(f) What should be used as the CME cache tag?
Answer:
(a) CME cache is part of the on-board memory and on-chip caches is an SRAM reside in a
processor chip.
(b) 4K video has two major resolutions: 3840 2160 and 4096 2160. The appropriate block
size should be the frame size. So, the block size should be 3840 2160 4B ≈ 33.17MB or
4096 2160 4B ≈ 35.39MB.
(c) Sequential prefetching, which mainly used when contiguous locations, could be used for
CME caches.
(d) Direct-mapped is a proper choice for CME cache because multimedia data exploit spatial
locality rather than temporal locality. That is, each item to be accessed has the same
probability. In addition, the hit time and cost for direct-mapped cache are lower than that of
fully or set associative cache.
(e) Random replacement algorithm that works better than LRU since each item to be accessed
has the same probability and random replacement algorithm is more efficiency in hardware
operation.
5
(f) Video or picture identity (name) can be used as the CME cache tag.
8. Branch instructions can be divided into conditional and unconditional branches, or direct and
indirect branches. For example, bne $s1, $s2, 25 is a conditional and direct branch in the MIPS
architecture. So a branch instruction can be classified as one of the following four types:
(1) Conditional direct branch
(2) Conditional indirect branch
(3) Unconditional direct branch
(4) Unconditional indirect branch
(a) For each branch instruction type, please gives one C construct which the compiler would
generate that branch instruction
(b) Which one in the four types is easier to predict? Which one is most difficult to predict?
(c) What hardware structures are often used to predict type (2) and type (4) branches
Answer:
(a)
Branch type C construct MIPS instruction example
(1) Conditional direct branch If-then-else statement beq $s1, $s2, Label
(2) Conditional indirect branch Switch-case statement beq $s1, $s2, $s3
(3) Unconditional direct branch For-loop statement j Label
(4) Unconditional indirect branch Function return statement jr $ra
註:beq $s1, $s2, $s3 (if $s1 = $s2 goto address in $s3)
(b) The easier to predict: unconditional direct branch
The most difficult to predict: conditional indirect branch
(c) VPC (Virtual Program Counter) Prediction, which treat an indirect branch as
multiple “virtual” conditional branches and use the conditional branch predictor, can be
used to predict type (2) and type (4) branches.
6
106 年台大電機
複選題
1. What are the approaches to reducing the average memory access time in a computer with
memory hierarchy?
(1) reducing the miss rate of cache
(2) using direct mapped cache
(3) increasing the associativity of cache
(4) increasing the cache size
Answer: (1)
註: AMAT = hit time + miss rate miss penalty
(2) using direct mapped cache will reduce hit time but increase miss rate
(3) increasing the associativity will reduce miss rate but increase hit time
(4) increasing the cache size will reduce miss rate but increase hit time
2. What are the features or facts related to a RISC (reduced instruction set computer) architecture?
(1) Instruction format is short and being one size
(2) The MIPS architecture is a RISC machine
(3) The main operations that affect memory are load and store instructions.
(4) A larger register set comprised of most general purpose registers, as compared to CISC
(Complex Instruction Set Computer) computer
(5) Increase performance through the use of pipelining
Answer: (1), (2), (3), (4), (5)
4. In a pipeline of a microprocessor, state the three types of instructions that would best fill the
branch delay slot and explain under what conditions they improve pipeline performance.
Answer:
As shown in the following figure, there are 3 ways to schedule branch delay slot: (a) from
before, (b) from target, and (c) from fall through. (a) is the best. Use (b) (c) when (a) is
impossible (data dependency). (b) is only valuable when branch taken. It is OK to execute this
instruction even the branch is non-taken (c) is only valuable when branch not taken; It is OK to
execute this instruction even the branch is taken.
7
5. A multiprocessor system has four processor cores, each capable of generating 2 loads and 1 store
per clock cycle. The processor clock cycle is 2ns, while the cycle time of the SRAMs used in the
memory system is 4 ns. Calculate the minimum number of memory banks required to allow all
processors to run at full memory bandwidth.
Answer:
3 memory references per processor 3 4 = 12 total memory references
4 ns / 2ns = 2 processor cycles pass for one SRAM cycle
Therefore, 2 12 = 24 banks are needed
8
106 年台聯大電機
1. (1) For a 32-bit address Space, calculate the total number of bits required for the cache listed
below: 32 KiB direct-mapped, write-back data cache, 2 words per cache block, and 2
control bits per cache block.
(2) Given that cache size in problem (1), please find the total size of the closet direct-mapped,
write-back data cache with 16-word blocks of equal size or greater. Explain why this data
cache, despite its larger size, might provide slower performance than that in (1).
(3) A generic memory hierarchy would consist of a TLB and a cache. A memory reference can
encounter three different types of misses: a TLB miss, a page fault, and a cache miss,
Consider all the combinations of these three events with one or more occurrences. There are
seven possibilities. For each possibility, please state whether this event can actually occur
and under what circumstances.
Answer
(1) Number of blocks in the cache = 32KiB / 8 bytes = 4K index = 12 bits
A tag bits = 32 – 12 – 3 = 17
The total number of bits for the cache = (2 + 17 + 64) 4K = 332 Kbits
(2) Suppose that there are 2N blocks in the cache.
The total number of bits for the cache = (2 + 32 – N – 6 + 512) 2N ≥ 332 Kbits N ≥ 10.
Pick N = 10 The total number of bits for the cache = 530 Kbits
Larger block size increase the miss penalty. This might provides slower performance than
(1)
(3)
TLB Page table Cache Identify/Explain
Possible, although the page table is never really
hit hit miss
checked if TLB hits.
TLB misses, but entry found in page table: after retry,
miss hit hit
data is found in cache.
TLB misses, but entry found in page table; after retry,
miss hit miss
data misses in cache.
TLB misses and is followed by a page fault; after retry,
miss miss miss
data must miss in cache.
Impossible: cannot have a translation in TLB if page is
hit miss miss
not present in memory.
Impossible: cannot have a translation in TLB if page is
hit miss hit
not present in memory.
Impossible: data cannot be allowed in cache if the page
miss miss hit
is not in memory.
9
2. Regarding computer arithmetic, is the following statement “True” or “False'? If True, please give
a brief explanation. If False, please give a counter example,
(1) Associativity holds for a sequence of two’s complement integer additions, even if the
computation overflows.
(2) Associativity holds for a sequence of floating-point additions.
Answer
(1) True. Suppose a, b, and c are three signed numbers. Clearly, addu [(a + b) + c] = addu [a +
(b + c)]. That is, if both additions result overflow, there will get the same wrong value.
(2) False. Suppose that the following 3 numbers are all single precision.
x = − l.510 × 1038, y = l.510 × 1038, z = 1.0
x + (y + z) = − l.510 × 1038 + (l.510 × 1038 + 1.0) = − l.510 × 1038 + l.510 × 1038 = 0.0
(x + y) + z = (− l.510 × 1038 + l.510 × 1038) + 1.0 = 0.0 + 1.0 = 1.0
x + (y + z) ≠ (x + y) + z
3. When making changes to optimize part of a computer, it is often the case that speeding up one
type of instructions comes at the cost of slowing down something else.
(1) Suppose that floating-point operations take 20% of the original program’s execution time
and the new fast floating-point unit speeds up floating-point operation by, on average, 2
times. Ignoring the penalty to any other instructions, what is the overall speedup?
(2) Suppose that speeding up the floating-point unit would slow down data cache accesses,
resulting in a 1.5 times slowdown. Suppose that data cache accesses consume 10% of the
execution time. What is the overall speedup?
(3) After implementing the new floating-point unit, what percentage of execution time is spent
on floating-point operations? What percentage is spent on data cache accesses?
Answer
(1) Speedup = 1 / (0.2 / 2 + 0.8) = 1.11
(2) Speedup = 1 / (0.2 / 2 + 0.1 1.5 + 0.7) = 1.053
(3) The execution time percentage on floating-point operations = 0.1/(0.1+0.15+0.7) = 10.53%
The execution time percentage on data cache accesses = 0.15/(0.1+0.15+0.7) = 15.79%
10
4. A new I-type format instruction swu has been added to the MIPS instruction set. Its format is
swu rt, l(rs). It takes arguments register rt, register rs, and immediate l, and it stores the contents
of R[rt] at the memory address (R[rs] + l) and then increments R[rs] by l.
(1) Given the single-cycle datapath in the following (control signals are marked with dashed
lines), fill in the blanks in the table below for this new instruction. You must give the control
signals (0, 1, 2, X) for the MIPS instruction. Each control signal must be specified as 0, 1, 2
or X (don't care). Writing a 0 or when an X is more accurate is not correct.
Opcode RegDst RegWrite ALUSrc ALUOp MemWrite MemRead MemToReg PCSrc
swu
Add 0
1
4 Add
Shift
PCSrc
left 2
MemWrite MemToReg
RegWrite
[25-21] Read Read
Read
PC register 1 Read Address 2
address [20-16] Read
Read data 1 1
write data
Instruction register 2 0
[31-0] 0 Write ALU Address
1 register Read Write
[15-11]
data 2 0 Data
Instruction 2 Write data
1 Memory
Memory data
[25-21]
RegDst
[15-0] Sign ALUSrc ALUOp
extend
This new I-type format instruction jin imm16(rs) is developed for the single-cycle processor,
which is a Jump Indirect instruction and will cause the processor to jump to the address stored in
the word at memory location imm16 + R[rs] (the same address computed by lw and sw).
(2) Draw the necessary modifications to implement the jin instruction on your sheet according
to the figure of the single-cycle datapath provided above.
(3) What is/are the new control signal(s) required to implement jin? Why is/are the control
signal(s) necessary?
Consider the 32-bit ALU design shown in the following. Now suppose that we wish to add
hardware support for xor - exchusive-OR. For example, xor $t0, $t1, $t2.
(4) Please clearly describe (a) the necessary changes to the ALU hardware showing the changes
to the ith ALU bit position by modifying the figure below, (b) the corresponding values of
the ALU control signals, and (c) describe briefly how the single-cycle datapath would
operate with these changes.
11
Operation
Binvert CarryIn
a
0
1
Result
b 0 + 2
1
Less 3
CarryOut
Answer
(1)
Opcode RegDst RegWrite ALUSrc ALUOp MemWrite MemRead MemToReg PCSrc
swu 2 1 1 0 1 0 0 0
(2)
PCSrc Jinc
Add 0 0
1 1
4 Add
Shift
left 2
RegWrite MemWrite MemToReg
(3) Control signal jinc is needed. When Jump Indirect instruction is executed the control signal
jinc is set to 1 and the jump target address can go through the multiplexor to the input of the
PC.
12
(4)
(a) (b) Binv CarryIn Op2 Op1 Op1
Operation
Binvert CarryIn and 0 x 0 0 0
a
0 or 0 x 0 0 1
1
Result
add 0 x 0 1 0
b 0 + 2
1 sub 1 1 0 1 0
Less 3
4 slt 1 1 0 1 1
CarryOut
xor 0 x 1 0 0
(c) Only the setting of ALUOp control signal is changed when instructions are executed.
5. Answer the following questions with respect to the MIPS program shown below. Note that this
simple program does not use the register saving conventions followed in class. Assume that each
instruction is a native instruction and can be stored in one word further, assume that the data
segment starts at 0x10001000 and that the text segment starts at 0x00400000.
.data
label: .word 8, 16, 32, 64
.byte 64, 32
.text
.globl main
main: la $4, label
li $5, 16
jal func
done
func: move $2, $4
move $3, $5
add $3, $3, $2
move $9, $0
loop: lw $22, 0($2)
add $9, $9, $22
addi $2, $2, 4
slt $8, $2, $3
bne $8, $0, loop
move $2, $9
jr $31
(1) What does the program do?
13
(2) State the values of the labels loop and main?
(3) Please finish the hexadecimal encodings of the following instructions?
bne $8, $0, loop
1st source 2nd source
Op-code OFFSET
register register
(6-bit) (16-bit)
(5-bit) (5-bit)
5
14
6. Given a 5-stage pipelined MIPS ISA design, the individual pipeline stages are named as IF, ID,
EX, MEM, and WB, respectively. The latencies of five stages are given as 350 ps, 250 ps, 280
ps, 400 ps, 280 ps, respectively.
(1) What is the maximum operating frequency for this processor?
(2) The following fragment of MIPS codes is being executed. If there is no forwarding or
hazard detection, please insert nops and rewrite the assembly to ensure correct execution.
add $t4, $s2, $t1
or $t2, $t1, $t2
lw $s4, 20($t4)
sw $t1, 16($t4)
sub $s3, $s4, $t2
(3) How many clock cycles are required to complete the execution of these instructions?
Answer
(1) The maximum operating frequency = 1 / 400 ps = 2.5 GH
(2) add $t4, $s2, $t1
or $t2, $t1, $t2
nop
lw $s4, 20($t4)
sw $t1, 16($t4)
nop
sub $s3, $s4, $t2
(3) (5 – 1) + 7 = 11 clock cycles are required to complete the execution of these instructions.
7. Assume that a pipelined processor has 8 pipelined stages as shown in the following. Due to the
branch prediction, the IF stage needs two clock cycles, called IF1 and IF2. The EXE stage needs
three clock cycles to Support multiplication and division. However, the addition, Subtraction,
and logic operations are still completed in one clock cycle. The three EXE stages are named
EXE1, EXE2, and EXE3. The latencies for 8 stages are 150 ps, 150 ps, 100 ps, 200 ps, 200 ps,
180ps, 200 ps, and 100 ps, respectively. The pipeline registers are named from PP1 to PP7
sequentially.
PP1 PP2 PP3 PP4 PP5 PP6 PP7
Mul/Div
Instr.
Instruction Fetch & Decode & Memory Write
Branch Prediction Register Add/Sub/L Access Back
ogic
Read
Operation
IF ID EXE MEM WB
Consider the following MIPS code.
15
1 lw $s2, 0($t4)
2 add $t0, $s2, $t1
3 mul $t2, $s3, $s5
4 sub $t0, $t0, $t2
5 sw $t0, -4($t4)
(1) If there is no forwarding or hazard detection, how many nops instructions should be
inserted between the first instruction (lw) and the second instruction (add)? Please also give
some simple explanation.
(2) If a forwarding path from the memory output result in pipeline registers PP7 to the input of
ALU at EXE1 stage is added to reduce the delay, how many nops instructions should be
inserted between the first instruction (lw) and the second instruction (add)? Please also give
some simple explanation.
(3) How to add the forwarding path to solve the data hazard with the minimum delay between
instruction 3 (mul) and instruction 4 (sub)? Please also give some simple explanation.
(4) How to add the forwarding path to Solve the data hazard with the minimum delay between
instruction 4 (sub) and instruction 5 (sw)? Please also give some simple explanation.
Answer
(1) 4 nops. The lw instruction should write to the register file at the first half of clock at WB
stage and add instruction should read from the register file at the second half of clock at ID
stage. There are 4 nops are needed to ensure the correctness of execution.
(2) 3 nops. The lw instruction at WB stage will forward the memory data from PP7 and the add
instruction should at the first EXE stage to accept the correct value for addition. There are 3
nops are needed to ensure the correctness of execution.
(3) Forwarding path should be added from MEM stage (PP6) to the first EXE stage since the
mul instruction will finish multiplication operation earliest at MEM stage.
(4) Forwarding path should be added from WB stage (PP7) to the MEM stage. Since the sw
instruction will write the data into memory at MEM stage.
註:
(3)
Mul/Div
Instr.
Instruction Fetch & Decode & Memory Write
Branch Prediction Register Add/Sub/L Access Back
ogic
Read
Operation
(4)
IF ID EXE MEM WB
16
8. Assume an 8-core computer system can process database queries at a steady state rate of
requests per Second and each transaction averagely takes a fixed amount time to process. Some
results regarding the transaction latency and processing rate are given in the following table.
How many requests are being processed at any given instant per core for the following 2
cases? Please write the values for (1) and (2), respectively.
Case Average Transaction Latency Maximum Transaction Processing Rate No. of Requests per Core
1 1 ms 10,000/sec. (1)
2 2 ms 24,000/sec. (2)
Answer
(1) (10000 / 8) 1 10-3 = 1.25
(2) (24000 / 8) 1 10-3 = 3
17
106 年清大資工
1. What is the decimal value of ‘‘1111 0011’’, a one-byte 2's complement binary number?
Answer: -13
2. For a color display using 8 bits for each of the primary colors (red, green, blue) per pixel and
with a resolution of 1280 800 pixels, what should be the size (in bytes) of the frame buffer to
store a frame?
Answer: 1280 800 3 = 3072000 bytes
3. Determine the values of the labels ELSE and DONE of the following segment of instructions.
Assume that the first instruction is loaded into memory location F0008000hex
Answer:
labels value
ELSE F000800Chex
DONE F0008010hex
5. You are using a tool that transforms machine code that is written for the ISA to code in a VLIW
ISA. The VLIW ISA is identical to except that multiple instructions can be grouped together into
one VLIW instruction. Up to N MIPS instructions can be grouped together ( N is the machine
width, which depends on the particular machine). The transformation tool can reorder
instructions to fill VLIW instructions, as long as loads and stores are not reordered relative to
18
each other (however, independent loads and stores can be placed in the same VLIW instruction).
You give the tool the following MIPS program (we have numbered the instructions for reference
below):
(01) lw $t0 ← 0($a0)
(02) lw $t2 ← 8($a0)
(03) lw $t1 ← 4($a0)
(04) add $t6 ← $t0, $t1
(05) lw $t3 ← 12($a0)
(06) sub $t7 ← $t1, $t2
(07) lw $t4 ← 16($a0)
(08) lw $t5 ← 20($a0)
(09) srlv $s2 ← $t6, $t7
(10) sub $s1 ← $t4, $t5
(11) add $s0 ← $t3, $t4
(12) sllv $s4 ← $t7, $s1
(13) srlv $s3 ← $t6, $s0
(14) sllv $s5 ← $s0, $s1
(15) add $s6 ← $s3, $s4
(16) add $s7 ← $s4, $s6
(17) srlv $t0 ← $s6, $s7
(18) srlv $t1 ← $t0, $s7
(a) Draw the dataflow graph of the program. Represent instructions as numbered nodes
(01~18), and flow dependences as directed edges (arrows).
(b) When you run the tool with its settings targeted for a particular VLIW machine, you find
that the resulting VLIW code has 9 VLIW instructions. What minimum value of N must the
target VLIW machine have?
(c) Based on the code above and the minimum value of N in (b), write down the MIPS
instruction numbers corresponding to each VLIW instruction in the table below. If there is
more than one MIPS instruction that could be placed into a VLIW instruction, please
choose the instruction that comes earliest in the original MIPS program.
MIPS MIPS MIPS MIPS MIPS MIPS MIPS
Instr. No. Instr. No. Instr. No. Instr. No. Instr. No. Instr. No. Instr. No.
VLIW Instruction 1:
VLIW Instruction 2:
VLIW Instruction 3:
VLIW Instruction 4:
VLIW Instruction 5:
VLIW Instruction 6:
19
VLIW Instruction 7:
VLIW Instruction 8:
VLIW Instruction 9:
(d) You find that the code is still not fast enough when it runs on the VLIW machine, so you
contact the VLIW machine vendor to buy a machine with a larger machine width N. What
minimum value of N would yield the maximum possible performance (i.e., the fewest
VLIW instructions), assuming that all MIPS instructions (and thus VLIW instructions)
complete with the same fixed latency and assuming no cache misses?
(e) Write the MIPS instruction numbers corresponding to each VLIW instruction, for this
optimal value of N. Again, as in part (c) above, pack instructions such that what more than
one instruction can be placed in a given VLIW instruction, the instruction that comes first
in the original code is chosen.
Answer:
(a)
01 02 03 05 07 08
04 06 11 10
09 13 12 14
15
16
17
18
6. Fine-Grained Multithreading (FGMT): consider a design "Machine Ⅰ" with five pipeline
stages: fetch, decode, execute, memory, and write back. Each stage takes 1 cycle. The
instruction and data caches (i.e., there is never a stall for a cache miss). Branch directions and
targets are resolved in the execute stage. The pipeline stalls when a branch is fetched, until the
branch is resolved. Dependency check logic is implemented in the decode stage to detect flow
dependences. The pipeline does not have any forwarding paths, so it must stall on detection of a
flow dependence. In order to avoid these stalls, we will consider modifying MachineⅠ to use
fine-grained multithreading.
Fetch Decode Execute Writeback
PC Reg.
file
PC Address Address
Reg.
Instruction file ALU Data
Instruction Data
Cache Cache
PC
Reg.
file
Thread ID
Thread ID Thread ID
(a) Is the five-stage pipeline of Machine I shown above, the machine's designer first focuses
on the branch stalls, and decides to use multithreading to keep the pipeline busy no matter
how many branch stalls occurs. What is the minimum number of threads required to
21
achieve this? Why?
(b) The machine's designer now decides to eliminate dependency-check logic and remove the
need for flow-dependence stalls (while still avoiding branch stalls). How many threads are
needed to ensure that no flow dependence ever occurs in the pipeline?
A rival designer is impressed by the throughput improvements and the reduction complexity
that FGMT brought to Machine I. This designer decides to implement FGMT on another
machine, Machine II. Machine II is a pipelined machine with the following stages.
Fetch 1 stage
Decode 1 stage
Execute 8 stages (branch direction/target are resolved in the first execute stage)
Memory 2 stages
Writeback 1 stage
Assume everything else in Machine II is the same in Machine I.
(c) Is the number of threads required to eliminate branch-related stalls in Machine II the same
as in Machine I?If YES, why? If NO, how many threads are required?
(d) Now consider flow-dependence stalls. Does Machine II require the same minimum number
of threads as Machine I to avoid the need for flow-dependence stalls? If YES, why? If NO,
how many threads are required?
Answer:
(a) 3 threads are required since branch directions and targets are resolved in the 3th (execution)
stage.
(b) There are 3 stages from decode to writeback; hence 3 threads are required to remove the
flow-dependence stalls. In addition, 3 threads are required to remove the branch stalls. So,
there are total 3 + 3 = 6 threads are required.
(c) Yes, Since the branch directions and targets are still resolved in the 3th (execution) stage.
(d) No. There are 11 stages from decode to writeback; hence 11 threads are required to remove
the flow-dependence stalls.
7. Assume you developed the next greatest memory technology, MagicRAM. A MagicRAM cell
is non-volatile. The access latency of a MagicRAM cell is 2 times that of an SRAM cell but the
same as that of a DRAM cell. The read/write energy of MagicRAM is similar to the read/write
energy of DRAM. The cost of MagicRAM is similar to that of DRAM. MagicRAM has higher
density than DRAM. MagicRAM has one shortcoming, however: a MagicRAM cell stops
functioning after 2000 writes are performed to the cell.
(a) Is there an advantage of MagicRAM over DRAM? Why?
(b) Is there an advantage of MagicRAM over SRAM?
(c) Assume you have a system that has a 32KB L1 cache made of SRAM, a 8MB L2 cache
made of SRAM, and 2GB main memory made of DRAM, as shown in Fig 1.
22
Processor core
L1 cache (32KB)
L2 cache (8 MB)
Assume you have complete design freedom and add structures to overcome the shortcoming of
MagicRAM. You will be able to propose a way to reduce/overcome the shortcoming of MagicRAM
(note that you can design the hierarchy in any way you like, but cannot change MagicRAM itself).
Does it make sense to add MagicRAM somewhere this memory hierarchy, given that you can
potentially reduce its shortcoming? If so, where would you place MagicRAM? If not, why not?
Explain below clearly and methodically. Depict in a figure clearly and describe why you made this
choice.
(d) Propose a way to reduce/overcome the shortcoming of MagicRAM by modifying the given
memory hierarchy. Be clear in your explanations and illustrate with drawings to aid
understanding.
Answer:
(a) Yes. Since the MagicRAM has higher density than DRAM, the area and power
consumption for the MagicRAM will be lower than that of the DRAM.
(b) Yes. Since a MagicRAM cell is non-volatile when the system, which is critical to safety
or property application, is crashed the data is remained and the system can be recovered.
(c)(d) MagicRAM can be used in memory hierarchy as a backup of the main memory, shown
in the following figure, if the computer system is critical to safety or property
application. Only used for the backup of main memory, the MagicRAM won’t be
accessed frequently and thus can reduce its shortcoming.
23
Processor core
L1 cache (32KB)
L2 cache (8 MB)
8. Consider the following three processors (X, Y, and Z) that are all of varying areas. Assume that
the single-thread performance of a core increases with the square root of its area.
(a) You are given a workload where S fraction of its work is serial and 1-S fraction of its work
is in infinitely parallelizable. If executed on a die composed of 16 Processor X’s, what
value of S would give a speedup of 4 over the performance of the workload on just one
Processor X?
(b) Given a homogeneous die of area 16A, which of the three processors would you use on
your die to achieve maximal speedup? What is that speedup over just a single ProcessorⅩ?
Assume the same workload in part (a).
(c) Now you are given a heterogeneous processor of area 16A to run the above workload. The
die consists of 1 Processor Y and 12 Processor X’s. When running the workload, all
sequential parts of the program will be run on the larger core while all parallel parts of the
program run exclusively on the smaller cores. What is the overall speedup achieved over a
Single Processor Ⅹ ?
(d) One of the programmers decides to optimize the given workload so that it has 10% of its
work serial sections and 90 % of its work in parallel sections. Which configuration would
you use to run the workload if given the choices between the processors from part (a), part
(b), and part (c)? Please write down the speedups for the three configurations.
(e) Typically, for a realistic workload, the parallel fraction is not infinitely parallelizable. What
are the three fundamental reasons?
Answer:
(a) S = 0.2
24
(b) Speedup of processor Y = . Speedup of processor Z = .
You would use the processor from part (c) because it gives you the maximum speed up.
(e) 1. Synchronization.
2. Load imbalance.
3. Resource contention.
25
106 年交大資聯
1. Which of the following objects have identical binary representation no matter the machine is
little endian or big endian.
(a) 2's complement number -l
(b) int i = 0xABBAABBA
(c) A single precision number -0.0 (in IEEE754 encoding format)
(d) A C null pointer
Answer: (a), (d)
註(d): In C, the NULL can be expressed as the integer value 0
2. For the pipelined implementation of the MIPS processor, which statements below are NOT
correct?
(a) All pipelined registers have the same length
(b) The pipeline clock cycle time is the average of all stage latencies,
(c) Exceptions in a pipeline are handled like mis-predicted branches.
(d) In the pipelined data path, separate instruction and data memories are used to reduce data
hazards.
Answer: (a), (b), (d)
註(d):Separate instruction and data memories are used to reduce structural hazards.
3. Assume the register numbers of $s2 and $zero are 17 and 0, respectively. Given the MIPS code
sequence below, if we assume it starts at location 80004000h in memory, which of following
statements are correct? .
80004000h add $t0, $zero, $zero
loop: beq $s2, $zero, finish
add $t0, $t0, $s1
sub $s2, $s2, 1
j loop
finish: addi $t0, $t0,100
add $v0, $t0, $zero
(a) The MIPS machine codes of the beq (OP code is 4) in this code sequence is 12200003h.
(b) The MIPS machine code of the j (OP code is 2) in this code sequence is 08001000h.
(c) Assume both $s1 and $s2 initially contain integers 5 and 6, respectively. The value in $v0 is
30 after the execution of the whole code sequence.
(d) With the same assumption as (c) that both $s1 and $s2 initially contain integers 5 and 6,
respectively. The beg instruction will be executed 6 times for this code sequence.
Answer: (a)
註(a):the register numbers of $s2 and $zero are 17 and 0
26
OP rs rt 16-bit address
Binary 000100 10001 00000 0000000000000011
Hexadecimal 12200003hex
註(b):
OP 26-bit address
Binary 000010 00000000000001000000000001
Hexadecimal 08001001hex
註(c):$v0 = 5 6 + 100 = 130
註(d):7 times
4. Consider the following sequence of actual outcomes for a branch. T means the branch is taken.
N means not taken. Assume both predictors are initialized to predict taken. Which of following
statements are true?
Branch: T-N-T-N-N-T-N
(a) If 1-bit predictor is used, the predictions for this branch will be T-T-N-T-N-N-T.
(b) If 2-bit predictor is used, the predictions for this branch will be T-T-T-T-T-N-N.
(c) If the same pattern (i.e., T-N-T-N-N-T-N) are repeated thousands of times, the prediction
accuracy rate of 1-bit predictor is about 2/7.
(d) If the same pattern are repeated thousands of times, the prediction accuracy rate of 2-bit
predictor is about 4/7.
Answer: (a), (d)
註(a):
T N T N N T N
State (predict) 1 (T) 1 (T) 0 (N) 1 (T) 0 (N) 0 (N) 1 (T)
註(b):
T N T N N T N
State (predict) 3 (T) 3 (T) 2 (T) 3 (T) 2 (T) 1 (N) 2 (T)
註(c):
T N T N N T N
Round 1 1 1 0 1 0 0 1
Round 2 0 1 0 1 0 0 1
Round 3 0 1 0 1 0 0 1
Correct?
註(d):
T N T N N T N
Round 1 3 3 2 3 2 1 2
Round 2 1 2 1 2 1 0 1
27
Round 3 0 1 0 1 0 0 1
Round 4 0 1 0 1 0 0 1
Correct?
5. Given the operation times for the major functional units are: 200ps for memory access, 100ps for
ALU operation; and 50ps for register file read or write. Assuming that all the other delays (like
control unit, multiplexer, pipeline overheads, etc) are negligible. Assume only R-type, lw, sw are
supported. Which of following statements are correct?
(a) For a single-cycle CPU where the instruction with the longest latency determines the clock
cycle time, the clock cycle time is 350ps.
(b) For a classic MIPS CPU with a 5-stage pipeline, the clock cycle time can be 200ps.
(c) It takes 3000ps to execute the code sequence below using the single-cycle CPU.
(d) It takes 1000ps to execute the code sequence below using the pipelined MIPS CPU
lw $t1, 0($s1)
sw $s1, 0($s2)
add $t2, $s2, $s3
add $t3, $s1, $s2
lw $t1,0($t2)
Answer: (b), (c)
註(a):The execution time for the lw instruction = 200 + 50 + 100 + 200 + 50 = 600 ps
註(d):(Suppose with forwarding) Execution Time = [(5 – 1) + 5] 200 = 1800ps
28
7. Consider a 1 KB, 4-way set associative cache (initially empty) with block size of 64 bytes. The
main memory consists of 256 blocks and the request for memory blocks is in the following
order: 0, 255, 1, 4, 3, 8, 142, 133, 159, 216, 113, 129, 63, 8, 17, 48, 32, 73, 92, 155. Which
one(s) of the following memory blocks will NOT be in the cache if LRU replacement policy is
used?
(a) 3 (b) 8 (c) 133 (d) 216
Answer: (c), (d)
註: Number of blocks in cache = 1KB / 64B = 16. The number of sets in cache = 16 / 4 = 4
Block address Tag Index Hit/Miss
0 0 0 Miss
255 63 3 Miss
1 0 1 Miss
4 1 0 Miss
3 0 3 Miss
8 2 0 Miss
142 35 2 Miss
133 33 1 Miss
159 39 3 Miss
216 54 0 Miss
113 28 1 Miss
129 32 1 Miss
63 15 3 Miss
8 2 0 Hit
17 4 1 Miss
48 12 0 Miss
32 8 0 Miss
73 18 1 Miss
92 23 0 Miss
155 38 3 Miss
Final content
Block 0 Block 1 Block 2 Block 3
Set 0 0 48 4 32 8 216 92
Set 1 1 17 133 73 113 129
Set 2 142
Set 3 255 155 3 159 63
29
8. Which of the following statements (about virtual memory and page table) are correct?
(a) For virtual memory, write-back is more practical than write-through.
(b) Given a 32-bit virtual address space with 4KB per page and 4 bytes per page table entry, the
total page table size is 4MB.
(c) It is possible to miss in cache and page table, but hit in translation look-aside buffer (TLB).
(d) It is possible to miss in TLB, but hit in cache and page table.
Answer: (a), (b), (d)
註(b):Number of page table entries = 4GB / 4KB = 1M. The page table size = 4 bytes 1M = 4MB
題組 A:
Consider the Pipelined CPU with five stages (IF, D, EXE, MEM, WB) and the code sequence shown
below. Please answer the questions below.
add $s1, $t1, $t2
sub $s1, $s1, $s2
lw $s2, 0($s1)
add $t1, $s1, $s2
sub $t3, $t2, $s2
sw $s1, 0($t3)
or $t4, $t2, $s2
add $t1, $s2, $s1
30
PCSrc
ID/EX
WB
EX/MEM
MEM/WB
Control M WB
EX WB
IF/ID M
Add
Add
Shift Branch
RegWrite
left 2
ALUSrc
MemWrite
MemtoReg
Instruction
PC Read
Address Read
register 1 data 1
Read Zero
register 2 ALU
Instruction
Write
Read
Result Address Read
memory register
data 2 data
Data
Write Registers memory
data
Write
data
Instruction
[15-0] Sign ALU
extend control
Instruction MemRead
[20-16] ALUOp
Instruction
[15-11]
RegDst
A1. Assume forwarding and stall mechanisms have been designed (though it is not shown in the
figure), which instruction is in IF stage when the code sequence runs to the 7 cycle?
(a) or $t4, $t2, $s2 .
(b) sw $s1, 0($t3)
(c) sub $t3, $t2, $s2
(d) add $t1, $s1, $s2
Answer: (b)
註:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13
add IF ID EX ME WB
sub IF ID EX ME WB
lw IF ID EX ME WB
add IF ID ID EX ME WB
sub IF IF ID EX ME WB
sw IF ID EX ME WB
or IF ID EX ME WB
add IF ID EX ME WB
31
A2. With the same assumption as question A1, which instruction is in EXE stage when the code
sequence runs to the 7th cycle?
(a) or $t4, $t2, $s2
(b) sw $s1, 0($t3)
(c) sub $t3, $2, $s2
(d) add $t1, $s1, $s2
Answer: (d)
A3. Assume only stall mechanism has been designed (i.e., no forwarding paths) and assume register
read and write can be done in the same cycle, which instruction is in MEM stage when the code
sequence runs to the 7th cycle?
(a) or $t4, $t2, $s2
(b) sub $t3, $2, $s2
(c) lw $s2,0($s1)
(d) sub $s1, $s1, $s2
Answer: (d)
註:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13
add IF ID EX ME WB
sub IF ID ID ID EX ME WB
lw IF IF IF ID ID ID EX ME WB
add IF IF IF ID ID ID EX ME
題組 B:
A computer system has an L1 cache, an L2 cache, and a main memory unit connected as shown
below. The block size is 16 words for the L1 cache, and is 4 words for the L2 cache; the main
memory is 4-word wide. The access times are 2ns, 20ns, and 200 ns for the L1 cache, L2 cache,
and main memory, respectively.
Data Bus Data Bus
L1 L2 Main
Cache Cache MEM
4 words 4 words
B1. When the processor requests some 4-word data, but there is a miss in the L1 cache and a hit in
the L2 cache, how much is the total required time for data transfer upon this request?
(a) 20 ns
(b) 22 ns
(c) 80 ns
(d) 82ns
32
Answer: (d)
註:2 + (16 / 4) 20 = 82 ns
B2. When the processor requests some 4-word data, but there is a miss in both of the L1 cache and
the L2 cache, then a hit in the main memory, how much is the total memory access time for this
request?
(a) 220 ns
(b) 222 ns
(c) 282 ns
(d) 880ns
Answer: (c)
註:2 + (16 / 4) 20 + 200= 282 ns
33
106 年成大電機
2. In a five-stage pipelined processor, multiple exceptions may occur at the same clock cycle.
Answer the following questions:
(a) Which stage can detect an illegal instruction?
(b) What might be the possible causes that result in an illegal instruction?
(c) Which stage(s) can detect memory access violations?
(d) TLB exception may occur at which stage(s)?
(e) If multiple exceptions have occurred at the same time, which PC should be identified and
saved for precise interrupt?
Answer
(a) ID stage
(b) The program was compiled with some processor-specific optimizations and is then
running on a processor that fails to meet those requirements.
The program is piled using wrong compiler.
The program runs on processors that don’t support the instructions.
(c) IF and MEM stage.
34
(d) IF and MEM stage.
(e) The program counter of the oldest instruction (the earlier instruction into pipeline) should
be identified and saved for precise interrupt.
Choose the most appropriate answers for the following multiple choice problems. Each question may
have more than one answer.
3. Using 16K 8 SRAM modules for on-chip memory system, which of the following is (are)
true?
(a) For 1 MB memory system, it needs 64 SRAM modules.
(b) The 16K 8 module has 14 address lines.
(c) The 16K 8 module has 16K address lines.
(d) It needs at least 8 modules for the connection to a 64-bit data bus. So the minimum memory
size is 128KB.
(e) It needs at least 8 modules for the connection to a 64-bit data bus. So the minimum memory
size is 64KB.
Answer: (a), (b), (d)
註(a): (1M 8) / (16K 8) = 26 = 64
註(d): (16K 8) 8 = 16KB 8 = 128KB
4. For a conditional branch instruction such as beq, rs, rt, foo, which of the following statements
are true?
(a) The label "foo" defines the base address of the branch target.
(b) The label "foo" is an offset relative to the program counter which points to the next
sequential instruction of the branch instruction.
(c) The label "foo" is translated into an unsigned number.
(d) The label "foo" is coded into the instruction as a string.
(e) The label "foo" is coded into the instruction as a signed number.
Answer: (b), (e)
5. Which of the following is (are) true for the forwarding unit used in a typical five-stage pipelined
processor?
(a) The forwarding unit is used to bypass the write-back result due to RAW hazards.
(b) The forwarding unit is used to forward data to the instruction cache.
(c) The forwarding unit compares the source register number of the instructions in the MEM
and WB stages with the destination register numbers of the dependent instruction.
(d) The forwarding unit compares the destination register number of the instructions in the
MEM and WB stages with the source register numbers of the dependent instruction.
(e) The forwarding unit is a combinational logic.
35
Answer: (a), (d), (e)
6. Which of the following statements is (are) true for virtual memory system?
(a) The space on the disk or flash memory reserved for the full virtual memory space of a
process is called Swap Space.
(b) Virtual memory function can be enabled through software control.
(c) Virtual memory technique treats part of the main memory as a fully-set associative
write-back cache for program execution.
(d) A translation lookaside buffer can be seen as the cache of a page table.
(e) A page table is shared among the programs in execution.
Answer: (a), (c), (d)
36
106 年成大資聯
Page
table
(1) What is the name of the TLB’s input (Symbol A in the figure)? (Hint: XXX address.)
(2) What is the name of the TLB’s output (Symbol B in the figure)?
(3) What is the name of this type of cache (with B as its input)? (Hint: XXX cache.)
(4) Does the cache aliasing occur in the cache design shown in the figure (Yes or No)? Explain
your answer.
(5) “We could have a hit in the cache, and get a TLB miss and a page table miss.” Is the
statement true (Yes or No)? Explain your answer.
(6) Consider the processor operating at 1GHz. The processor stalls during a cache miss and has
the following properties: (i) a cache access time of 2 clock cycles for a hit, (ii) a miss
penalty of 100 clock cycles, and (iii) a miss rate of 0.03 misses per reference. Please
compute the average memory access time.
Answer
(1) Virtual address
(2) Physical address
(3) Physically addressed cache
(4) No.
(5) No. Page table miss data are not in memory data are not in cache
(6) AMAT = (2 + 0.03 100) 1 ns = 5 ns
37
I2: LW R2, 16(R1)
I3: LW R1, 4(R3)
I4: SUB R5, R3, R4
(1) Find all data dependencies in the instruction sequence.
(2) Find all hazards in the instruction sequence for the processor with and without forwarding.
(3) Sometimes, even with forwarding, we would have to stall one stage for a data hazard.
Fortunately, the software technique, Reordering Code, would be adopted to avoid the
pipeline stalls. Please state if the instruction sequence suffers from a data hazard that cannot
be resolved by the forwarding (Yes or No). If your answer is Yes, please list your reordered
code.
Answer
RAW WAR WAW
(1) (R1) I1 to I2 (R2) I1 to I2 (R1) I1 to I3
(R1) I2 to I3
With forwarding Without forwarding
(2)
(R1) I1 to I2
(3) No
3. Determine whether each of the following statements is True (T) or False (F), and explain your
answer.
(1) Strong scaling is not limited by Amdahl's law.
(2) Both SMPs and message-passing computers rely on locks for synchronization.
(3) Multithreading technology in CPUs help reduce the memory latency.
Answer
Weak scaling can compensate for a serial portion of the program that
(1) F
would otherwise limit scalability.
Sending and receiving a message is an implicit synchronization, as well as
(2) F
a way to share data
Multithreading technology rely on parallelism to get more efficiency from
(3) F
a chip.
38
106 年成大電通
2. Design a DTLB and data cache system. Assume both the virtual address and physical address are
32 bits. The page size is 4KB. The DTLB uses 4-way set associative structure and has a total of
32 entries. The data cache is physically addressed; cache size 32KB, direct-mapped, line size 32
bytes.
(a) Show the integrated design of the DTLB and the data cache.
(b) Show the integration of this sub-memory system into a 5-stage processor pipeline.
Answer
(a)
39
Virtual page number
12 bits
= = = =
In de x V alid T ag D a ta
0
1
2
Data Cache
1023
20 32
17 bits 256 bits
40
(b)
ITLB DTLB
address
address
41
106 年中央資工
多選題
1. In Boolean algebra, which of the following statements are true? (Note: Z' is the inverse of Z)
(1) X + YZ' = (X + Y)(Y + Z')
(2) (X + Y)(X' + Z) = XZ + X'Y
(3) (W' + Z + XY)(Z' + W' + XY) = Z' + XY
(4) (X + Y)(Y + Z)(X' + Z) = (X' + Z)(X + Y)
(5) XY'Z + YZ = XZ + YZ
Answer: (2), (4), (5)
註: (1) RHS = Y + XZ’ when Z = 1, X = 0, and Y = 1 LHS = 0 ≠ 1 = RHS
(2) LHS = XZ + X’Y + YZ = XZ + X’Y + XYZ + X’YZ = XZ + X’Y = RHS
(3) LHS = (W’ + XY + ZZ’) = (W’ + XY) when W = 1, Z = 0, and X = 0 LHS = 0 ≠ 1 = RHS
(4) LHS = (Y + XZ)(X’ + Z) = X’Y + YZ + XZ
RHS = X’Y + YZ + XZ = LHS
(5) RHS = XYZ + XY’Z + YZ = XY’Z + YZ = LHS
2. About single-cycle and multi-cycle implementation, which of the following statements are true?
(a) Single-cycle implementation of CPU is used in the mainstream processors nowadays.
(b) For single-cycle implementation of CPU, the clock cycle is determined by the longest
possible path.
(c) Single-cycle implementation of CPU allows a functional unit to be used more than once per
instruction.
(d) Compared to single-cycle implementation, multicycle implementation is more efficient.
(e) None of the above.
Answer: (b)
4. About the branch prediction, which of the following statements are true?
(a) For a one-bit dynamic branch predictor, the branch prediction buffer contains one bit to
record whether the branch instruction was recently taken or not.
(b) A branch prediction buffer is a small special-purpose memory indexed by the higher-order
bits of the address of the branch instruction.
(c) For a two-bit dynamic branch predictor, a prediction must be wrong twice before it is
changed.
(d) The assumption of dynamic branch prediction is that the underlying algorithms and the data
that is being operated on have regularities.
(e) If the delay branch slots can be scheduled with independent instructions from before the
branch, branch hazards can be avoided.
Answer: (a), (c), (d), (e)
5. For the instruction set design, which of the following statements are true?
(a) Compared to Mem-Mem architecture or Reg-Mem architecture, Reg-Reg architecture has
the disadvantage of having large variation in CPI.
(b) MIPS requires that objects must be aligned in the memory.
(c) Register indirect addressing mode can be used for accessing using a pointer or a computed
address. For example: Add R4, (R1) means that Regs[R4] Regs[R4] + Mem [Regs[R1]]
(d) Modern compiler technology and its ability to effectively manipulate registers has led to a
decrease in register counts in more recent architectures.
(e) For PC-relative addressing mode, a displacement is added to the program counter.
Answer: (b), (c), (e)
單選題
6. Consider the following instruction mix for a processor.
ALU operations: 40%, uses 4 cycles
Branch operations: 30%: uses 4 cycles
Memory references: 30%: uses 5 cycles
The un-pipelined processor has a clock cycle time of 1ns. The pipelined processor has a clock
cycle time of 1.2ns. Suppose that we ignore any latency and hazards and assume that the
pipelined processor has an ideal CPI of 1. How much speedup can be achieved when
comparing the un-pipelined processor and the pipelined processor?
(a) 3.28 (b) 3.35 (c) 3.58 (d)3.86 (e) 3.98
43
Answer: (c)
註: Instruction time for un-pipelined processor = 1 ns (0.4 4 + 0.3 4 + 0.3 5) = 4.3 ns
Instruction time for pipelined processor = 1.2 ns 1 = 1.2 ns
Speedup = 4.3 / 1.2 = 3.58
7. We know that the Boolean function F1(A, B, C, D) = AB' + CD has the minterms, M3, M7, M8,
M9, M10, M11, M15. Now, F2(A, B, C, D) = A'B'D + CD' + A'BC' + ABD. Xi = i if F2(A, B, C,
D) has a minterm Mi, and Xi = 0, otherwise. The sum of all the sixteen Xi is K, 0 ≤ i < 16, What
is (K mod 5)? (“mod” is the modulo operation.)
(a) 0 (b) 1 (c) 2 (d) 3 (e) 4
Answer: (d)
註: F2 has the minterms: M1, M2, M3, M4, M5, M6, M10, M13, M14, M15
(3 + 10 + 15) mod 5 = 3
CD 00 01 11 10
AB
00 1 1 1
01 1 1 1
11 1 1 1
10 1
8. Boolean function F3(A, B, C, D) = A'B'C' + A'C'D + ABD + BCD + ACD + AB'D' + BCD'. If
the number of essential prime implicants is K, what is (K mod 5)?
(a) 0 (b) l (c) 2 (d) 3 (e) 4
Answer: (c)
註:
CD 00 01 11 10
AB
00 1 1 1
01 1 1
11 1 1
10 1 1 1
44
9. Consider two different machines, A and B. The measurements on the two machines running a set
of benchmark programs are shown below:
Machine A (2GHz)
Instruction Type Instruction Count (millions) CPI
Arithmetic and logic 8 l
Load and store 4 3
Branch 2 4
Others 4 3
Machine B (2.2GHz)
Instruction Type Instruction Count (millions) CPI
Arithmetic and logic 10 1
Load and store 8 2
Branch 2 4
Others 4 3
10. A processor with 2GHz has a base CPI of 4 when all the memory references hit in the primary
memory. Assume that a main memory access time is 100ns, including all the miss handling and
that the miss rate at the primary cache is 8%. We add a secondary cache that has a 20-ns access
time and helps to reduce the miss rate to the main memory to 2%. The resulting CPI now is K.
What is (Round(K 8) mod 5)? (Note: K 8 means K multiplied by 8)
(a) 0 (b) 1 (c) 2 (d) 3 (e) 4
Answer: (a)
註: CPI = 4 + 0.08 40 + 0.02 200 = 11.2
Round(11.2 8) mod 5 = Round(89.6) mod 5 = 90 mod 5 = 0
45
106 年中山電機
1. Terminology Explanation
(a) Pipeline Processing (b) Superscalar Processor (c) RISC (d) GPGPU
Answer
(a) Pipeline Processing: a category of techniques that provide simultaneous, or parallel,
processing within the computer. It refers to overlapping operations by moving data or
instructions into a conceptual pipe with all stages of the pipe processing simultaneously. For
example, while one instruction is being executed, the computer is decoding the next
instruction.
(b) Superscalar Processor: an advanced pipelining technique that enables the processor to
execute more than one instruction per clock cycle
(c) RISC: (Reduced instruction set computer) is a computer which is based on the design
strategy that simplified instructions can provide higher performance if this simplicity
enables much faster execution of each instruction.
(d) GPGPU: (General-purpose GPU): Using a GPU for general-purpose computation via a
traditional graphics API and graphics pipeline.
2. Suppose we are considering a change to an instruction set. The base machine is a load-store
machine. Measurements of the load-store machine showing the instruction mix and clock cycle
counts per instructions are given in the following table:
Instruction Type Frequency Clock cycle Count
ALU operations 40% 1
Loads 25% 2
Stores 15% 2
Branches 20% 2
Let's assume that 30% of the ALU operations directly use a loaded operand that is not used again. We
propose adding ALU instructions that have one source operand in memory. These new register-memory
instructions have a clock cycle count of 2. Suppose that the extended instruction set increases the clock
cycle count for branches by 1, but it does not affect the clock cycle time. Would this change improve CPU
performance? Explain your answer.
Answer
The frequency of register-memory ALU instructions = 0.4 × 0.3 = 0.12
The frequency of register-register ALU instructions becomes 0.4 – 0.12 = 0.28
The frequency of Load instructions becomes 0.25 – 0.12 = 0.13
The CPI for branch instructions becomes 2 + 1 = 3
CPIold = 0.4 × 1 + 0.25 × 2 + 0.15 × 2 + 0.2 × 2 = 1.6
CPInew = (0.28 × 1 + 0.13 × 2 + 0.15 × 2 + 0.2 × 3 + 0.12 × 2) / 0.88 = 1.91
ExTimeold = IC × CPI × T = 1.6 × IC × T
46
ExTimenew = 0.88 IC × CPI × T = 0.88 × 1.91 × IC× T = 1.68 × IC × T
This change would not improve CPU performance
3. A set associative cache has a block size of four 32-bit words and a set size of 4. The cache can
accommodate a total of 256K words. The main memory size that is cacheable is 1024M 32 bits.
Design the cache structure, and show how the processor's addresses are interpreted.
Answer
Cacheable memory size is 1024M 32 bits word address length = 30 bits
Block size of four words block offset = 2 bits
Number of sets = (256K / 4) / 4 = 16K index length = 14 bits
..
..
1023
= = = =
128 128
128 128
4-to-1 MUX
Data
Hit
4. Given the 8-bits adder (named Add8), the 2-to-1 8-bits multiplexers (named MUX8 2to1) and the
basic gates such as NOT, AND, OR, NAND, and NOR, you are asked to design an ALU in
function block diagrams, which must match the following requirements:
(1) Support add, sub, and sgt (set on great than) functions. Their operation selection bits (op_sel)
are as follows: add(00), sub(10), sgt(11),
(2) Report the result status in sign, zero, overflow, and carry bits.
Answer
47
(1)
op0 op1
c0
s0
s1 s0 ~ s7 0
a 0 r0 ~ r7
s2 1
1
s3
Add8
s4 s0
s5
b s1 ~ s7
0 s6 ×7
1 0 s7 0
1 c7
×8 c8
(2)
op0 op1
c0
s0
s1 s0 ~ s7 0 zero
a 0 r0 ~ r7
s2 1
1
s3
Add8
s0
s4 Sign
s5 s1 ~ s7 r7
b 0 s6 0
×7
1 0 s7
1 c7 Overflow
×8 c8
Carry out
49
BNEZ R3, Loop
SW R1, -4(R2)
50
106 年中山資工
1. True or False. (If the statement is false, please explain the answer shortly)
(1) Increasing the block size of a cache is likely to take advantage of temporal locality.
(2) Increasing the page size tends to decrease the size of the page table.
(3) Virtual memory typically uses a write-back strategy, rather than a write-through Strategy.
(4) If the cycle time and the CPI both increase by 10% and the number of instruction decreases
by 20%, then the execution time will remain the same.
(5) In uniform memory access (UMA) designs, all processors use the same address space.
Answer:
(1) F Increasing the block size of a cache is likely to take advantage of spatial locality.
(2) T
(3) T
(4) F 執行時間變為原來的 0.8 × 1.1 × 1.1 = 0.968
(5) T
2. Server farms such as Google and Yahoo! Provide enough computer capacity for the highest
request rate of the day. Imaging that most of the time these servers operate at only 60% capacity.
Assume further that the power does not scale linearly with the load; that is, when the servers are
operating at 60% capacity, they consume 90% of maximum power. The servers could be turned
off, but they would too long to restart in response to more load. As new System has been
proposed that allows for a quick restart but requires 20% of the maximum power while in this
“barely alive” state.
(1) How much power saving would be achieved by turning off 60% of the servers?
(2) How much power saving would be achieved by placing 60% of the servers in the “barely
alive” state?
Answer:
(1) 60%
(2) 0.4 + 0.6 × 0.2 = 0.58, which reduces the energy to 58% of the original energy
3. A multicycle CPU has three implementations. The first one is a 5-cycle IF-ID-EX-MEM-WB
design running at 4.8GHz, where load takes 5 cycles; store/R-type 4 cycles and branch/jump 3
cycles. The second one is a 6-cycle design running 5.6GHz, with MEM replaced by MEM1 and
MEM2. The third is a 7-cycle design running at 6.4GHz, with IF further replaced by IF1 and
IF2. Assume we have an instruction mix: load 26%, store 10%, R-type 49%, branch/jump 15%.
(1) Do you think it is worthwhile to go for the 6-cycle design over the 5-cycle design?
(2) How about the 7-cycle design over the 6-cycle design, is it worthwhile?
Answer:
(1) The average CPI for implementation 1 is:
51
5 0.26 + 4 0.1 + 4 0.49 + 3 0.15 = 4.11
The execution time for an instruction in implementation 1 = 4.11/4.8G = 0.86 ns
The average CPI for implementation 2 is:
6 0.26 + 5 0.1 + 4 0.49 + 3 0.15 = 4.47
The execution time for an instruction in implementation 2 = 4.47/5.6G = 0.80 ns
It is worthwhile to go for the 6-cycle design over the 5-cycle design,
(2) The average CPI for implementation 3 is:
7 0.26 + 6 0.1 + 5 0.49 + 4 0.15 = 5.47
The execution time for an instruction in implementation 3 = 5.47/6.4G = 0.85 ns
It is not worthwhile to go for 7-cycle design over the 6-cycle design.
4. Identify all of the data dependencies in the following code running in a 5-stage pipelined MIPS
CPU. Which dependencies are data hazards that will be resolves via forwarding? Which
dependencies are data hazards that will cause a stall?
Line Instructions
1 add $3, $4, $2
2 sub $5, $3, $1
3 lw $6, 200($3)
4 add $7, $3, $6
Answer:
Data dependency (line 1, line 2), (1, 3), (1, 4), (3, 4)
Data hazard (1, 2), (1, 3), (3, 4)
Can be resolved via forwarding (1, 2), (1, 3)
Cause a stall (3, 4)
5. For a system with 32-bit address, the CPU uses a 4-way set associate cache with block size of 16
bytes. The cases has 1024 entries in total
(1) Determine the tag size for each block.
(2) Assume each block requires 2 extra valid bits. What is the size of the cache memory?
Answer:
(1) Tag size = 32 – 10 – 4 = 18 bits
(2) The number of blocks in cache = 4 1024 = 4096 = 4K
The number of bits in each block = 2 + 18 + 16 8 = 148 bits
The total size of the cache memory = 148 4K = 592 Kbits
52
6. Given the following datapath for the single-cycle implementation of a computer and the
definition of its instructions:
PCSrc
Add
4 Add Sum
RegWrite Shift
left 2
Instruction [25-21]
Read Read MemWrite
PC
address Instruction [20-16] register 2 Read
Instruction Read data 1 ALUSrc MemtoReg
[31-0] register 1 ALU Address
Read
Instruction Write data 2 Read
memory Register data
Instruction [15-11] Write
data Registers Write
RegDst data Data
memory
Instruction [15-0] 16 Sign 32 ALU
extend
control MemRead
Instruction [5-0]
ALUOp
53
7. The following series of branch outcomes occurs for a single branch in a program. T means the
branch is taken; N means the branch is not taken. Τ T T N N T T T How many instances of this
branch instruction are mis-predicted with a 1-bit and 2-bit local branch predictor, respectively?
Assume the Branch History Table (BHT) are initialized to the N state. You may assume that this
is the only one branch in this program.
Answer:
1-bit predictor: 3 times; 2-bit predictor: 5 times
1-bit T T T N N T T T
Current state 0 1 1 1 0 0 1 1
Next state 1 1 1 0 0 1 1 1
Correct/Incorrect
2-bit T T T N N T T T
Current state 0 1 2 3 2 1 2 3
Next state 1 2 3 2 1 2 3 3
Correct/Incorrect
8. A computer whose processes have 1024 pages in their address spaces keeps its page tables in
memory. The overhead required for reading a word from the page table is 500 ns. In order to
reduce the overhead, the computer has Translation Lookaside Buffer (TLB), which holds 32
(virtual page, physical page frame) pairs, and can do a look up in 100 ns. What hit rate is needed
to reduce the mean overhead to 200 ns?
Answer:
Suppose the hit rate of TLB is H
100 ns + (1 – H) 500 ns = 200 ns H = 0.8
54
106 年中正電機
55
(2) c1 c2 c3 c4 c5 c6 c7 c8 c9
lw $5, -10($5) IF ID EX ME WB
sw $5, -10($5) IF ID ID EX ME WB
sub $5, $5, $5 IF IF ID EX ME WB
4. A CPU has a clock rate of 4GHz and voltage of 1 V. Assume that, on average, it consumes 30W
of static power and 40W of dynamic power. If the supply voltage is reduced by 10%, how much
percentage of total power saving can be achieved?
Answer
Suppose leakage current remain the same when the voltage is reduced.
The leakage current I = 30W / 1V = 30A
The new static power = 30A 0.9V = 27W
The capacity load C = 40W / (4G 12V) = 10pF
The new dynamic power = 10pF 4G 0.92V = 32.4W
(27W + 32.4W) / (30W + 40W) = 0.85
The total power can be reduced by 15%
6. What's the difference between “response time” and “throughput of a CPU? Give an idea to
improve “response time” and “throughput” of a CPU, respectively
Answer:
Response time: Also called execution time. The total time required for the computer to
complete a task, including disk accesses, memory accesses, I/O activities, operating system
overhead, CPU execution time, and so on.
Throughput: Also called bandwidth. Another measure of performance, it is the number of
56
tasks completed per unit time.
Use pipeline technique to implement a CPU.
7. When designing memory hierarchy in a computer system with DRAM and SRAM, which one is
used for cache memory and which one for main memory? Why?
Answer
Main memory is implemented from DRAM and caches is implemented from SRAM. It is
because that DRAM is less costly per bit but has longer latency than SRAM.
8. Implement the function “unsigned int Fib (unsigned int n)” which returns the value of the nth
Fibonacci number, Fib (0) = 0, Fib (1) = 1, Fib (2) = 1, Fib (3) = 2, ..., Fib (n) = Fib (n - 1) + Fib
(n - 2).
(1) Write the C code.
(2) Translate your C code into MIPS code. Assume that the argument n is in $a0, and the result
is in $v0.
Answer
(1) int fib(int n){
if (n ≤ 1)
return n;
else
return (fib(n-1) + fib(n-2))};
(2) fib: addi $sp, sp, -12 # save registers on stack
sw $a0, 0($sp)
sw $s0, 4($sp)
sw $ra, 8($sp)
bgt $a0,1, L1 # if n > 1 then goto L1
add $v0,$a0, $0 # output = input if n = 0 or n = 1
addi $sp, $sp, 12
jr $ra
L1: addi $a0, $a0, -1 # set argument = n-1
aal fib # compute fib(n-1)
move $s0, $v0 # save fib(n-1)
addi $a0, $a0, -1 # set argument to n-2
jal fib # compute fib(n-2)
add $v0, $v0, $s0 # $v0 = fib(n-2) + fib(n-1)
lw $a0, 0($sp) # restore registers from stack
lw $s0, 4($sp)
lw $ra, 8($sp)
57
addi $sp, $sp, 12
jr $ra
58
106 年中興電機
Fig. 1
Answer
(a) A(t + 1) = A♁x♁y
(b)
Present state Input Next state
A x y A
0 0 0 0
0 0 1 1
0 1 0 1
59
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 0
1 1 1 1
(c)
3. Suppose we have made the following measurements of average CPI for MIPS instructions
shown in Table 1. Please compute the effective CPI for the MIPS machine.
Table 1
MIPS Frequency
Instruction exam Average CPI
Integer Floating point
ples
Arithmetic add, sub 24% 48% 1.0 clock cycles
Logic and, sll 18% 4% 1.0 clock cycles
Data transfer lw, sw 36% 40% 1.3 clock cycles
Conditional branch beq, bne 18% 6% 1.6 clock cycles
Jump jr, jal 4% 2% 1.1 clock cycles
Answer: Effective CPI = 0.36 1 + 0.11 1 + 0.38 1.3 + 0.12 1.6 + 0.03 1.1 = 1.189
4. The performance of a program depends on the algorithm, the language, the compiler, the
architecture, and the actual hardware. The following table (Table 2) will summarize how these
components affect the factors in the CPU performance. The factors may include “CPI”,
“instruction count”, “execution time”, “gate counts”, and “clock rate”. Please determine proper
factors for “Affects what?” in Table 2. Copy the following table (Table 2) to your answer sheet
and fill the factors for the fields, i.e. A1, A2, A3, and A4, respectively.
Table 2
Hardware or Software Component Affects what?
Instruction set architecture A1
Compiler A2
Programming language A3
Algorithm A4
Answer
60
Hardware or Software Component Affects what?
Instruction set architecture instruction count, CPI, clock rate
Compiler instruction count, CPI
Programming language instruction count, CPI
Algorithm instruction count, CPI
Answer
Pseudo-instruction What it accomplishes
move $t1, $t2 $t1 = $t2
beq $t1, small , H if($t1 = small) go to H
li $t1, big $t1 = big
bge $t5, $t3, L if($t5 ≥ $t3) go to L
addi $t1, $t2, big $t1 = $t2 + big
lw $t5, big($t0) $t5 = Memory[$t0 + big]
61
6. Consider the following assembly language code:
I0: ADD R4 = R1 + R0;
I1: SUB R9 = R3 - R4;
I2: ADD R4 = R5 + R6;
I3: LDW R2 = MEM[R3 + 100];
I4: LDW R2 = MEM[R2 + 0];
I5: STW MEM[R4 + 100] = R2;
I6: AND R2 = R2 & R1;
I7: BEQ R9 == R1, Target;
I8: AND R9 = R9 & R1;
Consider a pipeline with forwarding, hazard detection, and 1 delay slot for branches. The
pipeline is the typical 5-stage IF, ID, EX, MEM, WB MIPS design. For the above code,
complete the pipeline diagram below (instructions on the left, cycles on top) for the code. Insert
the characters IF, ID, EX, MEM, WB for each instruction in the boxes. Assume that there two
levels of bypassing, that the second half of the decode stage performs a read of source registers,
and that the first half of the write-back stage writes to the register file. Label all data stalls (Draw
an X in the box). Label all data forwards that the forwarding unit detects (arrow between the
stages handing off the data and the stages receiving the data). What is the final execution time of
the code?
請將下列表格彙製於答案卷上做答
T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14
I0
I1
I2
I3
I4
I5
I6
I7
I8
Answer
T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14
I0 IF ID EX ME WB
I1 IF ID EX ME WB
I2 IF ID EX ME WB
I3 IF ID EX ME WB
I4 IF ID X EX ME WB
62
I5 IF X ID EX ME WB
I6 IF ID EX ME WB
I7 IF ID EX ME WB
I8 IF ID EX ME WB
7. Floating-Point Representation
(a) What decimal number does the bit pattern 0x0C000000 represent if it is a floating point
number? Use the IEEE 754 standard.
(b) Write down the binary representation of the decimal 63.25 assuming the IEEE 754 single
precision format.
Answer
(a) 0C00000016 = 0000 1100 0000 0000 0000 0000 0000 00002
Sign Exponent Fraction
0 00011000 00000000000000000000000
Decimal number = +(1 224-127) = 2-103
(b) 63.2510 = 111111.012 = 1.1111101 25
Sign Exponent Fraction
0 1000100 11111010000000000000000
8. In order to explore energy efficiency and its relationship with performance, we assume the
following energy consumption for activity in Instruction memory, Registers, and Data memory.
You can assume that the other components of the datapath spend a negligible amount of energy.
Assume that components in the datapath have the following latencies. You can assume that the
other components of the datapath have negligible latencies.
(1) How much energy is spent to execute an ADD instruction in a single-cycle design and in
the 5-stage pipelined design?
(2) What is the worst case MIPS instruction in terms of energy consumption, and what is the
energy spent to execute it?
(3) If energy reduction is important, how would you change the pipelined design? What is the
percentage reduction in the energy spent by an LW instruction after this change?
(4) What is the performance impact (clock cycle time) of your changes from (3).
63
Answer
(1) The energy for the two designs is the same: I-Mem is read, two registers are read, and a
register is written. We have:
140 pJ + 2 × 70 pJ + 60 pJ = 340 pJ
(2) Because the sum of memory read and register write energy is larger than memory write
energy, the worst-case instruction is a load instruction. For the energy spent by a load, we
have:
140 pJ + 2 × 70 pJ + 140 pJ + 60 pJ = 480 pJ
(3) We can avoid reading registers whose values are not going to be used. To do this, we must
add RegRead1 and RegRead2 control inputs to the Registers unit to enable or disable each
register read. With these new control signals, a lw instruction results in only one register
read (we still must read the register used to generate the address), so we have:
(4) After the change, the latencies of Control and Register Read cannot be overlapped. This
increases the latency of the ID stage and could affect the processor’s clock cycle time if
the ID stage becomes the longest-latency stage. We have:
Clock cycle time before change Clock cycle time after change
250 ps (D-Mem in MEM stage) No change (150 ps + 90 ps < 250 ps)
64
106 年中興資工
1. True or False
(1) An interrupt is an event that causes an unexpected change in control flow but comes from
outside of the processor.
(2) Pipelining improves instruction throughput rather than individual instruction execution
time.
(3) DMA is implemented with a specialized controller that transfers data between an I/O device
and memory dependent on the processor.
(4) The forwarding technique can eliminate all pipeline stalls caused by data dependency.
(5) There are six data hazards in this program:
add R2, R4, R4
sub R1, R2, R1
add R3, R1,R2
add R1, R1, R3
Answer
(1) (2) (3) (4) (5)
T T F F F
註(3):DMA transfers data between an I/O device and memory independent of the processor
註(4):Stall caused by load-use data hazard cannot be eliminated
註(5):There are 5 data hazards-- (1, 2), (1, 3), (2, 3), (2, 4), (3, 4)
2. We assume that there is a computer having the characteristics as Table 1 shows, where the X
denotes the stage is not required. What is its CPI with variable clock design? if this computer is
implemented with single clock design, what is its CPI. What is the improvement in
performance?
Table 1
Distribution IF ID ΕΧΕ MEM WB
Load 30% 3ns 1ns 1ns
Store 15% 3ns 2ns X
2ns 1ns
Arithmetic 40% 1ns X 1ns
Branch 15% 2ns X X
Answer
CPI for variable clock design (multi-cycle machine) = 0.3 5 + 0.15 4 + 0.4 4 + 0.15 3 =
4.15
CPI for single clock design = 1
Instruction time for single clock design = 2 + 1 + 3 + 1 + 1 = 8 ns
Instruction time for variable clock design = 4.15 3 ns = 12.45 ns
65
Improvement in performance (speedup) = 12.45 / 8 = 1.56
3. Assume that there is processor with a CPI of 3 without memory stall and the memory stall
penalty is 150 clock cycles. If there have 8% instructions of program require to accessing
memory and the miss rate of cache is 0.5, please find the CP with memory stalls.
Answer
CP with memory stalls = 3 + 1.08 0.08 150 = 15.96
66
106 年台科大資工
1. A complier designer is trying to decide between two code sequences for a particular Computer.
The hardware designers have supplied the following facts:
CPI for instruction class
A B C
CPI 7 3 1
For a particular high-level-language statement, the Compiler writer is considering two code
sequences that require the following instruction Counts:
3. (1) Given the following MIPS instruction Code segment, please answer each question below
16 L1: addi $t0, $t0, 4
20 lw $s1, 0($t0)
24 sw $s1, 32($t0)
28 lw $s1, 64($t0)
32 slt $s0, $t1, $zero
Given a pipeline process which has 5 stages: IF, ID, EX, ME, WB. Assume no forwarding
unit is available. There are hazards in the Code, please detect the hazards and point
out where to insert no-ops (or bubbles) to make the pipeline datapath execute the
code correctly. You don’t need to rewrite the entire Code segment. You can simply
indicate the location where you would insert no-ops. For example, if you want to
insert 6 no-ops between the instruction addi at address 16 and lw at address 20, you
can state Something like “6 no-ops between 16 and 20”.
(2) Assume the following code is executed on a MIPS CPU with 5 pipeline stages and data
forwarding capability. If this CPU use “always assume branch not taken” strategy to
handle branch instruction but the branch is taken in this example, how many clock cycles
69
are required to complete this program?
lw $4, 50($7)
beq $1, $4, 3
add $5, $3, $4
sub $6, $4, $3
or $7, $5, $2
slt $8, $5, $6
Answer:
(1) 2 no-ops between 16 and 20
2 no-ops between 20 and 24
2 no-ops between 28 and 32
2 no-ops between 32 and 36
(2) 3 instructions are executed (beq is taken); 2 stalls between lw and beq (data hazard); 1
instruction is flushed (guess wrong)
Total clock cycles = (5 – 1) + 3 + 2 + 1 = 10
70
106 年台科大電子
71
3. Multi-core has become a popular technology for new generation processors. However, the
amount of performance gained by the use of a multi-core processor does not increase as much
as the core numbers inside the chip, why?
Answer: The improvement in performance gained by the use of a multi-core processor depends
very much on the software algorithms used and their implementation. In particular,
possible gains are limited by the fraction of the software that can run in
parallel simultaneously on multiple cores; this effect is described by Amdahl's law. In the
best case, so-called embarrassingly parallel problems may realize speedup factors near the
number of cores.
4. For the memory hierarchy, describe the two different types of locality.
(1) Temporal locality.
(2) Spatial locality.
Answer:
(1) Temporal locality: if an item is referenced, it will tend to be referenced again soon.
(2) Spatial locality: if an item is referenced, items whose addresses are close by will tend to
be referenced soon.
5. For 1-bit full adder with inputs A, B, Cin and outputs SUM and Carry.
(1) Please write down its truth table and express SUM and Carry in Sum-of-minterms.
(2) Please use K-map to obtain the minimal sum-of-product form.
Answer:
(1)
A B Cin (C) SUM Carry
0 0 0 0 0 SUM = A’B’C + A’BC’ + AB’C’ + ABC
0 0 1 1 0
0 1 0 1 0 Carry = A’BC + AB’C + ABC’ + ABC
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
72
(2)
SUM Carry
BC BC
00 01 11 10 00 01 11 10
A A
0 1 1 0 1
1 1 1 1 1 1 1
73
106 台師大資工
74
3. Consider a cache with 64 blocks, and a block size of 64 bytes.
(1) Suppose the cache is direct mapped. What cache block number does byte address 16000
map to?
(2) Suppose the cache is direct mapped. What cache block number does byte address 32128
map to?
(3) Suppose the cache is 2-way set associative. How many sets are there in the cache?
(4) Suppose the cache is 2-way set associative. What cache set number does byte address
16000 map to?
Answer
(1) Memory block address = 16000 / 64 = 250
The cache block number which the address 16000 mapped to is 250 mod 64 = 58
(2) Memory block address = 32128 / 64 = 502
The cache block number which the address 32128 mapped to is 502 mod 64 = 54
(3) The number of sets in the cache = 64 / 2 = 32
(4) The cache set number which the address 16000 mapped to is 250 mod 32 = 26
4. Consider a MIPS processor with five-stage (i.e., IF, ID, EX, MEM, WB) pipeline. Suppose the
processor has separate instruction and data memories. The hazard detection and forwarding units
are also employed. Assume there is no structural hazard. There is also no cache miss. Find the
number of clock cycles required by the pipeline for the execution of each of the following
sequences.
(1) ADD R3, R2, R1 ; R3 R2 + R1
ADD R4, R10, R3 ; R4 R10 + R3
(2) LW R6, 12(R4) ; R6 MEM[R4+12]
ADD R8, R10, R6 ; R8 R10 + R6
ADD R12, R6, R8 ; R12 R6 + R8
Answer
(1) (5 – 1) + 2 = 6
(2) (5 – 1) + 3 + 1 = 8
5. Consider a two-level cache. Suppose the hit time of the L1 cache is 1 clock cycle. The local miss
rate of the L1 cache is 25%. The hit time of the L2 cache is 6 clock cycles. The local miss rate
for L2 is 4%. The L2 is connected to main memory. It will take 250 clock cycles to access a
cache block from main memory.
(1) Compute the average memory access time of L2 cache (in clock cycles),
(2) Compute the miss penalty for L1 cache (in clock cycles),
(3) Compute the average memory access time of L1 cache (in clock cycles).
Answer
75
(1) AMATL2 = 6 + 0.04 250 = 16
(2) L1 cache miss penalty ≈ L2 cache hit time = 6
(3) AMATL1 = 1 + 0.25 6 + 0.04 250 = 12.5
76
106 年彰師大電子、資工
1. Consider the following C codes (procedure swap(v, j) swaps v[j] and v[j+1]).
void sort ( int v[ ], int n )
{int i, j;
for (i = 0; i < n; i += 1) {
for ( j = i – 1; j >= 0 && v[j] > v[j+1]; j -= 1) {swap(v, j);}
}
}
(1) Assume n = 5 and v[0] to v[4] are initially 3, 5, 1, 9, 7. List the temporary results of v[ ] at
the end of i = 0, 1, 2, 3, 4, respectively. In this example, how many swaps are actually
invoked totally?
(2) Implement the outer loop (i loop) with MIPS assembly language (Assume i is in $s0, j is in
$s1, v is in $s2, and n is in $s3).
Answer:
(1)
V[0] V[1] V[2] V[3] V[4] No. of swap
i=0 3 5 1 9 7 0
i=1 3 5 1 9 7 0
i=2 1 3 5 9 7 2
i=3 1 3 5 9 7 0
i=4 1 3 5 7 9 1
swap are invoked for 3 times
(2)
move $s0, $zero
for1tst: slt $t0, $s0, $s3
beq $t0, $zero, exit1
...
(body of first for loop)
...
addi $s0, $s0, 1
j for1tst
exit1:
2. Transform the decimal real into IEEE754 single precision and vice versa.
(1) -1.625 (decimal real) note: use hex format to shorten your answer
(2) C0A00000 (IEEE754 single precision in hex)
Answer:
(1) BFD0000016
77
-1.62510 = -1.1012 20
S E F
1 01111111 10100000000000000000000
(2) -3.2510
C0A0000016 = 1 1000000 101000000000000000000000
-1. 01 22 = -101 = -510
3. Consider 2 different implementations P1 and P2 of the same Instruction Set Architecture (ISA).
There are 4 classes of instructions A, B, C and D, as shown in the following table.
Clock rate CPIA CPIB CPIC CPID
P1 1.5 GHz 1 2 3 4
P2 2 GHz 2 2 2 2
6
(1) Given a program with 10 instructions divided into classes as follows: 10% class A, 20%
class B, 50% class C and 20% class D, which implementation is faster?
(2) What is the global CPI for each implementation?
(3) Find the clock cycles required in both cases.
Answer:
(1) CPI for P1 = 1 0.1 + 2 0.2 + 3 0.5 + 4 0.2 = 2.8
The execution time for P1 = 106 2.8 / 1.5G = 1.87 ms
CPI for P2 = 2
The execution time for P2 = 106 2 / 2G = 1 ms
P2 is faster than P1
(2) Global CPI for P1 = 2.8
Global CPI for P2 = 2
(3) The clock cycles for P1 = 2.8 106 = 2.8M
The clock cycles for P2 = 2 106 = 2M
78
5. Explain the following terminologies.
(1) Program counter (PC)
(2) Exception program counter (EPC)
(3) Cache miss
(4) Page fault
Answer:
(1) Program counter (PC): The register containing the address of the instruction in the program
being executed
(2) Exception program counter (EPC): A register which is used to store the address of the
offending instruction that was executing when the exception was generated.
(3) Cache miss: A request for data from the cache that cannot be filled because the data is not
present in the cache.
(4) Page fault: An event that occurs when an accessed page is not present in main memory.
79
8. For the single-cycle design in Fig. 1, determine what instructions (add, lw, sw, or beq) will be
affected and cannot execute correctly if some functional unit is removed. Fill in the blanks in
Table 1 with a letter “X” if the corresponding instruction is affected.
Table 1
Unit add lw sw beq
“Sign-extend” removed
“Shift left 2” removed
“Mux A” removed
“Mux B” removed
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [15 0] 16 32
Sign
extend ALU
control
Instruction [5 0]
Fig. 1
Answer:
Unit add lw sw beq
“Sign-extend” removed x x x
“Shift left 2” removed x
“Mux A” removed x x
“Mux B” removed x x x x
80