105
105
105 台聯大電機................................................................................................................... 12
105 台科大資工................................................................................................................... 46
105 台科大電子................................................................................................................... 48
1
105 清大資工
1. For the following questions we assume that the pipeline contains 5 stages: IF, ID, EX, M, and
W and each stage requires one clock cycle. A MIPS-like assembly is used in the representation.
(a) Explains the concept of forwarding in the pipeline design.
(b) Identify all of the data dependencies in the following code. Which dependencies are data
hazards that will be resolved via forwarding?
ADD $2, $5, $4
ADD $4, $2, $5
SW $5, 100($2)
ADD $3, $2, $4
Answer:
(a) Forwarding is a method of resolving a data hazard by retrieving the missing data element
from internal buffers rather than waiting for it to arrive from programmer-visible registers or
memory.
(b)
Data dependency (1, 2), (1, 3), (1, 4), (2, 4)
Data hazard (resolved via forwarding) (1, 2), (1, 3), (2, 4)
2. Suppose out of order execution is adopted in a pipeline design. The figure below gives a generic
pipeline architectures with out-of-order pipeline design.
(a) Explain the needed work in the reservation station and reorder buffer, respectively, in order
to support out of order execution.
(b) In order to reduce power usage, we want to power off a floating point unit when it is no
longer in used. The following code fragment uses a new instruction called Power-off to
attempt to turn off floating point multiplier when it is no longer in used. To deal with out of
order executions, what are the work needed for reservation station and reorder buffer to
avoid the “Power-off @multiplier” instruction to be moved in front of the set of multiply
instructions.
mul.d $f2, $f4, $f6
mul.d $f1, $f3, $f5
mul.d $f7, $f1, $f2
Power-Off @multiplier
/* Integer operations in the rest of the assembly code*/
ADD $2, $5, $4
…
END
Answer:
(a) Reservation station is a buffer within a functional unit that holds the operands and the
2
operation.
Reorder buffer is the buffer that holds results in a dynamically scheduled processor until it
is safe to store the results to memory or a register.
(b) The reservation station of floating point multiplier has to check that there are no floating
point multiplication instructions remained and the reorder buffer has to commit the
Power-off @multiplier instruction after all floating point multiplication instructions.
3. A non-pipelined processor A has an average CPI (Clock Per Instruction) of 4 and has a clock
rate of 40MHz.
(a) What is the MIPS (Million Instructions Per Second) rate of processor A
(b) If an improved successor, called processor B, of processor A is designed with four-stage
pipeline and has the same clock rate, what is the maximum speedup of processor B as
compared with processor A?
(c) What is the CPI value of processor B?
Answer:
(a) MIPS of processor A = (40 106) / (4 106) = 10
(b) Speedup = (4 / 40M) / (1 / 40M) = 4
(c) The CPI value of processor B is 1
4. Denote respectively si and ci the Sum and CarryOut of a one-bit adder Adder(ai, bi), where ai
and bi are input.
(a) For a 16-bit addition, if we use the ripple adder, how many gate delay is required for
generating c16 after bits A, B and c0 are applied as input?
(b) If we use the carry-look-ahead adder, where the 16-bit adder is divided into four 4-bit
adders. In each 4-bit adder there is a 4-bit carry-look-ahead circuit. Let pi = ai + bi and gi =
ai * bi and assume there is one gate delay for each pi and gi and two gate delays for each
carry c4, c8, c12, c16 and final three gate delay for s15. How many gate delay is required for
generating c16 after A, B, and c0 are applied is input?
(c) Another way to apply look-ahead circuit is to add a second-level look-ahead circuit. With
the second-level look-ahead circuit, how many gate delay is required for generating c16
after A, B and c0 are applied as input?
Answer:
(a) 16 2 = 32 gate delay
(b) 1 + 2 + 2 + 2 + 2 = 9 gate delay
(c) 1 + 2 + 2 = 5 gate delay
3
5. Please fill “?” with the answer of “up” or “down” in the following. For example, when
associativity goes up, the access time goes up.
Design change Effect on miss rate Possible effects
Associativity (up) conflict miss ? access time up
Cache size (down) capacity miss ? access time ?
Block size (down) spatial locality ? miss penalty ?
Answer:
Design change Effect on miss rate Possible effects
Associativity (up) conflict miss down access time up
Cache size (down) capacity miss up access time down
Block size (down) spatial locality down miss penalty down
6. In 2-way cache, there are already 3 numbers in the sequence of 14, 2, 3 in the cache. The
following table shows the time sequences when 14, 2, and 3 are inserted into the 2-way cache.
Please insert 6 numbers after and show the time sequence using LRU. 6, 10, 22, 42, 11, 27
0 1 2 3
14
14 2
14 2 3
Answer:
0 1 2 3
14
14 2
14 2 3
6 10 3
22 42 3
22 42 3 11
22 42 27 11
4
105 交大資聯
複選題
1. Which of the following statements are correct?
(a) Writing programs with a set of powerful instructions is the shortcut to yield high
performance.
(b) The number of pipeline stages affects latency, not throughput; thus pipelining improves the
performance of a processor by decreasing the latency of a job (i.e., an instruction) to be
done.
(c) For a given program, its average cycles per instruction (CPI) is affected not only by the
instruction set architecture, but also by the compiler used.
(d) By reducing the clock frequency of a processor from 2 GHz to 1.5 GHz, and also reducing
its supply voltage from 1.25 Volt to 1 Volt, the overall power consumption of this processor
will be reduced by 40% theoretically.
(e) Hexadecimal integer value 0xABBCCBBA has identical storage sequence in the
(byte-addressable) memory no matter the machine is big-endian or little-endian.
Answer: (c)
註(d):(1.5 12) / (2 1.252) = 0.48. The power consumption will be reduced by 52%
2. Consider a byte-addressable memory hierarchy with 44-bit addresses. The memory hierarchy
adopts a 4-way set associative cache of 64 KB, with every block in a set containing 64 bytes.
Which of the following statements are correct?
(a) Among an address of 44 bits, the least-significant 6 bits is the “offset”.
(b) Among an address of 44 bits, the most-significant 30 bits is the “tag”.
(c) The total number of bits required to store the entire cache (including valid bits, tags, and
data) is (1 + 30 + 64 × 4 × 8) × 28 = 532,224 (bits).
(d) By increasing the block size while using the same size of storage for data, the total number
of bit required to storage the entire cache will decrease.
(e) Given the following sequence of access addresses: (0E1B01AA050)16, (0E1B01AA073)16,
(0E1B2FE3057)16, (0E1B4FFD85F)16, (0E1B01AA04E)16, assuming that the cache is
initially empty, there will be three sets containing exactly one referenced block at the end
of the sequence.
Answer: (a), (b), (d), (e)
註(c):The total number of bits = (1 + 30 + 64 × 8) × 4× 28 bits
註(e):
Address 0E1B01AA050 0E1B01AA073 0E1B2FE3057 0E1B4FFD85F 0E1B01AA04E
Index (bin) 10000001 10000001 11000001 01100001 10000001
5
3. Which of the following statements (about virtual memory and page table) are correct?
(a) For virtual memory, write-through is more practical than write-back.
(b) For virtual memory, full associativity is typically used for minimizing page fault rate.
(c) Given a 32-bit virtual address space with 2 KB per page and 4 bytes per page table entry,
the total page table size is 8 MB.
(d) It is possible to miss in cache and translation look-aside buffer (TLB), but hit in page table.
(e) It is possible to miss in TLB, but hit in cache and page table.
Answer: (b), (c), (d), (e)
6
ARM recently built up a new CPU design center at Hsinchu in Taiwan, the first CPU design center in
Asia. You are the principal engineer in a team in charge of developing a brand new CPU, called
NCTU (Next-generation Compute Terabit Unit). The following what-if questions refer to some
assumptions. The latencies for logic blocks in the figure are listed, while the latency of other blocks
is almost negligible:
The NCTU1, the basic standard 5-stage pipeline processor, is organized as the following diagram.
All instructions executed by a pipelined processor are broken down as ALU: 40%, beq: 30%, lw:
25% and sw: 5%.
IF/ID ID/EX EX/MEM MEM/WB
Add
Add
Shift
left 2
Instruction
6. Which of following statements can be true for ideal NCTU1 (assume there is no stalls, where
all instructions are executed ideally)?
(a) The best clock rate of NCTU1 is 10GHz.
(b) The MIPS (million instructions per second) can be reached up to 10000.
(c) All pipeline registers between stages have the same width (number of bits).
(d) The pipeline clock cycle time is equal to the average of all stage latencies.
(e) The longest instruction will be lw instruction. Reducing number of lw instructions in a
program can improve the NCTU1 system throughput (e.g. CPI or instructions executed per
second).
Answer: (a), (b)
7
7. Ideal execution does not exist at all. Which of following statements can be true for improving
NCTU1?
(a) The pipe stages look unbalanced. The ideal CPI can be reduced to less than 1 if we move
around the building blocks for better balance.
(b) The overall performance can be always improved by making the pipeline deeper as the
cycles are shorter.
(c) Solve the load-use data hazard (load instructions followed by a dependent instruction) by
inserting a hazard detection unit at the ID stage.
(d) Increase the register file with more read ports (three reads and a write) to avoid instruction
conflicts.
(e) If ALU control decoder (marked by dashed circle) needs to take 50ps, moving the ALU
control to the ID stage can improve NCTU1 for allowing more concurrent activities.
Answer: (c)
註(b):當cycle time已經很短了,再增加pipeline depth未必可以增加performance
8. For the statements in the following (i is $r8 and A is $r10), you solve the pipeline dependency
of NCTU1 as the textbook. How many cycles totally are needed for NCTU1 to execute them
all? (you may need to insert NOP if necessary.)
// C or C++
A[i] += 10;
A[i] *= 2;
// MIPS
add $r1, $r8, $r10
lw $r2, 4($r1)
addi $r2, $r2, 10
multi $r2, $r2, 2
sw $r2, 12($r1)
(a) Total 9 cycles ideally if we don’t consider hazard problems.
(b) Total 19 cycles if there is no forwarding unit and no hazard detection unit, and we use
NOPs.
(c) Total 11 cycles if there is a forwarding unit and no hazard detection unit for load-use.
(d) Total 10 cycles if there are forwarding unit and hazard detection unit for load-use.
(e) The forwarding unit will not be used in the lw instruction (load-use scenario) when hazard
detection unit is triggered.
Answer: (a), (d)
註(b):Total cycles = (5 – 1) + 5 + 8 = 17
註(c):Total cycles = (5 – 1) + 5 + 1 = 10
8
9. Now we hope to design a reduced low cost version using fewer stages, a new 4-piped CPU
called NCTU2. We can remove MEM stage and put the memory access in parallel with the
ALU. Which of following statements will be true for NCTU2?
(a) It is feasible except all load/store instructions with nonzero offsets, whereas we have to
convert them into register-based only (no offset). Specifically, convert
lw $r3, 30($r5)
into
addi $r1, $r5, 30
lw $r3, ($r1)
(b) The NCTU2 has better overall system throughput, as instructions take less cycles to
complete.
(c) The NCTU2 will have a worse clock rate, as its cycle time is longer than that of NCTU1.
(d) NCTU2 still needs a stall hazard detection unit for load-use condition.
(e) The data forwarding unit is still needed for NCTU2.
Answer: (a), (e)
Branches share a significant portion of program execution. In current NCTU1, branch outcomes are
determined at MEM stage. It’s one of the key problems.
10. Which of following statements can be true for NCTU1 by considering branches only?
(a) The CPI with all branch stalls is 1.9 for NCTU1 (assume no other stalls.)
(b) If the NCTU1 is designed with deeper pipelined stages, the branch control hazard becomes
harder to solve.
(c) Assuming branches are not taken can continue execution down the sequential instruction
stream and thus improve the NCTU1 without changing hardware.
(d) If branches handling is moved to the EXE stage, it can improve CPI without increasing the
cycle time.
(e) If more hardware resources such as forwarding comparator are provided, the branch
resolving can be perfectly moved to the IF stage while still maintaining the normal
pipelining mechanism.
Answer: (a), (b)
註(a):CPI = 1 + 0.3 3 = 1.9
註(b):The AND gate (with the zero and branch inputs) should be moved from MEM stage to EXE
stage
9
題組 A: Now we have a new NCTU3 derived from NCTU1, where the branch handling is moved to
the ID stage.
11. Which of following statements will be true?
(a) It needs more hardware costs such as extra adder for subtraction and a forwarding unit.
(b) The branch target address adder has to be replicated.
(c) The good thing is that it improves CPI without increasing cycle time.
(d) The branch only CPI of new NCTU3 will become 1.1.
Answer: (a)
註(d):CPI = 1 + 0.3 1 = 1.3
12. Even with NCTU3, we hope to improve the branch problem further. Which of following
statements is NOT true?
(a) Delayed branch is a mechanism to delay the effect of branch. Branch delay slot in NCTU3
is 1. It also needs no extra hardware resources in NCTU3.
(b) Compiler can help in filling delay slots. If compiler can find any safe instruction for 50%
of branches, the CPI can be reduced to 1.15.
(c) Predicting branch as not-taken is a good mechanism and it needs no extra hardware
resources in NCTU3.
(d) Branch prediction buffer is good to predict the branch outcome, but it does not help in
predicting the branch target.
Answer: (c)
10
(b) 1
(c) 2
(d) 3
(e) 4
Answer: (c)
註:
Byte address Block address Tag Index Hit or Miss?
2 0 0 0 Miss
14 1 0 1 Miss
0 0 0 0 Hit
17 2 1 0 Miss
12 1 0 1 Hit
3 0 0 0 Miss
11
105 台聯大電機
1. Consider an implementation of an instruction set architecture. The instructions can be divided
into three classes according to their CPI (class A, B, and C). The CPIs are 1, 2, and X for the
three classes, respectively, and the clock rate is 4.4 GHz.
Give a program with a dynamic instruction count of 5 × 109 instructions divided into classes as
flows: 20% class A, 60% class B, and 20% class C. The program is compiled using compiler A.
(1) Find X (CPI for the class C) given that the global CPI is 2.2.
(2) Find the clock cycles required.
(3) What is the performance expressed in instructions per second (the global CPI is 2.2).
(4) A new compiler, compiler B, is developed that uses only 109 instructions and has an
average CPI of 1.1. Assume the programs compiled using Compiler A and B run on
Processor A and Processor B, respectively. If the execution time on the Processor B is a
half of the execution time on the Processor A. how much faster is the clock of Processor A
versus the clock of Processor B? (Find Clock A / Clock B)
Answer:
(1) 2.2 = 1 0.2 + 2 0.6 + X 0.2 X = 4
(2) The clock cycles = 2.2 5 × 109 = 11 × 109
(3) Instructions per second = (4.4 × 109) / 2.2 = 2 × 109
(4) Execution Time (Processor A) = (11 × 109) / (4.4 × 109) = 2.5 sec.
Execution Time (Processor B) = 2.5 / 2 = 1.25 = (109 1.1) / Clock rateA
= Clock rateA = 0.88 GHz
The clock of Processor A is 4.4 GHz / 0.88 GHz = 5 times faster than Processor B
2. Assume there is a 16-bit half precision format. The leftmost bit is still the sign bit, the exponent
is 5 bits wide and has a bias of 15, and the mantissa is 10 bits long. A hidden 1 is assumed.
Please write down the bit pattern to represent –2.1875 × 10-1 assuming the 16-bit half precision
format. (Note: 0.21875 = 7/32)
Answer:
–2.1875 × 10-1 = 0.001112 = 1.112 × 2-3. The half precision format = 1 01100 1100000000
3. Given the following C program segment, the converted MIPS assembly codes are shown in the
below. Assume that arguments n and s locate in $a0 and $a1.
int Sum (int n, int s ) {
if (n < 1) return 0;
else return (n + sum (n – s, s))
};
Sum: addi $sp, $sp, -12
(a)
12
sw $a1, 4($sp)
sw $a0, 0($sp)
slti
$t0, $a0, 1
beq$ t0, $ zero, L1
add$v0, $zero, $zero
(b)
jr $ra
L1: sub $a0, $a0, $a1
jal sum
lw $a0, 0($sp)
lw $a1, 4($sp)
(c)
addi $sp, $sp, 12
add $v0, $a0, $v0
jr $ra
(1) Please fill in the blanks (a), (b), and (c) to complete this assembly codes.
(2) Let the initial values of $a0, $a1, $t0, $sp are 6(hex), 2(hex), 8(hex), 6FFF808C(hex) respectively.
What are the final values of $v0 and $a0 after the completion of the program? What is the
value of $sp when the first “jr” instruction is encountered?
Answer:
(1)
(a) (b) (c)
sw $ra, 8($sp) addi $sp, $sp, 12 lw $ra, 8($sp)
(2)
$v0 $a0 $sp
C(hex) 6(hex) 6FFF808C(hex)
13
4. Consider the following architecture. The latency of each block is given in the following.
Assume that the control block has zero delay if not specified.
X Z
Add
4 Add Sum
Shift PCSrc
left 2
Instruction [31-26]
Control
Instruction [5-0]
14
Answer:
(1) The jump instruction j is not supported by this architecture. And the datapath should be
revised as the following to support the instruction.
Instruction [25:0] Jump address [31:0]
Shift
left 2
26 28
PC + 4 [31:28]
Add
4 Add Sum
Shift
left 2
Instruction [31-26]
Control
Instruction [25-21]
Read
PC Read register 1
address Read
Instruction [20-16] data 1
Read
Instruction register 2
[31:0] ALU
Write Read Address Read
Instruction Register data 2 data
memory
Write Registers
data
Write Data
data memory
Instruction [15-0] 16 32
Sign ALU
extend control
Instruction [5-0]
15
(1) Assume that all branches are perfectly predicted and that no delay slots are used. If we only
have one memory (for both instructions and data), there is a structural hazard every time
we need to fetch an instruction in the same cycle in which another instruction accesses
data, To guarantee forward progress, this hazard must always be resolved in favor of the
instruction that accesses data. What is the total execution time of this instruction sequence
in the 5-stage pipeline that only has one memory? We have seen that data hazards can be
eliminated by adding nops to the code. Can you do the same with this structural hazard?
Why?
(2) Assuming stall-on-branch and no delay slots, what speedup is achieved on this code if
branch outcomes are determined in the ID stage, relative to the execution where branch
outcomes are determined in the Exe stage?
(3) Assume that all branches are perfectly predicted and that no delay slots are used. If we
change load/store instructions to use a register (without an offset) as the address, these
instructions no longer need to use the ALU. As a result, Mem and Exe stages can be
overlapped and the pipeline has only 4 stages. Change this code to accommodate this
changed ISA. Assuming this change does not affect clock cycle time, what speedup is
achieved in this instruction sequence?
(4) Given these pipeline stage latencies, repeat the speedup calculation from (3), but take into
account the (possible) change in clock cycle time. When Exe and Mem are done in a single
stage, most of their work can be done in parallel. As a result, the resulting Exe/Mem stage
has a latency that is the larger of the original two, plus 20 ps needed for the work that could
not be done in parallel.
(5) Given these pipeline stage latencies, repeat the speedup calculation from (3), taking into
account the (possible) change in clock cycle time, Assume that the latency of the ID stage
increases by 50% and the latency of the Exe stage decreases by 10ps when branch outcome
resolution is moved from Exe to ID.
Answer:
(1) ** represents a stall when an instruction cannot be fetched because a load or store instruction
is using the memory in that cycle
Instruction Pipeline stage
sw r16, 12(r6) IF ID EX ME WB
lw r16, 8(r6) IF ID EX ME WB
beq r5, r4, Label IF ID EX ME WB
add r5, r1, r4 ** ** IF ID EX ME WB
slt r5, r15, r4 IF ID EX ME WB
Clock cycles = 11
We cannot add NOPs to the code to eliminate this hazard—NOPs need to be fetched just like
any other instructions
16
(2) When branches execute in the EXE stage, each branch causes two stall cycles. When
branches execute in the ID stage, each branch only causes one stall cycle.
Cycles with branch in EXE Cycles with branch in ID Speedup
(5 – 1) + 5 + 1 2 = 11 (5 – 1) + 5 + 1 1 = 10 11 / 10 = 1.10
(3)
Cycles with 5 stages Cycles with 4 stages Speedup
(5 – 1) + 5 = 9 (4 – 1) + 5 = 8 9 / 8 = 1.13
(4) The number of cycles for the (normal) 5-stage and the (combined EX/MEM) 4-stage pipeline
is already computed in (3). The clock cycle time is equal to the latency of the longest-latency
stage. Combining EX and MEM stages affects clock time only if the combined EX/MEM
stage becomes the longest-latency stage:
Cycle time with 5 stages Cycle time with 4 stages Speedup
200 ps 220 ps (9 200) / (8 220) = 1.02
(5)
New ID New EX New Cycle Old Cycle
Speedup
Latency Latency Time Time
180ps 190ps 200ps (IF) 200ps (IF) (9 200) / (8 200) = 1.13
6. Suppose you want to perform two sums: one is a sum of 10 scalar variables, and one is a matrix
sum of a pair of two-dimensional arrays, which have dimensions 20 by 20. Suppose that only
the matrix sum is parallelizable.
(1) Assume the load was perfectly balanced, what speed-up do you get with 40 multiple
processors?
(2) If one processor's load is 2 times higher than all the rest, what speed-up do you get with 40
multiple processors?
Answer: suppose t is the required time for an addition operation
(1) Execution time for one processor = 9t + 400t = 409t
Execution time for 40 processor = 9t + 400t / 40 = 19t
Speedup = 409t / 19t = 21.53
(2) Execution time for one processor = 9t + 400t = 409t
Execution time for 40 processor = 9t + Max(20t, 380t / 39) = 29t
Speedup = 409t / 29t = 14.1
7. Some disks are quoted to have a 1,000,000-hour mean time to failure (MTTF). For a data
center, there might have 50,000 servers. Suppose each server has 4 disks. Use annual failure
rate (AFR) to calculate how many disks we would expect to fail per year.
Answer:
There are
50000 4 disks 8760 hrs/disk 1752 disks we would expect to fail per year
1000000 hrs/failur e
17
8. A new processor is advertised as having a total on-chip cache size of 168 Kbytes (KB) with a
byte-addressed two-level cache. It integrates 8 KB of onboard L1 cache, split evenly between
instruction and data caches, at 4 KB each. Meanwhile, the processor integrates 160 KB of
unified L2 cache with the chip packaging. The two-level cache has the following
characteristics:
Average local
Address Associative Sector size Block size Caching method Hit time
miss rate
L1 Physical Direct 2 1 word/block Write through, no
1 clock 0.15
cache byte-address mapped blocks/sector (8 bytes/word) write allocate
L2 Virtual 5-way set 2 1 word/block Write through 5 clocks
0.05
cache byte-address associative blocks/sector (8 bytes/word) write allocate (after L1 miss)
The system has a 40-bit physical address space and a 52-bit virtual address space. L2 miss
(transport time) takes 50 clock cycles.
(1) What is the total number of bits within each L1 cache block, including status bits?
(2) What is the total number of bits within each L1 cache sector, including status bits?
(3) What is the total number of bits within each L2 set, including status bits?
(4) Compute the average memory access time (AMAT, in clocks) for the given 2-level cache.
(5) Consider a case in which a task is interrupted, causing L1 and L2 caches described above
to be completely flushed by other tasks, and then that original task begins execution again
and runs to o completion, completely refilling the cache as it runs. What is the approximate
time penalty, in clocks, associated with refilling the caches when the original program
resumes execution? Note: “approximate” is that you should compute L1 and L2 penalties
independently and then add them, rather than try to figure out coupling between them.
Answer:
(1) 8 bytes per block + 1 valid bit = 65 bits
(2) Tag is 40 bits - 12 bits (4 KB) = 28 bits, and is the only sector overhead since it is direct
mapped 28 + 65 + 65 = 158 bits
(3) 8 bytes per block + 1 valid bit + 1 dirty bit = 66 bits
Cache is 32 KB 5 in size ; tag is 40 bits - 15 bits (32 KB) = 25 bits tag per sector
overhead bits per sector = 25 bits tag per sector + 3 bits LRU counter per sector = 28 bits
Set size = ((66 2 ) + 28) 5) = 800 bits per set
(4) AMAT = Thit1 + Pmiss1Thit2 + Pmiss1Pmiss2Ttransport
= 1 + ( 0.15 5 ) + (0.15 0.05 50) = 2.125 clocks
(5) For the L1 cache, 1K memory references (8K / 8 bytes) will suffer an L1 miss before the L1
cache is warmed up. But, 15% of these would have been misses anyway. Extra L1 misses =
1K 0.85 = 870 misses
L2 cache will suffer (160K / 8 bytes) 0.95 extra misses = 19456 misses
Penalty due to misses is approximately:
870 5 + 19456 50 = 977,150 clocks extra execution time due to loss of cache context
18
註:此題為 Sector Mapping Cache 已超出課本範圍
19
105 中央資工
單選題
1. Use 32-bit IEEE 754 single precision to encode “-1234.6875”. If K is the number of “1” in this
32-bit value, then what is “K mod 5”? (A) 0 (B) 1 (C) 2 (D) 3 (E.) 4
Answer: (b)
註:-1234.687510 = -10011010010.10112 K = 11
2. Assume that the number of data dependencies existing in the following MIPS code segment is
K. What is “K mode 5”? (A) 0 (B) 1 (C) 2 (D) 3 (E.) 4
add $6, $4, $5
add $2, $5, $6
sub $3, $6, $4
add $2, $2, $3
Answer: (e)
註:data dependencies between instructions: (1, 2), (1, 3), (2, 4), (3, 4) K = 4
3. Following the previous question. Assume that the code will be processed by the pipelined
datapath shown below and assume that the registers initially contain their number plus 100. For
example, $2 contains 102 and $5 contains 105, etc. In the fifth clock, calculate the sum of A, B,
C, D, E, F and G. If this sum is equal to K, what is “{Round(K/3}} mod 5"? (A) 0 (B) 1 (C) 2
(D)3 (E) 4
Registers Forward A G
ALU
F Data
memory
Forward B
Rs
Rt EX/MEM.RegisterRd
Rt
Rd
A Forwarding
C MEM/WB.RegisterRd
unit
B D
b. With forwarding
Answer: (a)
註:K = 645
A B C D E F G
6 4 2 6 209 104 314
20
4. A system has a 256 Kbyte cache memory and the address format is
31 14 13 4 3 0
Tag Set Offset
The cache should be (A) 2-way set associative (B) 4-way set associative (C) 8-way set
associative (D) a direct mapped cache (E) None of above.
Answer: (e)
註: block size = 24 = 16 bytes; number of cache blocks = 256 Kbyte / 16 byte = 16K
Number of sets = 210 = 1K; associativity = 16K / 1K = 16
5. In a 3-level memory hierarchy system, the hit time for each level is
T1 = 10ns (L1-cache)
T2 = 200ns (12-cache)
T3 = 600ns (Main memory)
The local hit rate in each level is H1 = 0.9 (L1-cache), H2 = 0.8 (L2-cache), and H3 = 1. (It is
assumed that we can always find the data in main memory.) If the average access time (in ns) of
this memory system is K. What is “{Round(K*123)} mod 5"? (A) 0 (B) 1 (C) 2 (D) 3 (E) 4
Answer: (b)
註: (global) miss rate for L1-cache = 0.1; (global) miss rate for L2-cache = 0.1 0.2 = 0.02;
AMAT = 10 + 0.1 200 + 0.02 600 = 42 ns K = 42
7. Compute the clock cycle per instruction (CPI) for the following instruction mix. The mix
includes 20% loads, 20% stores, 35% R-format operations, 20% branches, and 5% jumps. The
number of clock cycles for each instruction class is listed as follows: 4 cycles for loads, 4 cycles
for stores, 4 cycles for R-format instructions, 3 cycles for branches, 3 cycles for jumps.
21
(A) 3.15
(B) 3.25
(C) 3.5
(D) 3.75
(E) 3.9
Answer: (d)
註:CPI = 4 0.2 + 4 0.2 + 4 0.35 + 3 0.2 + 3 0.05 = 3.75
多選題(每個選項單獨計分,答錯一個選項倒扣1分)
8. Which of the following statements are correct?
(A) The case of “TLB miss, Page Table hit, Cache hit" is possible.
(B) In a pipelined system, forwarding can eliminate all the data hazards.
(C) A write-through cache and a write-back cache will have the same miss rate.
(D) In the analysis using 3C miss model, a cache miss will be classified as one of the three
types of misses, i.e. compulsory miss, conflict miss and capacity miss.
(E) Given a 32-bit long address, if a page size is 8K bytes, the virtual page number is of 19 bits
long.
Answer: (a), (c), (d), (e)
註(E):length of the virtual page number = 32 – 13 = 19
22
Add R2, R1, Address(B)
Store Address(C), R2
(E) None of the above is correct.
Answer: (a)
註:
memory-to-memory stack register-to-memory
Push Address(B) Load R1, Address(B)
Push Address(C) Add R2, R1, Address(C)
Add Addr.(A), Addr.(B), Addr.(C)
Add Store Address(A), R2
Pop Address(A)
11. Assume the following execution time for different operations: MUL takes 6 cycles, DIV takes
15 cycles, and ADD and SUB both take 1 cycle to complete. The instructions are issued into the
pipeline in order, but out of order execution and completion is allowed. What hazards will be
encountered when executing this code sequence?
MUL F1, F3, F2
ADD F2, F7, F8
DIV F10, F1, F5
SUB F5, F8, F2
SD 8(R1), F2
(A) No hazards for this code sequence
(B) There will be a data hazard when the ADD instruction is executed.
(C) There will be a data hazard when the SD instruction is executed
(D) There will be a data hazard when the DIV instruction is executed.
(E) A Write After Read (WAR) hazard is possible when SUB instruction completes execution.
Answer: (d), (e)
23
105 中正資工
1. If we want to design an adder to compute the addition of two 16-bit unsigned numbers with
ONLY 4-bit carry-look-ahead (CLA) adders and 2-to-1 multiplexers. The delay time of a 4-bit
CLA adder and a 2-to-1 multiplexer are DCLA and DMX, respectively. In addition, DMX is equal to
0.2 DCLA. Please determine the minimum delay time for this 16-bit adder in terms of DCLA.
Answer: The minimum delay time = 2 DCLA + 0.2 DCLA = 2.2DCLA
a15 b15 a12 b12 a11 b11 a8 b8 a15 b15 a12 b12 a11 b11 a8 b8
1 0 1 0 a7 b7 a4 b4 a3 b3 a0 b0
s7 s6 s5 s4 s3 s2 s1 s0
2. Based on IEEE 754 standard, the single precision numbers are stored in 32 bits with one sign
bit, eight exponent bits, and 23 mantissa bits. Please show the representation of -0.6875.
Answer: -0.687510 = -0.10112 = -1.011 2-1
1 01111110 01100000000000000000000
3. Caches take advantage of temporal locality and spatial locality to improve the performance of
the memory. Please briefly explain what is temporal locality and spatial locality.
Answer
Temporal locality: if an item is referenced, it will tend to be referenced again soon.
Spatial locality: if an item is referenced, items whose addresses are close by will tend to be
referenced soon.
4. The following techniques have been developed for cache optimizations: “Pipelined cache” and
“multi-banked cache.” Please briefly explain these techniques and how they work.
Answer
Pipelined cache:, Pipelining the cache access by inserting latches between modules in the cache
can achieve a high bandwidth cache. Cache access time is divided into decoding delay, wordline
to sense amplifier delay, and mux to data out delay. Using this technique, cache accesses can
start before the previous access is completely done, resulting in high bandwidth and a high
frequency cache.
Multi-banked cache: rather than treat the cache as a single monolithic block, divide into
independent banks that can support simultaneous accesses and can increase cache bandwidth.
24
5. The following circuit is composed of two D-type flip-flops (DFFs) and an OR gate, where the
DFFs are reset when c1k=1, Describe the function of the circuit and one possible application of
it.
clk D
dout
din rst
clk
D
rst
Answer:
此電路會於clock在low-level期間偵測出din的輸入信號是否有0、1切換。可用於偵測電路是
否有雜訊的干擾
6. The following is a code segment in MIPS assembly, where JAV, (jump and link) jumps to the
label j_target and saves its return address in R31 and JR returns to the next instruction of JAL,
JAL j target
….
j_target JR R31
List all possible hazards when the code is executed on a classical 5-stage pipelined datapath (i.e.
fetch, decode, execute, memory, and write-back). What are the minimum cycles for an interlock
unit to resolve each hazard without data forwarding?
Answer
JAL 及 JR 指令執行時會造成 control hazard 會各有一個 clock cycle 的延遲。若 JR 指令與
其前面指令有 data dependency 則會造成 data hazard 會有兩個 clock cycle 的延遲。
7. Briefly describe the basic ideas of the following terms in computer designs:
(a) TLB, (b) BTB, and (c) AMAT.
Answer
(a) TLB: a cache that keeps track of recently used address mappings to avoid an access to the
page table
(b) BTB: a structure that caches the destination PC or destination instruction for a branch. It is
usually organized as a cache with tags, making it more costly that a simple prediction buffer
(c) AMAT: average memory access time is the average time to access memory considering
both hits and misses and the frequency of different accesses.
25
105 中正電機
1. Explain three types of hazard in pipelining and give their solutions.
Answer
(1) Structural hazards: hardware cannot support the instructions executing in the same clock
cycle (limited resources)
(2) Data hazards: attempt to use item before it is ready (Data dependency)
(3) Control hazards: attempt to make a decision before condition is evaluated (branch
instructions)
Type Solutions
增加足夠的硬體(例如: use two memories, one for instruction and one
Structure hazard
for data)或暫停pipeline
Software solution: (a)使用compiler插入足夠的no operation (nop)指
令。(b)重排指令的順序使data hazard情形消失。
Data hazard Hardware solution: (a)使用Forwarding前饋資料給後續有data hazard
的指令。(b)若是遇到load-use這種無法單用Forwarding解決的data
hazard時則須先暫停(stall)一個時脈週期後再使用Forwarding
Software solution: (a)使用compiler插入足夠的no operation (nop)指
令。(b)使用Delay branch。
Hardware solution: (a)將分支判斷提前以減少發生control hazard時所
Control hazard
需要清除指令的個數。(b)使用預測方法(static or dynamic)。當預測
正確時pipeline便可全速運作其效能就不會因為分支指令而降低。當
預測不正確時我們才需要清除掉pipeline中擷取錯誤的指令。
26
3. Give the hardware architecture of floating-point add unit for A + B.
A = XM 2XE
B = YM 2YE
Answer
Answer
(a)
Iteration Step Multiplicand Product
0 Initial values 101110 000000 100011
27
1 prod = prod + Y 101110 101110 100011
1
Shift right product 101110 110111 010001
1 prod = prod + Y 101110 100101 010001
2
Shift right product 101110 110010 101000
0 no operation 101110 110010 101000
3
Shift right product 101110 111001 010100
0 no operation 101110 111001 010100
4
Shift right product 101110 111100 101010
0 no operation 101110 111100 101010
5
Shift right product 101110 111110 010101
1 prod = prod + (Y’ + 1) 101110 010000 010101
6
Shift right product 101110 001000 001010
(b)
Iteration Step Multiplicand Product
0 Initial values 101010 000000 100011 0
10 prod = prod – Mcand 101010 010110 100011 0
1
Shift right product 101010 001011 010001 1
11 no operation 101010 001011 010001 1
2
Shift right product 101010 000101 101000 1
01 prod = prod + Mcand 101010 101111 101000 1
3
Shift right product 101010 110111 110100 0
00 no operation 101010 110111 110100 0
4
Shift right product 101010 111011 111010 0
00 no operation 101010 111011 111010 0
5
Shift right product 101010 111101 111101 0
10 prod = prod – Mcand 101010 010011 111101 0
6
Shift right product 101010 001001 111110 1
5. Supposing that the industry trends show that a new process generation scales resistance by 1/3,
capacitance by 1/2, and clock cycle time by 1/2, by what factor does the dynamic power of the
digital system (without clock gating) scale?
Answer
Powernew = (2 Fold) (0.5 Cold) Vold2 =Fold Cold Vold2 = Powerold
Thus power scales by 1
28
6. If processor A has a higher clock rate than processor B, and processor A also has a higher MIPS
rating than processor B, explain whether processor A will always execute faster than processor
B. Why or why not?
Answer
clock rate
MIP
For the expression above, higher clock rate also implies higher MIPS rating. MIPS only provide
the information of clock rate and CPI. Lacking of instruction count we cannot inference that
processor A will always execute faster than processor B.
29
8. What kind of hazards will happen when executing the following MIPS program? How to avoid
the hazard without stalls?
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
Answer
Both add instructions have a data hazard because of their respective dependence on the
immediately preceding lw instruction. Notice that bypassing eliminates several other potential
hazards including the dependence of the first add on the first lw and any hazards for store
instructions. Moving up the third lw instruction eliminates both hazards:
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
註:此題為課本範例,課本假設有 forwarding unit。在此題目並未說明是否有 forwarding unit,
答案是以課本為主
30
9. Given the full datapath of a MIPS CPU below.
(a) Please specify the required functional blocks for “Branch instructions”, “R-format
instructions”, and “Load/Store instructions”, respectively.
(b) If the CPU can only execute four instructions, including “bne”, “add”, “lw”, “sw”, which
instruction will result in the longest delay by assuming that the delay of each functional
block is the same.
PCSrc
Add
4 Add Sum
Shift
left 2
Sign MemRead
extend
16 32
Answer
(a)
Instruction Functional blocks used for each instruction
Branch IM Reg ALU
R-format IM Reg ALU Reg
Load IM Reg ALU DM Reg
Store IM Reg ALU DM
Remark: IM: instruction memory; Reg: register file; DM: data memory
(b) According to above table, lw instruction will result in the longest delay
31
105 中山電機
1. Terminology Explanation
(a) SuperScalar Processor
(b) Multi-core Processor
(c) Branch Prediction
(d) VLIW Architecture
(e) Booth’s multiplication algorithm
Answer
(a) Superscalar: an advanced pipelining technique that enables the processor to execute more
than one instruction per clock cycle
(b) A multi-core processor is an integrated circuit (IC) to which two or more processors have
been attached for enhanced performance, reduced power consumption, and more efficient
simultaneous processing of multiple tasks
(c) Branch Prediction: A method of resolving a branch hazard that assumes a given outcome
for the branch and proceeds from that assumption rather than waiting to ascertain the actual
outcome
(d) VLIW: a style of instruction set architecture that lunches many operations that are defined
to be independent in a single wide instruction, typically with many separate opcode field.
(e) Booth’s multiplication algorithm is a multiplication algorithm that multiplies two
signed binary numbers in two's complement notation.
3. A set associative cache has a block size of four 32-bit words and a set size of 4. The cache can
accommodate a total of 16K words. The main memory size that is cacheable is 64M 32 bits.
Design the cache structure and show how the processor’s addresses are interpreted.
Answer
32
26-bit word address
20-bit word address
..
..
1023
= = = =
128 128
128 128
4-to-1 MUX
Data
Hit
4. Design a Four-phase Stepper Motor Controller circuit with a clock input to generate the output
signals A, B, C and D to control a four-phase stepper motor whose stepping waveform is
described as table 1.
Table 1
Output signals
step
A B C D
1 1 1 0 0
2 1 0 0 1
3 0 0 1 1
4 0 1 1 0
1 1 1 0 0
Answer
State table
A a B a b a bC a D a b a b Da a b a b Db b
33
Da Qa C
Qa A
B
Db Qb
Qb
D
clock
35
105 中山資工
1. SIMD, SIMT, VLIW, Superscalar, Superpipeline
(1) Explain the major features of single instruction multiple data (SIMD) processor. In other
words, point out the major differences between SIMD CPU and conventional CPU. Give
a practical example of SIMD instructions, CPU, or computer.
(2) Compare the differences of thread and process. Is context switching the change of threads
or the change of process?
(3) Explain the major features of single instruction multiple threads (SIMT) processor. Give a
practical example of SIMT computer.
(4) Explain the major features of very long instruction word (VLIW) processor. Which
processor category does VLIW belong to, static multiple issue or dynamic multiple issue?
(5) What is a superscalar processor? What is a superpipeline processor?
Answer:
(1) A conventional uniprocessor has a single instruction stream and single data stream. SIMD
computers operate on vectors of data. One practical example of SIMD instructions is SSE
instructions of x86
(2) A process is an executing instance of an application. A thread is a path of execution within a
process. Also, a process can contain multiple threads. When you start a running a program,
the operating system creates a process and begins executing the primary thread of that
process.
Context switching is the change of process
(3) A processor architecture that applies one instruction to multiple independent threads in
parallel.
Nvidia's introduced the single-instruction multiple-thread (SIMT) execution model where
multiple independent threads execute concurrently using a single instruction.
(4) A style of instruction set architecture that launches many operations that are defined to be
independent in a single wide instruction, typically with many separate opcode fields.
VLIW belong to, static multiple issue
(5) Superscalar is an advanced pipelining that enables the processor to execute more than one
instruction per clock cycle by selecting them during execution.
Superpipeline is an advanced pipelining that increase the depth of the pipeline to increase
the clock rate
36
2. Cache
(1) Give two methods to reduce cache miss rate.
(2) Give two methods to reduce cache hit time.
(3) Give two methods to reduce cache miss penalty.
(4) For the following Figure 1, explain the reason why the miss rate goes down as the block
size increases to a certain level, and explain the reason why the miss rate goes up if the
block size is too large relative to the cache size.
Figure 1
(5) Explain the differences of write-allocate and no-write-allocate during cache write miss.
Answer:
(1) increase cache size; increase associativity
(2) use virtually addressed cache; use direct-mapped cache
(3) use early restart technique; use critical word first technique
(4) Larger blocks exploit spatial locality to lower miss rates. The miss rate may go up
eventually if the block size becomes a significant fraction of the cache size, because the
number of blocks that can be held in the cache will become small, and there will be a great
deal of competition for those blocks. As a result, a block will be bumped out of the cache
before many of its words are accessed.
(5) Write allocate: the block is fetched from memory and then the appropriate portion of the
block is overwritten.
No write allocate: the portion of the block in memory is updated but not put it in the cache
37
3. The execution of an instruction can be divided into five parts: instruction fetch (IF), register
read (RR), ALU operation (EX), data access (MEM), and register write (RW). The following
Table 1 shows the execution time of each part for several types of instructions, assuming that
the multiplexors, and control unit have no delay.
Instruction Register ALU Data Register
fetch read operation access write
Load word (lw) 200 ps 100 ps 200 ps 200 ps 100 ps
Store word (sw) 200 ps 100 ps 200 ps 200 ps
R-format
200 ps 100 ps 200 ps 100 ps
(add, sub, and, or, slt)
Branch (beq) 200 ps 100 ps 200 ps
If instructions are to be executed in a pipelined CPU with five pipeline stages, IF, RR, EX,
MEM, RW where the pipeline stages execute the corresponding operations mentioned above.
(1) What is the cycle time of the pipelined CPU? What is the maximum working frequency?
(2) What is the latency of executing the load-word instruction (lw) in the pipelined CPU?
(3) What is the latency of executing the add instruction (add) in the pipelined CPU?
(4) What is the maximum throughput of the pipelined CPU?
(5) Propose a design method to increase the throughput performance of the pipelined CPU.
Answer:
(1) Cycle time = 200ps
Maximum frequency = 1 / 200ps = 5GHz
(2) The latency of a lw instruction = 5 200 = 1000 ps
(3) The latency of an add instruction = 5 200 = 1000 ps
(4) The maximum throughput = 1 / 200ps = 5G instructions/sec.
(5) Increase the depth of the pipeline
4. Memory Hierarchy
The following Figure 2 shows the process of going from a virtual address to a data item in
cache. Answer the following questions based on this figure.
(1) What is the purpose of the translation lookaside buffer (TLB)?
(2) Is the TLB direct mapped, set associative, or full associative? Is the data cache, direct
mapped, set associative, or fully associative?
(3) What is the page size? What is the block size of the data cache?
(4) What is the size of the data cache (excluding the tags)? What is the maximum size of
physical memory supported in this memory system?
(5) Is the cache virtually addressed or a physically addressed?
38
Virtual address
31 30 29 15 14 13 12 11 10 9 3210
Virtual page number Page offset
20 12
valid dirty Tag Physical page number
TLB
TLB hit
20
Cache
32
=
Cache hit Data
Figure 2
Answer:
(1) The purpose of a TLB is to speed up the translation of a virtual address to a physical address
(2) The TLB is a fully associative cache. The data cache is direct-mapped
(3) Page size = 212 bytes = 4Kbytes. Block size = 26 bytes = 64 bytes
(4) The size of data cache is 28 64 bytes = 16Kbytes and the size of physical memory is 220
212 = 4GB
(5) The cache is physically addressed
39
Figure 3
(d) Give two design methods to increase the performance of executing conditional
branch instructions
Answer:
(1) Arithmetic mean (AM) is the average of the execution times that is directly proportional to
total execution time.
The geometric mean is the nth root of the product of n execution time ratio
SPECINT2006 use geometric mean to summarize the speed measurements for several
benchmark programs
(2) (a) The vertical axis of Figure 3 is log-scale.
The growth rate is exponential
There is a straight line from 1986 to 2002
(b) VLSI technology improvement; advance architectural and organizational ideas
(c) Limits of power; available instruction-level parallelism; memory latency
(d) Branch prediction; move the branch decision earlier
40
105 中興資工
1. True or False
(a) The processor comprises two main components: datapath and register.
(b) In the immediate addressing mode, the operand is a constant within the instruction itself.
(c) TLB can be used to speed up the virtual-to-physical address translation.
(d) In cache write hit, we can use the write buffer to improve the performance of write-through
scheme.
(e) Program counter (PC) is the register that contains the data address in the program being
executed.
Answer
(a) (b) (c) (d) (e)
False True True True False
註(a):The processor comprises two main components: datapath and control unit
註(e):PC is the register that contains the instruction address in the program being executed.
41
105 中興電機
1. Please answer “Yes” or “No” for the following five comparisons between RI C and CI C
processors:
(a) The instruction format complexity of CISC is more than that of RISC.
(b) The clock cycle time of instructions in CISC is more than that in RISC.
(c) The clock cycle per instruction in CISC is more than that in RISC,
(d) The performance of CISC is more than that of RISC.
(e) The power consumption of CISC is more than that of RISC.
Answer:
(a) (b) (c) (d) (e)
Yes Yes Yes No Yes
3. The following C codes are compiled into the corresponding MIPS assembly codes. Assume
that the two parameters array and size are found in the registers $as) and $a1, and allocates p to
register $t0.
C codes:
clear(int *array, int size)
{
int *p;
for (p=&array[0]; p < &array[size]; p=p+1)
*p=0;
}
MIPS assembly codes:
move $t0, $a0
loop: OP1 $zero, 0($t0)
addi $t0, $t0, 4
add $t1, $a1, $a1
42
add $t1, $t1, OP3
OP2 $t2, OP4, $t1
slt $t3, $t0, $t2
bne $t3, $zero, loop
Please determine proper instructions for (OP1, OP2) and proper operands for (OP3, OP4).
Copy the following table (Table 1) to your answer sheet and fill in the two instructions and
two operands.
Table 1
Instruction/Operand
OP1
OP2
OP3
OP4
Answer:
Instruction/Operand
OP1 sw
OP2 add
OP3 $t1
OP4 $a0
4. For signed addition ($t0 = $t1 + $t2), the following sequence of MIPS codes can detect two
special conditions. The sequence codes are shown as follows:
addu $t0, $t1, $t2
xor $t3, $t1, $t2
slt $t3, $t3, $zero
bne $t3, $zero, T1
Please describe which conditions can T1 and T2 indices discover for signed addition?
Answer:
T1 T2
No_overflow Overflow
43
5. Compute the following floating-point operations using IEEE 754 single precision. Please show
the floating point encoding in BINARY.
–344.3125 - 123.5625
Answer:
−344.3125 −101011000.01012 −1.0101100001012 × 28
−123.5625 = −1111011.10012 = −1.11101110012 × 26
Step1: −1.11101110012×26 = −0.0111101110012 × 28
Step2: (−1.010110000101 − 0.0111101110012) × 28 = −1.110100111110 × 28
Step3: −1.110100111110 × 28 = −1.110100111110 × 28
Step4: −1.110100111110 × 28 = −1.110100111110 × 28
−1.110100111110 × 28 = −111010011.1110 = 467.87510
6. We examine how pipelining affects the clock cycle time of the processor. Assume that
individual stages of the datapath have the following latencies:
IF ID ΕΧ MEM WB
250 ps 350 ps 150 ps 300 ps 200 ps
Also, assume that instructions executed by the processor are broken down as follows:
ALU BEQ LW SW
45% 20% 20% 15%
(a) What is the clock cycle time in a pipelined and non-pipelined processor?
(b) What is the total latency of an LW instruction in a pipelined and non-pipelined processor?
(c) If we can split one stage of the pipelined datapath into two new stages, each with half the
latency of the original stage, which stage would you split and what is the new clock cycle
of the processor?
(d) Assuming there are no stalls for hazards, what is the utilization of the data memory?
Answer:
(a) (b)
Pipelined Single-cycle Pipelined Single-cycle
350ps 1250ps 1750ps 1250ps
(c)
Stage to split New clock cycle time
ID 300ps
(d)
Data memory is utilized only by LW and SW instructions in the MIPS
ISA. So the utilization is 35% of the clock cycles
44
7. Consider executing the following code on the pipelined datapath.
add $8, $5, $5
add $8, $5, $8
sub $3, $8, $4
add $2, $2, $3
add $4, $2, $3
(a) Show or list all of the dependencies in this program. For each dependency, indicate which
instructions and register are involved.
(b) At the end of the fifth cycle of execution, which registers are being read and which
register will be written?
(c) With regard to the program, explain what the forwarding unit is doing during the fifth
cycle of execution. If any comparisons are being made, describe them.
(d) With regard to the program, explain what the hazard detection unit is doing during the
fifth cycle of execution. If any comparisons are being made, describe them.
Answer:
(a)
Dependencies between instructions Register
(1, 2) $8
(2, 3) $8
(3, 4) $3
(3, 5) $3
(4, 5) $2
(b) Register $8 is being written and Registers $2 and $3 are being read.
(c) The forwarding unit is comparing $8 = $8? $8 = $4? $8 = $8? $8 = $4?
(d) The hazard detection unit is comparing $8 = $2? $8 = $3?
45
105 台科大資工
1. Assume a program requires the execution of 25106 FP instructions, 110106 INT instructions,
80106 LS instructions, and 16106 branch instructions. The CPI for each type of instruction is
2, 1, 4, and 2, respectively. Assume that the processor has a 2 GHz clock rate.
(a) What are the clock cycles of the program?
(b) If we want the program to run two times faster, what should the CPI of L/S instructions be?
(c) What is the execution time of the program if the CPI of INT and FP is reduced by 40% and
the CPI of L/S and Branch is reduced by 30%?
Answer:
(a) Clock cycles = 2 25 106 + 1 110 106 + 4 80 106 + 2 16 106 = 512 106
(b) 256 106 = 2 25 106 + 1 110 106 + CPIL/S 80 106 + 2 16 106 = 512 106
CPIL/S = 0.8
(c) Execution time = (2 0.6 25 106 + 1 0.6 110 106 + 4 0.7 80 106 + 2 0.7 16
106) / 2G = 0.1712 sec.
2. Assume that individual stages of the datapath have the following latencies:
IF ID EX MEM WB
250ps 360ps 150ps 300ps 200ps
Also, assume that instructions executed by the processor are broken down as follows:
alu beq lw sw
50% 20% 15% 15%
(a) What is the clock cycle time in a pipelined processor? What is the clock cycle time in a
non-pipelined processor?
(b) What is the total latency of an LW instruction in a pipelined processor? What is the total
latency of an LW instruction in a non-pipelined processor?
(c) If we can split one stage of the pipelined datapath into two new stages, each with half the
latency of the original stage, which stage should be split? What is the new clock cycle time
of the processor?
(d) Assuming there are no stalls or hazards, what is the utilization of the data memory?
(e) Assuming there are no stalls or hazards, what is the utilization of the write-register port of
the “Registers” unit?
Answer:
(a) Clock cycle time in a pipelined processor = 360ps
Clock cycle time in a non-pipelined processor = 1160ps
(b) Total latency in a pipelined processor = 360 5 = 1800ps
Total latency in a pipelined processor = 360 5 = 1160ps
(c) ID could be split into two new stages
The new clock cycle time = 300ps
46
(d) The utilization of the data memory = 15% + 15% = 30%
(e) The utilization of the write-register port = 50% + 15% = 65%
47
105 台科大電子
1. What do CISC and RISC stand for? Compare and contrast CISC architecture with RISC
architecture, and then give an example for each of them.
Answer:
CISC stands for complex instruction set computer and is the name given to processors that use a
large number of complicated instructions, to try to do more work with each one.
RISC stands for reduced instruction set computer and is the generic name given to processors
that use a small number of simple instructions, to try to do less work with each instruction but
execute them much faster.
RISC CISC
All instructions are the same size Instructions are not the same size
Few addressing modes are supported Support a lot of addressing modes
Only a few instruction formats Support a lot of instruction formats
Arithmetic instructions can only work on Arithmetic instructions can work on
registers memory
Data in memory must be loaded into Data in memory can be processed directly
registers before processing without using load/store instructions
2. Let a half adder be expressed as follows, construct a full adder using this building block.
A
B
Cout
48
A
B S
Cout
Cin
Answer:
Cache memory is a smaller, faster memory which stores copies of the data from frequently used
main memory locations and a CPU can access more quickly than it can access regular RAM.
49
105 成大資聯
1. Consider a MIPS processor with an additional floating point unit. Assume functional unit
delays in the processor are as follows: memory (2 ns), ALU and adders (2 ns), register file
access (1 ns), FPU add (8 ns), FPU multiply (16 ns), and the remaining units (0 ns). Also
assume instruction mix as follows: loads (31%), stores (21%), R-format instructions (27%),
branches (5%), jumps (2%), FP adds and subtracts (7%), and FP multiplies and divides (7%).
(a) What is the delay in nanosecond to execute a load, store, R-format, branch, jump, FP
add/subtract, and FP multiply/divide instruction in a single MIPS design? (b) What is the
averaged delay in nanosecond to execute a load, store, R-format, branch, jump, FP add/subtract,
and FP multiply/divide instruction in a multicycle MIPS design?
Answer:
(1) Each instruction need 20 ns to execute.
ALU/FPU add
Instruction Memory Register Memory Register Delay (ns)
/ FPU MPY
load 2 1 2 2 1 8
store 2 1 2 2 0 7
R-format 2 1 2 0 1 6
branch 2 1 2 0 0 5
jump 2 0 0 0 0 2
FP add/sub 2 1 8 0 1 12
FP mul/div 2 1 16 0 1 20
(2) Average delay = (5 0.31 + 4 0.21 + 4 0.27 + 3 0.05 + 3 0.02 + 4 0.07 + 4 0.07)
16 = 4.24 16 = 67.84 ns
2. Consider a pipelined processor that executes the MIPS code shown in Figure 1 using the logic
of hazard detection and data forwarding unit shown in Figure 2. If the MIPS code cannot be
executed correctly, then how do we revise the logic shown in Figure 2 such that the code can be
correctly executed?
add $1, $1, $5
add $1, $1, $6
add $1, $1, $7
Figure 1: The MIPS code
if (MEM/WB. RegWrite
and (MEM/WB.RagisterRd≠0)
and (MEM/WB.RegisterRd=ID/EX.RegisterRs))
then
50
ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd≠0)
and (MEM/WB.RagisterRd=ID/EX.RagistaxRt))
then
ForwardB = 01
Figure 2: The logic of hazard detection and data forwarding unit
Answer:
The logic should be revised as the follows.
if (MEM/WB. RegWrite
and (MEM/WB.RagisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
then
ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RagisterRd = ID/EX.RagistaxRt))
then
ForwardB = 01
3. What is the biased single precision IEEE 754 floating point format of 0.9375? What is the
purpose to bias the exponent of the floating point numbers?
Answer:
(1) 0.9375ten = 0.1111two = 1.111two × 2–1
S E F
0 01111110 11100000000000000000000
(2) 多數浮點數運算需要先比較指數大小來決定要對齊哪一個數的指數,使用bias表示法
可以直接比較指數欄位的無號數值即可判斷兩個浮點數指數的大小而不須考慮其正負
號,因此可以加快比較速度。
52
105 成大電通
1. Design a direct memory access (DMA) controller.
(1) Show a detailed block diagram of a DMA controller, including registers used
(2) Describe the operation of the DMA controller.
Answer:
(1)
DMA Controller
Data
count
Data
Data Lines register
Address
Address Lines register
DMA Request
DMA Acknowledge Control
Interrupt Logic
Read/Write
(2) To transfer data between memory and I/O devices, DMA controller takes over the control of
the system from the processor and transfer of data take place over the system bus. For this
purpose, the DMA controller must use the bus only when the processor does not need it, or it
must force the processor to suspend operation temporarily.
2. Show the design of the following memory hierarchy system for a conventional five-stage
pipeline processor. Assume both the virtual address and physical address are 32 bits. The page
size is 4KB. The L1 ITLB has 48 entries and is 3-way set associative. The L1 DTLB is
direct-mapped and has 32 entries. The L2 TLB is a unified TLB which has 128 entries using
4-way set associative design. The data cache and instruction cache are both physically
addressed; each has a cache size of 64KB, direct-mapped, line size 32 bytes.
(1) Show the L1 ITLB design and explain the translation of virtual address to the physical
address for |TLB.
(2) Show the L1 DTLB design integrated with the data cache.
(3) Show how to integrate the three TLBs, instruction cache, and the data cache to the pipeline.
Answer:
(1) The virtual address is first broken into a virtual page number and a page offset. The ITLB
which contains the virtual to physical address translation is indexed by the virtual page
number. If the mapping exists, the physical page number mapped from the ITLB constitutes
53
the upper portion of the physical address, while the page offset constitutes the lower portion.
If the mapping does not exist, the missing translation will be retrieved from the page table.
32
2216 8
4 12
14
15
0 1 2
PPN
Hit
(3)
ID/EX
IF ID EX MEM WB
WB
EX/MEM
MEM/WB
PCSrc Control M WB
EX WB
IF/ID M
Add
Add Shift
left 2
RegWrite
DTLB
MemWrite
ITLB
MemtoReg
ALUSrc
Instruction
PC Read Read
register 1 data 1
Address Read Zero
register 2 Read ALU Address Read
Write data 2 Result data
Instruction register
Data
Cache Write Registers
Cache
data
Write
data
Instruction
[15-0] Sign ALU
extend control
Instruction MemRead
[20-16] ALUOp
Instruction
[15-11]
RegDst
54
(2)
Virtual page number
12 bits
15 bits
31
20bits
17 bits
= 12bits
bits
15
Inde x V alid T ag D a ta
0
1
2
Data Cache
51 1
2047
20 32
17
16 512
256
55
105 成大電機
1. Find the word or phrase from the list below that best matches the description in the following
questions. Each answer should be used only once. (a) assembler (b) bit (c) binary number (d)
cache (e) CPU (f) chip (g) compiler (h) control (i) defect (j) DRAM (k) memory (l) operating
system (m) semiconductor (n) Supercomputer (o) yield (p) die (q) loader (r) linker (s) SRAM (t)
coverage (u) procedure (v) pipeline (w) ISA.
(1) Integrated circuit commonly used to construct main memory.
(2) Location of programs when they are running, containing the data needed as well.
(3) Microscopic flaw in a wafer.
(4) Percentage of good dies from the total number of dies on the wafer.
(5) Program that translates a symbolic version of an instruction into binary version.
(6) Program that translates from a higher level notation to assembly language.
(7) Small, fast memory that acts as a buffer for the main memory.
(8) Substance that does not conduct electricity well.
(9) Base 2 number.
(10) Component of the processor that tells the datapath, memory, and I/O devices what to do
according to the instructions of the program.
Answer:
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
j k i o a g d m c h
Choose the correct answers for the following multiple choice problems. Each question may have
more than one answer.
56
3. Which of the following Statements is (are) true for stack operations?
(1) Callee is a procedure that executes a series of stored instructions based on parameters
provided by the caller and then returns control to the caller.
(2) Jal procedure_name, is the instruction that calls procedure_name and returns from the
callee.
(3) Stack is a data structure for spilling registers organized as a last-in-first-out queue.
(4) Stack pointer, sp, is the register containing the address of program being executed.
(5) Push operation can be achieved by executing a store instruction.
Answer: (3)
註(1):ax is maintained by caller not callee
註(4):PC is the register containing the address of program being executed
註(5):Push operation can be achieved by executing a store plus add instructions
4. Which of the following statements is (are) true for IEEE 754 floating point representation?
(1) IEEE 754 standard defines the double precision number to be a 128-bit format.
(2) If a floating point number is shown in the form of (-1)5 (1+ F) 2E where S defines the
sign of the number, F the fraction field, and E the exponent, this means the leading 1-bit of
normalized binary numbers is implicit.
(3) IEEE 754 binary representation of -0.75(10) is 10111111011000000000000000000000 for
single precision.
(4) The floating point number represented in a biased exponent is actually this value: (-1)s
(1+ F) 2E-Bias
(5) Since there is no way to get 0.0 from this form: (-1)5 (1+ F) 2E, we will not be able to
represent 0.0 in floating point format.
Answer: (2), (4)
註(1):double precision is a 64-bit format
註(3):-0.75(10) is 10111111010000000000000000000000
58