Computer Architecture Midterm1 Cmu
Computer Architecture Midterm1 Cmu
Midterm Exam 1
Date: Wed., 3/6
Name:
Legibility & Name (5 Points):
Problem 1 (65 Points):
IO
NS
LU
T
Instructions:
1. This is a closed book exam. You are allowed to have one letter-sized cheat sheet.
SO
Tips:
Show work when needed. You will receive partial credit at the instructors discretion.
Initials:
Reason 1.
Data dependences
Reason 2.
Reason 3.
Resource contention
The running program cannot continue if the exception is not immediately handled.
They are not needed for the running programs progress and not critical for system
progress.
2/30
Initials:
What is the downside of a design that does not use a branch target buffer? Please be concrete
(and use less than 20 words).
Assume you have a machine with a 4-entry return address stack, yet the code that is executing
has six levels of nested function calls each of which end with an appropriate return instruction.
What is the return address prediction accuracy of this code?
4
6
Yes.
3/30
Initials:
What is the disadvantage of such a machine over one that has restartable and precise exceptions
or interrupts? Explain briefly.
Segment selector
TLB
In class, we covered several dataflow machines that implemented dataflow execution at the ISA
level. These machines included a structure/unit called the matching store. What is the function
of the matching store (in less than 10 words)?
Reservation stations.
4/30
Initials:
27
1
3
0
E
0
B
0
A
Tag
D
E
0
C
16
0
B
At compile time, the compiler does not know whether control dependences are taken or
not during execution. Hence, it does not know if an instruction can be moved above or
below a branch.
What can the compiler do to alleviate this problem? Describe two solutions we discussed in class.
Solution 1.
Solution 2.
Superblock or trace scheduling. The compiler profiles the code and determines likely branch directions and optimizes the instruction scheduling on
the frequently executed path.
5/30
Initials:
Solution 1.
Solution 2.
NO
If YES, explain what type of code leads to inefficient (i.e., lower performance than it could be)
execution and why. (Leave blank if you answered NO above.)
Tag and result broadcast is delayed by a cycle, thus delaying the execution of dependent
instructions.
If YES, explain what you would recommend to your friend to eliminate the inefficiency. (Leave
blank if you answered NO above.)
Broadcast the tag and result as soon as an instruction finishes execution.
If NO, justify how the design is as efficient as Tomasulos algorithm with full data forwarding can
be. (Leave blank if you answered YES above.)
BLANK
If NO, explain how the design can be simplified. (Leave blank if you answered YES above.)
BLANK
6/30
Initials:
instructions.
(b) Assume N instructions are on the correct path of a program and assume a branch predictor
accuracy of A. Write the equation for the number of instructions that are fetched on this machine
in terms of N and A. (Please show your work for full credit.)
Note that if you assumed the wrong number of instructions in Part (a), you will only be
marked wrong for this in Part (a). You can still get full credit on this and the following
parts.
Correct path instructions = N
Incorrect path instructions = N (0.2)(1 A)5 = N (1 A)
Fetched instructions = Correct path instructions + Incorrect path instructions
= N + N (1 A)
= N (2 A)
(c) Lets say we modified the machine so that it used dual path execution like we discussed in class
(where an equal number of instructions are fetched from each of the two branch paths). Assume
branches are resolved before new branches are fetched. Write how many instructions would be
fetched in this case, as a function of N . (Please show your work for full credit.)
7/30
Initials:
(d) Now lets say that the machine combines branch prediction and dual path execution in the following way:
A branch confidence estimator, like we discussed in class, is used to gauge how confident the
machine is of the prediction made for a branch. When confidence in a prediction is high, the
branch predictors prediction is used to fetch the next instruction; When confidence in a prediction
is low, dual path execution is used instead.
Assume that the confidence estimator estimates a fraction C of the branch predictions have high
confidence, and that the probability that the confidence estimator is wrong in its high confidence
estimation is M .
Write how many instructions would be fetched in this case, as a function of N , A, C, and M .
(Please show your work for full credit.)
8/30
Initials:
false
>0?
copy
T
NOT
output
T
copy
AND
Calculates the parity of X. (True if the number of set bits in X is odd and false otherwise.)
9/30
Initials:
14 13
1010
12
11
10
DR
7 6
SR1
5
0
4
0
3
0
2
0
1
0
0
0
The modifications we make to the LC-3b datapath and the microsequencer are highlighted in the
attached figures (see the next two pages). We also provide the original LC-3b state diagram, in case
you need it. (As a reminder, the selection logic for SR2MUX is determined internally based on the
instruction.)
The additional control signals are
GateTEMP1/1: NO, YES
GateTEMP2/1: NO, YES
LD.TEMP1/1: NO, LOAD
LD.TEMP2/1: NO, LOAD
ALUK/3: OR1 (A|0x1), LSHF1 (A<<1), PASSA, PASS0 (Pass value 0), PASS16 (Pass value 16)
COND/4:
COND0000 ;Unconditional
COND0001 ;Memory Ready
COND0010 ;Branch
COND0011 ;Addressing mode
COND0100 ;Mystery 1
COND1000 ;Mystery 2
The microcode for the instruction is given in the table below.
State
001010 (10)
Cond
COND0000
J
001011
001011 (11)
COND0000
101000
101000 (40)
110010 (50)
COND0000
COND1000
110010
101101
111101 (61)
COND0000
101101
101101 (45)
111111 (63)
COND0000
COND0100
111111
010010
Asserted Signals
ALUK = PASS0, GateALU, LD.REG,
DRMUX = DR (IR[11:9])
ALUK = PASSA, GateALU, LD.TEMP1,
SR1MUX = SR1 (IR[8:6])
ALUK = PASS16, GateALU, LD.TEMP2
ALUK = LSHF1, GateALU, LD.REG,
SR1MUX = DR, DRMUX = DR (IR[11:9])
ALUK = OR1, GateALU, LD.REG,
SR1MUX = DR, DRMUX = DR (IR[11:9])
GateTEMP1, LD.TEMP1
GateTEMP2, LD.TEMP2
10/30
Initials:
Code:
DR 0
TEMP1 value(SR1)
TEMP2 16
DR = DR << 1
if (TEMP1[0] == 0)
goto State 45
else
goto State 61
State 61: DR = DR | 0x1
State 45: TEMP1 = TEMP1 >> 1
State 63: DEC TEMP2
if (TEMP2 == 0)
goto State 18
else
goto State 50
State
State
State
State
10:
11:
40:
50:
11/30
Initials:
12/30
Initials:
13/30
Initials:
11
IR[11:9]
IR[11:9]
DR
SR1
111
IR[8:6]
DRMUX
SR1MUX
(b)
(a)
IR[11:9]
N
Z
P
Logic
BEN
(c)
14/30
Initials:
18, 19
MAR <! PC
PC <! PC + 2
33
MDR <! M
R
35
IR <! MDR
32
RTI
To 8
1011
ADD
To 11
1010
To 10
BR
AND
DR<! SR1+OP2*
set CC
To 18
DR<! SR1&OP2*
set CC
XOR
JMP
TRAP
[BEN]
JSR
SHF
LEA
LDB
STW
LDW
STB
12
15
R
PC<! MDR
To 18
To 18
[IR[11]]
MAR<! LSHF(ZEXT[IR[7:0]],1)
To 18
PC<! BaseR
To 18
MDR<! M[MAR]
R7<! PC
22
PC<! PC+LSHF(off9,1)
9
To 18
20
28
R7<! PC
PC<! BaseR
30
To 18
21
R7<! PC
PC<! PC+LSHF(off11,1)
13
DR<! SHF(SR,A,D,amt4)
set CC
To 18
14
DR<! PC+LSHF(off9, 1)
set CC
To 18
To 18
MAR<! B+off6
31
24
MDR<! SR
MDR<! M[MAR]
27
MAR<! B+off6
23
25
MDR<! M[MAR[15:1]0]
R
29
NOTES
MDR<! SR[7:0]
16
DR<! SEXT[BYTE.DATA]
set CC
DR<! MDR
set CC
M[MAR]<! MDR
To 18
To 18
To 18
15/30
17
M[MAR]<! MDR**
R
R
To 19
Initials:
Program 2
(a) If Program 1 yields 8 K page faults, what is the size of a page in this architecture?
Assume the page size you calculated for the rest of this question.
(b) Consider Program 2. How many pages should the physical memory be able to store to ensure
that Program 2 experiences the same number of page faults as Program 1 does?
8K page faults is possible only if each page is brought into physical memory exactly
once i.e., there shouldnt be any swapping.
Therefore, the physical memory must be large enough to retain all the pages.
PAGE COUNT = A SIZE/PAGE SIZE = 256MB/32K = 8K
(c) Consider Program 2. How many page faults would Program 2 experience if the physical memory
can store 1 page?
32K 8K / 4 = 64M. The inner loop touches a page four times before moving on to a
different page.
16/30
Initials:
32K 8K / 4 = 64M. After touching a page four times, the inner loop touches all other
pages (256MB) before coming back to the same page.
(d) Now suppose the same matrix is stored in column-major order. And, the physical memory size is
32 MB.
How many page faults would Program 1 experience?
32K 8K = 256M. After touching a page just once, the inner loop touches all other
pages (256MB) before coming back to the same page.
8K. The inner loop touches all of a page and never comes back to the same page.
(e) Suppose still that the same matrix is stored in column-major order. However, this time the
physical memory size is 8 MB.
How many page faults would Program 1 experience?
32K 8K = 256M. After touching a page just once, the inner loop touches all other
pages (256MB) before coming back to the same page.
8K. The inner loop touches all of a page and never comes back to the same page.
17/30
Initials:
Oldest
Youngest
V
1
1
1
1
Exception?
0
0
0
1
Opcode
LD
ADD
ADD
DIV
Rd
R1
R4
R7
R10
Reorder Buffer
Rs
Rt
R12
R2
R3/? R4/R7
R8 R9
?
R3
18/30
Dest. Value
?
3
17
?
Initials:
Note that R12 + R2 = 32, which is a valid 4-byte-aligned, bit addressable address.
19/30
Initials:
B1 */
B2 */
B3 */
B4 */
(a) Of the four branches, list all those that exhibit local correlation, if any.
Only B1.
B2, B3, B4 are not locally correlated. Just like consecutive outcomes of a die, an
element being a multiple of N (N is 2, 3, and 6, respectively for B2, B3, and B4) has
no bearing on whether the next element is also a multiple of N .
(b) Which of the four branches are globally correlated, if any? Explain in less than 20 words.
Now assume that the above piece of code is running on a processor that has a global branch predictor.
The global branch predictor has the following characteristics.
Global history register (GHR): 2 bits.
Pattern history table (PHT): 4 entries.
Pattern history table entry (PHTE): 11-bit signed saturating counter (possible values: -1024
1023)
Before the code is run, all PHTEs are initially set to 0.
As the code is being run, a PHTE is incremented (by one) whenever a branch that corresponds
to that PHTE is taken, whereas a PHTE is decremented (by one) whenever a branch that
corresponds to that PHTE is not taken.
20/30
Initials:
(d) After 120 iterations of the loop, calculate the expected value for only the first PHTE and fill it
in the shaded box below. (Please write it as a base-10 value, rounded to the nearest ones digit.)
Hint. For a given iteration of the loop, first consider, what is the probability that both B1 and B2
are taken? Given that they are, what is the probability that B3 will increment or decrement the
PHTE? Then consider...
Show your work.
Without loss of generality, lets take a look at the numbers from 1 through 6. Given
that a number is a multiple of two (i.e., 2, 4, 6), the probability that the number is
also a multiple of three (i.e., 6) is equal to 1/3, lets call this value Q. Given that a
number is a multiple of two and three (i.e., 6), the probability that the number is also
a multiple of six (i.e., 6) is equal to 1, lets call this value R.
For a single iteration of the loop, the PHTE has four chances of being incremented/decremented, once at each branch.
B3s contribution to PHTE. The probability that both B1 and B2 are taken is denoted
as P(B1 T && B2 T), which is equal to P(B1 T)*P(B2 T) = 1*1/2 = 1/2. Given that
they are, the probability that B3 is taken, is equal to Q = 1/3. Therefore, the PHTE
will be incremented with probability 1/2*1/3 = 1/6 and decremented with probability
1/2*(1-1/3) = 1/3. The net contribution of B3 to PHTE is 1/6-1/3 = -1/6.
B4s contribution to PHTE. P(B2 T && B3 T) = 1/6. P(B4 T | B2 T && B3 T) =
R = 1. B4s net contribution is 1/6*1 = 1/6.
B1s contribution to PHTE. P(B3 T && B4 T) = 1/6. P(B1 T | B3 T && B4 T) =
1. B1s net contribution is 1/6*1 = 1/6.
B2s contribution to PHTE. P(B4 T && B1 T) = 1/6*1 = 1/6. P(B2 T | B4 T &&
B1 T) = 1/2. B2s net contribution is 1/6*1/2 - 1/6*1/2 = 0.
For a single iteration, the net contribution to the PHTE, summed across all the four
branches, is equal to 1/6. Since there are 120 iterations, the expected PHTE value is
equal to 1/6*120=20.
TT
1st PHTE
2nd PHTE
3rd PHTE
4th PHTE
1 0
Older
Younger
GHR
TN
NT
NN
PHT
21/30
Initials:
PHTE2: TN
P(B1 T && B2 N)=1/2. P(B3 T | B1 T && B2 N)=1/3. PHTE=1/2*(1/3-2/3)=-1/6.
P(B2 T && B3 N)=1/3. P(B4 T | B2 T && B3 N)=0. PHTE=1/3*-1=-1/3.
P(B3 T && B4 N)=1/6. P(B1 T | B3 T && B4 N)=1. PHTE=1/6*1=1/6.
P(B4 T && B1 N)=0. P(B2 T | B4 T && B1 N)=X. PHTE = 0.
Answer: 120*(-1/6-1/3+1/6+0)=-40
PHTE3: NT
P(B1 N && B2 T)=0. P(B3 T | B1 N && B2 T)=X. PHTE=0.
P(B2 N && B3 T)=1/6. P(B4 T | B2 N && B3 T)=0. PHTE=1/6*-1=-1/6.
P(B3 N && B4 T)=0. P(B1 T | B3 N && B4 T)=X. PHTE=0.
P(B4 N && B1 T)=5/6. P(B2 T | B4 N && B1 T)=1/2. PHTE=5/6*(1/2-1/2)=0.
Answer: 120*(0-1/6+0+0)=-20
PHTE4: NN
P(B1 N && B2 N)=0. P(B3 T | B1 N && B2 N)=X. PHTE DELTA=0.
P(B2 N && B3 N)=1/3. P(B4 T | B2 N && B3 N)=0. PHTE DELTA=1/3*-1=-1/3.
P(B3 N && B4 N)=2/3. P(B1 T | B3 N && B4 N)=1. PHTE DELTA=2/3*1=2/3.
P(B4 N && B1 N)=0. P(B2 T | B4 N && B1 N)=X. PHTE DELTA = 0.
Answer: 120*(0-1/3+2/3+0) = 40.
TT
1st PHTE
2nd PHTE
3rd PHTE
4th PHTE
1 0
Older
Younger
GHR
TN
NT
NN
PHT
22/30
Initials:
(b) After the first 120 iterations, let us assume that the loop continues to execute for another 1 billion
iterations. What is the accuracy of this global branch predictor during the 1 billion iterations?
(Please write it as a percentage, rounded to the nearest single-digit.)
Show your work.
Given a history
iteration =
P(B1 T && B2 T) *
P(B2 T && B3 T) *
P(B3 T && B4 T) *
P(B4 T && B1 T) *
Given a history
iteration =
P(B1 T && B2 N) *
P(B2 N && B3 N) *
P(B3 N && B4 N) *
P(B4 N && B1 N) *
T
T
T
T
|
|
|
|
B1
B2
B3
B4
T
T
T
T
&&
&&
&&
&&
B2
B3
B4
B1
T)
T)
T)
T)
+
+
+
= 7/12
N
N
N
N
|
|
|
|
B1
B2
B3
B4
T
T
T
T
&&
&&
&&
&&
B2
B3
B4
B1
N)
N)
N)
N)
+
+
+
= 2/3
(c) Without prior knowledge of the contents of the array, what is the highest accuracy that any type
of branch predictor can achieve during the same 1 billion iterations as above? (Please write it as
a percentage, rounded to the nearest single-digit.)
Show your work.
Per-branch accuracy:
B1: 100%
B2: 50% (half the numbers are even)
B3: 33% (a third of the numbers are a multiple of three)
B4: 100% (global correlation)
Average accuracy: 70.8% = 71%
23/30
Initials:
Stratchpad
24/30
Initials:
Stratchpad
25/30
Initials:
Stratchpad
26/30
Initials:
Stratchpad
27/30
Initials:
Stratchpad
28/30
Initials:
Stratchpad
29/30
Initials:
Stratchpad
30/30