2011 Quiz 4 Sol
2011 Quiz 4 Sol
Notes:
• Not all questions are of equal difficulty, so look over the entire exam and
budget your time carefully.
• Please carefully state any assumptions you make.
• Please write your name on every page in the quiz.
• You must not discuss a quiz's contents with other students who have not
yet taken the quiz. If you have inadvertently been exposed to a quiz prior
to taking it, you must tell the instructor or TA.
• You will get no credit for selecting multiple-choice answers without
giving explanations if the instruction ask you to explain your choice.
Page 1 of 17
Name _________(answer key)________
Written in C, the code is as follows (in all questions, you should assume that all arrays do
not overlap in memory):
#define N 1024
double S[N],A[N],B[N],Y[N];
The code for this problem translates to the following VLIW operations:
A, B, Y, and S are immediates set by the compiler to point to the beginning of the A, B,
Y, and S arrays. Register $i is used to index the arrays. For this ISA, register $0 is read-
only and always returns the value zero.
This code will run on the VLIW machine that was presented in lecture and used in
PSet #4, shown here:
Page 2 of 17
Name _________(answer key)________
Our machine has no interlocks. All register values are read at the start of the instruction
before any writes from the same instruction take effect (i.e., no WAR hazards between
operations within a single VLIW instruction).
The naive scheduling of the above assembly code (one operation per instruction) turns
into the following schedule (another copy of this is provided in Appendix A. Feel free to
remove Appendix A from the test to help in answering Questions Q.1A and Q1.B):
Page 3 of 17
Name _________(answer key)________
Note: the array length for this program is statically declared as 1024, which will help you
make simplifying assumptions in the start-up and exit code. You may not need all entries
in the following table. For your convenience, an empty entry will be interpreted as a
NOP.
Hint: when indexing arrays, an operation such as ld $a, A+8($i) is perfectly valid
(remember that $i is a register, and A is the literal set by the compiler to point to the
beginning of the A array).
Extra Hint: For this problem, it is recommended that you name your registers $a0, $a1,
$b0, $b1, etc.
What is the resulting throughput of the code, in "floating point operations per cycle"?
Only consider the steady-state operation of the loop.
Page 4 of 17
Name _________(answer key)________
4
5 fmul $t0, $a0, $b0
7 ld $1y, Y+8($i)
8
9 fadd $s0, $t0, $y0
11
12
13 addi $n, $n, -2 st $s0, S($i)
Note that some ops, such as “ld $y, Y($i)” and “addi $i, $i, 8” are not on the critical path, and can
be correctly placed in many of the instructions (so long as immediates, etc. are handled correctly).
Page 5 of 17
Name _________(answer key)________
Note: the array for this program is statically allocated to 1024 elements, which allows the
compiler (i.e., you) to make simplifying assumptions about the prologue and epilog code.
What is the resulting throughput of the code, in "floating point operations per cycle"?
Only consider the steady-state operation of the loop.
Page 6 of 17
Name _________(answer key)________
25 st $s, S($i)
(cycle 25)
Note this assumes that you can’t read from a register after an instruction has been issued that writes to that same
register (RAW) until the write has finished. If this constraint is relaxed, a more optimal solution can be used.
-1/2 points: $n is initialized to 1021, because we need to leave the loop early and let the epilogue finish the last three
iterations as the loop drains.
-1/2 points: for each wrong immediate
-1 for throughput calculation
-3 points for epilogue
-3 for prologue
-5 for loop
-2 for missing an important operation something
-3-5 points for being very suboptimal
“ld $y” is tricky, because the load can happen very early, but we have to be careful for it not to be overwritten by later
“$ld $y$”’s before $y is used.
Page 7 of 17
Name _________(answer key)________
Use (mtc1 vlr, RI) to set the machine’s vector length to RI. However, this will
throw an exception if RI is greater than the machine’s maximum vector length. Note
that the machine’s maximum vector length is held in a register called VLENMAX.
Q2.A: (8 points)
; initial conditions
Written in C, the first code piece is as follows: ; RA <= A[0]
// N is the array size ; RB <= B[0]
; RN <= N
double A[N],B[N]; ; VLENMAX <= max vlen
The initial conditions are shown above, along with some example vector assembly code
to describe the code found in the inner-most loop (lacking the strip-mining).
Can this code be vectorized? Explain why or why not. If it can be vectorized, write the
vector assembly code. Make sure to add a strip-mine loop to deal with the case where N
is greater than the machine’s vector length. However, you can assume that N will be a
multiple of the vector length.
Page 8 of 17
Name _________(answer key)________
YES.
LV VA, RA
LV VB, RB
ADDV VA, VA, VB
SV VA, RA
Because we made the assumption that N will be a multiple of VLENMAX, you can just directly
set vlr to VLENMAX.
Also, no points were taken off for not checking for the N==0 corner case.
-2 points for not incrementing addresses correctly (-1 for each mistake)
-1 point for not updating RN correctly
-1 for not setting mtc1 vlr
Page 9 of 17
Name _________(answer key)________
Q2.B: (7 points)
Written in C, the second code piece is as follows:
// N is the array size
double A[N+1],B[N];
Can this code be vectorized? Explain why or why not. If it can be vectorized, write the
vector assembly code. Make sure to add a strip-mine loop to deal with the case where N
is greater than the machine’s vector length. However, you can assume that N will be a
multiple of the vector length.
LOOP:
OR RI, N, VLENMAX ; set vlen = n % vlenmax
MTC1 vlr, RI
LV VC, RC
LV VB, RB
ADDV VA, VC, VB
SV VA, RA
Page 10 of 17
Name _________(answer key)________
Q2.C: (7 points)
Written in C, the third code piece is as follows:
// N is the array size
double A[N+1],B[N+1];
Can this code be vectorized? Explain why or why not. If it can be vectorized, write the
vector assembly code. Make sure to add a strip-mine loop to deal with the case where N
is greater than the machine’s vector length. However, you can assume that N will be a
multiple of the vector length.
NO. Computing A[i] in iteration “i” requires using the previously computed A[i-1] from
iteration “i-1”, which forces a serialization (you must compute the elements one at a time,
and in-order).
Notice that the following code will get the wrong answer:
LV VC, RC
LV VB, RB
ADDV VA, VC, VB ; A[i] = A[i-1] + B[i] ????
SV VA, RA
Again, the above vector assembly gets the wrong answer. Consider the following arrays:
A = {0, 1, 2, 3, 4, 5};
B = {0, 0, 0, 0, 0, 0};
But running the above vector assembly, for a VLEN of 6, we incorrectly get:
A = {0, 0, 1, 2, 3, 4};
Page 11 of 17
Name _________(answer key)________
Question 3: Multithreading
(14 points)
For this problem, we are interested in evaluating the effectiveness of multithreading using
the following numerical code.
#define N 1024
double S[N],A[N],B[N],Y[N];
Page 12 of 17
Name _________(answer key)________
Notice because this is an in-order pipeline, the “ld $y” must wait for the FMUL to be
executed first before we can start the load.
A lot of students had “off by one” errors. Memory operations were stated as “takes 50
CPU cycles”. This means that a LD that starts in cycle 1 will end in cycle 50 (think about
an ALU instruction that takes 1 CPU cycle. If it starts in cycle 1, it will end in cycle 1.
Likewise, an instruction that takes 2 CPU cycles that starts in cycle 1 will end in cycle 2,
etc.).
On the other hand, FADD/FMUL both have a “use-delay of 5 cycles”. This means an
FADD that starts in cycle 103 will end in cycle 108, and the dependent ST can begin in
cycle 109 (e.g., for the standard 5-stage pipeline LD instructions have a use-delay of 1
cycle, which means if a LD started in cycle 1, it finished* in cycle 2, and a dependent
ADD could start in cycle 3).
*I say “finished” because it still has to be committed/written-back, but from the point of
view of when the LD value can be used, it finishes in cycle 2.
- 0 points for saying 4 versus 5 should be accepted as correct (fadd -> st)
-1/2 points for each off-by-one error (ld->fadd, ld->fmul)
-3 points for waiting for the store to finish for the store to complete
-3 points for counting latency of store, instead of steady-state of loop (throughput)
Page 13 of 17
Name _________(answer key)________
Each thread executes the above code, and is calculating its own independent piece of the
S array (i.e., there is no communication required between threads). Assuming an infinite
number of registers, what is the minimum number of threads we need to fully utilize the
processor? You are free to re-schedule the assembly as necessary to minimize the
number of threads required.
loop:
ld $a, A($i) ; 1
ld $b, B($i) ; N + 1
ld $y, Y($i) ; 2N + 1
addi $i, $i, 8 ; 3N + 1
addi $n, $n, -1 ; 4N + 1
fmul $t, $a, $b ; 5N + 1
fadd $s, $t, $y ; 6N + 1
st $s, S-8($i) ; 7N + 1
bnez $n, loop ; 8N + 1
critical path: is either LD $y -> FADD, or LD $b -> FMUL (so long as N is greater than 5
to hide the FMUL latency).
4N = 50, N = 12.5
-1.5 points for correct analysis, not ideal (answers 25 threads by moving LD Y)
-2.5 for not rescheduling at all (answers 50 threads) between LD Y -> FADD
-3 points for wrong analysis, optimal rescheduling
-5 for getting non-optimal scheduling, wrong analysis
Page 14 of 17
Name _________(answer key)________
Assume all resources have infinite size (i.e., physical registers, ROB, instruction-
window). There is still no cache (50 cycles to access memory) and the use-delay on
floating-point arithmetic is still 5 cycles. A maximum of four instructions can be issued
in a given cycle.
How many threads are required to saturate this machine, if every thread is running the
above code?
This is actually a trick question. If an OoO processor has infinite resources, because each
iteration of the loop is independent, the processor will dynamically unroll the loop in
hardware and use instruction-level-parallelism to saturate the machine and hide the 50
cycle memory latency.
Page 15 of 17
Name _________(answer key)________
Describe how you expect switching to each of the following architectures will affect
instructions/program and cycles/instruction (CPI) relative to a baseline 5-stage, in-order
processor.
Mark whether the following modifications will cause instruction/program and CPI to
increase, decrease, or whether the change will have no effect. Explain your reasoning
to receive credit.
Of course, the number of operations will increase, because NOPs will have to be added to
to hide latencies and for when some operations couldn’t be scheduled for a given
functional unit. Thus, the static instruction bytes / program will likely increase.
Cycles/Instruction:
If the student said that inst/program decreases, then CPI must be the same or increase, as
the machine is more likely to see a stall.
If the student said that inst/program increases because of added NOP-only instructions,
then CPI decreases as it will now stall less.
Page 16 of 17
Name _________(answer key)________
Assume that the new processor is still an in-order, 5-stage-pipeline processor, but that it
has been modified to switch between two threads every clock cycle (fine-grain
multithreading). If a thread is not ready to be issued (e.g., a cache miss), a bubble is
inserted in the pipeline.
Cycles/Instruction:
From the POV of one thread, CPI increases, as multithreading adds contention of
pipeline resources, and thus structural hazards can now occur that will stall the thread that
is otherwise ready to execute.
From the POV of multiple threads, the aggregate CPI decreases, as they are less likely to
encounter a stall since independent instructions can be interleaved.
END OF QUIZ
Page 17 of 17