0% found this document useful (0 votes)

12 views17 pages

2011 Quiz 4 Sol

This document is a quiz for a Computer Architecture and Engineering course, specifically CS152, conducted by Professor Krste Asanović on April 11th, 2011. It consists of multiple questions related to VLIW machine scheduling, loop unrolling, software pipelining, and vector processors, with a total of 80 points available. Students are instructed to manage their time effectively, state assumptions, and avoid discussing the quiz content with others.

Uploaded by

a4aafora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views17 pages

2011 Quiz 4 Sol

Uploaded by

a4aafora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Name _(answer key)

Computer Architecture and Engineering

CS152 Quiz #4
April 11th, 2011
Professor Krste Asanović

Name: <ANSWER KEY>

This is a closed book, closed notes exam.

80 Minutes
17 Pages

Notes:
• Not all questions are of equal difficulty, so look over the entire exam and
budget your time carefully.
• Please carefully state any assumptions you make.
• Please write your name on every page in the quiz.
• You must not discuss a quiz's contents with other students who have not
yet taken the quiz. If you have inadvertently been exposed to a quiz prior
to taking it, you must tell the instructor or TA.
• You will get no credit for selecting multiple-choice answers without
giving explanations if the instruction ask you to explain your choice.

Writing name on each sheet ________ 2 Points

Question 1 ________ 24 Points
Question 2 ________ 22 Points
Question 3 ________ 14 Points
Question 4 ________ 18 Points

TOTAL ________ 80 Points

Page 1 of 17
Name _________(answer key)________

Question 1: Scheduling for VLIW Machines

(24 points)
The following questions concerns the scheduling of floating-point code on a VLIW
machine.

Written in C, the code is as follows (in all questions, you should assume that all arrays do
not overlap in memory):
#define N 1024
double S[N],A[N],B[N],Y[N];

... arrays are initialized ...

for(int i = 0; i < N; i++)

S[i] = A[i] * B[i] + Y[i];

The code for this problem translates to the following VLIW operations:

addi $n, $0, 1024

addi $i, $0, 0
loop:
ld $a, A($i)
ld $b, B($i)
fmul $t, $a, $b
ld $y, Y($i)
fadd $s, $t, $y
st $s, S($i)
addi $i, $i, 8
addi $n, $n, -1
bnez $n, loop

A, B, Y, and S are immediates set by the compiler to point to the beginning of the A, B,
Y, and S arrays. Register $i is used to index the arrays. For this ISA, register $0 is read-
only and always returns the value zero.

This code will run on the VLIW machine that was presented in lecture and used in
PSet #4, shown here:

Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Addition FP Multiply

Figure 1. The VLIW machine

Page 2 of 17
Name _________(answer key)________

Our machine has six execution units:

- two ALU units, latency one-cycle (i.e., dependent ALU ops can be issued back-to-
back), also used for branches.
- two memory units, latency three cycles, fully pipelined, each unit can perform either
a load or a store.
- two FPU units, latency four cycles, fully pipelined, one unit can only perform fadd
operations, the other can only perform fmul operations.

Our machine has no interlocks. All register values are read at the start of the instruction
before any writes from the same instruction take effect (i.e., no WAR hazards between
operations within a single VLIW instruction).

The naive scheduling of the above assembly code (one operation per instruction) turns
into the following schedule (another copy of this is provided in Appendix A. Feel free to
remove Appendix A from the test to help in answering Questions Q.1A and Q1.B):

Inst Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Add FP Mul

1 addi $n,$0,1024 -- -- -- -- --
2 addi $i, $0, 0 -- -- -- -- --
loop: 3 -- -- ld $a, A($i) -- -- --
4 -- -- ld $b, B($i) -- -- --
5 -- -- -- -- -- --
6 -- -- -- -- -- --
7 -- -- -- -- -- fmul $t, $a, $b
8 -- -- ld $y, Y($i) -- -- --
9 -- -- -- -- -- --
10 -- -- -- -- -- --
11 -- -- -- -- fadd $s, $t, $y --
12 -- -- -- -- -- --
13 -- -- -- -- -- --
14 -- -- -- -- -- --
15 -- -- st $s, S($i) -- -- --
16 addi $i, $i, 8 -- -- -- -- --
17 addi $n, $n, -1 -- -- -- -- --
18 bnez $n, loop -- -- -- -- --
(Inst. 3)

Page 3 of 17
Name _________(answer key)________

Q1.A: Loop Unrolling & General Optimizations

(12 points)
Loop unrolling will enable higher throughput over the naive implementation shown on
the previous page. Unroll the above code once, to get two iterations inflight for every
loop in the VLIW code. You should also consider other performance optimizations to
improve throughput (i.e., re-ordering operations, adding or removing operations, and
packing operations into a single VLIW instruction). However, do not do software
pipelining. That is for the next part. To receive full credit, your code should demonstrate
good throughput.

Note: the array length for this program is statically declared as 1024, which will help you
make simplifying assumptions in the start-up and exit code. You may not need all entries
in the following table. For your convenience, an empty entry will be interpreted as a
NOP.

Hint: when indexing arrays, an operation such as ld $a, A+8($i) is perfectly valid
(remember that $i is a register, and A is the literal set by the compiler to point to the
beginning of the A array).

Extra Hint: For this problem, it is recommended that you name your registers $a0, $a1,
$b0, $b1, etc.

What is the resulting throughput of the code, in "floating point operations per cycle"?
Only consider the steady-state operation of the loop.

4 FLOPS / 13 cycles in the loop =

Throughput (FLOPS/cycle) _4/13

Page 4 of 17
Name _________(answer key)________

Inst Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Add FP Mul

1 addi $n,$0,1024 addi $i, $0, 0

loop: 2 ld $a0, A($i) ld $b0, B($i)

3 ld $a1, A+8($i) ld $b1, B+8($i)

4
5 fmul $t0, $a0, $b0

6 ld $y0, Y($i) fmul $t1, $a1, $b1

7 ld $1y, Y+8($i)

8
9 fadd $s0, $t0, $y0

10 fadd $s1, $t1, $y1

11
12
13 addi $n, $n, -2 st $s0, S($i)

14 addi $i, $i, 16 bnez $n, loop st $s1, S+8($i)

(Inst. 2)
15
16
17
18
19
20
21
22

Table Q1.A (Loop unrolling)

-1/2 points for incrementing $i by 8 (instead of 16)

-1/2 points for decrementing $n by 1 (instead of -2)
-1/2 points for each error in immediates
-1 point for issuing an operation too early
-1/2 points for suboptimal scheduling of an instruction/operation
-1 points for getting throughput calculation incorrect
-3 points for not condensing the two iterations together

Note that some ops, such as “ld $y, Y($i)” and “addi $i, $i, 8” are not on the critical path, and can
be correctly placed in many of the instructions (so long as immediates, etc. are handled correctly).

Page 5 of 17
Name _________(answer key)________

Q1.B: Software Pipelining (12 points)

An other optimization technique is software pipelining. Rewrite the assembly code to
leverage software pipelining (do not also use loop unrolling). You should show the loop
prologue and epilogue to initiate and drain the software pipeline. You may not need all
entries in the following table. However, it is okay if you run out of entries writing the
epilog code.

Note: the array for this program is statically allocated to 1024 elements, which allows the
compiler (i.e., you) to make simplifying assumptions about the prologue and epilog code.

What is the resulting throughput of the code, in "floating point operations per cycle"?
Only consider the steady-state operation of the loop.

Depends on the student’s implementation. For the following table:

2 FLOPS / 4 cycles in the loop =

Throughput (FLOPS/cycle) _1/2

Page 6 of 17
Name _________(answer key)________

Inst Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Add FP Mul

1 addi $n,$0,1021 addi $i, $0, 0

prologue: 2 ld $a, A($i) ld $b, B($i)
3
4
5 addi $i, $i, 8 ld $y, Y($i) fmul $t, $a, $b
6 ld $a, A($i) ld $b, B($i)
7
8
9 addi $I, $i, 8 ld $y, Y($i) fadd $s, $t, $y fmul $t, $a, $b
10 ld $a, A($i) ld $b, B($i)
11
12
loop: 13 addi $i, $i, 8 ld $y, Y($i) st $s, S-16($i) fadd $s, $t, $y fmul $t, $a, $b
14 ld $a, A($i) ld $b, B($i)
15 addi $n, $n, -1
16 bnez $n, loop
epilogue: 17 ld $y, Y($i) st $s, S-16($i) fadd $s, $t, $y fmul $t, $a, $b
18
19
20
21 st $s, S-8($i) fadd $s, $t, $y
23/24/25 ....

25 st $s, S($i)
(cycle 25)

Note this assumes that you can’t read from a register after an instruction has been issued that writes to that same
register (RAW) until the write has finished. If this constraint is relaxed, a more optimal solution can be used.
-1/2 points: $n is initialized to 1021, because we need to leave the loop early and let the epilogue finish the last three
iterations as the loop drains.
-1/2 points: for each wrong immediate
-1 for throughput calculation
-3 points for epilogue
-3 for prologue
-5 for loop
-2 for missing an important operation something
-3-5 points for being very suboptimal

“ld $y” is tricky, because the load can happen very early, but we have to be careful for it not to be overwritten by later
“$ld $y$”’s before $y is used.

Page 7 of 17
Name _________(answer key)________

Question 2: Vector Processors

(22 points)
We will now look at three different code segments. For each, comment on whether it can
be vectorized, and then write the strip-mined vector assembly code if it can be vectorized.

Use (mtc1 vlr, RI) to set the machine’s vector length to RI. However, this will
throw an exception if RI is greater than the machine’s maximum vector length. Note
that the machine’s maximum vector length is held in a register called VLENMAX.

Q2.A: (8 points)
; initial conditions
Written in C, the first code piece is as follows: ; RA <= A[0]
// N is the array size ; RB <= B[0]
; RN <= N
double A[N],B[N]; ; VLENMAX <= max vlen

... arrays are initialized ... ; inner-most loop

LV VA, RA
LV VA, RB
for(int i = 0; i < N; i++) ADDV VA, VA, VB
A[i] = A[i] + B[i]; SV VA, RA

The initial conditions are shown above, along with some example vector assembly code
to describe the code found in the inner-most loop (lacking the strip-mining).

Can this code be vectorized? Explain why or why not. If it can be vectorized, write the
vector assembly code. Make sure to add a strip-mine loop to deal with the case where N
is greater than the machine’s vector length. However, you can assume that N will be a
multiple of the vector length.

(Write your answer on the following page).

Page 8 of 17
Name _________(answer key)________

Q2.A continued ... ; initial conditions

; RA = &(A[0])
for(int i = 0; i < N; i++) ; RB = &(B[0])
A[i] = A[i] + B[i]; ; RN = N
; VLENMAX = max vlen

YES.

BEQZ RN, DONE ; handle N == 0 case

LOOP:
MTC1 vlr, VLENMAX ; set vector length

LV VA, RA
LV VB, RB
ADDV VA, VA, VB
SV VA, RA

SLLI RK, RI, 3 ; convert # of elements to # of bytes

ADD RA, RA, RK ; RA = &(A[RA+RI])
ADD RB, RB, RK ; RB = &(B[RB+RI])
SUB RN, RN, VLENMAX
BNEZ RN, LOOP
DONE:

Because we made the assumption that N will be a multiple of VLENMAX, you can just directly
set vlr to VLENMAX.

Also, no points were taken off for not checking for the N==0 corner case.

- 3 points for not updating memory addresses (RA, RB).

-2 points for not incrementing addresses correctly (-1 for each mistake)
-1 point for not updating RN correctly
-1 for not setting mtc1 vlr

-6 for only saying yes, but providing no code

Page 9 of 17
Name _________(answer key)________

Q2.B: (7 points)
Written in C, the second code piece is as follows:
// N is the array size
double A[N+1],B[N];

... arrays are initialized ...

for(int i = 0; i < N; i++)

A[i] = A[i+1] + B[i];

YES. While there is an interdependency between iterations, it is still possible to

vectorize.

LOOP:
OR RI, N, VLENMAX ; set vlen = n % vlenmax
MTC1 vlr, RI

ADD RC, RA, 8 ; RC = &(A[i+1]), if you will

LV VC, RC
LV VB, RB
ADDV VA, VC, VB
SV VA, RA

SLLI RK, RI, 3 ; convert # of elements to # of bytes

ADD RA, RA, RK ; RA = &(A[RA+RI])
ADD RB, RB, RK ; RB = &(B[RB+RI])
ADDI RN, RN, -1
BNEZ RN, LOOP

-1 for VC/RC being +1, and not +8

Page 10 of 17
Name _________(answer key)________

Q2.C: (7 points)
Written in C, the third code piece is as follows:
// N is the array size
double A[N+1],B[N+1];

... arrays are initialized ...

for(int i = 1; i < N+1; i++)

A[i] = A[i-1] + B[i];

-7 for saying ‘yes’

NO. Computing A[i] in iteration “i” requires using the previously computed A[i-1] from
iteration “i-1”, which forces a serialization (you must compute the elements one at a time,
and in-order).

Notice that the following code will get the wrong answer:

ADD RC, RA, -8 ; RC = &(A[i-1]), if you will

LV VC, RC
LV VB, RB
ADDV VA, VC, VB ; A[i] = A[i-1] + B[i] ????
SV VA, RA

Again, the above vector assembly gets the wrong answer. Consider the following arrays:

A = {0, 1, 2, 3, 4, 5};
B = {0, 0, 0, 0, 0, 0};

Running the above C code, the correct and final solution is

A = {0, 0, 0, 0, 0, 0}
(because A[0] = 0, and A[i] = A[i-1] starting from iteration “i=1”).

But running the above vector assembly, for a VLEN of 6, we incorrectly get:
A = {0, 0, 1, 2, 3, 4};

Page 11 of 17
Name _________(answer key)________

Question 3: Multithreading
(14 points)
For this problem, we are interested in evaluating the effectiveness of multithreading using
the following numerical code.

#define N 1024
double S[N],A[N],B[N],Y[N];

... arrays are initialized ...

for(int i = 0; i < N; i++)

S[i] = A[N] * B[N] + Y[N];

Using the disassembler we obtain:

addi $n, $0, 1024

addi $i, $0, 0
loop:
ld $a, A($i)
ld $b, B($i)
fmul $t, $a, $b
ld $y, Y($i)
fadd $s, $t, $y
st $s, S($i)
addi $i, $i, 8
addi $n, $n, -1
bnez $n, loop

Assume the following:

- Our system does not have a cache.
- Each memory operation directly accesses main memory and takes 50 CPU cycles.
- The load/store unit is fully pipelined.
- After the processor issues a memory operation, it can continue executing instructions
until it reaches an instruction that is dependent on an outstanding memory operation.
- The fmul and fadd instructions both have a use-delay of 5 cycles.

Page 12 of 17
Name _________(answer key)________

Q3.A: In-order Single-threaded Processor (4 points)

How many cycles does it take to execute one iteration of the loop in steady-state for a
single-threaded processor? Do not re-order the assembly code.

loop: start / end

ld $a, A($i) 1 / 50
ld $b, B($i) 2 / 51
fmul $t, $a, $b 52 / 57
ld $y, Y($i) 53 / 102
fadd $s, $t, $y 103 / 108
st $s, S($i) 109
addi $i, $i, 8 110 / 110
addi $n, $n, -1 111 / 111
bnez $n, loop 112 / 112
LD $a, A($i) 113 ....

112 cycles. (113 - 1).

Notice because this is an in-order pipeline, the “ld $y” must wait for the FMUL to be
executed first before we can start the load.

A lot of students had “off by one” errors. Memory operations were stated as “takes 50
CPU cycles”. This means that a LD that starts in cycle 1 will end in cycle 50 (think about
an ALU instruction that takes 1 CPU cycle. If it starts in cycle 1, it will end in cycle 1.
Likewise, an instruction that takes 2 CPU cycles that starts in cycle 1 will end in cycle 2,
etc.).

On the other hand, FADD/FMUL both have a “use-delay of 5 cycles”. This means an
FADD that starts in cycle 103 will end in cycle 108, and the dependent ST can begin in
cycle 109 (e.g., for the standard 5-stage pipeline LD instructions have a use-delay of 1
cycle, which means if a LD started in cycle 1, it finished* in cycle 2, and a dependent
ADD could start in cycle 3).

*I say “finished” because it still has to be committed/written-back, but from the point of
view of when the LD value can be used, it finishes in cycle 2.

- 0 points for saying 4 versus 5 should be accepted as correct (fadd -> st)
-1/2 points for each off-by-one error (ld->fadd, ld->fmul)
-3 points for waiting for the store to finish for the store to complete
-3 points for counting latency of store, instead of steady-state of loop (throughput)

Page 13 of 17
Name _________(answer key)________

Q3.B: In-order Multi-threaded Processor (5 points)

Now consider multithreading the pipeline. Threads are switched every cycle using a
fixed round-robin schedule. If the thread is not ready to run on its turn, a bubble is
inserted into the pipeline.

Each thread executes the above code, and is calculating its own independent piece of the
S array (i.e., there is no communication required between threads). Assuming an infinite
number of registers, what is the minimum number of threads we need to fully utilize the
processor? You are free to re-schedule the assembly as necessary to minimize the
number of threads required.

loop:
ld $a, A($i) ; 1
ld $b, B($i) ; N + 1
ld $y, Y($i) ; 2N + 1
addi $i, $i, 8 ; 3N + 1
addi $n, $n, -1 ; 4N + 1
fmul $t, $a, $b ; 5N + 1
fadd $s, $t, $y ; 6N + 1
st $s, S-8($i) ; 7N + 1
bnez $n, loop ; 8N + 1

critical path: is either LD $y -> FADD, or LD $b -> FMUL (so long as N is greater than 5
to hide the FMUL latency).

For LD $y -> FADD:

(6N+1) - (2N+1) >= 50 cycles

4N = 50, N = 12.5

So 13 threads will keep the machine fully utilized.

-1.5 points for correct analysis, not ideal (answers 25 threads by moving LD Y)
-2.5 for not rescheduling at all (answers 50 threads) between LD Y -> FADD
-3 points for wrong analysis, optimal rescheduling
-5 for getting non-optimal scheduling, wrong analysis

-3 for not using round-robin

Page 14 of 17
Name _________(answer key)________

Q3.C: Out-of-Order SMT Processor (5 points)

Now consider a four-wide, out-of-order SMT (simultaneous multithreading) processor. It

can issue any ready instruction from any thread.

Assume all resources have infinite size (i.e., physical registers, ROB, instruction-
window). There is still no cache (50 cycles to access memory) and the use-delay on
floating-point arithmetic is still 5 cycles. A maximum of four instructions can be issued
in a given cycle.

How many threads are required to saturate this machine, if every thread is running the
above code?

This is actually a trick question. If an OoO processor has infinite resources, because each
iteration of the loop is independent, the processor will dynamically unroll the loop in
hardware and use instruction-level-parallelism to saturate the machine and hide the 50
cycle memory latency.

Thus, only one thread is required to saturate the machine.

Page 15 of 17
Name _________(answer key)________

Question 4: Iron Law

(18 points)

Describe how you expect switching to each of the following architectures will affect
instructions/program and cycles/instruction (CPI) relative to a baseline 5-stage, in-order
processor.

Mark whether the following modifications will cause instruction/program and CPI to
increase, decrease, or whether the change will have no effect. Explain your reasoning
to receive credit.

Q4.A: VLIW (6 points)

How do instructions/program and CPI change when moving from a 5-stage-pipeline in-
order processor to a traditional VLIW processor.

Inst/Program: DECREASES. The same number of operations need to be executed,

but some of these operations can now be combined and scheduled to run in a single
instruction, therefore, I/P goes down (however, if very few instructions could be
scheduled together, and lots of NOP instructions had to be added to hide latencies in
software, then Inst/Program could go up, but then you’ve done something horribly wrong
by switching to VLIW).

Of course, the number of operations will increase, because NOPs will have to be added to
to hide latencies and for when some operations couldn’t be scheduled for a given
functional unit. Thus, the static instruction bytes / program will likely increase.

Cycles/Instruction:

Depends on your answer to inst/program.

If the student said that inst/program decreases, then CPI must be the same or increase, as
the machine is more likely to see a stall.

If the student said that inst/program increases because of added NOP-only instructions,
then CPI decreases as it will now stall less.

-1 point if student said that inst/program decreases and CPI decreases

Page 16 of 17
Name _________(answer key)________

Q4.B: Vector Processors (6 points)

How do instructions/program and CPI change when moving from a 5-stage-pipeline in-
order processor to a single-lane vector processor.

Inst/Program: DECREASES. Instructions will decrease because many operations can

now be expressed in a single instruction. Also, a vector processor can express a number
of loop iterations in small number of vector instructions, eliminating the loop-overhead
expended every iteration (versus once per vector).

Cycles/Instruction: INCREASES. Vector instructions take multiple cycles to

execute, though chaining can help overlap execution of multiple vector instructions to
mitigate this issue.

Q4.C: Multithreaded Processor (6 points)

How do instructions/program and CPI change when moving from a 5-stage-pipeline in-
order processor to a multithreaded processor?

Assume that the new processor is still an in-order, 5-stage-pipeline processor, but that it
has been modified to switch between two threads every clock cycle (fine-grain
multithreading). If a thread is not ready to be issued (e.g., a cache miss), a bubble is
inserted in the pipeline.

Inst/Program: UNCHANGED. The ISA needs not change to enable multithreading.

Cycles/Instruction:

The answer depends on your point of view.

From the POV of one thread, CPI increases, as multithreading adds contention of
pipeline resources, and thus structural hazards can now occur that will stall the thread that
is otherwise ready to execute.

From the POV of multiple threads, the aggregate CPI decreases, as they are less likely to
encounter a stall since independent instructions can be interleaved.

END OF QUIZ

Page 17 of 17

ECIL CSE Question Paper
No ratings yet
ECIL CSE Question Paper
6 pages
Solution Manual For Fundamentals of Database Systems 6E 6th Edition
13% (8)
Solution Manual For Fundamentals of Database Systems 6E 6th Edition
2 pages
Sap Webi Tutorial
100% (2)
Sap Webi Tutorial
105 pages
Final Exam Solution - Test Paper Final Exam Solution - Test Paper
No ratings yet
Final Exam Solution - Test Paper Final Exam Solution - Test Paper
15 pages
Understanding SAP EWM Wave
No ratings yet
Understanding SAP EWM Wave
8 pages
Control System Practical File
No ratings yet
Control System Practical File
19 pages
pCO Sistema - EN - Ver - 1.08
No ratings yet
pCO Sistema - EN - Ver - 1.08
6 pages
Iot Basics
No ratings yet
Iot Basics
43 pages
OSS Information Gateway 2016 Issue 02 (U2000 Poster U2000 Overview V200R016C10)
No ratings yet
OSS Information Gateway 2016 Issue 02 (U2000 Poster U2000 Overview V200R016C10)
4 pages
ADB Code Adapter
No ratings yet
ADB Code Adapter
5 pages
Quiz 1
100% (1)
Quiz 1
12 pages
Summer Internship Report1
0% (1)
Summer Internship Report1
17 pages
It430 Assignment Solution
0% (1)
It430 Assignment Solution
2 pages
62683en PDF
No ratings yet
62683en PDF
533 pages
HP Client Automation Enterprise User Guide - Windows and Linux.
No ratings yet
HP Client Automation Enterprise User Guide - Windows and Linux.
485 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
2021 10 08 - Log
No ratings yet
2021 10 08 - Log
190 pages
Microsoft Office Interview Questions and Answers PDF
100% (1)
Microsoft Office Interview Questions and Answers PDF
15 pages
Parallel Distributed Techniques
No ratings yet
Parallel Distributed Techniques
91 pages
Superscalar Architecture
No ratings yet
Superscalar Architecture
156 pages
40 Out
No ratings yet
40 Out
80 pages
Lecture05 - High-Level Digital Design Automation
No ratings yet
Lecture05 - High-Level Digital Design Automation
36 pages
Final Exam Solution - Test Paper Final Exam Solution - Test Paper
No ratings yet
Final Exam Solution - Test Paper Final Exam Solution - Test Paper
82 pages
'AFGL-TR m88m02 C: Wedlinger Palo Usa
No ratings yet
'AFGL-TR m88m02 C: Wedlinger Palo Usa
25 pages
0233 BSCS 21 COAL Assignment 2
No ratings yet
0233 BSCS 21 COAL Assignment 2
16 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
Cs433 Fa20 Hw3 Solution
No ratings yet
Cs433 Fa20 Hw3 Solution
15 pages
Coa Lab 2
No ratings yet
Coa Lab 2
16 pages
Affiliate Marketing - Unit 1 1
No ratings yet
Affiliate Marketing - Unit 1 1
64 pages
HW3S24 Sol
No ratings yet
HW3S24 Sol
16 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
COMP1411 Final Exam Question Book
No ratings yet
COMP1411 Final Exam Question Book
10 pages
Lab 8 Report
No ratings yet
Lab 8 Report
28 pages
CS222 - COAL - SOLUTION - Final - Spring2023
No ratings yet
CS222 - COAL - SOLUTION - Final - Spring2023
12 pages
BCS402 Lab Manual
No ratings yet
BCS402 Lab Manual
11 pages
Cs152 Sp16 F Sol VLIW
No ratings yet
Cs152 Sp16 F Sol VLIW
40 pages
Nightingale 06
No ratings yet
Nightingale 06
14 pages
ECSE 324 MT Fall 2021 A With Solutions PDF
No ratings yet
ECSE 324 MT Fall 2021 A With Solutions PDF
10 pages
CH02 Solution-1 PDF
No ratings yet
CH02 Solution-1 PDF
10 pages
cs146 Fall2017 Midterm1xx
No ratings yet
cs146 Fall2017 Midterm1xx
12 pages
111 Computer Organization - Midterm
No ratings yet
111 Computer Organization - Midterm
6 pages
MC Labmanual
No ratings yet
MC Labmanual
17 pages
FS S5860 Series Switches
No ratings yet
FS S5860 Series Switches
24 pages
Final w11
No ratings yet
Final w11
10 pages
hw4 Cse490-590-Sp2025 Sol
No ratings yet
hw4 Cse490-590-Sp2025 Sol
7 pages
VeNus Manual For Vendor (EN) - Final
No ratings yet
VeNus Manual For Vendor (EN) - Final
27 pages
CS205 Assignment1 Solution
No ratings yet
CS205 Assignment1 Solution
7 pages
CSE 420 Fall 2018 Module 1 Sample Questi
No ratings yet
CSE 420 Fall 2018 Module 1 Sample Questi
18 pages
2007 Fall Exam1 Solpp
No ratings yet
2007 Fall Exam1 Solpp
6 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
LAB03 Report
No ratings yet
LAB03 Report
8 pages
DSP Lab
No ratings yet
DSP Lab
40 pages
CCEE 213 - 2006 - 2007 - II - Final
No ratings yet
CCEE 213 - 2006 - 2007 - II - Final
10 pages
Midterm Sample Answer: Instructor: Cristiana Amza Department of Electrical and Computer Engineering University of Toronto
No ratings yet
Midterm Sample Answer: Instructor: Cristiana Amza Department of Electrical and Computer Engineering University of Toronto
18 pages
RFP Templates
No ratings yet
RFP Templates
15 pages
18s Cpe221 Final Solution
No ratings yet
18s Cpe221 Final Solution
7 pages
Aws Cloud Technical Essentials
No ratings yet
Aws Cloud Technical Essentials
2 pages
Module-5: Syntax Directed Translation, Intermediate Code Generation, Code Generation 5.1,5.2,5.3, 6.1,6.2,8.1,8.2
No ratings yet
Module-5: Syntax Directed Translation, Intermediate Code Generation, Code Generation 5.1,5.2,5.3, 6.1,6.2,8.1,8.2
37 pages
VLSI Lab Report 8
No ratings yet
VLSI Lab Report 8
24 pages
2001 Spring Exam1 Sol
No ratings yet
2001 Spring Exam1 Sol
6 pages
cs433 Fa19 hw4 Solution
No ratings yet
cs433 Fa19 hw4 Solution
12 pages
Harry Potter Database Instructions X
No ratings yet
Harry Potter Database Instructions X
11 pages
Final Exam: 15-213 Introduction To Computer Systems
No ratings yet
Final Exam: 15-213 Introduction To Computer Systems
17 pages
CS61C Homework 3 - C To MIPS Practice Problems: Problem 1 - Useful Snippets
No ratings yet
CS61C Homework 3 - C To MIPS Practice Problems: Problem 1 - Useful Snippets
8 pages
1 Vector Processing: Solutions
No ratings yet
1 Vector Processing: Solutions
16 pages
Software Engineering: Elysium Technologies Private Limited
No ratings yet
Software Engineering: Elysium Technologies Private Limited
16 pages
Exploring Quantum Computing Use Cases For Manufacturing - IBM
No ratings yet
Exploring Quantum Computing Use Cases For Manufacturing - IBM
8 pages
Solutions: CS152 Computer Architecture and Engineering
No ratings yet
Solutions: CS152 Computer Architecture and Engineering
17 pages
Cold Start Problem in Recomandation System
No ratings yet
Cold Start Problem in Recomandation System
15 pages
CH02 Solution
No ratings yet
CH02 Solution
10 pages
Utilizzando Solo e Unicamente Istruzioni Dalla Tabella Sottostante
No ratings yet
Utilizzando Solo e Unicamente Istruzioni Dalla Tabella Sottostante
2 pages
Syed Tarique Abedin Resume
No ratings yet
Syed Tarique Abedin Resume
1 page
CDR
No ratings yet
CDR
22 pages
CPE 221 Final Exam Solution Fall 2018
No ratings yet
CPE 221 Final Exam Solution Fall 2018
6 pages
Hw5 Solution
No ratings yet
Hw5 Solution
11 pages
Chapter 2 Solutions: For More Practice
No ratings yet
Chapter 2 Solutions: For More Practice
8 pages
CS433: Computer System Organization - Spring 2006 Problems For Chapter 4 Due Date: None (Practice Only)
No ratings yet
CS433: Computer System Organization - Spring 2006 Problems For Chapter 4 Due Date: None (Practice Only)
1 page
CENG400-Midterm-Fall 2014
No ratings yet
CENG400-Midterm-Fall 2014
9 pages
Exam2 s09 v2
No ratings yet
Exam2 s09 v2
10 pages
Arch June 2020
No ratings yet
Arch June 2020
3 pages
Announced Quiz 3: ECE511/CSE511 Computer Architecture
No ratings yet
Announced Quiz 3: ECE511/CSE511 Computer Architecture
1 page
Lab 2 REsult
No ratings yet
Lab 2 REsult
7 pages
ps1 Sol
No ratings yet
ps1 Sol
11 pages
Types of DBMS Architecture
No ratings yet
Types of DBMS Architecture
13 pages
Half Blind Attack MSP430
No ratings yet
Half Blind Attack MSP430
6 pages
Cold-Start User Problem
No ratings yet
Cold-Start User Problem
3 pages
Network Design and Workstation Specifications
No ratings yet
Network Design and Workstation Specifications
3 pages
Installation Guide: Ultimate Traffic
No ratings yet
Installation Guide: Ultimate Traffic
9 pages
2005 Computer Architecture Solutions
No ratings yet
2005 Computer Architecture Solutions
11 pages
Shayari 3
No ratings yet
Shayari 3
1 page
More Shayari
No ratings yet
More Shayari
1 page
CS398 Exam 3, 2 Chance December 17th, 2012: Circle The Section That Attend (So We Can Hand Back Your Exam)
No ratings yet
CS398 Exam 3, 2 Chance December 17th, 2012: Circle The Section That Attend (So We Can Hand Back Your Exam)
7 pages
Cs433 Sp12 Midterm Sol
No ratings yet
Cs433 Sp12 Midterm Sol
9 pages
CSCI 510: Computer Architecture Written Assignment 2 Solutions
No ratings yet
CSCI 510: Computer Architecture Written Assignment 2 Solutions
6 pages
Assignment 3 - COMP2129
No ratings yet
Assignment 3 - COMP2129
4 pages
Law of Arrest
No ratings yet
Law of Arrest
1 page
Apache Cassandra Developer Associate - Exam Practice Tests
From Everand
Apache Cassandra Developer Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet

2011 Quiz 4 Sol

Uploaded by

2011 Quiz 4 Sol

Uploaded by

Name _________(answer key)________

Computer Architecture and Engineering

Name: <ANSWER KEY>

This is a closed book, closed notes exam.

Writing name on each sheet ________ 2 Points

TOTAL ________ 80 Points

Question 1: Scheduling for VLIW Machines

... arrays are initialized ...

for(int i = 0; i < N; i++)

addi $n, $0, 1024

Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Addition FP Multiply

Figure 1. The VLIW machine

Our machine has six execution units:

Inst Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Add FP Mul

Q1.A: Loop Unrolling & General Optimizations

4 FLOPS / 13 cycles in the loop =

Throughput (FLOPS/cycle) ___4/13__

Inst Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Add FP Mul

1 addi $n,$0,1024 addi $i, $0, 0

3 ld $a1, A+8($i) ld $b1, B+8($i)

6 ld $y0, Y($i) fmul $t1, $a1, $b1

10 fadd $s1, $t1, $y1

14 addi $i, $i, 16 bnez $n, loop st $s1, S+8($i)

Table Q1.A (Loop unrolling)

-1/2 points for incrementing $i by 8 (instead of 16)

Q1.B: Software Pipelining (12 points)

Depends on the student’s implementation. For the following table:

2 FLOPS / 4 cycles in the loop =

Throughput (FLOPS/cycle) ___1/2__

Inst Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Add FP Mul

1 addi $n,$0,1021 addi $i, $0, 0

Question 2: Vector Processors

... arrays are initialized ... ; inner-most loop

(Write your answer on the following page).

Q2.A continued ... ; initial conditions

BEQZ RN, DONE ; handle N == 0 case

SLLI RK, RI, 3 ; convert # of elements to # of bytes

- 3 points for not updating memory addresses (RA, RB).

-6 for only saying yes, but providing no code

... arrays are initialized ...

for(int i = 0; i < N; i++)

YES. While there is an interdependency between iterations, it is still possible to

ADD RC, RA, 8 ; RC = &(A[i+1]), if you will

SLLI RK, RI, 3 ; convert # of elements to # of bytes

-1 for VC/RC being +1, and not +8

... arrays are initialized ...

for(int i = 1; i < N+1; i++)

-7 for saying ‘yes’

ADD RC, RA, -8 ; RC = &(A[i-1]), if you will

Running the above C code, the correct and final solution is

... arrays are initialized ...

for(int i = 0; i < N; i++)

Using the disassembler we obtain:

addi $n, $0, 1024

Assume the following:

Q3.A: In-order Single-threaded Processor (4 points)

loop: start / end

112 cycles. (113 - 1).

Q3.B: In-order Multi-threaded Processor (5 points)

For LD $y -> FADD:

So 13 threads will keep the machine fully utilized.

-3 for not using round-robin

Q3.C: Out-of-Order SMT Processor (5 points)

Now consider a four-wide, out-of-order SMT (simultaneous multithreading) processor. It

Thus, only one thread is required to saturate the machine.

Question 4: Iron Law

Q4.A: VLIW (6 points)

Inst/Program: DECREASES. The same number of operations need to be executed,

Depends on your answer to inst/program.

-1 point if student said that inst/program decreases and CPI decreases

Q4.B: Vector Processors (6 points)

Inst/Program: DECREASES. Instructions will decrease because many operations can

Cycles/Instruction: INCREASES. Vector instructions take multiple cycles to

Q4.C: Multithreaded Processor (6 points)

Inst/Program: UNCHANGED. The ISA needs not change to enable multithreading.

The answer depends on your point of view.

Name _(answer key)

Throughput (FLOPS/cycle) _4/13

Throughput (FLOPS/cycle) _1/2