CS222 - COAL - SOLUTION - Final - Spring2023
CS222 - COAL - SOLUTION - Final - Spring2023
Read each question completely before answering it. There are 3 questions.
In case of any ambiguity, you may make an assumption. Write it but your assumption should not contradict
any statement in the question paper.
Write the answer in the space below each question. Answers written with pencil will get zero marks.
Do rough work with pencil and then write final answer with a pen. Use last few pages for rough work.
Read the whole paper first.
Whenever possible, show your work and your thought process. This will make it easier for us to
give you partial credit.
.
Pre-mid 30
Caches 30
Parallelism 20
Total 80
i. By how much must we improve the CPI of FP instructions if we want the program to run
two times faster?
ii. By how much is the execution time of the program improved if the CPI of INT and FP in-
structions is reduced by 40% and the CPI of L/S and Branch is reduced by 30%?
Answer i:
Total Clock cycles = CPIfp × No. FP instr. + CPIint × No. INT instr. + CPIl/s × No. L/S instr. + CPIbranch
× No. branch instr.
= 512 × 106
To make the program run two times faster we’d have to reduce its time to half, i.e., half the
number of clock cycles by improving the CPI of FP instructions:
CPIimproved fp × No. FP instr. + CPIint × No. INT instr. + CPIl/s × No. L/S instr. + CPIbranch × No. branch instr. = clock cycles/2
CPIimproved fp = (clock cycles/2 − (CPIint × No. INT instr. + CPIl/s × No. L/S instr. + CPIbranch × No. branch instr.)) / No. FP instr.
The result requires a –ve number of clock cycles per Floating Point instruction which is absurd.
Therefore, we can conclude that it is impossible to make the program run two times faster by re-
ducing the CPI of Floating Point instructions.
Page 2 of 12
Answer ii:
Original Clock cycles = CPIfp × No. FP instr. + CPIint × No. INT instr. + CPIl/s × No. L/S instr. + CPI-
branch × No. branch instr.
= 512 × 106
New Clock cycles = CPIfp-new × No. FP instr. + CPIint-new × No. INT instr. + CPIl/s-new × No. L/S instr. +
CPIbranch-new × No. branch instr.
New Time TCPU = new clock cycles/clock rate = (342 × 106) / (2 × 109) = 0.171 s
Page 3 of 12
Part b:
Assuming risc-v calling convention, what does the following code do? The function f2 is called
with two arguments:
f2( int arr[], int N); // arr is an array of 4-byte integers
// N is size of the array; always a +ve even integer
f2:
srli a1 a1 1 // divide the array size by 2 to get the half of it
addi t0 x0 0 // initialize i to zero
loop:
bge t0 a1 ret // if loop finished, return
lw t2 0(a0) // t2 = arr[i]
slli t1 a1 2 // t1 = half*4 ; get byte offset for arr[i+half]
add t3 a0 t1 // t3 = &arr[i+half]
lw t1 0(t3) // t1 = arr[i+half]
sw t1 0(a0) // arr[i] = t1
sw t2 0(t3) // arr[i+half] = t2
addi t0 t0 1 // i++
addi a0 a0 4 // update &arr[i]
beq x0 x0 loop // loop back
ret:
jalr x0 0(x1) // return
Answer:
Page 4 of 12
Part c:
Translate the following C code to RISC-V assembly code. Use a minimum number of instructions.
Assume that the values of a, b, i, and j are in registers x5, x6, x7, and x29, respectively. Also,
assume that register x10 holds the base address of the array D, which is an array of unsigned
short integers.
Comment each line of your code to tell what you are doing.
Answer:
addi x7, x0, 0 // Init i = 0
LOOPI:
bge x7, x5, ENDI // While i < a
addi x30, x10, 0 // x30 = &D
addi x29, x0, 0 // Init j = 0
LOOPJ: bge x29, x6, ENDJ // While j < b
slli x28, x7, 1 // i*2 for short ints offset
add x28, x30, x28
lhu x28, 0(x28) // x28 = D[i]
add x31, x28, x29 // x31 = D[i]+j
slli x28, x29, 3 // (j*4)*2 for short ints offset
add x28, x30, x28
sh x31, 0(x28) // D[4*j] = x31
addi x29, x29, 1 // j++
jal x0, LOOPJ
ENDJ: addi x7, x7, 1 // i++;
jal x0, LOOPI
ENDI:
// next insn
Page 5 of 12
Question 2:
Part a (5.1):
The following code is written in C, where elements within the same row are stored contiguously.
Assume each word is a 64-bit integer.
A[I][J] : A[I][0], A[I][1], A[I][2], …. are accessed successively inner loop. They are
stored contiguously in memory.
Locality is affected by both the reference order and data layout. The same computation can also
be written below in Matlab, which differs from C in that it stores matrix elements within the
same column contiguously in memory.
for I=1:8
for J=1:8000
A(I,J)=B(I,0)+A(J,I);
end
end
A (J, I) : A (0, I), A (1, I), A (2, I), … accessed one after other in inner loop. They are
contiguous in RAM as Matlab stores 2D array in column wise.
B (I, 0) : A (0, 0), A (1, 0), A (2, 0), … accessed one after other in outer loop. They are
contiguous in RAM as Matlab stores 2D array in column wise.
Page 6 of 12
Part b (5.2.3): Given the following addresses in binary (ignore the underscores) by a processor,
in the given order, which of the following cache configurations will result in the maximum hit
rate? Assume all caches are direct mapped and the total size of each cache is 32 bytes (8
words). For each case, assume initially cache is empty and all valid bits are zero. For each
option, indicate how many bits of the address will be used for offset, index and tag. For each of
them explain in a line or two how did you reach this conclusion.
Block size = 1 word = 4 bytes 2 least significant bits for byte offset within a block
Num blocks = total_words / block_size = 8/1 = 8 Next 3 bits for block index (23 = 8)
Tag = rest of the bits
Block size = 2 words = 8 bytes 3 least significant bits for byte offset within a block
Num blocks = total_words / block_size = 8/2 = 4 Next 2 bits for block index (22 = 4)
Tag = rest of the bits
Block size = 4 words = 16 bytes 4 least significant bits for byte offset within a block
Num blocks = total_words / block_size = 8/4 = 2 Next 1 bit for block index (21 = 2)
Tag = rest of the bits
In the above cases, option 2 has the highest hit rate with 2 hits out of 12 access (100*2/12 =
16.67%), therefore it’s the most optimized configuration for this series of memory accesses.
Page 8 of 12
Part c:
The average memory access time (AMAT) for a microprocessor with 1 level of cache is 2.4 clock cycles
- If data is present and valid in the cache, it can be found in 1 clock cycle
- If data is not found in the cache, 80 clock cycles are needed to get it from off-chip memory
i. What is the miss rate for this L1 cache?
We must first determine the miss rate of the L1 cache to use in the revised AMAT formula:
AMAT = Hit Time + Miss Rate x Miss Penalty
2.4 = 1 + Miss Rate x 80
Miss Rate = 1.75%
Designers are trying to improve the average memory access time to obtain a 65% improvement in
average memory access time, and are considering adding a 2nd level of cache on-chip.
- This second level of cache could be accessed in 6 clock cycles
- The addition of this cache does not affect the first level cache’s access patterns or hit times
- Off-chip accesses would still require 80 additional cycles.
ii. What should be the hit rate for this L2 cache in order to achieve the desired speedup?
There was confusion about the terminology about this question, i.e.,
65% improvement means 65% speedup, or 65% improvement means reducing
the time by 65%, or reducing the time to 65%,
Answer 1:
Next, we can calculate the target AMAT … i.e. AMATwith L2:
Speedup = Time (old) / Time (new)
1.65 = 2.4 / Time (new)
Time (new) = 1.4545 clock cycles
We can then again use the AMAT formula to solve for highest acceptable miss rate in L2:
1.4545 = 1 + 0.0175 x (6 + (Miss RateL2)(80))
Solving for Miss RateL2 suggests that the highest possible miss rate is ~24.96%
Thus, as the hit rate is (1 – Miss Rate), the L2 hit rate must be ~75%.
Answer 2:
New time = 65% of old time = 0.65 * 2.4 = 1.56
1.56 = 1 + (0.0175 * 6) + ((Miss RateL2)*80) => Miss RateL2 = 0.56% ; L2 miss rate is % of total
accesses
1.56 = 1 + (0.0175 * 6) + ((0.0175 * Miss RateL2)*80) => Miss RateL2 = 32.5% ; L2 miss rate is %
of L2 accesses
Answer 3:
New time = (1-65%) of old time = 0.35 * 2.4 = 0.84
0.84 = 1 + (0.0175 * 6) + ((Miss RateL2)*80) => Miss RateL2 = -0.33% ; L2 miss rate is % of total
accesses
Page 9 of 12
0.84 = 1 + (0.0175 * 6) + ((0.0175 * Miss RateL2)*80) => Miss RateL2 = -18.9% ; L2 miss rate is %
of L2 accesses
Page 10 of 12
Question 3:
Part a: Your future self is taking the course of Parallel Processing in your 7 th semester. You are
asked to write a very compute intensive program. You can make 90% of the program parallel,
with 10% of it being sequential.
Page 11 of 12
part b:
Briefly explain (few lines each) the workings of SISD, SIMD, MIMD, and SPMD processors.
add x1 x2 x3
Here a single program’s instructin stream instructions manipulate individual data items.
vadd.vv v1 v2 v3
will add 16 pairs of integers simultaneousy assuming v1 v2 and v3 are 512 bits wide and integers are 32 bits.
Page 12 of 12