0% found this document useful (0 votes)
15 views12 pages

CS222 - COAL - SOLUTION - Final - Spring2023

The document outlines the final exam for the Computer Organization and Assembly Language course at GIK Institute, detailing the exam structure, instructions, and specific questions. It includes calculations related to instruction execution times and CPI improvements, as well as code translation tasks in RISC-V assembly. Additionally, it discusses concepts of locality in programming and evaluates different cache configurations for optimal hit rates.

Uploaded by

muazndm129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

CS222 - COAL - SOLUTION - Final - Spring2023

The document outlines the final exam for the Computer Organization and Assembly Language course at GIK Institute, detailing the exam structure, instructions, and specific questions. It includes calculations related to instruction execution times and CPI improvements, as well as code translation tasks in RISC-V assembly. Additionally, it discusses concepts of locality in programming and evaluates different cache configurations for optimal hit rates.

Uploaded by

muazndm129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

GIK Institute of Engineering Sciences and Technology, Topi

Spring 2023 (FCSE) Final Exam


22nd May 2023, 02:30 pm – 05:30 am

Course Code: CS222 Course Name: Computer Organization and Assembly


Language

Instructor name: Taj Muhammad Khan Section: AI

Vetter name: Vetter signature:

Student Name: Registration No:

 Read each question completely before answering it. There are 3 questions.
 In case of any ambiguity, you may make an assumption. Write it but your assumption should not contradict
any statement in the question paper.
 Write the answer in the space below each question. Answers written with pencil will get zero marks.
 Do rough work with pencil and then write final answer with a pen. Use last few pages for rough work.
 Read the whole paper first.
 Whenever possible, show your work and your thought process. This will make it easier for us to
give you partial credit.
.

Time: 150 minutes. Max Marks: 80

Question Total Marks Obtained

Pre-mid 30

Caches 30

Parallelism 20

Total 80

Question 1 (10+10+10 marks) :


Page 1 of 12
Part a (1.15): Assume a program requires the execution of 50 × 106 Floating Point instructions, 110
× 106 Integer instructions, 80 × 106 Load/Store instructions, and 16 × 106 Branch instructions.
The CPI for each type of instruction is 1, 1, 4, and 2, respectively. Assume that the processor has
a 2 GHz clock rate.

i. By how much must we improve the CPI of FP instructions if we want the program to run
two times faster?
ii. By how much is the execution time of the program improved if the CPI of INT and FP in-
structions is reduced by 40% and the CPI of L/S and Branch is reduced by 30%?

Answer i:
Total Clock cycles = CPIfp × No. FP instr. + CPIint × No. INT instr. + CPIl/s × No. L/S instr. + CPIbranch
× No. branch instr.

= 1 × 50 × 106 + 1 × 110 × 106 + 4 × 80 × 106 + 2 × 16 × 106

= 512 × 106

Time TCPU = clock cycles/clock rate = (512 × 106) / (2 × 109) = 0.256 s

To make the program run two times faster we’d have to reduce its time to half, i.e., half the
number of clock cycles by improving the CPI of FP instructions:

CPIimproved fp × No. FP instr. + CPIint × No. INT instr. + CPIl/s × No. L/S instr. + CPIbranch × No. branch instr. = clock cycles/2

CPIimproved fp = (clock cycles/2 − (CPIint × No. INT instr. + CPIl/s × No. L/S instr. + CPIbranch × No. branch instr.)) / No. FP instr.

CPIimproved fp = (256 − 462)/50 < 0

The result requires a –ve number of clock cycles per Floating Point instruction which is absurd.
Therefore, we can conclude that it is impossible to make the program run two times faster by re-
ducing the CPI of Floating Point instructions.

Page 2 of 12
Answer ii:

Original Clock cycles = CPIfp × No. FP instr. + CPIint × No. INT instr. + CPIl/s × No. L/S instr. + CPI-
branch × No. branch instr.

= 1 × 50 × 106 + 1 × 110 × 106 + 4 × 80 × 106 + 2 × 16 × 106

= 512 × 106

Time TCPU = clock cycles/clock rate = (512 × 106) / (2 × 109) = 0.256 s

New Clock cycles = CPIfp-new × No. FP instr. + CPIint-new × No. INT instr. + CPIl/s-new × No. L/S instr. +
CPIbranch-new × No. branch instr.

= 0.6 × 1 × 50 × 106 + 0.6 ×1 × 110 × 106 + 0.7 × 4 × 80 × 106 + 0.7 × 2 ×


16 × 10 6

= (30 + 66 + 224 + 22.4) × 106


= (30 + 66 + 224 + 22.4) × 106
= 342 × 106

New Time TCPU = new clock cycles/clock rate = (342 × 106) / (2 × 109) = 0.171 s

Page 3 of 12
Part b:
Assuming risc-v calling convention, what does the following code do? The function f2 is called
with two arguments:
f2( int arr[], int N); // arr is an array of 4-byte integers
// N is size of the array; always a +ve even integer

f2:
srli a1 a1 1 // divide the array size by 2 to get the half of it
addi t0 x0 0 // initialize i to zero

loop:
bge t0 a1 ret // if loop finished, return
lw t2 0(a0) // t2 = arr[i]
slli t1 a1 2 // t1 = half*4 ; get byte offset for arr[i+half]
add t3 a0 t1 // t3 = &arr[i+half]
lw t1 0(t3) // t1 = arr[i+half]
sw t1 0(a0) // arr[i] = t1
sw t2 0(t3) // arr[i+half] = t2
addi t0 t0 1 // i++
addi a0 a0 4 // update &arr[i]
beq x0 x0 loop // loop back

ret:
jalr x0 0(x1) // return

Answer:

It swaps the two halves of the array, i.e., given an array


arr = {1, 2, 3, 4, 5, 6, 7, 8} and N=8,

it would make the array {5, 6, 7, 8, 1, 2, 3, 4}.

f2( int arr[], int N) {


int half = N/2;
for (int i=0; i<half; i++){
int tmp = arr[i];
arr[i] = arr[i+half];
arr[i+half] = tmp;
}
}

See comments above for details.

Page 4 of 12
Part c:

Translate the following C code to RISC-V assembly code. Use a minimum number of instructions.
Assume that the values of a, b, i, and j are in registers x5, x6, x7, and x29, respectively. Also,
assume that register x10 holds the base address of the array D, which is an array of unsigned
short integers.

for(i=0; i<a; i++)


for(j=0; j<b; j++)
D[4*j] = D[i] + j;

Comment each line of your code to tell what you are doing.

Answer:
addi x7, x0, 0 // Init i = 0
LOOPI:
bge x7, x5, ENDI // While i < a
addi x30, x10, 0 // x30 = &D
addi x29, x0, 0 // Init j = 0
LOOPJ: bge x29, x6, ENDJ // While j < b
slli x28, x7, 1 // i*2 for short ints offset
add x28, x30, x28
lhu x28, 0(x28) // x28 = D[i]
add x31, x28, x29 // x31 = D[i]+j
slli x28, x29, 3 // (j*4)*2 for short ints offset
add x28, x30, x28
sh x31, 0(x28) // D[4*j] = x31
addi x29, x29, 1 // j++
jal x0, LOOPJ
ENDJ: addi x7, x7, 1 // i++;
jal x0, LOOPI
ENDI:
// next insn

Page 5 of 12
Question 2:
Part a (5.1):

The following code is written in C, where elements within the same row are stored contiguously.
Assume each word is a 64-bit integer.

for (I=0; I<8; I++)


for (J=0; J<8000; J++)
A[I][J]=B[I][0]+A[J][I];

i. How many 64-bit integers can be stored in a 16-byte cache block?

2 => 1 64-bit integer will take 8 bytes, so 2 in a block.

ii. Which variable references exhibit temporal locality?

I : accessed repeatedly in both loops


J : accessed repeatedly in inner loop
B[I][0] : accessed repeatedly in inner loop

iii. Which variable references exhibit spatial locality?

A[I][J] : A[I][0], A[I][1], A[I][2], …. are accessed successively inner loop. They are
stored contiguously in memory.

Locality is affected by both the reference order and data layout. The same computation can also
be written below in Matlab, which differs from C in that it stores matrix elements within the
same column contiguously in memory.

for I=1:8
for J=1:8000
A(I,J)=B(I,0)+A(J,I);
end
end

iv. Which variable references exhibit temporal locality?

I : accessed repeatedly in both loops


J : accessed repeatedly in inner loop
B (I, 0) : accessed repeatedly in inner loop

v. Which variable references exhibit spatial locality?

A (J, I) : A (0, I), A (1, I), A (2, I), … accessed one after other in inner loop. They are
contiguous in RAM as Matlab stores 2D array in column wise.

B (I, 0) : A (0, 0), A (1, 0), A (2, 0), … accessed one after other in outer loop. They are
contiguous in RAM as Matlab stores 2D array in column wise.

Page 6 of 12
Part b (5.2.3): Given the following addresses in binary (ignore the underscores) by a processor,
in the given order, which of the following cache configurations will result in the maximum hit
rate? Assume all caches are direct mapped and the total size of each cache is 32 bytes (8
words). For each case, assume initially cache is empty and all valid bits are zero. For each
option, indicate how many bits of the address will be used for offset, index and tag. For each of
them explain in a line or two how did you reach this conclusion.

Option 1: a cache with block size of 1 word:

Block size = 1 word = 4 bytes  2 least significant bits for byte offset within a block
Num blocks = total_words / block_size = 8/1 = 8  Next 3 bits for block index (23 = 8)
Tag = rest of the bits

Address Tag Index Offset inside Hit or miss?


block
0000_0000_1100 0000_000 011 00 Miss
0010_1101_0000 0010_110 100 00 Miss
0000_1010_1100 0000_101 011 00 Miss
0000_0000_1000 0000_000 010 00 Miss
0010_1111_1100 0010_111 111 00 Miss
0001_0110_0000 0001_011 000 00 Miss
0010_1111_1000 0010_111 110 00 Miss
0000_0011_1000 0000_001 110 00 Miss
0010_1101_0100 0010_110 101 00 Miss
0000_1011_0000 0000_101 100 00 Miss
0010_1110_1000 0010_111 010 00 Miss
0011_1111_0100 0011_111 101 00 Miss

Option 2: a cache with block size of 2 words:

Block size = 2 words = 8 bytes  3 least significant bits for byte offset within a block
Num blocks = total_words / block_size = 8/2 = 4  Next 2 bits for block index (22 = 4)
Tag = rest of the bits

Address Tag Index Offset inside Hit or miss?


block
0000_0000_1100 0000_000 01 100 Miss
0010_1101_0000 0010_110 10 000 Miss
0000_1010_1100 0000_101 01 100 Miss
0000_0000_1000 0000_000 01 000 Miss
0010_1111_1100 0010_111 11 100 Miss
0001_0110_0000 0001_011 00 000 Miss
0010_1111_1000 0010_111 11 000 Hit
0000_0011_1000 0000_001 11 000 Miss
0010_1101_0100 0010_110 10 100 Hit
0000_1011_0000 0000_101 10 000 Miss
0010_1110_1000 0010_111 01 000 Miss
0011_1111_0100 0011_111 10 100 Miss
Page 7 of 12
Option 3: a cache with block size of 4 words:

Block size = 4 words = 16 bytes  4 least significant bits for byte offset within a block
Num blocks = total_words / block_size = 8/4 = 2  Next 1 bit for block index (21 = 2)
Tag = rest of the bits

Address Tag Index Offset inside Hit or miss?


block
0000_0000_1100 0000_000 0 1100 Miss
0010_1101_0000 0010_110 1 0000 Miss
0000_1010_1100 0000_101 0 1100 Miss
0000_0000_1000 0000_000 0 1000 Miss
0010_1111_1100 0010_111 1 1100 Miss
0001_0110_0000 0001_011 0 0000 Miss
0010_1111_1000 0010_111 1 1000 Hit
0000_0011_1000 0000_001 1 1000 Miss
0010_1101_0100 0010_110 1 0100 Miss
0000_1011_0000 0000_101 1 0000 Miss
0010_1110_1000 0010_111 0 1000 Miss
0011_1111_0100 0011_111 1 0100 Miss

In the above cases, option 2 has the highest hit rate with 2 hits out of 12 access (100*2/12 =
16.67%), therefore it’s the most optimized configuration for this series of memory accesses.

Page 8 of 12
Part c:
The average memory access time (AMAT) for a microprocessor with 1 level of cache is 2.4 clock cycles
- If data is present and valid in the cache, it can be found in 1 clock cycle
- If data is not found in the cache, 80 clock cycles are needed to get it from off-chip memory
i. What is the miss rate for this L1 cache?

We must first determine the miss rate of the L1 cache to use in the revised AMAT formula:
AMAT = Hit Time + Miss Rate x Miss Penalty
2.4 = 1 + Miss Rate x 80
Miss Rate = 1.75%

Designers are trying to improve the average memory access time to obtain a 65% improvement in
average memory access time, and are considering adding a 2nd level of cache on-chip.
- This second level of cache could be accessed in 6 clock cycles
- The addition of this cache does not affect the first level cache’s access patterns or hit times
- Off-chip accesses would still require 80 additional cycles.
ii. What should be the hit rate for this L2 cache in order to achieve the desired speedup?

There was confusion about the terminology about this question, i.e.,
65% improvement means 65% speedup, or 65% improvement means reducing
the time by 65%, or reducing the time to 65%,

L2 miss rate signifies percentage of total accesses, or percentage of L2 accesses,

so there are different answers by different students.

I am therefore awarding full marks to everyone in this question.

Answer 1:
Next, we can calculate the target AMAT … i.e. AMATwith L2:
Speedup = Time (old) / Time (new)
1.65 = 2.4 / Time (new)
Time (new) = 1.4545 clock cycles
We can then again use the AMAT formula to solve for highest acceptable miss rate in L2:
1.4545 = 1 + 0.0175 x (6 + (Miss RateL2)(80))
Solving for Miss RateL2 suggests that the highest possible miss rate is ~24.96%
Thus, as the hit rate is (1 – Miss Rate), the L2 hit rate must be ~75%.

Answer 2:
New time = 65% of old time = 0.65 * 2.4 = 1.56

1.56 = 1 + (0.0175 * 6) + ((Miss RateL2)*80) => Miss RateL2 = 0.56% ; L2 miss rate is % of total
accesses
1.56 = 1 + (0.0175 * 6) + ((0.0175 * Miss RateL2)*80) => Miss RateL2 = 32.5% ; L2 miss rate is %
of L2 accesses

Answer 3:
New time = (1-65%) of old time = 0.35 * 2.4 = 0.84

0.84 = 1 + (0.0175 * 6) + ((Miss RateL2)*80) => Miss RateL2 = -0.33% ; L2 miss rate is % of total
accesses

Page 9 of 12
0.84 = 1 + (0.0175 * 6) + ((0.0175 * Miss RateL2)*80) => Miss RateL2 = -18.9% ; L2 miss rate is %
of L2 accesses

Page 10 of 12
Question 3:

Part a: Your future self is taking the course of Parallel Processing in your 7 th semester. You are
asked to write a very compute intensive program. You can make 90% of the program parallel,
with 10% of it being sequential.

What speedup can you get on 10 processors?

This is solved using Amdahl’s Law.


Sp = 1 / (f + (1 - f)/p) = 1 / (.1 + (.9)/10) = 1 / .19 = 5.26

where f = portion of program non-parallelizable, and p=number of processors

What would be the maximum speedup on an infinite number of processors?

max Sp = 1 / (.1 + .9/∞) = 1 / .1 = 10

Page 11 of 12
part b:

Briefly explain (few lines each) the workings of SISD, SIMD, MIMD, and SPMD processors.

SISD Single Instruction Single Data:


A single instruction stream operate on single pieces of data. It’s the classic case, e.g., an add instruction adds two
integers:

add x1 x2 x3

Here a single program’s instructin stream instructions manipulate individual data items.

SIMD Single Instruction Multipe Data:


A single instruction stream operate on multiple pieces of data. It’s the classic case, e.g., an add instruction adds
multiple integers. Such data is loaded in SIMD or vector registers who have large widths and can store multiple
data items in the same register. So when an add instruction adds two such registers, it is in fact adding multiple in-
tegers at the same time.

vadd.vv v1 v2 v3

will add 16 pairs of integers simultaneousy assuming v1 v2 and v3 are 512 bits wide and integers are 32 bits.

MIMD Multipe instruction Multiple Data:


Here multiple instruction streams are being executed simultaneoulsy, each with its own data. A typical example is
a multicore processor running multiple programs simultaneoulsy, one on each core, and each program is manipu-
lating its own data.

SPMD Single Program Multiple Data:


It is like the above MIMD in that there are multiple cores but these cores are running instructions from the same
instruction stream (same program) albeit each core may be manipulating different data. A simple example will be
a program calculating the sum of elements of a very large array runs the summation code in parallel on multiple
cores, each calculating the sum of a part of the array. For a quad core system, the array can be divided into 4 por-
tions and each core can run the summation code on its 1/4th of the array calculating partial sums which can then
later be added to calculate the total sum.

Page 12 of 12

You might also like