0% found this document useful (0 votes)
14 views3 pages

HPCA Endsem SPR 2024

The document is an end-semester examination paper for the High-Performance Computer Architecture course at IIT Kharagpur, detailing instructions for two sections of students. It includes various questions related to microprocessor architecture, cache coherence, and RISC-V assembly code, requiring students to provide concise answers and complete tables based on given scenarios. The exam covers theoretical concepts and practical applications in computer architecture, with a total of 100 marks and a duration of 3 hours.

Uploaded by

Pratik Sonune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

HPCA Endsem SPR 2024

The document is an end-semester examination paper for the High-Performance Computer Architecture course at IIT Kharagpur, detailing instructions for two sections of students. It includes various questions related to microprocessor architecture, cache coherence, and RISC-V assembly code, requiring students to provide concise answers and complete tables based on given scenarios. The exam covers theoretical concepts and practical applications in computer architecture, with a total of 100 marks and a duration of 3 hours.

Uploaded by

Pratik Sonune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Indian Institute of Technology Kharagpur

Department of Computer Science and Engineering


End-semester Examination, Spring 2023-24
High Performance Computer Architecture (CS60003)
Students: 167 Full Marks: 100 Time: 3 hours

INSTRUCTIONS: This question paper has two parts: PART-I for Section-1 students, and PART-II
for Section-2 students. ONLY ATTEMPT THE APPROPRIATE PART OF THE QUESTION
PAPER. This test is closed books and closed notes. Calculators are allowed.

PART-I [for Section-1]


ANSWER ALL QUESTIONS

1. Give brief answers (two-three sentences maximum) to each of the following questions:
(a) Consider a single-core microprocessor with two levels of on-chip cache. Derive the overall Average
Memory Access Time (AMAT) of the cache system as a function of the Hit Time, Miss Penalty, Hit
Rate and Miss Rate of the two levels of the [3]
(b) Why is a Critical Word First strategy more effective in a cache with relatively large block size? [3]
(c) What policies are adopted by microprocessors that perform out-of-order execution (with or without
speculation) to enable precise exceptions? [3]
(d) Intel and AMD processors have a CISC ISA, but internally convert CISC instructions through a
hardware layer to RISC-type instructions before executing them. [3]
(e) In the latest high-throughput microprocessors, the Reorder Buffer (ROB) has decreased in significance,
and only buffers control information. Why? [3]
(f) Explain the concept of macro-op fusion adopted in modern high-performance microprocessors. [3]
(g) Explain why it is rare to find an issue width greater than four in modern microprocessors. [3]
(h) Give two reasons why increase of microprocessor clock frequency to increase throughput has fallen out of
favour over the last two decades. [3]
(i) Explain why it is particularly challenging to accurately predict cache write latencies (even when there
has been a hit) in a modern multi-core microprocessor. [3]
(j) Distinguish between local node, home node, and remote node in the context of a directory-based cache
coherence protocol. [3]
2. Consider the following RISC-V code fragment:

addi x4,x1,#800 ; x1 = upper bound for X


foo: fld F2,0(x1) ; (F2) = X(i)
fmul.d F4,F2,F0 ; (F4) = a*X(i)
fld F6,0(x2) ; (F6) = Y(i)
fadd.d F6,F4,F6 ; (F6) = a*X(i) + Y(i)
fsd F6,0(x2) ; Y(i) = a*X(i) + Y(i)
addi x1,x1,#8 ; increment X index
addi x2,x2,#8 ; increment Y index
sltu x3,x1,x4 ; set X3 to 1 if X1 < X4
bnez x3,foo ; loop if needed, nothing to write to CDB

Consider a processor with the 64-bit RISC-V ISA implementing the Tomasulo’s Algorithm without speculation
scheme, with the specifications of Functional Units (FUs) as shown in Fig. 1, executing the above code
snippet. Assume the following:
ˆ Functional units are not pipelined.

1
Figure 1: Functional Unit specifications in a processor with Tomasulo’s scheme.

ˆ There is no forwarding between functional units; results are communicated by the Common Data Bus
(CDB).
ˆ The execution stage does both the effective address calculation and the memory access for loads and
stores.
ˆ Loads require one clock cycle.
ˆ Issue and write-back result each require one clock cycle.
ˆ There are five load buffer slots and five store buffer slots.
ˆ Branch on Not Equal to Zero (bnez) instruction requires one clock cycle to execute. This means that
(since there is no speculation), the first instruction of the next iteration is issued correctly in the next
clock cycle after issue of bnez because of the presence of a perfect branch predictor, but it waits for one
clock cycle after issue in its corresponding reservation station while the bnez instruction is executing.
ˆ bnez does not write to CDB, and hence does not consume any clock cycle for “Write CDB”.

Complete the following table for the first three iterations of the loop. You may ignore the addi
instruction before the loop starts. The execution of the first two instructions of the first iteration of the loop
has been shown for your convenience in Fig. 2. Note that Executes/memory implies start of execution.
[15]

Figure 2: Example table entries for a processor with Tomasulo’s scheme.

3. Suppose we have a 96-core future generation processor, but on average only 54 cores can be busy. Suppose
that 90% of the time, we can use all available cores; 9% of the time, we can use 50 cores; and 1% of the time
is strictly serial.

(a) How much speedup might we expect from the above microprocesor? [5]
(b) Now assume that cores of the processor in part-(a) can be turned off when not in use. How would the
multi-core speedup compare to the 24-processor count version that can use all its processor 99% of the
time? [2]
(c) Explain (in brief) the advantage of the MESI protocol over the MSI protocol, in the context of ensuring
cache coherence. [3]
(d) Explain, with RISC-V assembly language code snippet, how a traditional spin lock based technique to
solve the critical section problem should be modified to take advantage of a coherent cache system. You
may assume the existence of an (effectively) atomic register↔memory swap operation denoted by EXCH.
[5]

2
4. (a) Suppose we have an application running on a 100-core microprocessor, and assume that application can
use 1, 50, or 100 cores. If we assume that 90% of the time we can use all 100 cores, how much of the
remaining 10% of the execution time must employ 50 processors if we want a speedup of 75? [3]
(b) Consider an 8-core microprocessor where each processor has its private L1 and L2 caches, and snooping
is performed on a shared bus among the L2 caches. Assume the average L2 request, whether for a
coherence miss or other miss, is 12 cycles. Assume a clock rate of 2.5 GHz, a CPI of 0.75, and a
load/store frequency of 45%.If our goal is that no more than 50% of the L2 bandwidth is consumed by
coherence traffic, what is the maximum coherence miss rate allowable per processor? [3]
(c) Consider a 64-processor CC-NUMA (Cache Coherent – Non-Uniform Memory Access) computer system.
Each processor has a single-level 128 KB on-chip cache. The sharing among the processors is at the
block-level, and the block size is 64 bytes. The size of the main memory connected to each node is 2 GB.
A directory-based cache coherence protocol is being used. Determine (i) the length of each directory
entry (in bits); (ii) the total space occupied by directories in the entire system. [2 + 2 = 4]
5. Consider a microprocessor with two levels of on-chip data cache. The L1 cache is “Virtually Indexed
Physically Tagged” (VIPT), direct-mapped, holds 8 kB of data. The L2 cache is direct-mapped, holds 4 MB
of data. Both L1 and L2 caches use 64 byte blocks. The page size is 8 kB. The Translation Lookaside Buffer
(TLB) is direct-mapped with 256 entries. Each Virtual Address is 64 bits long, and each Physical Address is
41 bits long.
(a) Draw a block diagram depicting the microarchitecture of this data cache, clearly depicting the width of
each field in bits. [6]
(b) Determine the total size of the TLB, L1 cache, and L2 cache on the system. Assume each L1 entry is
accompanied by 3 bits of metadata, each L2 entry is accompanied by 4 bits of metadata, and each TLB
entry is accompanied by 5 bits of metadata. [6]
(c) Distinguish between true coherence miss and false coherence miss in the private cache of a multi-core
microprocessor, with a simple example. [4]
(d) Inspired by the success of the “critical-word-first” techniques in reducing the L1 cache miss penalty, the
designers for a microprocessor are considering the use of these techniques for the L2 cache. Assume that
the microprocessor is being planned to have a single 1 MB L2 cache with 64-byte blocks. Further assume
that the L2 cache can be written with 16 bytes every 4 processor cycles. The time to receive the first
16-bytes from the memory controller is 100 cycles, and transfer of each additional 16 bytes from the
main memory requires 16 cycles (excluding the 4 cycles required to write the 16 bytes to the L2 cache).
For each of the following two scenarios determine how many cycles would it take to service an L2 cache
miss: (i) without using critical-word-first, and, (ii) with the critical-word-first technique. [2 + 2 = 4]
6. Suppose IIT-KGP’s ERP software runs on an 8-core RISC-V microprocessor-based computer. The receivables
of IIT-KGP consist of a large number of grants, and also the student fees. In the ERP software, a thread is
spawned whenever a receivable transaction is initiated. A thread corresponding to a receivable transaction
updates a shared variable balance. For updating balance with the amount received, each thread essentially
carries out the following operation:
balance = balance + amount; // You need to synchronize access to balance!
Assume that the variable balance is shared among the threads and its memory location is given by 0(X1).
The received amount is in the register X2.
(a) Explain the operations of the Load Reserved (LR) and Store Conditional (SC) instructions of the RISC-V
Instruction Set Architecture. [4]
(b) Write a RISC-V code fragment that can (effectively) atomically perform the above operation,
considering as a Critical Section Problem. Clearly mark the instructions corresponding to the entry
section, the critical section, and the exit section. [6]

——————————————- END OF PART-I —————————————————

You might also like