1.
a 1
1.b
1.c
1.d
1.e
1.f
1.g
1.h
1.i
1.j
2.a 5
Marking Scheme: 2+3
2.b 5
3.a 5
Marking Scheme: 1+4
3.b 5
4.a 5
Reducing the miss rate in a cache system is critical for improving the performance of computer
architectures. The miss rate is the fraction of memory accesses that result in cache misses.
Various optimization techniques are used to reduce the miss rate, and these can be broadly
categorized into cache design strategies and programming optimizations.
1. Larger Cache Size
A larger cache can hold more data, which reduces capacity misses because it is less likely for a
required data block to be evicted due to lack of space. However, increasing cache size may lead
to higher access times and cost.
Example:
• If a 64 KB cache experiences frequent misses due to limited capacity, upgrading it to 128 KB
can reduce the miss rate.
2. Higher Associativity
Caches can be classified as direct-mapped, set-associative, or fully-associative. Higher
associativity reduces conflict misses by allowing multiple cache lines to be mapped to the
same set.
Example:
• A 4-way set-associative cache reduces conflict misses compared to a direct-mapped cache
because multiple blocks can coexist in the same set.
3. Victim Cache
A small, fully-associative victim cache stores blocks evicted from the main cache. This reduces
conflict misses by giving recently evicted blocks a second chance.
Example:
• A system using an 8-entry victim cache alongside a direct-mapped cache can catch frequently
re-accessed blocks and reduce misses.
4. Prefetching
Prefetching fetches data into the cache before it is needed, based on predictable access
patterns. Hardware and software prefetching are commonly used.
Example:
• For a loop that accesses array elements sequentially, a hardware prefetcher can predict the
next memory accesses and bring them into the cache early.
5. Cache Line Size Optimization
Increasing the cache line size can reduce compulsory misses by fetching more data during each
memory access. However, it may increase the miss penalty if unused data is fetched.
Example:
• A 64-byte cache line may reduce compulsory misses in applications with spatial locality
compared to a 32-byte cache line.
6. Multi-level Caches
Using multiple cache levels (L1, L2, L3) helps reduce misses. Data that cannot fit in the L1 cache
can often be found in L2 or L3, reducing main memory accesses.
Example:
• Modern processors often use a small, fast L1 cache for frequently accessed data and larger
L2/L3 caches for less frequently accessed data.
7. Compiler Optimizations
Compilers can reorganize code to improve data locality, reducing misses.
Techniques:
• Loop Interchange: Reordering nested loops to improve spatial locality.
• Blocking (Tiling): Breaking large data sets into smaller blocks to fit into the cache.
Example:
// Without optimization:
for (i = 0; i < N; i++) {
for (j = 0; j < M; j++) {
A[i][j] = B[i][j] + C[i][j];
}
}
// With loop interchange (better cache utilization):
for (j = 0; j < M; j++) {
for (i = 0; i < N; i++) {
A[i][j] = B[i][j] + C[i][j];
}
}
8. Reducing Cache Pollution
Reducing unnecessary data loads into the cache can lower misses. Techniques like selective
caching or bypassing less useful data help achieve this.
Example:
• Streaming data that is unlikely to be reused can bypass the cache to prevent evicting useful
data.
9. Cache Partitioning
Partitioning the cache among different threads or cores reduces contention and conflict misses
in multi-threaded environments.
Example:
• In a multi-core processor, assigning a portion of the cache to each core avoids frequent cache
evictions caused by other cores.
10. Software-Controlled Caches
Explicit control over cache behavior, such as cache hints in software, can help optimize data
placement and reduce misses.
Example:
• Programming models like CUDA allow developers to explicitly manage shared memory and
cache in GPU programming, reducing miss rates.
By combining these strategies effectively, system designers and programmers can significantly
reduce cache miss rates, leading to improved overall performance.
4.b 5
Marking Scheme: 2+2=1
5.a 5
5.b 5
6.a 5
Direct Mapped: 2.5 + Set Associative : 2.5 marks
6.b 5
Marking Scheme: WAW: 2.5 + WAR: 2.5 Marks
WAW Resolution: Delays writes of later instructions to ensure they do not
overwrite earlier writes prematurely.
WAR Resolution: Delays writes until all dependent reads are completed.
The combination of reservation stations and the CDB dynamically schedules and
synchronizes instructions, effectively resolving these hazards without stalling the
pipeline hazards by dynamically scheduling instructions and managing
dependencies.
1. WAW Hazard
● Tomasulo's approach allows instructions to issue and execute out of order.
● If an instruction is ready to write to the destination register via the CDB, it
checks whether any later instruction also writes to the same register. The
reservation station ensures that the later write occurs only after the earlier
instruction has completed, preventing the hazard.
I1: ADD R1, R2, R3 # Writes result to R1
I2: MUL R1, R4, R5 # Also writes to R1
In Tomasulo's approach,
● Both instructions are issued to their respective reservation stations.
● When ADD (I1) completes execution, it writes to the CDB.
● MUL (I2) will not write to R1 until the reservation station confirms that
ADD has completed and broadcast its result on the CDB. This ensures the
correctness of the final value in R1.
2. Write After Read (WAR) Hazard
● Tomasulo's algorithm tracks dependencies using reservation stations and
avoids hazards by delaying writes until reads are completed.
● If an instruction wants to write to a register being read by an earlier
instruction, the reservation station ensures the write is delayed until the read
completes.
Example:
I1: ADD R1, R2, R3 # Reads R2, R3
I2: MUL R2, R4, R5 # Writes to R2
ADD (I1) is issued to a reservation station and begins execution, reading R2 and
R3.
MUL (I2) is issued but will not write to R2 until ADD completes its execution and
finishes reading R2. This is managed by the reservation station tracking the usage
of R2.
7.a 5
Marking Scheme: 3 marks for split cache + 2 marks for unified cache.
7.b 5
(i) Vector processor Architecture:
In computing, a vector processor is a central processing unit (CPU) that
implements an instruction set where its instructions are designed to operate
efficiently and effectively on large one dimensional arrays of data called vectors.
This is in contrast to scalar processors, whose instructions operate on single data
items only, and in contrast to some of those same scalar processors having
additional single instruction, multiple data (SIMD) or SIMD within a register
(SWAR) Arithmetic Units.
Marking Scheme: Best of the two answers will be awarded out of 5 marks. If only 1
option is done, then also awarded out of 5 marks.
For specific kinds of computing applications, vector processing performs very well
in terms of key features. These features are as follows:
1. Simultaneous operations: This is achieved through the use of specialized
hardware that can process multiple data elements in parallel.
2. High performance: Vector processing can achieve high performance by
exploiting data parallelism and reducing memory access. This means that
vector processors can perform computations faster than traditional
processors, particularly for tasks that involve repeated operations on large
datasets.
3. Scalability: Vector processors can scale up to handle larger datasets without
sacrificing performance.
4. Limited instruction set: Vector processors have a limited instruction set
that’s optimized for numerical computations.
5. Data alignment: Vector processors require data to be aligned in memory to
achieve optimal performance. This means the data must be stored in
contiguous memory locations so that the processor can access it efficiently.
Vector processing provides higher performance than traditional CPU or GPU
architectures because it’s able to handle more data at once. And we all know how
vital high performance is when you’re working on graphics-related use cases. There
are two main types of vector processing: SIMD and MIMD.
(ii) VLIW processor Architecture:
In multiple issue processors, we increase the width of the pipeline. Several
instructions are fetched and decoded in the front-end of the pipeline. Several
instructions are issued to the functional units in the back-end. Suppose if m is the
maximum number of instructions that can be issued in one cycle, we say that the
processor is m-issue wide. There are basically two variations in multiple issue
processors – Superscalar processors and VLIW (Very Long Instruction Word)
processors.
VLIW processors, issue a fixed number of instructions formatted either as one large
instruction or as a fixed instruction packet with the parallelism among instructions
explicitly indicated by the instruction. Hence, they are also known as EPIC–
Explicitly Parallel Instruction Computers. Examples include the Intel/HP Itanium
processor.
VLIW uses Instruction Level Parallelism, i.e. it has programs to control the parallel
execution of the instructions.
VLIW Architecture deals with the performance by depending on the compiler. The
programs decide the parallel flow of the instructions and to resolve conflicts. This
increases compiler complexity but decreases hardware complexity by a lot.
Features :
The processors in this architecture have multiple functional units, fetch from
the Instruction cache that have the Very Long Instruction Word.
Multiple independent operations are grouped together in a single VLIW
Instruction. They are initialized in the same clock cycle.
Each operation is assigned an independent functional unit.
All the functional units share a common register file.
Instruction words are typically of the length 64-1024 bits depending on the
number of execution unit and the code length required to control each unit.
Instruction scheduling and parallel dispatch of the word is done statically by
the compiler.
The compiler checks for dependencies before scheduling parallel execution
of the instructions.