Ca-Module Ii Notes
Ca-Module Ii Notes
NOTES
Magnetic Tape
Access Method
Random Access: Each addressable location in memory has a unique, physically
wired-in addressing mechanism.
• Constant time
• Independent of the sequence of prior accesses
• Any location can be selected at random and directly accessed
• Main memory and some cache systems are random access.
Associative: RAM that enables one to make a comparison of desired bit locations
within a word for a specified match
• Word is retrieved based on a portion of its contents rather than its address
• Retrieval time is constant independent of location or prior access patterns
Performance
• Access time (latency)
• For RAM: time to perform a read or write operation
• Others: time to position the read-write head at desired location
• Memory cycle time: Primarily applied to RAM
• Access time + additional time required before a second access
• Required for electrical signals to be terminated/regenerated
• Concerns the system bus.
• Transfer time: Rate at which data can be transferred in/out of memory
• For RAM: 1 / Cycle time
For Others:
Tn : Average time to read or write n bits;
TA : Average access time;
n: Number of bits
R: Transfer rate, in bits per second (bps)
Physical characteristics
• Volatile: information decays naturally or is lost when powered off;
• Nonvolatile: information remains without deterioration until changed:
• no electrical power is needed to retain information;
• E.g.: Magnetic-surface memories are nonvolatile;
• Semiconductor memory (memory on integrated circuits) may be either
volatile or non-volatile.
Memory Hierarchy
• Design constraints on memory can be summed up by three questions:
• How much?
• If memory exists, applications will likely be developed to use it.
• How fast?
• Best performance achieved when memory keeps up with the processor i.e. as
the processor executes instructions, memory should minimize pausing /
waiting for instructions or operands.
• How expensive?
• Cost of memory must be reasonable in relationship to other components;
Memory Hierarchy
• Trade-off among 3 characteristics: Capacity, Access time and Cost
• Faster access time, greater cost per bit
• Greater capacity – smaller cost per bit
• Greater capacity – slower access time
• Conclusion – Use a memory hierarchy instead of a single type of memory
• Supplement smaller, more expensive, faster memories with Larger, cheaper,
slower memories
Example
• Suppose that the processor has access to two levels of memory:
• Level 1 - L1:
• contains 1000 words and has an access time of 0.01µs;
• Level 2 - L2:
• contains 100,000 words and has an access time of 0.1µs.
• Assume that:
• if word ∈ L1, then the processor accesses it directly;
• If word ∈ L2, then word is transferred to L1 and then accessed by the
processor.
For simplicity:
Ignore time required for processor to determine whether word is in L1 or L2. Also,
let:
• H define the fraction of all memory accesses that are found in L1
• T1 is the access time of L1
• T2 is the access time of L2
• Now consider the following scenario:
• Suppose 95% of the memory accesses are found in L1.
• Average time to access a word is:
• (0.95)(0.01µs) + (0.05)(0.01µs + 0.1µs) = 0.0095 + 0.0055 = 0.015µs
• Average access time is much closer to 0.01µs than to 0.1µs, as desired.
Example
General shape of the curve that covers this situation:
For high percentages of L1 access, the average total access time is much closer to
that of L1 than that of L2
Figure: Performance of accesses involving only L1
Example
• Strategy to minimize access time should be:
• Organize data across the hierarchy such that
• % of accesses to lower levels is substantially less than that of upper levels
• i.e. L2 memory contains all program instructions and data:
• Data that is currently being used should be in L1;
• Eventually:
• Data ∈ L1 will be swapped to L2 to make room for new data
• On average, most references will be to data contained in L1.
Example
• This principle can be applied across more than two levels of memory:
• Processor registers:
• Fastest, smallest, and most expensive type of memory
• Followed immediately by the cache:
• Stages data movement between registers and main memory;
• Improves perfomance;
• Is not usually visible to the processor;
• Is not usually visible to the programmer.
• Followed by main memory:
• Principal internal memory system of the computer;
• Each location has a unique address.
Block Size:
Block size is the unit of information changed between cache and main memory.
Mapping Function:
When a replacement block of data is scan into the cache, the mapping performs
determines that cache location in which the block will occupy.
Replacement Algorithm:
The replacement algorithmic rule chooses, at intervals, the constraints of the mapping
perform, which block to interchange once a replacement block is to be loaded into the
cache and also the cache already has all slots full of alternative blocks.
Write Policy:
• If the contents of a block within the cache square measure altered, then it’s
necessary to write down it back to main memory before exchange it.
• The written policy dictates once the memory write operation takes place. At
one extreme, the writing will occur whenever the block is updated. At the
opposite extreme, the writing happens only if the block is replaced.
• The latter policy minimizes memory write operations however leaves the main
memory in associate obsolute state.
• This can interfere with the multiple-processor operation and with direct
operation by I/O hardware modules.
Cache Mapping
• First m main memory blocks map into each line of the cache;
• Next m blocks of main memory map in the following manner:
o Bm maps into line L0 of cache;
o Bm+1 maps into line L1; …
• Modulo operation implies repetitive structure;
• Each line can have a different main memory block
• We need the ability to distinguish between these
• Most significant bits - the tag, serve this purpose
In the example:
• 24 bit address (main memory)
• w = 2 bit word identifier (22= 4 Bytes in a block)
• s = (s-r)+r
• =(8)+14
• =22 bit block identifier
• r = 14 bit line identifier
• Tag =22-14 (s-r= 8 bits)
• Number of Cache lines = 214 = 16384
Direct Mapping- Organization of Cache
Direct Mapping- Cache hit or Miss
Number of bits in tag = Number of bits in physical address – (Number of bits in line
number + Number of bits in block offset)
Given-
Cache memory size = 16 KB
Block size = Line size = 256 bytes
Main memory size = 128 KB
Step 1:
• Main memory = 128 KB = 217 bytes
• Thus, Number of bits in physical address = 17 bits (s+w)
Step 2:
• Block size = 256 bytes = 28 bytes
• Number of bits to access a word in a block = 8 bits
• w = 8 bits
Given:Cache memory size = 16 KB; Block size = Line size = 256 bytes
Main memory size = 128 KB
Calculated: s + w = 17; w = 8
Step 3:
Total number of lines in cache = Cache size / Line size
= 16 KB / 256 bytes
= 214 bytes / 28 bytes
= 26 lines
So r = 6 bits
Step 4:
Number of bits in tag = (s+w)– (r+ w)
= 17 bits – (6 bits + 8 bits)
= 17 bits – 14 bits
= 3 bits
Step 5:
Tag directory size
= Number of tags x Tag size
= Number of lines in cache x Number of bits in tag
= 26 x 3 bits = 192 bits
= 24 bytes ( 8 bits= 1 byte)
Thus, size of tag directory = 24 bytes
Consider a direct mapped cache of size 512 KB with block size 1 KB. There are 7 bits
in the tag. Find-1.Size of main memory. 2.Tag directory size.
Given:
Cache memory size = 512 KB
Block size = Line size = 1 KB
Number of bits in tag = 7 bits
We consider that the memory is byte addressable.
Number of Bits in Block Offset-
Block size = 1 KB = 210 bytes
Number of bits in block offset = 10 bits
ASSOCIATIVE MAPPING
In full associative mapping, any block can go into any line of the cache.
Tag Word id
s bits w bits
Given:
Cache memory size = 16 KB
Block size = Frame size = Line size = 256 bytes
Main memory size = 128 KB
Main memory = 128 KB = 217 bytes
Number of bits in physical address = 17 bits
Number of Bits in Block Offset-
Block size = 256 bytes = 28 bytes
Thus, Number of bits in block offset = 8 bits
Tag/Block number Block offset
8
17 bits
Number of bits in tag
= Number of bits in physical address – Number of bits in block offset
= 17 bits – 8 bits
= 9 bits
A 4-way set associative mapped cache has block size of 4 KB. The size of main
memory is 16 GB and there are 10 bits in the tag. Find the size of cache memory and
the tag directory size.
Cache Performance
Varying Associativity over Cache Size
LRU Algorithm
• Replace block in the set that has been in the cache longest, with no references
to it.
o Maintains a list of indexes to all the lines in the cache:
o Whenever a line is used move it to the front of the list;
o Choose the line at the back of the list when replacing a block.
• LRU replacement can be implemented by attaching a number to each CM
block to indicate how recent a block has been used.
o Every time a CPU reference is made all of these numbers are updated
in such a way that the smaller a number the more recent it was used,
i.e., the LRU block is always indicated by the largest number.
LRU Replacement -Example
Block L1 L2 L3 L4 Status
5 5 Miss
4 5 4 Miss
6 5 4 6 Miss
3 5 4 6 3 Miss
4 5 4 6 3 Hit
0 0 4 6 3 Miss
2 0 4 2 3 Miss
5 0 4 2 5 Miss
3 0 3 2 5 Miss
0 0 3 2 5 Hit
6 0 3 6 5 Miss
7 0 3 6 7 Miss
11 0 11 6 7 Miss
3 3 11 6 7 Miss
5 3 11 5 7 Miss
Given 8 Cache lines and the following reference string trace the block placement
in Cache: 4,3,25,8,19,6,25,8,16,35,45,22,7
Disadvantages
1. High latency to evict an unused cache line.
2. It does not consider 'frequency' and 'spatial locality'.
FIFO Replacement
First-in-first-out (FIFO):
o Replace the block in the set that has been in the cache longest:
o Regardless of whether or not there exist references to the block;
o Easily implemented as a round-robin or circular buffer technique
Disadvantages:
o Not always good at performance.
LFU
Least frequently used (LFU):
o Replace the block in the set that has experienced the fewest references;
o Implemented by associating a counter with each line.
Consider 4 block cache with following main memory references: 5,0, 1,3,2,4, 1,0, 5.
Identify the hit ratio with the given memory requests.
LFU Disadvantages
o A separate counter is needed.
PERFORMANCE CONSIDERATIONS
• Two key factors in the commercial success of a computer are performance and
cost.
• Objective: Best possible performance for a given cost
• A common measure of success is the
price/ performance ratio
• The extent to which cache improves performance is dependent on how
frequently the requested instructions and data are found in the cache
Prefetching
• New data is bought into the cache when it is first needed. Processor has to
pause until new data arrives - miss penalty.
• To avoid stalling the processor, it is possible to prefetch the data into the cache
before they are needed.
• A special prefetch instruction may be provided in the instruction set of the
processor.
• Executing this instruction causes the addressed data to be loaded into the
cache, as in the case of a Read miss but before they are needed in the program.
This avoids miss penalty.
• Hardware or Software (Compiler or Programmer)
Lockup-Free Cache
• While servicing a miss, the cache is said to be locked.
• This problem can be solved by modifying the basic cache structure to allow
the processor to access the cache while a miss is being serviced
• A cache that can support multiple outstanding misses is called lockup-free.
• Such a cache must include circuitry that keeps track of all outstanding misses.
• This may be done with special registers that hold the pertinent information
about these misses.
PIPELINING
Basic concept - Pipeline Organization and issues - Data Dependencies –Memory
Delays – Branch Delays – Resource Limitations - Performance Evaluation -
Superscalar operation –Pipelining in CISC Processors - Instruction Level
Parallelism –Parallel Processing Challenges – Flynn’s Classification – Hardware
multithreading –Multicore Processors: GPU, Multiprocessor Network
Topologies.
PIPELINING
• Overlaps the execution of successive instructions to achieve high performance
• Example: Manufacture of a product involving 3 processes
Time 1 2 3 4 5 6 7 8 9
P1 A B C
P2 A B C
P3 A B C
Time 1 2 3 4 5
P1 A B C
P2 A B C
P3 A B C
Pipelining
The speed of execution of programs is influenced by many factors
• Using faster circuit technology to implement the processor and the main
memory
• Arranging the hardware so that more than one operation can be performed
at the same time – overall completion time is speeded up – individual
operation time remains the same
What is Pipelining?
• Pipelining is an implementation technique whereby multiple instructions
are overlapped in execution.
• Pipe stage (pipe segment)
• Commonly known as an assembly-line operation.
• Automobile Manufacturing
Idea of pipelining
• Original Five-stage processor organization allows instructions to be fetched
and executed one at a time
• Overlapping of instructions
Simple implementation of A RISC ISA
Five-cycle implementation
• Instruction fetch cycle (IF)
• Instruction decode/register fetch cycle (ID)
o Operand fetches
o Sign-extending the immediate field;
o Decoding is done in parallel with reading registers. This technique is
known as fixed-field decoding;
o Test branch condition and computed branch address; finished
branching at the end of this cycle.
• Execution/Compute (EX)
o Memory reference;
o Register-Register ALU instruction;
o Register-Immediate ALU instruction;
• Memory access/branch completion cycle (MEM)
• Write-back cycle (WB)
o Register-Register ALU instruction;
o Register-Immediate ALU instruction;
o Load instruction;
5 stage Pipeline
Pipelining Hazards
• Hazard - situation that prevents the next instruction in the instruction stream
from executing during its designated clock cycle.
• Three Types of hazards
o Structural hazard: Arises from resource conflicts.
o Data hazard: Arises when an instruction depends on the results of a
previous instruction.
o Control hazard: Arises from branches and other instructions that change
the PC.
• A pipeline can be stalled by a hazard.
Data Dependencies
LD R3,0(R2) LD R1,0(R2)
DSUB R1,R2,R5 DSUB R4,R1,R5
AND R6,R1,R7 AND R6,R1,R7
OR R8,R1,R9 OR R8,R1,R9
XOR R8,R2,R4 XOR R8,R2,R4
Subtract instruction is stalled for three cycles to delay reading register R2 until cycle 6
Operand Forwarding
Add R2, R3, #100
Subtract R9, R2, #30
Pipeline is stalled for 3 cycles – but required value is available at end of Cycle 3
Operand forwarding – instead of Stalling the pipeline required value is sent to ALU
in Cycle
Add R2, R3, #100
Subtract R9, R2, #30
MEMORY DELAY
• Delays arising from memory accesses are another cause of pipeline stalls
• Load instruction may require more than one clock cycle to obtain its operand
from memory.
• This may occur because the requested instruction or data are not found in the
cache, resulting in a cache miss. A memory access may take ten or more
cycles.
Stalling by 3 Cycles
No DEPENDENCE
DEPENDENCE OVERCOME
BY FORWARDING LD R1, 45(R2)
LD R1, 45(R2) DADD R5, R6, R7
DADD R5, R6, R7 DSUB R8, R6, R7
DSUB R8, R1, R7 OR R9, R6, R7
OR R9, R6, R7
DEPENDENCE REQUIRING STALL
LD R1, 45(R2)
DADD R5, R1, R7
DSUB R8, R6, R7
OR R9, R6, R7
BRANCH DELAYS
• In an ideal pipeline a new instruction is fetched every cycle, while the
preceding instruction is still being decoded.
• Branch instructions can alter the sequence of execution but they must first be
executed to determine whether and where to branch
• The number of stalls introduced during branch operations in the pipelined
processor is known as branch penalty
• Various techniques can be used for mitigating impact of branch delays :
o Unconditional Branches
o Conditional Branches
o The Branch Delay Slot
o Branch Prediction
o Static Branch Prediction
o Dynamic Branch Prediction
o Branch Target Buffer for Dynamic Prediction
Unconditional Branches
• Ij – Branch instruction
• Ik – Branch target – computed only in Cycle 3
• So Ik is fetched in Cycle 4
• Two – cycle delay
Branch instructions represent about 20 % of the dynamic instruction count of most
programs.
Dynamic count – number of instruction executions – some instructions may get
executed multiple times.
Two-cycle branch penalty – increases execution time by nearly 40%.
Reducing the branch penalty requires the branch target address to be computed earlier
– Decode stage
• Decode stage: instruction decoder determines that the instruction is a branch
instruction
• Computed target address will be available before the end of the cycle 2
• Branch delay slot can be filled with a useful instruction which will be
executed irrespective of whether the branch is taken or not
• Move one of the instructions preceding the branch to the branch delay slot
• Logically, execution proceeds as though the branch instruction were placed
after the ADD instruction – Delayed branching
• If no useful instruction is found – NOP is placed and branch penalty of 1 is
incurred.
Branch_if_[R3]=0 TARGET
Add R7, R8, R9
Ij+1
..
..
TARGET: Ik
Branch Prediction
• To reduce the branch penalty further, the processor needs to anticipate that an
instruction being fetched is a branch instruction and predict its outcome to
determine which instruction should be fetched in cycle 2.
• Types of Branch Prediction
o Static Branch Prediction
o Dynamic Branch Prediction
o LT - Branch is likely to be taken
o LNT - Branch is likely not to be taken
1. Processor Limitations
Clock Speed: The clock speed of a processor, which defines how quickly it can
execute instructions, is often a limiting factor. Higher clock speeds lead to increased
power consumption and heat generation, which creates challenges for energy-efficient
designs.
Instruction Throughput: As the demand for processing power increases, modern
processors use techniques like superscalar execution and pipelining to increase
instruction throughput. However, these techniques have their limits in terms of how
many instructions can be processed simultaneously without causing issues like
instruction dependencies or pipeline stalls.
2. Memory Limitations
Cache Memory: The speed gap between the CPU and main memory (RAM) is a
significant bottleneck. Processors rely on caches (L1, L2, L3) to store frequently used
data for faster access. However, cache sizes are limited due to space and cost
constraints.
Memory Hierarchy: Efficient memory usage depends on memory hierarchy design
(registers, cache, main memory, and storage). Larger memory hierarchies can lead to
increased power consumption, and managing data flow between different levels of
memory presents a design challenge, as well as issues like cache coherence in multi-
core systems.
3. Interconnect Limitations
4. Storage Limitations
6. Scalability Limitations
7. Bandwidth-Delay Product
GPU Memory: Graphics Processing Units (GPUs) are designed to handle highly
parallel workloads but are constrained by their local memory (often much smaller
than CPU memory). Techniques like memory paging and streaming
multiprocessors (SMs) help mitigate this issue, but resource limitations can still
affect the performance of GPU-based systems.
Compute Units: While GPUs have many smaller cores that are optimized for parallel
execution, the total compute power is still constrained by the number of compute
units available, especially in workloads that do not fit well into the SIMD execution
model.
Performance evaluation in computer systems refers to assessing and quantifying how well a
system or component performs under specific workloads or conditions. This is essential for
understanding the efficiency, throughput, and overall capability of hardware and software.
1.1 Throughput
Throughput refers to the amount of work a system can perform in a given period,
typically measured in terms of tasks completed, data processed, or instructions
executed per unit of time. For processors, this is often quantified as instructions per
cycle (IPC), operations per second (OPS), or flops (floating-point operations per
second).
Parallel systems, throughput is particularly important in determining the system’s
capacity to handle multiple tasks or data streams simultaneously.
1.2 Latency
Execution time refers to the total time a system takes to execute a given program or
workload. It is a direct measure of the time taken to perform operations. Execution
time can be broken down into:
o CPU time: Time spent on processing.
o I/O time: Time spent waiting for input/output operations.
o Memory access time: Time spent waiting for data to be fetched from memory
or caches.
1.4 Speedup
1.5 Efficiency
Efficiency measures the utilization of the system’s resources relative to the maximum
possible utilization. In parallel systems, efficiency is often calculated as:
High efficiency means that adding more processors leads to significant improvements
in performance, whereas low efficiency implies that resources are not being fully
utilized, often due to overheads such as synchronization or communication.
CPI (Cycles Per Instruction): This metric reflects the number of clock cycles
required to execute a single instruction. A CPU with a lower CPI is more efficient in
processing instructions.
o CPI Calculation: The CPU's overall performance is linked to its CPI, which
is determined by the instruction set architecture, data hazards, and instruction
scheduling.
o Example: A CPU with a CPI of 2 will take 2 clock cycles to complete each
instruction on average.
MIPS (Million Instructions Per Second): This measures the number of millions of
instructions a processor can execute in one second. However, MIPS alone does not
provide an accurate performance measure because it doesn’t account for instruction
complexity.
o Formula:
Synthetic benchmarks are specifically designed tests that simulate specific aspects
of system performance, such as memory access patterns, processor throughput, or I/O
performance. These benchmarks help identify how well a system performs in isolated
tasks but may not reflect real-world application performance.
Patterson & Hennessey mention that synthetic benchmarks can provide insights into
the raw capabilities of the system components but may fail to represent complex, real-
world workloads.
GPUs are highly parallel computing devices optimized for specific workloads, such
as graphics rendering and matrix computations. The performance evaluation of GPUs
often focuses on their ability to handle large amounts of data in parallel.
o Patterson & Hennessey discuss how GPU performance is typically
measured by the number of cores (thousands of small cores in GPUs),
memory bandwidth, and the efficiency of execution in parallel tasks.
The Energy-Delay Product (EDP) is a metric that evaluates the trade-off between
energy consumption and execution time. It is calculated as:
In a superscalar system, the instruction fetch unit fetches multiple instructions from
memory in parallel, while the instruction dispatch unit dynamically assigns these
instructions to available pipelines.
Modern superscalar processors can fetch and decode several instructions per cycle,
making use of techniques such as out-of-order execution to maximize instruction
throughput.
Superscalar processors have several execution units (also called functional units),
each of which performs a specific type of operation. Common execution units in
superscalar processors include:
o ALUs (Arithmetic Logic Units) for integer operations.
o FPUs (Floating Point Units) for floating-point calculations.
o Load/Store units for memory operations.
The ability to execute multiple instructions in parallel depends on having multiple
functional units and scheduling them appropriately.
After decoding, the instructions are dispatched to available execution units based on
their type (e.g., integer operations go to the ALU, floating-point operations go to the
FPU).
The processor ensures that dependent instructions are scheduled in the correct order,
while independent instructions can be processed simultaneously.
Superscalar processors often support out-of-order execution. This means that the
processor can execute instructions as soon as their operands are available, rather than
strictly following the program’s sequential order.
Dynamic scheduling techniques, such as the score boarding technique or
Tomasulo’s algorithm, help in reordering instructions to avoid pipeline stalls and
improve parallel execution.
The presence of multiple pipelines enables the processor to execute different types of
instructions (such as integer and floating-point) concurrently. This further increases
throughput by making efficient use of different execution units.
Superscalar processors can efficiently utilize the various execution units within the
CPU, improving resource utilization and reducing idle times for components like
ALUs and FPUs.
Even with multiple pipelines, the processor may encounter bottlenecks in instruction
dispatch. If the processor cannot quickly determine which functional unit should
handle each instruction, this can reduce the number of instructions that are executed in
parallel.
Advanced techniques like dynamic instruction scheduling and register renaming
help alleviate this issue by optimizing the use of execution units.
4.4 Limited Parallelism
The Intel Pentium processors are classic examples of superscalar architecture. The
Pentium Pro and later models featured multiple pipelines, allowing for execution of
several instructions per clock cycle.
ARM processors, used in many mobile and embedded systems, often feature
superscalar designs to enhance performance while keeping power consumption low.
Similarly, AMD’s Ryzen processors use a superscalar architecture to deliver high
performance for both single-threaded and multi-threaded applications.
CISC processors, such as the x86 architecture, have a wide range of complex instructions,
and each instruction can vary greatly in length and execution time. These processors use
micro-operations (μ-ops) to break complex instructions into simpler steps, and pipelining is
used to optimize the execution of these instructions.
Pipelining in processors divides instruction execution into multiple stages, with each stage
performing a specific task. In a basic five-stage pipeline, these stages are typically:
1. Instruction Fetch (IF): The processor fetches the instruction from memory.
2. Instruction Decode (ID): The processor decodes the instruction and prepares the
necessary operands.
3. Execute (EX): The processor performs the operation specified by the instruction
(e.g., addition, subtraction, etc.).
4. Memory Access (MEM): If the instruction involves memory (e.g., load or store), the
memory is accessed.
5. Write Back (WB): The result of the instruction is written back to the register file or
memory.
These stages operate in parallel, so while one instruction is being decoded, another is being
executed, and a third may be in the memory access stage.
One of the main challenges in CISC pipelining is that CISC instructions can vary
significantly in length. For example, an instruction in the x86 architecture may be
just one byte long or several bytes long, depending on the operation.
This variability makes instruction fetching more complicated. In pipelining, the fetch
stage usually expects instructions to be of uniform length. In CISC processors, this
variability requires special handling mechanisms to correctly fetch and decode
instructions, ensuring that the correct instruction boundaries are identified.
2.2 Complex Decoding
CISC instructions are often quite complex, meaning they can require multiple
decoding steps. An instruction like MOV might involve directly moving a register's
value, while a more complex instruction like LODS (which loads a string) can involve
different addressing modes and different operations.
This complexity can cause delays during the decode stage of the pipeline.Multiple
levels of decoding might be necessary, which can create pipeline stall conditions.
Since CISC instructions are often broken into multiple micro-operations, pipelining
the micro-operations instead of the original instructions is a key optimization. This
allows independent operations to proceed in parallel without waiting for the
completion of the entire instruction.
Out-of-order execution is frequently used here to allow micro-operations that don’t
depend on each other to proceed, reducing the time spent waiting for other operations
to complete.
The Intel x86 architecture is one of the most widely known examples of a CISC processor
employing pipelining. In early designs, the x86 processors used simple, non-pipelined
execution models. However, as technology advanced, Intel began to incorporate pipelined
execution into their processors with multiple stages of instruction processing.
In the Pentium processor, for instance, multiple instructions can be fetched and decoded in
parallel, and the instructions are divided into simpler micro-operations that can be pipelined
individually. The Pentium Pro and later models used deeper pipelines, achieving high
throughput despite the complexity of CISC instructions.
5.1 Benefits
Pipelining helps to exploit ILP by breaking down the execution of each instruction
into multiple stages. These stages can be overlapped, allowing multiple instructions to
be processed simultaneously in different stages.
The combination of pipelining and ILP allows for the execution of more than one
instruction in parallel, increasing the overall throughput of the processor.
SIMD allows the same instruction to be applied to multiple data points in parallel.
This is particularly useful in applications like vector processing and multimedia,
where large datasets can be processed simultaneously.
ILP is exploited in SIMD by performing multiple data operations in parallel, with
each operation requiring the same instruction.
Data dependencies and control dependencies often limit the amount of parallelism
that can be achieved. For instance, a RAW hazard (true dependency) prevents two
instructions from being executed in parallel if one depends on the result of the other.
Despite advanced techniques like out-of-order execution, ILP is inherently limited
by these dependencies.
Resource contention occurs when multiple instructions require the same hardware
resource (e.g., ALUs, registers, memory). This can limit parallelism, as only one
instruction can use a resource at a time.
ILP can be improved by providing more resources (e.g., multiple ALUs) or by using
techniques like register renaming to avoid false dependencies.
3.4 Diminishing Returns
While ILP can significantly improve performance, there are diminishing returns. Even
with advanced techniques like out-of-order execution and speculative execution, the
degree of parallelism is limited by the dependencies in the program and the
availability of resources.
Amdahl’s Law highlights that improving ILP will not result in a proportional
speedup if there is a significant sequential portion in the program that cannot be
parallelized.
Superscalar processors exploit ILP by having multiple execution units that can
execute several instructions concurrently. Intel’s Pentium and AMD’s Ryzen
processors are examples of superscalar processors that use ILP to maximize
instruction throughput.
These processors can dynamically schedule instructions to different execution units,
making use of ILP to perform several operations in parallel.
Graphics Processing Units (GPUs) are highly optimized for ILP, especially in
SIMD workloads. GPUs can execute thousands of threads in parallel, with each
thread performing the same operation on different data, making them well-suited for
workloads that require high levels of parallelism, such as graphics rendering and
machine learning.
1. Scalability Issues
o The law suggests that as the number of processors increases, the speedup
grows, but only up to a limit determined by the non-parallelizable portion.
This puts a fundamental limit on how much performance can be gained by
simply adding more processors.
o In practice, as processors are added, the overhead of coordinating them (e.g.,
managing memory, handling communication) can outweigh the benefits,
limiting scalability.
Race conditions occur when multiple processes or threads attempt to access shared
resources (e.g., memory or I/O) simultaneously, and the final outcome depends on the
order of execution. In parallel systems, unsynchronized access to shared resources
can result in unpredictable and incorrect results.
Proper synchronization is required to prevent race conditions. This is typically done
using locks, mutexes, or semaphores, but these mechanisms can introduce delays
and reduce performance.
2.2 Deadlock
A deadlock occurs when two or more processes are waiting for each other to release
resources, leading to a situation where none of the processes can proceed.
o For example, Process A holds Resource 1 and waits for Resource 2, while
Process B holds Resource 2 and waits for Resource 1.
Deadlocks can significantly reduce the performance of parallel systems if they are not
properly managed. Techniques such as timeout mechanisms, deadlock detection,
and resource allocation graphs are used to prevent or resolve deadlocks.
Effective load balancing ensures that all processors in a parallel system are utilized
efficiently. If some processors are idle while others are overloaded, the overall
performance can be negatively impacted.
The challenge is distributing the workload in such a way that the tasks are evenly
spread across the processors and the load is balanced throughout the execution. This
can be difficult, especially when tasks vary in complexity or size.
o Dynamic load balancing techniques, where tasks are redistributed during
execution, are employed to ensure optimal performance.
3. Communication Overhead
Bandwidth limitations refer to the limited rate at which data can be transferred
between processors or between a processor and memory. This can be a bottleneck in
parallel systems, especially when large amounts of data need to be shared between
processors.
Modern processors and architectures, such as multi-core processors and GPUs, use
high-bandwidth memory and fast interconnects (e.g., InfinityFabric, NVLink) to
improve data transfer rates.
Memory latency refers to the delay between a processor requesting data from
memory and receiving it. In a parallel system, multiple processors may request data
from the same memory, leading to contention and increased latency.
This issue can be mitigated by having multiple memory banks or employing non-
uniform memory access (NUMA), where processors are assigned to specific
memory regions to reduce contention.
In multi-core systems, each core typically has its own local cache to store frequently
accessed data. However, when multiple cores modify the same memory locations, the
caches can become inconsistent, resulting in cache coherence problems.
Solutions to this include the MESI protocol (Modified, Exclusive, Shared, Invalid),
which ensures that all caches in a multi-core system are kept consistent, but
maintaining cache coherence can introduce significant overhead and reduce
performance.
False sharing occurs when different threads access different variables that happen to
reside on the same cache line. Even though the threads are not actually sharing data,
they may cause unnecessary cache invalidations, reducing performance.
To avoid false sharing, data must be carefully aligned and placed in memory to ensure
that frequently accessed data does not reside on the same cache line.
Soft errors, caused by environmental factors like radiation, can lead to bit flips in
memory or processor states. In parallel systems, soft errors are a significant concern,
as they may lead to incorrect results if not detected and corrected.
Error-correcting codes (ECC) and redundant execution are used to protect against
these types of errors in parallel systems.
7. Energy Efficiency
Flynn’s Classification
1.1 Definition
1.2 Characteristics
Processor: Single processor that handles both instruction and data sequentially.
Memory: A single memory unit is used for both instructions and data.
Example Systems: Early mainframe computers, basic microprocessors, and
personal computers.
1.3 Limitations
SISD systems are constrained by the von Neumann bottleneck, where the processor
is limited by the speed of memory access, and sequential execution restricts
performance.
It is not capable of taking advantage of modern parallel computing demands.
2.1 Definition
SIMD systems are capable of applying the same instruction to multiple data elements
at once. This allows multiple data elements to be processed in parallel using a single
instruction.
SIMD is widely used in applications like vector processing, graphics processing,
and scientific computing, where the same operation needs to be applied to large
datasets.
2.2 Characteristics
Processor: Single control unit issuing the same instruction to multiple processing
elements.
Memory: Data is organized in such a way that the same instruction operates on
different pieces of data in parallel.
Example Systems: Graphics Processing Units (GPUs), Vector processors, SIMD
extensions in CPUs (e.g., Intel AVX and SSE instructions).
2.3 Strengths
SIMD systems can process large amounts of data in parallel with minimal overhead,
making them highly effective for data-parallel applications.
They are particularly beneficial for media processing, 3D rendering, and machine
learning, where the same operation needs to be performed on many pieces of data
simultaneously.
2.4 Limitations
SIMD is restricted to problems that exhibit data parallelism, where the same
operation can be performed on multiple data elements independently.
SIMD cannot be used for tasks that require task parallelism (e.g., different
instructions for different data).
3.1 Definition
MISD is a more theoretical and rare category where multiple instruction streams
operate on a single data stream. This system would execute multiple instructions
concurrently on the same data, but the data itself remains unchanged by the different
operations.
It’s not commonly seen in practice and is more of a conceptual category.
3.2 Characteristics
Processor: Multiple processors each execute different instructions on the same data.
Memory: Single data stream, with each processor accessing the same data.
Example Systems: While no commercially viable systems exist for MISD, it could
potentially be useful in fault tolerance systems where different computations are
applied to the same data to check for consistency and accuracy.
3.3 Strengths
3.4 Limitations
4.1 Definition
MIMD systems are capable of executing multiple instruction streams concurrently on
multiple data streams. Each processor in an MIMD system can execute its own
instruction sequence on different data, which makes MIMD the most versatile and
widely used model in parallel computing.
MIMD is suitable for a wide range of applications, from supercomputing to
distributed systems.
4.2 Characteristics
4.3 Strengths
MIMD systems can handle both task parallelism and data parallelism, allowing
them to address a wide variety of complex computational problems.
They can be scalable, supporting any number of processors, and can be optimized for
distributed processing in large clusters.
4.4 Limitations
Flynn's classification continues to play a critical role in the design of modern multi-
core processors and distributed systems.
o SIMD is used extensively in GPUs and media processing units, where the
same instruction is applied to multiple data elements simultaneously.
o MIMD is the foundation for most modern supercomputers and cloud
computing infrastructures, enabling the parallel execution of independent
tasks across many processors or nodes.
MIMD architectures, especially those used in multi-core processors, distributed
computing, and cluster-based systems, are the most flexible and widely adopted
systems in contemporary computing.
Flynn’s Classification provides a framework for understanding the basic architectures that
enable parallel processing. From the simple and sequential SISD to the highly flexible
MIMD, each category of architecture serves different types of computational needs. Modern
parallel computing continues to evolve within this framework, with SIMD and MIMD
architectures being central to the development of high-performance computing systems, such
as GPUs, multi-core processors, and supercomputers. Understanding Flynn’s categories
helps in selecting the right parallel processing model for a given application, ensuring optimal
performance based on the nature of the task and available hardware.
Hardware Multithreading
There are several types of hardware multithreading techniques, each with its unique approach
to managing multiple threads within a processor. The most common types include Fine-
Grained Multithreading, Coarse-Grained Multithreading, and Simultaneous
Multithreading (SMT).
In traditional single-threaded execution, the processor may experience idle cycles due
to waiting on memory or I/O operations. Hardware multithreading minimizes these
idle cycles by executing other threads during these stalls, thus keeping the processor
busy and making full use of its resources.
2.3 Reduced Latency
By switching between threads that are not stalled, hardware multithreading can reduce
the impact of latency caused by memory accesses, cache misses, or other long-latency
operations. This ensures that the processor can continue to perform useful work even
when one thread is waiting for data.
2.4 Scalability
Multithreading can scale with the number of threads and available hardware
resources. This is especially beneficial in systems with many processors or cores, as
each processor or core can handle multiple threads concurrently.
When multiple threads are executed on the same core or processor, they must share
the available resources (e.g., execution units, memory, cache). This can lead to
resource contention, where the threads compete for limited resources, potentially
reducing the performance benefits of multithreading.
Graphics Processing Units (GPUs), which are designed for high-throughput parallel
computing, use a form of multithreading that allows many threads to run in parallel on
different processing units within the GPU. CUDA cores in NVIDIA GPUs can
execute thousands of threads simultaneously, making GPUs well-suited for parallel
workloads like deep learning, scientific simulations, and video rendering.
4.3 Supercomputing
In the context of modern computing, multicore processors and Graphics Processing Units
(GPUs) are crucial components that significantly contribute to the efficiency and
performance of a wide range of applications, from scientific computing to graphics rendering
and machine learning. While multicore processors are designed for general-purpose
computing, GPUs are specialized hardware designed for handling highly parallel tasks.
1. Multicore Processors
1.1 Definition
Cores: Each core can execute instructions concurrently, which means a multicore
processor can handle multiple tasks or threads simultaneously. This is essential for
handling complex, multithreaded applications that require large amounts of
computational power.
Shared Resources: In a multicore processor, multiple cores often share resources like
cache (L1, L2, and L3), memory bus, and I/O controllers. Efficient management of
shared resources is key to optimizing performance in multicore systems.
Parallelism: Multicore processors can perform both task parallelism (multiple tasks
on separate cores) and data parallelism (dividing a large task into smaller chunks for
multiple cores).
1.4 Challenges
2.1 Definition
Processing Cores: Unlike CPU cores, which are designed to handle a few threads
with high clock speeds and complex instructions, GPU cores are simpler and
designed for parallelism, making them suitable for applications like image
processing, matrix multiplications, and simulations.
SIMD Model: Most modern GPUs operate on the SIMD (Single Instruction, Multiple
Data) model, where a single instruction is applied to multiple pieces of data
simultaneously. This is highly effective for data-intensive operations such as vector
and matrix operations.
Memory Hierarchy: GPUs have a distinct memory hierarchy optimized for high-
throughput data access:
o Global Memory: Large but slower memory shared across all cores.
o Shared Memory: Faster, smaller memory used by threads within a block.
o Registers: The fastest form of memory, used for thread-local data.
Graphics and Gaming: GPUs are best known for their use in rendering graphics for
video games, movies, and interactive applications. They handle the complex
mathematical computations required for tasks such as texture mapping, lighting
calculations, and 3D rendering.
Machine Learning: Deep learning and other machine learning techniques benefit
greatly from GPUs, as they can perform the massive matrix and vector operations
required for training neural networks in parallel.
Scientific Computing: Simulations in physics, chemistry, and biology can take
advantage of the GPU’s parallel processing capabilities to speed up computational
models.
In modern computing systems, multicore processors and GPUs are often used together to
exploit both types of parallelism—task parallelism from multicore CPUs and data
parallelism from GPUs. This combined architecture is increasingly common in high-
performance systems such as servers, supercomputers, and workstations.
Effective communication between the CPU and GPU is critical for achieving high
performance in heterogeneous systems. Direct Memory Access (DMA), PCI
Express (PCIe), and shared memory allow for high-speed data transfer between the
CPU and GPU.
CPU handles control and high-level logic, while the GPU performs the data-heavy
parallel computations, such as matrix multiplication, image processing, or
simulation tasks.
Deep Learning: In a deep learning model, the CPU may be responsible for
controlling the flow of data, managing input/output operations, and executing
sequential tasks, while the GPU performs the massive matrix calculations required for
training large neural networks.
4. Future Trends and Advancements
Multicore processors are scaling toward higher core counts, with processors like the
AMD Ryzen 9 and Intel Xeon offering up to 64 cores, designed to handle highly
parallel workloads in servers, workstations, and high-performance computers.
GPUs are becoming more powerful, with companies like NVIDIA and AMD pushing
the boundaries of parallel processing. For instance, NVIDIA’s A100 Tensor Core
GPU for deep learning provides over 54 teraflops of processing power, highlighting
the growing importance of GPUs in AI workloads.
The combination of multicore processors and GPUs represents the future of high-
performance computing. Multicore CPUs offer general-purpose processing power and handle
complex, serial workloads, while GPUs excel at parallelizing data-intensive tasks. Together,
they enable faster processing in a variety of fields, including machine learning, scientific
simulations, graphics rendering, and big data processing. As technology evolves, we can
expect even more powerful and efficient systems, integrating these components for enhanced
performance in a wide range of applications.
A well-designed topology ensures that data can be transferred efficiently between processors
and memory, minimizing latency and maximizing throughput.
Communication Pathways: How processors communicate with each other and with
memory units.
Bandwidth and Latency: The speed and efficiency of data transfer between
processors and memory, affecting the overall system performance.
Scalability: The ability to add more processors without significantly degrading
performance.
Fault Tolerance: The system’s ability to continue functioning even if one or more
components fail.
Multiprocessor systems are typically classified into two main types based on their
interconnection structure:
Both topologies have their specific advantages and challenges, with the design of the
interconnection network being key to efficient system operation.
Definition: In a bus-based topology, all processors and memory units are connected
to a single communication bus. The bus serves as the shared medium for data transfer
between processors and memory.
Characteristics:
o Simple Design: The design is simple and cost-effective for small-scale
multiprocessor systems.
o Single Communication Path: All processors share the same bus, meaning
that only one processor can transmit data at a time.
o Scalability Issues: As more processors are added, the bus becomes a
bottleneck, reducing performance due to congestion.
Advantages:
o Cost-effective and easy to implement for small systems.
o Easy to add new processors.
Disadvantages:
o Limited scalability due to the shared bus.
o High contention for bus access leads to performance degradation as the system
scales up.
Example: Bus-based topologies are typically used in small-scale systems like multi-
core desktop processors or embedded systems.
Definition: A fat tree is a type of network topology often used in data centers and
cloud computing environments. It uses a hierarchical, tree-like structure where the
inner nodes (routers) have more bandwidth than the outer nodes, ensuring that the
bottleneck does not occur in the network’s core.
Characteristics:
o High Bandwidth: The architecture ensures that there is sufficient bandwidth
at the core of the network to support large-scale data transfers without
congestion.
o Scalable: Fat tree topologies can be easily scaled by adding more layers or
branching.
o Redundant Paths: It provides multiple paths between any two processors,
improving fault tolerance and reliability.
Advantages:
o Fault Tolerant: Redundant paths ensure that the system remains operational
even if some paths fail.
o Balanced Traffic: The topology balances traffic across the network, reducing
bottlenecks.
Disadvantages:
o Complex Routing: Fat tree topologies require more complex routing
algorithms, and the management of such networks can be more intricate.
o Higher Cost: The design complexity and need for more interconnection
hardware increase the cost.
Example: Fat tree topologies are widely used in data center networks and cloud
computing environments.
1. Tree Topology
1.1 Definition
1.2 Characteristics
1.3 Advantages
Scalable: New processors can be added to the tree without affecting the existing
network too much.
Simple Design: The tree structure is relatively simple to design and manage.
1.4 Disadvantages
Single Point of Failure: A failure at the root or any higher-level node can disrupt
communication across the entire system.
Uneven Communication Delay: The distance between nodes can vary depending on
where they are in the tree, which can result in uneven communication latency.
2. Mesh-of-Trees Topology
2.1 Definition
2.2 Characteristics
Redundancy: The mesh aspect of the topology ensures that processors can
communicate along multiple routes, avoiding potential bottlenecks in the network.
Hierarchical and Parallel: The topology combines hierarchical communication
(from the tree) and parallel communication (from the mesh), making it flexible for
various types of workloads.
2.3 Advantages
Fault Tolerance: Multiple communication paths ensure that the failure of a processor
or link will not disrupt the system completely.
Balanced Traffic: Traffic is distributed across the mesh and tree structure, reducing
congestion at any single point.
2.4 Disadvantages
Complex Design: The hybrid nature of the topology makes it more complex to
implement and manage compared to simpler topologies like bus or star topologies.
3.1 Definition
3.2 Characteristics
3.3 Advantages
Low Latency: Data can be sent directly from one processor to another without
passing through intermediate processors.
Fault Tolerance: The system can tolerate failures in individual processors or links, as
alternative paths are always available.
3.4 Disadvantages
High Cost and Complexity: A complete graph requires an extremely large number of
interconnections, which is impractical for large systems due to the high cost and
hardware complexity.
Scalability Issues: As the number of processors increases, the number of connections
grows quadratically, making this topology unscalable for large systems.
4. Star Topology
4.1 Definition
4.3 Advantages
4.4 Disadvantages
Single Point of Failure: The failure of the central processor or switch causes the
entire system to fail.
Potential Bottleneck: All communication passes through the central node, which can
become a performance bottleneck as the number of processors increases.
5.1 Definition
A columnar or torus topology is a variation of the mesh topology, where the network
is structured as a grid with the first and last rows (or columns) connected. This creates
a wraparound effect, making it a continuous loop.
5.2 Characteristics
2D/3D Grid: In the simplest form, processors are arranged in a 2D grid, and
communication paths "wrap" around the edges, ensuring that every processor has a
direct connection to its neighbors.
5.3 Advantages
Reduced Latency: The wraparound feature reduces the overall distance for
communication between processors, improving latency compared to standard mesh
topologies.
Scalability: Like the mesh topology, torus topologies are scalable and can support a
large number of processors without congestion.
5.4 Disadvantages
Complex Routing: The routing algorithms become more complicated because of the
wraparound connections, especially as the number of processors increases.
Network Management: Managing a torus network can be more complex due to its
topology, requiring more sophisticated routing protocols.
6. Clusters and Clustered Interconnects
6.1 Definition
In a clustered network, processors are grouped into smaller sets (clusters), and each
cluster is connected to a central interconnection network. This type of topology is
often used in distributed systems and data centers.
6.2 Characteristics
6.3 Advantages
Modular Design: Clusters allow for modular expansion, making it easier to scale the
system.
Fault Tolerance: If one cluster fails, the rest of the system can continue functioning.
6.4 Disadvantages
The choice of multiprocessor network topology is a critical factor that influences the
performance, scalability, and fault tolerance of a multiprocessor system. Different
topologies offer varying levels of efficiency in terms of communication speed, data transfer
capacity, and system reliability. For example, bus-based topologies are simple but not
scalable, while mesh and hypercube topologies offer higher scalability and lower latency at
the cost of increased complexity. As systems continue to grow in size and demand, the role of
network topology becomes increasingly significant in the design and performance of
multicore and multiprocessor systems, especially in supercomputing, cloud computing, and
parallel processing applications.