PDC-1 2
PDC-1 2
UNIT-1.2
Architectural Classification Schemes
◦ Architectural classification scheme are as follows
1. Flynn’s Classification
2. Shores Classification
3. Feng’s Classification
•Similarly, there is a flow of operands between processor and memory-directionally. This flow of
operands is called data stream.
Flynn’s classification
M.J.Flynn introduced a system for the organization of system architectures of computers. Classification is
based on multiplicity of instruction streams and data streams observed by the CPU during program execution.
1) Single Instruction and Single Data stream (SISD):
•In this organisation, sequential execution of instructions is performed by one CPU containing a
single processing element (PE), i.e., ALU under one control unit.
•SISD machines are conventional serial computers that process only one stream of instructions
and one stream of data.
There is no instruction level parallelism and data level parallelism.
•Each control unit is handling one instruction stream and processed through its corresponding
processing element. But each processing element is processing only a single data stream at a
time.
• Therefore, for handling multiple instruction stream sand single data stream, multiple control
units and multiple processing elements are organised in this classification.
• All processing elements are interacting with the common shared memory for the organisation of
single data stream
This classification is not popular in commercial machines. But for the specialized applications, MISD
organisation can be very helpful for real time computers need to be fault tolerant where several processors
execute the same data for producing the redundant data.
4) Multiple Instruction and Multiple Data stream (MIMD)
•In this organization, multiple processing elements and multiple control units are organized as in
MISD.
•But the difference is that now in this organization multiple instruction streams operate on multiple
data streams.
•Therefore, for handling multiple instruction streams, multiple control units and multiple processing
elements are organized such that multiple processing elements are handling multiple data streams
from the Main memory.
•The processors work on their own data with their own instructions. Tasks executed by different
processors can start or finish at different times. They are not lock-stepped, as in SIMD computers, but
run asynchronously.
•This classification actually recognizes the parallel computer. That means in the real sense MIMD
organisation is said to be a Parallel computer. All multiprocessor systems fall under this classification.
• Ex: C.mmp, Burroughs D825, Cray-2, S1, Cray X-MP, HEP, Pluribus,IBM 370/168 MP, Univac
1100/80, Tandem/16, IBM 3081/3084, C.m*, BBN Butterfly, Meiko Computing Surface (CS-1),
FPS T/40000, iPSC. MIMD organization is the most popular for parallel computer. In the real sense,
parallel computers execute the instructions in MIMD mode.
2. Shore’s Classification/taxonomy
•Shore's Taxonomy is a framework introduced by Melvin Shore for categorizing different types
of parallel computer architectures based on how they manage and coordinate memory and
processing elements.
•It provides a way to classify parallel systems into distinct categories, focusing on the relationship
between processing elements, memory organization, and communication methods.
Type-I
•Classical Von Neumann architectures is in this category. It is similar to SISD type of Flynn’s
classification. Type-I also contains PU, data &memory. The Processor works on memory words
and is to be called as word-serial bit parallel.
•Ex: CDC7600,Cray-1
Control Unit
Horizontal Processing
Unit
Memory(Word Slice)
•Horizontal Processing Unit (HPU): A system that focuses on parallel processing, distributing
tasks across multiple units to handle operations simultaneously, emphasizing scalability and
efficiency.
•Vertical Processing Unit (VPU): A system that processes tasks sequentially or hierarchically,
focusing on specialization and depth within a single process or a series of dependent steps.
Type-II
•Computers in this category are similar to those of type-I except that they work on bit-slices of
memory (vertical slices) rather than word-slices(horizontal slices).The processing unit work in
word-parallel bit-serial mode
• Ex: ICL DAP, Goodyear Aerospace STARAN.
Control Unit
Horizontal Processing
Control Unit
Unit
Control Unit
Control Unit
how data is processed, and the terms "serially" and "parallely" describe the mode of operation.
BSWS: WSBS has been called bit serial processing because one bit is processed at a time.
•Both bits and words are handled serially, meaning the system processes one bit of one word at a time before moving to
the next.
•This mode emphasizes minimal hardware complexity but has lower processing speed.
WPBS: It has been called bit slice processing because m-bit slice is processes at a time. i.e. Bits are processed serially,
but words are processed in parallel.
WSBP: word slice processing because one word of n-bit processed at a time. i.e. Bits are processed in parallel, but
words are processed sequentially.
WPBP: It is known as fully parallel processing in which an array on n*m bit is processes at one time. Both bits and
words are processed in parallel, providing the highest speed.
Memory Access Classification
Parallel architectures can be classified into two major categories in terms of memory arrangement:
• Shared memory
• Message passing or distributed memory
This classification constitutes a subdivision of MIMD parallel architecture and are also known as:
•Shared memory architecture tightly coupled architecture
• Distributed memory architecture loosely coupled architecture
Shared Memory Multiprocessor
•Multiple processors share a common memory unit comprising a single or
several memory modules
• All the processors have equal access to the memory modules.
•The memory modules are seen as a single address space by all the processors.
•The memory modules store data as well as serve to establish communication
among the processors via some bus arrangement.
•Communication is established through memory access instructions
◦ processors exchange messages between one another by one processor writing data
into the shared memory and another reading that data from the memory
• The executable programming codes are stored in the memory for each processor to execute
•Each program can gain access to all data sets present in the memory if necessary
•Access to these memory modules can easily be controlled through appropriate programming mechanisms.
•However, this architecture suffers from a bottleneck problem when a number of processors endeavor to
access the global memory at the same time .This limits the scalability of the system.
The shared memory multiprocessor systems can further be divided into three modes which are based
on the manner in which shared memory is accessed.
Uniform Memory Access Model (UMA)
•In the case of UMA architectures, the memory access time to the different parts of the memory are almost
the same.
•UMA architectures are also called symmetric multiprocessors
An UMA architecture comprises two or more processors with identical characteristics as follows
The processors:
◦ share the same main memory and I/O facilities
◦ are interconnected by some form of bus-based interconnection scheme
The memory access time is approximately the same for all processors.
•Processors perform the same functions under control of an operating system, which provides interaction
between processors and their programs at the job, task, file and data element levels.
Non-Uniform Memory Access Model (NUMA)
•In shared memory multiprocessor systems, each processor can have its own local memory. Together,
all the local memories form the shared global memory. This means the global memory is distributed
across all processors.
•A processor can access its own local memory quickly and consistently since it is directly connected
to it. However, accessing the local memory of another processor is slower and depends on its
location. As a result, not all memory locations can be accessed at the same speed.
Cache-Only Memory Access Model (COMA)
•In shared memory multiprocessor systems, each processor may have a cache memory to speed
up instruction execution. In the NUMA model, if cache memories replace local memories, it
becomes the COMA model. Here, all cache memories together create a shared global memory
space. However, accessing a cache memory from another processor is also slower and varies
depending on its location, making the access non-uniform in this model.
Message Passing Multicomputer/Distributed Systems
•This architecture suffers from the drawback of requiring direct communication from processor to
processor.
•The speed performance of distributed memory architecture largely depends upon how the processors are
connected to each other.
•It is impractical to connect each processor to the remaining processors through independent cables it can
work for a very low number of processors but becomes impossible as the number of processors in the
system increases.
The most common solution is to use specialized bus networks to connect all the processors in the system
in order that each processor can communicate with any other processor attached to the system.
Memory Issues: Shared Vs Distributed
Shared
Memory
Distributed
Memory
SIMD
In synchronous parallel processing architecture, multiple processing elements (PEs) perform
tasks under a common control unit. Each PE executes the same machine language program, as
instructions and data are broadcast by the control unit.
•In these systems, a single control unit dispatches the same instruction to various processors, enabling
parallel execution across multiple data streams.
•This model is particularly efficient for vector and matrix computations, making SIMD machines
well-suited for applications requiring large-scale data processing. The sequence of different data
items processed by SIMD systems is often referred to as a vector.
Array processor & SIMD Machines
An array processor, a type of SIMD computer, handles single instruction streams applied to multiple data
streams. SIMD machines, also known as vector computers or processor arrays, excel at executing vector
and matrix operations efficiently.
The architecture typically features a single control unit that reads instructions pointed to by a single
program counter (PC). These instructions are decoded and control signals are sent to the processing units.
Data is supplied to and retrieved from the processing units via a memory system with multiple data paths.
1.SIMD Array Processor using interconnection network
.
Purpose:
◦ This architecture is designed to process large amounts of data in parallel by applying a single instruction across
multiple data points simultaneously.
◦ It's suitable for tasks like matrix computations, image processing, and other parallelizable workloads.
Components:
◦ Processing Elements (PEs):
◦ Each processing element is responsible for performing computations on a subset of the data.
◦ Each PE has its own local memory for storing data specific to its task.
◦ Global Control Unit and Scalar Processor:
◦ This unit sends the same instruction to all the processing elements. It acts as the "brain" of the SIMD system.
◦ The scalar processor handles non-parallel tasks that cannot be split among the PEs.
◦ Interconnection Network:
◦ The interconnection network connects the processing elements and allows them to communicate with each other
when data exchange or synchronization is needed.
Working:
◦ The global control unit sends a single instruction to all processing elements.
◦ Each PE operates on its own portion of the data, performing the same operation but on different pieces of data.
◦ The interconnection network allows data sharing or communication among the PEs as needed.
2.SIMD Array Processor using Alignment network
Purpose:
This is another variation of the SIMD architecture, emphasizing data alignment for efficient memory access. It is more
structured to handle cases where memory and processing elements need precise alignment.
Components:
oMemory Modules:
Data is stored in multiple memory banks (Memory 1, Memory 2, ..., Memory N).
Each memory module corresponds to a specific processing element.
oProcessing Elements (PEs):
These are the same as in the first diagram, performing parallel computations on subsets of data.
oGlobal Control Unit and Scalar Processor:
The central unit sends instructions to all processing elements and manages overall system control.
oInterconnection Network (Alignment Network):
This network connects memory modules to processing elements and ensures the correct data is aligned with the
respective PE.
It ensures data consistency and improves data access efficiency.
Working:
The global control unit sends a single instruction to all PEs.
Data from the memory modules is sent to the respective PEs through the alignment network.
The alignment network ensures that the data is correctly routed and synchronized with the corresponding PE for efficient
processing.
Internal structure of PE in SIMD
In SIMD, multiple processing elements (PEs) execute the same instruction simultaneously on different data. The
diagram highlights the key components of a PE within this architecture:
Programming Constraints: To make SIMD work efficiently, certain rules must be followed:
◦ Data Alignment: Data must be stored neatly in memory at specific positions so the system can quickly access it.
Misaligned data slows things down.
◦ Data Arrangement: Data should be organized to minimize cache misses, ensuring efficient access during
computation.
◦ Memory System: SIMD uses a memory system divided into multiple "lanes" (called banks). This allows several
data points to be accessed or processed at the same time, speeding up operations.
SIMD (Single Instruction, Multiple Data) processes multiple data points simultaneously using processing
elements (PEs).Each PE communicates with others through ports (left, right, top, bottom).
There are two main communication models in SIMD systems are:
1. Single-Port Communication
2. All-Port Communication
•A PE processes a piece of data and sends the result to its neighbor through one port. At the same time, it can
receive data from another neighbor using a different port.
•While this method is simple, it limits the speed and efficiency of data transfer in larger systems.
A PE in the middle of a grid sends data to all four neighbors (top, bottom, left, right) at the same time.
This increases efficiency and reduces delays in SIMD operations, especially for tasks requiring high
parallelism.
In SIMD (Single Instruction, Multiple Data)
architectures, tasks like sorting or comparison are
critical, especially when dealing with arrays or
matrices. To handle these tasks efficiently,
comparators and sorting networks are employed.
1. Comparators
Types of Comparators:
2. Comparison Networks
A comparison network is a system of multiple comparators connected together to process data in parallel.
Key Features:
◦ Representation: A directed acyclic graph (DAG) where:
◦ Nodes: Represent inputs and outputs.
◦ Edges: Represent data flow through comparators.
◦ Depth: The number of stages (or levels) of comparators in the network.
3.Sorting Networks
A sorting network is a specific type of comparison network designed to ensure the outputs are sorted regardless
of the input order.
Key Features:
◦ Size: The total number of comparators used in the network.
◦ Depth: The number of stages (or levels) required to fully sort the input data.
◦ Monotonicity: Ensures that the output sequence is always sorted.
Sorting networks use comparators to sort inputs in parallel, which makes them highly efficient for SIMD systems.
A sorting network is a system that organizes numbers into a sorted order (like smallest to largest). It does this using
comparators, which compare two numbers at a time and arrange them in the correct order.
How It Works:
Input Wires: Numbers (data) enter the sorting network.
Columns of Comparators:
◦ Comparators are like little machines that take two numbers, compare them, and arrange them:
◦ The smaller number goes to one side.
◦ The larger number goes to the other side.
◦ These comparators are arranged in columns, and each column works on multiple pairs of numbers at the same time.
Interconnection Network:
◦ After one column finishes, the numbers are sent to the next column for further sorting.
Output Wires: By the end, the numbers come out fully sorted.
What is a Comparator?
A comparator is the basic unit in the sorting network. It compares two numbers:
If sorting in ascending order:
◦ The smaller number goes to the top.
◦ The larger number goes to the bottom.
• It involves performing operations on entire arrays or vectors in a single step, rather than on individual
elements one at a time.
•Vector Processors are special-purpose computers designed to handle tasks that involve large datasets,
poor locality, and long run times.
•These processors are particularly effective for mathematical operations performed on multiple data
elements simultaneously.
•They use vector instructions, which provide a mechanism for efficient and high-speed computation.
•Vector processors are also known as array processors
Key Features of Vector Processors
Scalar and Vector Instructions:
•They support both scalar (single data element) and vector (multiple data elements) instructions.
•Vector instructions operate in a highly pipelined mode for efficiency.
Two Families of Vector Processors:
•Memory-to-Memory Architecture:
•Directly moves vector operands from memory to pipelines and back to memory.
•Register-to-Register Architecture:
•Uses vector registers to act as intermediaries between memory and functional pipelines.
Advantages of Vector Processors
Highly Pipelined Operations:
◦ This architecture allows for continuous and efficient processing of data streams.
Reduced Overhead:
◦ The hardware design minimizes loop overhead and address generation delays, improving performance for repetitive
tasks.
•Traditional supercomputers are often vector-based because of their ability to efficiently handle large-scale
computational problems.
•These systems are ideal for tasks in scientific computation like simulations, modeling, and numerical analysis.
1.Loop Overhead:
•In traditional processors, executing a loop involves extra steps, such as:
•Checking the loop condition (e.g., i < n).
•Incrementing or updating the loop counter (e.g., i++).
•Jumping back to the beginning of the loop if the condition is true.
•These repeated steps add "overhead," meaning extra work that doesn’t directly contribute to the actual
computation.
In vector processors:
A single vector instruction replaces the entire loop, eliminating the need for these repetitive checks and updates.
For example:
•Instead of looping over each element of an array, one vector instruction processes all elements at once.
2Address Generation Delays:
•In traditional processors, each element in an array or memory location requires calculating its address individually
(e.g., A[i]).
•This can cause delays because the processor has to compute the address for each element repeatedly.
In vector processors:
•The hardware uses predictable addressing patterns (like strides), allowing it to generate addresses for the entire
vector efficiently without recalculating for every element.
3.Improved Performance for Repetitive Tasks:
•Tasks that involve repetitive operations (e.g., matrix multiplication, array addition) benefit significantly because:
•The loop overhead is removed.
•Address generation is faster and happens in parallel with computations.
Why is a vector processor needed?
•Each computation in a vector works independently of the others, so the hardware doesn't need to check for
data conflicts within a vector instruction.
Ex:
•In a vector instruction, each element in a vector (e.g., an array) is processed separately and does not depend
on the result of any other element in the same vector.
•Adding two vectors A and B to produce C:
css
C[i] = A[i] + B[i]
Each C[i] only depends on A[i] and B[i]. It does not depend on C[i-1] or C[i+1]. This independence allows
the hardware to compute all elements in parallel or in a pipeline without worrying about conflicts.
•The hardware only needs to check for data conflicts between two vector instructions once for each vector,
not for every individual element.
•Vector instructions that access memory follow a predictable pattern, making memory access more efficient.
•Since a single vector instruction can replace an entire loop, there are no control issues (like deciding when
to exit the loop) that usually happen with regular loops.
Vector Processing Principles
•A vector is a group of numbers (data items) of the same type stored in memory. These numbers are usually
arranged with a fixed distance between each item, called the "stride.“
•In a vector processor, a stride refers to the step size or the distance (in memory locations) between
consecutive elements of a vector that the processor needs to access. It determines how the processor accesses
data in memory for operations involving vectors.
•Why is Stride Important?
•In vector processors, data is often stored in memory in non-contiguous blocks.
•The stride tells the processor how far apart the elements of a vector are in memory.
• Stride enables efficient handling of both contiguous and non-contiguous data.
Types of Stride:
•To save time, both instructions and data are processed using pipelines, which help reduce
decoding time.
Primary Components of VMIPS(Vector Multiprocessor
Instructionset)
Vector Registers:
•VMIPS has 8 vector registers, each capable of holding 64 elements.
•Each register has at least two read ports and one write port for efficient operations.
Program Loading:
•Programs and data are loaded into memory using the main computer. The scalar control unit manages the decoding of
instructions.
Vector Operand:
•A vector operand consists of an ordered set of elements. The number of elements is called the vector length.
•Each element can be a floating-point number, integer, logical value, or character.
Vector Computers:
•These computers build on scalar processors and include an optional vector processor.
•When a vector instruction is decoded, it is sent to the vector processor’s control unit to manage data flow and execution.
Instruction Decoding:
•After decoding, instructions are checked for vector or scalar operations.
•Scalar operations are directly executed by the scalar processor, while vector operations are handled by the vector control
unit.
Data Elements in Registers:
•Each vector register holds multiple data elements, forming a vector.
•The vector length register keeps track of how many elements the vector contains.
•Registers work in sequence, ensuring data is accessed efficiently.
Vector Processor Models
There are two main types of architectures for vector processors:
Register-to-Register Model:
◦ All vector operations (except loading and storing data) happen between the vector registers.
•Uses a fixed number of vector registers to store operands, intermediate results, and final outputs.
•The length of each vector register is typically fixed (e.g., 64-bit components in Cray series
supercomputers).
•Operates directly on data in memory, with primary memory holding both operands and results.
•Vector instructions handle fetching and storing data in large chunks (super words).
Examples: CDC Star-100 and TI ASC machines, which are memory-memory architectures.
Advantages of Vector Processors
Lower Instruction Bandwidth:
◦ Fewer fetches and decodes reduce the need for instruction bandwidth.
No Memory Wastage:
◦ Memory is used more efficiently with reduced overhead.
Scalable Performance:
◦ Performance improves by adding more hardware resources.
Interrupt Handling:
◦ Managing precise interrupts can be challenging.
Programming Complexity:
◦ Programmers or compilers must adapt programs for vectorization.
Block:
◦ A collection of threads grouped together.
Grid:
◦ A collection of blocks that form a larger computational unit.
Warp:
◦ A group of 32 threads that execute together on the GPU.
Kernel:
◦ The program (or function) that runs on the GPU.
Tesla GPU Architecture
•Based on a scalable processor array.
•Key components:
1.TPC (Texture/Processor Cluster): Handles textures and processing.
2.SM (Streaming Multiprocessor): Executes parallel threads.
3.SP (Streaming Processor): The basic computation unit.
4.ROP (Raster Operation Processor): Manages rendering tasks like blending and depth calculations.
Block Diagram Example (GeForce 8800 GPU):
Contains:
Memory:
◦ Uses external DRAM for data storage.
◦ ROP units perform memory-related operations like blending and depth tests.
Interconnection Network:
◦ Transfers data between GPU components.
◦ Connects the SPA to DRAM through a two-level cache.
◦ Each TPC (Texture/Processor Cluster) in the SPA works like a mini processor, managing tasks efficiently.
◦ GPUs scale from small setups (1 TPC) to high-performance setups (up to 8 TPCs or more).
Work Distribution:
◦ The input assembler sends data to the processors (like geometry, vertex, and pixel shaders).
◦ Data is processed and distributed evenly among the GPU units to ensure smooth execution.
Memory Management in GPUs
◦ Data is read and written through load/store instructions that work with three types of memory spaces:
◦ Shared Memory: Fast memory shared by threads within the same group (Streaming Multiprocessor or SM).
◦ Global Memory: Large memory shared across all threads in the GPU.
Memory Types:
◦ The GPU supports global, shared, and local memory for different tasks:
◦ Global Memory: Used for large datasets and shared among all threads.
◦ Shared Memory: Fast and used for data shared between threads in the same SM.
◦ It supports advanced memory like DDR2 or GDDR3, with speeds up to 1 GHz, for high bandwidth and fast
processing.
Partitions:
◦ Global memory is divided into 6 partitions, each handling 1/6 of the total physical memory space.
Instruction-Level Support for Parallel Programming
Parallelism refers to performing multiple tasks simultaneously to improve efficiency. It can occur at
following different levels:
Job Level:
◦ Independent jobs run at the same time on the same system.
◦ Example: Running multiple programs (e.g., a video player and a browser) simultaneously.
Program Level:
◦ Multiple tasks work together to solve a single large problem.
◦ Example: Different software modules collaborate to process data.
Instruction Level:
◦ A single instruction (e.g., adding two numbers) is broken into smaller steps (sub-instructions).
◦ Sub-instructions are executed in parallel using techniques like pipelining.
◦ Example: Adding two arrays element-wise.
Bit Level:
◦ Individual bits in a word (binary representation) are processed.
◦ If bits are processed one after another, it is bit-serial. If processed at the same time, it is bit-parallel.
levels of parallelism
•Instruction-Level Parallelism (Fine-Grained)
•Loop-Level Parallelism (Fine-Grained)
•Procedure-Level Parallelism (Medium-Grained)
•Subprogram-Level Parallelism (Medium to Coarse-Grained)
•Job or Program-Level Parallelism (Coarse-Grained)
Instruction-Level Parallelism (ILP):Breaking a program into individual instructions that can run
simultaneously.
Grain:
◦ A grain is the smallest unit of work in parallel processing.
◦ It represents a portion of the overall computation that can be performed independently.
Hardware Examples:
◦ Processors like VLIWs (Very Long Instruction Word) and superscalar processors benefit from ILP.
Loop-Level Parallelism:
Instruction Count:
◦ If the number of instructions in a loop is less than 500, the loop can be considered for parallelism.
◦ If iterations of the loop are independent of each other, they can be handled either by a pipeline or a SIMD
(Single Instruction, Multiple Data) machine.
Optimized Programs:
◦ Well-optimized program structures are designed to execute efficiently on parallel or vector machines.
Challenges:
◦ Some loops (like recursive loops) are difficult to handle in parallel.
Granularity:
◦ Loop-level parallelism is still considered fine-grained computation, as it deals with individual iterations of
loops.
Cross-Iteration Parallelism in loop-level parallelism refers to the ability to execute tasks from
different iterations of a loop concurrently. This allows the parallelization to extend beyond the
boundaries of a single loop iteration, increasing the overall efficiency of parallel execution.
Procedure-Level Parallelism:
•Medium-sized grain, typically involving less than 2000 instructions.
•Difficult to detect parallelism due to fewer instructions and the need for inter-procedural dependence
analysis.
•Communication requirements are lower compared to instruction-level parallelism.
•Multitasking operations can benefit at this level.
Subprogram-Level Parallelism:
•Operates at the job step level, with thousands of instructions, typically medium- or coarse-grained.
•Job steps can overlap across different jobs, enabling multi-programming.
•No compilers are currently available to exploit medium- or coarse-grain parallelism.
Examples:
◦ Executing different functions in parallel in a multi-threaded program.
◦ Running different modules of a large application simultaneously.
Job or Program-Level Parallelism:
Job or Program-Level Parallelism refers to the highest level of parallelism where independent jobs or programs are
executed concurrently. This level of parallelism focuses on utilizing the computational power of a parallel computer to run
multiple, unrelated tasks simultaneously.
Key Characteristics
Suitable Hardware:
◦ Ideal for Machines with Fewer Powerful Processors:
◦ Fewer, high-performance processors are well-suited for executing independent programs that demand significant
computational power.
◦ Less Effective for Many Small Processors:
◦ Machines with numerous smaller processors (e.g., GPUs) are better suited for fine-grained parallelism rather than
job-level tasks.
Practical Use Cases
Cluster Computing:
◦ Independent jobs distributed across nodes in a cluster for parallel execution.
Cloud Computing:
◦ Running different user programs on virtual machines simultaneously.
Scientific Simulations:
◦ Multiple simulations or experiments running concurrently on a supercomputer.
Avoiding Dependencies in Parallel Execution
•Hardware Terminology:
•Data Hazards:
•RAW (Read After Write): A hazard occurs when an instruction depends on the result of a previous
instruction.
•WAR (Write After Read): A hazard occurs when a write operation is scheduled after a read operation for the
same resource.
•WAW (Write After Write): A hazard occurs when two write operations are scheduled for the same resource.
•Software Terminology:
•Data Dependencies:
•Name Dependency: Dependencies caused due to variable names.
•True Dependency: Arises when an instruction depends on the actual data produced by a previous instruction.
Pipelining and Stalls
Loop Unrolling
Loop unrolling is an optimization technique used in programming and compiler design to
improve the performance of loops. It involves reducing the overhead of loop control by
executing multiple iterations of the loop within a single iteration.
The compiler or programmer
duplicates the body of the loop
multiple times, reducing the
number of iterations.
The loop's control structure (e.g.,
incrementing the counter and
checking the condition) executes
less frequently, improving
efficiency.
Vector Processing for Loop-
Level Parallelism
◦ Vector processing is a method
of loop-level parallelism where a
single instruction operates on
multiple data elements
simultaneously.
◦ It is designed to handle repetitive
operations within loops
efficiently, particularly when the
same operation is applied to a
large set of data (such as arrays
or matrices).
Alternative Methods:
◦ If vector processing is not used, parallelism can still be achieved through:
Parallel Instructions:
◦ If two instructions are parallel, they can execute simultaneously in a pipeline of arbitrary depth without
causing stalls.
Dependent Instructions:
◦ If two instructions are dependent:
Two main methods are used to track the sharing status of data blocks:
1)Directory-based
2)Snooping
Directory-based protocol
◦ The sharing status of a block of memory
is maintained in a central location, called
a directory.
◦ This centralized directory tracks which
processors are sharing the data.
◦ It is commonly used in Distributed Shared
Memory (DSM) systems.
◦ It supports larger processor counts but has
higher overhead compared to other
methods.
Snooping:
•There is no central directory.
•Every cache with a copy of the data also keeps
track of its sharing status.
•This method is used in systems with a common
bus, where all caches can monitor ("snoop")
the communication medium to check if they
have the requested data block.
•It works well for systems with a shared
broadcast medium.
Types of NUMA