0% found this document useful (0 votes)
8 views114 pages

PDC-1 2

The document discusses various architectural classification schemes for parallel and distributed computing, including Flynn's, Shore's, and Feng's classifications. It details different types of instruction and data streams, such as SISD, SIMD, MISD, and MIMD, as well as memory access classifications like shared and distributed memory. Additionally, it explores the structure and functioning of SIMD array processors and their components, emphasizing their efficiency in processing large data sets in parallel.

Uploaded by

tatadhanyasri20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views114 pages

PDC-1 2

The document discusses various architectural classification schemes for parallel and distributed computing, including Flynn's, Shore's, and Feng's classifications. It details different types of instruction and data streams, such as SISD, SIMD, MISD, and MIMD, as well as memory access classifications like shared and distributed memory. Additionally, it explores the structure and functioning of SIMD array processors and their components, emphasizing their efficiency in processing large data sets in parallel.

Uploaded by

tatadhanyasri20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 114

Parallel &Distributed Computing

UNIT-1.2
Architectural Classification Schemes
◦ Architectural classification scheme are as follows

1. Flynn’s Classification

2. Shores Classification

3. Feng’s Classification

Other types of architectural classification

4. Classification based on coupling between processing elements

5. Classification based on mode of accessing memory.


Instruction Stream and Data Stream
•The term ‘stream’ refers to a sequence or flow of either instructions or data operated on by the
computer.
•In the complete cycle of instruction execution, a flow of instructions from main memory to the CPU
is established. This flow of instructions is called instruction stream.

•Similarly, there is a flow of operands between processor and memory-directionally. This flow of
operands is called data stream.
Flynn’s classification
M.J.Flynn introduced a system for the organization of system architectures of computers. Classification is
based on multiplicity of instruction streams and data streams observed by the CPU during program execution.
1) Single Instruction and Single Data stream (SISD):
•In this organisation, sequential execution of instructions is performed by one CPU containing a
single processing element (PE), i.e., ALU under one control unit.
•SISD machines are conventional serial computers that process only one stream of instructions
and one stream of data.
There is no instruction level parallelism and data level parallelism.

Examples of SISD machines include:

• CDC 6600 which is unpipelined but has multiple functional units.

• CDC 7600 which has a pipelined arithmetic unit.

• Amdhal 470/6 which has pipelined instruction processing.

• Cray-1 which supports vector processing.


2) Single Instruction and Multiple Data stream (SIMD)
•In this organisation, multiple processing elements work under the control of a single control unit. It has
one instruction and multiple data stream.
•All the processing elements of this organization receive the same instruction broadcast from the CU.
•Main memory can also be divided into modules for generating multiple data streams acting as a
distributed memory.
•Therefore, all the processing elements simultaneously execute the same instruction and are said to be
'lock-stepped' together. Each processor takes the data from its own memory and hence it has on distinct
data streams.
•Every processor must be allowed to complete its instruction before the next instruction is taken for
execution. Thus, the execution of instructions is synchronous.
•Ex: ILLIAC-IV, PEPE, BSP, STARAN, MPP, DAP and the Connection Machine (CM-1).
Multiple Instruction and Single Data stream (MISD)
•In this organization, multiple processing elements are organised under the control of multiple
control units. Systolic array is one example of MISD.

•Each control unit is handling one instruction stream and processed through its corresponding
processing element. But each processing element is processing only a single data stream at a
time.

• Therefore, for handling multiple instruction stream sand single data stream, multiple control
units and multiple processing elements are organised in this classification.

• All processing elements are interacting with the common shared memory for the organisation of
single data stream
This classification is not popular in commercial machines. But for the specialized applications, MISD
organisation can be very helpful for real time computers need to be fault tolerant where several processors
execute the same data for producing the redundant data.
4) Multiple Instruction and Multiple Data stream (MIMD)
•In this organization, multiple processing elements and multiple control units are organized as in
MISD.
•But the difference is that now in this organization multiple instruction streams operate on multiple
data streams.
•Therefore, for handling multiple instruction streams, multiple control units and multiple processing
elements are organized such that multiple processing elements are handling multiple data streams
from the Main memory.
•The processors work on their own data with their own instructions. Tasks executed by different
processors can start or finish at different times. They are not lock-stepped, as in SIMD computers, but
run asynchronously.
•This classification actually recognizes the parallel computer. That means in the real sense MIMD
organisation is said to be a Parallel computer. All multiprocessor systems fall under this classification.
• Ex: C.mmp, Burroughs D825, Cray-2, S1, Cray X-MP, HEP, Pluribus,IBM 370/168 MP, Univac
1100/80, Tandem/16, IBM 3081/3084, C.m*, BBN Butterfly, Meiko Computing Surface (CS-1),
FPS T/40000, iPSC. MIMD organization is the most popular for parallel computer. In the real sense,
parallel computers execute the instructions in MIMD mode.
2. Shore’s Classification/taxonomy
•Shore's Taxonomy is a framework introduced by Melvin Shore for categorizing different types
of parallel computer architectures based on how they manage and coordinate memory and
processing elements.

•It provides a way to classify parallel systems into distinct categories, focusing on the relationship
between processing elements, memory organization, and communication methods.
Type-I
•Classical Von Neumann architectures is in this category. It is similar to SISD type of Flynn’s
classification. Type-I also contains PU, data &memory. The Processor works on memory words
and is to be called as word-serial bit parallel.
•Ex: CDC7600,Cray-1
Control Unit

Horizontal Processing
Unit

Memory(Word Slice)
•Horizontal Processing Unit (HPU): A system that focuses on parallel processing, distributing
tasks across multiple units to handle operations simultaneously, emphasizing scalability and
efficiency.
•Vertical Processing Unit (VPU): A system that processes tasks sequentially or hierarchically,
focusing on specialization and depth within a single process or a series of dependent steps.
Type-II
•Computers in this category are similar to those of type-I except that they work on bit-slices of
memory (vertical slices) rather than word-slices(horizontal slices).The processing unit work in
word-parallel bit-serial mode
• Ex: ICL DAP, Goodyear Aerospace STARAN.

Control Unit

Vertical Processing Unit Memory(bit Slice)


Type-III
Type-III is a combination of Type-I and Type-II. It could be characterized with the memory as an
array of bits with both horizontal and vertical reading and processing possible. It contains vertical
and the horizontal processing units. One that works on data-words and the other that works on
data-slices.
Example: Sanders Associates OMEN-60 series of machines.

Horizontal Processing
Control Unit
Unit

Vertical Processing Unit Data Memory


Type-IV
Type-IV machines are similar to Type-I machines, except that the processing unit and the data memory
are replicated. Communication between the various processing elements can only occur though the
control unit. The machine is easy to expand by adding more processing elements. The communications
bandwidth is limited by having all messages go though the central control unit. Machines in this
category could be called unconnected arrays.
Control Unit

Processing Unit Processing Unit Processing Unit

Memory Memory Memory


Type-V
This type is similar to Type-IV, with the addition of the communication between the processing
elements.

Control Unit

Processing Unit Processing Unit Processing Unit

Memory Memory Memory


Type-VI
•It is also called associative processor. This is sometimes called 'Logic In Memory Array' or
LIMA. It includes the logic in the memory itself.
•Combines logic and memory functions within the same hardware unit.
•Eliminates the need for frequent data transfers, which reduces latency and energy consumption.

Control Unit

Processing Unit &


Memory
3.Feng’s Classification
Tse-yun Feng proposed a classification for parallel processing systems based on how bits in a
word and words themselves are processed. The classification results in four categories,
depending on whether the bits and words are handled in parallel or serially.
1. Bit-Serial, Word-Serial
2. Bit-Parallel, Word-Serial
3. Bit-Serial, Word-Parallel
4. Bit-Parallel, Word-Parallel
In the context of Feng's Classification for parallel processing, the terms "words" and "bits" refer to

how data is processed, and the terms "serially" and "parallely" describe the mode of operation.

BSWS: WSBS has been called bit serial processing because one bit is processed at a time.
•Both bits and words are handled serially, meaning the system processes one bit of one word at a time before moving to

the next.
•This mode emphasizes minimal hardware complexity but has lower processing speed.

WPBS: It has been called bit slice processing because m-bit slice is processes at a time. i.e. Bits are processed serially,
but words are processed in parallel.

WSBP: word slice processing because one word of n-bit processed at a time. i.e. Bits are processed in parallel, but
words are processed sequentially.

WPBP: It is known as fully parallel processing in which an array on n*m bit is processes at one time. Both bits and
words are processed in parallel, providing the highest speed.
Memory Access Classification
Parallel architectures can be classified into two major categories in terms of memory arrangement:
• Shared memory
• Message passing or distributed memory
This classification constitutes a subdivision of MIMD parallel architecture and are also known as:
•Shared memory architecture  tightly coupled architecture
• Distributed memory architecture  loosely coupled architecture
Shared Memory Multiprocessor
•Multiple processors share a common memory unit comprising a single or
several memory modules
• All the processors have equal access to the memory modules.
•The memory modules are seen as a single address space by all the processors.
•The memory modules store data as well as serve to establish communication
among the processors via some bus arrangement.
•Communication is established through memory access instructions
◦ processors exchange messages between one another by one processor writing data
into the shared memory and another reading that data from the memory
• The executable programming codes are stored in the memory for each processor to execute

•The data related to each program is also stored in this memory

•Each program can gain access to all data sets present in the memory if necessary

•There is no direct processor-to-processor communication involved in the programming process


Communication is handled mainly via the shared memory modules.

•Access to these memory modules can easily be controlled through appropriate programming mechanisms.

•However, this architecture suffers from a bottleneck problem when a number of processors endeavor to
access the global memory at the same time .This limits the scalability of the system.
The shared memory multiprocessor systems can further be divided into three modes which are based
on the manner in which shared memory is accessed.
Uniform Memory Access Model (UMA)
•In the case of UMA architectures, the memory access time to the different parts of the memory are almost
the same.
•UMA architectures are also called symmetric multiprocessors
An UMA architecture comprises two or more processors with identical characteristics as follows
The processors:
◦ share the same main memory and I/O facilities
◦ are interconnected by some form of bus-based interconnection scheme

The memory access time is approximately the same for all processors.
•Processors perform the same functions under control of an operating system, which provides interaction
between processors and their programs at the job, task, file and data element levels.
Non-Uniform Memory Access Model (NUMA)
•In shared memory multiprocessor systems, each processor can have its own local memory. Together,
all the local memories form the shared global memory. This means the global memory is distributed
across all processors.
•A processor can access its own local memory quickly and consistently since it is directly connected
to it. However, accessing the local memory of another processor is slower and depends on its
location. As a result, not all memory locations can be accessed at the same speed.
Cache-Only Memory Access Model (COMA)
•In shared memory multiprocessor systems, each processor may have a cache memory to speed
up instruction execution. In the NUMA model, if cache memories replace local memories, it
becomes the COMA model. Here, all cache memories together create a shared global memory
space. However, accessing a cache memory from another processor is also slower and varies
depending on its location, making the access non-uniform in this model.
Message Passing Multicomputer/Distributed Systems

•In a distributed memory architecture each unit is a


complete computer building block including the processor,
memory and I/O system
•A processor can access the memory, which is directly
attached to it.
•Communication among the processors is established in the
form of I/O operations through message signals and bus
networks.
•Certainly, access to local memory is faster than access to remote processors.
•Most importantly, the further the physical distance to the remote processor, the longer it will take to access
the remote data.

•This architecture suffers from the drawback of requiring direct communication from processor to
processor.

•The speed performance of distributed memory architecture largely depends upon how the processors are
connected to each other.

•It is impractical to connect each processor to the remaining processors through independent cables  it can
work for a very low number of processors but becomes impossible as the number of processors in the
system increases.

The most common solution is to use specialized bus networks to connect all the processors in the system
in order that each processor can communicate with any other processor attached to the system.
Memory Issues: Shared Vs Distributed

Shared
Memory
Distributed
Memory
SIMD
In synchronous parallel processing architecture, multiple processing elements (PEs) perform
tasks under a common control unit. Each PE executes the same machine language program, as
instructions and data are broadcast by the control unit.

This architecture supports fine-grained instruction-level parallelism, where each processing


element executes the same type of instruction stream on different data streams. Such systems are
categorized as Single Instruction, Multiple Data (SIMD) parallel computers.
•Array and vector processors are the most common examples of SIMD machines.

•In these systems, a single control unit dispatches the same instruction to various processors, enabling
parallel execution across multiple data streams.

•This model is particularly efficient for vector and matrix computations, making SIMD machines
well-suited for applications requiring large-scale data processing. The sequence of different data
items processed by SIMD systems is often referred to as a vector.
Array processor & SIMD Machines

An array processor, a type of SIMD computer, handles single instruction streams applied to multiple data
streams. SIMD machines, also known as vector computers or processor arrays, excel at executing vector
and matrix operations efficiently.

The architecture typically features a single control unit that reads instructions pointed to by a single
program counter (PC). These instructions are decoded and control signals are sent to the processing units.
Data is supplied to and retrieved from the processing units via a memory system with multiple data paths.
1.SIMD Array Processor using interconnection network

.
Purpose:
◦ This architecture is designed to process large amounts of data in parallel by applying a single instruction across
multiple data points simultaneously.
◦ It's suitable for tasks like matrix computations, image processing, and other parallelizable workloads.

Components:
◦ Processing Elements (PEs):
◦ Each processing element is responsible for performing computations on a subset of the data.
◦ Each PE has its own local memory for storing data specific to its task.
◦ Global Control Unit and Scalar Processor:
◦ This unit sends the same instruction to all the processing elements. It acts as the "brain" of the SIMD system.
◦ The scalar processor handles non-parallel tasks that cannot be split among the PEs.
◦ Interconnection Network:
◦ The interconnection network connects the processing elements and allows them to communicate with each other
when data exchange or synchronization is needed.
Working:
◦ The global control unit sends a single instruction to all processing elements.
◦ Each PE operates on its own portion of the data, performing the same operation but on different pieces of data.
◦ The interconnection network allows data sharing or communication among the PEs as needed.
2.SIMD Array Processor using Alignment network
Purpose:
This is another variation of the SIMD architecture, emphasizing data alignment for efficient memory access. It is more
structured to handle cases where memory and processing elements need precise alignment.
Components:
oMemory Modules:
Data is stored in multiple memory banks (Memory 1, Memory 2, ..., Memory N).
Each memory module corresponds to a specific processing element.
oProcessing Elements (PEs):
These are the same as in the first diagram, performing parallel computations on subsets of data.
oGlobal Control Unit and Scalar Processor:
The central unit sends instructions to all processing elements and manages overall system control.
oInterconnection Network (Alignment Network):
This network connects memory modules to processing elements and ensures the correct data is aligned with the
respective PE.
It ensures data consistency and improves data access efficiency.
Working:
The global control unit sends a single instruction to all PEs.
Data from the memory modules is sent to the respective PEs through the alignment network.
The alignment network ensures that the data is correctly routed and synchronized with the corresponding PE for efficient
processing.
Internal structure of PE in SIMD
In SIMD, multiple processing elements (PEs) execute the same instruction simultaneously on different data. The
diagram highlights the key components of a PE within this architecture:

1.General Registers (R1 to RN):


•The registers store data locally for computation.
•Each PE in SIMD processes its unique data stream, and these registers provide fast access to the data required for
computation.
2.Selectors:
•This component determines which register values will be fed into the Arithmetic Logic Unit (ALU) for
computation.
•It allows flexibility in selecting data inputs based on the active instruction.
3.Arithmetic Logic Unit (ALU):
•The ALU performs arithmetic (e.g., addition, multiplication) and logic (e.g., AND, OR, NOT) operations.
•In SIMD, all PEs execute the same operation on their respective data simultaneously.
4.Local Main Memory:
•This is the memory available exclusively to each PE.
•It stores the data and instructions specific to the PE.
•In SIMD, local memory helps reduce contention and speeds up access compared to a centralized memory model.
5.Instruction Processing Unit:
• Index Registers: Handle indexing operations, such as accessing elements of an array or data vectors.
• Address Register: Stores the address of data in memory, enabling efficient memory access.
• Mask: Plays a crucial role in SIMD by enabling conditional execution. For example, in masked operations, some
PEs can remain inactive (masked) while others continue computation.
6.Communication Unit:
• This unit allows the PE to interact with other PEs or external systems.
• Communication can include data transfers, synchronization, or coordination with the central control unit or memory.
7.Connection to Network:
• Facilitates communication between PEs and external networks or control units.
• This enables coordination across the SIMD architecture for data transfer and result aggregation.
8.Shared Memory (Optional):
• While not explicitly shown in the diagram, SIMD architectures often include shared memory for storing global data
accessible to all PEs.
SIMD Architecture and Programming Principles
SIMD architectures excel at performing the same operation on multiple data elements simultaneously.

Programming Constraints: To make SIMD work efficiently, certain rules must be followed:
◦ Data Alignment: Data must be stored neatly in memory at specific positions so the system can quickly access it.
Misaligned data slows things down.
◦ Data Arrangement: Data should be organized to minimize cache misses, ensuring efficient access during
computation.
◦ Memory System: SIMD uses a memory system divided into multiple "lanes" (called banks). This allows several
data points to be accessed or processed at the same time, speeding up operations.
SIMD (Single Instruction, Multiple Data) processes multiple data points simultaneously using processing
elements (PEs).Each PE communicates with others through ports (left, right, top, bottom).
There are two main communication models in SIMD systems are:

1. Single-Port Communication

2. All-Port Communication

Single-Port Communication in SIMD


•In this model, a PE can use only one port at a time to send or receive data.

•A PE processes a piece of data and sends the result to its neighbor through one port. At the same time, it can
receive data from another neighbor using a different port.
•While this method is simple, it limits the speed and efficiency of data transfer in larger systems.

All-Port Communication in SIMD

APE can use all ports simultaneously to send or receive data.

A PE in the middle of a grid sends data to all four neighbors (top, bottom, left, right) at the same time.

This increases efficiency and reduces delays in SIMD operations, especially for tasks requiring high
parallelism.
In SIMD (Single Instruction, Multiple Data)
architectures, tasks like sorting or comparison are
critical, especially when dealing with arrays or
matrices. To handle these tasks efficiently,
comparators and sorting networks are employed.

1. Comparators

A comparator is the building block used to compare


two inputs and produce two outputs based on their
values. It works in parallel, making it ideal for SIMD
operations.

Types of Comparators:
2. Comparison Networks

A comparison network is a system of multiple comparators connected together to process data in parallel.

Key Features:
◦ Representation: A directed acyclic graph (DAG) where:
◦ Nodes: Represent inputs and outputs.
◦ Edges: Represent data flow through comparators.
◦ Depth: The number of stages (or levels) of comparators in the network.

3.Sorting Networks

A sorting network is a specific type of comparison network designed to ensure the outputs are sorted regardless
of the input order.

Key Features:
◦ Size: The total number of comparators used in the network.
◦ Depth: The number of stages (or levels) required to fully sort the input data.
◦ Monotonicity: Ensures that the output sequence is always sorted.

Sorting networks use comparators to sort inputs in parallel, which makes them highly efficient for SIMD systems.
A sorting network is a system that organizes numbers into a sorted order (like smallest to largest). It does this using
comparators, which compare two numbers at a time and arrange them in the correct order.
How It Works:
Input Wires: Numbers (data) enter the sorting network.
Columns of Comparators:
◦ Comparators are like little machines that take two numbers, compare them, and arrange them:
◦ The smaller number goes to one side.
◦ The larger number goes to the other side.
◦ These comparators are arranged in columns, and each column works on multiple pairs of numbers at the same time.

Interconnection Network:
◦ After one column finishes, the numbers are sent to the next column for further sorting.

Output Wires: By the end, the numbers come out fully sorted.
What is a Comparator?
A comparator is the basic unit in the sorting network. It compares two numbers:
If sorting in ascending order:
◦ The smaller number goes to the top.
◦ The larger number goes to the bottom.

If sorting in descending order:


◦ The larger number goes to the top.
◦ The smaller number goes to the bottom.

The Zero-One Principle


This principle makes it easy to test sorting networks:
If a sorting network can sort only 0s and 1s correctly, it will work for all types of numbers.
This simplifies testing because you only need to check with binary inputs (0s and 1s).
A sorting network is a comparison network for which the output sequence is monotonically
increasing (that is b1≤ b2 ≤ ....bn) for every input sequence.
Vector Processing
•Vector processing refers to a method of computing that deals with vectors, which are ordered sets of
numbers, rather than scalar values (individual numbers).

• It involves performing operations on entire arrays or vectors in a single step, rather than on individual
elements one at a time.

•Vector Processors are special-purpose computers designed to handle tasks that involve large datasets,
poor locality, and long run times.
•These processors are particularly effective for mathematical operations performed on multiple data
elements simultaneously.
•They use vector instructions, which provide a mechanism for efficient and high-speed computation.
•Vector processors are also known as array processors
Key Features of Vector Processors
Scalar and Vector Instructions:
•They support both scalar (single data element) and vector (multiple data elements) instructions.
•Vector instructions operate in a highly pipelined mode for efficiency.
Two Families of Vector Processors:
•Memory-to-Memory Architecture:
•Directly moves vector operands from memory to pipelines and back to memory.
•Register-to-Register Architecture:
•Uses vector registers to act as intermediaries between memory and functional pipelines.
Advantages of Vector Processors
Highly Pipelined Operations:
◦ This architecture allows for continuous and efficient processing of data streams.

Speedup for Arrays and Matrices:


◦ Traditional scientific computing tasks often involve arrays and matrices, which vector processors can handle faster
than scalar processors.

Reduced Overhead:
◦ The hardware design minimizes loop overhead and address generation delays, improving performance for repetitive
tasks.
•Traditional supercomputers are often vector-based because of their ability to efficiently handle large-scale
computational problems.
•These systems are ideal for tasks in scientific computation like simulations, modeling, and numerical analysis.
1.Loop Overhead:
•In traditional processors, executing a loop involves extra steps, such as:
•Checking the loop condition (e.g., i < n).
•Incrementing or updating the loop counter (e.g., i++).
•Jumping back to the beginning of the loop if the condition is true.
•These repeated steps add "overhead," meaning extra work that doesn’t directly contribute to the actual
computation.
In vector processors:
A single vector instruction replaces the entire loop, eliminating the need for these repetitive checks and updates.
For example:
•Instead of looping over each element of an array, one vector instruction processes all elements at once.
2Address Generation Delays:
•In traditional processors, each element in an array or memory location requires calculating its address individually
(e.g., A[i]).
•This can cause delays because the processor has to compute the address for each element repeatedly.
In vector processors:
•The hardware uses predictable addressing patterns (like strides), allowing it to generate addresses for the entire
vector efficiently without recalculating for every element.
3.Improved Performance for Repetitive Tasks:
•Tasks that involve repetitive operations (e.g., matrix multiplication, array addition) benefit significantly because:
•The loop overhead is removed.
•Address generation is faster and happens in parallel with computations.
Why is a vector processor needed?
•Each computation in a vector works independently of the others, so the hardware doesn't need to check for
data conflicts within a vector instruction.
Ex:
•In a vector instruction, each element in a vector (e.g., an array) is processed separately and does not depend
on the result of any other element in the same vector.
•Adding two vectors A and B to produce C:
css
C[i] = A[i] + B[i]
Each C[i] only depends on A[i] and B[i]. It does not depend on C[i-1] or C[i+1]. This independence allows
the hardware to compute all elements in parallel or in a pipeline without worrying about conflicts.

•The hardware only needs to check for data conflicts between two vector instructions once for each vector,
not for every individual element.
•Vector instructions that access memory follow a predictable pattern, making memory access more efficient.
•Since a single vector instruction can replace an entire loop, there are no control issues (like deciding when
to exit the loop) that usually happen with regular loops.
Vector Processing Principles
•A vector is a group of numbers (data items) of the same type stored in memory. These numbers are usually
arranged with a fixed distance between each item, called the "stride.“
•In a vector processor, a stride refers to the step size or the distance (in memory locations) between
consecutive elements of a vector that the processor needs to access. It determines how the processor accesses
data in memory for operations involving vectors.
•Why is Stride Important?
•In vector processors, data is often stored in memory in non-contiguous blocks.
•The stride tells the processor how far apart the elements of a vector are in memory.
• Stride enables efficient handling of both contiguous and non-contiguous data.
Types of Stride:

1.Stride = 1 (Contiguous Access):


•When elements are stored in consecutive memory locations.
•Example: A vector [A[0], A[1], A[2], A[3]] is stored at memory addresses 100, 101, 102, 103.
•Stride = 1 means the processor accesses the elements one after another without skipping.
2.Stride > 1 (Non-Contiguous Access):
•When elements are spaced out in memory (not consecutive).
•Example: A vector [B[0], B[1], B[2], B[3]] is stored at memory addresses 200, 204, 208, 212.
•Here, the stride = 4, as the processor must skip 3 memory addresses to reach the next element.
•A vector processor is a system made up of various hardware components, such as vector
registers, pipelines for processing, and counters for managing operations.
•Vector processing happens when calculations (like arithmetic or logical operations) are
performed on vectors. Turning regular (scalar) code into vector code is called "vectorization.“
•Vector processing is 10 to 20 times faster than regular (scalar) processing. Special compilers,
called vectorizing compilers, help convert scalar code into vector code.
Properties of Vector Instructions
•Independent Computation: Each result is calculated separately, avoiding data conflicts and
enabling efficient pipelines.
•Efficient Workload: A single vector instruction performs the work of an entire loop, reducing
the need for extra instructions.
•Predictable Memory Access: Memory access follows a fixed pattern, making it easier to
manage and faster.
•No Control Issues: Vector instructions replace loops, so there are no decisions (like when to stop
the loop) to slow things down.
Vector Supercomputer
•A vector computer is built on a regular (scalar) processor but is designed to handle 1-D arrays of
data, called vectors. Each vector has multiple data elements, and the number of these elements is
known as the vector length.

•To save time, both instructions and data are processed using pipelines, which help reduce
decoding time.
Primary Components of VMIPS(Vector Multiprocessor
Instructionset)
Vector Registers:
•VMIPS has 8 vector registers, each capable of holding 64 elements.
•Each register has at least two read ports and one write port for efficient operations.

Vector Functional Units:


•These units are fully pipelined, allowing new operations to start in every clock cycle.

Vector Load-Store Unit:


•This unit is fully pipelined, enabling data movement between vector registers and memory at a rate of 1 word per clock cycle

(after an initial delay).


Scalar Registers:
•These registers act as inputs to vector functional units and are used for tasks like providing addresses to the vector load-store unit.

Program Loading:
•Programs and data are loaded into memory using the main computer. The scalar control unit manages the decoding of

instructions.
Vector Operand:
•A vector operand consists of an ordered set of elements. The number of elements is called the vector length.
•Each element can be a floating-point number, integer, logical value, or character.
Vector Computers:
•These computers build on scalar processors and include an optional vector processor.
•When a vector instruction is decoded, it is sent to the vector processor’s control unit to manage data flow and execution.
Instruction Decoding:
•After decoding, instructions are checked for vector or scalar operations.
•Scalar operations are directly executed by the scalar processor, while vector operations are handled by the vector control
unit.
Data Elements in Registers:
•Each vector register holds multiple data elements, forming a vector.
•The vector length register keeps track of how many elements the vector contains.
•Registers work in sequence, ensuring data is accessed efficiently.
Vector Processor Models
There are two main types of architectures for vector processors:

Register-to-Register Model:
◦ All vector operations (except loading and storing data) happen between the vector registers.

Memory-Memory Vector Processor:


◦ All vector operations directly involve memory, working from memory to memory.
Register-to-Register Model:

•Uses a fixed number of vector registers to store operands, intermediate results, and final outputs.

•The length of each vector register is typically fixed (e.g., 64-bit components in Cray series
supercomputers).

•Operates on registers rather than memory, which improves efficiency.

•Commonly used in SIMD (Single Instruction, Multiple Data) architectures.


Memory-to-Memory Model:

•Operates directly on data in memory, with primary memory holding both operands and results.

•Vector instructions handle fetching and storing data in large chunks (super words).

•Results are written back to memory after operations.

Examples: CDC Star-100 and TI ASC machines, which are memory-memory architectures.
Advantages of Vector Processors
Lower Instruction Bandwidth:
◦ Fewer fetches and decodes reduce the need for instruction bandwidth.

Efficient Memory Access:


◦ Load/Store units can access memory using predictable patterns, simplifying memory operations.

No Memory Wastage:
◦ Memory is used more efficiently with reduced overhead.

Simplified Control Hazards:


◦ Eliminates control issues caused by loops, making execution smoother.

Scalable Performance:
◦ Performance improves by adding more hardware resources.

Smaller Code Size:


◦ A single instruction can handle multiple operations, reducing the overall code size.
Disadvantages of Vector Processors
Need for Scalar Units:
◦ A traditional scalar unit is still required for tasks that cannot be vectorized.

Interrupt Handling:
◦ Managing precise interrupts can be challenging.

Programming Complexity:
◦ Programmers or compilers must adapt programs for vectorization.

Inefficiency for Small Vectors:


◦ Not very effective when working with small vector sizes.

Limited Application Scope:


◦ May not perform well for all types of applications.

High Memory Requirements:


◦ Requires specialized, high-bandwidth memory systems.
GPU Co-processing
•A Graphics Processing Unit (GPU) is a specialized processor designed to handle and boost the
performance of tasks like video and graphics rendering.
•While CPUs process tasks sequentially, GPUs process large amounts of data in parallel using thousands
of threads, making them ideal for tasks that involve large datasets.

Key Uses of GPUs:


•GPUs are commonly used for graphics rendering (e.g., 3D animations in games) and other parallel
computing tasks.
•They are much faster than CPUs for tasks that can be parallelized, such as handling complex graphical
outputs or processing massive datasets.
•The first GPU was introduced by Nvidia in 1999, called the GeForce 256.
•It could process 10 million polygons per second and had 22 million transistors.
Specifications of GeForce 256 (First Commercial GPU): which is released by NVIDIA
PCI or AGP 4x bus.
32 MB SDR RAM.
2.6 GB/s bandwidth.
120 MHz core clock.
220 nm process technology.
4 Pixel shaders (cores).
Popularity Growth:
•As the demand for high-performance graphics increased (e.g., gaming, simulations), GPUs became more popular.
•GPUs are typically connected to the CPU via PCI or PCI-Express and have their own dedicated RAM for faster performance.
Applications:
•GPUs are heavily used in:
• Research (e.g., AI, deep learning, simulations).
• High-performance computing due to their ability to process data in highly parallel mode.
NVIDIA GPU History
Timeline of Major GPU Releases:
◦ 1999: First GPU released.
◦ 2006: CUDA architecture introduced for general-purpose GPU computing.
◦ 2007: Tesla GPUs released for high-performance tasks.
◦ 2009: Fermi architecture released for enhanced parallel computing.
◦ 2012: Kepler architecture launched for better performance and efficiency.
Basic Terminology in GPUs
Thread:
◦ The smallest unit of computation on a GPU.

Block:
◦ A collection of threads grouped together.

Grid:
◦ A collection of blocks that form a larger computational unit.

Warp:
◦ A group of 32 threads that execute together on the GPU.

Kernel:
◦ The program (or function) that runs on the GPU.
Tesla GPU Architecture
•Based on a scalable processor array.
•Key components:
1.TPC (Texture/Processor Cluster): Handles textures and processing.
2.SM (Streaming Multiprocessor): Executes parallel threads.
3.SP (Streaming Processor): The basic computation unit.
4.ROP (Raster Operation Processor): Manages rendering tasks like blending and depth calculations.
Block Diagram Example (GeForce 8800 GPU):
Contains:

•128 Streaming Processors (SPs).


•Organized into 16 Streaming Multiprocessors (SMs).
•Independent processing units called TPCs.
•Data flow:
•Starts at the host system (CPU), connects via the PCI-Express bus, and flows through various
GPU components.
How the GPU Works:
Scalable Streaming Processor Array (SPA):
◦ Executes parallel computations.
◦ Handles all GPU commands for graphics and general-purpose tasks.

Memory:
◦ Uses external DRAM for data storage.
◦ ROP units perform memory-related operations like blending and depth tests.

Interconnection Network:
◦ Transfers data between GPU components.
◦ Connects the SPA to DRAM through a two-level cache.

Communication with the Host CPU:


◦ The GPU interacts with the CPU to:
◦ Fetch data from system memory.
◦ Check commands.
◦ Maintain synchronization for consistent performance.
Processing Tasks:
◦ The Streaming Processor Array (SPA) handles both graphics tasks and general-purpose computing tasks.

◦ Each TPC (Texture/Processor Cluster) in the SPA works like a mini processor, managing tasks efficiently.

◦ GPUs scale from small setups (1 TPC) to high-performance setups (up to 8 TPCs or more).

Work Distribution:
◦ The input assembler sends data to the processors (like geometry, vertex, and pixel shaders).

◦ Data is processed and distributed evenly among the GPU units to ensure smooth execution.
Memory Management in GPUs

How Data Is Accessed:


◦ GPUs use integer byte addressing to improve memory efficiency.

◦ Data is read and written through load/store instructions that work with three types of memory spaces:

◦ Local Memory: Temporary memory used for private, per-thread data.

◦ Shared Memory: Fast memory shared by threads within the same group (Streaming Multiprocessor or SM).

◦ Global Memory: Large memory shared across all threads in the GPU.

Memory Types:
◦ The GPU supports global, shared, and local memory for different tasks:

◦ Global Memory: Used for large datasets and shared among all threads.

◦ Shared Memory: Fast and used for data shared between threads in the same SM.

◦ Local Memory: Private data for individual threads.


Memory Bandwidth:
◦ Each GPU memory partition is 64 bits wide.

◦ It supports advanced memory like DDR2 or GDDR3, with speeds up to 1 GHz, for high bandwidth and fast
processing.

Partitions:
◦ Global memory is divided into 6 partitions, each handling 1/6 of the total physical memory space.
Instruction-Level Support for Parallel Programming
Parallelism refers to performing multiple tasks simultaneously to improve efficiency. It can occur at
following different levels:

Job Level:
◦ Independent jobs run at the same time on the same system.
◦ Example: Running multiple programs (e.g., a video player and a browser) simultaneously.

Program Level:
◦ Multiple tasks work together to solve a single large problem.
◦ Example: Different software modules collaborate to process data.

Instruction Level:
◦ A single instruction (e.g., adding two numbers) is broken into smaller steps (sub-instructions).
◦ Sub-instructions are executed in parallel using techniques like pipelining.
◦ Example: Adding two arrays element-wise.
Bit Level:
◦ Individual bits in a word (binary representation) are processed.

◦ If bits are processed one after another, it is bit-serial. If processed at the same time, it is bit-parallel.

◦ Example: Processing 32 bits of data in parallel in a 32-bit processor.

levels of parallelism
•Instruction-Level Parallelism (Fine-Grained)
•Loop-Level Parallelism (Fine-Grained)
•Procedure-Level Parallelism (Medium-Grained)
•Subprogram-Level Parallelism (Medium to Coarse-Grained)
•Job or Program-Level Parallelism (Coarse-Grained)
Instruction-Level Parallelism (ILP):Breaking a program into individual instructions that can run
simultaneously.
Grain:
◦ A grain is the smallest unit of work in parallel processing.
◦ It represents a portion of the overall computation that can be performed independently.

Granularity: Usually involves fewer than 20 instructions per grain.


Granularity in parallelism refers to the size of tasks (units of work) and the frequency of communication or
synchronization between them. It determines how much computation is performed before tasks need to interact.
Here's a breakdown:
Challenges:
◦ Data dependencies: When one instruction depends on the result of another.

◦ Control dependencies: Managing the order of execution.

Hardware Examples:
◦ Processors like VLIWs (Very Long Instruction Word) and superscalar processors benefit from ILP.
Loop-Level Parallelism:
Instruction Count:
◦ If the number of instructions in a loop is less than 500, the loop can be considered for parallelism.
◦ If iterations of the loop are independent of each other, they can be handled either by a pipeline or a SIMD
(Single Instruction, Multiple Data) machine.

Optimized Programs:
◦ Well-optimized program structures are designed to execute efficiently on parallel or vector machines.

Challenges:
◦ Some loops (like recursive loops) are difficult to handle in parallel.

Granularity:
◦ Loop-level parallelism is still considered fine-grained computation, as it deals with individual iterations of
loops.

Cross-Iteration Parallelism in loop-level parallelism refers to the ability to execute tasks from
different iterations of a loop concurrently. This allows the parallelization to extend beyond the
boundaries of a single loop iteration, increasing the overall efficiency of parallel execution.
Procedure-Level Parallelism:
•Medium-sized grain, typically involving less than 2000 instructions.
•Difficult to detect parallelism due to fewer instructions and the need for inter-procedural dependence
analysis.
•Communication requirements are lower compared to instruction-level parallelism.
•Multitasking operations can benefit at this level.
Subprogram-Level Parallelism:
•Operates at the job step level, with thousands of instructions, typically medium- or coarse-grained.
•Job steps can overlap across different jobs, enabling multi-programming.
•No compilers are currently available to exploit medium- or coarse-grain parallelism.
Examples:
◦ Executing different functions in parallel in a multi-threaded program.
◦ Running different modules of a large application simultaneously.
Job or Program-Level Parallelism:

Job or Program-Level Parallelism refers to the highest level of parallelism where independent jobs or programs are
executed concurrently. This level of parallelism focuses on utilizing the computational power of a parallel computer to run
multiple, unrelated tasks simultaneously.

Key Characteristics

Independent Jobs or Programs:


◦ Each job or program operates independently of others, with no inter-task dependencies.
◦ Example: Running separate programs like a web server, database server, and data analysis task on different processors
simultaneously.

Suitable Hardware:
◦ Ideal for Machines with Fewer Powerful Processors:
◦ Fewer, high-performance processors are well-suited for executing independent programs that demand significant
computational power.
◦ Less Effective for Many Small Processors:
◦ Machines with numerous smaller processors (e.g., GPUs) are better suited for fine-grained parallelism rather than
job-level tasks.
Practical Use Cases
Cluster Computing:
◦ Independent jobs distributed across nodes in a cluster for parallel execution.

Cloud Computing:
◦ Running different user programs on virtual machines simultaneously.

Scientific Simulations:
◦ Multiple simulations or experiments running concurrently on a supercomputer.
Avoiding Dependencies in Parallel Execution
•Hardware Terminology:
•Data Hazards:
•RAW (Read After Write): A hazard occurs when an instruction depends on the result of a previous
instruction.
•WAR (Write After Read): A hazard occurs when a write operation is scheduled after a read operation for the
same resource.
•WAW (Write After Write): A hazard occurs when two write operations are scheduled for the same resource.
•Software Terminology:
•Data Dependencies:
•Name Dependency: Dependencies caused due to variable names.
•True Dependency: Arises when an instruction depends on the actual data produced by a previous instruction.
Pipelining and Stalls
Loop Unrolling
Loop unrolling is an optimization technique used in programming and compiler design to
improve the performance of loops. It involves reducing the overhead of loop control by
executing multiple iterations of the loop within a single iteration.
The compiler or programmer
duplicates the body of the loop
multiple times, reducing the
number of iterations.
The loop's control structure (e.g.,
incrementing the counter and
checking the condition) executes
less frequently, improving
efficiency.
Vector Processing for Loop-
Level Parallelism
◦ Vector processing is a method
of loop-level parallelism where a
single instruction operates on
multiple data elements
simultaneously.
◦ It is designed to handle repetitive
operations within loops
efficiently, particularly when the
same operation is applied to a
large set of data (such as arrays
or matrices).
Alternative Methods:
◦ If vector processing is not used, parallelism can still be achieved through:

◦ Dynamic Techniques: Such as branch prediction.

◦ Static Techniques: Like loop unrolling by the compiler.

Parallel Instructions:
◦ If two instructions are parallel, they can execute simultaneously in a pipeline of arbitrary depth without
causing stalls.

Dependent Instructions:
◦ If two instructions are dependent:

◦ They are not parallel and must execute in order.

◦ However, partial overlap may sometimes be possible during execution.


Multiprocessor Caches and Cache Coherence

Coherence governs the behavior of reads and


writes to the same memory location across
multiple processors.
•Cache Coherence: In a multiprocessor system,

two processors can maintain different values for


the same memory location in their caches.
• Figure demonstrates a cache coherence issue:
• Two processors (P and Q) interact with the same memory location (B).
• Processor P has B = 5 in its cache memory, while Processor Q has B = 7.
• The main memory value for B is 7.
• Inconsistent views of B lead to a cache coherence problem.

Memory System Coherence


•A memory system is coherent if:
1.Consistency: Any read operation returns the most recently written value.
2.Write Propagation: If processor P writes to a memory location (A), and no other processor writes to A,
subsequent reads of A must return the value written by P.
Cache Coherence Conditions
1.Read/Write Separation:
• If Processor P writes to location A and another processor reads from A, provided the read/write operations are
sufficiently separated, the read operation should return the value written by P.
2.Write Serialization:
• Writes to the same location by any two processors must be seen in the same order by all processors.
• Example:
• If values 1 and then 2 are written to a location, all processors must observe this order.
• It prevents inconsistencies where a processor observes the value 2 first and later sees the value revert to 1.
Principles of Cache Coherence

Preserve Program Order:


If a processor (P) writes to a memory location (A) and later reads from the same location (A), it
should get the value it wrote, as long as no other processor has written to A in between.

Coherent View of Memory:


If one processor writes to a memory location (A) and another processor reads from A afterward, the
reader must get the updated value, provided:
◦ There is enough time between the write and the read.

◦ No other writes happen to A in the meantime.


Coherence and Consistency
•Coherence ensures that a read returns the most recent written value for the same memory location.
•Consistency ensures that updates to one memory location are visible to all processors in a predictable
way.
Ways to Maintain Coherence
•When multiple processors access the same data, each may store a copy in their cache. Unlike I/O,
where such situations are rare, this is common and needs to be handled carefully.
•Migration: Data can be moved from shared memory to a processor’s local cache for faster access.
This reduces the delay in accessing remote data and minimizes the load on shared memory.
•Replication: When multiple processors need to read the same data, a copy is stored in each processor’s
cache. This reduces access delays and prevents competition for shared memory.
•Ensuring migration and replication is important for better performance when accessing shared data.
Instead of avoiding shared data through software, systems use hardware protocols to manage cache
coherence.
•Cache coherence protocols are used to keep track of shared data across multiple processors. These
protocols are essential for maintaining data accuracy and consistency.
Techniques/protocols for Tracking Sharing Status

Two main methods are used to track the sharing status of data blocks:

1)Directory-based

2)Snooping
Directory-based protocol
◦ The sharing status of a block of memory
is maintained in a central location, called
a directory.
◦ This centralized directory tracks which
processors are sharing the data.
◦ It is commonly used in Distributed Shared
Memory (DSM) systems.
◦ It supports larger processor counts but has
higher overhead compared to other
methods.
Snooping:
•There is no central directory.
•Every cache with a copy of the data also keeps
track of its sharing status.
•This method is used in systems with a common
bus, where all caches can monitor ("snoop")
the communication medium to check if they
have the requested data block.
•It works well for systems with a shared
broadcast medium.
Types of NUMA

You might also like