Parallel and Distributed Computing
CS 3006 (BCS-7A | BDS-7A)
Lecture 3
Danyal Farhat
FAST School of Computing
NUCES Lahore
Flynn’s Classical Taxonomy
and Processor to Memory
Connection Strategies
Hardware Architecture Classifications
• Flynn’s Classification
Differentiates multiprocessor computers according to the dimensions of
instruction and data
• Feng’s Classification
Mainly based on serial and parallel processing in the computer system
• Handler’s Classification
Calculated on the basis of degree of parallelism and pipelining in system
levels
Flynn’s Classical Taxonomy
• Most Widely used parallel computer classifications
• Differentiates multiprocessor computers according to the
dimensions of instruction and data
Instruction stream: Sequence of instructions from memory to control unit
Data stream: Sequence of data from memory to control unit
• SISD: Single Instruction stream, Single Data stream
• SIMD: Single Instruction stream, Multiple Data stream
• MISD: Multiple Instruction stream, Single Data stream
• MIMD: Multiple Instruction stream, Multiple Data stream
Processor Organizations
SISD
• A serial (non-parallel computer)
• Single instruction: one instruction per cycle
• Single data: only one data stream per cycle
• Easy and deterministic execution
Example:
• Single CPU workstations
• Most workstations from HP, IBM and SGI are SISD
machines
SISD (Cont.)
• Performance of a processor can be measured with:
MIPS Rate = f x IPC
Million instructions per second (MIPS) is an approximate measure of a
computer's raw processing power
f (clock frequency of processor); IPC (instructions per cycle)
How to increase performance of uniprocessor?
• Multithreading
• Increasing clock frequency
• Increasing number of instructions completed during a
processor cycle (multiple pipelines in a superscalar
architecture and/or out of order execution)
SISD – Multithreading
• Run multiple threads on the same core concurrently
• Context switch implemented in hardware
• Minimum hardware support: replicate architectural state
All running threads must have their own context
Multiple register sets in the core
Multiple state registers
Program Counter (PC)
Memory Address Register (MAR)
Accumulator Register (ACC)
SISD – Multithreading (Cont.)
Implicit Multithreading
• concurrent execution of multiple threads extracted from a
single sequential program
• Managed by processor hardware
• Improve individual application performance
Explicit Multithreading
• concurrent execution of instructions from different explicit
threads, either by interleaving instructions from different
threads or by parallel execution on parallel pipelines
SISD-Explicit Multithreading
• Four approaches for explicit multithreading
Interleaved multithreading (fine-grained): switching can be at each
clock cycle. In case of few active threads, performance degrades
Blocked multithreading (coarse-grained): events like cache miss
produce switch
Simultaneous multithreading (SMT): execution units of a superscalar
processor receive instructions from multiple threads
Chip multiprocessing: e.g. dual core (not SISD)
• Architectures like IA-64 Very Long Instruction Word (VLIW)
allow multiple instructions (to be executed in parallel) in a
single word
SISD-Explicit Multithreading (Cont.)
Interleaved Multithreading (fine-grained):
• Fetch instructions from different threads in consecutive cycles
• In every clock cycle, we fetch an instruction for a thread
switching is at each clock cycle
• In case of few active threads, performance degrades
SISD-Explicit Multithreading (Cont.)
Blocked Multithreading (coarse-grained):
• Another thread is started when a thread is blocked
• Events like cache miss, waiting for I/O produce switch
• Switch to different thread when a long latency event (e.g. L2
cache miss) occurs
SISD-Explicit Multithreading (Cont.)
Simultaneous Multithreading (SMT):
• Fetch instructions from different threads in single cycle
• Execution units of a superscalar processor receive instructions
from multiple threads
• A superscalar processor is a CPU that implements a form of
parallelism called instruction-level parallelism within a single
processor
Intel’s Hyper Threading Technology
• A single physical processor appears as two logical processors
by applying two-threaded SMT approach
Example: Intel Pentium 4 in 2002
• Each logical processor maintains a complete set of architecture
state (general-purpose registers, control registers,…)
• Logical processors share nearly all other resources such as
caches, execution units, branch predictors, control logic and
buses
Intel’s Hyper Threading Technology (Cont.)
• Partitioned resources are recombined when only one thread is
active
• Add less than 5% to the relative chip size
• Improve performance by 16% to 28%
SIMD
• Homogeneous processing units / processing elements (PEs)
• Single instruction: All processor units execute the same
instruction at any given time
• Multiple data: Each processing unit can operate on different
data set
Example: Add A and B, C and D, X and Z
SIMD (Cont.)
• Each processing element has an associated data memory
So that each instruction is executed on a different set of data by the
different processors
• Used by vector and array processors
Suitable for vector and matrix calculations
• Vector processors act on array of similar data (only when
executing in vector mode) and in this case they are several
times faster than when executing in scalar mode
Example: NEC SX-8 processors run at 2 GHz for vectors and 1 GHz for
scalar operations
SIMD - Example
• A good example is the processing of pixels on screen
• A sequential processor would examine each pixel one at a time
and apply the processing instruction
• An array or vector processor can process all the elements of an
array simultaneously
• Game consoles and graphic cards make heavy use of such
processors to shift those pixels
• Such designs are usually dedicated to a particular application
and not commonly marketed for general purpose computing
SIMD-Example
MISD
• A single data stream is transmitted to a set of processors, each
of which executes a different instruction sequence
• Each processing unit operates on the data independently via
independent instruction stream
• This structure is not commercially implemented
• An example of use could be multiple cryptography algorithms
attempting to crack a coded message
MISD (Cont.)
• Example: 3 processors executes three different instructions on
same data set
MIMD
• Multiple instruction: Every processor may execute a different
instruction stream
• Multiple data: Every processor may work with a different data
stream
Examples:
• Most of the current supercomputers
• Grid computers
• Networked parallel computers
• Symmetric Multiprocessors (SMP) computers
MIMD (Cont.)
MIMD systems are mainly:
Shared Memory (SM) Systems:
• Multiple CPUs all of which share the same address space
(there is only one memory)
Distributed Memory (DM) Systems:
• Each CPU has its own associated memory
• CPUs are connected by some network (clusters)
MIMD - Shared Memory
• All processors have access to all memory as a global address
space
Uniform Memory Access (UMA)
• From all processing units to the shared memory, the data
access time is constant.
• Mostly represented by Symmetric Multiprocessor (SMP)
machines.
Non-Uniform Memory Access (NUMA)
• From all processing units to the shared memory, the data
access time is not constant.
Shared Memory Interconnection Network
• Main problem is how to do interconnections of the CPUs to
each other and to the memory
There are three main network topologies available:
• Crossbar (n2 connections - datapath without sharing)
• -network(n log2 n connections - log2 n switching stages
and shared on a path)
• Central databus (1 connections - n shared)
Shared Memory Interconnection Network (Cont.)
Thank You!