Chapter 1 (Parallel Computer Models)
Chapter 1 (Parallel Computer Models)
Chapter One
Parallel Computer Models
Computer Generations
First Generation (1945-1954):
• Single central processing unit (CPU).
• Performed fixed-point arithmetic using program counter
• Used machine or assembly languages
• Subroutine linkage was not implemented
• Vacuum tubes and relay memories technology
• Representative system: IBM 701, ENIAC, Princeton IAS
Second Generation (1955-1964):
• Floating point arithmetic, multiplexed memory, index registers were introduced
• Subroutine libraries and compilers were implemented
• High level language (Fortan, Cobol) were established
• Registered transfer language was developed
• Representative system: IBM 7030, the Univac LARC, CDC 1604
Third Generation (1965-1974):
• Pipelining and cache memory were introduced
• Integrated circuit (IC) and Microprogrammed control for setting up the activities between CPU and
I/O for multiple users
• Time-sharing operating system was developed using virtual memory for maximum use of
resources
• Representative system: IBM 360-370 series, CDC 6600/7600, ASC, PDP-8 series
Computer Generations cont…
Computing problem
Algorithms and data structures
Hardware resources
Operating system
System software support
Compiler Support
Computing problem:
• Modern computer requires an integrated system: machine hardware, instructions set, system
software, application programs, user interface.
• Requires different computing resources based on the type of the problems e.g. for numerical
problems, the solution demands complex mathematical formulations, for alphanumerical
problems, it demands efficient transaction processing and large database management, for
artificial intelligence it demands logic interference and symbolic manipulations. Some
problem requires the combinations of all these processes.
Algorithms and data structures:
• To specify the computations involved in the solution, particular data structures and
algorithms are needed.
• In most cases, numerical algorithms are deterministic and symbolic processing may need
nondeterministic approaches.
Hardware resources:
• An operating system, application program, preprocessor, memory and other peripheral
devices form the hardware core of a computer system.
• This includes other special hardware interfaces integrated in the I/O devices like network
adaptors, modems, workstations, display terminals, printers, scanners etc.
Operating System:
• Manages the allocation and deallocation of the resources during the execution of any user
program.
• Application software and standard benchmark programs must be built for the performance
evaluation.
• Maps efficient compiler, processor scheduling, memory management and parallelism (at
both compile and run time).
System software support:
• Programs written in high-level languages must be translated into machine language with the
help of a good software support.
• Resource binding needs the use of the compiler, assembler, loader and OS kernel to
accomplish physical machine to program execution.
Compiler Support:
• There are three types of compiler support approaches: i) Preprocessor ii) Precompiler and iii)
Parallelizing compiler.
• Preprocessor uses sequential compiler and low level library to implement high level parallel
approach.
• Precompiler requires not full but rather some parallel program flows and limited
optimizations.
• Parallelizing compiler requires fully developed parallelizing/vectorising compiler that can
transform sequential codes into parallel constructs.
Flynn’s Classification of Computer Architectures(Michael Flynn, 1972):
SISD (Single Instruction stream over a Single Data stream):
Legends: CU = Control Unit, PU = Processing Unit, MU = Memory Unit, IS = Instruction Stream, DS = Data Stream
PE = Processing Element, LM = Local Memory
System Performance Attributes
Clock Time/Clock Cycle:
• Clock time denoted by 𝝉, is a constant that the digital computers have. Can vary depending
on the architecture used. (Usually measured in ns)
Clock Rate/Cycle Frequency:
• Inverse of clock time is clock rate, denoted by f. In short, (f = 1/𝝉). (Usually measured in hz)
Instruction Count (Ic):
• Instruction count is the size of a single program means the number of machine instructions to
be performed to execute a single program.
CPI/Cycles per Instruction:
• An instruction may consist of several micro operations (fetch, decode, operand fetch,
execute) that can be performed in one clock cycle. Then again, different machine instructions
may require different clock cycles.
• Number of clock cycles needed to execute a single instruction.
• If CPU clock cycles for a single program is C and Instruction count is Ic, then:
𝐶
𝐶𝑃𝐼 =
𝐼𝑐
Execution Time/CPU Time:
• If Ic is the total number of instructions, CPI is cycle per instruction and 𝝉 is clock cycle, the
total execution time, (T) for a program is:
𝐶𝑃𝐼
𝑇 = 𝐼𝑐 ∗ 𝐶𝑃𝐼 ∗ 𝝉 or 𝑇 = 𝐼𝑐 ∗
𝑓
• Carrying out an instruction requires some phases:
Instruction fetch
Decode
Operands fetch
Execution
Storing results back to the memory
• Decode and execution phases are carried out in the CPU, thus the term processor cycle.
• Rest three phases require accessing memory, thus the term memory cycle.
• Usually the memory cycle is k times the processor cycle, where k is the ratio between
memory cycle and processor cycle. So the CPI can be written as
CPI = Processor cycle(p) + Memory cycle(m) * k
• Therefore, total execution time, T:
𝑇 = 𝐼𝑐 ∗ 𝑝 + 𝑚 ∗ 𝑘 ∗ 𝝉
Where, p = number of processor cycles needed for instruction decode and execution
m = number of memory references needed
k = ratio between p and m
MIPS rate (Million Instructions Per Seconds):
• Evaluated as:
𝐼𝑐
𝑀𝐼𝑃𝑆 𝑟𝑎𝑡𝑒 =
𝑇∗106
𝐼𝑐 𝑓 𝑓∗𝐼𝑐
Or, = = [Derivated from, Total execute time, 𝑇 = 𝐼𝑐 ∗ 𝐶𝑃𝐼 ∗ 𝝉]
𝑇∗106 𝐶𝑃𝐼∗106 𝐶∗106
Math Problems
Multiprocessors and Multicomputers
Depending on the type of accessing memory, two categories are represented here.
Shared-Memory Multiprocessors
• Uniform Memory Access (UMA) model
• Non-uniform Memory Access (NUMA) model
• Cache Only Memory Architecture (COMA) moldel
Distributed-Memory Multicomputers
Shared-Memory Multiprocessors
• The Uniform Memory Access (UMA) model:
The physical memory is uniformly shared by all the
processors.
All processors have equal access time to all the memory
words, thus called uniform memory access.
Processor can have their own private cache.
All the tightly connected processors and the memory
are interconnected by a single bus, a crossbar switch or
a multistage network.
When all the processors have equal access to all the
peripherals, it is called symmetric multiprocessor.
When one or couple of processors have the access right,
It is called asymmetric multiprocessor.
UMA model is advantageous for timesharing
Applications for multiple users and also useful to speed-up
large single program in a short time.
• The Nonuniform Memory Access (NUMA) model:
In NUMA model, access time varies with the
location of the memory word.
The shared memory is physically distributed to
all the processors, called local memories.
Processors are divided into several clusters, the
clusters are themselves are UMA or NUMA
multiprocessors.
All processors that are belonged to a
certain clusters have uniform access to the cluster
shared memory.
All clusters have equal access to the global
shared memory and indeed the access time to
cluster memory is shorter than to the global
memory.