CS 405
COMPUTER SYSTEM
KTUStudents.in
ARCHITECTURE
SUFYAN P
Assistant Professor
[email protected]
Computer Science and engineering
MEA Engineering College, Perinthalmanna
For more study materials: WWW.KTUSTUDENTS.IN
TEXT BOOK:
K. Hwang and Naresh
KTUStudents.in
Jotwani, Advanced
Computer Architecture,
Parallelism,
Scalability,
Programmability, TMH,
2010.
For more study materials: WWW.KTUSTUDENTS.IN
Introduction to advanced computer
architecture
❖ Computer Organization:
• It refers to the operational units and their interconnections
that realize the architectural specifications.
• It describes the function of and design of the various units of
digital computer that store and process information
KTUStudents.in
❖ Computer hardware:
• Consists of electronic circuits, displays, magnetic and optical
storage media, electromechanical equipment and
communication facilities.
❖ Computer Architecture:
• It is concerned with the structure and behavior of the
computer.
• It includes the information formats, the instruction set and
techniques for addressing memory.
For more study materials: WWW.KTUSTUDENTS.IN
Introduction to advanced computer
architecture
Syllabus
• Basic concepts of parallel computer models
• SIMD computers
KTUStudents.in
• Multiprocessors and multi-computers
• Cache Coherence Protocols
• Multicomputers
• Pipelining computers and Multithreading
For more study materials: WWW.KTUSTUDENTS.IN
MODULE-1
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
CONTENTS
• Parallel computer models
o Evolution of Computer Architecture
o System Attributes to performance.
•
•
KTUStudents.in
Amdahl's law
Multiprocessors and Multicomputers,
• Multivector and SIMD computers,
• Architectural development tracks
• Conditions of parallelism.
For more study materials: WWW.KTUSTUDENTS.IN
Evolution of computer
KTUStudents.in
architecture
For more study materials: WWW.KTUSTUDENTS.IN
INTRODUCTION
• Study of computer architecture involves
both
o Hardware organization
o Programming
• Evolution of computer architecture started
KTUStudents.in
with Von Neumann Architecture
o Build as a sequential machine
o Executing scalar data
• Major advancement came due to following
techniques
o Look-ahead technique
o Parallelism & pipelining
o Flynn’s classification
o Parallel / vector computers
For more study materials: WWW.KTUSTUDENTS.IN
Look-ahead Technique
• Introduced for enabling instruction prefetching
• Used to overlap I/E operations
KTUStudents.in
o I/E➔ instruction fetch and execute
• Enable functional parallelism
o Different functions are distributed & they are performed concurrently
by processes or threads across different processors
For more study materials: WWW.KTUSTUDENTS.IN
Look-ahead Technique[2]
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Flynn’s classification
• Classification is based on
o Instruction stream
o Data streams
• Classifications are:-
KTUStudents.in
o SISD(Single instruction stream over single data stream)
• E.g.: conventional sequential machines
o SIMD(single instruction stream over multiple data stream)
• E.g.: Vector computers
o MIMD
• E.g.: parallel computers
o MISD
• E.g.: special purpose computers
For more study materials: WWW.KTUSTUDENTS.IN
Pipelining
• Pipelining is a technique where multiple instructions are
overlapped during execution.
• Pipeline is divided into stages and these stages are
KTUStudents.in
connected with one another to form a pipe like structure.
• Instructions enter from one end and exit from another
end.
• Pipelining increases the overall instruction throughput.
For more study materials: WWW.KTUSTUDENTS.IN
Parallel computers
• Executions are carried out simultaneously
• 2 classes
o Shared memory multiprocessors
KTUStudents.in
o Message passing multicomputers
• Distinction lies in
o Memory sharing
o Interprocessor communication
For more study materials: WWW.KTUSTUDENTS.IN
Shared memory multiprocessor
• Processors in the multiprocessor system uses common
memory
• They communicate using shared variables
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Message passing multicomputer
• Each computer node in the multicomputer system has a
local memory
KTUStudents.in
• Interprocessor communication done via message passing
For more study materials: WWW.KTUSTUDENTS.IN
Vector processor
• It is a processor whose instructions operate on
vector data
• 2 families of vector processor
o Memory to memory
o Register to register
KTUStudents.in
• Memory to memory architecture
o supports pipelined flow of vector operands directly from memory to
pipelines & then back to the memory
• Register to register architecture
o Uses registers to interface between memory & pipelines
For more study materials: WWW.KTUSTUDENTS.IN
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
System attributes to performance
• Ideal performance of a system demands perfect
matching between
o machine capability
o program behavior
• Machine capability can be enhanced via
KTUStudents.in
o Better h/w technology
o Innovative architectural features
o Efficient resource management
• Factors affecting program behavior
o Algorithm design
o Data structures
o Language efficiency
o Programmer skill
o Compiler technology
For more study materials: WWW.KTUSTUDENTS.IN
Performance factors
• Cycle time τ
• Clock rate f
o f= 1/τ
• CPI (cycles per instruction)
•
•
KTUStudents.in
Instruction count Ic
Processor cycle p
• Memory cycle m
• Ratio between memory cycle & processor cycle k
For more study materials: WWW.KTUSTUDENTS.IN
• Cycle time
o Time taken to complete once clock cycle
• Clock rate
o Inverse of cycle time
• Instruction count
o No: of machine instructions to be executed in a program
KTUStudents.in
• CPI (cycles per instruction)
o No: of cycles taken to execute one instruction
For more study materials: WWW.KTUSTUDENTS.IN
CPU time (T)
• CPU time is the time needed to execute a program
• It depends on following factors
o Ic
o CPI
o Cycle time
KTUStudents.in
• T = Ic * CPI* τ ………………..( 1)
• Instruction execution involves a cycle of events like
o Instruction fetch ➔ memory access
o Decode ➔ carried out by CPU
o Operand fetch ➔ memory access
o Execution ➔ carried out by CPU
o Store result ➔ memory access
For more study materials: WWW.KTUSTUDENTS.IN
CPI (cycles per instruction)
• It can be divided into 2 component terms
o Processor cycles p
o Memory cycles m
• Eq (1) can be replaced as
KTUStudents.in
o T= Ic * (p+m*k)* τ……..(2)
• p➔ no: of processor cycles needed for inst decode & execution
• m➔ no: of memory reference needed
• k➔ ratio between memory and processor cycles
• τ➔ processor cycle time
For more study materials: WWW.KTUSTUDENTS.IN
• Memory cycle
o Time needed to complete one memory reference
o Denoted as m
o k➔ depends on
• speed of cache
• memory technology
KTUStudents.in
• Processor memory interconnection scheme
For more study materials: WWW.KTUSTUDENTS.IN
• C ➔ total no: of clock cycles needed to execute a program (n
instructions)
• CPI➔ no: of clock cycles needed to execute single instruction
• CPI= C ………(3)
KTUStudents.in
Ic
• The eq1.1 can be rewritten as following
• T= Ic * CPI* τ ➔ T=Ic *C*τ
Ic
➔ T= C * τ ……….(4)
➔ T= C ………….(5)
f
For more study materials: WWW.KTUSTUDENTS.IN
System attributes
• 5 Performance factors( Ic, p,m,k, τ) are
influenced by 4 system attributes
o Instruction set architecture
o Compiler technology
o
o
KTUStudents.in
CPU implementation & control
Cache and memory hierarchy
• The instruction-set architecture affects the program
length (Ic) and processor cycle needed (p).
• The compiler technology affects the values of IC , p, and
the memory reference count (m).
For more study materials: WWW.KTUSTUDENTS.IN
MIPS rate (million instructions per
second)
• MIPS rate is based on following factors
o Clock rate f
o Instruction count Ic
o CPI of given machine
• MIPS rate= Ic ………(6)
KTUStudents.in
T * 106
• In terms of eq 1.1, the above eq can be rewritten as
following
Ic ➔f …..(7) ➔ f
Ic * CPI * τ * 106 CPI * 106 C/Ic * 106
➔ Ic * f ………(8)
MIPS rate = C * 106
For more study materials: WWW.KTUSTUDENTS.IN
Throughput rate
• CPU throughput Wp
• It is the measure of how many programs can be executed
per second, based on the MIPS rate & average program
length
KTUStudents.in
• Wp = f ……..(9)
Ic * CPI
For more study materials: WWW.KTUSTUDENTS.IN
Problem 1
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
solution
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
• Total no: of cycles required to execute complete
program
➔ 45000+64000+30000+16000
➔ 155000 cycles
• C=155000 cycles
KTUStudents.in
• Effective CPI= C/Ic
➔ 155000/100000
CPI = 1.55
For more study materials: WWW.KTUSTUDENTS.IN
• MIPS rate= f/CPI*106
= 40/ 1.55*106
=40*106/1.55*106
KTUStudents.in
= 25.8
For more study materials: WWW.KTUSTUDENTS.IN
• Given f=40 MHz ➔ τ =1/40
• T = Ic * CPI * τ
• = 100000*1.55*1/40
KTUStudents.in
• =100000*1.55*0.025
• =3875
• = 3.875 ms
For more study materials: WWW.KTUSTUDENTS.IN
Problem 2
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Floating point operations per second
• Most computer applications use floating point
operations
• For those application, performance is measured using
FLOPS
KTUStudents.in
• Flops➔ no : of floating point operations per second
For more study materials: WWW.KTUSTUDENTS.IN
Implicit parallelism
o In this approach conventional languages like C,C++, Fortran is
used to write the source program
o Sequentially coded source program is translated to a parallel
object code
KTUStudents.in
o This is done by compiler
o Compiler must be able to detect parallelism
For more study materials: WWW.KTUSTUDENTS.IN
Explicit parallelism
• More effort is needed by the programmer
• Source program is developed using the parallel dialects
of C,C++,fortran
• Parallelism is explicitly specified in the source program
KTUStudents.in
• Compiler need not detect parallelism
• Reduce the burden of compiler
For more study materials: WWW.KTUSTUDENTS.IN
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
MULTIPROCESSORS &
KTUStudents.in
MULTICOMPUTERS
For more study materials: WWW.KTUSTUDENTS.IN
Introduction
• Parallel computers are divided into 2
o Shared memory multiprocessors
o Message passing multicomputers
KTUStudents.in
• Their difference is based on memory
o One has shared common memory
o Other has unshared distributed memory
For more study materials: WWW.KTUSTUDENTS.IN
Shared memory multiprocessor
• 3 models
o UMA model (uniform memory access)
o NUMA model (non-uniform memory access)
KTUStudents.in
o COMA model (cache only memory architecture)
• These models differ in how the memory &
peripheral resources are shared or distributed
For more study materials: WWW.KTUSTUDENTS.IN
UMA model
• Physical memory is uniformly shared by all the
processors
• All processors have equal access time
• Peripherals are also shared
KTUStudents.in
• Due to this high degree of resource sharing,
multiprocessors are also called as tightly coupled
systems
• Communication & synchronization b/w processors are
done via shared variables
• System interconnection is done using
o Bus
o Crossbar switch
o Multistage network
For more study materials: WWW.KTUSTUDENTS.IN
UMA model
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
UMA model
• UMA is sometimes called CC-UMA - Cache Coherent
UMA.
KTUStudents.in
• Cache coherent means if one processor updates a
location in shared memory, all the other processors
know about the update
For more study materials: WWW.KTUSTUDENTS.IN
Advantages
• Suitable for general purpose & time sharing
applications by multiple users
KTUStudents.in
• Speed up the execution of a single large program
in time critical applications
For more study materials: WWW.KTUSTUDENTS.IN
Symmetric Vs asymmetric
multiprocessor system
• Symmetric multiprocessor system
o All processor have equal access to all peripheral devices
o All processor are equally capable of running executive
KTUStudents.in
programs such as OS kernel, I/O routines
• Asymmetric multiprocessor system
o Only one or subset of processors are executive-capable
o Master processor (MP) can execute OS and I/O routines
o Remaining have no I/O capabilities
o Remaining processors are called as attached processors
(AP)
o AP executes user codes under the supervision of master
processors
For more study materials: WWW.KTUSTUDENTS.IN
NUMA model
• It is a shared memory system in which the access time
varies with the location of memory word
• There are two NUMA models
KTUStudents.in
o Shared local memory model
o Hierarchical cluster model
For more study materials: WWW.KTUSTUDENTS.IN
Shared local memory model
• Shared memory is physically distributed to all
processors
• They are called as local memories
KTUStudents.in
o Collection of local memories forms a global address space
o This is accessible by all processors
• Access variations
o Access to the local memory attached with a local processor is faster
o Access of remote memory attached with other processors takes longer
time
• This is due to delay in interconnection n/w
For more study materials: WWW.KTUSTUDENTS.IN
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Hierarchical cluster model
• Processors are divided into several clusters
• Each cluster itself is a UMA or NUMA multiprocessor
• Clusters are connected to global shared memory
modules (GSM)
KTUStudents.in
o All clusters have equal access to global memory
• All processors belonging to same cluster uniformly
access the cluster shared memory modules (CSM)
• Access time to CSM is shorter than GSM
• Access rights to inter cluster memories can be specified
in various ways
For more study materials: WWW.KTUSTUDENTS.IN
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
COMA model
• Its is a multiprocessor using only the cache memory
• Special case of NUMA
• Distributed main memories are converted to caches
KTUStudents.in
• No memory hierarchy
• All caches form a global address space
• Remote cache access is assisted by distributed cache
directories
For more study materials: WWW.KTUSTUDENTS.IN
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Representative multiprocessors
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Message passing multicomputer
• Also called as distributed- memory multicomputer
• This system consist of multiple computers known as
nodes
• Nodes are interconnected by message passing n/w
KTUStudents.in
• Each node is an autonomous computer consisting
o Processor
o Local memory
o Attached disks or I/O peripherals
For more study materials: WWW.KTUSTUDENTS.IN
Message passing multicomputer
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Message passing multicomputer
• Message passing n/w provide point to point static
connections among nodes
• All local memories are private & are accessible only by
KTUStudents.in
local processors
• Therefore they are also called as no-remote-memory
access machines (NORMA)
• Inter-node communications are carried out by message
passing
For more study materials: WWW.KTUSTUDENTS.IN
Advantages
• Scalability
• Fault tolerance
• Suitable for certain applications
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Representative multicomputer
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
MULTIVECTOR AND SIMD
COMPUTERS
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Introduction
• In this section we introduce super computer and parallel
processors for vector processing and data parallelism
KTUStudents.in
• Supercomputers are classified as
o Vector supercomputers
• Uses powerful processors equipped with vector hardware
o SIMD supercomputers
• Provide massive data parallelism
For more study materials: WWW.KTUSTUDENTS.IN
Supercomputers
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Vector supercomputers
• A vector computer is build on top of a scalar processor
o Ie a vector processor is attached to the scalar processor
• Host computer loads the program & data to the main
memory
KTUStudents.in
• Scalar control unit decodes all the instructions
• If the decoded instruction is a scalar operation or a
program control operation,
o it is directly executed by the scalar processor
o Execution is done using scalar functional pipelines
For more study materials: WWW.KTUSTUDENTS.IN
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
• If the decoded instruction is a vector operation,
o It is sent to vector control unit
• Vector control unit
o It supervise the flow of vector data b/w main memory & vector
functional pipelines
o It coordinates the vector data flow
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Vector processor models
• Register to register architecture
o Vector registers are used to hold the following
• Vector operands
• Intermediate vector results
KTUStudents.in
• Final vector results
o Vector functional pipelines receives operands from
this registers & put the results back to these registers
o All vector registers are programmable
o Each vector register has a component counter
• It keeps track of component registers used in
successive pipeline cycles
For more study materials: WWW.KTUSTUDENTS.IN
Vector processor models
• Memory to memory architecture
KTUStudents.in
• A vector stream unit is used instead of vector registers
• Vector operands & results are directly retrieved from
and stored into the main memory
• Eg: Cyber 205
For more study materials: WWW.KTUSTUDENTS.IN
Examples of vector supercomputers
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
SIMD Supercomputers
• Computers with multiple processing elements (PE)
• They perform the same operation on multiple data
points simultaneously,
KTUStudents.in
• Operational model of SIMD computers are specified
using 5 tuple
• M=(N,C,I,M,R)
For more study materials: WWW.KTUSTUDENTS.IN
SIMD Supercomputers
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
• N➔ no of processing elements(PE) in the machine
• C➔ set of instructions directly executed by CU
• I➔ set of instructions broadcasted by CU to all PE’s for
parallel execution
o This include arithmetic, logic, data routing operations
KTUStudents.in
executed by each PE over the data within that PE
• M➔ set of masking schemes
o Each mask partitions set of PE’s to enabled & disabled
subsets
• R➔ set of data routing functions
o Used in interconnection n/w for inter PE
communications
For more study materials: WWW.KTUSTUDENTS.IN
Examples of SIMD supercomputer
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Amdahl’s law[1]
• It is named after computer scientist Gene Amdahl( a
computer architect from IBM and Amdahl corporation),
It is also known as Amdahl’s argument.
• It is a formula which gives the theoretical speedup in
KTUStudents.in
latency of the execution of a task at a fixed workload that
can be expected of a system whose resources are
improved.
• In other words, it is a formula used to find the maximum
improvement possible by just improving a particular
part of a system.
• It is often used in parallel computing to predict the
theoretical speedup when using multiple processors.
For more study materials: WWW.KTUSTUDENTS.IN
Amdahl’s law[2]
• Speed-up
• Speedup is defined as the ratio of performance for the
KTUStudents.in
entire task using the enhancement and performance for
the entire task without using the enhancement
• OR
• Speedup can be defined as the ratio of execution time
for the entire task without using the enhancement and
execution time for the entire task using the
enhancement.
For more study materials: WWW.KTUSTUDENTS.IN
Amdahl’s law[3]
• If Pe is the performance for entire task using the
enhancement when possible,
• Pw is the performance for entire task without using the
enhancement,
KTUStudents.in
• Ew is the execution time for entire task without using
the enhancement and
• Ee is the execution time for entire task using the
enhancement when possible then,
• Speedup = Pe/Pw
or
Speedup = Ew/Ee
For more study materials: WWW.KTUSTUDENTS.IN
Amdahl’s law[4]
• Amdahl’s law uses two factors to find speedup from
some enhancement –
• Fraction enhanced
KTUStudents.in
• Speedup enhanced
For more study materials: WWW.KTUSTUDENTS.IN
Amdahl’s law[5]
• Fraction enhanced –
• The fraction of the computation time in the original computer
that can be converted to take advantage of the enhancement.
KTUStudents.in
• For example- if 10 seconds of the execution time of a program
that takes 40 seconds in total can use an enhancement , the
fraction is 10/40. This obtained value is Fraction Enhanced.
• Fraction enhanced is always less than 1. (<1)
For more study materials: WWW.KTUSTUDENTS.IN
Amdahl’s law[6]
• Speedup enhanced –
• The improvement gained by the enhanced execution
mode; that is, how much faster the task would run if the
KTUStudents.in
enhanced mode were used for the entire program.
• For example – If the enhanced mode takes, say 3 seconds
for a portion of the program, while it is 6 seconds in the
original mode, the improvement is 6/3. This value is
Speedup enhanced.
• Speedup Enhanced is always greater than 1. (>1)
For more study materials: WWW.KTUSTUDENTS.IN
Amdahl’s law[7]
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
KTUStudents.in
THANK YOU
For more study materials: WWW.KTUSTUDENTS.IN
CONDITIONS OF PARALLELISM
KTUStudents.in
For more study materials: WWW.KTUSTUDENTS.IN
Introduction
• To execute several program segments in parallel, each
segment has to be independent of the other segment
• Dependency is the main challenge of parallelism
• Dependence graph shows the relation between program
KTUStudents.in
statements
o Nodes of dependence graph➔ program statements
o Directed edges with labels➔ relations among the
statements
For more study materials: WWW.KTUSTUDENTS.IN
Types of dependences
• Data dependence
KTUStudents.in
• Control dependence
• Resource dependence
For more study materials: WWW.KTUSTUDENTS.IN
DATA DEPENDENCE
• Ordering relationship b/w statements are
indicated by data dependence
• Types
KTUStudents.in
o Flow dependence
o Anti-dependence
o Output dependence
o I/O dependence
o Unknown dependence
For more study materials: WWW.KTUSTUDENTS.IN
Flow dependence
• A statement S2 is flow dependent on statement S1,
• if an execution path exists from S1 to S2, and if at least
one output of S1 is fed as an input to S2
• Denoted as S1➔S2
o Eg: consider following instruction
KTUStudents.in
• S1: LOAD R1,A
• S2: ADD R2,R1
• S2 is flow dependent on S1
o Coz o/p of S1 is fed as i/p of S2
o Ie variable A is passed into register R1
For more study materials: WWW.KTUSTUDENTS.IN
Anti-dependence
• Statement S2 is anti-dependent on statement S1 if,
• S2 follows S1 in program order and if o/p of S2 overlaps
the input to S1
• Denoted using a direct arrow crossed with a bar
• S1
KTUStudents.in
S2
o Eg: consider the following statements
o S2: ADD R2,R1
o S3: move R1,R3
• S3 is anti-dependent on S2 since S3 is overlapping the
input to S2
• Ie conflicts in the register content of R1
For more study materials: WWW.KTUSTUDENTS.IN
Output dependence
• Two statements are output dependent if they produce
the same output variable
• Denoted as S1 S2
• Eg:
KTUStudents.in
• S1: LOAD R1,A
• S3: MOVE R1,R3
• S3 is output dependent on S1 coz they both modify the
same register R1
For more study materials: WWW.KTUSTUDENTS.IN
I/O dependence
• I/O dependence occurs if same file is referenced by
both I/O statements
• Read and write are the I/O statements
• Eg:
•
•
S1:
S2:
KTUStudents.in
READ(4),A(i)
PROCESS
• S3: WRITE(4),B(I)
• S4: CLOSE(4)
• S1 and S3 are I/O dependent since both access the
same file
For more study materials: WWW.KTUSTUDENTS.IN
Control dependence
• Implies that order of execution of statements cannot be
determined before runtime
• Conditional statements will not be resolved until
runtime
KTUStudents.in
• Conditional branching may eliminate or introduce data
dependencies
• Control dependency prohibits parallelism
• Solution
o Compiler techniques
o Hardware Branch prediction technique
For more study materials: WWW.KTUSTUDENTS.IN
Resource dependence
• It is concerned with the conflicts in using shared
resources in parallel areas
• Shared resources are
o Integer units
o Floating point units
KTUStudents.in
o Registers
o Memory areas
• When conflicting resource is ALU➔ALU dependence
• If conflict involve workplace storage➔ storage
dependence
For more study materials: WWW.KTUSTUDENTS.IN
Hardware and software parallelism
• To implement parallelism we require
o Hardware support
o Software support
• Joint efforts from hardware designers & software
KTUStudents.in
programmers are required to exploit parallelism
For more study materials: WWW.KTUSTUDENTS.IN
Hardware parallelism
• This is a type of parallelism defined by machine
architecture & hardware multiplicity
• It is a function of cost & performance tradeoffs
• Indicates peak performance of processor resources
KTUStudents.in
• One way to characterize parallelism in processor
o No: of inst issues per machine cycle
• If a processor issues k inst per cycle➔ k-issue
processor
• If a processor issues 1 inst per cycle➔ one-issue
processor
o Eg: conventional pipelined processor
• i960CA➔ three-issue machine
For more study materials: WWW.KTUSTUDENTS.IN
Software Parallelism
• This type of parallelism is revealed in program profile or
program flow graph
o Flow graph displays the simultaneously executable operations
• Software parallelism is a function of
KTUStudents.in
o Algorithm
o Programming style
o Program design
• Types of software parallelism
o Control parallelism
o Data parallelism
For more study materials: WWW.KTUSTUDENTS.IN
Control Parallelism
• Allows 2 or more operations to be performed
simultaneously
• Control parallelism is achieved using
o pipelining
KTUStudents.in
o Multiple functional units
• Limitations
o Length of pipeline
o Multiplicity of functional units
• Pipelining & functional parallelism are handled by h/w
• Programmers need not take any special efforts to invoke
them
For more study materials: WWW.KTUSTUDENTS.IN
Data Parallelism
• Same operation is performed over many data elements
by many processors simultaneously
• Offers highest potential for concurrency
• Practiced in SIMD & MIMD modes
KTUStudents.in
• Data parallel code is easy to write & debug than control
parallel code
For more study materials: WWW.KTUSTUDENTS.IN
Role of compilers
• Compiler techniques are used to improve performance
• Early compilers which exploits parallelism are:-
o CDC STACKLIB
o Cray CFT
KTUStudents.in
• Features included in existing compilers to improve
parallelism are:-
o Loop transformation
o s/w pipelining
• To exploit parallelism to the most,
o Design the compiler & h/w jointly at the same time
For more study materials: WWW.KTUSTUDENTS.IN