0% found this document useful (0 votes)
9 views49 pages

Module3

Module 3 of the Computer Architecture course covers instruction-level parallelism (ILP), techniques for increasing ILP, and various processor architectures including superscalar, super-pipelined, and VLIW. It discusses dynamic and static parallelism, the implementation of ILP through pipelining and out-of-order execution, and the classification of architectures. Additionally, it explores array and vector processors, their history, benefits, and operational principles.

Uploaded by

My Creations
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views49 pages

Module3

Module 3 of the Computer Architecture course covers instruction-level parallelism (ILP), techniques for increasing ILP, and various processor architectures including superscalar, super-pipelined, and VLIW. It discusses dynamic and static parallelism, the implementation of ILP through pipelining and out-of-order execution, and the classification of architectures. Additionally, it explores array and vector processors, their history, benefits, and operational principles.

Uploaded by

My Creations
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

COMPUTER ARCHITECTURE

PCC-CS 402

Module 3

Department of Computer Science & Engineering


2nd Year,4th Semester
2022
Overview
Module 3

Instruction-level parallelism: basic concepts,


Techniques for increasing ILP,
Superscalar, Super-pipelined and VLIW
processor architectures.
Array and vector processors.
Instruction Level Parallelism (ILP)

• Measure of how INSTRUCTIONS in a program can be executed


SIMULTANEOUSLY (if possible).

• Difference between ILP and Concurrency


• ILP is the parallel execution of sequence of instructions of a
SPECIFIC THREAD.
• Concurrency is assignment of different THREADS of one or more
processes to CPU’s cores.

• There are two approaches of achieving ILP:


• Hardware (Dynamic Parallelism)
• Software (Static Parallelism)
Dynamic Vs. Static Parallelism

• In Dynamic Parallelism (DP), the processor decides at run-time which instructions can be
executed in parallel.
Example – Pentium Processor.
• CPU examines the instruction stream and assigns instructions to each cycle.
• CPU resolves hazards using advanced techniques at runtime.
• Compiler can help by re-ordering instructions.

• In Static Parallelism (SP), the compiler decides which instructions to be executed in


parallel.
Example – Itanium Processor.
• Compiler groups instructions to be issued.
• Compiler packages the instructions into ‘issue slots’.
• Compiler detects and avoids hazards.
Illustration 1

• Let us consider the following program:


1. e = a + b
2. f = c + d
3. m = e * f

• It can be observed that Operation 3 depends upon the outcome of Operation 1 and
Operation 2.

• However, Operation 1 and Operation2 do not have interdependencies. Hence they can
be performed in parallel.

• If we assume that each operation consumes one unit of time, then without ILP, 3 units of
time will be consumed. With ILP two units of time will be consumed. Therefore amount
of ILP is 3/2.
Illustration 2

• The simplest and most common way to increase the amount of parallelism
available among instructions is to exploit parallelism among iterations of a
loop. This type of parallelism is often called loop-level parallelism.

• Example 1:
for (i=1; i<=1000; i= i+1)
x[i] = x[i] + y[i]

• This is a parallel loop. Every iteration of the loop can overlap with any
other iteration, although within each loop iteration there is little opportunity
for overlap.
Illustration 2

• Example 2:
for (i=1; i<=100; i= i+1)
{
a[i] = a[i] + b[i]; //s1
b[i+1] = c[i] + d[i]; //s2
}
• Is this loop parallel? If not how to make it parallel?
• Statement s1 uses the value assigned in the previous iteration by statement s2, so there is a loop-carried
dependency between s1 and s2. Despite this dependency, this loop can be made parallel because the
dependency is not circular:
1. neither statement depends on itself
2. while s1 depends on s2, s2 does not depend on s1.

• A loop is parallel unless there is a cycle in the dependencies, since the absence of a cycle means that the
dependencies give a partial ordering on the statements.
Illustration 2

• To expose the parallelism the loop must be transformed to conform to the partial order. Two observations are
critical to this transformation:
1. There is no dependency from s1 to s2. Then, interchanging the two statements will not affect the
execution of s2.
2. On the first iteration of the loop, statement s1 depends on the value of b[1] computed prior to
initiating the loop.

• This allows us to replace the loop above with the following code sequence, which makes possible
overlapping of the iterations of the loop:
a[1] = a[1] + b[1];
for (i=1; i<=99; i= i+1)
{
b[i+1] = c[i] + d[i];
a[i+1] = a[i+1] + b[i+1];
}
b[101] = c[100] + d[100];
Illustration 2

• Example 3:
for (i=1; i<=100; i= i+1)
{
a[i+1] = a[i] + c[i]; //S1
b[i+1] = b[i] + a[i+1]; //S2
}

• This loop is not parallel because it has cycles in the dependencies, namely the statements
S1 and S2 depend on themselves!
Implementation of ILP

• Instruction pipelining where the execution of multiple instructions can be


partially overlapped.

• Superscalar execution, VLIW, and the closely related explicitly parallel


instruction computing concepts, in which multiple execution units

• Speculative execution which allows the execution of complete instructions


or parts of instructions before being certain whether this execution should
take place. A commonly used form of speculative execution is control flow
speculation where instructions past a control flow instruction (e.g., a branch)
are executed before the target of the control flow instruction is determined.
Implementation of ILP

• Out-of-order execution where instructions execute in any order that does not violate data
dependencies.
Note that this technique is independent of both pipelining and superscalar execution.
Current implementations of out-of-order execution dynamically (i.e., while the program is
executing and without any help from the compiler) extract ILP from ordinary programs. An
alternative is to extract this parallelism at compile time and somehow convey this
information to the hardware.
• Register renaming which refers to a technique used to avoid unnecessary serialization of
program operations imposed by the reuse of registers by those operations, used to enable
out-of-order execution.

• Branch prediction which is used to avoid stalling for control dependencies to be resolved.
Branch prediction is used with speculative execution.
ILP Architectures Classifications

• Sequential Architectures: the program is not expected to convey any


explicit information regarding parallelism. (Superscalar processors)
• Dependence Architectures: the program explicitly indicates the
dependences that exist between operations (Dataflow processors)
• Independence Architectures: the program provides information as to
which operations are independent of one another. (VLIW processors)
Superscalar Processors

• Implements ILP within a single processor.

• In contrast to a scalar processor that can execute at most one single instruction per clock
cycle, a superscalar processor can execute more than one instruction during a clock cycle
by simultaneously dispatching multiple instructions to different execution units on the
processor.

• It therefore allows for more throughput (the number of instructions that can be executed
in a unit of time) than would otherwise be possible at a given clock rate.

• Each execution unit is not a separate processor (or a core if the processor is a multi-core
processor), but an execution resource within a single CPU such as an arithmetic logic
unit.
Superscalar Processors

• According to Flynn’s Classification, SISD is a Single Core Superscalar Processor.

• A multi-core superscalar processor is classified as an MIMD processor.

• A Superscalar machine executes multiple independent instructions in parallel. They are


pipelined as well. • “Common” instructions (arithmetic, load/store, conditional branch) can
be executed independently.

• Equally applicable to RISC & CISC, but more straightforward in RISC machines.

• The order of execution is usually assisted by the compiler.


Superscalar Processors
Superscalar Processors - Limitations

• Dependent upon:
• Instruction level parallelism possible
• Compiler based optimization
• Hardware support

• Limited by
• Data dependency
• Procedural dependency
• Resource conflicts
Very Long Instruction Word - VLIW

• Refers to instruction set architectures designed to exploit ILP.


• Whereas conventional central processing units (CPU, processor) mostly allow programs
to specify instructions to execute in sequence only, a VLIW processor allows programs to
explicitly specify instructions to execute in parallel.
• A VLIW means that multiple number of operations are grouped together into one very
long instruction.
• In a VLIW processor, multiple operations inside a single long instruction are issued in
parallel to an equal number of functional units.
• The compiler checks that there are only independent instructions executed in parallel so as
to extract as much parallelism as possible. • One Program Counter points to one long
instruction. • Since, multiple operations are packed in one instruction word, the instruction
words are much larger than RISC and CISC.
Very Long Instruction Word - VLIW
VLIW Implementation

• The VLIW architecture is implemented through Static Scheduling. This means that they
are not done at runtime by the processor but are handled by the compiler.
• The compiler takes the complex instruction that need to be handled and compiles them
into Object Code.
• The Object Code is then passed to the Register File.
• It is this Object Code that is referred to as Very Long Instruction Word.
• The compiler must guarantee that the multiple instructions which group together are
independent so they can be executable in parallel.
• The Compiler pre-arranges the Object Code so the VLIW chip can quickly execute the
instructions in parallel.
VLIW limitations

• The need for a powerful compiler,


• Increased code size arising from aggressive scheduling
policies,
• Larger memory bandwidth and register-file bandwidth,
• Limitations due to the lock-step operation, binary compatibility
across implementations with varying number of functional units
and latencies
VLIW Vs. Superscalar

• Superscalar Processors generally use Dynamic Scheduling that transforms all ILP
complexity to the processor hardware.

• Hardware Complexity is more in case of Superscalar Architectures.

• VLIW chips don’t need most of the complex circuitry that Superscalar chips must use to
coordinate parallel execution at runtime.

• Thus in VLIW architecture, complexity is greatly reduced:


• The executable instructions are generated directly by the compiler.
• They are passed as Native Code by the functional units present in the hardware.
VLIW Vs. RISC Vs. CISC
Array and Vector Processors
History
• Development of Array Processors began in 1960s in the Solomon Project.
• Solomon’s objective was to improve the mathematical performance.
For this he implemented a considerable amount of maths-coprocessors under
the control of a single CPU.
• The CPU fed a single (common) instruction to all the math-units (ALUs) one
per cycle, but to work on a different dataset.
• In 1962, the project was cancelled, but University of Illionos started it as the
ILLIAC IV.
• From 1972 – 1990, it was the fastest machine for the world.
Array Processor

• It is a synchronous parallel computer with multiple ALUs. These ALUs are


called Processing Elements (PEs) which can operate in parallel in Lock Step
fashion.
• Comprises of N identical PEs under the control of a single Control Unit. It
also has a number of Memory Modules.
• A pipeline can be highly implemented using an Array Processor which will
process each small unit in the pipeline stages individually.
How Array Processors help

• If a CPU has an instruction


C =: A+B
• The data for A, B and C can theoretically be encoded in the instruction. In reality, the
address of the memory location where the data is stored is passed to the instruction. This
process is highly time consuming.
• In modern CPUs instruction pipelines are used in which the instruction passes through
sub-units in time.
• In array processors, instead of pipelining just the instructions, the data is also pipelined.
Hence decoding time is saved.
Example – Adding two groups of 10 nos. each

• Normal processor:
execute 10 times:
read the next instruction and decode
fetch this number, fetch that number
add them
put the result in the location
end loop

• Array processor:
read instruction and decode it
fetch these 10 numbers
fetch those 10 numbers
add them
put the result in the location.
How Array Processor is beneficial

• Only two address translations are needed


• Fetching and decoding instruction is done only one time instead of 10 times.
• Code is smaller, hence better memory management is done.
• Performance improvement by avoiding stalls. Stalls occur when an instruction waits
for a data which is yet not ready.
Array Processor Classification

• SIMD (Flynn's Classification)


• It manipulates vector instructions by means of multiple functional units
responding to a common instruction.
• Attached Array Processor
• An auxiliary processor attached to a general purpose processor.
• The job of this processor is to improve the performance of the host computer
in specific numeric calculation tasks.
Array Processor Architecture - SIMD

• Two basic configurations:


• Array processors using RAM also known as Dedicated Memory Organization.
ILLIAC IV, CM-2, MP-1 machines.
• Associative processor using Content Addressable Memory also known as
Global Memory Organization. BSP machines.
SIMD Architecture – Array Processor using RAM

• A single Control Unit (CU) and


multiple PEs can be observed.
• The CU controls all the PEs under it.
• The CU decodes all the instructions
and decides which PE should receive
which instruction.
• The vector instruction is broadcasted
to al the PEs to achieve spatial
paralellism.
• The scalar instructions are executed
inside the CU.
SIMD Architecture – Array Processor using RAM

• One PE connected to another PE through routing


register R.
• While communication between PEs, contents of R
is transferred.
• All Inputs and Outputs are transferred through R.
• The D register is the Address register and it stores
the 8-bit address of PE.
• During an instruction cycle, only the enabled PE
will take the operand sent to it while the other PEs
discard the operands.
• For an enabled PE, status register S=1
• For a masked PE, status register
S=A=Accumulator, B=2nd Operand
of binary operations
Attached Array Processor

• In this configuration, the Attached Array Processor has an input output interface to
common processor and another interface with a local memory.

• The local memory connects to the main memory with the help of high speed
memory bus.
Why use Array Processors ?

• High speed computation is the major benefit of Array Processor.


• The design of most Array Processors optimize their performance for repetitive
arithmetic operations than the host CPU. Since most Array Processors operate
asynchronously from the host CPU, they constitute a co-processor which increases
the capacity of the system.
• Array Processor consist of its own local memory. On systems which have limited
physical memory, this attribute helps.
Vector Processors

You might also like