0% found this document useful (0 votes)
9 views

Advanced Computer Architecture

Uploaded by

rrrroptv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Advanced Computer Architecture

Uploaded by

rrrroptv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

COMPUTER ARCHITECTURE

ADVANCED COMPUTER ARCHITECTURE

Falguni Sinhababu
Government College of Engineering and Leather Technology

1
MULTICYCLE OPERATIONS

Advanced Computer Architecture 2


INTRODUCTION
▪ In our earlier discussions we are considering a
computer instruction consists of integer instruction and
it is divided into 5 stages (IF, ID, EX, MEM and WB). And
each stage will require 1 clock cycle to execute.
▪ But real implementation will consists of both integer and
floating-point units.
▪ Floating point operations are more complex than integer
operations.
▪ Will require more than one cycle in the EX stage
▪ Make the pipeline scheduling and operation more complex.
▪ New type of data hazards will appear that are otherwise not
possible in the integer pipeline.

Advanced Computer Architecture 3


(A) SOLUTION 1
▪ Do not make any change in the pipeline control.
▪ Use a slow clock such that the ALU operations in the
floating point instructions can finish in one clock
cycle (in EX stage).
▪ Drawback:
▪ Other operations are also slowed down, causing severe
degradation in performance.
▪ Not acceptable in practice.

Advanced Computer Architecture 4


(B) SOLUTION 2
▪ We allow the floating point arithmetic pipeline to have a
longer latency (time taken for an instruction to complete its
execution).
▪ EX cycle is considered to be repeated several times.
▪ The number of repetitions can very depending on the operation.

▪ The EX stage will have multiple floating point functional


units.
▪ For example, one for addition/subtraction (pipelined), one for
multiplication (pipelined) and one for division (non-pipelined).
▪ The stall will occur in the pipeline if the instruction to be issued (ID
to EX stage) will either cause a structural hazard for the functional
unit, or a data hazard.
▪ Pipelining the functional units can avoid the structural hazard.
Advanced Computer Architecture 5
(B) SOLUTION 2
MUL IF ID EX MEM WB

MUL IF ID EX MEM WB

Without Pipeline

MUL IF ID EX EX EX EX MEM WB

MUL IF ID EX EX EX EX MEM WB

With Pipeline

Advanced Computer Architecture 6


FLOATING POINT OPERATIONS
▪ Consider that there are
4 functional units
▪ Main integer unit that
handles loads and stores,
integer ALU operations
and branches
▪ Floating point adder/
subtractor
▪ Floating point and integer
multiplier
▪ Floating point and integer
divider.

Advanced Computer Architecture 7


MIPS32 FLOATING POINT EXTENSIONS
▪In the floating extension of MIPS32, there are
32 floating point registers F0 to F31, each of
size 32 bit.
▪For double precision operations, register pairs
can be used to store the data:
▪Register pair <F0, F1> referred to as F0
▪Register pair <F2, F3> referred to as F2
▪Register pair <F30, F31> referred to as F30
Advanced Computer Architecture 8
FLOATING POINT INSTRUCTION EXAMPLES
▪ Load into a floating point register pair:
▪ L.D F6, 200(R2)
▪ F6= Mem(R2+200); F7 = Mem(R2+204);
▪ Store from a floating point register pair:
▪ S.D F4, 40(R5)
▪ Mem(R5+40) = F4; Mem(R5+44) = F5;
▪ Arithmetic operations for floating point register pairs:
▪ ADD.D F0, F4, F6
▪ SUB.D F12, F8, F20
▪ MUL.D F4, F6, F8
▪ DIV.D F8, F8, F10

Advanced Computer Architecture 9


LATENCY AND INITIATION INTERVAL
▪ The multi-cycle arithmetic units are often
pipelined to allow overlapped operations and
hence improve performance.
▪ Definition:
▪ Latency: the number of cycles between an
instruction producing a result and another
instruction using it.
▪ Initiation Interval: the number of cycles that
must elapse between issuing two operations of the
same type.
Advanced Computer Architecture 10
TYPICAL VALUE ASSUMED
Function Unit Latency Initiation
 Assumptions on number of EX
Interval stages:
 FP Add/ Subtract: 4
Integer ALU 0 1  FP Multiply: 7
Data Memory( 1 1  FP Divide: 1 (not pipelined)
Integer/ FP
Load)  It is possible to have up to:
FP Add/ 3 1  4 outstanding FP Add/ Subtract
Subtract  7 outstanding FP Multiply
FP Multiply 6 1  1 FP divide
FP Divide 24 25
Advanced Computer Architecture 11
MULTI-CYCLE PIPELINE STRUCTURE

Advanced Computer Architecture 12


SOME SCENARIOS

Out of order completion of instructions

Stalls arising due to RAW Hazards

Advanced Computer Architecture 13


SOME SCENARIOS

▪ Three instructions are trying to write into the FP register bank


simultaneously.
▪ WAW hazard for the last two conflicting instructions (both writing F6).

▪ No conflict in MEM as only the last instruction accesses memory.

Advanced Computer Architecture 14


EXPLOITING INSTRUCTION
LEVEL PARALLELISM

Advanced Computer Architecture 15


INTRODUCTION
▪ To keep the pipeline full, we try to exploit parallelism among
instructions
▪ Sequence of unrelated instructions that can be overlapped without
causing hazard
▪ Related instructions must be separated by appropriate number of clock
cycles equal to the pipeline latency between the pair of instructions

Advanced Computer Architecture 16


INTRODUCTION
▪ In addition, branches have one clock cycle delay.
▪ The functional units are fully pipelined (except
division), such that an operation can be issued on
every clock cycle.
▪ As an alternative, the functional units can also be
replicated.
▪ We now look at a simplier compiler technique that
can create additional parallelism between
instructions.
▪ Help in reducing pipeline penalty
Advanced Computer Architecture 17
EXAMPLE
MIPS32 Code

Add a scalar s to a vector x

9 clock
cycles per
iteration
(with 4
Our program has 1000 iterations and there are 9
clock cycles per iteration so there are total of
stalls)
9000 iterations.

Advanced Computer Architecture 18


EXAMPLE
▪ We now carried out instruction scheduling.
▪ Moving instructions around and making necessary changes to reduce stalls

Our program has 1000 iterations and now there


are 7 clock cycles per iteration so there are
total of 7000 iterations.

7 clock
cycles per
iteration
(with 2
stalls)

Advanced Computer Architecture 19


LOOP UNROLLING
▪ We now carry out loop unrolling
▪ Replicating the body of the loop
multiple times, so that the loop
overhead per iteration reduces.
Unroll loop
3 times

▪ We use different registers for each


iteration.
▪ Number of stalls per loop = 3x4+1 = 13
▪ Clock cycles per loop =14 + 13 = 27 Cycles per instruction = 27/4 = 6.8

Advanced Computer Architecture 20


SCHEDULING THE UNROLL LOOP

Scheduling
the unroll
loop

No stalls. 14/4 = 3.5 cycles per iteration

Advanced Computer Architecture 21


LOOP UNROLLING: SUMMARY
▪ Lowering the number of loops from 9000 → 7000 →
3500
▪ Loop unrolling can expose more parallelism in
instructions that can be scheduled
▪ Effective way of improving pipeline performance
▪ Can be used to lower the CPI (less than 1) in
architectures where more than one instructions can be
issued per cycle
a. Super Pipeline Architecture
b. Superscalar Architecture
c. Very Long Instruction Word (VLIW) architecture

Advanced Computer Architecture 22


SUPER PIPELINE ARCHITECTURE
▪ Super pipelining is the breaking of stages of a given pipeline into
smaller stages (thus making pipeline deeper) in an attempt to shorten
the clock period and hence enhancing the instruction throughput by
keeping more and more instructions in flight at a time.

Advanced Computer Architecture 23


SUPERSCALAR ARCHITECTURE
▪ Superscalar Machines:
▪ Machines that can issue multiple independent
instructions per clock cycle when they are properly
scheduled by the compiler.
▪ Can result in a CPI of less than 1.
▪ How does it work ?
▪ The hardware can issue a small number (say 2 to 4) of
independent instructions in every clock cycle.
▪ The hardware checks for conflicts between
instructions.
▪ If the instructions are dependent, then only the first
instruction in the sequence will be issued.
Advanced Computer Architecture 24
SUPERSCALAR ARCHITECTURE SCHEMATIC

Advanced Computer Architecture 25


EXAMPLE
▪ Suppose two instructions can be issued every clock cycle.
a. One can be a load, store, branch or integer ALU operation.
b. The other can be a floating point operation.
▪ Used only for illustration.
▪ We do not have shown how FP operations extend the EX cycle.

Advanced Computer Architecture 26


INSTRUCTION DEPENDENCY CHECK
a. Can be checked dynamically by the hardware
(Superscalar Architecture)
b. Compiler can take the complete
responsibility of creating a packets of
instructions that can be simultaneously
issued.
▪ Hardware does not dynamically take any decision
about multiple issues
▪ Also referred to as VLIW architecture

Advanced Computer Architecture 27


SOME ISSUES
▪ If we issue an integer and a FP operation in parallel, the
need for additional hardware is minimized.
▪ Different register set and functional units are used
▪ Only conflict is when the integer instruction is a FP load,
store or move.
▪ This creates contention for the FP register ports and can be
treated as a structural hazard.
▪ In the original MIPS32 pipeline, load instructions have a
latency of 1.
▪ In the superscalar version, the next 3 instructions cannot be
used the result of load without stalling.
▪ Branch delay also becomes 3 cycles.
Advanced Computer Architecture 28
VLIW ARCHITECTURE
▪ In a Very Long Instruction Word (VLIW) machine, an
instruction word is typically hundreds of bits in
length.
▪ Specifies a number of basic operations / instructions, each
using different functional units.
▪ Multiple functional units are used concurrently when a
VLIW “macro-instruction” is being executed.
▪ All the functional units share a common register files.
▪ Similar to Superscalar architecture in concept, but
responsibility of identifying set of instructions that
can concurrently lies with the compiler.
Advanced Computer Architecture 29
VLIW ARCHITECTURE SCHEMATIC

Advanced Computer Architecture 30


EXAMPLE
▪ We try to schedule this
unrolled program code on a
VLIW processor, assuming
that there are 4 functional
units:
▪ Two memory reference unit
(to handle LOAD and
STORE).
▪ One floating point
arithmetic unit (Only the
ADD instruction).
▪ One integer operation and
branch unit (ADDI and
BNE).

Advanced Computer Architecture 31


SCHEDULING ON A VLIW PROCESSOR

LOADs are done using the two Load/Store unit. Then two ADD.D instructions. The S.D there
will be delay of 2 cycles. For 4 instructions we need 8 cycles.

Clock cycles/Iteration = 8/4 = 2.0


Advanced Computer Architecture 32
VECTOR AND ARRAY PROCESSORS

Advanced Computer Architecture 33


VECTOR PROCESSOR
▪ Provide high level instructions that operate on entire
arrays of numbers (called vectors).
▪ A single vector instruction is equivalent to an entire
loop.
▪ No loop overheads are required.
▪ Example:
▪ A, B and C are three vectors containing 64 numbers each.
▪ The three vectors are mapped to vector registers V1, V2, V3
(say).
▪ The following vector instruction computes Ci = Ai + Bi
▪ ADDV V1, V2, V3

Advanced Computer Architecture 34


BASIC VECTOR PROCESSOR ARCHITECTURE
▪A vector processor typically consists of an
ordinary pipelined scalar unit plus a
vector unit
▪All functional units within the vector unit
are deeply pipelined, resulting in a shorter
clock cycle time (CCT).
▪Deep pipelining on vectors do not result in
hazards, since every computation is
independent of others.
Advanced Computer Architecture 35
VECTOR PROCESSOR ARCHITECTURE SCHEMATIC

Advanced Computer Architecture 36


VECTOR PROCESSOR ISA
▪ Vector Registers
▪ There are 8 vector registers V0, V1 … V7.
▪ Each vector register can hold 64 numbers and each are double precision numbers.
▪ Each vector register has 2 read ports and 1 write port, to allow overlapping
operations.
▪ Vector functional units
▪ Each functional unit is fully pipelined and can start a new operation every clock
cycle.
▪ A hardware control unit detects hazards (conflict for functional units and also for
register accesses), and insert stalls as required
▪ Vector Load/Store Unit:
▪ The load/store unit is fully pipelined and allow fast loading and storing of vectors.
▪ The memory system is also deeply interleaved to allow parallel access.
▪ After an initial latency (which indicates the access time of the memory), one word
can be accessed per clock cycle.

Advanced Computer Architecture 37


VECTOR PROCESSOR ISA
▪ Scalar registers:
▪ These are normal scalar and floating point registers.
▪ Can be used to provide data as input to the vector functional units, as
well as compute memory addresses for vector load/store.
▪ Vector control registers
▪ Vector Mask Register (VMASK)
▪ Indicates which elements of vector to operate on
▪ Vector length register (VLEN)
▪ Need to operate on vectors of different lengths
▪ Vector stride register (VSTR)
▪ Elements of a vector might be stored apart from each other in memory
▪ Stride: distance between two elements of a vector

Advanced Computer Architecture 38


EXAMPLE 1
▪ Consider the SAXPY or
DAXPY vector operation: Y MIPS32
= a*X + Y where X and Y Code
are vectors each of size 64
and a is a scalar.
▪ Rx contains starting address
of X
Vector
▪ Ry contains starting address Processor
of Y Code
▪ R1 contains the address of
the scalar ‘a’.
Advanced Computer Architecture 39
SOME PROPERTIES
▪ The vector processor greatly reduces the dynamic
instruction bandwidth (number of instructions actually
getting executed) (from 514 to 6).
▪ Frequency of pipeline interlocks are greatly reduced.
▪ In the original MIPS32 version, every ADD.D must wait for
MUL.D, and S.D must wait for ADD.D.
▪ In the vector processor version, pipeline stalls are required
once per vector operation, rather than once per vector
element.
▪ Pipeline stall frequency is reduced almost 64 times.

Advanced Computer Architecture 40


VECTOR START-UP AND INITIATION RATE
▪ The running time of each vector operation in the vector processor
has two components:
a. Start-up Time: arises due to the pipeline latency of the vector
operation.
▪ After how much time the first result will be available.
▪ Mainly determined by the dept of the pipeline
▪ A latency of 8 clock cycles means that the operation takes 8 cycles, and also
there are 8 stages in the pipeline.
b. Initiation Rate: time per result once the vector instruction is
running.
▪ Usually 1 per clock cycle for individual operations.

▪ Total time to complete a vector operation of length n (n ≤ 64) is:


▪ Start-up Time + (n x Initiation Rate)

Advanced Computer Architecture 41


VECTOR CHAINING
▪ Vector chaining: Data forwarding from one vector functional unit to another

V1 V2 V3 V4 V5
LV v1
MULV v3,v1,v2
ADDV v5, v3, v4

Chain Chain

Load Unit
Mult. Add

Memory

Advanced Computer Architecture 42


OTHER VECTOR PROCESSING CONCEPTS
▪ Vector Length Register
▪ Specifies the length of any vector operation. Suppose we have to
operate on first 30 elements of a 64 bit vector, this register is used.
▪ Loading and storing vectors with stride
▪ Vector elements are stored in memory with uniform spacing between
elements.
▪ Adjacent elements of a vector are not sequential in memory.
▪ Strip mining
▪ How to split loops if the original loop handles vectors that are larger
than that supported by the hardware?
▪ Suppose the vector register is of length 64 and the program is running
for 200 cycles. So we run it for 3 times and the remaining 8 can be run
using strip mining as the fourth operation.
Advanced Computer Architecture 43
SIMD PROCESSING
▪Single instruction operates on multiple data
elements
▪ In time or in space
▪Multiple processing elements
▪Time-space duality
▪ Array processor: Instruction operates on multiple
data elements at the same time
▪ Vector processor: Instruction operates on multiple
data elements in consecutive time steps
Advanced Computer Architecture 44
ARRAY VS. VECTOR PROCESSORS
ARRAY PROCESSOR VECTOR PROCESSOR

Instruction Stream Same op @ same time


Different ops @ time
LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR  VR, 2
ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3

Space Space
Advanced Computer Architecture 45
SIMD ARRAY PROCESSING VS. VLIW
VLIW Array processor

Advanced Computer Architecture 46


MULTICORE PROCESSORS

Advanced Computer Architecture 47


MULTI-CORE PROCESSORS
▪ A processing system composed of two or more independent
cores or CPUs.
▪ The cores are typically integrated onto a single integrated
circuit die, or they may be integrated on multiple dies in a
single chip package.
▪ Cores share memory:
▪ In modern multi-core systems, typically the L1 and L2 cache are
private to each core, while the L3 cache is shared among the cores.
▪ In symmetric multi-core systems, all the cores are identical.
▪ Example: multi-core processors used in computer systems
▪ In asymmetric multi-core systems, the cores may have
different functionalities.
Advanced Computer Architecture 48
WHY MULTI-CORES
▪ It is difficult to sustain Moore’s law and at the same
time meet performance demand of various
applications.
▪ Difficult to increase clock frequency, mainly due to power
consumption issues.
▪ Possible solution:
▪ Replicate hardware and run them at a lower clock rate to
reduce power consumption.
▪ 1 core running at 3 GHz has the same performance as 2
cores running at 1.5 GHz with lower power consumption.

Advanced Computer Architecture 49


TAXONOMY OF PARALLEL ARCHITECTURES
(FLYNN’S CLASSIFICATION)
▪ Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966

▪ Single instruction-stream single data-stream (SISD).


▪ Traditional uniprocessor systems.

▪ Multiple instruction-stream single data-stream (MISD).


▪ No commercial implementation exists.
▪ Closest form: systolic array processor, streaming processor

▪ Single instruction-stream multiple data-stream (SIMD).


▪ Array and vector processors.

▪ Multiple instruction-stream multiple data-stream (MIMD).


▪ Multiprocessor systems (various architectures exist).
▪ Multithreaded processor

Advanced Computer Architecture 50


FLYNN’S CLASSIFICATION
D11 D12 D13
I3 I2 I1 SISD D1 D2 D3 I3 I2 I1 D21 D22 D23
SIMD …

Dm1 Dm2 Im3

I31 I21 I11 I31 I21 I11 D11 D12 D13


I32 I22 I12 D1 D2 D3 I32 I22 I12
… MISD D21 D22 D23

… … MIMD …
I3m I2m I1m I3m I2m I1m

Dm1 Dm2 Im3

Advanced Computer Architecture 51


SINGLE-CORE COMPUTER
▪ Falls under SISD
Category.
▪ Typically two
busses:
▪ A high-speed
CPU Memory
bus, that also
connects to I/O
bridge.
▪ A lower speed
I/O bus,
connecting
various
peripherals

Advanced Computer Architecture 52


SINGLE-CORE PROCESSOR

Advanced Computer Architecture 53


MOTHER BOARD ARCHITECTURE
▪ Chipset consisting of north bridge and south bridge

Advanced Computer Architecture 54


MOTHER BOARD VIEW

Advanced Computer Architecture 55


MULTI-CORE ARCHITECTURE

Advanced Computer Architecture 56


TRADITIONAL MULTIPROCESSOR ARCHITECTURES
▪ Tightly coupled multiprocessors:
▪ The processors access common shared memory.
▪ Inter-processor communication takes place through shared
memory.
▪ Multi-core architectures fall under this category.
▪ Loosely coupled multiprocessors (cluster computers):
▪ Memory is distributed among the processors.
▪ Processors typically communicate through a high speed
interconnection network.

Advanced Computer Architecture 57


TIGHTLY COUPLED MULTIPROCESSORS

▪ Difficult to extend to large number of processors.


▪ Memory bandwith requirement increases with the number of processors.
▪ Memory access time for all processors is uniform
▪ Called uniform memory access (UMA)

Advanced Computer Architecture 58


LOOSELY COUPLED MULTIPROCESSORS

▪ Cost-effective way to scale memory bandwidth.


▪ Communicating data between processors is complex and has higher latency.
▪ Memory access time depends on the data.
▪ Called Non Uniform Memory Access(NUMA).

Advanced Computer Architecture 59


THANK YOU

Advanced Computer Architecture 60

You might also like