0% found this document useful (0 votes)

115 views21 pages

Lect5 PDF

1) Superscalar processors allow multiple scalar instructions to be executed simultaneously by exploiting instruction-level parallelism. This is achieved through techniques like superpipelining and issuing multiple instructions per clock cycle. 2) Dependency issues can limit parallel execution in superscalar processors. Techniques like register renaming are used to reduce dependencies. 3) Both superpipelining and superscalar architectures aim to improve performance through parallel instruction execution, but superscalar additionally requires duplicating functional units to execute multiple instructions simultaneously in the same pipeline stage.

Uploaded by

Rabia Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views21 pages

Lect5 PDF

Uploaded by

Rabia Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Lecture 5: Superscalar Processors

 Definition and motivation

 Superpipeline

 Dependency issues

 Parallel instruction
execution

Superscalar Architecture
 Superscalar is a computer designed to improve the
performance of the execution of scalar instructions.
 A scalar
l iis a variable
i bl th
thatt can h
hold
ld only
l one atomic
t i
value at a time, e.g., an integer or a real.
 A scalar architecture processes one data item at a
time  the computers we discussed up till now.

 Examples of non-scalar
non scalar variables:
 Arrays
 Matrices
 Records

1
Superscalar Architecture (Cont’d)
 In a superscalar architecture (SSA), several scalar
instructions can be initiated simultaneously and
executed independently.
 Pipelining allows also several instructions to be
executed at the same time, but they have to be in
different pipeline stages at a given moment.
 SSA includes all features of pipelining but, in addition,
there can be several instructions executingg
simultaneously in the same pipeline stage.
 SSA introduces therefore a new level of parallelism,
called instruction-level parallelism.

General Superscalar Organization

In this example, two integer, two floating-point, and

one memory (load or store) operations can be
executed at the same time.

2
Motivation
 Most operations are on scalar quantities (about 80%).
 Speedup these operations will lead to large overall performance
improvement.

How to implement the idea?

 A SSA processor fetches multiple instructions at a time, and
attempts to find nearby instructions that are independent of each
other and therefore can be executed in parallel.
 Based on the dependency analysis, the processor may issue
and execute instructions in an order that differs from that of the
original machine code.
 The processor may eliminate some unnecessary dependencies
by the use of additional registers and renaming of register
references.

Superpipelining
 Superpipelining is based on dividing the stages of a pipeline into
several sub-stages, and thus increasing the number of
instructions which are handled by the pipeline at the same time.
 For example,
F l by
b di
dividing
idi each h stage
t iinto
t ttwo sub-stages,
b t a
pipeline can perform at twice the speed in the ideal situation.
 Many pipeline stages may perform tasks that require less than half
a clock cycle.
 No duplication of hardware is needed for these stages.

FI EI FI1 FI2 EI1 EI2 FI EI

FI EI FI1 FI2 EI1 EI2 FI EI
FI1 FI2 EI1 EI2 FI EI
FI1 FI2 EI1 EI2 FI EI
2 hardware resources needed 4 hardware resources needed 3 hardware resources needed

3
Superpipelining (Cont’d)
 For a given architecture and the corresponding instruction set
there is an optimal number of pipeline stages/sub-stages.
 Increasing the number of stages/sub-stages over this limit
reduces the overall performance
performance.
 Overhead of data buffering between the stages.
 Not all stages can be divided into (equal-length) sub-stages.
 The hazards will be more difficult to resolved.
 The clock skew problem.
 More complex hardware.

FI EI FI1 FI2 EI1 EI2 FI EI

FI EI FI1 FI2 EI1 EI2 FI EI
FI1 FI2 EI1 EI2 FI EI
FI1 FI2 EI1 EI2 FI EI
2 hardware resources needed 4 hardware resources needed 3 hardware resources needed

Superscalar vs. Superpipeline

 Base machine: 4-stage
pipeline
 Instruction fetch
 p
Operation decode
 Operation execution
 Result write back

 Superpipeline of degree 2
 A sub-stage often takes
half a clock cycle to
finish.

 Superscalar of degree 2
 Two instructions are
executed concurrently
in each pipeline stage.
 Duplication of hardware
is required by definition.

4
Superpipelined Superscalar Design
Sub-stage 1 2 3
I1
time
FI DI EI WO
I2 FI DI EI WO
I3 Superpipeline of
FI DI EI WO
I4 FI DI EI WO
degree 3 and
I5 FI FI DI DI EI EI WO WO superscalar of
I6 FI FI DI DI EI EI WO WO degree 4:
I7 FI FI DI DI EI EI WO WO 12 times speed-up
I8 FI FI DI DI EI EI WO WO over the base
I9 FI DI EI WO machine.
I10 FI DI EI WO  48 times speed-
I11 FI DI EI WO up over sequential
I12 FI DI EI WO execution
execution.

 This is a new trend of architecture design:

 Pentium Pro(P6): 3-degree superscalar, 12-stage
“superpipeline”.
 PowerPC 620: 4-degree superscalar, 4/6-stage pipeline.

Basic Superscalar Concepts

 SSA allows several instructions to be issued and
completed per clock cycle.
 It consists of a number of pipelines that are working in
parallel.
 Depending on the number and kind of parallel units
available, a certain number of instructions can be
executed in parallel.
 In the following example two floating point and two
integer operations can be issued and executed
simultaneously.
 Each unit is also pipelined and can execute several
operations in different pipeline stages.

5
An SSA Example
Instruction Buffer

Instruction
Ins. Fetch

Cache
Unit

n
Decode,

Memory
Rename & Integer
Dispatch Unit

es
n

stations,
t ti etc.)
t ) Point Unit

Floating
Point Unit
Data
Cache
Commit

Lecture 5: Superscalar Processors

 Definition and motivation

 Superpipeline

 Dependency issues

 Parallel instruction
execution

6
Parallel Execution Limitation
 The situations which prevent instructions to be executed in
parallel by SSA are very similar to those which prevent efficient
execution on a pipelined architecture (pipeline hazards):
 Resource conflicts.
 Control (procedural) dependency.
 Data dependencies.

 Their consequences on SSA are more severe than those on

simple pipelines, because the potential of parallelism in SSA is
greater and
and, thus
thus, a larger amount of performance will be lost
lost.

 Instruction-level parallelism = the degree in which, on average,

the instructions of a program can be executed in parallel.

Resource Conflicts
 Several instructions compete for the same hardware
resource at the same time.
 e
e.g.,
g two arithmetic instructions need the same
floating-point unit for execution.
 similar to structural hazards in pipeline.

 They can be solved partly by introducing several

hardware units for the same functions.
 e.g., have two floating-point units.
 the hardware units can also be pipelined to support
several operations at the same time.

7
Procedural Dependency
 The presence of branches creates major problems in
assuring the optimal parallelism.
 cannot execute instructions after a branch in parallel
with instructions before a branch.
 similar to control hazards in pipeline.

 If instructions are of variable length, they cannot be

fetched and issued in parallel, since an instruction has
to be decoded in order to identify the following one
one.
 therefore, superscalar techniques are more efficiently
applicable to RISCs, with fixed instruction length and
format.

Data Conflicts
 Caused by data dependencies between instructions in
the program.
 similar to date hazards in pipeline
pipeline.

 To address the problem and to increase the degree of

parallel execution, SSA provides a great liberty in the
order in which instructions are issued and executed.

 Therefore data dependencies have to be considered

Therefore,
and dealt with much more carefully.

8
Window of Execution
 Due to data dependencies, only some part of the instructions are
potential subjects for parallel execution.
 In order to find instructions to be issued in parallel, the processor
has to select from a sufficiently large instruction sequence
sequence.
 There are usually a lot of data dependencies in a short instruction
sequence.

 Window of execution is defined as the set of instructions that is

considered for execution at a certain moment.

 The number of instructions in the window should be as large as

possible. However, this is limited by:
 Capacity to fetch instructions at a high rate.
 The problem of branches.
 The cost of hardware needed to analyze data dependencies.

Window of Execution Example

for (i=0; i<last; i++) {
L2 move r3,r7
if (a[i] > a[i+1]) {
load r8,(r3) r8 := a[i]
temp = a[i];
add r3,r3,#4 r3 := r3+4
a[i] = a[i+1];
load r9,(r3) r9 := a[i+1]
a[i+1] = temp;
ble r8,r9,L3 jump if r8r9
change++;
} move r3,r7
} store r9,(r3) a[i] := r9
add r3,r3,#4
r6: i (initially 0);
store r8,(r3) a[i+1] := r8
r7: address for a[i];
add r5 r5 #1
r5,r5,#1 change++
r3: address for a[i] &
a[i+1]; L3 add r6,r6,#1 i++
r4: last; add r7,r7,#4
r5: change (init. 0); blt r6,r4,L2 jump if r6<r4
r8: a[i];
r9: a[i+1] Basic Blocks

9
Window of Execution Example
L2 move r3,r7
load r8,(r3) r8 := a[i]
L2: add r3,r3,#4 r3 := r3+4
load r9,(r3) r9 := a[i+1]
ble r8,r9,L3 jump if r8r9

move r3,r7
store r9,(r3) a[i] := r9
add r3,r3,#4
store r8,(r3) a[i+1] := r8
add r5 r5 #1
r5,r5,#1 change++

L3: L3 add r6,r6,#1 i++

add r7,r7,#4
blt r6,r4,L2 jump if r6<r4

Basic Blocks

Window of Execution (cont’d)

 The window of execution can be extended over basic
block borders by branch prediction.
 Speculative execution
execution.

 With speculative execution, instructions of the predicted

path are entered into the window of execution.
 Instructions from the predicted path are executed
tentatively.
 If the prediction turns out to be correct the state change
produced by these instructions will become permanent
and visible (the instructions commit);
 Otherwise, all effects are removed.

10
Data Dependencies
 All instructions in the window of execution may begin
execution, subject to data dependence and resource
constraints.
constraints

 Three types of data dependencies can be identified:

 True data dependency

 Output dependency
Artificial dependencies
 Anti-dependency

True Data Dependency

 True data dependencies exist when the output of one
instruction is required as an input to a subsequent instruction:
MUL R4,R3,R1 (R4 := R3 * R1)
. . .
ADD R2,R4,R5 (R2 := R4 + R5)
 can fetch and decode second instruction in parallel with first.
 can NOT execute second instruction until first is finished.

 They are intrinsic features of the user’s program, and cannot

be eliminated by compiler or hardware techniques.
 They have to be detected and handled by hardware.
 The addition above cannot be executed before the result of the
multiplication is available.
 The simplest solution is to stall the adder until the multiplier has
finished.
 In order to avoid the adder to be idle, the hardware can find other
instructions which can be executed by the adder.

11
True Data Dependency Example

L2 move r3,r7

load r8,(r3)

add r3,r3,#4

load r9,(r3)

ble r8,r9,L3

 There are often a lot of true data dependencies in a small region

of a program.
 Increasing the window size can reduce the impact of these
dependencies.
 A compiler cannot help to eliminate them!

Output Dependency
 An output dependency exists if two instructions are writing into
the same location.
 If the second instruction writes before the first one, an error
occurs:
MUL R4,R3,R1 (R4 := R3 * R1)
. . .
ADD R4,R2,R5 (R4 := R2 + R5)
L2 move r3,r7

load r8,(r3)
,( )

add r3,r3,#4

load r9,(r3)

ble r8,r9,L3

12
Anti--dependency
Anti
 An anti-dependency exists if an instruction uses a location as
an operand while a following one is writing into that location.
 If the first one is still using the location when the second one
writes into it,
it an error occurs:
MUL R4,R3,R1 (R4 := R3 * R1)
. . .
ADD R3,R2,R5 (R3 := R2 + R5)

L2 move r3,r7

load r8 (r3)
r8,(r3)

add r3,r3,#4

load r9,(r3)

ble r8,r9,L3

Output and Anti-

Anti- Dependencies
 Output dependencies and anti-dependencies are not intrinsic
features of the executed program.
 They are not real data dependencies but storage conflicts.
 They are due to the competition of several instructions for the
same register.
 They are only the consequence of the manner in which the
programmer or the compiler are using registers (or memory
locations).
 In the previous examples the conflicts are produced only
because:
 The output dependency: R4 is used by both instructions to store
the result (due to, for example, optimization of register usage);
 The anti-dependency: R3 is used by the second instruction to
store the result.

13
Output and Anti-
Anti- Dependencies (Cont’d)
 Output dependencies and anti-dependencies can usually be
eliminated by using additional registers.
 This technique is called register renaming.

MUL R4,R3,R1 (R4 := R3 * R1)

. . .
ADD R4,R2,R5 (R4 := R2 + R5)

MUL R4,R3,R1 (R4 := R3 * R1)

. . .
ADD R3,R2,R5 (R3 := R2 + R5)

Effect of Dependencies

Data
dependency

Procedural
dependency

Resource
dependency

14
Lecture 5: Superscalar Processors

 Definition and motivation

 Superpipeline

 Dependency issues

 Parallel instruction
execution

Instruction vs Machine Parallelism

 Instruction-level parallelism (ILP)  the average number of
instructions in a program that a processor might be able to
execute at the same time.
 Mostly determined by the number off true (data)
( ) dependencies and
procedural (control) dependencies in relation to the number of
other instructions.
 Machine parallelism of a processor  the ability of the
processor to take advantage of the ILP of the program.
 Determined by the number of instructions that can be fetched and
executed at the same time,, i.e.,, the capacity
p y of the hardware.
 To achieve high performance, we need both ILP and machine
parallelism.
 The ideal situation is that we have the same ILP and machine
parallelism.

15
Division and Decoupling
To increase ILP, we should divide the instruction execution
into smaller tasks and decouple them. In particular, we
have three important activities:
 Instruction issue  an instruction is initiated and
starts execution.
 Instruction completion  an instruction has
competed its specified operations.
 Instr ction commit  the res
Instruction results
lts of the instr
instruction
ction
operations are written back to the register files or
cache.
 The machine state is changed.

SSA Instruction Execution Policies

 Instructions can be executed in an order different from
the strictly sequential one, with the requirement that
the results must be the same.
same

 Execution policies usually used:

 In-order issue with in-order completion.
 In-order issue with out-of-order completion.
 Out-of-order issue with out-of-order completion.
 Out-of-order issue with in-order completion.

16
In--Order Issue with In
In In--Order Completion
 Instructions are issued in exact program order, and completed in
the same order (with parallel issue and completion, of course!).
 An instruction cannot be issued before the previous one has been
issued;
 An
A iinstruction
t ti cannott be b completed
l t dbbefore
f th
the previous
i one h
has
been completed.
 To guarantee in-order completion, an instruction will stall when
there is a conflict and when a unit requires more than one cycle
to execute.

Example:
 Assume a p processor that can issue and decode two instructions
per cycle, that has three functional units (two single-cycle integer
units, and a two-cycle floating-point unit), and that can complete
and write back two results per cycle.
 And an instruction sequence with the characteristics given in the
next slide.

IOI with IOC Example

I1 – needs two execute cycles (floating-point)
I2 –
I3 –
I4 – needs the same function unit as I3
I5 – needs data value produced by I4
I6 – needs the same function unit as I5

Cycle Issue/Decode Execute Write/Complete

1 I1 I2
2 I3 I4 I1 I2
3 I3 I4 I1 Stall
4 I5 I4 I3 I1 I2
5 I5 I6 I4 I3
6 I6 I5 I4
7 I6 I5
8 I6

17
IOI with IOC Discussion
 The processor detects and handles (by stalling) true
data dependencies and resource conflicts.
 A instructions
As i t ti are iissued
d and
d completed
l t d iin th
their
i strict
ti t
order.
 In this way, the exploited parallelism is very much
dependent on the way the program has been written
or compiled.
 Ex. if I3 and I6 switch position, the pairs I4/I6 and I3/I5
can be executed in parallel (see the following slides).
 To exploit such parallelism improvement, the compiler
needs to perform elaborate data-flow analysis.

IOI with IOC Example (Cont’d)

I1 – needs two execute cycles (floating-point)
I2 –
I6 – needs the same function unit as I5
I4 – needs the same function unit as I3
I5 – needs data value produced by I4
I3 –

Cycle Issue/Decode Execute Write/Complete

1 I1 I2
2 I6 I4 I1 I2
3 I6 I4 I1 Stall
4 I5 I3 I6 I4 I1 I2
5 I5 I3 I6 I4
6 I5 I3
7
8

18
IOI with IOC Discussion (Cont’d)

 The basic idea of SSA is not to rely on compiler-based

technique (compatibility consideration).
 SSA allows the hardware alone to detect instructions
which can be executed in parallel and to do that
accordingly.
 IOI with IOC is not very efficient, but it simplifies the
hardware.

In--Order Issue w. Out-

In Out-of-
of-Order Completion
 With out-of-order completion, a later instruction may complete
before a previous one.
 Address mainly the issue of long-latency operations such as
di i i
division.
Out-of-Order Completion
I1 – needs two
cycles
I2 – Cycle Issue/Decode Execute Write/Complete
I3 – 1 I1 I2
I4 – conflict
with I3
2 I3 I4 I1 I2
I5 – depending 3 I5 I4 I1 I33 I2
on I4 4 I6 I4 I1 I3
I6 – conflict
with I5
5 I6 I5 I4
6 I6 I5
7 I6
8

19
Out--of-
Out of-Order Issue w. Out-
Out-of-
of-Order Completion

 With in-order issue, no new instruction can be issued when the

processor has detected a conflict, and is stalled until after the
conflict has been resolved.
 The processor is not allowed to look ahead for further instructions,
which could be executed in parallel with the current ones.

 Out-of-order issue takes a set of decoded instructions, issues

any instruction, in any order, as long as the program execution is
correct.
 Decouple decode pipeline from execution pipeline, by introducing
an instruction window.
 When a functional unit becomes available an instruction can be
executed.
 Since instructions have been decoded, processor can look ahead.

OOI with OOC Example

I1 – needs two cycles
I2 –
I3 –
I4 – conflict with I3
I5 – depending on I4
I6 – conflict with I5

Cycle Decode Ins. Window Execute Write/Complete

1 I1 I2
2 I3 I4 I1, I2 I1 I2
3 I5 I6 I3, I4 I1 I3 I2
4 I4,, I5,, I6 I6 I4 I1 I3
5 I5 I5 I4 I6
6 I5
7
8

20
Speedup w/o Procedural Dependencies

All dependencies Only true dependencies

Summary
 The following techniques are main features for superscalar processors:
 Several pipelined units which are working in parallel;
 Out-of-order issue and out-of-order completion;
 R i t renaming.
Register i

 All of the above techniques are aimed to enhance performance.

 Experiments have shown:

 Only adding additional functional units is not very efficient;
 Out-of-order issue is extremely important, which allows to look ahead for
i d
independent
d t iinstructions;
t ti
 Register renaming can improve performance with more than 30%; in this
case performance is limited only by true dependencies.
 It is important to provide a fetching/decoding capacity so that the window of
execution is sufficiently large.

R For Health Data Science
100% (1)
R For Health Data Science
365 pages
Arabic - Unicode Character Table
No ratings yet
Arabic - Unicode Character Table
4 pages
Irregular Verbs - Homework
100% (1)
Irregular Verbs - Homework
5 pages
Lec5 PDF
No ratings yet
Lec5 PDF
39 pages
ITEC582-Chapter 16m
No ratings yet
ITEC582-Chapter 16m
55 pages
Instruction Pipelining and SuperScalar Development - 2019
No ratings yet
Instruction Pipelining and SuperScalar Development - 2019
53 pages
7TH - Unit 2-21ec74h6 - Ca
No ratings yet
7TH - Unit 2-21ec74h6 - Ca
95 pages
Computer Organization and Architecture: Instruction-Level Parallelism and Superscalar Processors
No ratings yet
Computer Organization and Architecture: Instruction-Level Parallelism and Superscalar Processors
43 pages
Unit 1
No ratings yet
Unit 1
5 pages
P14-15 Superscalar
No ratings yet
P14-15 Superscalar
28 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
Lecture 06 - (New) Pipelining and Parallelism
No ratings yet
Lecture 06 - (New) Pipelining and Parallelism
37 pages
CH18 COA11e
No ratings yet
CH18 COA11e
37 pages
Superscaling in Computer Architecture
No ratings yet
Superscaling in Computer Architecture
9 pages
Lecture 06 - (New) Pipelining and Parallelism
No ratings yet
Lecture 06 - (New) Pipelining and Parallelism
36 pages
Superscalar - Superpipeline - Processor
No ratings yet
Superscalar - Superpipeline - Processor
10 pages
Chapter 5 PPTV 41 STDV 1
No ratings yet
Chapter 5 PPTV 41 STDV 1
47 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
50 pages
CH - 14 - Instruction Level Parallelism and Superscalar Processors
No ratings yet
CH - 14 - Instruction Level Parallelism and Superscalar Processors
42 pages
Module 6
No ratings yet
Module 6
59 pages
ELECH473 Th04
No ratings yet
ELECH473 Th04
59 pages
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
No ratings yet
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
17 pages
CH16 COA9e Instruction Level Parallelism and Superscalar Processors
No ratings yet
CH16 COA9e Instruction Level Parallelism and Superscalar Processors
20 pages
L27,28 Superscaler
No ratings yet
L27,28 Superscaler
28 pages
Superscalar and Superpipelined Processors
No ratings yet
Superscalar and Superpipelined Processors
4 pages
Decode and Issue More and One Instruction at A Time Executing More Than One Instruction at A Time More Than One Execution Unit
No ratings yet
Decode and Issue More and One Instruction at A Time Executing More Than One Instruction at A Time More Than One Execution Unit
28 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Lec18 Pipeline
No ratings yet
Lec18 Pipeline
59 pages
CH16 ParallelismSuperScalar 22 Slides
No ratings yet
CH16 ParallelismSuperScalar 22 Slides
22 pages
William Stallings Computer Organization and Architecture: Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture: Instruction Level Parallelism and Superscalar Processors
28 pages
Instruction-Level Parallelism and Superscalar Processors
No ratings yet
Instruction-Level Parallelism and Superscalar Processors
22 pages
L14 MipsPipeline Ovw
No ratings yet
L14 MipsPipeline Ovw
17 pages
Super Scalar & Super Pipeline Approach To Processor
No ratings yet
Super Scalar & Super Pipeline Approach To Processor
13 pages
Computer Architecture 1
No ratings yet
Computer Architecture 1
37 pages
Computer Organization and Architecture What Does Superscalar Mean?
No ratings yet
Computer Organization and Architecture What Does Superscalar Mean?
14 pages
Pipelining Lecture
No ratings yet
Pipelining Lecture
39 pages
Computer Architecture - Lecture 13
No ratings yet
Computer Architecture - Lecture 13
18 pages
Architecture PDF
No ratings yet
Architecture PDF
19 pages
Hafta 14
No ratings yet
Hafta 14
23 pages
Computer Systems Architecture: Thorsten Altenkirch and Liyang Hu
No ratings yet
Computer Systems Architecture: Thorsten Altenkirch and Liyang Hu
20 pages
(123doc) Dien Tu Vien Thong c16 Instructionlevel Parallelism and Superscalar Processors 39 g3 Khotailieu
No ratings yet
(123doc) Dien Tu Vien Thong c16 Instructionlevel Parallelism and Superscalar Processors 39 g3 Khotailieu
71 pages
Superscalar Processor
No ratings yet
Superscalar Processor
4 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
49 pages
Introduction To Parallel Processing: Unit-2
No ratings yet
Introduction To Parallel Processing: Unit-2
32 pages
Computer Architecture Unit 3
No ratings yet
Computer Architecture Unit 3
8 pages
Superscalar Vs Superpipeline Processor
No ratings yet
Superscalar Vs Superpipeline Processor
17 pages
Instruction Level Parallelism and Superscalar Processors
No ratings yet
Instruction Level Parallelism and Superscalar Processors
34 pages
Chapter 13 - Instruction Level Parallelism
No ratings yet
Chapter 13 - Instruction Level Parallelism
16 pages
Microprocessor Based Systems: Lecture No 05 Virtual Machines and Pipelining Concept
No ratings yet
Microprocessor Based Systems: Lecture No 05 Virtual Machines and Pipelining Concept
19 pages
Unit 5
No ratings yet
Unit 5
51 pages
Presentation - Cea - Chapter16 2
No ratings yet
Presentation - Cea - Chapter16 2
33 pages
Unit 5
No ratings yet
Unit 5
36 pages
Pipelinehazard For Class
No ratings yet
Pipelinehazard For Class
61 pages
Pipelinehazard 160823134502
No ratings yet
Pipelinehazard 160823134502
61 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
22 pages
Notes 21
No ratings yet
Notes 21
11 pages
Superscalar Processor - Wikipedia
No ratings yet
Superscalar Processor - Wikipedia
5 pages
ILP-Architectures Part I
No ratings yet
ILP-Architectures Part I
56 pages
Lecture 5
No ratings yet
Lecture 5
50 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
First Hop Redundancy Protocol: Network Redundancy Protocol
From Everand
First Hop Redundancy Protocol: Network Redundancy Protocol
Mulayam Singh
No ratings yet
CCNA Certification All-in-One For Dummies
From Everand
CCNA Certification All-in-One For Dummies
Silviu Angelescu
5/5 (1)
Computer Architecture: Dept. of Computer Science (UOG) University of Gujrat
No ratings yet
Computer Architecture: Dept. of Computer Science (UOG) University of Gujrat
20 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Lec3 PDF
No ratings yet
Lec3 PDF
15 pages
The Binary Search Tree Property: Balanced Trees
No ratings yet
The Binary Search Tree Property: Balanced Trees
6 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Lec2 PDF
No ratings yet
Lec2 PDF
21 pages
Representation Processing: Week 4
No ratings yet
Representation Processing: Week 4
35 pages
Computer Architecture: Dept. of Computer Science (UOG) University of Gujrat
No ratings yet
Computer Architecture: Dept. of Computer Science (UOG) University of Gujrat
20 pages
Lecture 4: RISC Computers Lecture 4: RISC Computers
No ratings yet
Lecture 4: RISC Computers Lecture 4: RISC Computers
15 pages
Computer Arcitecture: Lecture Data Path by Engr. Saleem Afzal Dhillu Class: BS-CS The University of Gujrat
No ratings yet
Computer Arcitecture: Lecture Data Path by Engr. Saleem Afzal Dhillu Class: BS-CS The University of Gujrat
36 pages
Phrasal Verbs: Santa Cruz School
No ratings yet
Phrasal Verbs: Santa Cruz School
2 pages
1920 3. The Audio Lingual Method
No ratings yet
1920 3. The Audio Lingual Method
63 pages
Version Control Systems
No ratings yet
Version Control Systems
14 pages
Quote #867 (Prem Vidya Industries)
No ratings yet
Quote #867 (Prem Vidya Industries)
2 pages
Sped Report Card - Docx Final
No ratings yet
Sped Report Card - Docx Final
2 pages
Matlab Tutorial
No ratings yet
Matlab Tutorial
31 pages
Regular Expressions
No ratings yet
Regular Expressions
30 pages
Year 3 Teaching of Grammar Lesson Plan
No ratings yet
Year 3 Teaching of Grammar Lesson Plan
2 pages
Christianity in Early Africa PDF
No ratings yet
Christianity in Early Africa PDF
48 pages
Performance Task Newtons Olympic
100% (2)
Performance Task Newtons Olympic
1 page
Grammar Dim 1 Su
No ratings yet
Grammar Dim 1 Su
22 pages
The Average Black Girl
No ratings yet
The Average Black Girl
2 pages
Gs Present Continuous - Exercises
No ratings yet
Gs Present Continuous - Exercises
4 pages
500 450 Demo
No ratings yet
500 450 Demo
6 pages
Cognizant Syllabus and Exam Pattern For 2025 Batch
No ratings yet
Cognizant Syllabus and Exam Pattern For 2025 Batch
7 pages
Oral Communication Grade 11 Q1 W3
No ratings yet
Oral Communication Grade 11 Q1 W3
16 pages
CNF Finals LP
No ratings yet
CNF Finals LP
10 pages
UCSP Unit 14 Religion and Belief Systems
No ratings yet
UCSP Unit 14 Religion and Belief Systems
40 pages
MATH 251-02 Fall 22, Sept 6th Calculus Quadratic Surfaces
No ratings yet
MATH 251-02 Fall 22, Sept 6th Calculus Quadratic Surfaces
11 pages
Directed Writing Practice 1 Formal Letter
100% (4)
Directed Writing Practice 1 Formal Letter
4 pages
MB90F546GS Datasheet
No ratings yet
MB90F546GS Datasheet
67 pages
PAST CONTINUOUS Questions
100% (1)
PAST CONTINUOUS Questions
2 pages
Computer Programming 3 Midterm Examinations Name: - Score: - Course, Yr & Section
No ratings yet
Computer Programming 3 Midterm Examinations Name: - Score: - Course, Yr & Section
3 pages
Ensayos en Línea para Comprar
100% (1)
Ensayos en Línea para Comprar
7 pages
Cse
No ratings yet
Cse
71 pages
A Comparative Study On Ripple Carry Adder and Modified Square Root Carry Select Adder in Radix-4 8 8 Booth Multiplier
No ratings yet
A Comparative Study On Ripple Carry Adder and Modified Square Root Carry Select Adder in Radix-4 8 8 Booth Multiplier
4 pages
The Philosophical Stakes
No ratings yet
The Philosophical Stakes
15 pages

Lect5 PDF

Uploaded by

Lect5 PDF

Uploaded by

Lecture 5: Superscalar Processors

 Definition and motivation

General Superscalar Organization

In this example, two integer, two floating-point, and

How to implement the idea?

FI EI FI1 FI2 EI1 EI2 FI EI

FI EI FI1 FI2 EI1 EI2 FI EI

Superscalar vs. Superpipeline

 This is a new trend of architecture design:

Basic Superscalar Concepts

Lecture 5: Superscalar Processors

 Definition and motivation

 Their consequences on SSA are more severe than those on

 Instruction-level parallelism = the degree in which, on average,

 They can be solved partly by introducing several

 If instructions are of variable length, they cannot be

 To address the problem and to increase the degree of

 Therefore data dependencies have to be considered

 Window of execution is defined as the set of instructions that is

 The number of instructions in the window should be as large as

Window of Execution Example

L3: L3 add r6,r6,#1 i++

Window of Execution (cont’d)

 With speculative execution, instructions of the predicted

 Three types of data dependencies can be identified:

 True data dependency

True Data Dependency

 They are intrinsic features of the user’s program, and cannot

 There are often a lot of true data dependencies in a small region

Output and Anti-

MUL R4,R3,R1 (R4 := R3 * R1)

MUL R4,R3,R1 (R4 := R3 * R1)

 Definition and motivation

Instruction vs Machine Parallelism

SSA Instruction Execution Policies

 Execution policies usually used:

IOI with IOC Example

Cycle Issue/Decode Execute Write/Complete

IOI with IOC Example (Cont’d)

Cycle Issue/Decode Execute Write/Complete

 The basic idea of SSA is not to rely on compiler-based

In--Order Issue w. Out-

 With in-order issue, no new instruction can be issued when the

 Out-of-order issue takes a set of decoded instructions, issues

OOI with OOC Example

Cycle Decode Ins. Window Execute Write/Complete

All dependencies Only true dependencies

 All of the above techniques are aimed to enhance performance.

 Experiments have shown:

You might also like