Lec 14
Lec 14
Paralelización del
procesamiento a nivel de
instrucción
Dr.-Ing. Jorge Castro-Godínez
Procesadores Escalares
Contenido Procesadores Escalares Procesadores Superescalares
EL4314 – Arquitectura de Computadoras I
Referencias
Escalares
Procesadores
• Best possible CPI Escalares
= 1, or best possible throughput -> 1 IPC.
• IPC > Los
1 means a superscalar processor
procesadores secuenciales tradicionales sólo permiten la
(verdadera) ejecución paralela de una instrucción por ciclo.
PROCESSOR DESIGN
Superescalares
• Scalar processor:
1.4.1.1 are pipelined
Processor Performance. processors
In Section that arethedesigned
1.3.1 we introduced iron law of to
fetch and issue
processor at most
performance, oneininstruction
as shown Equation (1.1).every machine
That equation cycle.
actually repre¬
sents the inverse of performance as a product of instruction count, average CPI, and
the clock cycle time. We can rewrite that equation to directly represent performance
xx—
• Superscalar processors
as a product of are those
the inverse of instruction that
count, are IPC
average designed to fetch
(IPC = 1/CPI), and the
andclock
issue frequency, as shown
multiple in Equation every
instructions (1.2). Looking at this cycle.
machine equation, we see that
performance can be increased by increasing the IPC, increasing the frequency, or
decreasing the instruction count.
• The use of CPI was popular during the days of scalar pipelined
processors.
• The performance penalties due to various forms of pipeline
stalls can be cleanly stated as different CPI overheads.
• The ultimate performance goal for scalar pipelined processors
was to reduce the average CPI = 1.
• In the superscalar domain, it becomes more convenient to use
IPC. The new performance goal for superscalar processors is to
achieve an average IPC > 1.
machine is measured
Figure by
1.5the overall utilization of the N processors or the fraction
MODERN PROCESSOR DESIGN
fraction of the
processors
parallelizedFigure
Scalar
in Figure
to run in
NNvN
are 1.5,/represents
hand+Vector
program
busy. Efficiency
Nx(l-h)must be_h
that Processing
the Efraction
can beofmodeled
in a+Traditional
executed N-Nh
as that can be
the program
1.5vector computation mode. Therefore, 1 - / represents the
_ x vIfLT 1is the total time(1-3)
Supercomputer.
sequentially.
required to run the program, then the relative speedup S can be represented as
As the number
[Amdahl, 1967]. of processors
Traditional N becomes very
supercomputers arelarge, theprocessors
parallel efficiencythatE approaches
perform
1 - scalar
both h, which
and is the fraction
vector of timeDuring
computations. the machine spends in vector
scalar computation computation.
only one processor isAs
N becomes
used. large, the
During vector amount ofalltime
computation
T (1 -f) in
spent
N processors
+ vector
(f/N) computation
are used to performbecomes
operations smaller
on
and data.
array smaller and approaches zero. Hence, as A becomesmachine
very large, thedepicted
efficiency E
where The
T is computation
the sum of (1performed
-/), the timebyrequired
such a parallel can be
to execute the sequential part, andas
approaches zero. This means that almost all the computation time is taken up with
1.4.1.3 Pipelined Processor Performance. Harold Stone proposed that a per¬
formance model similar to that for parallel processors can be developed for pipe¬
lined processors [Stone, 1987]. A typical execution profile of a pipelined processor
is shown in Figure 1.6(a). The machine parallelism parameter N is now the depth Escuela de Ingeniería Electrónica
Oh
<D
T3 steady state
filling
draining
-1 - J
(a)
Ideal
N
I
a<
j
Pu
-1 - g~
(b)
Figure 1.6
Figure 1.7(a). Instead of remaining in the pipeline full phase for the duration of the
entire execution, this steady PROCESSOR 21
DESIGNby pipeline
state is interrupted stalls. Each stall effec¬
tively induces a new pipeline draining phase and a new pipeline filling phase, as
shown in Figure 1.7(a), due to the break in the pipeline full phase. Similar modifica¬ Escuela de Ingeniería Electrónica
i
ocessors, N is now the number of pipeline stages, or the
S = (1.5)
ible. The parameter g now becomes the fraction of time
ed, and the parameter 1 - g now£
represents the fraction of
<D
Oh
(a)
(l-g) + (g/N) Real -> stalling cycles
of time when the pipeline is full, iskanalogous
Pipeline stall to/,
the vec Pipeline stall
N r— r—
am in the parallel processor model.
£a, Therefore, Amdahl’s 1
r —1
words, the actual performance£ gain that can be obtained
Oh
he curve for Equation (1.9) in Figure 1.8, we see that the speedup is
s 100%, that is, perfectly vectorizable. As/drops off from 100%, the Escuela de Ingeniería Electrónica
Superscalar proposal
ps off very quickly; as/becomes 0%, the speedup is one; that is, no 23
EL4314 – Arquitectura
PROCESSOR de Computadoras I
DESIGN
tained. With higher values of N, this speedup drop-off rate gets signifi¬ Vectorizability /
and as / approaches 0%, all the speedups approach one, regardless of
N. Now assume that the minimum degree of parallelism of 2 can be Figure 1.8
he nonvectorizable portion of the program. The speedup now becomes Easing of the Sequential Bottleneck with Instruction-Level Pa
for Nonvectorizable Code.
(1.10) Source: Agerwala and Cocke, 1987.
Pipelined processor
EL4314 – Arquitectura de Computadoras I
lined processor that does not overlap the processing of multiple instructions. This
form of speedup is restricted to comparison within the domain of scalar processors
and focuses on the increased throughput that can be obtained by a (scalar) pipelined
processor with respect to a (scalar) nonpipelined processor. Beginning with Chapter 3,
Successive instructions
Figure 1.9
Instruction Processing Profile of the Baseline Scalar Pipelined Machine.
are executed. On the other hand, a superpipelined machine issues instructions
faster than they are executed. A superpipelined machine of degree m, that is, one
that takes m minor cycles to execute a simple operation, can potentially achieve
Escuela de Ingeniería Electrónica
Superpipelined
better performance than that of themachine
baseline machine by a factor of m. Technically,
traditional pipelined computers that require multiple cycles for executing simple
EL4314 – Arquitectura de Computadoras I
operations should be classified as superpipelined. For example, the latency for per¬
• In a superpipelined
forming fixed-point addition machine,
is three cycles the machine
in both cycle
the CDC 6600 time 1964]
[Thornton, is shorter
than
and thethat of the
CRAY-1 baseline
[Russell, 1978], andmachine andcan
new instructions is be
referred to as
issued in every the minor
cycle.
Hence, these
cycle time.are really superpipelined machines.
In a way, the classification of superpipelined machines is somewhat artifi¬
• The simple
cial, because instruction
it depends still requires
on the choice onecycle
of the baseline baseline cycle, equal
and the definition of to
a simple
m minor operation.
cycles, Thefor
keyexecution,
characteristic ofbut
a superpipelined
the machine machine
canis issue
that the a new
instruction in every minor cycle
Contenido Procesadores Escalares Procesadores Superescalares Referencias
Superpipeline
Escuela de Ingeniería Electrónica
Superpipelined
Superpipelined machine EL4314 – Arquitectura de Computadoras I
IPC=1.5
are required for D-cache access. These are noninterruptible operations; no data for¬
Ejemplo: MIPS R4000 warding can involve the buffers between the IF and IS stages or the buffers between
the DF and DS stages. Cache accesses, here considered “simple” operations, are pipe¬
lined and require an operation latency of two (minor) cycles. The issue latency
Escuela defor the Electrónica
Ingeniería
Figure 1.11
Posee reloj MIPS
The “Superpipelined” interno
R4000 de doble
8-Stage frecuencia.
Pipeline. Simula un pipeline de
8 etapas
Prof.Ing. Jeferson González G. Lección 9 6/ 25
Superescalares
Procesadores Superescalares
Escuela de Ingeniería Electrónica
Superescalares
Escuela de Ingeniería Electrónica
Pipeline
Pipeline superescalar
Superescalar EL4314 – Arquitectura de Computadoras I
Segmentación paralela
Segmentación paralela
182 MODERN PROCESSOR DESIGN
EL4314 – Arquitectura de Computadoras I
JL
1
IF IF IF
D1 D1 D1 1
E)2 | D2 D2
EX | EX EX
WB |
“T
WB
U pipe
T WB
V pipe
(a) (b)
E X A M P E
Figure 4.4
(a) The Five-Stage i486 Scalar Pipeline; *__r
(b) The Five-Stage Pentium Parallel Pipeline
of Width s= 2.
Segmentación diversificada
Escuela de Ingeniería Electrónica
Figure 4.5
A Diversified Parallel Pipeline with Four
Execution Pipes.
186 MODERN PROCESSOR DESIGN Escuela de Ingeniería Electrónica
Figure 4.6
The CDC 6600 with 10 Diversified Functional Units
in Its CPU.
SUPERSCALAR ORGANIZATION 187
Escuela de Ingeniería Electrónica
Bus
Target instruction Instruction
interface
cache cache unit
General Instruction
History Floating-point Data
buffer register sequencer cache
file register file and branch unit
Integer Integer Bit field Multiplier Floating-point Divider Graphics Graphics Load/
unit unit unit unit add unit unit add unit pack unit store unit
Writeback busses
Segmentación dinámica
Escuela de Ingeniería Electrónica
Segmentación
Segmentación dinámica
En pipelines paralelos dinámica
se deben tener bu↵ers multientrada (para
EL4314 – Arquitectura de Computadoras I
I Stage i
11 l• 1• • If
i1
11
Stage i
Buffer (1)
'1
| Buffer (n)
,1 v•••, iuI
jL n (in order)
Stage i + 1
'1
| Stage i + 1
1v1i
jL n (in order)
T T
(a) (b)
Permite paso de datos independiente entre las etapas
zz
(desplazamiento, fifo). Caso dependencias!
...
.1
Stage i
Cada bu↵er puede ser accedido por separado.
Permite paso de datos independiente entre las etapas, (in order)
(desplazamiento,
Prof.Ing. Jeferson González G. Caso
fifo).Lección dependencias!
Buffer (25 n)
9 20/ 25
I Stage i 1•••i1
11
Segmentación dinámicaStage i
Segmentación dinámica
'1
,1 v•••, iuI
jL n (in order)
Escuela de Ingeniería Electrónica
Segmentación
' 1 dinámica
Buffer (1) | Buffer (n) EL4314 – Arquitectura de Computadoras I
i + 1 multientrada tienen la
capacidad de hacer Tre-ordenamiento de las instrucciones,
T
permitiendo la ejecución fuera de orden (OoOE).
(a) (b)
zz ...
.1
Stage i
, (in order)
Buffer (25 n)
Stage i + 1
I I
FT H
f I (out of order)
i i
(C)
Calendarización de instrucciones
Escuela de Ingeniería Electrónica
SL – Capítulo 1 y 4