0% found this document useful (0 votes)
21 views30 pages

Presentation Cea Chapter16 2 Demo

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views30 pages

Presentation Cea Chapter16 2 Demo

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Chapter 16

Instruction-Level Parallelism and Superscalar


processors
Table of contents

1. 2. 3. 4.
OVERVIEW DESIGN ISSUES PENTIUM 4 ARM COTEX-A8
01.
Overview
Superscalar
Refers to a machine that is
designed to improve the In most applications the
Term first coined in 1987 performance of the bulk of the operations are
execution of scalar on scalar quantities
instructions

Essence of the approach is Concept can be further


Represents the next step in
the ability to execute exploited by allowing
the evolution of high-
instructions independently instructions to be executed
performance general-
and concurrently in in an order different from
purpose processors
different pipelines the program order
SUPERSCALAR ORGANIZATION
ORDINARY SCALAR ORGANIZATION
Comparison of
superscalar
and super pipeline
Approaches
Constraints
• Instruction level parallelism
• Refers to the degree to which the instructions of a program can be
executed in parallel
• A combination of compiler based optimization and hardware techniques
can be used to maximize instruction level parallelism
• Limitations:
• True data dependency
• Procedural dependency
• Resource conflicts
• Output dependency
• Antidependency
02.
Design issues
Elaborate on what you want to
discuss.
Instruction level parallelism and machine
parallelism

Instruction level parallelism Machine parallelism


• Instructions in a sequence are dependent • Ability to take advantage of instruction
• Execution can be overlapped level parallelism
• Governed by data and procedural • Governed by number of parallel pipelines
dependency
Instruction issue policy
Instruction issue Instruction issue policy

• Refers to the process of initiating instruction execution in • Refers to the protocol used to issue instructions
the processor’s functional units • Instruction issue occurs when instruction moves from the
decode stage of the pipeline to the first execute stage of
the pipeline

Superscalar instruction issue policies can be


3 types of orderings are important grouped into the following categories

• The order in which instructions are fetched • In-order issue with in-order completion
• The order in which instructions are executed • In-order issue with out-of-order completion
• The order in which instructions update the contents of • Out-of-order issue with out-of-order completion
register and memory locations
Superscalar instruction
issue and completion
policies
Organization for out-of-order issue with out-
of-order completion
Register renaming
Output and antidependencies occur
because register contents may not reflect
the correct ordering from the program

May result in a pipeline stall

Registers allocated dynamically


Speedups of various machine organizations without
procedural dependencies
Branch prediction
• Any high-performance pipelined machine must address the issue of dealing with
branches

• Intel 80486 addressed the problem by fetching both the next sequential instruction after
a branch and speculatively fetching the branch target instruction

• RISC machines:
• Delayed branch strategy was explored
• Processor always executes the single instruction that immediately follows the branch
• Keeps the pipeline full while the processor fetches a new instruction stream

• Superscalar machines:
• Delayed branch strategy has less appeal
• Have returned to pre-RISC techniques of branch prediction
Conceptual depiction of
superscalar processing
Superscalar implementation
• Key elements:
• Instruction fetch strategies that simultaneously fetch multiple instruction
• Logic for determining true dependencies involving register values, and mechanisms
for communicating these values to where they are needed during execution
• Mechanisms for initiating, or issuing, multiple instructions in parallel
• Resources for parallel execution of multiple instructions, including multiple
pipelined functional units and memory hierarchies capable of simultaneously
servicing multiple memory references
• Mechanisms for committing the process state in correct order
03. Pentium 4
Pentium 4 diagram
1. The processor fetches instructions from memory
in the order of the static program
2. Each instruction is translated into one or more
fixed-length RISC instructions, known as micro-
operations
3. The processor executes the micro-ops on a
superscalar pipeline organization, so that the
micro-ops may be executed out of order
4. The processor commits the results of each micro-
op execution to the processor’s register set in the
order of the original program flow.
Pentium 4 pipeline

the Pentium 4 architecture implements a CISC


instruction set architecture on a RISC
microarchitecture. The inner RISC micro-ops pass
through a pipeline with at least 20 stages; in some
cases, micro-op requires multiple execution stages,
resulting in even longer pipeline
• Generation of Micro-ops
Front end
• The Pentium 4 organization has an in-order front end that feeds into an L1 instruction cache called the trace cache.
• The fetch/decode unit fetches x86 machine instructions from the L2 cache 64 bytes at a time.
• Branch prediction via the BTB & I-TLB unit may alter the sequential fetch operation.
• Once instructions are fetched, the fetch/decode unit scans the bytes to determine instruction boundaries and the decoder
translates each machine instruction into one to four micro-ops.
• The generated micro-ops are stored in the trace cache.
• Trace Cache Next Instruction Pointer
• The Pentium 4 uses a dynamic branch prediction strategy based on the history information for that entry in determining
whether to predict that the branch is taken.
• The Pentium 4 BTB is organized as a four-way set-associative cache with 512 lines, and each entry uses the address of the
branch as a tag.
• Conditional branches that do not have a history in the BTB are predicted using a static prediction algorithm.
• Trace Cache Fetch
• The trace cache takes the already-decoded micro-ops from the instruction decoder and assembles them into program-
ordered sequences of micro-ops called traces.
• Micro-ops are fetched sequentially from the trace cache, subject to the branch prediction logic.
• A few instructions require more than four micro-ops, and these instructions are transferred to microcode ROM.
• Drive
• The fifth stage of the Pentium 4 pipeline delivers decoded instructions from the trace cache to the rename/allocator module.
Out-of-order execution
logic
• In the allocate stage, resources required for execution are allocated, including a
reorder buffer (ROB) entry, a register entry for the result data value, and possibly a
load or store buffer. The ROB is a circular buffer that can hold up to 126 micro-ops and
tracks their completion status.
• In the register renaming stage, references to the 16 architectural registers are
remapped into a set of 128 physical registers to remove false dependencies. Micro-ops
are then placed in one of two micro-op queues, one for memory operations and the
other for micro-ops that don't involve memory references.
• In the micro-op scheduling and dispatching stage, the schedulers retrieve micro-ops
from the queues and dispatch them for execution, up to six at a time, favoring in-order
execution but with flexibility to allow out-of-order execution.
Integer and floating-point
execution units

• The integer and floating-point register files are used as the source for pending operations by the
execution units.
• The execution units retrieve values from the register files and the L1 data cache.
• A separate pipeline stage computes flags such as zero and negative, typically used as input to a branch
instruction.
• Branch checking is performed in a subsequent pipeline stage, comparing actual branch result with
prediction.
• If a branch prediction is wrong, micro-operations in various stages must be removed from the
pipeline.
• The Branch Predictor is provided with the correct branch destination during a drive stage, which
restarts the pipeline from the new target address.
04.
Arm
Cortex-A8
Arm
cortex-A8
Instruction fetch unit
 F0
 Predicts instruction stream
 Address generation unit (AGU) generate a new virtual address
 Fetches instructions from the L1 instruction cache
 Not counted as part of the 12-stage pipeline
 Places the fetched instructions into a buffer for consumption by the
 F1
decode pipeline
 The calculated address is used to fetch instructions from the L1
 Also includes the L1 instruction cache
instruction cache
 Speculative(there í no guarantee that they are executed)
 In parallel, the fetch address is used to access branch prediction
 Branch or exceptional instruction in the code stream can cause a
arrays
pipeline flush
 F3
 Can fetch up to four instructions per cycle
 Instruction data are placed in the instruction queue
 If an instruction results in branch prediction, new target address
is sent to the address generation unit
Instruction decode unit
 Decodes and sequences all ARM and Thumb instructions
 Dual pipeline structure, pipe0 and pipe1
 Two instructions can progress at a time
 Pipe 0 contains the older instruction in program order
 If instruction in pipe0 cannot issue, instruction in pipe 1 will not issue
 All issued instructions progress in order
 Results written back to register file at end of execution pipeline
 Prevents WAR hazards
 Keeps track of WAW hazards and recovery from flush conditions straightforward
 Main concern of decode pipeline is prevention of RAW hazards
Thank you!

You might also like