Presentation Cea Chapter16 2 Demo
Presentation Cea Chapter16 2 Demo
1. 2. 3. 4.
OVERVIEW DESIGN ISSUES PENTIUM 4 ARM COTEX-A8
01.
Overview
Superscalar
Refers to a machine that is
designed to improve the In most applications the
Term first coined in 1987 performance of the bulk of the operations are
execution of scalar on scalar quantities
instructions
• Refers to the process of initiating instruction execution in • Refers to the protocol used to issue instructions
the processor’s functional units • Instruction issue occurs when instruction moves from the
decode stage of the pipeline to the first execute stage of
the pipeline
• The order in which instructions are fetched • In-order issue with in-order completion
• The order in which instructions are executed • In-order issue with out-of-order completion
• The order in which instructions update the contents of • Out-of-order issue with out-of-order completion
register and memory locations
Superscalar instruction
issue and completion
policies
Organization for out-of-order issue with out-
of-order completion
Register renaming
Output and antidependencies occur
because register contents may not reflect
the correct ordering from the program
• Intel 80486 addressed the problem by fetching both the next sequential instruction after
a branch and speculatively fetching the branch target instruction
• RISC machines:
• Delayed branch strategy was explored
• Processor always executes the single instruction that immediately follows the branch
• Keeps the pipeline full while the processor fetches a new instruction stream
• Superscalar machines:
• Delayed branch strategy has less appeal
• Have returned to pre-RISC techniques of branch prediction
Conceptual depiction of
superscalar processing
Superscalar implementation
• Key elements:
• Instruction fetch strategies that simultaneously fetch multiple instruction
• Logic for determining true dependencies involving register values, and mechanisms
for communicating these values to where they are needed during execution
• Mechanisms for initiating, or issuing, multiple instructions in parallel
• Resources for parallel execution of multiple instructions, including multiple
pipelined functional units and memory hierarchies capable of simultaneously
servicing multiple memory references
• Mechanisms for committing the process state in correct order
03. Pentium 4
Pentium 4 diagram
1. The processor fetches instructions from memory
in the order of the static program
2. Each instruction is translated into one or more
fixed-length RISC instructions, known as micro-
operations
3. The processor executes the micro-ops on a
superscalar pipeline organization, so that the
micro-ops may be executed out of order
4. The processor commits the results of each micro-
op execution to the processor’s register set in the
order of the original program flow.
Pentium 4 pipeline
• The integer and floating-point register files are used as the source for pending operations by the
execution units.
• The execution units retrieve values from the register files and the L1 data cache.
• A separate pipeline stage computes flags such as zero and negative, typically used as input to a branch
instruction.
• Branch checking is performed in a subsequent pipeline stage, comparing actual branch result with
prediction.
• If a branch prediction is wrong, micro-operations in various stages must be removed from the
pipeline.
• The Branch Predictor is provided with the correct branch destination during a drive stage, which
restarts the pipeline from the new target address.
04.
Arm
Cortex-A8
Arm
cortex-A8
Instruction fetch unit
F0
Predicts instruction stream
Address generation unit (AGU) generate a new virtual address
Fetches instructions from the L1 instruction cache
Not counted as part of the 12-stage pipeline
Places the fetched instructions into a buffer for consumption by the
F1
decode pipeline
The calculated address is used to fetch instructions from the L1
Also includes the L1 instruction cache
instruction cache
Speculative(there í no guarantee that they are executed)
In parallel, the fetch address is used to access branch prediction
Branch or exceptional instruction in the code stream can cause a
arrays
pipeline flush
F3
Can fetch up to four instructions per cycle
Instruction data are placed in the instruction queue
If an instruction results in branch prediction, new target address
is sent to the address generation unit
Instruction decode unit
Decodes and sequences all ARM and Thumb instructions
Dual pipeline structure, pipe0 and pipe1
Two instructions can progress at a time
Pipe 0 contains the older instruction in program order
If instruction in pipe0 cannot issue, instruction in pipe 1 will not issue
All issued instructions progress in order
Results written back to register file at end of execution pipeline
Prevents WAR hazards
Keeps track of WAW hazards and recovery from flush conditions straightforward
Main concern of decode pipeline is prevention of RAW hazards
Thank you!