0% found this document useful (0 votes)
10 views22 pages

Unit V Coa - 105429

The document discusses processor organization and architecture, focusing on register organization, instruction cycles, instruction pipelining, and associated performance issues. It outlines the roles of user-visible and control/status registers, the stages of instruction execution, and the challenges of pipelining, including hazards and branch prediction strategies. Case studies on the microprocessor 8086 and Intel Pentium illustrate the practical applications of these concepts.

Uploaded by

Ethel Chikhoswe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views22 pages

Unit V Coa - 105429

The document discusses processor organization and architecture, focusing on register organization, instruction cycles, instruction pipelining, and associated performance issues. It outlines the roles of user-visible and control/status registers, the stages of instruction execution, and the challenges of pipelining, including hazards and branch prediction strategies. Case studies on the microprocessor 8086 and Intel Pentium illustrate the practical applications of these concepts.

Uploaded by

Ethel Chikhoswe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

351 CS 33 Computer Organization And Architecture

UNIT 5

Processor Organization – Register Organization – User Visible Registers, control and


status registers, Case Study – Register organization of microprocessor 8086, Instruction
cycle – The machine cycle and Data flow, Instruction Pipelining – Pipelining strategy,
pipeline performance, pipeline hazards, dealing with branches, Case study – Pipelining
with Pentium, Instruction level parallelism and superscalar processors – Superscalar
verses super pipelined, Constraints, Design issues – instruction level and machine
parallelism, instruction issue policy, register renaming, machine parallelism, branch
prediction, superscalar execution and implementation. Case study- Pentium IV.

PROCESSOR ORGANIZATION

• Fetch instruction: The processor reads an instruction from memory cache, main memory).

• Interpret instruction: The instruction is decoded to determine what action is required.

• Fetch data: The execution of an instruction may require reading data from memory or an I/O
module.

• Process data: The execution of an instruction may require performing some arithmetic or
logical operation on data.

• Write data: The results of an execution may require writing data to memory or an I/O module.

The major components of the processor are an arithmetic and logic unit (ALU) and a control unit
(CU).The ALU does the actual computation or processing of data.The control unit controls the
movement of data and instructions into and out of the processor and controls the operation of the

M.Briskilla, School of CS & IT,DMI- SJBU Page 1


351 CS 33 Computer Organization And Architecture

ALU. In addition to that there is a minimal internal memory, consisting of a set of storage
locations, called registers.

The data transfer and logic control paths are indicated, including an element labeled internal
processor bus. There is a small collection of major elements (computer: processor, I/O, memory;
processor: control unit, ALU, registers) connected by data paths.

REGISTER ORGANIZATION

At higher levels of the hierarchy, memory is faster, smaller, and more expensive (per bit).Within
the processor, there is a set of registers that function as a level of memory above main memory
and cache in the hierarchy. The registers in the processor perform two roles:

• User-visible registers: Enable the machine- or assembly language programmer to


minimize main memory references by optimizing use of registers.

• Control and status registers: Used by the control unit to control the operation of the
processor and by privileged, operating system programs to control the execution of
programs.

USER VISIBLE REGISTERS

A user-visible register is one that may be referenced by means of the machine language that the
processor executes. We can characterize these in the following categories:

M.Briskilla, School of CS & IT,DMI- SJBU Page 2


351 CS 33 Computer Organization And Architecture

 General purpose
 Data
 Address
 Condition codes

General-purpose registers can be assigned to a variety of functions by the programmer.

Data registers may be used only to hold data and cannot be employed in the calculation of an
operand address.

Address registers may themselves be somewhat general purpose, or they may be devoted to a
particular addressing mode. Examples include the following:

• Segment pointers: In a machine with segmented addressing a segment register holds the
address of the base of the segment. There may be multiple registers: for example, one for the
operating system and one for the current process.

• Index registers: These are used for indexed addressing and may be autoindexed.

• Stack pointer: If there is user-visible stack addressing, then typically there is a dedicated
register that points to the top of the stack. This allows implicit addressing; that is, push, pop, and
other stack instructions need not contain an explicit stack operand.

A final category of registers, which is at least partially visible to the user, holds condition codes
(also referred to as flags). Condition codes are bits set by the processor hardware as the result of
operations.

CONTROL AND STATUS REGISTERS

Four registers are essential to instruction execution:

• Program counter (PC): Contains the address of an instruction to be fetched

• Instruction register (IR): Contains the instruction most recently fetched

• Memory address register (MAR): Contains the address of a location in memory

• Memory buffer register (MBR): Contains a word of data to be written to memory or the word
most recently read. Many processor designs include a register or set of registers, often known as
the program status word (PSW), that contain status information. The PSW typically contains
condition codes plus other status information. Common fields or flags include the following:

• Sign: Contains the sign bit of the result of the last arithmetic operation.

• Zero: Set when the result is 0.

M.Briskilla, School of CS & IT,DMI- SJBU Page 3


351 CS 33 Computer Organization And Architecture

• Carry: Set if an operation resulted in a carry (addition) into or borrow (subtraction) out of a
high-order bit. Used for multiword arithmetic operations.

• Equal: Set if a logical compare result is equality.

• Overflow: Used to indicate arithmetic overflow.

• Interrupt Enable/Disable: Used to enable or disable interrupts.

• Supervisor: Indicates whether the processor is executing in supervisor or user mode

CASE STUDY – REGISTER ORGANIZATION OF MICROPROCESSOR 8086

 The MC68000 partitions its 32-bit registers into eight data registers and nine address
registers. The eight data registers are used primarily for data manipulation and are also
used in addressing as index registers. The width of the registers allows 8-, 16-,and 32-bit
data operations, determined by opcode. The address registers contain 32-bit (no
segmentation) addresses; two of these registers are also used as stack pointers, one for
users and one for the operating system, depending on the current execution mode. Both

M.Briskilla, School of CS & IT,DMI- SJBU Page 4


351 CS 33 Computer Organization And Architecture

registers are numbered 7, because only one can be used at a time. The MC68000 also
includes a 32-bit program counter and a 16-bit status register.
 The Motorola team wanted a very regular instruction set, with no special purpose
registers. A concern for code efficiency led them to divide the registers into two
functional components, saving one bit on each register specifier. This seems a reasonable
compromise between complete generality and code compaction.
 The Intel 8086 takes a different approach to register organization. Every register is
special purpose, although some registers are also usable as general purpose. The 8086
contains four 16-bit data registers that are addressable on a byte or 16-bit basis, and four
16-bit pointer and index registers. The data registers can be used as general purpose in
some instructions. In others, the registers are used implicitly.

INSTRUCTION CYCLE

An instruction cycle includes the following stages:

• Fetch: Read the next instruction from memory into the processor.

• Execute: Interpret the opcode and perform the indicated operation.

• Interrupt: If interrupts are enabled and an interrupt has occurred, save the current
process state and service the interrupt.

THE MACHINE CYCLE AND DATA FLOW

The main line of activity consists of alternating instruction fetch and instruction execution
activities. After an instruction is fetched, it is examined to determine if any indirect addressing is

M.Briskilla, School of CS & IT,DMI- SJBU Page 5


351 CS 33 Computer Organization And Architecture

involved. If so, the required operands are fetched using indirect addressing. Following execution,
an interrupt may be processed before the next instruction fetch.

During the fetch cycle, an instruction is read from memory. The PC contains the address of the
next instruction to be fetched. This address is moved to the MAR and placed on the address bus.
The control unit requests a memory read, and the result is placed on the data bus and copied into
the MBR and then moved to the IR. Meanwhile, the PC is incremented by 1, preparatory for the
next fetch. Once the fetch cycle is over, the control unit examines the contents of the IR to
determine if it contains an operand specifier using indirect addressing. If so, an indirect cycle is
performed.

M.Briskilla, School of CS & IT,DMI- SJBU Page 6


351 CS 33 Computer Organization And Architecture

The execute cycle takes many forms; the form depends on which of the various machine
instructions is in the IR. This cycle may involve transferring data among registers, read or write
from memory or I/O, and/or the invocation of the ALU.

INSTRUCTION PIPELINING

Instruction pipelining is a technique for implementing instruction-level parallelism within a


single processor. Pipelining attempts to keep every part of the processor busy with some
instruction by dividing incoming instructions into a series of sequential steps performed by
different processor units with different parts of instructions processed in parallel. It allows faster
CPU throughput.

PIPELINING STRATEGY

Pipelining is an implementation technique where multiple instructions are overlapped in


execution. The computer pipeline is divided in stages. Each stage completes a part of an
instruction in parallel.

The pipeline has two independent stages. The first stage fetches an instruction and buffers
it. When the second stage is free, the first stage passes it the buffered instruction. While the
second stage is executing the instruction, the first stage takes advantage of any unused memory
cycles to fetch and buffer the next instruction. This is called instruction prefetch or fetch overlap.

In general, pipelining requires registers to store data between stages. This process will
speed up instruction execution. If the fetch and execute stages were of equal duration, the
instruction cycle time would be halved.

M.Briskilla, School of CS & IT,DMI- SJBU Page 7


351 CS 33 Computer Organization And Architecture

Decomposition of the instruction processing in pipelining:

• Fetch instruction (FI): Read the next expected instruction into a buffer.

• Decode instruction (DI): Determine the opcode and the operand specifiers.

• Calculate operands (CO): Calculate the effective address of each source operand. This
may involve displacement, register indirect, indirect, or other forms of address calculation.

• Fetch operands (FO): Fetch each operand from memory. Operands in registers need not
be fetched.

• Execute instruction (EI): Perform the indicated operation and store the result, if any, in
the specified destination operand location.

• Write operand (WO): Store the result in memory.

With this decomposition, the various stages will be of more nearly equal duration.

In the figure below, the branch is taken. This is not determined until the end of time unit 7. At
this point, the pipeline must be cleared of instructions that are not useful. During time unit 8,
instruction 15 enters the pipeline. No instructions complete during time units 9 through 12; this is
the performance penalty incurred.

M.Briskilla, School of CS & IT,DMI- SJBU Page 8


351 CS 33 Computer Organization And Architecture

PIPELINE PERFORMANCE

The cycle time of an instruction pipeline is the time needed to advance a set of instructions one
stage through the pipeline; The cycle time can be determined as

PIPELINE HAZARDS

A pipeline hazard occurs when the pipeline, or some portion of the pipeline, must stall because
conditions do not permit continued execution. Such a pipeline stall is also referred to as a
pipeline bubble. There are three types of hazards: resource/Structural, data, and control.

RESOURCE HAZARDS: A resource hazard occurs when two (or more) instructions that are
already in the pipeline need the same resource. The result is that the instructions must be

M.Briskilla, School of CS & IT,DMI- SJBU Page 9


351 CS 33 Computer Organization And Architecture

executed in serial rather than parallel for a portion of the pipeline. A resource hazard is sometime
referred to as a structural hazard.

DATA HAZARDS: A data hazard occurs when there is a conflict in the access of an operand
location. For eg: Two instructions in a program are to be executed in sequence and both access a
particular memory or register operand. If the two instructions are executed in strict sequence, no
problem occurs. However, if the instructions are executed in a pipeline, then it is possible for the
operand value to be updated in such a way as to produce a different result than would occur with
strict sequential execution.

There are three types of data hazards;

• Read after write (RAW), or true dependency: An instruction modifies a register or memory
location and a succeeding instruction reads the data in that memory or register location. A hazard
occurs if the read takes place before the write operation is complete.

• Write after read (RAW), or antidependency: An instruction reads a register or memory location
and a succeeding instruction writes to the location. A hazard occurs if the write operation
completes before the read operation takes place.

• Write after write (RAW), or output dependency: Two instructions both write to the same
location. A hazard occurs if the write operations take place in the reverse order of the intended
sequence.

CONTROL HAZARDS: A control hazard, also known as a branch hazard, occurs when the
pipeline makes the wrong decision on a branch prediction and therefore brings instructions into
the pipeline that must subsequently be discarded.

DEALING WITH BRANCHES

One of the major problems in designing an instruction pipeline is assuring a steady flow of
instructions to the initial stages of the pipeline. Until the instruction is actually executed, it is
impossible to determine whether the branch will be taken or not.

A variety of approaches have been taken for dealing with conditional branches:

• Multiple streams

• Prefetch branch target

• Loop buffer

• Branch prediction

• Delayed branch

M.Briskilla, School of CS & IT,DMI- SJBU Page 10


351 CS 33 Computer Organization And Architecture

MULTIPLE STREAMS

A simple pipeline suffers a penalty for a branch instruction because it must choose one of two
instructions to fetch next and may make the wrong choice. A brute-force approach is to replicate
the initial portions of the pipeline and llow the pipeline to fetch both instructions, making use of
two streams.

PREFETCH BRANCH TARGET

When a conditional branch is recognized, the target of the branch is prefetched, in addition to the
instruction following the branch. This target is then saved until the branch instruction is
executed. If the branch is taken, the target has already been prefetched.

The IBM 360/91 uses this approach.

LOOP BUFFER

A loop buffer is a small, very-high-speed memory maintained by the instruction fetch stage of
the pipeline and containing the n most recently fetched instructions, in sequence. If a branch is to
be taken, the hardware first checks whether the branch target is within the buffer. If so, the next
instruction is fetched from the buffer.

BRANCH PREDICTION

Various techniques can be used to predict whether a branch will be taken. Among the more
common are the following:

• Predict never taken

• Predict always taken

• Predict by opcode

• Taken/not taken switch

• Branch history table

DELAYED BRANCH

It is possible to improve pipeline performance by automatically rearranging instructions within a


program, so that branch instructions occur later than actually desired.

CASE STUDY-PIPELINING WITH PENTIUM

Intel 80486 Pipelining

M.Briskilla, School of CS & IT,DMI- SJBU Page 11


351 CS 33 Computer Organization And Architecture

An instructive example of an instruction pipeline is that of the Intel 80486. The 80486
implements a five-stage pipeline:

• Fetch: Instructions are fetched from the cache or from external memory and placed into one of
the two 16-byte prefetch buffers. The objective of the fetch stage is to fill the prefetch buffers
with new data as soon as the old data have been consumed by the instruction decoder.

• Decode stage 1: All opcode and addressing-mode information is decoded in the D1 stage. The
required information, as well as instruction-length information, is included in at most the first 3
bytes of the instruction. Hence, 3 bytes are passed to the D1 stage from the prefetch buffers. The
D1 decoder can then direct the D2 stage to capture the rest of the instruction (displacement and
immediate data), which is not involved in the D1 decoding.

• Decode stage 2: The D2 stage expands each opcode into control signals for the ALU. It also
controls the computation of the more complex addressing modes.

• Execute: This stage includes ALU operations, cache access, and register update.

• Write back: This stage, if needed, updates registers and status flags modified during the
preceding execute stage. If the current instruction updates memory, the computed value is sent to
the cache and to the bus-interface write buffers at the same time.

INSTRUCTION LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS

A superscalar implementation of processor architecture is one in which common


instructions integer and floating-point arithmetic, loads, stores, and conditional branches can be
initiated simultaneously and executed independently. Such implementations raise a number of
complex design issues related to the instruction pipeline.

SUPERSCALAR VERSES SUPER PIPELINED

The essence of the superscalar approach is the ability to execute instructions independently and
concurrently in different pipelines.

M.Briskilla, School of CS & IT,DMI- SJBU Page 12


351 CS 33 Computer Organization And Architecture

An alternative approach to achieving greater performance is referred to as superpipelining.


Superpipelining exploits the fact that many pipeline stages perform tasks that require less than
half a clock cycle. Thus, a doubled internal clock speed allows the performance of two tasks in
one external clock cycle.

The upper part of the diagram illustrates an ordinary pipeline, used as a base for comparison. The
base pipeline issues one instruction per clock cycle and can perform one pipeline stage per clock
cycle. The pipeline has four stages: instruction fetch, operation decode, operation execution, and
result write back. The next part of the diagram shows a superpipelined implementation that is
capable of performing two pipeline stages per clock cycle. An alternative way of looking at this
is that the functions performed in each stage can be split into two nonoverlapping parts and each
can execute in half a clock cycle. A superpipeline implementation that behaves in this fashion is

M.Briskilla, School of CS & IT,DMI- SJBU Page 13


351 CS 33 Computer Organization And Architecture

said to be of degree 2. Finally, the lowest part of the diagram shows a superscalar
implementation capable of executing two instances of each stage in parallel. Higher-degree
superpipeline and superscalar implementations are of course possible.

CONSTRAINTS/LIMITATIONS

The term instruction-level parallelism refers to the degree to which, on average, the instructions
of a program can be executed in parallel. A combination of compiler-based optimization and
hardware techniques can be used to maximize instruction-level parallelism.

Five limitations of ILP:

• True data dependency

• Procedural dependency

• Resource conflicts

• Output dependency

• Antidependency

TRUE DATA DEPENDENCY

ADD EAX, ECX ;

MOV EBX, EAX ;

The second instruction can be fetched and decoded but cannot execute until the first instruction
executes. The reason is that the second instruction needs data produced by the first instruction.
This situation is referred to as a true data dependency (also called flow dependency or write after
read [WAR] dependency).

PROCEDURAL DEPENDENCIES

The presence of branches in an instruction sequence complicates the pipeline operation. The
instructions following a branch (taken or not taken) have a procedural dependency on the branch
and cannot be executed until the branch is executed.

RESOURCE CONFLICT

A resource conflict is a competition of two or more instructions for the same resource at the
same time. Examples of resources include memories, caches, buses, register-file ports, and
functional units (e.g., ALU adder).

M.Briskilla, School of CS & IT,DMI- SJBU Page 14


351 CS 33 Computer Organization And Architecture

DESIGN ISSUES

INSTRUCTION LEVEL AND MACHINE PARALLELISM

Instruction-level parallelism exists when instructions in a sequence are independent and thus
can be executed in parallel by overlapping.

Consider the following two code fragments

Load R1 ← R2 Add R3 ← R3, “1”

Add R3 ← R3, “1” Add R4 ← R3, R2

Add R4 ← R4, R2 Store [R4] ← R0

The three instructions on the left are independent, and in theory all three could be executed in
parallel. In contrast, the three instructions on the right cannot be executed in parallel because the
second instruction uses the result of the first, and the third instruction uses the result of the
second. The degree of instruction-level parallelism is determined by the frequency of true data
dependencies and procedural dependencies in the code.

Machine parallelism is a measure of the ability of the processor to take advantage of


instruction-level parallelism. Machine parallelism is determined by the number of instructions
that can be fetched and executed at the same time (the number of parallel pipelines) and by the
speed and sophistication of the mechanisms that the processor uses to find independent
instructions. Both instruction-level and machine parallelism are important factors in enhancing
performance.

INSTRUCTION ISSUE POLICY

The term instruction issue is used to refer to the process of initiating instruction execution in the
processor’s functional units and the term instruction issue policy to refer to the protocol used to
issue instructions.

In essence, the processor is trying to look ahead of the current point of execution to locate
instructions that can be brought into the pipeline and executed. Three types of orderings are
important in this regard:

• The order in which instructions are fetched

• The order in which instructions are executed

• The order in which instructions update the contents of register and memory locations

M.Briskilla, School of CS & IT,DMI- SJBU Page 15


351 CS 33 Computer Organization And Architecture

Superscalar instruction issue policies into the following categories:

• In-order issue with in-order completion

• In-order issue with out-of-order completion

• Out-of-order issue with out-of-order completion

IN-ORDER ISSUE WITH IN-ORDER COMPLETION

The simplest instruction issue policy is to issue instructions in the exact order that would be
achieved by sequential execution (in-order issue) and to write results in that same order (in-order
completion).

M.Briskilla, School of CS & IT,DMI- SJBU Page 16


351 CS 33 Computer Organization And Architecture

IN-ORDER ISSUE WITH OUT-OF-ORDER COMPLETION

Out-of-order completion is used in scalar RISC processors to improve the performance of


instructions that require multiple cycles.

With out-of-order completion, any number of instructions may be in the execution stage
at any one time, up to the maximum degree of machine parallelism across all functional units.
Instruction issuing is stalled by a resource conflict, a data dependency, or a procedural
dependency.

OUT-OF-ORDER ISSUE WITH OUT-OF-ORDER COMPLETION

With in-order issue, the processor will only decode instructions up to the point of a
dependency or conflict. No additional instructions are decoded until the conflict is resolved. As a
result, the processor cannot look ahead of the point of conflict to subsequent instructions that
may be independent of those already in the pipeline and that may be usefully introduced into the
pipeline.

To allow out-of-order issue, it is necessary to decouple the decode and execute stages of
the pipeline. This is done with a buffer referred to as an instruction window. With this
organization, after a processor has finished decoding an instruction, it is placed in the instruction
window. As long as this buffer is not full, the processor can continue to fetch and decode new
instructions. When a functional unit becomes available in the execute stage, an instruction from
the instruction window may be issued to the execute stage. The result of this organization is that
the processor has a lookahead capability, allowing it to identify independent instructions that can
be brought into the execute stage.

REGISTER RENAMING

Register renaming is a form of pipelining that deals with data dependences between instructions
by renaming their register operands. An assembly language programmer or a compiler specifies
these operands using architectural registers - the registers that are explicit in the instruction set
architecture. Renaming replaces architectural register names by, in effect, value names, with a
new value name for each instruction destination operand. This eliminates the name dependences
(output dependences and antidependences) between instructions and automatically recognizes
true dependences.

The recognition of true data dependences between instructions permits a more flexible life cycle
for instructions. By maintaining a status bit for each value indicating whether or not it has been
computed yet, it allows the execution phase of two instruction operations to be performed out of
order when there are no true data dependences between them. This is called out-of-order
execution.

Registers are allocated dynamically by the processor hardware, and they are associated with the
values needed by instructions at various points in time. When a new register value is created a

M.Briskilla, School of CS & IT,DMI- SJBU Page 17


351 CS 33 Computer Organization And Architecture

new register is allocated for that value. Subsequent instructions that access that value as a source
operand in that register must go through a renaming process.

Eg:

I1: R3b ← R3a op R5a


I2: R4b ← R3b + 1
I3: R3c ← R5a + 1
I4: R7b ← R3c op R4b

The register reference without the subscript refers to the logical register reference found in the
instruction. The register reference with the subscript refers to a hardware register allocated to
hold a new value. When a new allocation is made for a particular logical register, subsequent
instruction references to that logical register as a source operand are made to refer to the most
recently allocated hardware register.

MACHINE PARALLELISM

In each of the graphs, the vertical axis corresponds to the mean speedup of the superscalar
machine over the scalar machine. The horizontal axis shows the results for four alternative
processor organizations. The base machine does not duplicate any of the functional units, but it
can issue instructions out of order. The two graphs, combined, yield some important conclusions.
The first is that it is probably not worthwhile to add functional units without register renaming.

M.Briskilla, School of CS & IT,DMI- SJBU Page 18


351 CS 33 Computer Organization And Architecture

There is some slight improvement in performance, but at the cost of increased hardware
complexity.

BRANCH PREDICTION

With the advent of RISC machines, the delayed branch strategy was explored. This allows the
processor to calculate the result of conditional branch instructions before any unusable
instructions have been prefetched. With this method, the processor always executes the single
instruction that immediately follows the branch. This keeps the pipeline full while the processor
fetches a new instruction stream.

SUPERSCALAR EXECUTION

The program to be executed consists of a linear sequence of instructions. This is the static
program as written by the programmer or generated by the compiler. The instruction fetch
process, which includes branch prediction, is used to form a dynamic stream of instructions. This
stream is examined for dependencies, and the processor may remove artificial dependencies. The
processor then dispatches the instructions into a window of execution. In this window,
instructions no longer form a sequential stream but are structured according to their true data
dependencies. The processor performs the execution stage of each instruction in an order
determined by the true data dependencies and hardware resource availability. Finally,
instructions are conceptually put back into sequential order and their results are recorded.

The final step mentioned in the preceding paragraph is referred to as committing, or retiring, the
instruction.

M.Briskilla, School of CS & IT,DMI- SJBU Page 19


351 CS 33 Computer Organization And Architecture

SUPERSCALAR IMPLEMENTATION

Key elements of execution:

• Instruction fetch strategies that simultaneously fetch multiple instructions, often by predicting
the outcomes of, and fetching beyond, conditional branch instructions. These functions require
the use of multiple pipeline fetch and decode stages, and branch prediction logic.

• Logic for determining true dependencies involving register values, and mechanisms for
communicating these values to where they are needed during execution.

• Mechanisms for initiating, or issuing, multiple instructions in parallel.

• Resources for parallel execution of multiple instructions, including multiple pipelined


functional units and memory hierarchies capable of simultaneously servicing multiple memory
references.

• Mechanisms for committing the process state in correct order.

CASE STUDY- PENTIUM IV.

M.Briskilla, School of CS & IT,DMI- SJBU Page 20


351 CS 33 Computer Organization And Architecture

The operation of the Pentium 4 can be summarized as follows:

1. The processor fetches instructions from memory in the order of the static program.

2. Each instruction is translated into one or more fixed-length RISC instructions, known as
micro-operations, or micro-ops.

3. The processor executes the micro-ops on a superscalar pipeline organization, so that the
micro-ops may be executed out of order.

4. The processor commits the results of each micro-op execution to the processor’s register set in
the order of the original program flow.

TRACE CACHE FETCH

The trace cache takes the already-decoded micro-ops from the instruction decoder and assembles
them in to program-ordered sequences of micro-ops called traces. Micro-ops are fetched
sequentially from the trace cache, subject to the branch prediction logic.

Out-of-Order Execution Logic

This part of the processor reorders micro-ops to allow them to execute as quickly as their input
operands are ready.

ALLOCATE

The allocate stage allocates resources required for execution.

It performs the following functions:

• If a needed resource, such as a register, is unavailable for one of the three micro-ops arriving at
the allocator during a clock cycle, the allocator stalls the pipeline.

• The allocator allocates a reorder buffer (ROB) entry, which tracks the completion status of one
of the 126 micro-ops that could be in process at any time.2

• The allocator allocates one of the 128 integer or floating-point register entries for the result data
value of the micro-op, and possibly a load or store buffer

used to track one of the 48 loads or 24 stores in the machine pipeline.

• The allocator allocates an entry in one of the two micro-op queues in front of the instruction
schedulers.

REGISTER RENAMING

M.Briskilla, School of CS & IT,DMI- SJBU Page 21


351 CS 33 Computer Organization And Architecture

The rename stage remaps references to the 16 architectural registers (8 floating-point registers,
plus EAX, EBX, ECX, EDX, ESI, EDI,EBP, and ESP) into a set of 128 physical registers.

MICRO-OP SCHEDULING AND DISPATCHING

The schedulers are responsible for retrieving micro-ops from the micro-op queues and
dispatching these for execution. Each scheduler looks for micro-ops in whose status indicates
that the micro-op has all of its operands.

M.Briskilla, School of CS & IT,DMI- SJBU Page 22

You might also like