Unit V Coa - 105429
Unit V Coa - 105429
UNIT 5
PROCESSOR ORGANIZATION
• Fetch instruction: The processor reads an instruction from memory cache, main memory).
• Fetch data: The execution of an instruction may require reading data from memory or an I/O
module.
• Process data: The execution of an instruction may require performing some arithmetic or
logical operation on data.
• Write data: The results of an execution may require writing data to memory or an I/O module.
The major components of the processor are an arithmetic and logic unit (ALU) and a control unit
(CU).The ALU does the actual computation or processing of data.The control unit controls the
movement of data and instructions into and out of the processor and controls the operation of the
ALU. In addition to that there is a minimal internal memory, consisting of a set of storage
locations, called registers.
The data transfer and logic control paths are indicated, including an element labeled internal
processor bus. There is a small collection of major elements (computer: processor, I/O, memory;
processor: control unit, ALU, registers) connected by data paths.
REGISTER ORGANIZATION
At higher levels of the hierarchy, memory is faster, smaller, and more expensive (per bit).Within
the processor, there is a set of registers that function as a level of memory above main memory
and cache in the hierarchy. The registers in the processor perform two roles:
• Control and status registers: Used by the control unit to control the operation of the
processor and by privileged, operating system programs to control the execution of
programs.
A user-visible register is one that may be referenced by means of the machine language that the
processor executes. We can characterize these in the following categories:
General purpose
Data
Address
Condition codes
Data registers may be used only to hold data and cannot be employed in the calculation of an
operand address.
Address registers may themselves be somewhat general purpose, or they may be devoted to a
particular addressing mode. Examples include the following:
• Segment pointers: In a machine with segmented addressing a segment register holds the
address of the base of the segment. There may be multiple registers: for example, one for the
operating system and one for the current process.
• Index registers: These are used for indexed addressing and may be autoindexed.
• Stack pointer: If there is user-visible stack addressing, then typically there is a dedicated
register that points to the top of the stack. This allows implicit addressing; that is, push, pop, and
other stack instructions need not contain an explicit stack operand.
A final category of registers, which is at least partially visible to the user, holds condition codes
(also referred to as flags). Condition codes are bits set by the processor hardware as the result of
operations.
• Memory buffer register (MBR): Contains a word of data to be written to memory or the word
most recently read. Many processor designs include a register or set of registers, often known as
the program status word (PSW), that contain status information. The PSW typically contains
condition codes plus other status information. Common fields or flags include the following:
• Sign: Contains the sign bit of the result of the last arithmetic operation.
• Carry: Set if an operation resulted in a carry (addition) into or borrow (subtraction) out of a
high-order bit. Used for multiword arithmetic operations.
The MC68000 partitions its 32-bit registers into eight data registers and nine address
registers. The eight data registers are used primarily for data manipulation and are also
used in addressing as index registers. The width of the registers allows 8-, 16-,and 32-bit
data operations, determined by opcode. The address registers contain 32-bit (no
segmentation) addresses; two of these registers are also used as stack pointers, one for
users and one for the operating system, depending on the current execution mode. Both
registers are numbered 7, because only one can be used at a time. The MC68000 also
includes a 32-bit program counter and a 16-bit status register.
The Motorola team wanted a very regular instruction set, with no special purpose
registers. A concern for code efficiency led them to divide the registers into two
functional components, saving one bit on each register specifier. This seems a reasonable
compromise between complete generality and code compaction.
The Intel 8086 takes a different approach to register organization. Every register is
special purpose, although some registers are also usable as general purpose. The 8086
contains four 16-bit data registers that are addressable on a byte or 16-bit basis, and four
16-bit pointer and index registers. The data registers can be used as general purpose in
some instructions. In others, the registers are used implicitly.
INSTRUCTION CYCLE
• Fetch: Read the next instruction from memory into the processor.
• Interrupt: If interrupts are enabled and an interrupt has occurred, save the current
process state and service the interrupt.
The main line of activity consists of alternating instruction fetch and instruction execution
activities. After an instruction is fetched, it is examined to determine if any indirect addressing is
involved. If so, the required operands are fetched using indirect addressing. Following execution,
an interrupt may be processed before the next instruction fetch.
During the fetch cycle, an instruction is read from memory. The PC contains the address of the
next instruction to be fetched. This address is moved to the MAR and placed on the address bus.
The control unit requests a memory read, and the result is placed on the data bus and copied into
the MBR and then moved to the IR. Meanwhile, the PC is incremented by 1, preparatory for the
next fetch. Once the fetch cycle is over, the control unit examines the contents of the IR to
determine if it contains an operand specifier using indirect addressing. If so, an indirect cycle is
performed.
The execute cycle takes many forms; the form depends on which of the various machine
instructions is in the IR. This cycle may involve transferring data among registers, read or write
from memory or I/O, and/or the invocation of the ALU.
INSTRUCTION PIPELINING
PIPELINING STRATEGY
The pipeline has two independent stages. The first stage fetches an instruction and buffers
it. When the second stage is free, the first stage passes it the buffered instruction. While the
second stage is executing the instruction, the first stage takes advantage of any unused memory
cycles to fetch and buffer the next instruction. This is called instruction prefetch or fetch overlap.
In general, pipelining requires registers to store data between stages. This process will
speed up instruction execution. If the fetch and execute stages were of equal duration, the
instruction cycle time would be halved.
• Fetch instruction (FI): Read the next expected instruction into a buffer.
• Decode instruction (DI): Determine the opcode and the operand specifiers.
• Calculate operands (CO): Calculate the effective address of each source operand. This
may involve displacement, register indirect, indirect, or other forms of address calculation.
• Fetch operands (FO): Fetch each operand from memory. Operands in registers need not
be fetched.
• Execute instruction (EI): Perform the indicated operation and store the result, if any, in
the specified destination operand location.
With this decomposition, the various stages will be of more nearly equal duration.
In the figure below, the branch is taken. This is not determined until the end of time unit 7. At
this point, the pipeline must be cleared of instructions that are not useful. During time unit 8,
instruction 15 enters the pipeline. No instructions complete during time units 9 through 12; this is
the performance penalty incurred.
PIPELINE PERFORMANCE
The cycle time of an instruction pipeline is the time needed to advance a set of instructions one
stage through the pipeline; The cycle time can be determined as
PIPELINE HAZARDS
A pipeline hazard occurs when the pipeline, or some portion of the pipeline, must stall because
conditions do not permit continued execution. Such a pipeline stall is also referred to as a
pipeline bubble. There are three types of hazards: resource/Structural, data, and control.
RESOURCE HAZARDS: A resource hazard occurs when two (or more) instructions that are
already in the pipeline need the same resource. The result is that the instructions must be
executed in serial rather than parallel for a portion of the pipeline. A resource hazard is sometime
referred to as a structural hazard.
DATA HAZARDS: A data hazard occurs when there is a conflict in the access of an operand
location. For eg: Two instructions in a program are to be executed in sequence and both access a
particular memory or register operand. If the two instructions are executed in strict sequence, no
problem occurs. However, if the instructions are executed in a pipeline, then it is possible for the
operand value to be updated in such a way as to produce a different result than would occur with
strict sequential execution.
• Read after write (RAW), or true dependency: An instruction modifies a register or memory
location and a succeeding instruction reads the data in that memory or register location. A hazard
occurs if the read takes place before the write operation is complete.
• Write after read (RAW), or antidependency: An instruction reads a register or memory location
and a succeeding instruction writes to the location. A hazard occurs if the write operation
completes before the read operation takes place.
• Write after write (RAW), or output dependency: Two instructions both write to the same
location. A hazard occurs if the write operations take place in the reverse order of the intended
sequence.
CONTROL HAZARDS: A control hazard, also known as a branch hazard, occurs when the
pipeline makes the wrong decision on a branch prediction and therefore brings instructions into
the pipeline that must subsequently be discarded.
One of the major problems in designing an instruction pipeline is assuring a steady flow of
instructions to the initial stages of the pipeline. Until the instruction is actually executed, it is
impossible to determine whether the branch will be taken or not.
A variety of approaches have been taken for dealing with conditional branches:
• Multiple streams
• Loop buffer
• Branch prediction
• Delayed branch
MULTIPLE STREAMS
A simple pipeline suffers a penalty for a branch instruction because it must choose one of two
instructions to fetch next and may make the wrong choice. A brute-force approach is to replicate
the initial portions of the pipeline and llow the pipeline to fetch both instructions, making use of
two streams.
When a conditional branch is recognized, the target of the branch is prefetched, in addition to the
instruction following the branch. This target is then saved until the branch instruction is
executed. If the branch is taken, the target has already been prefetched.
LOOP BUFFER
A loop buffer is a small, very-high-speed memory maintained by the instruction fetch stage of
the pipeline and containing the n most recently fetched instructions, in sequence. If a branch is to
be taken, the hardware first checks whether the branch target is within the buffer. If so, the next
instruction is fetched from the buffer.
BRANCH PREDICTION
Various techniques can be used to predict whether a branch will be taken. Among the more
common are the following:
• Predict by opcode
DELAYED BRANCH
An instructive example of an instruction pipeline is that of the Intel 80486. The 80486
implements a five-stage pipeline:
• Fetch: Instructions are fetched from the cache or from external memory and placed into one of
the two 16-byte prefetch buffers. The objective of the fetch stage is to fill the prefetch buffers
with new data as soon as the old data have been consumed by the instruction decoder.
• Decode stage 1: All opcode and addressing-mode information is decoded in the D1 stage. The
required information, as well as instruction-length information, is included in at most the first 3
bytes of the instruction. Hence, 3 bytes are passed to the D1 stage from the prefetch buffers. The
D1 decoder can then direct the D2 stage to capture the rest of the instruction (displacement and
immediate data), which is not involved in the D1 decoding.
• Decode stage 2: The D2 stage expands each opcode into control signals for the ALU. It also
controls the computation of the more complex addressing modes.
• Execute: This stage includes ALU operations, cache access, and register update.
• Write back: This stage, if needed, updates registers and status flags modified during the
preceding execute stage. If the current instruction updates memory, the computed value is sent to
the cache and to the bus-interface write buffers at the same time.
The essence of the superscalar approach is the ability to execute instructions independently and
concurrently in different pipelines.
The upper part of the diagram illustrates an ordinary pipeline, used as a base for comparison. The
base pipeline issues one instruction per clock cycle and can perform one pipeline stage per clock
cycle. The pipeline has four stages: instruction fetch, operation decode, operation execution, and
result write back. The next part of the diagram shows a superpipelined implementation that is
capable of performing two pipeline stages per clock cycle. An alternative way of looking at this
is that the functions performed in each stage can be split into two nonoverlapping parts and each
can execute in half a clock cycle. A superpipeline implementation that behaves in this fashion is
said to be of degree 2. Finally, the lowest part of the diagram shows a superscalar
implementation capable of executing two instances of each stage in parallel. Higher-degree
superpipeline and superscalar implementations are of course possible.
CONSTRAINTS/LIMITATIONS
The term instruction-level parallelism refers to the degree to which, on average, the instructions
of a program can be executed in parallel. A combination of compiler-based optimization and
hardware techniques can be used to maximize instruction-level parallelism.
• Procedural dependency
• Resource conflicts
• Output dependency
• Antidependency
The second instruction can be fetched and decoded but cannot execute until the first instruction
executes. The reason is that the second instruction needs data produced by the first instruction.
This situation is referred to as a true data dependency (also called flow dependency or write after
read [WAR] dependency).
PROCEDURAL DEPENDENCIES
The presence of branches in an instruction sequence complicates the pipeline operation. The
instructions following a branch (taken or not taken) have a procedural dependency on the branch
and cannot be executed until the branch is executed.
RESOURCE CONFLICT
A resource conflict is a competition of two or more instructions for the same resource at the
same time. Examples of resources include memories, caches, buses, register-file ports, and
functional units (e.g., ALU adder).
DESIGN ISSUES
Instruction-level parallelism exists when instructions in a sequence are independent and thus
can be executed in parallel by overlapping.
The three instructions on the left are independent, and in theory all three could be executed in
parallel. In contrast, the three instructions on the right cannot be executed in parallel because the
second instruction uses the result of the first, and the third instruction uses the result of the
second. The degree of instruction-level parallelism is determined by the frequency of true data
dependencies and procedural dependencies in the code.
The term instruction issue is used to refer to the process of initiating instruction execution in the
processor’s functional units and the term instruction issue policy to refer to the protocol used to
issue instructions.
In essence, the processor is trying to look ahead of the current point of execution to locate
instructions that can be brought into the pipeline and executed. Three types of orderings are
important in this regard:
• The order in which instructions update the contents of register and memory locations
The simplest instruction issue policy is to issue instructions in the exact order that would be
achieved by sequential execution (in-order issue) and to write results in that same order (in-order
completion).
With out-of-order completion, any number of instructions may be in the execution stage
at any one time, up to the maximum degree of machine parallelism across all functional units.
Instruction issuing is stalled by a resource conflict, a data dependency, or a procedural
dependency.
With in-order issue, the processor will only decode instructions up to the point of a
dependency or conflict. No additional instructions are decoded until the conflict is resolved. As a
result, the processor cannot look ahead of the point of conflict to subsequent instructions that
may be independent of those already in the pipeline and that may be usefully introduced into the
pipeline.
To allow out-of-order issue, it is necessary to decouple the decode and execute stages of
the pipeline. This is done with a buffer referred to as an instruction window. With this
organization, after a processor has finished decoding an instruction, it is placed in the instruction
window. As long as this buffer is not full, the processor can continue to fetch and decode new
instructions. When a functional unit becomes available in the execute stage, an instruction from
the instruction window may be issued to the execute stage. The result of this organization is that
the processor has a lookahead capability, allowing it to identify independent instructions that can
be brought into the execute stage.
REGISTER RENAMING
Register renaming is a form of pipelining that deals with data dependences between instructions
by renaming their register operands. An assembly language programmer or a compiler specifies
these operands using architectural registers - the registers that are explicit in the instruction set
architecture. Renaming replaces architectural register names by, in effect, value names, with a
new value name for each instruction destination operand. This eliminates the name dependences
(output dependences and antidependences) between instructions and automatically recognizes
true dependences.
The recognition of true data dependences between instructions permits a more flexible life cycle
for instructions. By maintaining a status bit for each value indicating whether or not it has been
computed yet, it allows the execution phase of two instruction operations to be performed out of
order when there are no true data dependences between them. This is called out-of-order
execution.
Registers are allocated dynamically by the processor hardware, and they are associated with the
values needed by instructions at various points in time. When a new register value is created a
new register is allocated for that value. Subsequent instructions that access that value as a source
operand in that register must go through a renaming process.
Eg:
The register reference without the subscript refers to the logical register reference found in the
instruction. The register reference with the subscript refers to a hardware register allocated to
hold a new value. When a new allocation is made for a particular logical register, subsequent
instruction references to that logical register as a source operand are made to refer to the most
recently allocated hardware register.
MACHINE PARALLELISM
In each of the graphs, the vertical axis corresponds to the mean speedup of the superscalar
machine over the scalar machine. The horizontal axis shows the results for four alternative
processor organizations. The base machine does not duplicate any of the functional units, but it
can issue instructions out of order. The two graphs, combined, yield some important conclusions.
The first is that it is probably not worthwhile to add functional units without register renaming.
There is some slight improvement in performance, but at the cost of increased hardware
complexity.
BRANCH PREDICTION
With the advent of RISC machines, the delayed branch strategy was explored. This allows the
processor to calculate the result of conditional branch instructions before any unusable
instructions have been prefetched. With this method, the processor always executes the single
instruction that immediately follows the branch. This keeps the pipeline full while the processor
fetches a new instruction stream.
SUPERSCALAR EXECUTION
The program to be executed consists of a linear sequence of instructions. This is the static
program as written by the programmer or generated by the compiler. The instruction fetch
process, which includes branch prediction, is used to form a dynamic stream of instructions. This
stream is examined for dependencies, and the processor may remove artificial dependencies. The
processor then dispatches the instructions into a window of execution. In this window,
instructions no longer form a sequential stream but are structured according to their true data
dependencies. The processor performs the execution stage of each instruction in an order
determined by the true data dependencies and hardware resource availability. Finally,
instructions are conceptually put back into sequential order and their results are recorded.
The final step mentioned in the preceding paragraph is referred to as committing, or retiring, the
instruction.
SUPERSCALAR IMPLEMENTATION
• Instruction fetch strategies that simultaneously fetch multiple instructions, often by predicting
the outcomes of, and fetching beyond, conditional branch instructions. These functions require
the use of multiple pipeline fetch and decode stages, and branch prediction logic.
• Logic for determining true dependencies involving register values, and mechanisms for
communicating these values to where they are needed during execution.
1. The processor fetches instructions from memory in the order of the static program.
2. Each instruction is translated into one or more fixed-length RISC instructions, known as
micro-operations, or micro-ops.
3. The processor executes the micro-ops on a superscalar pipeline organization, so that the
micro-ops may be executed out of order.
4. The processor commits the results of each micro-op execution to the processor’s register set in
the order of the original program flow.
The trace cache takes the already-decoded micro-ops from the instruction decoder and assembles
them in to program-ordered sequences of micro-ops called traces. Micro-ops are fetched
sequentially from the trace cache, subject to the branch prediction logic.
This part of the processor reorders micro-ops to allow them to execute as quickly as their input
operands are ready.
ALLOCATE
• If a needed resource, such as a register, is unavailable for one of the three micro-ops arriving at
the allocator during a clock cycle, the allocator stalls the pipeline.
• The allocator allocates a reorder buffer (ROB) entry, which tracks the completion status of one
of the 126 micro-ops that could be in process at any time.2
• The allocator allocates one of the 128 integer or floating-point register entries for the result data
value of the micro-op, and possibly a load or store buffer
• The allocator allocates an entry in one of the two micro-op queues in front of the instruction
schedulers.
REGISTER RENAMING
The rename stage remaps references to the 16 architectural registers (8 floating-point registers,
plus EAX, EBX, ECX, EDX, ESI, EDI,EBP, and ESP) into a set of 128 physical registers.
The schedulers are responsible for retrieving micro-ops from the micro-op queues and
dispatching these for execution. Each scheduler looks for micro-ops in whose status indicates
that the micro-op has all of its operands.