0% found this document useful (0 votes)
189 views

CS17303 Computer Architecture Notes On Lesson Unit IV - Sumathi

The document discusses parallelism and instruction cycles. It covers: 1) An instruction cycle involves fetching an instruction from memory, decoding it, executing the actions, and writing results back. Modern CPUs execute instructions concurrently through pipelining. 2) Pipelining overlaps the stages of instruction execution (fetch, decode, execute, write) to improve throughput. It allows the next instruction to begin processing before the previous one finishes. 3) Pipeline hazards like data dependencies, branches, and structural conflicts can cause the pipeline to stall. Various techniques like branch prediction aim to reduce stalls and penalties from hazards.

Uploaded by

Praki Sachu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views

CS17303 Computer Architecture Notes On Lesson Unit IV - Sumathi

The document discusses parallelism and instruction cycles. It covers: 1) An instruction cycle involves fetching an instruction from memory, decoding it, executing the actions, and writing results back. Modern CPUs execute instructions concurrently through pipelining. 2) Pipelining overlaps the stages of instruction execution (fetch, decode, execute, write) to improve throughput. It allows the next instruction to begin processing before the previous one finishes. 3) Pipeline hazards like data dependencies, branches, and structural conflicts can cause the pipeline to stall. Various techniques like branch prediction aim to reduce stalls and penalties from hazards.

Uploaded by

Praki Sachu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT IV: PARALLELISM

Pipelining & Instruction cycle – pipelining strategy – pipeline hazards – dealing with branches –
RISC & CISC – Super scalar – Instruction level parallelism – Flynn’s taxonomy –
Multithreading - Multicore Processor - Case Study: Key Elements of ARM 11 MPCORE
INSTRUCTION CYCLE
An instruction cycle (also known as the fetch–decode–execute cycle or the fetch-execute cycle) is
the basic operational process of a computer. It is the process by which a computer retrieves
a program instruction from its memory, determines what actions the instruction dictates, and carries out
those actions. This cycle is repeated continuously by a computer's central processing unit (CPU),
from boot-up until the computer is shut down.
I n simpler CPUs the instruction cycle is executed sequentially, each instruction being processed
before the next one is started. In most modern CPUs the instruction cycles are instead
executed concurrently, and often in parallel, through an instruction pipeline: the next instruction starts
being processed before the previous instruction has finished, which is possible because the cycle is
broken up into separate steps.

PIPELINING
Pipelining is defined as Overlapping of the various stages of successive instruction
execution. It helps to improve the system throughput.
The processor executes a program by fetching and executing instructions, one after the
other. Let Fi and Ei refer to the fetch and execute steps for instruction Ii .Execution of a program
consists of a sequence of fetch and execute steps.
Figure 4.1 Basic idea of pipelining
The computer is controlled by a clock whose period is such that the fetch and execute
steps of any instruction can each be completed in one clock cycle. Operation of the computer
proceeds as in Figure 4.1c. In the first clock cycle, the fetch unit fetches an instruction I1 (step
F1) and stores it in buffer B1 at the end of the clock cycle. In the second clock cycle, the
instruction fetch unit proceeds with the fetch operation for instruction I2 (step F2). Meanwhile,
the execution unit performs the operation specified by instruction I1, which is available in buffer
B1 (step E1). By the end of the second clock cycle, the execution of instruction I1 is completed
and instruction I2 is available. Instruction I2 is stored in B1, replacing I1, which is no longer
needed. Step E2 is performed by the execution unit during the third clock cycle, while instruction
I3 is being fetched by the fetch unit.
A pipelined processor may process each instruction in four steps, as follows:
F Fetch: read the instruction from the memory.
D Decode: decode the instruction and fetch the source operand(s).
E Execute: perform the operation specified by the instruction.
W Write: store the result in the destination location.
Figure 4.2 A 4-stage pipeline
The sequence of events for this case is shown in Figure 4.2a. Four instructions are in
progress at any given time. This means that four distinct hardware units are needed, as shown in
Figure 4.2b. These units must be capable of performing their tasks simultaneously and without
interfering with one another. Information is passed from one unit to the next through a storage
buffer. As an instruction progresses through the pipeline, all the information needed by the stages
downstream must be passed along. For example, during clock cycle 4, the information in the
buffers is as follows:
 Buffer B1 holds instruction I3, which was fetched in cycle 3 and is being decoded by the
instruction-decoding unit.
 Buffer B2 holds both the source operands for instruction I2 and the specification of the
operation to be performed. This is the information produced by the decoding hardware in
cycle 3. The buffer also holds the information needed for the write step of instruction I2
(stepW2). Even though it is not needed by stage E, this information must be passed on to
stage W in the following clock cycle to enable that stage to perform the required Write
operation.
 Buffer B3 holds the results produced by the execution unit and the destination
information for instruction I1.
PIPELINE HAZARDS
Hazard: Any condition that causes the pipeline to stall is called a hazard. Idle periods are called
as stalls or bubbles in the pipeline.
Types of Hazard:
 Data hazard
 Instruction or control hazard
 Structural hazard
Data hazard: A data hazard is any condition in which either the source or the destination
operands of an instruction are not available at the time expected in the pipeline. Assume that
A=5, and consider the following two operations:
A←3+A
B←4×A
When these operations are performed in the order given, the result is B = 32. But if they
are performed concurrently, the value of A used in computing B would be the original value, 5,
leading to an incorrect result.

Figure 4.3 Pipeline stalled by data dependency between D2 and W1


Structural hazard: A third type of hazard that may be encountered in pipelined operation is
known as a structural hazard. This is the situation when two instructions require the use of a
given hardware resource at the same time. The most common case in which this hazard may
arise is in access to memory.
An example of a structural hazard is shown in the figure 4.4. This figure shows how the
load instruction Load X(R1),R2
Figure 4.4 Effect of load instruction in pipelining
The memory address, X+[R1],is computed in stepE2 in cycle 4,
then memory access takes place in cycle 5. The operand read from memory is written into
register R2 in cycle 6. This means that the execution step of this instruction takes two clock
cycles (cycles 4 and 5). It causes the pipeline to stall for one cycle, because both instructions I2
and I3 require access to the register file in cycle 6.
Instruction or control hazard: The pipeline may also be stalled because of a delay in the
availability of an instruction. For example, this may be a result of a miss in the cache, requiring
the instruction to be fetched from the main memory. Such hazards are often called control
hazards or instruction hazards. The effect of a cache miss on pipelined operation is illustrated in

Figure 4.5 Pipeline stall caused by a cache miss in F2


DEALING WITH BRANCHES
Unconditional Branches
Instructions I1 to I3 are stored at successive memory addresses, and I2 is a branch
instruction. Let the branch target be instruction Ik. Figure 4.6 shows a sequence of instructions
being executed in a two-stage pipeline. In clock cycle 3, the fetch operation for instruction I3 is
in progress at the same time that the branch instruction is being decoded and the target address
computed. In clock cycle 4, the processor must discard I3, which has been incorrectly fetched,
and fetch instruction Ik . In the meantime, the hardware unit responsible for the Execute (E) step
must be told to do nothing during that clock period. Thus, the pipeline is stalled for one clock
cycle. The time lost as a result of a branch instruction is often referred to as the branch penalty.
In Figure 4.6, the branch penalty is one clock cycle. For a longer pipeline, the branch penalty
may be higher.

Figure 4.6 An idle cycle caused by a branch instruction


Figure 4.7 Branch Timing

Figure 4.7 Branch Timing


For example, Figure 4.7 shows the effect of a branch instruction on a four-stage pipeline.
We have assumed that the branch address is computed in step E2. Instructions I3 and I4 must be
discarded, and the target instruction, Ik, is fetched in clock cycle 5. Thus, the branch penalty is
two clock cycles.
Instruction Queue and Prefetching
Either a cache miss or a branch instruction stalls the pipeline for one or more clock
cycles. To reduce the effect of these interruptions, many processors employ sophisticated fetch
units that can fetch instructions before they are needed and put them in a queue. Typically, the
instruction queue can store several instructions. A separate unit, which we call the dispatch unit,
takes instructions from the front of the queue and sends them to the execution unit. This leads to
the organization shown in Figure 4.8.The dispatch unit also performs the decoding function.

Figure 4.8 Use of an instruction queue in the hardware organization


CONDITIONAL BRANCHES AND BRANCH PREDICTION
A conditional branch instruction introduces the added hazard caused by the dependency
of the branch condition on the result of a preceding instruction. The decision to branch cannot be
made until the execution of that instruction has been completed.
The location following a branch instruction is called a branch delay slot. A technique
called delayed branching can minimize the penalty incurred as a result of conditional branch
instructions. The instructions in the delay slots are always fetched. The objective is to be able to
place useful instructions in these slots. If no useful instructions can be placed in the delay slots,
these slots must be filled with NOP instructions.

Figure 4.9 Reordering of Instructions for a delayed branch

Figure 4.10 Execution timing showing the delay slot being filled during the last two passes through the
loop
Branch Prediction
Another technique for reducing the branch penalty associated with conditional branches
is to attempt to predict whether or not a particular branch will be taken. The simplest form of
branch prediction is to assume that the branch will not take place and to continue to fetch
instructions in sequential address order. Until the branch condition is evaluated, instruction
execution along the predicted path must be done on a speculative basis.
Speculative execution means that instructions are executed before the processor is certain
that they are in the correct execution sequence.
The branch prediction decision is always the same every time a given instruction is
executed. Any approach that has this characteristic is called static branch prediction.
Another approach in which the prediction decision may change depending on execution
history is called dynamic branch prediction.
The algorithm may be described by the two-state machine in Figure 4.11a. The two states
are:
LT: Branch is likely to be taken
LNT: Branch is likely not to be taken
Suppose that the algorithm is started in state LNT. When the branch instruction is
executed and if the branch is taken, the machine moves to state LT. Otherwise, it remains in state
LNT. The next time the same instruction is encountered, the branch is predicted as taken if the
corresponding state machine is in state LT. Otherwise it is predicted as not taken.

Figure 4.11 State machine representation of branch-prediction algorithms


An algorithm that uses 4 states, thus requiring two bits of history information for each
branch instruction, is shown in Figure 4.11 b. The four states are:
ST: Strongly likely to be taken
LT: Likely to be taken
LNT: Likely not to be taken
SNT: Strongly likely not to be taken
Again assume that the state of the algorithm is initially set to LNT. After the branch instruction
has been executed, and if the branch is actually taken, the state is changed to ST; otherwise, it is
changed to SNT.
Reduced Instruction Set Computer (RISC) & Complex Instruction Set Computer (CISC)
Major characteristics of CISC architecture:
1. A large number of instructions-typically from 100 to 250 instructions
2. Some instructions that perform specialized tasks and are used infrequently
3. A large variety of addressing modes-typically from 5 to 20 different modes
4. Variable-length instruction formats
5. Instructions that manipulate operands in memory
Major characteristics of RISC architecture:
1. Relatively few instructions
2. Relatively few addressing modes
3. Memory access limited to load and store instructions
4. All operations done within the registers of the CPU
5. Fixed-length, easily decoded instruction format
6. Single-cycle instruction execution
7. Hardwired rather than microprogrammed control
Detailed Characteristics RISC: RISC processor design has separate digital circuitry in the
control unit, which produces all the necessary signals needed for the execution of each
instruction in the instruction set of the processor.
1. RISC processors use a small and limited number of instructions.: RISC
processors only support a small number of primitive and essential instructions. This
puts emphasis on software and compiler design due to the relatively simple
instruction set.
2. RISC machines mostly uses hardwired control unit.: Most of the RISC
processors are based on the hardwired control unit design approach. In hardwired
control unit, the control units use fixed logic circuits to interpret instructions and
generate control signals from them. It is significantly faster than its counterpart but
are rather inflexible.
3. RISC processors consume less power and high performance.: RISC processors
have been known to be heavily pipelined this ensures that the hardware resources of
the processor are utilized to a maximum giving higher throughput and also
consuming less power.
4. Each instruction is very simple and consistent.: Most instructions in a RISC
instruction set are very simple that get executed in one clock cycle.
5. RISC processors use simple addressing modes. : RISC processors don’t have as
many addressing modes and the addressing modes these processors have are rather
very simple. Most of the addressing modes are for register operations and do not
refer memory.
6. RISC instruction is of uniform fixed length.: The decision of RISC processor
designers to provide simple addressing modes leads to uniform length instructions.
For example, instruction length increases if an operand is in memory as opposed to
in a register. a. This is because we have to specify the memory address as part of
instruction encoding, which takes many more bits. This complicates instruction
decoding and scheduling.
7. Large Number of Registers.: The RISC design philosophy generally incorporates
a larger number of registers to prevent in large amounts of interactions with
memory

Difference between RISC and CISC/ Compare RISC and CISC Characteristics.
Instruction Set Architecture RISC CISC (complex instruction
(reduced instruction set computers) set computers)
Memory access Memory access restricted to load/store is directly available to most
instructions, and data manipulation types of instructions
instructions are register-to-register
Addressing mode limited in number substantial in number
Instruction formats all of the same length of different lengths
Instructions perform elementary operations perform both elementary and
complex operations
Control unit Hardwired, high throughput and fast Microprogrammed, facilitate
execution compact programs and
conserve memory,
(or)

- RISC CISC
RISC stands for Reduced Instruction Set CISC stands for Complex Instruction Set
1
Computer Computer
RISC processors have simple instructions CISC processors have complex instructions that
taking about one clock cycle. The average take up multiple clock cycles for execution. The
2
Clock cycles Per Instruction(CPI) of a RISC average Clock cycles Per Instruction of a CISC
processor is 1.5 processor is between 2 and 15
There are hardly any instructions that refer
3 Most of the instructions refer memory
memory.
RISC processors have a fixed instruction CISC processors have variable instruction
4
format format.
The instruction set is reduced i.e. it has only The instruction set has a variety of different
5 few instructions in the instruction set. Many instructions that can be used for complex
of these instructions are very primitive. operations.
CISC has many different addressing modes and
RISC has fewer addressing modes and most
can thus be used to represent higher level
6 of the instructions in the instruction set have
programming language statements more
register to register addressing mode.
efficiently.
Complex addressing modes are synthesized CISC already supports complex addressing
7
using software. modes
8 Multiple register sets are present Only has a single register set
They are normally not pipelined or less
9 RISC processors are highly pipelined
pipelined
The complexity of RISC lies in the compiler
10 The complexity lies in the micro program
that executes the program.
The most common RISC microprocessors
Examples of CISC processors are the
are Alpha, ARC, ARM, AVR, MIPS, PA-
11 System/360, VAX, PDP-11, Motorola 68000
RISC, PIC, Power Architecture, and
family, AMD and Intel x86 CPUs.
SPARC.

SUPER SCALAR
A superscalar processor is a CPU that implements a form of parallelism called instruction-
level parallelism within a single processor.
In contrast to a scalar processor that can
execute at most one single instruction per
clock cycle, a superscalar processor can
execute more than one instruction during a
clock cycle by simultaneously dispatching
multiple instructions to different execution
units on the processor. It therefore allows for
more throughput than would otherwise be
possible at a given clock rate. Each
execution unit is not a separate processor, but an execution resource within a single CPU such as
an arithmetic logic unit.
Example: Simple superscalar pipeline. By fetching and dispatching two instructions at a time,
a maximum of two instructions per cycle can be completed. (IF = Instruction Fetch, ID =
Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back, i =
Instruction number, t = Clock cycle [i.e., time])

Figure 4.12 Superscalar Organization


FLYNN’S TAXONOMY
 In 1966, Michael Flynn proposed a classification for computer architectures based on the
number of instruction steams and data streams.
 Flynn uses the stream concept for describing a machine's structure
 A stream simply means a sequence of items (data or instructions).
 Instruction stream is defined as sequence of instructions performed by the processing
unit.
 Data stream is defined as the traffic exchanged between the memory and processing unit.
 The classification of computer architectures based on the number of instruction steams
and data streams (Flynn’s Taxonomy).
Instruction Stream
Main
Data Stream
CPU
Flynn’s Classification of Computer Architectures Memory
 Single instruction single data (SISD)
 Single instruction multiple data (SIMD)
 Multiple instructions single data(MISD)
 Multiple instructions multiple data(MIMD)
Single Instruction Single Data Stream (SISD)

 SISD
corresponds to the traditional mono-processor ( von
Neumann computer). A single data stream is being
processed by one instruction stream
 A single-processor computer (uni-processor/Single control unit) in which a single stream
of instructions is generated from the program.
 No Parallelism.
 Example
 CDC 6600, which is unpipelined but has multiple functional units. •
 CDC 7600 which has a pipelined arithmetic unit.
 Amdhal 470/6 which has pipelined instruction processing.
 Cray-1 which supports vector processing.
 where PU - Processing Unit
CDC – Control Data Co-operation.
Single Instruction Multiple Data Stream (SIMD)
 Each instruction is executed on a different set of data by different processors i.e multiple
processing units of the same type process on multiple-data streams.
 This group is dedicated to array processing machines.
 Sometimes, vector processors can also be seen as a part of this group.
 Examples ILLIAC-IV, PEPE, BSP, STARAN, MPP, DAP and Connection Machine
(CM-1).
Multiple-Instruction Singe Data Stream (MISD)

 Each processor executes a different sequence of instructions.


 In case of MISD computers, multiple processing units operate on one single-data stream.
 In practice, this kind of organization has never been used
Multiple-Instruction Multiple Data Stream (MIMD)

 Each processor has a separate program.


 An instruction stream is generated from each program.
 Each instruction operates on different data.
 This last machine type builds the group for the traditional multi-processors. Several
processing units operate on multiple-data streams.
 Example
C.mmp, C.m*, BBN , Burroughs D825,
Cray-2, S1, Cray X-MP, FPS T/40000, iPSC. HEP, Pluribus,
IBM 370/168 MP, Univac 1100/80, Tandem/16, IBM 3081/3084, Butterfly,
Meiko Computing Surface (CS-1)
Flynn’s Taxonomy

HARDWARE MULTITHREADING
Allows multiple threads to share the functional units of a single processor in an
overlapping fashion. To permit this sharing, processor must duplicate the independent state of
each thread. Increase utilization of a processor by switching to another thread when one thread is
stalled.
Approaches to Hardware multithreading:
 Fine-grained multithreading
 Coarse-grained multithreading
 Simultaneous multithreading
Fine-grained Multithreading
 A version of hardware multithreading that suggests switching between threads after every
instruction.
 Interleaving done in Round Robin fashion, skipping any threads that are stalled at that
time.
Advantages
 Hide the throughput losses that arise for both short and long stalls.
 Instructions from other threads can be executed when one thread stalls.
Disadvantages
 It slowdown the execution of the individual threads.
 Thread is ready to execute without stalls will be delayed by instructions from other
threads.
Coarse-grained Multithreading
 A version of hardware multithreading that suggests switching between threads, only after
significant events, such a cache miss.
 It is limited in its ability to overcome throughput losses, from shorter stalls.
Advantages
Useful for reducing the penalty of high cost stalls, where pipeline refill is negligible
compared to the stall time.
Disadvantages
 Difficult to overcome throughput losses due to shorter stalls.
 When a stall occurs, the pipeline must be emptied
Simultaneous Multithreading
 Hardware multithreading that uses the resources of a multiple issue dynamically
scheduled pipeline processors called SMT
 A version of hardware multithreading that lowers the cost of multithreading by utilizing
the resources needed for multiple issue, dynamically schedule micro architecture.
 To exploit thread level parallelism at the same time it exploit instruction level parallelism
 It is multiple issue processor, must have more functional unit parallelism than a single
thread
 Improve the overall resource utilization by overlapping the latency of an instruction from
one thread by the execution of another thread.
 SMT share functional units dynamically and flexibly between multiple threads.
Example

Advantages
 More functional unit parallelism available than single threads
 Register renaming and dynamic scheduling are used
 Multiple instructions from independent threads can be issued without regard to
the dependences among them.
 It does not switch resources for every cycle
MULTICORE PROCESSOR
Hardware multithreading improved the efficiency of processors at modest cost, but the
difficulties is how to run old programs in parallel hardware. There are two solutions,
1. Shared Memory Multiprocessor(SMP):
 Provide single address space to all processors.
 Programs may be executed in parallel.
 All variable of the program must be made available at any time to any processor.
2. Cluster and Message Passing Multi Processor:
 Separate address space is received for each processor.
 Sharing must be explicit.
Shared Memory Multiprocessor:
 A parallel processor with single physical address space is called shared memory
multiprocessor.
 Processors communicate through shared variable in memory. All processors can access
any memory location via loads and stores.
 In each system independent jobs can be run in their own virtual address spaces, even if
they all share a physical address space.
Types of single address space multiprocessors:
 Uniform Memory Access(UMA)
 Non Uniform Memory Access(NUMA)
UMA
A multiprocessor in which latency to any word in main memory is about the same no matter
which processor requests the access.
NUMA
 A type of single address space multiprocessor in which some memory accesses are much
faster than others depending on which processor asks for which word.
 The main memory is divided and attached to different microprocessor or to different
 Microprocessor or to different memoryControllers on the same chip.
 Programming Challenges are hardes for NUMA than UMA.
 NUMA machines can scale to larger size.NUMAs
 Can have lower latency to nearby memory.
Synchronization
The process of coordinating the behavior of two or more processes, which may be running on
different processors.
Lock
A synchronization device that allows access to data to only one processor at a time.
Cluster and other Message Passing Multiprocessors
 Multiprocessor with multiple private address space
 Each processor have their own private physical address space
 Multiprocessor communicate via explicit message passing scheme, so name message
passing multiprocessors communications between multiple processors is done by
explicit sending and receiving information.
 There are some routines to send and receive messages.Send message routine is used
to send message to other processors in machines with private memories.
 Receive message routine is used to receive messages from another processors in
machines with private memories.
 If a sending processor needs confirmation that the message has reached the receiver,
then the receiving processor can also send an acknowledgement message back to the
sender.
Cluster and other Message Passing Multiprocessors

INSTRUCTION LEVEL PARALLELISM


Pipelining can overlap the execution of instructions when they are independent of one
another. Instruction Level Parallelism is a measure of number of instructions that can be
performed simultaneously during a single clock cycle. The potential overlap among instructions
is called as Instruction Level Parallelism.
Implementation of ILP:
 Pipelining
 Superscalar
 VLIW
 Multiprocessor computer
There are two primary methods for increasing the potential amount of instruction-level
parallelism. The first is increasing the depth of the pipeline to overlap more instructions. Another
approach is to replicate the internal components of the computer so that it can launch multiple
instructions in every pipeline stage. The general name for this technique is multiple issue.
Multiple issue:
A scheme whereby multiple instructions are launched in one clock cycle.
Types of Multiple Issue:
 Static multiple issue
 Dynamic multiple issue
Static multiple issue
An approach to implementing a multiple-issue processor where many decisions are made
by the compiler before execution.
Dynamic multiple issue
An approach to implementing a multiple-issue processor where many decisions are made
during execution by the processor. There are two primary and distinct responsibilities that must
be dealt with in a multiple-issue pipeline:
1. Packaging instructions into issue slots
2. Dealing with data and control hazards
Issue slots
The positions from which instructions could issue in a given clock cycle; by analogy,
these correspond to positions at the starting blocks for a sprint.
The Concept of Speculation
One of the most important methods for finding and exploiting more I LP is speculation.
Speculation
An approach whereby the compiler or processor guesses the outcome of an instruction to
remove it as a dependence in executing other instructions.
For example, the compiler can use speculation to reorder instructions, moving an instruction
across a branch or a load across a store.
Static Multiple Issue
In a static issue processor, the set of instructions issued in a given clock cycle, to assist
with packaging instruction and handling hazards which is called an issue packet. The packet may
be determined statically by the compiler or dynamically by the processor.
Very Long Instruction Word (VLIW)
A style of instruction set architecture that launches many operations that are defined to be
independent in a single wide instruction, typically with many separate opcode fields. The
compiler’s responsibilities may include static branch prediction and code scheduling to reduce or
prevent all hazards.
Use Latency
Number of clock cycles between a load instruction and an instruction that can use the
result of the load without stalling the pipeline.
Loop Unrolling
A technique to get more performance from loops that access arrays, in which multiple
copies of the loop body are made and instructions from different iterations are scheduled
together.
Superscalar
An advanced pipelining technique that enables the processor to execute more than one
instruction per clock cycle by selecting them during execution. In the simplest superscalar
processors, instructions issue in order, and the processor decides whether zero, one, or more
instructions can issue in a given clock cycle.
Dynamic pipeline scheduling
Hardware support for reordering the order of instruction execution so as to avoid stalls.
Dynamic pipeline scheduling chooses which instructions to execute next, possibly reordering
them to avoid stalls. In such processors, the pipeline is divided into three major units: an
instruction fetch and issue unit, multiple functional units and a commit unit. The first unit fetches
instructions, decodes them, and sends each instruction to a corresponding functional unit for
execution. Each functional unit has buffers, called reservation stations, which hold the operands
and the operation. When the result is completed, it is sent to any reservation stations waiting for
this particular result as well as to the commit unit, which buffers the result until it is safe to put
the result into the register file or, for a store, into memory. The buffer in the commit unit, often
called the reorder buffer
Commit unit
The unit in a dynamic or out-of-order execution pipeline that decides when it is safe to
release the result of an operation to programmer-visible registers and memory.
Reservation station
A buffer within a functional unit that holds the operands and the operation.
Reorder buffer
The buffer that holds results in a dynamically scheduled processor until it is safe to store
the results to memory or a register.

Figure: Three primary units of a dynamically scheduled pipeline


Out-of-order execution
A situation in pipelined execution when an instruction blocked from executing does not cause the
following instructions to wait.
In-order commit
A commit in which the results of pipelined execution are written to the programmer visible state in the
same order that instructions are fetched.

CASE STUDY: Key Elements of ARM11 MPCORE


The ARM11 MPCore is a multicore product based on the ARM11 processor family. The
ARM11 MPCore can be configured with up to four processors, each with its own L1 instruction
and data caches, per chip. Table below lists the configurable options for the system, including the
default values. Figure below presents a block diagram of the ARM11 MPCore. The key elements
of the system are as follows:
 Distributed interrupt controller (DIC): Handles interrupt detection and interrupt
prioritization. The DIC distributes interrupts to individual processors.
 Timer: Each CPU has its own private timer that can generate interrupts.
 Watchdog: Issues warning alerts in the event of software failures. If the watchdog
is enabled, it is set to a predetermined value and counts down to 0. It is
periodically reset. If the watchdog value reaches zero, an alert is issued.
 CPU interface: Handles interrupt acknowledgment, interrupt masking, and
interrupt completion acknowledgement.
 CPU: A single ARM11 processor. Individual CPUs are referred to as MP11
CPUs.
 Vector floating-point (VFP) unit: A coprocessor that implements floating point
operations in hardware.
 L1 cache: Each CPU has its own dedicated L1 data cache and L1 instruction
cache.
 Snoop control unit (SCU): Responsible for maintaining coherency among L1 data
caches.

Interrupt Handling
The Distributed Interrupt Controller (DIC) collates interrupts from a large number of sources. It
provides
 Masking of interrupts
 Prioritization of the interrupts
 Distribution of the interrupts to the target MP11 CPUs
 Tracking the status of interrupts
 Generation of interrupts by software
The DIC is designed to satisfy two functional requirements:
 Provide a means of routing an interrupt request to a single CPU or multiple CPUs,
as required.
 Provide a means of inter processor communication so that a thread on one CPU
can cause activity by a thread on another CPU.

The DIC can route an interrupt to one or more CPUs in the following three ways:
 An interrupt can be directed to a specific processor only.
 An interrupt can be directed to a defined group of processors. The MPCore views
the first processor to accept the interrupt, typically the least loaded, as being best
positioned to handle the interrupt.
 An interrupt can be directed to all processors.
From the point of view of an MP11 CPU, an interrupt can be
Inactive: An Inactive interrupt is one that is non-asserted, or which in a multiprocessing
environment has been completely processed by that CPU but can still be either pending or Active
in some of the CPUs to which it is targeted, and so might not have been cleared at the interrupt
source.
Pending: A Pending interrupt is one that has been asserted, and for which processing has not
started on that CPU.
Active: An Active interrupt is one that has been started on that CPU, but processing is not
complete. An Active interrupt can be pre-empted when a new interrupt of higher priority
interrupts MP11 CPU interrupt processing.

You might also like